Thread: Parallel Index Scans

Parallel Index Scans

From
Amit Kapila
Date:
As of now, the driving table for parallel query is accessed by
parallel sequential scan which limits its usage to a certain degree.
Parallelising index scans would further increase the usage of parallel
query in many more cases.  This patch enables the parallelism for the
btree scans.  Supporting parallel index scan for other index types
like hash, gist, spgist can be done as separate patches.

The basic idea is quite similar to parallel heap scans which is that
each worker (including leader whenever possible) will scan a block and
then get the next block that is required to be scan. The parallelism
in implemented at the leaf level of a btree.  The first worker to
start a btree scan will scan till leaf and others will wait till the
first worker has reached till leaf.   The first worker after reading
the leaf block will set the next block to be read and wake the first
worker waiting to scan the next block and proceed with scanning tuples
from the block it has read, similarly each worker after reading the
block, sets the next block to be read and wakes up the first waiting
worker.  This is achieved by using the condition variable patch [1]
proposed by Robert.  Parallelism is supported for both forward and
backward scans.

The optimizer will choose the parallelism based on number of pages in
index relation and cpu cost for evaluating the rows is divided equally
among workers.  Index Scan node is made parallel aware and can be used
beneath Gather as shown below:

Current Plan for Index Scans
----------------------------------------
 Index Scan using idx2 on test  (cost=0.42..7378.96 rows=2433 width=29)
   Index Cond: (c < 10)


Parallel version of plan
----------------------------------
 Gather  (cost=1000.42..1243.40 rows=2433 width=29)
   Workers Planned: 1
   ->  Parallel Index Scan using idx2 on test  (cost=0.42..0.10
rows=1431 width=29)
         Index Cond: (c < 10)


The Parallel index scans can be used in parallelising aggregate
queries as well.  For example, given a query like:  select count(*)
from t1 where c1 > 1000 and c1 < 1100 and c2='aaa' Group By c2; below
form of parallel plans are possible:

 Finalize HashAggregate
   Group Key: c2
   ->  Gather
         Workers Planned: 1
         ->  Partial HashAggregate
               Group Key: c2
               ->  Parallel Index Scan using idx_t1_partial on t1
                     Index Cond: ((c1 > 1000) AND (c1 < 1100))
                     Filter: (c2 = 'aaa'::bpchar)

OR

Finalize GroupAggregate
   Group Key: c2
   ->  Sort
         ->  Gather
               Workers Planned: 1
               ->  Partial GroupAggregate
                     Group Key: c2
                     ->  Parallel Index Scan using idx_t1_partial on t1
                           Index Cond: ((c1 > 1000) AND (c1 < 1100))
                           Filter: (c2 = 'aaa'::bpchar)

In the second plan (GroupAggregate), the Sort + Gather step would be
replaced with GatherMerge, once we have a GatherMerge node as proposed
by Rushabh [2].  Note, that above examples are just taken to explain
the usage of parallel index scan, actual plans will be selected based
on cost.

Performance tests
----------------------------
This test has been performed on community m/c (hydra, POWER-7).

Initialize pgbench with 3000 scale factor (./pgbench -i -s 3000 postgres)

Count the rows in pgbench_accounts based on values of aid and bid

Serial plan
------------------
set max_parallel_workers_per_gather=0;

postgres=# explain analyze select count(aid) from pgbench_accounts
where aid > 1000 and aid < 90000000 and bid > 800 and bid < 900;

        QUERY PLAN


--------------------------------------------------------------------------------------------------------------------------------------------------------------------
----
 Aggregate  (cost=4714590.52..4714590.53 rows=1 width=8) (actual
time=35684.425..35684.425 rows=1 loops=1)
   ->  Index Scan using pgbench_accounts_pkey on pgbench_accounts
(cost=0.57..4707458.12 rows=2852961 width=4) (actual
time=29210.743..34385.271 rows=9900000 loops
=1)
         Index Cond: ((aid > 1000) AND (aid < 90000000))
         Filter: ((bid > 800) AND (bid < 900))
         Rows Removed by Filter: 80098999
 Planning time: 0.183 ms
 Execution time: 35684.459 ms
(7 rows)


Parallel Plan
-------------------
set max_parallel_workers_per_gather=2;

postgres=# explain analyze select count(aid) from pgbench_accounts
where aid > 1000 and aid < 90000000 and bid > 800 and bid < 900;

                  QUERY PLAN


------------------------------------------------------------------------------------------------------------------------------------------------------------
---------------------------------
 Finalize Aggregate  (cost=3924773.13..3924773.14 rows=1 width=8)
(actual time=15033.105..15033.105 rows=1 loops=1)
   ->  Gather  (cost=3924772.92..3924773.12 rows=2 width=8) (actual
time=15032.986..15033.093 rows=3 loops=1)
         Workers Planned: 2
         Workers Launched: 2
         ->  Partial Aggregate  (cost=3923772.92..3923772.92 rows=1
width=8) (actual time=15030.354..15030.354 rows=1 loops=3)
               ->  Parallel Index Scan using pgbench_accounts_pkey on
pgbench_accounts  (cost=0.57..3920801.08 rows=1188734 width=4) (actual
time=12476.068..14600.410 rows=3300000 loops=3)
                     Index Cond: ((aid > 1000) AND (aid < 90000000))
                     Filter: ((bid > 800) AND (bid < 900))
                     Rows Removed by Filter: 26699666
 Planning time: 0.244 ms
 Execution time: 15036.081 ms
(11 rows)

The above is a median of 3 runs, all the runs gave almost same
execution time.  Here, we can notice that execution time is reduced by
more than half with two workers and I have tested with four workers
where time is reduced to one-fourth (9128.420 ms) of serial plan.  I
think these results are quite similar to what we got for parallel
sequential scans. Another thing to note is that parallelising index
scans are more beneficial if there is a Filter which removes many rows
fetched from Index Scan or if the Filter is costly (example - filter
contains costly function execution). This observation is also quite
similar to what we have observed with Parallel Sequential Scans.

I think we can parallelise Index Only Scans as well, but I have not
evaluated the same and certainly it can be done as a separate patch in
future.

Contributions
--------------------
First patch (parallel_index_scan_v1.patch) implements parallelism at
IndexAM level - Rahila Syed and Amit Kapila based on design inputs and
suggestions by Robert Haas
Second patch (parallel_index_opt_exec_support_v1.patch) provides
optimizer and executor support for parallel index scans - Amit Kapila

The order to use these patches is first apply condition variable patch
[1] then  parallel_index_scan_v1.patch and then
parallel_index_opt_exec_support_v1.patch

Thoughts?

[1] - https://www.postgresql.org/message-id/CAEepm%3D0zshYwB6wDeJCkrRJeoBM%3DjPYBe%2B-k_VtKRU_8zMLEfA%40mail.gmail.com
[2] - https://www.postgresql.org/message-id/CAGPqQf09oPX-cQRpBKS0Gq49Z%2Bm6KBxgxd_p9gX8CKk_d75HoQ%40mail.gmail.com

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachment

Re: Parallel Index Scans

From
Amit Kapila
Date:
On Thu, Oct 13, 2016 at 8:48 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> As of now, the driving table for parallel query is accessed by
> parallel sequential scan which limits its usage to a certain degree.
> Parallelising index scans would further increase the usage of parallel
> query in many more cases.  This patch enables the parallelism for the
> btree scans.  Supporting parallel index scan for other index types
> like hash, gist, spgist can be done as separate patches.
>

I would like to have an input on the method of selecting parallel
workers for scanning index.  Currently the patch selects number of
workers based on size of index relation and the upper limit of
parallel workers is max_parallel_workers_per_gather.  This is quite
similar to what we do for parallel sequential scan except for the fact
that in parallel seq. scan, we use the parallel_workers option if
provided by user during Create Table.  User can provide
parallel_workers option as below:

Create Table .... With (parallel_workers = 4);

Is it desirable to have similar option for parallel index scans, if
yes then what should be the interface for same?  One possible way
could be to allow user to provide it during Create Index as below:

Create Index .... With (parallel_workers = 4);

If above syntax looks sensible, then we might need to think what
should be used for parallel index build.  It seems to me that parallel
tuple sort patch [1] proposed by Peter G. is using above syntax for
getting the parallel workers input from user for parallel index
builds.

Another point which needs some thoughts is whether it is good idea to
use index relation size to calculate parallel workers for index scan.
I think ideally for index scans it should be based on number of pages
to be fetched/scanned from index.


[1] - https://www.postgresql.org/message-id/CAM3SWZTmkOFEiCDpUNaO4n9-1xcmWP-1NXmT7h0Pu3gM2YuHvg%40mail.gmail.com

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: Parallel Index Scans

From
Rahila Syed
Date:
>Another point which needs some thoughts is whether it is good idea to
>use index relation size to calculate parallel workers for index scan.
>I think ideally for index scans it should be based on number of pages
>to be fetched/scanned from index.
IIUC, its not possible to know the exact number of pages scanned from an index
in advance.
What we are essentially making parallel is the scan of the leaf pages.
So it will make sense to have the number of workers based on number of leaf pages.
Having said that, I think it will not make much difference as compared to existing method because
currently total index pages are used to calculate the number of workers. As far as I understand,in large indexes, the difference between
number of leaf pages and total pages is not significant. In other words, internal pages form a small fraction of total pages.
Also the calculation is based on log of number of pages so it will make even lesser difference.

Thank you,
Rahila Syed






On Tue, Oct 18, 2016 at 8:38 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
On Thu, Oct 13, 2016 at 8:48 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> As of now, the driving table for parallel query is accessed by
> parallel sequential scan which limits its usage to a certain degree.
> Parallelising index scans would further increase the usage of parallel
> query in many more cases.  This patch enables the parallelism for the
> btree scans.  Supporting parallel index scan for other index types
> like hash, gist, spgist can be done as separate patches.
>

I would like to have an input on the method of selecting parallel
workers for scanning index.  Currently the patch selects number of
workers based on size of index relation and the upper limit of
parallel workers is max_parallel_workers_per_gather.  This is quite
similar to what we do for parallel sequential scan except for the fact
that in parallel seq. scan, we use the parallel_workers option if
provided by user during Create Table.  User can provide
parallel_workers option as below:

Create Table .... With (parallel_workers = 4);

Is it desirable to have similar option for parallel index scans, if
yes then what should be the interface for same?  One possible way
could be to allow user to provide it during Create Index as below:

Create Index .... With (parallel_workers = 4);

If above syntax looks sensible, then we might need to think what
should be used for parallel index build.  It seems to me that parallel
tuple sort patch [1] proposed by Peter G. is using above syntax for
getting the parallel workers input from user for parallel index
builds.

Another point which needs some thoughts is whether it is good idea to
use index relation size to calculate parallel workers for index scan.
I think ideally for index scans it should be based on number of pages
to be fetched/scanned from index.


[1] - https://www.postgresql.org/message-id/CAM3SWZTmkOFEiCDpUNaO4n9-1xcmWP-1NXmT7h0Pu3gM2YuHvg%40mail.gmail.com

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: Parallel Index Scans

From
Amit Kapila
Date:
On Tue, Oct 18, 2016 at 4:08 PM, Rahila Syed <rahilasyed90@gmail.com> wrote:
>>Another point which needs some thoughts is whether it is good idea to
>>use index relation size to calculate parallel workers for index scan.
>>I think ideally for index scans it should be based on number of pages
>>to be fetched/scanned from index.
> IIUC, its not possible to know the exact number of pages scanned from an
> index
> in advance.

We can't find the exact numbers of index pages to be scanned, but I
think we can find estimated number of pages to be fetched (refer
cost_index).

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: Parallel Index Scans

From
Peter Geoghegan
Date:
On Mon, Oct 17, 2016 at 8:08 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> Create Index .... With (parallel_workers = 4);
>
> If above syntax looks sensible, then we might need to think what
> should be used for parallel index build.  It seems to me that parallel
> tuple sort patch [1] proposed by Peter G. is using above syntax for
> getting the parallel workers input from user for parallel index
> builds.

Apparently you see a similar issue with other major database systems,
where similar storage parameter things are kind of "overloaded" like
this (they are used by both index creation, and by the optimizer in
considering whether it should use a parallel index scan). That can be
a kind of a gotcha for their users, but maybe it's still worth it. In
any case, the complaints I saw about that were from users who used
parallel CREATE INDEX with the equivalent of my parallel_workers index
storage parameter, and then unexpectedly found this also forced the
use of parallel index scan. Not the other way around.

Ideally, the parallel_workers storage parameter will rarely be
necessary because the optimizer will generally do the right thing in
all case.

-- 
Peter Geoghegan



Re: Parallel Index Scans

From
Amit Kapila
Date:
On Thu, Oct 20, 2016 at 7:39 AM, Peter Geoghegan <pg@heroku.com> wrote:
> On Mon, Oct 17, 2016 at 8:08 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> Create Index .... With (parallel_workers = 4);
>>
>> If above syntax looks sensible, then we might need to think what
>> should be used for parallel index build.  It seems to me that parallel
>> tuple sort patch [1] proposed by Peter G. is using above syntax for
>> getting the parallel workers input from user for parallel index
>> builds.
>
> Apparently you see a similar issue with other major database systems,
> where similar storage parameter things are kind of "overloaded" like
> this (they are used by both index creation, and by the optimizer in
> considering whether it should use a parallel index scan). That can be
> a kind of a gotcha for their users, but maybe it's still worth it.
>

I have also checked and found that you are right.  In SQL Server, they
are using max degree of parallelism (MAXDOP) parameter which is I
think is common for all the sql statements.

> In
> any case, the complaints I saw about that were from users who used
> parallel CREATE INDEX with the equivalent of my parallel_workers index
> storage parameter, and then unexpectedly found this also forced the
> use of parallel index scan. Not the other way around.
>

I can understand that it can be confusing to users, so other option
could be to provide separate parameters like parallel_workers_build
and parallel_workers where first can be used for index build and
second can be used for scan.  My personal opinion is to have one
parameter, so that users have one less thing to learn about
parallelism.

> Ideally, the parallel_workers storage parameter will rarely be
> necessary because the optimizer will generally do the right thing in
> all case.
>

Yeah, we can choose not to provide any parameter for parallel index
scans, but some users might want to have a parameter similar to
parallel table scans, so it could be handy for them to use.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: Parallel Index Scans

From
Peter Geoghegan
Date:
On Wed, Oct 19, 2016 at 8:07 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> I have also checked and found that you are right.  In SQL Server, they
> are using max degree of parallelism (MAXDOP) parameter which is I
> think is common for all the sql statements.

It's not just that one that does things this way, for what it's worth.

> I can understand that it can be confusing to users, so other option
> could be to provide separate parameters like parallel_workers_build
> and parallel_workers where first can be used for index build and
> second can be used for scan.  My personal opinion is to have one
> parameter, so that users have one less thing to learn about
> parallelism.

That's my first instinct too, but I don't really have an opinion yet.

I think that this is the kind of thing where it could make sense to
take a "wait and see" approach, and then make a firm decision
immediately prior to beta. This is what we did in deciding the name of
and fine details around what ultimately became the
max_parallel_workers_per_gather GUC (plus related GUCs and storage
parameters).

-- 
Peter Geoghegan



Re: Parallel Index Scans

From
Robert Haas
Date:
On Wed, Oct 19, 2016 at 11:07 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> Ideally, the parallel_workers storage parameter will rarely be
>> necessary because the optimizer will generally do the right thing in
>> all case.
>
> Yeah, we can choose not to provide any parameter for parallel index
> scans, but some users might want to have a parameter similar to
> parallel table scans, so it could be handy for them to use.

I think the parallel_workers reloption should override the degree of
parallelism for any sort of parallel scan on that table.  Had I
intended it to apply only to sequential scans, I would have named it
differently.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Parallel Index Scans

From
Amit Kapila
Date:
On Thu, Oct 20, 2016 at 10:33 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Wed, Oct 19, 2016 at 11:07 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>>> Ideally, the parallel_workers storage parameter will rarely be
>>> necessary because the optimizer will generally do the right thing in
>>> all case.
>>
>> Yeah, we can choose not to provide any parameter for parallel index
>> scans, but some users might want to have a parameter similar to
>> parallel table scans, so it could be handy for them to use.
>
> I think the parallel_workers reloption should override the degree of
> parallelism for any sort of parallel scan on that table.  Had I
> intended it to apply only to sequential scans, I would have named it
> differently.
>

I think there is big difference of size of relation to scan between
parallel sequential scan and parallel (range) index scan which could
make it difficult for user to choose the value of this parameter.  Why
do you think that the parallel_workers reloption should suffice all
type of scans for a table?  I could only think of providing it based
on thinking that lesser config knobs makes life easier.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: Parallel Index Scans

From
Robert Haas
Date:
On Fri, Oct 21, 2016 at 9:27 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> I think the parallel_workers reloption should override the degree of
>> parallelism for any sort of parallel scan on that table.  Had I
>> intended it to apply only to sequential scans, I would have named it
>> differently.
>
> I think there is big difference of size of relation to scan between
> parallel sequential scan and parallel (range) index scan which could
> make it difficult for user to choose the value of this parameter.  Why
> do you think that the parallel_workers reloption should suffice all
> type of scans for a table?  I could only think of providing it based
> on thinking that lesser config knobs makes life easier.

Well, we could do that, but it would be fairly complicated and it
doesn't seem to me to be the right place to focus our efforts.  I'd
rather try to figure out some way to make the planner smarter, because
even if users can override the number of workers on a
per-table-per-scan-type basis, they're probably still going to find
using parallel query pretty frustrating unless we make the
number-of-workers formula smarter than it is today.  Anyway, even if
we do decide to add more reloptions than just parallel_degree someday,
couldn't that be left for a separate patch?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Parallel Index Scans

From
Amit Kapila
Date:
On Fri, Oct 21, 2016 at 10:55 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Fri, Oct 21, 2016 at 9:27 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>>> I think the parallel_workers reloption should override the degree of
>>> parallelism for any sort of parallel scan on that table.  Had I
>>> intended it to apply only to sequential scans, I would have named it
>>> differently.
>>
>> I think there is big difference of size of relation to scan between
>> parallel sequential scan and parallel (range) index scan which could
>> make it difficult for user to choose the value of this parameter.  Why
>> do you think that the parallel_workers reloption should suffice all
>> type of scans for a table?  I could only think of providing it based
>> on thinking that lesser config knobs makes life easier.
>
> Well, we could do that, but it would be fairly complicated and it
> doesn't seem to me to be the right place to focus our efforts.  I'd
> rather try to figure out some way to make the planner smarter, because
> even if users can override the number of workers on a
> per-table-per-scan-type basis, they're probably still going to find
> using parallel query pretty frustrating unless we make the
> number-of-workers formula smarter than it is today.  Anyway, even if
> we do decide to add more reloptions than just parallel_degree someday,
> couldn't that be left for a separate patch?
>

That makes sense to me.  As of now, patch doesn't consider reloptions
for parallel index scans.  So, I think we can leave it as it is and
then later as a separate patch decide whether to use reloption of
table or a separate reloption for index would be better choice.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: Parallel Index Scans

From
Amit Kapila
Date:
On Sat, Oct 22, 2016 at 9:07 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Fri, Oct 21, 2016 at 10:55 PM, Robert Haas <robertmhaas@gmail.com> wrote:

I have rebased the patch (parallel_index_scan_v2) based on latest
commit e8ac886c (condition variables).  I have removed the usage of
ConditionVariablePrepareToSleep as that is is no longer mandatory.  I
have also updated docs for wait event introduced by this patch (thanks
to Dilip for noticing it).  There is no change in
parallel_index_opt_exec_support patch, but just attaching here for
easier reference.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachment

Re: Parallel Index Scans

From
Haribabu Kommi
Date:


On Sat, Nov 26, 2016 at 10:35 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
On Sat, Oct 22, 2016 at 9:07 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Fri, Oct 21, 2016 at 10:55 PM, Robert Haas <robertmhaas@gmail.com> wrote:

I have rebased the patch (parallel_index_scan_v2) based on latest
commit e8ac886c (condition variables).  I have removed the usage of
ConditionVariablePrepareToSleep as that is is no longer mandatory.  I
have also updated docs for wait event introduced by this patch (thanks
to Dilip for noticing it).  There is no change in
parallel_index_opt_exec_support patch, but just attaching here for
easier reference.


Moved to next CF with "needs review" status.

Regards,
Hari Babu
Fujitsu Australia

Re: [HACKERS] Parallel Index Scans

From
Rafia Sabih
Date:
Hello,
On evaluating parallel index scans on TPC-H benchmark queries, I came across some interesting results.

For scale factor 20, queries 4, 6 and 14 are giving significant performance improvements with parallel index:
Q  | Head | PI
4   | 14     | 11
6   | 27     |  9
14 | 20     | 12

To confirm that the proposed patch is scalable I tested it on 300 scale factor, there some queries switched to bitmap index scan instead of parallel index, but there were other queries giving significant improvement in performance:
Q  | Head  | PI
4   | 207    | 168
14 | 2662  | 1576
15 | 847    | 190

All the performance numbers given above are in seconds. The experimental setup used in this exercise is as follows:
Server parameter settings: 
work_mem = 64 MB, 
max_parallel_workers_per_gather = 4, 
random_page_cost = seq_page_cost = 0.1 = parallel_tuple_cost, 
shared_buffers = 1 GB

Logical schema: Some additional indexes were created to ensure the use of indexes, 
on lineitem table -- l_shipdate, l_returnflag, l_shipmode, 
on orders table -- o_comment, o_orderdate, and 
on customer table -- c_mktsegment.

Machine used: IBM Power, 4 socket machine, 512 GB RAM

Main observations about the utility and power of this patch includes availability of appropriate indexes, giving suitable value of random_page_cost based on the RAM and DB sizes. E.g. in these experimentation I ensured warm cache environment, hence giving a higher value to random_page_cost than seq_page_cost does not makes much sense and it would inhibit the use of indexes. Also, the value of this parameter needs to be calibrated based on the underlying hardware, there is a recent work in this direction that gives a mechanism to do this calibration offline, also they experimented with Postgresql parameters [1]. 

Please find the attached file for have a look on these results in detail. The file pi_perf_tpch.ods gives the performance numbers and the graphs for both the scale factors. Attached zip folder gives the explain analyse output for these queries on both head as well as with parallel index patch.


On Mon, Dec 5, 2016 at 10:36 AM, Haribabu Kommi <kommi.haribabu@gmail.com> wrote:


On Sat, Nov 26, 2016 at 10:35 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
On Sat, Oct 22, 2016 at 9:07 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Fri, Oct 21, 2016 at 10:55 PM, Robert Haas <robertmhaas@gmail.com> wrote:

I have rebased the patch (parallel_index_scan_v2) based on latest
commit e8ac886c (condition variables).  I have removed the usage of
ConditionVariablePrepareToSleep as that is is no longer mandatory.  I
have also updated docs for wait event introduced by this patch (thanks
to Dilip for noticing it).  There is no change in
parallel_index_opt_exec_support patch, but just attaching here for
easier reference.


Moved to next CF with "needs review" status.

Regards,
Hari Babu
Fujitsu Australia



--
Regards,
Rafia Sabih
Attachment

Re: [HACKERS] Parallel Index Scans

From
Anastasia Lubennikova
Date:
The following review has been posted through the commitfest application:
make installcheck-world:  tested, passed
Implements feature:       tested, passed
Spec compliant:           tested, passed
Documentation:            tested, passed

Hi, thank you for the patch.
Results are very promising. Do you see any drawbacks of this feature or something that requires more testing?
I'm willing to oo a review. I hadn't do benchmarks yet, but I've read the patch and here are some
notes and questions about it.

I saw the discussion about parameters in the thread above. And I agree that we'd better concentrate
on the patch itself and add them later if necessary.

1. Can't we simply use "if (scan->parallel_scan != NULL)" instead of xs_temp_snap flag?

+    if (scan->xs_temp_snap)
+        UnregisterSnapshot(scan->xs_snapshot);

I must say that I'm quite new with all this parallel stuff. If you give me a link,
where to read about snapshots for parallel workers, my review will be more helpful.
Anyway, it would be great to have more comments about it in the code.

2. Don't you mind to rename 'amestimateparallelscan' to let's say 'amparallelscan_spacerequired'
or something like this? As far as I understand there is nothing to estimate, we know this size
for sure. I guess that you've chosen this name because of 'heap_parallelscan_estimate'.
But now it looks similar to 'amestimate' which refers to indexscan cost for optimizer.
That leads to the next question.

3. Are there any changes in cost estimation? I didn't find related changes in the patch.
Parallel scan is expected to be faster and optimizer definitely should know that.

4. +    uint8        ps_pageStatus;    /* state of scan, see below */
There is no desciption below. I'd make the comment more helpful:
/* state of scan. See possible flags values in nbtree.h */
And why do you call it pageStatus? What does it have to do with page?

5. Comment for _bt_parallel_seize() says:
"False indicates that we have reached the end of scan forcurrent scankeys and for that we return block number as
P_NONE."
What is the reason to check (blkno == P_NONE) after checking (status == false)in _bt_first() (see code below)? If
commentis correctwe'll never reach _bt_parallel_done()
 

+        blkno = _bt_parallel_seize(scan, &status);
+        if (status == false)
+        {
+            BTScanPosInvalidate(so->currPos);
+            return false;
+        }
+        else if (blkno == P_NONE)
+        {
+            _bt_parallel_done(scan);
+            BTScanPosInvalidate(so->currPos);
+            return false;
+        }

6. To avoid code duplication, I would wrap this into the function

+    /* initialize moreLeft/moreRight appropriately for scan direction */
+    if (ScanDirectionIsForward(dir))
+    {
+        so->currPos.moreLeft = false;
+        so->currPos.moreRight = true;
+    }
+    else
+    {
+        so->currPos.moreLeft = true;
+        so->currPos.moreRight = false;
+    }
+    so->numKilled = 0;            /* just paranoia */
+    so->markItemIndex = -1;        /* ditto */

And after that we can also get rid of _bt_parallel_readpage() which only
bring another level of indirection to the code.

7. Just a couple of typos I've noticed:
* Below flags are used indicate the state of parallel scan.* Below flags are used TO indicate the state of parallel
scan.

* On success, release lock and pin on buffer on success.
* On success release lock and pin on buffer.

8. I didn't find a description of the feature in documentation.
Probably we need to add a paragraph to the "Parallel Query" chapter. 

I will send another review of performance until the end of the week.

The new status of this patch is: Waiting on Author

Re: [HACKERS] Parallel Index Scans

From
Robert Haas
Date:
Thanks for reviewing!  A few quick thoughts from me since I write a
bunch of the design for this patch.

On Wed, Dec 21, 2016 at 10:16 AM, Anastasia Lubennikova
<lubennikovaav@gmail.com> wrote:
> 1. Can't we simply use "if (scan->parallel_scan != NULL)" instead of xs_temp_snap flag?
>
> +       if (scan->xs_temp_snap)
> +               UnregisterSnapshot(scan->xs_snapshot);
>
> I must say that I'm quite new with all this parallel stuff. If you give me a link,
> where to read about snapshots for parallel workers, my review will be more helpful.
> Anyway, it would be great to have more comments about it in the code.

I suspect it would be better to keep those two things formally
separate, even though they may always be the same right now.

> 2. Don't you mind to rename 'amestimateparallelscan' to let's say 'amparallelscan_spacerequired'
> or something like this? As far as I understand there is nothing to estimate, we know this size
> for sure. I guess that you've chosen this name because of 'heap_parallelscan_estimate'.
> But now it looks similar to 'amestimate' which refers to indexscan cost for optimizer.
> That leads to the next question.

"estimate" is being used this way quite widely now, in places like
ExecParallelEstimate.  So if we're going to change the terminology we
should do it broadly.

> 3. Are there any changes in cost estimation? I didn't find related changes in the patch.
> Parallel scan is expected to be faster and optimizer definitely should know that.

Generally the way that's reflected in the optimized is by having the
parallel scan have a lower row count.  See cost_seqscan() for an
example.

In general, you'll probably find a lot of parallels between this patch
and ee7ca559fcf404f9a3bd99da85c8f4ea9fbc2e92, which is probably a good
thing.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Parallel Index Scans

From
Amit Kapila
Date:
On Wed, Dec 21, 2016 at 8:46 PM, Anastasia Lubennikova
<lubennikovaav@gmail.com> wrote:
> The following review has been posted through the commitfest application:
> make installcheck-world:  tested, passed
> Implements feature:       tested, passed
> Spec compliant:           tested, passed
> Documentation:            tested, passed
>
> Hi, thank you for the patch.
> Results are very promising. Do you see any drawbacks of this feature or something that requires more testing?
>

I think you can focus on the handling of array scan keys for testing.
In general, one of my colleagues has shown interest in testing this
patch and I think he has tested as well but never posted his findings.
I will request him to share his findings and what kind of tests he has
done, if any.

> I'm willing to oo a review.

Thanks, that will be helpful.


> I saw the discussion about parameters in the thread above. And I agree that we'd better concentrate
> on the patch itself and add them later if necessary.
>
> 1. Can't we simply use "if (scan->parallel_scan != NULL)" instead of xs_temp_snap flag?
>
> +       if (scan->xs_temp_snap)
> +               UnregisterSnapshot(scan->xs_snapshot);
>

I agree with what Rober has told in his reply.  We do same way for
heap, refer heap_endscan().

> I must say that I'm quite new with all this parallel stuff. If you give me a link,
> where to read about snapshots for parallel workers, my review will be more helpful.
>

You can read transam/README.parallel.  Refer "State Sharing" portion
of README to learn more about it.

> Anyway, it would be great to have more comments about it in the code.
>

We are sharing snapshot to ensure that reads in both master backend
and worker backend can use the same snapshot.  There is no harm in
adding comments, but I think it is better to be consistent with
similar heapam code.  After reading README.parallel, if you still feel
that we should add more comments in the code, then we can definitely
do that.

> 2. Don't you mind to rename 'amestimateparallelscan' to let's say 'amparallelscan_spacerequired'
> or something like this?

Sure, I am open to other names, but IMHO, lets keep "estimate" in the
name to keep it consistent with other parallel stuff. Refer
execParallel.c to see how widely this word is used.

> As far as I understand there is nothing to estimate, we know this size
> for sure. I guess that you've chosen this name because of 'heap_parallelscan_estimate'.
> But now it looks similar to 'amestimate' which refers to indexscan cost for optimizer.
> That leads to the next question.
>

Do you mean 'amcostestimate'?  If you want we can rename it
amparallelscanestimate to be consistent with amcostestimate.

> 3. Are there any changes in cost estimation?
>

Yes.

> I didn't find related changes in the patch.
> Parallel scan is expected to be faster and optimizer definitely should know that.
>

You can find the relavant changes in
parallel_index_opt_exec_support_v2.patch, refer cost_index().

> 4. +    uint8           ps_pageStatus;  /* state of scan, see below */
> There is no desciption below. I'd make the comment more helpful:
> /* state of scan. See possible flags values in nbtree.h */

makes sense. Will change.

> And why do you call it pageStatus? What does it have to do with page?
>

During scan this tells us whether next page is available for scan.
Another option could be to name it as scanStatus, but not sure if that
is better.  Do you think if we add a comment like "indicates whether
next page is available for scan" for this variable then it will be
clear?

> 5. Comment for _bt_parallel_seize() says:
> "False indicates that we have reached the end of scan for
>  current scankeys and for that we return block number as P_NONE."
>
>  What is the reason to check (blkno == P_NONE) after checking (status == false)
>  in _bt_first() (see code below)? If comment is correct
>  we'll never reach _bt_parallel_done()
>
> +               blkno = _bt_parallel_seize(scan, &status);
> +               if (status == false)
> +               {
> +                       BTScanPosInvalidate(so->currPos);
> +                       return false;
> +               }
> +               else if (blkno == P_NONE)
> +               {
> +                       _bt_parallel_done(scan);
> +                       BTScanPosInvalidate(so->currPos);
> +                       return false;
> +               }
>

The first time master backend or worker hits last page (calls this
API), it will return P_NONE and after that any worker tries to fetch
next page, it will return status as false.  I think we can expand a
comment to explain it clearly.  Let me know, if you need more
clarification, I can explain it in detail.

> 6. To avoid code duplication, I would wrap this into the function
>
> +       /* initialize moreLeft/moreRight appropriately for scan direction */
> +       if (ScanDirectionIsForward(dir))
> +       {
> +               so->currPos.moreLeft = false;
> +               so->currPos.moreRight = true;
> +       }
> +       else
> +       {
> +               so->currPos.moreLeft = true;
> +               so->currPos.moreRight = false;
> +       }
> +       so->numKilled = 0;                      /* just paranoia */
> +       so->markItemIndex = -1;         /* ditto */
>

Okay, I think we can write a separate function (probably inline
function) for above.

> And after that we can also get rid of _bt_parallel_readpage() which only
> bring another level of indirection to the code.
>

See, this function is responsible for multiple actions like
initializing moreLeft/moreRight positions, reading next page, dropping
the lock/pin.  So replicating all these actions in the caller will
make the code in caller less readable as compared to now.  Consider
this point and let me know your view on same.

> 7. Just a couple of typos I've noticed:
>
>  * Below flags are used indicate the state of parallel scan.
>  * Below flags are used TO indicate the state of parallel scan.
>
> * On success, release lock and pin on buffer on success.
> * On success release lock and pin on buffer.
>

Will fix.

> 8. I didn't find a description of the feature in documentation.
> Probably we need to add a paragraph to the "Parallel Query" chapter.
>

Yes, I am aware of that and I think it makes sense to add it now
rather than waiting until the end.

> I will send another review of performance until the end of the week.
>

Okay, you can refer Rafia's mail above for non-default settings she
has used in her performance tests with TPC-H.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: [HACKERS] Parallel Index Scans

From
Amit Kapila
Date:
On Thu, Dec 22, 2016 at 9:49 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Wed, Dec 21, 2016 at 8:46 PM, Anastasia Lubennikova
> <lubennikovaav@gmail.com> wrote:
>> The following review has been posted through the commitfest application:
>> make installcheck-world:  tested, passed
>> Implements feature:       tested, passed
>> Spec compliant:           tested, passed
>> Documentation:            tested, passed
>>
>> Hi, thank you for the patch.
>> Results are very promising. Do you see any drawbacks of this feature or something that requires more testing?
>>
>
> I think you can focus on the handling of array scan keys for testing.
> In general, one of my colleagues has shown interest in testing this
> patch and I think he has tested as well but never posted his findings.
> I will request him to share his findings and what kind of tests he has
> done, if any.
>
>> I'm willing to oo a review.
>
> Thanks, that will be helpful.
>
>
>> I saw the discussion about parameters in the thread above. And I agree that we'd better concentrate
>> on the patch itself and add them later if necessary.
>>
>> 1. Can't we simply use "if (scan->parallel_scan != NULL)" instead of xs_temp_snap flag?
>>
>> +       if (scan->xs_temp_snap)
>> +               UnregisterSnapshot(scan->xs_snapshot);
>>
>
> I agree with what Rober has told in his reply.
>

Typo.
/Rober/Robert Haas

Thanks to Michael Paquier for noticing it and informing me offline.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: [HACKERS] Parallel Index Scans

From
tushar
Date:
On 12/22/2016 09:49 AM, Amit Kapila wrote:
> I think you can focus on the handling of array scan keys for testing.
> In general, one of my colleagues has shown interest in testing this
> patch and I think he has tested as well but never posted his findings.
> I will request him to share his findings and what kind of tests he has
> done, if any.
Sure, We (Prabhat and I) have done some testing for this feature 
internally but never published the test-scripts on this forum. PFA the 
sql scripts ( along with the expected .out files) we have used for 
testing for your ready reference.

In addition we had generated the LCOV (code coverage) report and 
compared the files which are changed for the "Parallel index scan" patch.
You can see the numbers for  "with patch" V/s "Without patch"  (.pdf 
file is attached)

-- 
regards,tushar


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] Parallel Index Scans

From
tushar
Date:
On 12/22/2016 01:35 PM, tushar wrote:
> On 12/22/2016 09:49 AM, Amit Kapila wrote:
>> I think you can focus on the handling of array scan keys for testing.
>> In general, one of my colleagues has shown interest in testing this
>> patch and I think he has tested as well but never posted his findings.
>> I will request him to share his findings and what kind of tests he has
>> done, if any.
> Sure, We (Prabhat and I) have done some testing for this feature 
> internally but never published the test-scripts on this forum. PFA the 
> sql scripts ( along with the expected .out files) we have used for 
> testing for your ready reference.
>
> In addition we had generated the LCOV (code coverage) report and 
> compared the files which are changed for the "Parallel index scan" patch.
> You can see the numbers for  "with patch" V/s "Without patch" (.pdf 
> file is attached)
>
In addition to that, we  run the sqlsmith against PG v10+PIS (parallel 
index scan) patches and found a crash  but that is coming on plain  PG 
v10 (without applying any patches) as well

postgres=# select           70 as c0,           pg_catalog.has_server_privilege(            cast(ref_0.indexdef as
text),           cast(cast(coalesce((select name from pg_catalog.pg_settings 
 
limit 1 offset 16)        ,               null) as text) as text)) as c1,           pg_catalog.pg_export_snapshot() as
c2,          ref_0.indexdef as c3,           ref_0.indexname as c4         from          pg_catalog.pg_indexes as ref_0
       where (ref_0.tablespace = ref_0.tablespace)           or (46 = 22)         limit 103;
 
TRAP: FailedAssertion("!(keylen < 64)", File: "hashfunc.c", Line: 139)
server closed the connection unexpectedly    This probably means the server terminated abnormally    before or while
processingthe request.
 
The connection to the server was lost. Attempting reset: 2016-12-23 
11:19:50.627 IST [2314] LOG:  server process (PID 2322) was terminated 
by signal 6: Aborted
2016-12-23 11:19:50.627 IST [2314] DETAIL:  Failed process was running: 
select               70 as c0,               pg_catalog.has_server_privilege(                cast(ref_0.indexdef as
text),               cast(cast(coalesce((select name from 
 
pg_catalog.pg_settings limit 1 offset 16)            ,                   null) as text) as text)) as c1,
pg_catalog.pg_export_snapshot()as c2,               ref_0.indexdef as c3,               ref_0.indexname as c4
 from              pg_catalog.pg_indexes as ref_0             where (ref_0.tablespace = ref_0.tablespace)
or(46 = 22)             limit 103;
 
2016-12-23 11:19:50.627 IST [2314] LOG:  terminating any other active 
server processes
2016-12-23 11:19:50.627 IST [2319] WARNING:  terminating connection 
because of crash of another server process
2016-12-23 11:19:50.627 IST [2319] DETAIL:  The postmaster has commanded 
this server process to roll back the current transaction and exit, 
because another server process exited abnormally and possibly corrupted 
shared memory.
2016-12-23 11:19:50.627 IST [2319] HINT:  In a moment you should be able 
to reconnect to the database and repeat your command.
2016-12-23 11:19:50.629 IST [2323] FATAL:  the database system is in 
recovery mode
Failed.
!> 2016-12-23 11:19:50.629 IST [2314] LOG:  all server processes 
terminated; reinitializing
2016-12-23 11:19:50.658 IST [2324] LOG:  database system was 
interrupted; last known up at 2016-12-23 11:19:47 IST
2016-12-23 11:19:50.810 IST [2324] LOG:  database system was not 
properly shut down; automatic recovery in progress
2016-12-23 11:19:50.812 IST [2324] LOG:  invalid record length at 
0/155E408: wanted 24, got 0
2016-12-23 11:19:50.812 IST [2324] LOG:  redo is not required
2016-12-23 11:19:50.819 IST [2324] LOG:  MultiXact member wraparound 
protections are now enabled
2016-12-23 11:19:50.822 IST [2314] LOG:  database system is ready to 
accept connections
2016-12-23 11:19:50.822 IST [2328] LOG:  autovacuum launcher started

-- 
regards,tushar




Re: [HACKERS] Parallel Index Scans

From
Robert Haas
Date:
On Fri, Dec 23, 2016 at 1:35 AM, tushar <tushar.ahuja@enterprisedb.com> wrote:
> In addition to that, we  run the sqlsmith against PG v10+PIS (parallel index
> scan) patches and found a crash  but that is coming on plain  PG v10
> (without applying any patches) as well

So why are you reporting it here rather than on a separate thread?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Parallel Index Scans

From
tushar
Date:
On 12/23/2016 05:38 PM, Robert Haas wrote:
> So why are you reporting it here rather than on a separate thread?
We found it -while testing parallel index scan and later it turned out 
to be crash in general.
Sure- make sense ,will do that.

-- 
regards,tushar




Re: [HACKERS] Parallel Index Scans

From
Rahila Syed
Date:
>> 5. Comment for _bt_parallel_seize() says:
>> "False indicates that we have reached the end of scan for
>>  current scankeys and for that we return block number as P_NONE."
>>
>>  What is the reason to check (blkno == P_NONE) after checking (status == false)
>>  in _bt_first() (see code below)? If comment is correct
>>  we'll never reach _bt_parallel_done()
>>
>> +               blkno = _bt_parallel_seize(scan, &status);
>> +               if (status == false)
>> +               {
>> +                       BTScanPosInvalidate(so->currPos);
>> +                       return false;
>> +               }
>> +               else if (blkno == P_NONE)
>> +               {
>> +                       _bt_parallel_done(scan);
>> +                       BTScanPosInvalidate(so->currPos);
>> +                       return false;
>> +               }
>>

>The first time master backend or worker hits last page (calls this
>API), it will return P_NONE and after that any worker tries to fetch
>next page, it will return status as false.  I think we can expand a
>comment to explain it clearly.  Let me know, if you need more
>clarification, I can explain it in detail.

Probably this was confusing because we have not mentioned
that P_NONE can be returned even when status = TRUE and
not just when status is false.

I think, the comment above the function can be modified as follows,

+ /*
+ * True indicates that the block number returned is either valid including P_NONE
+ * and scan is continued or block number is invalid and scan has just
+ * begun.

Thank you,
Rahila Syed


On Thu, Dec 22, 2016 at 9:49 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
On Wed, Dec 21, 2016 at 8:46 PM, Anastasia Lubennikova
<lubennikovaav@gmail.com> wrote:
> The following review has been posted through the commitfest application:
> make installcheck-world:  tested, passed
> Implements feature:       tested, passed
> Spec compliant:           tested, passed
> Documentation:            tested, passed
>
> Hi, thank you for the patch.
> Results are very promising. Do you see any drawbacks of this feature or something that requires more testing?
>

I think you can focus on the handling of array scan keys for testing.
In general, one of my colleagues has shown interest in testing this
patch and I think he has tested as well but never posted his findings.
I will request him to share his findings and what kind of tests he has
done, if any.

> I'm willing to oo a review.

Thanks, that will be helpful.


> I saw the discussion about parameters in the thread above. And I agree that we'd better concentrate
> on the patch itself and add them later if necessary.
>
> 1. Can't we simply use "if (scan->parallel_scan != NULL)" instead of xs_temp_snap flag?
>
> +       if (scan->xs_temp_snap)
> +               UnregisterSnapshot(scan->xs_snapshot);
>

I agree with what Rober has told in his reply.  We do same way for
heap, refer heap_endscan().

> I must say that I'm quite new with all this parallel stuff. If you give me a link,
> where to read about snapshots for parallel workers, my review will be more helpful.
>

You can read transam/README.parallel.  Refer "State Sharing" portion
of README to learn more about it.

> Anyway, it would be great to have more comments about it in the code.
>

We are sharing snapshot to ensure that reads in both master backend
and worker backend can use the same snapshot.  There is no harm in
adding comments, but I think it is better to be consistent with
similar heapam code.  After reading README.parallel, if you still feel
that we should add more comments in the code, then we can definitely
do that.

> 2. Don't you mind to rename 'amestimateparallelscan' to let's say 'amparallelscan_spacerequired'
> or something like this?

Sure, I am open to other names, but IMHO, lets keep "estimate" in the
name to keep it consistent with other parallel stuff. Refer
execParallel.c to see how widely this word is used.

> As far as I understand there is nothing to estimate, we know this size
> for sure. I guess that you've chosen this name because of 'heap_parallelscan_estimate'.
> But now it looks similar to 'amestimate' which refers to indexscan cost for optimizer.
> That leads to the next question.
>

Do you mean 'amcostestimate'?  If you want we can rename it
amparallelscanestimate to be consistent with amcostestimate.

> 3. Are there any changes in cost estimation?
>

Yes.

> I didn't find related changes in the patch.
> Parallel scan is expected to be faster and optimizer definitely should know that.
>

You can find the relavant changes in
parallel_index_opt_exec_support_v2.patch, refer cost_index().

> 4. +    uint8           ps_pageStatus;  /* state of scan, see below */
> There is no desciption below. I'd make the comment more helpful:
> /* state of scan. See possible flags values in nbtree.h */

makes sense. Will change.

> And why do you call it pageStatus? What does it have to do with page?
>

During scan this tells us whether next page is available for scan.
Another option could be to name it as scanStatus, but not sure if that
is better.  Do you think if we add a comment like "indicates whether
next page is available for scan" for this variable then it will be
clear?

> 5. Comment for _bt_parallel_seize() says:
> "False indicates that we have reached the end of scan for
>  current scankeys and for that we return block number as P_NONE."
>
>  What is the reason to check (blkno == P_NONE) after checking (status == false)
>  in _bt_first() (see code below)? If comment is correct
>  we'll never reach _bt_parallel_done()
>
> +               blkno = _bt_parallel_seize(scan, &status);
> +               if (status == false)
> +               {
> +                       BTScanPosInvalidate(so->currPos);
> +                       return false;
> +               }
> +               else if (blkno == P_NONE)
> +               {
> +                       _bt_parallel_done(scan);
> +                       BTScanPosInvalidate(so->currPos);
> +                       return false;
> +               }
>

The first time master backend or worker hits last page (calls this
API), it will return P_NONE and after that any worker tries to fetch
next page, it will return status as false.  I think we can expand a
comment to explain it clearly.  Let me know, if you need more
clarification, I can explain it in detail.

> 6. To avoid code duplication, I would wrap this into the function
>
> +       /* initialize moreLeft/moreRight appropriately for scan direction */
> +       if (ScanDirectionIsForward(dir))
> +       {
> +               so->currPos.moreLeft = false;
> +               so->currPos.moreRight = true;
> +       }
> +       else
> +       {
> +               so->currPos.moreLeft = true;
> +               so->currPos.moreRight = false;
> +       }
> +       so->numKilled = 0;                      /* just paranoia */
> +       so->markItemIndex = -1;         /* ditto */
>

Okay, I think we can write a separate function (probably inline
function) for above.

> And after that we can also get rid of _bt_parallel_readpage() which only
> bring another level of indirection to the code.
>

See, this function is responsible for multiple actions like
initializing moreLeft/moreRight positions, reading next page, dropping
the lock/pin.  So replicating all these actions in the caller will
make the code in caller less readable as compared to now.  Consider
this point and let me know your view on same.

> 7. Just a couple of typos I've noticed:
>
>  * Below flags are used indicate the state of parallel scan.
>  * Below flags are used TO indicate the state of parallel scan.
>
> * On success, release lock and pin on buffer on success.
> * On success release lock and pin on buffer.
>

Will fix.

> 8. I didn't find a description of the feature in documentation.
> Probably we need to add a paragraph to the "Parallel Query" chapter.
>

Yes, I am aware of that and I think it makes sense to add it now
rather than waiting until the end.

> I will send another review of performance until the end of the week.
>

Okay, you can refer Rafia's mail above for non-default settings she
has used in her performance tests with TPC-H.


--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] Parallel Index Scans

From
Anastasia Lubennikova
Date:
22.12.2016 07:19, Amit Kapila:
> On Wed, Dec 21, 2016 at 8:46 PM, Anastasia Lubennikova
> <lubennikovaav@gmail.com> wrote:
>> The following review has been posted through the commitfest application:
>> make installcheck-world:  tested, passed
>> Implements feature:       tested, passed
>> Spec compliant:           tested, passed
>> Documentation:            tested, passed
>>
>> Hi, thank you for the patch.
>> Results are very promising. Do you see any drawbacks of this feature or something that requires more testing?
>>
> I think you can focus on the handling of array scan keys for testing.
> In general, one of my colleagues has shown interest in testing this
> patch and I think he has tested as well but never posted his findings.
> I will request him to share his findings and what kind of tests he has
> done, if any.


Check please code related to buffer locking and pinning once again.
I got the warning. Here are the steps to reproduce it:
Except "autovacuum = off" config is default.

pgbench -i -s 100 test
pgbench -c 10 -T 120 test
    SELECT count(aid) FROM pgbench_accounts    WHERE aid > 1000 AND aid < 900000 AND bid > 800 AND bid < 900;
WARNING:  buffer refcount leak: [8297] (rel=base/12289/16459, 
blockNum=2469, flags=0x93800000, refcount=1 1) count
-------     0
(1 row)

postgres=# select 16459::regclass;       regclass
----------------------- pgbench_accounts_pkey


>> 2. Don't you mind to rename 'amestimateparallelscan' to let's say 'amparallelscan_spacerequired'
>> or something like this?
> Sure, I am open to other names, but IMHO, lets keep "estimate" in the
> name to keep it consistent with other parallel stuff. Refer
> execParallel.c to see how widely this word is used.
>
>> As far as I understand there is nothing to estimate, we know this size
>> for sure. I guess that you've chosen this name because of 'heap_parallelscan_estimate'.
>> But now it looks similar to 'amestimate' which refers to indexscan cost for optimizer.
>> That leads to the next question.
>>
> Do you mean 'amcostestimate'?  If you want we can rename it
> amparallelscanestimate to be consistent with amcostestimate.

I think that 'amparallelscanestimate' seems less ambiguous than 
amestimateparallelscan.
But it's up to you. There are enough comments to understand the purpose 
of this field.
>
>> And why do you call it pageStatus? What does it have to do with page?
>>
> During scan this tells us whether next page is available for scan.
> Another option could be to name it as scanStatus, but not sure if that
> is better.  Do you think if we add a comment like "indicates whether
> next page is available for scan" for this variable then it will be
> clear?

Yes, I think it describes the flag better.
>> 5. Comment for _bt_parallel_seize() says:
>> "False indicates that we have reached the end of scan for
>>   current scankeys and for that we return block number as P_NONE."
>>
>>   What is the reason to check (blkno == P_NONE) after checking (status == false)
>>   in _bt_first() (see code below)? If comment is correct
>>   we'll never reach _bt_parallel_done()
>>
>> +               blkno = _bt_parallel_seize(scan, &status);
>> +               if (status == false)
>> +               {
>> +                       BTScanPosInvalidate(so->currPos);
>> +                       return false;
>> +               }
>> +               else if (blkno == P_NONE)
>> +               {
>> +                       _bt_parallel_done(scan);
>> +                       BTScanPosInvalidate(so->currPos);
>> +                       return false;
>> +               }
>>
> The first time master backend or worker hits last page (calls this
> API), it will return P_NONE and after that any worker tries to fetch
> next page, it will return status as false.  I think we can expand a
> comment to explain it clearly.  Let me know, if you need more
> clarification, I can explain it in detail.
>
Got it,
I think you can add this explanation to the comment for 
_bt_parallel_seize().

>> 6. To avoid code duplication, I would wrap this into the function
>>
>> +       /* initialize moreLeft/moreRight appropriately for scan direction */
>> +       if (ScanDirectionIsForward(dir))
>> +       {
>> +               so->currPos.moreLeft = false;
>> +               so->currPos.moreRight = true;
>> +       }
>> +       else
>> +       {
>> +               so->currPos.moreLeft = true;
>> +               so->currPos.moreRight = false;
>> +       }
>> +       so->numKilled = 0;                      /* just paranoia */
>> +       so->markItemIndex = -1;         /* ditto */
>>
> Okay, I think we can write a separate function (probably inline
> function) for above.
>
>> And after that we can also get rid of _bt_parallel_readpage() which only
>> bring another level of indirection to the code.
>>
> See, this function is responsible for multiple actions like
> initializing moreLeft/moreRight positions, reading next page, dropping
> the lock/pin.  So replicating all these actions in the caller will
> make the code in caller less readable as compared to now.  Consider
> this point and let me know your view on same.

Thank you for clarification, now I agree with your implementation.
I've just missed that we also handle lock in this function.


Performance results with 2 parallel workers are about 1.5-3 times 
better, just like in your tests.
So, no doubt, this feature will be useful.
But I'm trying to find the worst cases for this feature. And I suppose 
we should test parallel index scans with
concurrent insertions. More parallel readers we have, higher the 
concurrency.
I doubt that it can significantly decrease performance, because number 
of parallel readers is not that big,
but it is worth testing.

-- 
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company




Re: [HACKERS] Parallel Index Scans

From
Amit Kapila
Date:
On Fri, Dec 23, 2016 at 6:42 PM, Anastasia Lubennikova
<a.lubennikova@postgrespro.ru> wrote:
> 22.12.2016 07:19, Amit Kapila:
>>
>> On Wed, Dec 21, 2016 at 8:46 PM, Anastasia Lubennikova
>> <lubennikovaav@gmail.com> wrote:
>>>
>>> The following review has been posted through the commitfest application:
>>> make installcheck-world:  tested, passed
>>> Implements feature:       tested, passed
>>> Spec compliant:           tested, passed
>>> Documentation:            tested, passed
>>>
>>> Hi, thank you for the patch.
>>> Results are very promising. Do you see any drawbacks of this feature or
>>> something that requires more testing?
>>>
>> I think you can focus on the handling of array scan keys for testing.
>> In general, one of my colleagues has shown interest in testing this
>> patch and I think he has tested as well but never posted his findings.
>> I will request him to share his findings and what kind of tests he has
>> done, if any.
>
>
>
> Check please code related to buffer locking and pinning once again.
> I got the warning. Here are the steps to reproduce it:
> Except "autovacuum = off" config is default.
>
> pgbench -i -s 100 test
> pgbench -c 10 -T 120 test
>
>     SELECT count(aid) FROM pgbench_accounts
>     WHERE aid > 1000 AND aid < 900000 AND bid > 800 AND bid < 900;
> WARNING:  buffer refcount leak: [8297] (rel=base/12289/16459, blockNum=2469,
> flags=0x93800000, refcount=1 1)
>  count
>

The similar problem has occurred while testing "parallel index only
scan" patch and Rafia has included the fix in her patch [1] which
ideally should be included in this patch, so I have copied the fix
from her patch.  Apart from that, I observed that similar problem can
happen for backward scans, so fixed the same as well.

>
>>> 2. Don't you mind to rename 'amestimateparallelscan' to let's say
>>> 'amparallelscan_spacerequired'
>>> or something like this?
>>
>> Sure, I am open to other names, but IMHO, lets keep "estimate" in the
>> name to keep it consistent with other parallel stuff. Refer
>> execParallel.c to see how widely this word is used.
>>
>>> As far as I understand there is nothing to estimate, we know this size
>>> for sure. I guess that you've chosen this name because of
>>> 'heap_parallelscan_estimate'.
>>> But now it looks similar to 'amestimate' which refers to indexscan cost
>>> for optimizer.
>>> That leads to the next question.
>>>
>> Do you mean 'amcostestimate'?  If you want we can rename it
>> amparallelscanestimate to be consistent with amcostestimate.
>
>
> I think that 'amparallelscanestimate' seems less ambiguous than
> amestimateparallelscan.
> But it's up to you. There are enough comments to understand the purpose of
> this field.
>

Okay, then lets leave as it is, because we have aminitparallelscan
which should also be renamed to amparallelscaninit if we rename
amestimateparallelscan.

>>
>>> And why do you call it pageStatus? What does it have to do with page?
>>>
>> During scan this tells us whether next page is available for scan.
>> Another option could be to name it as scanStatus, but not sure if that
>> is better.  Do you think if we add a comment like "indicates whether
>> next page is available for scan" for this variable then it will be
>> clear?
>
>
> Yes, I think it describes the flag better.

Changed as per above suggestion.

>>>
>>> 5. Comment for _bt_parallel_seize() says:
>>> "False indicates that we have reached the end of scan for
>>>   current scankeys and for that we return block number as P_NONE."
>>>
>>>   What is the reason to check (blkno == P_NONE) after checking (status ==
>>> false)
>>>   in _bt_first() (see code below)? If comment is correct
>>>   we'll never reach _bt_parallel_done()
>>>
>>> +               blkno = _bt_parallel_seize(scan, &status);
>>> +               if (status == false)
>>> +               {
>>> +                       BTScanPosInvalidate(so->currPos);
>>> +                       return false;
>>> +               }
>>> +               else if (blkno == P_NONE)
>>> +               {
>>> +                       _bt_parallel_done(scan);
>>> +                       BTScanPosInvalidate(so->currPos);
>>> +                       return false;
>>> +               }
>>>
>> The first time master backend or worker hits last page (calls this
>> API), it will return P_NONE and after that any worker tries to fetch
>> next page, it will return status as false.  I think we can expand a
>> comment to explain it clearly.  Let me know, if you need more
>> clarification, I can explain it in detail.
>>
> Got it,
> I think you can add this explanation to the comment for
> _bt_parallel_seize().
>

Expanded the comment as discussed.

>>> 6. To avoid code duplication, I would wrap this into the function
>>>
>>> +       /* initialize moreLeft/moreRight appropriately for scan direction
>>> */
>>> +       if (ScanDirectionIsForward(dir))
>>> +       {
>>> +               so->currPos.moreLeft = false;
>>> +               so->currPos.moreRight = true;
>>> +       }
>>> +       else
>>> +       {
>>> +               so->currPos.moreLeft = true;
>>> +               so->currPos.moreRight = false;
>>> +       }
>>> +       so->numKilled = 0;                      /* just paranoia */
>>> +       so->markItemIndex = -1;         /* ditto */
>>>
>> Okay, I think we can write a separate function (probably inline
>> function) for above.
>>

Added the above code in a separate inline function.

>>> And after that we can also get rid of _bt_parallel_readpage() which only
>>> bring another level of indirection to the code.
>>>
>> See, this function is responsible for multiple actions like
>> initializing moreLeft/moreRight positions, reading next page, dropping
>> the lock/pin.  So replicating all these actions in the caller will
>> make the code in caller less readable as compared to now.  Consider
>> this point and let me know your view on same.
>
>
> Thank you for clarification, now I agree with your implementation.
> I've just missed that we also handle lock in this function.
>

Okay.

> 7. Just a couple of typos I've noticed:
>
>  * Below flags are used indicate the state of parallel scan.
>  * Below flags are used TO indicate the state of parallel scan.
>
> * On success, release lock and pin on buffer on success.
> * On success release lock and pin on buffer.
>

Fixed.

> 8. I didn't find a description of the feature in documentation.
> Probably we need to add a paragraph to the "Parallel Query" chapter.
>

Added the description in parallel_index_opt_exec_support_v3.patch.

>
> Performance results with 2 parallel workers are about 1.5-3 times better,
> just like in your tests.
> So, no doubt, this feature will be useful.

Thanks for the tests.

> But I'm trying to find the worst cases for this feature. And I suppose we
> should test parallel index scans with
> concurrent insertions. More parallel readers we have, higher the
> concurrency.
> I doubt that it can significantly decrease performance, because number of
> parallel readers is not that big,
>

I am not sure if such a test is meaningful for this patch because
parallelism is generally used for large data reads and in such cases
there are generally not many concurrent writes.


Thanks for your valuable inputs.


[1] - https://www.postgresql.org/message-id/CAOGQiiNx4Ra9A-RyxjrgECownmVJ64EVpVgfN8ACR-MLupGnng%40mail.gmail.com

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] Parallel Index Scans

From
Amit Kapila
Date:
On Fri, Dec 23, 2016 at 5:48 PM, Rahila Syed <rahilasyed90@gmail.com> wrote:
>>> 5. Comment for _bt_parallel_seize() says:
>>> "False indicates that we have reached the end of scan for
>>>  current scankeys and for that we return block number as P_NONE."
>>>
>>>  What is the reason to check (blkno == P_NONE) after checking (status ==
>>> false)
>>>  in _bt_first() (see code below)? If comment is correct
>>>  we'll never reach _bt_parallel_done()
>>>
>>> +               blkno = _bt_parallel_seize(scan, &status);
>>> +               if (status == false)
>>> +               {
>>> +                       BTScanPosInvalidate(so->currPos);
>>> +                       return false;
>>> +               }
>>> +               else if (blkno == P_NONE)
>>> +               {
>>> +                       _bt_parallel_done(scan);
>>> +                       BTScanPosInvalidate(so->currPos);
>>> +                       return false;
>>> +               }
>>>
>>The first time master backend or worker hits last page (calls this
>>API), it will return P_NONE and after that any worker tries to fetch
>>next page, it will return status as false.  I think we can expand a
>>comment to explain it clearly.  Let me know, if you need more
>>clarification, I can explain it in detail.
>
> Probably this was confusing because we have not mentioned
> that P_NONE can be returned even when status = TRUE and
> not just when status is false.
>
> I think, the comment above the function can be modified as follows,
>
> + /*
> + * True indicates that the block number returned is either valid including
> P_NONE
> + * and scan is continued or block number is invalid and scan has just
> + * begun.
>

I think the modification (including P_NONE and scan is continued)
suggested by you can confuse the reader, because if the returned block
number is P_NONE, then we don't continue the scan.  I have used
slightly different words in the patch I have just posted, please check
and see if that looks fine to you.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: [HACKERS] Parallel Index Scans

From
Anastasia Lubennikova
Date:
27.12.2016 17:33, Amit Kapila:
On Fri, Dec 23, 2016 at 6:42 PM, Anastasia Lubennikova
<a.lubennikova@postgrespro.ru> wrote:
22.12.2016 07:19, Amit Kapila:
On Wed, Dec 21, 2016 at 8:46 PM, Anastasia Lubennikova
<lubennikovaav@gmail.com> wrote:
The following review has been posted through the commitfest application:
make installcheck-world:  tested, passed
Implements feature:       tested, passed
Spec compliant:           tested, passed
Documentation:            tested, passed

Hi, thank you for the patch.
Results are very promising. Do you see any drawbacks of this feature or
something that requires more testing?

I think you can focus on the handling of array scan keys for testing.
In general, one of my colleagues has shown interest in testing this
patch and I think he has tested as well but never posted his findings.
I will request him to share his findings and what kind of tests he has
done, if any.


Check please code related to buffer locking and pinning once again.
I got the warning. Here are the steps to reproduce it:
Except "autovacuum = off" config is default.

pgbench -i -s 100 test
pgbench -c 10 -T 120 test
   SELECT count(aid) FROM pgbench_accounts   WHERE aid > 1000 AND aid < 900000 AND bid > 800 AND bid < 900;
WARNING:  buffer refcount leak: [8297] (rel=base/12289/16459, blockNum=2469,
flags=0x93800000, refcount=1 1)count

The similar problem has occurred while testing "parallel index only
scan" patch and Rafia has included the fix in her patch [1] which
ideally should be included in this patch, so I have copied the fix
from her patch.  Apart from that, I observed that similar problem can
happen for backward scans, so fixed the same as well.

I confirm that this problem is solved.

But I'm trying to find the worst cases for this feature. And I suppose we
should test parallel index scans with
concurrent insertions. More parallel readers we have, higher the
concurrency.
I doubt that it can significantly decrease performance, because number of
parallel readers is not that big,

I am not sure if such a test is meaningful for this patch because
parallelism is generally used for large data reads and in such cases
there are generally not many concurrent writes.

I didn't find any case of noticeable performance degradation,
so set status to "Ready for committer".
Thank you for this patch.
-- 
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: [HACKERS] Parallel Index Scans

From
Robert Haas
Date:
On Fri, Jan 13, 2017 at 9:28 AM, Anastasia Lubennikova
<a.lubennikova@postgrespro.ru> wrote:
> I didn't find any case of noticeable performance degradation,
> so set status to "Ready for committer".

The very first hunk of doc changes looks like it makes the whitespace
totally wrong - surely it can't be right to have 0-space indents in C
code.

+    The <literal>index_size</> parameter indicate the size of generic parallel

indicate -> indicates
size of generic -> size of the generic

+   index-type-specific parallel information which will be stored immediatedly

Typo.

+   Initialize the parallel index scan state.  It will be used to initialize
+   index-type-specific parallel information which will be stored immediatedly
+   after generic parallel information required for parallel index scans.  The
+   required state information will be set in <literal>target</>.
+  </para>
+
+   <para>
+     The <function>aminitparallelscan</> and
<function>amestimateparallelscan</>
+     functions need only be provided if the access method supports
<quote>parallel</>
+     index scans.  If it doesn't, the <structfield>aminitparallelscan</> and
+     <structfield>amestimateparallelscan</> fields in its
<structname>IndexAmRoutine</>
+     struct must be set to NULL.
+   </para>

Inconsistent indentation.  <quote> seems like a strange choice of tag.

+    /* amestimateparallelscan is optional; assume no-op if not
provided by AM */

The fact that amestimateparallelscan is optional even when parallel
index scans are supported is undocumented.  Similarly for the other
functions, which also seem to be optional but not documented as such.
The code and documentation need to match.

+    void       *amtarget = (char *) ((void *) target) + offset;

Passing an unaligned pointer to the AM sounds like a recipe for
crashes on obscure platforms that can't tolerate alignment violations,
and possibly bad performance on others.  I'd arrange MAXALIGN the size
of the generic structure in index_parallelscan_estimate and
index_parallelscan_initialize.  Also, why pass the size of the generic
structure to the AM-specific estimate routine, anyway?  It can't
legally return a smaller value, and we can add_size() just as well as
the AM-specific code.  Wouldn't it make more sense for the AM-specific
code to return the amount of space that is needed for AM-specific
stuff, and let the generic code deal with the generic stuff?

+ *    status - True indicates that the block number returned is valid and scan
+ *             is continued or block number is invalid and scan has just begun
+ *             or block number is P_NONE and scan is finished.  False indicates
+ *             that we have reached the end of scan for current
scankeys and for
+ *             that we return block number as P_NONE.

It is hard to parse a sentence with that many "and" and "or" clauses,
especially since English, unlike C, does not have strict operator
precedence rules. Perhaps you could reword to make it more clear.
Also, does that survive pgindent with that indentation?

+    BTParallelScanDesc btscan = (BTParallelScanDesc) OffsetToPointer(
+                                                (void *) scan->parallel_scan,
+                                             scan->parallel_scan->ps_offset);

You could avoid these uncomfortable line breaks by declaring the
variable on one line and the initializing it on a separate line.

+        SpinLockAcquire(&btscan->ps_mutex);
+        pageStatus = btscan->ps_pageStatus;
+        if (so->arrayKeyCount < btscan->ps_arrayKeyCount)
+            *status = false;
+        else if (pageStatus == BTPARALLEL_DONE)
+            *status = false;
+        else if (pageStatus != BTPARALLEL_ADVANCING)
+        {
+            btscan->ps_pageStatus = BTPARALLEL_ADVANCING;
+            nextPage = btscan->ps_nextPage;
+            exit_loop = true;
+        }
+        SpinLockRelease(&btscan->ps_mutex);

IMHO, this needs comments.

+ * It updates the value of next page that allows parallel scan to move forward
+ * or backward depending on scan direction.  It wakes up one of the sleeping
+ * workers.

This construction is commonly used in India but sounds awkward to
other English-speakers, or at least to me.  You can either drop the
word "it" and just start with the verb "Updates the value of ..." or
you can replace the first instance of "It" with "This function".
Although actually, I think this whole comment needs rewriting.  Maybe
something like "Save information about scan position and wake up next
worker to continue scan."

+ * This must be called when there are no pages left to scan. Notify end of
+ * parallel scan to all the other workers associated with this scan.

Suggest: When there are no pages left to scan, this function should be
called to notify other workers.  Otherwise, they might wait forever
for the scan to advance to the next page.

+        if (status == false)

if (!status) is usually preferred for bools.  (Multiple instances.)

+#define BTPARALLEL_NOT_INITIALIZED 0x01
+#define BTPARALLEL_ADVANCING 0x02
+#define BTPARALLEL_DONE 0x03
+#define BTPARALLEL_IDLE 0x04

Let's change this to an enum.  We can keep the names of the members
as-is, just use typedef enum { ... } instead of #defines.

+#define OffsetToPointer(base, offset)\
+((void *)((char *)base + offset))

Blech.  Aside from the bad formatting, this is an awfully generic
thing to stick into relscan.h.  I'm not sure we should have it at all,
but certainly not in this file.

+/*
+ * BTParallelScanDescData contains btree specific shared information required
+ * for parallel scan.
+ */
+typedef struct BTParallelScanDescData
+{
+    BlockNumber ps_nextPage;    /* latest or next page to be scanned */
+    uint8        ps_pageStatus;    /* indicates whether next page is available
+                                 * for scan. see nbtree.h for possible states
+                                 * of parallel scan. */
+    int            ps_arrayKeyCount;        /* count indicating number of array
+                                         * scan keys processed by parallel
+                                         * scan */
+    slock_t        ps_mutex;        /* protects above variables */
+    ConditionVariable cv;        /* used to synchronize parallel scan */
+}    BTParallelScanDescData;

Why are the states declared a separate header file from the variable
that uses them?   Let's put them all in the same place.

Why do all of these fields except for the last one have a ps_ prefix,
but the last one doesn't?

I assume "ps" stands for "parallel scan" but maybe "btps" would be
better since this is btree-specific.

ps_nextPage sometimes contains something other than the next page, so
maybe we should choose a different name, like ps_currentPage or
ps_scanPage.

This is not a totally complete review - there are some things I have
deeper questions about and need to examine more closely - but let's
get the simple stuff tidied up first.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Parallel Index Scans

From
Amit Kapila
Date:
On Fri, Jan 13, 2017 at 7:58 PM, Anastasia Lubennikova
<a.lubennikova@postgrespro.ru> wrote:
> 27.12.2016 17:33, Amit Kapila:
>
>
> The similar problem has occurred while testing "parallel index only
> scan" patch and Rafia has included the fix in her patch [1] which
> ideally should be included in this patch, so I have copied the fix
> from her patch.  Apart from that, I observed that similar problem can
> happen for backward scans, so fixed the same as well.
>
>
> I confirm that this problem is solved.
>
> But I'm trying to find the worst cases for this feature. And I suppose we
> should test parallel index scans with
> concurrent insertions. More parallel readers we have, higher the
> concurrency.
> I doubt that it can significantly decrease performance, because number of
> parallel readers is not that big,
>
> I am not sure if such a test is meaningful for this patch because
> parallelism is generally used for large data reads and in such cases
> there are generally not many concurrent writes.
>
>
> I didn't find any case of noticeable performance degradation,
> so set status to "Ready for committer".
>

Thank you for the review!


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: [HACKERS] Parallel Index Scans

From
Amit Kapila
Date:
On Fri, Jan 13, 2017 at 11:06 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Fri, Jan 13, 2017 at 9:28 AM, Anastasia Lubennikova
> <a.lubennikova@postgrespro.ru> wrote:
>> I didn't find any case of noticeable performance degradation,
>> so set status to "Ready for committer".
>
> The very first hunk of doc changes looks like it makes the whitespace
> totally wrong - surely it can't be right to have 0-space indents in C
> code.
>

Fixed.

> +    The <literal>index_size</> parameter indicate the size of generic parallel
>
> indicate -> indicates
> size of generic -> size of the generic
>

Fixed.

> +   index-type-specific parallel information which will be stored immediatedly
>
> Typo.
>

Fixed.

> +   Initialize the parallel index scan state.  It will be used to initialize
> +   index-type-specific parallel information which will be stored immediatedly
> +   after generic parallel information required for parallel index scans.  The
> +   required state information will be set in <literal>target</>.
> +  </para>
> +
> +   <para>
> +     The <function>aminitparallelscan</> and
> <function>amestimateparallelscan</>
> +     functions need only be provided if the access method supports
> <quote>parallel</>
> +     index scans.  If it doesn't, the <structfield>aminitparallelscan</> and
> +     <structfield>amestimateparallelscan</> fields in its
> <structname>IndexAmRoutine</>
> +     struct must be set to NULL.
> +   </para>
>
> Inconsistent indentation.

Fixed.

>  <quote> seems like a strange choice of tag.
>

I have seen that <quote> is used in indexam.sgml at multiple places to
refer to "bitmap" and "plain" index scans.  So thought of using same
for "parallel" index scans.

> +    /* amestimateparallelscan is optional; assume no-op if not
> provided by AM */
>
> The fact that amestimateparallelscan is optional even when parallel
> index scans are supported is undocumented.
>

Okay, I have added that information in docs.

>  Similarly for the other
> functions, which also seem to be optional but not documented as such.
> The code and documentation need to match.
>

All the functions introduced by this patch are documented in
indexam.sgml as optional.  Not sure, which other place you are
expecting an update.

> +    void       *amtarget = (char *) ((void *) target) + offset;
>
> Passing an unaligned pointer to the AM sounds like a recipe for
> crashes on obscure platforms that can't tolerate alignment violations,
> and possibly bad performance on others.  I'd arrange MAXALIGN the size
> of the generic structure in index_parallelscan_estimate and
> index_parallelscan_initialize.

Right, changed as per suggestion.

>  Also, why pass the size of the generic
> structure to the AM-specific estimate routine, anyway?  It can't
> legally return a smaller value, and we can add_size() just as well as
> the AM-specific code.  Wouldn't it make more sense for the AM-specific
> code to return the amount of space that is needed for AM-specific
> stuff, and let the generic code deal with the generic stuff?
>

makes sense, so changed accordingly.

> + *    status - True indicates that the block number returned is valid and scan
> + *             is continued or block number is invalid and scan has just begun
> + *             or block number is P_NONE and scan is finished.  False indicates
> + *             that we have reached the end of scan for current
> scankeys and for
> + *             that we return block number as P_NONE.
>
> It is hard to parse a sentence with that many "and" and "or" clauses,
> especially since English, unlike C, does not have strict operator
> precedence rules. Perhaps you could reword to make it more clear.
>

Okay, I have changed the comment.

> Also, does that survive pgindent with that indentation?
>

Yes.

> +    BTParallelScanDesc btscan = (BTParallelScanDesc) OffsetToPointer(
> +                                                (void *) scan->parallel_scan,
> +                                             scan->parallel_scan->ps_offset);
>
> You could avoid these uncomfortable line breaks by declaring the
> variable on one line and the initializing it on a separate line.
>

Okay, changed.

> +        SpinLockAcquire(&btscan->ps_mutex);
> +        pageStatus = btscan->ps_pageStatus;
> +        if (so->arrayKeyCount < btscan->ps_arrayKeyCount)
> +            *status = false;
> +        else if (pageStatus == BTPARALLEL_DONE)
> +            *status = false;
> +        else if (pageStatus != BTPARALLEL_ADVANCING)
> +        {
> +            btscan->ps_pageStatus = BTPARALLEL_ADVANCING;
> +            nextPage = btscan->ps_nextPage;
> +            exit_loop = true;
> +        }
> +        SpinLockRelease(&btscan->ps_mutex);
>
> IMHO, this needs comments.
>

Sure, added a comment.

> + * It updates the value of next page that allows parallel scan to move forward
> + * or backward depending on scan direction.  It wakes up one of the sleeping
> + * workers.
>
> This construction is commonly used in India but sounds awkward to
> other English-speakers, or at least to me.  You can either drop the
> word "it" and just start with the verb "Updates the value of ..." or
> you can replace the first instance of "It" with "This function".
> Although actually, I think this whole comment needs rewriting.  Maybe
> something like "Save information about scan position and wake up next
> worker to continue scan."
>

Changed as per suggestion.

> + * This must be called when there are no pages left to scan. Notify end of
> + * parallel scan to all the other workers associated with this scan.
>
> Suggest: When there are no pages left to scan, this function should be
> called to notify other workers.  Otherwise, they might wait forever
> for the scan to advance to the next page.
>
> +        if (status == false)
>
> if (!status) is usually preferred for bools.  (Multiple instances.)
>
> +#define BTPARALLEL_NOT_INITIALIZED 0x01
> +#define BTPARALLEL_ADVANCING 0x02
> +#define BTPARALLEL_DONE 0x03
> +#define BTPARALLEL_IDLE 0x04
>
> Let's change this to an enum.  We can keep the names of the members
> as-is, just use typedef enum { ... } instead of #defines.
>

Changed as per suggestion.

> +#define OffsetToPointer(base, offset)\
> +((void *)((char *)base + offset))
>
> Blech.  Aside from the bad formatting, this is an awfully generic
> thing to stick into relscan.h.

Agreed and moved to c.h where some similar defines are present.

>  I'm not sure we should have it at all,
> but certainly not in this file.
>

Yeah, but I think there is no harm in keeping it and maybe start using
in code at other places as well.

> +/*
> + * BTParallelScanDescData contains btree specific shared information required
> + * for parallel scan.
> + */
> +typedef struct BTParallelScanDescData
> +{
> +    BlockNumber ps_nextPage;    /* latest or next page to be scanned */
> +    uint8        ps_pageStatus;    /* indicates whether next page is available
> +                                 * for scan. see nbtree.h for possible states
> +                                 * of parallel scan. */
> +    int            ps_arrayKeyCount;        /* count indicating number of array
> +                                         * scan keys processed by parallel
> +                                         * scan */
> +    slock_t        ps_mutex;        /* protects above variables */
> +    ConditionVariable cv;        /* used to synchronize parallel scan */
> +}    BTParallelScanDescData;
>
> Why are the states declared a separate header file from the variable
> that uses them?   Let's put them all in the same place.
>

Agreed and changed accordingly.

> Why do all of these fields except for the last one have a ps_ prefix,
> but the last one doesn't?
>

No specific reason, so Changed as per suggestion.

> I assume "ps" stands for "parallel scan" but maybe "btps" would be
> better since this is btree-specific.
>

Changed as per suggestion.

> ps_nextPage sometimes contains something other than the next page, so
> maybe we should choose a different name, like ps_currentPage or
> ps_scanPage.
>

Changed as per suggestion.


I have also rebased the optimizer/executor support patch
(parallel_index_opt_exec_support_v4.patch) and added a test case in
it.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] Parallel Index Scans

From
Robert Haas
Date:
On Mon, Jan 16, 2017 at 7:11 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> Fixed.

Thanks for the update.  Some more comments:

It shouldn't be necessary for MultiExecBitmapIndexScan to modify the
IndexScanDesc.  That seems really broken.  If a parallel scan isn't
supported here (and I'm sure that's the case right now) then no such
IndexScanDesc should be getting created.

WAIT_EVENT_PARALLEL_INDEX_SCAN is in fact btree-specific.  There's no
guarantee that any other AMs the implement parallel index scans will
use that wait event, and they might have others instead.  I would make
it a lot more specific, like WAIT_EVENT_BTREE_PAGENUMBER.  (Waiting
for the page number needed to continue a parallel btree scan to become
available.)

Why do we need separate AM support for index_parallelrescan()?  Why
can't index_rescan() cover that case?  If the AM-specific method is
getting the IndexScanDesc, it can see for itself whether it is a
parallel scan.

I'd rename PS_State to BTPS_State, to match the other renamings.

If we're going to update all of the AMs to set the new support
functions to NULL, we should also update contrib/bloom.

index_parallelscan_estimate has multiple lines that go over 80
characters for no really good reason.  Separate the initialization of
index_scan from the variable declaration.  Do the same for
amindex_size.  Also, you don't really need to encase the end of the
function in an "else" block when the "if" block is guaranteed to
returrn.

Several function header comments still use the style where the first
word of the description is "It".  Say "this function" or similar the
first time, instead of "it". Then when you say "it" later, it's clear
that it refers back to where you said "this function".

index_parallelscan_initialize also has a line more than 80 characters
that looks easy to fix by splitting the declaration from the
initialization.

I think it's a bad idea to add a ParallelIndexScanDesc argument to
index_beginscan().  That function is used in lots of places, and
somebody might think that they are allowed to actually pass a non-NULL
value there, which they aren't: they must go through
index_beginscan_parallel.  I think that the additional argument should
only be added to index_beginscan_internal, and
index_beginscan_parallel should remain unchanged.  Either that, or get
rid of index_beginscan_parallel altogether and have everyone use
index_beginscan directly, and put the snapshot-restore logic in that
function.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Parallel Index Scans

From
Haribabu Kommi
Date:


On Mon, Jan 16, 2017 at 11:11 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

Changed as per suggestion.


I have also rebased the optimizer/executor support patch
(parallel_index_opt_exec_support_v4.patch) and added a test case in
it.

Thanks for the patch. Here are comments found during review.

parallel_index_scan_v4.patch:


+ amtarget = (char *) ((void *) target) + offset;

The above calcuation can be moved after NULL check?

+ * index_beginscan_parallel - join parallel index scan

The name and the description doesn't sync properly, any better description?

+ BTPARALLEL_DONE,
+ BTPARALLEL_IDLE
+} PS_State;

The order of above two enum values can be changed according to their use.

+ /* Check if the scan for current scan keys is finished */
+ if (so->arrayKeyCount < btscan->btps_arrayKeyCount)
+ *status = false;

I didn't clearly understand, in which scenario the arrayKeyCount is less
than btps_arrayKeyCount?


+BlockNumber
+_bt_parallel_seize(IndexScanDesc scan, bool *status)

The return value of the above function is validated only in _bt_first
function, but in other cases it is not. From the function description,
it is possible to return P_NONE for the workers also with status as
true. I feel it is better to handle the P_NONE case internally only
so that callers just check for the status. Am i missing anything?


+extern BlockNumber _bt_parallel_seize(IndexScanDesc scan, bool *status);
+extern void _bt_parallel_release(IndexScanDesc scan, BlockNumber scan_page);

Any better names for the above functions, as these function will provide/set
the next page that needs to be read.


parallel_index_opt_exec_support_v4.patch:

+#include "access/parallel.h"

Why is it required to be include nbtree.c? i didn't find
any code changes in the patch.


+ /* reset (parallel) index scan */
+ if (node->iss_ScanDesc)
+ {

Why this if check required? There is an assert check in later function calls.


Regards,
Hari Babu
Fujitsu Australia

Re: [HACKERS] Parallel Index Scans

From
Rahila Syed
Date:
>+ /* Check if the scan for current scan keys is finished */
>+ if (so->arrayKeyCount < btscan->btps_arrayKeyCount)
>+ *status = false;

>I didn't clearly understand, in which scenario the arrayKeyCount is less
>than btps_arrayKeyCount?
Consider following array scan keys

select * from test2 where j=ANY(ARRAY[1000,2000,3000]);

By the time the current worker has finished reading heap tuples corresponding
to array key 1000(arrayKeyCount = 0), some other worker might have advanced the scan to the
array key 2000(btps_arrayKeyCount =1). In this case when the current worker fetches next page to scan,
it must advance its scan keys before scanning the next page of parallel scan.
I hope this helps.

>+BlockNumber
>+_bt_parallel_seize(IndexScanDesc scan, bool *status)

>The return value of the above function is validated only in _bt_first
>function, but in other cases it is not.
In other cases it is validated in _bt_readnextpage() which is called after
_bt_parallel_seize().

>From the function description,
>it is possible to return P_NONE for the workers also with status as
>true. I feel it is better to handle the P_NONE case internally only
>so that callers just check for the status. Am i missing anything?

In case of the next block being P_NONE and status true, the code
calls _bt_parallel_done() to notify other workers followed by
BTScanPosInvalidate(). Similar check for block = P_NONE also
happens in existing code. See following in _bt_readnextpage(),


            if (blkno == P_NONE || !so->currPos.moreRight)
            {
               _bt_parallel_done(scan);
                BTScanPosInvalidate(so->currPos);
                return false;
            }
So to keep it consistent with the existing code, the check
is kept outside _bt_parallel_seize().

Thank you,
Rahila Syed


On Wed, Jan 18, 2017 at 6:25 AM, Haribabu Kommi <kommi.haribabu@gmail.com> wrote:


On Mon, Jan 16, 2017 at 11:11 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

Changed as per suggestion.


I have also rebased the optimizer/executor support patch
(parallel_index_opt_exec_support_v4.patch) and added a test case in
it.

Thanks for the patch. Here are comments found during review.

parallel_index_scan_v4.patch:


+ amtarget = (char *) ((void *) target) + offset;

The above calcuation can be moved after NULL check?

+ * index_beginscan_parallel - join parallel index scan

The name and the description doesn't sync properly, any better description?

+ BTPARALLEL_DONE,
+ BTPARALLEL_IDLE
+} PS_State;

The order of above two enum values can be changed according to their use.

+ /* Check if the scan for current scan keys is finished */
+ if (so->arrayKeyCount < btscan->btps_arrayKeyCount)
+ *status = false;

I didn't clearly understand, in which scenario the arrayKeyCount is less
than btps_arrayKeyCount?


+BlockNumber
+_bt_parallel_seize(IndexScanDesc scan, bool *status)

The return value of the above function is validated only in _bt_first
function, but in other cases it is not. From the function description,
it is possible to return P_NONE for the workers also with status as
true. I feel it is better to handle the P_NONE case internally only
so that callers just check for the status. Am i missing anything?


+extern BlockNumber _bt_parallel_seize(IndexScanDesc scan, bool *status);
+extern void _bt_parallel_release(IndexScanDesc scan, BlockNumber scan_page);

Any better names for the above functions, as these function will provide/set
the next page that needs to be read.


parallel_index_opt_exec_support_v4.patch:

+#include "access/parallel.h"

Why is it required to be include nbtree.c? i didn't find
any code changes in the patch.


+ /* reset (parallel) index scan */
+ if (node->iss_ScanDesc)
+ {

Why this if check required? There is an assert check in later function calls.


Regards,
Hari Babu
Fujitsu Australia

Re: [HACKERS] Parallel Index Scans

From
Amit Kapila
Date:
On Tue, Jan 17, 2017 at 11:27 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Mon, Jan 16, 2017 at 7:11 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>
>
> WAIT_EVENT_PARALLEL_INDEX_SCAN is in fact btree-specific.  There's no
> guarantee that any other AMs the implement parallel index scans will
> use that wait event, and they might have others instead.  I would make
> it a lot more specific, like WAIT_EVENT_BTREE_PAGENUMBER.  (Waiting
> for the page number needed to continue a parallel btree scan to become
> available.)
>

WAIT_EVENT_BTREE_PAGENUMBER - NUMBER sounds slightly inconvenient. How
about just WAIT_EVENT_BTREE_PAGE?  We can keep the description as
suggested by you?

> Why do we need separate AM support for index_parallelrescan()?  Why
> can't index_rescan() cover that case?

The reason is that sometime index_rescan is called when we have to
just update runtime scankeys in index and we don't want to reset
parallel scan for that.  Refer ExecReScanIndexScan() changes in patch
parallel_index_opt_exec_support_v4.  Rescan is called from below place
during index scan.

ExecIndexScan(IndexScanState *node)
{
/*
* If we have runtime keys and they've not already been set up, do it now.
*/
if (node->iss_NumRuntimeKeys != 0 && !node->iss_RuntimeKeysReady)
ExecReScan((PlanState *) node);

>  If the AM-specific method is
> getting the IndexScanDesc, it can see for itself whether it is a
> parallel scan.
>

I think if we want to do that way then we need to pass some additional
information related to runtime scan keys in index_rescan method and
then probably to amspecific rescan method. That sounds scary.


>
> I think it's a bad idea to add a ParallelIndexScanDesc argument to
> index_beginscan().  That function is used in lots of places, and
> somebody might think that they are allowed to actually pass a non-NULL
> value there, which they aren't: they must go through
> index_beginscan_parallel.  I think that the additional argument should
> only be added to index_beginscan_internal, and
> index_beginscan_parallel should remain unchanged.
>

If we go that way then we need to set few parameters like heapRelation
and xs_snapshot in index_beginscan_parallel as we are doing in
index_beginscan. Does going that way sound better to you?

>  Either that, or get
> rid of index_beginscan_parallel altogether and have everyone use
> index_beginscan directly, and put the snapshot-restore logic in that
> function.
>

I think there is value in retaining index_beginscan_parallel as that
is parallel to heap_beginscan_parallel.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: [HACKERS] Parallel Index Scans

From
Amit Kapila
Date:
On Wed, Jan 18, 2017 at 6:25 AM, Haribabu Kommi
<kommi.haribabu@gmail.com> wrote:
>
>
> On Mon, Jan 16, 2017 at 11:11 PM, Amit Kapila <amit.kapila16@gmail.com>
> wrote:
>>
>
> + * index_beginscan_parallel - join parallel index scan
>
> The name and the description doesn't sync properly, any better description?
>

This can be called by both the worker and leader of parallel index
scan.  What problem do you see with it.  heap_beginscan_parallel has
similar description, so not sure changing here alone makes sense.


>
> +extern BlockNumber _bt_parallel_seize(IndexScanDesc scan, bool *status);
> +extern void _bt_parallel_release(IndexScanDesc scan, BlockNumber
> scan_page);
>
> Any better names for the above functions, as these function will provide/set
> the next page that needs to be read.
>

These functions also set the state of scan.  IIRC, these names were
being agreed between Robert and Rahila as well (suggested offlist by
Robert).  I am open to change if you or others have any better
suggestions.

>
> + /* reset (parallel) index scan */
> + if (node->iss_ScanDesc)
> + {
>
> Why this if check required? There is an assert check in later function
> calls.
>

This is required because we don't initialize the scan descriptor for
parallel-aware nodes during ExecInitIndexScan.  It got initialized
later at the time of execution when we initialize dsm.  Now, it is
quite possible that Gather node can occur on inner side of join in
which case Rescan can be called before even execution starts. This is
the reason why we have similar check in ExecReScanSeqScan which is
added during parallel sequential scans (f0661c4e). Does that answer
your question?


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: [HACKERS] Parallel Index Scans

From
Robert Haas
Date:
On Wed, Jan 18, 2017 at 8:03 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Tue, Jan 17, 2017 at 11:27 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> On Mon, Jan 16, 2017 at 7:11 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> WAIT_EVENT_PARALLEL_INDEX_SCAN is in fact btree-specific.  There's no
>> guarantee that any other AMs the implement parallel index scans will
>> use that wait event, and they might have others instead.  I would make
>> it a lot more specific, like WAIT_EVENT_BTREE_PAGENUMBER.  (Waiting
>> for the page number needed to continue a parallel btree scan to become
>> available.)
>
> WAIT_EVENT_BTREE_PAGENUMBER - NUMBER sounds slightly inconvenient. How
> about just WAIT_EVENT_BTREE_PAGE?  We can keep the description as
> suggested by you?

Sure.

>> Why do we need separate AM support for index_parallelrescan()?  Why
>> can't index_rescan() cover that case?
>
> The reason is that sometime index_rescan is called when we have to
> just update runtime scankeys in index and we don't want to reset
> parallel scan for that.  Refer ExecReScanIndexScan() changes in patch
> parallel_index_opt_exec_support_v4.  Rescan is called from below place
> during index scan.

Hmm, tricky.  OK, I'll think about that some more.

>> I think it's a bad idea to add a ParallelIndexScanDesc argument to
>> index_beginscan().  That function is used in lots of places, and
>> somebody might think that they are allowed to actually pass a non-NULL
>> value there, which they aren't: they must go through
>> index_beginscan_parallel.  I think that the additional argument should
>> only be added to index_beginscan_internal, and
>> index_beginscan_parallel should remain unchanged.
>
> If we go that way then we need to set few parameters like heapRelation
> and xs_snapshot in index_beginscan_parallel as we are doing in
> index_beginscan. Does going that way sound better to you?

It's pretty minor code duplication; I don't think it's an issue.

> I think there is value in retaining index_beginscan_parallel as that
> is parallel to heap_beginscan_parallel.

OK.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Parallel Index Scans

From
Robert Haas
Date:
On Mon, Jan 16, 2017 at 7:11 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> Fixed.

With respect to the second patch
(parallel_index_opt_exec_support_v4.patch), I'm hoping it can use the
new function from Dilip's bitmap heap scan patch set.  See commit
716c7d4b242f0a64ad8ac4dc48c6fed6557ba12c.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Parallel Index Scans

From
Haribabu Kommi
Date:


On Wed, Jan 18, 2017 at 6:55 PM, Rahila Syed <rahilasyed90@gmail.com> wrote:
>+ /* Check if the scan for current scan keys is finished */
>+ if (so->arrayKeyCount < btscan->btps_arrayKeyCount)
>+ *status = false;

>I didn't clearly understand, in which scenario the arrayKeyCount is less
>than btps_arrayKeyCount?
Consider following array scan keys

select * from test2 where j=ANY(ARRAY[1000,2000,3000]);

By the time the current worker has finished reading heap tuples corresponding
to array key 1000(arrayKeyCount = 0), some other worker might have advanced the scan to the
array key 2000(btps_arrayKeyCount =1). In this case when the current worker fetches next page to scan,
it must advance its scan keys before scanning the next page of parallel scan.
I hope this helps.

Thanks for the details.
One worker incremented arrayKeyCount and btps_arrayKeyCount both. As 
btps_arrayKeyCount present in shared memory, so the other worker see the update
and hits above the condition.
 
>+BlockNumber
>+_bt_parallel_seize(IndexScanDesc scan, bool *status)

>The return value of the above function is validated only in _bt_first
>function, but in other cases it is not.
In other cases it is validated in _bt_readnextpage() which is called after
_bt_parallel_seize().

>From the function description,
>it is possible to return P_NONE for the workers also with status as
>true. I feel it is better to handle the P_NONE case internally only
>so that callers just check for the status. Am i missing anything?

In case of the next block being P_NONE and status true, the code
calls _bt_parallel_done() to notify other workers followed by
BTScanPosInvalidate(). Similar check for block = P_NONE also
happens in existing code. See following in _bt_readnextpage(),


            if (blkno == P_NONE || !so->currPos.moreRight)
            {
               _bt_parallel_done(scan);
                BTScanPosInvalidate(so->currPos);
                return false;
            }
So to keep it consistent with the existing code, the check
is kept outside _bt_parallel_seize().

Thanks. Got it.

Regards,
Hari Babu
Fujitsu Australia

Re: [HACKERS] Parallel Index Scans

From
Amit Kapila
Date:
On Tue, Jan 17, 2017 at 11:27 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Mon, Jan 16, 2017 at 7:11 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> Fixed.
>
> Thanks for the update.  Some more comments:
>
> It shouldn't be necessary for MultiExecBitmapIndexScan to modify the
> IndexScanDesc.  That seems really broken.  If a parallel scan isn't
> supported here (and I'm sure that's the case right now) then no such
> IndexScanDesc should be getting created.
>

Fixed.

> WAIT_EVENT_PARALLEL_INDEX_SCAN is in fact btree-specific.  There's no
> guarantee that any other AMs the implement parallel index scans will
> use that wait event, and they might have others instead.  I would make
> it a lot more specific, like WAIT_EVENT_BTREE_PAGENUMBER.  (Waiting
> for the page number needed to continue a parallel btree scan to become
> available.)
>

Changed as per discussion.

> Why do we need separate AM support for index_parallelrescan()?  Why
> can't index_rescan() cover that case?  If the AM-specific method is
> getting the IndexScanDesc, it can see for itself whether it is a
> parallel scan.
>

Left as it is based on yesterdays discussion.

> I'd rename PS_State to BTPS_State, to match the other renamings.
>
> If we're going to update all of the AMs to set the new support
> functions to NULL, we should also update contrib/bloom.
>
> index_parallelscan_estimate has multiple lines that go over 80
> characters for no really good reason.  Separate the initialization of
> index_scan from the variable declaration.  Do the same for
> amindex_size.  Also, you don't really need to encase the end of the
> function in an "else" block when the "if" block is guaranteed to
> returrn.
>
> Several function header comments still use the style where the first
> word of the description is "It".  Say "this function" or similar the
> first time, instead of "it". Then when you say "it" later, it's clear
> that it refers back to where you said "this function".
>
> index_parallelscan_initialize also has a line more than 80 characters
> that looks easy to fix by splitting the declaration from the
> initialization.
>

Fixed all the above.

> I think it's a bad idea to add a ParallelIndexScanDesc argument to
> index_beginscan().  That function is used in lots of places, and
> somebody might think that they are allowed to actually pass a non-NULL
> value there, which they aren't: they must go through
> index_beginscan_parallel.  I think that the additional argument should
> only be added to index_beginscan_internal, and
> index_beginscan_parallel should remain unchanged.  Either that, or get
> rid of index_beginscan_parallel altogether and have everyone use
> index_beginscan directly, and put the snapshot-restore logic in that
> function.
>

Changed as per yesterday's discussion.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] Parallel Index Scans

From
Amit Kapila
Date:
On Wed, Jan 18, 2017 at 6:25 AM, Haribabu Kommi
<kommi.haribabu@gmail.com> wrote:
>
>
> On Mon, Jan 16, 2017 at 11:11 PM, Amit Kapila <amit.kapila16@gmail.com>
> wrote:
>>
>>
>> Changed as per suggestion.
>>
>>
>> I have also rebased the optimizer/executor support patch
>> (parallel_index_opt_exec_support_v4.patch) and added a test case in
>> it.
>
>
> Thanks for the patch. Here are comments found during review.
>
> parallel_index_scan_v4.patch:
>
>
> + amtarget = (char *) ((void *) target) + offset;
>
> The above calcuation can be moved after NULL check?
>
> + * index_beginscan_parallel - join parallel index scan
>
> The name and the description doesn't sync properly, any better description?
>
> + BTPARALLEL_DONE,
> + BTPARALLEL_IDLE
> +} PS_State;
>
> The order of above two enum values can be changed according to their use.
>

Changed code as per your suggestion.


>
> parallel_index_opt_exec_support_v4.patch:
>
> +#include "access/parallel.h"
>
> Why is it required to be include nbtree.c? i didn't find
> any code changes in the patch.
>

Removed.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: [HACKERS] Parallel Index Scans

From
Amit Kapila
Date:
On Thu, Jan 19, 2017 at 12:28 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Mon, Jan 16, 2017 at 7:11 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> Fixed.
>
> With respect to the second patch
> (parallel_index_opt_exec_support_v4.patch), I'm hoping it can use the
> new function from Dilip's bitmap heap scan patch set.  See commit
> 716c7d4b242f0a64ad8ac4dc48c6fed6557ba12c.
>

Updated patch has used the function from above commit.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: [HACKERS] Parallel Index Scans

From
Robert Haas
Date:
On Thu, Jan 19, 2017 at 4:26 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> Fixed.

I think that the whole idea of GatherSupportsBackwardScan is wrong.
Gather doesn't support a backwards scan under any circumstances,
whether the underlying node does or not.  You can read the tuples
once, in order, and you can't back up.  That's what supporting a
backward scan means: you can back up and then move forward again.
It's there to support cursor operations.

Also note this comment in ExecSupportsBackwardsScan, which seems just
as relevant to parallel index scans as anything else:
   /*    * Parallel-aware nodes return a subset of the tuples in each worker, and    * in general we can't expect to
haveenough bookkeeping state to know    * which ones we returned in this worker as opposed to some other worker.    */
if (node->parallel_aware)       return false;
 

If all of that were no issue, the considerations in
TargetListSupportsBackwardScan could be a problem, too.  But I think
there shouldn't be any issue having Gather just continue to return
false.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Parallel Index Scans

From
Robert Haas
Date:
On Wed, Jan 18, 2017 at 8:03 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> Why do we need separate AM support for index_parallelrescan()?  Why
>> can't index_rescan() cover that case?
>
> The reason is that sometime index_rescan is called when we have to
> just update runtime scankeys in index and we don't want to reset
> parallel scan for that.

Why not?  I see the code, but the comments don't seem to offer any
justification for that.  And it seems wrong to me.  If the scan keys
change, surely that only makes sense if we restart the scan.  You
can't just blindly continue the same scan if the keys have changed,
can you?

I think the reason that this isn't breaking for you is that it's
difficult or impossible to get a parallel index scan someplace where
the keys would change at runtime.  Normally, the parallel scan is on
the driving relation, and so there are no runtime keys.  We currently
have no way for a parallel scan to occur on the inner side of a nested
loop unless there's an intervening Gather node - and in that case the
keys can't change without relaunching the workers.  It's hard to see
how it could work otherwise.  For example, suppose you had something
like this:

Gather
-> Nested Loop
  -> Parallel Seq Scan on a
  -> Hash Join
    -> Seq Scan on b
    -> Parallel Shared Hash
      -> Parallel Index Scan on c
          Index Cond: c.x = a.x

Well, the problem here is that there's nothing to ensure that various
workers who are cooperating to build the hash table all have the same
value for a.x, nor is there any thing to ensure that they'll all get
done with the shared hash table at the same time.  So this is just
chaos.  I think we have to disallow this kind of plan as nonsensical.
Given that, I'm not sure a reset of the runtime keys can ever really
happen.  Have you investigated this?

I extracted the generic portions of this infrastructure (i.e. not the
btree-specific stuff) and spent some time working on it today.  The
big thing I fixed was the documentation, which you added in a fairly
illogical part of the file.  You had all of the new functions for
supporting parallel scans in a section that explicitly says it's for
mandatory functions, and which precedes the section on regular
non-parallel scans.  I moved the documentation for all three new
methods to the same place, added some explanation of parallel scans in
general, and rewrote the descriptions for the individual functions to
be more clear.  Also, in indexam.c, I adjusted things to use
RELATION_CHECKS in a couple of places, did some work on comments and
coding style, and fixed a place that should have used the new
OffsetToPointer macro but instead hand-rolled the thing with the casts
backwards.  Adding an integer to a value of type "void *" does not
work on all compilers.  The patch I ended up with is attached.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] Parallel Index Scans

From
Haribabu Kommi
Date:


On Thu, Jan 19, 2017 at 1:18 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
On Wed, Jan 18, 2017 at 6:25 AM, Haribabu Kommi
<kommi.haribabu@gmail.com> wrote:
>
>
> On Mon, Jan 16, 2017 at 11:11 PM, Amit Kapila <amit.kapila16@gmail.com>
> wrote:
>>
>
> + * index_beginscan_parallel - join parallel index scan
>
> The name and the description doesn't sync properly, any better description?
>

This can be called by both the worker and leader of parallel index
scan.  What problem do you see with it.  heap_beginscan_parallel has
similar description, so not sure changing here alone makes sense.

Ok.
 

>
> +extern BlockNumber _bt_parallel_seize(IndexScanDesc scan, bool *status);
> +extern void _bt_parallel_release(IndexScanDesc scan, BlockNumber
> scan_page);
>
> Any better names for the above functions, as these function will provide/set
> the next page that needs to be read.
>

These functions also set the state of scan.  IIRC, these names were
being agreed between Robert and Rahila as well (suggested offlist by
Robert).  I am open to change if you or others have any better
suggestions.

I didn't find any better names other than the following,

_bt_get_next_parallel_page
_bt_set_next_parallel_page
 
>
> + /* reset (parallel) index scan */
> + if (node->iss_ScanDesc)
> + {
>
> Why this if check required? There is an assert check in later function
> calls.
>

This is required because we don't initialize the scan descriptor for
parallel-aware nodes during ExecInitIndexScan.  It got initialized
later at the time of execution when we initialize dsm.  Now, it is
quite possible that Gather node can occur on inner side of join in
which case Rescan can be called before even execution starts. This is
the reason why we have similar check in ExecReScanSeqScan which is
added during parallel sequential scans (f0661c4e). Does that answer
your question?

Thanks for the details. got it.


Regards,
Hari Babu
Fujitsu Australia

Re: [HACKERS] Parallel Index Scans

From
Amit Kapila
Date:
On Fri, Jan 20, 2017 at 12:59 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Thu, Jan 19, 2017 at 4:26 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> Fixed.
>
>
> If all of that were no issue, the considerations in
> TargetListSupportsBackwardScan could be a problem, too.  But I think
> there shouldn't be any issue having Gather just continue to return
> false.
>

You are right.  I have added that code under the assumption that if
the underlying node (in this case index scan) can support backward
scan, gather can also support.  I forgot/missed that
ExecSupportsBackwardScan is to support cursors operations.  Will fix
in next version of patch.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: [HACKERS] Parallel Index Scans

From
Amit Kapila
Date:
On Fri, Jan 20, 2017 at 3:41 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Wed, Jan 18, 2017 at 8:03 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>>> Why do we need separate AM support for index_parallelrescan()?  Why
>>> can't index_rescan() cover that case?
>>
>> The reason is that sometime index_rescan is called when we have to
>> just update runtime scankeys in index and we don't want to reset
>> parallel scan for that.
>
> Why not?  I see the code, but the comments don't seem to offer any
> justification for that.  And it seems wrong to me.  If the scan keys
> change, surely that only makes sense if we restart the scan.  You
> can't just blindly continue the same scan if the keys have changed,
> can you?
>

Sure, if scan keys have changed then we can't continue, but this is
the case where runtime keys are first time initialized.

if (node->iss_NumRuntimeKeys != 0 && !node->iss_RuntimeKeysReady)

In the above check, the second part of the check
(!node->iss_RuntimeKeysReady) ensures that it is for the first time.
Now, let me give you an example to explain what bad can happen if we
allow resetting parallel scan in this case.  Consider a query like
select * from t1 where c1 < parallel_index(10);, in this if we allow
resetting parallel scan descriptor during first time initialization of
runtime keys, it can easily corrupt the parallel scan state.  Suppose
leader has taken the lead and is scanning some page and worker reaches
to initialize its keys in ExecReScanIndexScan(), if worker resets the
parallel scan, then it will corrupt the state of the parallel scan
state.

If you agree with the above explanation, then I will expand the
comments in next update.


Just in case you want to debug the above query, below is the schema
and necessary steps.

create or replace function parallel_index(a integer) returns integer
as $$
begin       return a + 1;
end;
$$ language plpgsql STABLE PARALLEL SAFE;


create table t1(c1 int, c2 char(20));
insert into t1 values(generate_series(1,300000),'aaa');
create index idx_t1 on t1 (c1);

set max_parallel_workers_per_gather=1;
set parallel_setup_cost=0;
set parallel_tuple_cost=0;
set min_parallel_relation_size=0;
set enable_bitmapscan=off;
set enable_seqscan=off;

select * from t1 where c1 < parallel_index(1000);

> I think the reason that this isn't breaking for you is that it's
> difficult or impossible to get a parallel index scan someplace where
> the keys would change at runtime.  Normally, the parallel scan is on
> the driving relation, and so there are no runtime keys.  We currently
> have no way for a parallel scan to occur on the inner side of a nested
> loop unless there's an intervening Gather node - and in that case the
> keys can't change without relaunching the workers.  It's hard to see
> how it could work otherwise.  For example, suppose you had something
> like this:
>
> Gather
> -> Nested Loop
>   -> Parallel Seq Scan on a
>   -> Hash Join
>     -> Seq Scan on b
>     -> Parallel Shared Hash
>       -> Parallel Index Scan on c
>           Index Cond: c.x = a.x
>
> Well, the problem here is that there's nothing to ensure that various
> workers who are cooperating to build the hash table all have the same
> value for a.x, nor is there any thing to ensure that they'll all get
> done with the shared hash table at the same time.  So this is just
> chaos.  I think we have to disallow this kind of plan as nonsensical.
> Given that, I'm not sure a reset of the runtime keys can ever really
> happen.  Have you investigated this?
>

Having parallelism on the right side under gather can only be possible
after Parallel hash patch of Robert, so maybe some investigation is
needed when we review that patch.  Let me know if you want me to
investigate something more after my explanation above.

> I extracted the generic portions of this infrastructure (i.e. not the
> btree-specific stuff) and spent some time working on it today.  The
> big thing I fixed was the documentation, which you added in a fairly
> illogical part of the file.
>

Hmm, it is not illogical. All the functions are described in the same
order as they are declared in IndexAmRoutine structure and I have
followed the same. I think both amestimateparallelscan and
aminitparallelscan should be added one para down which says (The
purpose of an index .. The scan-related functions that an index access
method must or may provide are:).

>  You had all of the new functions for
> supporting parallel scans in a section that explicitly says it's for
> mandatory functions, and which precedes the section on regular
> non-parallel scans.
>

I think that section should say "must or may provide" instead of "must
provide" as the functions amcanreturn and amproperty are optional and
are described in that section.

>  I moved the documentation for all three new
> methods to the same place, added some explanation of parallel scans in
> general, and rewrote the descriptions for the individual functions to
> be more clear.
>

I think the place where you have added these new functions breaks the
existing order which is to describe them in the order they are
declared in IndexAmRoutine.  Apart from that extracted patch looks
good to me.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: [HACKERS] Parallel Index Scans

From
Robert Haas
Date:
On Fri, Jan 20, 2017 at 9:29 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> Sure, if scan keys have changed then we can't continue, but this is
> the case where runtime keys are first time initialized.
>
> if (node->iss_NumRuntimeKeys != 0 && !node->iss_RuntimeKeysReady)
>
> In the above check, the second part of the check
> (!node->iss_RuntimeKeysReady) ensures that it is for the first time.
> Now, let me give you an example to explain what bad can happen if we
> allow resetting parallel scan in this case.  Consider a query like
> select * from t1 where c1 < parallel_index(10);, in this if we allow
> resetting parallel scan descriptor during first time initialization of
> runtime keys, it can easily corrupt the parallel scan state.  Suppose
> leader has taken the lead and is scanning some page and worker reaches
> to initialize its keys in ExecReScanIndexScan(), if worker resets the
> parallel scan, then it will corrupt the state of the parallel scan
> state.

Hmm, I see.  So the problem if I understand it correctly is that every
participating process needs to update the backend-private state for
the runtime keys and only one of those processes can update the shared
state.  But in the case of a "real" rescan, even the shared state
needs to be reset.  OK, makes sense.

Why does btparallelrescan cater to the case where scan->parallel_scan
== NULL?  I would assume it should never get called in that case.
Also, I think ExecReScanIndexScan needs significantly better comments.
After some thought I see what's it's doing - mostly anyway - but I was
quite confused at first.  I still don't completely understand why it
needs this if-test:

+       /* reset (parallel) index scan */
+       if (node->iss_ScanDesc)
+       {

>> I extracted the generic portions of this infrastructure (i.e. not the
>> btree-specific stuff) and spent some time working on it today.  The
>> big thing I fixed was the documentation, which you added in a fairly
>> illogical part of the file.
>
> Hmm, it is not illogical. All the functions are described in the same
> order as they are declared in IndexAmRoutine structure and I have
> followed the same.

I see.  Sorry, I didn't realize that was what you were going for.

> I think both amestimateparallelscan and
> aminitparallelscan should be added one para down which says (The
> purpose of an index .. The scan-related functions that an index access
> method must or may provide are:).

I think it's a good idea to put all three of those functions together
in the listing, similar to what we did in
69d34408e5e7adcef8ef2f4e9c4f2919637e9a06 for FDWs.  After all they are
closely related in purpose, and it may be easiest to understand if
they are next to each other in the listing.  I suggest that we move
them to the end in IndexAmRoutine similar to the way FdwRoutine was
done; in other words, my idea is to make the structure consistent with
the way that I revised the documentation instead of making the
documentation consistent with the order you picked for the structure
members.  What I like about that is that it gives a good opportunity
to include some general remarks on parallel index scans in a central
place, as I did in that patch.  Also, it makes it easier for people
who care about parallel index scans to find all of the related things
(since they are together) and for people who don't care about them to
ignore it all (for the same reason).  What do you think about that
approach?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Parallel Index Scans

From
Amit Kapila
Date:
On Sat, Jan 21, 2017 at 1:15 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Fri, Jan 20, 2017 at 9:29 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> Sure, if scan keys have changed then we can't continue, but this is
>> the case where runtime keys are first time initialized.
>>
>> if (node->iss_NumRuntimeKeys != 0 && !node->iss_RuntimeKeysReady)
>>
>> In the above check, the second part of the check
>> (!node->iss_RuntimeKeysReady) ensures that it is for the first time.
>> Now, let me give you an example to explain what bad can happen if we
>> allow resetting parallel scan in this case.  Consider a query like
>> select * from t1 where c1 < parallel_index(10);, in this if we allow
>> resetting parallel scan descriptor during first time initialization of
>> runtime keys, it can easily corrupt the parallel scan state.  Suppose
>> leader has taken the lead and is scanning some page and worker reaches
>> to initialize its keys in ExecReScanIndexScan(), if worker resets the
>> parallel scan, then it will corrupt the state of the parallel scan
>> state.
>
> Hmm, I see.  So the problem if I understand it correctly is that every
> participating process needs to update the backend-private state for
> the runtime keys and only one of those processes can update the shared
> state.  But in the case of a "real" rescan, even the shared state
> needs to be reset.  OK, makes sense.
>

Exactly.

> Why does btparallelrescan cater to the case where scan->parallel_scan
> == NULL?  I would assume it should never get called in that case.
>

Okay, will modify the patch accordingly.

> Also, I think ExecReScanIndexScan needs significantly better comments.
> After some thought I see what's it's doing - mostly anyway - but I was
> quite confused at first.  I still don't completely understand why it
> needs this if-test:
>
> +       /* reset (parallel) index scan */
> +       if (node->iss_ScanDesc)
> +       {
>

I have mentioned the reason towards the end of the e-mail [1] (Refer
line, This is required because ..).  Basically, this is required to
make plans like below work sanely.

Nested Loop
  -> Seq Scan on a
  -> Gather
    -> Parallel Index Scan on b
          Index Cond: b.x = 15

I understand that such plans don't make much sense, but we do support
them and I have seen somewhat similar plan getting select in TPC-H
benchmark Let me know if this needs more explanation.

>
> I think it's a good idea to put all three of those functions together
> in the listing, similar to what we did in
> 69d34408e5e7adcef8ef2f4e9c4f2919637e9a06 for FDWs.  After all they are
> closely related in purpose, and it may be easiest to understand if
> they are next to each other in the listing.  I suggest that we move
> them to the end in IndexAmRoutine similar to the way FdwRoutine was
> done; in other words, my idea is to make the structure consistent with
> the way that I revised the documentation instead of making the
> documentation consistent with the order you picked for the structure
> members.  What I like about that is that it gives a good opportunity
> to include some general remarks on parallel index scans in a central
> place, as I did in that patch.  Also, it makes it easier for people
> who care about parallel index scans to find all of the related things
> (since they are together) and for people who don't care about them to
> ignore it all (for the same reason).  What do you think about that
> approach?
>

Sounds sensible.  Updated patch based on that approach is attached.  I
will rebase the remaining work based on this patch and send them
separately.

[1] -
https://www.postgresql.org/message-id/CAA4eK1%2BnBiCxtxcNuzpaiN%2BnrRrRB5YDgoaqb3hyn%3DYUxL-%2BOw%40mail.gmail.com
-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] Parallel Index Scans

From
Amit Kapila
Date:
On Sat, Jan 21, 2017 at 12:23 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Sat, Jan 21, 2017 at 1:15 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>>
>> I think it's a good idea to put all three of those functions together
>> in the listing, similar to what we did in
>> 69d34408e5e7adcef8ef2f4e9c4f2919637e9a06 for FDWs.  After all they are
>> closely related in purpose, and it may be easiest to understand if
>> they are next to each other in the listing.  I suggest that we move
>> them to the end in IndexAmRoutine similar to the way FdwRoutine was
>> done; in other words, my idea is to make the structure consistent with
>> the way that I revised the documentation instead of making the
>> documentation consistent with the order you picked for the structure
>> members.  What I like about that is that it gives a good opportunity
>> to include some general remarks on parallel index scans in a central
>> place, as I did in that patch.  Also, it makes it easier for people
>> who care about parallel index scans to find all of the related things
>> (since they are together) and for people who don't care about them to
>> ignore it all (for the same reason).  What do you think about that
>> approach?
>>
>
> Sounds sensible.  Updated patch based on that approach is attached.
>

In spite of being careful, I missed reorganizing the functions in
genam.h which I have done in attached patch.

>  I
> will rebase the remaining work based on this patch and send them
> separately.
>

Rebased patches are attached.  I have fixed few review comments in
these patches.
parallel_index_scan_v6 - Changed the function btparallelrescan so that
it always expects a valid parallel scan descriptor.
parallel_index_opt_exec_support_v6 - Removed the usage of
GatherSupportsBackwardScan.  Expanded the comments in
ExecReScanIndexScan.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] Parallel Index Scans

From
Amit Kapila
Date:
On Fri, Jan 20, 2017 at 7:29 AM, Haribabu Kommi
<kommi.haribabu@gmail.com> wrote:
>
> On Thu, Jan 19, 2017 at 1:18 AM, Amit Kapila <amit.kapila16@gmail.com>
> wrote:
>> >
>> > +extern BlockNumber _bt_parallel_seize(IndexScanDesc scan, bool
>> > *status);
>> > +extern void _bt_parallel_release(IndexScanDesc scan, BlockNumber
>> > scan_page);
>> >
>> > Any better names for the above functions, as these function will
>> > provide/set
>> > the next page that needs to be read.
>> >
>>
>> These functions also set the state of scan.  IIRC, these names were
>> being agreed between Robert and Rahila as well (suggested offlist by
>> Robert).  I am open to change if you or others have any better
>> suggestions.
>
>
> I didn't find any better names other than the following,
>
> _bt_get_next_parallel_page
> _bt_set_next_parallel_page
>

I am not sure using *_next_* here will convey the message because for
backward scans we set the last page.  I am open to changing the names
of functions if committer and or others prefer the names suggested by
you.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: [HACKERS] Parallel Index Scans

From
Robert Haas
Date:
On Mon, Jan 23, 2017 at 1:03 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> In spite of being careful, I missed reorganizing the functions in
> genam.h which I have done in attached patch.

Cool.  Committed parallel-generic-index-scan.2.patch.  Thanks.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Parallel Index Scans

From
Haribabu Kommi
Date:


On Mon, Jan 23, 2017 at 5:07 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
On Fri, Jan 20, 2017 at 7:29 AM, Haribabu Kommi
<kommi.haribabu@gmail.com> wrote:
>
> On Thu, Jan 19, 2017 at 1:18 AM, Amit Kapila <amit.kapila16@gmail.com>
> wrote:
>> >
>> > +extern BlockNumber _bt_parallel_seize(IndexScanDesc scan, bool
>> > *status);
>> > +extern void _bt_parallel_release(IndexScanDesc scan, BlockNumber
>> > scan_page);
>> >
>> > Any better names for the above functions, as these function will
>> > provide/set
>> > the next page that needs to be read.
>> >
>>
>> These functions also set the state of scan.  IIRC, these names were
>> being agreed between Robert and Rahila as well (suggested offlist by
>> Robert).  I am open to change if you or others have any better
>> suggestions.
>
>
> I didn't find any better names other than the following,
>
> _bt_get_next_parallel_page
> _bt_set_next_parallel_page
>

I am not sure using *_next_* here will convey the message because for
backward scans we set the last page.  I am open to changing the names
of functions if committer and or others prefer the names suggested by
you.

OK. I am fine with it.
I don't have any other comments on the patch.

Regards,
Hari Babu
Fujitsu Australia

Re: [HACKERS] Parallel Index Scans

From
Amit Kapila
Date:
On Fri, Jan 27, 2017 at 6:53 AM, Haribabu Kommi
<kommi.haribabu@gmail.com> wrote:
>
>
> On Mon, Jan 23, 2017 at 5:07 PM, Amit Kapila <amit.kapila16@gmail.com>
> wrote:
>>
>> On Fri, Jan 20, 2017 at 7:29 AM, Haribabu Kommi
>> <kommi.haribabu@gmail.com> wrote:
>> > I didn't find any better names other than the following,
>> >
>> > _bt_get_next_parallel_page
>> > _bt_set_next_parallel_page
>> >
>>
>> I am not sure using *_next_* here will convey the message because for
>> backward scans we set the last page.  I am open to changing the names
>> of functions if committer and or others prefer the names suggested by
>> you.
>
>
> OK. I am fine with it.
> I don't have any other comments on the patch.
>

Thanks for the review.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: [HACKERS] Parallel Index Scans

From
Robert Haas
Date:
On Mon, Jan 23, 2017 at 1:03 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> parallel_index_opt_exec_support_v6 - Removed the usage of
> GatherSupportsBackwardScan.  Expanded the comments in
> ExecReScanIndexScan.

I looked through this and in general it looks reasonable to me.
However, I did notice one thing that I think is wrong.  In the
parallel bitmap heap scan patch, the second argument to
compute_parallel_worker() is the number of pages that the parallel
scan is expected to fetch from the heap.  In this patch, it's the
total number of pages in the index.  The former seems to me to be
better, because the point of having a threshold relation size for
parallelism is that we don't want to use a lot of workers to scan a
small number of pages -- the distribution of work won't be even, and
the potential savings are limited.  If we've got a big index but are
using a very selective qual to pull out only one or a small number of
rows on a single page or a small handful of pages, we shouldn't
generate a parallel path for that.

Now, against that theory, the GUC that controls the behavior of
compute_parallel_worker() is called min_parallel_relation_size, which
might make you think that the decision is supposed to be based on the
whole size of some relation.  But I think that just means we need to
rename the GUC to something like min_parallel_scan_size.  Possibly we
also ought consider reducing the default value somewhat, because it
seems like both sequential and index scans can benefit even when
scanning less than 8MB.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Parallel Index Scans

From
Amit Kapila
Date:
On Sat, Jan 28, 2017 at 1:34 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Mon, Jan 23, 2017 at 1:03 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> parallel_index_opt_exec_support_v6 - Removed the usage of
>> GatherSupportsBackwardScan.  Expanded the comments in
>> ExecReScanIndexScan.
>
> I looked through this and in general it looks reasonable to me.
> However, I did notice one thing that I think is wrong.  In the
> parallel bitmap heap scan patch, the second argument to
> compute_parallel_worker() is the number of pages that the parallel
> scan is expected to fetch from the heap.  In this patch, it's the
> total number of pages in the index.  The former seems to me to be
> better, because the point of having a threshold relation size for
> parallelism is that we don't want to use a lot of workers to scan a
> small number of pages -- the distribution of work won't be even, and
> the potential savings are limited.  If we've got a big index but are
> using a very selective qual to pull out only one or a small number of
> rows on a single page or a small handful of pages, we shouldn't
> generate a parallel path for that.
>

Agreed, that it makes sense to consider only the number of pages to
scan for computation of parallel workers.  I think for index scan we
should consider both index and heap pages that need to be scanned
(costing of index scan consider both index and heap pages).   I thin
where considering heap pages matter more is when the finally selected
rows are scattered across heap pages or we need to apply a filter on
rows after fetching from the heap.  OTOH, we can consider just pages
in the index as that is where mainly the parallelism works.  In the
attached patch (parallel_index_opt_exec_support_v7.patch), I have
considered both index and heap pages, let me know if you think some
other way is better.  I have also prepared a separate independent
patch (compute_index_pages_v1) on HEAD to compute index pages which
can be used by parallel index scan. There is no change in
parallel_index_scan (parallel btree scan) patch, so I am not attaching
its new version.

> Now, against that theory, the GUC that controls the behavior of
> compute_parallel_worker() is called min_parallel_relation_size, which
> might make you think that the decision is supposed to be based on the
> whole size of some relation.  But I think that just means we need to
> rename the GUC to something like min_parallel_scan_size.  Possibly we
> also ought consider reducing the default value somewhat, because it
> seems like both sequential and index scans can benefit even when
> scanning less than 8MB.
>

Agreed, but let's consider it separately.


The order in which patches needs to be applied:
compute_index_pages_v1.patch, parallel_index_scan_v6.patch[1],
parallel_index_opt_exec_support_v7.patch


[1] - https://www.postgresql.org/message-id/CAA4eK1J%3DLSBpDx7i_izGJxGVUryqPe-2SKT02De-PrQvywiMxw%40mail.gmail.com

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] Parallel Index Scans

From
Rahila Syed
Date:
Hello,

>Agreed, that it makes sense to consider only the number of pages to
>scan for computation of parallel workers.  I think for index scan we
>should consider both index and heap pages that need to be scanned
>(costing of index scan consider both index and heap pages).   I thin
>where considering heap pages matter more is when the finally selected
>rows are scattered across heap pages or we need to apply a filter on
>rows after fetching from the heap.  OTOH, we can consider just pages
>in the index as that is where mainly the parallelism works
IMO, considering just index pages will give a better estimate of work to be done
in parallel. As the amount of work/number of pages divided amongst workers is irrespective of
the number of heap pages scanned.

Thank you,
Rahila Syed




On Mon, Jan 30, 2017 at 2:52 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
On Sat, Jan 28, 2017 at 1:34 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Mon, Jan 23, 2017 at 1:03 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> parallel_index_opt_exec_support_v6 - Removed the usage of
>> GatherSupportsBackwardScan.  Expanded the comments in
>> ExecReScanIndexScan.
>
> I looked through this and in general it looks reasonable to me.
> However, I did notice one thing that I think is wrong.  In the
> parallel bitmap heap scan patch, the second argument to
> compute_parallel_worker() is the number of pages that the parallel
> scan is expected to fetch from the heap.  In this patch, it's the
> total number of pages in the index.  The former seems to me to be
> better, because the point of having a threshold relation size for
> parallelism is that we don't want to use a lot of workers to scan a
> small number of pages -- the distribution of work won't be even, and
> the potential savings are limited.  If we've got a big index but are
> using a very selective qual to pull out only one or a small number of
> rows on a single page or a small handful of pages, we shouldn't
> generate a parallel path for that.
>

Agreed, that it makes sense to consider only the number of pages to
scan for computation of parallel workers.  I think for index scan we
should consider both index and heap pages that need to be scanned
(costing of index scan consider both index and heap pages).   I thin
where considering heap pages matter more is when the finally selected
rows are scattered across heap pages or we need to apply a filter on
rows after fetching from the heap.  OTOH, we can consider just pages
in the index as that is where mainly the parallelism works.  In the
attached patch (parallel_index_opt_exec_support_v7.patch), I have
considered both index and heap pages, let me know if you think some
other way is better.  I have also prepared a separate independent
patch (compute_index_pages_v1) on HEAD to compute index pages which
can be used by parallel index scan. There is no change in
parallel_index_scan (parallel btree scan) patch, so I am not attaching
its new version.

> Now, against that theory, the GUC that controls the behavior of
> compute_parallel_worker() is called min_parallel_relation_size, which
> might make you think that the decision is supposed to be based on the
> whole size of some relation.  But I think that just means we need to
> rename the GUC to something like min_parallel_scan_size.  Possibly we
> also ought consider reducing the default value somewhat, because it
> seems like both sequential and index scans can benefit even when
> scanning less than 8MB.
>

Agreed, but let's consider it separately.


The order in which patches needs to be applied:
compute_index_pages_v1.patch, parallel_index_scan_v6.patch[1],
parallel_index_opt_exec_support_v7.patch


[1] - https://www.postgresql.org/message-id/CAA4eK1J%3DLSBpDx7i_izGJxGVUryqPe-2SKT02De-PrQvywiMxw%40mail.gmail.com

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] Parallel Index Scans

From
Robert Haas
Date:
On Tue, Jan 24, 2017 at 4:51 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Mon, Jan 23, 2017 at 1:03 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> In spite of being careful, I missed reorganizing the functions in
>> genam.h which I have done in attached patch.
>
> Cool.  Committed parallel-generic-index-scan.2.patch.  Thanks.

Reviewing parallel_index_scan_v6.patch:

I think it might be better to swap the return value and the out
parameter for _bt_parallel_seize; that is, return a bool, and have
callers ignore the value of the out parameter (e.g. *pageno).

I think _bt_parallel_advance_scan should be renamed something that
includes the word "keys", like _bt_parallel_advance_array_keys.

The hunk in indexam.c looks like a leftover that can be omitted.

+/*
+ * Below flags are used to indicate the state of parallel scan.

They aren't really flags any more; they're members of an enum.  I
think you could just leave this sentence out entirely and start right
in with descriptions of the individual values.  But maybe all of those
descriptions should end in a period (instead of one ending in a period
but not the other three) since they read like complete sentences.

+ * btinitparallelscan - Initializing BTParallelScanDesc for parallel btree scan

Initializing -> initialize

+ *  btparallelrescan() -- reset parallel scan

Previous two prototypes have one dash, this one has two.  Make it
consistent, please.

+     * Ideally, we don't need to acquire spinlock here, but being
+     * consistent with heap_rescan seems to be a good idea.

How about: In theory, we don't need to acquire the spinlock here,
because there shouldn't be any other workers running at this point,
but we do so for consistency.

+ * _bt_parallel_seize() -- returns the next block to be scanned for forward
+ *      scans and latest block scanned for backward scans.

I think the description should be more like "Begin the process of
advancing the scan to a new page.  Other scans must wait until we call
bt_parallel_release() or bt_parallel_done()."  And likewise
_bt_parallel_release() should say something like "Complete the process
of advancing the scan to a new page.  We now have the new value for
btps_scanPage; some other backend can now begin advancing the scan."
And _bt_parallel_done should say something like "Mark the parallel
scan as complete."

I am a bit mystified about how this manages to work with array keys.
_bt_parallel_done() won't set btps_pageStatus to BTPARALLEL_DONE
unless so->arrayKeyCount >= btscan->btps_arrayKeyCount, but
_bt_parallel_advance_scan() won't do anything unless btps_pageStatus
is already BTPARALLEL_DONE.  It seems like one of those two things has
to be wrong.

_bt_readpage's header comment should be updated to note that, in the
case of a parallel scan, _bt_parallel_seize should have been called
prior to entering this function, and _bt_parallel_release will be
called prior to return (or this could alternatively be phrased in
terms of btps_pageStatus on entry/exit).

_bt_readnextpage isn't very clear about the meaning of its blkno
argument.  It looks like it always has to be valid when scanning
forward, but only in the parallel case when scanning backward?  That
is a little odd.  Another, possibly related thing that is odd is that
when _bt_steppage() finds no tuples and decides to advance to a new
page again, there's a very short comment in the forward case and a
very long comment in the backward case:
           /* nope, keep going */
vs.           /*            * For parallel scans, get the last page scanned as it is quite            * possible that
bythe time we try to fetch previous page, other            * worker has also decided to scan that previous page.  We
could           * avoid that by doing _bt_parallel_release once we have read the            * current page, but it is
badto make other workers wait till we            * read the page.            */
 

Now it seems to me that these cases are symmetric and the issues are
identical.  It's basically that, while we were judging whether the
current page has useful contents, some other process could have
advanced the scan (since _bt_readpage un-seized it).

-            /* check for deleted page */            page = BufferGetPage(so->currPos.buf);
TestForOldSnapshot(scan->xs_snapshot,rel, page);            opaque = (BTPageOpaque) PageGetSpecialPointer(page);
 
+            /* check for deleted page */

This is an independent change; committed.

What kind of testing has been done to ensure that this doesn't have
concurrency bugs?  What's the highest degree of parallelism that's
been tested?  Did that test include array keys?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Parallel Index Scans

From
Amit Kapila
Date:
On Tue, Jan 31, 2017 at 5:53 PM, Rahila Syed <rahilasyed90@gmail.com> wrote:
> Hello,
>
>>Agreed, that it makes sense to consider only the number of pages to
>>scan for computation of parallel workers.  I think for index scan we
>>should consider both index and heap pages that need to be scanned
>>(costing of index scan consider both index and heap pages).   I thin
>>where considering heap pages matter more is when the finally selected
>>rows are scattered across heap pages or we need to apply a filter on
>>rows after fetching from the heap.  OTOH, we can consider just pages
>>in the index as that is where mainly the parallelism works
> IMO, considering just index pages will give a better estimate of work to be
> done
> in parallel. As the amount of work/number of pages divided amongst workers
> is irrespective of
> the number of heap pages scanned.
>

Yeah, I understand that point and I can see there is strong argument
to do that way, but let's wait and see what others including Robert
have to say about this point.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: [HACKERS] Parallel Index Scans

From
Rahila Syed
Date:
Hello Robert,

>I am a bit mystified about how this manages to work with array keys.
>_bt_parallel_done() won't set btps_pageStatus to BTPARALLEL_DONE
>unless so->arrayKeyCount >= btscan->btps_arrayKeyCount, but
>_bt_parallel_advance_scan() won't do anything unless btps_pageStatus
>is already BTPARALLEL_DONE.  It seems like one of those two things has
>to be wrong.

btps_pageStatus is to be set to BTPARALLEL_DONE only by the first worker which is
performing scan for latest array key and which has encountered end of scan. This is ensured by
following check in _bt_parallel_done(),

   if (so->arrayKeyCount >= btscan->btps_arrayKeyCount &&
       btscan->btps_pageStatus != BTPARALLEL_DONE)

Thus, BTPARALLEL_DONE marks end of scan only for latest array keys. This ensures that when any worker reaches
_bt_advance_array_keys() it advances latest scan which is marked by btscan->btps_arrayKeyCount only when latest scan
has ended by checking if(btps_pageStatus == BTPARALLEL_DONE) in _bt_advance_array_keys(). Otherwise, the worker just
advances its local so->arrayKeyCount.

I hope this provides some clarification.

Thank you,
Rahila Syed

 



On Wed, Feb 1, 2017 at 3:52 AM, Robert Haas <robertmhaas@gmail.com> wrote:
On Tue, Jan 24, 2017 at 4:51 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Mon, Jan 23, 2017 at 1:03 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> In spite of being careful, I missed reorganizing the functions in
>> genam.h which I have done in attached patch.
>
> Cool.  Committed parallel-generic-index-scan.2.patch.  Thanks.

Reviewing parallel_index_scan_v6.patch:

I think it might be better to swap the return value and the out
parameter for _bt_parallel_seize; that is, return a bool, and have
callers ignore the value of the out parameter (e.g. *pageno).

I think _bt_parallel_advance_scan should be renamed something that
includes the word "keys", like _bt_parallel_advance_array_keys.

The hunk in indexam.c looks like a leftover that can be omitted.

+/*
+ * Below flags are used to indicate the state of parallel scan.

They aren't really flags any more; they're members of an enum.  I
think you could just leave this sentence out entirely and start right
in with descriptions of the individual values.  But maybe all of those
descriptions should end in a period (instead of one ending in a period
but not the other three) since they read like complete sentences.

+ * btinitparallelscan - Initializing BTParallelScanDesc for parallel btree scan

Initializing -> initialize

+ *  btparallelrescan() -- reset parallel scan

Previous two prototypes have one dash, this one has two.  Make it
consistent, please.

+     * Ideally, we don't need to acquire spinlock here, but being
+     * consistent with heap_rescan seems to be a good idea.

How about: In theory, we don't need to acquire the spinlock here,
because there shouldn't be any other workers running at this point,
but we do so for consistency.

+ * _bt_parallel_seize() -- returns the next block to be scanned for forward
+ *      scans and latest block scanned for backward scans.

I think the description should be more like "Begin the process of
advancing the scan to a new page.  Other scans must wait until we call
bt_parallel_release() or bt_parallel_done()."  And likewise
_bt_parallel_release() should say something like "Complete the process
of advancing the scan to a new page.  We now have the new value for
btps_scanPage; some other backend can now begin advancing the scan."
And _bt_parallel_done should say something like "Mark the parallel
scan as complete."

I am a bit mystified about how this manages to work with array keys.
_bt_parallel_done() won't set btps_pageStatus to BTPARALLEL_DONE
unless so->arrayKeyCount >= btscan->btps_arrayKeyCount, but
_bt_parallel_advance_scan() won't do anything unless btps_pageStatus
is already BTPARALLEL_DONE.  It seems like one of those two things has
to be wrong.

_bt_readpage's header comment should be updated to note that, in the
case of a parallel scan, _bt_parallel_seize should have been called
prior to entering this function, and _bt_parallel_release will be
called prior to return (or this could alternatively be phrased in
terms of btps_pageStatus on entry/exit).

_bt_readnextpage isn't very clear about the meaning of its blkno
argument.  It looks like it always has to be valid when scanning
forward, but only in the parallel case when scanning backward?  That
is a little odd.  Another, possibly related thing that is odd is that
when _bt_steppage() finds no tuples and decides to advance to a new
page again, there's a very short comment in the forward case and a
very long comment in the backward case:

            /* nope, keep going */
vs.
            /*
             * For parallel scans, get the last page scanned as it is quite
             * possible that by the time we try to fetch previous page, other
             * worker has also decided to scan that previous page.  We could
             * avoid that by doing _bt_parallel_release once we have read the
             * current page, but it is bad to make other workers wait till we
             * read the page.
             */

Now it seems to me that these cases are symmetric and the issues are
identical.  It's basically that, while we were judging whether the
current page has useful contents, some other process could have
advanced the scan (since _bt_readpage un-seized it).

-            /* check for deleted page */
             page = BufferGetPage(so->currPos.buf);
             TestForOldSnapshot(scan->xs_snapshot, rel, page);
             opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+            /* check for deleted page */

This is an independent change; committed.

What kind of testing has been done to ensure that this doesn't have
concurrency bugs?  What's the highest degree of parallelism that's
been tested?  Did that test include array keys?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] Parallel Index Scans

From
Amit Kapila
Date:
On Wed, Feb 1, 2017 at 3:52 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Tue, Jan 24, 2017 at 4:51 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> On Mon, Jan 23, 2017 at 1:03 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>>> In spite of being careful, I missed reorganizing the functions in
>>> genam.h which I have done in attached patch.
>>
>> Cool.  Committed parallel-generic-index-scan.2.patch.  Thanks.
>
> Reviewing parallel_index_scan_v6.patch:
>
> I think it might be better to swap the return value and the out
> parameter for _bt_parallel_seize; that is, return a bool, and have
> callers ignore the value of the out parameter (e.g. *pageno).
>

makes sense, so changed accordingly.

> I think _bt_parallel_advance_scan should be renamed something that
> includes the word "keys", like _bt_parallel_advance_array_keys.
>

Changed as per suggestion.

> The hunk in indexam.c looks like a leftover that can be omitted.
>

It is not a leftover hunk. Earlier, the patch has the same check
btparallelrescan, but based on your comment up thread [1] (Why does
btparallelrescan cater to the case where scan->parallel_scan== NULL?),
this has been moved to indexam.c.


> +/*
> + * Below flags are used to indicate the state of parallel scan.
>
> They aren't really flags any more; they're members of an enum.  I
> think you could just leave this sentence out entirely and start right
> in with descriptions of the individual values.  But maybe all of those
> descriptions should end in a period (instead of one ending in a period
> but not the other three) since they read like complete sentences.
>
> + * btinitparallelscan - Initializing BTParallelScanDesc for parallel btree scan
>
> Initializing -> initialize
>
> + *  btparallelrescan() -- reset parallel scan
>
> Previous two prototypes have one dash, this one has two.  Make it
> consistent, please.
>
> +     * Ideally, we don't need to acquire spinlock here, but being
> +     * consistent with heap_rescan seems to be a good idea.
>
> How about: In theory, we don't need to acquire the spinlock here,
> because there shouldn't be any other workers running at this point,
> but we do so for consistency.
>
> + * _bt_parallel_seize() -- returns the next block to be scanned for forward
> + *      scans and latest block scanned for backward scans.
>
> I think the description should be more like "Begin the process of
> advancing the scan to a new page.  Other scans must wait until we call
> bt_parallel_release() or bt_parallel_done()."  And likewise
> _bt_parallel_release() should say something like "Complete the process
> of advancing the scan to a new page.  We now have the new value for
> btps_scanPage; some other backend can now begin advancing the scan."
> And _bt_parallel_done should say something like "Mark the parallel
> scan as complete."
>

Changed the code as per your above suggestions.

> I am a bit mystified about how this manages to work with array keys.
> _bt_parallel_done() won't set btps_pageStatus to BTPARALLEL_DONE
> unless so->arrayKeyCount >= btscan->btps_arrayKeyCount, but
> _bt_parallel_advance_scan() won't do anything unless btps_pageStatus
> is already BTPARALLEL_DONE.
>

This is just to ensure that btps_arrayKeyCount is advanced once and
btps_pageStatus is changed to BTPARALLEL_DONE once per array element.
So it goes something like, if we have array with values [1,2,3], then
all the workers will complete the scan with key 1 and one of them will
mark btps_pageStatus as BTPARALLEL_DONE and then first one to hit
_bt_parallel_advance_scan will increment the value of
btps_arrayKeyCount, then same will happen for key 2 and 3.  It is
quite possible that by the time one of the participant advances it
local key, the scan for that key is already finished and we handle
that situation in _bt_parallel_seize() with below check:

if (so->arrayKeyCount < btscan->btps_arrayKeyCount)
*status = false;

I think Rahila has also mentioned something on above lines, let us
know if we are missing something here?  Do you want to add more
comments in the code to explain this handling, if yes, then where (on
top of function _bt_parallel_advance_scan)?


>  It seems like one of those two things has
> to be wrong.
>
> _bt_readpage's header comment should be updated to note that, in the
> case of a parallel scan, _bt_parallel_seize should have been called
> prior to entering this function, and _bt_parallel_release will be
> called prior to return (or this could alternatively be phrased in
> terms of btps_pageStatus on entry/exit).
>

Changed as per suggestion.

> _bt_readnextpage isn't very clear about the meaning of its blkno
> argument.  It looks like it always has to be valid when scanning
> forward, but only in the parallel case when scanning backward?
>

It can be only invalid for non-parallel backward scan and in that case
the appropriate value for so->currPos will be set.  Refer
_bt_steppage(). I have updated the comments.

> That
> is a little odd.  Another, possibly related thing that is odd is that
> when _bt_steppage() finds no tuples and decides to advance to a new
> page again, there's a very short comment in the forward case and a
> very long comment in the backward case:
>
>             /* nope, keep going */
> vs.
>             /*
>              * For parallel scans, get the last page scanned as it is quite
>              * possible that by the time we try to fetch previous page, other
>              * worker has also decided to scan that previous page.  We could
>              * avoid that by doing _bt_parallel_release once we have read the
>              * current page, but it is bad to make other workers wait till we
>              * read the page.
>              */
>
> Now it seems to me that these cases are symmetric and the issues are
> identical.  It's basically that, while we were judging whether the
> current page has useful contents, some other process could have
> advanced the scan (since _bt_readpage un-seized it).
>

Yeah, but the reason of difference in comments is that for
non-parallel backwards scans there is no code at that place to move to
previous page and it basically relies on next call to _bt_walk_left()
whereas for parallel-scans, we can't simply rely on _bt_walk_left().
I have slightly modified the  comments for backward scan case, see if
that looks better and if not, then suggest what you think is better.

> -            /* check for deleted page */
>              page = BufferGetPage(so->currPos.buf);
>              TestForOldSnapshot(scan->xs_snapshot, rel, page);
>              opaque = (BTPageOpaque) PageGetSpecialPointer(page);
> +            /* check for deleted page */
>
> This is an independent change; committed.
>

Thanks.

> What kind of testing has been done to ensure that this doesn't have
> concurrency bugs?
>

Used large table parallel index scans (both forward and backward
scans).  These tests have been done by Tushar and you can find
detailed report up thread [2].  Apart from that, the patch has been
tested with TPC-H queries at various scale factors and it is being
used in multiple queries and we have verified the results of same as
well.  TPC-H tests have been done by Rafia.

Tushar has done some further extensive test of this patch.  Tushar,
can you please share your test results?

> What's the highest degree of parallelism that's
> been tested?

7

>  Did that test include array keys?
>

Yes.

Note - The order in which patches needs to be applied:
compute_index_pages_v1.patch, parallel_index_scan_v7.patch,
parallel_index_opt_exec_support_v7.patch.  The first and third patches
are posted up thread [3].

[1] - https://www.postgresql.org/message-id/CA%2BTgmoZv0%2BcLUV7fZRo76_PB9cfu5mBCVmoXKmaqrc7F30nJzw%40mail.gmail.com
[2] - https://www.postgresql.org/message-id/1d6353a0-63cb-65d9-a70c-0913899d5b06%40enterprisedb.com
[3] - https://www.postgresql.org/message-id/CAA4eK1KowGSYYVpd2qPpaPPA5R90r%2B%2BQwDFbrRECTE9H_HvpOg%40mail.gmail.com

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] Parallel Index Scans

From
tushar
Date:
On 02/01/2017 06:50 PM, Amit Kapila wrote:
Used large table parallel index scans (both forward and backward
scans).  These tests have been done by Tushar and you can find
detailed report up thread [2].  Apart from that, the patch has been
tested with TPC-H queries at various scale factors and it is being
used in multiple queries and we have verified the results of same as
well.  TPC-H tests have been done by Rafia.

Tushar has done some further extensive test of this patch.  Tushar,
can you please share your test results?
Yes, We have
0)Tested on a high end machine with this following configuration

[edb@ip-10-0-38-61 pg_log]$ lscpu
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                128
On-line CPU(s) list:   0-127
Thread(s) per core:    2
Core(s) per socket:    16
Socket(s):             4
NUMA node(s):          4
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 63
Model name:            Intel(R) Xeon(R) CPU E7-8880 v3 @ 2.30GHz

[edb@ip-10-0-38-61 pg_log]$ df -h
Filesystem      Size  Used Avail Use% Mounted on
devtmpfs        961G   60K  961G   1% /dev
tmpfs           961G  556K  961G   1% /dev/shm
/dev/xvda1      197G  156G   42G  80% /

[edb@ip-10-0-38-61 pg_log]$ free
             total       used       free     shared    buffers     cached
Mem:    2014742800  170971292 1843771508     142668     166128  162463396
-/+ buffers/cache:    8341768 2006401032
Swap:            0          0          0

1)Executed the testcases with multiple clients ( e.g run our testcase file against 4 different psql terminal of the same server simultaneously)  for concurrency,
   We made a effort to execute same set of tests (testcase.sql file) via different terminals against the same server.
2) We checked count(*) of the query  before and after disabling/enabling max_parallel_workers_per_gather to make sure end result(o/p) is consistent.
3) We are able to get parallel workers =14 (highest degree of parallelism ) in our case

pgbench with -scaling factor =10,000 ( taken 149 GB data in the database, 100 million rows is inserted) on amanzon instance (128 cores ,4 nodes)

We are able to see 14  workers launched  out of 14 workers planned  against  this below query

postgres=# \di+ pgbench_accounts_pkey
                                    List of relations
 Schema |         Name          | Type  | Owner |      Table | Size  | Description
--------+-----------------------+-------+-------+------------------+-------+-------------
 public | pgbench_accounts_pkey | index | edb   | pgbench_accounts | 21 GB |
(1 row)

index size is now 21 GB

postgres=# explain analyse verbose select * from pgbench_accounts where aid <50000000 and bid <=1 ;
                                                                                  QUERY PLAN                                                                                 
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Gather  (cost=0.57..1745380.10 rows=4691 width=97) (actual time=0.546..2316.118 rows=100000 loops=1)
   Output: aid, bid, abalance, filler
   Workers Planned: 14
   Workers Launched: 14
   ->  Parallel Index Scan using pgbench_accounts_pkey on public.pgbench_accounts  (cost=0.57..1745380.10 rows=335 width=97) (actual time=0.081..2253.234 rows=6667 loops=15)
         Output: aid, bid, abalance, filler
         Index Cond: (pgbench_accounts.aid < 50000000)
         Filter: (pgbench_accounts.bid <= 1)
         Rows Removed by Filter: 3326667
         Worker 0: actual time=0.069..2251.456 rows=7036 loops=1
         Worker 1: actual time=0.070..2256.772 rows=6588 loops=1
         Worker 2: actual time=0.071..2257.164 rows=6954 loops=1
         Worker 3: actual time=0.079..2255.166 rows=6222 loops=1
         Worker 4: actual time=0.063..2254.814 rows=6588 loops=1
         Worker 5: actual time=0.091..2253.872 rows=6588 loops=1
         Worker 6: actual time=0.093..2254.237 rows=6222 loops=1
         Worker 7: actual time=0.068..2254.749 rows=7320 loops=1
         Worker 8: actual time=0.060..2253.953 rows=6588 loops=1
         Worker 9: actual time=0.127..2253.546 rows=8052 loops=1
         Worker 10: actual time=0.091..2252.737 rows=7686 loops=1
         Worker 11: actual time=0.087..2252.056 rows=7320 loops=1
         Worker 12: actual time=0.091..2252.600 rows=7320 loops=1
         Worker 13: actual time=0.057..2252.341 rows=7686 loops=1
 Planning time: 0.165 ms
 Execution time: 2357.132 ms
(25 rows)

even for array keys,  index size is in MB . we are able to see 09 workers launched out of 09 workers planned

postgres=# set enable_bitmapscan =0;
SET
postgres=# set enable_seqscan =0;
SET
postgres=# \di+ ary_idx
                        List of relations
 Schema |  Name   | Type  | Owner |  Table  | Size  | Description
--------+---------+-------+-------+---------+-------+-------------
 public | ary_idx | index | edb   | ary_tab | 56 MB |
(1 row)

postgres=# explain analyze verbose select count(1) from ary_tab where ARRAY[7,8,9,10]=c2 and c1 = 'four';
                                                                         QUERY PLAN                                                                        
------------------------------------------------------------------------------------------------------------------------------------------------------------
 Finalize Aggregate  (cost=47083.83..47083.84 rows=1 width=8) (actual time=141.766..141.767 rows=1 loops=1)
   Output: count(1)
   ->  Gather  (cost=47083.80..47083.81 rows=9 width=8) (actual time=141.547..141.753 rows=10 loops=1)
         Output: (PARTIAL count(1))
         Workers Planned: 9
         Workers Launched: 9
         ->  Partial Aggregate  (cost=47083.80..47083.81 rows=1 width=8) (actual time=136.679..136.679 rows=1 loops=10)
               Output: PARTIAL count(1)
               Worker 0: actual time=135.215..135.215 rows=1 loops=1
               Worker 1: actual time=136.158..136.158 rows=1 loops=1
               Worker 2: actual time=136.348..136.349 rows=1 loops=1
               Worker 3: actual time=136.564..136.565 rows=1 loops=1
               Worker 4: actual time=135.759..135.760 rows=1 loops=1
               Worker 5: actual time=136.405..136.405 rows=1 loops=1
               Worker 6: actual time=136.158..136.158 rows=1 loops=1
               Worker 7: actual time=136.319..136.319 rows=1 loops=1
               Worker 8: actual time=136.597..136.597 rows=1 loops=1
               ->  Parallel Index Scan using ary_idx on public.ary_tab  (cost=0.42..47083.79 rows=4 width=0) (actual time=122.557..136.673 rows=5 loops=10)
                     Index Cond: ('{7,8,9,10}'::integer[] = ary_tab.c2)
                     Filter: (ary_tab.c1 = 'four'::text)
                     Rows Removed by Filter: 100000
                     Worker 0: actual time=135.211..135.211 rows=0 loops=1
                     Worker 1: actual time=136.153..136.153 rows=0 loops=1
                     Worker 2: actual time=136.342..136.342 rows=0 loops=1
                     Worker 3: actual time=136.559..136.559 rows=0 loops=1
                     Worker 4: actual time=135.756..135.756 rows=0 loops=1
                     Worker 5: actual time=136.402..136.402 rows=0 loops=1
                     Worker 6: actual time=136.150..136.150 rows=0 loops=1
                     Worker 7: actual time=136.314..136.314 rows=0 loops=1
                     Worker 8: actual time=136.592..136.592 rows=0 loops=1
 Planning time: 0.813 ms
 Execution time: 145.881 ms
(32 rows)

4)LCOV/Sql report can found for the same @ https://www.postgresql.org/message-id/1d6353a0-63cb-65d9-a70c-0913899d5b06@enterprisedb.com
-- 
regards,tushar
EnterpriseDB  https://www.enterprisedb.com/
The Enterprise PostgreSQL Company

Re: [HACKERS] Parallel Index Scans

From
tushar
Date:
On 02/01/2017 06:50 PM, Amit Kapila wrote:
Used large table parallel index scans (both forward and backward
scans).  These tests have been done by Tushar and you can find
detailed report up thread [2].  Apart from that, the patch has been
tested with TPC-H queries at various scale factors and it is being
used in multiple queries and we have verified the results of same as
well.  TPC-H tests have been done by Rafia.

Tushar has done some further extensive test of this patch.  Tushar,
can you please share your test results?
Yes, We have
0)Tested on a high end machine with this following configuration

[edb@ip-10-0-38-61 pg_log]$ lscpu
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                128
On-line CPU(s) list:   0-127
Thread(s) per core:    2
Core(s) per socket:    16
Socket(s):             4
NUMA node(s):          4
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 63
Model name:            Intel(R) Xeon(R) CPU E7-8880 v3 @ 2.30GHz

[edb@ip-10-0-38-61 pg_log]$ df -h
Filesystem      Size  Used Avail Use% Mounted on
devtmpfs        961G   60K  961G   1% /dev
tmpfs           961G  556K  961G   1% /dev/shm
/dev/xvda1      197G  156G   42G  80% /

[edb@ip-10-0-38-61 pg_log]$ free
             total       used       free     shared    buffers     cached
Mem:    2014742800  170971292 1843771508     142668     166128  162463396
-/+ buffers/cache:    8341768 2006401032
Swap:            0          0          0

1)Executed the testcases with multiple clients ( e.g run our testcase file against 4 different psql terminal of the same server simultaneously)  for concurrency,
   We made a effort to execute same set of tests (testcase.sql file) via different terminals against the same server.
2) We checked count(*) of the query  before and after disabling/enabling max_parallel_workers_per_gather to make sure end result(o/p) is consistent.
3) We are able to get parallel workers =14 (highest degree of parallelism ) in our case

pgbench with -scaling factor =10,000 ( taken 149 GB data in the database, 100 million rows is inserted) on amanzon instance (128 cores ,4 nodes)

We are able to see 14  workers launched  out of 14 workers planned  against  this below query

postgres=# \di+ pgbench_accounts_pkey
                                    List of relations
 Schema |         Name          | Type  | Owner |      Table | Size  | Description
--------+-----------------------+-------+-------+------------------+-------+-------------
 public | pgbench_accounts_pkey | index | edb   | pgbench_accounts | 21 GB |
(1 row)

index size is now 21 GB

postgres=# explain analyse verbose select * from pgbench_accounts where aid <50000000 and bid <=1 ;
                                                                                  QUERY PLAN                                                                                 
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Gather  (cost=0.57..1745380.10 rows=4691 width=97) (actual time=0.546..2316.118 rows=100000 loops=1)
   Output: aid, bid, abalance, filler
   Workers Planned: 14
   Workers Launched: 14
   ->  Parallel Index Scan using pgbench_accounts_pkey on public.pgbench_accounts  (cost=0.57..1745380.10 rows=335 width=97) (actual time=0.081..2253.234 rows=6667 loops=15)
         Output: aid, bid, abalance, filler
         Index Cond: (pgbench_accounts.aid < 50000000)
         Filter: (pgbench_accounts.bid <= 1)
         Rows Removed by Filter: 3326667
         Worker 0: actual time=0.069..2251.456 rows=7036 loops=1
         Worker 1: actual time=0.070..2256.772 rows=6588 loops=1
         Worker 2: actual time=0.071..2257.164 rows=6954 loops=1
         Worker 3: actual time=0.079..2255.166 rows=6222 loops=1
         Worker 4: actual time=0.063..2254.814 rows=6588 loops=1
         Worker 5: actual time=0.091..2253.872 rows=6588 loops=1
         Worker 6: actual time=0.093..2254.237 rows=6222 loops=1
         Worker 7: actual time=0.068..2254.749 rows=7320 loops=1
         Worker 8: actual time=0.060..2253.953 rows=6588 loops=1
         Worker 9: actual time=0.127..2253.546 rows=8052 loops=1
         Worker 10: actual time=0.091..2252.737 rows=7686 loops=1
         Worker 11: actual time=0.087..2252.056 rows=7320 loops=1
         Worker 12: actual time=0.091..2252.600 rows=7320 loops=1
         Worker 13: actual time=0.057..2252.341 rows=7686 loops=1
 Planning time: 0.165 ms
 Execution time: 2357.132 ms
(25 rows)

even for array keys,  index size is in MB . we are able to see 09 workers launched out of 09 workers planned

postgres=# set enable_bitmapscan =0;
SET
postgres=# set enable_seqscan =0;
SET
postgres=# \di+ ary_idx
                        List of relations
 Schema |  Name   | Type  | Owner |  Table  | Size  | Description
--------+---------+-------+-------+---------+-------+-------------
 public | ary_idx | index | edb   | ary_tab | 56 MB |
(1 row)

postgres=# explain analyze verbose select count(1) from ary_tab where ARRAY[7,8,9,10]=c2 and c1 = 'four';
                                                                         QUERY PLAN                                                                        
------------------------------------------------------------------------------------------------------------------------------------------------------------
 Finalize Aggregate  (cost=47083.83..47083.84 rows=1 width=8) (actual time=141.766..141.767 rows=1 loops=1)
   Output: count(1)
   ->  Gather  (cost=47083.80..47083.81 rows=9 width=8) (actual time=141.547..141.753 rows=10 loops=1)
         Output: (PARTIAL count(1))
         Workers Planned: 9
         Workers Launched: 9
         ->  Partial Aggregate  (cost=47083.80..47083.81 rows=1 width=8) (actual time=136.679..136.679 rows=1 loops=10)
               Output: PARTIAL count(1)
               Worker 0: actual time=135.215..135.215 rows=1 loops=1
               Worker 1: actual time=136.158..136.158 rows=1 loops=1
               Worker 2: actual time=136.348..136.349 rows=1 loops=1
               Worker 3: actual time=136.564..136.565 rows=1 loops=1
               Worker 4: actual time=135.759..135.760 rows=1 loops=1
               Worker 5: actual time=136.405..136.405 rows=1 loops=1
               Worker 6: actual time=136.158..136.158 rows=1 loops=1
               Worker 7: actual time=136.319..136.319 rows=1 loops=1
               Worker 8: actual time=136.597..136.597 rows=1 loops=1
               ->  Parallel Index Scan using ary_idx on public.ary_tab  (cost=0.42..47083.79 rows=4 width=0) (actual time=122.557..136.673 rows=5 loops=10)
                     Index Cond: ('{7,8,9,10}'::integer[] = ary_tab.c2)
                     Filter: (ary_tab.c1 = 'four'::text)
                     Rows Removed by Filter: 100000
                     Worker 0: actual time=135.211..135.211 rows=0 loops=1
                     Worker 1: actual time=136.153..136.153 rows=0 loops=1
                     Worker 2: actual time=136.342..136.342 rows=0 loops=1
                     Worker 3: actual time=136.559..136.559 rows=0 loops=1
                     Worker 4: actual time=135.756..135.756 rows=0 loops=1
                     Worker 5: actual time=136.402..136.402 rows=0 loops=1
                     Worker 6: actual time=136.150..136.150 rows=0 loops=1
                     Worker 7: actual time=136.314..136.314 rows=0 loops=1
                     Worker 8: actual time=136.592..136.592 rows=0 loops=1
 Planning time: 0.813 ms
 Execution time: 145.881 ms
(32 rows)

4)LCOV/Sql report can found for the same @ https://www.postgresql.org/message-id/1d6353a0-63cb-65d9-a70c-0913899d5b06@enterprisedb.com
-- 
regards,tushar
EnterpriseDB  https://www.enterprisedb.com/
The Enterprise PostgreSQL Company

Re: [HACKERS] Parallel Index Scans

From
Michael Paquier
Date:
On Wed, Feb 1, 2017 at 10:20 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> makes sense, so changed accordingly.

I have moved this patch to CF 2017-03.
-- 
Michael



Re: [HACKERS] Parallel Index Scans

From
Robert Haas
Date:
On Wed, Feb 1, 2017 at 12:58 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> Yeah, I understand that point and I can see there is strong argument
> to do that way, but let's wait and see what others including Robert
> have to say about this point.

It seems to me that you can make an argument for any point of view.
In a parallel sequential scan, the smallest unit of work that can be
given to one worker is one heap page; in a parallel index scan, it's
one index page.  By that logic, as Rahila says, we ought to do this
based on the number of index pages.  On the other hand, it's weird to
use the same GUC to measure index pages at some times and heap pages
at other times, and it could result in failing to engage parallelism
where we really should do so, or using an excessively small number of
workers.  An index scan that hits 25 index pages could hit 1000 heap
pages; if it's OK to use a parallel sequential scan for a table with
1000 heap pages, why is it not OK to use a parallel index scan to scan
1000 heap pages?  I can't think of any reason.

On balance, I'm somewhat inclined to think that we ought to base
everything on heap pages, so that we're always measuring in the same
units.  That's what Dilip's patch for parallel bitmap heap scan does,
and I think it's a reasonable choice.  However, for parallel index
scan, we might want to also cap the number of workers to, say,
index_pages/10, just so we don't pick an index scan that's going to
result in a very lopsided work distribution.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Parallel Index Scans

From
Amit Kapila
Date:
On Sat, Feb 4, 2017 at 5:54 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Wed, Feb 1, 2017 at 12:58 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> Yeah, I understand that point and I can see there is strong argument
>> to do that way, but let's wait and see what others including Robert
>> have to say about this point.
>
> It seems to me that you can make an argument for any point of view.
> In a parallel sequential scan, the smallest unit of work that can be
> given to one worker is one heap page; in a parallel index scan, it's
> one index page.  By that logic, as Rahila says, we ought to do this
> based on the number of index pages.  On the other hand, it's weird to
> use the same GUC to measure index pages at some times and heap pages
> at other times, and it could result in failing to engage parallelism
> where we really should do so, or using an excessively small number of
> workers.  An index scan that hits 25 index pages could hit 1000 heap
> pages; if it's OK to use a parallel sequential scan for a table with
> 1000 heap pages, why is it not OK to use a parallel index scan to scan
> 1000 heap pages?  I can't think of any reason.
>

I think one difference is that if we want to scan 1000 heap pages with
parallel index scan, scanning index cost is additional as compare to
parallel sequential scan.

> On balance, I'm somewhat inclined to think that we ought to base
> everything on heap pages, so that we're always measuring in the same
> units.  That's what Dilip's patch for parallel bitmap heap scan does,
> and I think it's a reasonable choice.  However, for parallel index
> scan, we might want to also cap the number of workers to, say,
> index_pages/10, just so we don't pick an index scan that's going to
> result in a very lopsided work distribution.
>

I guess in the above context you mean heap_pages or index_pages that
are expected to be *fetched* during index scan.

Yet another thought is that for parallel index scan we use
index_pages_fetched, but use either a different GUC
(min_parallel_index_rel_size) with a relatively lower default value
(say equal to min_parallel_relation_size/4 = 2MB) or directly use
min_parallel_relation_size/4 for parallel index scans.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: [HACKERS] Parallel Index Scans

From
Amit Kapila
Date:
On Sat, Feb 4, 2017 at 7:14 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Sat, Feb 4, 2017 at 5:54 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> On Wed, Feb 1, 2017 at 12:58 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>
>> On balance, I'm somewhat inclined to think that we ought to base
>> everything on heap pages, so that we're always measuring in the same
>> units.  That's what Dilip's patch for parallel bitmap heap scan does,
>> and I think it's a reasonable choice.  However, for parallel index
>> scan, we might want to also cap the number of workers to, say,
>> index_pages/10, just so we don't pick an index scan that's going to
>> result in a very lopsided work distribution.
>>
>
> I guess in the above context you mean heap_pages or index_pages that
> are expected to be *fetched* during index scan.
>
> Yet another thought is that for parallel index scan we use
> index_pages_fetched, but use either a different GUC
> (min_parallel_index_rel_size) with a relatively lower default value
> (say equal to min_parallel_relation_size/4 = 2MB) or directly use
> min_parallel_relation_size/4 for parallel index scans.
>

I had some offlist discussion with Robert about the above point and we
feel that keeping only heap pages for parallel computation might not
be future proof as for parallel index only scans there might not be
any heap pages.  So, it is better to use separate GUC for parallel
index (only) scans.  We can have two guc's
min_parallel_table_scan_size (8MB) and min_parallel_index_scan_size
(512kB) for computing parallel scans.  The parallel sequential scan
and parallel bitmap heap scans can use min_parallel_table_scan_size as
a threshold to compute parallel workers as we are doing now.  For
parallel index scans, both min_parallel_table_scan_size and
min_parallel_index_scan_size can be used for threshold;  We can
compute parallel workers both based on heap_pages to be scanned and
index_pages to be scanned and then keep the minimum of those.  This
will help us to engage parallel index scans when the index pages are
lower than threshold but there are many heap pages to be scanned and
will also allow keeping a maximum cap on the number of workers based
on index scan size.

guc_parallel_index_scan_v1.patch - Change name of existing
min_parallel_relation_size to min_parallel_table_scan_size and added a
new guc min_parallel_index_scan_size with default value of 512kB.
This patch also adjusted the computation in compute_parallel_worker
based on two guc's.

compute_index_pages_v2.patch - This function extracts the computation
of index pages to be scanned in a separate function and used it in
existing code.  You will notice that I have pulled up the logic of
conversion of clauses to indexquals from create_index_path to
build_index_paths as that is required to compute the number of index
and heap pages to be scanned by scan in patch
parallel_index_opt_exec_support_v8.patch.  This doesn't impact any
existing functionality.

parallel_index_scan_v7 - patch to parallelize btree scans, nothing is
changed from previous version (just rebased on latest head).

parallel_index_opt_exec_support_v8.patch - This contain changes to
compute parallel workers using both heap and index pages that need to
be scanned.

Patches guc_parallel_index_scan_v1.patch and
compute_index_pages_v2.patch are independent patches.  Both the
patches are required by parallel index scan patches.

The current set of patches handles all the reported comments.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] Parallel Index Scans

From
Peter Geoghegan
Date:
On Wed, Feb 8, 2017 at 10:33 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> I had some offlist discussion with Robert about the above point and we
> feel that keeping only heap pages for parallel computation might not
> be future proof as for parallel index only scans there might not be
> any heap pages.  So, it is better to use separate GUC for parallel
> index (only) scans.  We can have two guc's
> min_parallel_table_scan_size (8MB) and min_parallel_index_scan_size
> (512kB) for computing parallel scans.  The parallel sequential scan
> and parallel bitmap heap scans can use min_parallel_table_scan_size as
> a threshold to compute parallel workers as we are doing now.  For
> parallel index scans, both min_parallel_table_scan_size and
> min_parallel_index_scan_size can be used for threshold;  We can
> compute parallel workers both based on heap_pages to be scanned and
> index_pages to be scanned and then keep the minimum of those.  This
> will help us to engage parallel index scans when the index pages are
> lower than threshold but there are many heap pages to be scanned and
> will also allow keeping a maximum cap on the number of workers based
> on index scan size.

What about parallel CREATE INDEX? The patch currently uses
min_parallel_relation_size as an input into the optimizer's custom
cost model. I had wondered if that made sense. Note that another such
input is the projected size of the final index. That's the thing that
increases at logarithmic intervals as there is a linear increase in
the number of workers assigned to the operation (so it's not the size
of the underlying table).

-- 
Peter Geoghegan



Re: [HACKERS] Parallel Index Scans

From
Amit Kapila
Date:
On Thu, Feb 9, 2017 at 12:08 PM, Peter Geoghegan <pg@bowt.ie> wrote:
> On Wed, Feb 8, 2017 at 10:33 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> I had some offlist discussion with Robert about the above point and we
>> feel that keeping only heap pages for parallel computation might not
>> be future proof as for parallel index only scans there might not be
>> any heap pages.  So, it is better to use separate GUC for parallel
>> index (only) scans.  We can have two guc's
>> min_parallel_table_scan_size (8MB) and min_parallel_index_scan_size
>> (512kB) for computing parallel scans.  The parallel sequential scan
>> and parallel bitmap heap scans can use min_parallel_table_scan_size as
>> a threshold to compute parallel workers as we are doing now.  For
>> parallel index scans, both min_parallel_table_scan_size and
>> min_parallel_index_scan_size can be used for threshold;  We can
>> compute parallel workers both based on heap_pages to be scanned and
>> index_pages to be scanned and then keep the minimum of those.  This
>> will help us to engage parallel index scans when the index pages are
>> lower than threshold but there are many heap pages to be scanned and
>> will also allow keeping a maximum cap on the number of workers based
>> on index scan size.
>
> What about parallel CREATE INDEX? The patch currently uses
> min_parallel_relation_size as an input into the optimizer's custom
> cost model. I had wondered if that made sense. Note that another such
> input is the projected size of the final index.
>

If projected index size is available, then I think Create Index can
also use a somewhat similar formula where we cap the maximum number of
workers based on the size of the index.  Now, I am not sure if the
threshold values of guc's kept for the scan are realistic for Create
Index operation.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: [HACKERS] Parallel Index Scans

From
Robert Haas
Date:
On Thu, Feb 9, 2017 at 5:34 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> What about parallel CREATE INDEX? The patch currently uses
>> min_parallel_relation_size as an input into the optimizer's custom
>> cost model. I had wondered if that made sense. Note that another such
>> input is the projected size of the final index.
>
> If projected index size is available, then I think Create Index can
> also use a somewhat similar formula where we cap the maximum number of
> workers based on the size of the index.  Now, I am not sure if the
> threshold values of guc's kept for the scan are realistic for Create
> Index operation.

I think that would be an abuse of the GUC, because the idea of the
existing GUC - and the new one we're proposing to create here - has
always been about the amount of data being fed into the parallel
operation.  In the case of CREATE INDEX, the resulting index is an
output, not an input.  So if I were Peter and wanted to reuse the
existing GUCs, I'd reuse the one for the table size, because that's
what is being scanned.  No index is going to get scanned.

Of course, it's possible that the sensible amount of parallelism for
CREATE INDEX is higher or lower than for other sequential scans, so
that might not be the right thing to do.  It might need its own knob,
or some other refinement.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Parallel Index Scans

From
Robert Haas
Date:
On Wed, Feb 1, 2017 at 8:20 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> The hunk in indexam.c looks like a leftover that can be omitted.
>
> It is not a leftover hunk. Earlier, the patch has the same check
> btparallelrescan, but based on your comment up thread [1] (Why does
> btparallelrescan cater to the case where scan->parallel_scan== NULL?),
> this has been moved to indexam.c.

That's not my point.  The point is, if it's not a parallel scan,
index_parallelrescan() should never get called in the first place.  So
therefore it shouldn't need to worry about scan->parallel_scan being
NULL.  It seems possibly reasonable to put something like
Assert(scan->parallel_scan != NULL) in there, but arranging to return
without doing anything in that case seems like it just masks possible
bugs in the calling code.  And really I'm not sure we even need the
Assert.

>> I am a bit mystified about how this manages to work with array keys.
>> _bt_parallel_done() won't set btps_pageStatus to BTPARALLEL_DONE
>> unless so->arrayKeyCount >= btscan->btps_arrayKeyCount, but
>> _bt_parallel_advance_scan() won't do anything unless btps_pageStatus
>> is already BTPARALLEL_DONE.
>
> This is just to ensure that btps_arrayKeyCount is advanced once and
> btps_pageStatus is changed to BTPARALLEL_DONE once per array element.
> So it goes something like, if we have array with values [1,2,3], then
> all the workers will complete the scan with key 1 and one of them will
> mark btps_pageStatus as BTPARALLEL_DONE and then first one to hit
> _bt_parallel_advance_scan will increment the value of
> btps_arrayKeyCount, then same will happen for key 2 and 3.  It is
> quite possible that by the time one of the participant advances it
> local key, the scan for that key is already finished and we handle
> that situation in _bt_parallel_seize() with below check:
>
> if (so->arrayKeyCount < btscan->btps_arrayKeyCount)
> *status = false;
>
> I think Rahila has also mentioned something on above lines, let us
> know if we are missing something here?  Do you want to add more
> comments in the code to explain this handling, if yes, then where (on
> top of function _bt_parallel_advance_scan)?

You know, I just misread this code.  Both of you are right, and I am wrong.

>> That
>> is a little odd.  Another, possibly related thing that is odd is that
>> when _bt_steppage() finds no tuples and decides to advance to a new
>> page again, there's a very short comment in the forward case and a
>> very long comment in the backward case:
>>
>>             /* nope, keep going */
>> vs.
>>             /*
>>              * For parallel scans, get the last page scanned as it is quite
>>              * possible that by the time we try to fetch previous page, other
>>              * worker has also decided to scan that previous page.  We could
>>              * avoid that by doing _bt_parallel_release once we have read the
>>              * current page, but it is bad to make other workers wait till we
>>              * read the page.
>>              */
>>
>> Now it seems to me that these cases are symmetric and the issues are
>> identical.  It's basically that, while we were judging whether the
>> current page has useful contents, some other process could have
>> advanced the scan (since _bt_readpage un-seized it).
>
> Yeah, but the reason of difference in comments is that for
> non-parallel backwards scans there is no code at that place to move to
> previous page and it basically relies on next call to _bt_walk_left()
> whereas for parallel-scans, we can't simply rely on _bt_walk_left().
> I have slightly modified the  comments for backward scan case, see if
> that looks better and if not, then suggest what you think is better.

Why can't we rely on _bt_walk_left?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Parallel Index Scans

From
Robert Haas
Date:
On Thu, Feb 9, 2017 at 1:33 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> compute_index_pages_v2.patch - This function extracts the computation
> of index pages to be scanned in a separate function and used it in
> existing code.  You will notice that I have pulled up the logic of
> conversion of clauses to indexquals from create_index_path to
> build_index_paths as that is required to compute the number of index
> and heap pages to be scanned by scan in patch
> parallel_index_opt_exec_support_v8.patch.  This doesn't impact any
> existing functionality.

This design presupposes that every AM that might ever want to do
parallel scans is happy with genericcostestimate()'s method of
computing the number of index pages that will be fetched.  However,
that might not be true for every possible AM.  In fact, it's already
not true for BRIN, which always reads the whole index.  Now, since
we're only concerned with btree at the moment, nothing would be
immediately broken by this approach but it seems we're just setting
ourselves up for future pain.

I have what I think might be a better idea: let's make
amcostestimate() responsible for returning a suggested parallel degree
for the path via an additional out parameter.  cost_index() can then
reduce that value if it seems like not enough heap pages will be
fetched to justify the return value, or it can override it completely
if parallel_degree is set for the relation.  Then we don't need to run
this logic twice to compute the same value, and we don't violate the
AM abstraction layer.

BTW, the changes to add_partial_path() aren't needed, because an
IndexPath only gets reused if you stick a Bitmap{Heap,And,Or}Path on
top of it, and that won't be the case with this or any other pending
patch.  If we get the ability to have a Parallel Bitmap Heap Scan that
takes a parallel index scan rather than a standard index scan as
input, then we'll need something like this.  But for now it's probably
best to just update the comments and remove the Assert().

I think you can also leave out the documentation changes from these
patches.  I'll do some general rewriting of the parallel query section
once we know exactly what capabilities we'll have in this release; I
think that will work better than trying to update them a bit at a time
for each patch.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Parallel Index Scans

From
Amit Kapila
Date:
On Fri, Feb 10, 2017 at 11:27 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Wed, Feb 1, 2017 at 8:20 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>>> The hunk in indexam.c looks like a leftover that can be omitted.
>>
>> It is not a leftover hunk. Earlier, the patch has the same check
>> btparallelrescan, but based on your comment up thread [1] (Why does
>> btparallelrescan cater to the case where scan->parallel_scan== NULL?),
>> this has been moved to indexam.c.
>
> That's not my point.  The point is, if it's not a parallel scan,
> index_parallelrescan() should never get called in the first place.  So
> therefore it shouldn't need to worry about scan->parallel_scan being
> NULL.  It seems possibly reasonable to put something like
> Assert(scan->parallel_scan != NULL) in there, but arranging to return
> without doing anything in that case seems like it just masks possible
> bugs in the calling code.  And really I'm not sure we even need the
> Assert.
>

This is just a safety check, so probably we can change it to Assert.

>
>>> That
>>> is a little odd.  Another, possibly related thing that is odd is that
>>> when _bt_steppage() finds no tuples and decides to advance to a new
>>> page again, there's a very short comment in the forward case and a
>>> very long comment in the backward case:
>>>
>>>             /* nope, keep going */
>>> vs.
>>>             /*
>>>              * For parallel scans, get the last page scanned as it is quite
>>>              * possible that by the time we try to fetch previous page, other
>>>              * worker has also decided to scan that previous page.  We could
>>>              * avoid that by doing _bt_parallel_release once we have read the
>>>              * current page, but it is bad to make other workers wait till we
>>>              * read the page.
>>>              */
>>>
>>> Now it seems to me that these cases are symmetric and the issues are
>>> identical.  It's basically that, while we were judging whether the
>>> current page has useful contents, some other process could have
>>> advanced the scan (since _bt_readpage un-seized it).
>>
>> Yeah, but the reason of difference in comments is that for
>> non-parallel backwards scans there is no code at that place to move to
>> previous page and it basically relies on next call to _bt_walk_left()
>> whereas for parallel-scans, we can't simply rely on _bt_walk_left().
>> I have slightly modified the  comments for backward scan case, see if
>> that looks better and if not, then suggest what you think is better.
>
> Why can't we rely on _bt_walk_left?
>

The reason is mentioned in comments, but let me try to explain with
some example.  When you reach that point of code, it means that either
the current page (assume page number is 10) doesn't contain any
matching items or it is a half-dead page, both of which indicates that
we have to move to the previous page.   Now, before checking if the
current page contains matching items, we signal parallel machinery
(via _bt_parallel_release) to allow workers to read the previous page
(assume previous page number is 9).  So it is quite possible that
after deciding that current page (page number 10) doesn't contain any
matching tuples if we directly move to the previous page (in this case
it will be 9) by using _bt_walk_left, some other worker would have
read page 9.  In short, if we directly use _bt_walk_left(), then we
are prone to returning some of the values twice as multiple workers
can read the same page.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: [HACKERS] Parallel Index Scans

From
Robert Haas
Date:
On Fri, Feb 10, 2017 at 11:22 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> Why can't we rely on _bt_walk_left?
>
> The reason is mentioned in comments, but let me try to explain with
> some example.  When you reach that point of code, it means that either
> the current page (assume page number is 10) doesn't contain any
> matching items or it is a half-dead page, both of which indicates that
> we have to move to the previous page.   Now, before checking if the
> current page contains matching items, we signal parallel machinery
> (via _bt_parallel_release) to allow workers to read the previous page
> (assume previous page number is 9).  So it is quite possible that
> after deciding that current page (page number 10) doesn't contain any
> matching tuples if we directly move to the previous page (in this case
> it will be 9) by using _bt_walk_left, some other worker would have
> read page 9.  In short, if we directly use _bt_walk_left(), then we
> are prone to returning some of the values twice as multiple workers
> can read the same page.

But ... the entire point of the seize-and-release stuff is to avoid
this problem.  You're suppose to seize the scan, read the current
page, walk left, store the page you find in the scan, and then release
the scan.  The entire point of that stuff is that when somebody's
advancing the scan to the next page, everybody else waits for them to
get done.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Parallel Index Scans

From
Amit Kapila
Date:
On Sat, Feb 11, 2017 at 9:41 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Fri, Feb 10, 2017 at 11:22 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>>> Why can't we rely on _bt_walk_left?
>>
>> The reason is mentioned in comments, but let me try to explain with
>> some example.  When you reach that point of code, it means that either
>> the current page (assume page number is 10) doesn't contain any
>> matching items or it is a half-dead page, both of which indicates that
>> we have to move to the previous page.   Now, before checking if the
>> current page contains matching items, we signal parallel machinery
>> (via _bt_parallel_release) to allow workers to read the previous page
>> (assume previous page number is 9).  So it is quite possible that
>> after deciding that current page (page number 10) doesn't contain any
>> matching tuples if we directly move to the previous page (in this case
>> it will be 9) by using _bt_walk_left, some other worker would have
>> read page 9.  In short, if we directly use _bt_walk_left(), then we
>> are prone to returning some of the values twice as multiple workers
>> can read the same page.
>
> But ... the entire point of the seize-and-release stuff is to avoid
> this problem.  You're suppose to seize the scan, read the current
> page, walk left, store the page you find in the scan, and then release
> the scan.
>

Exactly and that is what is done in the patch.  Basically, if we found
that the current page is half-dead or it doesn't contain any matching
items, then release the current buffer, seize the scan, read the
current page, walk left and so on.  I am slightly confused here
because it seems both of us agree on what is the right thing to do and
according to me that is how it is implemented.  Are you just ensuring
about whether I have implemented as discussed or do you see a problem
with the way it is implemented?


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: [HACKERS] Parallel Index Scans

From
Amit Kapila
Date:
On Fri, Feb 10, 2017 at 11:27 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Wed, Feb 1, 2017 at 8:20 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>>> The hunk in indexam.c looks like a leftover that can be omitted.
>>
>> It is not a leftover hunk. Earlier, the patch has the same check
>> btparallelrescan, but based on your comment up thread [1] (Why does
>> btparallelrescan cater to the case where scan->parallel_scan== NULL?),
>> this has been moved to indexam.c.
>
> That's not my point.  The point is, if it's not a parallel scan,
> index_parallelrescan() should never get called in the first place.  So
> therefore it shouldn't need to worry about scan->parallel_scan being
> NULL.  It seems possibly reasonable to put something like
> Assert(scan->parallel_scan != NULL) in there, but arranging to return
> without doing anything in that case seems like it just masks possible
> bugs in the calling code.  And really I'm not sure we even need the
> Assert.
>

I have removed the check from index_parallelrescan() and ensured in
the caller that it must be called only when parallel_scan descriptor
is present.

Comments from another e-mail:
> This design presupposes that every AM that might ever want to do
> parallel scans is happy with genericcostestimate()'s method of
> computing the number of index pages that will be fetched.  However,
> that might not be true for every possible AM.  In fact, it's already
> not true for BRIN, which always reads the whole index.  Now, since
> we're only concerned with btree at the moment, nothing would be
> immediately broken by this approach but it seems we're just setting
> ourselves up for future pain.
>

I think to make it future-proof, we could add a generic page
estimation function.  However, I have tried to implement based on your
below suggestion, see if that looks better to you, otherwise we can
introduce a generic page estimation API.

> I have what I think might be a better idea: let's make
> amcostestimate() responsible for returning a suggested parallel degree
> for the path via an additional out parameter.  cost_index() can then
> reduce that value if it seems like not enough heap pages will be
> fetched to justify the return value, or it can override it completely
> if parallel_degree is set for the relation.
>

I see the value of your idea, but I think it might be better to return
the number of IndexPages required to be scanned from amcostestimate()
and then use the already computed value of heap_pages in cost_index to
identify the number of workers.  This will make the calculation simple
and avoid overriding the return value.  Now, the returned value of
index pages will only be used for cases when partial path is being
constructed, but I think that is okay, because we are not doing any
extra calculation to compute the number of index pages fetched.
Another argument could be that the number of index pages to be used
for parallelism can be different from the number of pages to be
scanned or what ever is used in cost computation, but I think that is
also easy to change later when we create partial paths for indexes
other than btree.  I have implemented the above idea in the attached
patch (parallel_index_opt_exec_support_v9.patch)

>  Then we don't need to run
> this logic twice to compute the same value, and we don't violate the
> AM abstraction layer.
>

We can avoid computing the same value twice, however, with your
suggested approach, we have to do all the additional work for the
cases where employing parallel workers is not beneficial, so not sure
if there is a net gain.

> BTW, the changes to add_partial_path() aren't needed, because an
> IndexPath only gets reused if you stick a Bitmap{Heap,And,Or}Path on
> top of it, and that won't be the case with this or any other pending
> patch.  If we get the ability to have a Parallel Bitmap Heap Scan that
> takes a parallel index scan rather than a standard index scan as
> input, then we'll need something like this.  But for now it's probably
> best to just update the comments and remove the Assert().
>

Right, changed as per suggestion.

> I think you can also leave out the documentation changes from these
> patches.  I'll do some general rewriting of the parallel query section
> once we know exactly what capabilities we'll have in this release; I
> think that will work better than trying to update them a bit at a time
> for each patch.
>

Okay, removed the documentation part.

Patches to be used: guc_parallel_index_scan_v1.patch [1],
parallel_index_scan_v8.patch, parallel_index_opt_exec_support_v9.patch


[1] - https://www.postgresql.org/message-id/CAA4eK1%2BTnM4pXQbvn7OXqam%2Bk_HZqb0ROZUMxOiL6DWJYCyYow%40mail.gmail.com

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] Parallel Index Scans

From
Robert Haas
Date:
On Sat, Feb 11, 2017 at 6:35 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>>>> Why can't we rely on _bt_walk_left?
>>> The reason is mentioned in comments, but let me try to explain with
>>> some example.  When you reach that point of code, it means that either
>>> the current page (assume page number is 10) doesn't contain any
>>> matching items or it is a half-dead page, both of which indicates that
>>> we have to move to the previous page.   Now, before checking if the
>>> current page contains matching items, we signal parallel machinery
>>> (via _bt_parallel_release) to allow workers to read the previous page
>>> (assume previous page number is 9).  So it is quite possible that
>>> after deciding that current page (page number 10) doesn't contain any
>>> matching tuples if we directly move to the previous page (in this case
>>> it will be 9) by using _bt_walk_left, some other worker would have
>>> read page 9.  In short, if we directly use _bt_walk_left(), then we
>>> are prone to returning some of the values twice as multiple workers
>>> can read the same page.
>> But ... the entire point of the seize-and-release stuff is to avoid
>> this problem.  You're suppose to seize the scan, read the current
>> page, walk left, store the page you find in the scan, and then release
>> the scan.
> Exactly and that is what is done in the patch.  Basically, if we found
> that the current page is half-dead or it doesn't contain any matching
> items, then release the current buffer, seize the scan, read the
> current page, walk left and so on.  I am slightly confused here
> because it seems both of us agree on what is the right thing to do and
> according to me that is how it is implemented.  Are you just ensuring
> about whether I have implemented as discussed or do you see a problem
> with the way it is implemented?

Well, before, I thought you said that relying entirely on
_bt_walk_left couldn't work because then two people might end up
running it at the same time, and that would cause problems.  But if
you can only run _bt_walk_left while you've got the scan seized, then
that can't happen.  Evidently I'm missing something here.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Parallel Index Scans

From
Amit Kapila
Date:
On Mon, Feb 13, 2017 at 5:47 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Sat, Feb 11, 2017 at 6:35 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>>>>> Why can't we rely on _bt_walk_left?
>>>> The reason is mentioned in comments, but let me try to explain with
>>>> some example.  When you reach that point of code, it means that either
>>>> the current page (assume page number is 10) doesn't contain any
>>>> matching items or it is a half-dead page, both of which indicates that
>>>> we have to move to the previous page.   Now, before checking if the
>>>> current page contains matching items, we signal parallel machinery
>>>> (via _bt_parallel_release) to allow workers to read the previous page
>>>> (assume previous page number is 9).  So it is quite possible that
>>>> after deciding that current page (page number 10) doesn't contain any
>>>> matching tuples if we directly move to the previous page (in this case
>>>> it will be 9) by using _bt_walk_left, some other worker would have
>>>> read page 9.  In short, if we directly use _bt_walk_left(), then we
>>>> are prone to returning some of the values twice as multiple workers
>>>> can read the same page.
>>> But ... the entire point of the seize-and-release stuff is to avoid
>>> this problem.  You're suppose to seize the scan, read the current
>>> page, walk left, store the page you find in the scan, and then release
>>> the scan.
>> Exactly and that is what is done in the patch.  Basically, if we found
>> that the current page is half-dead or it doesn't contain any matching
>> items, then release the current buffer, seize the scan, read the
>> current page, walk left and so on.  I am slightly confused here
>> because it seems both of us agree on what is the right thing to do and
>> according to me that is how it is implemented.  Are you just ensuring
>> about whether I have implemented as discussed or do you see a problem
>> with the way it is implemented?
>
> Well, before, I thought you said that relying entirely on
> _bt_walk_left couldn't work because then two people might end up
> running it at the same time, and that would cause problems.  But if
> you can only run _bt_walk_left while you've got the scan seized, then
> that can't happen.  Evidently I'm missing something here.
>

I think the comment at that place is not as clear as it should be.  So
how about changing it as below:

Existing comment:
--------------------------
/*
* For parallel scans, get the last page scanned as it is quite
* possible that by the time we try to fetch previous page, other
* worker has also decided to scan that previous page.  So we
* can't rely on _bt_walk_left call.
*/

Modified comment:
--------------------------
/** For parallel scans, it is quite possible that by the time we try to fetch* the previous page, another worker has
alsodecided to scan that* previous page.  So to avoid that we need to get the last page scanned* from shared scan
descriptorbefore calling _bt_walk_left.*/
 


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: [HACKERS] Parallel Index Scans

From
Robert Haas
Date:
On Mon, Feb 13, 2017 at 9:04 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> I think the comment at that place is not as clear as it should be.  So
> how about changing it as below:
>
> Existing comment:
> --------------------------
> /*
> * For parallel scans, get the last page scanned as it is quite
> * possible that by the time we try to fetch previous page, other
> * worker has also decided to scan that previous page.  So we
> * can't rely on _bt_walk_left call.
> */
>
> Modified comment:
> --------------------------
> /*
>  * For parallel scans, it is quite possible that by the time we try to fetch
>  * the previous page, another worker has also decided to scan that
>  * previous page.  So to avoid that we need to get the last page scanned
>  * from shared scan descriptor before calling _bt_walk_left.
>  */

That sounds way better.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Parallel Index Scans

From
Robert Haas
Date:
On Tue, Feb 14, 2017 at 12:48 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> That sounds way better.

Here's an updated patch.  Please review my changes, which include:

* Various comment updates.
* _bt_parallel_seize now unconditionally sets *pageno to P_NONE at the
beginning, instead of doing it conditionally at the end.  This seems
cleaner to me.
* I removed various BTScanPosInvalidate calls from _bt_first in places
where they followed calls to _bt_parallel_done, because I can't see
how the scan position could be valid at that point; note that
_bt_first asserts that it is invalid on entry.
* I added a _bt_parallel_done() call to _bt_first where it apparently
returned without releasing the scan; search for SK_ROW_MEMBER.  Maybe
there's something I'm missing that makes this unnecessary, but if so
there should probably be a comment here.
* I wasn't happy with the strange calling convention where
_bt_readnextpage usually gets a valid block number but not for
non-parallel backward scans.  I had a stab at fixing that so that the
block number is always valid, but I'm not entirely sure I've got the
logic right.  Please see what you think.
* I repositioned the function prototypes you added to nbtree.h to
separate the public and non-public interfaces.

I can't easily test this because your second patch doesn't apply, so
I'd appreciate it if you could have a stab at checking whether I've
broken anything in this revision.  Also, it would be good if you could
rebase the second patch.

I think this is pretty close to committable at this point.  Whether or
not I broke anything in this revision, I don't think there's too much
left to be done here.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] Parallel Index Scans

From
Amit Kapila
Date:
On Wed, Feb 15, 2017 at 2:04 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Tue, Feb 14, 2017 at 12:48 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> That sounds way better.
>
> Here's an updated patch.  Please review my changes, which include:
>
> * Various comment updates.

1.
+ * BTPARALLEL_IDLE indicates that no backend is currently advancing the scan
+ * to a new page; some process can start doing that.
+ *
+ * BTPARALLEL_DONE implies that the scan is complete (including error exit).

/implies/indicates, to be consistent with other explanations.

2.
+ * of the scan (depending on thes can direction).  An invalid block number

/thes can/the scan

I have modified the patch to include above two changes.

3.
+ else if (pageStatus == BTPARALLEL_DONE)
+ {
+ /*
+ * We're done with this set of scankeys, but have not yet advanced
+ * to the next set.
+ */
+ status = false;
+ }

Here second part of the comment (but have not yet advanced ..) seems
to be slightly misleading because this state has nothing to do with
the advancement of scan keys.

I have not changed this because I am not sure what you have in mind.


> * _bt_parallel_seize now unconditionally sets *pageno to P_NONE at the
> beginning, instead of doing it conditionally at the end.  This seems
> cleaner to me.
> * I removed various BTScanPosInvalidate calls from _bt_first in places
> where they followed calls to _bt_parallel_done, because I can't see
> how the scan position could be valid at that point; note that
> _bt_first asserts that it is invalid on entry.
> * I added a _bt_parallel_done() call to _bt_first where it apparently
> returned without releasing the scan; search for SK_ROW_MEMBER.  Maybe
> there's something I'm missing that makes this unnecessary, but if so
> there should probably be a comment here.
> * I wasn't happy with the strange calling convention where
> _bt_readnextpage usually gets a valid block number but not for
> non-parallel backward scans.  I had a stab at fixing that so that the
> block number is always valid, but I'm not entirely sure I've got the
> logic right.  Please see what you think.
>

Looks good to me.

> * I repositioned the function prototypes you added to nbtree.h to
> separate the public and non-public interfaces.
>

I have verified all your changes and they look good to me.

> I can't easily test this because your second patch doesn't apply,

I have tried and it works for me on latest code except for one test
output file which could have been excluded.  I wonder whether you are
first applying the GUC related patch [1] before applying the optimizer
support related patch.  In anycase, to avoid confusion I am attaching
all the three patches with this e-mail.

> so
> I'd appreciate it if you could have a stab at checking whether I've
> broken anything in this revision.  Also, it would be good if you could
> rebase the second patch.
>

I have rebased the optimizer/executor support related patch.


[1] - https://www.postgresql.org/message-id/CAA4eK1%2BTnM4pXQbvn7OXqam%2Bk_HZqb0ROZUMxOiL6DWJYCyYow%40mail.gmail.com

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] Parallel Index Scans

From
Robert Haas
Date:
On Wed, Feb 15, 2017 at 7:11 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> Here second part of the comment (but have not yet advanced ..) seems
> to be slightly misleading because this state has nothing to do with
> the advancement of scan keys.
>
> I have not changed this because I am not sure what you have in mind.

OK, I rewrote that to be (hopefully) more clear.

> I have verified all your changes and they look good to me.

Cool.  Committed.  I also changed the wait event to be BtreePage in
the docs + pg_stat_activity, and moved it into alphabetical order in
the switch and the enum.

>> I can't easily test this because your second patch doesn't apply,
>
> I have tried and it works for me on latest code except for one test
> output file which could have been excluded.  I wonder whether you are
> first applying the GUC related patch [1] before applying the optimizer
> support related patch.  In anycase, to avoid confusion I am attaching
> all the three patches with this e-mail.

Oh, duh.  I forgot about the prerequisite patch.  Sorry.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Parallel Index Scans

From
Robert Haas
Date:
On Wed, Feb 15, 2017 at 7:11 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:>
> support related patch.  In anycase, to avoid confusion I am attaching
> all the three patches with this e-mail.

Committed guc_parallel_index_scan_v1.patch, with changes to the
documentation and GUC descriptions.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Parallel Index Scans

From
Robert Haas
Date:
On Wed, Feb 15, 2017 at 1:39 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Wed, Feb 15, 2017 at 7:11 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:>
>> support related patch.  In anycase, to avoid confusion I am attaching
>> all the three patches with this e-mail.
>
> Committed guc_parallel_index_scan_v1.patch, with changes to the
> documentation and GUC descriptions.

And committed parallel_index_opt_exec_support_v10.patch as well, with
a couple of minor tweaks.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Parallel Index Scans

From
Amit Kapila
Date:
On Thu, Feb 16, 2017 at 12:27 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Wed, Feb 15, 2017 at 1:39 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> On Wed, Feb 15, 2017 at 7:11 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:>
>>> support related patch.  In anycase, to avoid confusion I am attaching
>>> all the three patches with this e-mail.
>>
>> Committed guc_parallel_index_scan_v1.patch, with changes to the
>> documentation and GUC descriptions.
>
> And committed parallel_index_opt_exec_support_v10.patch as well, with
> a couple of minor tweaks.
>

Thanks a lot!  I think this is a big step forward for parallelism in
PostgreSQL.  Now, we have another way to drive parallel scans and I
hope many more queries can use parallelism.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: [HACKERS] Parallel Index Scans

From
Michael Banck
Date:
Hi,

On Thu, Feb 16, 2017 at 08:14:28AM +0530, Amit Kapila wrote:
> On Thu, Feb 16, 2017 at 12:27 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> > On Wed, Feb 15, 2017 at 1:39 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> >> On Wed, Feb 15, 2017 at 7:11 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:>
> >>> support related patch.  In anycase, to avoid confusion I am attaching
> >>> all the three patches with this e-mail.
> >>
> >> Committed guc_parallel_index_scan_v1.patch, with changes to the
> >> documentation and GUC descriptions.
> >
> > And committed parallel_index_opt_exec_support_v10.patch as well, with
> > a couple of minor tweaks.
> 
> Thanks a lot!  I think this is a big step forward for parallelism in
> PostgreSQL.  Now, we have another way to drive parallel scans and I
> hope many more queries can use parallelism.

Shouldn't the chapter 15.3 "Parallel Plans" in the documentation be
updated for this as well, or is this going to be taken care as a batch
at the ned of the development cycle, pending other added parallization
features?


Michael

-- 
Michael Banck
Projektleiter / Senior Berater
Tel.: +49 2166 9901-171
Fax:  +49 2166 9901-100
Email: michael.banck@credativ.de

credativ GmbH, HRB Mönchengladbach 12080
USt-ID-Nummer: DE204566209
Trompeterallee 108, 41189 Mönchengladbach
Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer



Re: [HACKERS] Parallel Index Scans

From
Amit Kapila
Date:
On Mon, Mar 6, 2017 at 4:57 PM, Michael Banck <michael.banck@credativ.de> wrote:
> Hi,
>
> On Thu, Feb 16, 2017 at 08:14:28AM +0530, Amit Kapila wrote:
>> On Thu, Feb 16, 2017 at 12:27 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> > On Wed, Feb 15, 2017 at 1:39 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> >> On Wed, Feb 15, 2017 at 7:11 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:>
>> >>> support related patch.  In anycase, to avoid confusion I am attaching
>> >>> all the three patches with this e-mail.
>> >>
>> >> Committed guc_parallel_index_scan_v1.patch, with changes to the
>> >> documentation and GUC descriptions.
>> >
>> > And committed parallel_index_opt_exec_support_v10.patch as well, with
>> > a couple of minor tweaks.
>>
>> Thanks a lot!  I think this is a big step forward for parallelism in
>> PostgreSQL.  Now, we have another way to drive parallel scans and I
>> hope many more queries can use parallelism.
>
> Shouldn't the chapter 15.3 "Parallel Plans" in the documentation be
> updated for this as well, or is this going to be taken care as a batch
> at the ned of the development cycle, pending other added parallization
> features?
>

Robert mentioned up thread that it is better to update it once at end
rather than with each feature.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: [HACKERS] Parallel Index Scans

From
Robert Haas
Date:
On Mon, Mar 6, 2017 at 6:33 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Mon, Mar 6, 2017 at 4:57 PM, Michael Banck <michael.banck@credativ.de> wrote:
>> Hi,
>>
>> On Thu, Feb 16, 2017 at 08:14:28AM +0530, Amit Kapila wrote:
>>> On Thu, Feb 16, 2017 at 12:27 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>>> > On Wed, Feb 15, 2017 at 1:39 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>>> >> On Wed, Feb 15, 2017 at 7:11 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:>
>>> >>> support related patch.  In anycase, to avoid confusion I am attaching
>>> >>> all the three patches with this e-mail.
>>> >>
>>> >> Committed guc_parallel_index_scan_v1.patch, with changes to the
>>> >> documentation and GUC descriptions.
>>> >
>>> > And committed parallel_index_opt_exec_support_v10.patch as well, with
>>> > a couple of minor tweaks.
>>>
>>> Thanks a lot!  I think this is a big step forward for parallelism in
>>> PostgreSQL.  Now, we have another way to drive parallel scans and I
>>> hope many more queries can use parallelism.
>>
>> Shouldn't the chapter 15.3 "Parallel Plans" in the documentation be
>> updated for this as well, or is this going to be taken care as a batch
>> at the ned of the development cycle, pending other added parallization
>> features?
>>
>
> Robert mentioned up thread that it is better to update it once at end
> rather than with each feature.

I was going to do it after index and index-only scans and parallel
bitmap heap scan were committed, but I've been holding off on
committing parallel bitmap heap scan waiting for Andres to fix the
simplehash regressions, so maybe I should just go do an update now and
another one later once that goes in.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Parallel Index Scans

From
Amit Kapila
Date:
On Mon, Mar 6, 2017 at 6:49 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Mon, Mar 6, 2017 at 6:33 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> I was going to do it after index and index-only scans and parallel
> bitmap heap scan were committed, but I've been holding off on
> committing parallel bitmap heap scan waiting for Andres to fix the
> simplehash regressions, so maybe I should just go do an update now and
> another one later once that goes in.
>

As you wish, but one point to note is that as of now parallelism for
index scans can be influenced by table level parameter
parallel_workers.  It sounds slightly awkward, but if we want to keep
that way, then maybe we can update the docs to indicate the same.
Another option is to have a separate parameter for index scans.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: [HACKERS] Parallel Index Scans

From
Gavin Flower
Date:
On 07/03/17 02:46, Amit Kapila wrote:
> On Mon, Mar 6, 2017 at 6:49 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> On Mon, Mar 6, 2017 at 6:33 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>>
>> I was going to do it after index and index-only scans and parallel
>> bitmap heap scan were committed, but I've been holding off on
>> committing parallel bitmap heap scan waiting for Andres to fix the
>> simplehash regressions, so maybe I should just go do an update now and
>> another one later once that goes in.
>>
> As you wish, but one point to note is that as of now parallelism for
> index scans can be influenced by table level parameter
> parallel_workers.  It sounds slightly awkward, but if we want to keep
> that way, then maybe we can update the docs to indicate the same.
> Another option is to have a separate parameter for index scans.
>
>
My immediate gut feeling was to have separate parameters.

On thinking about it, I think they serve different use cases.  I don't 
think of workers when I think of Index scans, and I suspect I'd find 
more reasons to keep them separate if I looked deeper.


Cheers,
Gavin