Thread: Parallel bitmap heap scan

Parallel bitmap heap scan

From
Dilip Kumar
Date:
Hi Hackers,

I would like to propose parallel bitmap heap scan feature. After
running TPCH benchmark, It was observed that many of TPCH queries are
using bitmap scan (@TPCH_plan.tar.gz attached below). Keeping this
point in mind we thought that many query will get benefited with
parallel bitmap scan.

Robert has also pointed out the same thing in his blog related to parallel query
http://rhaas.blogspot.in/2016/04/postgresql-96-with-parallel-query-vs.html

Currently Bitmap heap plan look like this :
------------------------------------------------------
Bitmap Heap Scan
-> Bitmap Index Scan

After this patch :
---------------------
Parallel Bitmap Heap Scan
-> Bitmap Index Scan

As part of this work I have implemented parallel processing in
BitmapHeapScan node. BitmapIndexScan is still non parallel.

Brief design idea:
-----------------------
#1. Shared TIDBitmap creation and initialization
  First worker to see the state as parallel bitmap info as PBM_INITIAL
become leader and set the state to PBM_INPROGRESS All other   workers
see the state as PBM_INPROGRESS will wait for leader to complete the
TIDBitmap.

#2 At this level TIDBitmap is ready and all workers are awake.

#3. Bitmap processing (Iterate and process the pages).
In this phase each worker will iterate over page and chunk array and
select heap pages one by one. If prefetch is enable then there will be
two iterator. Since multiple worker are iterating over same page and
chunk array we need to have a shared iterator, so we grab a spin lock
and iterate within a lock, so that each worker get and different page
to process.

Note: For more detail on design, please refer comment of
BitmapHeapNext API in "parallel-bitmap-heap-scan-v1.patch" file.

Attached patch details:
------------------------------
1. parallel-bitmap-heap-scan-v1.patch: This is the main patch to make
bitmap heap scan node parallel aware.

2. dht-return-dsa-v1.patch:  This patch will provide new API, where we
can scan full DHT[1], and get the dsa_pointers (a relative pointer).
The dsa_pointer values can be shared with other processes. We need
this because, after TIDBitmap is created, only one worker will process
whole TIDBitmap and convert it to a page and chunk array. So we need
to store the generic pointer, so that later on each worker can convert
those to their local pointer before start processing.

My patch depends on following patches.
------------------------------------------------------
1. conditional_variable
https://www.postgresql.org/message-id/CAEepm%3D0zshYwB6wDeJCkrRJeoBM%3DjPYBe%2B-k_VtKRU_8zMLEfA%40mail.gmail.com

2. dsa_area
https://www.postgresql.org/message-id/CAEepm%3D024p-MeAsDmG%3DR3%2BtR4EGhuGJs_%2BrjFKF0eRoSTmMJnA%40mail.gmail.com

3. Creating a DSA area to provide work space for parallel execution
https://www.postgresql.org/message-id/CAEepm%3D0HmRefi1%2BxDJ99Gj5APHr8Qr05KZtAxrMj8b%2Bay3o6sA%40mail.gmail.com

4. Hash table in dynamic shared memory (DHT) [1]
https://www.postgresql.org/message-id/CAEepm%3D0VrMt3s_REDhQv6z1pHL7FETOD7Rt9V2MQ3r-2ss2ccA%40mail.gmail.com

Order in which patches should be applied:
--------------------------------------------------------
1. conditional_variable
2. dsa_area
3. Creating a DSA area to provide work space for parallel execution
4. Hash table in dynamic shared memory.
5. dht-return-dsa-v1.patch
6. parallel-bitmap-heap-scan-v1.patch

Performance Results:
-----------------------------
Summary :
1. After this patch, I observed currently 4 queries are getting
significant improvement (Q4, Q6, Q14, Q15).
   - Q4, is converting from parallel seqscan to parallel bitmap heap scan.
   - Other queries are converted from a regular bitmap heap scan to a
parallel bitmap heap scan.
2. Benefit is more visible at lower workers (upto 4), after that some
of the queries are selecting ParallelSeqScan over ParallelBitmapScan.
And, I think this is expected, because so far we have only made
BitmapHeap node as parallel whereas ParallelSeqScan is completely
parallel so at higher worker count ParallelSeqScan is better choice.
3. Detailed result is attached  @TPCH_PBMS.pdf
4. Explain analyse output is attached @TPCH_plan.tar.gz  (for all
changed queries at worker 2)

TPCH query plan changed  example (TPCH Q6):
----------------------------------------------------------------
On Head:
-------------

                                  QUERY PLAN

-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=1558475.95..1558475.96 rows=1 width=32) (actual
time=40921.437..40921.438 rows=1 loops=1)
   ->  Aggregate  (cost=1558475.95..1558475.96 rows=1 width=32)
(actual time=40921.435..40921.435 rows=1 loops=1)
         ->  Bitmap Heap Scan on lineitem  (cost=291783.32..1552956.39
rows=1103911 width=12) (actual time=7032.075..38997.369 rows=1140434
loops=1)
               Recheck Cond: ((l_shipdate >= '1994-01-01'::date) AND
(l_shipdate < '1995-01-01 00:00:00'::timestamp without time zone) AND
(l_discount >= 0.01) AND (l_discount <= 0.03) AND (l_quantity <
'24'::numeric))
               Rows Removed by Index Recheck: 25284232
               Heap Blocks: exact=134904 lossy=530579
               ->  Bitmap Index Scan on idx_lineitem_shipdate
(cost=0.00..291507.35 rows=1103911 width=0) (actual
time=6951.408..6951.408 rows=1140434 loops=1)
                     Index Cond: ((l_shipdate >= '1994-01-01'::date)
AND (l_shipdate < '1995-01-01 00:00:00'::timestamp without time zone)
AND (l_discount >= 0.01) AND (l_discount <= 0.03) AND (l_quantity <
'24'::numeric))
 Planning time: 1.126 ms
 Execution time: 40922.569 ms
(10 rows)

After Patch:
----------------

                                         QUERY PLAN

-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=1541767.60..1541767.61 rows=1 width=32) (actual
time=21895.008..21895.009 rows=1 loops=1)
   ->  Finalize Aggregate  (cost=1541767.60..1541767.61 rows=1
width=32) (actual time=21895.006..21895.006 rows=1 loops=1)
         ->  Gather  (cost=1541767.38..1541767.59 rows=2 width=32)
(actual time=21894.341..21894.970 rows=3 loops=1)
               Workers Planned: 2
               Workers Launched: 2
               ->  Partial Aggregate  (cost=1540767.38..1540767.39
rows=1 width=32) (actual time=21890.990..21890.990 rows=1 loops=3)
                     ->  Parallel Bitmap Heap Scan on lineitem
(cost=291783.32..1538467.56 rows=459963 width=12) (actual
time=8517.126..21215.469 rows=380145 loops=3)
                           Recheck Cond: ((l_shipdate >=
'1994-01-01'::date) AND (l_shipdate < '1995-01-01 00:00:00'::timestamp
without time zone) AND (l_discount >= 0.01) AND (l_discount <= 0.03)
AND (l_quantity < '24'::numeric))
                           Rows Removed by Index Recheck: 8427921
                           Heap Blocks: exact=47761 lossy=187096
                           ->  Bitmap Index Scan on
idx_lineitem_shipdate  (cost=0.00..291507.35 rows=1103911 width=0)
(actual time=8307.291..8307.291 rows=1140434 loops=1)
                                 Index Cond: ((l_shipdate >=
'1994-01-01'::date) AND (l_shipdate < '1995-01-01 00:00:00'::timestamp
without time zone) AND (l_discount >= 0.01) AND (l_discount <= 0.03)
AND (l_quantity < '24'::numeric))
 Planning time: 1.173 ms
 Execution time: 21915.931 ms
(14 rows)


* Thanks to Robert Haas and Amit Kapila, for helping in design review
(off list) and many valuable inputs.
* Thanks to Thomas Munro for DSA and DHT work on which my patch is based on.
* Thanks to Rafia sabih for helping with performance test.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachment

Re: Parallel bitmap heap scan

From
Dilip Kumar
Date:
There is major chance in tidbitmap.c file after efficient hash table
commit [1] and my patch need to be rebased.

Only parallel-bitmap-heap-scan need to be rebased, all other patch can
be applied on head as is.
Rebased version (v2) of parallel-bitmap-heap-scan is attached.

[1] commit-http://git.postgresql.org/pg/commitdiff/75ae538bc3168bf44475240d4e0487ee2f3bb376


On Fri, Oct 7, 2016 at 11:46 AM, Dilip Kumar <dilipbalaut@gmail.com> wrote:
> Hi Hackers,
>
> I would like to propose parallel bitmap heap scan feature. After
> running TPCH benchmark, It was observed that many of TPCH queries are
> using bitmap scan (@TPCH_plan.tar.gz attached below). Keeping this
> point in mind we thought that many query will get benefited with
> parallel bitmap scan.
>
> Robert has also pointed out the same thing in his blog related to parallel query
> http://rhaas.blogspot.in/2016/04/postgresql-96-with-parallel-query-vs.html
>
> Currently Bitmap heap plan look like this :
> ------------------------------------------------------
> Bitmap Heap Scan
> -> Bitmap Index Scan
>
> After this patch :
> ---------------------
> Parallel Bitmap Heap Scan
> -> Bitmap Index Scan
>
> As part of this work I have implemented parallel processing in
> BitmapHeapScan node. BitmapIndexScan is still non parallel.
>
> Brief design idea:
> -----------------------
> #1. Shared TIDBitmap creation and initialization
>   First worker to see the state as parallel bitmap info as PBM_INITIAL
> become leader and set the state to PBM_INPROGRESS All other   workers
> see the state as PBM_INPROGRESS will wait for leader to complete the
> TIDBitmap.
>
> #2 At this level TIDBitmap is ready and all workers are awake.
>
> #3. Bitmap processing (Iterate and process the pages).
> In this phase each worker will iterate over page and chunk array and
> select heap pages one by one. If prefetch is enable then there will be
> two iterator. Since multiple worker are iterating over same page and
> chunk array we need to have a shared iterator, so we grab a spin lock
> and iterate within a lock, so that each worker get and different page
> to process.
>
> Note: For more detail on design, please refer comment of
> BitmapHeapNext API in "parallel-bitmap-heap-scan-v1.patch" file.
>
> Attached patch details:
> ------------------------------
> 1. parallel-bitmap-heap-scan-v1.patch: This is the main patch to make
> bitmap heap scan node parallel aware.
>
> 2. dht-return-dsa-v1.patch:  This patch will provide new API, where we
> can scan full DHT[1], and get the dsa_pointers (a relative pointer).
> The dsa_pointer values can be shared with other processes. We need
> this because, after TIDBitmap is created, only one worker will process
> whole TIDBitmap and convert it to a page and chunk array. So we need
> to store the generic pointer, so that later on each worker can convert
> those to their local pointer before start processing.
>
> My patch depends on following patches.
> ------------------------------------------------------
> 1. conditional_variable
> https://www.postgresql.org/message-id/CAEepm%3D0zshYwB6wDeJCkrRJeoBM%3DjPYBe%2B-k_VtKRU_8zMLEfA%40mail.gmail.com
>
> 2. dsa_area
> https://www.postgresql.org/message-id/CAEepm%3D024p-MeAsDmG%3DR3%2BtR4EGhuGJs_%2BrjFKF0eRoSTmMJnA%40mail.gmail.com
>
> 3. Creating a DSA area to provide work space for parallel execution
> https://www.postgresql.org/message-id/CAEepm%3D0HmRefi1%2BxDJ99Gj5APHr8Qr05KZtAxrMj8b%2Bay3o6sA%40mail.gmail.com
>
> 4. Hash table in dynamic shared memory (DHT) [1]
> https://www.postgresql.org/message-id/CAEepm%3D0VrMt3s_REDhQv6z1pHL7FETOD7Rt9V2MQ3r-2ss2ccA%40mail.gmail.com
>
> Order in which patches should be applied:
> --------------------------------------------------------
> 1. conditional_variable
> 2. dsa_area
> 3. Creating a DSA area to provide work space for parallel execution
> 4. Hash table in dynamic shared memory.
> 5. dht-return-dsa-v1.patch
> 6. parallel-bitmap-heap-scan-v1.patch
>
> Performance Results:
> -----------------------------
> Summary :
> 1. After this patch, I observed currently 4 queries are getting
> significant improvement (Q4, Q6, Q14, Q15).
>    - Q4, is converting from parallel seqscan to parallel bitmap heap scan.
>    - Other queries are converted from a regular bitmap heap scan to a
> parallel bitmap heap scan.
> 2. Benefit is more visible at lower workers (upto 4), after that some
> of the queries are selecting ParallelSeqScan over ParallelBitmapScan.
> And, I think this is expected, because so far we have only made
> BitmapHeap node as parallel whereas ParallelSeqScan is completely
> parallel so at higher worker count ParallelSeqScan is better choice.
> 3. Detailed result is attached  @TPCH_PBMS.pdf
> 4. Explain analyse output is attached @TPCH_plan.tar.gz  (for all
> changed queries at worker 2)
>
> TPCH query plan changed  example (TPCH Q6):
> ----------------------------------------------------------------
> On Head:
> -------------
>
>                                   QUERY PLAN
>
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>  Limit  (cost=1558475.95..1558475.96 rows=1 width=32) (actual
> time=40921.437..40921.438 rows=1 loops=1)
>    ->  Aggregate  (cost=1558475.95..1558475.96 rows=1 width=32)
> (actual time=40921.435..40921.435 rows=1 loops=1)
>          ->  Bitmap Heap Scan on lineitem  (cost=291783.32..1552956.39
> rows=1103911 width=12) (actual time=7032.075..38997.369 rows=1140434
> loops=1)
>                Recheck Cond: ((l_shipdate >= '1994-01-01'::date) AND
> (l_shipdate < '1995-01-01 00:00:00'::timestamp without time zone) AND
> (l_discount >= 0.01) AND (l_discount <= 0.03) AND (l_quantity <
> '24'::numeric))
>                Rows Removed by Index Recheck: 25284232
>                Heap Blocks: exact=134904 lossy=530579
>                ->  Bitmap Index Scan on idx_lineitem_shipdate
> (cost=0.00..291507.35 rows=1103911 width=0) (actual
> time=6951.408..6951.408 rows=1140434 loops=1)
>                      Index Cond: ((l_shipdate >= '1994-01-01'::date)
> AND (l_shipdate < '1995-01-01 00:00:00'::timestamp without time zone)
> AND (l_discount >= 0.01) AND (l_discount <= 0.03) AND (l_quantity <
> '24'::numeric))
>  Planning time: 1.126 ms
>  Execution time: 40922.569 ms
> (10 rows)
>
> After Patch:
> ----------------
>
>                                          QUERY PLAN
>
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>  Limit  (cost=1541767.60..1541767.61 rows=1 width=32) (actual
> time=21895.008..21895.009 rows=1 loops=1)
>    ->  Finalize Aggregate  (cost=1541767.60..1541767.61 rows=1
> width=32) (actual time=21895.006..21895.006 rows=1 loops=1)
>          ->  Gather  (cost=1541767.38..1541767.59 rows=2 width=32)
> (actual time=21894.341..21894.970 rows=3 loops=1)
>                Workers Planned: 2
>                Workers Launched: 2
>                ->  Partial Aggregate  (cost=1540767.38..1540767.39
> rows=1 width=32) (actual time=21890.990..21890.990 rows=1 loops=3)
>                      ->  Parallel Bitmap Heap Scan on lineitem
> (cost=291783.32..1538467.56 rows=459963 width=12) (actual
> time=8517.126..21215.469 rows=380145 loops=3)
>                            Recheck Cond: ((l_shipdate >=
> '1994-01-01'::date) AND (l_shipdate < '1995-01-01 00:00:00'::timestamp
> without time zone) AND (l_discount >= 0.01) AND (l_discount <= 0.03)
> AND (l_quantity < '24'::numeric))
>                            Rows Removed by Index Recheck: 8427921
>                            Heap Blocks: exact=47761 lossy=187096
>                            ->  Bitmap Index Scan on
> idx_lineitem_shipdate  (cost=0.00..291507.35 rows=1103911 width=0)
> (actual time=8307.291..8307.291 rows=1140434 loops=1)
>                                  Index Cond: ((l_shipdate >=
> '1994-01-01'::date) AND (l_shipdate < '1995-01-01 00:00:00'::timestamp
> without time zone) AND (l_discount >= 0.01) AND (l_discount <= 0.03)
> AND (l_quantity < '24'::numeric))
>  Planning time: 1.173 ms
>  Execution time: 21915.931 ms
> (14 rows)
>
>
> * Thanks to Robert Haas and Amit Kapila, for helping in design review
> (off list) and many valuable inputs.
> * Thanks to Thomas Munro for DSA and DHT work on which my patch is based on.
> * Thanks to Rafia sabih for helping with performance test.
>
> --
> Regards,
> Dilip Kumar
> EnterpriseDB: http://www.enterprisedb.com



--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachment

Re: Parallel bitmap heap scan

From
Robert Haas
Date:
On Mon, Oct 17, 2016 at 1:23 AM, Dilip Kumar <dilipbalaut@gmail.com> wrote:
> There is major chance in tidbitmap.c file after efficient hash table
> commit [1] and my patch need to be rebased.
>
> Only parallel-bitmap-heap-scan need to be rebased, all other patch can
> be applied on head as is.
> Rebased version (v2) of parallel-bitmap-heap-scan is attached.

But what's the impact on performance?  Presumably parallel bitmap heap
scan was already slower than the non-parallel version, and that commit
presumably widens the gap.  Seems like something to worry about...

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Parallel bitmap heap scan

From
Andres Freund
Date:
Hi,

On 2016-10-07 11:46:40 +0530, Dilip Kumar wrote:
> Brief design idea:
> -----------------------
> #1. Shared TIDBitmap creation and initialization
>   First worker to see the state as parallel bitmap info as PBM_INITIAL
> become leader and set the state to PBM_INPROGRESS All other   workers
> see the state as PBM_INPROGRESS will wait for leader to complete the
> TIDBitmap.
> 
> #2 At this level TIDBitmap is ready and all workers are awake.
> 
> #3. Bitmap processing (Iterate and process the pages).
> In this phase each worker will iterate over page and chunk array and
> select heap pages one by one. If prefetch is enable then there will be
> two iterator. Since multiple worker are iterating over same page and
> chunk array we need to have a shared iterator, so we grab a spin lock
> and iterate within a lock, so that each worker get and different page
> to process.

I don't quite understand why the bitmap has to be parallel at all. As
far as I understand your approach as described here, the only thing that
needs to be shared are the iteration arrays.  Since they never need to
be resized and such, it seems to make a lot more sense to just add an
API to share those, instead of the whole underlying hash.

Greetings,

Andres Freund



Re: Parallel bitmap heap scan

From
Dilip Kumar
Date:
On Tue, Oct 18, 2016 at 1:42 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> But what's the impact on performance?  Presumably parallel bitmap heap
> scan was already slower than the non-parallel version, and that commit
> presumably widens the gap.  Seems like something to worry about...

I have checked the performance in my local machine and there is no
impact on the gap.
If you see the explain analyze output of all queries which got
benefited which parallel bitmap map heap scan, BitmapIndex node is
taking very less time compare to BitmapHeap.


Actual execution time on head (before efficient hash table patch)        BitmapHeapNode      BitmapIndexNode
Q6           38997                              6951
Q14         14516                                569
Q15         28530                              1442

Out of 4 queries, Q4 is converted from parallel seqscan to parallel
bitmap scan so no impact.
Q14, Q15 time spent in BitmapIndex node is < 5% of time spent in
BitmapHeap Node. Q6 it's 20% but I did not see much impact on this in
my local machine. However I will take the complete performance reading
and post the data on my actual performance machine.


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: Parallel bitmap heap scan

From
Dilip Kumar
Date:
On Tue, Oct 18, 2016 at 1:45 AM, Andres Freund <andres@anarazel.de> wrote:
> I don't quite understand why the bitmap has to be parallel at all. As
> far as I understand your approach as described here, the only thing that
> needs to be shared are the iteration arrays.  Since they never need to
> be resized and such, it seems to make a lot more sense to just add an
> API to share those, instead of the whole underlying hash.

You are right that we only share iteration arrays. But only point is
that each entry of iteration array is just a pointer to hash entry.
So either we need to build hash in shared memory (my current approach)
or we need to copy each hash element at shared location (I think this
is going to be expensive).

Let me know if I am missing something..

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: Parallel bitmap heap scan

From
Andres Freund
Date:
On 2016-10-19 09:43:10 +0530, Dilip Kumar wrote:
> On Tue, Oct 18, 2016 at 1:42 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> > But what's the impact on performance?  Presumably parallel bitmap heap
> > scan was already slower than the non-parallel version, and that commit
> > presumably widens the gap.  Seems like something to worry about...
> 
> I have checked the performance in my local machine and there is no
> impact on the gap.

Try measuring with something more heavy on bitmap scan time
itself. E.g.
SELECT SUM(l_extendedprice) FROM lineitem WHERE (l_shipdate >= '1995-01-01'::date) AND (l_shipdate <=
'1996-12-31'::date);
or similar.  The tpch queries don't actually spend that much time in the
bitmapscan itself - the parallization of the rest of the query is what
matters...

Greetings,

Andres Freund



Re: Parallel bitmap heap scan

From
Dilip Kumar
Date:
On Wed, Oct 19, 2016 at 12:39 PM, Andres Freund <andres@anarazel.de> wrote:
> Try measuring with something more heavy on bitmap scan time
> itself. E.g.
> SELECT SUM(l_extendedprice) FROM lineitem WHERE (l_shipdate >= '1995-01-01'::date) AND (l_shipdate <=
'1996-12-31'::date);
> or similar.  The tpch queries don't actually spend that much time in the
> bitmapscan itself - the parallization of the rest of the query is what
> matters...

Yeah, I agree.

I have tested with this query, with exact filter condition it was
taking parallel sequence scan, so I have modified the filter a bit and
tested.

Tested with all default configuration in my local machine. I think I
will generate more such test cases and do detail testing in my
performance machine.


Explain Analyze results:
---------------------------------
On Head:
------------
postgres=# explain analyze  SELECT SUM(l_extendedprice) FROM lineitem
WHERE (l_shipdate >= '1995-01-01'::date) AND (l_shipdate <=
'1996-03-31'::date);

 QUERY PLAN

-----------------------------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=848805.90..848805.91 rows=1 width=32) (actual
time=12440.165..12440.166 rows=1 loops=1)
   ->  Bitmap Heap Scan on lineitem  (cost=143372.40..834833.25
rows=5589057 width=8) (actual time=1106.217..11183.722 rows=5678841
loops=1)
         Recheck Cond: ((l_shipdate >= '1995-01-01'::date) AND
(l_shipdate <= '1996-03-31'::date))
         Rows Removed by Index Recheck: 20678739
         Heap Blocks: exact=51196 lossy=528664
         ->  Bitmap Index Scan on idx_lineitem_shipdate
(cost=0.00..141975.13 rows=5589057 width=0) (actual
time=1093.376..1093.376 rows=5678841 loops=1)
               Index Cond: ((l_shipdate >= '1995-01-01'::date) AND
(l_shipdate <= '1996-03-31'::date))
 Planning time: 0.185 ms
 Execution time: 12440.819 ms
(9 rows)

After Patch:
---------------
postgres=# explain analyze  SELECT SUM(l_extendedprice) FROM lineitem
WHERE (l_shipdate >= '1995-01-01'::date) AND (l_shipdate <=
'1996-03-31'::date);

       QUERY PLAN


--------------------------------------------------------------------------------------------------------------------------------------------------------------
---------
 Finalize Aggregate  (cost=792751.16..792751.17 rows=1 width=32)
(actual time=6660.157..6660.157 rows=1 loops=1)
   ->  Gather  (cost=792750.94..792751.15 rows=2 width=32) (actual
time=6659.378..6660.117 rows=3 loops=1)
         Workers Planned: 2
         Workers Launched: 2
         ->  Partial Aggregate  (cost=791750.94..791750.95 rows=1
width=32) (actual time=6655.941..6655.941 rows=1 loops=3)
               ->  Parallel Bitmap Heap Scan on lineitem
(cost=143372.40..785929.00 rows=2328774 width=8) (actual
time=1980.797..6204.232 rows=1892947 loops=
3)
                     Recheck Cond: ((l_shipdate >= '1995-01-01'::date)
AND (l_shipdate <= '1996-03-31'::date))
                     Rows Removed by Index Recheck: 6930269
                     Heap Blocks: exact=17090 lossy=176443
                     ->  Bitmap Index Scan on idx_lineitem_shipdate
(cost=0.00..141975.13 rows=5589057 width=0) (actual
time=1933.454..1933.454 rows=5678841
loops=1)
                           Index Cond: ((l_shipdate >=
'1995-01-01'::date) AND (l_shipdate <= '1996-03-31'::date))
 Planning time: 0.207 ms
 Execution time: 6669.195 ms
(13 rows)


Summary:
-> With patch overall execution is 2 time faster compared to head.
-> Bitmap creation with patch is bit slower compared to head and thats
because of DHT vs efficient hash table.

I found one defect in v2 patch, that I induced during last rebasing.
That is fixed in v3.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachment

Re: Parallel bitmap heap scan

From
Amit Kapila
Date:
On Wed, Oct 19, 2016 at 9:23 PM, Dilip Kumar <dilipbalaut@gmail.com> wrote:
> On Wed, Oct 19, 2016 at 12:39 PM, Andres Freund <andres@anarazel.de> wrote:
>> Try measuring with something more heavy on bitmap scan time
>> itself. E.g.
>> SELECT SUM(l_extendedprice) FROM lineitem WHERE (l_shipdate >= '1995-01-01'::date) AND (l_shipdate <=
'1996-12-31'::date);
>> or similar.  The tpch queries don't actually spend that much time in the
>> bitmapscan itself - the parallization of the rest of the query is what
>> matters...
>
> Yeah, I agree.
>
> I have tested with this query, with exact filter condition it was
> taking parallel sequence scan, so I have modified the filter a bit and
> tested.
>
> Tested with all default configuration in my local machine. I think I
> will generate more such test cases and do detail testing in my
> performance machine.
>
>
> Explain Analyze results:
> ---------------------------------
> On Head:
> ------------
> postgres=# explain analyze  SELECT SUM(l_extendedprice) FROM lineitem
> WHERE (l_shipdate >= '1995-01-01'::date) AND (l_shipdate <=
> '1996-03-31'::date);
>
>  QUERY PLAN
>
-----------------------------------------------------------------------------------------------------------------------------------------------------------
>  Aggregate  (cost=848805.90..848805.91 rows=1 width=32) (actual
> time=12440.165..12440.166 rows=1 loops=1)
>    ->  Bitmap Heap Scan on lineitem  (cost=143372.40..834833.25
> rows=5589057 width=8) (actual time=1106.217..11183.722 rows=5678841
> loops=1)
>          Recheck Cond: ((l_shipdate >= '1995-01-01'::date) AND
> (l_shipdate <= '1996-03-31'::date))
>          Rows Removed by Index Recheck: 20678739
>          Heap Blocks: exact=51196 lossy=528664
>          ->  Bitmap Index Scan on idx_lineitem_shipdate
> (cost=0.00..141975.13 rows=5589057 width=0) (actual
> time=1093.376..1093.376 rows=5678841 loops=1)
>                Index Cond: ((l_shipdate >= '1995-01-01'::date) AND
> (l_shipdate <= '1996-03-31'::date))
>  Planning time: 0.185 ms
>  Execution time: 12440.819 ms
> (9 rows)
>
> After Patch:
> ---------------
> postgres=# explain analyze  SELECT SUM(l_extendedprice) FROM lineitem
> WHERE (l_shipdate >= '1995-01-01'::date) AND (l_shipdate <=
> '1996-03-31'::date);
>
>        QUERY PLAN
>
>
--------------------------------------------------------------------------------------------------------------------------------------------------------------
> ---------
>  Finalize Aggregate  (cost=792751.16..792751.17 rows=1 width=32)
> (actual time=6660.157..6660.157 rows=1 loops=1)
>    ->  Gather  (cost=792750.94..792751.15 rows=2 width=32) (actual
> time=6659.378..6660.117 rows=3 loops=1)
>          Workers Planned: 2
>          Workers Launched: 2
>          ->  Partial Aggregate  (cost=791750.94..791750.95 rows=1
> width=32) (actual time=6655.941..6655.941 rows=1 loops=3)
>                ->  Parallel Bitmap Heap Scan on lineitem
> (cost=143372.40..785929.00 rows=2328774 width=8) (actual
> time=1980.797..6204.232 rows=1892947 loops=
> 3)
>                      Recheck Cond: ((l_shipdate >= '1995-01-01'::date)
> AND (l_shipdate <= '1996-03-31'::date))
>                      Rows Removed by Index Recheck: 6930269
>                      Heap Blocks: exact=17090 lossy=176443
>                      ->  Bitmap Index Scan on idx_lineitem_shipdate
> (cost=0.00..141975.13 rows=5589057 width=0) (actual
> time=1933.454..1933.454 rows=5678841
> loops=1)
>                            Index Cond: ((l_shipdate >=
> '1995-01-01'::date) AND (l_shipdate <= '1996-03-31'::date))
>  Planning time: 0.207 ms
>  Execution time: 6669.195 ms
> (13 rows)
>
>
> Summary:
> -> With patch overall execution is 2 time faster compared to head.
> -> Bitmap creation with patch is bit slower compared to head and thats
> because of DHT vs efficient hash table.
>

I think here the impact of slowness due to Bitmap Index Scan is not
much visible, as the time it takes as compare to overall time is less.
However, I think there is an advantage of using DHT as that will allow
us to build the hash table by multiple workers using parallel index
scan in future.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: Parallel bitmap heap scan

From
Amit Khandekar
Date:
On 19 October 2016 at 09:47, Dilip Kumar <dilipbalaut@gmail.com> wrote:
> On Tue, Oct 18, 2016 at 1:45 AM, Andres Freund <andres@anarazel.de> wrote:
>> I don't quite understand why the bitmap has to be parallel at all. As
>> far as I understand your approach as described here, the only thing that
>> needs to be shared are the iteration arrays.  Since they never need to
>> be resized and such, it seems to make a lot more sense to just add an
>> API to share those, instead of the whole underlying hash.
>
> You are right that we only share iteration arrays. But only point is
> that each entry of iteration array is just a pointer to hash entry.
> So either we need to build hash in shared memory (my current approach)
> or we need to copy each hash element at shared location (I think this
> is going to be expensive).

While the discussion is going on regarding implementation for creating
shared tidbitmap, meanwhile I am starting with review of the bitmap
heap scan part, i.e. nodeBitmapHeapscan.c, since this looks mostly
independent of tidbitmap implementation.

> Brief design idea:
> -----------------------
> #1. Shared TIDBitmap creation and initialization
>   First worker to see the state as parallel bitmap info as PBM_INITIAL
> become leader and set the state to PBM_INPROGRESS All other   workers
> see the state as PBM_INPROGRESS will wait for leader to complete the
> TIDBitmap.
>
> #2 At this level TIDBitmap is ready and all workers are awake.

As far as correctness is concerned, the logic where the first worker
becomes leader while others synchronously wait, looks good. Workers
get allocated right from the beginning even though they would stay
idle for some percentage of time (5-20% ?) , but I guess there is
nothing we can do about it with the current parallel query
infrastructure.

In pbms_is_leader() , I didn't clearly understand the significance of
the for-loop. If it is a worker, it can call
ConditionVariablePrepareToSleep() followed by
ConditionVariableSleep(). Once it comes out of
ConditionVariableSleep(), isn't it guaranteed that the leader has
finished the bitmap ? If yes, then it looks like it is not necessary
to again iterate and go back through the pbminfo->state checking.
Also, with this, variable queuedSelf also might not be needed. But I
might be missing something here. Not sure what happens if worker calls
ConditionVariable[Prepare]Sleep() but leader has already called
ConditionVariableBroadcast(). Does the for loop have something to do
with this ? But this can happen even with the current for-loop, it
seems.


> #3. Bitmap processing (Iterate and process the pages).
> In this phase each worker will iterate over page and chunk array and
> select heap pages one by one. If prefetch is enable then there will be
> two iterator. Since multiple worker are iterating over same page and
> chunk array we need to have a shared iterator, so we grab a spin lock
> and iterate within a lock, so that each worker get and different page
> to process.

tbm_iterate() call under SpinLock :
For parallel tbm iteration, tbm_iterate() is called while SpinLock is
held. Generally we try to keep code inside Spinlock call limited to a
few lines, and that too without occurrence of a function call.
Although tbm_iterate() code itself looks safe under a spinlock, I was
checking if we can squeeze SpinlockAcquire() and SpinLockRelease()
closer to each other. One thought is :  in tbm_iterate(), acquire the
SpinLock before the while loop that iterates over lossy chunks. Then,
if both chunk and per-page data remain, release spinlock just before
returning (the first return stmt). And then just before scanning
bitmap of an exact page, i.e. just after "if (iterator->spageptr <
tbm->npages)", save the page handle, increment iterator->spageptr,
release Spinlock, and then use the saved page handle to iterate over
the page bitmap.

prefetch_pages() call under Spinlock :
Here again, prefetch_pages() is called while pbminfo->prefetch_mutex
Spinlock is held. Effectively, heavy functions like PrefetchBuffer()
would get called while under the Spinlock. These can even ereport().
One option is to use mutex lock for this purpose. But I think that
would slow things down. Moreover, the complete set of prefetch pages
would be scanned by a single worker, and others might wait for this
one. Instead, what I am thinking is: grab the pbminfo->prefetch_mutex
Spinlock only while incrementing pbminfo->prefetch_pages. The rest
part viz : iterating over the prefetch pages, and doing the
PrefetchBuffer() need not be synchronised using this
pgbinfo->prefetch_mutex Spinlock. pbms_parallel_iterate() already has
its own iterator spinlock. Only thing is, workers may not do the
actual PrefetchBuffer() sequentially. One of them might shoot ahead
and prefetch 3-4 pages while the other is lagging with the
sequentially lesser page number; but I believe this is fine, as long
as they all prefetch all the required blocks.



Re: Parallel bitmap heap scan

From
Dilip Kumar
Date:
On Fri, Nov 18, 2016 at 9:59 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:

Thanks for the review..

> In pbms_is_leader() , I didn't clearly understand the significance of
> the for-loop. If it is a worker, it can call
> ConditionVariablePrepareToSleep() followed by
> ConditionVariableSleep(). Once it comes out of
> ConditionVariableSleep(), isn't it guaranteed that the leader has
> finished the bitmap ? If yes, then it looks like it is not necessary
> to again iterate and go back through the pbminfo->state checking.
> Also, with this, variable queuedSelf also might not be needed. But I
> might be missing something here.

I have taken this logic from example posted on conditional variable thread[1]

for (;;)
{   ConditionVariablePrepareToSleep(cv);   if (condition for which we are waiting is satisfied)       break;
ConditionVariableSleep();
}
ConditionVariableCancelSleep();

[1] https://www.postgresql.org/message-id/CA%2BTgmoaj2aPti0yho7FeEf2qt-JgQPRWb0gci_o1Hfr%3DC56Xng%40mail.gmail.com

So it appears to me even if we come out of ConditionVariableSleep();
we need to verify our condition and then only break.

Not sure what happens if worker calls
> ConditionVariable[Prepare]Sleep() but leader has already called
> ConditionVariableBroadcast(). Does the for loop have something to do
> with this ? But this can happen even with the current for-loop, it
> seems.
If leader has already called ConditionVariableBroadcast, but after
ConditionVariablePrepareToSleep we will check the condition again
before calling ConditionVariableSleep. And condition check is under
SpinLockAcquire(&pbminfo->state_mutex);

However I think there is one problem in my code (I think you might be
pointing same), that after ConditionVariablePrepareToSleep, if
pbminfo->state is already PBM_FINISHED, I am not resetting needWait to
false, and which can lead to the problem.

>
>
>> #3. Bitmap processing (Iterate and process the pages).
>> In this phase each worker will iterate over page and chunk array and
>> select heap pages one by one. If prefetch is enable then there will be
>> two iterator. Since multiple worker are iterating over same page and
>> chunk array we need to have a shared iterator, so we grab a spin lock
>> and iterate within a lock, so that each worker get and different page
>> to process.
>
> tbm_iterate() call under SpinLock :
> For parallel tbm iteration, tbm_iterate() is called while SpinLock is
> held. Generally we try to keep code inside Spinlock call limited to a
> few lines, and that too without occurrence of a function call.
> Although tbm_iterate() code itself looks safe under a spinlock, I was
> checking if we can squeeze SpinlockAcquire() and SpinLockRelease()
> closer to each other. One thought is :  in tbm_iterate(), acquire the
> SpinLock before the while loop that iterates over lossy chunks. Then,
> if both chunk and per-page data remain, release spinlock just before
> returning (the first return stmt). And then just before scanning
> bitmap of an exact page, i.e. just after "if (iterator->spageptr <
> tbm->npages)", save the page handle, increment iterator->spageptr,
> release Spinlock, and then use the saved page handle to iterate over
> the page bitmap.

Main reason to keep Spin lock out of this function to avoid changes
inside this function, and also this function takes local iterator as
input which don't have spin lock reference to it. But that can be
changed, we can pass shared iterator to it.

I will think about this logic and try to update in next version.

>
> prefetch_pages() call under Spinlock :
> Here again, prefetch_pages() is called while pbminfo->prefetch_mutex
> Spinlock is held. Effectively, heavy functions like PrefetchBuffer()
> would get called while under the Spinlock. These can even ereport().
> One option is to use mutex lock for this purpose. But I think that
> would slow things down. Moreover, the complete set of prefetch pages
> would be scanned by a single worker, and others might wait for this
> one. Instead, what I am thinking is: grab the pbminfo->prefetch_mutex
> Spinlock only while incrementing pbminfo->prefetch_pages. The rest
> part viz : iterating over the prefetch pages, and doing the
> PrefetchBuffer() need not be synchronised using this
> pgbinfo->prefetch_mutex Spinlock. pbms_parallel_iterate() already has
> its own iterator spinlock. Only thing is, workers may not do the
> actual PrefetchBuffer() sequentially. One of them might shoot ahead
> and prefetch 3-4 pages while the other is lagging with the
> sequentially lesser page number; but I believe this is fine, as long
> as they all prefetch all the required blocks.

I agree with your point, will try to fix this as well.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: Parallel bitmap heap scan

From
Robert Haas
Date:
On Wed, Oct 19, 2016 at 11:53 AM, Dilip Kumar <dilipbalaut@gmail.com> wrote:
> I found one defect in v2 patch, that I induced during last rebasing.
> That is fixed in v3.

So, I had a brief look at this tonight.  This is not a full review,
but just some things I noticed:

+ *        Update snpashot info in heap scan descriptor.

Typo.  Also, why should we have a function for this at all?  And if we
do have a function for this, why should it have "bm" in the name when
it's stored in heapam.c?

+ *    [PARALLEL BITMAP HEAP SCAN ALGORITHM]
+ *
+ *    #1. Shared TIDBitmap creation and initialization
+ *        a) First worker to see the state as parallel bitmap info as
+ *        PBM_INITIAL become leader and set the state to PBM_INPROGRESS
+ *        All other workers see the state as PBM_INPROGRESS will wait for
+ *        leader to complete the TIDBitmap.
+ *
+ *        Leader Worker Processing:
+ *        (Leader is responsible for creating shared TIDBitmap and create
+ *        shared page and chunk array from TIDBitmap.)
+ *            1) Create TIDBitmap using DHT.
+ *            2) Begin Iterate: convert hash table into shared page and chunk
+ *            array.
+ *            3) Restore local TIDBitmap variable information into
+ *            ParallelBitmapInfo so that other worker can see those.
+ *            4) set state to PBM_FINISHED.
+ *            5) Wake up other workers.
+ *
+ *        Other Worker Processing:
+ *            1) Wait until leader create shared TIDBitmap and shared page
+ *            and chunk array.
+ *            2) Attach to shared page table, copy TIDBitmap from
+ *            ParallelBitmapInfo to local TIDBitmap, we copy this to local
+ *            TIDBitmap so that next level processing can read information
+ *            same as in non parallel case and we can avoid extra changes
+ *            in code.
+ *
+ *    # At this level TIDBitmap is ready and all workers are awake #
+ *
+ *    #2. Bitmap processing (Iterate and process the pages).
+ *        . In this phase each worker will iterate over page and
chunk array and
+ *        select heap pages one by one. If prefetch is enable then there will
+ *        be two iterator.
+ *        . Since multiple worker are iterating over same page and chunk array
+ *        we need to have a shared iterator, so we grab a spin lock and iterate
+ *        within a lock.

The formatting of this comment is completely haphazard.  "Leader
worker" is not a term that has any meaning.  A given backend involved
in a parallel query is either a leader or a worker, not both.

+    /* reset parallel bitmap scan info, if present */
+    if (node->parallel_bitmap)
+    {
+        ParallelBitmapInfo *pbminfo = node->parallel_bitmap;
+
+        pbminfo->state = PBM_INITIAL;
+        pbminfo->tbmiterator.schunkbit = 0;
+        pbminfo->tbmiterator.spageptr = 0;
+        pbminfo->tbmiterator.schunkptr = 0;
+        pbminfo->prefetch_iterator.schunkbit = 0;
+        pbminfo->prefetch_iterator.spageptr = 0;
+        pbminfo->prefetch_iterator.schunkptr = 0;
+        pbminfo->prefetch_pages = 0;
+        pbminfo->prefetch_target = -1;
+    }

This is obviously not going to work in the face of concurrent
activity.  When we did Parallel Seq Scan, we didn't make any changes
to the rescan code at all.  I think we are assuming that there's no
way to cause a rescan of the driving table of a parallel query; if
that's wrong, we'll need some fix, but this isn't it.  I would just
leave this out.

+static bool
+pbms_is_leader(ParallelBitmapInfo *pbminfo)

I think you should see if you can use Thomas Munro's barrier stuff for
this instead.

+    SerializeSnapshot(estate->es_snapshot, pbminfo->phs_snapshot_data);
+
+    shm_toc_insert(pcxt->toc, node->ss.ps.plan->plan_node_id, pbminfo);
+
+    node->parallel_bitmap = pbminfo;
+    snapshot = RestoreSnapshot(pbminfo->phs_snapshot_data);
+
+    heap_bm_update_snapshot(node->ss.ss_currentScanDesc, snapshot);

This doesn't make any sense.  You serialize the snapshot from the
estate, then restore it, then shove it into the scan descriptor. But
presumably that's already the snapshot the scan descriptor is using.
The workers need to do this, perhaps, but not the leader!

+    dht_parameters params = {0};

Not PostgreSQL style.

+#define TBM_IS_SHARED(tbm) (tbm)->shared

Seems pointless.

+    bool            shared;        /* need to build shared tbm if set*/

Style.

+    params.tranche_id = LWLockNewTrancheId();

You absolutely, positively cannot burn through tranche IDs like this.

+    if (tbm->shared_pagetable)
+        dht_detach(tbm->shared_pagetable);

Hmm, would we leak references if we errored out?

@@ -47,7 +47,6 @@ typedef enum
static List *translate_sub_tlist(List *tlist, int relid);

-/***************************************************************************** *        MISC. PATH UTILITIES
*****************************************************************************/

Useless whitespace change.

@@ -23,7 +23,6 @@#include "utils/relcache.h"#include "utils/snapshot.h"

-/* "options" flag bits for heap_insert */#define HEAP_INSERT_SKIP_WAL    0x0001#define HEAP_INSERT_SKIP_FSM    0x0002

Useless whitespace change.
    WAIT_EVENT_MQ_RECEIVE,    WAIT_EVENT_MQ_SEND,    WAIT_EVENT_PARALLEL_FINISH,
+    WAIT_EVENT_PARALLEL_BITMAP_SCAN,    WAIT_EVENT_SAFE_SNAPSHOT,    WAIT_EVENT_SYNC_REP

Missing a documentation update.

In general, the amount of change in nodeBitmapHeapScan.c seems larger
than I would have expected.  My copy of that file has 655 lines; this
patch adds 544 additional lines.  I think/hope that some of that can
be simplified away.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Parallel bitmap heap scan

From
Dilip Kumar
Date:
On Wed, Nov 23, 2016 at 7:24 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> So, I had a brief look at this tonight.  This is not a full review,
> but just some things I noticed:

Thanks for the review..

>
> + *        Update snpashot info in heap scan descriptor.
>
> Typo.  Also, why should we have a function for this at all?  And if we
> do have a function for this, why should it have "bm" in the name when
> it's stored in heapam.c?
We are updating snapshot in HeapScanDesc, that's the reason I am using
function and kept in heapam.c file like other function
heap_beginscan_bm.
But I think we can change it's name bm is not required in this, this
function don't do anything specific for bm. I will change this.

>
> + *    [PARALLEL BITMAP HEAP SCAN ALGORITHM]
> + *
> + *    #1. Shared TIDBitmap creation and initialization
> + *        a) First worker to see the state as parallel bitmap info as
> + *        PBM_INITIAL become leader and set the state to PBM_INPROGRESS
> + *        All other workers see the state as PBM_INPROGRESS will wait for
> + *        leader to complete the TIDBitmap.
> + *
> + *        Leader Worker Processing:
> + *        (Leader is responsible for creating shared TIDBitmap and create
> + *        shared page and chunk array from TIDBitmap.)
> + *            1) Create TIDBitmap using DHT.
> + *            2) Begin Iterate: convert hash table into shared page and chunk
> + *            array.
> + *            3) Restore local TIDBitmap variable information into
> + *            ParallelBitmapInfo so that other worker can see those.
> + *            4) set state to PBM_FINISHED.
> + *            5) Wake up other workers.
> + *
> + *        Other Worker Processing:
> + *            1) Wait until leader create shared TIDBitmap and shared page
> + *            and chunk array.
> + *            2) Attach to shared page table, copy TIDBitmap from
> + *            ParallelBitmapInfo to local TIDBitmap, we copy this to local
> + *            TIDBitmap so that next level processing can read information
> + *            same as in non parallel case and we can avoid extra changes
> + *            in code.
> + *
> + *    # At this level TIDBitmap is ready and all workers are awake #
> + *
> + *    #2. Bitmap processing (Iterate and process the pages).
> + *        . In this phase each worker will iterate over page and
> chunk array and
> + *        select heap pages one by one. If prefetch is enable then there will
> + *        be two iterator.
> + *        . Since multiple worker are iterating over same page and chunk array
> + *        we need to have a shared iterator, so we grab a spin lock and iterate
> + *        within a lock.
>
> The formatting of this comment is completely haphazard.

I will fix this..

"Leader
> worker" is not a term that has any meaning.  A given backend involved
> in a parallel query is either a leader or a worker, not both.

I agree this is confusing, but we can't call it directly a leader
because IMHO we meant by a leader who, actually spawns all worker and
gather the results. But here I meant by "leader worker" is the worker
which takes responsibility of building tidbitmap, which can be any
worker. So I named it "leader worker". Let me think what we can call
it.

>
> +    /* reset parallel bitmap scan info, if present */
> +    if (node->parallel_bitmap)
> +    {
> +        ParallelBitmapInfo *pbminfo = node->parallel_bitmap;
> +
> +        pbminfo->state = PBM_INITIAL;
> +        pbminfo->tbmiterator.schunkbit = 0;
> +        pbminfo->tbmiterator.spageptr = 0;
> +        pbminfo->tbmiterator.schunkptr = 0;
> +        pbminfo->prefetch_iterator.schunkbit = 0;
> +        pbminfo->prefetch_iterator.spageptr = 0;
> +        pbminfo->prefetch_iterator.schunkptr = 0;
> +        pbminfo->prefetch_pages = 0;
> +        pbminfo->prefetch_target = -1;
> +    }
>
> This is obviously not going to work in the face of concurrent
> activity.  When we did Parallel Seq Scan, we didn't make any changes
> to the rescan code at all.  I think we are assuming that there's no
> way to cause a rescan of the driving table of a parallel query; if
> that's wrong, we'll need some fix, but this isn't it.  I would just
> leave this out.

In heap_rescan function we have similar change..

if (scan->rs_parallel != NULL)
{ parallel_scan = scan->rs_parallel; SpinLockAcquire(¶llel_scan->phs_mutex); parallel_scan->phs_cblock =
parallel_scan->phs_startblock;SpinLockRelease(¶llel_scan->phs_mutex);
 
}

And this is not for driving table, it's required for the case when
gather is as inner node for JOIN.
consider below example. I know it's a bad plan but we can produce this plan :)

Outer Node                Inner Node
SeqScan(t1)   NLJ   (Gather -> Parallel SeqScan (t2)).

Reason for doing same is that, during ExecReScanGather we don't
recreate new DSM instead of that we just reinitialise the DSM
(ExecParallelReinitialize).

>
> +static bool
> +pbms_is_leader(ParallelBitmapInfo *pbminfo)
>
> I think you should see if you can use Thomas Munro's barrier stuff for
> this instead.

Okay, I will think over it.

>
> +    SerializeSnapshot(estate->es_snapshot, pbminfo->phs_snapshot_data);
> +
> +    shm_toc_insert(pcxt->toc, node->ss.ps.plan->plan_node_id, pbminfo);
> +
> +    node->parallel_bitmap = pbminfo;
> +    snapshot = RestoreSnapshot(pbminfo->phs_snapshot_data);
> +
> +    heap_bm_update_snapshot(node->ss.ss_currentScanDesc, snapshot);
>
> This doesn't make any sense.  You serialize the snapshot from the
> estate, then restore it, then shove it into the scan descriptor. But
> presumably that's already the snapshot the scan descriptor is using.
> The workers need to do this, perhaps, but not the leader!

This is wrong, will fix this.

>
> +    dht_parameters params = {0};
>
> Not PostgreSQL style.

I will fix..
>
> +#define TBM_IS_SHARED(tbm) (tbm)->shared
>
> Seems pointless.
Ok..
>
> +    bool            shared;        /* need to build shared tbm if set*/
>
> Style.

Ok.
>
> +    params.tranche_id = LWLockNewTrancheId();
>
> You absolutely, positively cannot burn through tranche IDs like this.
>
> +    if (tbm->shared_pagetable)
> +        dht_detach(tbm->shared_pagetable);
>
> Hmm, would we leak references if we errored out?

I will check on this part. Anyway, In my next version I am working on
making my patch independent of DHT, so this part will be removed.

>
> @@ -47,7 +47,6 @@ typedef enum
>
>  static List *translate_sub_tlist(List *tlist, int relid);
>
> -
>  /*****************************************************************************
>   *        MISC. PATH UTILITIES
>   *****************************************************************************/
>
> Useless whitespace change.
>
> @@ -23,7 +23,6 @@
>  #include "utils/relcache.h"
>  #include "utils/snapshot.h"
>
> -
>  /* "options" flag bits for heap_insert */
>  #define HEAP_INSERT_SKIP_WAL    0x0001
>  #define HEAP_INSERT_SKIP_FSM    0x0002
>
> Useless whitespace change.
>
>      WAIT_EVENT_MQ_RECEIVE,
>      WAIT_EVENT_MQ_SEND,
>      WAIT_EVENT_PARALLEL_FINISH,
> +    WAIT_EVENT_PARALLEL_BITMAP_SCAN,
>      WAIT_EVENT_SAFE_SNAPSHOT,
>      WAIT_EVENT_SYNC_REP
>
> Missing a documentation update.

I will fix these, in next version.
>
> In general, the amount of change in nodeBitmapHeapScan.c seems larger
> than I would have expected.  My copy of that file has 655 lines; this
> patch adds 544 additional lines.  I think/hope that some of that can
> be simplified away.

I will work on this.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: Parallel bitmap heap scan

From
Dilip Kumar
Date:
On Wed, Nov 23, 2016 at 12:31 PM, Dilip Kumar <dilipbalaut@gmail.com> wrote:

I tried to address these comments in my new version, All comments are
fixed except below

>> + *
>> + *    #2. Bitmap processing (Iterate and process the pages).
>> + *        . In this phase each worker will iterate over page and
>> chunk array and
>> + *        select heap pages one by one. If prefetch is enable then there will
>> + *        be two iterator.
>> + *        . Since multiple worker are iterating over same page and chunk array
>> + *        we need to have a shared iterator, so we grab a spin lock and iterate
>> + *        within a lock.
>>
>> The formatting of this comment is completely haphazard.
>
> I will fix this..
I have changed the formatting, and also moved the algorithm
description inside function body.
I am not sure does this meet your expectation or we should change it further ?

>
>> +    /* reset parallel bitmap scan info, if present */
>> +    if (node->parallel_bitmap)
>> +    {
>> +        ParallelBitmapInfo *pbminfo = node->parallel_bitmap;
>> +
>> +        pbminfo->state = PBM_INITIAL;
>> +        pbminfo->tbmiterator.schunkbit = 0;
>> +        pbminfo->tbmiterator.spageptr = 0;
>> +        pbminfo->tbmiterator.schunkptr = 0;
>> +        pbminfo->prefetch_iterator.schunkbit = 0;
>> +        pbminfo->prefetch_iterator.spageptr = 0;
>> +        pbminfo->prefetch_iterator.schunkptr = 0;
>> +        pbminfo->prefetch_pages = 0;
>> +        pbminfo->prefetch_target = -1;
>> +    }
>>
>> This is obviously not going to work in the face of concurrent
>> activity.  When we did Parallel Seq Scan, we didn't make any changes
>> to the rescan code at all.  I think we are assuming that there's no
>> way to cause a rescan of the driving table of a parallel query; if
>> that's wrong, we'll need some fix, but this isn't it.  I would just
>> leave this out.
>
> In heap_rescan function we have similar change..
>
> if (scan->rs_parallel != NULL)
> {
>   parallel_scan = scan->rs_parallel;
>   SpinLockAcquire(¶llel_scan->phs_mutex);
>   parallel_scan->phs_cblock = parallel_scan->phs_startblock;
>   SpinLockRelease(¶llel_scan->phs_mutex);
> }
>
> And this is not for driving table, it's required for the case when
> gather is as inner node for JOIN.
> consider below example. I know it's a bad plan but we can produce this plan :)
>
> Outer Node                Inner Node
> SeqScan(t1)   NLJ   (Gather -> Parallel SeqScan (t2)).
>
> Reason for doing same is that, during ExecReScanGather we don't
> recreate new DSM instead of that we just reinitialise the DSM
> (ExecParallelReinitialize).

This is not fixed, reason is already explained.

>> +static bool
>> +pbms_is_leader(ParallelBitmapInfo *pbminfo)
>>
>> I think you should see if you can use Thomas Munro's barrier stuff for
>> this instead.
>
> Okay, I will think over it.

IMHO, barrier is used when multiple worker are doing some work
together in phase1, and before moving to next phase all have to
complete phase1, so we put barrier, so that before starting next phase
all cross the barrier.

But here case is different, only one worker need to finish the phase1,
and as soon as it's over all can start phase2. But we don't have
requirement that all worker should cross certain barrier. In this case
even though some worker did not start, other worker can do their work.

>>
>> In general, the amount of change in nodeBitmapHeapScan.c seems larger
>> than I would have expected.  My copy of that file has 655 lines; this
>> patch adds 544 additional lines.  I think/hope that some of that can
>> be simplified away.
>
> I will work on this.

I have removed some function which was actually not required, and code
can be merged in main function. Almost reduced by 100 lines.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: Parallel bitmap heap scan

From
Dilip Kumar
Date:
On Tue, Nov 22, 2016 at 9:05 AM, Dilip Kumar <dilipbalaut@gmail.com> wrote:
> On Fri, Nov 18, 2016 at 9:59 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
>
> Thanks for the review..

I have worked on these comments..
>
>> In pbms_is_leader() , I didn't clearly understand the significance of
>> the for-loop. If it is a worker, it can call
>> ConditionVariablePrepareToSleep() followed by
>> ConditionVariableSleep(). Once it comes out of
>> ConditionVariableSleep(), isn't it guaranteed that the leader has
>> finished the bitmap ? If yes, then it looks like it is not necessary
>> to again iterate and go back through the pbminfo->state checking.
>> Also, with this, variable queuedSelf also might not be needed. But I
>> might be missing something here.
>
> I have taken this logic from example posted on conditional variable thread[1]
>
> for (;;)
> {
>     ConditionVariablePrepareToSleep(cv);
>     if (condition for which we are waiting is satisfied)
>         break;
>     ConditionVariableSleep();
> }
> ConditionVariableCancelSleep();
>
> [1] https://www.postgresql.org/message-id/CA%2BTgmoaj2aPti0yho7FeEf2qt-JgQPRWb0gci_o1Hfr%3DC56Xng%40mail.gmail.com

>
> So it appears to me even if we come out of ConditionVariableSleep();
> we need to verify our condition and then only break.
>
> Not sure what happens if worker calls
>> ConditionVariable[Prepare]Sleep() but leader has already called
>> ConditionVariableBroadcast(). Does the for loop have something to do
>> with this ? But this can happen even with the current for-loop, it
>> seems.
> If leader has already called ConditionVariableBroadcast, but after
> ConditionVariablePrepareToSleep we will check the condition again
> before calling ConditionVariableSleep. And condition check is under
> SpinLockAcquire(&pbminfo->state_mutex);
>
> However I think there is one problem in my code (I think you might be
> pointing same), that after ConditionVariablePrepareToSleep, if
> pbminfo->state is already PBM_FINISHED, I am not resetting needWait to
> false, and which can lead to the problem.

I have fixed the defect what I have mentioned above,

>
>>
>>
>>> #3. Bitmap processing (Iterate and process the pages).
>>> In this phase each worker will iterate over page and chunk array and
>>> select heap pages one by one. If prefetch is enable then there will be
>>> two iterator. Since multiple worker are iterating over same page and
>>> chunk array we need to have a shared iterator, so we grab a spin lock
>>> and iterate within a lock, so that each worker get and different page
>>> to process.
>>
>> tbm_iterate() call under SpinLock :
>> For parallel tbm iteration, tbm_iterate() is called while SpinLock is
>> held. Generally we try to keep code inside Spinlock call limited to a
>> few lines, and that too without occurrence of a function call.
>> Although tbm_iterate() code itself looks safe under a spinlock, I was
>> checking if we can squeeze SpinlockAcquire() and SpinLockRelease()
>> closer to each other. One thought is :  in tbm_iterate(), acquire the
>> SpinLock before the while loop that iterates over lossy chunks. Then,
>> if both chunk and per-page data remain, release spinlock just before
>> returning (the first return stmt). And then just before scanning
>> bitmap of an exact page, i.e. just after "if (iterator->spageptr <
>> tbm->npages)", save the page handle, increment iterator->spageptr,
>> release Spinlock, and then use the saved page handle to iterate over
>> the page bitmap.
>
> Main reason to keep Spin lock out of this function to avoid changes
> inside this function, and also this function takes local iterator as
> input which don't have spin lock reference to it. But that can be
> changed, we can pass shared iterator to it.
>
> I will think about this logic and try to update in next version.
Still this issue is not addressed.
Logic inside tbm_iterate is using same variable, like spageptr,
multiple places. IMHO this complete logic needs to be done under one
spin lock.

>
>>
>> prefetch_pages() call under Spinlock :
>> Here again, prefetch_pages() is called while pbminfo->prefetch_mutex
>> Spinlock is held. Effectively, heavy functions like PrefetchBuffer()
>> would get called while under the Spinlock. These can even ereport().
>> One option is to use mutex lock for this purpose. But I think that
>> would slow things down. Moreover, the complete set of prefetch pages
>> would be scanned by a single worker, and others might wait for this
>> one. Instead, what I am thinking is: grab the pbminfo->prefetch_mutex
>> Spinlock only while incrementing pbminfo->prefetch_pages. The rest
>> part viz : iterating over the prefetch pages, and doing the
>> PrefetchBuffer() need not be synchronised using this
>> pgbinfo->prefetch_mutex Spinlock. pbms_parallel_iterate() already has
>> its own iterator spinlock. Only thing is, workers may not do the
>> actual PrefetchBuffer() sequentially. One of them might shoot ahead
>> and prefetch 3-4 pages while the other is lagging with the
>> sequentially lesser page number; but I believe this is fine, as long
>> as they all prefetch all the required blocks.
>
> I agree with your point, will try to fix this as well.

I have fixed this part.


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: Parallel bitmap heap scan

From
Robert Haas
Date:
On Sat, Nov 26, 2016 at 7:40 AM, Dilip Kumar <dilipbalaut@gmail.com> wrote:
> IMHO, barrier is used when multiple worker are doing some work
> together in phase1, and before moving to next phase all have to
> complete phase1, so we put barrier, so that before starting next phase
> all cross the barrier.
>
> But here case is different, only one worker need to finish the phase1,
> and as soon as it's over all can start phase2. But we don't have
> requirement that all worker should cross certain barrier. In this case
> even though some worker did not start, other worker can do their work.

I think the Barrier stuff has a process for choosing one worker to
conduct a particular phase.  So it seems like if the Barrier API is
well-designed, you should be able to use it to decide who will conduct
the index scan, and then when that's done everyone can proceed to
scanning the heap.  If that can't work for some reason, Thomas should
probably adjust his API so it does.  He's presenting that as a
generally-useful primitive...

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Parallel bitmap heap scan

From
Dilip Kumar
Date:
On Sun, Nov 27, 2016 at 3:15 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> I think the Barrier stuff has a process for choosing one worker to
> conduct a particular phase.  So it seems like if the Barrier API is
> well-designed, you should be able to use it to decide who will conduct
> the index scan, and then when that's done everyone can proceed to
> scanning the heap.  If that can't work for some reason, Thomas should
> probably adjust his API so it does.  He's presenting that as a
> generally-useful primitive...

If I understand the barrier API correctly, It has two Part.
1. BarrierInit  2. BarrierWait.

1. In BarrierInit we defined that, how many worker(lets say nworkers)
should cross the barrier, before we are allowed to cross the
BarriedWait.

2. BarrierWait, will actually make calling process wait until
BarrierWait is not called for nworkers times.

So I am not very clear, If we call BarrierInit with nworkers=1, then
first question is when should we call BarrierWait, because as soon as
we call BarrierWait count will reach 1, and now everyone is allowed to
proceed. so obviously It should be called once the Bitmap is Ready.

Second question is, if it's called only after Bitmap is ready, then
what about other process, how they are supposed to wait until bitmap
is not ready. If they wait using BarrierWait, it again make the count
1 and everyone is allowed to proceed. Which doesn't seems correct.

Correct me if I am missing something ?

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: Parallel bitmap heap scan

From
Thomas Munro
Date:
On Sun, Nov 27, 2016 at 3:34 PM, Dilip Kumar <dilipbalaut@gmail.com> wrote:
> On Sun, Nov 27, 2016 at 3:15 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> I think the Barrier stuff has a process for choosing one worker to
>> conduct a particular phase.  So it seems like if the Barrier API is
>> well-designed, you should be able to use it to decide who will conduct
>> the index scan, and then when that's done everyone can proceed to
>> scanning the heap.  If that can't work for some reason, Thomas should
>> probably adjust his API so it does.  He's presenting that as a
>> generally-useful primitive...
>
> If I understand the barrier API correctly, It has two Part.
> 1. BarrierInit  2. BarrierWait.
>
> 1. In BarrierInit we defined that, how many worker(lets say nworkers)
> should cross the barrier, before we are allowed to cross the
> BarriedWait.
>
> 2. BarrierWait, will actually make calling process wait until
> BarrierWait is not called for nworkers times.
>
> So I am not very clear, If we call BarrierInit with nworkers=1, then
> first question is when should we call BarrierWait, because as soon as
> we call BarrierWait count will reach 1, and now everyone is allowed to
> proceed. so obviously It should be called once the Bitmap is Ready.
>
> Second question is, if it's called only after Bitmap is ready, then
> what about other process, how they are supposed to wait until bitmap
> is not ready. If they wait using BarrierWait, it again make the count
> 1 and everyone is allowed to proceed. Which doesn't seems correct.
>
> Correct me if I am missing something ?

I'm not sure if it's the right tool for this job or not and haven't
studied this patch yet.  I will.  But here is one way to use barrier.c
for something like this, based on the description above.  It's
slightly more complicated than you said because you don't know whether
the leader is going to participate or how many of the planned workers
will actually be able to start up, so there would be no way to provide
that 'participants' argument to BarrierInit and any given participant
might already have missed some of the 'BarrierWait' calls by the time
it starts running, so merely calling BarrierWait the right number of
times isn't enough to stay in sync.  So instead you do this:
   #define PBS_PHASE INIT 0   #define PBS_PHASE_BUILDING 1   #define PBS_PHASE_SCANNING 2

Initialise the barrier with BarrierInit(&something->barrier, 0), which
says that you don't know how many participants there will be.

Somewhere in each participant you need to do this exactly once:
   BarrierAttach(&something->barrier);

I think you need to track whether you've called that yet and do so on
demand in your ExecBitmapHeap function.  You can't just do it in
ExecBitmapHeapInitializeWorker because the leader needs to do it too,
but *only* if it runs the plan.  Then you need something like this:
   switch (BarrierPhase(&something->barrier)   {   case PBS_PHASE_INIT:       if (BarrierWait(&something->barrier,
WAIT_EVENT_PBS_PHASE_INIT))      {           /* Serial phase that will run in only one chosen participant. */
build_the_bitmap();      }       /* Fall through. */
 
   case PBS_PHASE_BUILDING:       BarrierWait(&something->barrier, WAIT_EVENT_PBS_PHASE_BUILDING);       /* Fall
through.*/
 
   case PBS_PHASE_SCANNING:       scan_the_bitmap_and_emit_one_tuple();   }

When a new participant arrives here, if it finds that we're still in
the INIT phase, then it enters an election to see if it can build the
bitmap; one lucky participant wins and does that, while any other
participants twiddle their thumbs at the next BarrierWait call.  If  a
new participant finds that we're already in the BUILDING phase when it
arrives, then it has missed that election and just has to wait for the
building to be completed.  Once they all agree that building has
finished, we move onto scanning.  If a new arrival finds that we're in
SCANNING phase, then it happily scans and emits tuples.  Does that
make sense?

Not sure exactly how to coordinate rescans yet, but probably with
BarrierWaitSet(&something->barrier, PBS_PHASE_INIT).

-- 
Thomas Munro
http://www.enterprisedb.com



Re: Parallel bitmap heap scan

From
Amit Kapila
Date:
On Mon, Nov 28, 2016 at 8:11 AM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
> When a new participant arrives here, if it finds that we're still in
> the INIT phase, then it enters an election to see if it can build the
> bitmap; one lucky participant wins and does that, while any other
> participants twiddle their thumbs at the next BarrierWait call.  If  a
> new participant finds that we're already in the BUILDING phase when it
> arrives, then it has missed that election and just has to wait for the
> building to be completed.  Once they all agree that building has
> finished, we move onto scanning.  If a new arrival finds that we're in
> SCANNING phase, then it happily scans and emits tuples.  Does that
> make sense?
>
> Not sure exactly how to coordinate rescans yet, but probably with
> BarrierWaitSet(&something->barrier, PBS_PHASE_INIT).
>

Do you think that using barrier's will simplify the patch as compared
to using condition variables because in that case, it will make sense
to use barriers?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: Parallel bitmap heap scan

From
Thomas Munro
Date:
On Mon, Nov 28, 2016 at 3:49 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> Do you think that using barrier's will simplify the patch as compared
> to using condition variables because in that case, it will make sense
> to use barriers?

It would work, but I suppose you might call it overkill.  If they were
cooperating to build the bitmap in parallel then a barrier might look
more tempting, because then they'd all be waiting for each other to
agree that they've all finished doing that and are ready to scan.
When they're all just waiting for one guy to flip a single bit, then
it's debatable whether a barrier is any simpler than a condition
variable + a spinlock + a bit!

-- 
Thomas Munro
http://www.enterprisedb.com



Re: Parallel bitmap heap scan

From
Dilip Kumar
Date:
On Mon, Nov 28, 2016 at 9:20 AM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
> It would work, but I suppose you might call it overkill.  If they were
> cooperating to build the bitmap in parallel then a barrier might look
> more tempting, because then they'd all be waiting for each other to
> agree that they've all finished doing that and are ready to scan.
> When they're all just waiting for one guy to flip a single bit, then
> it's debatable whether a barrier is any simpler than a condition
> variable + a spinlock + a bit!

+1


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: Parallel bitmap heap scan

From
Dilip Kumar
Date:
On Fri, Nov 25, 2016 at 6:55 PM, Dilip Kumar <dilipbalaut@gmail.com> wrote:
> I have changed the design to directly make it based on DSA instead of using DHT.
> In new version we no longer use DHT. Instead, of that I have made some
> change in simplehash[1], so that it can allow external allocator. In
> tidbitmap.c, I have register the allocator to simplehash and those
> allocator functions will allocate memory directly from DSA.
>
> simplehash is always using one single memory (during expand it copy
> from old memory to new memory). Which makes remaining processing very
> simple for us.
>
> In tbm_begin_iterate, we need not to scan complete hash table instead
> of that we can just process dsa memory and convert into page array and
> chunk array.
>
> I have tested the performance in my local machine and I observed that
> it's slightly better than older
> DHT based version (complete performance number will be published soon).
>
> Dependency on other patches:
> 1. dsa_area (dsa-v7.patch)
> https://www.postgresql.org/message-id/CAEepm%3D024p-MeAsDmG%3DR3%2BtR4EGhuGJs_%2BrjFKF0eRoSTmMJnA%40mail.gmail.com
>
> 2. Creating a DSA area to provide work space for parallel execution
> (dsa-area-for-executor-v3.patch)
> https://www.postgresql.org/message-id/CAEepm%3D0HmRefi1%2BxDJ99Gj5APHr8Qr05KZtAxrMj8b%2Bay3o6sA%40mail.gmail.com
>
> patch details
> 1. hash_support_alloc_free_v1.patch [1].
> 2. parallel-bitmap-heap-scan-v3.patch

I just realised that, my latest patch I just sent to Andres, instead
of replying to all.
Forwarding the same mail to Hackers.

Performance reading with new patch..
TPCH-scale factor 10. work_mem 20MB, Power 4 socket machine

Query Head     Patch Improvement
Q4       4811     3290     1.5x
Q6     13136     6198     2.1x
Q14     8119     5057     1.6x
Q15   25652   20143     1.2x

Explained analyzed results are attached with the mail..

* I have also applied Andres patch from below link, for taking this
performance (both for head and for patch).
https://www.postgresql.org/message-id/20161123083351.5vramz52nmdokhzz%40alap3.anarazel.de

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachment

Re: Parallel bitmap heap scan

From
Haribabu Kommi
Date:


On Wed, Nov 30, 2016 at 4:38 PM, Dilip Kumar <dilipbalaut@gmail.com> wrote:
On Fri, Nov 25, 2016 at 6:55 PM, Dilip Kumar <dilipbalaut@gmail.com> wrote:
> I have changed the design to directly make it based on DSA instead of using DHT.
> In new version we no longer use DHT. Instead, of that I have made some
> change in simplehash[1], so that it can allow external allocator. In
> tidbitmap.c, I have register the allocator to simplehash and those
> allocator functions will allocate memory directly from DSA.
>
> simplehash is always using one single memory (during expand it copy
> from old memory to new memory). Which makes remaining processing very
> simple for us.
>
> In tbm_begin_iterate, we need not to scan complete hash table instead
> of that we can just process dsa memory and convert into page array and
> chunk array.
>
> I have tested the performance in my local machine and I observed that
> it's slightly better than older
> DHT based version (complete performance number will be published soon).
>
> Dependency on other patches:
> 1. dsa_area (dsa-v7.patch)
> https://www.postgresql.org/message-id/CAEepm%3D024p-MeAsDmG%3DR3%2BtR4EGhuGJs_%2BrjFKF0eRoSTmMJnA%40mail.gmail.com
>
> 2. Creating a DSA area to provide work space for parallel execution
> (dsa-area-for-executor-v3.patch)
> https://www.postgresql.org/message-id/CAEepm%3D0HmRefi1%2BxDJ99Gj5APHr8Qr05KZtAxrMj8b%2Bay3o6sA%40mail.gmail.com
>
> patch details
> 1. hash_support_alloc_free_v1.patch [1].
> 2. parallel-bitmap-heap-scan-v3.patch

I just realised that, my latest patch I just sent to Andres, instead
of replying to all.
Forwarding the same mail to Hackers.

Performance reading with new patch..
TPCH-scale factor 10. work_mem 20MB, Power 4 socket machine

Query Head     Patch Improvement
Q4       4811     3290     1.5x
Q6     13136     6198     2.1x
Q14     8119     5057     1.6x
Q15   25652   20143     1.2x

Explained analyzed results are attached with the mail..

* I have also applied Andres patch from below link, for taking this
performance (both for head and for patch).
https://www.postgresql.org/message-id/20161123083351.5vramz52nmdokhzz%40alap3.anarazel.de


Moved to next CF with "needs review" status.

Regards,
Hari Babu
Fujitsu Australia

Re: [HACKERS] Parallel bitmap heap scan

From
Amit Kapila
Date:
On Wed, Nov 30, 2016 at 11:08 AM, Dilip Kumar <dilipbalaut@gmail.com> wrote:
>>
>> patch details
>> 1. hash_support_alloc_free_v1.patch [1].
>> 2. parallel-bitmap-heap-scan-v3.patch

Few assorted comments:

1.
+ else if (needWait)
+ {
+ /* Add ourself to wait queue */
+ ConditionVariablePrepareToSleep(&pbminfo->cv);
+ queuedSelf = true;
+ }

With the committed version of condition variables, you can avoid
calling ConditionVariablePrepareToSleep().  Refer latest parallel
index scan patch [1].

2.
+         <entry><literal>ParallelBitmapScan</></entry>
+         <entry>Waiting for leader to complete the TidBitmap.</entry>
+        </row>
+        <row>

Isn't it better to write it as below:

Waiting for the leader backend to form the TidBitmap.

3.
+ * Update snpashot info in heap scan descriptor.

/snpashot/snapshot

4.#include "utils/tqual.h"
-
+#include "miscadmin.h"
static void bitgetpage(HeapScanDesc scan, TBMIterateResult *tbmres);
-
+static bool pbms_is_leader(ParallelBitmapInfo *pbminfo);
 TBMIterateResult *tbmres;
-
+ ParallelBitmapInfo *pbminfo = node->parallel_bitmap;
static int tbm_comparator(const void *left, const void *right);
-
+void * tbm_alloc_shared(Size size, void *arg);

It seems line deletes at above places are spurious.  Please check for
similar occurrences at other places in patch.

5.
+ bool is_shared; /* need to build shared tbm if set*/

space is required towards the end of the comment (set */).

6.
+ /*
+ * If we have shared TBM means we are running in parallel mode.
+ * So directly convert dsa array to page and chunk array.
+ */

I think the above comment can be simplified as "For shared TBM,
directly convert dsa array to page and chunk array"

7.
+ dsa_entry = (void*)(((char*)dsa_entry) + sizeof(dsa_pointer));

extra space is missing at multiple places in above line. It should be
written as below:

dsa_entry = (void *)(((char *) dsa_entry) + sizeof(dsa_pointer));

Similar stuff needs to be taken care at other places in the patch as
well.  I think it will be better if you run pgindent on your patch.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: [HACKERS] Parallel bitmap heap scan

From
Dilip Kumar
Date:
On Sat, Dec 10, 2016 at 5:36 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> Few assorted comments:

Thanks for the review

>
> 1.
> + else if (needWait)
> + {
> + /* Add ourself to wait queue */
> + ConditionVariablePrepareToSleep(&pbminfo->cv);
> + queuedSelf = true;
> + }
>
> With the committed version of condition variables, you can avoid
> calling ConditionVariablePrepareToSleep().  Refer latest parallel
> index scan patch [1].

Oh, I see, Fixed.
>
> 2.
> +         <entry><literal>ParallelBitmapScan</></entry>
> +         <entry>Waiting for leader to complete the TidBitmap.</entry>
> +        </row>
> +        <row>
>
> Isn't it better to write it as below:
>
> Waiting for the leader backend to form the TidBitmap.
Done this way..
>
> 3.
> + * Update snpashot info in heap scan descriptor.
>
> /snpashot/snapshot
Fixed
>
> 4.
>  #include "utils/tqual.h"
> -
> +#include "miscadmin.h"
>
>  static void bitgetpage(HeapScanDesc scan, TBMIterateResult *tbmres);
> -
> +static bool pbms_is_leader(ParallelBitmapInfo *pbminfo);
>
>   TBMIterateResult *tbmres;
> -
> + ParallelBitmapInfo *pbminfo = node->parallel_bitmap;
>
>  static int tbm_comparator(const void *left, const void *right);
> -
> +void * tbm_alloc_shared(Size size, void *arg);
>
> It seems line deletes at above places are spurious.  Please check for
> similar occurrences at other places in patch.
>
Fixed
> 5.
> + bool is_shared; /* need to build shared tbm if set*/
>
> space is required towards the end of the comment (set */).
Fixed
>
> 6.
> + /*
> + * If we have shared TBM means we are running in parallel mode.
> + * So directly convert dsa array to page and chunk array.
> + */
>
> I think the above comment can be simplified as "For shared TBM,
> directly convert dsa array to page and chunk array"

Done this way..
>
> 7.
> + dsa_entry = (void*)(((char*)dsa_entry) + sizeof(dsa_pointer));
>
> extra space is missing at multiple places in above line. It should be
> written as below:
>
> dsa_entry = (void *)(((char *) dsa_entry) + sizeof(dsa_pointer));
Fixed..

>
> Similar stuff needs to be taken care at other places in the patch as
> well.  I think it will be better if you run pgindent on your patch.

I have run the pgindent, and taken all the changes whichever was
applicable for added code.

There are some cleanup for "hash-support-alloc-free" also, so
attaching a new patch for this as well.


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] Parallel bitmap heap scan

From
Dilip Kumar
Date:
Rebased on the current head.

On Tue, Dec 13, 2016 at 12:06 PM, Dilip Kumar <dilipbalaut@gmail.com> wrote:
> On Sat, Dec 10, 2016 at 5:36 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> Few assorted comments:
>
> Thanks for the review
>
>>
>> 1.
>> + else if (needWait)
>> + {
>> + /* Add ourself to wait queue */
>> + ConditionVariablePrepareToSleep(&pbminfo->cv);
>> + queuedSelf = true;
>> + }
>>
>> With the committed version of condition variables, you can avoid
>> calling ConditionVariablePrepareToSleep().  Refer latest parallel
>> index scan patch [1].
>
> Oh, I see, Fixed.
>>
>> 2.
>> +         <entry><literal>ParallelBitmapScan</></entry>
>> +         <entry>Waiting for leader to complete the TidBitmap.</entry>
>> +        </row>
>> +        <row>
>>
>> Isn't it better to write it as below:
>>
>> Waiting for the leader backend to form the TidBitmap.
> Done this way..
>>
>> 3.
>> + * Update snpashot info in heap scan descriptor.
>>
>> /snpashot/snapshot
> Fixed
>>
>> 4.
>>  #include "utils/tqual.h"
>> -
>> +#include "miscadmin.h"
>>
>>  static void bitgetpage(HeapScanDesc scan, TBMIterateResult *tbmres);
>> -
>> +static bool pbms_is_leader(ParallelBitmapInfo *pbminfo);
>>
>>   TBMIterateResult *tbmres;
>> -
>> + ParallelBitmapInfo *pbminfo = node->parallel_bitmap;
>>
>>  static int tbm_comparator(const void *left, const void *right);
>> -
>> +void * tbm_alloc_shared(Size size, void *arg);
>>
>> It seems line deletes at above places are spurious.  Please check for
>> similar occurrences at other places in patch.
>>
> Fixed
>> 5.
>> + bool is_shared; /* need to build shared tbm if set*/
>>
>> space is required towards the end of the comment (set */).
> Fixed
>>
>> 6.
>> + /*
>> + * If we have shared TBM means we are running in parallel mode.
>> + * So directly convert dsa array to page and chunk array.
>> + */
>>
>> I think the above comment can be simplified as "For shared TBM,
>> directly convert dsa array to page and chunk array"
>
> Done this way..
>>
>> 7.
>> + dsa_entry = (void*)(((char*)dsa_entry) + sizeof(dsa_pointer));
>>
>> extra space is missing at multiple places in above line. It should be
>> written as below:
>>
>> dsa_entry = (void *)(((char *) dsa_entry) + sizeof(dsa_pointer));
> Fixed..
>
>>
>> Similar stuff needs to be taken care at other places in the patch as
>> well.  I think it will be better if you run pgindent on your patch.
>
> I have run the pgindent, and taken all the changes whichever was
> applicable for added code.
>
> There are some cleanup for "hash-support-alloc-free" also, so
> attaching a new patch for this as well.
>
>
> --
> Regards,
> Dilip Kumar
> EnterpriseDB: http://www.enterprisedb.com



-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] Parallel bitmap heap scan

From
Andres Freund
Date:
Hi,

On 2016-12-21 10:36:13 +0530, Dilip Kumar wrote:
> diff --git a/src/include/lib/simplehash.h b/src/include/lib/simplehash.h
> index 12aedbc..b9a05e6 100644
> --- a/src/include/lib/simplehash.h
> +++ b/src/include/lib/simplehash.h
> @@ -66,6 +66,7 @@
>  #define SH_STATUS_EMPTY SH_MAKE_NAME(EMPTY)
>  #define SH_STATUS_IN_USE SH_MAKE_NAME(IN_USE)
>  #define SH_ITERATOR SH_MAKE_NAME(iterator)
> +#define SH_ALLOCATOR SH_MAKE_NAME(alloc)
>  
>  /* function declarations */
>  #define SH_CREATE SH_MAKE_NAME(create)
> @@ -90,6 +91,18 @@
>  /* generate forward declarations necessary to use the hash table */
>  #ifdef SH_DECLARE
>  
> +typedef struct SH_ALLOCATOR
> +{
> +    /* Allocation function */
> +    void     *(*HashAlloc) (Size size, void *args);
> +
> +    /* Free function */
> +    void      (*HashFree) (void *pointer, void *args);
> +
> +    /* Arguments to be passed to alloc and free functions */
> +    void     *args;
> +} SH_ALLOCATOR;
> +
>  /* type definitions */
>  typedef struct SH_TYPE
>  {
> @@ -112,6 +125,9 @@ typedef struct SH_TYPE
>      /* hash buckets */
>      SH_ELEMENT_TYPE *data;
>  
> +    /* hash allocator */
> +    SH_ALLOCATOR    *alloc;
> +
>      /* memory context to use for allocations */
>      MemoryContext ctx;
>  
> @@ -133,7 +149,8 @@ typedef struct SH_ITERATOR
>  } SH_ITERATOR;

>  /* externally visible function prototypes */
> -SH_SCOPE SH_TYPE *SH_CREATE(MemoryContext ctx, uint32 nelements);
> +SH_SCOPE SH_TYPE *SH_CREATE(MemoryContext ctx, uint32 nelements,
> +                            SH_ALLOCATOR *alloc);
>  SH_SCOPE void SH_DESTROY(SH_TYPE *tb);
>  SH_SCOPE void SH_GROW(SH_TYPE *tb, uint32 newsize);
>  SH_SCOPE SH_ELEMENT_TYPE *SH_INSERT(SH_TYPE *tb, SH_KEY_TYPE key, bool *found);
> @@ -281,7 +298,7 @@ SH_ENTRY_HASH(SH_TYPE *tb, SH_ELEMENT_TYPE * entry)
>   * allocating required memory in the passed-in context.
>   */
>  SH_SCOPE SH_TYPE *
> -SH_CREATE(MemoryContext ctx, uint32 nelements)
> +SH_CREATE(MemoryContext ctx, uint32 nelements, SH_ALLOCATOR *alloc)
>  {
>      SH_TYPE    *tb;
>      uint64        size;
> @@ -294,9 +311,19 @@ SH_CREATE(MemoryContext ctx, uint32 nelements)
>  
>      SH_COMPUTE_PARAMETERS(tb, size);
>  
> -    tb->data = MemoryContextAllocExtended(tb->ctx,
> -                                          sizeof(SH_ELEMENT_TYPE) * tb->size,
> -                                          MCXT_ALLOC_HUGE | MCXT_ALLOC_ZERO);
> +    tb->alloc = alloc;
> +
> +    /*
> +     * If allocation handle is passed then use allocation function from the
> +     * handle otherwise use standard allocator.
> +     */
> +    if (tb->alloc != NULL)
> +        tb->data = tb->alloc->HashAlloc(sizeof(SH_ELEMENT_TYPE) * tb->size,
> +                                        tb->alloc->args);
> +    else
> +        tb->data = MemoryContextAllocExtended(tb->ctx,
> +                                        sizeof(SH_ELEMENT_TYPE) * tb->size,
> +                                        MCXT_ALLOC_HUGE | MCXT_ALLOC_ZERO);

If we go there, it seems better to also wrap the memory context based approach
in the allocator.  This also needs docs, including a warning that just
using an allocator in shared memory does *NOT* allow the hashtable to be
used in shared memory in the general case.

Greetings,

Andres Freund



Re: [HACKERS] Parallel bitmap heap scan

From
Dilip Kumar
Date:
On Wed, Dec 21, 2016 at 3:23 PM, Andres Freund <andres@anarazel.de> wrote:

Sorry for the delayed response.

> If we go there, it seems better to also wrap the memory context based approach
> in the allocator.

One option is we can keep default allocator in the simple hash, and if
the caller doesn't supply any functions we can use default functions
as I have shown below.

+/* default allocator function */
+static void *
+SH_DEFAULT_ALLOC(Size size, void *args)
+{
+ MemoryContext context = (MemoryContext) args;
+
+ return MemoryContextAllocExtended(context, size,
+  MCXT_ALLOC_HUGE | MCXT_ALLOC_ZERO);
+}
+
+/* default allocator function */
+static void
+SH_DEFAULT_FREE(void *pointer, void *args)
+{
+ pfree(pointer);
+}
+

+SH_CREATE(MemoryContext ctx, uint32 nelements, SH_ALLOCATOR *alloc){
- SH_TYPE    *tb;
- uint64 size;
+ SH_TYPE     *tb;
+ uint64 size;
 tb = MemoryContextAllocZero(ctx, sizeof(SH_TYPE)); tb->ctx = ctx;
@@ -294,9 +336,18 @@ SH_CREATE(MemoryContext ctx, uint32 nelements)
 SH_COMPUTE_PARAMETERS(tb, size);

- tb->data = MemoryContextAllocExtended(tb->ctx,
-  sizeof(SH_ELEMENT_TYPE) * tb->size,
-  MCXT_ALLOC_HUGE | MCXT_ALLOC_ZERO);
+ if (!alloc)
+ {
+ tb->alloc = palloc(sizeof(SH_ALLOCATOR));
+ tb->alloc->HashAlloc = SH_DEFAULT_ALLOC;
+ tb->alloc->HashFree = SH_DEFAULT_FREE;
+ tb->alloc->args = tb->ctx;
+ }
+ else
+ tb->alloc = alloc;
+
+ tb->data = tb->alloc->HashAlloc(sizeof(SH_ELEMENT_TYPE) * tb->size,
+ tb->alloc->args);

Other the another option is, that we can always make caller to provide
an allocator. But this way every new user for simple hash need to take
care of having allocator.

What is your opinion?


>This also needs docs, including a warning that just
> using an allocator in shared memory does *NOT* allow the hash table to be
> used in shared memory in the general case.

Make sense.


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: [HACKERS] Parallel bitmap heap scan

From
Dilip Kumar
Date:
On Mon, Dec 26, 2016 at 3:14 PM, Dilip Kumar <dilipbalaut@gmail.com> wrote:
> Other the another option is, that we can always make caller to provide
> an allocator. But this way every new user for simple hash need to take
> care of having allocator.
>
> What is your opinion?

Attached is the new version of the patch which implements it the way I
described.

>
>
>>This also needs docs, including a warning that just
>> using an allocator in shared memory does *NOT* allow the hash table to be
>> used in shared memory in the general case.
>
> Make sense.
Added the Warning.

I have also fixed some bug in parallel bitmap heap scan
(path.parallel_workers was not initialised before calling
cost_bitmap_heap_scan in some cases, so it was taking the
uninitialized value). Patch attached.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] Parallel bitmap heap scan

From
Amit Khandekar
Date:
Sorry for the delay in my next response. I still haven't exhaustively
gone through all changes, but meanwhile, below are some more points.

On 26 November 2016 at 18:18, Dilip Kumar <dilipbalaut@gmail.com> wrote:
> On Tue, Nov 22, 2016 at 9:05 AM, Dilip Kumar <dilipbalaut@gmail.com> wrote:
>> On Fri, Nov 18, 2016 at 9:59 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
>>
>> Thanks for the review..
>
> I have worked on these comments..
>>
>>> In pbms_is_leader() , I didn't clearly understand the significance of
>>> the for-loop. If it is a worker, it can call
>>> ConditionVariablePrepareToSleep() followed by
>>> ConditionVariableSleep(). Once it comes out of
>>> ConditionVariableSleep(), isn't it guaranteed that the leader has
>>> finished the bitmap ? If yes, then it looks like it is not necessary
>>> to again iterate and go back through the pbminfo->state checking.
>>> Also, with this, variable queuedSelf also might not be needed. But I
>>> might be missing something here.
>>
>> I have taken this logic from example posted on conditional variable thread[1]
>>
>> for (;;)
>> {
>>     ConditionVariablePrepareToSleep(cv);
>>     if (condition for which we are waiting is satisfied)
>>         break;
>>     ConditionVariableSleep();
>> }
>> ConditionVariableCancelSleep();
>>
>> [1] https://www.postgresql.org/message-id/CA%2BTgmoaj2aPti0yho7FeEf2qt-JgQPRWb0gci_o1Hfr%3DC56Xng%40mail.gmail.com

With the latest patches, this looks fine to me.

>>> tbm_iterate() call under SpinLock :
>>> For parallel tbm iteration, tbm_iterate() is called while SpinLock is
>>> held. Generally we try to keep code inside Spinlock call limited to a
>>> few lines, and that too without occurrence of a function call.
>>> Although tbm_iterate() code itself looks safe under a spinlock, I was
>>> checking if we can squeeze SpinlockAcquire() and SpinLockRelease()
>>> closer to each other. One thought is :  in tbm_iterate(), acquire the
>>> SpinLock before the while loop that iterates over lossy chunks. Then,
>>> if both chunk and per-page data remain, release spinlock just before
>>> returning (the first return stmt). And then just before scanning
>>> bitmap of an exact page, i.e. just after "if (iterator->spageptr <
>>> tbm->npages)", save the page handle, increment iterator->spageptr,
>>> release Spinlock, and then use the saved page handle to iterate over
>>> the page bitmap.
>>
>> Main reason to keep Spin lock out of this function to avoid changes
>> inside this function, and also this function takes local iterator as
>> input which don't have spin lock reference to it. But that can be
>> changed, we can pass shared iterator to it.
>>
>> I will think about this logic and try to update in next version.
> Still this issue is not addressed.
> Logic inside tbm_iterate is using same variable, like spageptr,
> multiple places. IMHO this complete logic needs to be done under one
> spin lock.

I think we both agree on the part that the mutex handle can be passed
to tbm_iterate() to do this. I am yet to give more thought on how
clumsy it will be if we add SpinlockRelease() calls in intermediate
places in the function rather than in the end.


>
>>
>>>
>>> prefetch_pages() call under Spinlock :
>>> Here again, prefetch_pages() is called while pbminfo->prefetch_mutex
>>> Spinlock is held. Effectively, heavy functions like PrefetchBuffer()
>>> would get called while under the Spinlock. These can even ereport().
>>> One option is to use mutex lock for this purpose. But I think that
>>> would slow things down. Moreover, the complete set of prefetch pages
>>> would be scanned by a single worker, and others might wait for this
>>> one. Instead, what I am thinking is: grab the pbminfo->prefetch_mutex
>>> Spinlock only while incrementing pbminfo->prefetch_pages. The rest
>>> part viz : iterating over the prefetch pages, and doing the
>>> PrefetchBuffer() need not be synchronised using this
>>> pgbinfo->prefetch_mutex Spinlock. pbms_parallel_iterate() already has
>>> its own iterator spinlock. Only thing is, workers may not do the
>>> actual PrefetchBuffer() sequentially. One of them might shoot ahead
>>> and prefetch 3-4 pages while the other is lagging with the
>>> sequentially lesser page number; but I believe this is fine, as long
>>> as they all prefetch all the required blocks.
>>
>> I agree with your point, will try to fix this as well.
>
> I have fixed this part.

This looks good now.


Further points below ....



===== nodeBItmapHeapscan.c ======


In BitmapHeapNext(), the following code is relevant only for tbm
returned from MultiExecProcNode().
if (!tbm || !IsA(tbm, TIDBitmap))
elog(ERROR, "unrecognized result from subplan");

So it should be moved just below MultiExecProcNode(), so that it does
not get called when it is created from tbm_create().

--------------

BitmapHeapScanState->is_leader field looks unnecessary. Instead, a
local variable is_leader in BitmapHeapNext() will solve the purpose.
This is because is_leader is used only in BitmapHeapNext().

--------------

In BitmapHeapNext(), just before tbm_restore_shared_members() is
called, we create tbm using tbm_create(). I think tbm_create() does
not make sense for shared tbm. Whatever fields are required, will be
restored in tbm_restore_shared_members(). The other fields which do
not make sense in a restored tbm are not required to be initialized
using tbm_create(). So I think tbm_restore_shared_members() itself can
call makeNode(TIDBitmap). (Also it is not required to initialize
tbm->allocator; see note below in tidbitmap.c section). So
tbm_restore_shared_members() itself can call tbm_set_parallel().
Looking at all this, it looks better to have the function name changed
to tbm_attach(parallel_tbm) or tbm_restore(parallel_tbm) rather than
tbm_restore_shared_members(). The function header anyways (rightly)
says  : Attach worker to shared TID bitmap.

-------------

From what I understand, the leader worker does not have to create its
iterators before waking up the other workers, as long as it makes sure
it copies tbm fields into shared memory before waking workers. But in
the patch, tbm_begin_iterate() is called *before* the
ConditionVariableBroadcast() is called. So I feel, we can shift the
code inside the "if (node->is_leader)" to a place inside the "if
(pbminfo == NULL || pbms_is_leader(pbminfo))" condition block, just
after MultiExecProcNode() call. (And we won't even need is_leader
local variable as well). This way now the other workers will start
working sooner.



====== tidbitmap.c =======


The new #include's are not in alphabetical order.

--------------

ParallelTIDBitmap.inited is unused, and I believe, not required.

--------------

For leader worker, the new TIDBitmap fields added for parallel bitmap
*are* valid while the tid is being built. So the below comment should
be shifted accordingly :
/* these are valid when iterating is true: */
Better still, the shared tbm-related fields can be kept in the end,
and a comment should be added that these are for shared tbm.

--------------

It seems, the comment below the last ParallelTIDBitmap field is not relevant :
/* table to dsa_pointer's array. */

--------------

tbm->allocator field does not seem to be required. A new allocator can
be just palloc'ed in tbm_create_pagetable(), and passed to
pagetable_create(). SH_CREATE() stores this allocator in the
SH_TYPE->alloc field, and fetches the same whenever it needs it for
calling any of the allocator functions. So we can remove the
tbm->allocator field and shift "palloc(sizeof(pagetable_alloc))" call
from tbm_create() to tbm_create_pagetable().

--------------

In tbm_free() :
I think we should call pagetable_destroy() even when tbm is shared, so
that the hash implementation gets a chance to free the hash table
structure. I understand that the hash table structure itself is not
big, so it will only be a small memory leak, but it's better to not
assume that. Instead, let SH_DESTROY() call HashFree(). Then, in
tbm_free_shared(), we can skip the dsa_free() call if tbm->iterating
is false. Basically, tbm bitmap implementation should deduce from the
bitmap state whether it should free the shared data, rather than
preventing a call to SH_DESTROY().

-----------

In tbm_begin_iterate(), for shared tbm, internal structures from
simplehash.h are assumed to be known. For e.g., the hash entries will
always be present in one single array, and also the entry status is
evaluated using pagetable_IN_USE. Is simplehash.h designed keeping in
mind that these things are suppose to be exposed ?

I understand that the hash table handle is local to the leader worker,
and so it is not accessible to other workers. And so, we cannot use
pagetable_iterate() to scan the hash table. So, how about copying the
SH_TYPE structure and making it accessible to other workers ? If we
have something like SH_ATTACH() or SH_COPY(), this will copy the
relevant fields that are sufficient to restore the SH_TYPE structure,
and other workers can start using this hash table after assigning dsa
array back to tb->data. Something like HASH_ATTACH used in dynahash.c.

--------------

In the tbm_update_shared_members() header comments :* Restore leaders private tbm state to shared location. This must*
becalled before waking up the other worker.
 

Above can be changed to:* Store leader's private tbm state to shared location. This must* be called before waking up
otherworkers.
 

--------------

To be consistent with other function header comments in this file, a
blank line is required between these two lines :
* tbm_estimate_parallel_tidbitmap* Estimate size of shared TIDBitmap related info.

-------------

tbm->is_shared is set in both tbm_set_parallel() and
tbm_restore_shared_members(). I think it is needed only in
tbm_set_parallel().

--------------

tbm_alloc_shared() Function header comments need some typo correction :* It allocated memory from DSA and also stores
dsa_pointerin the memory
 
allocated => allocates

--------------

We prepend the dsa_pointer value into the shared memory data allocated
for the pagetable entries; and we save the address of the first  page
table entry in tbm->data. But the same dsa_pointer is also stored in
tbm->dsa_data. And tbm is accessible in tbm_free_shared(). So it does
not look necessary to prepend dsa_pointer before the page table
entries. In tbm_free_shared(), we can do dsa_free(tbm->area,
tbm->dsa_data).

--------------

Below are my suggested changes in the algorithm comments :

@@ -103,27 +103,27 @@ BitmapHeapNext(BitmapHeapScanState *node)       /*
--------------------------------------------------------------------       *      Parallel Bitmap Heap Scan Algorithm
    *
 
-        *      First worker to see the state as parallel bitmap info
as PBM_INITIAL
-        *      become leader and set the state to PBM_INPROGRESS All
other workers
-        *      see the state as PBM_INPROGRESS will wait for leader
to complete the
-        *      TIDBitmap.
+        *      The first worker to see the state of
ParallelBitmapInfo as PBM_INITIAL
+        *      becomes the leader and sets the state to
PBM_INPROGRESS. All other
+        *      workers see the state as PBM_INPROGRESS, and will wait
for leader to
+        *      finish building the TIDBitmap.        *        *      Leader Processing:        *        Create
TIDBitmapusing DSA memory.        *        Restore local TIDBitmap variable information into
 
-        *        ParallelBitmapInfo so that other worker can see those.
-        *        set state to PBM_FINISHED.
+        *           ParallelBitmapInfo so that other worker can see those.
+        *        Set state to PBM_FINISHED.        *        Wake up other workers.        *        *      Other Worker
Processing:
-        *        Wait until leader create shared TIDBitmap.
-        *        copy TIDBitmap from info from ParallelBitmapInfo to
local TIDBitmap.
+        *        Wait until leader creates shared TIDBitmap.
+        *        Copy TIDBitmap from ParallelBitmapInfo to local TIDBitmap.        *        *      Iterate and process
thepages.        *        a) In this phase each worker will iterate over page
 
and chunk array
-        *               and select heap pages one by one. If prefetch
is enable then
-        *               there will be two iterator.
-        *        b) Since multiple worker are iterating over same
page and chunk
+        *               and select heap pages one by one. If prefetch
is enabled then
+        *               there will be two iterators.
+        *        b) Since multiple workers are iterating over same
page and chunk        *               array we need to have a shared iterator, so
we grab a spin        *               lock and iterate within a lock.



Re: [HACKERS] Parallel bitmap heap scan

From
Dilip Kumar
Date:
On Fri, Jan 6, 2017 at 10:47 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
> This looks good now.

Thanks..
>
>
> Further points below ....

Thanks for the review.


> ===== nodeBItmapHeapscan.c ======
>
>
> In BitmapHeapNext(), the following code is relevant only for tbm
> returned from MultiExecProcNode().
> if (!tbm || !IsA(tbm, TIDBitmap))
> elog(ERROR, "unrecognized result from subplan");

Fixed

> --------------
>
> BitmapHeapScanState->is_leader field looks unnecessary. Instead, a
> local variable is_leader in BitmapHeapNext() will solve the purpose.
> This is because is_leader is used only in BitmapHeapNext().

Fixed
>
> --------------
>
> In BitmapHeapNext(), just before tbm_restore_shared_members() is
> called, we create tbm using tbm_create(). I think tbm_create() does
> not make sense for shared tbm. Whatever fields are required, will be
> restored in tbm_restore_shared_members(). The other fields which do
> not make sense in a restored tbm are not required to be initialized
> using tbm_create(). So I think tbm_restore_shared_members() itself can
> call makeNode(TIDBitmap). (Also it is not required to initialize
> tbm->allocator; see note below in tidbitmap.c section). So
> tbm_restore_shared_members() itself can call tbm_set_parallel().
> Looking at all this, it looks better to have the function name changed
> to tbm_attach(parallel_tbm) or tbm_restore(parallel_tbm) rather than
> tbm_restore_shared_members(). The function header anyways (rightly)
> says  : Attach worker to shared TID bitmap.

Fixed

>
> -------------
>
> From what I understand, the leader worker does not have to create its
> iterators before waking up the other workers, as long as it makes sure
> it copies tbm fields into shared memory before waking workers. But in
> the patch, tbm_begin_iterate() is called *before* the
> ConditionVariableBroadcast() is called. So I feel, we can shift the
> code inside the "if (node->is_leader)" to a place inside the "if
> (pbminfo == NULL || pbms_is_leader(pbminfo))" condition block, just
> after MultiExecProcNode() call. (And we won't even need is_leader
> local variable as well). This way now the other workers will start
> working sooner.

Correct, Fixed.

>
>
>
> ====== tidbitmap.c =======
>
>
> The new #include's are not in alphabetical order.

Done..
>
> --------------
>
> ParallelTIDBitmap.inited is unused, and I believe, not required.

Fixed
>
> --------------
>
> For leader worker, the new TIDBitmap fields added for parallel bitmap
> *are* valid while the tid is being built. So the below comment should
> be shifted accordingly :
> /* these are valid when iterating is true: */
> Better still, the shared tbm-related fields can be kept in the end,
> and a comment should be added that these are for shared tbm.
>

Done
> --------------
>
> It seems, the comment below the last ParallelTIDBitmap field is not relevant :
> /* table to dsa_pointer's array. */

Fixed..
>
> --------------
>
> tbm->allocator field does not seem to be required. A new allocator can
> be just palloc'ed in tbm_create_pagetable(), and passed to
> pagetable_create(). SH_CREATE() stores this allocator in the
> SH_TYPE->alloc field, and fetches the same whenever it needs it for
> calling any of the allocator functions. So we can remove the
> tbm->allocator field and shift "palloc(sizeof(pagetable_alloc))" call
> from tbm_create() to tbm_create_pagetable().

Done
>
> --------------
>
> In tbm_free() :
> I think we should call pagetable_destroy() even when tbm is shared, so
> that the hash implementation gets a chance to free the hash table
> structure. I understand that the hash table structure itself is not
> big, so it will only be a small memory leak, but it's better to not
> assume that. Instead, let SH_DESTROY() call HashFree(). Then, in
> tbm_free_shared(), we can skip the dsa_free() call if tbm->iterating
> is false. Basically, tbm bitmap implementation should deduce from the
> bitmap state whether it should free the shared data, rather than
> preventing a call to SH_DESTROY().

Fixed
>
> -----------
>
> In tbm_begin_iterate(), for shared tbm, internal structures from
> simplehash.h are assumed to be known. For e.g., the hash entries will
> always be present in one single array, and also the entry status is
> evaluated using pagetable_IN_USE. Is simplehash.h designed keeping in
> mind that these things are suppose to be exposed ?
>
> I understand that the hash table handle is local to the leader worker,
> and so it is not accessible to other workers. And so, we cannot use
> pagetable_iterate() to scan the hash table. So, how about copying the
> SH_TYPE structure and making it accessible to other workers ? If we
> have something like SH_ATTACH() or SH_COPY(), this will copy the
> relevant fields that are sufficient to restore the SH_TYPE structure,
> and other workers can start using this hash table after assigning dsa
> array back to tb->data. Something like HASH_ATTACH used in dynahash.c.

This looks cleaner, and also avoid processing the data of the hash
directly. I have changed as per the suggestion.

>
> --------------
>
> In the tbm_update_shared_members() header comments :
>  * Restore leaders private tbm state to shared location. This must
>  * be called before waking up the other worker.
>
> Above can be changed to:
>  * Store leader's private tbm state to shared location. This must
>  * be called before waking up other workers.

Done

>
> --------------
>
> To be consistent with other function header comments in this file, a
> blank line is required between these two lines :
>
>  * tbm_estimate_parallel_tidbitmap
>  * Estimate size of shared TIDBitmap related info.
>
Done
> -------------
>
> tbm->is_shared is set in both tbm_set_parallel() and
> tbm_restore_shared_members(). I think it is needed only in
> tbm_set_parallel().

Done
>
> --------------
>
> tbm_alloc_shared() Function header comments need some typo correction :
>  * It allocated memory from DSA and also stores dsa_pointer in the memory
> allocated => allocates
>
> --------------
>
> We prepend the dsa_pointer value into the shared memory data allocated
> for the pagetable entries; and we save the address of the first  page
> table entry in tbm->data. But the same dsa_pointer is also stored in
> tbm->dsa_data. And tbm is accessible in tbm_free_shared(). So it does
> not look necessary to prepend dsa_pointer before the page table
> entries. In tbm_free_shared(), we can do dsa_free(tbm->area,
> tbm->dsa_data).

tbm->dsa_data allocates the latest allocated dsa_pointer. And, hash
table first allocates the new memory then free the old pointer. So
whenever it allocates the new memory we will store latest in
tbm->dsa_data, and when the hash table will call free it will call
with the local pointer and we don't have any reference to dsa pointer
for that old memory, that's the reason we store dsa pointer as header.

>
> --------------
>
> Below are my suggested changes in the algorithm comments :
>
> @@ -103,27 +103,27 @@ BitmapHeapNext(BitmapHeapScanState *node)

Changed.

Apart from these changes, I have created a separate
patch(0001-opt-parallelcost-refactoring-v7.patch) for code refactoring
in optimizer required for parallelbitmap heap scan. (earlier this was
part of my main patch)

This patch can be used by 0003-parallel-bitmap-heap-scan-v7.patch
attached in the mail and also parallel-index-scan[2] can be rebased on
this, if this get committed,

[2]
https://www.postgresql.org/message-id/CAA4eK1KthrAvNjmB2cWuUHz%2Bp3ZTTtbD7o2KUw49PC8HK4r1Wg%40mail.gmail.com

I will send the performance data with the new patch in the separate mail.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] Parallel bitmap heap scan

From
tushar
Date:
On 01/09/2017 01:05 PM, Dilip Kumar wrote:
> This patch can be used by 0003-parallel-bitmap-heap-scan-v7.patch
> attached in the mail and also parallel-index-scan[2] can be rebased on
> this, if this get committed,
After applying your patches against the fresh sources of PG v10 , not 
able to perform initdb

centos@tusharcentos7 bin]$ ./initdb -D data
The files belonging to this database system will be owned by user "centos".
This user must also own the server process.

The database cluster will be initialized with locale "en_US.utf8".
The default database encoding has accordingly been set to "UTF8".
The default text search configuration will be set to "english".

Data page checksums are disabled.

creating directory data ... ok
creating subdirectories ... ok
selecting default max_connections ... 100
selecting default shared_buffers ... 128MB
selecting dynamic shared memory implementation ... posix
creating configuration files ... ok
running bootstrap script ... ok
performing post-bootstrap initialization ... sh: line 1: 30709 
Segmentation fault 
"/home/centos/PG10_9ja/postgresql/edbpsql/bin/postgres" --single -F -O 
-j -c search_path=pg_catalog -c exit_on_error=true template1 > /dev/null
child process exited with exit code 139
initdb: removing data directory "data"
[centos@tusharcentos7 bin]$

-- 
regards,tushar
EnterpriseDB  https://www.enterprisedb.com/
The Enterprise PostgreSQL Company




Re: [HACKERS] Parallel bitmap heap scan

From
Dilip Kumar
Date:
On Mon, Jan 9, 2017 at 3:07 PM, tushar <tushar.ahuja@enterprisedb.com> wrote:
> creating directory data ... ok
> creating subdirectories ... ok
> selecting default max_connections ... 100
> selecting default shared_buffers ... 128MB
> selecting dynamic shared memory implementation ... posix
> creating configuration files ... ok
> running bootstrap script ... ok
> performing post-bootstrap initialization ... sh: line 1: 30709 Segmentation
> fault "/home/centos/PG10_9ja/postgresql/edbpsql/bin/postgres" --single -F -O
> -j -c search_path=pg_catalog -c exit_on_error=true template1 > /dev/null
> child process exited with exit code 139
> initdb: removing data directory "data"

I have taken the latest code, applied all 3 patches and compiled.
Initdb is working fine for me.

Can you please verify, do you have any extra patch along with my patch?
Did you properly clean the code?

I have asked one of my colleague to verify this and the result is
same, No crash.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: [HACKERS] Parallel bitmap heap scan

From
tushar
Date:
On 01/09/2017 04:36 PM, Dilip Kumar wrote:
> I have taken the latest code, applied all 3 patches and compiled.
> Initdb is working fine for me.
>
> Can you please verify, do you have any extra patch along with my patch?
> Did you properly clean the code?
Thanks Dilip. issue is reproducible  if  we  uses '--enable-cassert' 
switch in the configure. We are able to reproduce it only with 
--enable-cassert' .

-- 
regards,tushar
EnterpriseDB  https://www.enterprisedb.com/
The Enterprise PostgreSQL Company




Re: [HACKERS] Parallel bitmap heap scan

From
Dilip Kumar
Date:
On Mon, Jan 9, 2017 at 5:01 PM, tushar <tushar.ahuja@enterprisedb.com> wrote:
> Thanks Dilip. issue is reproducible  if  we  uses '--enable-cassert' switch
> in the configure. We are able to reproduce it only with --enable-cassert' .

Thanks, Tushar. I have fixed it. The defect was in 0002. I have also
observed another issue related to code refactoring, Actually, there
was some code present in 0001 which supposed to be in 0003.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] Parallel bitmap heap scan

From
tushar
Date:
On 01/09/2017 07:22 PM, Dilip Kumar wrote:
> Thanks, Tushar. I have fixed it. The defect was in 0002. I have also
> observed another issue related to code refactoring, Actually, there
> was some code present in 0001 which supposed to be in 0003.
Thanks, I have checked at my end and it is fixed now.

-- 
regards,tushar
EnterpriseDB  https://www.enterprisedb.com/
The Enterprise PostgreSQL Company




Re: [HACKERS] Parallel bitmap heap scan

From
tushar
Date:
On 01/10/2017 11:29 AM, tushar wrote:
> On 01/09/2017 07:22 PM, Dilip Kumar wrote:
>> Thanks, Tushar. I have fixed it. The defect was in 0002. I have also
>> observed another issue related to code refactoring, Actually, there
>> was some code present in 0001 which supposed to be in 0003.
> Thanks, I have checked at my end and it is fixed now.
>
We found a regression , earlier the testcase was working fine (against 
the older patches of Parallel bitmap heap scan)  but now getting a 
server crash
against v8 patches.

Testcase - (one of the table of TPC-H )

postgres=#explain analyze verbose
SELECT SUM(l_extendedprice) FROM lineitem
WHERE (l_shipdate >= '1995-01-01'::date)
AND (l_shipdate <='1996-03-31'::date);
WARNING:  terminating connection because of crash of another server process
DETAIL:  The postmaster has commanded this server process to roll back 
the current transaction and exit, because another server process exited 
abnormally and possibly corrupted shared memory.
HINT:  In a moment you should be able to reconnect to the database and 
repeat your command.
server closed the connection unexpectedly        This probably means the server terminated abnormally        before or
whileprocessing the request.
 
connection to server was lost

Here is the stack trace - ( Two core dump file generated)

[centos@centos-cpula bin]$  gdb -q -c data/core.25434 
/home/centos/PG10_10jan/postgresql/edbpsql/bin/postgres
Reading symbols from 
/home/centos/PG10_10jan/postgresql/edbpsql/bin/postgres...done.
[New Thread 25434]
Missing separate debuginfo for
Try: yum --enablerepo='*-debug*' install 
/usr/lib/debug/.build-id/7f/719af91ee951b4fcb6647e7868f95f766a616b
Reading symbols from /usr/lib64/libssl.so.10...(no debugging symbols 
found)...done.
Loaded symbols for /usr/lib64/libssl.so.10
Reading symbols from /usr/lib64/libcrypto.so.10...(no debugging symbols 
found)...done.
Loaded symbols for /usr/lib64/libcrypto.so.10
Reading symbols from /lib64/librt.so.1...(no debugging symbols 
found)...done.
Loaded symbols for /lib64/librt.so.1
Reading symbols from /lib64/libdl.so.2...(no debugging symbols 
found)...done.
Loaded symbols for /lib64/libdl.so.2
Reading symbols from /lib64/libm.so.6...(no debugging symbols found)...done.
Loaded symbols for /lib64/libm.so.6
Reading symbols from /lib64/libc.so.6...(no debugging symbols found)...done.
Loaded symbols for /lib64/libc.so.6
Reading symbols from /lib64/libpthread.so.0...(no debugging symbols 
found)...done.
[Thread debugging using libthread_db enabled]
Loaded symbols for /lib64/libpthread.so.0
Reading symbols from /lib64/libgssapi_krb5.so.2...(no debugging symbols 
found)...done.
Loaded symbols for /lib64/libgssapi_krb5.so.2
Reading symbols from /lib64/libkrb5.so.3...(no debugging symbols 
found)...done.
Loaded symbols for /lib64/libkrb5.so.3
Reading symbols from /lib64/libcom_err.so.2...(no debugging symbols 
found)...done.
Loaded symbols for /lib64/libcom_err.so.2
Reading symbols from /lib64/libk5crypto.so.3...(no debugging symbols 
found)...done.
Loaded symbols for /lib64/libk5crypto.so.3
Reading symbols from /lib64/libz.so.1...(no debugging symbols found)...done.
Loaded symbols for /lib64/libz.so.1
Reading symbols from /lib64/ld-linux-x86-64.so.2...(no debugging symbols 
found)...done.
Loaded symbols for /lib64/ld-linux-x86-64.so.2
Reading symbols from /lib64/libkrb5support.so.0...(no debugging symbols 
found)...done.
Loaded symbols for /lib64/libkrb5support.so.0
Reading symbols from /lib64/libkeyutils.so.1...(no debugging symbols 
found)...done.
Loaded symbols for /lib64/libkeyutils.so.1
Reading symbols from /lib64/libresolv.so.2...(no debugging symbols 
found)...done.
Loaded symbols for /lib64/libresolv.so.2
Reading symbols from /lib64/libselinux.so.1...(no debugging symbols 
found)...done.
Loaded symbols for /lib64/libselinux.so.1
Reading symbols from /lib64/libnss_files.so.2...(no debugging symbols 
found)...done.
Loaded symbols for /lib64/libnss_files.so.2
Core was generated by `postgres: bgworker: parallel worker for PID 
25433              '.
Program terminated with signal 11, Segmentation fault.
#0  0x00000000006f2fa6 in pagetable_destroy (tb=0x2079bf0) at 
../../../src/include/lib/simplehash.h:361
361            tb->alloc->HashFree(tb->data, tb->alloc->args);
Missing separate debuginfos, use: debuginfo-install 
glibc-2.12-1.192.el6.x86_64 keyutils-libs-1.4-5.el6.x86_64 
krb5-libs-1.10.3-57.el6.x86_64 libcom_err-1.41.12-22.el6.x86_64 
libselinux-2.0.94-7.el6.x86_64 openssl-1.0.1e-48.el6_8.1.x86_64 
zlib-1.2.3-29.el6.x86_64
(gdb) bt
#0  0x00000000006f2fa6 in pagetable_destroy (tb=0x2079bf0) at 
../../../src/include/lib/simplehash.h:361
#1  0x00000000006f3b52 in tbm_free (tbm=0x2077fe0) at tidbitmap.c:296
#2  0x00000000006ab29b in ExecEndBitmapHeapScan (node=0x207e760) at 
nodeBitmapHeapscan.c:717
#3  0x0000000000691701 in ExecEndNode (node=0x207e760) at execProcnode.c:689
#4  0x00000000006a8f86 in ExecEndAgg (node=0x207e878) at nodeAgg.c:3563
#5  0x0000000000691800 in ExecEndNode (node=0x207e878) at execProcnode.c:755
#6  0x000000000068cb0f in ExecEndPlan (planstate=0x207e878, 
estate=0x207df58) at execMain.c:1469
#7  0x000000000068afb7 in standard_ExecutorEnd (queryDesc=0x20592e8) at 
execMain.c:470
#8  0x000000000068aef6 in ExecutorEnd (queryDesc=0x20592e8) at 
execMain.c:441
#9  0x0000000000690ba5 in ParallelQueryMain (seg=0x1fe1e58, 
toc=0x7fcbc3677000) at execParallel.c:819
#10 0x0000000000518620 in ParallelWorkerMain (main_arg=300704707) at 
parallel.c:1109
#11 0x0000000000793db2 in StartBackgroundWorker () at bgworker.c:747
#12 0x00000000007a6265 in do_start_bgworker (rw=0x20039f0) at 
postmaster.c:5560
#13 0x00000000007a659f in maybe_start_bgworker () at postmaster.c:5744
#14 0x00000000007a5706 in sigusr1_handler (postgres_signal_arg=10) at 
postmaster.c:4998
#15 <signal handler called>
#16 0x0000003746ee1503 in __select_nocancel () from /lib64/libc.so.6
#17 0x00000000007a10ca in ServerLoop () at postmaster.c:1676
#18 0x00000000007a08bf in PostmasterMain (argc=3, argv=0x1fdfd20) at 
postmaster.c:1320
#19 0x00000000006e7816 in main (argc=3, argv=0x1fdfd20) at main.c:228
(gdb)


[centos@centos-cpula bin]$  gdb -q -c data/core.25435 
/home/centos/PG10_10jan/postgresql/edbpsql/bin/postgres
Reading symbols from 
/home/centos/PG10_10jan/postgresql/edbpsql/bin/postgres...done.
[New Thread 25435]
Missing separate debuginfo for
Try: yum --enablerepo='*-debug*' install 
/usr/lib/debug/.build-id/7f/719af91ee951b4fcb6647e7868f95f766a616b
Reading symbols from /usr/lib64/libssl.so.10...(no debugging symbols 
found)...done.
Loaded symbols for /usr/lib64/libssl.so.10
Reading symbols from /usr/lib64/libcrypto.so.10...(no debugging symbols 
found)...done.
Loaded symbols for /usr/lib64/libcrypto.so.10
Reading symbols from /lib64/librt.so.1...(no debugging symbols 
found)...done.
Loaded symbols for /lib64/librt.so.1
Reading symbols from /lib64/libdl.so.2...(no debugging symbols 
found)...done.
Loaded symbols for /lib64/libdl.so.2
Reading symbols from /lib64/libm.so.6...(no debugging symbols found)...done.
Loaded symbols for /lib64/libm.so.6
Reading symbols from /lib64/libc.so.6...(no debugging symbols found)...done.
Loaded symbols for /lib64/libc.so.6
Reading symbols from /lib64/libpthread.so.0...(no debugging symbols 
found)...done.
[Thread debugging using libthread_db enabled]
Loaded symbols for /lib64/libpthread.so.0
Reading symbols from /lib64/libgssapi_krb5.so.2...(no debugging symbols 
found)...done.
Loaded symbols for /lib64/libgssapi_krb5.so.2
Reading symbols from /lib64/libkrb5.so.3...(no debugging symbols 
found)...done.
Loaded symbols for /lib64/libkrb5.so.3
Reading symbols from /lib64/libcom_err.so.2...(no debugging symbols 
found)...done.
Loaded symbols for /lib64/libcom_err.so.2
Reading symbols from /lib64/libk5crypto.so.3...(no debugging symbols 
found)...done.
Loaded symbols for /lib64/libk5crypto.so.3
Reading symbols from /lib64/libz.so.1...(no debugging symbols found)...done.
Loaded symbols for /lib64/libz.so.1
Reading symbols from /lib64/ld-linux-x86-64.so.2...(no debugging symbols
found)...done.
Loaded symbols for /lib64/ld-linux-x86-64.so.2
Reading symbols from /lib64/libkrb5support.so.0...(no debugging symbols 
found)...done.
Loaded symbols for /lib64/libkrb5support.so.0
Reading symbols from /lib64/libkeyutils.so.1...(no debugging symbols 
found)...done.
Loaded symbols for /lib64/libkeyutils.so.1
Reading symbols from /lib64/libresolv.so.2...(no debugging symbols 
found)...done.
Loaded symbols for /lib64/libresolv.so.2
Reading symbols from /lib64/libselinux.so.1...(no debugging symbols 
found)...done.
Loaded symbols for /lib64/libselinux.so.1
Reading symbols from /lib64/libnss_files.so.2...(no debugging symbols 
found)...done.
Loaded symbols for /lib64/libnss_files.so.2
Core was generated by `postgres: bgworker: parallel worker for PID 
25433              '.
Program terminated with signal 11, Segmentation fault.
#0  0x00000000006f2fa6 in pagetable_destroy (tb=0x20775a0) at 
../../../src/include/lib/simplehash.h:361
361            tb->alloc->HashFree(tb->data, tb->alloc->args);
Missing separate debuginfos, use: debuginfo-install 
glibc-2.12-1.192.el6.x86_64 keyutils-libs-1.4-5.el6.x86_64 
krb5-libs-1.10.3-57.el6.x86_64 libcom_err-1.41.12-22.el6.x86_64 
libselinux-2.0.94-7.el6.x86_64 openssl-1.0.1e-48.el6_8.1.x86_64 
zlib-1.2.3-29.el6.x86_64
(gdb) bt
#0  0x00000000006f2fa6 in pagetable_destroy (tb=0x20775a0) at 
../../../src/include/lib/simplehash.h:361
#1  0x00000000006f3b52 in tbm_free (tbm=0x2075990) at tidbitmap.c:296
#2  0x00000000006ab29b in ExecEndBitmapHeapScan (node=0x207c110) at 
nodeBitmapHeapscan.c:717
#3  0x0000000000691701 in ExecEndNode (node=0x207c110) at execProcnode.c:689
#4  0x00000000006a8f86 in ExecEndAgg (node=0x207c228) at nodeAgg.c:3563
#5  0x0000000000691800 in ExecEndNode (node=0x207c228) at execProcnode.c:755
#6  0x000000000068cb0f in ExecEndPlan (planstate=0x207c228, 
estate=0x207b908) at execMain.c:1469
#7  0x000000000068afb7 in standard_ExecutorEnd (queryDesc=0x20594b8) at 
execMain.c:470
#8  0x000000000068aef6 in ExecutorEnd (queryDesc=0x20594b8) at 
execMain.c:441
#9  0x0000000000690ba5 in ParallelQueryMain (seg=0x1fe1e58, 
toc=0x7fcbc3677000) at execParallel.c:819
#10 0x0000000000518620 in ParallelWorkerMain (main_arg=300704707) at 
parallel.c:1109
#11 0x0000000000793db2 in StartBackgroundWorker () at bgworker.c:747
#12 0x00000000007a6265 in do_start_bgworker (rw=0x20064a0) at 
postmaster.c:5560
#13 0x00000000007a659f in maybe_start_bgworker () at postmaster.c:5744
#14 0x00000000007a5706 in sigusr1_handler (postgres_signal_arg=10) at 
postmaster.c:4998
#15 <signal handler called>
#16 0x0000003746ee1503 in __select_nocancel () from /lib64/libc.so.6
#17 0x00000000007a10ca in ServerLoop () at postmaster.c:1676
#18 0x00000000007a08bf in PostmasterMain (argc=3, argv=0x1fdfd20) at 
postmaster.c:1320
#19 0x00000000006e7816 in main (argc=3, argv=0x1fdfd20) at main.c:228
(gdb)

-- 
regards,tushar
EnterpriseDB  https://www.enterprisedb.com/
The Enterprise PostgreSQL Company




Re: [HACKERS] Parallel bitmap heap scan

From
Dilip Kumar
Date:
On Tue, Jan 10, 2017 at 2:19 PM, tushar <tushar.ahuja@enterprisedb.com> wrote:
> We found a regression , earlier the testcase was working fine (against the
> older patches of Parallel bitmap heap scan)  but now getting a server crash
> against v8 patches.
>
> Testcase - (one of the table of TPC-H )
>
> postgres=#explain analyze verbose
> SELECT SUM(l_extendedprice) FROM lineitem
> WHERE (l_shipdate >= '1995-01-01'::date)
> AND (l_shipdate <='1996-03-31'::date);

While fixing some of the review comments in v7 and v8, I had allocated
new memory for pagetable, and missed to initialize it.

Thanks for reporting, I have fixed it in v9. After fix query is
running fine for me. Please try attached patch and confirm from your
side.

postgres=# explain analyze verbose
SELECT SUM(l_extendedprice) FROM lineitem
WHERE (l_shipdate >= '1995-01-01'::date)
AND (l_shipdate <='1996-03-31'::date);

        QUERY PLAN


--------------------------------------------------------------------------------------------------------------------------------------------------------------
-----------
 Finalize Aggregate  (cost=798002.46..798002.47 rows=1 width=32)
(actual time=15501.245..15501.245 rows=1 loops=1)
   Output: sum(l_extendedprice)
   ->  Gather  (cost=798002.24..798002.45 rows=2 width=32) (actual
time=15494.358..15498.919 rows=3 loops=1)
         Output: (PARTIAL sum(l_extendedprice))
         Workers Planned: 2
         Workers Launched: 2
         ->  Partial Aggregate  (cost=797002.24..797002.25 rows=1
width=32) (actual time=15492.937..15492.937 rows=1 loops=3)
               Output: PARTIAL sum(l_extendedprice)
               Worker 0: actual time=15491.218..15491.219 rows=1 loops=1
               Worker 1: actual time=15493.514..15493.514 rows=1 loops=1
               ->  Parallel Bitmap Heap Scan on public.lineitem
(cost=147461.75..791014.31 rows=2395170 width=8) (actual
time=8553.301..15061.333 rows=189294
7 loops=3)
                     Output: l_extendedprice
                     Recheck Cond: ((lineitem.l_shipdate >=
'1995-01-01'::date) AND (lineitem.l_shipdate <= '1996-03-31'::date))
                     Rows Removed by Index Recheck: 6451177
                     Heap Blocks: exact=27963 lossy=164938
                     Worker 0: actual time=8548.957..15054.511
rows=1887239 loops=1
                     Worker 1: actual time=8554.817..15050.317
rows=1902477 loops=1
                     ->  Bitmap Index Scan on idx_lineitem_shipdate
(cost=0.00..146024.65 rows=5748409 width=0) (actual
time=8533.701..8533.701 rows=5678841
loops=1)
                           Index Cond: ((lineitem.l_shipdate >=
'1995-01-01'::date) AND (lineitem.l_shipdate <= '1996-03-31'::date))
 Planning time: 2.742 ms
 Execution time: 15509.696 ms
(21 rows)



-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] Parallel bitmap heap scan

From
tushar
Date:
On 01/10/2017 05:16 PM, Dilip Kumar wrote:
>   Please try attached patch and confirm from your
> side.
Thanks,issue seems to be fixed now.

-- 
regards,tushar
EnterpriseDB  https://www.enterprisedb.com/
The Enterprise PostgreSQL Company




Re: [HACKERS] Parallel bitmap heap scan

From
Rafia Sabih
Date:
On Wed, Jan 11, 2017 at 5:33 PM, tushar <tushar.ahuja@enterprisedb.com> wrote:
>
> On 01/10/2017 05:16 PM, Dilip Kumar wrote:
>>
>>   Please try attached patch and confirm from your
>> side.
>
> Thanks,issue seems to be fixed now.
>
>
> --
> regards,tushar
> EnterpriseDB  https://www.enterprisedb.com/
> The Enterprise PostgreSQL Company
>
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers


Hello Dilip,
I was trying to test the performance of all the parallel query related
patches on TPC-H queries, I found that when parallel hash [1 ]and your
patch are applied then Q10 and Q14 were hanged, however, without any
one of these patches, these queries are running fine. Following is the
stack trace,

At the master:

#0  0x00003fff8bef7de8 in __epoll_wait_nocancel () from /lib64/power8/libc.so.6
#1  0x00000000104ea184 in WaitEventSetWaitBlock (set=0x10039b2d360,
cur_timeout=-1, occurred_events=0x3fffd4a55a98, nevents=1) at
latch.c:998
#2  0x00000000104e9fc8 in WaitEventSetWait (set=0x10039b2d360,
timeout=-1, occurred_events=0x3fffd4a55a98, nevents=1,
wait_event_info=134217730) at latch.c:950
#3  0x00000000104e9484 in WaitLatchOrSocket (latch=0x3fff85128ab4,
wakeEvents=1, sock=-1, timeout=-1, wait_event_info=134217730) at
latch.c:350
#4  0x00000000104e931c in WaitLatch (latch=0x3fff85128ab4,
wakeEvents=1, timeout=0, wait_event_info=134217730) at latch.c:304
#5  0x000000001032a378 in gather_readnext (gatherstate=0x10039aeeb98)
at nodeGather.c:393
#6  0x000000001032a044 in gather_getnext (gatherstate=0x10039aeeb98)
at nodeGather.c:293
#7  0x0000000010329e94 in ExecGather (node=0x10039aeeb98) at nodeGather.c:234
#8  0x00000000103086f0 in ExecProcNode (node=0x10039aeeb98) at
execProcnode.c:521
#9  0x000000001031f550 in fetch_input_tuple (aggstate=0x10039aed220)
at nodeAgg.c:587
#10 0x00000000103229a4 in agg_retrieve_direct (aggstate=0x10039aed220)
at nodeAgg.c:2211

At one of the worker process,

#0  0x00000000103879f8 in pagetable_insert (tb=0x10039abcac0,
key=6491, found=0x3fffd4a54b88 "") at
../../../src/include/lib/simplehash.h:571
#1  0x0000000010389964 in tbm_get_pageentry (tbm=0x10039ab5778,
pageno=6491) at tidbitmap.c:876
#2  0x0000000010388608 in tbm_add_tuples (tbm=0x10039ab5778,
tids=0x10039a94c30, ntids=1, recheck=0 '\000') at tidbitmap.c:340
#3  0x00000000100f5e54 in btgetbitmap (scan=0x10039ab3c80,
tbm=0x10039ab5778) at nbtree.c:439
#4  0x00000000100eab7c in index_getbitmap (scan=0x10039ab3c80,
bitmap=0x10039ab5778) at indexam.c:687
#5  0x0000000010328acc in MultiExecBitmapIndexScan
(node=0x10039a98630) at nodeBitmapIndexscan.c:98
#6  0x00000000103088d8 in MultiExecProcNode (node=0x10039a98630) at
execProcnode.c:591
#7  0x0000000010326c70 in BitmapHeapNext (node=0x10039a98770) at
nodeBitmapHeapscan.c:164
#8  0x0000000010316440 in ExecScanFetch (node=0x10039a98770,
accessMtd=0x10326b70 <BitmapHeapNext>, recheckMtd=0x10327bb4
<BitmapHeapRecheck>) at execScan.c:95
#9  0x0000000010316590 in ExecScan (node=0x10039a98770,
accessMtd=0x10326b70 <BitmapHeapNext>, recheckMtd=0x10327bb4
<BitmapHeapRecheck>) at execScan.c:180
#10 0x0000000010327c84 in ExecBitmapHeapScan (node=0x10039a98770) at
nodeBitmapHeapscan.c:623

At other workers,

#0  0x00003fff8bef7de8 in __epoll_wait_nocancel () from /lib64/power8/libc.so.6
#1  0x00000000104ea184 in WaitEventSetWaitBlock (set=0x10039a24670,
cur_timeout=-1, occurred_events=0x3fffd4a55ed8, nevents=1) at
latch.c:998
#2  0x00000000104e9fc8 in WaitEventSetWait (set=0x10039a24670,
timeout=-1, occurred_events=0x3fffd4a55ed8, nevents=1,
wait_event_info=134217737) at latch.c:950
#3  0x000000001051a3dc in ConditionVariableSleep (cv=0x3fff7c2d01ac,
wait_event_info=134217737) at condition_variable.c:132
#4  0x00000000103283e0 in pbms_is_leader (pbminfo=0x3fff7c2d0178) at
nodeBitmapHeapscan.c:900
#5  0x0000000010326c38 in BitmapHeapNext (node=0x10039a95f50) at
nodeBitmapHeapscan.c:153
#6  0x0000000010316440 in ExecScanFetch (node=0x10039a95f50,
accessMtd=0x10326b70 <BitmapHeapNext>, recheckMtd=0x10327bb4
<BitmapHeapRecheck>) at execScan.c:95
#7  0x0000000010316590 in ExecScan (node=0x10039a95f50,
accessMtd=0x10326b70 <BitmapHeapNext>, recheckMtd=0x10327bb4
<BitmapHeapRecheck>) at execScan.c:180
#8  0x0000000010327c84 in ExecBitmapHeapScan (node=0x10039a95f50) at
nodeBitmapHeapscan.c:623
#9  0x0000000010308588 in ExecProcNode (node=0x10039a95f50) at
execProcnode.c:443
#10 0x00000000103310d8 in ExecHashJoinOuterGetTuple
(outerNode=0x10039a95f50, hjstate=0x10039a96808,
hashvalue=0x3fffd4a5639c) at nodeHashjoin.c:936

I was using TPC-H with scale factor 20, please let me know if there is
anything more you require in this regard.

[1] https://www.postgresql.org/message-id/CAEepm%3D1vGcv6LBrxZeqPb_rPxfraidWAF_8_4z2ZMQ%2B7DOjj9w%40mail.gmail.com
-- 
Regards,
Rafia Sabih
EnterpriseDB: http://www.enterprisedb.com/



Re: [HACKERS] Parallel bitmap heap scan

From
Dilip Kumar
Date:
On Thu, Jan 12, 2017 at 5:22 PM, Rafia Sabih
<rafia.sabih@enterprisedb.com> wrote:
> Hello Dilip,
> I was trying to test the performance of all the parallel query related
> patches on TPC-H queries, I found that when parallel hash [1 ]and your
> patch are applied then Q10 and Q14 were hanged, however, without any
> one of these patches, these queries are running fine. Following is the
> stack trace,

Thanks for reporting, I will look into this.


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: [HACKERS] Parallel bitmap heap scan

From
tushar
Date:
On 01/11/2017 05:33 PM, tushar wrote:
On 01/10/2017 05:16 PM, Dilip Kumar wrote:
  Please try attached patch and confirm from your
side.
Thanks,issue seems to be fixed now.
Prabhat and I have done some testing for this feature and written few testcases.  PFA the sql script(along with the expected .out files)

In addition we have generated the LCOV (code coverage) report and compared the files which are changed for the "Parallel Bitmap Heap scan" patch.
You can see the numbers for  "with_patch" V/s "with_patch+TestCases"  (.pdf file is attached)
-- 
regards,tushar
EnterpriseDB  https://www.enterprisedb.com/
The Enterprise PostgreSQL Company
Attachment

Re: [HACKERS] Parallel bitmap heap scan

From
Dilip Kumar
Date:
On Thu, Jan 12, 2017 at 5:47 PM, Dilip Kumar <dilipbalaut@gmail.com> wrote:
>> Hello Dilip,
>> I was trying to test the performance of all the parallel query related
>> patches on TPC-H queries, I found that when parallel hash [1 ]and your
>> patch are applied then Q10 and Q14 were hanged, however, without any
>> one of these patches, these queries are running fine. Following is the
>> stack trace,
>
> Thanks for reporting, I will look into this.
I had setup for TPCH scale factor 5, I could reproduce the hang
reported by you with Q10. I have fixed the issue in v10 (only 0003 is
changed other patches have no change). However, after fixing this
issue it's crashing in parallel shared hash patch and call stack is
same what you have reported on parallel shared hash thread[1]

[1] https://www.postgresql.org/message-id/CAOGQiiOJji9GbLwfT0vBuixdkgAsxzZjTpPxsO6BiTLD--SzYg%40mail.gmail.com

Please verify with the new patch.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] Parallel bitmap heap scan

From
Dilip Kumar
Date:
On Fri, Jan 13, 2017 at 6:36 PM, Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> Please verify with the new patch.

Patch 0001 and 0003 required to rebase on the latest head. 0002 is
still the same.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] Parallel bitmap heap scan

From
Robert Haas
Date:
On Wed, Jan 18, 2017 at 12:14 AM, Dilip Kumar <dilipbalaut@gmail.com> wrote:
> On Fri, Jan 13, 2017 at 6:36 PM, Dilip Kumar <dilipbalaut@gmail.com> wrote:
>>
>> Please verify with the new patch.
>
> Patch 0001 and 0003 required to rebase on the latest head. 0002 is
> still the same.

I've committed the first half of 0001.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Parallel bitmap heap scan

From
Dilip Kumar
Date:
On Thu, Jan 19, 2017 at 12:26 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> Patch 0001 and 0003 required to rebase on the latest head. 0002 is
>> still the same.
>
> I've committed the first half of 0001.
Thanks.  0001 and 0003 required rebasing after this commit.



-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] Parallel bitmap heap scan

From
Haribabu Kommi
Date:


On Mon, Jan 23, 2017 at 3:42 PM, Dilip Kumar <dilipbalaut@gmail.com> wrote:
On Thu, Jan 19, 2017 at 12:26 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> Patch 0001 and 0003 required to rebase on the latest head. 0002 is
>> still the same.
>
> I've committed the first half of 0001.
Thanks.  0001 and 0003 required rebasing after this commit.

I reviewed 0002-hash-support-alloc-free-v12.patch, some minor comments.

- SH_TYPE    *tb;
- uint64 size;
+ SH_TYPE     *tb;
+ uint64 size;

The above change may not be required.

+ if (tb->alloc)
+ {
+ tb->alloc->HashFree(tb->data, tb->alloc->args);
+ pfree(tb->alloc);
+ }

The above code tries to free the tb->alloc memory. In case if the user
has provide the alloc structure to SH_CREATE function and the same
pointer is stored in the tb structure. And in free function freeing that
memory may cause problem.

So either explicitly mentioning that the input must a palloc'ed data or
by default allocate memory and copy the input data into allocated
memory.

Regards,
Hari Babu
Fujitsu Australia

Re: [HACKERS] Parallel bitmap heap scan

From
Dilip Kumar
Date:
On Mon, Jan 23, 2017 at 1:52 PM, Haribabu Kommi
<kommi.haribabu@gmail.com> wrote:
> I reviewed 0002-hash-support-alloc-free-v12.patch, some minor comments.
>
> - SH_TYPE    *tb;
> - uint64 size;
> + SH_TYPE     *tb;
> + uint64 size;
>
> The above change may not be required.
>
> + if (tb->alloc)
> + {
> + tb->alloc->HashFree(tb->data, tb->alloc->args);
> + pfree(tb->alloc);
> + }
>
> The above code tries to free the tb->alloc memory. In case if the user
> has provide the alloc structure to SH_CREATE function and the same
> pointer is stored in the tb structure. And in free function freeing that
> memory may cause problem.
>
> So either explicitly mentioning that the input must a palloc'ed data or
> by default allocate memory and copy the input data into allocated
> memory.

I have changed as per the comments. 0002 and 0003 are changed, 0001 is
still the same.


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] Parallel bitmap heap scan

From
Dilip Kumar
Date:
On Tue, Jan 24, 2017 at 10:18 AM, Dilip Kumar <dilipbalaut@gmail.com> wrote:
> I have changed as per the comments. 0002 and 0003 are changed, 0001 is
> still the same.

2 days back my colleague Rafia, reported one issue (offlist) that
parallel bitmap node is not scaling as good as other nodes e.g
parallel sequence scan and parallel index scan.

After some perf analysis, I found that there was one unconditional
spin lock in parallel bitmap patch which we were taking for checking
the prefetch target. Basically, we were always taking the lock and
checking the prefetch_target is reached to prefetch_maximum. So even
after it will reach to prefetch_maximum we will acquire the lock and
check after every tuple. I have changed that logic, now I will check
the condition first if we need to increase the target then will take
the lock and recheck the condition.

There is just one line change in 0003 compared to older version, all
other patches are the same.

Some performance data to show that new parallel bitmap patch performs
way better than the previous version.
TPCH scale 20, work_mem 64MB, shared buffers 8GB (4 workers)
machine intel 8 socket machine

            v13(time in ms)           v14 (time in ms)
Q6        37065.841                18202.903

Q14     15229.569                 11341.121


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] Parallel bitmap heap scan

From
Robert Haas
Date:
On Fri, Jan 27, 2017 at 1:32 AM, Dilip Kumar <dilipbalaut@gmail.com> wrote:
> There is just one line change in 0003 compared to older version, all
> other patches are the same.

I spent some time looking at 0001 (and how those changes are used in
0003) and I thought it looked good, so I committed 0001.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Parallel bitmap heap scan

From
Haribabu Kommi
Date:


On Fri, Jan 27, 2017 at 5:32 PM, Dilip Kumar <dilipbalaut@gmail.com> wrote:
On Tue, Jan 24, 2017 at 10:18 AM, Dilip Kumar <dilipbalaut@gmail.com> wrote:
> I have changed as per the comments. 0002 and 0003 are changed, 0001 is
> still the same.

2 days back my colleague Rafia, reported one issue (offlist) that
parallel bitmap node is not scaling as good as other nodes e.g
parallel sequence scan and parallel index scan.

After some perf analysis, I found that there was one unconditional
spin lock in parallel bitmap patch which we were taking for checking
the prefetch target. Basically, we were always taking the lock and
checking the prefetch_target is reached to prefetch_maximum. So even
after it will reach to prefetch_maximum we will acquire the lock and
check after every tuple. I have changed that logic, now I will check
the condition first if we need to increase the target then will take
the lock and recheck the condition.

There is just one line change in 0003 compared to older version, all
other patches are the same.

Some performance data to show that new parallel bitmap patch performs
way better than the previous version.
TPCH scale 20, work_mem 64MB, shared buffers 8GB (4 workers)
machine intel 8 socket machine

            v13(time in ms)           v14 (time in ms)
Q6        37065.841                18202.903

Q14     15229.569                 11341.121

Thanks for the update. I have some comments


0002-hash-support-alloc-free-v14.patch:


+ if (tb->alloc)
+ {

The memory for tb->alloc is allocated always, is the if check still required?



0003-parallel-bitmap-heap-scan-v14.patch:


+ * and set parallel flag in lower level bitmap index scan. Later
+ * bitmap index node will use this flag to indicate tidbitmap that
+ * it needs to create an shared page table.
+ */

I feel better to mention, where this flag is used, so that it will be more clear.


+ * Mark tidbitmap as shared and also store DSA area in it.
+ * marking tidbitmap as shared is indication that this tidbitmap
+ * should be created in shared memory. DSA area will be used for

The flag of shared is set in tidbitmap structure itself, but the
comment is mentioned that tidbitmpa should be created in shared memory.
I think that is the page table that needs to be created in shared memory.
Providing some more information here will be helpful.


- node->tbmres = tbmres = tbm_iterate(tbmiterator);
+ node->tbmres = tbmres = pbms_parallel_iterate(tbmiterator,
+ pbminfo ? &pbminfo->tbmiterator : NULL);

Instead Passing both normal and paralle iterators to the functions
and checking inside again for NULL, How about just adding check
in the caller itself? Or if you prefer the current method, then try to
explain the details of input in the function header and more description.


+ /* Increase prefetch target if it's not yet at the max. */
+ if (node->prefetch_target < node->prefetch_maximum)

I didn't evaluate all scenarios, but the above code may have a problem,
In case of parallel mode the the prefetch_target is fetched from node
and not from the pbminfo. Later it gets from the pminfo and update that.
May be this code needs to rewrite.


+ TBMIterator *iterator = node->prefetch_iterator;

Why another variable? Why can't we use the prefetch_iterator itself? 
Currently node->tbmiterator and node->prefetch_iterator are initialized
irrespective of whether is it a parallel one or not. But while using, there
is a check to use the parallel one in case of parallel. if it is the case,
why can't we avoid the initialization itself?


+ else
+ needWait = false;

By default needWait is false. Just set that to true only for the
case of PBM_INPROGRESS

+ * so that during free we can directly get the dsa_pointe and free it.

dsa_pointe -> dsa_pointer


+typedef struct
+{
+ TIDBitmap  *tbm; /* TIDBitmap we're iterating over */
+ int spageptr; /* next spages index */
+ int schunkptr; /* next schunks index */
+ int schunkbit; /* next bit to check in current schunk */
+ TBMIterateResult output; /* MUST BE LAST (because variable-size) */
+} TBMIterator;

I didn't find the need of moving this structure. Can you point me where it
is used.

Regards,
Hari Babu
Fujitsu Australia

Re: [HACKERS] Parallel bitmap heap scan

From
Michael Paquier
Date:
On Tue, Jan 31, 2017 at 11:17 AM, Haribabu Kommi
<kommi.haribabu@gmail.com> wrote:
> On Fri, Jan 27, 2017 at 5:32 PM, Dilip Kumar <dilipbalaut@gmail.com> wrote:
>> On Tue, Jan 24, 2017 at 10:18 AM, Dilip Kumar <dilipbalaut@gmail.com>
>> wrote:
>> > I have changed as per the comments. 0002 and 0003 are changed, 0001 is
>> > still the same.
>>
>> There is just one line change in 0003 compared to older version, all
>> other patches are the same.
>
> Thanks for the update. I have some comments

This review is too fresh to be addressed, so I have moved this patch
to the next CF.
-- 
Michael



Re: [HACKERS] Parallel bitmap heap scan

From
Dilip Kumar
Date:
On Tue, Jan 31, 2017 at 7:47 AM, Haribabu Kommi
<kommi.haribabu@gmail.com> wrote:
> Thanks for the update. I have some comments
>
Thanks for the review.

>
> 0002-hash-support-alloc-free-v14.patch:
>
>
> + if (tb->alloc)
> + {
>
> The memory for tb->alloc is allocated always, is the if check still
> required?

In parallel case, only first worker will call SH_CREATE, other worker
will only do palloc for pagetable and copy the reference from main
worker, so they will not have allocator.


> 0003-parallel-bitmap-heap-scan-v14.patch:
>
>
> + * and set parallel flag in lower level bitmap index scan. Later
> + * bitmap index node will use this flag to indicate tidbitmap that
> + * it needs to create an shared page table.
> + */
>
> I feel better to mention, where this flag is used, so that it will be more
> clear.

Done
>
>
> + * Mark tidbitmap as shared and also store DSA area in it.
> + * marking tidbitmap as shared is indication that this tidbitmap
> + * should be created in shared memory. DSA area will be used for
>
> The flag of shared is set in tidbitmap structure itself, but the
> comment is mentioned that tidbitmpa should be created in shared memory.
> I think that is the page table that needs to be created in shared memory.
> Providing some more information here will be helpful.
>
Done

>
> - node->tbmres = tbmres = tbm_iterate(tbmiterator);
> + node->tbmres = tbmres = pbms_parallel_iterate(tbmiterator,
> + pbminfo ? &pbminfo->tbmiterator : NULL);
>
> Instead Passing both normal and paralle iterators to the functions
> and checking inside again for NULL, How about just adding check
> in the caller itself? Or if you prefer the current method, then try to
> explain the details of input in the function header and more description.
>
Done as you said.

>
> + /* Increase prefetch target if it's not yet at the max. */
> + if (node->prefetch_target < node->prefetch_maximum)
>
> I didn't evaluate all scenarios, but the above code may have a problem,
> In case of parallel mode the the prefetch_target is fetched from node
> and not from the pbminfo. Later it gets from the pminfo and update that.
> May be this code needs to rewrite.
>
Fixed it.

>
> + TBMIterator *iterator = node->prefetch_iterator;
>
> Why another variable? Why can't we use the prefetch_iterator itself?
> Currently node->tbmiterator and node->prefetch_iterator are initialized
> irrespective of whether is it a parallel one or not. But while using, there
> is a check to use the parallel one in case of parallel. if it is the case,
> why can't we avoid the initialization itself?

Fixed
>
>
> + else
> + needWait = false;
>
> By default needWait is false. Just set that to true only for the
> case of PBM_INPROGRESS
>

Actually inside the while loop, suppose first we set needWait = true,
if PBM_INPROGRESS and got for ConditionalSleep, After it come out of
sleep, we need to check that PBM_FINISHED is set or we need to sleep
again, So in such case we need to reset it to needWait=false. However
this can be done by directly returning if it's PBM_FINISHED. But I
want to keep below the logic common.
+ SpinLockRelease(&pbminfo->state_mutex);
+
+ /* If we are leader or leader has already created a TIDBITMAP */
+ if (leader || !needWait)
+ break;

> + * so that during free we can directly get the dsa_pointe and free it.
>
> dsa_pointe -> dsa_pointer
Done

>
>
> +typedef struct
> +{
> + TIDBitmap  *tbm; /* TIDBitmap we're iterating over */
> + int spageptr; /* next spages index */
> + int schunkptr; /* next schunks index */
> + int schunkbit; /* next bit to check in current schunk */
> + TBMIterateResult output; /* MUST BE LAST (because variable-size) */
> +} TBMIterator;
>
> I didn't find the need of moving this structure. Can you point me where it
> is used.

Because  pbms_parallel_iterate need to access this structure so I
moved it to tidbitmap.h

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] Parallel bitmap heap scan

From
Robert Haas
Date:
On Tue, Jan 31, 2017 at 6:05 AM, Dilip Kumar <dilipbalaut@gmail.com> wrote:
>> 0002-hash-support-alloc-free-v14.patch:
>>
>>
>> + if (tb->alloc)
>> + {
>>
>> The memory for tb->alloc is allocated always, is the if check still
>> required?
>
> In parallel case, only first worker will call SH_CREATE, other worker
> will only do palloc for pagetable and copy the reference from main
> worker, so they will not have allocator.

When Andres wrote in
https://www.postgresql.org/message-id/20161017201558.cr34stc6zvxy3mst@alap3.anarazel.de
that you should try to share only the iteration arrays, I believe that
he meant the results of tbm_begin_iterate() -- that is,
iterator->spageptr, chunkptr, schunkbit, spages, and schunks.  I'm
perhaps putting words into his mouth, but what he said was that we
could avoid sharing "the whole underlying hash".  But the patch set
you've implemented here doesn't behave that way.  Instead, you've got
the array containing the hash table elements shared (which is good)
plus there's a sort of dummy hash table in every worker which copies
some but not all of the members of the original hash table, leading to
the otherwise-unnecessary if-test that Haribabu is complaining about.
So the hash table is kinda-shared-kinda-not-shared, which I don't
*think* is really what Andres had in mind.

In short, I think SH_COPY (implemented here as pagetable_copy) needs
to go away.  The work of tbm_begin_iterate() should be done before we
begin the shared scan and the results of that work should be shared.
What happens right now (it appears) is that every backend does that
work based on the same hash table and we just assume they all get the
same answer.  And we really do assume that, because
pbms_parallel_iterate() assumes it can shuttle private state back and
forth between iterator in different backends and nothing will break;
but those backends aren't actually using the same iteration arrays.
They are using different iteration arrays that should have the same
contents because they were all derived from the same semi-shared hash
table.  That's pretty fragile, and a waste of CPU cycles if the hash
table is large (every backend does its own sort).

On a related note, I think it's unacceptable to make the internal
details of TBMIterator public.  You've moved it from tidbitmap.c to
tidbitmap.h so that nodeBitmapHeapScan.c can tinker with the guts of
the TBMIterator, but that's not OK.  Those details need to remain
private to tidbitmap.c. pbms_parallel_iterate() is a nasty kludge; we
need some better solution.  The knowledge of how a shared iterator
should iterate needs to be private to tidbitmap.c, not off in the
executor someplace.  And I think the entries need to actually be
updated in shared memory directly, not copied back and forth between a
backend-private iterator and a shared iterator.

Also, pbms_parallel_iterate() can't hold a spinlock around a call to
tbm_iterate().  Note the coding rules for spinlocks mentioned in
spin.h and src/backend/storage/lmgr/README.  I think the right thing
to do here is to use an LWLock in a new tranche (add an entry to
BuiltinTrancheIds).

In 0002, you have this:

+    tb->alloc = palloc(sizeof(SH_ALLOCATOR));

This should be using MemoryContextAlloc(ctx, ...) rather than palloc.
Otherwise the allocator object is in a different context from
everything else in the hash table.  But TBH, I don't really see why we
want this to be a separate object.  Why not just put
HashAlloc/HashFree/args into SH_TYPE directly?  That avoids some
pointer chasing and doesn't really seem to cost anything (except that
SH_CREATE will grow a slightly longer argument sequence).

I think maybe we should rename the functions to element_allocate,
element_free, and element_allocator_ctx, or something like that.  The
current names aren't capitalized consistently with other things in
this module, and putting the word "element" in there would make it
more clear what the purpose of this thing is.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Parallel bitmap heap scan

From
Dilip Kumar
Date:
On Wed, Feb 1, 2017 at 11:09 PM, Robert Haas <robertmhaas@gmail.com> wrote:

Hi Robert,

Thanks for the review.

> When Andres wrote in
> https://www.postgresql.org/message-id/20161017201558.cr34stc6zvxy3mst@alap3.anarazel.de
> that you should try to share only the iteration arrays, I believe that
> he meant the results of tbm_begin_iterate() -- that is,
> iterator->spageptr, chunkptr, schunkbit, spages, and schunks.  I'm
> perhaps putting words into his mouth, but what he said was that we
> could avoid sharing "the whole underlying hash".  But the patch set
> you've implemented here doesn't behave that way.  Instead, you've got
> the array containing the hash table elements shared (which is good)

Here is my analysis why we selected this instead of just sharing the iterator:
----------------------------------------------------------------------------------------------------
The problem is that each entry of iteration array is just a pointer to
hash entry. So if we only shared iterator then iterator element will
be pointing to local memory of other workers. I thought of one more
option that during tbm_begin_iterate instead of keeping pointer to
hash entry, we can make a copy of that in DSA area, but that will have
2 copies of hash table element for short duration which is not good.
And, I think what we are doing currently is better than that.
 The work of tbm_begin_iterate() should be done before we
> begin the shared scan and the results of that work should be shared.
> What happens right now (it appears) is that every backend does that
> work based on the same hash table and we just assume they all get the
> same answer.

Actually, tbm_begin_iterate is processing the each element of the hash
table and converting to chunk and page array and currently, it’s done
by each worker and it’s pretty cheap. Suppose I try to do this only by
the first worker then implementation will look something like this.

1. First, we need to create a tbm->spages array and tbm->schunks array
and both should be the array of dsa_pointers.
2. Then each worker will process these array’s and will convert them
to the array of their local pointers.
3. With the current solution where all hash elements are stored in one
large dsa chunk, then how we are going to divide them into multiple
dsa pointers.

I will work on remaining comments.
 And we really do assume that, because
> pbms_parallel_iterate() assumes it can shuttle private state back and
> forth between iterator in different backends and nothing will break;
> but those backends aren't actually using the same iteration arrays.
> They are using different iteration arrays that should have the same
> contents because they were all derived from the same semi-shared hash
> table.  That's pretty fragile, and a waste of CPU cycles if the hash
> table is large (every backend does its own sort).
>
> On a related note, I think it's unacceptable to make the internal
> details of TBMIterator public.  You've moved it from tidbitmap.c to
> tidbitmap.h so that nodeBitmapHeapScan.c can tinker with the guts of
> the TBMIterator, but that's not OK.  Those details need to remain
> private to tidbitmap.c. pbms_parallel_iterate() is a nasty kludge; we
> need some better solution.  The knowledge of how a shared iterator
> should iterate needs to be private to tidbitmap.c, not off in the
> executor someplace.  And I think the entries need to actually be
> updated in shared memory directly, not copied back and forth between a
> backend-private iterator and a shared iterator.
>
> Also, pbms_parallel_iterate() can't hold a spinlock around a call to
> tbm_iterate().  Note the coding rules for spinlocks mentioned in
> spin.h and src/backend/storage/lmgr/README.  I think the right thing
> to do here is to use an LWLock in a new tranche (add an entry to
> BuiltinTrancheIds).
>
> In 0002, you have this:
>
> +    tb->alloc = palloc(sizeof(SH_ALLOCATOR));
>
> This should be using MemoryContextAlloc(ctx, ...) rather than palloc.
> Otherwise the allocator object is in a different context from
> everything else in the hash table.  But TBH, I don't really see why we
> want this to be a separate object.  Why not just put
> HashAlloc/HashFree/args into SH_TYPE directly?  That avoids some
> pointer chasing and doesn't really seem to cost anything (except that
> SH_CREATE will grow a slightly longer argument sequence).
>
> I think maybe we should rename the functions to element_allocate,
> element_free, and element_allocator_ctx, or something like that.  The
> current names aren't capitalized consistently with other things in
> this module, and putting the word "element" in there would make it
> more clear what the purpose of this thing is.


--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: [HACKERS] Parallel bitmap heap scan

From
Dilip Kumar
Date:
On Wed, Feb 1, 2017 at 11:09 PM, Robert Haas <robertmhaas@gmail.com> wrote:

Here are the latest version of the patch, which address all the
comments given by Robert.

> https://www.postgresql.org/message-id/20161017201558.cr34stc6zvxy3mst@alap3.anarazel.de
> that you should try to share only the iteration arrays, I believe that
> he meant the results of tbm_begin_iterate() -- that is,
> iterator->spageptr, chunkptr, schunkbit, spages, and schunks.  I'm
> perhaps putting words into his mouth, but what he said was that we
> could avoid sharing "the whole underlying hash".  But the patch set
> you've implemented here doesn't behave that way.  Instead, you've got
> the array containing the hash table elements shared (which is good)
> plus there's a sort of dummy hash table in every worker which copies
> some but not all of the members of the original hash table, leading to
> the otherwise-unnecessary if-test that Haribabu is complaining about.
> So the hash table is kinda-shared-kinda-not-shared, which I don't
> *think* is really what Andres had in mind.
>
> In short, I think SH_COPY (implemented here as pagetable_copy) needs
> to go away.  The work of tbm_begin_iterate() should be done before we
> begin the shared scan and the results of that work should be shared.

I have removed the SH_COPY and now leader performs the
tbm_begin_shared_iterate before waking up the worker. Basically, the
leader will create the page and chunk array and that is the array of
the "relptr" (offlist, suggested by Robert).

> What happens right now (it appears) is that every backend does that
> work based on the same hash table and we just assume they all get the
> same answer.  And we really do assume that, because
> pbms_parallel_iterate() assumes it can shuttle private state back and
> forth between iterator in different backends and nothing will break;
> but those backends aren't actually using the same iteration arrays.
> They are using different iteration arrays that should have the same
> contents because they were all derived from the same semi-shared hash
> table.  That's pretty fragile, and a waste of CPU cycles if the hash
> table is large (every backend does its own sort).
>
> On a related note, I think it's unacceptable to make the internal
> details of TBMIterator public.  You've moved it from tidbitmap.c to
> tidbitmap.h so that nodeBitmapHeapScan.c can tinker with the guts of
> the TBMIterator, but that's not OK.  Those details need to remain
> private to tidbitmap.c.
Fixed

 pbms_parallel_iterate() is a nasty kludge; we
> need some better solution.  The knowledge of how a shared iterator
> should iterate needs to be private to tidbitmap.c, not off in the
> executor someplace.  And I think the entries need to actually be
> updated in shared memory directly, not copied back and forth between a
> backend-private iterator and a shared iterator.

I have fixed this, now there is new function called tbm_shared_iterate
which will directly iterate using shared iterator. So now no need to
copy member back and forth between local and shared iterator.

>
> Also, pbms_parallel_iterate() can't hold a spinlock around a call to
> tbm_iterate().  Note the coding rules for spinlocks mentioned in
> spin.h and src/backend/storage/lmgr/README.  I think the right thing
> to do here is to use an LWLock in a new tranche (add an entry to
> BuiltinTrancheIds).

Done that way.
>
> In 0002, you have this:
>
> +    tb->alloc = palloc(sizeof(SH_ALLOCATOR));
>
> This should be using MemoryContextAlloc(ctx, ...) rather than palloc.
> Otherwise the allocator object is in a different context from
> everything else in the hash table.  But TBH, I don't really see why we
> want this to be a separate object.  Why not just put
> HashAlloc/HashFree/args into SH_TYPE directly?  That avoids some
> pointer chasing and doesn't really seem to cost anything (except that
> SH_CREATE will grow a slightly longer argument sequence).

Done as suggested
>
> I think maybe we should rename the functions to element_allocate,
> element_free, and element_allocator_ctx, or something like that.  The
> current names aren't capitalized consistently with other things in
> this module, and putting the word "element" in there would make it
> more clear what the purpose of this thing is.

Fixed as per the suggestion

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] Parallel bitmap heap scan

From
Robert Haas
Date:
On Sun, Feb 5, 2017 at 9:48 AM, Dilip Kumar <dilipbalaut@gmail.com> wrote:
>> I think maybe we should rename the functions to element_allocate,
>> element_free, and element_allocator_ctx, or something like that.  The
>> current names aren't capitalized consistently with other things in
>> this module, and putting the word "element" in there would make it
>> more clear what the purpose of this thing is.
>
> Fixed as per the suggestion

Eh, not really.  You changed the memory context to be called
element_allocator_ctx, rather than changing the args passed to the
element allocator to have that name, which is what I had in mind.

I did some assorted renaming and other cosmetic improvements and committed 0002.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Parallel bitmap heap scan

From
Jeff Janes
Date:
On Tue, Feb 7, 2017 at 1:03 PM, Robert Haas <robertmhaas@gmail.com> wrote:
On Sun, Feb 5, 2017 at 9:48 AM, Dilip Kumar <dilipbalaut@gmail.com> wrote:
>> I think maybe we should rename the functions to element_allocate,
>> element_free, and element_allocator_ctx, or something like that.  The
>> current names aren't capitalized consistently with other things in
>> this module, and putting the word "element" in there would make it
>> more clear what the purpose of this thing is.
>
> Fixed as per the suggestion

Eh, not really.  You changed the memory context to be called
element_allocator_ctx, rather than changing the args passed to the
element allocator to have that name, which is what I had in mind.

I did some assorted renaming and other cosmetic improvements and committed 0002.

I'm getting compiler errors:

In file included from execGrouping.c:47:
../../../src/include/lib/simplehash.h:91: error: redefinition of typedef 'simplehash_allocate'
../../../src/include/lib/simplehash.h:91: note: previous declaration of 'simplehash_allocate' was here
../../../src/include/lib/simplehash.h:92: error: redefinition of typedef 'simplehash_free'
../../../src/include/lib/simplehash.h:92: note: previous declaration of 'simplehash_free' was here


gcc version 4.4.7 20120313 (Red Hat 4.4.7-17) (GCC)
 
Cheers,

Jeff

Re: [HACKERS] Parallel bitmap heap scan

From
Andres Freund
Date:
On 2017-02-07 16:03:43 -0500, Robert Haas wrote:
> On Sun, Feb 5, 2017 at 9:48 AM, Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >> I think maybe we should rename the functions to element_allocate,
> >> element_free, and element_allocator_ctx, or something like that.  The
> >> current names aren't capitalized consistently with other things in
> >> this module, and putting the word "element" in there would make it
> >> more clear what the purpose of this thing is.
> >
> > Fixed as per the suggestion
> 
> Eh, not really.  You changed the memory context to be called
> element_allocator_ctx, rather than changing the args passed to the
> element allocator to have that name, which is what I had in mind.
> 
> I did some assorted renaming and other cosmetic improvements and committed 0002.

FWIW, I think it'd have been better to not add the new callbacks as
parameters to *_create(), but rather have them be "templatized" like the
rest of simplehash.  That'd require that callback to check the context,
to know whether it should use shared memory or not, but that seems fine
to me.  Right now this pushes the head of simplehash above a
cacheline...

I'm also doubtful about the naming of the default callbacks:
+/* default memory allocator function */
+static void *
+SH_DEFAULT_ALLOC(Size size, void *args)
+{
+   MemoryContext context = (MemoryContext) args;
+
+   return MemoryContextAllocExtended(context, size,
+                                     MCXT_ALLOC_HUGE | MCXT_ALLOC_ZERO);
+}

SH_DEFAULT_ALLOC sounds like it's a name that's one of the prefixed
names (like SH_CREATE actually becomes pagetable_create and such) - but
afaics it's not.  Which afaics means that this'll generate symbol
conflicts if one translation unit uses multiple simplehash.h style
hashes.  Afaics these should either be prefixed, or static inline
functions.

- Andres



Re: [HACKERS] Parallel bitmap heap scan

From
Andres Freund
Date:
On 2017-02-07 13:13:43 -0800, Jeff Janes wrote:
> On Tue, Feb 7, 2017 at 1:03 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>
> > On Sun, Feb 5, 2017 at 9:48 AM, Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > >> I think maybe we should rename the functions to element_allocate,
> > >> element_free, and element_allocator_ctx, or something like that.  The
> > >> current names aren't capitalized consistently with other things in
> > >> this module, and putting the word "element" in there would make it
> > >> more clear what the purpose of this thing is.
> > >
> > > Fixed as per the suggestion
> >
> > Eh, not really.  You changed the memory context to be called
> > element_allocator_ctx, rather than changing the args passed to the
> > element allocator to have that name, which is what I had in mind.
> >
> > I did some assorted renaming and other cosmetic improvements and committed
> > 0002.
> >
>
> I'm getting compiler errors:
>
> In file included from execGrouping.c:47:
> ../../../src/include/lib/simplehash.h:91: error: redefinition of typedef
> 'simplehash_allocate'
> ../../../src/include/lib/simplehash.h:91: note: previous declaration of
> 'simplehash_allocate' was here
> ../../../src/include/lib/simplehash.h:92: error: redefinition of typedef
> 'simplehash_free'
> ../../../src/include/lib/simplehash.h:92: note: previous declaration of
> 'simplehash_free' was here

Oh yea, that too - you can't just redefine a typedef like that according
to C89 :(



Re: [HACKERS] Parallel bitmap heap scan

From
Robert Haas
Date:
On Tue, Feb 7, 2017 at 4:13 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
> I'm getting compiler errors:
>
> In file included from execGrouping.c:47:
> ../../../src/include/lib/simplehash.h:91: error: redefinition of typedef
> 'simplehash_allocate'
> ../../../src/include/lib/simplehash.h:91: note: previous declaration of
> 'simplehash_allocate' was here
> ../../../src/include/lib/simplehash.h:92: error: redefinition of typedef
> 'simplehash_free'
> ../../../src/include/lib/simplehash.h:92: note: previous declaration of
> 'simplehash_free' was here

Thanks, I'll stick an #ifdef guard around that.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Parallel bitmap heap scan

From
Robert Haas
Date:
On Tue, Feb 7, 2017 at 4:15 PM, Andres Freund <andres@anarazel.de> wrote:
> FWIW, I think it'd have been better to not add the new callbacks as
> parameters to *_create(), but rather have them be "templatized" like the
> rest of simplehash.  That'd require that callback to check the context,
> to know whether it should use shared memory or not, but that seems fine
> to me.  Right now this pushes the head of simplehash above a
> cacheline...

Something like the attached?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] Parallel bitmap heap scan

From
Andres Freund
Date:
On 2017-02-07 16:36:55 -0500, Robert Haas wrote:
> On Tue, Feb 7, 2017 at 4:15 PM, Andres Freund <andres@anarazel.de> wrote:
> > FWIW, I think it'd have been better to not add the new callbacks as
> > parameters to *_create(), but rather have them be "templatized" like the
> > rest of simplehash.  That'd require that callback to check the context,
> > to know whether it should use shared memory or not, but that seems fine
> > to me.  Right now this pushes the head of simplehash above a
> > cacheline...
> 
> Something like the attached?

Yes.

> +#ifndef SH_USE_NONDEFAULT_ALLOCATOR
> +

That should probably be documented in the file header.


Thanks!

Andres



Re: [HACKERS] Parallel bitmap heap scan

From
Robert Haas
Date:
On Tue, Feb 7, 2017 at 4:45 PM, Andres Freund <andres@anarazel.de> wrote:
> On 2017-02-07 16:36:55 -0500, Robert Haas wrote:
>> On Tue, Feb 7, 2017 at 4:15 PM, Andres Freund <andres@anarazel.de> wrote:
>> > FWIW, I think it'd have been better to not add the new callbacks as
>> > parameters to *_create(), but rather have them be "templatized" like the
>> > rest of simplehash.  That'd require that callback to check the context,
>> > to know whether it should use shared memory or not, but that seems fine
>> > to me.  Right now this pushes the head of simplehash above a
>> > cacheline...
>>
>> Something like the attached?
>
> Yes.
>
>> +#ifndef SH_USE_NONDEFAULT_ALLOCATOR
>> +
>
> That should probably be documented in the file header.

Right.  OK, did that and a few other cleanups, and committed.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Parallel bitmap heap scan

From
Robert Haas
Date:
On Sun, Feb 5, 2017 at 9:48 AM, Dilip Kumar <dilipbalaut@gmail.com> wrote:
> Here are the latest version of the patch, which address all the
> comments given by Robert.

I think it would be useful to break the remaining patch into two
parts, one that enhances the tidbitmap.c API and another that uses
that to implement Parallel Bitmap Heap Scan.  In this review I'm going
to focus on the changes to tidbitmap.c and tidbitmap.h.

+extern void tbm_restore_shared_members(TIDBitmap *tbm,
+                           ParallelTBMInfo * stbm);
+extern void tbm_update_shared_members(TIDBitmap *tbm,
+                          ParallelTBMInfo * parallel_tbm);
+void        tbm_set_parallel(TIDBitmap *tbm, void *area);
+extern Size tbm_estimate_parallel_tbminfo(Size size);
+TIDBitmap  *tbm_attach(ParallelTBMInfo * parallel_tbm, void *area);
+TBMIterator *tbm_begin_shared_iterate(TIDBitmap *tbm,
+                         ParallelTBMInfo * parallel_tbm, bool prefetch);
+TBMIterateResult *tbm_shared_iterate(TBMIterator *iterator);
+void        tbm_init_shared_iterator(ParallelTBMInfo * tbminfo);

There are several cosmetic problems here.  You may have noticed that
all function prototypes in PostgreSQL header files are explicitly
declared extern; yours should be, too.  Also, there is extra
whitespace before some of the variable names here, like
"ParallelTBMInfo * tbminfo" instead of "ParallelTBMInfo *tbminfo". If
that is due to pgindent, the solution is to add the typedef names to
your local typedef list.  Also, tbm_restore_shared_members doesn't
actually exist.

More broadly, this seems like an extremely complicated API.  Even
ignoring the function that doesn't exist, that's still 7 different
functions just for shared iteration, which seems like a lot.  I
suggest the following API:

1. Add an additional argument to tbm_create(), dsa_area *dsa.  If it's
NULL, we allocate a backend-private memory for the hash entries as
normal.  If it's true, we use the dsa_area to store the hash entries,
using the infrastructure added by your 0002 and revised in
c3c4f6e1740be256178cd7551d5b8a7677159b74.  (You can use a flag in the
BitmapIndexScan and BitmapOr to decide whether to pass NULL or
es_query_dsa.)

2. Add a new function dsa_pointer tbm_prepare_shared_iterate(TIDBitmap
*tbm) which allocates shared iteration arrays using the DSA passed to
tbm_create() and returns a dsa_pointer to the result.  Arrange this so
that if it's called more than once, each call returns a separate
iterator, so that you can call it once to get the main iterator and a
second time for the prefetch iterator, but have both of those point to
the same underlying iteration arrays.

3. Add a new function TBMSharedIterator
*tbm_attach_shared_iterate(dsa_area *dsa, dsa_pointer) which is called
once per backend and gets passed the dsa_pointer from the previous
step.

4. Add a new function TBMIterateResult
*tbm_shared_iterate(TBMSharedIterator *) to fetch the next result.

As compared with your proposal, this involves only 3 functions instead
of 7 (plus one new argument to another function), and I think it's a
lot clearer what those functions are actually doing.  You don't need
tbm_estimate_parallel_tbminfo() any more because the data being passed
from one backend to another is precisely a dsa_pointer, and the bitmap
scan can just leave space for that in its own estimator function.  You
don't need tbm_update_shared_members() separately from
tbm_begin_shared_iterate() separately from tbm_init_shared_iterator()
because tbm_prepare_shared_iterate() can do all that stuff.  You don't
need tbm_set_parallel() because the additional argument to
tbm_create() takes care of that need.

The way you've got ParallelTBMInfo set up right now doesn't respect
the separation of concerns between nodeBitmapHeapscan.c and
tidbitmap.c properly. tidbitmap.c shouldn't know that the caller
intends on creating two iterators, one of which is for prefetching.
The design above hopefully addresses that: each call to
tbm_prepare_shared_iterate returns a dsa_pointer to a separate chunk
of shared iterator state, but tidbitmap.c does not need to know how
many times that will be called.

Apart from the above, this patch will need a rebase over today's
commits, and please make sure all functions you add have high-quality
header comments.

Thanks,

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Parallel bitmap heap scan

From
Dilip Kumar
Date:
On Wed, Feb 8, 2017 at 8:07 AM, Robert Haas <robertmhaas@gmail.com> wrote:

.Thanks for your input, I have few queries about these comments.

> 2. Add a new function dsa_pointer tbm_prepare_shared_iterate(TIDBitmap
> *tbm) which allocates shared iteration arrays using the DSA passed to
> tbm_create() and returns a dsa_pointer to the result.  Arrange this so
> that if it's called more than once, each call returns a separate
> iterator, so that you can call it once to get the main iterator and a
> second time for the prefetch iterator, but have both of those point to
> the same underlying iteration arrays.
>
> 3. Add a new function TBMSharedIterator
> *tbm_attach_shared_iterate(dsa_area *dsa, dsa_pointer) which is called
> once per backend and gets passed the dsa_pointer from the previous
> step.

IIUC, tbm_prepare_shared_iterate will be called only by the leader,
for tbmiterator as well as for the prefetch_iterator. And,
tbm_attach_shared_iterate will be called by each backend and for both
the iterators.

IMHO, tbm_attach_shared_iterate should return TBMIterator with
reference to TBMSharedIterator inside it. The reason behind same is
that we can not keep TBMIterateResult inside TBMSharedIterator
otherwise, results will also go in shared memory but we want to have
local result memory for each worker so that other worker doesn't
disturb it.

Another option can be that we change the tbm_shared_iterate as explained below

TBMIterateResult * tbm_shared_iterate(TBMSharedIterator *,
TBMIterateResult *result).

Now, if result passed to this API is NULL then it will allocate the
memory for the result and that way we will have local result memory,
and if it's not NULL we will use this memory to store our results.
BitmapHeapNode already having a reference to the TBMIterateResult so
we should not have any problem in passing this reference to the
tbm_shared_iterate. I think this one looks better than what I
explained above.

Please suggest.

>
> 4. Add a new function TBMIterateResult
> *tbm_shared_iterate(TBMSharedIterator *) to fetch the next result.


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: [HACKERS] Parallel bitmap heap scan

From
Dilip Kumar
Date:
On Wed, Feb 8, 2017 at 3:44 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>>> +#ifndef SH_USE_NONDEFAULT_ALLOCATOR
>>> +
>>
>> That should probably be documented in the file header.
>
> Right.  OK, did that and a few other cleanups, and committed.

The new SH_CREATE(MemoryContext ctx, uint32 nelements) don't have any
option to supply arguments to it. Our callback functions need access
to TBM.

Is it expected that if the user of SH_CREATE who doesn't want to pass
a "MemoryContext" then we can pass arguments instead of ctx?

something like this ?
if (!tbm->dsa)      tbm->pagetable = pagetable_create(tbm->mcxt, 128);
else      tbm->pagetable = pagetable_create((MemoryContext)tbm, 128);

And, In allocation function, we can access this context and typecast to tbm?

As shown below.
static void *
pagetable_allocate(pagetable_hash *pagetable, Size size)
{   TIDBitmap  *tbm = pagetable->ctx;


So Is it expected to do like I explained above, or we missed to have
an arg parameter to SH_CREATE as well as in SH_TYPE structure or there
is some other way you have in mind?

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: [HACKERS] Parallel bitmap heap scan

From
Dilip Kumar
Date:
On Wed, Feb 8, 2017 at 3:44 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>>> +#ifndef SH_USE_NONDEFAULT_ALLOCATOR
>>> +
>>
>> That should probably be documented in the file header.
>
> Right.  OK, did that and a few other cleanups, and committed.

I think we need to have prototype for the default allocator outside of
#ifndef SH_USE_NONDEFAULT_ALLOCATOR. Because the file e.g. tidbitmap.c
who wants to use SH_USE_NONDEFAULT_ALLOCATOR will provide the
allocator function definition but it can not have the declaration of
those function as that function take SH_TYPE as input and that will be
only defined once we include the simplehash.h.

So basically we can not declare prototype before including
simplehash.h for allocator. And, if we don't declare we will get
"implicit declaration warning" because simplehash itself is using
those functions.

The solution is simplehash.h, should always declare it, and provide
the definitions only if SH_USE_NONDEFAULT_ALLOCATOR is not defined.
Attached patch does that.


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] Parallel bitmap heap scan

From
Robert Haas
Date:
On Wed, Feb 8, 2017 at 5:21 AM, Dilip Kumar <dilipbalaut@gmail.com> wrote:
> On Wed, Feb 8, 2017 at 3:44 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>>>> +#ifndef SH_USE_NONDEFAULT_ALLOCATOR
>>>> +
>>>
>>> That should probably be documented in the file header.
>>
>> Right.  OK, did that and a few other cleanups, and committed.
>
> I think we need to have prototype for the default allocator outside of
> #ifndef SH_USE_NONDEFAULT_ALLOCATOR. Because the file e.g. tidbitmap.c
> who wants to use SH_USE_NONDEFAULT_ALLOCATOR will provide the
> allocator function definition but it can not have the declaration of
> those function as that function take SH_TYPE as input and that will be
> only defined once we include the simplehash.h.
>
> So basically we can not declare prototype before including
> simplehash.h for allocator. And, if we don't declare we will get
> "implicit declaration warning" because simplehash itself is using
> those functions.
>
> The solution is simplehash.h, should always declare it, and provide
> the definitions only if SH_USE_NONDEFAULT_ALLOCATOR is not defined.
> Attached patch does that.

Makes sense, will commit.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Parallel bitmap heap scan

From
Robert Haas
Date:
On Wed, Feb 8, 2017 at 4:20 AM, Dilip Kumar <dilipbalaut@gmail.com> wrote:
> The new SH_CREATE(MemoryContext ctx, uint32 nelements) don't have any
> option to supply arguments to it. Our callback functions need access
> to TBM.
>
> Is it expected that if the user of SH_CREATE who doesn't want to pass
> a "MemoryContext" then we can pass arguments instead of ctx?

You can store whatever you want in SH_TYPE's private_data member.
SH_ALLOCATE and SH_FREE both get a pointer to the SH_TYPE, so they
have access to that.  Hmm, but there's no way to get that set in
SH_CREATE before SH_ALLOCATE is called.  Maybe we need to add a
private_data argument to SH_CREATE.  execGrouping.c could use that
instead of frobbing private_data directly:

-    hashtable->hashtab = tuplehash_create(tablecxt, nbuckets);
-    hashtable->hashtab->private_data = hashtable;
+    hashtable->hashtab = tuplehash_create(tablecxt, nbuckets, hashtable);

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Parallel bitmap heap scan

From
Robert Haas
Date:
On Wed, Feb 8, 2017 at 1:59 AM, Dilip Kumar <dilipbalaut@gmail.com> wrote:
> IIUC, tbm_prepare_shared_iterate will be called only by the leader,
> for tbmiterator as well as for the prefetch_iterator. And,
> tbm_attach_shared_iterate will be called by each backend and for both
> the iterators.

That's what I had in mind.

> IMHO, tbm_attach_shared_iterate should return TBMIterator with
> reference to TBMSharedIterator inside it. The reason behind same is
> that we can not keep TBMIterateResult inside TBMSharedIterator
> otherwise, results will also go in shared memory but we want to have
> local result memory for each worker so that other worker doesn't
> disturb it.

No, I don't agree.  I think TBMSharedIterator should be an unshared
structure created by tbm_attach_shared_iterate, which can internally
contain backend-private state like a TBMIterateResult, and which can
also contain a pointer to the shared-memory structure previously
created by tbm_prepare_shared_iterate.  That thing needs to be called
something other than a TBMSharedIterator, like TBMSharedIterationState
or something.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Parallel bitmap heap scan

From
Dilip Kumar
Date:
On Wed, Feb 8, 2017 at 7:01 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> You can store whatever you want in SH_TYPE's private_data member.
> SH_ALLOCATE and SH_FREE both get a pointer to the SH_TYPE, so they
> have access to that.  Hmm, but there's no way to get that set in
> SH_CREATE before SH_ALLOCATE is called.  Maybe we need to add a
> private_data argument to SH_CREATE.  execGrouping.c could use that
> instead of frobbing private_data directly:
>
> -    hashtable->hashtab = tuplehash_create(tablecxt, nbuckets);
> -    hashtable->hashtab->private_data = hashtable;
> +    hashtable->hashtab = tuplehash_create(tablecxt, nbuckets, hashtable);

Okay, will go ahead as you suggested. Patch attached for the same.


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] Parallel bitmap heap scan

From
Robert Haas
Date:
On Wed, Feb 8, 2017 at 8:58 AM, Dilip Kumar <dilipbalaut@gmail.com> wrote:
> On Wed, Feb 8, 2017 at 7:01 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> You can store whatever you want in SH_TYPE's private_data member.
>> SH_ALLOCATE and SH_FREE both get a pointer to the SH_TYPE, so they
>> have access to that.  Hmm, but there's no way to get that set in
>> SH_CREATE before SH_ALLOCATE is called.  Maybe we need to add a
>> private_data argument to SH_CREATE.  execGrouping.c could use that
>> instead of frobbing private_data directly:
>>
>> -    hashtable->hashtab = tuplehash_create(tablecxt, nbuckets);
>> -    hashtable->hashtab->private_data = hashtable;
>> +    hashtable->hashtab = tuplehash_create(tablecxt, nbuckets, hashtable);
>
> Okay, will go ahead as you suggested. Patch attached for the same.

Looks good to me.  If nobody has further ideas here, I'll push this
and your previous patch tomorrow.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Parallel bitmap heap scan

From
Robert Haas
Date:
On Wed, Feb 8, 2017 at 10:59 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> Looks good to me.  If nobody has further ideas here, I'll push this
> and your previous patch tomorrow.

Done.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Parallel bitmap heap scan

From
Dilip Kumar
Date:
On Wed, Feb 8, 2017 at 8:07 AM, Robert Haas <robertmhaas@gmail.com> wrote:

Thanks for the detailed review and your valuable feedback.

> I think it would be useful to break the remaining patch into two
> parts, one that enhances the tidbitmap.c API and another that uses
> that to implement Parallel Bitmap Heap Scan.

I have done that.

> There are several cosmetic problems here.  You may have noticed that
> all function prototypes in PostgreSQL header files are explicitly
> declared extern; yours should be, too.  Also, there is extra
> whitespace before some of the variable names here, like
> "ParallelTBMInfo * tbminfo" instead of "ParallelTBMInfo *tbminfo". If
> that is due to pgindent, the solution is to add the typedef names to
> your local typedef list.  Also, tbm_restore_shared_members doesn't
> actually exist.

Fixed

> 1. Add an additional argument to tbm_create(), dsa_area *dsa.  If it's
> NULL, we allocate a backend-private memory for the hash entries as
> normal.  If it's true, we use the dsa_area to store the hash entries,
> using the infrastructure added by your 0002 and revised in
> c3c4f6e1740be256178cd7551d5b8a7677159b74.  (You can use a flag in the
> BitmapIndexScan and BitmapOr to decide whether to pass NULL or
> es_query_dsa.)
Done

>
> 2. Add a new function dsa_pointer tbm_prepare_shared_iterate(TIDBitmap
> *tbm) which allocates shared iteration arrays using the DSA passed to
> tbm_create() and returns a dsa_pointer to the result.  Arrange this so
> that if it's called more than once, each call returns a separate
> iterator, so that you can call it once to get the main iterator and a
> second time for the prefetch iterator, but have both of those point to
> the same underlying iteration arrays.
>
> 3. Add a new function TBMSharedIterator
> *tbm_attach_shared_iterate(dsa_area *dsa, dsa_pointer) which is called
> once per backend and gets passed the dsa_pointer from the previous
> step.
>
> 4. Add a new function TBMIterateResult
> *tbm_shared_iterate(TBMSharedIterator *) to fetch the next result.

I have tried to get these three API's as you explained with one
change. In tbm_attach_shared_iterate I need to pass TBM also because
each worker will create their own copy of TBM. Those workers need to
get the TBM-related information from the shared location. Even though
I try to access some of the information like chunk, npages from
directly shared location, but some other information like base pointer
to dsa element, RelptrPagetableEntry etc. should be local to each
worker (after conversion from dsa pointer to local pointer).

>
> As compared with your proposal, this involves only 3 functions instead
> of 7 (plus one new argument to another function), and I think it's a
> lot clearer what those functions are actually doing.  You don't need
> tbm_estimate_parallel_tbminfo() any more because the data being passed
> from one backend to another is precisely a dsa_pointer, and the bitmap
> scan can just leave space for that in its own estimator function.  You
> don't need tbm_update_shared_members() separately from
> tbm_begin_shared_iterate() separately from tbm_init_shared_iterator()
> because tbm_prepare_shared_iterate() can do all that stuff.  You don't
> need tbm_set_parallel() because the additional argument to
> tbm_create() takes care of that need.

Right

>
> The way you've got ParallelTBMInfo set up right now doesn't respect
> the separation of concerns between nodeBitmapHeapscan.c and
> tidbitmap.c properly. tidbitmap.c shouldn't know that the caller
> intends on creating two iterators, one of which is for prefetching.
> The design above hopefully addresses that: each call to
> tbm_prepare_shared_iterate returns a dsa_pointer to a separate chunk
> of shared iterator state, but tidbitmap.c does not need to know how
> many times that will be called.

Done
>
> Apart from the above, this patch will need a rebase over today's
> commits,

Done
and please make sure all functions you add have high-quality
> header comments.

I tried my best, please check the latest patch (0001).

Apart from those, there are some more changes.
1. I have moved the dsa_pointer and dsa_area declaration from dsa.h to
postgres .h, an independent patch is attached for the same. Because we
need to declare function headers with dsa_pointer in tidbitmap.h, but
tidbitmap.h also used in FRONTEND, therefore, this can not include
dsa.h (as it contains atomics.h).

2. I noticed there was one defect in my code related to handling the
TBM_ONE_PAGE, In initial version, there was no problem, but it got
introduced somewhere in intermediate versions.

I have fixed the same. There were two option to do that

1. Don’t switch to  TBM_ONE_PAGE in case of parallel mode (ideally
this can not happen only worst estimation can get us there) and
directly got to TBM_HASH
2. In shared information keep space for sharing a PagetableEntry.


I have implemented 2nd (in the initial versions I implemented with 1st).

Note: Patch is also rebased on top of guc_parallel_index_scan_v1.patch
from Parallel Index Scan thread[1]

[1]
https://www.postgresql.org/message-id/CAA4eK1%2BTnM4pXQbvn7OXqam%2Bk_HZqb0ROZUMxOiL6DWJYCyYow%40mail.gmail.com

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] Parallel bitmap heap scan

From
Robert Haas
Date:
On Sat, Feb 11, 2017 at 1:41 AM, Dilip Kumar <dilipbalaut@gmail.com> wrote:
> I tried my best, please check the latest patch (0001).

I don't think it's acceptable (or necessary) to move the DSA
definitions into postgres.h.  Why do you think you need to do that,
vs. just including dsa.h in a few more places?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Parallel bitmap heap scan

From
Dilip Kumar
Date:
On Mon, Feb 13, 2017 at 6:24 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> I don't think it's acceptable (or necessary) to move the DSA
> definitions into postgres.h.  Why do you think you need to do that,
> vs. just including dsa.h in a few more places?

I need to access dsa_pointer in tidbitmap.h, which is included from
FRONTEND as well. Now, problem is that dsa.h is including #include
"port/atomics.h", but atomic.h can not be included if FRONTEND is
defined.

#ifndef ATOMICS_H
#define ATOMICS_H
#ifdef FRONTEND
#error "atomics.h may not be included from frontend code"
#endif

Is there any other solution to this ?

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: [HACKERS] Parallel bitmap heap scan

From
Haribabu Kommi
Date:


On Tue, Feb 14, 2017 at 12:48 AM, Dilip Kumar <dilipbalaut@gmail.com> wrote:
On Mon, Feb 13, 2017 at 6:24 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> I don't think it's acceptable (or necessary) to move the DSA
> definitions into postgres.h.  Why do you think you need to do that,
> vs. just including dsa.h in a few more places?

I need to access dsa_pointer in tidbitmap.h, which is included from
FRONTEND as well. Now, problem is that dsa.h is including #include
"port/atomics.h", but atomic.h can not be included if FRONTEND is
defined.

#ifndef ATOMICS_H
#define ATOMICS_H
#ifdef FRONTEND
#error "atomics.h may not be included from frontend code"
#endif

Is there any other solution to this ?

How about creating another header file with the parallel changes
and include it only in necessary places?

Following are my observations, while going through the patch.

+#if SIZEOF_DSA_POINTER == 4
+typedef uint32 dsa_pointer;
+#else
+typedef uint64 dsa_pointer;
+#endif

I feel the declaration of the above typdef can be moved into the
section above if we going with the current move into postgres.h
file.

+/*
+ * tbm_alloc_shared
+ *
+ * Callback function for allocating the memory for hashtable elements.
+ * It allocates memory from DSA if tbm holds a reference to a dsa.
+ */
+static inline void *
+pagetable_allocate(pagetable_hash *pagetable, Size size)

Function name and comments mismatch?


Regards,
Hari Babu
Fujitsu Australia

Re: [HACKERS] Parallel bitmap heap scan

From
Robert Haas
Date:
On Mon, Feb 13, 2017 at 8:48 AM, Dilip Kumar <dilipbalaut@gmail.com> wrote:
> On Mon, Feb 13, 2017 at 6:24 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> I don't think it's acceptable (or necessary) to move the DSA
>> definitions into postgres.h.  Why do you think you need to do that,
>> vs. just including dsa.h in a few more places?
>
> I need to access dsa_pointer in tidbitmap.h, which is included from
> FRONTEND as well. Now, problem is that dsa.h is including #include
> "port/atomics.h", but atomic.h can not be included if FRONTEND is
> defined.
>
> #ifndef ATOMICS_H
> #define ATOMICS_H
> #ifdef FRONTEND
> #error "atomics.h may not be included from frontend code"
> #endif
>
> Is there any other solution to this ?

Well, any problem like this generally has a bunch of solutions, so
I'll say yes.  I spent a good chunk of today studying the issue and
started a new thread devoted specifically to it:

https://www.postgresql.org/message-id/CA%2BTgmoZ%3DF%3DGkxV0YEv-A8tb%2BAEGy_Qa7GSiJ8deBKFATnzfEug%40mail.gmail.com

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Parallel bitmap heap scan

From
Dilip Kumar
Date:
On Tue, Feb 14, 2017 at 6:51 AM, Haribabu Kommi
<kommi.haribabu@gmail.com> wrote:
> +#if SIZEOF_DSA_POINTER == 4
> +typedef uint32 dsa_pointer;
> +#else
> +typedef uint64 dsa_pointer;
> +#endif
>
> I feel the declaration of the above typdef can be moved into the
> section above if we going with the current move into postgres.h
> file.

Robert has already given the patch[1] for the same so now don't need
to do anything for this. Therefore, I included the dsa.h in
tidbitmap.h.

[1] https://www.postgresql.org/message-id/CA%2BTgmoZ%3DF%3DGkxV0YEv-A8tb%2BAEGy_Qa7GSiJ8deBKFATnzfEug%40mail.gmail.com
>
> +/*
> + * tbm_alloc_shared
> + *
> + * Callback function for allocating the memory for hashtable elements.
> + * It allocates memory from DSA if tbm holds a reference to a dsa.
> + */
> +static inline void *
> +pagetable_allocate(pagetable_hash *pagetable, Size size)
>
> Function name and comments mismatch?

Fixed.


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] Parallel bitmap heap scan

From
Robert Haas
Date:
On Tue, Feb 14, 2017 at 12:18 AM, Dilip Kumar <dilipbalaut@gmail.com> wrote:
> Fixed.

Thanks, the external interface to this looks much cleaner now.
Scrutinizing the internals:

What is the point of having a TBMSharedIterator contain a TIDBitmap
pointer?  All the values in that TIDBitmap are populated from the
TBMSharedInfo, but it seems to me that the fields that are copied over
unchanged (like status and nchunks) could just be used directly from
the TBMSharedInfo, and the fields that are computed (like relpages and
relchunks) could be stored directly in the TBMSharedIterator.
tbm_shared_iterate() is already separate code from tbm_iterate(), so
it can be changed to refer to whichever data structure contains the
data it needs.

Why is TBMSharedInfo separate from TBMSharedIteratorState?  It seems
like you could merge those two into a single structure.

I think we can simplify things here a bit by having shared iterators
not support single-page mode.  In the backend-private case,
tbm_begin_iterate() really doesn't need to do anything with the
pagetable in the TBM_ONE_PAGE case, but for a shared iterator, we've
got to anyway copy the pagetable into shared memory.  So it seems to
me that it would be simpler just to transform it into a standard
iteration array while we're at it, instead of copying it into entry1.
In other words, I suggest removing both "entry1" and "status" from
TBMSharedInfo and making tbm_prepare_shared_iterate smarter to
compensate.

I think "dsa_data" is poorly named; it should be something like
"dsapagetable" in TIDBitmap and just "pagetable" in TBMSharedInfo.  I
think you should should move the "base", "relpages", and "relchunks"
into TBMSharedIterator and give them better names, like "ptbase",
"ptpages" and "ptchunks".  "base" isn't clear that we're talking about
the pagetable's base as opposed to anything else, and "relpages" and
"relchunks" are named based on the fact that the pointers are relative
rather than named based on what data they point at, which doesn't seem
like the right choice.

I suggest putting the parallel functions next to each other in the
file: tbm_begin_iterate(), tbm_prepare_shared_iterate(),
tbm_iterate(), tbm_shared_iterate(), tbm_end_iterate(),
tbm_end_shared_iterate().

+    if (tbm->dsa == NULL)
+        return pfree(pointer);

Don't try to return a value of type void.  The correct spelling of
this is { pfree(pointer); return; }.  Formatted appropriately across 4
lines, of course.

+    /*
+     * If TBM is in iterating phase that means pagetable is already created
+     * and we have come here during tbm_free. By this time we are already
+     * detached from the DSA because the GatherNode would have been shutdown.
+     */
+    if (tbm->iterating)
+        return;

This seems like something of a kludge, although not a real easy one to
fix.  The problem here is that tidbitmap.c ideally shouldn't have to
know that it's being used by the executor or that there's a Gather
node involved, and certainly not the order of operations around
shutdown.  It should really be the problem of 0002 to handle this kind
of problem, without 0001 having to know anything about it.  It strikes
me that it would probably be a good idea to have Gather clean up its
children before it gets cleaned up itself.  So in ExecShutdownNode, we
could do something like this:

diff --git a/src/backend/executor/execProcnode.c
b/src/backend/executor/execProcnode.c
index 0dd95c6..5ccc2e8 100644
--- a/src/backend/executor/execProcnode.c
+++ b/src/backend/executor/execProcnode.c
@@ -815,6 +815,8 @@ ExecShutdownNode(PlanState *node)    if (node == NULL)        return false;

+    planstate_tree_walker(node, ExecShutdownNode, NULL);
+    switch (nodeTag(node))    {        case T_GatherState:
@@ -824,5 +826,5 @@ ExecShutdownNode(PlanState *node)            break;    }

-    return planstate_tree_walker(node, ExecShutdownNode, NULL);
+    return false;}

Also, in ExecEndGather, something like this:

diff --git a/src/backend/executor/nodeGather.c
b/src/backend/executor/nodeGather.c
index a1a3561..32c97d3 100644
--- a/src/backend/executor/nodeGather.c
+++ b/src/backend/executor/nodeGather.c
@@ -229,10 +229,10 @@ ExecGather(GatherState *node)voidExecEndGather(GatherState *node){
+    ExecEndNode(outerPlanState(node));      /* let children clean up first */    ExecShutdownGather(node);
ExecFreeExprContext(&node->ps);   ExecClearTuple(node->ps.ps_ResultTupleSlot);
 
-    ExecEndNode(outerPlanState(node));}
/*

With those changes and an ExecShutdownBitmapHeapScan() called from
ExecShutdownNode(), it should be possible (I think) for us to always
have the bitmap heap scan node shut down before the Gather node shuts
down, which I think would let you avoid having a special case for this
inside the TBM code.

+    char       *ptr;
+    dsaptr = dsa_allocate(tbm->dsa, size + sizeof(dsa_pointer));
+    tbm->dsa_data = dsaptr;
+    ptr = dsa_get_address(tbm->dsa, dsaptr);
+    memset(ptr, 0, size + sizeof(dsa_pointer));
+    *((dsa_pointer *) ptr) = dsaptr;

Hmm.  Apparently, we need a dsa_allocate_and_zero() or dsa_allocate0()
function.  This pattern seems likely to come up a lot, and I don't
think we should require every caller to deal with it.

I don't see why you think you need to add sizeof(dsa_pointer) to the
allocation and store an extra copy of the dsa_pointer in the
additional space.   You are already storing it in tbm->dsa_data and
that seems good enough.  pagetable_free() needs the value, but it has
a pointer to the TIDBitmap and could just pass tbm->dsa_data directly
instead of fishing the pointer out of the extra space.  Also, this
kind of thing is in general not very wise because it tends to cause
alignment faults on some picky machine.  dsa_allocate() will return a
MAXALIGN'd pointer but ptr + sizeof(dsa_pointer) will only be
MAXALIGN'd if (sizeof(dsa_pointer) % MAXIMUM_ALIGNOF) == 0, which I
suspect might not be true everywhere and which is a shaky assumption
for the future even if it happens to be true today.  For example,
consider a 32-bit platform where doubles need to be 8-byte aligned.

It seems shaky to me that tbm->iterating can be set when we've created
either a shared or a backend-private iterator.  For example,
tbm_iterate() asserts that tbm->iterating is set as a way of checking
that spages and schunks will be set.  But that's not guaranteed any
more with these changes, because we might've built the shared version
of the iteration arrays rather than the backend-private version.
Maybe you ought to change it from a bool to an enum:
TBM_NOT_ITERATING, TBM_ITERATING_PRIVATE, TBM_ITERATING_SHARED.  And
then update all of the asserts and tests to check for the specific
state they care about.

+        while (schunkbit < PAGES_PER_CHUNK)
+        {
+            int            wordnum = WORDNUM(schunkbit);
+            int            bitnum = BITNUM(schunkbit);
+
+            if ((chunk->words[wordnum] & ((bitmapword) 1 << bitnum)) != 0)
+                break;
+            schunkbit++;
+        }

How about refactoring this chunk of code into a static inline function
and having both tbm_iterate() and tbm_shared_iterate() call it?
Probably something like static inline void
tbm_advance_schunkbit(PageTableEntry *chunk, int *schunkbit).

+        /* scan bitmap to extract individual offset numbers */
+        ntuples = 0;
+        for (wordnum = 0; wordnum < WORDS_PER_PAGE; wordnum++)
+        {
+            bitmapword    w = page->words[wordnum];
+
+            if (w != 0)
+            {
+                int            off = wordnum * BITS_PER_BITMAPWORD + 1;
+
+                while (w != 0)
+                {
+                    if (w & 1)
+                        output->offsets[ntuples++] = (OffsetNumber) off;
+                    off++;
+                    w >>= 1;
+                }
+            }
+        }

Similarly, this looks like it could be refactored into a shared static
inline function as well, instead of duplicating it.


-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Parallel bitmap heap scan

From
Dilip Kumar
Date:
On Tue, Feb 14, 2017 at 10:18 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> Thanks, the external interface to this looks much cleaner now.
> Scrutinizing the internals:

Thanks for the comments, I am working on these. I have few doubts for
some of the comments.

> I suggest removing both "entry1" and "status" from
> TBMSharedInfo and making tbm_prepare_shared_iterate smarter to
> compensate.

This is easily doable and it will look much cleaner. While doing this
I am facing one problem related to
relptr_store and relptr_access. The problem is that when I call
"relptr_store  (base, rp, val)" with both base and val as a same
address then it will store offset as 0 which is fine, but when we use
relptr_access with 0 offset it is returning NULL instead of giving the
actual address. I expect it should return directly base when offset is
0. (I am facing this problem because in this case (TBM_ONE_PAGE),
there will be only one page and address of that will be same as
base.).

The question can be why we did not face this issue while converting
the pagetable to page array, because at least one entry will be there
which should have the same address as the base. But coincidently we
did not get that problem because in that case as of now we were
storing dsa_pointer in the head of the element array, therefore first
pagetable element was not same as the base.

There can be multiple solutions to this but none of them looks cleaner to me.
sol1: If relptr_access return NULL then directly use the base address
as our PagetableEntry.
sol2: Instead of using base as start of the element array, just try to
use some modified base as e.g base=base - some number.
sol3: change relptr_access to not return NULL in this case. But, this
will change the current behaviour of this interface and need to
analyze the side effects.

>
> I don't see why you think you need to add sizeof(dsa_pointer) to the
> allocation and store an extra copy of the dsa_pointer in the
> additional space.   You are already storing it in tbm->dsa_data and
> that seems good enough.  pagetable_free() needs the value, but it has
> a pointer to the TIDBitmap and could just pass tbm->dsa_data directly
> instead of fishing the pointer out of the extra space.

If you see the code of SH_GROW, first it needs to allocate the bigger
chunk copy data from smaller chunk to the bigger chunk and then free
the smaller chunk.

So by the time it call the pagetable_free, it would have already
called the pagetable_allocate for the newer bigger chunk of memory so
now, tbm->dsa_data points to the latest memory, but pagetable_free
wants to free older one.

One solution to this could be that I keep two dsa_pointer in TBM, one
point to old memory and another to new. (After this here we will get
the same problem of relptr_access because now we will have first entry
pointer is same as base pointer)

Please provide your thought.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: [HACKERS] Parallel bitmap heap scan

From
Robert Haas
Date:
On Wed, Feb 15, 2017 at 7:17 AM, Dilip Kumar <dilipbalaut@gmail.com> wrote:
> This is easily doable and it will look much cleaner. While doing this
> I am facing one problem related to
> relptr_store and relptr_access. The problem is that when I call
> "relptr_store  (base, rp, val)" with both base and val as a same
> address then it will store offset as 0 which is fine, but when we use
> relptr_access with 0 offset it is returning NULL instead of giving the
> actual address. I expect it should return directly base when offset is
> 0. (I am facing this problem because in this case (TBM_ONE_PAGE),
> there will be only one page and address of that will be same as
> base.).
>
> There can be multiple solutions to this but none of them looks cleaner to me.
> sol1: If relptr_access return NULL then directly use the base address
> as our PagetableEntry.
> sol2: Instead of using base as start of the element array, just try to
> use some modified base as e.g base=base - some number.
> sol3: change relptr_access to not return NULL in this case. But, this
> will change the current behaviour of this interface and need to
> analyze the side effects.

Hmm, yeah, that's a problem.  How about not using relative pointers
here, and instead just using array indexes?

>> I don't see why you think you need to add sizeof(dsa_pointer) to the
>> allocation and store an extra copy of the dsa_pointer in the
>> additional space.   You are already storing it in tbm->dsa_data and
>> that seems good enough.  pagetable_free() needs the value, but it has
>> a pointer to the TIDBitmap and could just pass tbm->dsa_data directly
>> instead of fishing the pointer out of the extra space.
>
> If you see the code of SH_GROW, first it needs to allocate the bigger
> chunk copy data from smaller chunk to the bigger chunk and then free
> the smaller chunk.

Oh, I see.

> So by the time it call the pagetable_free, it would have already
> called the pagetable_allocate for the newer bigger chunk of memory so
> now, tbm->dsa_data points to the latest memory, but pagetable_free
> wants to free older one.
>
> One solution to this could be that I keep two dsa_pointer in TBM, one
> point to old memory and another to new. (After this here we will get
> the same problem of relptr_access because now we will have first entry
> pointer is same as base pointer)

Yes, two dsa_pointers seems fine.  Maybe that's not totally beautiful,
but it seems better than what you had before.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Parallel bitmap heap scan

From
Dilip Kumar
Date:
On Tue, Feb 14, 2017 at 10:18 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> What is the point of having a TBMSharedIterator contain a TIDBitmap
> pointer?  All the values in that TIDBitmap are populated from the
> TBMSharedInfo, but it seems to me that the fields that are copied over
> unchanged (like status and nchunks) could just be used directly from
> the TBMSharedInfo, and the fields that are computed (like relpages and
> relchunks) could be stored directly in the TBMSharedIterator.
> tbm_shared_iterate() is already separate code from tbm_iterate(), so
> it can be changed to refer to whichever data structure contains the
> data it needs.

Done

>
> Why is TBMSharedInfo separate from TBMSharedIteratorState?  It seems
> like you could merge those two into a single structure.

Done
>
> I think we can simplify things here a bit by having shared iterators
> not support single-page mode.  In the backend-private case,
> tbm_begin_iterate() really doesn't need to do anything with the
> pagetable in the TBM_ONE_PAGE case, but for a shared iterator, we've
> got to anyway copy the pagetable into shared memory.  So it seems to
> me that it would be simpler just to transform it into a standard
> iteration array while we're at it, instead of copying it into entry1.
> In other words, I suggest removing both "entry1" and "status" from
> TBMSharedInfo and making tbm_prepare_shared_iterate smarter to
> compensate.

Done

>
> I think "dsa_data" is poorly named; it should be something like
> "dsapagetable" in TIDBitmap and just "pagetable" in TBMSharedInfo.  I
> think you should should move the "base", "relpages", and "relchunks"
> into TBMSharedIterator and give them better names, like "ptbase",
> "ptpages" and "ptchunks".  "base" isn't clear that we're talking about
> the pagetable's base as opposed to anything else, and "relpages" and
> "relchunks" are named based on the fact that the pointers are relative
> rather than named based on what data they point at, which doesn't seem
> like the right choice.

Done
>
> I suggest putting the parallel functions next to each other in the
> file: tbm_begin_iterate(), tbm_prepare_shared_iterate(),
> tbm_iterate(), tbm_shared_iterate(), tbm_end_iterate(),
> tbm_end_shared_iterate().

Done
>
> +    if (tbm->dsa == NULL)
> +        return pfree(pointer);
>
> Don't try to return a value of type void.  The correct spelling of
> this is { pfree(pointer); return; }.  Formatted appropriately across 4
> lines, of course.

Fixed

>
> +    /*
> +     * If TBM is in iterating phase that means pagetable is already created
> +     * and we have come here during tbm_free. By this time we are already
> +     * detached from the DSA because the GatherNode would have been shutdown.
> +     */
> +    if (tbm->iterating)
> +        return;
>
> This seems like something of a kludge, although not a real easy one to
> fix.  The problem here is that tidbitmap.c ideally shouldn't have to
> know that it's being used by the executor or that there's a Gather
> node involved, and certainly not the order of operations around
> shutdown.  It should really be the problem of 0002 to handle this kind
> of problem, without 0001 having to know anything about it.  It strikes
> me that it would probably be a good idea to have Gather clean up its
> children before it gets cleaned up itself.  So in ExecShutdownNode, we
> could do something like this:
>
> diff --git a/src/backend/executor/execProcnode.c
> b/src/backend/executor/execProcnode.c
> index 0dd95c6..5ccc2e8 100644
> --- a/src/backend/executor/execProcnode.c
> +++ b/src/backend/executor/execProcnode.c
> @@ -815,6 +815,8 @@ ExecShutdownNode(PlanState *node)
>      if (node == NULL)
>          return false;
>
> +    planstate_tree_walker(node, ExecShutdownNode, NULL);
> +
>      switch (nodeTag(node))
>      {
>          case T_GatherState:
> @@ -824,5 +826,5 @@ ExecShutdownNode(PlanState *node)
>              break;
>      }
>
> -    return planstate_tree_walker(node, ExecShutdownNode, NULL);
> +    return false;
>  }
>
> Also, in ExecEndGather, something like this:
>
> diff --git a/src/backend/executor/nodeGather.c
> b/src/backend/executor/nodeGather.c
> index a1a3561..32c97d3 100644
> --- a/src/backend/executor/nodeGather.c
> +++ b/src/backend/executor/nodeGather.c
> @@ -229,10 +229,10 @@ ExecGather(GatherState *node)
>  void
>  ExecEndGather(GatherState *node)
>  {
> +    ExecEndNode(outerPlanState(node));      /* let children clean up first */
>      ExecShutdownGather(node);
>      ExecFreeExprContext(&node->ps);
>      ExecClearTuple(node->ps.ps_ResultTupleSlot);
> -    ExecEndNode(outerPlanState(node));
>  }
>
>  /*
>
> With those changes and an ExecShutdownBitmapHeapScan() called from
> ExecShutdownNode(), it should be possible (I think) for us to always
> have the bitmap heap scan node shut down before the Gather node shuts
> down, which I think would let you avoid having a special case for this
> inside the TBM code.

Done
(gather_shutdown_children_first.patch does what you have mentioned
above and 0002 implement the BitmapHeapScanShutdown function).

>
> +    char       *ptr;
> +    dsaptr = dsa_allocate(tbm->dsa, size + sizeof(dsa_pointer));
> +    tbm->dsa_data = dsaptr;
> +    ptr = dsa_get_address(tbm->dsa, dsaptr);
> +    memset(ptr, 0, size + sizeof(dsa_pointer));
> +    *((dsa_pointer *) ptr) = dsaptr;
>
> Hmm.  Apparently, we need a dsa_allocate_and_zero() or dsa_allocate0()
> function.  This pattern seems likely to come up a lot, and I don't
> think we should require every caller to deal with it.
Done

>
> I don't see why you think you need to add sizeof(dsa_pointer) to the
> allocation and store an extra copy of the dsa_pointer in the
> additional space.   You are already storing it in tbm->dsa_data and

Done with the help of two pointers as discussed in another mail [1].

[1] https://www.postgresql.org/message-id/CA%2BTgmoaWsCJ5L2du9i1RQaegaNLgOYF2KgRFu%2B7sUAeQc_xBFg%40mail.gmail.com

>
> It seems shaky to me that tbm->iterating can be set when we've created
> either a shared or a backend-private iterator.  For example,
> tbm_iterate() asserts that tbm->iterating is set as a way of checking
> that spages and schunks will be set.  But that's not guaranteed any
> more with these changes, because we might've built the shared version
> of the iteration arrays rather than the backend-private version.
> Maybe you ought to change it from a bool to an enum:
> TBM_NOT_ITERATING, TBM_ITERATING_PRIVATE, TBM_ITERATING_SHARED.  And
> then update all of the asserts and tests to check for the specific
> state they care about.

Done
>
> +        while (schunkbit < PAGES_PER_CHUNK)
> +        {
> +            int            wordnum = WORDNUM(schunkbit);
> +            int            bitnum = BITNUM(schunkbit);
> +
> +            if ((chunk->words[wordnum] & ((bitmapword) 1 << bitnum)) != 0)
> +                break;
> +            schunkbit++;
> +        }
>
> How about refactoring this chunk of code into a static inline function
> and having both tbm_iterate() and tbm_shared_iterate() call it?
> Probably something like static inline void
> tbm_advance_schunkbit(PageTableEntry *chunk, int *schunkbit).

Good idea, done this way
>
> +        /* scan bitmap to extract individual offset numbers */
> +        ntuples = 0;
> +        for (wordnum = 0; wordnum < WORDS_PER_PAGE; wordnum++)
> +        {
> +            bitmapword    w = page->words[wordnum];
> +
> +            if (w != 0)
> +            {
> +                int            off = wordnum * BITS_PER_BITMAPWORD + 1;
> +
> +                while (w != 0)
> +                {
> +                    if (w & 1)
> +                        output->offsets[ntuples++] = (OffsetNumber) off;
> +                    off++;
> +                    w >>= 1;
> +                }
> +            }
> +        }
>
> Similarly, this looks like it could be refactored into a shared static
> inline function as well, instead of duplicating it.

Done.

Attached patches:

interface_dsa_allocate0.patch : Provides new interface dsa_allocate0,
which is used by 0001
gather_shutdown_childeren_first.patch : Processing the child node
before shuting down the gather, 0002 will use that part to shutdown
BitmapNode before gather.

0001: tidbimap change to provide the interfaces for shared iterator.
0002: actual parallel bitmap heap scan by using interfaces of 0001.


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] Parallel bitmap heap scan

From
Robert Haas
Date:
On Thu, Feb 16, 2017 at 10:54 AM, Dilip Kumar <dilipbalaut@gmail.com> wrote:
> interface_dsa_allocate0.patch : Provides new interface dsa_allocate0,
> which is used by 0001

Committed.  I moved the function prototype for dsa_allocate0() next to
the existing prototype for dsa_allocate().  Please try to think about
things like this when you add functions or prototypes or structure
members: put them in a natural place where they are close to related
things, not just wherever happens to be convenient.

> gather_shutdown_childeren_first.patch : Processing the child node

This same problem came up on the Parallel Hash thread.  I mentioned
this proposed fix over there; let's see what he (or anyone else) has
to say about it.

> 0001: tidbimap change to provide the interfaces for shared iterator.

+ * in order to share relptrs of the chunk and the pages arrays and other
+ * TBM related information with other workers.

No relptrs any more.

+    dsa_pointer dsapagetable;    /* dsa_pointer to the element array */
+    dsa_pointer dsapagetableold;    /* dsa_pointer to the old element array */
+    dsa_area   *dsa;            /* reference to per-query dsa area */
+    dsa_pointer ptpages;        /* dsa_pointer to the page array */
+    dsa_pointer ptchunks;        /* dsa_pointer to the chunk array */

Let's put the DSA pointer first and then the other stuff after it.
That seems more logical.

+typedef struct TBMSharedIteratorState
+{
+    int            spageptr;        /* next spages index */
+    int            schunkptr;        /* next schunks index */
+    int            schunkbit;        /* next bit to check in current schunk */
+    LWLock        lock;            /* lock to protect above members */
+    dsa_pointer pagetable;        /* dsa pointers to head of pagetable data */
+    dsa_pointer spages;            /* dsa pointer to page array */
+    dsa_pointer schunks;        /* dsa pointer to chunk array */
+    int            nentries;        /* number of entries in pagetable */
+    int            maxentries;        /* limit on same to meet maxbytes */
+    int            npages;            /* number of exact entries in
pagetable */
+    int            nchunks;        /* number of lossy entries in pagetable */
+} TBMSharedIteratorState;

I think you've got this largely backwards from the most logical order.
Let's order it like this: nentries, maxentries, npages, nchunks,
pagetable, spages, schunks, lock (to protect BELOW members), spageptr,
chunkptr, schunkbit.

+struct TBMSharedIterator
+{
+    PagetableEntry *ptbase;        /* pointer to the pagetable element array */
+    dsa_area   *dsa;            /* reference to per-query dsa area */
+    int           *ptpages;        /* sorted exact page index list */
+    int           *ptchunks;        /* sorted lossy page index list */
+    TBMSharedIteratorState *state;        /* shared members */
+    TBMIterateResult output;    /* MUST BE LAST (because variable-size) */
+};

Do we really need "dsa" here when it's already present in the shared
state?  It doesn't seem like we even use it for anything.  It's needed
to create each backend's TBMSharedIterator, but not afterwards, I
think.

In terms of ordering, I'd move "state" to the top of the structure,
just like "tbm" comes first in a TBMIterator.

+ * total memory consumption.  If dsa passed to this function is not NULL
+ * then the memory for storing elements of the underlying page table will
+ * be allocated from the DSA.

Notice that you capitalized "DSA" in two different ways in the same
sentence; I'd go for the all-caps version.  Also, you need the word
"the" before the first one.

+    if (tbm->status == TBM_HASH && (tbm->iterating == TBM_NOT_ITERATING))

Excess parentheses.

+ * tbm_prepare_shared_iterate - prepare to iterator through a TIDBitmap
+ * by multiple workers using shared iterator.

Prepare to iterate, not prepare to iterator.  I suggest rephrasing
this as "prepare shared iteration state for a TIDBitmap".

+ * The TBMSharedIteratorState will be allocated from DSA so that multiple
+ * worker can attach to it and iterate jointly.

Maybe change to "The necessary shared state will be allocated from the
DSA passed to tbm_create, so that multiple workers can attach to it
and iterate jointly".

+ * This will convert the pagetable hash into page and chunk array of the index
+ * into pagetable array so that multiple workers can use it to get the actual
+ * page entry.

I think you can leave off everything starting from "so that".  It's
basically redundant with what you already said.

+    dsa_pointer iterator;
+    TBMSharedIteratorState *iterator_state;

These aren't really great variable names, because the latter isn't the
state associated with the former.  They're both the same object.
Maybe just "dp" and "istate".

I think this function should Assert(tbm->dsa != NULL) and
Assert(tbm->iterating != TBM_ITERATING_PRIVATE), and similarly
tbm_begin_iterate should Assert(tbm->iterating != TBM_ITERATE_SHARED).

+    /*
+     * If we have a hashtable, create and fill the sorted page lists, unless
+     * we already did that for a previous iterator.  Note that the lists are
+     * attached to the bitmap not the iterator, so they can be used by more
+     * than one iterator.  However, we keep dsa_pointer to these in the shared
+     * iterator so that other workers can access them directly.
+     */

This is mostly cut-and-pasted from tbm_begin_iterate() but it's not
really correct here now because (1) we're no longer trying to fake up
a TIDBitmap proper in every backend and (2) this code runs even if we
don't have a hashtable.  I think the comment should be something like
"If we're not already iterating, create and fill the sorted page
lists.  (If we are, the sorted page lists are already stored in the
TIDBitmap, and we can just reuse them.)"

+         * Create page list and chunk list using relptr so that we can share
+         * this information across multiple workers.

No relptrs any more.

+            tbm->ptpages = dsa_allocate(tbm->dsa, tbm->npages * (sizeof(int)));

Extra parentheses.

+                                         tbm->nchunks * (sizeof(int)));

Extra parentheses.

+         * If TBM status is TBM_HASH then iterate over the pagetable and

"If the TBM status is"...

+             * directly store it's index i.e. 0 in page array

s/it's/its/

Don't you also need some code here to handle the TBM_EMPTY case?

+    /*
+     * Store the TBM member in the shared state so that we can share them
+     * across multiple workers.
+     */
+    iterator_state->maxentries = tbm->maxentries;
+    iterator_state->nchunks = tbm->nchunks;
+    iterator_state->nentries = tbm->nentries;
+    iterator_state->npages = tbm->npages;
+    iterator_state->pagetable = tbm->dsapagetable;
+    iterator_state->spages = tbm->ptpages;
+    iterator_state->schunks = tbm->ptchunks;
+
+    /* Initialize the shared iterator state */
+    iterator_state->schunkbit = 0;
+    iterator_state->schunkptr = 0;
+    iterator_state->spageptr = 0;
+
+    /* Initialize the iterator lock */
+    LWLockInitialize(&iterator_state->lock, LWTRANCHE_PARALLEL_TBM_ITERATOR);

Set all of the structure members in the same order that you declare them.

+ * tbm_extract_page_tuple - extract the tuple offsets from the page

s/the page/a page/

+ * Process the page bits to extract the tuple offsets and store them into
+ * TBMIterateResult.

This duplicates the preceding, a bit.  Maybe just "The extracted
offsets are stored into the TBMIterateResult".

+/*
+ *    tbm_advance_schunkbit
+ *
+ *    Advance the chunkbit to get the next page entry from the chunk
+ */

The formatting of this header comment is randomly different from the
preceding and following header comments.

I would change the argument to schunkbitp, declare a local variable
named schunkbit, and do int schunkbit = *schunkbitp at the top and
*schunkbitp = schunkbit at the bottom.

+ *    As above, but this will iterate using shared iterator which is shared
+ *    across multiple workers.  We need to acquire the iterator LWLock, before
+ *    accessing the shared members.

"using shared iterator which is shared" -> "using an iterator which is shared"
"multiple workers" -> "multiple processes"

+        PagetableEntry *chunk = ptbase + ptchunks[state->schunkptr];

Maybe &ptbase[ptchunks[state->schunkptr]] would look a little nicer.

+        PagetableEntry *chunk = ptbase + ptchunks[state->schunkptr];
+        PagetableEntry *page = ptbase + ptpages[state->spageptr];
+        PagetableEntry *page = ptbase + ptpages[state->spageptr];

Similarly.

+ * tbm_end_shared_iterate - finish an iteration over a TIDBitmap

Maybe s/an iteration/a shared iteration/

+ * As above, but it frees the memory of TBMSharedIterator.

Instead of this, I'd write "This doesn't free any of the shared state
associated with the iterator, just our backend-private state."

+    PagetableEntry *lpage = ((PagetableEntry *) arg) + *((int *) left);
+    PagetableEntry *rpage = ((PagetableEntry *) arg) + *((int *) right);

Again, try to use array dereference notation rather than pointer
arithmetic where you can.  Maybe:

PageTableEntry *base = (PageTableEntry *) arg;
PageTableEntry *lpage = &arg[*(int *) left];
etc.

+tbm_attach_shared_iterate(dsa_area *dsa, dsa_pointer iterator)
+{
+    TBMSharedIterator *shared_iterator;
+    TBMSharedIteratorState *shared_state;

Again, variable naming issues.  Let's go with dsa_pointer dp,
TBMSharedIterator *iterator, TBMSharedIteratorState *istate.  Please
try to go through and always use the same variable name for the same
kind of object.  It's confusing when "iterator" sometimes means a
TBMSharedIterator * and sometimes a TBMSharedIteratorState * and
sometimes a dsa_pointer.  Make it all consistent (and hopefully
logical).

+    shared_iterator = (TBMSharedIterator *) palloc(sizeof(TBMIterator) +
+                                 MAX_TUPLES_PER_PAGE * sizeof(OffsetNumber));

Wrong sizeof.  (This isn't just a style issue - it's a bug!)

+ * Callback function for allocating the memory for hashtable elements.
+ * It allocates memory from DSA if tbm holds a reference to a dsa.

Inconsistent capitalization of "DSA" again.

Maybe change the whole comment to "Allocate memory for hashtable
elements, using DSA if available."

+    if (tbm->dsapagetable)

Test against InvalidDsaPointer.  Or maybe just get rid of the "if" and
do it unconditionally; if dsapagetable is InvalidDsaPointer then
dsapagetableold will be too, so no harm done.

+    LWTRANCHE_PARALLEL_TBM_ITERATOR,

Change to LWLTRANCHE_TBM, add a call to LWLockRegisterTranche in
lwlock.c, and patch the docs with the new tranche name.

Despite this fairly extensive list of mostly-trivial cosmetic
problems, I think this is in fairly good shape now.  I think the basic
design is right, so once we get these loose ends tied up this should
be good to go in.

> 0002: actual parallel bitmap heap scan by using interfaces of 0001.

This needs a lot more extensive review than I quite have the time and
energy for right now, but here are a few comments.

+         <entry><literal>ParallelBitmapScan</></entry>

ParallelBitmapPopulate, maybe?  You're not waiting for the scan;
you're waiting for the bitmap to be populated.

+         <entry>Waiting for leader backend to complete the TidBitmap.</entry>

Likewise, complete -> populate?

+    /* If any limit was set to zero, the user doesn't want a parallel scan. */

If we got zero?  There aren't multiple limits mentioned anywhere
nearby, so this reads oddly.

+        /*
+         * It may be possible to amortize some of the I/O cost, but probably
+         * not very much, because most operating systems already do aggressive
+         * prefetching.  For now, we assume that the disk run cost can't be
+         * amortized at all.
+         */

This looks like it was cut-and-pasted from the sequential scan case,
but it doesn't really apply here.  This presupposes we're relying on
OS prefetching, but in the case of a bitmap heap scan we do our own
prefetching.  Maybe change to "Since we do prefetching in both the
parallel and non-parallel cases, we don't expect to get substantially
more I/O parallelism by spreading the scan across multiple backends.
So, assume the disk run cost can't be amortized at all."

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Parallel bitmap heap scan

From
Dilip Kumar
Date:
On Fri, Feb 17, 2017 at 2:01 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> + * in order to share relptrs of the chunk and the pages arrays and other
> + * TBM related information with other workers.
>
> No relptrs any more.

Fixed
>
> +    dsa_pointer dsapagetable;    /* dsa_pointer to the element array */
> +    dsa_pointer dsapagetableold;    /* dsa_pointer to the old element array */
> +    dsa_area   *dsa;            /* reference to per-query dsa area */
> +    dsa_pointer ptpages;        /* dsa_pointer to the page array */
> +    dsa_pointer ptchunks;        /* dsa_pointer to the chunk array */
>
> Let's put the DSA pointer first and then the other stuff after it.
> That seems more logical.

Done that way
>
> +typedef struct TBMSharedIteratorState
> +{
> +    int            spageptr;        /* next spages index */
> +    int            schunkptr;        /* next schunks index */
> +    int            schunkbit;        /* next bit to check in current schunk */
> +    LWLock        lock;            /* lock to protect above members */
> +    dsa_pointer pagetable;        /* dsa pointers to head of pagetable data */
> +    dsa_pointer spages;            /* dsa pointer to page array */
> +    dsa_pointer schunks;        /* dsa pointer to chunk array */
> +    int            nentries;        /* number of entries in pagetable */
> +    int            maxentries;        /* limit on same to meet maxbytes */
> +    int            npages;            /* number of exact entries in
> pagetable */
> +    int            nchunks;        /* number of lossy entries in pagetable */
> +} TBMSharedIteratorState;
>
> I think you've got this largely backwards from the most logical order.
> Let's order it like this: nentries, maxentries, npages, nchunks,
> pagetable, spages, schunks, lock (to protect BELOW members), spageptr,
> chunkptr, schunkbit.

Done
>
> +struct TBMSharedIterator
> +{
> +    PagetableEntry *ptbase;        /* pointer to the pagetable element array */
> +    dsa_area   *dsa;            /* reference to per-query dsa area */
> +    int           *ptpages;        /* sorted exact page index list */
> +    int           *ptchunks;        /* sorted lossy page index list */
> +    TBMSharedIteratorState *state;        /* shared members */
> +    TBMIterateResult output;    /* MUST BE LAST (because variable-size) */
> +};
>
> Do we really need "dsa" here when it's already present in the shared
> state?  It doesn't seem like we even use it for anything.  It's needed
> to create each backend's TBMSharedIterator, but not afterwards, I
> think.

Right, removed.

>
> In terms of ordering, I'd move "state" to the top of the structure,
> just like "tbm" comes first in a TBMIterator.

Yeah, that looks better. Done that way.
>
> + * total memory consumption.  If dsa passed to this function is not NULL
> + * then the memory for storing elements of the underlying page table will
> + * be allocated from the DSA.
>
> Notice that you capitalized "DSA" in two different ways in the same
> sentence; I'd go for the all-caps version.  Also, you need the word
> "the" before the first one.

Fixed, all such instances.

>
> +    if (tbm->status == TBM_HASH && (tbm->iterating == TBM_NOT_ITERATING))
>
> Excess parentheses.
Fixed
>
> + * tbm_prepare_shared_iterate - prepare to iterator through a TIDBitmap
> + * by multiple workers using shared iterator.
>
> Prepare to iterate, not prepare to iterator.  I suggest rephrasing
> this as "prepare shared iteration state for a TIDBitmap".

Fixed.
>
> + * The TBMSharedIteratorState will be allocated from DSA so that multiple
> + * worker can attach to it and iterate jointly.
>
> Maybe change to "The necessary shared state will be allocated from the
> DSA passed to tbm_create, so that multiple workers can attach to it
> and iterate jointly".

Done.
>
> + * This will convert the pagetable hash into page and chunk array of the index
> + * into pagetable array so that multiple workers can use it to get the actual
> + * page entry.
>
> I think you can leave off everything starting from "so that".  It's
> basically redundant with what you already said.

Done
>
> +    dsa_pointer iterator;
> +    TBMSharedIteratorState *iterator_state;
>
> These aren't really great variable names, because the latter isn't the
> state associated with the former.  They're both the same object.
> Maybe just "dp" and "istate".

Done
>
> I think this function should Assert(tbm->dsa != NULL) and
> Assert(tbm->iterating != TBM_ITERATING_PRIVATE), and similarly
> tbm_begin_iterate should Assert(tbm->iterating != TBM_ITERATE_SHARED).

Done
>
> +    /*
> +     * If we have a hashtable, create and fill the sorted page lists, unless
> +     * we already did that for a previous iterator.  Note that the lists are
> +     * attached to the bitmap not the iterator, so they can be used by more
> +     * than one iterator.  However, we keep dsa_pointer to these in the shared
> +     * iterator so that other workers can access them directly.
> +     */
>
> This is mostly cut-and-pasted from tbm_begin_iterate() but it's not
> really correct here now because (1) we're no longer trying to fake up
> a TIDBitmap proper in every backend and (2) this code runs even if we
> don't have a hashtable.  I think the comment should be something like
> "If we're not already iterating, create and fill the sorted page
> lists.  (If we are, the sorted page lists are already stored in the
> TIDBitmap, and we can just reuse them.)"

Done
>
> +         * Create page list and chunk list using relptr so that we can share
> +         * this information across multiple workers.
>
> No relptrs any more.

Done
>
> +            tbm->ptpages = dsa_allocate(tbm->dsa, tbm->npages * (sizeof(int)));
>
> Extra parentheses.
>
> +                                         tbm->nchunks * (sizeof(int)));
>
> Extra parentheses.
Fixed
>
> +         * If TBM status is TBM_HASH then iterate over the pagetable and
>
> "If the TBM status is"...
>
> +             * directly store it's index i.e. 0 in page array
>
> s/it's/its/
Done
>
> Don't you also need some code here to handle the TBM_EMPTY case?,

IMHO, we don't need to do anything for TBM_EMPTY because npages and
nchunks will be zero so iterator will handle such cases (same as it's
done for non-parallel case.)

>
> +    /*
> +     * Store the TBM member in the shared state so that we can share them
> +     * across multiple workers.
> +     */
> +    iterator_state->maxentries = tbm->maxentries;
> +    iterator_state->nchunks = tbm->nchunks;
> +    iterator_state->nentries = tbm->nentries;
> +    iterator_state->npages = tbm->npages;
> +    iterator_state->pagetable = tbm->dsapagetable;
> +    iterator_state->spages = tbm->ptpages;
> +    iterator_state->schunks = tbm->ptchunks;
> +
> +    /* Initialize the shared iterator state */
> +    iterator_state->schunkbit = 0;
> +    iterator_state->schunkptr = 0;
> +    iterator_state->spageptr = 0;
> +
> +    /* Initialize the iterator lock */
> +    LWLockInitialize(&iterator_state->lock, LWTRANCHE_PARALLEL_TBM_ITERATOR);
>
> Set all of the structure members in the same order that you declare them.

Done.
>
> + * tbm_extract_page_tuple - extract the tuple offsets from the page
>
> s/the page/a page/
>
> + * Process the page bits to extract the tuple offsets and store them into
> + * TBMIterateResult.
>
> This duplicates the preceding, a bit.  Maybe just "The extracted
> offsets are stored into the TBMIterateResult".

Done
>
> +/*
> + *    tbm_advance_schunkbit
> + *
> + *    Advance the chunkbit to get the next page entry from the chunk
> + */
>
> The formatting of this header comment is randomly different from the
> preceding and following header comments.

Seems like different function in the file are using different
formatting. But done as near by functions.
>
> I would change the argument to schunkbitp, declare a local variable
> named schunkbit, and do int schunkbit = *schunkbitp at the top and
> *schunkbitp = schunkbit at the bottom.
>

Fixed

> + *    As above, but this will iterate using shared iterator which is shared
> + *    across multiple workers.  We need to acquire the iterator LWLock, before
> + *    accessing the shared members.
>
> "using shared iterator which is shared" -> "using an iterator which is shared"
> "multiple workers" -> "multiple processes"

Done
>
> +        PagetableEntry *chunk = ptbase + ptchunks[state->schunkptr];
>
> Maybe &ptbase[ptchunks[state->schunkptr]] would look a little nicer.
>
> +        PagetableEntry *chunk = ptbase + ptchunks[state->schunkptr];
> +        PagetableEntry *page = ptbase + ptpages[state->spageptr];
> +        PagetableEntry *page = ptbase + ptpages[state->spageptr];
>
> Similarly.

Done
>
> + * tbm_end_shared_iterate - finish an iteration over a TIDBitmap
>
> Maybe s/an iteration/a shared iteration/
>
> + * As above, but it frees the memory of TBMSharedIterator.
>
> Instead of this, I'd write "This doesn't free any of the shared state
> associated with the iterator, just our backend-private state."

Done
>
> +    PagetableEntry *lpage = ((PagetableEntry *) arg) + *((int *) left);
> +    PagetableEntry *rpage = ((PagetableEntry *) arg) + *((int *) right);
>
> Again, try to use array dereference notation rather than pointer
> arithmetic where you can.  Maybe:
>
> PageTableEntry *base = (PageTableEntry *) arg;
> PageTableEntry *lpage = &arg[*(int *) left];
> etc.
Done

>
> +tbm_attach_shared_iterate(dsa_area *dsa, dsa_pointer iterator)
> +{
> +    TBMSharedIterator *shared_iterator;
> +    TBMSharedIteratorState *shared_state;
>
> Again, variable naming issues.  Let's go with dsa_pointer dp,
> TBMSharedIterator *iterator, TBMSharedIteratorState *istate.  Please
> try to go through and always use the same variable name for the same
> kind of object.  It's confusing when "iterator" sometimes means a
> TBMSharedIterator * and sometimes a TBMSharedIteratorState * and
> sometimes a dsa_pointer.  Make it all consistent (and hopefully
> logical).
Done
>
> +    shared_iterator = (TBMSharedIterator *) palloc(sizeof(TBMIterator) +
> +                                 MAX_TUPLES_PER_PAGE * sizeof(OffsetNumber));
>
> Wrong sizeof.  (This isn't just a style issue - it's a bug!)

Right, it's a bug. Fixed.
>
> + * Callback function for allocating the memory for hashtable elements.
> + * It allocates memory from DSA if tbm holds a reference to a dsa.
>
> Inconsistent capitalization of "DSA" again.
>
> Maybe change the whole comment to "Allocate memory for hashtable
> elements, using DSA if available."

Fixe
>
> +    if (tbm->dsapagetable)
>
> Test against InvalidDsaPointer.  Or maybe just get rid of the "if" and
> do it unconditionally; if dsapagetable is InvalidDsaPointer then
> dsapagetableold will be too, so no harm done.

Right
>
> +    LWTRANCHE_PARALLEL_TBM_ITERATOR,
>
> Change to LWLTRANCHE_TBM, add a call to LWLockRegisterTranche in
> lwlock.c, and patch the docs with the new tranche name.

Fixed
>
> Despite this fairly extensive list of mostly-trivial cosmetic
> problems, I think this is in fairly good shape now.  I think the basic
> design is right, so once we get these loose ends tied up this should
> be good to go in.
>
>> 0002: actual parallel bitmap heap scan by using interfaces of 0001.
>
> This needs a lot more extensive review than I quite have the time and
> energy for right now, but here are a few comments.
>
> +         <entry><literal>ParallelBitmapScan</></entry>
>
> ParallelBitmapPopulate, maybe?  You're not waiting for the scan;
> you're waiting for the bitmap to be populated.
Done
>
> +         <entry>Waiting for leader backend to complete the TidBitmap.</entry>
>
> Likewise, complete -> populate?
Done
>
> +    /* If any limit was set to zero, the user doesn't want a parallel scan. */
>
> If we got zero?  There aren't multiple limits mentioned anywhere
> nearby, so this reads oddly.

Removed
>
> +        /*
> +         * It may be possible to amortize some of the I/O cost, but probably
> +         * not very much, because most operating systems already do aggressive
> +         * prefetching.  For now, we assume that the disk run cost can't be
> +         * amortized at all.
> +         */
>
> This looks like it was cut-and-pasted from the sequential scan case,
> but it doesn't really apply here.  This presupposes we're relying on
> OS prefetching, but in the case of a bitmap heap scan we do our own
> prefetching.  Maybe change to "Since we do prefetching in both the
> parallel and non-parallel cases, we don't expect to get substantially
> more I/O parallelism by spreading the scan across multiple backends.
> So, assume the disk run cost can't be amortized at all."

Removed this comment.


Apart from these, there are some more changes.
in 0001:
- Improved the comments for TBMSharedIteratorState and TBMSharedIterator.
- Added error handling for return pointer from dsa_allocate.
- Moved function declaration in tidbitmap.h (functionaly related are
kept together)

in 0002:
- Improved comments.
- Code refactoring in BitmapHeapNext.
- Removed local tbm creation in BitmapHeapNext : as per new tidbitmap
it's of no use.


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] Parallel bitmap heap scan

From
Dilip Kumar
Date:
On Sat, Feb 18, 2017 at 10:45 PM, Dilip Kumar <dilipbalaut@gmail.com> wrote:
> in 0002:
> - Improved comments.
> - Code refactoring in BitmapHeapNext.
> - Removed local tbm creation in BitmapHeapNext : as per new tidbitmap
> it's of no use.

I have observed one problem with 0002 and I though of sharing that
before fixing the same because we might have encountered the same
problem in some other patches i.e parallel shared hash and there might
be already a way to handle that.

The problem is that In ExecEndBitmapHeapScan we do tbm_free. Wherein,
it frees local pagetable memory and the shared memory using callback
routine and if other than the Backend (actual backend for the client
which is executing gather node) node any other worker become the
leader (worker which is responsible for creating the shared TBM) then
it will finish it work and free the shared memory by calling
ExecEndBitmapHeapScan, and it's possible that other worker are still
operating on the shared memory.

So far I never observed this issue in our test may be because either
Backend become the leader or by the time leader (non backend) free the
TBM others also finishes there work.

Solution:
- One solution to this problem can be that leader should not complete
the scan unless all other worker have finished the work.
- We can also think of that we don't make anyone wait but we make a
arrangement that last worker to finish the bitmapscan only free the
memory, but one problem with this solution is that last worker can be
non-leader also, which will have access to shared memory but how to
free the pagetable local memory (it's local to the leader).

Any suggestion on this ?

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: [HACKERS] Parallel bitmap heap scan

From
Amit Kapila
Date:
On Sun, Feb 19, 2017 at 7:18 PM, Dilip Kumar <dilipbalaut@gmail.com> wrote:
> On Sat, Feb 18, 2017 at 10:45 PM, Dilip Kumar <dilipbalaut@gmail.com> wrote:
>> in 0002:
>> - Improved comments.
>> - Code refactoring in BitmapHeapNext.
>> - Removed local tbm creation in BitmapHeapNext : as per new tidbitmap
>> it's of no use.
>
> I have observed one problem with 0002 and I though of sharing that
> before fixing the same because we might have encountered the same
> problem in some other patches i.e parallel shared hash and there might
> be already a way to handle that.
>
> The problem is that In ExecEndBitmapHeapScan we do tbm_free. Wherein,
> it frees local pagetable memory and the shared memory using callback
> routine and if other than the Backend (actual backend for the client
> which is executing gather node) node any other worker become the
> leader (worker which is responsible for creating the shared TBM) then
> it will finish it work and free the shared memory by calling
> ExecEndBitmapHeapScan, and it's possible that other worker are still
> operating on the shared memory.
> So far I never observed this issue in our test may be because either
> Backend become the leader or by the time leader (non backend) free the
> TBM others also finishes there work.
>
> Solution:
> - One solution to this problem can be that leader should not complete
> the scan unless all other worker have finished the work.

For Parallel Seq Scan, we do wait for parallel workers to finish
before freeing the shared memory.  Why the case is different here?


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: [HACKERS] Parallel bitmap heap scan

From
Robert Haas
Date:
On Sun, Feb 19, 2017 at 7:18 PM, Dilip Kumar <dilipbalaut@gmail.com> wrote:
> I have observed one problem with 0002 and I though of sharing that
> before fixing the same because we might have encountered the same
> problem in some other patches i.e parallel shared hash and there might
> be already a way to handle that.
>
> The problem is that In ExecEndBitmapHeapScan we do tbm_free. Wherein,
> it frees local pagetable memory and the shared memory using callback
> routine and if other than the Backend (actual backend for the client
> which is executing gather node) node any other worker become the
> leader (worker which is responsible for creating the shared TBM) then
> it will finish it work and free the shared memory by calling
> ExecEndBitmapHeapScan, and it's possible that other worker are still
> operating on the shared memory.
>
> So far I never observed this issue in our test may be because either
> Backend become the leader or by the time leader (non backend) free the
> TBM others also finishes there work.
>
> Solution:
> - One solution to this problem can be that leader should not complete
> the scan unless all other worker have finished the work.
> - We can also think of that we don't make anyone wait but we make a
> arrangement that last worker to finish the bitmapscan only free the
> memory, but one problem with this solution is that last worker can be
> non-leader also, which will have access to shared memory but how to
> free the pagetable local memory (it's local to the leader).
>
> Any suggestion on this ?

It's probably OK if tbm_free() doesn't free the memory allocated from
DSA, and we just let cleanup at end of query do it.  However, that
could cause some trouble if the Parallel Bitmap Heap Scan gets
executed over and over and keeps allocating more and more memory from
DSA.  I think the way to fix that would be to maintain a reference
count that starts at 1 when the iterator arrays are created and gets
incremented every time a TBMSharedIteratorState is created.  It gets
decremented when the TIDBitmap is destroyed that has iterator arrays
is destroyed, and each time a TBMSharedIteratorState is destroyed.
When it reaches 0, the process that reduces the reference count to 0
calls dsa_free on the DSA pointers for pagetable, spages, and schunks.
(Also, if a TIDBitmap is freed before iteration begins, it frees the
DSA pointer for the pagetable only; the others won't have values yet.)

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Parallel bitmap heap scan

From
Dilip Kumar
Date:
On Sun, Feb 19, 2017 at 7:44 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> It's probably OK if tbm_free() doesn't free the memory allocated from
> DSA, and we just let cleanup at end of query do it.  However, that
> could cause some trouble if the Parallel Bitmap Heap Scan gets
> executed over and over and keeps allocating more and more memory from
> DSA.

Is it possible that Parallel Bitmap Heap Scan will be executed
multiple time without shutting down the Workers?

I can imagine it can get executed over and over if plan is something like below.

NestLoopJoin   -> SeqScan   -> Gather       -> Parallel Bitmap Heap Scan

But in such case every time the Inner node of the NLJ will be
rescanned i.e. Gather will be rescanned which in turn shutdown
workers.

So basically, what I want to propose is that Only during
ExecReScanBitmapHeapScan we can free all the DSA pointers because at
that time we can be sure that all the workers have completed there
task and we are safe to free. (And we don't free any DSA memory at
ExecEndBitmapHeapScan).
I think the way to fix that would be to maintain a reference
> count that starts at 1 when the iterator arrays are created and gets
> incremented every time a TBMSharedIteratorState is created.  It gets
> decremented when the TIDBitmap is destroyed that has iterator arrays
> is destroyed, and each time a TBMSharedIteratorState is destroyed.
> When it reaches 0, the process that reduces the reference count to 0
> calls dsa_free on the DSA pointers for pagetable, spages, and schunks.
> (Also, if a TIDBitmap is freed before iteration begins, it frees the
> DSA pointer for the pagetable only; the others won't have values yet.)


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: [HACKERS] Parallel bitmap heap scan

From
Robert Haas
Date:
On Sun, Feb 19, 2017 at 9:59 PM, Dilip Kumar <dilipbalaut@gmail.com> wrote:
> I can imagine it can get executed over and over if plan is something like below.
>
> NestLoopJoin
>     -> SeqScan
>     -> Gather
>         -> Parallel Bitmap Heap Scan
>
> But in such case every time the Inner node of the NLJ will be
> rescanned i.e. Gather will be rescanned which in turn shutdown
> workers.

Yeah, but it looks like ExecReScanGather gets rid of the workers, but
reuses the existing DSM.  I'm not quite sure what happens to the DSA.
It looks like it probably just hangs around from the previous
iteration, which means that any allocations will also hang around.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Parallel bitmap heap scan

From
Thomas Munro
Date:
On Sun, Feb 19, 2017 at 10:34 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Sun, Feb 19, 2017 at 9:59 PM, Dilip Kumar <dilipbalaut@gmail.com> wrote:
>> I can imagine it can get executed over and over if plan is something like below.
>>
>> NestLoopJoin
>>     -> SeqScan
>>     -> Gather
>>         -> Parallel Bitmap Heap Scan
>>
>> But in such case every time the Inner node of the NLJ will be
>> rescanned i.e. Gather will be rescanned which in turn shutdown
>> workers.
>
> Yeah, but it looks like ExecReScanGather gets rid of the workers, but
> reuses the existing DSM.  I'm not quite sure what happens to the DSA.
> It looks like it probably just hangs around from the previous
> iteration, which means that any allocations will also hang around.

Yes, it hangs around.  Being able to reuse state in a rescan is a
feature: you might be able to reuse a hash table or a bitmap.

In the Parallel Shared Hash patch, the last participant to detach from
the shared hash table barrier gets a different return code and runs
some cleanup code.  The alternative was to make the leader wait for
the workers to finish accessing the hash table so that it could do
that.  I had it that way in an early version, but my goal is to
minimise synchronisation points so now it's just 'last to leave turns
out the lights' with no waiting.

One practical problem that came up was the need for executor nodes to
get a chance to do that kind of cleanup before the DSM segment is
detached.  In my patch series I introduced a new node API
ExecNodeDetach to allow for that.  Andres objected that the need for
that is evidence that the existing protocol is broken and should be
fixed instead.  I'm looking into that.

On Sun, Feb 19, 2017 at 9:59 PM, Dilip Kumar <dilipbalaut@gmail.com> wrote:
> So basically, what I want to propose is that Only during
> ExecReScanBitmapHeapScan we can free all the DSA pointers because at
> that time we can be sure that all the workers have completed there
> task and we are safe to free. (And we don't free any DSA memory at
> ExecEndBitmapHeapScan).

I think this works.

I also grappled a bit with the question of whether it's actually worth
trying to free DSA memory when you're finished with it, eating
precious CPU cycles at end of a join, or just letting the the
executor's DSA area get nuked at end of parallel execution.  As you
say, there is a special case for rescans to avoid leaks.  I described
this as a potential approach in a TODO note in my v5 patch, but
currently my code just does the clean-up every time on the grounds
that it's simple and hasn't shown up as a performance problem yet.

Some hand-wavy thoughts on this topic in the context of hash joins:

The argument for cleaning up sooner rather than later would be that it
could reduce the total peak memory usage of large execution plans.  Is
that a reasonable goal and can we achieve it?  I suspect the answer is
yes in theory but no in practice, and we don't even try to achieve it
in non-parallel queries as far as I know.

My understanding is that PostgreSQL's planner can generate left-deep,
bushy and right-deep hash join plans:

N nested left-deep hash joins:  Each hash join is on the outer side of
its parent, supplying tuples to probe the parent hash table.  Their
probe phases overlap, so all N hash tables must exist and be fully
loaded at the same time.

N nested right-deep hash joins:  Each hash join is on the inner side
of its parent, supplying tuples to build the hash table of its parent.
Theoretically you only need two full hash tables in memory at peak,
because you'll finish probing each one while build its parent's hash
table and then not need the child's hash table again (unless you need
to rescan).

N nested bushy hash joins:  Somewhere in between.

But we don't actually take advantage of that opportunity to reduce
peak memory today.  We always have N live hash tables and don't free
them until standard_ExecutorEnd runs ExecProcEnd on the top of the
plan.  Perhaps we could teach hash tables to free themselves ASAP at
the end of their probe phase unless they know a rescan is possible.
But that just opens a whole can of worms:  if we care about total peak
memory usage, should it become a planner goal that might favour
right-deep hash joins?  I guess similar questions must arise for
bitmap heap scan and anything else holding memory that it doesn't
technically need anymore, and parallel query doesn't really change
anything about the situation, except maybe that Gather nodes provide a
point of scoping somewhere in between 'eager destruction' and 'hog all
the space until end of plan' which makes things a bit better.  I don't
know anywhere near enough about query planners to say whether such
ideas about planning are reasonable, and am quite aware that it's
difficult terrain, and I have other fish to fry, so I'm going to put
down the can opener and back away.

-- 
Thomas Munro
http://www.enterprisedb.com



Re: [HACKERS] Parallel bitmap heap scan

From
Tom Lane
Date:
Thomas Munro <thomas.munro@enterprisedb.com> writes:
> One practical problem that came up was the need for executor nodes to
> get a chance to do that kind of cleanup before the DSM segment is
> detached.  In my patch series I introduced a new node API
> ExecNodeDetach to allow for that.  Andres objected that the need for
> that is evidence that the existing protocol is broken and should be
> fixed instead.  I'm looking into that.

The thing that you really have to worry about for this kind of proposal
is "what if the query errors out and we never get to ExecEndNode"?
It's particularly nasty if you're talking about parallel queries where
maybe only one or some of the processes involved detect an error.
        regards, tom lane



Re: [HACKERS] Parallel bitmap heap scan

From
Dilip Kumar
Date:
On Sun, Feb 19, 2017 at 10:34 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> Yeah, but it looks like ExecReScanGather gets rid of the workers, but
> reuses the existing DSM.  I'm not quite sure what happens to the DSA.
> It looks like it probably just hangs around from the previous
> iteration, which means that any allocations will also hang around.

Yes right, they hang around. But, during rescan
(ExecReScanBitmapHeapScan) we can free all these DSA pointers ? That
mean before reallocating the DSA pointers we would have already got
rid of the old ones.


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: [HACKERS] Parallel bitmap heap scan

From
Robert Haas
Date:
On Mon, Feb 20, 2017 at 5:55 AM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
> On Sun, Feb 19, 2017 at 9:59 PM, Dilip Kumar <dilipbalaut@gmail.com> wrote:
>> So basically, what I want to propose is that Only during
>> ExecReScanBitmapHeapScan we can free all the DSA pointers because at
>> that time we can be sure that all the workers have completed there
>> task and we are safe to free. (And we don't free any DSA memory at
>> ExecEndBitmapHeapScan).
>
> I think this works.

OK.

> Some hand-wavy thoughts on this topic in the context of hash joins:
>
> The argument for cleaning up sooner rather than later would be that it
> could reduce the total peak memory usage of large execution plans.  Is
> that a reasonable goal and can we achieve it?  I suspect the answer is
> yes in theory but no in practice, and we don't even try to achieve it
> in non-parallel queries as far as I know.

We're pretty stupid about causing nodes to stop eating up resources as
early as we could; for example, when a Limit is filled, we don't make
any attempt to have scans underneath it release pins or memory or
anything else.  But we don't usually let the same node consume memory
multiple times. ExecReScanBitmapHeapScan frees all of the memory used
for the previous bitmap in the non-parallel case, so it should
probably do that in the parallel case also.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Parallel bitmap heap scan

From
Robert Haas
Date:
On Mon, Feb 20, 2017 at 6:19 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Thomas Munro <thomas.munro@enterprisedb.com> writes:
>> One practical problem that came up was the need for executor nodes to
>> get a chance to do that kind of cleanup before the DSM segment is
>> detached.  In my patch series I introduced a new node API
>> ExecNodeDetach to allow for that.  Andres objected that the need for
>> that is evidence that the existing protocol is broken and should be
>> fixed instead.  I'm looking into that.
>
> The thing that you really have to worry about for this kind of proposal
> is "what if the query errors out and we never get to ExecEndNode"?
> It's particularly nasty if you're talking about parallel queries where
> maybe only one or some of the processes involved detect an error.

I think that's not actually a problem, because we've already got code
to make sure that all DSM resources associated with the query get
blown away in that case.  Of course, that code might have bugs, but if
it does, I think it's better to try to fix those bugs than to insert
some belt-and-suspenders mechanism for reclaiming every possible chunk
of memory in retail fashion, just like we blow up es_query_cxt rather
than trying to pfree allocations individually.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Parallel bitmap heap scan

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> On Mon, Feb 20, 2017 at 6:19 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> The thing that you really have to worry about for this kind of proposal
>> is "what if the query errors out and we never get to ExecEndNode"?
>> It's particularly nasty if you're talking about parallel queries where
>> maybe only one or some of the processes involved detect an error.

> I think that's not actually a problem, because we've already got code
> to make sure that all DSM resources associated with the query get
> blown away in that case.  Of course, that code might have bugs, but if
> it does, I think it's better to try to fix those bugs than to insert
> some belt-and-suspenders mechanism for reclaiming every possible chunk
> of memory in retail fashion, just like we blow up es_query_cxt rather
> than trying to pfree allocations individually.

Actually, I think we're saying the same thing: rely on the general DSM
cleanup mechanism, don't insert extra stuff that you expect will get
done by executor shutdown.
        regards, tom lane



Re: [HACKERS] Parallel bitmap heap scan

From
Dilip Kumar
Date:
On Mon, Feb 20, 2017 at 11:18 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Mon, Feb 20, 2017 at 5:55 AM, Thomas Munro
> <thomas.munro@enterprisedb.com> wrote:
>> On Sun, Feb 19, 2017 at 9:59 PM, Dilip Kumar <dilipbalaut@gmail.com> wrote:
>>> So basically, what I want to propose is that Only during
>>> ExecReScanBitmapHeapScan we can free all the DSA pointers because at
>>> that time we can be sure that all the workers have completed there
>>> task and we are safe to free. (And we don't free any DSA memory at
>>> ExecEndBitmapHeapScan).
>>
>> I think this works.
>
> OK.

In my latest version of the patch, I have fixed it as described above
i.e only cleanup in ExecReScanBitmapHeapScan.

For getting this there is some change in both the patches.

0001:
1. I have got a new interface, "tbm_free_shared_area(dsa_area *dsa,
dsa_pointer dp)"  which will be responsible for freeing all the shared
members (pagetable, ptpage and ptchunk).  Actually, we can not do this
in tbm_free itself because the only leader will have a local tbm with
reference to all these pointers and our parallel bitmap leader may not
necessarily be the actual backend.

If we want to get this done using tbm, then we need to create a local
tbm in each worker and get the shared member reference copied into tbm
using tbm_attach_shared_iterate like we were doing in the earlier
version.

2. Now tbm_free is not freeing any of the shared members which can be
accessed by other worker so tbm_free is safe to call from
ExecEndBitmapHeapScan without any safety check or ref count.

0002:
1. We don't need ExecShutdownBitmapHeapScan anymore because now we are
not freeing any shared member in ExecEndBitmapHeapScan.

2. In ExecReScanBitmapHeapScan we will call tbm_free_shared_area to
free the shared members of the TBM.
3. After that, we will free TBMSharedIteratorState what we allocated
using tbm_prepare_shared_iterate.


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] Parallel bitmap heap scan

From
Robert Haas
Date:
On Tue, Feb 21, 2017 at 10:20 AM, Dilip Kumar <dilipbalaut@gmail.com> wrote:
> 0001:
> 1. I have got a new interface, "tbm_free_shared_area(dsa_area *dsa,
> dsa_pointer dp)"  which will be responsible for freeing all the shared
> members (pagetable, ptpage and ptchunk).  Actually, we can not do this
> in tbm_free itself because the only leader will have a local tbm with
> reference to all these pointers and our parallel bitmap leader may not
> necessarily be the actual backend.

I'm not entirely sure about the calling convention for
tbm_free_shared_area() but the rest seems OK.

> 2. Now tbm_free is not freeing any of the shared members which can be
> accessed by other worker so tbm_free is safe to call from
> ExecEndBitmapHeapScan without any safety check or ref count.

That also seems fine. We ended up with something very similar in the
Parallel Index Scan patch.

> 0002:
> 1. We don't need ExecShutdownBitmapHeapScan anymore because now we are
> not freeing any shared member in ExecEndBitmapHeapScan.
> 2. In ExecReScanBitmapHeapScan we will call tbm_free_shared_area to
> free the shared members of the TBM.
> 3. After that, we will free TBMSharedIteratorState what we allocated
> using tbm_prepare_shared_iterate.

Check.  But I think tbm_free_shared_area() should also free the object
itself, instead of making the caller do that separately.

+        if (DsaPointerIsValid(node->pstate->tbmiterator))
+        {
+            /* First we free the shared TBM members using the shared state */
+            tbm_free_shared_area(dsa, node->pstate->tbmiterator);
+            dsa_free(dsa, node->pstate->tbmiterator);
+        }
+
+        if (DsaPointerIsValid(node->pstate->prefetch_iterator))
+            dsa_free(dsa, node->pstate->prefetch_iterator);

The fact that these cases aren't symmetric suggests that your
abstraction is leaky.  I'm guessing that you can't call
tbm_free_shared_area because the two iterators share one copy of the
underlying iteration arrays, and the TBM code isn't smart enough to
avoid freeing them twice.  You're going to have to come up with a
better solution to that problem; nodeBitmapHeapScan.c shouldn't know
about the way the underlying storage details are managed.  (Maybe you
need to reference-count the iterator arrays?)

+    if (node->inited)
+        goto start_iterate;

My first programming teacher told me not to use goto.  I've
occasionally violated that rule, but I need a better reason than you
have here.  It looks very easy to avoid.

+            pbms_set_parallel(outerPlanState(node));

I think this should be a flag in the plan, and the planner should set
it correctly, instead of having it be a flag in the executor that the
executor sets.  Also, the flag itself should really be called
something that involves the word "shared" rather than "parallel",
because the bitmap will not be created in parallel, but it will be
shared.

Have you checked whether this patch causes any regression in the
non-parallel case?  It adds a bunch of "if" statements, so it might.
Hopefully not, but it needs to be carefully tested.

@@ -48,10 +48,11 @@#include "utils/snapmgr.h"#include "utils/tqual.h"

-static TupleTableSlot *BitmapHeapNext(BitmapHeapScanState *node);

Unnecessary.

+static bool pbms_is_leader(ParallelBitmapState *pbminfo);
+static void pbms_set_parallel(PlanState *node);

I don't think this "pbms" terminology is very good.  It's dissimilar
to the way other functions in this file are named.  Maybe
BitmapShouldInitializeSharedState().

I think that some of the bits that this function makes conditional on
pstate should be factored out into inline functions.  Like:

-            if (node->prefetch_target >= node->prefetch_maximum)
-                 /* don't increase any further */ ;
-            else if (node->prefetch_target >= node->prefetch_maximum / 2)
-                node->prefetch_target = node->prefetch_maximum;
-            else if (node->prefetch_target > 0)
-                node->prefetch_target *= 2;
-            else
-                node->prefetch_target++;
+            if (!pstate)
+            {
+                if (node->prefetch_target >= node->prefetch_maximum)
+                     /* don't increase any further */ ;
+                else if (node->prefetch_target >= node->prefetch_maximum / 2)
+                    node->prefetch_target = node->prefetch_maximum;
+                else if (node->prefetch_target > 0)
+                    node->prefetch_target *= 2;
+                else
+                    node->prefetch_target++;
+            }
+            else if (pstate->prefetch_target < node->prefetch_maximum)
+            {
+                SpinLockAcquire(&pstate->prefetch_mutex);
+                if (pstate->prefetch_target >= node->prefetch_maximum)
+                     /* don't increase any further */ ;
+                else if (pstate->prefetch_target >=
+                         node->prefetch_maximum / 2)
+                    pstate->prefetch_target = node->prefetch_maximum;
+                else if (pstate->prefetch_target > 0)
+                    pstate->prefetch_target *= 2;
+                else
+                    pstate->prefetch_target++;
+                SpinLockRelease(&pstate->prefetch_mutex);
+            }

I suggest creating an inline function like BitmapAdjustPrefetch() for
this logic, and letting the code call that.  The function can look
something like this: if (pstate == NULL) { non-parallel stuff; return;
} parallel stuff follows...

And similarly for the other cases where you've made the logic
conditional.  This will make it more clear what's happening
post-patch, I think, and will also help keep the level of indentation
from getting out-of-control in certain places.  In fact, maybe you
should submit a preliminary refactoring patch that moves these chunks
of logic into functions and then the main patch can apply over top of
that.

+    bool                 inited;

Suggest: initialized

- * ----------------
+ *        pscan_len           size of the shared memory for parallel bitmap
+ *        inited               is node is ready to iterate
+ *        stbmiterator       shared iterator
+ *        sprefetch_iterator shared iterator for prefetching
+ *        pstate               shared state for parallel bitmap scan
+ *----------------

No need to change number of dashes.

+ *     PBMState information : Current status of the TIDBitmap creation during
+ *                            parallel bitmap heap scan.

If you look for existing places where comments are formatted like
this, I bet you won't find many.  Copy the surrounding style more.

+    dsa_pointer    tbmiterator;
+    dsa_pointer    prefetch_iterator;
+    slock_t        prefetch_mutex;
+    int            prefetch_pages;
+    int            prefetch_target;
+    bool        prefetching;
+    slock_t        state_mutex;
+    PBMState    state;
+    ConditionVariable cv;
+    char        phs_snapshot_data[FLEXIBLE_ARRAY_MEMBER];

I think it is probably not a good idea to have two separate mutexes
here.  They'll be in the same cache line, so it won't be much faster
than having one mutex, and the state mutex won't be acquired very
often so you won't really gain anything anyway.  I think you can just
merge the mutexes into one called 'mutex'.

+                    /* Allow only one process to prefetch */

If this is a good idea, there should be a comment explaining why.

pbms_is_leader() looks horrifically inefficient.  Every worker will
reacquire the spinlock for every tuple.  You should only have to enter
this spinlock-acquiring loop for the first tuple.  After that, either
you became the leader, did the initialization, and set the state to
PBM_FINISHED, or you waited until the pre-existing leader did the
same.  You should have a backend-local flag that keeps you from
touching the spinlock for every tuple.  I wouldn't be surprised if
fixing this nets a noticeable performance increase for this patch at
high worker counts.

+    TBMSharedIterator    *stbmiterator;
+    TBMSharedIterator    *sprefetch_iterator;

Maybe shared_iterator and shared_prefetch_iterator.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Parallel bitmap heap scan

From
Dilip Kumar
Date:
On Sun, Feb 26, 2017 at 9:26 PM, Robert Haas <robertmhaas@gmail.com> wrote:

Thanks for the review, I will work on these. There are some comments I
need suggestions.

> tbm_free_shared_area because the two iterators share one copy of the
> underlying iteration arrays, and the TBM code isn't smart enough to
> avoid freeing them twice.  You're going to have to come up with a
> better solution to that problem; nodeBitmapHeapScan.c shouldn't know
> about the way the underlying storage details are managed.  (Maybe you
> need to reference-count the iterator arrays?)

Yeah, I also think current way doesn't look so clean, currently, these
arrays are just integers array, may be we can use a first slot of the
array for reference-count? or convert to the structure which has space
for reference-count and an integers array. What do you suggest?

>
> +    if (node->inited)
> +        goto start_iterate;
>
> My first programming teacher told me not to use goto.  I've
> occasionally violated that rule, but I need a better reason than you
> have here.  It looks very easy to avoid.

Yes, this can be avoided, I was just trying to get rid of multi-level
if nesting and end up with the "goto".

>
> +            pbms_set_parallel(outerPlanState(node));
>
> I think this should be a flag in the plan, and the planner should set
> it correctly, instead of having it be a flag in the executor that the
> executor sets.  Also, the flag itself should really be called
> something that involves the word "shared" rather than "parallel",
> because the bitmap will not be created in parallel, but it will be
> shared.

Earlier, I thought that it will be not a good idea to set that flag in
BitmapIndexScan path because the same path can be used either under
ParallelBitmapHeapPath or under normal BitmapHeapPath.  But, now after
putting some thought, I realised that we can do that in create_plan.
Therefore, I will change this.

>
> Have you checked whether this patch causes any regression in the
> non-parallel case?  It adds a bunch of "if" statements, so it might.
> Hopefully not, but it needs to be carefully tested.

During the initial patch development, I have tested this aspect also
but never published any results for the same. I will do that along
with my next patch and post the results.
>
> pbms_is_leader() looks horrifically inefficient.  Every worker will
> reacquire the spinlock for every tuple.  You should only have to enter
> this spinlock-acquiring loop for the first tuple.  After that, either
> you became the leader, did the initialization, and set the state to
> PBM_FINISHED, or you waited until the pre-existing leader did the
> same.  You should have a backend-local flag that keeps you from
> touching the spinlock for every tuple.  I wouldn't be surprised if
> fixing this nets a noticeable performance increase for this patch at
> high worker counts.

I think there is some confusion, if you notice, below code will avoid
calling pbms_is_leader for every tuple.
It will be called only first time. And, after that node->inited will
be true and it will directly jump to start_iterate for subsequent
calls.  Am I missing something?

> +    if (node->inited)
> +        goto start_iterate;


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: [HACKERS] Parallel bitmap heap scan

From
Robert Haas
Date:
On Mon, Feb 27, 2017 at 10:30 AM, Dilip Kumar <dilipbalaut@gmail.com> wrote:
> On Sun, Feb 26, 2017 at 9:26 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> tbm_free_shared_area because the two iterators share one copy of the
>> underlying iteration arrays, and the TBM code isn't smart enough to
>> avoid freeing them twice.  You're going to have to come up with a
>> better solution to that problem; nodeBitmapHeapScan.c shouldn't know
>> about the way the underlying storage details are managed.  (Maybe you
>> need to reference-count the iterator arrays?)
>
> Yeah, I also think current way doesn't look so clean, currently, these
> arrays are just integers array, may be we can use a first slot of the
> array for reference-count? or convert to the structure which has space
> for reference-count and an integers array. What do you suggest?

Maybe something like typedef struct { int refcnt; SomeType
somename[FLEXIBLE_ARRAY_MEMBER]; } SomeOtherType; would be a good
approach.

>> +    if (node->inited)
>> +        goto start_iterate;
>>
>> My first programming teacher told me not to use goto.  I've
>> occasionally violated that rule, but I need a better reason than you
>> have here.  It looks very easy to avoid.
>
> Yes, this can be avoided, I was just trying to get rid of multi-level
> if nesting and end up with the "goto".

That's what I figured.

>> +            pbms_set_parallel(outerPlanState(node));
>>
>> I think this should be a flag in the plan, and the planner should set
>> it correctly, instead of having it be a flag in the executor that the
>> executor sets.  Also, the flag itself should really be called
>> something that involves the word "shared" rather than "parallel",
>> because the bitmap will not be created in parallel, but it will be
>> shared.
>
> Earlier, I thought that it will be not a good idea to set that flag in
> BitmapIndexScan path because the same path can be used either under
> ParallelBitmapHeapPath or under normal BitmapHeapPath.  But, now after
> putting some thought, I realised that we can do that in create_plan.
> Therefore, I will change this.

Cool.

>> pbms_is_leader() looks horrifically inefficient.  Every worker will
>> reacquire the spinlock for every tuple.  You should only have to enter
>> this spinlock-acquiring loop for the first tuple.  After that, either
>> you became the leader, did the initialization, and set the state to
>> PBM_FINISHED, or you waited until the pre-existing leader did the
>> same.  You should have a backend-local flag that keeps you from
>> touching the spinlock for every tuple.  I wouldn't be surprised if
>> fixing this nets a noticeable performance increase for this patch at
>> high worker counts.
>
> I think there is some confusion, if you notice, below code will avoid
> calling pbms_is_leader for every tuple.
> It will be called only first time. And, after that node->inited will
> be true and it will directly jump to start_iterate for subsequent
> calls.  Am I missing something?
>
>> +    if (node->inited)
>> +        goto start_iterate;

Oh, OK.  I guess I was just confused, then.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Parallel bitmap heap scan

From
Dilip Kumar
Date:
On Sun, Feb 26, 2017 at 9:26 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> I'm not entirely sure about the calling convention for
> tbm_free_shared_area() but the rest seems OK.

Changed
>
>> 2. Now tbm_free is not freeing any of the shared members which can be
>> accessed by other worker so tbm_free is safe to call from
>> ExecEndBitmapHeapScan without any safety check or ref count.
>
> That also seems fine. We ended up with something very similar in the
> Parallel Index Scan patch.
>
>> 0002:
>> 1. We don't need ExecShutdownBitmapHeapScan anymore because now we are
>> not freeing any shared member in ExecEndBitmapHeapScan.
>> 2. In ExecReScanBitmapHeapScan we will call tbm_free_shared_area to
>> free the shared members of the TBM.
>> 3. After that, we will free TBMSharedIteratorState what we allocated
>> using tbm_prepare_shared_iterate.
>
> Check.  But I think tbm_free_shared_area() should also free the object
> itself, instead of making the caller do that separately.

Right, done that way.
>
> +        if (DsaPointerIsValid(node->pstate->tbmiterator))
> +        {
> +            /* First we free the shared TBM members using the shared state */
> +            tbm_free_shared_area(dsa, node->pstate->tbmiterator);
> +            dsa_free(dsa, node->pstate->tbmiterator);
> +        }
> +
> +        if (DsaPointerIsValid(node->pstate->prefetch_iterator))
> +            dsa_free(dsa, node->pstate->prefetch_iterator);
>
> The fact that these cases aren't symmetric suggests that your
> abstraction is leaky.  I'm guessing that you can't call
> tbm_free_shared_area because the two iterators share one copy of the
> underlying iteration arrays, and the TBM code isn't smart enough to
> avoid freeing them twice.  You're going to have to come up with a
> better solution to that problem; nodeBitmapHeapScan.c shouldn't know
> about the way the underlying storage details are managed.  (Maybe you
> need to reference-count the iterator arrays?)

Converted iterator arrays to structure and maintained the refcount. I
had to do the same thing for pagetable also because that is also
shared across iterator.

>
> +    if (node->inited)
> +        goto start_iterate;
>
> My first programming teacher told me not to use goto.  I've
> occasionally violated that rule, but I need a better reason than you
> have here.  It looks very easy to avoid.

Changed
>
> +            pbms_set_parallel(outerPlanState(node));
>
> I think this should be a flag in the plan, and the planner should set
> it correctly, instead of having it be a flag in the executor that the
> executor sets.  Also, the flag itself should really be called
> something that involves the word "shared" rather than "parallel",
> because the bitmap will not be created in parallel, but it will be
> shared.
Done
>
> Have you checked whether this patch causes any regression in the
> non-parallel case?  It adds a bunch of "if" statements, so it might.
> Hopefully not, but it needs to be carefully tested.

With new patch I tested in my local machine, perform multiple
executions and it doesn't show any regression.  Attached file
perf_result contains the explain analyze output for one of the query
(head, patch with 0 workers and patch with 2 workers).  I will perform
the test with all the TPC-H queries which using bitmap plan on the
bigger machine and will post the results next week.
>
> @@ -48,10 +48,11 @@
>  #include "utils/snapmgr.h"
>  #include "utils/tqual.h"
>
> -
>  static TupleTableSlot *BitmapHeapNext(BitmapHeapScanState *node);
>
> Unnecessary.
Fixed
>
> +static bool pbms_is_leader(ParallelBitmapState *pbminfo);
> +static void pbms_set_parallel(PlanState *node);
>
> I don't think this "pbms" terminology is very good.  It's dissimilar
> to the way other functions in this file are named.  Maybe
> BitmapShouldInitializeSharedState().
Changed
>
> I think that some of the bits that this function makes conditional on
> pstate should be factored out into inline functions.  Like:
>
> -            if (node->prefetch_target >= node->prefetch_maximum)
> -                 /* don't increase any further */ ;
> -            else if (node->prefetch_target >= node->prefetch_maximum / 2)
> -                node->prefetch_target = node->prefetch_maximum;
> -            else if (node->prefetch_target > 0)
> -                node->prefetch_target *= 2;
> -            else
> -                node->prefetch_target++;
> +            if (!pstate)
> +            {
> +                if (node->prefetch_target >= node->prefetch_maximum)
> +                     /* don't increase any further */ ;
> +                else if (node->prefetch_target >= node->prefetch_maximum / 2)
> +                    node->prefetch_target = node->prefetch_maximum;
> +                else if (node->prefetch_target > 0)
> +                    node->prefetch_target *= 2;
> +                else
> +                    node->prefetch_target++;
> +            }
> +            else if (pstate->prefetch_target < node->prefetch_maximum)
> +            {
> +                SpinLockAcquire(&pstate->prefetch_mutex);
> +                if (pstate->prefetch_target >= node->prefetch_maximum)
> +                     /* don't increase any further */ ;
> +                else if (pstate->prefetch_target >=
> +                         node->prefetch_maximum / 2)
> +                    pstate->prefetch_target = node->prefetch_maximum;
> +                else if (pstate->prefetch_target > 0)
> +                    pstate->prefetch_target *= 2;
> +                else
> +                    pstate->prefetch_target++;
> +                SpinLockRelease(&pstate->prefetch_mutex);
> +            }
>
> I suggest creating an inline function like BitmapAdjustPrefetch() for
> this logic, and letting the code call that.  The function can look
> something like this: if (pstate == NULL) { non-parallel stuff; return;
> } parallel stuff follows...
>
> And similarly for the other cases where you've made the logic
> conditional.  This will make it more clear what's happening
> post-patch, I think, and will also help keep the level of indentation
> from getting out-of-control in certain places.  In fact, maybe you
> should submit a preliminary refactoring patch that moves these chunks
> of logic into functions and then the main patch can apply over top of
> that.

Done.

>
> +    bool                 inited;
>
> Suggest: initialized

Done
>
> - * ----------------
> + *        pscan_len           size of the shared memory for parallel bitmap
> + *        inited               is node is ready to iterate
> + *        stbmiterator       shared iterator
> + *        sprefetch_iterator shared iterator for prefetching
> + *        pstate               shared state for parallel bitmap scan
> + *----------------
>
> No need to change number of dashes.
Fixed
>
> + *     PBMState information : Current status of the TIDBitmap creation during
> + *                            parallel bitmap heap scan.
>
> If you look for existing places where comments are formatted like
> this, I bet you won't find many.  Copy the surrounding style more.

Done as surrounding structure, also changed the name to
SharedBitmapState, I think that is better name for the purpose.

>
> +    dsa_pointer    tbmiterator;
> +    dsa_pointer    prefetch_iterator;
> +    slock_t        prefetch_mutex;
> +    int            prefetch_pages;
> +    int            prefetch_target;
> +    bool        prefetching;
> +    slock_t        state_mutex;
> +    PBMState    state;
> +    ConditionVariable cv;
> +    char        phs_snapshot_data[FLEXIBLE_ARRAY_MEMBER];
>
> I think it is probably not a good idea to have two separate mutexes
> here.  They'll be in the same cache line, so it won't be much faster
> than having one mutex, and the state mutex won't be acquired very
> often so you won't really gain anything anyway.  I think you can just
> merge the mutexes into one called 'mutex'.

Done.
>
> +                    /* Allow only one process to prefetch */
>
> If this is a good idea, there should be a comment explaining why.

Done
>
> +    TBMSharedIterator    *stbmiterator;
> +    TBMSharedIterator    *sprefetch_iterator;
>
> Maybe shared_iterator and shared_prefetch_iterator.
Done


0001- same as previous with some changes for freeing the shared memory stuff.
0002- nodeBitmapHeapScan refactoring, this applies independently
0003- actual parallel bitmap stuff applies on top of 0001 and 0002


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] Parallel bitmap heap scan

From
Robert Haas
Date:
On Tue, Feb 28, 2017 at 9:18 PM, Dilip Kumar <dilipbalaut@gmail.com> wrote:
> 0001- same as previous with some changes for freeing the shared memory stuff.
> 0002- nodeBitmapHeapScan refactoring, this applies independently
> 0003- actual parallel bitmap stuff applies on top of 0001 and 0002

0002 wasn't quite careful enough about the placement of #ifdef
USE_PREFETCH, but otherwise looks OK.  Committed after changing that
and getting rid of the local variable prefetch_iterator, which seemed
to be adding rather than removing complexity after this refactoring.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Parallel bitmap heap scan

From
Dilip Kumar
Date:
On Thu, Mar 2, 2017 at 6:52 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> 0002 wasn't quite careful enough about the placement of #ifdef
> USE_PREFETCH, but otherwise looks OK.  Committed after changing that
> and getting rid of the local variable prefetch_iterator, which seemed
> to be adding rather than removing complexity after this refactoring.

0003 is rebased after this commit.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] Parallel bitmap heap scan

From
Robert Haas
Date:
On Mon, Mar 6, 2017 at 12:35 AM, Dilip Kumar <dilipbalaut@gmail.com> wrote:
> On Thu, Mar 2, 2017 at 6:52 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> 0002 wasn't quite careful enough about the placement of #ifdef
>> USE_PREFETCH, but otherwise looks OK.  Committed after changing that
>> and getting rid of the local variable prefetch_iterator, which seemed
>> to be adding rather than removing complexity after this refactoring.
>
> 0003 is rebased after this commit.

You've still got this:

+               if (DsaPointerIsValid(node->pstate->tbmiterator))
+                       tbm_free_shared_area(dsa, node->pstate->tbmiterator);
+
+               if (DsaPointerIsValid(node->pstate->prefetch_iterator))
+                       dsa_free(dsa, node->pstate->prefetch_iterator);

I'm trying to get to a point where both calls use
tbm_free_shared_area() - i.e. no peeking behind the abstraction layer.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Parallel bitmap heap scan

From
Dilip Kumar
Date:
On Tue, Mar 7, 2017 at 9:34 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>
> +               if (DsaPointerIsValid(node->pstate->tbmiterator))
> +                       tbm_free_shared_area(dsa, node->pstate->tbmiterator);
> +
> +               if (DsaPointerIsValid(node->pstate->prefetch_iterator))
> +                       dsa_free(dsa, node->pstate->prefetch_iterator);

As per latest code, both should be calling to tbm_free_shared_area
because tbm_free_shared_area is capable of handling that. My silly
mistake. I will fix it.


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: [HACKERS] Parallel bitmap heap scan

From
Robert Haas
Date:
(On Tue, Feb 28, 2017 at 10:48 AM, Dilip Kumar <dilipbalaut@gmail.com> wrote:
> 0001- same as previous with some changes for freeing the shared memory stuff.

+    if (--ptbase->refcount == 0)
+        dsa_free(dsa, istate->pagetable);
+
+    if (istate->spages)
+    {
+        ptpages = dsa_get_address(dsa, istate->spages);
+        if (--ptpages->refcount == 0)
+            dsa_free(dsa, istate->spages);
+    }
+    if (istate->schunks)
+    {
+        ptchunks = dsa_get_address(dsa, istate->schunks);
+        if (--ptchunks->refcount == 0)
+            dsa_free(dsa, istate->schunks);
+    }

This doesn't involve any locking, which I think will happen to work
with the current usage pattern but doesn't seem very robust in
general.  I think you either need the refcounts to be protected by a
spinlock, or maybe better, use pg_atomic_uint32 for them.  You want
something like if (pg_atomic_sub_fetch_u32(&refcount, 1) == 0) {
dsa_free(...) }

Otherwise, there's no guarantee it will get freed exactly once.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Parallel bitmap heap scan

From
Robert Haas
Date:
On Tue, Mar 7, 2017 at 11:09 AM, Dilip Kumar <dilipbalaut@gmail.com> wrote:
> On Tue, Mar 7, 2017 at 9:34 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>>
>> +               if (DsaPointerIsValid(node->pstate->tbmiterator))
>> +                       tbm_free_shared_area(dsa, node->pstate->tbmiterator);
>> +
>> +               if (DsaPointerIsValid(node->pstate->prefetch_iterator))
>> +                       dsa_free(dsa, node->pstate->prefetch_iterator);
>
> As per latest code, both should be calling to tbm_free_shared_area
> because tbm_free_shared_area is capable of handling that. My silly
> mistake. I will fix it.

Thanks.  I don't think I believe this rationale:

+                               /*
+                                * If one of the process has already
identified that we need
+                                * to do prefetch then let it perform
the prefetch and allow
+                                * others to proceed with the work in
hand.  Another option
+                                * could be that we allow all of them
to participate in
+                                * prefetching.  But, most of this
work done under mutex or
+                                * LWLock so ultimately we may end up
in prefetching
+                                * sequentially.
+                                */

I mean, IIUC, the call to PrefetchBuffer() is not done under any lock.
And that's the slow part.  The tiny amount of time we spend updating
the prefetch information under the mutex should be insignificant
compared to the cost of actually reading the buffer.  Unless I'm
missing something.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Parallel bitmap heap scan

From
Dilip Kumar
Date:
On Tue, Mar 7, 2017 at 9:44 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> I mean, IIUC, the call to PrefetchBuffer() is not done under any lock.
> And that's the slow part.  The tiny amount of time we spend updating
> the prefetch information under the mutex should be insignificant
> compared to the cost of actually reading the buffer.  Unless I'm
> missing something.

Okay, but IIUC, the PrefetchBuffer is an async call to load the buffer
if it's not already in shared buffer? so If instead of one process is
making multiple async calls to PrefetchBuffer, if we make it by
multiple processes will it be any faster?  Or you are thinking that at
least we can make BufTableLookup call parallel because that is not an
async call.


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: [HACKERS] Parallel bitmap heap scan

From
Robert Haas
Date:
On Tue, Mar 7, 2017 at 11:27 AM, Dilip Kumar <dilipbalaut@gmail.com> wrote:
> On Tue, Mar 7, 2017 at 9:44 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> I mean, IIUC, the call to PrefetchBuffer() is not done under any lock.
>> And that's the slow part.  The tiny amount of time we spend updating
>> the prefetch information under the mutex should be insignificant
>> compared to the cost of actually reading the buffer.  Unless I'm
>> missing something.
>
> Okay, but IIUC, the PrefetchBuffer is an async call to load the buffer
> if it's not already in shared buffer? so If instead of one process is
> making multiple async calls to PrefetchBuffer, if we make it by
> multiple processes will it be any faster?  Or you are thinking that at
> least we can make BufTableLookup call parallel because that is not an
> async call.

It's not about speed.  It's about not forgetting to prefetch.  Suppose
that worker 1 becomes the prefetch worker but then doesn't return to
the Bitmap Heap Scan node for a long time because it's busy in some
other part of the plan tree.  Now you just stop prefetching; that's
bad.  You want prefetching to continue regardless of which workers are
busy doing what; as long as SOME worker is executing the parallel
bitmap heap scan, prefetching should continue as needed.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Parallel bitmap heap scan

From
Dilip Kumar
Date:
On Tue, Mar 7, 2017 at 10:07 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> It's not about speed.  It's about not forgetting to prefetch.  Suppose
> that worker 1 becomes the prefetch worker but then doesn't return to
> the Bitmap Heap Scan node for a long time because it's busy in some
> other part of the plan tree.  Now you just stop prefetching; that's
> bad.  You want prefetching to continue regardless of which workers are
> busy doing what; as long as SOME worker is executing the parallel
> bitmap heap scan, prefetching should continue as needed.

Right, I missed this part. I will fix this.


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: [HACKERS] Parallel bitmap heap scan

From
Dilip Kumar
Date:
On Tue, Mar 7, 2017 at 10:10 PM, Dilip Kumar <dilipbalaut@gmail.com> wrote:
>> It's not about speed.  It's about not forgetting to prefetch.  Suppose
>> that worker 1 becomes the prefetch worker but then doesn't return to
>> the Bitmap Heap Scan node for a long time because it's busy in some
>> other part of the plan tree.  Now you just stop prefetching; that's
>> bad.  You want prefetching to continue regardless of which workers are
>> busy doing what; as long as SOME worker is executing the parallel
>> bitmap heap scan, prefetching should continue as needed.
>
> Right, I missed this part. I will fix this.
I have fixed this part, after doing that I realised if multiple
processes are prefetching then it may be possible that in boundary
cases (e.g. suppose prefetch_target is 3 and prefetch_pages is at 2)
there may be some extra prefetch but finally those prefetched blocks
will be used.  Another, solution to this problem is that we can
increase the prefetch_pages in advance then call tbm_shared_iterate,
this will avoid extra prefetch.  But I am not sure what will be best
here.


On Tue, Mar 7, 2017 at 9:41 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> +    if (--ptbase->refcount == 0)
> +        dsa_free(dsa, istate->pagetable);
> +
> +    if (istate->spages)
> +    {
> +        ptpages = dsa_get_address(dsa, istate->spages);
> +        if (--ptpages->refcount == 0)
> +            dsa_free(dsa, istate->spages);
> +    }
> +    if (istate->schunks)
> +    {
> +        ptchunks = dsa_get_address(dsa, istate->schunks);
> +        if (--ptchunks->refcount == 0)
> +            dsa_free(dsa, istate->schunks);
> +    }
>
> This doesn't involve any locking, which I think will happen to work
> with the current usage pattern but doesn't seem very robust in
> general.  I think you either need the refcounts to be protected by a
> spinlock, or maybe better, use pg_atomic_uint32 for them.  You want
> something like if (pg_atomic_sub_fetch_u32(&refcount, 1) == 0) {
> dsa_free(...) }
>
> Otherwise, there's no guarantee it will get freed exactly once.

Fixed


On Tue, Mar 7, 2017 at 9:34 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> You've still got this:
>
> +               if (DsaPointerIsValid(node->pstate->tbmiterator))
> +                       tbm_free_shared_area(dsa, node->pstate->tbmiterator);
> +
> +               if (DsaPointerIsValid(node->pstate->prefetch_iterator))
> +                       dsa_free(dsa, node->pstate->prefetch_iterator);
>
> I'm trying to get to a point where both calls use
> tbm_free_shared_area() - i.e. no peeking behind the abstraction layer.

Fixed

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] Parallel bitmap heap scan

From
Robert Haas
Date:
On Wed, Mar 8, 2017 at 1:49 AM, Dilip Kumar <dilipbalaut@gmail.com> wrote:
> On Tue, Mar 7, 2017 at 10:10 PM, Dilip Kumar <dilipbalaut@gmail.com> wrote:
>>> It's not about speed.  It's about not forgetting to prefetch.  Suppose
>>> that worker 1 becomes the prefetch worker but then doesn't return to
>>> the Bitmap Heap Scan node for a long time because it's busy in some
>>> other part of the plan tree.  Now you just stop prefetching; that's
>>> bad.  You want prefetching to continue regardless of which workers are
>>> busy doing what; as long as SOME worker is executing the parallel
>>> bitmap heap scan, prefetching should continue as needed.
>>
>> Right, I missed this part. I will fix this.
> I have fixed this part, after doing that I realised if multiple
> processes are prefetching then it may be possible that in boundary
> cases (e.g. suppose prefetch_target is 3 and prefetch_pages is at 2)
> there may be some extra prefetch but finally those prefetched blocks
> will be used.  Another, solution to this problem is that we can
> increase the prefetch_pages in advance then call tbm_shared_iterate,
> this will avoid extra prefetch.  But I am not sure what will be best
> here.

I don't think I understand exactly why this system might be prone to a
little bit of extra prefetching - can you explain further? I don't
think it will hurt anything as long as we are talking about a small
number of extra prefetches here and there.

> Fixed

Committed 0001.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Parallel bitmap heap scan

From
Dilip Kumar
Date:
On Wed, Mar 8, 2017 at 6:42 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> I don't think I understand exactly why this system might be prone to a
> little bit of extra prefetching - can you explain further?
Let me explain with an example, suppose there are 2 workers
prefetching jointly, lets assume
prefetch_target is 3, so for example when both the workers prefetch 1
page each then prefetch_page count will be 2, so both the workers will
go for next prefetch because prefetch_page is still less than prefetch
target, so again both the workers will do prefetch and totally they
will prefetch 4 pages.

> I don't think it will hurt anything as long as we are talking about a small
> number of extra prefetches here and there.

I completely agree with this point and I mentioned in the mail so that
it don't go unnoticed.

And, whatever extra prefetch we have done we will anyway use it unless
we stop the execution because of some limit clause.

>
>> Fixed
>
> Committed 0001.

Thanks


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: [HACKERS] Parallel bitmap heap scan

From
Robert Haas
Date:
Reviewing 0003:

How about adding a regression test?

bitmap_subplan_mark_shared could use castNode(), which seems like it
would be better style.  Maybe some other places, too.

+         <entry><literal>ParallelBitmapPopulate</></entry>
+         <entry>Waiting for the leader to populate the TidBitmap.</entry>
+        </row>
+        <row>

If you build the documentation, you'll find that this doesn't come out
right; you need to add 1 to the value of the nearest preceding
"morerows".  (I fixed a similar issue with 0001 while committing.)

+        /*---------------
+         * Check the current state
+         * If state is
+         * BM_INITIAL     : We become the leader and set it to BM_INPROGRESS
+         * BM_INPROGRESS : We need to wait till leader creates bitmap
+         * BM_FINISHED   : bitmap is ready so no need to wait
+         *---------------

The formatting of this comment is slightly off - the comment for
BM_INITIAL isn't aligned the same as the others.  But I would just
delete the whole comment, since more or less it recapitulates the
function header comment anyway.

I wonder if BitmapShouldInitializeSharedState couldn't be written a
little more compactly overall, like this:

{   SharedBitmapState   state;
   while (1)   {       SpinLockAcquire(&pstate->mutex);       state = pstate->state;       if (pstate->state ==
BM_INITIAL)          pstate->state = BM_INPROGRESS;       SpinLockRelease(&pstate->mutex);
 
       /* If we are leader or leader has already created a TIDBITMAP */       if (state != BM_INPROGRESS)
break;
       /* Sleep until leader finishes creating the bitmap */       ConditionVariableSleep(&pstate->cv,
WAIT_EVENT_PARALLEL_BITMAP_SCAN);  }
 
   ConditionVariableCancelSleep();
   return (state == BM_INITIAL);
}

+                /*
+                 * By this time we have already populated the TBM and
+                 * initialized the shared iterators so set the state to
+                 * BM_FINISHED and wake up others.
+                 */
+                SpinLockAcquire(&pstate->mutex);
+                pstate->state = BM_FINISHED;
+                SpinLockRelease(&pstate->mutex);
+                ConditionVariableBroadcast(&pstate->cv);

I think it would be good to have a function for this, like
BitmapDoneInitializingSharedState(), and just call that function here.

+            SpinLockAcquire(&pstate->mutex);
+
+            /*
+             * Recheck under the mutex, If some other process has already done
+             * the enough prefetch then we need not to do anything.
+             */
+            if (pstate->prefetch_pages >= pstate->prefetch_target)
+                SpinLockRelease(&pstate->mutex);
+                return;
+            SpinLockRelease(&pstate->mutex);

I think it would be clearer to write this as:

SpinLockAcquire(&pstate->mutex);
do_prefetch = (pstate->prefetch_pages >= pstate->prefetch_target);
SpinLockRelease(&pstate->mutex);
if (!do_prefetch)  return;

Then it's more obvious what's going on with the spinlock.  But
actually what I would do is also roll in the increment to prefetch
pages in there, so that you don't have to reacquire the spinlock after
calling PrefetchBuffer:

bool do_prefetch = false;
SpinLockAcquire(&pstate->mutex);
if (pstate->prefetch_pages < pstate->prefetch_target)
{   pstate->prefetch_pages++;   do_prefetch = true;
}
SpinLockRelease(&pstate->mutex);

That seems like it will reduce the amount of excess prefetching
considerably, and also simplify the code and cut the spinlock
acquisitions by 50%.

Overall I think this is in pretty good shape.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Parallel bitmap heap scan

From
Dilip Kumar
Date:
On Wed, Mar 8, 2017 at 8:28 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> How about adding a regression test?

Added
>
> bitmap_subplan_mark_shared could use castNode(), which seems like it
> would be better style.  Maybe some other places, too.
>
> +         <entry><literal>ParallelBitmapPopulate</></entry>
> +         <entry>Waiting for the leader to populate the TidBitmap.</entry>
> +        </row>
> +        <row>
>
> If you build the documentation, you'll find that this doesn't come out
> right; you need to add 1 to the value of the nearest preceding
> "morerows".  (I fixed a similar issue with 0001 while committing.)

Fixed
>
> +        /*---------------
> +         * Check the current state
> +         * If state is
> +         * BM_INITIAL     : We become the leader and set it to BM_INPROGRESS
> +         * BM_INPROGRESS : We need to wait till leader creates bitmap
> +         * BM_FINISHED   : bitmap is ready so no need to wait
> +         *---------------
>
> The formatting of this comment is slightly off - the comment for
> BM_INITIAL isn't aligned the same as the others.  But I would just
> delete the whole comment, since more or less it recapitulates the
> function header comment anyway.

Removed.
>
> I wonder if BitmapShouldInitializeSharedState couldn't be written a
> little more compactly overall, like this:
>
> {
>     SharedBitmapState   state;
>
>     while (1)
>     {
>         SpinLockAcquire(&pstate->mutex);
>         state = pstate->state;
>         if (pstate->state == BM_INITIAL)
>             pstate->state = BM_INPROGRESS;
>         SpinLockRelease(&pstate->mutex);
>
>         /* If we are leader or leader has already created a TIDBITMAP */
>         if (state != BM_INPROGRESS)
>             break;
>
>         /* Sleep until leader finishes creating the bitmap */
>         ConditionVariableSleep(&pstate->cv, WAIT_EVENT_PARALLEL_BITMAP_SCAN);
>     }
>
>     ConditionVariableCancelSleep();
>
>     return (state == BM_INITIAL);
> }

This looks good, done this way

>
> +                /*
> +                 * By this time we have already populated the TBM and
> +                 * initialized the shared iterators so set the state to
> +                 * BM_FINISHED and wake up others.
> +                 */
> +                SpinLockAcquire(&pstate->mutex);
> +                pstate->state = BM_FINISHED;
> +                SpinLockRelease(&pstate->mutex);
> +                ConditionVariableBroadcast(&pstate->cv);
>
> I think it would be good to have a function for this, like
> BitmapDoneInitializingSharedState(), and just call that function here.

Done
>
> +            SpinLockAcquire(&pstate->mutex);
> +
> +            /*
> +             * Recheck under the mutex, If some other process has already done
> +             * the enough prefetch then we need not to do anything.
> +             */
> +            if (pstate->prefetch_pages >= pstate->prefetch_target)
> +                SpinLockRelease(&pstate->mutex);
> +                return;
> +            SpinLockRelease(&pstate->mutex);
>
> I think it would be clearer to write this as:
>
> SpinLockAcquire(&pstate->mutex);
> do_prefetch = (pstate->prefetch_pages >= pstate->prefetch_target);
> SpinLockRelease(&pstate->mutex);
> if (!do_prefetch)
>    return;
>
> Then it's more obvious what's going on with the spinlock.  But
> actually what I would do is also roll in the increment to prefetch
> pages in there, so that you don't have to reacquire the spinlock after
> calling PrefetchBuffer:
>
> bool do_prefetch = false;
> SpinLockAcquire(&pstate->mutex);
> if (pstate->prefetch_pages < pstate->prefetch_target)
> {
>     pstate->prefetch_pages++;
>     do_prefetch = true;
> }
> SpinLockRelease(&pstate->mutex);
>
> That seems like it will reduce the amount of excess prefetching
> considerably, and also simplify the code and cut the spinlock
> acquisitions by 50%.
>
Right, done that way
> Overall I think this is in pretty good shape.

Thanks.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] Parallel bitmap heap scan

From
Robert Haas
Date:
On Wed, Mar 8, 2017 at 11:20 AM, Dilip Kumar <dilipbalaut@gmail.com> wrote:
> Right, done that way

This didn't compile because you bobbled some code in
src/backend/nodes, but it was a trivial mistake so I fixed it.

Committed with that fix and a bunch of minor cosmetic changes.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Parallel bitmap heap scan

From
Jeff Janes
Date:

On Wed, Mar 8, 2017 at 9:08 AM, Robert Haas <robertmhaas@gmail.com> wrote:
On Wed, Mar 8, 2017 at 11:20 AM, Dilip Kumar <dilipbalaut@gmail.com> wrote:
> Right, done that way

This didn't compile because you bobbled some code in
src/backend/nodes, but it was a trivial mistake so I fixed it.

Committed with that fix and a bunch of minor cosmetic changes.


I'm getting some compiler warnings in gcc version 4.4.7 20120313 (Red Hat 4.4.7-17) (GCC):

nodeBitmapHeapscan.c: In function 'BitmapHeapNext':
nodeBitmapHeapscan.c:79: warning: 'tbmiterator' may be used uninitialized in this function
nodeBitmapHeapscan.c:80: warning: 'shared_tbmiterator' may be used uninitialized in this function

Cheers,

Jeff

Re: [HACKERS] Parallel bitmap heap scan

From
Tom Lane
Date:
Jeff Janes <jeff.janes@gmail.com> writes:
> I'm getting some compiler warnings in gcc version 4.4.7 20120313 (Red Hat
> 4.4.7-17) (GCC):

Me too.  Fix pushed.
        regards, tom lane



Re: [HACKERS] Parallel bitmap heap scan

From
Robert Haas
Date:
On Wed, Mar 8, 2017 at 12:45 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Jeff Janes <jeff.janes@gmail.com> writes:
>> I'm getting some compiler warnings in gcc version 4.4.7 20120313 (Red Hat
>> 4.4.7-17) (GCC):
>
> Me too.  Fix pushed.

Thanks.  Sorry for the hassle; my compiler isn't as picky about this
as I would like, and apparently Dilip's isn't either.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Parallel bitmap heap scan

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> On Wed, Mar 8, 2017 at 12:45 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Jeff Janes <jeff.janes@gmail.com> writes:
>>> I'm getting some compiler warnings in gcc version 4.4.7 20120313 (Red Hat
>>> 4.4.7-17) (GCC):

>> Me too.  Fix pushed.

> Thanks.  Sorry for the hassle; my compiler isn't as picky about this
> as I would like, and apparently Dilip's isn't either.

Might be interesting to see whether -O level affects it.  In principle,
whether you get the warning should depend on how much the compiler has
analyzed the logic flow ...
        regards, tom lane



Re: [HACKERS] Parallel bitmap heap scan

From
Robert Haas
Date:
On Wed, Mar 8, 2017 at 12:53 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Thanks.  Sorry for the hassle; my compiler isn't as picky about this
>> as I would like, and apparently Dilip's isn't either.
>
> Might be interesting to see whether -O level affects it.  In principle,
> whether you get the warning should depend on how much the compiler has
> analyzed the logic flow ...

What I'm using is:

Configured with:
--prefix=/Applications/Xcode.app/Contents/Developer/usr
--with-gxx-include-dir=/usr/include/c++/4.2.1
Apple LLVM version 7.0.2 (clang-700.1.81)
Target: x86_64-apple-darwin14.5.0
Thread model: posix

While I haven't experimented with this too extensively, my general
impression is that this thing is extremely tolerant of uninitialized
variables.  I just tried compiling nodeBitmapHeapscan.c with -Wall
-Werror and each of -O0, -O1, -O2, and -O3, and none of those produced
any warnings.  I've been reluctant to go to the hassle of installing a
different compiler...

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Parallel bitmap heap scan

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> What I'm using is:

> Configured with:
> --prefix=/Applications/Xcode.app/Contents/Developer/usr
> --with-gxx-include-dir=/usr/include/c++/4.2.1
> Apple LLVM version 7.0.2 (clang-700.1.81)
> Target: x86_64-apple-darwin14.5.0
> Thread model: posix

Hm.  I noticed that longfin didn't spit up on it either, despite having
-Werror turned on.  That's a slightly newer version, but still Apple's
clang:

$ gcc -v
Configured with: --prefix=/Applications/Xcode.app/Contents/Developer/usr --with-gxx-include-dir=/usr/include/c++/4.2.1
Apple LLVM version 8.0.0 (clang-800.0.42.1)
Target: x86_64-apple-darwin16.4.0
Thread model: posix
InstalledDir: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin
        regards, tom lane



Re: [HACKERS] Parallel bitmap heap scan

From
Robert Haas
Date:
On Wed, Mar 8, 2017 at 1:18 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> What I'm using is:
>
>> Configured with:
>> --prefix=/Applications/Xcode.app/Contents/Developer/usr
>> --with-gxx-include-dir=/usr/include/c++/4.2.1
>> Apple LLVM version 7.0.2 (clang-700.1.81)
>> Target: x86_64-apple-darwin14.5.0
>> Thread model: posix
>
> Hm.  I noticed that longfin didn't spit up on it either, despite having
> -Werror turned on.  That's a slightly newer version, but still Apple's
> clang:
>
> $ gcc -v
> Configured with: --prefix=/Applications/Xcode.app/Contents/Developer/usr
--with-gxx-include-dir=/usr/include/c++/4.2.1
> Apple LLVM version 8.0.0 (clang-800.0.42.1)
> Target: x86_64-apple-darwin16.4.0
> Thread model: posix
> InstalledDir: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin

Yeah.  I think on my previous MacBook Pro, you could do this without
generating a warning:

int x;
printf("%d\n", x);

The compiler on this one detects that case, but that seems to be about
as far as it goes.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Parallel bitmap heap scan

From
Rafia Sabih
Date:
On Wed, Mar 8, 2017 at 11:52 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Wed, Mar 8, 2017 at 1:18 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Robert Haas <robertmhaas@gmail.com> writes:
>>> What I'm using is:
>>
>>> Configured with:
>>> --prefix=/Applications/Xcode.app/Contents/Developer/usr
>>> --with-gxx-include-dir=/usr/include/c++/4.2.1
>>> Apple LLVM version 7.0.2 (clang-700.1.81)
>>> Target: x86_64-apple-darwin14.5.0
>>> Thread model: posix
>>
>> Hm.  I noticed that longfin didn't spit up on it either, despite having
>> -Werror turned on.  That's a slightly newer version, but still Apple's
>> clang:
>>
>> $ gcc -v
>> Configured with: --prefix=/Applications/Xcode.app/Contents/Developer/usr
--with-gxx-include-dir=/usr/include/c++/4.2.1
>> Apple LLVM version 8.0.0 (clang-800.0.42.1)
>> Target: x86_64-apple-darwin16.4.0
>> Thread model: posix
>> InstalledDir: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin
>
> Yeah.  I think on my previous MacBook Pro, you could do this without
> generating a warning:
>
> int x;
> printf("%d\n", x);
>
> The compiler on this one detects that case, but that seems to be about
> as far as it goes.

Recently, on testing TPC-H 300 scale factor I stumbled on to a error
for Q6, the test environment is as follows,
work_mem = 1GB,
shared_buffers = 10 GB,
Effective_cache_size = 10GB
random_page_cost = seq+page_cost =0.1

The error message is, ERROR:  invalid DSA memory alloc request size 1610612740
In case required, stack trace is as follows,

#0  errfinish (dummy=0) at elog.c:415
#1  0x0000000010738820 in elog_finish (elevel=20, fmt=0x109115d0
"invalid DSA memory alloc request size %zu") at elog.c:1378
#2  0x0000000010778824 in dsa_allocate_extended (area=0x1001b53b138,
size=1610612740, flags=4) at dsa.c:676
#3  0x00000000103a3e60 in pagetable_allocate (pagetable=0x1001b52a590,
size=1610612736) at tidbitmap.c:1521
#4  0x00000000103a0488 in pagetable_grow (tb=0x1001b52a590,
newsize=33554432) at ../../../src/include/lib/simplehash.h:379
#5  0x00000000103a07b0 in pagetable_insert (tb=0x1001b52a590,
key=34962730, found=0x3fffc705ba48 "") at
../../../src/include/lib/simplehash.h:504
#6  0x00000000103a3354 in tbm_get_pageentry (tbm=0x1001b526fa0,
pageno=34962730) at tidbitmap.c:1235
#7  0x00000000103a1614 in tbm_add_tuples (tbm=0x1001b526fa0,
tids=0x1001b51d22a, ntids=1, recheck=0 '\000') at tidbitmap.c:425
#8  0x00000000100fdf24 in btgetbitmap (scan=0x1001b51c4a0,
tbm=0x1001b526fa0) at nbtree.c:460
#9  0x00000000100f2c48 in index_getbitmap (scan=0x1001b51c4a0,
bitmap=0x1001b526fa0) at indexam.c:726
#10 0x000000001033c7d8 in MultiExecBitmapIndexScan
(node=0x1001b51c180) at nodeBitmapIndexscan.c:91
#11 0x000000001031c0b4 in MultiExecProcNode (node=0x1001b51c180) at
execProcnode.c:611
#12 0x000000001033a8a8 in BitmapHeapNext (node=0x1001b5187a8) at
nodeBitmapHeapscan.c:143
#13 0x000000001032a094 in ExecScanFetch (node=0x1001b5187a8,
accessMtd=0x1033a6c8 <BitmapHeapNext>, recheckMtd=0x1033bab8
<BitmapHeapRecheck>) at execScan.c:95
#14 0x000000001032a194 in ExecScan (node=0x1001b5187a8,
accessMtd=0x1033a6c8 <BitmapHeapNext>, recheckMtd=0x1033bab8
<BitmapHeapRecheck>) at execScan.c:162
#15 0x000000001033bb88 in ExecBitmapHeapScan (node=0x1001b5187a8) at
nodeBitmapHeapscan.c:667

Please feel free to ask if any more information is required for this.
-- 
Regards,
Rafia Sabih
EnterpriseDB: http://www.enterprisedb.com/



Re: Parallel bitmap heap scan

From
Dilip Kumar
Date:
On Mon, Mar 27, 2017 at 12:53 PM, Rafia Sabih
<rafia.sabih@enterprisedb.com> wrote:
> Recently, on testing TPC-H 300 scale factor I stumbled on to a error
> for Q6, the test environment is as follows,
> work_mem = 1GB,
> shared_buffers = 10 GB,
> Effective_cache_size = 10GB
> random_page_cost = seq+page_cost =0.1
>
> The error message is, ERROR:  invalid DSA memory alloc request size 1610612740
> In case required, stack trace is as follows,

Thanks for reporting.  In pagetable_allocate, DSA_ALLOC_HUGE flag were
missing while allocating the memory from the DSA.  I have also handled
the NULL pointer return from dsa_get_address.

Please verify with the attached patch.


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachment

Re: Parallel bitmap heap scan

From
Robert Haas
Date:
On Mon, Mar 27, 2017 at 5:02 AM, Dilip Kumar <dilipbalaut@gmail.com> wrote:
> On Mon, Mar 27, 2017 at 12:53 PM, Rafia Sabih
> <rafia.sabih@enterprisedb.com> wrote:
>> Recently, on testing TPC-H 300 scale factor I stumbled on to a error
>> for Q6, the test environment is as follows,
>> work_mem = 1GB,
>> shared_buffers = 10 GB,
>> Effective_cache_size = 10GB
>> random_page_cost = seq+page_cost =0.1
>>
>> The error message is, ERROR:  invalid DSA memory alloc request size 1610612740
>> In case required, stack trace is as follows,
>
> Thanks for reporting.  In pagetable_allocate, DSA_ALLOC_HUGE flag were
> missing while allocating the memory from the DSA.  I have also handled
> the NULL pointer return from dsa_get_address.
>
> Please verify with the attached patch.

Failing to pass DSA_ALLOC_HUGE is an obvious oversight, but I think
the ptbase == NULL check shouldn't be needed, because we're not
passing DSA_ALLOC_NO_OOM.  And that's good, because this is going to
be called from SH_CREATE, which isn't prepared for a NULL return
anyway.

Am I all wet?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Parallel bitmap heap scan

From
Dilip Kumar
Date:
On Mon, Mar 27, 2017 at 5:58 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> Failing to pass DSA_ALLOC_HUGE is an obvious oversight, but I think
> the ptbase == NULL check shouldn't be needed, because we're not
> passing DSA_ALLOC_NO_OOM.  And that's good, because this is going to
> be called from SH_CREATE, which isn't prepared for a NULL return
> anyway.
>
> Am I all wet?

Yes you are right that we are not passing DSA_ALLOC_NO_OOM, so
dsa_allocate_extended will return error in case if allocation failed.


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com