Thread: Parallel bitmap index scan

Parallel bitmap index scan

From
Dilip Kumar
Date:
I would like to propose a patch for enabling the parallelism for the
bitmap index scan path.

Background:
Currently, we support only a parallel bitmap heap scan path.  Therein,
the underlying bitmap index scan is done by a single worker called the
leader.  The leader creates a bitmap in shared memory and once the
bitmap is ready it creates a shared iterator and after that, all the
workers process the shared iterator and scan the heap in parallel.
While analyzing the TPCH plan we have observed that some of the
queries are spending significant time in preparing the bitmap.  So the
idea of this patch is to use the parallel index scan for preparing the
underlying bitmap in parallel.

Design:
If underlying index AM supports the parallel path (currently only
BTREE support it), then we will create a parallel bitmap heap scan
path on top of the parallel bitmap index scan path.  So the idea of
this patch is that each worker will do the parallel index scan and
generate their part of the bitmap.  And, we will create a barrier so
that we can not start preparing the shared iterator until all the
worker is ready with their bitmap.  The first worker, which is ready
with the bitmap will keep a copy of its TBM and the page table in the
shared memory.  And, all the subsequent workers will merge their TBM
with the shared TBM.  Once all the TBM are merged we will get one
common shared TBM and after that stage, the worker can continue.  The
remaining part is the same,  basically, again one worker will scan the
shared TBM and prepare the shared iterator and once it is ready all
the workers will jointly scan the heap in parallel using shared
iterator.

BitmapHeapNext
{
...
BarrierAttach();
tbm = MultiExecProcNode();
tbm_merge(tbm); --Merge with common tbm using tbm_union
BarrierArriveAndWait();

if (BitmapShouldInitializeSharedState(pstate)). --> only one worker
come out of this
{
   tbm_prepare_shared_iterate();
   BitmapDoneInitializingSharedState(). -->wakeup others
}
tbm_attach_shared_iterate(). --> all worker attach to shared iterator
...
}

Performance:  With scale factor 10, I could see that Q6 is spending
significant time in a bitmap index scan so I have taken the
performance with that query and I can see that the bitmap index scan
node is 3x faster by using 3 workers whereas overall plan got ~40%
faster.

TPCH: S.F. 10, work_mem=512MB  shared_buffers: 1GB

HEAD:
  Limit  (cost=1559777.02..1559777.03 rows=1 width=32) (actual
time=5260.121..5260.122 rows=1 loops=1)
   ->  Finalize Aggregate  (cost=1559777.02..1559777.03 rows=1
width=32) (actual time=5260.119..5260.119 rows=1 loops=1)
         ->  Gather  (cost=1559776.69..1559777.00 rows=3 width=32)
(actual time=5257.251..5289.595 rows=4 loops=1)
               Workers Planned: 3
               Workers Launched: 3
               ->  Partial Aggregate  (cost=1558776.69..1558776.70
rows=1 width=32) (actual time=5247.714..5247.714 rows=1 loops=4)
                     ->  Parallel Bitmap Heap Scan on lineitem
(cost=300603.01..1556898.89 rows=375560 width=12) (actual
time=3475.944..50
37.484 rows=285808 loops=4)
                           Recheck Cond: ((l_shipdate >=
'1997-01-01'::date) AND (l_shipdate < '1998-01-01 00:00:00'::timestamp
without tim
e zone) AND (l_discount >= 0.02) AND (l_discount <= 0.04) AND
(l_quantity < '24'::numeric))
                           Heap Blocks: exact=205250
                           ->  Bitmap Index Scan on
idx_lineitem_shipdate  (cost=0.00..300311.95 rows=1164235 width=0)
(actual time=3169.85
5..3169.855 rows=1143234 loops=1)
                                 Index Cond: ((l_shipdate >=
'1997-01-01'::date) AND (l_shipdate < '1998-01-01 00:00:00'::timestamp
without
 time zone) AND (l_discount >= 0.02) AND (l_discount <= 0.04) AND
(l_quantity < '24'::numeric))
 Planning Time: 0.659 ms
 Execution Time: 5289.787 ms
(13 rows)


PATCH:

  Limit  (cost=1559579.85..1559579.86 rows=1 width=32) (actual
time=3333.572..3333.572 rows=1 loops=1)
   ->  Finalize Aggregate  (cost=1559579.85..1559579.86 rows=1
width=32) (actual time=3333.569..3333.569 rows=1 loops=1)
         ->  Gather  (cost=1559579.52..1559579.83 rows=3 width=32)
(actual time=3328.619..3365.227 rows=4 loops=1)
               Workers Planned: 3
               Workers Launched: 3
               ->  Partial Aggregate  (cost=1558579.52..1558579.53
rows=1 width=32) (actual time=3307.805..3307.805 rows=1 loops=4)
                     ->  Parallel Bitmap Heap Scan on lineitem
(cost=300405.84..1556701.72 rows=375560 width=12) (actual
time=1585.726..30
97.628 rows=285808 loops=4)
                           Recheck Cond: ((l_shipdate >=
'1997-01-01'::date) AND (l_shipdate < '1998-01-01 00:00:00'::timestamp
without tim
e zone) AND (l_discount >= 0.02) AND (l_discount <= 0.04) AND
(l_quantity < '24'::numeric))
                           Heap Blocks: exact=184293
                           ->  Parallel Bitmap Index Scan on
idx_lineitem_shipdate  (cost=0.00..300311.95 rows=1164235 width=0)
(actual tim
e=1008.361..1008.361 rows=285808 loops=4)
                                 Index Cond: ((l_shipdate >=
'1997-01-01'::date) AND (l_shipdate < '1998-01-01 00:00:00'::timestamp
without
 time zone) AND (l_discount >= 0.02) AND (l_discount <= 0.04) AND
(l_quantity < '24'::numeric))
 Planning Time: 0.690 ms
 Execution Time: 3365.420 ms

Note:
- Currently, I have only parallelized then bitmap index path when we
have a bitmap index scan directly under bitmap heap.  But, if we have
BitmapAnd or BitmapOr path then I did not parallelize the underlying
bitmap index scan.  I think for BitmapAnd and BitmapOr we should use a
completely different design, something similar to what we are doing in
parallel append so I don't think BitmapAnd and BitmapOr we need to
cover under this patch.

- POC patch is attached to discuss the idea.  The patch still needs
cleanup and testing.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachment

Re: Parallel bitmap index scan

From
Dilip Kumar
Date:
On Sun, Jul 26, 2020 at 6:42 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> I would like to propose a patch for enabling the parallelism for the
> bitmap index scan path.
>
> Background:
> Currently, we support only a parallel bitmap heap scan path.  Therein,
> the underlying bitmap index scan is done by a single worker called the
> leader.  The leader creates a bitmap in shared memory and once the
> bitmap is ready it creates a shared iterator and after that, all the
> workers process the shared iterator and scan the heap in parallel.
> While analyzing the TPCH plan we have observed that some of the
> queries are spending significant time in preparing the bitmap.  So the
> idea of this patch is to use the parallel index scan for preparing the
> underlying bitmap in parallel.
>
> Design:
> If underlying index AM supports the parallel path (currently only
> BTREE support it), then we will create a parallel bitmap heap scan
> path on top of the parallel bitmap index scan path.  So the idea of
> this patch is that each worker will do the parallel index scan and
> generate their part of the bitmap.  And, we will create a barrier so
> that we can not start preparing the shared iterator until all the
> worker is ready with their bitmap.  The first worker, which is ready
> with the bitmap will keep a copy of its TBM and the page table in the
> shared memory.  And, all the subsequent workers will merge their TBM
> with the shared TBM.  Once all the TBM are merged we will get one
> common shared TBM and after that stage, the worker can continue.  The
> remaining part is the same,  basically, again one worker will scan the
> shared TBM and prepare the shared iterator and once it is ready all
> the workers will jointly scan the heap in parallel using shared
> iterator.
>
> BitmapHeapNext
> {
> ...
> BarrierAttach();
> tbm = MultiExecProcNode();
> tbm_merge(tbm); --Merge with common tbm using tbm_union
> BarrierArriveAndWait();
>
> if (BitmapShouldInitializeSharedState(pstate)). --> only one worker
> come out of this
> {
>    tbm_prepare_shared_iterate();
>    BitmapDoneInitializingSharedState(). -->wakeup others
> }
> tbm_attach_shared_iterate(). --> all worker attach to shared iterator
> ...
> }
>
> Performance:  With scale factor 10, I could see that Q6 is spending
> significant time in a bitmap index scan so I have taken the
> performance with that query and I can see that the bitmap index scan
> node is 3x faster by using 3 workers whereas overall plan got ~40%
> faster.
>
> TPCH: S.F. 10, work_mem=512MB  shared_buffers: 1GB
>
> HEAD:
>   Limit  (cost=1559777.02..1559777.03 rows=1 width=32) (actual
> time=5260.121..5260.122 rows=1 loops=1)
>    ->  Finalize Aggregate  (cost=1559777.02..1559777.03 rows=1
> width=32) (actual time=5260.119..5260.119 rows=1 loops=1)
>          ->  Gather  (cost=1559776.69..1559777.00 rows=3 width=32)
> (actual time=5257.251..5289.595 rows=4 loops=1)
>                Workers Planned: 3
>                Workers Launched: 3
>                ->  Partial Aggregate  (cost=1558776.69..1558776.70
> rows=1 width=32) (actual time=5247.714..5247.714 rows=1 loops=4)
>                      ->  Parallel Bitmap Heap Scan on lineitem
> (cost=300603.01..1556898.89 rows=375560 width=12) (actual
> time=3475.944..50
> 37.484 rows=285808 loops=4)
>                            Recheck Cond: ((l_shipdate >=
> '1997-01-01'::date) AND (l_shipdate < '1998-01-01 00:00:00'::timestamp
> without tim
> e zone) AND (l_discount >= 0.02) AND (l_discount <= 0.04) AND
> (l_quantity < '24'::numeric))
>                            Heap Blocks: exact=205250
>                            ->  Bitmap Index Scan on
> idx_lineitem_shipdate  (cost=0.00..300311.95 rows=1164235 width=0)
> (actual time=3169.85
> 5..3169.855 rows=1143234 loops=1)
>                                  Index Cond: ((l_shipdate >=
> '1997-01-01'::date) AND (l_shipdate < '1998-01-01 00:00:00'::timestamp
> without
>  time zone) AND (l_discount >= 0.02) AND (l_discount <= 0.04) AND
> (l_quantity < '24'::numeric))
>  Planning Time: 0.659 ms
>  Execution Time: 5289.787 ms
> (13 rows)
>
>
> PATCH:
>
>   Limit  (cost=1559579.85..1559579.86 rows=1 width=32) (actual
> time=3333.572..3333.572 rows=1 loops=1)
>    ->  Finalize Aggregate  (cost=1559579.85..1559579.86 rows=1
> width=32) (actual time=3333.569..3333.569 rows=1 loops=1)
>          ->  Gather  (cost=1559579.52..1559579.83 rows=3 width=32)
> (actual time=3328.619..3365.227 rows=4 loops=1)
>                Workers Planned: 3
>                Workers Launched: 3
>                ->  Partial Aggregate  (cost=1558579.52..1558579.53
> rows=1 width=32) (actual time=3307.805..3307.805 rows=1 loops=4)
>                      ->  Parallel Bitmap Heap Scan on lineitem
> (cost=300405.84..1556701.72 rows=375560 width=12) (actual
> time=1585.726..30
> 97.628 rows=285808 loops=4)
>                            Recheck Cond: ((l_shipdate >=
> '1997-01-01'::date) AND (l_shipdate < '1998-01-01 00:00:00'::timestamp
> without tim
> e zone) AND (l_discount >= 0.02) AND (l_discount <= 0.04) AND
> (l_quantity < '24'::numeric))
>                            Heap Blocks: exact=184293
>                            ->  Parallel Bitmap Index Scan on
> idx_lineitem_shipdate  (cost=0.00..300311.95 rows=1164235 width=0)
> (actual tim
> e=1008.361..1008.361 rows=285808 loops=4)
>                                  Index Cond: ((l_shipdate >=
> '1997-01-01'::date) AND (l_shipdate < '1998-01-01 00:00:00'::timestamp
> without
>  time zone) AND (l_discount >= 0.02) AND (l_discount <= 0.04) AND
> (l_quantity < '24'::numeric))
>  Planning Time: 0.690 ms
>  Execution Time: 3365.420 ms
>
> Note:
> - Currently, I have only parallelized then bitmap index path when we
> have a bitmap index scan directly under bitmap heap.  But, if we have
> BitmapAnd or BitmapOr path then I did not parallelize the underlying
> bitmap index scan.  I think for BitmapAnd and BitmapOr we should use a
> completely different design, something similar to what we are doing in
> parallel append so I don't think BitmapAnd and BitmapOr we need to
> cover under this patch.
>
> - POC patch is attached to discuss the idea.  The patch still needs
> cleanup and testing.
>

There was one compilation warning so fixed in this version.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachment

Re: Parallel bitmap index scan

From
Thomas Munro
Date:
On Mon, Jul 27, 2020 at 1:58 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> On Sun, Jul 26, 2020 at 6:42 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > I would like to propose a patch for enabling the parallelism for the
> > bitmap index scan path.

                Workers Planned: 4
                ->  Parallel Bitmap Heap Scan on tenk1
                      Recheck Cond: (hundred > 1)
-                     ->  Bitmap Index Scan on tenk1_hundred
+                     ->  Parallel Bitmap Index Scan on tenk1_hundred
                            Index Cond: (hundred > 1)

+1, this is a very good feature to have.

+                                       /* Merge bitmap to a common
shared bitmap */
+                                       SpinLockAcquire(&pstate->mutex);
+                                       tbm_merge(tbm,
&pstate->tbm_shared, &pstate->pt_shared);
+                                       SpinLockRelease(&pstate->mutex);

This doesn't look like a job for a spinlock.

You have a barrier so that you can wait until all workers have
finished merging their partial bitmap into the complete bitmap, which
makes perfect sense.  You also use that spinlock (probably should be
LWLock) to serialise the bitmap merging work...  Hmm, I suppose there
would be an alternative design which also uses the barrier to
serialise the merge, and has the merging done entirely by one process,
like this:

  bool chosen_to_merge = false;

  /* Attach to the barrier, and see what phase we're up to. */
  switch (BarrierAttach())
  {
  case PBH_BUILDING:
    ... build my partial bitmap in shmem ...
    chosen_to_merge = BarrierArriveAndWait();
    /* Fall through */

  case PBH_MERGING:
    if (chosen_to_merge)
      ... perform merge of all partial results into final shmem bitmap ...
    BarrierArriveAndWait();
    /* Fall through */

  case PBH_SCANNING:
    /* We attached too late to help build the bitmap.  */
    BarrierDetach();
    break;
  }

Just an idea, not sure if it's a good one.  I find it a little easier
to reason about the behaviour of late-attaching workers when the
phases are explicitly named and handled with code like that (it's not
immediately clear to me whether your coding handles late attachers
correctly, which seems to be one of the main problems to think about
with "dynamic party" parallelism...).



Re: Parallel bitmap index scan

From
Dilip Kumar
Date:


On Mon, 27 Jul 2020 at 3:48 AM, Thomas Munro <thomas.munro@gmail.com> wrote:
On Mon, Jul 27, 2020 at 1:58 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> On Sun, Jul 26, 2020 at 6:42 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > I would like to propose a patch for enabling the parallelism for the
> > bitmap index scan path.

                Workers Planned: 4
                ->  Parallel Bitmap Heap Scan on tenk1
                      Recheck Cond: (hundred > 1)
-                     ->  Bitmap Index Scan on tenk1_hundred
+                     ->  Parallel Bitmap Index Scan on tenk1_hundred
                            Index Cond: (hundred > 1)

+1, this is a very good feature to have.

+                                       /* Merge bitmap to a common
shared bitmap */
+                                       SpinLockAcquire(&pstate->mutex);
+                                       tbm_merge(tbm,
&pstate->tbm_shared, &pstate->pt_shared);
+                                       SpinLockRelease(&pstate->mutex);

This doesn't look like a job for a spinlock.

Yes I agree with that.

You have a barrier so that you can wait until all workers have
finished merging their partial bitmap into the complete bitmap, which
makes perfect sense.  You also use that spinlock (probably should be
LWLock) to serialise the bitmap merging work...  Hmm, I suppose there
would be an alternative design which also uses the barrier to
serialise the merge, and has the merging done entirely by one process,
like this:

  bool chosen_to_merge = false;

  /* Attach to the barrier, and see what phase we're up to. */
  switch (BarrierAttach())
  {
  case PBH_BUILDING:
    ... build my partial bitmap in shmem ...
    chosen_to_merge = BarrierArriveAndWait();
    /* Fall through */

  case PBH_MERGING:
    if (chosen_to_merge)
      ... perform merge of all partial results into final shmem bitmap ...
    BarrierArriveAndWait();
    /* Fall through */

  case PBH_SCANNING:
    /* We attached too late to help build the bitmap.  */
    BarrierDetach();
    break;
  }

Just an idea, not sure if it's a good one.  I find it a little easier
to reason about the behaviour of late-attaching workers when the
phases are explicitly named and handled with code like that (it's not
immediately clear to me whether your coding handles late attachers
correctly, which seems to be one of the main problems to think about
with "dynamic party" parallelism...).

Yeah this seems better idea.  I am handling late attachers case but the idea of using the barrier phase looks quite clean.  I will change it this way.
--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: Parallel bitmap index scan

From
Dilip Kumar
Date:
On Mon, Jul 27, 2020 at 3:48 AM Thomas Munro <thomas.munro@gmail.com> wrote:
>
> On Mon, Jul 27, 2020 at 1:58 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > On Sun, Jul 26, 2020 at 6:42 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > >
> > > I would like to propose a patch for enabling the parallelism for the
> > > bitmap index scan path.
>
>                 Workers Planned: 4
>                 ->  Parallel Bitmap Heap Scan on tenk1
>                       Recheck Cond: (hundred > 1)
> -                     ->  Bitmap Index Scan on tenk1_hundred
> +                     ->  Parallel Bitmap Index Scan on tenk1_hundred
>                             Index Cond: (hundred > 1)
>
> +1, this is a very good feature to have.
>
> +                                       /* Merge bitmap to a common
> shared bitmap */
> +                                       SpinLockAcquire(&pstate->mutex);
> +                                       tbm_merge(tbm,
> &pstate->tbm_shared, &pstate->pt_shared);
> +                                       SpinLockRelease(&pstate->mutex);
>
> This doesn't look like a job for a spinlock.
>
> You have a barrier so that you can wait until all workers have
> finished merging their partial bitmap into the complete bitmap, which
> makes perfect sense.  You also use that spinlock (probably should be
> LWLock) to serialise the bitmap merging work...  Hmm, I suppose there
> would be an alternative design which also uses the barrier to
> serialise the merge, and has the merging done entirely by one process,
> like this:
>
>   bool chosen_to_merge = false;
>
>   /* Attach to the barrier, and see what phase we're up to. */
>   switch (BarrierAttach())
>   {
>   case PBH_BUILDING:
>     ... build my partial bitmap in shmem ...
>     chosen_to_merge = BarrierArriveAndWait();
>     /* Fall through */
>
>   case PBH_MERGING:
>     if (chosen_to_merge)
>       ... perform merge of all partial results into final shmem bitmap ...
>     BarrierArriveAndWait();
>     /* Fall through */
>
>   case PBH_SCANNING:
>     /* We attached too late to help build the bitmap.  */
>     BarrierDetach();
>     break;
>   }
>
> Just an idea, not sure if it's a good one.  I find it a little easier
> to reason about the behaviour of late-attaching workers when the
> phases are explicitly named and handled with code like that (it's not
> immediately clear to me whether your coding handles late attachers
> correctly, which seems to be one of the main problems to think about
> with "dynamic party" parallelism...).

Actually, for merging, I could not use the strategy you suggested
because in this case all the worker prepare their TBM and merge to the
shared TBM.  Basically, we don't need to choose a leader for that all
the workers need to merge their TBM to the shared location but one at
a time, and also we don't need to wait for all the workers to prepare
TBM before they start merging.  However, once the merge is over we
need to wait for all the workers to finish the merge and after that
only one worker will be allowed to prepare the shared iterator.  So
for that, I have used your idea of the barrier phase.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachment

Re: Parallel bitmap index scan

From
Bharath Rupireddy
Date:
On Sun, Jul 26, 2020 at 6:43 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> I would like to propose a patch for enabling the parallelism for the
> bitmap index scan path.
>
> Background:
> Currently, we support only a parallel bitmap heap scan path.  Therein,
> the underlying bitmap index scan is done by a single worker called the
> leader.  The leader creates a bitmap in shared memory and once the
> bitmap is ready it creates a shared iterator and after that, all the
> workers process the shared iterator and scan the heap in parallel.
> While analyzing the TPCH plan we have observed that some of the
> queries are spending significant time in preparing the bitmap.  So the
> idea of this patch is to use the parallel index scan for preparing the
> underlying bitmap in parallel.
>
> Design:
> If underlying index AM supports the parallel path (currently only
> BTREE support it), then we will create a parallel bitmap heap scan
> path on top of the parallel bitmap index scan path.  So the idea of
> this patch is that each worker will do the parallel index scan and
> generate their part of the bitmap.  And, we will create a barrier so
> that we can not start preparing the shared iterator until all the
> worker is ready with their bitmap.  The first worker, which is ready
> with the bitmap will keep a copy of its TBM and the page table in the
> shared memory.  And, all the subsequent workers will merge their TBM
> with the shared TBM.  Once all the TBM are merged we will get one
> common shared TBM and after that stage, the worker can continue.  The
> remaining part is the same,  basically, again one worker will scan the
> shared TBM and prepare the shared iterator and once it is ready all
> the workers will jointly scan the heap in parallel using shared
> iterator.
>

Though I have not looked at the patch or code for the existing
parallel bitmap heap scan, one point keeps bugging in my mind. I may
be utterly wrong or my question may be so silly, anyways I would like
to ask here:

From the above design: each parallel worker creates partial bitmaps
for the index data that they looked at. Why should they merge these
bitmaps to a single bitmap in shared memory? Why can't each parallel
worker do a bitmap heap scan using the partial bitmaps they built
during it's bitmap index scan and emit qualified tuples/rows so that
the gather node can collect them? There may not be even lock
contention as bitmap heap scan takes read locks for the heap
pages/tuples.

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com



Re: Parallel bitmap index scan

From
Dilip Kumar
Date:


On Mon, 17 Aug 2020 at 7:42 PM, Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote:
On Sun, Jul 26, 2020 at 6:43 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

>

> I would like to propose a patch for enabling the parallelism for the

> bitmap index scan path.

>

> Background:

> Currently, we support only a parallel bitmap heap scan path.  Therein,

> the underlying bitmap index scan is done by a single worker called the

> leader.  The leader creates a bitmap in shared memory and once the

> bitmap is ready it creates a shared iterator and after that, all the

> workers process the shared iterator and scan the heap in parallel.

> While analyzing the TPCH plan we have observed that some of the

> queries are spending significant time in preparing the bitmap.  So the

> idea of this patch is to use the parallel index scan for preparing the

> underlying bitmap in parallel.

>

> Design:

> If underlying index AM supports the parallel path (currently only

> BTREE support it), then we will create a parallel bitmap heap scan

> path on top of the parallel bitmap index scan path.  So the idea of

> this patch is that each worker will do the parallel index scan and

> generate their part of the bitmap.  And, we will create a barrier so

> that we can not start preparing the shared iterator until all the

> worker is ready with their bitmap.  The first worker, which is ready

> with the bitmap will keep a copy of its TBM and the page table in the

> shared memory.  And, all the subsequent workers will merge their TBM

> with the shared TBM.  Once all the TBM are merged we will get one

> common shared TBM and after that stage, the worker can continue.  The

> remaining part is the same,  basically, again one worker will scan the

> shared TBM and prepare the shared iterator and once it is ready all

> the workers will jointly scan the heap in parallel using shared

> iterator.

>



Though I have not looked at the patch or code for the existing

parallel bitmap heap scan, one point keeps bugging in my mind. I may

be utterly wrong or my question may be so silly, anyways I would like

to ask here:



From the above design: each parallel worker creates partial bitmaps

for the index data that they looked at. Why should they merge these

bitmaps to a single bitmap in shared memory? Why can't each parallel

worker do a bitmap heap scan using the partial bitmaps they built

during it's bitmap index scan and emit qualified tuples/rows so that

the gather node can collect them? There may not be even lock

contention as bitmap heap scan takes read locks for the heap

pages/tuples.

The main reason is that there could be lossy pages in bitmap and if that is the case then there will be duplicate data.  Maybe if there is no lossy data in any of the bitmap we might do as u describe but still I think that it is very much possible that different bitmap might have many common heap pages because bitmap is prepared using index scan.  And in such cases we will be doing duplicate i/o.
--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: Parallel bitmap index scan

From
Dilip Kumar
Date:
On Sun, Jul 26, 2020 at 6:42 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> I would like to propose a patch for enabling the parallelism for the
> bitmap index scan path.
>
> Background:
> Currently, we support only a parallel bitmap heap scan path.  Therein,
> the underlying bitmap index scan is done by a single worker called the
> leader.  The leader creates a bitmap in shared memory and once the
> bitmap is ready it creates a shared iterator and after that, all the
> workers process the shared iterator and scan the heap in parallel.
> While analyzing the TPCH plan we have observed that some of the
> queries are spending significant time in preparing the bitmap.  So the
> idea of this patch is to use the parallel index scan for preparing the
> underlying bitmap in parallel.
>
> Design:
> If underlying index AM supports the parallel path (currently only
> BTREE support it), then we will create a parallel bitmap heap scan
> path on top of the parallel bitmap index scan path.  So the idea of
> this patch is that each worker will do the parallel index scan and
> generate their part of the bitmap.  And, we will create a barrier so
> that we can not start preparing the shared iterator until all the
> worker is ready with their bitmap.  The first worker, which is ready
> with the bitmap will keep a copy of its TBM and the page table in the
> shared memory.  And, all the subsequent workers will merge their TBM
> with the shared TBM.  Once all the TBM are merged we will get one
> common shared TBM and after that stage, the worker can continue.  The
> remaining part is the same,  basically, again one worker will scan the
> shared TBM and prepare the shared iterator and once it is ready all
> the workers will jointly scan the heap in parallel using shared
> iterator.
>
> BitmapHeapNext
> {
> ...
> BarrierAttach();
> tbm = MultiExecProcNode();
> tbm_merge(tbm); --Merge with common tbm using tbm_union
> BarrierArriveAndWait();
>
> if (BitmapShouldInitializeSharedState(pstate)). --> only one worker
> come out of this
> {
>    tbm_prepare_shared_iterate();
>    BitmapDoneInitializingSharedState(). -->wakeup others
> }
> tbm_attach_shared_iterate(). --> all worker attach to shared iterator
> ...
> }
>
> Performance:  With scale factor 10, I could see that Q6 is spending
> significant time in a bitmap index scan so I have taken the
> performance with that query and I can see that the bitmap index scan
> node is 3x faster by using 3 workers whereas overall plan got ~40%
> faster.
>
> TPCH: S.F. 10, work_mem=512MB  shared_buffers: 1GB
>
> HEAD:
>   Limit  (cost=1559777.02..1559777.03 rows=1 width=32) (actual
> time=5260.121..5260.122 rows=1 loops=1)
>    ->  Finalize Aggregate  (cost=1559777.02..1559777.03 rows=1
> width=32) (actual time=5260.119..5260.119 rows=1 loops=1)
>          ->  Gather  (cost=1559776.69..1559777.00 rows=3 width=32)
> (actual time=5257.251..5289.595 rows=4 loops=1)
>                Workers Planned: 3
>                Workers Launched: 3
>                ->  Partial Aggregate  (cost=1558776.69..1558776.70
> rows=1 width=32) (actual time=5247.714..5247.714 rows=1 loops=4)
>                      ->  Parallel Bitmap Heap Scan on lineitem
> (cost=300603.01..1556898.89 rows=375560 width=12) (actual
> time=3475.944..50
> 37.484 rows=285808 loops=4)
>                            Recheck Cond: ((l_shipdate >=
> '1997-01-01'::date) AND (l_shipdate < '1998-01-01 00:00:00'::timestamp
> without tim
> e zone) AND (l_discount >= 0.02) AND (l_discount <= 0.04) AND
> (l_quantity < '24'::numeric))
>                            Heap Blocks: exact=205250
>                            ->  Bitmap Index Scan on
> idx_lineitem_shipdate  (cost=0.00..300311.95 rows=1164235 width=0)
> (actual time=3169.85
> 5..3169.855 rows=1143234 loops=1)
>                                  Index Cond: ((l_shipdate >=
> '1997-01-01'::date) AND (l_shipdate < '1998-01-01 00:00:00'::timestamp
> without
>  time zone) AND (l_discount >= 0.02) AND (l_discount <= 0.04) AND
> (l_quantity < '24'::numeric))
>  Planning Time: 0.659 ms
>  Execution Time: 5289.787 ms
> (13 rows)
>
>
> PATCH:
>
>   Limit  (cost=1559579.85..1559579.86 rows=1 width=32) (actual
> time=3333.572..3333.572 rows=1 loops=1)
>    ->  Finalize Aggregate  (cost=1559579.85..1559579.86 rows=1
> width=32) (actual time=3333.569..3333.569 rows=1 loops=1)
>          ->  Gather  (cost=1559579.52..1559579.83 rows=3 width=32)
> (actual time=3328.619..3365.227 rows=4 loops=1)
>                Workers Planned: 3
>                Workers Launched: 3
>                ->  Partial Aggregate  (cost=1558579.52..1558579.53
> rows=1 width=32) (actual time=3307.805..3307.805 rows=1 loops=4)
>                      ->  Parallel Bitmap Heap Scan on lineitem
> (cost=300405.84..1556701.72 rows=375560 width=12) (actual
> time=1585.726..30
> 97.628 rows=285808 loops=4)
>                            Recheck Cond: ((l_shipdate >=
> '1997-01-01'::date) AND (l_shipdate < '1998-01-01 00:00:00'::timestamp
> without tim
> e zone) AND (l_discount >= 0.02) AND (l_discount <= 0.04) AND
> (l_quantity < '24'::numeric))
>                            Heap Blocks: exact=184293
>                            ->  Parallel Bitmap Index Scan on
> idx_lineitem_shipdate  (cost=0.00..300311.95 rows=1164235 width=0)
> (actual tim
> e=1008.361..1008.361 rows=285808 loops=4)
>                                  Index Cond: ((l_shipdate >=
> '1997-01-01'::date) AND (l_shipdate < '1998-01-01 00:00:00'::timestamp
> without
>  time zone) AND (l_discount >= 0.02) AND (l_discount <= 0.04) AND
> (l_quantity < '24'::numeric))
>  Planning Time: 0.690 ms
>  Execution Time: 3365.420 ms
>
> Note:
> - Currently, I have only parallelized then bitmap index path when we
> have a bitmap index scan directly under bitmap heap.  But, if we have
> BitmapAnd or BitmapOr path then I did not parallelize the underlying
> bitmap index scan.  I think for BitmapAnd and BitmapOr we should use a
> completely different design, something similar to what we are doing in
> parallel append so I don't think BitmapAnd and BitmapOr we need to
> cover under this patch.
>
> - POC patch is attached to discuss the idea.  The patch still needs
> cleanup and testing.

 I have rebased this patch on the current head.  Apart from this, I
have also measure performance with the higher scalare factor this
time.  At a higher scale factor I can see the performance with the
patch is dropping.  Basically, the underlying bitmap index scan node
is getting faster with parallelism but the overall performance is
going down due to the TBM merging in the parallel bitmap heap node.
Currently, there is a lot of scope for improving tbm_merge.
- Currently, whichever worker produces the TBM first becomes the host
TBM and all the other workers merge their TBM to that.  Ideally, the
largest TBM should become the host TBM.
- While merging we are directly using tbm_union and that need to
reinsert the complete entry in the host TBM's hashtable,  I think
instead of merging like this we can create just a shared iterator (and
somehow remove the duplicates) but don't really need to merge the
hashtable.  I haven't thought about this design completely but seems
doable,  basically by doing this the TBM iterator array will keep the
items from multiple tbm_hashtables.

max_parallel_workers_per_gather=4
work_mem=20GB
shared_buffes=20GB


HEAD
TPCH QUERY           (Parallel BitmapHeap+ BitmapIndex)
    BitmapIndex
4                                19764
                                   535
5                                12035
                                 1545
6                                119815
                                7943
14                              44154
                                1007

PATCH
TPCH QUERY          (Parallel BitmapHeap+Parallel BitmapIndex).
Parallel BitmapIndex
4                                 19765
                                   287
5                                 13799
                                   848
6                                 116204
                                3255
14                               44078
                                  416

So if we see the performance results, in most of the queries the time
spent in the bitmap index is reduced by half or more but still, the
total time spent in the bitmap heap scan is either not reduced
significantly or it is increased.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachment

Re: Parallel bitmap index scan

From
Tomas Vondra
Date:
Hi,

I took a look at this today, doing a bit of stress-testing, and I can
get it to crash because of segfaults in pagetable_create (not sure if
the issue is there, it might be just a symptom of an issue elsewhere).

Attached is a shell script I use to run the stress test - it's using
'test' database, generates tables of different size and then runs
queries with various parameter combinations. It takes a while to trigger
the crash, so it might depend on timing or something like that.

I've also attached two examples of backtraces. I've also seen infinite
loop in pagetable_create, but the crashes are much more common.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

Re: Parallel bitmap index scan

From
Tomas Vondra
Date:
On 11/11/20 8:52 PM, Tomas Vondra wrote:
> Hi,
> 
> I took a look at this today, doing a bit of stress-testing, and I can
> get it to crash because of segfaults in pagetable_create (not sure if
> the issue is there, it might be just a symptom of an issue elsewhere).
> 
> Attached is a shell script I use to run the stress test - it's using
> 'test' database, generates tables of different size and then runs
> queries with various parameter combinations. It takes a while to trigger
> the crash, so it might depend on timing or something like that.
> 
> I've also attached two examples of backtraces. I've also seen infinite
> loop in pagetable_create, but the crashes are much more common.
> 

Hi Dilip,

Do you plan to work on this for PG14? I haven't noticed any response in
this thread, dealing with the crashes I reported a while ago. Also, it
doesn't seem to be added to any of the commitfests.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Parallel bitmap index scan

From
Dilip Kumar
Date:
On Wed, 23 Dec 2020 at 4:15 AM, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:
On 11/11/20 8:52 PM, Tomas Vondra wrote:
> Hi,
>
> I took a look at this today, doing a bit of stress-testing, and I can
> get it to crash because of segfaults in pagetable_create (not sure if
> the issue is there, it might be just a symptom of an issue elsewhere).
>
> Attached is a shell script I use to run the stress test - it's using
> 'test' database, generates tables of different size and then runs
> queries with various parameter combinations. It takes a while to trigger
> the crash, so it might depend on timing or something like that.
>
> I've also attached two examples of backtraces. I've also seen infinite
> loop in pagetable_create, but the crashes are much more common.
>

Hi Dilip,

Do you plan to work on this for PG14? I haven't noticed any response in
this thread, dealing with the crashes I reported a while ago. Also, it
doesn't seem to be added to any of the commitfests.


Hi Tomas,

Thanks for testing this.  Actually we have noticed a lot of performance drop in many cases due to the tbm_merge.  So off list we are discussing different approaches and testing the performance.  So basically, in the current approach all the worker are first preparing their bitmap hash and then they are merging into the common bitmap hash under a lock.  So based on the off list discussion with Robert, the next approach I am trying is to directly insert into the shared bitmap hash while scanning the index itself.  So now instead of preparing a separate bitmap, all the workers will directly insert into the shared bitmap hash.  I agree that for getting each page from the bitmaphash we need to acquire the lock and this also might generate a lot of lock contention but we want to try the POC and check the performance.  In fact I have already implemented the POC and results aren't great.  But I am still experimenting with it to see whether the lock can be more granular than I have now.  I will share my finding soon along with the POC patch.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com