Parallel bitmap index scan - Mailing list pgsql-hackers

From Dilip Kumar
Subject Parallel bitmap index scan
Date
Msg-id CAFiTN-t4NtRzafw94x+Ub_gUqQsv=u9nK=OaJmS612M_bZv6+Q@mail.gmail.com
Whole thread Raw
Responses Re: Parallel bitmap index scan
Re: Parallel bitmap index scan
Re: Parallel bitmap index scan
List pgsql-hackers
I would like to propose a patch for enabling the parallelism for the
bitmap index scan path.

Background:
Currently, we support only a parallel bitmap heap scan path.  Therein,
the underlying bitmap index scan is done by a single worker called the
leader.  The leader creates a bitmap in shared memory and once the
bitmap is ready it creates a shared iterator and after that, all the
workers process the shared iterator and scan the heap in parallel.
While analyzing the TPCH plan we have observed that some of the
queries are spending significant time in preparing the bitmap.  So the
idea of this patch is to use the parallel index scan for preparing the
underlying bitmap in parallel.

Design:
If underlying index AM supports the parallel path (currently only
BTREE support it), then we will create a parallel bitmap heap scan
path on top of the parallel bitmap index scan path.  So the idea of
this patch is that each worker will do the parallel index scan and
generate their part of the bitmap.  And, we will create a barrier so
that we can not start preparing the shared iterator until all the
worker is ready with their bitmap.  The first worker, which is ready
with the bitmap will keep a copy of its TBM and the page table in the
shared memory.  And, all the subsequent workers will merge their TBM
with the shared TBM.  Once all the TBM are merged we will get one
common shared TBM and after that stage, the worker can continue.  The
remaining part is the same,  basically, again one worker will scan the
shared TBM and prepare the shared iterator and once it is ready all
the workers will jointly scan the heap in parallel using shared
iterator.

BitmapHeapNext
{
...
BarrierAttach();
tbm = MultiExecProcNode();
tbm_merge(tbm); --Merge with common tbm using tbm_union
BarrierArriveAndWait();

if (BitmapShouldInitializeSharedState(pstate)). --> only one worker
come out of this
{
   tbm_prepare_shared_iterate();
   BitmapDoneInitializingSharedState(). -->wakeup others
}
tbm_attach_shared_iterate(). --> all worker attach to shared iterator
...
}

Performance:  With scale factor 10, I could see that Q6 is spending
significant time in a bitmap index scan so I have taken the
performance with that query and I can see that the bitmap index scan
node is 3x faster by using 3 workers whereas overall plan got ~40%
faster.

TPCH: S.F. 10, work_mem=512MB  shared_buffers: 1GB

HEAD:
  Limit  (cost=1559777.02..1559777.03 rows=1 width=32) (actual
time=5260.121..5260.122 rows=1 loops=1)
   ->  Finalize Aggregate  (cost=1559777.02..1559777.03 rows=1
width=32) (actual time=5260.119..5260.119 rows=1 loops=1)
         ->  Gather  (cost=1559776.69..1559777.00 rows=3 width=32)
(actual time=5257.251..5289.595 rows=4 loops=1)
               Workers Planned: 3
               Workers Launched: 3
               ->  Partial Aggregate  (cost=1558776.69..1558776.70
rows=1 width=32) (actual time=5247.714..5247.714 rows=1 loops=4)
                     ->  Parallel Bitmap Heap Scan on lineitem
(cost=300603.01..1556898.89 rows=375560 width=12) (actual
time=3475.944..50
37.484 rows=285808 loops=4)
                           Recheck Cond: ((l_shipdate >=
'1997-01-01'::date) AND (l_shipdate < '1998-01-01 00:00:00'::timestamp
without tim
e zone) AND (l_discount >= 0.02) AND (l_discount <= 0.04) AND
(l_quantity < '24'::numeric))
                           Heap Blocks: exact=205250
                           ->  Bitmap Index Scan on
idx_lineitem_shipdate  (cost=0.00..300311.95 rows=1164235 width=0)
(actual time=3169.85
5..3169.855 rows=1143234 loops=1)
                                 Index Cond: ((l_shipdate >=
'1997-01-01'::date) AND (l_shipdate < '1998-01-01 00:00:00'::timestamp
without
 time zone) AND (l_discount >= 0.02) AND (l_discount <= 0.04) AND
(l_quantity < '24'::numeric))
 Planning Time: 0.659 ms
 Execution Time: 5289.787 ms
(13 rows)


PATCH:

  Limit  (cost=1559579.85..1559579.86 rows=1 width=32) (actual
time=3333.572..3333.572 rows=1 loops=1)
   ->  Finalize Aggregate  (cost=1559579.85..1559579.86 rows=1
width=32) (actual time=3333.569..3333.569 rows=1 loops=1)
         ->  Gather  (cost=1559579.52..1559579.83 rows=3 width=32)
(actual time=3328.619..3365.227 rows=4 loops=1)
               Workers Planned: 3
               Workers Launched: 3
               ->  Partial Aggregate  (cost=1558579.52..1558579.53
rows=1 width=32) (actual time=3307.805..3307.805 rows=1 loops=4)
                     ->  Parallel Bitmap Heap Scan on lineitem
(cost=300405.84..1556701.72 rows=375560 width=12) (actual
time=1585.726..30
97.628 rows=285808 loops=4)
                           Recheck Cond: ((l_shipdate >=
'1997-01-01'::date) AND (l_shipdate < '1998-01-01 00:00:00'::timestamp
without tim
e zone) AND (l_discount >= 0.02) AND (l_discount <= 0.04) AND
(l_quantity < '24'::numeric))
                           Heap Blocks: exact=184293
                           ->  Parallel Bitmap Index Scan on
idx_lineitem_shipdate  (cost=0.00..300311.95 rows=1164235 width=0)
(actual tim
e=1008.361..1008.361 rows=285808 loops=4)
                                 Index Cond: ((l_shipdate >=
'1997-01-01'::date) AND (l_shipdate < '1998-01-01 00:00:00'::timestamp
without
 time zone) AND (l_discount >= 0.02) AND (l_discount <= 0.04) AND
(l_quantity < '24'::numeric))
 Planning Time: 0.690 ms
 Execution Time: 3365.420 ms

Note:
- Currently, I have only parallelized then bitmap index path when we
have a bitmap index scan directly under bitmap heap.  But, if we have
BitmapAnd or BitmapOr path then I did not parallelize the underlying
bitmap index scan.  I think for BitmapAnd and BitmapOr we should use a
completely different design, something similar to what we are doing in
parallel append so I don't think BitmapAnd and BitmapOr we need to
cover under this patch.

- POC patch is attached to discuss the idea.  The patch still needs
cleanup and testing.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachment

pgsql-hackers by date:

Previous
From: Amit Kapila
Date:
Subject: Re: INSERT INTO SELECT, Why Parallelism is not selected?
Next
From: Dilip Kumar
Date:
Subject: Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions