Re: Parallel bitmap index scan - Mailing list pgsql-hackers

From Dilip Kumar
Subject Re: Parallel bitmap index scan
Date
Msg-id CAFiTN-t_23FjA9Oc8AjMXFoQTNWSopmxMYisWAgo=o1r-ZHiWA@mail.gmail.com
Whole thread Raw
In response to Parallel bitmap index scan  (Dilip Kumar <dilipbalaut@gmail.com>)
Responses Re: Parallel bitmap index scan
List pgsql-hackers
On Sun, Jul 26, 2020 at 6:42 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> I would like to propose a patch for enabling the parallelism for the
> bitmap index scan path.
>
> Background:
> Currently, we support only a parallel bitmap heap scan path.  Therein,
> the underlying bitmap index scan is done by a single worker called the
> leader.  The leader creates a bitmap in shared memory and once the
> bitmap is ready it creates a shared iterator and after that, all the
> workers process the shared iterator and scan the heap in parallel.
> While analyzing the TPCH plan we have observed that some of the
> queries are spending significant time in preparing the bitmap.  So the
> idea of this patch is to use the parallel index scan for preparing the
> underlying bitmap in parallel.
>
> Design:
> If underlying index AM supports the parallel path (currently only
> BTREE support it), then we will create a parallel bitmap heap scan
> path on top of the parallel bitmap index scan path.  So the idea of
> this patch is that each worker will do the parallel index scan and
> generate their part of the bitmap.  And, we will create a barrier so
> that we can not start preparing the shared iterator until all the
> worker is ready with their bitmap.  The first worker, which is ready
> with the bitmap will keep a copy of its TBM and the page table in the
> shared memory.  And, all the subsequent workers will merge their TBM
> with the shared TBM.  Once all the TBM are merged we will get one
> common shared TBM and after that stage, the worker can continue.  The
> remaining part is the same,  basically, again one worker will scan the
> shared TBM and prepare the shared iterator and once it is ready all
> the workers will jointly scan the heap in parallel using shared
> iterator.
>
> BitmapHeapNext
> {
> ...
> BarrierAttach();
> tbm = MultiExecProcNode();
> tbm_merge(tbm); --Merge with common tbm using tbm_union
> BarrierArriveAndWait();
>
> if (BitmapShouldInitializeSharedState(pstate)). --> only one worker
> come out of this
> {
>    tbm_prepare_shared_iterate();
>    BitmapDoneInitializingSharedState(). -->wakeup others
> }
> tbm_attach_shared_iterate(). --> all worker attach to shared iterator
> ...
> }
>
> Performance:  With scale factor 10, I could see that Q6 is spending
> significant time in a bitmap index scan so I have taken the
> performance with that query and I can see that the bitmap index scan
> node is 3x faster by using 3 workers whereas overall plan got ~40%
> faster.
>
> TPCH: S.F. 10, work_mem=512MB  shared_buffers: 1GB
>
> HEAD:
>   Limit  (cost=1559777.02..1559777.03 rows=1 width=32) (actual
> time=5260.121..5260.122 rows=1 loops=1)
>    ->  Finalize Aggregate  (cost=1559777.02..1559777.03 rows=1
> width=32) (actual time=5260.119..5260.119 rows=1 loops=1)
>          ->  Gather  (cost=1559776.69..1559777.00 rows=3 width=32)
> (actual time=5257.251..5289.595 rows=4 loops=1)
>                Workers Planned: 3
>                Workers Launched: 3
>                ->  Partial Aggregate  (cost=1558776.69..1558776.70
> rows=1 width=32) (actual time=5247.714..5247.714 rows=1 loops=4)
>                      ->  Parallel Bitmap Heap Scan on lineitem
> (cost=300603.01..1556898.89 rows=375560 width=12) (actual
> time=3475.944..50
> 37.484 rows=285808 loops=4)
>                            Recheck Cond: ((l_shipdate >=
> '1997-01-01'::date) AND (l_shipdate < '1998-01-01 00:00:00'::timestamp
> without tim
> e zone) AND (l_discount >= 0.02) AND (l_discount <= 0.04) AND
> (l_quantity < '24'::numeric))
>                            Heap Blocks: exact=205250
>                            ->  Bitmap Index Scan on
> idx_lineitem_shipdate  (cost=0.00..300311.95 rows=1164235 width=0)
> (actual time=3169.85
> 5..3169.855 rows=1143234 loops=1)
>                                  Index Cond: ((l_shipdate >=
> '1997-01-01'::date) AND (l_shipdate < '1998-01-01 00:00:00'::timestamp
> without
>  time zone) AND (l_discount >= 0.02) AND (l_discount <= 0.04) AND
> (l_quantity < '24'::numeric))
>  Planning Time: 0.659 ms
>  Execution Time: 5289.787 ms
> (13 rows)
>
>
> PATCH:
>
>   Limit  (cost=1559579.85..1559579.86 rows=1 width=32) (actual
> time=3333.572..3333.572 rows=1 loops=1)
>    ->  Finalize Aggregate  (cost=1559579.85..1559579.86 rows=1
> width=32) (actual time=3333.569..3333.569 rows=1 loops=1)
>          ->  Gather  (cost=1559579.52..1559579.83 rows=3 width=32)
> (actual time=3328.619..3365.227 rows=4 loops=1)
>                Workers Planned: 3
>                Workers Launched: 3
>                ->  Partial Aggregate  (cost=1558579.52..1558579.53
> rows=1 width=32) (actual time=3307.805..3307.805 rows=1 loops=4)
>                      ->  Parallel Bitmap Heap Scan on lineitem
> (cost=300405.84..1556701.72 rows=375560 width=12) (actual
> time=1585.726..30
> 97.628 rows=285808 loops=4)
>                            Recheck Cond: ((l_shipdate >=
> '1997-01-01'::date) AND (l_shipdate < '1998-01-01 00:00:00'::timestamp
> without tim
> e zone) AND (l_discount >= 0.02) AND (l_discount <= 0.04) AND
> (l_quantity < '24'::numeric))
>                            Heap Blocks: exact=184293
>                            ->  Parallel Bitmap Index Scan on
> idx_lineitem_shipdate  (cost=0.00..300311.95 rows=1164235 width=0)
> (actual tim
> e=1008.361..1008.361 rows=285808 loops=4)
>                                  Index Cond: ((l_shipdate >=
> '1997-01-01'::date) AND (l_shipdate < '1998-01-01 00:00:00'::timestamp
> without
>  time zone) AND (l_discount >= 0.02) AND (l_discount <= 0.04) AND
> (l_quantity < '24'::numeric))
>  Planning Time: 0.690 ms
>  Execution Time: 3365.420 ms
>
> Note:
> - Currently, I have only parallelized then bitmap index path when we
> have a bitmap index scan directly under bitmap heap.  But, if we have
> BitmapAnd or BitmapOr path then I did not parallelize the underlying
> bitmap index scan.  I think for BitmapAnd and BitmapOr we should use a
> completely different design, something similar to what we are doing in
> parallel append so I don't think BitmapAnd and BitmapOr we need to
> cover under this patch.
>
> - POC patch is attached to discuss the idea.  The patch still needs
> cleanup and testing.

 I have rebased this patch on the current head.  Apart from this, I
have also measure performance with the higher scalare factor this
time.  At a higher scale factor I can see the performance with the
patch is dropping.  Basically, the underlying bitmap index scan node
is getting faster with parallelism but the overall performance is
going down due to the TBM merging in the parallel bitmap heap node.
Currently, there is a lot of scope for improving tbm_merge.
- Currently, whichever worker produces the TBM first becomes the host
TBM and all the other workers merge their TBM to that.  Ideally, the
largest TBM should become the host TBM.
- While merging we are directly using tbm_union and that need to
reinsert the complete entry in the host TBM's hashtable,  I think
instead of merging like this we can create just a shared iterator (and
somehow remove the duplicates) but don't really need to merge the
hashtable.  I haven't thought about this design completely but seems
doable,  basically by doing this the TBM iterator array will keep the
items from multiple tbm_hashtables.

max_parallel_workers_per_gather=4
work_mem=20GB
shared_buffes=20GB


HEAD
TPCH QUERY           (Parallel BitmapHeap+ BitmapIndex)
    BitmapIndex
4                                19764
                                   535
5                                12035
                                 1545
6                                119815
                                7943
14                              44154
                                1007

PATCH
TPCH QUERY          (Parallel BitmapHeap+Parallel BitmapIndex).
Parallel BitmapIndex
4                                 19765
                                   287
5                                 13799
                                   848
6                                 116204
                                3255
14                               44078
                                  416

So if we see the performance results, in most of the queries the time
spent in the bitmap index is reduced by half or more but still, the
total time spent in the bitmap heap scan is either not reduced
significantly or it is increased.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachment

pgsql-hackers by date:

Previous
From: "Namrata Bhave"
Date:
Subject: Binaries on s390x arch
Next
From: Nikolay Shaplov
Date:
Subject: Is deduplicate_items ever used in nbtree?