Re: Parallel bitmap index scan - Mailing list pgsql-hackers
From | Dilip Kumar |
---|---|
Subject | Re: Parallel bitmap index scan |
Date | |
Msg-id | CAFiTN-t_23FjA9Oc8AjMXFoQTNWSopmxMYisWAgo=o1r-ZHiWA@mail.gmail.com Whole thread Raw |
In response to | Parallel bitmap index scan (Dilip Kumar <dilipbalaut@gmail.com>) |
Responses |
Re: Parallel bitmap index scan
|
List | pgsql-hackers |
On Sun, Jul 26, 2020 at 6:42 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > I would like to propose a patch for enabling the parallelism for the > bitmap index scan path. > > Background: > Currently, we support only a parallel bitmap heap scan path. Therein, > the underlying bitmap index scan is done by a single worker called the > leader. The leader creates a bitmap in shared memory and once the > bitmap is ready it creates a shared iterator and after that, all the > workers process the shared iterator and scan the heap in parallel. > While analyzing the TPCH plan we have observed that some of the > queries are spending significant time in preparing the bitmap. So the > idea of this patch is to use the parallel index scan for preparing the > underlying bitmap in parallel. > > Design: > If underlying index AM supports the parallel path (currently only > BTREE support it), then we will create a parallel bitmap heap scan > path on top of the parallel bitmap index scan path. So the idea of > this patch is that each worker will do the parallel index scan and > generate their part of the bitmap. And, we will create a barrier so > that we can not start preparing the shared iterator until all the > worker is ready with their bitmap. The first worker, which is ready > with the bitmap will keep a copy of its TBM and the page table in the > shared memory. And, all the subsequent workers will merge their TBM > with the shared TBM. Once all the TBM are merged we will get one > common shared TBM and after that stage, the worker can continue. The > remaining part is the same, basically, again one worker will scan the > shared TBM and prepare the shared iterator and once it is ready all > the workers will jointly scan the heap in parallel using shared > iterator. > > BitmapHeapNext > { > ... > BarrierAttach(); > tbm = MultiExecProcNode(); > tbm_merge(tbm); --Merge with common tbm using tbm_union > BarrierArriveAndWait(); > > if (BitmapShouldInitializeSharedState(pstate)). --> only one worker > come out of this > { > tbm_prepare_shared_iterate(); > BitmapDoneInitializingSharedState(). -->wakeup others > } > tbm_attach_shared_iterate(). --> all worker attach to shared iterator > ... > } > > Performance: With scale factor 10, I could see that Q6 is spending > significant time in a bitmap index scan so I have taken the > performance with that query and I can see that the bitmap index scan > node is 3x faster by using 3 workers whereas overall plan got ~40% > faster. > > TPCH: S.F. 10, work_mem=512MB shared_buffers: 1GB > > HEAD: > Limit (cost=1559777.02..1559777.03 rows=1 width=32) (actual > time=5260.121..5260.122 rows=1 loops=1) > -> Finalize Aggregate (cost=1559777.02..1559777.03 rows=1 > width=32) (actual time=5260.119..5260.119 rows=1 loops=1) > -> Gather (cost=1559776.69..1559777.00 rows=3 width=32) > (actual time=5257.251..5289.595 rows=4 loops=1) > Workers Planned: 3 > Workers Launched: 3 > -> Partial Aggregate (cost=1558776.69..1558776.70 > rows=1 width=32) (actual time=5247.714..5247.714 rows=1 loops=4) > -> Parallel Bitmap Heap Scan on lineitem > (cost=300603.01..1556898.89 rows=375560 width=12) (actual > time=3475.944..50 > 37.484 rows=285808 loops=4) > Recheck Cond: ((l_shipdate >= > '1997-01-01'::date) AND (l_shipdate < '1998-01-01 00:00:00'::timestamp > without tim > e zone) AND (l_discount >= 0.02) AND (l_discount <= 0.04) AND > (l_quantity < '24'::numeric)) > Heap Blocks: exact=205250 > -> Bitmap Index Scan on > idx_lineitem_shipdate (cost=0.00..300311.95 rows=1164235 width=0) > (actual time=3169.85 > 5..3169.855 rows=1143234 loops=1) > Index Cond: ((l_shipdate >= > '1997-01-01'::date) AND (l_shipdate < '1998-01-01 00:00:00'::timestamp > without > time zone) AND (l_discount >= 0.02) AND (l_discount <= 0.04) AND > (l_quantity < '24'::numeric)) > Planning Time: 0.659 ms > Execution Time: 5289.787 ms > (13 rows) > > > PATCH: > > Limit (cost=1559579.85..1559579.86 rows=1 width=32) (actual > time=3333.572..3333.572 rows=1 loops=1) > -> Finalize Aggregate (cost=1559579.85..1559579.86 rows=1 > width=32) (actual time=3333.569..3333.569 rows=1 loops=1) > -> Gather (cost=1559579.52..1559579.83 rows=3 width=32) > (actual time=3328.619..3365.227 rows=4 loops=1) > Workers Planned: 3 > Workers Launched: 3 > -> Partial Aggregate (cost=1558579.52..1558579.53 > rows=1 width=32) (actual time=3307.805..3307.805 rows=1 loops=4) > -> Parallel Bitmap Heap Scan on lineitem > (cost=300405.84..1556701.72 rows=375560 width=12) (actual > time=1585.726..30 > 97.628 rows=285808 loops=4) > Recheck Cond: ((l_shipdate >= > '1997-01-01'::date) AND (l_shipdate < '1998-01-01 00:00:00'::timestamp > without tim > e zone) AND (l_discount >= 0.02) AND (l_discount <= 0.04) AND > (l_quantity < '24'::numeric)) > Heap Blocks: exact=184293 > -> Parallel Bitmap Index Scan on > idx_lineitem_shipdate (cost=0.00..300311.95 rows=1164235 width=0) > (actual tim > e=1008.361..1008.361 rows=285808 loops=4) > Index Cond: ((l_shipdate >= > '1997-01-01'::date) AND (l_shipdate < '1998-01-01 00:00:00'::timestamp > without > time zone) AND (l_discount >= 0.02) AND (l_discount <= 0.04) AND > (l_quantity < '24'::numeric)) > Planning Time: 0.690 ms > Execution Time: 3365.420 ms > > Note: > - Currently, I have only parallelized then bitmap index path when we > have a bitmap index scan directly under bitmap heap. But, if we have > BitmapAnd or BitmapOr path then I did not parallelize the underlying > bitmap index scan. I think for BitmapAnd and BitmapOr we should use a > completely different design, something similar to what we are doing in > parallel append so I don't think BitmapAnd and BitmapOr we need to > cover under this patch. > > - POC patch is attached to discuss the idea. The patch still needs > cleanup and testing. I have rebased this patch on the current head. Apart from this, I have also measure performance with the higher scalare factor this time. At a higher scale factor I can see the performance with the patch is dropping. Basically, the underlying bitmap index scan node is getting faster with parallelism but the overall performance is going down due to the TBM merging in the parallel bitmap heap node. Currently, there is a lot of scope for improving tbm_merge. - Currently, whichever worker produces the TBM first becomes the host TBM and all the other workers merge their TBM to that. Ideally, the largest TBM should become the host TBM. - While merging we are directly using tbm_union and that need to reinsert the complete entry in the host TBM's hashtable, I think instead of merging like this we can create just a shared iterator (and somehow remove the duplicates) but don't really need to merge the hashtable. I haven't thought about this design completely but seems doable, basically by doing this the TBM iterator array will keep the items from multiple tbm_hashtables. max_parallel_workers_per_gather=4 work_mem=20GB shared_buffes=20GB HEAD TPCH QUERY (Parallel BitmapHeap+ BitmapIndex) BitmapIndex 4 19764 535 5 12035 1545 6 119815 7943 14 44154 1007 PATCH TPCH QUERY (Parallel BitmapHeap+Parallel BitmapIndex). Parallel BitmapIndex 4 19765 287 5 13799 848 6 116204 3255 14 44078 416 So if we see the performance results, in most of the queries the time spent in the bitmap index is reduced by half or more but still, the total time spent in the bitmap heap scan is either not reduced significantly or it is increased. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Attachment
pgsql-hackers by date: