Re: [HACKERS] Merge join for GiST - Mailing list pgsql-hackers
From | Andrew Borodin |
---|---|
Subject | Re: [HACKERS] Merge join for GiST |
Date | |
Msg-id | CAJEAwVGQ-nk+e-GnP72cEnE=cBfhd+5AMY7BTma8S04FAR9WgQ@mail.gmail.com Whole thread Raw |
In response to | Re: [HACKERS] Merge join for GiST (Jeff Davis <pgsql@j-davis.com>) |
List | pgsql-hackers |
Hi, hackers! I've adapted crossmatch join from pgSphere to cube for performance tests. I've placed spatial join code here https://github.com/Octonica/postgres/blob/spatialjoin/contrib/cube/spatialjoin.c and node code here https://github.com/Octonica/postgres/blob/spatialjoin/contrib/cube/joinnode.c If you have an idea of improving the performance of this code, please do not hesitate to express them. One of the performance bottlenecks is code nearby heap_getattr() in fetch_next_pair(). ==Performance Tests== I've tested performance on queries which aggregate result of the spatial join. See cube_test.sql attached. On 3d data, Spatial Join performs 3x faster than Nested Loop Join + GiST Scan Nested Loop + Index Scan plan: HashAggregate (cost=36841568.00..36841570.00 rows=200 width=40) (actual time=206565.869..206738.307 rows=298731 loops=1) Group Key: r.nx -> Nested Loop (cost=0.41..25591568.00 rows=900000000 width=40) (actual time=0.357..200838.416 rows=8464147 loops=1) -> Seq Scan on regions r (cost=0.00..6410.00 rows=300000 width=40) (actual time=0.015..324.436 rows=300000 loops=1) -> Index Scan using idx on datatable a (cost=0.41..55.28 rows=3000 width=64) (actual time=0.174..0.648 rows=28 loops=300000) Index Cond: (r.c @> c) Planning time: 17.175 ms Execution time: 206806.926 ms Time: 206852,635 ms (03:26,853) Spatial Join plan: HashAggregate (cost=56250001.00..56250003.00 rows=200 width=40) (actual time=67373.644..67553.118 rows=298731 loops=1) Group Key: r.nx -> Custom Scan (SpatialJoin) (cost=0.00..1.00 rows=4500000000 width=40) (actual time=0.151..61718.804 rows=8464147 loops=1) Outer index: idx Inner index: idx1 Planning time: 0.182 ms Execution time: 67630.742 ms Time: 67631,557 ms (01:07,632) But on more realistic 7D data with queries emulating OLAP system performance of Spatial Join is 2 times worse than Nested Loop Join + GiST Scan. Which comes as a complete surprise to me. I do not see any algorithmic reason for Spatial Join to be slower. Thus I strongly suspect that my implementation is not efficient, but as for now I have no ideas how to improve it. Here are plans for 7D Nested Loop + Index Scan HashAggregate (cost=3425143.00..3425743.00 rows=60000 width=72) (actual time=122794.715..122822.893 rows=60000 loops=1) Group Key: r.nx -> Nested Loop (cost=0.41..2075143.00 rows=60000000 width=72) (actual time=0.311..100478.710 rows=39817008 loops=1) -> Seq Scan on r1 r (cost=0.00..2419.00 rows=60000 width=128) (actual time=0.043..60.579 rows=60000 loops=1) -> Index Scan using ix_a1_cube on a1 a (cost=0.41..24.55 rows=1000 width=128) (actual time=0.110..1.266 rows=664 loops=60000) Index Cond: (c <@ r.c) Planning time: 0.349 ms Execution time: 122831.353 ms (8 rows) Spatial Join HashAggregate (cost=6750001.00..6750601.00 rows=60000 width=72) (actual time=241832.855..241889.360 rows=60000 loops=1) Group Key: r.nx -> Custom Scan (SpatialJoin) (cost=0.00..1.00 rows=300000000 width=72) (actual time=0.140..216187.111 rows=39817008 loops=1) Outer index: ix_r1_cube Inner index: ix_a1_cube Planning time: 0.533 ms Execution time: 241907.569 ms (7 rows) Time: 241910,440 ms (04:01,910) Any ideas will be highly appreciated. Best regards, Andrey Borodin. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
pgsql-hackers by date: