Re: [HACKERS] Merge join for GiST - Mailing list pgsql-hackers

From Andrew Borodin
Subject Re: [HACKERS] Merge join for GiST
Date
Msg-id CAJEAwVGQ-nk+e-GnP72cEnE=cBfhd+5AMY7BTma8S04FAR9WgQ@mail.gmail.com
Whole thread Raw
In response to Re: [HACKERS] Merge join for GiST  (Jeff Davis <pgsql@j-davis.com>)
List pgsql-hackers
Hi, hackers!

I've adapted crossmatch join from pgSphere to cube for performance tests.
I've placed spatial join code here
https://github.com/Octonica/postgres/blob/spatialjoin/contrib/cube/spatialjoin.c
and node code here
https://github.com/Octonica/postgres/blob/spatialjoin/contrib/cube/joinnode.c

If you have an idea of improving the performance of this code, please
do not hesitate to express them.
One of the performance bottlenecks is code nearby heap_getattr() in
fetch_next_pair().

==Performance Tests==
I've tested performance on queries which aggregate result of the
spatial join. See cube_test.sql attached.

On 3d data, Spatial Join performs 3x faster than Nested Loop Join + GiST Scan

Nested Loop + Index Scan plan:
 HashAggregate  (cost=36841568.00..36841570.00 rows=200 width=40)
(actual time=206565.869..206738.307 rows=298731 loops=1)
   Group Key: r.nx
   ->  Nested Loop  (cost=0.41..25591568.00 rows=900000000 width=40)
(actual time=0.357..200838.416 rows=8464147 loops=1)
         ->  Seq Scan on regions r  (cost=0.00..6410.00 rows=300000
width=40) (actual time=0.015..324.436 rows=300000 loops=1)
         ->  Index Scan using idx on datatable a  (cost=0.41..55.28
rows=3000 width=64) (actual time=0.174..0.648 rows=28 loops=300000)
               Index Cond: (r.c @> c)
 Planning time: 17.175 ms
 Execution time: 206806.926 ms

Time: 206852,635 ms (03:26,853)

Spatial Join plan:
HashAggregate  (cost=56250001.00..56250003.00 rows=200 width=40)
(actual time=67373.644..67553.118 rows=298731 loops=1)
   Group Key: r.nx
   ->  Custom Scan (SpatialJoin)  (cost=0.00..1.00 rows=4500000000
width=40) (actual time=0.151..61718.804 rows=8464147 loops=1)
         Outer index: idx
         Inner index: idx1
 Planning time: 0.182 ms
 Execution time: 67630.742 ms

Time: 67631,557 ms (01:07,632)

But on more realistic 7D data with queries emulating OLAP system
performance of Spatial Join is 2 times worse than Nested Loop Join +
GiST Scan. Which comes as a complete surprise to me.
I do not see any algorithmic reason for Spatial Join to be slower.
Thus I strongly suspect that my implementation is not efficient, but
as for now I have no ideas how to improve it.

Here are plans for 7D
Nested Loop + Index Scan
 HashAggregate  (cost=3425143.00..3425743.00 rows=60000 width=72)
(actual time=122794.715..122822.893 rows=60000 loops=1)
   Group Key: r.nx
   ->  Nested Loop  (cost=0.41..2075143.00 rows=60000000 width=72)
(actual time=0.311..100478.710 rows=39817008 loops=1)
         ->  Seq Scan on r1 r  (cost=0.00..2419.00 rows=60000
width=128) (actual time=0.043..60.579 rows=60000 loops=1)
         ->  Index Scan using ix_a1_cube on a1 a  (cost=0.41..24.55
rows=1000 width=128) (actual time=0.110..1.266 rows=664 loops=60000)
               Index Cond: (c <@ r.c)
 Planning time: 0.349 ms
 Execution time: 122831.353 ms
(8 rows)

Spatial Join
 HashAggregate  (cost=6750001.00..6750601.00 rows=60000 width=72)
(actual time=241832.855..241889.360 rows=60000 loops=1)
   Group Key: r.nx
   ->  Custom Scan (SpatialJoin)  (cost=0.00..1.00 rows=300000000
width=72) (actual time=0.140..216187.111 rows=39817008 loops=1)
         Outer index: ix_r1_cube
         Inner index: ix_a1_cube
 Planning time: 0.533 ms
 Execution time: 241907.569 ms
(7 rows)

Time: 241910,440 ms (04:01,910)

Any ideas will be highly appreciated.

Best regards, Andrey Borodin.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

pgsql-hackers by date:

Previous
From: Amit Langote
Date:
Subject: Re: [HACKERS] Declarative partitioning - another take
Next
From: Konstantin Knizhnik
Date:
Subject: [HACKERS] Bug in prepared statement cache invalidation?