I've taken a look at this, and can expose a minimal test case reproducing what
I believe is the same problem, see the attached test.sql file for this.
The test case is a bit extreme, by setting random_page_cost so low, but the
same thing can happen with default values of random_page_cost given a
significantly high number of loops.
Running the attached test case, I get the following:
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------
Nested Loop (cost=0.01..711.84 rows=20000 width=13) (actual
time=0.240..6521.129 rows=20000 loops=1)
Buffers: shared hit=445110
-> Seq Scan on t2 (cost=0.00..307.00 rows=20000 width=9) (actual
time=0.007..2.297 rows=20000 loops=1)
Buffers: shared hit=107
-> Bitmap Heap Scan on t1 (cost=0.01..0.02 rows=1 width=4) (actual
time=0.325..0.325 rows=1 loops=20000)
Recheck Cond: (id = t2.t1_id)
Rows Removed by Index Recheck: 0
Heap Blocks: exact=20013
Buffers: shared hit=445003
-> Bitmap Index Scan on t1_gin_index (cost=0.00..0.01 rows=1
width=0) (actual time=0.324..0.324 rows=1 loops=20000)
Index Cond: (id = t2.t1_id)
Buffers: shared hit=424990
Planning:
Buffers: shared hit=79
Planning Time: 0.325 ms
Execution Time: 6522.759 ms
If I drop the gin index, the btree index is used:
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------
Nested Loop (cost=0.29..1249.56 rows=20000 width=13) (actual
time=0.018..16.570 rows=20000 loops=1)
Buffers: shared hit=4607
-> Seq Scan on t2 (cost=0.00..307.00 rows=20000 width=9) (actual
time=0.007..1.717 rows=20000 loops=1)
Buffers: shared hit=107
-> Memoize (cost=0.29..0.31 rows=1 width=4) (actual time=0.000..0.000
rows=1 loops=20000)
Cache Key: t2.t1_id
Cache Mode: logical
Hits: 18500 Misses: 1500 Evictions: 0 Overflows: 0 Memory Usage:
154kB
Buffers: shared hit=4500
-> Index Only Scan using t1_pkey on t1 (cost=0.28..0.30 rows=1
width=4) (actual time=0.002..0.002 rows=1 loops=1500)
Index Cond: (id = t2.t1_id)
Heap Fetches: 1500
Buffers: shared hit=4500
Planning:
Buffers: shared hit=10
Planning Time: 0.198 ms
Execution Time: 17.683 ms
Looking into it, it looks like we are not charging a cpu "descent" cost for
the entry tree of the gin index, which we do for the btree index. In general,
it does not pose a problem since IO costs are far greater than cpu costs. But
when the index scan is inside a nestloop, we account for cache effect and
amortize the cost of IO over the number of outer scans, which reduces its
relative importance significantly. In that case, the index scan on the gin
index appears much cheaper, as the constant cpu cost is not taken into
account.
I'm not sure how we should account for the cost of the descent, but I believe
it should be more than free. Perhaps we could devise a similar strategy using
the estimated numEntryPages ?
--
Ronan Dunklau