BRIN index which is much faster never chosen by planner - Mailing list pgsql-hackers

From Jeremy Finzel
Subject BRIN index which is much faster never chosen by planner
Date
Msg-id CAMa1XUj8WLszW8L2piuPZ88hh+CWitFu-qtf-6TR5rPw5xvhMA@mail.gmail.com
Whole thread Raw
Responses Re: BRIN index which is much faster never chosen by planner
Re: BRIN index which is much faster never chosen by planner
List pgsql-hackers
Good Afternoon,

I posted about this on another thread here, but the topic was not precisely planner issues, so I wanted to post it here.

I am running Postgres 11.5.  I have a table that is insert-only and has 312 million rows.  It is also pruned continuously to only past year.  The size of the table is 223 GB with indexes, 140GB without.  One of the fields is rec_insert_time timestamptz.  Here are all potentially relevant table stats:

schemaname          | foo
relname             | log_table
n_tup_ins           | 86850506
n_tup_upd           | 0
n_tup_del           | 68916115
n_tup_hot_upd       | 0
n_live_tup          | 312810691
n_dead_tup          | 9405132
n_mod_since_analyze | 11452
last_vacuum         | 2019-09-20 09:41:43.78383-05
last_autovacuum     | 2019-10-04 13:56:16.810355-05
last_analyze        | 2019-10-10 09:34:26.807695-05
last_autoanalyze    |
vacuum_count        | 2
autovacuum_count    | 1
analyze_count       | 13
autoanalyze_count   | 0
total_relation_size | 223 GB
relation_size       | 139 GB
table_size          | 140 GB

I have a simple query looking at past 10 days based on rec_insert_time, and it will not choose the BRIN index even with several configurations.  Here are my all relevant indexes (I intentionally do not have a btree on rec_insert_time because I believe BRIN *should* fit better here):

"log_table_brand_id_product_rec_insert_time_idx" btree (brand_id, product, rec_insert_time)
"log_table_rec_insert_time_idx" brin (rec_insert_time)
"log_table_rec_insert_time_idx1" brin (rec_insert_time) WITH (pages_per_range='64')
"rec_insert_time_brin_1000" brin (rec_insert_time) WITH (pages_per_range='1000')

And here is the SQL:
SELECT
 category, source, MIN(rec_insert_time) OVER (partition by source order by rec_insert_time) AS first_source_time, MAX(rec_insert_time) OVER (partition by source order by rec_insert_time) AS last_source_time
FROM (SELECT DISTINCT ON (brand_id, last_change, log_id)
category, source(field1) AS source, rec_insert_time
FROM log_table l
INNER JOIN public.small_join_table filter ON filter.category = l.category
WHERE field1 IS NOT NULL AND l.category = 'music'
AND l.rec_insert_time >= now() - interval '10 days'
ORDER BY brand_id, last_change, log_id, rec_insert_time DESC) unique_cases
;

This query will choose a seq scan on log_table every time in spite of these BRIN indexes on rec_insert_time.
@Michael Lewis had suggested I check default_statistics_target for this column.  I raised it to 5000 for this column and still it's choosing a seq scan.

Here is default chosen plan (takes 2 minutes 12 seconds):
 WindowAgg  (cost=24437881.80..24437897.70 rows=707 width=120) (actual time=132173.173..132173.222 rows=53 loops=1)
   Output: unique_cases.category, unique_cases.source, min(unique_cases.rec_insert_time) OVER (?), max(unique_cases.rec_insert_time) OVER (?), unique_cases.rec_insert_time
   Buffers: shared hit=391676 read=17772642 dirtied=4679 written=7
   ->  Sort  (cost=24437881.80..24437883.56 rows=707 width=104) (actual time=132173.146..132173.149 rows=53 loops=1)
         Output: unique_cases.source, unique_cases.rec_insert_time, unique_cases.category
         Sort Key: unique_cases.source, unique_cases.rec_insert_time
         Sort Method: quicksort  Memory: 32kB
         Buffers: shared hit=391676 read=17772642 dirtied=4679 written=7
         ->  Subquery Scan on unique_cases  (cost=24437834.20..24437848.34 rows=707 width=104) (actual time=132172.950..132173.062 rows=53 loops=1)
               Output: unique_cases.source, unique_cases.rec_insert_time, unique_cases.category
               Buffers: shared hit=391676 read=17772642 dirtied=4679 written=7
               ->  Unique  (cost=24437834.20..24437841.27 rows=707 width=124) (actual time=132172.946..132173.048 rows=53 loops=1)
                     Output: l.category, (source(l.field1)), l.rec_insert_time, l.brand_id, l.last_change, l.log_id
                     Buffers: shared hit=391676 read=17772642 dirtied=4679 written=7
                     ->  Sort  (cost=24437834.20..24437835.96 rows=707 width=124) (actual time=132172.939..132172.962 rows=466 loops=1)
                           Output: l.category, (source(l.field1)), l.rec_insert_time, l.brand_id, l.last_change, l.log_id
                           Sort Key: l.brand_id, l.last_change, l.log_id, l.rec_insert_time DESC
                           Sort Method: quicksort  Memory: 90kB
                           Buffers: shared hit=391676 read=17772642 dirtied=4679 written=7
                           ->  Nested Loop  (cost=0.00..24437800.73 rows=707 width=124) (actual time=4096.253..132171.425 rows=466 loops=1)
                                 Output: l.category, source(l.field1), l.rec_insert_time, l.brand_id, l.last_change, l.log_id
                                 Inner Unique: true
                                 Join Filter: ((l.category)::text = filter.category)
                                 Rows Removed by Join Filter: 346704
                                 Buffers: shared hit=391676 read=17772642 dirtied=4679 written=7
                                 ->  Seq Scan on foo.log_table l  (cost=0.00..24420483.80 rows=707 width=99) (actual time=4095.763..132112.686 rows=466 loops=1)
                                       Output: <hidden>
                                       Filter: ((l.field1 IS NOT NULL) AND (l.category = 'music'::name) AND (l.rec_insert_time >= (now() - '10 days'::interval)))
                                       Rows Removed by Filter: 312830265
                                       Buffers: shared hit=391675 read=17772636 dirtied=4679 written=7
                                 ->  Materialize  (cost=0.00..33.98 rows=1399 width=8) (actual time=0.001..0.036 rows=745 loops=466)
                                       Output: filter.category
                                       Buffers: shared hit=1 read=6
                                       ->  Seq Scan on public.small_join_table filter  (cost=0.00..26.99 rows=1399 width=8) (actual time=0.054..0.189 rows=745 loops=1)
                                             Output: filter.category
                                             Buffers: shared hit=1 read=6
 Planning Time: 0.552 ms
 Execution Time: 132173.657 ms
(38 rows)


Here is the plan I get when I turn off seqscan and indexscan!!! (21 seconds):
SET enable_seqscan TO false;
SET enable_indexscan TO false;

 WindowAgg  (cost=24363224.32..24363240.85 rows=735 width=120) (actual time=21337.992..21338.040 rows=53 loops=1)
   Output: unique_cases.category, unique_cases.source, min(unique_cases.rec_insert_time) OVER (?), max(unique_cases.rec_insert_time) OVER (?), unique_cases.rec_insert_time
   Buffers: shared hit=1471 read=1509030 dirtied=121 written=1631
   ->  Sort  (cost=24363224.32..24363226.15 rows=735 width=104) (actual time=21337.965..21337.968 rows=53 loops=1)
         Output: unique_cases.source, unique_cases.rec_insert_time, unique_cases.category
         Sort Key: unique_cases.source, unique_cases.rec_insert_time
         Sort Method: quicksort  Memory: 32kB
         Buffers: shared hit=1471 read=1509030 dirtied=121 written=1631
         ->  Subquery Scan on unique_cases  (cost=24363174.62..24363189.32 rows=735 width=104) (actual time=21337.777..21337.889 rows=53 loops=1)
               Output: unique_cases.source, unique_cases.rec_insert_time, unique_cases.category
               Buffers: shared hit=1471 read=1509030 dirtied=121 written=1631
               ->  Unique  (cost=24363174.62..24363181.97 rows=735 width=124) (actual time=21337.772..21337.874 rows=53 loops=1)
                     Output: l.category, (source(l.field1)), l.rec_insert_time, l.brand_id, l.last_change, l.log_id
                     Buffers: shared hit=1471 read=1509030 dirtied=121 written=1631
                     ->  Sort  (cost=24363174.62..24363176.46 rows=735 width=124) (actual time=21337.767..21337.791 rows=466 loops=1)
                           Output: l.category, (source(l.field1)), l.rec_insert_time, l.brand_id, l.last_change, l.log_id
                           Sort Key: l.brand_id, l.last_change, l.log_id, l.rec_insert_time DESC
                           Sort Method: quicksort  Memory: 90kB
                           Buffers: shared hit=1471 read=1509030 dirtied=121 written=1631
                           ->  Nested Loop  (cost=2393.31..24363139.63 rows=735 width=124) (actual time=824.212..21336.263 rows=466 loops=1)
                                 Output: l.category, source(l.field1), l.rec_insert_time, l.brand_id, l.last_change, l.log_id
                                 Inner Unique: true
                                 Buffers: shared hit=1471 read=1509030 dirtied=121 written=1631
                                 ->  Bitmap Heap Scan on foo.log_table l  (cost=2391.71..24360848.29 rows=735 width=99) (actual time=824.133..21329.054 rows=466 loops=1)
                                       Output: <hidden>
                                       Recheck Cond: (l.rec_insert_time >= (now() - '10 days'::interval))
                                       Rows Removed by Index Recheck: 8187584
                                       Filter: ((l.field1 IS NOT NULL) AND (l.category = 'music'::name))
                                       Rows Removed by Filter: 19857107
                                       Heap Blocks: lossy=1509000
                                       Buffers: shared hit=73 read=1509030 dirtied=121 written=1631
                                       ->  Bitmap Index Scan on rec_insert_time_brin_1000  (cost=0.00..2391.52 rows=157328135 width=0) (actual time=72.391..72.391 rows=15090000 loops=1)
                                             Index Cond: (l.rec_insert_time >= (now() - '10 days'::interval))
                                             Buffers: shared hit=29 read=74
                                 ->  Bitmap Heap Scan on public.small_join_table filter  (cost=1.60..3.12 rows=1 width=8) (actual time=0.010..0.010 rows=1 loops=466)
                                       Output: filter.category, filter.type, filter.location
                                       Recheck Cond: (filter.category = (l.category)::text)
                                       Heap Blocks: exact=466
                                       Buffers: shared hit=1398
                                       ->  Bitmap Index Scan on small_join_table_pkey  (cost=0.00..1.60 rows=1 width=0) (actual time=0.007..0.007 rows=1 loops=466)
                                             Index Cond: (filter.category = (l.category)::text)
                                             Buffers: shared hit=932
 Planning Time: 1.869 ms
 Execution Time: 21338.244 ms
(44 rows)

Notice it chooses the smallest BRIN index with 1000 pages per range, and this is far faster than the seq scan.

I do believe the estimate is actually way off.  Just a plain EXPLAIN of the latter estimates 10x more rows than actual:
 WindowAgg  (cost=24354689.19..24354705.07 rows=706 width=120)
   ->  Sort  (cost=24354689.19..24354690.95 rows=706 width=104)
         Sort Key: unique_cases.source, unique_cases.rec_insert_time
         ->  Subquery Scan on unique_cases  (cost=24354641.66..24354655.78 rows=706 width=104)
               ->  Unique  (cost=24354641.66..24354648.72 rows=706 width=124)
                     ->  Sort  (cost=24354641.66..24354643.42 rows=706 width=124)
                           Sort Key: l.brand_id, l.last_change, l.log_id, l.rec_insert_time DESC
                           ->  Nested Loop  (cost=2385.42..24354608.25 rows=706 width=124)
                                 ->  Bitmap Heap Scan on log_table l  (cost=2383.82..24352408.26 rows=706 width=99)
                                       Recheck Cond: (rec_insert_time >= (now() - '10 days'::interval))
                                       Filter: ((field1 IS NOT NULL) AND (category = 'music'::name))
                                       ->  Bitmap Index Scan on rec_insert_time_brin_1000  (cost=0.00..2383.64 rows=156577455 width=0)
                                             Index Cond: (rec_insert_time >= (now() - '10 days'::interval))
                                 ->  Bitmap Heap Scan on small_join_table filter  (cost=1.60..3.12 rows=1 width=8)
                                       Recheck Cond: (category = (l.category)::text)
                                       ->  Bitmap Index Scan on small_join_table_pkey  (cost=0.00..1.60 rows=1 width=0)
                                             Index Cond: (category = (l.category)::text)
(17 rows)


Here is EXPLAIN only of the default chosen plan:
 WindowAgg  (cost=24437857.18..24437873.07 rows=706 width=120)
   ->  Sort  (cost=24437857.18..24437858.95 rows=706 width=104)
         Sort Key: unique_cases.source, unique_cases.rec_insert_time
         ->  Subquery Scan on unique_cases  (cost=24437809.66..24437823.78 rows=706 width=104)
               ->  Unique  (cost=24437809.66..24437816.72 rows=706 width=124)
                     ->  Sort  (cost=24437809.66..24437811.42 rows=706 width=124)
                           Sort Key: l.brand_id, l.last_change, l.log_id, l.rec_insert_time DESC
                           ->  Nested Loop  (cost=0.00..24437776.25 rows=706 width=124)
                                 Join Filter: ((l.category)::text = filter.category)
                                 ->  Seq Scan on log_table l  (cost=0.00..24420483.80 rows=706 width=99)
                                       Filter: ((field1 IS NOT NULL) AND (category = 'music'::name) AND (rec_insert_time >= (now() - '10 days'::interval)))
                                 ->  Materialize  (cost=0.00..33.98 rows=1399 width=8)
                                       ->  Seq Scan on small_join_table filter  (cost=0.00..26.99 rows=1399 width=8)
(13 rows)



Any insight into this is much appreciated.  This is just one example of many similar issues I have been finding with BRIN indexes scaling predictably with insert-only workloads.

Thanks!
Jeremy

pgsql-hackers by date:

Previous
From: Mark Wong
Date:
Subject: Re: stress test for parallel workers
Next
From: Andrew Dunstan
Date:
Subject: Re: stress test for parallel workers