Thread: BRIN index which is much faster never chosen by planner

BRIN index which is much faster never chosen by planner

From
Jeremy Finzel
Date:
Good Afternoon,

I posted about this on another thread here, but the topic was not precisely planner issues, so I wanted to post it here.

I am running Postgres 11.5.  I have a table that is insert-only and has 312 million rows.  It is also pruned continuously to only past year.  The size of the table is 223 GB with indexes, 140GB without.  One of the fields is rec_insert_time timestamptz.  Here are all potentially relevant table stats:

schemaname          | foo
relname             | log_table
n_tup_ins           | 86850506
n_tup_upd           | 0
n_tup_del           | 68916115
n_tup_hot_upd       | 0
n_live_tup          | 312810691
n_dead_tup          | 9405132
n_mod_since_analyze | 11452
last_vacuum         | 2019-09-20 09:41:43.78383-05
last_autovacuum     | 2019-10-04 13:56:16.810355-05
last_analyze        | 2019-10-10 09:34:26.807695-05
last_autoanalyze    |
vacuum_count        | 2
autovacuum_count    | 1
analyze_count       | 13
autoanalyze_count   | 0
total_relation_size | 223 GB
relation_size       | 139 GB
table_size          | 140 GB

I have a simple query looking at past 10 days based on rec_insert_time, and it will not choose the BRIN index even with several configurations.  Here are my all relevant indexes (I intentionally do not have a btree on rec_insert_time because I believe BRIN *should* fit better here):

"log_table_brand_id_product_rec_insert_time_idx" btree (brand_id, product, rec_insert_time)
"log_table_rec_insert_time_idx" brin (rec_insert_time)
"log_table_rec_insert_time_idx1" brin (rec_insert_time) WITH (pages_per_range='64')
"rec_insert_time_brin_1000" brin (rec_insert_time) WITH (pages_per_range='1000')

And here is the SQL:
SELECT
 category, source, MIN(rec_insert_time) OVER (partition by source order by rec_insert_time) AS first_source_time, MAX(rec_insert_time) OVER (partition by source order by rec_insert_time) AS last_source_time
FROM (SELECT DISTINCT ON (brand_id, last_change, log_id)
category, source(field1) AS source, rec_insert_time
FROM log_table l
INNER JOIN public.small_join_table filter ON filter.category = l.category
WHERE field1 IS NOT NULL AND l.category = 'music'
AND l.rec_insert_time >= now() - interval '10 days'
ORDER BY brand_id, last_change, log_id, rec_insert_time DESC) unique_cases
;

This query will choose a seq scan on log_table every time in spite of these BRIN indexes on rec_insert_time.
@Michael Lewis had suggested I check default_statistics_target for this column.  I raised it to 5000 for this column and still it's choosing a seq scan.

Here is default chosen plan (takes 2 minutes 12 seconds):
 WindowAgg  (cost=24437881.80..24437897.70 rows=707 width=120) (actual time=132173.173..132173.222 rows=53 loops=1)
   Output: unique_cases.category, unique_cases.source, min(unique_cases.rec_insert_time) OVER (?), max(unique_cases.rec_insert_time) OVER (?), unique_cases.rec_insert_time
   Buffers: shared hit=391676 read=17772642 dirtied=4679 written=7
   ->  Sort  (cost=24437881.80..24437883.56 rows=707 width=104) (actual time=132173.146..132173.149 rows=53 loops=1)
         Output: unique_cases.source, unique_cases.rec_insert_time, unique_cases.category
         Sort Key: unique_cases.source, unique_cases.rec_insert_time
         Sort Method: quicksort  Memory: 32kB
         Buffers: shared hit=391676 read=17772642 dirtied=4679 written=7
         ->  Subquery Scan on unique_cases  (cost=24437834.20..24437848.34 rows=707 width=104) (actual time=132172.950..132173.062 rows=53 loops=1)
               Output: unique_cases.source, unique_cases.rec_insert_time, unique_cases.category
               Buffers: shared hit=391676 read=17772642 dirtied=4679 written=7
               ->  Unique  (cost=24437834.20..24437841.27 rows=707 width=124) (actual time=132172.946..132173.048 rows=53 loops=1)
                     Output: l.category, (source(l.field1)), l.rec_insert_time, l.brand_id, l.last_change, l.log_id
                     Buffers: shared hit=391676 read=17772642 dirtied=4679 written=7
                     ->  Sort  (cost=24437834.20..24437835.96 rows=707 width=124) (actual time=132172.939..132172.962 rows=466 loops=1)
                           Output: l.category, (source(l.field1)), l.rec_insert_time, l.brand_id, l.last_change, l.log_id
                           Sort Key: l.brand_id, l.last_change, l.log_id, l.rec_insert_time DESC
                           Sort Method: quicksort  Memory: 90kB
                           Buffers: shared hit=391676 read=17772642 dirtied=4679 written=7
                           ->  Nested Loop  (cost=0.00..24437800.73 rows=707 width=124) (actual time=4096.253..132171.425 rows=466 loops=1)
                                 Output: l.category, source(l.field1), l.rec_insert_time, l.brand_id, l.last_change, l.log_id
                                 Inner Unique: true
                                 Join Filter: ((l.category)::text = filter.category)
                                 Rows Removed by Join Filter: 346704
                                 Buffers: shared hit=391676 read=17772642 dirtied=4679 written=7
                                 ->  Seq Scan on foo.log_table l  (cost=0.00..24420483.80 rows=707 width=99) (actual time=4095.763..132112.686 rows=466 loops=1)
                                       Output: <hidden>
                                       Filter: ((l.field1 IS NOT NULL) AND (l.category = 'music'::name) AND (l.rec_insert_time >= (now() - '10 days'::interval)))
                                       Rows Removed by Filter: 312830265
                                       Buffers: shared hit=391675 read=17772636 dirtied=4679 written=7
                                 ->  Materialize  (cost=0.00..33.98 rows=1399 width=8) (actual time=0.001..0.036 rows=745 loops=466)
                                       Output: filter.category
                                       Buffers: shared hit=1 read=6
                                       ->  Seq Scan on public.small_join_table filter  (cost=0.00..26.99 rows=1399 width=8) (actual time=0.054..0.189 rows=745 loops=1)
                                             Output: filter.category
                                             Buffers: shared hit=1 read=6
 Planning Time: 0.552 ms
 Execution Time: 132173.657 ms
(38 rows)


Here is the plan I get when I turn off seqscan and indexscan!!! (21 seconds):
SET enable_seqscan TO false;
SET enable_indexscan TO false;

 WindowAgg  (cost=24363224.32..24363240.85 rows=735 width=120) (actual time=21337.992..21338.040 rows=53 loops=1)
   Output: unique_cases.category, unique_cases.source, min(unique_cases.rec_insert_time) OVER (?), max(unique_cases.rec_insert_time) OVER (?), unique_cases.rec_insert_time
   Buffers: shared hit=1471 read=1509030 dirtied=121 written=1631
   ->  Sort  (cost=24363224.32..24363226.15 rows=735 width=104) (actual time=21337.965..21337.968 rows=53 loops=1)
         Output: unique_cases.source, unique_cases.rec_insert_time, unique_cases.category
         Sort Key: unique_cases.source, unique_cases.rec_insert_time
         Sort Method: quicksort  Memory: 32kB
         Buffers: shared hit=1471 read=1509030 dirtied=121 written=1631
         ->  Subquery Scan on unique_cases  (cost=24363174.62..24363189.32 rows=735 width=104) (actual time=21337.777..21337.889 rows=53 loops=1)
               Output: unique_cases.source, unique_cases.rec_insert_time, unique_cases.category
               Buffers: shared hit=1471 read=1509030 dirtied=121 written=1631
               ->  Unique  (cost=24363174.62..24363181.97 rows=735 width=124) (actual time=21337.772..21337.874 rows=53 loops=1)
                     Output: l.category, (source(l.field1)), l.rec_insert_time, l.brand_id, l.last_change, l.log_id
                     Buffers: shared hit=1471 read=1509030 dirtied=121 written=1631
                     ->  Sort  (cost=24363174.62..24363176.46 rows=735 width=124) (actual time=21337.767..21337.791 rows=466 loops=1)
                           Output: l.category, (source(l.field1)), l.rec_insert_time, l.brand_id, l.last_change, l.log_id
                           Sort Key: l.brand_id, l.last_change, l.log_id, l.rec_insert_time DESC
                           Sort Method: quicksort  Memory: 90kB
                           Buffers: shared hit=1471 read=1509030 dirtied=121 written=1631
                           ->  Nested Loop  (cost=2393.31..24363139.63 rows=735 width=124) (actual time=824.212..21336.263 rows=466 loops=1)
                                 Output: l.category, source(l.field1), l.rec_insert_time, l.brand_id, l.last_change, l.log_id
                                 Inner Unique: true
                                 Buffers: shared hit=1471 read=1509030 dirtied=121 written=1631
                                 ->  Bitmap Heap Scan on foo.log_table l  (cost=2391.71..24360848.29 rows=735 width=99) (actual time=824.133..21329.054 rows=466 loops=1)
                                       Output: <hidden>
                                       Recheck Cond: (l.rec_insert_time >= (now() - '10 days'::interval))
                                       Rows Removed by Index Recheck: 8187584
                                       Filter: ((l.field1 IS NOT NULL) AND (l.category = 'music'::name))
                                       Rows Removed by Filter: 19857107
                                       Heap Blocks: lossy=1509000
                                       Buffers: shared hit=73 read=1509030 dirtied=121 written=1631
                                       ->  Bitmap Index Scan on rec_insert_time_brin_1000  (cost=0.00..2391.52 rows=157328135 width=0) (actual time=72.391..72.391 rows=15090000 loops=1)
                                             Index Cond: (l.rec_insert_time >= (now() - '10 days'::interval))
                                             Buffers: shared hit=29 read=74
                                 ->  Bitmap Heap Scan on public.small_join_table filter  (cost=1.60..3.12 rows=1 width=8) (actual time=0.010..0.010 rows=1 loops=466)
                                       Output: filter.category, filter.type, filter.location
                                       Recheck Cond: (filter.category = (l.category)::text)
                                       Heap Blocks: exact=466
                                       Buffers: shared hit=1398
                                       ->  Bitmap Index Scan on small_join_table_pkey  (cost=0.00..1.60 rows=1 width=0) (actual time=0.007..0.007 rows=1 loops=466)
                                             Index Cond: (filter.category = (l.category)::text)
                                             Buffers: shared hit=932
 Planning Time: 1.869 ms
 Execution Time: 21338.244 ms
(44 rows)

Notice it chooses the smallest BRIN index with 1000 pages per range, and this is far faster than the seq scan.

I do believe the estimate is actually way off.  Just a plain EXPLAIN of the latter estimates 10x more rows than actual:
 WindowAgg  (cost=24354689.19..24354705.07 rows=706 width=120)
   ->  Sort  (cost=24354689.19..24354690.95 rows=706 width=104)
         Sort Key: unique_cases.source, unique_cases.rec_insert_time
         ->  Subquery Scan on unique_cases  (cost=24354641.66..24354655.78 rows=706 width=104)
               ->  Unique  (cost=24354641.66..24354648.72 rows=706 width=124)
                     ->  Sort  (cost=24354641.66..24354643.42 rows=706 width=124)
                           Sort Key: l.brand_id, l.last_change, l.log_id, l.rec_insert_time DESC
                           ->  Nested Loop  (cost=2385.42..24354608.25 rows=706 width=124)
                                 ->  Bitmap Heap Scan on log_table l  (cost=2383.82..24352408.26 rows=706 width=99)
                                       Recheck Cond: (rec_insert_time >= (now() - '10 days'::interval))
                                       Filter: ((field1 IS NOT NULL) AND (category = 'music'::name))
                                       ->  Bitmap Index Scan on rec_insert_time_brin_1000  (cost=0.00..2383.64 rows=156577455 width=0)
                                             Index Cond: (rec_insert_time >= (now() - '10 days'::interval))
                                 ->  Bitmap Heap Scan on small_join_table filter  (cost=1.60..3.12 rows=1 width=8)
                                       Recheck Cond: (category = (l.category)::text)
                                       ->  Bitmap Index Scan on small_join_table_pkey  (cost=0.00..1.60 rows=1 width=0)
                                             Index Cond: (category = (l.category)::text)
(17 rows)


Here is EXPLAIN only of the default chosen plan:
 WindowAgg  (cost=24437857.18..24437873.07 rows=706 width=120)
   ->  Sort  (cost=24437857.18..24437858.95 rows=706 width=104)
         Sort Key: unique_cases.source, unique_cases.rec_insert_time
         ->  Subquery Scan on unique_cases  (cost=24437809.66..24437823.78 rows=706 width=104)
               ->  Unique  (cost=24437809.66..24437816.72 rows=706 width=124)
                     ->  Sort  (cost=24437809.66..24437811.42 rows=706 width=124)
                           Sort Key: l.brand_id, l.last_change, l.log_id, l.rec_insert_time DESC
                           ->  Nested Loop  (cost=0.00..24437776.25 rows=706 width=124)
                                 Join Filter: ((l.category)::text = filter.category)
                                 ->  Seq Scan on log_table l  (cost=0.00..24420483.80 rows=706 width=99)
                                       Filter: ((field1 IS NOT NULL) AND (category = 'music'::name) AND (rec_insert_time >= (now() - '10 days'::interval)))
                                 ->  Materialize  (cost=0.00..33.98 rows=1399 width=8)
                                       ->  Seq Scan on small_join_table filter  (cost=0.00..26.99 rows=1399 width=8)
(13 rows)



Any insight into this is much appreciated.  This is just one example of many similar issues I have been finding with BRIN indexes scaling predictably with insert-only workloads.

Thanks!
Jeremy

Re: BRIN index which is much faster never chosen by planner

From
Michael Lewis
Date:
Since the optimizer is choosing a seq scan over index scan when it seems like it has good row estimates in both cases, to me that may mean costs of scanning index are expected to be high. Is this workload on SSD? Has the random_page_cost config been decreased from default 4 (compared with cost of 1 unit for sequential scan)?

Your buffer hits aren't great. What is shared_buffers set to? How much ram on this cluster?

With this table being insert only, one assumes correlation is very high on the data in this column as shown in pg_stats, but have your confirmed?

To me, distinct ON is often a bad code smell and probably can be re-written to be much more efficient with GROUP BY, lateral & order by, or some other tool. Same with the window function. It is a powerful tool, but sometimes not the right one.

Is "source" a function that is called on field1? What is it doing/how is it defined?

Re: BRIN index which is much faster never chosen by planner

From
Tomas Vondra
Date:
On Thu, Oct 10, 2019 at 04:58:11PM -0500, Jeremy Finzel wrote:
>
> ...
>
>Notice it chooses the smallest BRIN index with 1000 pages per range, and
>this is far faster than the seq scan.
>
>I do believe the estimate is actually way off.  Just a plain EXPLAIN of the
>latter estimates 10x more rows than actual:
> WindowAgg  (cost=24354689.19..24354705.07 rows=706 width=120)
>   ->  Sort  (cost=24354689.19..24354690.95 rows=706 width=104)
>         Sort Key: unique_cases.source, unique_cases.rec_insert_time
>         ->  Subquery Scan on unique_cases  (cost=24354641.66..24354655.78
>rows=706 width=104)
>               ->  Unique  (cost=24354641.66..24354648.72 rows=706
>width=124)
>                     ->  Sort  (cost=24354641.66..24354643.42 rows=706
>width=124)
>                           Sort Key: l.brand_id, l.last_change, l.log_id,
>l.rec_insert_time DESC
>                           ->  Nested Loop  (cost=2385.42..24354608.25
>rows=706 width=124)
>                                 ->  Bitmap Heap Scan on log_table l
> (cost=2383.82..24352408.26 rows=706 width=99)
>                                       Recheck Cond: (rec_insert_time >=
>(now() - '10 days'::interval))
>                                       Filter: ((field1 IS NOT NULL) AND
>(category = 'music'::name))
>                                       ->  Bitmap Index Scan on
>rec_insert_time_brin_1000  (cost=0.00..2383.64 rows=156577455 width=0)
>                                             Index Cond: (rec_insert_time
>>= (now() - '10 days'::interval))
>                                 ->  Bitmap Heap Scan on small_join_table
>filter  (cost=1.60..3.12 rows=1 width=8)
>                                       Recheck Cond: (category =
>(l.category)::text)
>                                       ->  Bitmap Index Scan on
>small_join_table_pkey  (cost=0.00..1.60 rows=1 width=0)
>                                             Index Cond: (category =
>(l.category)::text)
>(17 rows)
>
>
>Here is EXPLAIN only of the default chosen plan:
> WindowAgg  (cost=24437857.18..24437873.07 rows=706 width=120)
>   ->  Sort  (cost=24437857.18..24437858.95 rows=706 width=104)
>         Sort Key: unique_cases.source, unique_cases.rec_insert_time
>         ->  Subquery Scan on unique_cases  (cost=24437809.66..24437823.78
>rows=706 width=104)
>               ->  Unique  (cost=24437809.66..24437816.72 rows=706
>width=124)
>                     ->  Sort  (cost=24437809.66..24437811.42 rows=706
>width=124)
>                           Sort Key: l.brand_id, l.last_change, l.log_id,
>l.rec_insert_time DESC
>                           ->  Nested Loop  (cost=0.00..24437776.25
>rows=706 width=124)
>                                 Join Filter: ((l.category)::text =
>filter.category)
>                                 ->  Seq Scan on log_table l
> (cost=0.00..24420483.80 rows=706 width=99)
>                                       Filter: ((field1 IS NOT NULL) AND
>(category = 'music'::name) AND (rec_insert_time >= (now() - '10
>days'::interval)))
>                                 ->  Materialize  (cost=0.00..33.98
>rows=1399 width=8)
>                                       ->  Seq Scan on small_join_table
>filter  (cost=0.00..26.99 rows=1399 width=8)
>(13 rows)
>
>
>
>Any insight into this is much appreciated.  This is just one example of
>many similar issues I have been finding with BRIN indexes scaling
>predictably with insert-only workloads.
>

It's quite interesting planning issue. The cost estimates are:

 ->  Seq Scan on foo.log_table l
 (cost=0.00..24420483.80 rows=707 width=99) (actual

while for the bitmap heap scan it looks like this:

 ->  Bitmap Heap Scan on foo.log_table l
 (cost=2391.71..24360848.29 rows=735 width=99) (actual

So the planner actualy thinks the bitmap heap scan is a tad *cheaper*
but picks the seq scan anyway. This is likely because we don't really
compare the exact costs, but we do fuzzy comparison - the plan has to be
at least 1% cheaper to dominate the existing plan. This allows us to
save some work when replacing the paths.

In this case the difference is only about 0.2%, so we keep the seqscan
path. The real question is why the planner came to this cost, when it
got pretty good row estimates etc.

Looking at the cost_bitmap_heap_scan() I think the costing issue comes
mostly from this bit:

  /*
   * For small numbers of pages we should charge spc_random_page_cost
   * apiece, while if nearly all the table's pages are being read, it's more
   * appropriate to charge spc_seq_page_cost apiece.  The effect is
   * nonlinear, too. For lack of a better idea, interpolate like this to
   * determine the cost per page.
   */
  if (pages_fetched >= 2.0)
      cost_per_page = spc_random_page_cost -
          (spc_random_page_cost - spc_seq_page_cost)
          * sqrt(pages_fetched / T);
  else
      cost_per_page = spc_random_page_cost;

The index scan is estimated to return 157328135 rows, i.e. about 50% of
the table (apparently it's ~10x more than the actual number). This is
produced by compute_bitmap_pages() which also computes pages_fetched,
and I guess that's going to be pretty close to all pages, because with
T = 18350080 (which is 140GB) and using

  pages_fetched = (2.0 * T * tuples_fetched) / (2.0 * T + tuples_fetched);

we get 29731418 (which is more than 18350080, so it gets clamped).

So this kinda seems like the optimizer kinda believes it'll have to scan
the whole table anyway. In reality, of course, the number of tuples
returned by the index is 10x lower, so the formula above would give us
only about 10975262 pages (so ~1/2 the table). The actual number of
pages is however even lower - only about 1509030, i.e. ~8% of the table.

So this seems like a combination of multiple issues. Firstly, the bitmap
index scan on rec_insert_time_brin_1000 estimate seems somewhat poor. It
might be worth increasing stats target on that column, or something like
that. Not sure, but it's about the only "fixable" thing here, I think.

The other issue is that the estimation of pages fetched using bitmap
heap scan is rather crude - but that's simply hard, and I don't think we
can fundamentally improve this.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 



Re: BRIN index which is much faster never chosen by planner

From
David Rowley
Date:
On Fri, 11 Oct 2019 at 12:13, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
> The index scan is estimated to return 157328135 rows, i.e. about 50% of
> the table (apparently it's ~10x more than the actual number).

Don't pay too much attention to the actual row counts from bitmap
index scans of brin indexes. The value is entirely made up, per:

/*
* XXX We have an approximation of the number of *pages* that our scan
* returns, but we don't have a precise idea of the number of heap tuples
* involved.
*/
return totalpages * 10;

in bringetbitmap().

(Ideally EXPLAIN would be written in such a way that it didn't even
show the actual rows for node types that don't return rows. However,
I'm sure that would break many explain parsing tools)

The planner might be able to get a better estimate on the number of
matching rows if the now() - interval '10 days' expression was
replaced with 'now'::timestamptz - interval '10 days'. However, care
would need to be taken to ensure the plan is never prepared since
'now' is evaluated during parse. The same care must be taken when
creating views, functions, stored procedures and the like.

The planner will just estimate the selectivity of now() - interval '10
days'  by using DEFAULT_INEQ_SEL, which is 0.3333333333333333, so it
thinks it'll get 1/3rd of the table.  Using 'now' will allow the
planner to lookup actual statistics on that column which will likely
give a much better estimate, which by the looks of it, likely will
result in one of those BRIN index being used.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



Re: BRIN index which is much faster never chosen by planner

From
Michael Lewis
Date:


On Thu, Oct 10, 2019 at 6:22 PM David Rowley <david.rowley@2ndquadrant.com> wrote:
The planner might be able to get a better estimate on the number of
matching rows if the now() - interval '10 days' expression was
replaced with 'now'::timestamptz - interval '10 days'. However, care
would need to be taken to ensure the plan is never prepared since
'now' is evaluated during parse. The same care must be taken when
creating views, functions, stored procedures and the like.

The planner will just estimate the selectivity of now() - interval '10
days'  by using DEFAULT_INEQ_SEL, which is 0.3333333333333333, so it
thinks it'll get 1/3rd of the table.  Using 'now' will allow the
planner to lookup actual statistics on that column which will likely
give a much better estimate, which by the looks of it, likely will
result in one of those BRIN index being used.

This surprised me a bit, and would have significant implications. I tested a few different tables in our system and get the same row count estimate with either WHERE condition. Perhaps I am missing a critical piece of what you said.

explain
select * from charges where posted_on > now() - interval '10 days';

explain
select * from charges where posted_on > 'now'::timestamptz  - interval '10 days';

Re: BRIN index which is much faster never chosen by planner

From
David Rowley
Date:
On Fri, 11 Oct 2019 at 17:48, Michael Lewis <mlewis@entrata.com> wrote:
>
> On Thu, Oct 10, 2019 at 6:22 PM David Rowley <david.rowley@2ndquadrant.com> wrote:
>> The planner will just estimate the selectivity of now() - interval '10
>> days'  by using DEFAULT_INEQ_SEL, which is 0.3333333333333333, so it
>> thinks it'll get 1/3rd of the table.  Using 'now' will allow the
>> planner to lookup actual statistics on that column which will likely
>> give a much better estimate, which by the looks of it, likely will
>> result in one of those BRIN index being used.
>
>
> This surprised me a bit, and would have significant implications. I tested a few different tables in our system and
getthe same row count estimate with either WHERE condition. Perhaps I am missing a critical piece of what you said.
 
>
> explain
> select * from charges where posted_on > now() - interval '10 days';
>
> explain
> select * from charges where posted_on > 'now'::timestamptz  - interval '10 days';

You're right. On looking more closely at the code, it uses
estimate_expression_value(), which performs additional constant
folding of expressions for selectivity purposes only. It does end up
calling the now() function and evaluating the now() - interval '10
days'; expression into a Const.

The header comment for that function reads:

* estimate_expression_value
 *
 * This function attempts to estimate the value of an expression for
 * planning purposes.  It is in essence a more aggressive version of
 * eval_const_expressions(): we will perform constant reductions that are
 * not necessarily 100% safe, but are reasonable for estimation purposes.

So I take back what I said about using 'now'::timestamptz instead of now().

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



Re: BRIN index which is much faster never chosen by planner

From
Jeremy Finzel
Date:
On Thu, Oct 10, 2019 at 7:22 PM David Rowley <david.rowley@2ndquadrant.com> wrote:
The planner might be able to get a better estimate on the number of
matching rows if the now() - interval '10 days' expression was
replaced with 'now'::timestamptz - interval '10 days'. However, care
would need to be taken to ensure the plan is never prepared since
'now' is evaluated during parse. The same care must be taken when
creating views, functions, stored procedures and the like.

You are on to something here I think with the now() function, even if above suggestion is not exactly right as you said further down.  I am finding a hard-coded timestamp gives the right query plan.  I also tested same with even bigger window (last 16 days) and it yet still chooses the brin index.

foo_prod=# EXPLAIN
foo_prod-# SELECT
foo_prod-#  category, source, MIN(rec_insert_time) OVER (partition by source order by rec_insert_time) AS first_source_time, MAX(rec_insert_time) OVER (partition by source order by rec_insert_time) AS last_source_time
foo_prod-# FROM (SELECT DISTINCT ON (brand_id, last_change, log_id)
foo_prod(# category, source(field1) AS source, rec_insert_time
foo_prod(# FROM log_table l
foo_prod(# INNER JOIN public.small_join_table filter ON filter.category = l.category
foo_prod(# WHERE field1 IS NOT NULL AND l.category = 'music'
foo_prod(# AND l.rec_insert_time >= now() - interval '10 days'
foo_prod(# ORDER BY brand_id, last_change, log_id, rec_insert_time DESC) unique_cases;
                                                                               QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 WindowAgg  (cost=24436329.10..24436343.56 rows=643 width=120)
   ->  Sort  (cost=24436329.10..24436330.70 rows=643 width=104)
         Sort Key: unique_cases.source, unique_cases.rec_insert_time
         ->  Subquery Scan on unique_cases  (cost=24436286.24..24436299.10 rows=643 width=104)
               ->  Unique  (cost=24436286.24..24436292.67 rows=643 width=124)
                     ->  Sort  (cost=24436286.24..24436287.85 rows=643 width=124)
                           Sort Key: l.brand_id, l.last_change, l.log_id, l.rec_insert_time DESC
                           ->  Nested Loop  (cost=0.00..24436256.25 rows=643 width=124)
                                 Join Filter: ((l.category)::text = filter.category)
                                 ->  Seq Scan on small_join_table filter  (cost=0.00..26.99 rows=1399 width=8)
                                 ->  Materialize  (cost=0.00..24420487.02 rows=643 width=99)
                                       ->  Seq Scan on log_table l  (cost=0.00..24420483.80 rows=643 width=99)
                                             Filter: ((field1 IS NOT NULL) AND (category = 'music'::name) AND (rec_insert_time >= (now() - '10 days'::interval)))
(13 rows)

foo_prod=# SELECT now() - interval '10 days';
           ?column?
-------------------------------
 2019-10-01 08:20:38.115471-05
(1 row)

foo_prod=# EXPLAIN
SELECT
 category, source, MIN(rec_insert_time) OVER (partition by source order by rec_insert_time) AS first_source_time, MAX(rec_insert_time) OVER (partition by source order by rec_insert_time) AS last_source_time
FROM (SELECT DISTINCT ON (brand_id, last_change, log_id)
category, source(field1) AS source, rec_insert_time
FROM log_table l
INNER JOIN public.small_join_table filter ON filter.category = l.category
WHERE field1 IS NOT NULL AND l.category = 'music'
AND l.rec_insert_time >= '2019-10-01 08:20:38.115471-05'
ORDER BY brand_id, last_change, log_id, rec_insert_time DESC) unique_cases;
                                                                              QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
 WindowAgg  (cost=19664576.17..19664590.63 rows=643 width=120)
   ->  Sort  (cost=19664576.17..19664577.77 rows=643 width=104)
         Sort Key: unique_cases.source, unique_cases.rec_insert_time
         ->  Subquery Scan on unique_cases  (cost=19664533.31..19664546.17 rows=643 width=104)
               ->  Unique  (cost=19664533.31..19664539.74 rows=643 width=124)
                     ->  Sort  (cost=19664533.31..19664534.92 rows=643 width=124)
                           Sort Key: l.brand_id, l.last_change, l.log_id, l.rec_insert_time DESC
                           ->  Nested Loop  (cost=3181.19..19664503.32 rows=643 width=124)
                                 ->  Gather  (cost=3180.91..19662574.92 rows=643 width=99)
                                       Workers Planned: 3
                                       ->  Parallel Bitmap Heap Scan on log_table l  (cost=2180.91..19661510.62 rows=207 width=99)
                                             Recheck Cond: (rec_insert_time >= '2019-10-01 08:20:38.115471-05'::timestamp with time zone)
                                             Filter: ((field1 IS NOT NULL) AND (category = 'music'::name))
                                             ->  Bitmap Index Scan on rec_insert_time_brin_1000  (cost=0.00..2180.75 rows=142602171 width=0)
                                                   Index Cond: (rec_insert_time >= '2019-10-01 08:20:38.115471-05'::timestamp with time zone)


Let me know if this rings any bells!  I will respond to other comments with other replies.

Thanks,
Jeremy
 

Re: BRIN index which is much faster never chosen by planner

From
Jeremy Finzel
Date:
Dear Michael,

On Thu, Oct 10, 2019 at 5:20 PM Michael Lewis <mlewis@entrata.com> wrote:
Since the optimizer is choosing a seq scan over index scan when it seems like it has good row estimates in both cases, to me that may mean costs of scanning index are expected to be high. Is this workload on SSD? Has the random_page_cost config been decreased from default 4 (compared with cost of 1 unit for sequential scan)?

It's 1.5
 
Your buffer hits aren't great. What is shared_buffers set to? How much ram on this cluster?

shared_buffers is 4GB.  It has 500G of RAM, but server has several clusters on it.
 

With this table being insert only, one assumes correlation is very high on the data in this column as shown in pg_stats, but have your confirmed?

Yes, but the issue isn't with the BRIN index performing badly or being fragmented.  It's that it performs great (7x faster than the seq scan) but postgres doesn't pick using it.  I have seen this same issue also in other attempts I have made to use BRIN.
 
To me, distinct ON is often a bad code smell and probably can be re-written to be much more efficient with GROUP BY, lateral & order by, or some other tool. Same with the window function. It is a powerful tool, but sometimes not the right one.

I don't really agree, but it's beside the point because the issue is not in aggregation.  It's pre-aggregation.  Indeed if I run my query as a simple select (as I tried) it's the exact same planning issue.  (In my experience, distinct on for given example is the fastest.  Same with window functions which prevent inefficient self-joins)
 
Is "source" a function that is called on field1? What is it doing/how is it defined?

I can't see how that matters either, but the "source" function is a mask for a built-in pg function that is trivial.  This whole query is masked so as not to expose our actual prod query, but I hope it's still understandable enough :).

My question is not how to make this query faster in general.  It's that I want to use BRIN indexes very much, but I'm not sure I can trust they will scale with the right query plan like I know BTREE will.

Thanks!
Jeremy

Re: BRIN index which is much faster never chosen by planner

From
Jeremy Finzel
Date:
On Thu, Oct 10, 2019 at 6:13 PM Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:

So this seems like a combination of multiple issues. Firstly, the bitmap
index scan on rec_insert_time_brin_1000 estimate seems somewhat poor. It
might be worth increasing stats target on that column, or something like
that. Not sure, but it's about the only "fixable" thing here, I think.

In the OP I had mentioned that I already increased it to 5000, and it made no difference.  Ah fine.... let's go ahead and try 10000... still no change:

foo_prod=# ALTER TABLE log_table ALTER COLUMN rec_insert_time SET STATISTICS 10000;
ALTER TABLE
foo_prod=# ANALYZE log_table;
ANALYZE
foo_prod=# EXPLAIN
SELECT
 category, source, MIN(rec_insert_time) OVER (partition by source order by rec_insert_time) AS first_source_time, MAX(rec_insert_time) OVER (partition by source order by rec_insert_time) AS last_source_time
FROM (SELECT DISTINCT ON (brand_id, last_change, log_id)
category, source(field1) AS source, rec_insert_time
FROM log_table l
INNER JOIN public.small_join_table filter ON filter.category = l.category
WHERE field1 IS NOT NULL AND l.category = 'music'
AND l.rec_insert_time >= now() - interval '10 days'
ORDER BY brand_id, last_change, log_id, rec_insert_time DESC) unique_cases;
                                                                               QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 WindowAgg  (cost=24451299.20..24451313.21 rows=623 width=120)
   ->  Sort  (cost=24451299.20..24451300.75 rows=623 width=104)
         Sort Key: unique_cases.source, unique_cases.rec_insert_time
         ->  Subquery Scan on unique_cases  (cost=24451257.82..24451270.28 rows=623 width=104)
               ->  Unique  (cost=24451257.82..24451264.05 rows=623 width=124)
                     ->  Sort  (cost=24451257.82..24451259.38 rows=623 width=124)
                           Sort Key: l.brand_id, l.last_change, l.log_id, l.rec_insert_time DESC
                           ->  Nested Loop  (cost=0.00..24451228.90 rows=623 width=124)
                                 Join Filter: ((l.category)::text = filter.category)
                                 ->  Seq Scan on small_join_table filter  (cost=0.00..26.99 rows=1399 width=8)
                                 ->  Materialize  (cost=0.00..24435949.31 rows=623 width=99)
                                       ->  Seq Scan on log_table l  (cost=0.00..24435946.20 rows=623 width=99)
                                             Filter: ((field1 IS NOT NULL) AND (category = 'music'::name) AND (rec_insert_time >= (now() - '10 days'::interval)))
(13 rows)

Thanks,
Jeremy

Re: BRIN index which is much faster never chosen by planner

From
Tomas Vondra
Date:
On Fri, Oct 11, 2019 at 09:08:05AM -0500, Jeremy Finzel wrote:
>On Thu, Oct 10, 2019 at 7:22 PM David Rowley <david.rowley@2ndquadrant.com>
>wrote:
>
>> The planner might be able to get a better estimate on the number of
>> matching rows if the now() - interval '10 days' expression was
>> replaced with 'now'::timestamptz - interval '10 days'. However, care
>> would need to be taken to ensure the plan is never prepared since
>> 'now' is evaluated during parse. The same care must be taken when
>> creating views, functions, stored procedures and the like.
>>
>
>You are on to something here I think with the now() function, even if above
>suggestion is not exactly right as you said further down.  I am finding a
>hard-coded timestamp gives the right query plan.  I also tested same with
>even bigger window (last 16 days) and it yet still chooses the brin index.
>
>foo_prod=# EXPLAIN
>foo_prod-# SELECT
>foo_prod-#  category, source, MIN(rec_insert_time) OVER (partition by
>source order by rec_insert_time) AS first_source_time, MAX(rec_insert_time)
>OVER (partition by source order by rec_insert_time) AS last_source_time
>foo_prod-# FROM (SELECT DISTINCT ON (brand_id, last_change, log_id)
>foo_prod(# category, source(field1) AS source, rec_insert_time
>foo_prod(# FROM log_table l
>foo_prod(# INNER JOIN public.small_join_table filter ON filter.category =
>l.category
>foo_prod(# WHERE field1 IS NOT NULL AND l.category = 'music'
>foo_prod(# AND l.rec_insert_time >= now() - interval '10 days'
>foo_prod(# ORDER BY brand_id, last_change, log_id, rec_insert_time DESC)
>unique_cases;
>
>   QUERY PLAN

>-------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> WindowAgg  (cost=24436329.10..24436343.56 rows=643 width=120)
>   ->  Sort  (cost=24436329.10..24436330.70 rows=643 width=104)
>         Sort Key: unique_cases.source, unique_cases.rec_insert_time
>         ->  Subquery Scan on unique_cases  (cost=24436286.24..24436299.10
>rows=643 width=104)
>               ->  Unique  (cost=24436286.24..24436292.67 rows=643
>width=124)
>                     ->  Sort  (cost=24436286.24..24436287.85 rows=643
>width=124)
>                           Sort Key: l.brand_id, l.last_change, l.log_id,
>l.rec_insert_time DESC
>                           ->  Nested Loop  (cost=0.00..24436256.25
>rows=643 width=124)
>                                 Join Filter: ((l.category)::text =
>filter.category)
>                                 ->  Seq Scan on small_join_table filter
> (cost=0.00..26.99 rows=1399 width=8)
>                                 ->  Materialize  (cost=0.00..24420487.02
>rows=643 width=99)
>                                       ->  Seq Scan on log_table l
> (cost=0.00..24420483.80 rows=643 width=99)
>                                             Filter: ((field1 IS NOT NULL)
>AND (category = 'music'::name) AND (rec_insert_time >= (now() - '10
>days'::interval)))
>(13 rows)
>
>foo_prod=# SELECT now() - interval '10 days';
>           ?column?
>-------------------------------
> 2019-10-01 08:20:38.115471-05
>(1 row)
>
>foo_prod=# EXPLAIN
>SELECT
> category, source, MIN(rec_insert_time) OVER (partition by source order by
>rec_insert_time) AS first_source_time, MAX(rec_insert_time) OVER (partition
>by source order by rec_insert_time) AS last_source_time
>FROM (SELECT DISTINCT ON (brand_id, last_change, log_id)
>category, source(field1) AS source, rec_insert_time
>FROM log_table l
>INNER JOIN public.small_join_table filter ON filter.category = l.category
>WHERE field1 IS NOT NULL AND l.category = 'music'
>AND l.rec_insert_time >= '2019-10-01 08:20:38.115471-05'
>ORDER BY brand_id, last_change, log_id, rec_insert_time DESC) unique_cases;
>
>  QUERY PLAN

>-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
> WindowAgg  (cost=19664576.17..19664590.63 rows=643 width=120)
>   ->  Sort  (cost=19664576.17..19664577.77 rows=643 width=104)
>         Sort Key: unique_cases.source, unique_cases.rec_insert_time
>         ->  Subquery Scan on unique_cases  (cost=19664533.31..19664546.17
>rows=643 width=104)
>               ->  Unique  (cost=19664533.31..19664539.74 rows=643
>width=124)
>                     ->  Sort  (cost=19664533.31..19664534.92 rows=643
>width=124)
>                           Sort Key: l.brand_id, l.last_change, l.log_id,
>l.rec_insert_time DESC
>                           ->  Nested Loop  (cost=3181.19..19664503.32
>rows=643 width=124)
>                                 ->  Gather  (cost=3180.91..19662574.92
>rows=643 width=99)
>                                       Workers Planned: 3
>                                       ->  Parallel Bitmap Heap Scan on
>log_table l  (cost=2180.91..19661510.62 rows=207 width=99)
>                                             Recheck Cond: (rec_insert_time
>>= '2019-10-01 08:20:38.115471-05'::timestamp with time zone)
>                                             Filter: ((field1 IS NOT NULL)
>AND (category = 'music'::name))
>                                             ->  Bitmap Index Scan on
>rec_insert_time_brin_1000  (cost=0.00..2180.75 rows=142602171 width=0)
>                                                   Index Cond:
>(rec_insert_time >= '2019-10-01 08:20:38.115471-05'::timestamp with time
>zone)
>
>
>Let me know if this rings any bells!  I will respond to other comments with
>other replies.
>

My guess - it's (at least partially) due to cpu_operator_cost,
associated with the now() call. When replaced with a literal, this cost
disappears and so the total query cost decreases.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 



Re: BRIN index which is much faster never chosen by planner

From
Jeremy Finzel
Date:
The other issue is that the estimation of pages fetched using bitmap
heap scan is rather crude - but that's simply hard, and I don't think we
can fundamentally improve this.

I wanted to follow up on this specific issue.  Isn't this the heart of the matter and a fundamental problem?  If I want to rely on BRIN indexes as in a straightforward case as explained in OP, but I don't know if the planner will be nearly reliable enough, how can I depend on them in production?  Is this not considered a planner bug or should this kind of case be documented as problematic for BRIN?  As another way to look at it: is there a configuration parameter that could be added specific to BRIN or bitmapscan to provide help to cases like this?

On freshly analyzed tables, I tried my original query again and found that even with now() - 3 days it does not choose the BRIN index.  In fact it chose another btree on the table like (id1, id2, rec_insert_time).  With warm cache, the pg-chosen plan takes 40 seconds to execute, whereas when I force a BRIN scan it takes only 4 seconds.

I could understand more if the execution times were close, but the actual BRIN index is orders of magnitude faster than the plan Postgres is choosing.  I appreciate the feedback on this very much, as I am quite eager to use BRIN indexes!!!

Thanks,
Jeremy 

Re: BRIN index which is much faster never chosen by planner

From
David Rowley
Date:
On Tue, 15 Oct 2019 at 08:43, Jeremy Finzel <finzelj@gmail.com> wrote:
> I wanted to follow up on this specific issue.  Isn't this the heart of the matter and a fundamental problem?  If I
wantto rely on BRIN indexes as in a straightforward case as explained in OP, but I don't know if the planner will be
nearlyreliable enough, how can I depend on them in production?  Is this not considered a planner bug or should this
kindof case be documented as problematic for BRIN?  As another way to look at it: is there a configuration parameter
thatcould be added specific to BRIN or bitmapscan to provide help to cases like this? 
>
> On freshly analyzed tables, I tried my original query again and found that even with now() - 3 days it does not
choosethe BRIN index.  In fact, it chose another btree on the table like (id1, id2, rec_insert_time).  With warm cache,
thepg-chosen plan takes 40 seconds to execute, whereas when I force a BRIN scan it takes only 4 seconds. 

Another thing which you might want to look at is the correlation
column in the pg_stats view for the rec_insert_time column. Previous
to 7e534adcd, BRIN index were costed based on the selectivity
estimate. There was no accountability towards the fact that the pages
for those records might have been spread out over the entire table.
Post 7e534adcd, we use the correlation estimate to attempt to estimate
how many pages (more specifically "ranges") we're likely to hit based
on that and the selectivity estimate. This commit intended to fix the
issue we had with BRIN indexes being selected far too often.  Of
course, the correlation is based on the entire table, if there are
subsets of the table that are perhaps perfectly correlated, then the
planner is not going to know about that. It's possible that some of
your older rec_insert_times are spread out far more than the newer
ones.  As a test, you could try creating a new table and copying the
records over to it in rec_insert_time order and seeing if the BRIN
index is selected for that table (after having performed an ANALYZE).

It would be interesting if you could show the pg_stats row for the
column so that we can see if the correlation is low.

You can see from the code below that the final selectivity strongly
influenced by the correlation value (REF: brincostestimate)

qualSelectivity = clauselist_selectivity(root, indexQuals,
baserel->relid,
JOIN_INNER, NULL);

/* work out the actual number of ranges in the index */
indexRanges = Max(ceil((double) baserel->pages / statsData.pagesPerRange),
  1.0);

/*
* Now calculate the minimum possible ranges we could match with if all of
* the rows were in the perfect order in the table's heap.
*/
minimalRanges = ceil(indexRanges * qualSelectivity);

/*
* Now estimate the number of ranges that we'll touch by using the
* indexCorrelation from the stats. Careful not to divide by zero (note
* we're using the absolute value of the correlation).
*/
if (*indexCorrelation < 1.0e-10)
estimatedRanges = indexRanges;
else
estimatedRanges = Min(minimalRanges / *indexCorrelation, indexRanges);

/* we expect to visit this portion of the table */
selec = estimatedRanges / indexRanges;

CLAMP_PROBABILITY(selec);


My overall view on this is that the BRIN index is not that great since
it's not eliminating that many rows by using it.

From above we see:

>  Bitmap Heap Scan on foo.log_table l  (cost=2391.71..24360848.29 rows=735 width=99) (actual time=824.133..21329.054
rows=466loops=1) 
                                       Output: <hidden>
                                       Recheck Cond:
(l.rec_insert_time >= (now() - '10 days'::interval))
                                       Rows Removed by Index Recheck: 8187584
                                       Filter: ((l.field1 IS NOT NULL)
AND (l.category = 'music'::name))
                                       Rows Removed by Filter: 19857107
                                       Heap Blocks: lossy=1509000

So you have just 466 rows matching these quals, but the executor had
to scan 1.5 million pages to get those and filter out 8.1 million rows
on the recheck then 19.8 million on the filter. You've mentioned that
the table's heap is 139 GB, which is about 18 million pages.  It seems
your query would perform much better if you had a btree index such as
(category, rec_insert_time) where field1 is not null;,

Of course, you've mentioned that you are finding when the plan uses
the BRIN index that it executes more quickly, but I think you're going
to find BRIN unreliable for tables anything other than INSERT-only
tables which the records are always inserted with an ever-increasing
or decreasing value in the BRIN indexed column.  If you start
performing UPDATEs then that's going to create holes that new record
will fill and cause the correlation to start dropping resulting in the
BRIN indexes scan cost going up.

On the other hand, if you think you can do better than what was done
in 7e534adcd, then it would be good to see someone working on it. I'm
sure something better can be done. It's just not that easy to do with
the scant correlation data we have on the column.

As for is this a bug or something that's missing from the documents.
The documents do mention:

"BRIN stands for Block Range Index. BRIN is designed for handling very
large tables in which certain columns have some natural correlation
with their physical location within the table."

https://www.postgresql.org/docs/current/brin-intro.html

--
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



Re: BRIN index which is much faster never chosen by planner

From
Jeremy Finzel
Date:
Thank you for the thorough and thoughtful reply!  Please see below.

On Mon, Oct 14, 2019 at 3:48 PM David Rowley <david.rowley@2ndquadrant.com> wrote:
Another thing which you might want to look at is the correlation
column in the pg_stats view for the rec_insert_time column. Previous
to 7e534adcd, BRIN index were costed based on the selectivity
estimate. There was no accountability towards the fact that the pages
for those records might have been spread out over the entire table.
Post 7e534adcd, we use the correlation estimate to attempt to estimate
how many pages (more specifically "ranges") we're likely to hit based
on that and the selectivity estimate. This commit intended to fix the
issue we had with BRIN indexes being selected far too often.  Of
course, the correlation is based on the entire table, if there are
subsets of the table that are perhaps perfectly correlated, then the
planner is not going to know about that. It's possible that some of
your older rec_insert_times are spread out far more than the newer
ones.  As a test, you could try creating a new table and copying the
records over to it in rec_insert_time order and seeing if the BRIN
index is selected for that table (after having performed an ANALYZE).

It would be interesting if you could show the pg_stats row for the
column so that we can see if the correlation is low.

So what I said originally (and light bulbs now going off) is that the table is insert-only, but it is **pruned (deletes) to the past year of data**.  I think this is the key of what I've missed.  Once vacuum runs, we have a bunch of old physical space being re-used by new inserts, thus botching that good correlation between physical and logical order.  So it appears the physical order of the data is indeed well-correlated in recent history, but not so when you go back a bit further.  Here are pg_stats:

-[ RECORD 1 ]----------+---------------------------
schemaname             | foo
tablename              | log_table
attname                | rec_insert_time
inherited              | f
null_frac              | 0
avg_width              | 8
n_distinct             | 1.89204e+06
correlation            | 0.193951
most_common_elems      |
most_common_elem_freqs |
elem_count_histogram   |

I have not tried creating a fresh table, but if I modify my OP query to instead take a window of 10 days 100 days ago, the BRIN index actually performs really bad... identically to the seq scan:

Here is with a seq scan:
                                                                                                           QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 WindowAgg  (cost=26822167.70..26822170.60 rows=129 width=120) (actual time=200730.856..200730.910 rows=62 loops=1)
   ->  Sort  (cost=26822167.70..26822168.02 rows=129 width=104) (actual time=200730.834..200730.837 rows=62 loops=1)
         Sort Key: unique_cases.source, unique_cases.rec_insert_time
         Sort Method: quicksort  Memory: 33kB
         ->  Subquery Scan on unique_cases  (cost=26822160.60..26822163.18 rows=129 width=104) (actual time=200730.672..200730.763 rows=62 loops=1)
               ->  Unique  (cost=26822160.60..26822161.89 rows=129 width=124) (actual time=200730.670..200730.753 rows=62 loops=1)
                     ->  Sort  (cost=26822160.60..26822160.92 rows=129 width=124) (actual time=200730.668..200730.686 rows=395 loops=1)
                           Sort Key: l.brand_id, l.last_change, l.log_id, l.rec_insert_time DESC
                           Sort Method: quicksort  Memory: 80kB
                           ->  Nested Loop  (cost=0.00..26822156.08 rows=129 width=124) (actual time=200692.321..200730.121 rows=395 loops=1)
                                 Join Filter: ((l.category)::text = filter.category)
                                 Rows Removed by Join Filter: 552210
                                 ->  Seq Scan on small_join_table filter  (cost=0.00..26.99 rows=1399 width=8) (actual time=0.013..0.179 rows=1399 loops=1)
                                 ->  Materialize  (cost=0.00..26818970.84 rows=129 width=99) (actual time=136.942..143.440 rows=395 loops=1399)
                                       ->  Seq Scan on log_table l  (cost=0.00..26818970.20 rows=129 width=99) (actual time=191581.193..200649.013 rows=395 loops=1)
                                             Filter: ((field1 IS NOT NULL) AND (category = 'music'::name) AND (rec_insert_time >= (now() - '100 days'::interval)) AND (rec_insert_time <= (now() - '90 days'::interval)))
                                             Rows Removed by Filter: 315097963
 Planning Time: 1.541 ms
 Execution Time: 200731.273 ms
(19 rows)

Here is with the forced brin index scan:

                                                                                                    QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 WindowAgg  (cost=26674201.49..26674204.40 rows=129 width=120) (actual time=200303.118..200303.177 rows=62 loops=1)
   ->  Sort  (cost=26674201.49..26674201.82 rows=129 width=104) (actual time=200303.093..200303.096 rows=62 loops=1)
         Sort Key: unique_cases.source, unique_cases.rec_insert_time
         Sort Method: quicksort  Memory: 33kB
         ->  Subquery Scan on unique_cases  (cost=26674194.39..26674196.97 rows=129 width=104) (actual time=200302.918..200303.012 rows=62 loops=1)
               ->  Unique  (cost=26674194.39..26674195.68 rows=129 width=124) (actual time=200302.914..200302.998 rows=62 loops=1)
                     ->  Sort  (cost=26674194.39..26674194.71 rows=129 width=124) (actual time=200302.911..200302.929 rows=395 loops=1)
                           Sort Key: l.brand_id, l.last_change, l.log_id, l.rec_insert_time DESC
                           Sort Method: quicksort  Memory: 80kB
                           ->  Nested Loop  (cost=1245.13..26674189.87 rows=129 width=124) (actual time=138087.290..200301.925 rows=395 loops=1)
                                 ->  Bitmap Heap Scan on log_table l  (cost=1244.85..26673801.66 rows=129 width=99) (actual time=138087.083..200298.259 rows=395 loops=1)
                                       Recheck Cond: ((rec_insert_time >= (now() - '100 days'::interval)) AND (rec_insert_time <= (now() - '90 days'::interval)))
                                       Rows Removed by Index Recheck: 214939302
                                       Filter: ((field1 IS NOT NULL) AND (category = 'music'::name))
                                       Rows Removed by Filter: 15326889
                                       Heap Blocks: lossy=13566000
                                       ->  Bitmap Index Scan on rec_insert_time_brin_1000  (cost=0.00..1244.81 rows=78608872 width=0) (actual time=678.203..678.203 rows=135660000 loops=1)
                                             Index Cond: ((rec_insert_time >= (now() - '100 days'::interval)) AND (rec_insert_time <= (now() - '90 days'::interval)))
                                 ->  Index Only Scan using small_join_table_pkey on small_join_table filter  (cost=0.28..3.01 rows=1 width=8) (actual time=0.005..0.005 rows=1 loops=395)
                                       Index Cond: (category = (l.category)::text)
                                       Heap Fetches: 395
 Planning Time: 2.031 ms
 Execution Time: 200304.411 ms
(23 rows)
 
So you have just 466 rows matching these quals, but the executor had
to scan 1.5 million pages to get those and filter out 8.1 million rows
on the recheck then 19.8 million on the filter. You've mentioned that
the table's heap is 139 GB, which is about 18 million pages.  It seems
your query would perform much better if you had a btree index such as
(category, rec_insert_time) where field1 is not null;,

I agree btree would do the trick for performance, but I was trying to avoid the near-100G overhead of such an index.  For the given example, the somewhat improved performance of BRIN may be acceptable to me.  However, as seen above, it appears that a btree may be my only option given this workload to get reliable performance.
 
Of course, you've mentioned that you are finding when the plan uses
the BRIN index that it executes more quickly, but I think you're going
to find BRIN unreliable for tables anything other than INSERT-only
tables which the records are always inserted with an ever-increasing
or decreasing value in the BRIN indexed column.  If you start
performing UPDATEs then that's going to create holes that new record
will fill and cause the correlation to start dropping resulting in the
BRIN indexes scan cost going up.

Or deletes, as in my case.
 
On the other hand, if you think you can do better than what was done
in 7e534adcd, then it would be good to see someone working on it. I'm
sure something better can be done. It's just not that easy to do with
the scant correlation data we have on the column.

As for is this a bug or something that's missing from the documents.
The documents do mention:

"BRIN stands for Block Range Index. BRIN is designed for handling very
large tables in which certain columns have some natural correlation
with their physical location within the table."

Yes, I was aware of this, and perhaps nothing indeed needs to change with docs here given my case.  But perhaps it would be worth exploring if there could be more detailed stats on physical vs logical correlation, such as when ANALYZE takes its samples, noting physical locations as well as logical values, and allowing the correlation to take account of that more detailed analysis.  Of course, sounds like a huge amount of work with uncertain benefits.  In my case, it could be said that if I am always querying the last few days of data, a BRIN index here is perfect, and a BTREE is way overkill.  That is a real use case to consider.  But more generally, I would drop the BRIN index if I had any other query patterns beyond the few days of data.

However, this may be a fallacy.  It might be that a few days from now, the last 10 days of data will actually be really fragmented, depending only on when VACUUM runs.

As the docs state, I do believe that the only use case that will work really well for BRIN is either a truly insert-only table which is never pruned (an idea I dislike as a DBA who wants us to keep OLTP data trim and implement data retention policies :), or a table which is routinely CLUSTERed!

Thanks again for the detailed feedback.

Thanks,
Jeremy 

Re: BRIN index which is much faster never chosen by planner

From
Michael Lewis
Date:
Thanks for closing the loop on the data correlation question. I've been playing with BRIN indexes on a log table of sorts and this thread helped clear up some of the behavior I have been seeing.

I am curious, would a partial btree index fit your needs? Perhaps the maintenance overhead is too significant or this is too off-the-wall, but a daily job to create new index and drop the old concurrently could give the performance you need while still saving the extra disk space of the full btree on the timestamp.

CREATE INDEX CONCURRENTLY log_table_rec_insert_time_partial_10_04 ON log_table USING btree ( rec_insert_time ) WHERE rec_insert_time > '2019-10-04'::DATE;
DROP INDEX CONCURRENTLY IF EXISTS log_table_rec_insert_time_partial_10_03;

I would consider including category column as well, but I suspect that would increase the size of the index significantly. Of course, this depends on the query planner evaluating that "l.rec_insert_time >= now() - interval '10 days'" and determining that the index fulfills the need.

Re: BRIN index which is much faster never chosen by planner

From
Justin Pryzby
Date:
This reminds me of an issue I reported several years ago where Btree index
scans were chosen over seq scan of a large, INSERT-only table due to very high
correlation, but performed poorly.  I concluded that use of the the high "large
scale" correlation on a large 50+GB table caused the planner to fail to account
for a larger number of pages being read nonsequentially (the opposite of your
issue).  I think that's because we were INSERTing data which was at least
approximately sorted on record END time, and the index was on record START
time.  For a large table with a week's data, the correlation of "start time"
was still be very high (0.99995).  But scanning the index ends up reading pages
nonsequentially, and also multiple visits per page.

I eeked out a patch which made "correlation" a per-index statistic rather than
a per-column one.  That means the planner could distinguish between a
freshly-built btree index and a fragmented one.  (At the time, there was a
hypothesis that our issue was partially due to repeated values of the index
columns.)  It didn't occur to me at the time, but that would also allow
creating numerous, partial BRIN indices, each of which would have separate
correlation computed over just their "restricted range", which *might* also
handle your case, depending on how packed your data is.

https://www.postgresql.org/message-id/flat/20170707234119.GN17566%40telsasoft.com#fdcbebc342b8fb9ad0ff293913f54d11

On Tue, Oct 15, 2019 at 11:05:13AM -0500, Jeremy Finzel wrote:
> I do believe that the only use case that will work really well for BRIN is
> either a truly insert-only table which is never pruned ...  or a table which
> is routinely CLUSTERed!

Or partitioned table, which for large data sets I highly recommend instead of
DELETE.

Justin



Re: BRIN index which is much faster never chosen by planner

From
David Rowley
Date:
On Wed, 16 Oct 2019 at 05:05, Jeremy Finzel <finzelj@gmail.com> wrote:
> But perhaps it would be worth exploring if there could be more detailed stats on physical vs logical correlation,
suchas when ANALYZE takes its samples, noting physical locations as well as logical values, and allowing the
correlationto take account of that more detailed analysis.  Of course, sounds like a huge amount of work with uncertain
benefits.

Yes. I think improving the statistics could be beneficial. It does
appear like the single value for the entire column is not fine-grained
enough for your use case.

> As the docs state, I do believe that the only use case that will work really well for BRIN is either a truly
insert-onlytable which is never pruned (an idea I dislike as a DBA who wants us to keep OLTP data trim and implement
dataretention policies :), or a table which is routinely CLUSTERed! 

Have you considered partitioning the table so that the retention
policy could be implemented by dropping a partition rather than
performing a bulk DELETE?


--
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



Re: BRIN index which is much faster never chosen by planner

From
David Rowley
Date:
On Wed, 16 Oct 2019 at 11:40, Justin Pryzby <pryzby@telsasoft.com> wrote:
> It didn't occur to me at the time, but that would also allow
> creating numerous, partial BRIN indices, each of which would have separate
> correlation computed over just their "restricted range", which *might* also
> handle your case, depending on how packed your data is.

Perhaps I've misunderstood you, but the correlation that's used is per
column, not per index. The only way to have it calculate multiple
correlations would be to partition the table. There'd be a correlation
for the column on each partition that way.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services