Re: BRIN index which is much faster never chosen by planner - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: BRIN index which is much faster never chosen by planner
Date
Msg-id 20191010231309.gnwra4sv4igs7xhu@development
Whole thread Raw
In response to BRIN index which is much faster never chosen by planner  (Jeremy Finzel <finzelj@gmail.com>)
Responses Re: BRIN index which is much faster never chosen by planner
Re: BRIN index which is much faster never chosen by planner
Re: BRIN index which is much faster never chosen by planner
List pgsql-hackers
On Thu, Oct 10, 2019 at 04:58:11PM -0500, Jeremy Finzel wrote:
>
> ...
>
>Notice it chooses the smallest BRIN index with 1000 pages per range, and
>this is far faster than the seq scan.
>
>I do believe the estimate is actually way off.  Just a plain EXPLAIN of the
>latter estimates 10x more rows than actual:
> WindowAgg  (cost=24354689.19..24354705.07 rows=706 width=120)
>   ->  Sort  (cost=24354689.19..24354690.95 rows=706 width=104)
>         Sort Key: unique_cases.source, unique_cases.rec_insert_time
>         ->  Subquery Scan on unique_cases  (cost=24354641.66..24354655.78
>rows=706 width=104)
>               ->  Unique  (cost=24354641.66..24354648.72 rows=706
>width=124)
>                     ->  Sort  (cost=24354641.66..24354643.42 rows=706
>width=124)
>                           Sort Key: l.brand_id, l.last_change, l.log_id,
>l.rec_insert_time DESC
>                           ->  Nested Loop  (cost=2385.42..24354608.25
>rows=706 width=124)
>                                 ->  Bitmap Heap Scan on log_table l
> (cost=2383.82..24352408.26 rows=706 width=99)
>                                       Recheck Cond: (rec_insert_time >=
>(now() - '10 days'::interval))
>                                       Filter: ((field1 IS NOT NULL) AND
>(category = 'music'::name))
>                                       ->  Bitmap Index Scan on
>rec_insert_time_brin_1000  (cost=0.00..2383.64 rows=156577455 width=0)
>                                             Index Cond: (rec_insert_time
>>= (now() - '10 days'::interval))
>                                 ->  Bitmap Heap Scan on small_join_table
>filter  (cost=1.60..3.12 rows=1 width=8)
>                                       Recheck Cond: (category =
>(l.category)::text)
>                                       ->  Bitmap Index Scan on
>small_join_table_pkey  (cost=0.00..1.60 rows=1 width=0)
>                                             Index Cond: (category =
>(l.category)::text)
>(17 rows)
>
>
>Here is EXPLAIN only of the default chosen plan:
> WindowAgg  (cost=24437857.18..24437873.07 rows=706 width=120)
>   ->  Sort  (cost=24437857.18..24437858.95 rows=706 width=104)
>         Sort Key: unique_cases.source, unique_cases.rec_insert_time
>         ->  Subquery Scan on unique_cases  (cost=24437809.66..24437823.78
>rows=706 width=104)
>               ->  Unique  (cost=24437809.66..24437816.72 rows=706
>width=124)
>                     ->  Sort  (cost=24437809.66..24437811.42 rows=706
>width=124)
>                           Sort Key: l.brand_id, l.last_change, l.log_id,
>l.rec_insert_time DESC
>                           ->  Nested Loop  (cost=0.00..24437776.25
>rows=706 width=124)
>                                 Join Filter: ((l.category)::text =
>filter.category)
>                                 ->  Seq Scan on log_table l
> (cost=0.00..24420483.80 rows=706 width=99)
>                                       Filter: ((field1 IS NOT NULL) AND
>(category = 'music'::name) AND (rec_insert_time >= (now() - '10
>days'::interval)))
>                                 ->  Materialize  (cost=0.00..33.98
>rows=1399 width=8)
>                                       ->  Seq Scan on small_join_table
>filter  (cost=0.00..26.99 rows=1399 width=8)
>(13 rows)
>
>
>
>Any insight into this is much appreciated.  This is just one example of
>many similar issues I have been finding with BRIN indexes scaling
>predictably with insert-only workloads.
>

It's quite interesting planning issue. The cost estimates are:

 ->  Seq Scan on foo.log_table l
 (cost=0.00..24420483.80 rows=707 width=99) (actual

while for the bitmap heap scan it looks like this:

 ->  Bitmap Heap Scan on foo.log_table l
 (cost=2391.71..24360848.29 rows=735 width=99) (actual

So the planner actualy thinks the bitmap heap scan is a tad *cheaper*
but picks the seq scan anyway. This is likely because we don't really
compare the exact costs, but we do fuzzy comparison - the plan has to be
at least 1% cheaper to dominate the existing plan. This allows us to
save some work when replacing the paths.

In this case the difference is only about 0.2%, so we keep the seqscan
path. The real question is why the planner came to this cost, when it
got pretty good row estimates etc.

Looking at the cost_bitmap_heap_scan() I think the costing issue comes
mostly from this bit:

  /*
   * For small numbers of pages we should charge spc_random_page_cost
   * apiece, while if nearly all the table's pages are being read, it's more
   * appropriate to charge spc_seq_page_cost apiece.  The effect is
   * nonlinear, too. For lack of a better idea, interpolate like this to
   * determine the cost per page.
   */
  if (pages_fetched >= 2.0)
      cost_per_page = spc_random_page_cost -
          (spc_random_page_cost - spc_seq_page_cost)
          * sqrt(pages_fetched / T);
  else
      cost_per_page = spc_random_page_cost;

The index scan is estimated to return 157328135 rows, i.e. about 50% of
the table (apparently it's ~10x more than the actual number). This is
produced by compute_bitmap_pages() which also computes pages_fetched,
and I guess that's going to be pretty close to all pages, because with
T = 18350080 (which is 140GB) and using

  pages_fetched = (2.0 * T * tuples_fetched) / (2.0 * T + tuples_fetched);

we get 29731418 (which is more than 18350080, so it gets clamped).

So this kinda seems like the optimizer kinda believes it'll have to scan
the whole table anyway. In reality, of course, the number of tuples
returned by the index is 10x lower, so the formula above would give us
only about 10975262 pages (so ~1/2 the table). The actual number of
pages is however even lower - only about 1509030, i.e. ~8% of the table.

So this seems like a combination of multiple issues. Firstly, the bitmap
index scan on rec_insert_time_brin_1000 estimate seems somewhat poor. It
might be worth increasing stats target on that column, or something like
that. Not sure, but it's about the only "fixable" thing here, I think.

The other issue is that the estimation of pages fetched using bitmap
heap scan is rather crude - but that's simply hard, and I don't think we
can fundamentally improve this.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 



pgsql-hackers by date:

Previous
From: Michael Lewis
Date:
Subject: Re: BRIN index which is much faster never chosen by planner
Next
From: Daniel Wood
Date:
Subject: Re: BTP_DELETED leaf still in tree