Re: BRIN index which is much faster never chosen by planner - Mailing list pgsql-hackers
From | Tomas Vondra |
---|---|
Subject | Re: BRIN index which is much faster never chosen by planner |
Date | |
Msg-id | 20191010231309.gnwra4sv4igs7xhu@development Whole thread Raw |
In response to | BRIN index which is much faster never chosen by planner (Jeremy Finzel <finzelj@gmail.com>) |
Responses |
Re: BRIN index which is much faster never chosen by planner
Re: BRIN index which is much faster never chosen by planner Re: BRIN index which is much faster never chosen by planner |
List | pgsql-hackers |
On Thu, Oct 10, 2019 at 04:58:11PM -0500, Jeremy Finzel wrote: > > ... > >Notice it chooses the smallest BRIN index with 1000 pages per range, and >this is far faster than the seq scan. > >I do believe the estimate is actually way off. Just a plain EXPLAIN of the >latter estimates 10x more rows than actual: > WindowAgg (cost=24354689.19..24354705.07 rows=706 width=120) > -> Sort (cost=24354689.19..24354690.95 rows=706 width=104) > Sort Key: unique_cases.source, unique_cases.rec_insert_time > -> Subquery Scan on unique_cases (cost=24354641.66..24354655.78 >rows=706 width=104) > -> Unique (cost=24354641.66..24354648.72 rows=706 >width=124) > -> Sort (cost=24354641.66..24354643.42 rows=706 >width=124) > Sort Key: l.brand_id, l.last_change, l.log_id, >l.rec_insert_time DESC > -> Nested Loop (cost=2385.42..24354608.25 >rows=706 width=124) > -> Bitmap Heap Scan on log_table l > (cost=2383.82..24352408.26 rows=706 width=99) > Recheck Cond: (rec_insert_time >= >(now() - '10 days'::interval)) > Filter: ((field1 IS NOT NULL) AND >(category = 'music'::name)) > -> Bitmap Index Scan on >rec_insert_time_brin_1000 (cost=0.00..2383.64 rows=156577455 width=0) > Index Cond: (rec_insert_time >>= (now() - '10 days'::interval)) > -> Bitmap Heap Scan on small_join_table >filter (cost=1.60..3.12 rows=1 width=8) > Recheck Cond: (category = >(l.category)::text) > -> Bitmap Index Scan on >small_join_table_pkey (cost=0.00..1.60 rows=1 width=0) > Index Cond: (category = >(l.category)::text) >(17 rows) > > >Here is EXPLAIN only of the default chosen plan: > WindowAgg (cost=24437857.18..24437873.07 rows=706 width=120) > -> Sort (cost=24437857.18..24437858.95 rows=706 width=104) > Sort Key: unique_cases.source, unique_cases.rec_insert_time > -> Subquery Scan on unique_cases (cost=24437809.66..24437823.78 >rows=706 width=104) > -> Unique (cost=24437809.66..24437816.72 rows=706 >width=124) > -> Sort (cost=24437809.66..24437811.42 rows=706 >width=124) > Sort Key: l.brand_id, l.last_change, l.log_id, >l.rec_insert_time DESC > -> Nested Loop (cost=0.00..24437776.25 >rows=706 width=124) > Join Filter: ((l.category)::text = >filter.category) > -> Seq Scan on log_table l > (cost=0.00..24420483.80 rows=706 width=99) > Filter: ((field1 IS NOT NULL) AND >(category = 'music'::name) AND (rec_insert_time >= (now() - '10 >days'::interval))) > -> Materialize (cost=0.00..33.98 >rows=1399 width=8) > -> Seq Scan on small_join_table >filter (cost=0.00..26.99 rows=1399 width=8) >(13 rows) > > > >Any insight into this is much appreciated. This is just one example of >many similar issues I have been finding with BRIN indexes scaling >predictably with insert-only workloads. > It's quite interesting planning issue. The cost estimates are: -> Seq Scan on foo.log_table l (cost=0.00..24420483.80 rows=707 width=99) (actual while for the bitmap heap scan it looks like this: -> Bitmap Heap Scan on foo.log_table l (cost=2391.71..24360848.29 rows=735 width=99) (actual So the planner actualy thinks the bitmap heap scan is a tad *cheaper* but picks the seq scan anyway. This is likely because we don't really compare the exact costs, but we do fuzzy comparison - the plan has to be at least 1% cheaper to dominate the existing plan. This allows us to save some work when replacing the paths. In this case the difference is only about 0.2%, so we keep the seqscan path. The real question is why the planner came to this cost, when it got pretty good row estimates etc. Looking at the cost_bitmap_heap_scan() I think the costing issue comes mostly from this bit: /* * For small numbers of pages we should charge spc_random_page_cost * apiece, while if nearly all the table's pages are being read, it's more * appropriate to charge spc_seq_page_cost apiece. The effect is * nonlinear, too. For lack of a better idea, interpolate like this to * determine the cost per page. */ if (pages_fetched >= 2.0) cost_per_page = spc_random_page_cost - (spc_random_page_cost - spc_seq_page_cost) * sqrt(pages_fetched / T); else cost_per_page = spc_random_page_cost; The index scan is estimated to return 157328135 rows, i.e. about 50% of the table (apparently it's ~10x more than the actual number). This is produced by compute_bitmap_pages() which also computes pages_fetched, and I guess that's going to be pretty close to all pages, because with T = 18350080 (which is 140GB) and using pages_fetched = (2.0 * T * tuples_fetched) / (2.0 * T + tuples_fetched); we get 29731418 (which is more than 18350080, so it gets clamped). So this kinda seems like the optimizer kinda believes it'll have to scan the whole table anyway. In reality, of course, the number of tuples returned by the index is 10x lower, so the formula above would give us only about 10975262 pages (so ~1/2 the table). The actual number of pages is however even lower - only about 1509030, i.e. ~8% of the table. So this seems like a combination of multiple issues. Firstly, the bitmap index scan on rec_insert_time_brin_1000 estimate seems somewhat poor. It might be worth increasing stats target on that column, or something like that. Not sure, but it's about the only "fixable" thing here, I think. The other issue is that the estimation of pages fetched using bitmap heap scan is rather crude - but that's simply hard, and I don't think we can fundamentally improve this. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
pgsql-hackers by date: