Re: BRIN index which is much faster never chosen by planner - Mailing list pgsql-hackers

From Jeremy Finzel
Subject Re: BRIN index which is much faster never chosen by planner
Date
Msg-id CAMa1XUi-LicCQs8BquTiaV_Mrjk57Efroew=aKuZ9WnpQ26Rvw@mail.gmail.com
Whole thread Raw
In response to Re: BRIN index which is much faster never chosen by planner  (David Rowley <david.rowley@2ndquadrant.com>)
Responses Re: BRIN index which is much faster never chosen by planner
Re: BRIN index which is much faster never chosen by planner
Re: BRIN index which is much faster never chosen by planner
List pgsql-hackers
Thank you for the thorough and thoughtful reply!  Please see below.

On Mon, Oct 14, 2019 at 3:48 PM David Rowley <david.rowley@2ndquadrant.com> wrote:
Another thing which you might want to look at is the correlation
column in the pg_stats view for the rec_insert_time column. Previous
to 7e534adcd, BRIN index were costed based on the selectivity
estimate. There was no accountability towards the fact that the pages
for those records might have been spread out over the entire table.
Post 7e534adcd, we use the correlation estimate to attempt to estimate
how many pages (more specifically "ranges") we're likely to hit based
on that and the selectivity estimate. This commit intended to fix the
issue we had with BRIN indexes being selected far too often.  Of
course, the correlation is based on the entire table, if there are
subsets of the table that are perhaps perfectly correlated, then the
planner is not going to know about that. It's possible that some of
your older rec_insert_times are spread out far more than the newer
ones.  As a test, you could try creating a new table and copying the
records over to it in rec_insert_time order and seeing if the BRIN
index is selected for that table (after having performed an ANALYZE).

It would be interesting if you could show the pg_stats row for the
column so that we can see if the correlation is low.

So what I said originally (and light bulbs now going off) is that the table is insert-only, but it is **pruned (deletes) to the past year of data**.  I think this is the key of what I've missed.  Once vacuum runs, we have a bunch of old physical space being re-used by new inserts, thus botching that good correlation between physical and logical order.  So it appears the physical order of the data is indeed well-correlated in recent history, but not so when you go back a bit further.  Here are pg_stats:

-[ RECORD 1 ]----------+---------------------------
schemaname             | foo
tablename              | log_table
attname                | rec_insert_time
inherited              | f
null_frac              | 0
avg_width              | 8
n_distinct             | 1.89204e+06
correlation            | 0.193951
most_common_elems      |
most_common_elem_freqs |
elem_count_histogram   |

I have not tried creating a fresh table, but if I modify my OP query to instead take a window of 10 days 100 days ago, the BRIN index actually performs really bad... identically to the seq scan:

Here is with a seq scan:
                                                                                                           QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 WindowAgg  (cost=26822167.70..26822170.60 rows=129 width=120) (actual time=200730.856..200730.910 rows=62 loops=1)
   ->  Sort  (cost=26822167.70..26822168.02 rows=129 width=104) (actual time=200730.834..200730.837 rows=62 loops=1)
         Sort Key: unique_cases.source, unique_cases.rec_insert_time
         Sort Method: quicksort  Memory: 33kB
         ->  Subquery Scan on unique_cases  (cost=26822160.60..26822163.18 rows=129 width=104) (actual time=200730.672..200730.763 rows=62 loops=1)
               ->  Unique  (cost=26822160.60..26822161.89 rows=129 width=124) (actual time=200730.670..200730.753 rows=62 loops=1)
                     ->  Sort  (cost=26822160.60..26822160.92 rows=129 width=124) (actual time=200730.668..200730.686 rows=395 loops=1)
                           Sort Key: l.brand_id, l.last_change, l.log_id, l.rec_insert_time DESC
                           Sort Method: quicksort  Memory: 80kB
                           ->  Nested Loop  (cost=0.00..26822156.08 rows=129 width=124) (actual time=200692.321..200730.121 rows=395 loops=1)
                                 Join Filter: ((l.category)::text = filter.category)
                                 Rows Removed by Join Filter: 552210
                                 ->  Seq Scan on small_join_table filter  (cost=0.00..26.99 rows=1399 width=8) (actual time=0.013..0.179 rows=1399 loops=1)
                                 ->  Materialize  (cost=0.00..26818970.84 rows=129 width=99) (actual time=136.942..143.440 rows=395 loops=1399)
                                       ->  Seq Scan on log_table l  (cost=0.00..26818970.20 rows=129 width=99) (actual time=191581.193..200649.013 rows=395 loops=1)
                                             Filter: ((field1 IS NOT NULL) AND (category = 'music'::name) AND (rec_insert_time >= (now() - '100 days'::interval)) AND (rec_insert_time <= (now() - '90 days'::interval)))
                                             Rows Removed by Filter: 315097963
 Planning Time: 1.541 ms
 Execution Time: 200731.273 ms
(19 rows)

Here is with the forced brin index scan:

                                                                                                    QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 WindowAgg  (cost=26674201.49..26674204.40 rows=129 width=120) (actual time=200303.118..200303.177 rows=62 loops=1)
   ->  Sort  (cost=26674201.49..26674201.82 rows=129 width=104) (actual time=200303.093..200303.096 rows=62 loops=1)
         Sort Key: unique_cases.source, unique_cases.rec_insert_time
         Sort Method: quicksort  Memory: 33kB
         ->  Subquery Scan on unique_cases  (cost=26674194.39..26674196.97 rows=129 width=104) (actual time=200302.918..200303.012 rows=62 loops=1)
               ->  Unique  (cost=26674194.39..26674195.68 rows=129 width=124) (actual time=200302.914..200302.998 rows=62 loops=1)
                     ->  Sort  (cost=26674194.39..26674194.71 rows=129 width=124) (actual time=200302.911..200302.929 rows=395 loops=1)
                           Sort Key: l.brand_id, l.last_change, l.log_id, l.rec_insert_time DESC
                           Sort Method: quicksort  Memory: 80kB
                           ->  Nested Loop  (cost=1245.13..26674189.87 rows=129 width=124) (actual time=138087.290..200301.925 rows=395 loops=1)
                                 ->  Bitmap Heap Scan on log_table l  (cost=1244.85..26673801.66 rows=129 width=99) (actual time=138087.083..200298.259 rows=395 loops=1)
                                       Recheck Cond: ((rec_insert_time >= (now() - '100 days'::interval)) AND (rec_insert_time <= (now() - '90 days'::interval)))
                                       Rows Removed by Index Recheck: 214939302
                                       Filter: ((field1 IS NOT NULL) AND (category = 'music'::name))
                                       Rows Removed by Filter: 15326889
                                       Heap Blocks: lossy=13566000
                                       ->  Bitmap Index Scan on rec_insert_time_brin_1000  (cost=0.00..1244.81 rows=78608872 width=0) (actual time=678.203..678.203 rows=135660000 loops=1)
                                             Index Cond: ((rec_insert_time >= (now() - '100 days'::interval)) AND (rec_insert_time <= (now() - '90 days'::interval)))
                                 ->  Index Only Scan using small_join_table_pkey on small_join_table filter  (cost=0.28..3.01 rows=1 width=8) (actual time=0.005..0.005 rows=1 loops=395)
                                       Index Cond: (category = (l.category)::text)
                                       Heap Fetches: 395
 Planning Time: 2.031 ms
 Execution Time: 200304.411 ms
(23 rows)
 
So you have just 466 rows matching these quals, but the executor had
to scan 1.5 million pages to get those and filter out 8.1 million rows
on the recheck then 19.8 million on the filter. You've mentioned that
the table's heap is 139 GB, which is about 18 million pages.  It seems
your query would perform much better if you had a btree index such as
(category, rec_insert_time) where field1 is not null;,

I agree btree would do the trick for performance, but I was trying to avoid the near-100G overhead of such an index.  For the given example, the somewhat improved performance of BRIN may be acceptable to me.  However, as seen above, it appears that a btree may be my only option given this workload to get reliable performance.
 
Of course, you've mentioned that you are finding when the plan uses
the BRIN index that it executes more quickly, but I think you're going
to find BRIN unreliable for tables anything other than INSERT-only
tables which the records are always inserted with an ever-increasing
or decreasing value in the BRIN indexed column.  If you start
performing UPDATEs then that's going to create holes that new record
will fill and cause the correlation to start dropping resulting in the
BRIN indexes scan cost going up.

Or deletes, as in my case.
 
On the other hand, if you think you can do better than what was done
in 7e534adcd, then it would be good to see someone working on it. I'm
sure something better can be done. It's just not that easy to do with
the scant correlation data we have on the column.

As for is this a bug or something that's missing from the documents.
The documents do mention:

"BRIN stands for Block Range Index. BRIN is designed for handling very
large tables in which certain columns have some natural correlation
with their physical location within the table."

Yes, I was aware of this, and perhaps nothing indeed needs to change with docs here given my case.  But perhaps it would be worth exploring if there could be more detailed stats on physical vs logical correlation, such as when ANALYZE takes its samples, noting physical locations as well as logical values, and allowing the correlation to take account of that more detailed analysis.  Of course, sounds like a huge amount of work with uncertain benefits.  In my case, it could be said that if I am always querying the last few days of data, a BRIN index here is perfect, and a BTREE is way overkill.  That is a real use case to consider.  But more generally, I would drop the BRIN index if I had any other query patterns beyond the few days of data.

However, this may be a fallacy.  It might be that a few days from now, the last 10 days of data will actually be really fragmented, depending only on when VACUUM runs.

As the docs state, I do believe that the only use case that will work really well for BRIN is either a truly insert-only table which is never pruned (an idea I dislike as a DBA who wants us to keep OLTP data trim and implement data retention policies :), or a table which is routinely CLUSTERed!

Thanks again for the detailed feedback.

Thanks,
Jeremy 

pgsql-hackers by date:

Previous
From: Heikki Linnakangas
Date:
Subject: Re: Questions/Observations related to Gist vacuum
Next
From: Justin Pryzby
Date:
Subject: v12.0: interrupt reindex CONCURRENTLY: ccold: ERROR: could not findtuple for parent of relation ...