Re: WIP: BRIN multi-range indexes - Mailing list pgsql-hackers
From | Tomas Vondra |
---|---|
Subject | Re: WIP: BRIN multi-range indexes |
Date | |
Msg-id | 20200807162701.v7lyzun32lqku3md@development Whole thread Raw |
In response to | Re: WIP: BRIN multi-range indexes (Tomas Vondra <tomas.vondra@2ndquadrant.com>) |
Responses |
Re: WIP: BRIN multi-range indexes
|
List | pgsql-hackers |
On Tue, Aug 04, 2020 at 05:17:43PM +0200, Tomas Vondra wrote: >On Tue, Aug 04, 2020 at 05:36:51PM +0300, Alexander Korotkov wrote: >>Hi, Tomas! >> >>Sorry for the late reply. >> >>On Sun, Jul 19, 2020 at 6:19 PM Tomas Vondra >><tomas.vondra@2ndquadrant.com> wrote: >>>I think there's a number of weak points in this approach. >>> >>>Firstly, it assumes the summaries can be represented as arrays of >>>built-in types, which I'm not really sure about. It clearly is not true >>>for the bloom opclasses, for example. But even for minmax oclasses it's >>>going to be tricky because the ranges may be on different data types so >>>presumably we'd need somewhat nested data structure. >>> >>>Moreover, multi-minmax summary contains either points or intervals, >>>which requires additional fields/flags to indicate that. That further >>>complicates the things ... >>> >>>maybe we could decompose that into separate arrays or something, but >>>honestly it seems somewhat premature - there are far more important >>>aspects to discuss, I think (e.g. how the ranges are built/merged in >>>multi-minmax, or whether bloom opclasses are useful at all). >> >>I see. But there is at least a second option to introduce a new >>datatype with just an output function. In the similar way >>gist/tsvector_ops uses gtsvector key type. I think it would be more >>transparent than using just bytea. Also, this is the way we already >>use in the core. >> > >So you're proposing to have a new data types "brin_minmax_multi_summary" >and "brin_bloom_summary" (or some other names), with output functions >printing something nicer? I suppose that could work, and we could even >add pageinspect functions returning the value as raw bytea. > >Good idea! > Attached is an updated version of the patch series, implementing this. Adding the extra data types was fairly simple, because both bloom and minmax-multi indexes already used "struct as varlena" approach, so all that needed was a bunch of in/out functions and catalog records. I've left the changes in separate patches for clarity, ultimately it'll get merged into the other parts. This reminded me that the current costing may not quite work, because it depends on how well the index is correlated to the table. That may be OK for minmax-multi in most cases, but for bloom it makes almost no sense - correlation does not really matter for bloom filters, what matters is the number of values in each range. Consider this example: create table t (a int); insert into t select x from ( select (i/10) as x from generate_series(1,10000000) s(i) order by random() ) foo; create index on t using brin( a int4_bloom_ops(n_distinct_per_range=6000, false_positive_rate=0.05)) with (pages_per_range = 16); vacuum analyze t; test=# explain analyze select * from t where a = 10000; QUERY PLAN ----------------------------------------------------------------------------------------------------- Seq Scan on t (cost=0.00..169247.71 rows=10 width=4) (actual time=38.088..513.654 rows=10 loops=1) Filter: (a = 10000) Rows Removed by Filter: 9999990 Planning Time: 0.060 ms Execution Time: 513.719 ms (5 rows) test=# set enable_seqscan = off; SET test=# explain analyze select * from t where a = 10000; QUERY PLAN ---------------------------------------------------------------------------------------------------------------------------- Bitmap Heap Scan on t (cost=5553.07..174800.78 rows=10 width=4) (actual time=7.790..27.585 rows=10 loops=1) Recheck Cond: (a = 10000) Rows Removed by Index Recheck: 224182 Heap Blocks: lossy=992 -> Bitmap Index Scan on t_a_idx (cost=0.00..5553.06 rows=9999977 width=0) (actual time=7.006..7.007 rows=9920 loops=1) Index Cond: (a = 10000) Planning Time: 0.052 ms Execution Time: 27.658 ms (8 rows) Clearly, the main problem is in brincostestimate relying on correlation to tweak the selectivity estimates, leading to an estimate of almost the whole table, when in practice we only scan a tiny fraction. Part 0008 is an experimental tweaks the logic to ignore correlation for bloom and minmax-multi opclasses, producing this plan: test=# explain analyze select * from t where a = 10000; QUERY PLAN --------------------------------------------------------------------------------------------------------------------------- Bitmap Heap Scan on t (cost=5542.01..16562.95 rows=10 width=4) (actual time=12.013..34.705 rows=10 loops=1) Recheck Cond: (a = 10000) Rows Removed by Index Recheck: 224182 Heap Blocks: lossy=992 -> Bitmap Index Scan on t_a_idx (cost=0.00..5542.00 rows=3615 width=0) (actual time=11.108..11.109 rows=9920 loops=1) Index Cond: (a = 10000) Planning Time: 0.386 ms Execution Time: 34.778 ms (8 rows) which is way closer to reality, of course. I'm not entirely sure it behaves correctly for multi-column BRIN indexes, but I think as a PoC it's sufficient. For bloom, I think we can be a bit smarter - we could use the false positive rate as the "minimum expected selectivity" or something like that. After all, the false positive rate essentially means "Given a random value, what's the chance that a bloom filter matches?" So given a table with N ranges, we expect about (N * fpr) to match. Of course, the problem is that this only works for "full" bloom filters. Ranges with fewer distinct values will have much lower probability, and ranges with unexpectedly many distinct values will have much higher probability. But I think we can ignore that, assume the index was created with good parameters, so the bloom filters won't degrade and the target fpr is probably a defensive value. For minmax-multi, we probably should not ignore correlation entirely. It does handle imperfect correlation much more gracefully than plain minmax, but it still depends on reasonably ordered data. A possible improvement would be to compute average "covering" of ranges, i.e. given the length of a column domain D = MAX(column) - MIN(column) compute what fraction of that is covered by a range by summing lengths of intervals in the range, and dividing it by D. And then averaging it over all BRIN ranges. This would allow us to estimate how many ranges are matched by a random value from the column domain, I think. But this requires extending what data analyze collects for indexes - I don't think there are any stats specific to BRIN-specific collected at the moment. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachment
- 0001-Pass-all-keys-to-BRIN-consistent-function-a-20200807.patch
- 0002-Move-IS-NOT-NULL-checks-to-bringetbitmap-20200807.patch
- 0003-Move-processing-of-NULLs-from-BRIN-support--20200807.patch
- 0004-BRIN-bloom-indexes-20200807.patch
- 0005-add-special-pg_brin_bloom_summary-data-type-20200807.patch
- 0006-BRIN-multi-range-minmax-indexes-20200807.patch
- 0007-add-special-pg_brin_minmax_multi_summary-da-20200807.patch
- 0008-tweak-costing-for-bloom-minmax-multi-indexe-20200807.patch
pgsql-hackers by date: