Re: WIP: BRIN multi-range indexes - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: WIP: BRIN multi-range indexes
Date
Msg-id d288a5f2-3225-64c4-2be0-62379a138a34@enterprisedb.com
Whole thread Raw
In response to Re: WIP: BRIN multi-range indexes  (Tomas Vondra <tomas.vondra@enterprisedb.com>)
Responses Re: WIP: BRIN multi-range indexes  (John Naylor <john.naylor@enterprisedb.com>)
List pgsql-hackers

On 1/20/21 1:07 AM, Tomas Vondra wrote:
> On 1/19/21 9:44 PM, John Naylor wrote:
>> On Tue, Jan 12, 2021 at 1:42 PM Tomas Vondra 
>> <tomas.vondra@enterprisedb.com <mailto:tomas.vondra@enterprisedb.com>> 
>> wrote:
>>  > I suspect it'd due to minmax having to decide which "ranges" to merge,
>>  > which requires repeated sorting, etc. I certainly don't dare to claim
>>  > the current algorithm is perfect. I wouldn't have expected such big
>>  > difference, though - so definitely worth investigating.
>>
>> It seems that monotonically increasing (or decreasing) values in a 
>> table are a worst case scenario for multi-minmax indexes, or 
>> basically, unique values within a range. I'm guessing it's because it 
>> requires many passes to fit all the values into a limited number of 
>> ranges. I tried using smaller pages_per_range numbers, 32 and 8, and 
>> that didn't help.
>>
>> Now, with a different data distribution, using only 10 values that 
>> repeat over and over, the results are muchs more sympathetic to 
>> multi-minmax:
>>
>> insert into iot (num, create_dt)
>> select random(), '2020-01-01 0:00'::timestamptz + (x % 10 || ' 
>> seconds')::interval
>> from generate_series(1,5*365*24*60*60) x;
>>
>> create index cd_single on iot using brin(create_dt);
>> 27.2s
>>
>> create index cd_multi on iot using brin(create_dt 
>> timestamptz_minmax_multi_ops);
>> 30.4s
>>
>> create index cd_bt on iot using btree(create_dt);
>> 61.8s
>>
>> Circling back to the monotonic case, I tried running a simple perf 
>> record on a backend creating a multi-minmax index on a timestamptz 
>> column and these were the highest non-kernel calls:
>> +   21.98%    21.91%  postgres         postgres            [.] 
>> FunctionCall2Coll
>> +    9.31%     9.29%  postgres         postgres            [.] 
>> compare_combine_ranges
>> +    8.60%     8.58%  postgres         postgres            [.] qsort_arg
>> +    5.68%     5.66%  postgres         postgres            [.] 
>> brin_minmax_multi_add_value
>> +    5.63%     5.60%  postgres         postgres            [.] 
>> timestamp_lt
>> +    4.73%     4.71%  postgres         postgres            [.] 
>> reduce_combine_ranges
>> +    3.80%     0.00%  postgres         [unknown]           [.] 
>> 0x0320016800040000
>> +    3.51%     3.50%  postgres         postgres            [.] 
>> timestamp_eq
>>
>> There's no one place that's pathological enough to explain the 4x 
>> slowness over traditional BRIN and nearly 3x slowness over btree when 
>> using a large number of unique values per range, so making progress 
>> here would have to involve a more holistic approach.
>>
> 
> Yeah. This very much seems like the primary problem is in how we build 
> the ranges incrementally - with monotonic sequences, we end up having to 
> merge the ranges over and over again. I don't know what was the 
> structure of the table, but I guess it was kinda narrow (very few 
> columns), which exacerbates the problem further, because the number of 
> rows per range will be way higher than in real-world.
> 
> I do think the solution to this might be to allow more values during 
> batch index creation, and only "compress" to the requested number at the 
> very end (when serializing to on-disk format).
> > There are a couple additional comments about possibly replacing
> sequential scan with a binary search, that could help a bit too.
> 

OK, I took a look at this, and I came up with two optimizations that 
improve this for the pathological cases. I've kept this as patches on 
top of the last patch, to allow easier review of the changes.


0007 - This reworks how the ranges are reduced by merging the closest 
ranges. Instead of doing that iteratively in a fairly expensive loop, 
the new reduce reduce_combine_ranges() uses much simpler approach.

There's a couple more optimizations (skipping expensive code when not 
needed, etc.) which should help a bit too.


0008 - This is a WIP version of the batch mode. Originally, when 
building an index we'd "fill" the small buffer, combine some of the 
ranges to free ~25% of space for new values. And we'd do this over and 
over. This involves some expensive steps (sorting etc.) and for some 
pathologic cases (like monotonic sequences) this performed particularly 
poorly. The new code simply collects all values in the range, and then 
does the expensive stuff only once.

Note: These parts are fairly new, with minimal testing so far.

When measured on a table with 10M rows with a number of data sets with 
different patterns, the results look like this:

     dataset              btree  minmax  unpatched  patched    diff
     --------------------------------------------------------------
     monotonic-100-asc    3023     1002       1281     1722    1.34
     monotonic-100-desc   3245     1042       1363     1674    1.23
     monotonic-10000-asc  2597     1028       2469     2272    0.92
     monotonic-10000-desc 2591     1021       2157     2246    1.04
     monotonic-asc        1863      968       4884     1106    0.23
     monotonic-desc       2546     1017       3520     2337    0.66
     random-100           3648     1133       1594     1797    1.13
     random-10000         3507     1124       1651     2657    1.61

The btree and minmax are the current indexes. unpatched means minmax 
multi from the previous patch version, patched is with 0007 and 0008 
applied. The diff shows patched/unpatched. The benchmarking script is 
attached.

The pathological case (monotonic-asc) is now 4x faster, roughly equal to 
regular minmax index build. The opposite (monotonic-desc) is about 33% 
faster, roughly in line with btree.

There are a couple cases where it's actually a bit slower - those are 
the cases with very few distinct values per range. I believe this 
happens because in the batch mode the code does not check if the summary 
already contains this value, adds it to the buffer and the last step 
ends up being more expensive than this.

I believe there's some "compromise" between those two extremes, i.e. we 
should use buffer that is too small or too large, but something in 
between, so that the reduction happens once in a while but not too often 
(as with the original aggressive approach).

FWIW, none of this is likely to be an issue in practice, because (a) 
tables usually don't have such strictly monotonic patterns, (b) people 
should stick to plain minmax for cases that do. And (c) regular tables 
tend to have much wider rows, so there are fewer values per range (so 
that other stuff is likely more expensive that building BRIN).


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

pgsql-hackers by date:

Previous
From: Michail Nikolaev
Date:
Subject: Re: [PATCH] Full support for index LP_DEAD hint bits on standby
Next
From: "Hou, Zhijie"
Date:
Subject: RE: Parallel INSERT (INTO ... SELECT ...)