Re: WIP: BRIN multi-range indexes - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: WIP: BRIN multi-range indexes
Date
Msg-id 2e903689-c518-f77d-551a-5e3e279d0838@enterprisedb.com
Whole thread Raw
In response to Re: WIP: BRIN multi-range indexes  (John Naylor <john.naylor@enterprisedb.com>)
Responses Re: WIP: BRIN multi-range indexes  (John Naylor <john.naylor@enterprisedb.com>)
List pgsql-hackers
On 1/23/21 12:27 AM, John Naylor wrote:
> On Thu, Jan 21, 2021 at 9:06 PM Tomas Vondra 
> <tomas.vondra@enterprisedb.com <mailto:tomas.vondra@enterprisedb.com>> 
> wrote:
>  > [wip optimizations]
> 
>  > The pathological case (monotonic-asc) is now 4x faster, roughly equal to
>  > regular minmax index build. The opposite (monotonic-desc) is about 33%
>  > faster, roughly in line with btree.
> 
> Those numbers look good. I get similar results, shown below. I've read 
> 0007-8 briefly but not in depth.
>  >  > There are a couple cases where it's actually a bit slower - those are
>  > the cases with very few distinct values per range. I believe this
>  > happens because in the batch mode the code does not check if the summary
>  > already contains this value, adds it to the buffer and the last step
>  > ends up being more expensive than this.
> 
> I think if it's worst case a bit faster than btree and best case a bit 
> slower than traditional minmax, that's acceptable.
> 
>  > I believe there's some "compromise" between those two extremes, i.e. we
>  > should use buffer that is too small or too large, but something in
>  > between, so that the reduction happens once in a while but not too often
>  > (as with the original aggressive approach).
> 
> This sounds good also.
> 

Yeah, I agree.

I think the reason why some of the cases got a bit slower is that in 
those cases the original approach (ranges being built fairly frequently, 
not just once at the end) we quickly built something that represented 
the whole range, so adding a new value was often no-op. The add_value 
callback found the range already "includes" the new value, etc.

With the batch mode, that's no longer true - we accumulate everything, 
so we have to sort it etc. Which I guess may be fairly expensive, thanks 
to calling comparator functions etc. I wonder if this could be optimized 
a bit, e.g. by first "deduplicating" the values using memcmp() or so.

But ultimately, I think the right solution will be to limit the buffer 
size to something like 10x the target, and roll with that. Typically, 
increasing the buffer size from e.g. 100B to 1000B brings much clearer 
improvement than increasing it from 1000B to 10000B. I'd bet this 
follows the pattern.

>  > FWIW, none of this is likely to be an issue in practice, because (a)
>  > tables usually don't have such strictly monotonic patterns, (b) people
>  > should stick to plain minmax for cases that do.
> 
> Still, it would be great if multi-minmax can be a drop in replacement. I 
> know there was a sticking point of a distance function not being 
> available on all types, but I wonder if that can be remedied or worked 
> around somehow.
> 

Hmm. I think Alvaro also mentioned he'd like to use this as a drop-in 
replacement for minmax (essentially, using these opclasses as the 
default ones, with the option to switch back to plain minmax). I'm not 
convinced we should do that - though. Imagine you have minmax indexes in 
your existing DB, it's working perfectly fine, and then we come and just 
silently change that during dump/restore. Is there some past example 
when we did something similar and it turned it to be OK?

As for the distance functions, I'm pretty sure there are data types 
without "natural" distance - like most strings, for example. We could 
probably invent something, but the question is how much we can rely on 
it working well enough in practice.

Of course, is minmax even the right index type for such data types? 
Strings are usually "labels" and not queried using range queries, 
although sometimes people encode stuff as strings (but then it's very 
unlikely we'll define the distance definition well). So maybe for those 
types a hash / bloom would be a better fit anyway.


But I do have an idea - maybe we can do without distances, in those 
cases. Essentially, the primary issue of minmax indexes are outliers, so 
what if we simply sort the values, keep one range in the middle and as 
many single points on each tail?

Imagine we have N values, and we want to represent this by K values. We 
simply sort the N values, keep (k-2)/2 values on each tail as outliers, 
and use 2 values for the values in between.

Example:

input: [1, 2, 100, 110, 111, ..., 120, , ..., 130, 201, 202]

Given k = 6, we would keep 2 values on tails, and range for the rest:

   [1, 2, (100, 130), 201, 202]

Of course, this does not optimize for the same thing as when we have 
distance - in that case we try to minimize the "covering" of the input 
data, something like

      sum(length(r) for r in ranges) / (max(ranges) - min(ranges))

But maybe it's good enough when there's no distance function ...


> And (c) regular tables
>  > tend to have much wider rows, so there are fewer values per range (so
>  > that other stuff is likely more expensive than building BRIN).
> 
> True. I'm still puzzled that it didn't help to use 8 pages per range, 
> but it's moot now.
> 

I'd bet that even with just 8 pages, there were quite a few values in 
the range - possibly hundreds per page. I haven't tried if the patches 
help with smaller ranges, so maybe we should check.

> 
> Here are some numbers (median of 3) with a similar scenario as before, 
> repeated here with some added details. I didn't bother with what you 
> call "unpatched":
> 
>                 btree   minmax   multi
> monotonic-asc  44.4    26.5     27.8
> mono-del-ins   38.7    24.6     30.4
> mono-10-asc    61.8    25.6     33.5
> 

Thanks. Those numbers seem reasonable.


-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



pgsql-hackers by date:

Previous
From: Michael Paquier
Date:
Subject: Re: Some more hackery around cryptohashes (some fixes + SHA1)
Next
From: Ajin Cherian
Date:
Subject: Re: Single transaction in the tablesync worker?