Home > mailing lists

Re: PATCH: Using BRIN indexes for sorted output - Mailing list pgsql-hackers

From	Tomas Vondra
Subject	Re: PATCH: Using BRIN indexes for sorted output
Date	October 16, 2022 14:33:50
Msg-id	1607e037-8c5b-1626-9f98-496f067b9eb5@enterprisedb.com Whole thread Raw
In response to	Re: PATCH: Using BRIN indexes for sorted output (Zhihong Yu <zyu@yugabyte.com>)
Responses	Re: PATCH: Using BRIN indexes for sorted output Re: PATCH: Using BRIN indexes for sorted output
List	pgsql-hackers

Tree view

On 10/16/22 16:01, Zhihong Yu wrote:
> 
> 
> On Sun, Oct 16, 2022 at 6:51 AM Tomas Vondra
> <tomas.vondra@enterprisedb.com <mailto:tomas.vondra@enterprisedb.com>>
> wrote:
> 
>     On 10/15/22 14:33, Tomas Vondra wrote:
>     > Hi,
>     >
>     > ...
>     >
>     > There's a bunch of issues with this initial version of the patch,
>     > usually described in XXX comments in the relevant places.6)
>     >
>     > ...
> 
>     I forgot to mention one important issue in my list yesterday, and that's
>     memory consumption. The way the patch is coded now, the new BRIN support
>     function (brin_minmax_ranges) produces information about *all* ranges in
>     one go, which may be an issue. The worst case is 32TB table, with 1-page
>     BRIN ranges, which means ~4 billion ranges. The info is an array of ~32B
>     structs, so this would require ~128GB of RAM. With the default 128-page
>     ranges, it's still be ~1GB, which is quite a lot.
> 
>     We could have a discussion about what's the reasonable size of BRIN
>     ranges on such large tables (e.g. building a bitmap on 4 billion ranges
>     is going to be "not cheap" so this is likely pretty rare). But we should
>     not introduce new nodes that ignore work_mem, so we need a way to deal
>     with such cases somehow.
> 
>     The easiest solution likely is to check this while planning - we can
>     check the table size, calculate the number of BRIN ranges, and check
>     that the range info fits into work_mem, and just not create the path
>     when it gets too large. That's what we did for HashAgg, although that
>     decision was unreliable because estimating GROUP BY cardinality is hard.
> 
>     The wrinkle here is that counting just the range info (BrinRange struct)
>     does not include the values for by-reference types. We could use average
>     width - that's just an estimate, though.
> 
>     A more comprehensive solution seems to be to allow requesting chunks of
>     the BRIN ranges. So that we'd get "slices" of ranges and we'd process
>     those. So for example if you have 1000 ranges, and you can only handle
>     100 at a time, we'd do 10 loops, each requesting 100 ranges.
> 
>     This has another problem - we do care about "overlaps", and we can't
>     really know if the overlapping ranges will be in the same "slice"
>     easily. The chunks would be sorted (for example) by maxval. But there
>     can be a range with much higher maxval (thus in some future slice), but
>     very low minval (thus intersecting with ranges in the current slice).
> 
>     Imagine ranges with these minval/maxval values, sorted by maxval:
> 
>     [101,200]
>     [201,300]
>     [301,400]
>     [150,500]
> 
>     and let's say we can only process 2-range slices. So we'll get the first
>     two, but both of them intersect with the very last range.
> 
>     We could always include all the intersecting ranges into the slice, but
>     what if there are too many very "wide" ranges?
> 
>     So I think this will need to switch to an iterative communication with
>     the BRIN index - instead of asking "give me info about all the ranges",
>     we'll need a way to
> 
>       - request the next range (sorted by maxval)
>       - request the intersecting ranges one by one (sorted by minval)
> 
>     Of course, the BRIN side will have some of the same challenges with
>     tracking the info without breaking the work_mem limit, but I suppose it
>     can store the info into a tuplestore/tuplesort, and use that instead of
>     plain in-memory array. Alternatively, it could just return those, and
>     BrinSort would use that. OTOH it seems cleaner to have some sort of API,
>     especially if we want to support e.g. minmax-multi opclasses, that have
>     a more complicated concept of "intersection".
> 
> 
>     regards
> 
>     -- 
>     Tomas Vondra
>     EnterpriseDB: http://www.enterprisedb.com <http://www.enterprisedb.com>
>     The Enterprise PostgreSQL Company
> 
> Hi,
> In your example involving [150,500], can this range be broken down into
> 4 ranges, ending in 200, 300, 400 and 500, respectively ?
> That way, there is no intersection among the ranges.
> 

Not really, I think. These "value ranges" map to "page ranges" and how
would you split those? I mean, you know values [150,500] map to blocks
[0,127]. You split the values into [150,200], [201,300], [301,400]. How
do you split the page range [0,127]?

Also, splitting a range into more ranges is likely making the issue
worse, because it increases the number of ranges, right? And I mean,
much worse, because imagine a "wide" range that overlaps with every
other range - the number of ranges would explode.

It's not clear to me at which point you'd make the split. At the
beginning, right after loading the ranges from BRIN index? A lot of that
may be unnecessary, in case the range is loaded as a "non-intersecting"
range.

Try to formulate the whole algorithm. Maybe I'm missing something.

The current algorithm is something like this:

1. request info about ranges from the BRIN opclass
2. sort them by maxval and minval
3. NULLS FIRST: read all ranges that might have NULLs => output
4. read the next range (by maxval) into tuplesort
   (if no more ranges, go to (9))
5. load all tuples from "splill" tuplestore, compare to maxval
6. load all tuples from no-summarized ranges (first range only)
   (into tuplesort/tuplestore, depending on maxval comparison)
7. load all intersecting ranges (with minval < current maxval)
   (into tuplesort/tuplestore, depending on maxval comparison)
8. sort the tuplesort, output all tuples, then back to (4)
9. NULLS LAST: read all ranges that might have NULLs => output
10. done

For "DESC" ordering the process is almost the same, except that we swap
minval/maxval in most places.

> bq. can store the info into a tuplestore/tuplesort
> 
> Wouldn't this involve disk accesses which may reduce the effectiveness
> of BRIN sort ?

Yes, it might. But the question is whether the result is still faster
than alternative plans (e.g. seqscan+sort), and those are likely to do
even more I/O.

Moreover, for "regular" cases this shouldn't be a significant issue,
because the stuff will fit into work_mem and so there'll be no I/O. But
it'll handle those extreme cases gracefully.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

pgsql-hackers by date:

From: Tomas Vondra
Date: 16 October 2022, 14:14:32
Subject: Re: PATCH: Using BRIN indexes for sorted output

From: Tom Lane
Date: 16 October 2022, 14:41:57
Subject: Re: PATCH: Using BRIN indexes for sorted output

Re: PATCH: Using BRIN indexes for sorted output - Mailing list pgsql-hackers

Previous

Next