Thread: PATCH: Using BRIN indexes for sorted output

PATCH: Using BRIN indexes for sorted output

From

Tomas Vondra

Date:

15 October 2022, 12:33:50

Hi,

There have been a couple discussions about using BRIN indexes for
sorting - in fact this was mentioned even in the "Improving Indexing
Performance" unconference session this year (don't remember by whom).
But I haven't seen any patches, so here's one.

The idea is that we can use information about ranges to split the table
into smaller parts that can be sorted in smaller chunks. For example if
you have a tiny 2MB table with two ranges, with values in [0,100] and
[101,200] intervals, then it's clear we can sort the first range, output
tuples, and then sort/output the second range.

The attached patch builds "BRIN Sort" paths/plans, closely resembling
index scans, only for BRIN indexes. And this special type of index scan
does what was mentioned above - incrementally sorts the data. It's a bit
more complicated because of overlapping ranges, ASC/DESC, NULL etc.

This is disabled by default, using a GUC enable_brinsort (you may need
to tweak other GUCs to disable parallel plans etc.).

A trivial example, demonstrating the benefits:

  create table t (a int) with (fillfactor = 10);
  insert into t select i from generate_series(1,10000000) s(i);


First, a simple LIMIT query:

explain (analyze, costs off) select * from t order by a limit 10;

                              QUERY PLAN
------------------------------------------------------------------------
 Limit (actual time=1879.768..1879.770 rows=10 loops=1)
   ->  Sort (actual time=1879.767..1879.768 rows=10 loops=1)
         Sort Key: a
         Sort Method: top-N heapsort  Memory: 25kB
         ->  Seq Scan on t
             (actual time=0.007..1353.110 rows=10000000 loops=1)
 Planning Time: 0.083 ms
 Execution Time: 1879.786 ms
(7 rows)

                              QUERY PLAN
------------------------------------------------------------------------
 Limit (actual time=1.217..1.219 rows=10 loops=1)
   ->  BRIN Sort using t_a_idx on t
       (actual time=1.216..1.217 rows=10 loops=1)
         Sort Key: a
 Planning Time: 0.084 ms
 Execution Time: 1.234 ms
(5 rows)

That's a pretty nice improvement - of course, this is thanks to having a
perfectly sequential, and the difference can be almost arbitrary by
making the table smaller/larger. Similarly, if the table gets less
sequential (making ranges to overlap), the BRIN plan will be more
expensive. Feel free to experiment with other data sets.

However, not only the LIMIT queries can improve - consider a sort of the
whole table:

test=# explain (analyze, costs off) select * from t order by a;

                               QUERY PLAN
-------------------------------------------------------------------------
 Sort (actual time=2806.468..3487.213 rows=10000000 loops=1)
   Sort Key: a
   Sort Method: external merge  Disk: 117528kB
   ->  Seq Scan on t (actual time=0.018..1498.754 rows=10000000 loops=1)
 Planning Time: 0.110 ms
 Execution Time: 3766.825 ms
(6 rows)

test=# explain (analyze, costs off) select * from t order by a;
                                    QUERY PLAN

----------------------------------------------------------------------------------
 BRIN Sort using t_a_idx on t (actual time=1.210..2670.875 rows=10000000
loops=1)
   Sort Key: a
 Planning Time: 0.073 ms
 Execution Time: 2939.324 ms
(4 rows)

Right - not a huge difference, but still a nice 25% speedup, mostly due
to not having to spill data to disk and sorting smaller amounts of data.

There's a bunch of issues with this initial version of the patch,
usually described in XXX comments in the relevant places.6)

1) The paths are created in build_index_paths() because that's what
creates index scans (which the new path resembles). But that is expected
to produce IndexPath, not BrinSortPath, so it's not quite correct.
Should be somewhere "higher" I guess.

2) BRIN indexes don't have internal ordering, i.e. ASC/DESC and NULLS
FIRST/LAST does not really matter for them. The patch just generates
paths for all 4 combinations (or tries to). Maybe there's a better way.

3) I'm not quite sure the separation of responsibilities between
opfamily and opclass is optimal. I added a new amproc, but maybe this
should be split differently. At the moment only minmax indexes have
this, but adding this to minmax-multi should be trivial.

4) The state changes in nodeBrinSort is a bit confusing. Works, but may
need cleanup and refactoring. Ideas welcome.

5) The costing is essentially just plain cost_index. I have some ideas
about BRIN costing in general, which I'll post in a separate thread (as
it's not specific to this patch).

6) At the moment this only picks one of the index keys, specified in the
ORDER BY clause. I think we can generalize this to multiple keys, but
thinking about multi-key ranges was a bit too much for me. The good
thing is this nicely combines with IncrementalSort.

7) Only plain index keys for the ORDER BY keys, no expressions. Should
not be hard to fix, though.

8) Parallel version is not supported, but I think it shouldn't be
possible. Just make the leader build the range info, and then let the
workers to acquire/sort ranges and merge them by Gather Merge.

9) I was also thinking about leveraging other indexes to quickly
eliminate ranges that need to be sorted. The node does evaluate filter,
of course, but only after reading the tuple from the range. But imagine
we allow BrinSort to utilize BRIN indexes to evaluate the filter - in
that case we might skip many ranges entirely. Essentially like a bitmap
index scan does, except that building the bitmap incrementally with BRIN
is trivial - you can quickly check if a particular range matches or not.
With other indexes (e.g. btree) you essentially need to evaluate the
filter completely, and only then you can look at the bitmap. Which seems
rather against the idea of this patch, which is about low startup cost.
Of course, the condition might be very selective, but then you probably
can just fetch the matching tuples and do a Sort.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

0001-Allow-BRIN-indexes-to-produce-sorted-output-20221015.patch

Re: PATCH: Using BRIN indexes for sorted output

From

Zhihong Yu

Date:

15 October 2022, 13:46:47

On Sat, Oct 15, 2022 at 5:34 AM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:

Hi,

There have been a couple discussions about using BRIN indexes for
sorting - in fact this was mentioned even in the "Improving Indexing
Performance" unconference session this year (don't remember by whom).
But I haven't seen any patches, so here's one.

The idea is that we can use information about ranges to split the table
into smaller parts that can be sorted in smaller chunks. For example if
you have a tiny 2MB table with two ranges, with values in [0,100] and
[101,200] intervals, then it's clear we can sort the first range, output
tuples, and then sort/output the second range.

The attached patch builds "BRIN Sort" paths/plans, closely resembling
index scans, only for BRIN indexes. And this special type of index scan
does what was mentioned above - incrementally sorts the data. It's a bit
more complicated because of overlapping ranges, ASC/DESC, NULL etc.

This is disabled by default, using a GUC enable_brinsort (you may need
to tweak other GUCs to disable parallel plans etc.).

A trivial example, demonstrating the benefits:

create table t (a int) with (fillfactor = 10);
insert into t select i from generate_series(1,10000000) s(i);

First, a simple LIMIT query:

explain (analyze, costs off) select * from t order by a limit 10;

QUERY PLAN
------------------------------------------------------------------------
Limit (actual time=1879.768..1879.770 rows=10 loops=1)
-> Sort (actual time=1879.767..1879.768 rows=10 loops=1)
Sort Key: a
Sort Method: top-N heapsort Memory: 25kB
-> Seq Scan on t
(actual time=0.007..1353.110 rows=10000000 loops=1)
Planning Time: 0.083 ms
Execution Time: 1879.786 ms
(7 rows)

QUERY PLAN
------------------------------------------------------------------------
Limit (actual time=1.217..1.219 rows=10 loops=1)
-> BRIN Sort using t_a_idx on t
(actual time=1.216..1.217 rows=10 loops=1)
Sort Key: a
Planning Time: 0.084 ms
Execution Time: 1.234 ms
(5 rows)

That's a pretty nice improvement - of course, this is thanks to having a
perfectly sequential, and the difference can be almost arbitrary by
making the table smaller/larger. Similarly, if the table gets less
sequential (making ranges to overlap), the BRIN plan will be more
expensive. Feel free to experiment with other data sets.

However, not only the LIMIT queries can improve - consider a sort of the
whole table:

test=# explain (analyze, costs off) select * from t order by a;

QUERY PLAN
-------------------------------------------------------------------------
Sort (actual time=2806.468..3487.213 rows=10000000 loops=1)
Sort Key: a
Sort Method: external merge Disk: 117528kB
-> Seq Scan on t (actual time=0.018..1498.754 rows=10000000 loops=1)
Planning Time: 0.110 ms
Execution Time: 3766.825 ms
(6 rows)

test=# explain (analyze, costs off) select * from t order by a;
QUERY PLAN

----------------------------------------------------------------------------------
BRIN Sort using t_a_idx on t (actual time=1.210..2670.875 rows=10000000
loops=1)
Sort Key: a
Planning Time: 0.073 ms
Execution Time: 2939.324 ms
(4 rows)

Right - not a huge difference, but still a nice 25% speedup, mostly due
to not having to spill data to disk and sorting smaller amounts of data.

There's a bunch of issues with this initial version of the patch,
usually described in XXX comments in the relevant places.6)

1) The paths are created in build_index_paths() because that's what
creates index scans (which the new path resembles). But that is expected
to produce IndexPath, not BrinSortPath, so it's not quite correct.
Should be somewhere "higher" I guess.

2) BRIN indexes don't have internal ordering, i.e. ASC/DESC and NULLS
FIRST/LAST does not really matter for them. The patch just generates
paths for all 4 combinations (or tries to). Maybe there's a better way.

3) I'm not quite sure the separation of responsibilities between
opfamily and opclass is optimal. I added a new amproc, but maybe this
should be split differently. At the moment only minmax indexes have
this, but adding this to minmax-multi should be trivial.

4) The state changes in nodeBrinSort is a bit confusing. Works, but may
need cleanup and refactoring. Ideas welcome.

5) The costing is essentially just plain cost_index. I have some ideas
about BRIN costing in general, which I'll post in a separate thread (as
it's not specific to this patch).

6) At the moment this only picks one of the index keys, specified in the
ORDER BY clause. I think we can generalize this to multiple keys, but
thinking about multi-key ranges was a bit too much for me. The good
thing is this nicely combines with IncrementalSort.

7) Only plain index keys for the ORDER BY keys, no expressions. Should
not be hard to fix, though.

8) Parallel version is not supported, but I think it shouldn't be
possible. Just make the leader build the range info, and then let the
workers to acquire/sort ranges and merge them by Gather Merge.

9) I was also thinking about leveraging other indexes to quickly
eliminate ranges that need to be sorted. The node does evaluate filter,
of course, but only after reading the tuple from the range. But imagine
we allow BrinSort to utilize BRIN indexes to evaluate the filter - in
that case we might skip many ranges entirely. Essentially like a bitmap
index scan does, except that building the bitmap incrementally with BRIN
is trivial - you can quickly check if a particular range matches or not.
With other indexes (e.g. btree) you essentially need to evaluate the
filter completely, and only then you can look at the bitmap. Which seems
rather against the idea of this patch, which is about low startup cost.
Of course, the condition might be very selective, but then you probably
can just fetch the matching tuples and do a Sort.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Hi,

I am still going over the patch.

Minor: for #8, I guess you meant `it should be possible` .

Cheers

Re: PATCH: Using BRIN indexes for sorted output

From

Tomas Vondra

Date:

15 October 2022, 15:23:54

On 10/15/22 15:46, Zhihong Yu wrote:
>...
>     8) Parallel version is not supported, but I think it shouldn't be
>     possible. Just make the leader build the range info, and then let the
>     workers to acquire/sort ranges and merge them by Gather Merge.
> ...
> Hi,
> I am still going over the patch.
> 
> Minor: for #8, I guess you meant `it should be possible` .
> 

Yes, I meant to say it should be possible. Sorry for the confusion.



regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: PATCH: Using BRIN indexes for sorted output

From

Zhihong Yu

Date:

16 October 2022, 01:36:51

On Sat, Oct 15, 2022 at 8:23 AM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:

On 10/15/22 15:46, Zhihong Yu wrote:
>...
> 8) Parallel version is not supported, but I think it shouldn't be
> possible. Just make the leader build the range info, and then let the
> workers to acquire/sort ranges and merge them by Gather Merge.
> ...
> Hi,
> I am still going over the patch.
>
> Minor: for #8, I guess you meant `it should be possible` .
>

Yes, I meant to say it should be possible. Sorry for the confusion.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Hi,

For brin_minmax_ranges, looking at the assignment to gottuple and reading gottuple, it seems variable gottuple can be omitted - we can check tup directly.

+ /* Maybe mark the range as processed. */
+ range->processed |= mark_processed;

`Maybe` can be dropped.

For brinsort_load_tuples(), do we need to check for interrupts inside the loop ?

Similar question for subsequent methods involving loops, such as brinsort_load_unsummarized_ranges.

Cheers

Re: PATCH: Using BRIN indexes for sorted output

From

Tomas Vondra

Date:

16 October 2022, 13:50:53

On 10/15/22 14:33, Tomas Vondra wrote:
> Hi,
> 
> ...
> 
> There's a bunch of issues with this initial version of the patch,
> usually described in XXX comments in the relevant places.6)
> 
> ...

I forgot to mention one important issue in my list yesterday, and that's
memory consumption. The way the patch is coded now, the new BRIN support
function (brin_minmax_ranges) produces information about *all* ranges in
one go, which may be an issue. The worst case is 32TB table, with 1-page
BRIN ranges, which means ~4 billion ranges. The info is an array of ~32B
structs, so this would require ~128GB of RAM. With the default 128-page
ranges, it's still be ~1GB, which is quite a lot.

We could have a discussion about what's the reasonable size of BRIN
ranges on such large tables (e.g. building a bitmap on 4 billion ranges
is going to be "not cheap" so this is likely pretty rare). But we should
not introduce new nodes that ignore work_mem, so we need a way to deal
with such cases somehow.

The easiest solution likely is to check this while planning - we can
check the table size, calculate the number of BRIN ranges, and check
that the range info fits into work_mem, and just not create the path
when it gets too large. That's what we did for HashAgg, although that
decision was unreliable because estimating GROUP BY cardinality is hard.

The wrinkle here is that counting just the range info (BrinRange struct)
does not include the values for by-reference types. We could use average
width - that's just an estimate, though.

A more comprehensive solution seems to be to allow requesting chunks of
the BRIN ranges. So that we'd get "slices" of ranges and we'd process
those. So for example if you have 1000 ranges, and you can only handle
100 at a time, we'd do 10 loops, each requesting 100 ranges.

This has another problem - we do care about "overlaps", and we can't
really know if the overlapping ranges will be in the same "slice"
easily. The chunks would be sorted (for example) by maxval. But there
can be a range with much higher maxval (thus in some future slice), but
very low minval (thus intersecting with ranges in the current slice).

Imagine ranges with these minval/maxval values, sorted by maxval:

[101,200]
[201,300]
[301,400]
[150,500]

and let's say we can only process 2-range slices. So we'll get the first
two, but both of them intersect with the very last range.

We could always include all the intersecting ranges into the slice, but
what if there are too many very "wide" ranges?

So I think this will need to switch to an iterative communication with
the BRIN index - instead of asking "give me info about all the ranges",
we'll need a way to

  - request the next range (sorted by maxval)
  - request the intersecting ranges one by one (sorted by minval)

Of course, the BRIN side will have some of the same challenges with
tracking the info without breaking the work_mem limit, but I suppose it
can store the info into a tuplestore/tuplesort, and use that instead of
plain in-memory array. Alternatively, it could just return those, and
BrinSort would use that. OTOH it seems cleaner to have some sort of API,
especially if we want to support e.g. minmax-multi opclasses, that have
a more complicated concept of "intersection".

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: PATCH: Using BRIN indexes for sorted output

From

Zhihong Yu

Date:

16 October 2022, 14:01:33

On Sun, Oct 16, 2022 at 6:51 AM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:

On 10/15/22 14:33, Tomas Vondra wrote:
> Hi,
>
> ...
>
> There's a bunch of issues with this initial version of the patch,
> usually described in XXX comments in the relevant places.6)
>
> ...

I forgot to mention one important issue in my list yesterday, and that's
memory consumption. The way the patch is coded now, the new BRIN support
function (brin_minmax_ranges) produces information about *all* ranges in
one go, which may be an issue. The worst case is 32TB table, with 1-page
BRIN ranges, which means ~4 billion ranges. The info is an array of ~32B
structs, so this would require ~128GB of RAM. With the default 128-page
ranges, it's still be ~1GB, which is quite a lot.

We could have a discussion about what's the reasonable size of BRIN
ranges on such large tables (e.g. building a bitmap on 4 billion ranges
is going to be "not cheap" so this is likely pretty rare). But we should
not introduce new nodes that ignore work_mem, so we need a way to deal
with such cases somehow.

The easiest solution likely is to check this while planning - we can
check the table size, calculate the number of BRIN ranges, and check
that the range info fits into work_mem, and just not create the path
when it gets too large. That's what we did for HashAgg, although that
decision was unreliable because estimating GROUP BY cardinality is hard.

The wrinkle here is that counting just the range info (BrinRange struct)
does not include the values for by-reference types. We could use average
width - that's just an estimate, though.

A more comprehensive solution seems to be to allow requesting chunks of
the BRIN ranges. So that we'd get "slices" of ranges and we'd process
those. So for example if you have 1000 ranges, and you can only handle
100 at a time, we'd do 10 loops, each requesting 100 ranges.

This has another problem - we do care about "overlaps", and we can't
really know if the overlapping ranges will be in the same "slice"
easily. The chunks would be sorted (for example) by maxval. But there
can be a range with much higher maxval (thus in some future slice), but
very low minval (thus intersecting with ranges in the current slice).

Imagine ranges with these minval/maxval values, sorted by maxval:

[101,200]
[201,300]
[301,400]
[150,500]

and let's say we can only process 2-range slices. So we'll get the first
two, but both of them intersect with the very last range.

We could always include all the intersecting ranges into the slice, but
what if there are too many very "wide" ranges?

So I think this will need to switch to an iterative communication with
the BRIN index - instead of asking "give me info about all the ranges",
we'll need a way to

- request the next range (sorted by maxval)
- request the intersecting ranges one by one (sorted by minval)

Of course, the BRIN side will have some of the same challenges with
tracking the info without breaking the work_mem limit, but I suppose it
can store the info into a tuplestore/tuplesort, and use that instead of
plain in-memory array. Alternatively, it could just return those, and
BrinSort would use that. OTOH it seems cleaner to have some sort of API,
especially if we want to support e.g. minmax-multi opclasses, that have
a more complicated concept of "intersection".

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Hi,

In your example involving [150,500], can this range be broken down into 4 ranges, ending in 200, 300, 400 and 500, respectively ?

That way, there is no intersection among the ranges.

bq. can store the info into a tuplestore/tuplesort

Wouldn't this involve disk accesses which may reduce the effectiveness of BRIN sort ?

Cheers

Re: PATCH: Using BRIN indexes for sorted output

From

Tomas Vondra

Date:

16 October 2022, 14:14:32


On 10/16/22 03:36, Zhihong Yu wrote:
> 
> 
> On Sat, Oct 15, 2022 at 8:23 AM Tomas Vondra
> <tomas.vondra@enterprisedb.com <mailto:tomas.vondra@enterprisedb.com>>
> wrote:
> 
>     On 10/15/22 15:46, Zhihong Yu wrote:
>     >...
>     >     8) Parallel version is not supported, but I think it shouldn't be
>     >     possible. Just make the leader build the range info, and then
>     let the
>     >     workers to acquire/sort ranges and merge them by Gather Merge.
>     > ...
>     > Hi,
>     > I am still going over the patch.
>     >
>     > Minor: for #8, I guess you meant `it should be possible` .
>     >
> 
>     Yes, I meant to say it should be possible. Sorry for the confusion.
> 
> 
> 
>     regards
> 
>     -- 
>     Tomas Vondra
>     EnterpriseDB: http://www.enterprisedb.com <http://www.enterprisedb.com>
>     The Enterprise PostgreSQL Company
> 
> Hi,
> 
> For brin_minmax_ranges, looking at the assignment to gottuple and
> reading gottuple, it seems variable gottuple can be omitted - we can
> check tup directly.
> 
> +   /* Maybe mark the range as processed. */
> +   range->processed |= mark_processed;
> 
> `Maybe` can be dropped.
> 

No, because the "mark_processed" may be false. So we may not mark it as
processed in some cases.

> For brinsort_load_tuples(), do we need to check for interrupts inside
> the loop ?
> Similar question for subsequent methods involving loops, such
> as brinsort_load_unsummarized_ranges.
> 

We could/should, although most of the loops should be very short.


regrds

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: PATCH: Using BRIN indexes for sorted output

From

Tomas Vondra

Date:

16 October 2022, 14:33:50

On 10/16/22 16:01, Zhihong Yu wrote:
> 
> 
> On Sun, Oct 16, 2022 at 6:51 AM Tomas Vondra
> <tomas.vondra@enterprisedb.com <mailto:tomas.vondra@enterprisedb.com>>
> wrote:
> 
>     On 10/15/22 14:33, Tomas Vondra wrote:
>     > Hi,
>     >
>     > ...
>     >
>     > There's a bunch of issues with this initial version of the patch,
>     > usually described in XXX comments in the relevant places.6)
>     >
>     > ...
> 
>     I forgot to mention one important issue in my list yesterday, and that's
>     memory consumption. The way the patch is coded now, the new BRIN support
>     function (brin_minmax_ranges) produces information about *all* ranges in
>     one go, which may be an issue. The worst case is 32TB table, with 1-page
>     BRIN ranges, which means ~4 billion ranges. The info is an array of ~32B
>     structs, so this would require ~128GB of RAM. With the default 128-page
>     ranges, it's still be ~1GB, which is quite a lot.
> 
>     We could have a discussion about what's the reasonable size of BRIN
>     ranges on such large tables (e.g. building a bitmap on 4 billion ranges
>     is going to be "not cheap" so this is likely pretty rare). But we should
>     not introduce new nodes that ignore work_mem, so we need a way to deal
>     with such cases somehow.
> 
>     The easiest solution likely is to check this while planning - we can
>     check the table size, calculate the number of BRIN ranges, and check
>     that the range info fits into work_mem, and just not create the path
>     when it gets too large. That's what we did for HashAgg, although that
>     decision was unreliable because estimating GROUP BY cardinality is hard.
> 
>     The wrinkle here is that counting just the range info (BrinRange struct)
>     does not include the values for by-reference types. We could use average
>     width - that's just an estimate, though.
> 
>     A more comprehensive solution seems to be to allow requesting chunks of
>     the BRIN ranges. So that we'd get "slices" of ranges and we'd process
>     those. So for example if you have 1000 ranges, and you can only handle
>     100 at a time, we'd do 10 loops, each requesting 100 ranges.
> 
>     This has another problem - we do care about "overlaps", and we can't
>     really know if the overlapping ranges will be in the same "slice"
>     easily. The chunks would be sorted (for example) by maxval. But there
>     can be a range with much higher maxval (thus in some future slice), but
>     very low minval (thus intersecting with ranges in the current slice).
> 
>     Imagine ranges with these minval/maxval values, sorted by maxval:
> 
>     [101,200]
>     [201,300]
>     [301,400]
>     [150,500]
> 
>     and let's say we can only process 2-range slices. So we'll get the first
>     two, but both of them intersect with the very last range.
> 
>     We could always include all the intersecting ranges into the slice, but
>     what if there are too many very "wide" ranges?
> 
>     So I think this will need to switch to an iterative communication with
>     the BRIN index - instead of asking "give me info about all the ranges",
>     we'll need a way to
> 
>       - request the next range (sorted by maxval)
>       - request the intersecting ranges one by one (sorted by minval)
> 
>     Of course, the BRIN side will have some of the same challenges with
>     tracking the info without breaking the work_mem limit, but I suppose it
>     can store the info into a tuplestore/tuplesort, and use that instead of
>     plain in-memory array. Alternatively, it could just return those, and
>     BrinSort would use that. OTOH it seems cleaner to have some sort of API,
>     especially if we want to support e.g. minmax-multi opclasses, that have
>     a more complicated concept of "intersection".
> 
> 
>     regards
> 
>     -- 
>     Tomas Vondra
>     EnterpriseDB: http://www.enterprisedb.com <http://www.enterprisedb.com>
>     The Enterprise PostgreSQL Company
> 
> Hi,
> In your example involving [150,500], can this range be broken down into
> 4 ranges, ending in 200, 300, 400 and 500, respectively ?
> That way, there is no intersection among the ranges.
> 

Not really, I think. These "value ranges" map to "page ranges" and how
would you split those? I mean, you know values [150,500] map to blocks
[0,127]. You split the values into [150,200], [201,300], [301,400]. How
do you split the page range [0,127]?

Also, splitting a range into more ranges is likely making the issue
worse, because it increases the number of ranges, right? And I mean,
much worse, because imagine a "wide" range that overlaps with every
other range - the number of ranges would explode.

It's not clear to me at which point you'd make the split. At the
beginning, right after loading the ranges from BRIN index? A lot of that
may be unnecessary, in case the range is loaded as a "non-intersecting"
range.

Try to formulate the whole algorithm. Maybe I'm missing something.

The current algorithm is something like this:

1. request info about ranges from the BRIN opclass
2. sort them by maxval and minval
3. NULLS FIRST: read all ranges that might have NULLs => output
4. read the next range (by maxval) into tuplesort
   (if no more ranges, go to (9))
5. load all tuples from "splill" tuplestore, compare to maxval
6. load all tuples from no-summarized ranges (first range only)
   (into tuplesort/tuplestore, depending on maxval comparison)
7. load all intersecting ranges (with minval < current maxval)
   (into tuplesort/tuplestore, depending on maxval comparison)
8. sort the tuplesort, output all tuples, then back to (4)
9. NULLS LAST: read all ranges that might have NULLs => output
10. done

For "DESC" ordering the process is almost the same, except that we swap
minval/maxval in most places.

> bq. can store the info into a tuplestore/tuplesort
> 
> Wouldn't this involve disk accesses which may reduce the effectiveness
> of BRIN sort ?

Yes, it might. But the question is whether the result is still faster
than alternative plans (e.g. seqscan+sort), and those are likely to do
even more I/O.

Moreover, for "regular" cases this shouldn't be a significant issue,
because the stuff will fit into work_mem and so there'll be no I/O. But
it'll handle those extreme cases gracefully.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: PATCH: Using BRIN indexes for sorted output

From

Tom Lane

Date:

16 October 2022, 14:41:57

Tomas Vondra <tomas.vondra@enterprisedb.com> writes:
> I forgot to mention one important issue in my list yesterday, and that's
> memory consumption.

TBH, this is all looking like vastly more complexity than benefit.
It's going to be impossible to produce a reliable cost estimate
given all the uncertainty, and I fear that will end in picking
BRIN-based sorting when it's not actually a good choice.

The examples you showed initially are cherry-picked to demonstrate
the best possible case, which I doubt has much to do with typical
real-world tables.  It would be good to see what happens with
not-perfectly-sequential data before even deciding this is worth
spending more effort on.  It also seems kind of unfair to decide
that the relevant comparison point is a seqscan rather than a
btree indexscan.

            regards, tom lane

Re: PATCH: Using BRIN indexes for sorted output

From

Zhihong Yu

Date:

16 October 2022, 14:42:39

On Sun, Oct 16, 2022 at 7:33 AM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:

On 10/16/22 16:01, Zhihong Yu wrote:
>
>
> On Sun, Oct 16, 2022 at 6:51 AM Tomas Vondra
> <tomas.vondra@enterprisedb.com <mailto:tomas.vondra@enterprisedb.com>>
> wrote:
>
> On 10/15/22 14:33, Tomas Vondra wrote:
> > Hi,
> >
> > ...
> >
> > There's a bunch of issues with this initial version of the patch,
> > usually described in XXX comments in the relevant places.6)
> >
> > ...
>
> I forgot to mention one important issue in my list yesterday, and that's
> memory consumption. The way the patch is coded now, the new BRIN support
> function (brin_minmax_ranges) produces information about *all* ranges in
> one go, which may be an issue. The worst case is 32TB table, with 1-page
> BRIN ranges, which means ~4 billion ranges. The info is an array of ~32B
> structs, so this would require ~128GB of RAM. With the default 128-page
> ranges, it's still be ~1GB, which is quite a lot.
>
> We could have a discussion about what's the reasonable size of BRIN
> ranges on such large tables (e.g. building a bitmap on 4 billion ranges
> is going to be "not cheap" so this is likely pretty rare). But we should
> not introduce new nodes that ignore work_mem, so we need a way to deal
> with such cases somehow.
>
> The easiest solution likely is to check this while planning - we can
> check the table size, calculate the number of BRIN ranges, and check
> that the range info fits into work_mem, and just not create the path
> when it gets too large. That's what we did for HashAgg, although that
> decision was unreliable because estimating GROUP BY cardinality is hard.
>
> The wrinkle here is that counting just the range info (BrinRange struct)
> does not include the values for by-reference types. We could use average
> width - that's just an estimate, though.
>
> A more comprehensive solution seems to be to allow requesting chunks of
> the BRIN ranges. So that we'd get "slices" of ranges and we'd process
> those. So for example if you have 1000 ranges, and you can only handle
> 100 at a time, we'd do 10 loops, each requesting 100 ranges.
>
> This has another problem - we do care about "overlaps", and we can't
> really know if the overlapping ranges will be in the same "slice"
> easily. The chunks would be sorted (for example) by maxval. But there
> can be a range with much higher maxval (thus in some future slice), but
> very low minval (thus intersecting with ranges in the current slice).
>
> Imagine ranges with these minval/maxval values, sorted by maxval:
>
> [101,200]
> [201,300]
> [301,400]
> [150,500]
>
> and let's say we can only process 2-range slices. So we'll get the first
> two, but both of them intersect with the very last range.
>
> We could always include all the intersecting ranges into the slice, but
> what if there are too many very "wide" ranges?
>
> So I think this will need to switch to an iterative communication with
> the BRIN index - instead of asking "give me info about all the ranges",
> we'll need a way to
>
> - request the next range (sorted by maxval)
> - request the intersecting ranges one by one (sorted by minval)
>
> Of course, the BRIN side will have some of the same challenges with
> tracking the info without breaking the work_mem limit, but I suppose it
> can store the info into a tuplestore/tuplesort, and use that instead of
> plain in-memory array. Alternatively, it could just return those, and
> BrinSort would use that. OTOH it seems cleaner to have some sort of API,
> especially if we want to support e.g. minmax-multi opclasses, that have
> a more complicated concept of "intersection".
>
>
> regards
>
> --
> Tomas Vondra
> EnterpriseDB: http://www.enterprisedb.com <http://www.enterprisedb.com>
> The Enterprise PostgreSQL Company
>
> Hi,
> In your example involving [150,500], can this range be broken down into
> 4 ranges, ending in 200, 300, 400 and 500, respectively ?
> That way, there is no intersection among the ranges.
>

Not really, I think. These "value ranges" map to "page ranges" and how
would you split those? I mean, you know values [150,500] map to blocks
[0,127]. You split the values into [150,200], [201,300], [301,400]. How
do you split the page range [0,127]?

Also, splitting a range into more ranges is likely making the issue
worse, because it increases the number of ranges, right? And I mean,
much worse, because imagine a "wide" range that overlaps with every
other range - the number of ranges would explode.

It's not clear to me at which point you'd make the split. At the
beginning, right after loading the ranges from BRIN index? A lot of that
may be unnecessary, in case the range is loaded as a "non-intersecting"
range.

Try to formulate the whole algorithm. Maybe I'm missing something.

The current algorithm is something like this:

1. request info about ranges from the BRIN opclass
2. sort them by maxval and minval
3. NULLS FIRST: read all ranges that might have NULLs => output
4. read the next range (by maxval) into tuplesort
(if no more ranges, go to (9))
5. load all tuples from "splill" tuplestore, compare to maxval
6. load all tuples from no-summarized ranges (first range only)
(into tuplesort/tuplestore, depending on maxval comparison)
7. load all intersecting ranges (with minval < current maxval)
(into tuplesort/tuplestore, depending on maxval comparison)
8. sort the tuplesort, output all tuples, then back to (4)
9. NULLS LAST: read all ranges that might have NULLs => output
10. done

For "DESC" ordering the process is almost the same, except that we swap
minval/maxval in most places.

Hi,

Thanks for the quick reply.

I don't have good answer w.r.t. splitting the page range [0,127] now. Let me think more about it.

The 10 step flow (subject to changes down the road) should be either given in the description of the patch or, written as comment inside the code.

This would help people grasp the concept much faster.

BTW splill seems to be a typo - I assume you meant spill.

Cheers

Re: PATCH: Using BRIN indexes for sorted output

From

Tomas Vondra

Date:

16 October 2022, 15:48:31

On 10/16/22 16:41, Tom Lane wrote:
> Tomas Vondra <tomas.vondra@enterprisedb.com> writes:
>> I forgot to mention one important issue in my list yesterday, and that's
>> memory consumption.
> 
> TBH, this is all looking like vastly more complexity than benefit.
> It's going to be impossible to produce a reliable cost estimate
> given all the uncertainty, and I fear that will end in picking
> BRIN-based sorting when it's not actually a good choice.
> 

Maybe. If it turns out the estimates we have are insufficient to make
good planning decisions, that's life.

As I wrote in my message, I know the BRIN costing is a bit shaky in
general (not just for this new operation), and I intend to propose some
improvement in a separate patch.

I think the main issue with BRIN costing is that we have no stats about
the ranges, and we can't estimate how many ranges we'll really end up
accessing. If you have 100 rows, will that be 1 range or 100 ranges? Or
for the BRIN Sort, how many overlapping ranges will there be?

I intend to allow index AMs to collect custom statistics, and the BRIN
minmax opfamily would collect e.g. this:

1) number of non-summarized ranges
2) number of all-nulls ranges
3) number of has-nulls ranges
4) average number of overlaps (given a random range, how many other
   ranges intersect with it)
5) how likely is it for a row to hit multiple ranges (cross-check
   sample rows vs. ranges)

I believe this will allow much better / more reliable BRIN costing (the
number of overlaps is particularly useful for the this patch).

> The examples you showed initially are cherry-picked to demonstrate
> the best possible case, which I doubt has much to do with typical
> real-world tables.  It would be good to see what happens with
> not-perfectly-sequential data before even deciding this is worth
> spending more effort on.

Yes, the example was trivial "happy case" example. Obviously, the
performance degrades as the data become more random (with ranges wider),
forcing the BRIN Sort to read / sort more tuples.

But let's see an example with less correlated data, say, like this:

create table t (a int) with (fillfactor = 10);

insert into t select i + 10000 * random()
                from generate_series(1,10000000) s(i);

With the fillfactor=10, there are ~2500 values per 1MB range, so this
means each range overlaps with ~4 more. The results then look like this:

1) select * from t order by a;

  seqscan+sort: 4437 ms
  brinsort: 4233 ms

2) select * from t order by a limit 10;

  seqscan+sort: 1859 ms
  brinsort: 4 ms

If you increase the random factor from 10000 to 100000 (so, 40 ranges),
the seqscan timings remain about the same, while brinsort gets to 5200
and 20 ms. And with 1M, it's ~6000 and 300 ms.

Only at 5000000, where we pretty much read 1/2 the table because the
ranges intersect, we get the same timing as the seqscan (for the LIMIT
query). The "full sort" query is more like 5000 vs. 6600 ms, so slower
but not by a huge amount.

Yes, this is a very simple example. I can do more tests with other
datasets (larger/smaller, different distribution, ...).

> It also seems kind of unfair to decide
> that the relevant comparison point is a seqscan rather than a
> btree indexscan.
> 

I don't think it's all that unfair. How likely is it to have both a BRIN
and btree index on the same column? And even if you do have such indexes
(say, on different sets of keys), we kinda already have this costing
issue with index and bitmap index scans.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: PATCH: Using BRIN indexes for sorted output

From

Tomas Vondra

Date:

16 October 2022, 15:49:55

On 10/16/22 16:42, Zhihong Yu wrote:
> ...
> 
> I don't have good answer w.r.t. splitting the page range [0,127] now.
> Let me think more about it.
> 

Sure, no problem.

> The 10 step flow (subject to changes down the road) should be either
> given in the description of the patch or, written as comment inside the
> code.
> This would help people grasp the concept much faster.

True. I'll add it to the next version of the pach.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: PATCH: Using BRIN indexes for sorted output

From

Matthias van de Meent

Date:

16 October 2022, 18:22:01

On Sun, 16 Oct 2022 at 16:42, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> It also seems kind of unfair to decide
> that the relevant comparison point is a seqscan rather than a
> btree indexscan.

I think the comparison against full table scan seems appropriate, as
the benefit of BRIN is less space usage when compared to other
indexes, and better IO selectivity than full table scans.

A btree easily requires 10x the space of a normal BRIN index, and may
require a lot of random IO whilst scanning. This BRIN-sorted scan
would have a much lower random IO cost during its scan, and would help
bridge the performance gap between having index that supports ordered
retrieval, and no index at all, which is especially steep in large
tables.

I think that BRIN would be an alternative to btree as a provider of
sorted data, even when the table is not 100% clustered. This
BRIN-assisted table sort can help reduce the amount of data that is
accessed in top-N sorts significantly, both at the index and at the
relation level, without having the space overhead of "all sortable
columns get a btree index".

If BRIN gets its HOT optimization back, the benefits would be even
larger, as we would then have an index that can speed up top-N sorts
without bloating other indexes, and at very low disk footprint.
Columns that are only occasionally accessed in a sorted manner could
then get BRIN minmax indexes to support this sort, at minimal overhead
to the rest of the application.

Kind regards,

Matthias van de Meent

Re: PATCH: Using BRIN indexes for sorted output

From

Matthias van de Meent

Date:

16 October 2022, 20:17:37

First of all, it's really great to see that this is being worked on.

On Sun, 16 Oct 2022 at 16:34, Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
> Try to formulate the whole algorithm. Maybe I'm missing something.
>
> The current algorithm is something like this:
>
> 1. request info about ranges from the BRIN opclass
> 2. sort them by maxval and minval

Why sort on maxval and minval? That seems wasteful for effectively all
sorts, where range sort on minval should suffice: If you find a range
that starts at 100 in a list of ranges sorted at minval, you've
processed all values <100. You can't make a similar comparison when
that range is sorted on maxvals.

> 3. NULLS FIRST: read all ranges that might have NULLs => output
> 4. read the next range (by maxval) into tuplesort
>    (if no more ranges, go to (9))
> 5. load all tuples from "splill" tuplestore, compare to maxval

Instead of this, shouldn't an update to tuplesort that allows for
restarting the sort be better than this? Moving tuples that we've
accepted into BRINsort state but not yet returned around seems like a
waste of cycles, and I can't think of a reason why it can't work.

> 6. load all tuples from no-summarized ranges (first range only)
>    (into tuplesort/tuplestore, depending on maxval comparison)
> 7. load all intersecting ranges (with minval < current maxval)
>    (into tuplesort/tuplestore, depending on maxval comparison)
> 8. sort the tuplesort, output all tuples, then back to (4)
> 9. NULLS LAST: read all ranges that might have NULLs => output
> 10. done
>
> For "DESC" ordering the process is almost the same, except that we swap
> minval/maxval in most places.

When I was thinking about this feature at the PgCon unconference, I
was thinking about it more along the lines of the following system
(for ORDER BY col ASC NULLS FIRST):

1. prepare tuplesort Rs (for Rangesort) for BRIN tuples, ordered by
[has_nulls, min ASC]
2. scan info about ranges from BRIN, store them in Rs.
3. Finalize the sorting of Rs.
4. prepare tuplesort Ts (for Tuplesort) for sorting on the specified
column ordering.
5. load all tuples from no-summarized ranges into Ts'
6. while Rs has a block range Rs' with has_nulls:
   - Remove Rs' from Rs
   - store the tuples of Rs' range in Ts.
We now have all tuples with NULL in our sorted set; max_sorted = (NULL)
7. Finalize the Ts sorted set.
8. While the next tuple Ts' in the Ts tuplesort <= max_sorted
  - Remove Ts' from Ts
  - Yield Ts'
Now, all tuples up to and including max_sorted are yielded.
9. If there are no more ranges in Rs:
  - Yield all remaining tuples from Ts, then return.
10. "un-finalize" Ts, so that we can start adding tuples to that tuplesort.
     This is different from Tomas' implementation, as he loads the
tuples into a new tuplestore.
11. get the next item from Rs: Rs'
   - remove Rs' from Rs
   - assign Rs' min value to max_sorted
   - store the tuples of Rs' range in Ts
12. while the next item Rs' from Rs has a min value of max_sorted:
   - remove Rs' from Rs
   - store the tuples of Rs' range in Ts
13. The 'new' value from the next item from Rs is stored in
max_sorted. If no such item exists, max_sorted is assigned a sentinel
value (+INF)
14. Go to Step 7

This set of operations requires a restarting tuplesort for Ts, but I
don't think that would result in many API changes for tuplesort. It
reduces the overhead of large overlapping ranges, as it doesn't need
to copy all tuples that have been read from disk but have not yet been
returned.

The maximum cost of this tuplesort would be the cost of sorting a
seqscanned table, plus sorting the relevant BRIN ranges, plus the 1
extra compare per tuple and range that are needed to determine whether
the range or tuple should be extracted from the tuplesort. The minimum
cost would be the cost of sorting all BRIN ranges, plus sorting all
tuples in one of the index's ranges.

Kind regards,

Matthias van de Meent

PS. Are you still planning on giving the HOT optimization for BRIN a
second try? I'm fairly confident that my patch at [0] would fix the
issue that lead to the revert of that feature, but it introduced ABI
changes after the feature freeze and thus it didn't get in. The patch
might need some polishing, but I think it shouldn't take too much
extra effort to get into PG16.

[0] https://www.postgresql.org/message-id/CAEze2Wi9%3DBay_%3DrTf8Z6WPgZ5V0tDOayszQJJO%3DR_9aaHvr%2BTg%40mail.gmail.com

Re: PATCH: Using BRIN indexes for sorted output

From

Tomas Vondra

Date:

17 October 2022, 03:43:40

On 10/16/22 22:17, Matthias van de Meent wrote:
> First of all, it's really great to see that this is being worked on.
> 
> On Sun, 16 Oct 2022 at 16:34, Tomas Vondra
> <tomas.vondra@enterprisedb.com> wrote:
>> Try to formulate the whole algorithm. Maybe I'm missing something.
>>
>> The current algorithm is something like this:
>>
>> 1. request info about ranges from the BRIN opclass
>> 2. sort them by maxval and minval
> 
> Why sort on maxval and minval? That seems wasteful for effectively all
> sorts, where range sort on minval should suffice: If you find a range
> that starts at 100 in a list of ranges sorted at minval, you've
> processed all values <100. You can't make a similar comparison when
> that range is sorted on maxvals.
> 

Because that allows to identify overlapping ranges quickly.

Imagine you have the ranges sorted by maxval, which allows you to add
tuples in small increments. But how do you know there's not a range
(possibly with arbitrarily high maxval), that however overlaps with the
range we're currently processing?

Consider these ranges sorted by maxval

  range #1  [0,100]
  range #2  [101,200]
  range #3  [150,250]
  ...
  range #1000000 [190,1000000000]

processing the range #1 is simple, because there are no overlapping
ranges. When processing range #2, that's not the case - the following
range #3 is overlapping too, so we need to load the tuples too. But
there may be other ranges (in arbitrary distance) also overlapping.

So we either have to cross-check everything with everything - that's
O(N^2) so not great, or we can invent a way to eliminate ranges that
can't overlap.

The patch does that by having two arrays - one sorted by maxval, one
sorted by minval. After proceeding to the next range by maxval (using
the first array), the minval-sorted array is used to detect overlaps.
This can be done quickly, because we only care for new matches since the
previous range, so we can remember the index to the array and start from
it. And we can stop once the minval exceeds the maxval for the range in
the first step. Because we'll only sort tuples up to that point.

>> 3. NULLS FIRST: read all ranges that might have NULLs => output
>> 4. read the next range (by maxval) into tuplesort
>>    (if no more ranges, go to (9))
>> 5. load all tuples from "splill" tuplestore, compare to maxval
> 
> Instead of this, shouldn't an update to tuplesort that allows for
> restarting the sort be better than this? Moving tuples that we've
> accepted into BRINsort state but not yet returned around seems like a
> waste of cycles, and I can't think of a reason why it can't work.
> 

I don't understand what you mean by "update to tuplesort". Can you
elaborate?

The point of spilling them into a tuplestore is to make the sort cheaper
by not sorting tuples that can't possibly be produced, because the value
exceeds the current maxval. Consider ranges sorted by maxval

   [0,1000]
   [500,1500]
   [1001,2000]
   ...

We load tuples from [0,1000] and use 1000 as "threshold" up to which we
can sort. But we have to load tuples from the overlapping range(s) too,
e.g. from [500,1500] except that all tuples with values > 1000 can't be
produced (because there might be yet more ranges intersecting with that
part).

So why sort these tuples at all? Imagine imperfectly correlated table
where each range overlaps with ~10 other ranges. If we feed all of that
into the tuplestore, we're now sorting 11x the amount of data.

Or maybe I just don't understand what you mean.

>> 6. load all tuples from no-summarized ranges (first range only)
>>    (into tuplesort/tuplestore, depending on maxval comparison)
>> 7. load all intersecting ranges (with minval < current maxval)
>>    (into tuplesort/tuplestore, depending on maxval comparison)
>> 8. sort the tuplesort, output all tuples, then back to (4)
>> 9. NULLS LAST: read all ranges that might have NULLs => output
>> 10. done
>>
>> For "DESC" ordering the process is almost the same, except that we swap
>> minval/maxval in most places.
> 
> When I was thinking about this feature at the PgCon unconference, I
> was thinking about it more along the lines of the following system
> (for ORDER BY col ASC NULLS FIRST):
> 
> 1. prepare tuplesort Rs (for Rangesort) for BRIN tuples, ordered by
> [has_nulls, min ASC]
> 2. scan info about ranges from BRIN, store them in Rs.
> 3. Finalize the sorting of Rs.
> 4. prepare tuplesort Ts (for Tuplesort) for sorting on the specified
> column ordering.
> 5. load all tuples from no-summarized ranges into Ts'
> 6. while Rs has a block range Rs' with has_nulls:
>    - Remove Rs' from Rs
>    - store the tuples of Rs' range in Ts.
> We now have all tuples with NULL in our sorted set; max_sorted = (NULL)
> 7. Finalize the Ts sorted set.
> 8. While the next tuple Ts' in the Ts tuplesort <= max_sorted
>   - Remove Ts' from Ts
>   - Yield Ts'
> Now, all tuples up to and including max_sorted are yielded.
> 9. If there are no more ranges in Rs:
>   - Yield all remaining tuples from Ts, then return.
> 10. "un-finalize" Ts, so that we can start adding tuples to that tuplesort.
>      This is different from Tomas' implementation, as he loads the
> tuples into a new tuplestore.
> 11. get the next item from Rs: Rs'
>    - remove Rs' from Rs
>    - assign Rs' min value to max_sorted
>    - store the tuples of Rs' range in Ts

I don't think this works, because we may get a range (Rs') with very
high maxval (thus read very late from Rs), but with very low minval.
AFAICS max_sorted must never go back, and this breaks it.

> 12. while the next item Rs' from Rs has a min value of max_sorted:
>    - remove Rs' from Rs
>    - store the tuples of Rs' range in Ts
> 13. The 'new' value from the next item from Rs is stored in
> max_sorted. If no such item exists, max_sorted is assigned a sentinel
> value (+INF)
> 14. Go to Step 7
> 
> This set of operations requires a restarting tuplesort for Ts, but I
> don't think that would result in many API changes for tuplesort. It
> reduces the overhead of large overlapping ranges, as it doesn't need
> to copy all tuples that have been read from disk but have not yet been
> returned.
> 
> The maximum cost of this tuplesort would be the cost of sorting a
> seqscanned table, plus sorting the relevant BRIN ranges, plus the 1
> extra compare per tuple and range that are needed to determine whether
> the range or tuple should be extracted from the tuplesort. The minimum
> cost would be the cost of sorting all BRIN ranges, plus sorting all
> tuples in one of the index's ranges.
> 

I'm not a tuplesort expert, but my assumption it's better to sort
smaller amounts of rows - which is why the patch sorts only the rows it
knows it can actually output.

> Kind regards,
> 
> Matthias van de Meent
> 
> PS. Are you still planning on giving the HOT optimization for BRIN a
> second try? I'm fairly confident that my patch at [0] would fix the
> issue that lead to the revert of that feature, but it introduced ABI
> changes after the feature freeze and thus it didn't get in. The patch
> might need some polishing, but I think it shouldn't take too much
> extra effort to get into PG16.
> 

Thanks for reminding me, I'll take a look before the next CF.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: PATCH: Using BRIN indexes for sorted output

From

Matthias van de Meent

Date:

17 October 2022, 14:00:07

On Mon, 17 Oct 2022 at 05:43, Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
>
> On 10/16/22 22:17, Matthias van de Meent wrote:
> > On Sun, 16 Oct 2022 at 16:34, Tomas Vondra
> > <tomas.vondra@enterprisedb.com> wrote:
> >> Try to formulate the whole algorithm. Maybe I'm missing something.
> >>
> >> The current algorithm is something like this:
> >>
> >> 1. request info about ranges from the BRIN opclass
> >> 2. sort them by maxval and minval
> >
> > Why sort on maxval and minval? That seems wasteful for effectively all
> > sorts, where range sort on minval should suffice: If you find a range
> > that starts at 100 in a list of ranges sorted at minval, you've
> > processed all values <100. You can't make a similar comparison when
> > that range is sorted on maxvals.
>
> Because that allows to identify overlapping ranges quickly.
>
> Imagine you have the ranges sorted by maxval, which allows you to add
> tuples in small increments. But how do you know there's not a range
> (possibly with arbitrarily high maxval), that however overlaps with the
> range we're currently processing?

Why do we need to identify overlapping ranges specifically? If you
sort by minval, it becomes obvious that any subsequent range cannot
contain values < the minval of the next range in the list, allowing
you to emit any values less than the next, unprocessed, minmax range's
minval.

> >> 3. NULLS FIRST: read all ranges that might have NULLs => output
> >> 4. read the next range (by maxval) into tuplesort
> >>    (if no more ranges, go to (9))
> >> 5. load all tuples from "splill" tuplestore, compare to maxval
> >
> > Instead of this, shouldn't an update to tuplesort that allows for
> > restarting the sort be better than this? Moving tuples that we've
> > accepted into BRINsort state but not yet returned around seems like a
> > waste of cycles, and I can't think of a reason why it can't work.
> >
>
> I don't understand what you mean by "update to tuplesort". Can you
> elaborate?

Tuplesort currently only allows the following workflow: you to load
tuples, then call finalize, then extract tuples. There is currently no
way to add tuples once you've started extracting them.

For my design to work efficiently or without hacking into the
internals of tuplesort, we'd need a way to restart or 'un-finalize'
the tuplesort so that it returns to the 'load tuples' phase. Because
all data of the previous iteration is already sorted, adding more data
shouldn't be too expensive.

>  The point of spilling them into a tuplestore is to make the sort cheaper
> by not sorting tuples that can't possibly be produced, because the value
> exceeds the current maxval. Consider ranges sorted by maxval
> [...]
>
> Or maybe I just don't understand what you mean.

If we sort the ranges by minval like this:

1. [0,1000]
2. [0,999]
3. [50,998]
4. [100,997]
5. [100,996]
6. [150,995]

Then we can load and sort the values for range 1 and 2, and emit all
values up to (not including) 50 - the minval of the next,
not-yet-loaded range in the ordered list of ranges. Then add the
values from range 3 to the set of tuples we have yet to output; sort;
and then emit valus up to 100 (range 4's minval), etc. This reduces
the amount of tuples in the tuplesort to the minimum amount needed to
output any specific value.

If the ranges are sorted and loaded by maxval, like your algorithm expects:

1. [150,995]
2. [100,996]
3. [100,997]
4. [50,998]
5. [0,999]
6. [0,1000]

We need to load all ranges into the sort before it could start
emitting any tuples, as all ranges overlap with the first range.

> > [algo]
>
> I don't think this works, because we may get a range (Rs') with very
> high maxval (thus read very late from Rs), but with very low minval.
> AFAICS max_sorted must never go back, and this breaks it.

max_sorted cannot go back, because it is the min value of the next
range in the list of ranges sorted by min value; see also above.

There is a small issue in my algorithm where I use <= for yielding
values where it should be <, where initialization of max_value to NULL
is then be incorrect, but apart from that I don't think there are any
issues with the base algorithm.

> > The maximum cost of this tuplesort would be the cost of sorting a
> > seqscanned table, plus sorting the relevant BRIN ranges, plus the 1
> > extra compare per tuple and range that are needed to determine whether
> > the range or tuple should be extracted from the tuplesort. The minimum
> > cost would be the cost of sorting all BRIN ranges, plus sorting all
> > tuples in one of the index's ranges.
> >
>
> I'm not a tuplesort expert, but my assumption it's better to sort
> smaller amounts of rows - which is why the patch sorts only the rows it
> knows it can actually output.

I see that the two main differences between our designs are in
answering these questions:

- How do we select table ranges for processing?
- How do we handle tuples that we know we can't output yet?

For the first, I think the differences are explained above. The main
drawback of your selection algorithm seems to be that your algorithm's
worst-case is "all ranges overlap", whereas my algorithm's worst case
is "all ranges start at the same value", which is only a subset of
your worst case.

For the second, the difference is whether we choose to sort the tuples
that are out-of-bounds, but are already in the working set due to
being returned from a range overlapping with the current bound.
My algorithm tries to reduce the overhead of increasing the sort
boundaries by also sorting the out-of-bound data, allowing for
O(n-less-than-newbound) overhead of extending the bounds (total
complexity for whole sort O(n-out-of-bound)), and O(n log n)
processing of all tuples during insertion.
Your algorithm - if I understand it correctly - seems to optimize for
faster results within the current bound by not sorting the
out-of-bounds data with O(1) processing when out-of-bounds, at the
cost of needing O(n-out-of-bound-tuples) operations when the maxval /
max_sorted boundary is increased, with a complexity of O(n*m) for an
average of n out-of-bound tuples and m bound updates.

Lastly, there is the small difference in how the ranges are extracted
from BRIN: I prefer and mention an iterative approach where the tuples
are extracted from the index and loaded into a tuplesort in some
iterative fashion (which spills to disk and does not need all tuples
to reside in memory), whereas your current approach was mentioned as
(paraphrasing) 'allocate all this data in one chunk and hope that
there is enough memory available'. I think this is not so much a
disagreement in best approach, but mostly a case of what could be made
to work; so in later updates I hope we'll see improvements here.

Kind regards,

Matthias van de Meent

Re: PATCH: Using BRIN indexes for sorted output

From

Tomas Vondra

Date:

18 October 2022, 12:03:14

On 10/17/22 16:00, Matthias van de Meent wrote:
> On Mon, 17 Oct 2022 at 05:43, Tomas Vondra
> <tomas.vondra@enterprisedb.com> wrote:
>>
>> On 10/16/22 22:17, Matthias van de Meent wrote:
>>> On Sun, 16 Oct 2022 at 16:34, Tomas Vondra
>>> <tomas.vondra@enterprisedb.com> wrote:
>>>> Try to formulate the whole algorithm. Maybe I'm missing something.
>>>>
>>>> The current algorithm is something like this:
>>>>
>>>> 1. request info about ranges from the BRIN opclass
>>>> 2. sort them by maxval and minval
>>>
>>> Why sort on maxval and minval? That seems wasteful for effectively all
>>> sorts, where range sort on minval should suffice: If you find a range
>>> that starts at 100 in a list of ranges sorted at minval, you've
>>> processed all values <100. You can't make a similar comparison when
>>> that range is sorted on maxvals.
>>
>> Because that allows to identify overlapping ranges quickly.
>>
>> Imagine you have the ranges sorted by maxval, which allows you to add
>> tuples in small increments. But how do you know there's not a range
>> (possibly with arbitrarily high maxval), that however overlaps with the
>> range we're currently processing?
> 
> Why do we need to identify overlapping ranges specifically? If you
> sort by minval, it becomes obvious that any subsequent range cannot
> contain values < the minval of the next range in the list, allowing
> you to emit any values less than the next, unprocessed, minmax range's
> minval.
> 

D'oh! I think you're right, it should be possible to do this with only
sort by minval. And it might actually be better way to do that.

I think I chose the "maxval" ordering because it seemed reasonable.
Looking at the current range and using the maxval as the threshold
seemed reasonable. But it leads to a bunch of complexity with the
intersecting ranges, and I never reconsidered this choice. Silly me.

>>>> 3. NULLS FIRST: read all ranges that might have NULLs => output
>>>> 4. read the next range (by maxval) into tuplesort
>>>>    (if no more ranges, go to (9))
>>>> 5. load all tuples from "splill" tuplestore, compare to maxval
>>>
>>> Instead of this, shouldn't an update to tuplesort that allows for
>>> restarting the sort be better than this? Moving tuples that we've
>>> accepted into BRINsort state but not yet returned around seems like a
>>> waste of cycles, and I can't think of a reason why it can't work.
>>>
>>
>> I don't understand what you mean by "update to tuplesort". Can you
>> elaborate?
> 
> Tuplesort currently only allows the following workflow: you to load
> tuples, then call finalize, then extract tuples. There is currently no
> way to add tuples once you've started extracting them.
> 
> For my design to work efficiently or without hacking into the
> internals of tuplesort, we'd need a way to restart or 'un-finalize'
> the tuplesort so that it returns to the 'load tuples' phase. Because
> all data of the previous iteration is already sorted, adding more data
> shouldn't be too expensive.
> 

Not sure. I still think it's better to limit the amount of data we have
in the tuplesort. Even if the tuplesort can efficiently skip the already
sorted part, it'll still occupy disk space, possibly even force the data
to disk etc. (We'll still have to write that into a tuplestore, but that
should be relatively small and short-lived/recycled).

FWIW I wonder if the assumption that tuplesort can quickly skip already
sorted data holds e.g. for tuplesorts much larger than work_mem, but I
haven't checked that.

I'd also like to include some more info in the explain, like how many
times we did a sort, and what was the largest amount of data we sorted.
Although, maybe that could be tracked by tracking the tuplesort size of
the last sort.

Considering the tuplesort does not currently support this, I'll probably
stick to the existing approach with separate tuplestore. There's enough
complexity in the patch already, I think. The only thing we'll need with
the minval ordering is the ability to "peek ahead" to the next minval
(which is going to be the threshold used to route values either to
tuplesort or tuplestore).

>>  The point of spilling them into a tuplestore is to make the sort cheaper
>> by not sorting tuples that can't possibly be produced, because the value
>> exceeds the current maxval. Consider ranges sorted by maxval
>> [...]
>>
>> Or maybe I just don't understand what you mean.
> 
> If we sort the ranges by minval like this:
> 
> 1. [0,1000]
> 2. [0,999]
> 3. [50,998]
> 4. [100,997]
> 5. [100,996]
> 6. [150,995]
> 
> Then we can load and sort the values for range 1 and 2, and emit all
> values up to (not including) 50 - the minval of the next,
> not-yet-loaded range in the ordered list of ranges. Then add the
> values from range 3 to the set of tuples we have yet to output; sort;
> and then emit valus up to 100 (range 4's minval), etc. This reduces
> the amount of tuples in the tuplesort to the minimum amount needed to
> output any specific value.
> 
> If the ranges are sorted and loaded by maxval, like your algorithm expects:
> 
> 1. [150,995]
> 2. [100,996]
> 3. [100,997]
> 4. [50,998]
> 5. [0,999]
> 6. [0,1000]
> 
> We need to load all ranges into the sort before it could start
> emitting any tuples, as all ranges overlap with the first range.
> 

Right, thanks - I get this now.

>>> [algo]
>>
>> I don't think this works, because we may get a range (Rs') with very
>> high maxval (thus read very late from Rs), but with very low minval.
>> AFAICS max_sorted must never go back, and this breaks it.
> 
> max_sorted cannot go back, because it is the min value of the next
> range in the list of ranges sorted by min value; see also above.
> 
> There is a small issue in my algorithm where I use <= for yielding
> values where it should be <, where initialization of max_value to NULL
> is then be incorrect, but apart from that I don't think there are any
> issues with the base algorithm.
> 
>>> The maximum cost of this tuplesort would be the cost of sorting a
>>> seqscanned table, plus sorting the relevant BRIN ranges, plus the 1
>>> extra compare per tuple and range that are needed to determine whether
>>> the range or tuple should be extracted from the tuplesort. The minimum
>>> cost would be the cost of sorting all BRIN ranges, plus sorting all
>>> tuples in one of the index's ranges.
>>>
>>
>> I'm not a tuplesort expert, but my assumption it's better to sort
>> smaller amounts of rows - which is why the patch sorts only the rows it
>> knows it can actually output.
> 
> I see that the two main differences between our designs are in
> answering these questions:
> 
> - How do we select table ranges for processing?
> - How do we handle tuples that we know we can't output yet?
> 
> For the first, I think the differences are explained above. The main
> drawback of your selection algorithm seems to be that your algorithm's
> worst-case is "all ranges overlap", whereas my algorithm's worst case
> is "all ranges start at the same value", which is only a subset of
> your worst case.
> 

Right, those are very good points.

> For the second, the difference is whether we choose to sort the tuples
> that are out-of-bounds, but are already in the working set due to
> being returned from a range overlapping with the current bound.
> My algorithm tries to reduce the overhead of increasing the sort
> boundaries by also sorting the out-of-bound data, allowing for
> O(n-less-than-newbound) overhead of extending the bounds (total
> complexity for whole sort O(n-out-of-bound)), and O(n log n)
> processing of all tuples during insertion.
> Your algorithm - if I understand it correctly - seems to optimize for
> faster results within the current bound by not sorting the
> out-of-bounds data with O(1) processing when out-of-bounds, at the
> cost of needing O(n-out-of-bound-tuples) operations when the maxval /
> max_sorted boundary is increased, with a complexity of O(n*m) for an
> average of n out-of-bound tuples and m bound updates.
> 

Right. I wonder if we these are actually complementary approaches, and
we could/should pick between them depending on how many rows we expect
to consume.

My focus was LIMIT queries, so I favored the approach with the lowest
startup cost. I haven't quite planned for this to work so well even in
full-sort cases. That kinda surprised me (I wonder if the very large
tuplesorts - compared to work_mem - would hurt this, though).

> Lastly, there is the small difference in how the ranges are extracted
> from BRIN: I prefer and mention an iterative approach where the tuples
> are extracted from the index and loaded into a tuplesort in some
> iterative fashion (which spills to disk and does not need all tuples
> to reside in memory), whereas your current approach was mentioned as
> (paraphrasing) 'allocate all this data in one chunk and hope that
> there is enough memory available'. I think this is not so much a
> disagreement in best approach, but mostly a case of what could be made
> to work; so in later updates I hope we'll see improvements here.
> 

Right. I think I mentioned this in my post [1], where I also envisioned
some sort of iterative approach. And I think you're right the approach
with ordering by minval is naturally more suitable because it just
consumes the single sequence of ranges.


regards

[1]
https://www.postgresql.org/message-id/1a7c2ff5-a855-64e9-0272-1f9947f8a558%40enterprisedb.com

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: PATCH: Using BRIN indexes for sorted output

From

Tomas Vondra

Date:

22 October 2022, 23:17:55

Hi,

here's an updated/reworked version of the patch, on top of the "BRIN
statistics" patch as 0001 (because some of the stuff is useful, but we
can ignore this part in this thread).

Warning: I realized the new node is somewhat broken when it comes to
projection and matching the indexed column, most likely because the
targetlists are wired/processed incorrectly or something like that. So
when experimenting with this, just index the first column of the table
and don't do anything requiring a projection. I'll get this fixed, but
I've been focusing on the other stuff. I'm not particularly familiar
with this tlist/project stuff, so any help is welcome.


The main change in this version is the adoption of multiple ideas
suggested by Matthias in his earlier responses.

Firstly, this changes how the index opclass passes information to the
executor node. Instead of using a plain array, we now use a tuplesort.
This addresses the memory consumption issues with large number of
ranges, and it also simplifies the sorting etc. which is now handled by
the tuplesort. The support procedure simply fills a tuplesort and then
hands it over to the caller (more or less).


Secondly, instead of ordering the ranges by maxval, this orders them by
minval (as suggested by Matthias), which greatly simplifies the code
because we don't need to detect overlapping ranges etc.

More precisely, the ranges are sorted to get this ordering

- not yet summarized ranges
- ranges sorted by (minval, blkno)
- all-nulls ranges

This particular ordering is beneficial for the algorithm, which does two
passes over the ranges. For the NULLS LAST case (i.e. the default), we
do this:

- produce tuples with non-NULL values, ordered by the value
- produce tuples with NULL values (arbitrary order)

And each of these phases does a separate pass over the ranges (I'll get
to that in a minute). And the ordering is tailored to this.

Note: For DESC we'd sort by maxval, and for NULLS FIRST the phases would
happen in the opposite order, but those are details. Let's assume ASC
ordering with NULLS LAST, unless stated otherwise.

The idea here is that all not-summarized ranges need to be processed
always, both when processing NULLs and non-NULL values, which happens as
two separate passes over ranges.

The all-null ranges don't need to be processed during the non-NULL pass,
and we can terminate this pass early once we hit the first null-only
range. So placing them last helps with this.

The regular ranges are ordered by minval, as dictated by the algorithm
(which is now described in nodeBrinSort.c comment), but we also sort
them by blkno to make this a bit more sequential (but this only matters
for ranges with the same minval, and that's probably rare, but the extra
sort key is also cheap so why not).

I mentioned we do two separate passes - one for non-NULL values, one for
NULL values. That may be somewhat inefficient, because in extreme cases
we might end up scanning the whole table twice (imagine BRIN ranges
where each range has both regular values and NULLs). It might be
possible to do all of this in a single pass, at least in some cases -
for example while scanning ranges, we might stash NULL rows into a
tuplestore, so that the second pass is not needed. That assumes there
are not too many such rows (otherwise we might need to write and then
read many rows, outweighing the cost of just doing two passes). This
should be possible to estimate/cost fairly well, I think, and the
comment in nodeBrinSort actually presents some ideas about this.

And we can't do that for the NULLS FIRST case, because if we stash the
non-NULL rows somewhere, we won't be able to do the "incremental" sort,
i.e. we might just do regular Sort right away. So I've kept this simple
approach with two passes for now.


This still uses the approach with spilling tuples to a tuplestore, and
only sorting rows that we know are safe to output. I still think this is
a good approach, for the reasons I explained before, but changing this
is not hard so we can experiment.

There's however a related question - how quickly should we increment the
minval value, serving as a watermark? One option is to go to the next
distinct minval value - but that may result in excessive number of tiny
sorts, because the number ranges and rows between the old and new minval
values tends to be small. Another negative consequence is that this may
cause of lot of spilling (and re-spilling), because we only consume tiny
number of rows from the tuplestore after incrementing the watermark.

Or we can do larger steps, skipping some of the minval values, so that
more rows quality into the sort. Of course, too large step means we'll
exceed work_mem and switch to an on-disk sort, which we probably don't
want. Also, this may be the wrong thing to do for LIMIT queries, that
only need a couple rows, and a tiny sort is fine (because we won't do
too many of them).

Patch 0004 introduces a new GUC called brinsort_watermark_step, that can
be used to experiment with this. By default it's set to '1' which means
we simply progress to the next minval value.

Then 0005 tries to customize this based on statistics - we estimate the
number of rows we expect to get for each minval increment to "add" and
then pick just a step value not to overflow work_mem. This happens in
create_brinsort_plan, and the comment explains the main weakness - the
way the number of rows is estimated is somewhat naive, as it just
divides reltuples by number of ranges. But I have a couple ideas about
what statistics we might collect, explained in 0001 in the comment at
brin_minmax_stats.

But there's another option - we can tune the step based on past sorts.
If we see the sorts are doing on-disk sort, maybe try doing smaller
steps. Patch 0006 implements a very simple variant of this. There's a
couple ideas about how it might be improved, mentioned in the comment at
brinsort_adjust_watermark_step.

There's also patch 0003, which extends the EXPLAIN output with a
counters tracking the number of sorts, counts of on-disk/in-memory
sorts, space used, number of rows sorted/spilled, and so on. This is
useful when analyzing e.g. the effect of higher/lower watermark steps,
discussed in the preceding paragraphs.



regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

Re: PATCH: Using BRIN indexes for sorted output

From

Justin Pryzby

Date:

24 October 2022, 04:32:37

On Sat, Oct 15, 2022 at 02:33:50PM +0200, Tomas Vondra wrote:
> Of course, if there are e.g. BTREE indexes this is going to be slower,
> but people are unlikely to have both index types on the same column.

On Sun, Oct 16, 2022 at 05:48:31PM +0200, Tomas Vondra wrote:
> I don't think it's all that unfair. How likely is it to have both a BRIN
> and btree index on the same column? And even if you do have such indexes

Note that we (at my work) use unique, btree indexes on multiple columns
for INSERT ON CONFLICT into the most-recent tables: UNIQUE(a,b,c,...),
plus a separate set of indexes on all tables, used for searching:
BRIN(a) and BTREE(b).  I'd hope that the costing is accurate enough to
prefer the btree index for searching the most-recent table, if that's
what's faster (for example, if columns b and c are specified).

> +    /* There must not be any TID scan in progress yet. */
> +    Assert(node->ss.ss_currentScanDesc == NULL);
> +
> +    /* Initialize the TID range scan, for the provided block range. */
> +    if (node->ss.ss_currentScanDesc == NULL)
> +    {

Why is this conditional on the condition that was just Assert()ed ?

>  
> +void
> +cost_brinsort(BrinSortPath *path, PlannerInfo *root, double loop_count,
> +           bool partial_path)

It's be nice to refactor existing code to avoid this part being so
duplicitive.

> +     * In some situations (particularly with OR'd index conditions) we may
> +     * have scan_clauses that are not equal to, but are logically implied by,
> +     * the index quals; so we also try a predicate_implied_by() check to see

Isn't that somewhat expensive ?

If that's known, then it'd be good to say that in the documentation.

> +    {
> +        {"enable_brinsort", PGC_USERSET, QUERY_TUNING_METHOD,
> +            gettext_noop("Enables the planner's use of BRIN sort plans."),
> +            NULL,
> +            GUC_EXPLAIN
> +        },
> +        &enable_brinsort,
> +        false,

I think new GUCs should be enabled during patch development.
Maybe in a separate 0002 patch "for CI only not for commit".
That way "make check" at least has a chance to hit that new code paths.

Also, note that indxpath.c had the var initialized to true.

> +            attno = (i + 1);
> +       nranges = (nblocks / pagesPerRange);
> +                               node->bs_phase = (nullsFirst) ? BRINSORT_LOAD_NULLS : BRINSORT_LOAD_RANGE;

I'm curious why you have parenthesis these places ?

> +#ifndef NODEBrinSort_H
> +#define NODEBrinSort_H

NODEBRIN_SORT would be more consistent with NODEINCREMENTALSORT.
But I'd prefer NODE_* - otherwise it looks like NO DEBRIN.

This needed a bunch of work needed to pass any of the regression tests -
even with the feature set to off.

 . meson.build needs the same change as the corresponding ./Makefile.
 . guc missing from postgresql.conf.sample
 . brin_validate.c is missing support for the opr function.
   I gather you're planning on changing this part (?) but this allows to
   pass tests for now.
 . mingw is warning about OidIsValid(pointer) in nodeBrinSort.c.
   https://cirrus-ci.com/task/5771227447951360?logs=mingw_cross_warning#L969
 . Uninitialized catalog attribute.
 . Some typos in your other patches: "heuristics heuristics".  ste.
   lest (least).

-- 
Justin

Hi,

Here's an updated version of this patch series. There's still plenty of
stuff to improve, but it fixes a number of issues I mentioned earlier.

The two most important changes are:


1) handling of projections

Until now queries with projection might have failed, due to not using
the right slot, bogus var references and so on. The code was somewhat
confused because the new node is somewhere in between a scan node and a
sort (or more precisely, it combines both).

I believe this version handles all of this correctly - the code that
initializes the slots/projection info etc. needs serious cleanup, but
should be correct.


2) handling of expressions

The other improvement is handling of expressions - if you have a BRIN
index on an expression, this should now work too. This also includes
correct handling of collations (which the previous patches ignored).

Similarly to the projections, I believe the code is correct but needs
cleanup. In particular, I haven't paid close attention to memory
management, so there might be memory leaks when evaluating expressions.


The last two parts of the patch series (0009 and 0010) are about
testing. 0009 adds a regular regression test with various combinations
(projections, expressions, single- vs. multi-column indexes, ...).

0010 introduces a python script that randomly generates data sets,
indexes and queries. I use it to both test random combinations and to
evaluate performance. I don't expect it to be committed etc. - it's
included only to keep it versioned with the rest of the patch.


I did some basic benchmarking using the 0010 part, to evaluate the how
this works for various cases. The script varies a number of parameters:

- number of rows
- table fill factor
- randomness (how much ranges overlapp)
- pages per range
- limit / offset for queries
- ...

The script forces both a "seqscan" and "brinsort" plan, and collects
timing info.

The results are encouraging, I think. Attached are two charts, plotting
speedup vs. fraction of tuples the query has to sort.

  speedup = (seqscan timing / brinsort timing)

  fraction = (limit + offset) / (table rows)

A query with "limit 1 offset 0" has fraction ~0.0, query that scans
everything (perhaps because it has no LIMIT/OFFSET) has ~1.0.

For speedup, 1.0 means "no change" while values above 1.0 means the
query gets faster. Both plots have log-scale y-axis.

brinsort-all-data.gif shows results for all queries. There's significant
speedup for small values of fraction (i.e. queries with limit, requiring
few rows). This is expected, as this is pretty much the primary use case
for the patch.

The other thing is that the benefits quickly diminish - for fractions
close to 0.0 the potential benefits are huge, but once you cross ~10% of
the table, it's within 10x, for ~25% less than 5x etc.

OTOH there are also a fair number of queries that got slower - those are
the data points below 1.0. I've looked into many of them, and there are
a couple reasons why that can happen:

1) random data set - When the ranges are very wide, BRIN Sort has to
read most of the data, and it ends up sorting almost as many rows as the
sequential scan. But it's more expensive, especially when combined with
the following points.

    Note: I don't think is an issue in practice, because BRIN indexes
    would suck quite badly on such data, so no one is going to create
    such indexes in the first place.

2) tiny ranges - By default ranges are 1MB, but it's possible to make
them much smaller. But BRIN Sort has to read/sort all ranges, and that
gets more expensive with the number of ranges.

    Note: I'm not sure there's a way around this, although Matthias
    had some interesting ideas about how to keep the ranges sorted.
    But ultimately, I think this is fine, as long as it's costed
    correctly. For fractions close to 0.0 this is still going to be
    a huge win.

3) non-adaptive (and low) watermark_step - The number of sorts makes a
huge difference - in an extreme case we could add the ranges one by one,
with a sort after each. For small limit/offset that works, but for more
rows it's quite pointless.

    Note: The adaptive step (adjusted during execution) works great, and
    the script sets explicit values mostly to trigger more corner cases.
    Also, I wonder if we should force higher values as we progress
    through the table - we still don't want to exceed work_mem, but the
    larger fraction we scan the more we should prefer larger "batches".


The second "filter" chart (brinsort-filtered-data.gif) shows results
filtered to only show runs with:

   - pages_per_range >= 32
   - randomness <= 5% (i.e. each range covers about 5% of domain)
   - adaptive step (= -1)

And IMO this looks much better - there are almost no slower queries,
except for a bunch of queries that scan all the data.


So, what are the next steps for this patch:

1) cleanup of the existing code (mentioned above)

2) improvement of the costing - This is probably the critical part,
because we need a costing that allows us to identify the queries that
are likely to be faster/slower. I believe this is doable - either now or
using the new opclass-specific stats proposed in a separate patch (and
kept in part 0001 for completeness).


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

Re: PATCH: Using BRIN indexes for sorted output

From

Tomas Vondra

Date:

16 February 2023, 14:07:59

Hi,

Rebased version of the patches, fixing only minor conflicts.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

Re: PATCH: Using BRIN indexes for sorted output

From

Justin Pryzby

Date:

16 February 2023, 16:10:16

On Thu, Feb 16, 2023 at 03:07:59PM +0100, Tomas Vondra wrote:
> Rebased version of the patches, fixing only minor conflicts.

Per cfbot, the patch fails on 32 bit builds.
+ERROR:  count mismatch: 0 != 1000

And causes warnings in mingw cross-compile.

On Sun, Oct 23, 2022 at 11:32:37PM -0500, Justin Pryzby wrote:
> I think new GUCs should be enabled during patch development.
> Maybe in a separate 0002 patch "for CI only not for commit".
> That way "make check" at least has a chance to hit that new code
> paths.
> 
> Also, note that indxpath.c had the var initialized to true.

In your patch, the amstats guc is still being set to false during
startup by the guc machinery.  And the tests crash everywhere if it's
set to on:

TRAP: failed Assert("(nmatches_unique >= 1) && (nmatches_unique <= unique[nvalues-1])"), File:
"../src/backend/access/brin/brin_minmax.c",Line: 644, PID: 25519
 

>  . Some typos in your other patches: "heuristics heuristics".  ste.
>    lest (least).

These are still present.

-- 
Justin

Re: PATCH: Using BRIN indexes for sorted output

From

Tomas Vondra

Date:

18 February 2023, 12:19:49

On 2/16/23 17:10, Justin Pryzby wrote:
> On Thu, Feb 16, 2023 at 03:07:59PM +0100, Tomas Vondra wrote:
>> Rebased version of the patches, fixing only minor conflicts.
> 
> Per cfbot, the patch fails on 32 bit builds.
> +ERROR:  count mismatch: 0 != 1000
> 
> And causes warnings in mingw cross-compile.
> 

There was a silly mistake in trying to store block numbers as bigint
when sorting the ranges, instead of uint32. That happens to work on
64-bit systems, but on 32-bit systems it produces bogus block.

The attached should fix that - it passes on 32-bit arm, even with
valgrind and all that.

> On Sun, Oct 23, 2022 at 11:32:37PM -0500, Justin Pryzby wrote:
>> I think new GUCs should be enabled during patch development.
>> Maybe in a separate 0002 patch "for CI only not for commit".
>> That way "make check" at least has a chance to hit that new code
>> paths.
>>
>> Also, note that indxpath.c had the var initialized to true.
> 
> In your patch, the amstats guc is still being set to false during
> startup by the guc machinery.  And the tests crash everywhere if it's
> set to on:
> 
> TRAP: failed Assert("(nmatches_unique >= 1) && (nmatches_unique <= unique[nvalues-1])"), File:
"../src/backend/access/brin/brin_minmax.c",Line: 644, PID: 25519
 
> 

Right, that was a silly thinko in building the stats, and I found a
couple more issues nearby. Should be fixed in the attached version.

>>  . Some typos in your other patches: "heuristics heuristics".  ste.
>>    lest (least).
> 
> These are still present.

Thanks for reminding me, those should be fixed too now.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

Re: PATCH: Using BRIN indexes for sorted output

From

Tomas Vondra

Date:

18 February 2023, 15:54:57

cfbot complained there's one more place triggering a compiler warning on
32-bit systems, so here's a version fixing that.

I've also added a copy of the regression tests but using the indexam
stats added in 0001. This is just a copy of the already existing
regression tests, just with enable_indexam_stats=true - this should
catch some of the issues that went mostly undetected in the earlier
patch versions.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

On 2/24/23 20:14, Tomas Vondra wrote:
> 
> 
> On 2/24/23 19:03, Matthias van de Meent wrote:
>> On Thu, 23 Feb 2023 at 19:48, Tomas Vondra
>> <tomas.vondra@enterprisedb.com> wrote:
>>>
>>> On 2/23/23 17:44, Matthias van de Meent wrote:
>>>> On Thu, 23 Feb 2023 at 16:22, Tomas Vondra
>>>> <tomas.vondra@enterprisedb.com> wrote:
>>>>>
>>>>> On 2/23/23 15:19, Matthias van de Meent wrote:
>>>>>> Comments on 0001, mostly comments and patch design:
>>>>
>>>> One more comment:
>>>>
>>>>>>> + range->min_value = bval->bv_values[0];
>>>>>>> + range->max_value = bval->bv_values[1];
>>>>
>>>> This produces dangling pointers for minmax indexes on by-ref types
>>>> such as text and bytea, due to the memory context of the decoding
>>>> tuple being reset every iteration. :-(
>>>>
>>>
>>> Yeah, that sounds like a bug. Also a sign the tests should have some
>>> by-ref data types (presumably there are none, as that would certainly
>>> trip some asserts etc.).
>>
>> I'm not sure we currently trip asserts, as the data we store in the
>> memory context is very limited, making it unlikely we actually release
>> the memory region back to the OS.
>> I did get assertion failures by adding the attached patch, but I don't
>> think that's something we can do in the final release.
>>
> 
> But we should randomize the memory if we ever do pfree(), and it's
> strange valgrind didn't complain when I ran tests with it. But yeah,
> I'll take a look and see if we can add some tests covering this.
> 

There was no patch to trigger the assertions, but the attached patches
should fix that, I think. It pretty much just does datumCopy() after
reading the BRIN tuple from disk.

It's interesting I've been unable to hit the usual asserts checking
freed memory etc. even with text columns etc. I guess the BRIN tuple
memory happens to be aligned in a way that just happens to work.

It howver triggered an assert failure in minval_end, because it didn't
use the proper comparator.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

On 7/10/23 14:38, Matthias van de Meent wrote:
> On Mon, 10 Jul 2023 at 13:43, Tomas Vondra
> <tomas.vondra@enterprisedb.com> wrote:
>> On 7/10/23 12:22, Matthias van de Meent wrote:
>>> On Fri, 7 Jul 2023 at 20:26, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:
>>>> However, it's not quite clear to me is what you mean by the weight- and
>>>> compare-operators? That is, what are
>>>>
>>>>  - brin_minmax_minorder(PG_FUNCTION_ARGS=brintuple) -> range.min
>>>>  - brin_minmax_maxorder(PG_FUNCTION_ARGS=brintuple) -> range.max
>>>>  - brin_minmax_compare(order, order) -> int
>>>>
>>>> supposed to do? Or what does "PG_FUNCTION_ARGS=brintuple" mean?
>>>
>>> _minorder/_maxorder is for extracting the minimum/maximum relative
>>> order of each range, used for ASC/DESC sorting of operator results
>>> (e.g. to support ORDER BY <->(box_column, '(1,2)'::point) DESC).
>>> PG_FUNCTION_ARGS is mentioned because of PG calling conventions;
>>> though I did forget to describe the second operator argument for the
>>> distance function. We might also want to use only one such "order
>>> extraction function" with DESC/ASC indicated by an argument.
>>>
>>
>> I'm still not sure I understand what "minimum/maximum relative order"
>> is. Isn't it the same as returning min/max value that can appear in the
>> range? Although, that wouldn't work for points/boxes.
> 
> Kind of. For single-dimensional opclasses (minmax, minmax_multi) we
> only need to extract the normal min/max values for ASC/DESC sorts,
> which are readily available in the summary. But for multi-dimensional
> and distance searches (nearest neighbour) we need to calculate the
> distance between the indexed value(s) and the origin value to compare
> the summary against, and the order would thus be asc/desc on distance
> - a distance which may not be precisely represented by float types -
> thus 'relative order' with its own order operation.
> 

Can you give some examples of such data / queries, and how would it
leverage the BRIN sort stuff?

For distance searches, I imagine this as data indexed by BRIN inclusion
opclass, which creates a bounding box. We could return closest/furthest
point on the bounding box (from the point used in the query). Which
seems a bit like a R-tree ...

But I have no idea what would this do for multi-dimensional searches, or
what would those searches do? How would you sort such data other than
lexicographically? Which I think is covered by the current BRIN Sort,
because the data is either stored as multiple columns, in which case we
use the BRIN on the first column. Or it's indexed using BRIN minmax as a
tuple of values, but then it's sorted lexicographically.


>>>> In principle we just need a procedure that tells us min/max for a given
>>>> page range - I guess that's what the minorder/maxorder functions do? But
>>>> why would we need the compare one? We're comparing by the known data
>>>> type, so we can just delegate the comparison to that, no?
>>>
>>> Is there a comparison function for any custom orderable type that we
>>> can just use? GIST distance ordering uses floats, and I don't quite
>>> like that from a user perspective, as it makes ordering operations
>>> imprecise. I'd rather allow (but discourage) any type with its own
>>> compare function.
>>>
>>
>> I haven't really thought about geometric types, just about minmax and
>> minmax-multi. It's not clear to me what the benefit for these types be.
>> I mean, we can probably sort points lexicographically, but is anyone
>> doing that in queries? It seems useless for order by distance.
> 
> Yes, that's why you would sort them by distance, where the distance is
> generated by the opclass as min/max distance between the summary and
> the distance's origin, and then inserted into the tuplesort.
> 

OK, so the query says "order by distance from point X" and we calculate
the min/max distance of values in a given page range.

> (previously)
>>>> I finally had time to look at this patch again. There's a bit of bitrot,
>>>> so here's a rebased version (no other changes).
> 
> It seems like you forgot to attach the rebased patch, so unless you're
> actively working on updating the patchset right now, could you send
> the rebase to make CFBot happy?
> 

Yeah, I forgot about the attachment. So here it is.



regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Hello,

> Parallel version is not supported, but I think it should be possible.

@Tomas are you working on this ? If not, I would like to give it a try.

> static void
> AssertCheckRanges(BrinSortState *node)
> {
> #ifdef USE_ASSERT_CHECKING
>
> #endif
> }

I guess it should not be empty at the ongoing development stage.

Attached a small modification of the patch with a draft of the docs.

Regards,
Sergey Dudoladov

Attachment

wip-add-docs-fix-typos-20230802.patch

Re: PATCH: Using BRIN indexes for sorted output

From

Tomas Vondra

Date:

02 August 2023, 16:04:10

On 8/2/23 17:25, Sergey Dudoladov wrote:
> Hello,
> 
>> Parallel version is not supported, but I think it should be possible.
> 
> @Tomas are you working on this ? If not, I would like to give it a try.
> 

Feel free to try. Just keep it in a separate part/patch, to make it
easier to combine the work later.

>> static void
>> AssertCheckRanges(BrinSortState *node)
>> {
>> #ifdef USE_ASSERT_CHECKING
>>
>> #endif
>> }
> 
> I guess it should not be empty at the ongoing development stage.
> 
> Attached a small modification of the patch with a draft of the docs.
> 

Thanks. FWIW it's generally better to always post the whole patch
series, otherwise the cfbot gets confused as it's unable to combine
stuff from different messages.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: PATCH: Using BRIN indexes for sorted output

From

vignesh C

Date:

21 January 2024, 02:02:07

On Wed, 2 Aug 2023 at 21:34, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:
>
>
>
> On 8/2/23 17:25, Sergey Dudoladov wrote:
> > Hello,
> >
> >> Parallel version is not supported, but I think it should be possible.
> >
> > @Tomas are you working on this ? If not, I would like to give it a try.
> >
>
> Feel free to try. Just keep it in a separate part/patch, to make it
> easier to combine the work later.
>
> >> static void
> >> AssertCheckRanges(BrinSortState *node)
> >> {
> >> #ifdef USE_ASSERT_CHECKING
> >>
> >> #endif
> >> }
> >
> > I guess it should not be empty at the ongoing development stage.
> >
> > Attached a small modification of the patch with a draft of the docs.
> >
>
> Thanks. FWIW it's generally better to always post the whole patch
> series, otherwise the cfbot gets confused as it's unable to combine
> stuff from different messages.

Are we planning to take this patch forward? It has been nearly 5
months since the last discussion on this. If the interest has gone
down and if there are no plans to handle this I'm thinking of
returning this commitfest entry in this commitfest and can be opened
when there is more interest.

Regards,
Vignesh

Re: PATCH: Using BRIN indexes for sorted output

From

vignesh C

Date:

01 February 2024, 18:47:02

On Sun, 21 Jan 2024 at 07:32, vignesh C <vignesh21@gmail.com> wrote:
>
> On Wed, 2 Aug 2023 at 21:34, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:
> >
> >
> >
> > On 8/2/23 17:25, Sergey Dudoladov wrote:
> > > Hello,
> > >
> > >> Parallel version is not supported, but I think it should be possible.
> > >
> > > @Tomas are you working on this ? If not, I would like to give it a try.
> > >
> >
> > Feel free to try. Just keep it in a separate part/patch, to make it
> > easier to combine the work later.
> >
> > >> static void
> > >> AssertCheckRanges(BrinSortState *node)
> > >> {
> > >> #ifdef USE_ASSERT_CHECKING
> > >>
> > >> #endif
> > >> }
> > >
> > > I guess it should not be empty at the ongoing development stage.
> > >
> > > Attached a small modification of the patch with a draft of the docs.
> > >
> >
> > Thanks. FWIW it's generally better to always post the whole patch
> > series, otherwise the cfbot gets confused as it's unable to combine
> > stuff from different messages.
>
> Are we planning to take this patch forward? It has been nearly 5
> months since the last discussion on this. If the interest has gone
> down and if there are no plans to handle this I'm thinking of
> returning this commitfest entry in this commitfest and can be opened
> when there is more interest.

Since the author or no one else showed interest in taking it forward
and the patch had no activity for more than 5 months, I have changed
the status to RWF. Feel free to add a new CF entry when someone is
planning to resume work more actively by starting off with a rebased
version.

Regards,
Vignesh