Thread: PATCH: Using BRIN indexes for sorted output

PATCH: Using BRIN indexes for sorted output

From
Tomas Vondra
Date:
Hi,

There have been a couple discussions about using BRIN indexes for
sorting - in fact this was mentioned even in the "Improving Indexing
Performance" unconference session this year (don't remember by whom).
But I haven't seen any patches, so here's one.

The idea is that we can use information about ranges to split the table
into smaller parts that can be sorted in smaller chunks. For example if
you have a tiny 2MB table with two ranges, with values in [0,100] and
[101,200] intervals, then it's clear we can sort the first range, output
tuples, and then sort/output the second range.

The attached patch builds "BRIN Sort" paths/plans, closely resembling
index scans, only for BRIN indexes. And this special type of index scan
does what was mentioned above - incrementally sorts the data. It's a bit
more complicated because of overlapping ranges, ASC/DESC, NULL etc.

This is disabled by default, using a GUC enable_brinsort (you may need
to tweak other GUCs to disable parallel plans etc.).

A trivial example, demonstrating the benefits:

  create table t (a int) with (fillfactor = 10);
  insert into t select i from generate_series(1,10000000) s(i);


First, a simple LIMIT query:

explain (analyze, costs off) select * from t order by a limit 10;

                              QUERY PLAN
------------------------------------------------------------------------
 Limit (actual time=1879.768..1879.770 rows=10 loops=1)
   ->  Sort (actual time=1879.767..1879.768 rows=10 loops=1)
         Sort Key: a
         Sort Method: top-N heapsort  Memory: 25kB
         ->  Seq Scan on t
             (actual time=0.007..1353.110 rows=10000000 loops=1)
 Planning Time: 0.083 ms
 Execution Time: 1879.786 ms
(7 rows)

                              QUERY PLAN
------------------------------------------------------------------------
 Limit (actual time=1.217..1.219 rows=10 loops=1)
   ->  BRIN Sort using t_a_idx on t
       (actual time=1.216..1.217 rows=10 loops=1)
         Sort Key: a
 Planning Time: 0.084 ms
 Execution Time: 1.234 ms
(5 rows)

That's a pretty nice improvement - of course, this is thanks to having a
perfectly sequential, and the difference can be almost arbitrary by
making the table smaller/larger. Similarly, if the table gets less
sequential (making ranges to overlap), the BRIN plan will be more
expensive. Feel free to experiment with other data sets.

However, not only the LIMIT queries can improve - consider a sort of the
whole table:

test=# explain (analyze, costs off) select * from t order by a;

                               QUERY PLAN
-------------------------------------------------------------------------
 Sort (actual time=2806.468..3487.213 rows=10000000 loops=1)
   Sort Key: a
   Sort Method: external merge  Disk: 117528kB
   ->  Seq Scan on t (actual time=0.018..1498.754 rows=10000000 loops=1)
 Planning Time: 0.110 ms
 Execution Time: 3766.825 ms
(6 rows)

test=# explain (analyze, costs off) select * from t order by a;
                                    QUERY PLAN

----------------------------------------------------------------------------------
 BRIN Sort using t_a_idx on t (actual time=1.210..2670.875 rows=10000000
loops=1)
   Sort Key: a
 Planning Time: 0.073 ms
 Execution Time: 2939.324 ms
(4 rows)

Right - not a huge difference, but still a nice 25% speedup, mostly due
to not having to spill data to disk and sorting smaller amounts of data.

There's a bunch of issues with this initial version of the patch,
usually described in XXX comments in the relevant places.6)

1) The paths are created in build_index_paths() because that's what
creates index scans (which the new path resembles). But that is expected
to produce IndexPath, not BrinSortPath, so it's not quite correct.
Should be somewhere "higher" I guess.

2) BRIN indexes don't have internal ordering, i.e. ASC/DESC and NULLS
FIRST/LAST does not really matter for them. The patch just generates
paths for all 4 combinations (or tries to). Maybe there's a better way.

3) I'm not quite sure the separation of responsibilities between
opfamily and opclass is optimal. I added a new amproc, but maybe this
should be split differently. At the moment only minmax indexes have
this, but adding this to minmax-multi should be trivial.

4) The state changes in nodeBrinSort is a bit confusing. Works, but may
need cleanup and refactoring. Ideas welcome.

5) The costing is essentially just plain cost_index. I have some ideas
about BRIN costing in general, which I'll post in a separate thread (as
it's not specific to this patch).

6) At the moment this only picks one of the index keys, specified in the
ORDER BY clause. I think we can generalize this to multiple keys, but
thinking about multi-key ranges was a bit too much for me. The good
thing is this nicely combines with IncrementalSort.

7) Only plain index keys for the ORDER BY keys, no expressions. Should
not be hard to fix, though.

8) Parallel version is not supported, but I think it shouldn't be
possible. Just make the leader build the range info, and then let the
workers to acquire/sort ranges and merge them by Gather Merge.

9) I was also thinking about leveraging other indexes to quickly
eliminate ranges that need to be sorted. The node does evaluate filter,
of course, but only after reading the tuple from the range. But imagine
we allow BrinSort to utilize BRIN indexes to evaluate the filter - in
that case we might skip many ranges entirely. Essentially like a bitmap
index scan does, except that building the bitmap incrementally with BRIN
is trivial - you can quickly check if a particular range matches or not.
With other indexes (e.g. btree) you essentially need to evaluate the
filter completely, and only then you can look at the bitmap. Which seems
rather against the idea of this patch, which is about low startup cost.
Of course, the condition might be very selective, but then you probably
can just fetch the matching tuples and do a Sort.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Attachment

Re: PATCH: Using BRIN indexes for sorted output

From
Zhihong Yu
Date:


On Sat, Oct 15, 2022 at 5:34 AM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:
Hi,

There have been a couple discussions about using BRIN indexes for
sorting - in fact this was mentioned even in the "Improving Indexing
Performance" unconference session this year (don't remember by whom).
But I haven't seen any patches, so here's one.

The idea is that we can use information about ranges to split the table
into smaller parts that can be sorted in smaller chunks. For example if
you have a tiny 2MB table with two ranges, with values in [0,100] and
[101,200] intervals, then it's clear we can sort the first range, output
tuples, and then sort/output the second range.

The attached patch builds "BRIN Sort" paths/plans, closely resembling
index scans, only for BRIN indexes. And this special type of index scan
does what was mentioned above - incrementally sorts the data. It's a bit
more complicated because of overlapping ranges, ASC/DESC, NULL etc.

This is disabled by default, using a GUC enable_brinsort (you may need
to tweak other GUCs to disable parallel plans etc.).

A trivial example, demonstrating the benefits:

  create table t (a int) with (fillfactor = 10);
  insert into t select i from generate_series(1,10000000) s(i);


First, a simple LIMIT query:

explain (analyze, costs off) select * from t order by a limit 10;

                              QUERY PLAN
------------------------------------------------------------------------
 Limit (actual time=1879.768..1879.770 rows=10 loops=1)
   ->  Sort (actual time=1879.767..1879.768 rows=10 loops=1)
         Sort Key: a
         Sort Method: top-N heapsort  Memory: 25kB
         ->  Seq Scan on t
             (actual time=0.007..1353.110 rows=10000000 loops=1)
 Planning Time: 0.083 ms
 Execution Time: 1879.786 ms
(7 rows)

                              QUERY PLAN
------------------------------------------------------------------------
 Limit (actual time=1.217..1.219 rows=10 loops=1)
   ->  BRIN Sort using t_a_idx on t
       (actual time=1.216..1.217 rows=10 loops=1)
         Sort Key: a
 Planning Time: 0.084 ms
 Execution Time: 1.234 ms
(5 rows)

That's a pretty nice improvement - of course, this is thanks to having a
perfectly sequential, and the difference can be almost arbitrary by
making the table smaller/larger. Similarly, if the table gets less
sequential (making ranges to overlap), the BRIN plan will be more
expensive. Feel free to experiment with other data sets.

However, not only the LIMIT queries can improve - consider a sort of the
whole table:

test=# explain (analyze, costs off) select * from t order by a;

                               QUERY PLAN
-------------------------------------------------------------------------
 Sort (actual time=2806.468..3487.213 rows=10000000 loops=1)
   Sort Key: a
   Sort Method: external merge  Disk: 117528kB
   ->  Seq Scan on t (actual time=0.018..1498.754 rows=10000000 loops=1)
 Planning Time: 0.110 ms
 Execution Time: 3766.825 ms
(6 rows)

test=# explain (analyze, costs off) select * from t order by a;
                                    QUERY PLAN

----------------------------------------------------------------------------------
 BRIN Sort using t_a_idx on t (actual time=1.210..2670.875 rows=10000000
loops=1)
   Sort Key: a
 Planning Time: 0.073 ms
 Execution Time: 2939.324 ms
(4 rows)

Right - not a huge difference, but still a nice 25% speedup, mostly due
to not having to spill data to disk and sorting smaller amounts of data.

There's a bunch of issues with this initial version of the patch,
usually described in XXX comments in the relevant places.6)

1) The paths are created in build_index_paths() because that's what
creates index scans (which the new path resembles). But that is expected
to produce IndexPath, not BrinSortPath, so it's not quite correct.
Should be somewhere "higher" I guess.

2) BRIN indexes don't have internal ordering, i.e. ASC/DESC and NULLS
FIRST/LAST does not really matter for them. The patch just generates
paths for all 4 combinations (or tries to). Maybe there's a better way.

3) I'm not quite sure the separation of responsibilities between
opfamily and opclass is optimal. I added a new amproc, but maybe this
should be split differently. At the moment only minmax indexes have
this, but adding this to minmax-multi should be trivial.

4) The state changes in nodeBrinSort is a bit confusing. Works, but may
need cleanup and refactoring. Ideas welcome.

5) The costing is essentially just plain cost_index. I have some ideas
about BRIN costing in general, which I'll post in a separate thread (as
it's not specific to this patch).

6) At the moment this only picks one of the index keys, specified in the
ORDER BY clause. I think we can generalize this to multiple keys, but
thinking about multi-key ranges was a bit too much for me. The good
thing is this nicely combines with IncrementalSort.

7) Only plain index keys for the ORDER BY keys, no expressions. Should
not be hard to fix, though.

8) Parallel version is not supported, but I think it shouldn't be
possible. Just make the leader build the range info, and then let the
workers to acquire/sort ranges and merge them by Gather Merge.

9) I was also thinking about leveraging other indexes to quickly
eliminate ranges that need to be sorted. The node does evaluate filter,
of course, but only after reading the tuple from the range. But imagine
we allow BrinSort to utilize BRIN indexes to evaluate the filter - in
that case we might skip many ranges entirely. Essentially like a bitmap
index scan does, except that building the bitmap incrementally with BRIN
is trivial - you can quickly check if a particular range matches or not.
With other indexes (e.g. btree) you essentially need to evaluate the
filter completely, and only then you can look at the bitmap. Which seems
rather against the idea of this patch, which is about low startup cost.
Of course, the condition might be very selective, but then you probably
can just fetch the matching tuples and do a Sort.


regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Hi,
I am still going over the patch.

Minor: for #8, I guess you meant `it should be possible` .

Cheers

Re: PATCH: Using BRIN indexes for sorted output

From
Tomas Vondra
Date:
On 10/15/22 15:46, Zhihong Yu wrote:
>...
>     8) Parallel version is not supported, but I think it shouldn't be
>     possible. Just make the leader build the range info, and then let the
>     workers to acquire/sort ranges and merge them by Gather Merge.
> ...
> Hi,
> I am still going over the patch.
> 
> Minor: for #8, I guess you meant `it should be possible` .
> 

Yes, I meant to say it should be possible. Sorry for the confusion.



regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: PATCH: Using BRIN indexes for sorted output

From
Zhihong Yu
Date:


On Sat, Oct 15, 2022 at 8:23 AM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:
On 10/15/22 15:46, Zhihong Yu wrote:
>...
>     8) Parallel version is not supported, but I think it shouldn't be
>     possible. Just make the leader build the range info, and then let the
>     workers to acquire/sort ranges and merge them by Gather Merge.
> ...
> Hi,
> I am still going over the patch.
>
> Minor: for #8, I guess you meant `it should be possible` .
>

Yes, I meant to say it should be possible. Sorry for the confusion.



regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Hi,

For brin_minmax_ranges, looking at the assignment to gottuple and reading gottuple, it seems variable gottuple can be omitted - we can check tup directly.

+   /* Maybe mark the range as processed. */
+   range->processed |= mark_processed;

`Maybe` can be dropped.

For brinsort_load_tuples(), do we need to check for interrupts inside the loop ?
Similar question for subsequent methods involving loops, such as brinsort_load_unsummarized_ranges.

Cheers

Re: PATCH: Using BRIN indexes for sorted output

From
Tomas Vondra
Date:
On 10/15/22 14:33, Tomas Vondra wrote:
> Hi,
> 
> ...
> 
> There's a bunch of issues with this initial version of the patch,
> usually described in XXX comments in the relevant places.6)
> 
> ...

I forgot to mention one important issue in my list yesterday, and that's
memory consumption. The way the patch is coded now, the new BRIN support
function (brin_minmax_ranges) produces information about *all* ranges in
one go, which may be an issue. The worst case is 32TB table, with 1-page
BRIN ranges, which means ~4 billion ranges. The info is an array of ~32B
structs, so this would require ~128GB of RAM. With the default 128-page
ranges, it's still be ~1GB, which is quite a lot.

We could have a discussion about what's the reasonable size of BRIN
ranges on such large tables (e.g. building a bitmap on 4 billion ranges
is going to be "not cheap" so this is likely pretty rare). But we should
not introduce new nodes that ignore work_mem, so we need a way to deal
with such cases somehow.

The easiest solution likely is to check this while planning - we can
check the table size, calculate the number of BRIN ranges, and check
that the range info fits into work_mem, and just not create the path
when it gets too large. That's what we did for HashAgg, although that
decision was unreliable because estimating GROUP BY cardinality is hard.

The wrinkle here is that counting just the range info (BrinRange struct)
does not include the values for by-reference types. We could use average
width - that's just an estimate, though.

A more comprehensive solution seems to be to allow requesting chunks of
the BRIN ranges. So that we'd get "slices" of ranges and we'd process
those. So for example if you have 1000 ranges, and you can only handle
100 at a time, we'd do 10 loops, each requesting 100 ranges.

This has another problem - we do care about "overlaps", and we can't
really know if the overlapping ranges will be in the same "slice"
easily. The chunks would be sorted (for example) by maxval. But there
can be a range with much higher maxval (thus in some future slice), but
very low minval (thus intersecting with ranges in the current slice).

Imagine ranges with these minval/maxval values, sorted by maxval:

[101,200]
[201,300]
[301,400]
[150,500]

and let's say we can only process 2-range slices. So we'll get the first
two, but both of them intersect with the very last range.

We could always include all the intersecting ranges into the slice, but
what if there are too many very "wide" ranges?

So I think this will need to switch to an iterative communication with
the BRIN index - instead of asking "give me info about all the ranges",
we'll need a way to

  - request the next range (sorted by maxval)
  - request the intersecting ranges one by one (sorted by minval)

Of course, the BRIN side will have some of the same challenges with
tracking the info without breaking the work_mem limit, but I suppose it
can store the info into a tuplestore/tuplesort, and use that instead of
plain in-memory array. Alternatively, it could just return those, and
BrinSort would use that. OTOH it seems cleaner to have some sort of API,
especially if we want to support e.g. minmax-multi opclasses, that have
a more complicated concept of "intersection".


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: PATCH: Using BRIN indexes for sorted output

From
Zhihong Yu
Date:


On Sun, Oct 16, 2022 at 6:51 AM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:
On 10/15/22 14:33, Tomas Vondra wrote:
> Hi,
>
> ...
>
> There's a bunch of issues with this initial version of the patch,
> usually described in XXX comments in the relevant places.6)
>
> ...

I forgot to mention one important issue in my list yesterday, and that's
memory consumption. The way the patch is coded now, the new BRIN support
function (brin_minmax_ranges) produces information about *all* ranges in
one go, which may be an issue. The worst case is 32TB table, with 1-page
BRIN ranges, which means ~4 billion ranges. The info is an array of ~32B
structs, so this would require ~128GB of RAM. With the default 128-page
ranges, it's still be ~1GB, which is quite a lot.

We could have a discussion about what's the reasonable size of BRIN
ranges on such large tables (e.g. building a bitmap on 4 billion ranges
is going to be "not cheap" so this is likely pretty rare). But we should
not introduce new nodes that ignore work_mem, so we need a way to deal
with such cases somehow.

The easiest solution likely is to check this while planning - we can
check the table size, calculate the number of BRIN ranges, and check
that the range info fits into work_mem, and just not create the path
when it gets too large. That's what we did for HashAgg, although that
decision was unreliable because estimating GROUP BY cardinality is hard.

The wrinkle here is that counting just the range info (BrinRange struct)
does not include the values for by-reference types. We could use average
width - that's just an estimate, though.

A more comprehensive solution seems to be to allow requesting chunks of
the BRIN ranges. So that we'd get "slices" of ranges and we'd process
those. So for example if you have 1000 ranges, and you can only handle
100 at a time, we'd do 10 loops, each requesting 100 ranges.

This has another problem - we do care about "overlaps", and we can't
really know if the overlapping ranges will be in the same "slice"
easily. The chunks would be sorted (for example) by maxval. But there
can be a range with much higher maxval (thus in some future slice), but
very low minval (thus intersecting with ranges in the current slice).

Imagine ranges with these minval/maxval values, sorted by maxval:

[101,200]
[201,300]
[301,400]
[150,500]

and let's say we can only process 2-range slices. So we'll get the first
two, but both of them intersect with the very last range.

We could always include all the intersecting ranges into the slice, but
what if there are too many very "wide" ranges?

So I think this will need to switch to an iterative communication with
the BRIN index - instead of asking "give me info about all the ranges",
we'll need a way to

  - request the next range (sorted by maxval)
  - request the intersecting ranges one by one (sorted by minval)

Of course, the BRIN side will have some of the same challenges with
tracking the info without breaking the work_mem limit, but I suppose it
can store the info into a tuplestore/tuplesort, and use that instead of
plain in-memory array. Alternatively, it could just return those, and
BrinSort would use that. OTOH it seems cleaner to have some sort of API,
especially if we want to support e.g. minmax-multi opclasses, that have
a more complicated concept of "intersection".


regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Hi,
In your example involving [150,500], can this range be broken down into 4 ranges, ending in 200, 300, 400 and 500, respectively ?
That way, there is no intersection among the ranges.

bq. can store the info into a tuplestore/tuplesort

Wouldn't this involve disk accesses which may reduce the effectiveness of BRIN sort ?

Cheers

Re: PATCH: Using BRIN indexes for sorted output

From
Tomas Vondra
Date:

On 10/16/22 03:36, Zhihong Yu wrote:
> 
> 
> On Sat, Oct 15, 2022 at 8:23 AM Tomas Vondra
> <tomas.vondra@enterprisedb.com <mailto:tomas.vondra@enterprisedb.com>>
> wrote:
> 
>     On 10/15/22 15:46, Zhihong Yu wrote:
>     >...
>     >     8) Parallel version is not supported, but I think it shouldn't be
>     >     possible. Just make the leader build the range info, and then
>     let the
>     >     workers to acquire/sort ranges and merge them by Gather Merge.
>     > ...
>     > Hi,
>     > I am still going over the patch.
>     >
>     > Minor: for #8, I guess you meant `it should be possible` .
>     >
> 
>     Yes, I meant to say it should be possible. Sorry for the confusion.
> 
> 
> 
>     regards
> 
>     -- 
>     Tomas Vondra
>     EnterpriseDB: http://www.enterprisedb.com <http://www.enterprisedb.com>
>     The Enterprise PostgreSQL Company
> 
> Hi,
> 
> For brin_minmax_ranges, looking at the assignment to gottuple and
> reading gottuple, it seems variable gottuple can be omitted - we can
> check tup directly.
> 
> +   /* Maybe mark the range as processed. */
> +   range->processed |= mark_processed;
> 
> `Maybe` can be dropped.
> 

No, because the "mark_processed" may be false. So we may not mark it as
processed in some cases.

> For brinsort_load_tuples(), do we need to check for interrupts inside
> the loop ?
> Similar question for subsequent methods involving loops, such
> as brinsort_load_unsummarized_ranges.
> 

We could/should, although most of the loops should be very short.


regrds

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: PATCH: Using BRIN indexes for sorted output

From
Tomas Vondra
Date:

On 10/16/22 16:01, Zhihong Yu wrote:
> 
> 
> On Sun, Oct 16, 2022 at 6:51 AM Tomas Vondra
> <tomas.vondra@enterprisedb.com <mailto:tomas.vondra@enterprisedb.com>>
> wrote:
> 
>     On 10/15/22 14:33, Tomas Vondra wrote:
>     > Hi,
>     >
>     > ...
>     >
>     > There's a bunch of issues with this initial version of the patch,
>     > usually described in XXX comments in the relevant places.6)
>     >
>     > ...
> 
>     I forgot to mention one important issue in my list yesterday, and that's
>     memory consumption. The way the patch is coded now, the new BRIN support
>     function (brin_minmax_ranges) produces information about *all* ranges in
>     one go, which may be an issue. The worst case is 32TB table, with 1-page
>     BRIN ranges, which means ~4 billion ranges. The info is an array of ~32B
>     structs, so this would require ~128GB of RAM. With the default 128-page
>     ranges, it's still be ~1GB, which is quite a lot.
> 
>     We could have a discussion about what's the reasonable size of BRIN
>     ranges on such large tables (e.g. building a bitmap on 4 billion ranges
>     is going to be "not cheap" so this is likely pretty rare). But we should
>     not introduce new nodes that ignore work_mem, so we need a way to deal
>     with such cases somehow.
> 
>     The easiest solution likely is to check this while planning - we can
>     check the table size, calculate the number of BRIN ranges, and check
>     that the range info fits into work_mem, and just not create the path
>     when it gets too large. That's what we did for HashAgg, although that
>     decision was unreliable because estimating GROUP BY cardinality is hard.
> 
>     The wrinkle here is that counting just the range info (BrinRange struct)
>     does not include the values for by-reference types. We could use average
>     width - that's just an estimate, though.
> 
>     A more comprehensive solution seems to be to allow requesting chunks of
>     the BRIN ranges. So that we'd get "slices" of ranges and we'd process
>     those. So for example if you have 1000 ranges, and you can only handle
>     100 at a time, we'd do 10 loops, each requesting 100 ranges.
> 
>     This has another problem - we do care about "overlaps", and we can't
>     really know if the overlapping ranges will be in the same "slice"
>     easily. The chunks would be sorted (for example) by maxval. But there
>     can be a range with much higher maxval (thus in some future slice), but
>     very low minval (thus intersecting with ranges in the current slice).
> 
>     Imagine ranges with these minval/maxval values, sorted by maxval:
> 
>     [101,200]
>     [201,300]
>     [301,400]
>     [150,500]
> 
>     and let's say we can only process 2-range slices. So we'll get the first
>     two, but both of them intersect with the very last range.
> 
>     We could always include all the intersecting ranges into the slice, but
>     what if there are too many very "wide" ranges?
> 
>     So I think this will need to switch to an iterative communication with
>     the BRIN index - instead of asking "give me info about all the ranges",
>     we'll need a way to
> 
>       - request the next range (sorted by maxval)
>       - request the intersecting ranges one by one (sorted by minval)
> 
>     Of course, the BRIN side will have some of the same challenges with
>     tracking the info without breaking the work_mem limit, but I suppose it
>     can store the info into a tuplestore/tuplesort, and use that instead of
>     plain in-memory array. Alternatively, it could just return those, and
>     BrinSort would use that. OTOH it seems cleaner to have some sort of API,
>     especially if we want to support e.g. minmax-multi opclasses, that have
>     a more complicated concept of "intersection".
> 
> 
>     regards
> 
>     -- 
>     Tomas Vondra
>     EnterpriseDB: http://www.enterprisedb.com <http://www.enterprisedb.com>
>     The Enterprise PostgreSQL Company
> 
> Hi,
> In your example involving [150,500], can this range be broken down into
> 4 ranges, ending in 200, 300, 400 and 500, respectively ?
> That way, there is no intersection among the ranges.
> 

Not really, I think. These "value ranges" map to "page ranges" and how
would you split those? I mean, you know values [150,500] map to blocks
[0,127]. You split the values into [150,200], [201,300], [301,400]. How
do you split the page range [0,127]?

Also, splitting a range into more ranges is likely making the issue
worse, because it increases the number of ranges, right? And I mean,
much worse, because imagine a "wide" range that overlaps with every
other range - the number of ranges would explode.

It's not clear to me at which point you'd make the split. At the
beginning, right after loading the ranges from BRIN index? A lot of that
may be unnecessary, in case the range is loaded as a "non-intersecting"
range.

Try to formulate the whole algorithm. Maybe I'm missing something.

The current algorithm is something like this:

1. request info about ranges from the BRIN opclass
2. sort them by maxval and minval
3. NULLS FIRST: read all ranges that might have NULLs => output
4. read the next range (by maxval) into tuplesort
   (if no more ranges, go to (9))
5. load all tuples from "splill" tuplestore, compare to maxval
6. load all tuples from no-summarized ranges (first range only)
   (into tuplesort/tuplestore, depending on maxval comparison)
7. load all intersecting ranges (with minval < current maxval)
   (into tuplesort/tuplestore, depending on maxval comparison)
8. sort the tuplesort, output all tuples, then back to (4)
9. NULLS LAST: read all ranges that might have NULLs => output
10. done

For "DESC" ordering the process is almost the same, except that we swap
minval/maxval in most places.

> bq. can store the info into a tuplestore/tuplesort
> 
> Wouldn't this involve disk accesses which may reduce the effectiveness
> of BRIN sort ?

Yes, it might. But the question is whether the result is still faster
than alternative plans (e.g. seqscan+sort), and those are likely to do
even more I/O.

Moreover, for "regular" cases this shouldn't be a significant issue,
because the stuff will fit into work_mem and so there'll be no I/O. But
it'll handle those extreme cases gracefully.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: PATCH: Using BRIN indexes for sorted output

From
Tom Lane
Date:
Tomas Vondra <tomas.vondra@enterprisedb.com> writes:
> I forgot to mention one important issue in my list yesterday, and that's
> memory consumption.

TBH, this is all looking like vastly more complexity than benefit.
It's going to be impossible to produce a reliable cost estimate
given all the uncertainty, and I fear that will end in picking
BRIN-based sorting when it's not actually a good choice.

The examples you showed initially are cherry-picked to demonstrate
the best possible case, which I doubt has much to do with typical
real-world tables.  It would be good to see what happens with
not-perfectly-sequential data before even deciding this is worth
spending more effort on.  It also seems kind of unfair to decide
that the relevant comparison point is a seqscan rather than a
btree indexscan.

            regards, tom lane



Re: PATCH: Using BRIN indexes for sorted output

From
Zhihong Yu
Date:


On Sun, Oct 16, 2022 at 7:33 AM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:


On 10/16/22 16:01, Zhihong Yu wrote:
>
>
> On Sun, Oct 16, 2022 at 6:51 AM Tomas Vondra
> <tomas.vondra@enterprisedb.com <mailto:tomas.vondra@enterprisedb.com>>
> wrote:
>
>     On 10/15/22 14:33, Tomas Vondra wrote:
>     > Hi,
>     >
>     > ...
>     >
>     > There's a bunch of issues with this initial version of the patch,
>     > usually described in XXX comments in the relevant places.6)
>     >
>     > ...
>
>     I forgot to mention one important issue in my list yesterday, and that's
>     memory consumption. The way the patch is coded now, the new BRIN support
>     function (brin_minmax_ranges) produces information about *all* ranges in
>     one go, which may be an issue. The worst case is 32TB table, with 1-page
>     BRIN ranges, which means ~4 billion ranges. The info is an array of ~32B
>     structs, so this would require ~128GB of RAM. With the default 128-page
>     ranges, it's still be ~1GB, which is quite a lot.
>
>     We could have a discussion about what's the reasonable size of BRIN
>     ranges on such large tables (e.g. building a bitmap on 4 billion ranges
>     is going to be "not cheap" so this is likely pretty rare). But we should
>     not introduce new nodes that ignore work_mem, so we need a way to deal
>     with such cases somehow.
>
>     The easiest solution likely is to check this while planning - we can
>     check the table size, calculate the number of BRIN ranges, and check
>     that the range info fits into work_mem, and just not create the path
>     when it gets too large. That's what we did for HashAgg, although that
>     decision was unreliable because estimating GROUP BY cardinality is hard.
>
>     The wrinkle here is that counting just the range info (BrinRange struct)
>     does not include the values for by-reference types. We could use average
>     width - that's just an estimate, though.
>
>     A more comprehensive solution seems to be to allow requesting chunks of
>     the BRIN ranges. So that we'd get "slices" of ranges and we'd process
>     those. So for example if you have 1000 ranges, and you can only handle
>     100 at a time, we'd do 10 loops, each requesting 100 ranges.
>
>     This has another problem - we do care about "overlaps", and we can't
>     really know if the overlapping ranges will be in the same "slice"
>     easily. The chunks would be sorted (for example) by maxval. But there
>     can be a range with much higher maxval (thus in some future slice), but
>     very low minval (thus intersecting with ranges in the current slice).
>
>     Imagine ranges with these minval/maxval values, sorted by maxval:
>
>     [101,200]
>     [201,300]
>     [301,400]
>     [150,500]
>
>     and let's say we can only process 2-range slices. So we'll get the first
>     two, but both of them intersect with the very last range.
>
>     We could always include all the intersecting ranges into the slice, but
>     what if there are too many very "wide" ranges?
>
>     So I think this will need to switch to an iterative communication with
>     the BRIN index - instead of asking "give me info about all the ranges",
>     we'll need a way to
>
>       - request the next range (sorted by maxval)
>       - request the intersecting ranges one by one (sorted by minval)
>
>     Of course, the BRIN side will have some of the same challenges with
>     tracking the info without breaking the work_mem limit, but I suppose it
>     can store the info into a tuplestore/tuplesort, and use that instead of
>     plain in-memory array. Alternatively, it could just return those, and
>     BrinSort would use that. OTOH it seems cleaner to have some sort of API,
>     especially if we want to support e.g. minmax-multi opclasses, that have
>     a more complicated concept of "intersection".
>
>
>     regards
>
>     --
>     Tomas Vondra
>     EnterpriseDB: http://www.enterprisedb.com <http://www.enterprisedb.com>
>     The Enterprise PostgreSQL Company
>
> Hi,
> In your example involving [150,500], can this range be broken down into
> 4 ranges, ending in 200, 300, 400 and 500, respectively ?
> That way, there is no intersection among the ranges.
>

Not really, I think. These "value ranges" map to "page ranges" and how
would you split those? I mean, you know values [150,500] map to blocks
[0,127]. You split the values into [150,200], [201,300], [301,400]. How
do you split the page range [0,127]?

Also, splitting a range into more ranges is likely making the issue
worse, because it increases the number of ranges, right? And I mean,
much worse, because imagine a "wide" range that overlaps with every
other range - the number of ranges would explode.

It's not clear to me at which point you'd make the split. At the
beginning, right after loading the ranges from BRIN index? A lot of that
may be unnecessary, in case the range is loaded as a "non-intersecting"
range.

Try to formulate the whole algorithm. Maybe I'm missing something.

The current algorithm is something like this:

1. request info about ranges from the BRIN opclass
2. sort them by maxval and minval
3. NULLS FIRST: read all ranges that might have NULLs => output
4. read the next range (by maxval) into tuplesort
   (if no more ranges, go to (9))
5. load all tuples from "splill" tuplestore, compare to maxval
6. load all tuples from no-summarized ranges (first range only)
   (into tuplesort/tuplestore, depending on maxval comparison)
7. load all intersecting ranges (with minval < current maxval)
   (into tuplesort/tuplestore, depending on maxval comparison)
8. sort the tuplesort, output all tuples, then back to (4)
9. NULLS LAST: read all ranges that might have NULLs => output
10. done

For "DESC" ordering the process is almost the same, except that we swap
minval/maxval in most places.

Hi,
Thanks for the quick reply.

I don't have good answer w.r.t. splitting the page range [0,127] now. Let me think more about it.

The 10 step flow (subject to changes down the road) should be either given in the description of the patch or, written as comment inside the code.
This would help people grasp the concept much faster.

BTW splill seems to be a typo - I assume you meant spill.

Cheers

Re: PATCH: Using BRIN indexes for sorted output

From
Tomas Vondra
Date:

On 10/16/22 16:41, Tom Lane wrote:
> Tomas Vondra <tomas.vondra@enterprisedb.com> writes:
>> I forgot to mention one important issue in my list yesterday, and that's
>> memory consumption.
> 
> TBH, this is all looking like vastly more complexity than benefit.
> It's going to be impossible to produce a reliable cost estimate
> given all the uncertainty, and I fear that will end in picking
> BRIN-based sorting when it's not actually a good choice.
> 

Maybe. If it turns out the estimates we have are insufficient to make
good planning decisions, that's life.

As I wrote in my message, I know the BRIN costing is a bit shaky in
general (not just for this new operation), and I intend to propose some
improvement in a separate patch.

I think the main issue with BRIN costing is that we have no stats about
the ranges, and we can't estimate how many ranges we'll really end up
accessing. If you have 100 rows, will that be 1 range or 100 ranges? Or
for the BRIN Sort, how many overlapping ranges will there be?

I intend to allow index AMs to collect custom statistics, and the BRIN
minmax opfamily would collect e.g. this:

1) number of non-summarized ranges
2) number of all-nulls ranges
3) number of has-nulls ranges
4) average number of overlaps (given a random range, how many other
   ranges intersect with it)
5) how likely is it for a row to hit multiple ranges (cross-check
   sample rows vs. ranges)

I believe this will allow much better / more reliable BRIN costing (the
number of overlaps is particularly useful for the this patch).

> The examples you showed initially are cherry-picked to demonstrate
> the best possible case, which I doubt has much to do with typical
> real-world tables.  It would be good to see what happens with
> not-perfectly-sequential data before even deciding this is worth
> spending more effort on.

Yes, the example was trivial "happy case" example. Obviously, the
performance degrades as the data become more random (with ranges wider),
forcing the BRIN Sort to read / sort more tuples.

But let's see an example with less correlated data, say, like this:

create table t (a int) with (fillfactor = 10);

insert into t select i + 10000 * random()
                from generate_series(1,10000000) s(i);

With the fillfactor=10, there are ~2500 values per 1MB range, so this
means each range overlaps with ~4 more. The results then look like this:

1) select * from t order by a;

  seqscan+sort: 4437 ms
  brinsort: 4233 ms

2) select * from t order by a limit 10;

  seqscan+sort: 1859 ms
  brinsort: 4 ms

If you increase the random factor from 10000 to 100000 (so, 40 ranges),
the seqscan timings remain about the same, while brinsort gets to 5200
and 20 ms. And with 1M, it's ~6000 and 300 ms.

Only at 5000000, where we pretty much read 1/2 the table because the
ranges intersect, we get the same timing as the seqscan (for the LIMIT
query). The "full sort" query is more like 5000 vs. 6600 ms, so slower
but not by a huge amount.

Yes, this is a very simple example. I can do more tests with other
datasets (larger/smaller, different distribution, ...).

> It also seems kind of unfair to decide
> that the relevant comparison point is a seqscan rather than a
> btree indexscan.
> 

I don't think it's all that unfair. How likely is it to have both a BRIN
and btree index on the same column? And even if you do have such indexes
(say, on different sets of keys), we kinda already have this costing
issue with index and bitmap index scans.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: PATCH: Using BRIN indexes for sorted output

From
Tomas Vondra
Date:
On 10/16/22 16:42, Zhihong Yu wrote:
> ...
> 
> I don't have good answer w.r.t. splitting the page range [0,127] now.
> Let me think more about it.
> 

Sure, no problem.

> The 10 step flow (subject to changes down the road) should be either
> given in the description of the patch or, written as comment inside the
> code.
> This would help people grasp the concept much faster.

True. I'll add it to the next version of the pach.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: PATCH: Using BRIN indexes for sorted output

From
Matthias van de Meent
Date:
On Sun, 16 Oct 2022 at 16:42, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> It also seems kind of unfair to decide
> that the relevant comparison point is a seqscan rather than a
> btree indexscan.

I think the comparison against full table scan seems appropriate, as
the benefit of BRIN is less space usage when compared to other
indexes, and better IO selectivity than full table scans.

A btree easily requires 10x the space of a normal BRIN index, and may
require a lot of random IO whilst scanning. This BRIN-sorted scan
would have a much lower random IO cost during its scan, and would help
bridge the performance gap between having index that supports ordered
retrieval, and no index at all, which is especially steep in large
tables.

I think that BRIN would be an alternative to btree as a provider of
sorted data, even when the table is not 100% clustered. This
BRIN-assisted table sort can help reduce the amount of data that is
accessed in top-N sorts significantly, both at the index and at the
relation level, without having the space overhead of "all sortable
columns get a btree index".

If BRIN gets its HOT optimization back, the benefits would be even
larger, as we would then have an index that can speed up top-N sorts
without bloating other indexes, and at very low disk footprint.
Columns that are only occasionally accessed in a sorted manner could
then get BRIN minmax indexes to support this sort, at minimal overhead
to the rest of the application.

Kind regards,

Matthias van de Meent



Re: PATCH: Using BRIN indexes for sorted output

From
Matthias van de Meent
Date:
First of all, it's really great to see that this is being worked on.

On Sun, 16 Oct 2022 at 16:34, Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
> Try to formulate the whole algorithm. Maybe I'm missing something.
>
> The current algorithm is something like this:
>
> 1. request info about ranges from the BRIN opclass
> 2. sort them by maxval and minval

Why sort on maxval and minval? That seems wasteful for effectively all
sorts, where range sort on minval should suffice: If you find a range
that starts at 100 in a list of ranges sorted at minval, you've
processed all values <100. You can't make a similar comparison when
that range is sorted on maxvals.

> 3. NULLS FIRST: read all ranges that might have NULLs => output
> 4. read the next range (by maxval) into tuplesort
>    (if no more ranges, go to (9))
> 5. load all tuples from "splill" tuplestore, compare to maxval

Instead of this, shouldn't an update to tuplesort that allows for
restarting the sort be better than this? Moving tuples that we've
accepted into BRINsort state but not yet returned around seems like a
waste of cycles, and I can't think of a reason why it can't work.

> 6. load all tuples from no-summarized ranges (first range only)
>    (into tuplesort/tuplestore, depending on maxval comparison)
> 7. load all intersecting ranges (with minval < current maxval)
>    (into tuplesort/tuplestore, depending on maxval comparison)
> 8. sort the tuplesort, output all tuples, then back to (4)
> 9. NULLS LAST: read all ranges that might have NULLs => output
> 10. done
>
> For "DESC" ordering the process is almost the same, except that we swap
> minval/maxval in most places.

When I was thinking about this feature at the PgCon unconference, I
was thinking about it more along the lines of the following system
(for ORDER BY col ASC NULLS FIRST):

1. prepare tuplesort Rs (for Rangesort) for BRIN tuples, ordered by
[has_nulls, min ASC]
2. scan info about ranges from BRIN, store them in Rs.
3. Finalize the sorting of Rs.
4. prepare tuplesort Ts (for Tuplesort) for sorting on the specified
column ordering.
5. load all tuples from no-summarized ranges into Ts'
6. while Rs has a block range Rs' with has_nulls:
   - Remove Rs' from Rs
   - store the tuples of Rs' range in Ts.
We now have all tuples with NULL in our sorted set; max_sorted = (NULL)
7. Finalize the Ts sorted set.
8. While the next tuple Ts' in the Ts tuplesort <= max_sorted
  - Remove Ts' from Ts
  - Yield Ts'
Now, all tuples up to and including max_sorted are yielded.
9. If there are no more ranges in Rs:
  - Yield all remaining tuples from Ts, then return.
10. "un-finalize" Ts, so that we can start adding tuples to that tuplesort.
     This is different from Tomas' implementation, as he loads the
tuples into a new tuplestore.
11. get the next item from Rs: Rs'
   - remove Rs' from Rs
   - assign Rs' min value to max_sorted
   - store the tuples of Rs' range in Ts
12. while the next item Rs' from Rs has a min value of max_sorted:
   - remove Rs' from Rs
   - store the tuples of Rs' range in Ts
13. The 'new' value from the next item from Rs is stored in
max_sorted. If no such item exists, max_sorted is assigned a sentinel
value (+INF)
14. Go to Step 7

This set of operations requires a restarting tuplesort for Ts, but I
don't think that would result in many API changes for tuplesort. It
reduces the overhead of large overlapping ranges, as it doesn't need
to copy all tuples that have been read from disk but have not yet been
returned.

The maximum cost of this tuplesort would be the cost of sorting a
seqscanned table, plus sorting the relevant BRIN ranges, plus the 1
extra compare per tuple and range that are needed to determine whether
the range or tuple should be extracted from the tuplesort. The minimum
cost would be the cost of sorting all BRIN ranges, plus sorting all
tuples in one of the index's ranges.

Kind regards,

Matthias van de Meent

PS. Are you still planning on giving the HOT optimization for BRIN a
second try? I'm fairly confident that my patch at [0] would fix the
issue that lead to the revert of that feature, but it introduced ABI
changes after the feature freeze and thus it didn't get in. The patch
might need some polishing, but I think it shouldn't take too much
extra effort to get into PG16.

[0] https://www.postgresql.org/message-id/CAEze2Wi9%3DBay_%3DrTf8Z6WPgZ5V0tDOayszQJJO%3DR_9aaHvr%2BTg%40mail.gmail.com



Re: PATCH: Using BRIN indexes for sorted output

From
Tomas Vondra
Date:
On 10/16/22 22:17, Matthias van de Meent wrote:
> First of all, it's really great to see that this is being worked on.
> 
> On Sun, 16 Oct 2022 at 16:34, Tomas Vondra
> <tomas.vondra@enterprisedb.com> wrote:
>> Try to formulate the whole algorithm. Maybe I'm missing something.
>>
>> The current algorithm is something like this:
>>
>> 1. request info about ranges from the BRIN opclass
>> 2. sort them by maxval and minval
> 
> Why sort on maxval and minval? That seems wasteful for effectively all
> sorts, where range sort on minval should suffice: If you find a range
> that starts at 100 in a list of ranges sorted at minval, you've
> processed all values <100. You can't make a similar comparison when
> that range is sorted on maxvals.
> 

Because that allows to identify overlapping ranges quickly.

Imagine you have the ranges sorted by maxval, which allows you to add
tuples in small increments. But how do you know there's not a range
(possibly with arbitrarily high maxval), that however overlaps with the
range we're currently processing?

Consider these ranges sorted by maxval

  range #1  [0,100]
  range #2  [101,200]
  range #3  [150,250]
  ...
  range #1000000 [190,1000000000]

processing the range #1 is simple, because there are no overlapping
ranges. When processing range #2, that's not the case - the following
range #3 is overlapping too, so we need to load the tuples too. But
there may be other ranges (in arbitrary distance) also overlapping.

So we either have to cross-check everything with everything - that's
O(N^2) so not great, or we can invent a way to eliminate ranges that
can't overlap.

The patch does that by having two arrays - one sorted by maxval, one
sorted by minval. After proceeding to the next range by maxval (using
the first array), the minval-sorted array is used to detect overlaps.
This can be done quickly, because we only care for new matches since the
previous range, so we can remember the index to the array and start from
it. And we can stop once the minval exceeds the maxval for the range in
the first step. Because we'll only sort tuples up to that point.

>> 3. NULLS FIRST: read all ranges that might have NULLs => output
>> 4. read the next range (by maxval) into tuplesort
>>    (if no more ranges, go to (9))
>> 5. load all tuples from "splill" tuplestore, compare to maxval
> 
> Instead of this, shouldn't an update to tuplesort that allows for
> restarting the sort be better than this? Moving tuples that we've
> accepted into BRINsort state but not yet returned around seems like a
> waste of cycles, and I can't think of a reason why it can't work.
> 

I don't understand what you mean by "update to tuplesort". Can you
elaborate?

The point of spilling them into a tuplestore is to make the sort cheaper
by not sorting tuples that can't possibly be produced, because the value
exceeds the current maxval. Consider ranges sorted by maxval

   [0,1000]
   [500,1500]
   [1001,2000]
   ...

We load tuples from [0,1000] and use 1000 as "threshold" up to which we
can sort. But we have to load tuples from the overlapping range(s) too,
e.g. from [500,1500] except that all tuples with values > 1000 can't be
produced (because there might be yet more ranges intersecting with that
part).

So why sort these tuples at all? Imagine imperfectly correlated table
where each range overlaps with ~10 other ranges. If we feed all of that
into the tuplestore, we're now sorting 11x the amount of data.

Or maybe I just don't understand what you mean.


>> 6. load all tuples from no-summarized ranges (first range only)
>>    (into tuplesort/tuplestore, depending on maxval comparison)
>> 7. load all intersecting ranges (with minval < current maxval)
>>    (into tuplesort/tuplestore, depending on maxval comparison)
>> 8. sort the tuplesort, output all tuples, then back to (4)
>> 9. NULLS LAST: read all ranges that might have NULLs => output
>> 10. done
>>
>> For "DESC" ordering the process is almost the same, except that we swap
>> minval/maxval in most places.
> 
> When I was thinking about this feature at the PgCon unconference, I
> was thinking about it more along the lines of the following system
> (for ORDER BY col ASC NULLS FIRST):
> 
> 1. prepare tuplesort Rs (for Rangesort) for BRIN tuples, ordered by
> [has_nulls, min ASC]
> 2. scan info about ranges from BRIN, store them in Rs.
> 3. Finalize the sorting of Rs.
> 4. prepare tuplesort Ts (for Tuplesort) for sorting on the specified
> column ordering.
> 5. load all tuples from no-summarized ranges into Ts'
> 6. while Rs has a block range Rs' with has_nulls:
>    - Remove Rs' from Rs
>    - store the tuples of Rs' range in Ts.
> We now have all tuples with NULL in our sorted set; max_sorted = (NULL)
> 7. Finalize the Ts sorted set.
> 8. While the next tuple Ts' in the Ts tuplesort <= max_sorted
>   - Remove Ts' from Ts
>   - Yield Ts'
> Now, all tuples up to and including max_sorted are yielded.
> 9. If there are no more ranges in Rs:
>   - Yield all remaining tuples from Ts, then return.
> 10. "un-finalize" Ts, so that we can start adding tuples to that tuplesort.
>      This is different from Tomas' implementation, as he loads the
> tuples into a new tuplestore.
> 11. get the next item from Rs: Rs'
>    - remove Rs' from Rs
>    - assign Rs' min value to max_sorted
>    - store the tuples of Rs' range in Ts

I don't think this works, because we may get a range (Rs') with very
high maxval (thus read very late from Rs), but with very low minval.
AFAICS max_sorted must never go back, and this breaks it.

> 12. while the next item Rs' from Rs has a min value of max_sorted:
>    - remove Rs' from Rs
>    - store the tuples of Rs' range in Ts
> 13. The 'new' value from the next item from Rs is stored in
> max_sorted. If no such item exists, max_sorted is assigned a sentinel
> value (+INF)
> 14. Go to Step 7
> 
> This set of operations requires a restarting tuplesort for Ts, but I
> don't think that would result in many API changes for tuplesort. It
> reduces the overhead of large overlapping ranges, as it doesn't need
> to copy all tuples that have been read from disk but have not yet been
> returned.
> 
> The maximum cost of this tuplesort would be the cost of sorting a
> seqscanned table, plus sorting the relevant BRIN ranges, plus the 1
> extra compare per tuple and range that are needed to determine whether
> the range or tuple should be extracted from the tuplesort. The minimum
> cost would be the cost of sorting all BRIN ranges, plus sorting all
> tuples in one of the index's ranges.
> 

I'm not a tuplesort expert, but my assumption it's better to sort
smaller amounts of rows - which is why the patch sorts only the rows it
knows it can actually output.

> Kind regards,
> 
> Matthias van de Meent
> 
> PS. Are you still planning on giving the HOT optimization for BRIN a
> second try? I'm fairly confident that my patch at [0] would fix the
> issue that lead to the revert of that feature, but it introduced ABI
> changes after the feature freeze and thus it didn't get in. The patch
> might need some polishing, but I think it shouldn't take too much
> extra effort to get into PG16.
> 

Thanks for reminding me, I'll take a look before the next CF.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: PATCH: Using BRIN indexes for sorted output

From
Matthias van de Meent
Date:
On Mon, 17 Oct 2022 at 05:43, Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
>
> On 10/16/22 22:17, Matthias van de Meent wrote:
> > On Sun, 16 Oct 2022 at 16:34, Tomas Vondra
> > <tomas.vondra@enterprisedb.com> wrote:
> >> Try to formulate the whole algorithm. Maybe I'm missing something.
> >>
> >> The current algorithm is something like this:
> >>
> >> 1. request info about ranges from the BRIN opclass
> >> 2. sort them by maxval and minval
> >
> > Why sort on maxval and minval? That seems wasteful for effectively all
> > sorts, where range sort on minval should suffice: If you find a range
> > that starts at 100 in a list of ranges sorted at minval, you've
> > processed all values <100. You can't make a similar comparison when
> > that range is sorted on maxvals.
>
> Because that allows to identify overlapping ranges quickly.
>
> Imagine you have the ranges sorted by maxval, which allows you to add
> tuples in small increments. But how do you know there's not a range
> (possibly with arbitrarily high maxval), that however overlaps with the
> range we're currently processing?

Why do we need to identify overlapping ranges specifically? If you
sort by minval, it becomes obvious that any subsequent range cannot
contain values < the minval of the next range in the list, allowing
you to emit any values less than the next, unprocessed, minmax range's
minval.

> >> 3. NULLS FIRST: read all ranges that might have NULLs => output
> >> 4. read the next range (by maxval) into tuplesort
> >>    (if no more ranges, go to (9))
> >> 5. load all tuples from "splill" tuplestore, compare to maxval
> >
> > Instead of this, shouldn't an update to tuplesort that allows for
> > restarting the sort be better than this? Moving tuples that we've
> > accepted into BRINsort state but not yet returned around seems like a
> > waste of cycles, and I can't think of a reason why it can't work.
> >
>
> I don't understand what you mean by "update to tuplesort". Can you
> elaborate?

Tuplesort currently only allows the following workflow: you to load
tuples, then call finalize, then extract tuples. There is currently no
way to add tuples once you've started extracting them.

For my design to work efficiently or without hacking into the
internals of tuplesort, we'd need a way to restart or 'un-finalize'
the tuplesort so that it returns to the 'load tuples' phase. Because
all data of the previous iteration is already sorted, adding more data
shouldn't be too expensive.

>  The point of spilling them into a tuplestore is to make the sort cheaper
> by not sorting tuples that can't possibly be produced, because the value
> exceeds the current maxval. Consider ranges sorted by maxval
> [...]
>
> Or maybe I just don't understand what you mean.

If we sort the ranges by minval like this:

1. [0,1000]
2. [0,999]
3. [50,998]
4. [100,997]
5. [100,996]
6. [150,995]

Then we can load and sort the values for range 1 and 2, and emit all
values up to (not including) 50 - the minval of the next,
not-yet-loaded range in the ordered list of ranges. Then add the
values from range 3 to the set of tuples we have yet to output; sort;
and then emit valus up to 100 (range 4's minval), etc. This reduces
the amount of tuples in the tuplesort to the minimum amount needed to
output any specific value.

If the ranges are sorted and loaded by maxval, like your algorithm expects:

1. [150,995]
2. [100,996]
3. [100,997]
4. [50,998]
5. [0,999]
6. [0,1000]

We need to load all ranges into the sort before it could start
emitting any tuples, as all ranges overlap with the first range.

> > [algo]
>
> I don't think this works, because we may get a range (Rs') with very
> high maxval (thus read very late from Rs), but with very low minval.
> AFAICS max_sorted must never go back, and this breaks it.

max_sorted cannot go back, because it is the min value of the next
range in the list of ranges sorted by min value; see also above.

There is a small issue in my algorithm where I use <= for yielding
values where it should be <, where initialization of max_value to NULL
is then be incorrect, but apart from that I don't think there are any
issues with the base algorithm.

> > The maximum cost of this tuplesort would be the cost of sorting a
> > seqscanned table, plus sorting the relevant BRIN ranges, plus the 1
> > extra compare per tuple and range that are needed to determine whether
> > the range or tuple should be extracted from the tuplesort. The minimum
> > cost would be the cost of sorting all BRIN ranges, plus sorting all
> > tuples in one of the index's ranges.
> >
>
> I'm not a tuplesort expert, but my assumption it's better to sort
> smaller amounts of rows - which is why the patch sorts only the rows it
> knows it can actually output.

I see that the two main differences between our designs are in
answering these questions:

- How do we select table ranges for processing?
- How do we handle tuples that we know we can't output yet?

For the first, I think the differences are explained above. The main
drawback of your selection algorithm seems to be that your algorithm's
worst-case is "all ranges overlap", whereas my algorithm's worst case
is "all ranges start at the same value", which is only a subset of
your worst case.

For the second, the difference is whether we choose to sort the tuples
that are out-of-bounds, but are already in the working set due to
being returned from a range overlapping with the current bound.
My algorithm tries to reduce the overhead of increasing the sort
boundaries by also sorting the out-of-bound data, allowing for
O(n-less-than-newbound) overhead of extending the bounds (total
complexity for whole sort O(n-out-of-bound)), and O(n log n)
processing of all tuples during insertion.
Your algorithm - if I understand it correctly - seems to optimize for
faster results within the current bound by not sorting the
out-of-bounds data with O(1) processing when out-of-bounds, at the
cost of needing O(n-out-of-bound-tuples) operations when the maxval /
max_sorted boundary is increased, with a complexity of O(n*m) for an
average of n out-of-bound tuples and m bound updates.

Lastly, there is the small difference in how the ranges are extracted
from BRIN: I prefer and mention an iterative approach where the tuples
are extracted from the index and loaded into a tuplesort in some
iterative fashion (which spills to disk and does not need all tuples
to reside in memory), whereas your current approach was mentioned as
(paraphrasing) 'allocate all this data in one chunk and hope that
there is enough memory available'. I think this is not so much a
disagreement in best approach, but mostly a case of what could be made
to work; so in later updates I hope we'll see improvements here.

Kind regards,

Matthias van de Meent



Re: PATCH: Using BRIN indexes for sorted output

From
Tomas Vondra
Date:
On 10/17/22 16:00, Matthias van de Meent wrote:
> On Mon, 17 Oct 2022 at 05:43, Tomas Vondra
> <tomas.vondra@enterprisedb.com> wrote:
>>
>> On 10/16/22 22:17, Matthias van de Meent wrote:
>>> On Sun, 16 Oct 2022 at 16:34, Tomas Vondra
>>> <tomas.vondra@enterprisedb.com> wrote:
>>>> Try to formulate the whole algorithm. Maybe I'm missing something.
>>>>
>>>> The current algorithm is something like this:
>>>>
>>>> 1. request info about ranges from the BRIN opclass
>>>> 2. sort them by maxval and minval
>>>
>>> Why sort on maxval and minval? That seems wasteful for effectively all
>>> sorts, where range sort on minval should suffice: If you find a range
>>> that starts at 100 in a list of ranges sorted at minval, you've
>>> processed all values <100. You can't make a similar comparison when
>>> that range is sorted on maxvals.
>>
>> Because that allows to identify overlapping ranges quickly.
>>
>> Imagine you have the ranges sorted by maxval, which allows you to add
>> tuples in small increments. But how do you know there's not a range
>> (possibly with arbitrarily high maxval), that however overlaps with the
>> range we're currently processing?
> 
> Why do we need to identify overlapping ranges specifically? If you
> sort by minval, it becomes obvious that any subsequent range cannot
> contain values < the minval of the next range in the list, allowing
> you to emit any values less than the next, unprocessed, minmax range's
> minval.
> 

D'oh! I think you're right, it should be possible to do this with only
sort by minval. And it might actually be better way to do that.

I think I chose the "maxval" ordering because it seemed reasonable.
Looking at the current range and using the maxval as the threshold
seemed reasonable. But it leads to a bunch of complexity with the
intersecting ranges, and I never reconsidered this choice. Silly me.

>>>> 3. NULLS FIRST: read all ranges that might have NULLs => output
>>>> 4. read the next range (by maxval) into tuplesort
>>>>    (if no more ranges, go to (9))
>>>> 5. load all tuples from "splill" tuplestore, compare to maxval
>>>
>>> Instead of this, shouldn't an update to tuplesort that allows for
>>> restarting the sort be better than this? Moving tuples that we've
>>> accepted into BRINsort state but not yet returned around seems like a
>>> waste of cycles, and I can't think of a reason why it can't work.
>>>
>>
>> I don't understand what you mean by "update to tuplesort". Can you
>> elaborate?
> 
> Tuplesort currently only allows the following workflow: you to load
> tuples, then call finalize, then extract tuples. There is currently no
> way to add tuples once you've started extracting them.
> 
> For my design to work efficiently or without hacking into the
> internals of tuplesort, we'd need a way to restart or 'un-finalize'
> the tuplesort so that it returns to the 'load tuples' phase. Because
> all data of the previous iteration is already sorted, adding more data
> shouldn't be too expensive.
> 

Not sure. I still think it's better to limit the amount of data we have
in the tuplesort. Even if the tuplesort can efficiently skip the already
sorted part, it'll still occupy disk space, possibly even force the data
to disk etc. (We'll still have to write that into a tuplestore, but that
should be relatively small and short-lived/recycled).

FWIW I wonder if the assumption that tuplesort can quickly skip already
sorted data holds e.g. for tuplesorts much larger than work_mem, but I
haven't checked that.

I'd also like to include some more info in the explain, like how many
times we did a sort, and what was the largest amount of data we sorted.
Although, maybe that could be tracked by tracking the tuplesort size of
the last sort.

Considering the tuplesort does not currently support this, I'll probably
stick to the existing approach with separate tuplestore. There's enough
complexity in the patch already, I think. The only thing we'll need with
the minval ordering is the ability to "peek ahead" to the next minval
(which is going to be the threshold used to route values either to
tuplesort or tuplestore).

>>  The point of spilling them into a tuplestore is to make the sort cheaper
>> by not sorting tuples that can't possibly be produced, because the value
>> exceeds the current maxval. Consider ranges sorted by maxval
>> [...]
>>
>> Or maybe I just don't understand what you mean.
> 
> If we sort the ranges by minval like this:
> 
> 1. [0,1000]
> 2. [0,999]
> 3. [50,998]
> 4. [100,997]
> 5. [100,996]
> 6. [150,995]
> 
> Then we can load and sort the values for range 1 and 2, and emit all
> values up to (not including) 50 - the minval of the next,
> not-yet-loaded range in the ordered list of ranges. Then add the
> values from range 3 to the set of tuples we have yet to output; sort;
> and then emit valus up to 100 (range 4's minval), etc. This reduces
> the amount of tuples in the tuplesort to the minimum amount needed to
> output any specific value.
> 
> If the ranges are sorted and loaded by maxval, like your algorithm expects:
> 
> 1. [150,995]
> 2. [100,996]
> 3. [100,997]
> 4. [50,998]
> 5. [0,999]
> 6. [0,1000]
> 
> We need to load all ranges into the sort before it could start
> emitting any tuples, as all ranges overlap with the first range.
> 

Right, thanks - I get this now.

>>> [algo]
>>
>> I don't think this works, because we may get a range (Rs') with very
>> high maxval (thus read very late from Rs), but with very low minval.
>> AFAICS max_sorted must never go back, and this breaks it.
> 
> max_sorted cannot go back, because it is the min value of the next
> range in the list of ranges sorted by min value; see also above.
> 
> There is a small issue in my algorithm where I use <= for yielding
> values where it should be <, where initialization of max_value to NULL
> is then be incorrect, but apart from that I don't think there are any
> issues with the base algorithm.
> 
>>> The maximum cost of this tuplesort would be the cost of sorting a
>>> seqscanned table, plus sorting the relevant BRIN ranges, plus the 1
>>> extra compare per tuple and range that are needed to determine whether
>>> the range or tuple should be extracted from the tuplesort. The minimum
>>> cost would be the cost of sorting all BRIN ranges, plus sorting all
>>> tuples in one of the index's ranges.
>>>
>>
>> I'm not a tuplesort expert, but my assumption it's better to sort
>> smaller amounts of rows - which is why the patch sorts only the rows it
>> knows it can actually output.
> 
> I see that the two main differences between our designs are in
> answering these questions:
> 
> - How do we select table ranges for processing?
> - How do we handle tuples that we know we can't output yet?
> 
> For the first, I think the differences are explained above. The main
> drawback of your selection algorithm seems to be that your algorithm's
> worst-case is "all ranges overlap", whereas my algorithm's worst case
> is "all ranges start at the same value", which is only a subset of
> your worst case.
> 

Right, those are very good points.

> For the second, the difference is whether we choose to sort the tuples
> that are out-of-bounds, but are already in the working set due to
> being returned from a range overlapping with the current bound.
> My algorithm tries to reduce the overhead of increasing the sort
> boundaries by also sorting the out-of-bound data, allowing for
> O(n-less-than-newbound) overhead of extending the bounds (total
> complexity for whole sort O(n-out-of-bound)), and O(n log n)
> processing of all tuples during insertion.
> Your algorithm - if I understand it correctly - seems to optimize for
> faster results within the current bound by not sorting the
> out-of-bounds data with O(1) processing when out-of-bounds, at the
> cost of needing O(n-out-of-bound-tuples) operations when the maxval /
> max_sorted boundary is increased, with a complexity of O(n*m) for an
> average of n out-of-bound tuples and m bound updates.
> 

Right. I wonder if we these are actually complementary approaches, and
we could/should pick between them depending on how many rows we expect
to consume.

My focus was LIMIT queries, so I favored the approach with the lowest
startup cost. I haven't quite planned for this to work so well even in
full-sort cases. That kinda surprised me (I wonder if the very large
tuplesorts - compared to work_mem - would hurt this, though).

> Lastly, there is the small difference in how the ranges are extracted
> from BRIN: I prefer and mention an iterative approach where the tuples
> are extracted from the index and loaded into a tuplesort in some
> iterative fashion (which spills to disk and does not need all tuples
> to reside in memory), whereas your current approach was mentioned as
> (paraphrasing) 'allocate all this data in one chunk and hope that
> there is enough memory available'. I think this is not so much a
> disagreement in best approach, but mostly a case of what could be made
> to work; so in later updates I hope we'll see improvements here.
> 

Right. I think I mentioned this in my post [1], where I also envisioned
some sort of iterative approach. And I think you're right the approach
with ordering by minval is naturally more suitable because it just
consumes the single sequence of ranges.


regards

[1]
https://www.postgresql.org/message-id/1a7c2ff5-a855-64e9-0272-1f9947f8a558%40enterprisedb.com

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: PATCH: Using BRIN indexes for sorted output

From
Tomas Vondra
Date:
Hi,

here's an updated/reworked version of the patch, on top of the "BRIN
statistics" patch as 0001 (because some of the stuff is useful, but we
can ignore this part in this thread).

Warning: I realized the new node is somewhat broken when it comes to
projection and matching the indexed column, most likely because the
targetlists are wired/processed incorrectly or something like that. So
when experimenting with this, just index the first column of the table
and don't do anything requiring a projection. I'll get this fixed, but
I've been focusing on the other stuff. I'm not particularly familiar
with this tlist/project stuff, so any help is welcome.


The main change in this version is the adoption of multiple ideas
suggested by Matthias in his earlier responses.

Firstly, this changes how the index opclass passes information to the
executor node. Instead of using a plain array, we now use a tuplesort.
This addresses the memory consumption issues with large number of
ranges, and it also simplifies the sorting etc. which is now handled by
the tuplesort. The support procedure simply fills a tuplesort and then
hands it over to the caller (more or less).


Secondly, instead of ordering the ranges by maxval, this orders them by
minval (as suggested by Matthias), which greatly simplifies the code
because we don't need to detect overlapping ranges etc.

More precisely, the ranges are sorted to get this ordering

- not yet summarized ranges
- ranges sorted by (minval, blkno)
- all-nulls ranges

This particular ordering is beneficial for the algorithm, which does two
passes over the ranges. For the NULLS LAST case (i.e. the default), we
do this:

- produce tuples with non-NULL values, ordered by the value
- produce tuples with NULL values (arbitrary order)

And each of these phases does a separate pass over the ranges (I'll get
to that in a minute). And the ordering is tailored to this.

Note: For DESC we'd sort by maxval, and for NULLS FIRST the phases would
happen in the opposite order, but those are details. Let's assume ASC
ordering with NULLS LAST, unless stated otherwise.

The idea here is that all not-summarized ranges need to be processed
always, both when processing NULLs and non-NULL values, which happens as
two separate passes over ranges.

The all-null ranges don't need to be processed during the non-NULL pass,
and we can terminate this pass early once we hit the first null-only
range. So placing them last helps with this.

The regular ranges are ordered by minval, as dictated by the algorithm
(which is now described in nodeBrinSort.c comment), but we also sort
them by blkno to make this a bit more sequential (but this only matters
for ranges with the same minval, and that's probably rare, but the extra
sort key is also cheap so why not).

I mentioned we do two separate passes - one for non-NULL values, one for
NULL values. That may be somewhat inefficient, because in extreme cases
we might end up scanning the whole table twice (imagine BRIN ranges
where each range has both regular values and NULLs). It might be
possible to do all of this in a single pass, at least in some cases -
for example while scanning ranges, we might stash NULL rows into a
tuplestore, so that the second pass is not needed. That assumes there
are not too many such rows (otherwise we might need to write and then
read many rows, outweighing the cost of just doing two passes). This
should be possible to estimate/cost fairly well, I think, and the
comment in nodeBrinSort actually presents some ideas about this.

And we can't do that for the NULLS FIRST case, because if we stash the
non-NULL rows somewhere, we won't be able to do the "incremental" sort,
i.e. we might just do regular Sort right away. So I've kept this simple
approach with two passes for now.


This still uses the approach with spilling tuples to a tuplestore, and
only sorting rows that we know are safe to output. I still think this is
a good approach, for the reasons I explained before, but changing this
is not hard so we can experiment.

There's however a related question - how quickly should we increment the
minval value, serving as a watermark? One option is to go to the next
distinct minval value - but that may result in excessive number of tiny
sorts, because the number ranges and rows between the old and new minval
values tends to be small. Another negative consequence is that this may
cause of lot of spilling (and re-spilling), because we only consume tiny
number of rows from the tuplestore after incrementing the watermark.

Or we can do larger steps, skipping some of the minval values, so that
more rows quality into the sort. Of course, too large step means we'll
exceed work_mem and switch to an on-disk sort, which we probably don't
want. Also, this may be the wrong thing to do for LIMIT queries, that
only need a couple rows, and a tiny sort is fine (because we won't do
too many of them).

Patch 0004 introduces a new GUC called brinsort_watermark_step, that can
be used to experiment with this. By default it's set to '1' which means
we simply progress to the next minval value.

Then 0005 tries to customize this based on statistics - we estimate the
number of rows we expect to get for each minval increment to "add" and
then pick just a step value not to overflow work_mem. This happens in
create_brinsort_plan, and the comment explains the main weakness - the
way the number of rows is estimated is somewhat naive, as it just
divides reltuples by number of ranges. But I have a couple ideas about
what statistics we might collect, explained in 0001 in the comment at
brin_minmax_stats.

But there's another option - we can tune the step based on past sorts.
If we see the sorts are doing on-disk sort, maybe try doing smaller
steps. Patch 0006 implements a very simple variant of this. There's a
couple ideas about how it might be improved, mentioned in the comment at
brinsort_adjust_watermark_step.

There's also patch 0003, which extends the EXPLAIN output with a
counters tracking the number of sorts, counts of on-disk/in-memory
sorts, space used, number of rows sorted/spilled, and so on. This is
useful when analyzing e.g. the effect of higher/lower watermark steps,
discussed in the preceding paragraphs.



regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Attachment

Re: PATCH: Using BRIN indexes for sorted output

From
Justin Pryzby
Date:
On Sat, Oct 15, 2022 at 02:33:50PM +0200, Tomas Vondra wrote:
> Of course, if there are e.g. BTREE indexes this is going to be slower,
> but people are unlikely to have both index types on the same column.

On Sun, Oct 16, 2022 at 05:48:31PM +0200, Tomas Vondra wrote:
> I don't think it's all that unfair. How likely is it to have both a BRIN
> and btree index on the same column? And even if you do have such indexes

Note that we (at my work) use unique, btree indexes on multiple columns
for INSERT ON CONFLICT into the most-recent tables: UNIQUE(a,b,c,...),
plus a separate set of indexes on all tables, used for searching:
BRIN(a) and BTREE(b).  I'd hope that the costing is accurate enough to
prefer the btree index for searching the most-recent table, if that's
what's faster (for example, if columns b and c are specified).

> +    /* There must not be any TID scan in progress yet. */
> +    Assert(node->ss.ss_currentScanDesc == NULL);
> +
> +    /* Initialize the TID range scan, for the provided block range. */
> +    if (node->ss.ss_currentScanDesc == NULL)
> +    {

Why is this conditional on the condition that was just Assert()ed ?

>  
> +void
> +cost_brinsort(BrinSortPath *path, PlannerInfo *root, double loop_count,
> +           bool partial_path)

It's be nice to refactor existing code to avoid this part being so
duplicitive.

> +     * In some situations (particularly with OR'd index conditions) we may
> +     * have scan_clauses that are not equal to, but are logically implied by,
> +     * the index quals; so we also try a predicate_implied_by() check to see

Isn't that somewhat expensive ?

If that's known, then it'd be good to say that in the documentation.

> +    {
> +        {"enable_brinsort", PGC_USERSET, QUERY_TUNING_METHOD,
> +            gettext_noop("Enables the planner's use of BRIN sort plans."),
> +            NULL,
> +            GUC_EXPLAIN
> +        },
> +        &enable_brinsort,
> +        false,

I think new GUCs should be enabled during patch development.
Maybe in a separate 0002 patch "for CI only not for commit".
That way "make check" at least has a chance to hit that new code paths.

Also, note that indxpath.c had the var initialized to true.

> +            attno = (i + 1);
> +       nranges = (nblocks / pagesPerRange);
> +                               node->bs_phase = (nullsFirst) ? BRINSORT_LOAD_NULLS : BRINSORT_LOAD_RANGE;

I'm curious why you have parenthesis these places ?

> +#ifndef NODEBrinSort_H
> +#define NODEBrinSort_H

NODEBRIN_SORT would be more consistent with NODEINCREMENTALSORT.
But I'd prefer NODE_* - otherwise it looks like NO DEBRIN.

This needed a bunch of work needed to pass any of the regression tests -
even with the feature set to off.

 . meson.build needs the same change as the corresponding ./Makefile.
 . guc missing from postgresql.conf.sample
 . brin_validate.c is missing support for the opr function.
   I gather you're planning on changing this part (?) but this allows to
   pass tests for now.
 . mingw is warning about OidIsValid(pointer) in nodeBrinSort.c.
   https://cirrus-ci.com/task/5771227447951360?logs=mingw_cross_warning#L969
 . Uninitialized catalog attribute.
 . Some typos in your other patches: "heuristics heuristics".  ste.
   lest (least).

-- 
Justin

Attachment

Re: PATCH: Using BRIN indexes for sorted output

From
Tomas Vondra
Date:

On 10/24/22 06:32, Justin Pryzby wrote:
> On Sat, Oct 15, 2022 at 02:33:50PM +0200, Tomas Vondra wrote:
>> Of course, if there are e.g. BTREE indexes this is going to be slower,
>> but people are unlikely to have both index types on the same column.
> 
> On Sun, Oct 16, 2022 at 05:48:31PM +0200, Tomas Vondra wrote:
>> I don't think it's all that unfair. How likely is it to have both a BRIN
>> and btree index on the same column? And even if you do have such indexes
> 
> Note that we (at my work) use unique, btree indexes on multiple columns
> for INSERT ON CONFLICT into the most-recent tables: UNIQUE(a,b,c,...),
> plus a separate set of indexes on all tables, used for searching:
> BRIN(a) and BTREE(b).  I'd hope that the costing is accurate enough to
> prefer the btree index for searching the most-recent table, if that's
> what's faster (for example, if columns b and c are specified).
> 

Well, the costing is very crude at the moment - at the moment it's
pretty much just a copy of the existing BRIN costing. And the cost is
likely going to increase, because brinsort needs to do regular BRIN
bitmap scan (more or less) and then also a sort (which is an extra cost,
of course). So if it works now, I don't see why would brinsort break it.
Moreover, if you don't have ORDER BY in the query, I don't see why would
we create a brinsort at all.

But if you could test this once the costing gets improved, that'd be
very valuable.

>> +    /* There must not be any TID scan in progress yet. */
>> +    Assert(node->ss.ss_currentScanDesc == NULL);
>> +
>> +    /* Initialize the TID range scan, for the provided block range. */
>> +    if (node->ss.ss_currentScanDesc == NULL)
>> +    {
> 
> Why is this conditional on the condition that was just Assert()ed ?
> 

Yeah, that's a mistake, due to how the code evolved.

>>  
>> +void
>> +cost_brinsort(BrinSortPath *path, PlannerInfo *root, double loop_count,
>> +           bool partial_path)
> 
> It's be nice to refactor existing code to avoid this part being so
> duplicitive.
> 
>> +     * In some situations (particularly with OR'd index conditions) we may
>> +     * have scan_clauses that are not equal to, but are logically implied by,
>> +     * the index quals; so we also try a predicate_implied_by() check to see
> 
> Isn't that somewhat expensive ?
> 
> If that's known, then it'd be good to say that in the documentation.
> 

Some of this is probably a residue from create_indexscan_path and may
not be needed for this new node.

>> +    {
>> +        {"enable_brinsort", PGC_USERSET, QUERY_TUNING_METHOD,
>> +            gettext_noop("Enables the planner's use of BRIN sort plans."),
>> +            NULL,
>> +            GUC_EXPLAIN
>> +        },
>> +        &enable_brinsort,
>> +        false,
> 
> I think new GUCs should be enabled during patch development.
> Maybe in a separate 0002 patch "for CI only not for commit".
> That way "make check" at least has a chance to hit that new code paths.
> 
> Also, note that indxpath.c had the var initialized to true.
> 

Good point.

>> +            attno = (i + 1);
>> +       nranges = (nblocks / pagesPerRange);
>> +                               node->bs_phase = (nullsFirst) ? BRINSORT_LOAD_NULLS : BRINSORT_LOAD_RANGE;
> 
> I'm curious why you have parenthesis these places ?
> 

Not sure, it seemed more readable when writing the code I guess.

>> +#ifndef NODEBrinSort_H
>> +#define NODEBrinSort_H
> 
> NODEBRIN_SORT would be more consistent with NODEINCREMENTALSORT.
> But I'd prefer NODE_* - otherwise it looks like NO DEBRIN.
> 

Yeah, stupid search/replace on the indescan code, which was used as a
starting point.

> This needed a bunch of work needed to pass any of the regression tests -
> even with the feature set to off.
> 
>  . meson.build needs the same change as the corresponding ./Makefile.
>  . guc missing from postgresql.conf.sample
>  . brin_validate.c is missing support for the opr function.
>    I gather you're planning on changing this part (?) but this allows to
>    pass tests for now.
>  . mingw is warning about OidIsValid(pointer) in nodeBrinSort.c.
>    https://cirrus-ci.com/task/5771227447951360?logs=mingw_cross_warning#L969
>  . Uninitialized catalog attribute.
>  . Some typos in your other patches: "heuristics heuristics".  ste.
>    lest (least).
> 

Thanks, I'll get this fixed. I've posted the patch as a PoC to showcase
it and gather some feedback, I should have mentioned it's incomplete in
these ways.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: PATCH: Using BRIN indexes for sorted output

From
Greg Stark
Date:
Fwiw tuplesort does do something like what you want for the top-k
case. At least it used to last I looked -- not sure if it went out
with the tapesort ...

For top-k it inserts new tuples into the heap data structure and then
pops the top element out of the hash. That keeps a fixed number of
elements in the heap. It's always inserting and removing at the same
time. I don't think it would be very hard to add a tuplesort interface
to access that behaviour.

For something like BRIN you would sort the ranges by minvalue then
insert all the tuples for each range. Before inserting tuples for a
new range you would first pop out all the tuples that are < the
minvalue for the new range.

I'm not sure how you handle degenerate BRIN indexes that behave
terribly. Like, if many BRIN ranges covered the entire key range.
Perhaps there would be a clever way to spill the overflow and switch
to quicksort for the spilled tuples without wasting lots of work
already done and without being too inefficient.



Re: PATCH: Using BRIN indexes for sorted output

From
Tomas Vondra
Date:
On 11/16/22 22:52, Greg Stark wrote:
> Fwiw tuplesort does do something like what you want for the top-k
> case. At least it used to last I looked -- not sure if it went out
> with the tapesort ...
> > For top-k it inserts new tuples into the heap data structure and then
> pops the top element out of the hash. That keeps a fixed number of
> elements in the heap. It's always inserting and removing at the same
> time. I don't think it would be very hard to add a tuplesort interface
> to access that behaviour.
> 


Bounded sorts are still there, implemented using a heap (which is what
you're talking about, I think). I actually looked at it some time ago,
and it didn't look like a particularly good match for the general case
(without explicit LIMIT). Bounded sorts require specifying number of
tuples, and then discard the remaining tuples. But you don't know how
many tuples you'll actually find until the next minval - you have to
keep them all.

Maybe we could feed the tuples into a (sorted) heap incrementally, and
consume tuples until the next minval value. I'm not against exploring
that idea, but it certainly requires more work than just slapping some
interface to existing code.

> For something like BRIN you would sort the ranges by minvalue then
> insert all the tuples for each range. Before inserting tuples for a
> new range you would first pop out all the tuples that are < the
> minvalue for the new range.
> 

Well, yeah. That's pretty much exactly what the last version of this
patch (from October 23) does.

> I'm not sure how you handle degenerate BRIN indexes that behave
> terribly. Like, if many BRIN ranges covered the entire key range.
> Perhaps there would be a clever way to spill the overflow and switch
> to quicksort for the spilled tuples without wasting lots of work
> already done and without being too inefficient.

In two ways:

1) Don't have such BRIN index - if it has many degraded ranges, it's
bound to perform poorly even for WHERE conditions. We've lived with this
until now, I don't think this makes the issue any worse.

2) Improving statistics for BRIN indexes - until now the BRIN costing is
very crude, we have almost no information about how wide the ranges are,
how much they overlap, etc. The 0001 part (discussed in a thread [1])
aims to provide much better statistics. Yes, the costing still doesn't
use that information very much.


regards


[1] https://commitfest.postgresql.org/40/3952/

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: PATCH: Using BRIN indexes for sorted output

From
Andres Freund
Date:
Hi,

On 2022-11-17 00:52:35 +0100, Tomas Vondra wrote:
> Well, yeah. That's pretty much exactly what the last version of this
> patch (from October 23) does.

That version unfortunately doesn't build successfully:
https://cirrus-ci.com/task/5108789846736896

[03:02:48.641] Duplicate OIDs detected:
[03:02:48.641] 9979
[03:02:48.641] 9980
[03:02:48.641] found 2 duplicate OID(s) in catalog data

Greetings,

Andres Freund



Re: PATCH: Using BRIN indexes for sorted output

From
Tomas Vondra
Date:
Hi,

Here's an updated version of this patch series. There's still plenty of
stuff to improve, but it fixes a number of issues I mentioned earlier.

The two most important changes are:


1) handling of projections

Until now queries with projection might have failed, due to not using
the right slot, bogus var references and so on. The code was somewhat
confused because the new node is somewhere in between a scan node and a
sort (or more precisely, it combines both).

I believe this version handles all of this correctly - the code that
initializes the slots/projection info etc. needs serious cleanup, but
should be correct.


2) handling of expressions

The other improvement is handling of expressions - if you have a BRIN
index on an expression, this should now work too. This also includes
correct handling of collations (which the previous patches ignored).

Similarly to the projections, I believe the code is correct but needs
cleanup. In particular, I haven't paid close attention to memory
management, so there might be memory leaks when evaluating expressions.


The last two parts of the patch series (0009 and 0010) are about
testing. 0009 adds a regular regression test with various combinations
(projections, expressions, single- vs. multi-column indexes, ...).

0010 introduces a python script that randomly generates data sets,
indexes and queries. I use it to both test random combinations and to
evaluate performance. I don't expect it to be committed etc. - it's
included only to keep it versioned with the rest of the patch.


I did some basic benchmarking using the 0010 part, to evaluate the how
this works for various cases. The script varies a number of parameters:

- number of rows
- table fill factor
- randomness (how much ranges overlapp)
- pages per range
- limit / offset for queries
- ...

The script forces both a "seqscan" and "brinsort" plan, and collects
timing info.

The results are encouraging, I think. Attached are two charts, plotting
speedup vs. fraction of tuples the query has to sort.

  speedup = (seqscan timing / brinsort timing)

  fraction = (limit + offset) / (table rows)

A query with "limit 1 offset 0" has fraction ~0.0, query that scans
everything (perhaps because it has no LIMIT/OFFSET) has ~1.0.

For speedup, 1.0 means "no change" while values above 1.0 means the
query gets faster. Both plots have log-scale y-axis.

brinsort-all-data.gif shows results for all queries. There's significant
speedup for small values of fraction (i.e. queries with limit, requiring
few rows). This is expected, as this is pretty much the primary use case
for the patch.

The other thing is that the benefits quickly diminish - for fractions
close to 0.0 the potential benefits are huge, but once you cross ~10% of
the table, it's within 10x, for ~25% less than 5x etc.

OTOH there are also a fair number of queries that got slower - those are
the data points below 1.0. I've looked into many of them, and there are
a couple reasons why that can happen:

1) random data set - When the ranges are very wide, BRIN Sort has to
read most of the data, and it ends up sorting almost as many rows as the
sequential scan. But it's more expensive, especially when combined with
the following points.

    Note: I don't think is an issue in practice, because BRIN indexes
    would suck quite badly on such data, so no one is going to create
    such indexes in the first place.

2) tiny ranges - By default ranges are 1MB, but it's possible to make
them much smaller. But BRIN Sort has to read/sort all ranges, and that
gets more expensive with the number of ranges.

    Note: I'm not sure there's a way around this, although Matthias
    had some interesting ideas about how to keep the ranges sorted.
    But ultimately, I think this is fine, as long as it's costed
    correctly. For fractions close to 0.0 this is still going to be
    a huge win.

3) non-adaptive (and low) watermark_step - The number of sorts makes a
huge difference - in an extreme case we could add the ranges one by one,
with a sort after each. For small limit/offset that works, but for more
rows it's quite pointless.

    Note: The adaptive step (adjusted during execution) works great, and
    the script sets explicit values mostly to trigger more corner cases.
    Also, I wonder if we should force higher values as we progress
    through the table - we still don't want to exceed work_mem, but the
    larger fraction we scan the more we should prefer larger "batches".


The second "filter" chart (brinsort-filtered-data.gif) shows results
filtered to only show runs with:

   - pages_per_range >= 32
   - randomness <= 5% (i.e. each range covers about 5% of domain)
   - adaptive step (= -1)

And IMO this looks much better - there are almost no slower queries,
except for a bunch of queries that scan all the data.


So, what are the next steps for this patch:

1) cleanup of the existing code (mentioned above)

2) improvement of the costing - This is probably the critical part,
because we need a costing that allows us to identify the queries that
are likely to be faster/slower. I believe this is doable - either now or
using the new opclass-specific stats proposed in a separate patch (and
kept in part 0001 for completeness).


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Attachment

Re: PATCH: Using BRIN indexes for sorted output

From
Justin Pryzby
Date:
On Thu, Feb 16, 2023 at 03:07:59PM +0100, Tomas Vondra wrote:
> Rebased version of the patches, fixing only minor conflicts.

Per cfbot, the patch fails on 32 bit builds.
+ERROR:  count mismatch: 0 != 1000

And causes warnings in mingw cross-compile.

On Sun, Oct 23, 2022 at 11:32:37PM -0500, Justin Pryzby wrote:
> I think new GUCs should be enabled during patch development.
> Maybe in a separate 0002 patch "for CI only not for commit".
> That way "make check" at least has a chance to hit that new code
> paths.
> 
> Also, note that indxpath.c had the var initialized to true.

In your patch, the amstats guc is still being set to false during
startup by the guc machinery.  And the tests crash everywhere if it's
set to on:

TRAP: failed Assert("(nmatches_unique >= 1) && (nmatches_unique <= unique[nvalues-1])"), File:
"../src/backend/access/brin/brin_minmax.c",Line: 644, PID: 25519
 

>  . Some typos in your other patches: "heuristics heuristics".  ste.
>    lest (least).

These are still present.

-- 
Justin



Re: PATCH: Using BRIN indexes for sorted output

From
Tomas Vondra
Date:
On 2/16/23 17:10, Justin Pryzby wrote:
> On Thu, Feb 16, 2023 at 03:07:59PM +0100, Tomas Vondra wrote:
>> Rebased version of the patches, fixing only minor conflicts.
> 
> Per cfbot, the patch fails on 32 bit builds.
> +ERROR:  count mismatch: 0 != 1000
> 
> And causes warnings in mingw cross-compile.
> 

There was a silly mistake in trying to store block numbers as bigint
when sorting the ranges, instead of uint32. That happens to work on
64-bit systems, but on 32-bit systems it produces bogus block.

The attached should fix that - it passes on 32-bit arm, even with
valgrind and all that.

> On Sun, Oct 23, 2022 at 11:32:37PM -0500, Justin Pryzby wrote:
>> I think new GUCs should be enabled during patch development.
>> Maybe in a separate 0002 patch "for CI only not for commit".
>> That way "make check" at least has a chance to hit that new code
>> paths.
>>
>> Also, note that indxpath.c had the var initialized to true.
> 
> In your patch, the amstats guc is still being set to false during
> startup by the guc machinery.  And the tests crash everywhere if it's
> set to on:
> 
> TRAP: failed Assert("(nmatches_unique >= 1) && (nmatches_unique <= unique[nvalues-1])"), File:
"../src/backend/access/brin/brin_minmax.c",Line: 644, PID: 25519
 
> 

Right, that was a silly thinko in building the stats, and I found a
couple more issues nearby. Should be fixed in the attached version.

>>  . Some typos in your other patches: "heuristics heuristics".  ste.
>>    lest (least).
> 
> These are still present.

Thanks for reminding me, those should be fixed too now.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Attachment

Re: PATCH: Using BRIN indexes for sorted output

From
Tomas Vondra
Date:
cfbot complained there's one more place triggering a compiler warning on
32-bit systems, so here's a version fixing that.

I've also added a copy of the regression tests but using the indexam
stats added in 0001. This is just a copy of the already existing
regression tests, just with enable_indexam_stats=true - this should
catch some of the issues that went mostly undetected in the earlier
patch versions.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Attachment

Re: PATCH: Using BRIN indexes for sorted output

From
Justin Pryzby
Date:
Are (any of) these patches targetting v16 ?

typos:
ar we - we are?
morestly - mostly
interstect - intersect

> + * XXX We don't sort the bins, so just do binary sort. For large number of values
> + * this might be an issue, for small number of values a linear search is fine.

"binary sort" is wrong?

> + * only half of there ranges, thus 1/2. This can be extended to randomly

half of *these* ranges ?

> From 7b3307c27b35ece119feab4891f03749250e454b Mon Sep 17 00:00:00 2001
> From: Tomas Vondra <tomas.vondra@postgresql.org>
> Date: Mon, 17 Oct 2022 18:39:28 +0200
> Subject: [PATCH 01/11] Allow index AMs to build and use custom statistics

I think the idea can also apply to btree - currently, correlation is
considered to be a property of a column, but not an index.  But that
fails to distinguish between a freshly built index, and an index with
out of order heap references, which can cause an index scan to be a lot
more expensive.

I implemented per-index correlation stats way back when:
https://www.postgresql.org/message-id/flat/20160524173914.GA11880%40telsasoft.com

See also:
https://www.postgresql.org/message-id/14438.1512499811@sss.pgh.pa.us

With my old test case:

Index scan is 3x slower than bitmap scan, but index scan is costed as
being cheaper:

postgres=# explain analyze SELECT * FROM t WHERE i>11 AND i<55;
 Index Scan using t_i_idx on t  (cost=0.43..21153.74 rows=130912 width=8) (actual time=0.107..222.737 rows=128914
loops=1)

postgres=# SET enable_indexscan =no;
postgres=# explain analyze SELECT * FROM t WHERE i>11 AND i<55;
 Bitmap Heap Scan on t  (cost=2834.28..26895.96 rows=130912 width=8) (actual time=16.830..69.860 rows=128914 loops=1)

If it's clustered, then the index scan is almost twice as fast, and the
costs are more consistent with the associated time.  The planner assumes
that the indexes are freshly built...

postgres=# CLUSTER t USING t_i_idx ;
postgres=# explain analyze SELECT * FROM t WHERE i>11 AND i<55;
 Index Scan using t_i_idx on t  (cost=0.43..20121.74 rows=130912 width=8) (actual time=0.084..117.549 rows=128914
loops=1)

-- 
Justin



Re: PATCH: Using BRIN indexes for sorted output

From
Tomas Vondra
Date:
On 2/18/23 19:51, Justin Pryzby wrote:
> Are (any of) these patches targetting v16 ?
> 

Probably not. Maybe if there's more feedback / scrutiny, but I'm not
sure one commitfest is enough to polish the patch (especially
considering I haven't done much on the costing yet).

> typos:
> ar we - we are?
> morestly - mostly
> interstect - intersect
> 
>> + * XXX We don't sort the bins, so just do binary sort. For large number of values
>> + * this might be an issue, for small number of values a linear search is fine.
> 
> "binary sort" is wrong?
> 
>> + * only half of there ranges, thus 1/2. This can be extended to randomly
> 
> half of *these* ranges ?
> 

Thanks, I'll fix those.

>> From 7b3307c27b35ece119feab4891f03749250e454b Mon Sep 17 00:00:00 2001
>> From: Tomas Vondra <tomas.vondra@postgresql.org>
>> Date: Mon, 17 Oct 2022 18:39:28 +0200
>> Subject: [PATCH 01/11] Allow index AMs to build and use custom statistics
> 
> I think the idea can also apply to btree - currently, correlation is
> considered to be a property of a column, but not an index.  But that
> fails to distinguish between a freshly built index, and an index with
> out of order heap references, which can cause an index scan to be a lot
> more expensive.
> 
> I implemented per-index correlation stats way back when:
> https://www.postgresql.org/message-id/flat/20160524173914.GA11880%40telsasoft.com
> 
> See also:
> https://www.postgresql.org/message-id/14438.1512499811@sss.pgh.pa.us
> 
> With my old test case:
> 
> Index scan is 3x slower than bitmap scan, but index scan is costed as
> being cheaper:
> 
> postgres=# explain analyze SELECT * FROM t WHERE i>11 AND i<55;
>  Index Scan using t_i_idx on t  (cost=0.43..21153.74 rows=130912 width=8) (actual time=0.107..222.737 rows=128914
loops=1)
> 
> postgres=# SET enable_indexscan =no;
> postgres=# explain analyze SELECT * FROM t WHERE i>11 AND i<55;
>  Bitmap Heap Scan on t  (cost=2834.28..26895.96 rows=130912 width=8) (actual time=16.830..69.860 rows=128914
loops=1)
> 
> If it's clustered, then the index scan is almost twice as fast, and the
> costs are more consistent with the associated time.  The planner assumes
> that the indexes are freshly built...
> 
> postgres=# CLUSTER t USING t_i_idx ;
> postgres=# explain analyze SELECT * FROM t WHERE i>11 AND i<55;
>  Index Scan using t_i_idx on t  (cost=0.43..20121.74 rows=130912 width=8) (actual time=0.084..117.549 rows=128914
loops=1)
> 

Yeah, the concept of indexam statistics certainly applies to other index
types, and for btree we might collect information about correlation etc.
I haven't looked at the 2017 patch, but it seems reasonable.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: PATCH: Using BRIN indexes for sorted output

From
Matthias van de Meent
Date:
On Sat, 18 Feb 2023 at 16:55, Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
>
> cfbot complained there's one more place triggering a compiler warning on
> 32-bit systems, so here's a version fixing that.
>
> I've also added a copy of the regression tests but using the indexam
> stats added in 0001. This is just a copy of the already existing
> regression tests, just with enable_indexam_stats=true - this should
> catch some of the issues that went mostly undetected in the earlier
> patch versions.


Comments on 0001, mostly comments and patch design:

> +range_minval_cmp(const void *a, const void *b, void *arg)
> [...]
> +range_maxval_cmp(const void *a, const void *b, void *arg)
> [...]
> +range_values_cmp(const void *a, const void *b, void *arg)

Can the arguments of these functions be modified into the types they
are expected to receive? If not, could you add a comment on why that's
not possible?
I don't think it's good practise to "just" use void* for arguments to
cast them to more concrete types in the next lines without an
immediate explanation.

> +     * Statistics calculated by index AM (e.g. BRIN for ranges, etc.).

Could you please expand on this? We do have GIST support for ranges, too.

> + * brin_minmax_stats
> + *        Calculate custom statistics for a BRIN minmax index.
> + *
> + * At the moment this calculates:
> + *
> + *  - number of summarized/not-summarized and all/has nulls ranges

I think statistics gathering of an index should be done at the AM
level, not attribute level. The docs currently suggest that the user
builds one BRIN index with 16 columns instead of 16 BRIN indexes with
one column each, which would make the statistics gathering use 16x
more IO if the scanned data cannot be reused.

It is possible to build BRIN indexes on more than one column with more
than one opclass family like `USING brin (id int8_minmax_ops, id
int8_bloom_ops)`. This would mean various duplicate statistics fields,
no?
It seems to me that it's more useful to do the null- and n_summarized
on the index level instead of duplicating that inside the opclass.

> + for (heapBlk = 0; heapBlk < nblocks; heapBlk += pagesPerRange)

I am not familiar with the frequency of max-sized relations, but this
would go into an infinite loop for pagesPerRange values >1 for
max-sized relations due to BlockNumber wraparound. I think there
should be some additional overflow checks here.

> +/*
> + * get_attstaindexam
> + *
> + *      Given the table and attribute number of a column, get the index AM
> + *      statistics.  Return NULL if no data available.
> + *

Shouldn't this be "given the index and attribute number" instead of
"the table and attribute number"?
I think we need to be compatible with indexes on expression here, so
that we don't fail to create (or use) statistics for an index `USING
brin ( (int8range(my_min_column, my_max_column, '[]'))
range_inclusion_ops)` when we implement stats for range_inclusion_ops.

> + * Alternative to brin_minmax_match_tuples_to_ranges2, leveraging ordering

Does this function still exist?

.

I'm planning on reviewing the other patches, and noticed that a lot of
the patches are marked WIP. Could you share a status on those, because
currently that status is unknown: Are these patches you don't plan on
including, or are these patches only (or mostly) included for
debugging?


Kind regards,

Matthias van de Meent



Re: PATCH: Using BRIN indexes for sorted output

From
Tomas Vondra
Date:
On 2/23/23 15:19, Matthias van de Meent wrote:
> On Sat, 18 Feb 2023 at 16:55, Tomas Vondra
> <tomas.vondra@enterprisedb.com> wrote:
>>
>> cfbot complained there's one more place triggering a compiler warning on
>> 32-bit systems, so here's a version fixing that.
>>
>> I've also added a copy of the regression tests but using the indexam
>> stats added in 0001. This is just a copy of the already existing
>> regression tests, just with enable_indexam_stats=true - this should
>> catch some of the issues that went mostly undetected in the earlier
>> patch versions.
> 
> 
> Comments on 0001, mostly comments and patch design:
> 
>> +range_minval_cmp(const void *a, const void *b, void *arg)
>> [...]
>> +range_maxval_cmp(const void *a, const void *b, void *arg)
>> [...]
>> +range_values_cmp(const void *a, const void *b, void *arg)
> 
> Can the arguments of these functions be modified into the types they
> are expected to receive? If not, could you add a comment on why that's
> not possible?
> I don't think it's good practise to "just" use void* for arguments to
> cast them to more concrete types in the next lines without an
> immediate explanation.
> 

The reason is that that's what qsort() expects. If you change that to
actual data types, you'll get compile-time warnings. I agree this may
need better comments, though.

>> +     * Statistics calculated by index AM (e.g. BRIN for ranges, etc.).
> 
> Could you please expand on this? We do have GIST support for ranges, too.
> 

Expand in what way? This is meant to be AM-specific, so if GiST wants to
collect some additional stats, it's free to do so - perhaps some of the
ideas from the stats collected for BRIN would be applicable, but it's
also bound to the index structure.

>> + * brin_minmax_stats
>> + *        Calculate custom statistics for a BRIN minmax index.
>> + *
>> + * At the moment this calculates:
>> + *
>> + *  - number of summarized/not-summarized and all/has nulls ranges
> 
> I think statistics gathering of an index should be done at the AM
> level, not attribute level. The docs currently suggest that the user
> builds one BRIN index with 16 columns instead of 16 BRIN indexes with
> one column each, which would make the statistics gathering use 16x
> more IO if the scanned data cannot be reused.
> 

Why? The row sample is collected only once and used for building all the
index AM stats - it doesn't really matter if we analyze 16 single-column
indexes or 1 index with 16 columns. Yes, we'll need to scan more
indexes, but the with 16 columns the summaries will be larger so the
total amount of I/O will be almost the same I think.

Or maybe I don't understand what I/O you're talking about?

> It is possible to build BRIN indexes on more than one column with more
> than one opclass family like `USING brin (id int8_minmax_ops, id
> int8_bloom_ops)`. This would mean various duplicate statistics fields,
> no?
> It seems to me that it's more useful to do the null- and n_summarized
> on the index level instead of duplicating that inside the opclass.

I don't think it's worth it. The amount of data this would save is tiny,
and it'd only apply to cases where the index includes the same attribute
multiple times, and that's pretty rare I think. I don't think it's worth
the extra complexity.

> 
>> + for (heapBlk = 0; heapBlk < nblocks; heapBlk += pagesPerRange)
> 
> I am not familiar with the frequency of max-sized relations, but this
> would go into an infinite loop for pagesPerRange values >1 for
> max-sized relations due to BlockNumber wraparound. I think there
> should be some additional overflow checks here.
> 

Good point, but that's a pre-existing issue. We do this same loop in a
number of places.

>> +/*
>> + * get_attstaindexam
>> + *
>> + *      Given the table and attribute number of a column, get the index AM
>> + *      statistics.  Return NULL if no data available.
>> + *
> 
> Shouldn't this be "given the index and attribute number" instead of
> "the table and attribute number"?
> I think we need to be compatible with indexes on expression here, so
> that we don't fail to create (or use) statistics for an index `USING
> brin ( (int8range(my_min_column, my_max_column, '[]'))
> range_inclusion_ops)` when we implement stats for range_inclusion_ops.
> 
>> + * Alternative to brin_minmax_match_tuples_to_ranges2, leveraging ordering
> 
> Does this function still exist?
> 

Yes, but only in 0003 - it's a "brute-force" algorithm used as a
cross-check the result of the optimized algorithm in 0001. You're right
it should not be referenced in the comment.

> 
> I'm planning on reviewing the other patches, and noticed that a lot of
> the patches are marked WIP. Could you share a status on those, because
> currently that status is unknown: Are these patches you don't plan on
> including, or are these patches only (or mostly) included for
> debugging?
> 

I think the WIP label is a bit misleading, I used it mostly to mark
patches that are not meant to be committed on their own. A quick overview:

0002-wip-introduce-debug_brin_stats-20230218-2.patch
0003-wip-introduce-debug_brin_cross_check-20230218-2.patch

  Not meant for commit, used mostly during development/debugging, by
  adding debug logging and/or cross-check to validate the results.

  I think it's fine to ignore those during review.


0005-wip-brinsort-explain-stats-20230218-2.patch

  This needs more work. It does what it's meant to do (show info about
  the brinsort node), but I'm not very happy with the formatting etc.

  Review and suggestions welcome.


0006-wip-multiple-watermark-steps-20230218-2.patch
0007-wip-adjust-watermark-step-20230218-2.patch
0008-wip-adaptive-watermark-step-20230218-2.patch

  Ultimately this should be merged into 0004 (which does the actual
  brinsort), I think 0006 is the way to go, but I kept all the patches
  to show the incremental evolution (and allow comparisons).

  0006 adds a GUC to specify how many ranges to add in each round,
  0005 adjusts that based on statistics during planning and 0006 does
  that adaptively during execution.

0009-wip-add-brinsort-regression-tests-20230218-2.patch
0010-wip-add-brinsort-amstats-regression-tests-20230218-2.patch

  These need to move to the earlier parts. The tests are rather
  expensive so we'll need to reduce it somehow.

0011-wip-test-generator-script-20230218-2.patch

  Not intended for commit, I only included it to keep it as part of the
  patch series.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: PATCH: Using BRIN indexes for sorted output

From
Matthias van de Meent
Date:
On Thu, 23 Feb 2023 at 16:22, Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
>
> On 2/23/23 15:19, Matthias van de Meent wrote:
>> Comments on 0001, mostly comments and patch design:

One more comment:

>>> + range->min_value = bval->bv_values[0];
>>> + range->max_value = bval->bv_values[1];

This produces dangling pointers for minmax indexes on by-ref types
such as text and bytea, due to the memory context of the decoding
tuple being reset every iteration. :-(

> >> +range_values_cmp(const void *a, const void *b, void *arg)
> >
> > Can the arguments of these functions be modified into the types they
> > are expected to receive? If not, could you add a comment on why that's
> > not possible?
>
> The reason is that that's what qsort() expects. If you change that to
> actual data types, you'll get compile-time warnings. I agree this may
> need better comments, though.

Thanks in advance.

> >> +     * Statistics calculated by index AM (e.g. BRIN for ranges, etc.).
> >
> > Could you please expand on this? We do have GIST support for ranges, too.
> >
>
> Expand in what way? This is meant to be AM-specific, so if GiST wants to
> collect some additional stats, it's free to do so - perhaps some of the
> ideas from the stats collected for BRIN would be applicable, but it's
> also bound to the index structure.

I don't quite understand the flow of the comment, as I don't clearly
see what the "BRIN for ranges" tries to refer to. In my view, that
makes it a bad example which needs further explanation or rewriting,
aka "expanding on".

> >> + * brin_minmax_stats
> >> + *        Calculate custom statistics for a BRIN minmax index.
> >> + *
> >> + * At the moment this calculates:
> >> + *
> >> + *  - number of summarized/not-summarized and all/has nulls ranges
> >
> > I think statistics gathering of an index should be done at the AM
> > level, not attribute level. The docs currently suggest that the user
> > builds one BRIN index with 16 columns instead of 16 BRIN indexes with
> > one column each, which would make the statistics gathering use 16x
> > more IO if the scanned data cannot be reused.
> >
>
> Why? The row sample is collected only once and used for building all the
> index AM stats - it doesn't really matter if we analyze 16 single-column
> indexes or 1 index with 16 columns. Yes, we'll need to scan more
> indexes, but the with 16 columns the summaries will be larger so the
> total amount of I/O will be almost the same I think.
>
> Or maybe I don't understand what I/O you're talking about?

With the proposed patch, we do O(ncols_statsenabled) scans over the
BRIN index. Each scan reads all ncol columns of all block ranges from
disk, so in effect the data scan does on the order of
O(ncols_statsenabled * ncols * nranges) IOs, or O(n^2) on cols when
all columns have statistics enabled.

> > It is possible to build BRIN indexes on more than one column with more
> > than one opclass family like `USING brin (id int8_minmax_ops, id
> > int8_bloom_ops)`. This would mean various duplicate statistics fields,
> > no?
> > It seems to me that it's more useful to do the null- and n_summarized
> > on the index level instead of duplicating that inside the opclass.
>
> I don't think it's worth it. The amount of data this would save is tiny,
> and it'd only apply to cases where the index includes the same attribute
> multiple times, and that's pretty rare I think. I don't think it's worth
> the extra complexity.

Not necessarily, it was just an example of where we'd save IO.
Note that the current gathering method already retrieves all tuple
attribute data, so from a basic processing perspective we'd save some
time decoding as well.

> >
> > I'm planning on reviewing the other patches, and noticed that a lot of
> > the patches are marked WIP. Could you share a status on those, because
> > currently that status is unknown: Are these patches you don't plan on
> > including, or are these patches only (or mostly) included for
> > debugging?
> >
>
> I think the WIP label is a bit misleading, I used it mostly to mark
> patches that are not meant to be committed on their own. A quick overview:
>
> [...]

Thanks for the explanation, that's quite helpful. I'll see to further
reviewing 0004 and 0005 when I have additional time.

Kind regards,

Matthias van de Meent.



Re: PATCH: Using BRIN indexes for sorted output

From
Tomas Vondra
Date:
On 2/23/23 17:44, Matthias van de Meent wrote:
> On Thu, 23 Feb 2023 at 16:22, Tomas Vondra
> <tomas.vondra@enterprisedb.com> wrote:
>>
>> On 2/23/23 15:19, Matthias van de Meent wrote:
>>> Comments on 0001, mostly comments and patch design:
> 
> One more comment:
> 
>>>> + range->min_value = bval->bv_values[0];
>>>> + range->max_value = bval->bv_values[1];
> 
> This produces dangling pointers for minmax indexes on by-ref types
> such as text and bytea, due to the memory context of the decoding
> tuple being reset every iteration. :-(
> 

Yeah, that sounds like a bug. Also a sign the tests should have some
by-ref data types (presumably there are none, as that would certainly
trip some asserts etc.).

>>>> +range_values_cmp(const void *a, const void *b, void *arg)
>>>
>>> Can the arguments of these functions be modified into the types they
>>> are expected to receive? If not, could you add a comment on why that's
>>> not possible?
>>
>> The reason is that that's what qsort() expects. If you change that to
>> actual data types, you'll get compile-time warnings. I agree this may
>> need better comments, though.
> 
> Thanks in advance.
> 
>>>> +     * Statistics calculated by index AM (e.g. BRIN for ranges, etc.).
>>>
>>> Could you please expand on this? We do have GIST support for ranges, too.
>>>
>>
>> Expand in what way? This is meant to be AM-specific, so if GiST wants to
>> collect some additional stats, it's free to do so - perhaps some of the
>> ideas from the stats collected for BRIN would be applicable, but it's
>> also bound to the index structure.
> 
> I don't quite understand the flow of the comment, as I don't clearly
> see what the "BRIN for ranges" tries to refer to. In my view, that
> makes it a bad example which needs further explanation or rewriting,
> aka "expanding on".
> 

Ah, right. Yeah, the "BRIN for ranges" wording is a bit misleading. It
should really say only BRIN, but I was focused on the minmax use case,
so I mentioned the ranges.

>>>> + * brin_minmax_stats
>>>> + *        Calculate custom statistics for a BRIN minmax index.
>>>> + *
>>>> + * At the moment this calculates:
>>>> + *
>>>> + *  - number of summarized/not-summarized and all/has nulls ranges
>>>
>>> I think statistics gathering of an index should be done at the AM
>>> level, not attribute level. The docs currently suggest that the user
>>> builds one BRIN index with 16 columns instead of 16 BRIN indexes with
>>> one column each, which would make the statistics gathering use 16x
>>> more IO if the scanned data cannot be reused.
>>>
>>
>> Why? The row sample is collected only once and used for building all the
>> index AM stats - it doesn't really matter if we analyze 16 single-column
>> indexes or 1 index with 16 columns. Yes, we'll need to scan more
>> indexes, but the with 16 columns the summaries will be larger so the
>> total amount of I/O will be almost the same I think.
>>
>> Or maybe I don't understand what I/O you're talking about?
> 
> With the proposed patch, we do O(ncols_statsenabled) scans over the
> BRIN index. Each scan reads all ncol columns of all block ranges from
> disk, so in effect the data scan does on the order of
> O(ncols_statsenabled * ncols * nranges) IOs, or O(n^2) on cols when
> all columns have statistics enabled.
> 

I don't think that's the number of I/O operations we'll do, because we
always read the whole BRIN tuple at once. So I believe it should rather
be something like

  O(ncols_statsenabled * nranges)

assuming nranges is the number of page ranges. But even that's likely a
significant overestimate because most of the tuples will be served from
shared buffers.

Considering how tiny BRIN indexes are, this is likely orders of
magnitude less I/O than we expend on sampling rows from the table. I
mean, with the default statistics target we read ~30000 pages (~240MB)
or more just to sample the rows. Randomly, while the BRIN index is
likely scanned mostly sequentially.

Maybe there are cases where this would be an issue, but I haven't seen
one when working on this patch (and I did a lot of experiments). I'd
like to see one before we start optimizing it ...

This also reminds me that the issues I actually saw (e.g. memory
consumption) would be made worse by processing all columns at once,
because then you need to keep more columns in memory.


>>> It is possible to build BRIN indexes on more than one column with more
>>> than one opclass family like `USING brin (id int8_minmax_ops, id
>>> int8_bloom_ops)`. This would mean various duplicate statistics fields,
>>> no?
>>> It seems to me that it's more useful to do the null- and n_summarized
>>> on the index level instead of duplicating that inside the opclass.
>>
>> I don't think it's worth it. The amount of data this would save is tiny,
>> and it'd only apply to cases where the index includes the same attribute
>> multiple times, and that's pretty rare I think. I don't think it's worth
>> the extra complexity.
> 
> Not necessarily, it was just an example of where we'd save IO.
> Note that the current gathering method already retrieves all tuple
> attribute data, so from a basic processing perspective we'd save some
> time decoding as well.
> 

[shrug] I still think it's a negligible fraction of the time.

>>>
>>> I'm planning on reviewing the other patches, and noticed that a lot of
>>> the patches are marked WIP. Could you share a status on those, because
>>> currently that status is unknown: Are these patches you don't plan on
>>> including, or are these patches only (or mostly) included for
>>> debugging?
>>>
>>
>> I think the WIP label is a bit misleading, I used it mostly to mark
>> patches that are not meant to be committed on their own. A quick overview:
>>
>> [...]
> 
> Thanks for the explanation, that's quite helpful. I'll see to further
> reviewing 0004 and 0005 when I have additional time.
> 

Cool, thank you!


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: PATCH: Using BRIN indexes for sorted output

From
Alvaro Herrera
Date:
On 2023-Feb-23, Matthias van de Meent wrote:

> > + for (heapBlk = 0; heapBlk < nblocks; heapBlk += pagesPerRange)
> 
> I am not familiar with the frequency of max-sized relations, but this
> would go into an infinite loop for pagesPerRange values >1 for
> max-sized relations due to BlockNumber wraparound. I think there
> should be some additional overflow checks here.

They are definitely not very common -- BlockNumber wraps around at 32 TB
IIUC.  But yeah, I guess it is a possibility, and perhaps we should find
a way to write these loops in a more robust manner.

-- 
Álvaro Herrera        Breisgau, Deutschland  —  https://www.EnterpriseDB.com/
"World domination is proceeding according to plan"        (Andrew Morton)



Re: PATCH: Using BRIN indexes for sorted output

From
Tomas Vondra
Date:

On 2/24/23 09:39, Alvaro Herrera wrote:
> On 2023-Feb-23, Matthias van de Meent wrote:
> 
>>> + for (heapBlk = 0; heapBlk < nblocks; heapBlk += pagesPerRange)
>>
>> I am not familiar with the frequency of max-sized relations, but this
>> would go into an infinite loop for pagesPerRange values >1 for
>> max-sized relations due to BlockNumber wraparound. I think there
>> should be some additional overflow checks here.
> 
> They are definitely not very common -- BlockNumber wraps around at 32 TB
> IIUC.  But yeah, I guess it is a possibility, and perhaps we should find
> a way to write these loops in a more robust manner.
> 

I guess the easiest fix would be to do the arithmetic in 64 bits. That'd
eliminate the overflow.

Alternatively, we could do something like

  prevHeapBlk = 0;
  for (heapBlk = 0; (heapBlk < nblocks) && (prevHeapBlk <= heapBlk);
       heapBlk += pagesPerRange)
  {
    ...
    prevHeapBlk = heapBlk;
  }


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: PATCH: Using BRIN indexes for sorted output

From
Alvaro Herrera
Date:
On 2023-Feb-24, Tomas Vondra wrote:

> I guess the easiest fix would be to do the arithmetic in 64 bits. That'd
> eliminate the overflow.

Yeah, that might be easy to set up.  We then don't have to worry about
it until BlockNumber is enlarged to 64 bits ... but by that time surely
we can just grow it again to a 128 bit loop variable.

> Alternatively, we could do something like
> 
>   prevHeapBlk = 0;
>   for (heapBlk = 0; (heapBlk < nblocks) && (prevHeapBlk <= heapBlk);
>        heapBlk += pagesPerRange)
>   {
>     ...
>     prevHeapBlk = heapBlk;
>   }

I think a formulation of this kind has the benefit that it works after
BlockNumber is enlarged to 64 bits, and doesn't have to be changed ever
again (assuming it is correct).

... if pagesPerRange is not a whole divisor of MaxBlockNumber, I think
this will neglect the last range in the table.

-- 
Álvaro Herrera        Breisgau, Deutschland  —  https://www.EnterpriseDB.com/



Re: PATCH: Using BRIN indexes for sorted output

From
Tomas Vondra
Date:

On 2/24/23 16:14, Alvaro Herrera wrote:
> On 2023-Feb-24, Tomas Vondra wrote:
> 
>> I guess the easiest fix would be to do the arithmetic in 64 bits. That'd
>> eliminate the overflow.
> 
> Yeah, that might be easy to set up.  We then don't have to worry about
> it until BlockNumber is enlarged to 64 bits ... but by that time surely
> we can just grow it again to a 128 bit loop variable.
> 
>> Alternatively, we could do something like
>>
>>   prevHeapBlk = 0;
>>   for (heapBlk = 0; (heapBlk < nblocks) && (prevHeapBlk <= heapBlk);
>>        heapBlk += pagesPerRange)
>>   {
>>     ...
>>     prevHeapBlk = heapBlk;
>>   }
> 
> I think a formulation of this kind has the benefit that it works after
> BlockNumber is enlarged to 64 bits, and doesn't have to be changed ever
> again (assuming it is correct).
> 

Did anyone even propose doing that? I suspect this is unlikely to be the
only place that'd might be broken by that.

> ... if pagesPerRange is not a whole divisor of MaxBlockNumber, I think
> this will neglect the last range in the table.
> 

Why would it? Let's say BlockNumber is uint8, i.e. 255 max. And there
are 10 pages per range. That's 25 "full" ranges, and the last range
being just 5 pages. So we get into

   prevHeapBlk = 240
   heapBlk = 250

and we read the last 5 pages. And then we update

   prevHeapBlk = 250
   heapBlk = (250 + 10) % 255 = 5

and we don't do that loop. Or did I get this wrong, somehow?


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: PATCH: Using BRIN indexes for sorted output

From
Matthias van de Meent
Date:
On Fri, 24 Feb 2023 at 17:04, Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
>
> On 2/24/23 16:14, Alvaro Herrera wrote:
> > ... if pagesPerRange is not a whole divisor of MaxBlockNumber, I think
> > this will neglect the last range in the table.
> >
>
> Why would it? Let's say BlockNumber is uint8, i.e. 255 max. And there
> are 10 pages per range. That's 25 "full" ranges, and the last range
> being just 5 pages. So we get into
>
>    prevHeapBlk = 240
>    heapBlk = 250
>
> and we read the last 5 pages. And then we update
>
>    prevHeapBlk = 250
>    heapBlk = (250 + 10) % 255 = 5
>
> and we don't do that loop. Or did I get this wrong, somehow?

The result is off-by-one due to (u)int8 overflows being mod-256, but
apart from that your result is accurate.

The condition only stops the loop when we wrap around or when we go
past the last block, but no earlier than that.


Kind regards,

Matthias van de Meent



Re: PATCH: Using BRIN indexes for sorted output

From
Matthias van de Meent
Date:
On Thu, 23 Feb 2023 at 19:48, Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
>
> On 2/23/23 17:44, Matthias van de Meent wrote:
> > On Thu, 23 Feb 2023 at 16:22, Tomas Vondra
> > <tomas.vondra@enterprisedb.com> wrote:
> >>
> >> On 2/23/23 15:19, Matthias van de Meent wrote:
> >>> Comments on 0001, mostly comments and patch design:
> >
> > One more comment:
> >
> >>>> + range->min_value = bval->bv_values[0];
> >>>> + range->max_value = bval->bv_values[1];
> >
> > This produces dangling pointers for minmax indexes on by-ref types
> > such as text and bytea, due to the memory context of the decoding
> > tuple being reset every iteration. :-(
> >
>
> Yeah, that sounds like a bug. Also a sign the tests should have some
> by-ref data types (presumably there are none, as that would certainly
> trip some asserts etc.).

I'm not sure we currently trip asserts, as the data we store in the
memory context is very limited, making it unlikely we actually release
the memory region back to the OS.
I did get assertion failures by adding the attached patch, but I don't
think that's something we can do in the final release.

> > With the proposed patch, we do O(ncols_statsenabled) scans over the
> > BRIN index. Each scan reads all ncol columns of all block ranges from
> > disk, so in effect the data scan does on the order of
> > O(ncols_statsenabled * ncols * nranges) IOs, or O(n^2) on cols when
> > all columns have statistics enabled.
> >
>
> I don't think that's the number of I/O operations we'll do, because we
> always read the whole BRIN tuple at once. So I believe it should rather
> be something like
>
>   O(ncols_statsenabled * nranges)
>
> assuming nranges is the number of page ranges. But even that's likely a
> significant overestimate because most of the tuples will be served from
> shared buffers.

We store some data per index attribute, which makes the IO required
for a single indexed range proportional to the number of index
attributes.
If we then scan the index a number of times proportional to the number
of attributes, the cumulative IO load of statistics gathering for that
index is quadratic on the number of attributes, not linear, right?

> Considering how tiny BRIN indexes are, this is likely orders of
> magnitude less I/O than we expend on sampling rows from the table. I
> mean, with the default statistics target we read ~30000 pages (~240MB)
> or more just to sample the rows. Randomly, while the BRIN index is
> likely scanned mostly sequentially.

Mostly agreed; except I think it's not too difficult to imagine a BRIN
index that is on that scale; with as an example the bloom filters that
easily take up several hundreds of bytes.

With the default configuration of 128 pages_per_range,
n_distinct_per_range of -0.1, and false_positive_rate of 0.01, the
bloom filter size is 4.36 KiB - each indexed item on its own page. It
is still only 1% of the original table's size, but there are enough
tables that are larger than 24GB that this could be a significant
issue.

Note that most of my concerns are related to our current
documentation's statement that there are no demerits to multi-column
BRIN indexes:

"""
11.3. Multicolumn Indexes

[...] The only reason to have multiple BRIN indexes instead of one
multicolumn BRIN index on a single table is to have a different
pages_per_range storage parameter.
"""

Wide BRIN indexes add IO overhead for single-attribute scans when
compared to single-attribute indexes, so having N single-attribute
index scans to build statistics the statistics on an N-attribute index
is not great.

> Maybe there are cases where this would be an issue, but I haven't seen
> one when working on this patch (and I did a lot of experiments). I'd
> like to see one before we start optimizing it ...

I'm not only worried about optimizing it, I'm also worried that we're
putting this abstraction at the wrong level in a way that is difficult
to modify.

> This also reminds me that the issues I actually saw (e.g. memory
> consumption) would be made worse by processing all columns at once,
> because then you need to keep more columns in memory.

Yes, I that can be a valid concern, but don't we already do similar
things in the current table statistics gathering?

Kind regards,

Matthias van de Meent



Re: PATCH: Using BRIN indexes for sorted output

From
Tomas Vondra
Date:

On 2/24/23 19:03, Matthias van de Meent wrote:
> On Thu, 23 Feb 2023 at 19:48, Tomas Vondra
> <tomas.vondra@enterprisedb.com> wrote:
>>
>> On 2/23/23 17:44, Matthias van de Meent wrote:
>>> On Thu, 23 Feb 2023 at 16:22, Tomas Vondra
>>> <tomas.vondra@enterprisedb.com> wrote:
>>>>
>>>> On 2/23/23 15:19, Matthias van de Meent wrote:
>>>>> Comments on 0001, mostly comments and patch design:
>>>
>>> One more comment:
>>>
>>>>>> + range->min_value = bval->bv_values[0];
>>>>>> + range->max_value = bval->bv_values[1];
>>>
>>> This produces dangling pointers for minmax indexes on by-ref types
>>> such as text and bytea, due to the memory context of the decoding
>>> tuple being reset every iteration. :-(
>>>
>>
>> Yeah, that sounds like a bug. Also a sign the tests should have some
>> by-ref data types (presumably there are none, as that would certainly
>> trip some asserts etc.).
> 
> I'm not sure we currently trip asserts, as the data we store in the
> memory context is very limited, making it unlikely we actually release
> the memory region back to the OS.
> I did get assertion failures by adding the attached patch, but I don't
> think that's something we can do in the final release.
> 

But we should randomize the memory if we ever do pfree(), and it's
strange valgrind didn't complain when I ran tests with it. But yeah,
I'll take a look and see if we can add some tests covering this.

>>> With the proposed patch, we do O(ncols_statsenabled) scans over the
>>> BRIN index. Each scan reads all ncol columns of all block ranges from
>>> disk, so in effect the data scan does on the order of
>>> O(ncols_statsenabled * ncols * nranges) IOs, or O(n^2) on cols when
>>> all columns have statistics enabled.
>>>
>>
>> I don't think that's the number of I/O operations we'll do, because we
>> always read the whole BRIN tuple at once. So I believe it should rather
>> be something like
>>
>>   O(ncols_statsenabled * nranges)
>>
>> assuming nranges is the number of page ranges. But even that's likely a
>> significant overestimate because most of the tuples will be served from
>> shared buffers.
> 
> We store some data per index attribute, which makes the IO required
> for a single indexed range proportional to the number of index
> attributes.
> If we then scan the index a number of times proportional to the number
> of attributes, the cumulative IO load of statistics gathering for that
> index is quadratic on the number of attributes, not linear, right?
> 

Ah, OK. I was focusing on number of I/O operations while you're talking
about amount of I/O performed. You're right the amount of I/O is
quadratic, but I think what's more interesting is the comparison of the
alternative ANALYZE approaches. The current simple approach does a
multiple of the I/O amount, linear to the number of attributes.

Which is not great, ofc.


>> Considering how tiny BRIN indexes are, this is likely orders of
>> magnitude less I/O than we expend on sampling rows from the table. I
>> mean, with the default statistics target we read ~30000 pages (~240MB)
>> or more just to sample the rows. Randomly, while the BRIN index is
>> likely scanned mostly sequentially.
> 
> Mostly agreed; except I think it's not too difficult to imagine a BRIN
> index that is on that scale; with as an example the bloom filters that
> easily take up several hundreds of bytes.
> 
> With the default configuration of 128 pages_per_range,
> n_distinct_per_range of -0.1, and false_positive_rate of 0.01, the
> bloom filter size is 4.36 KiB - each indexed item on its own page. It
> is still only 1% of the original table's size, but there are enough
> tables that are larger than 24GB that this could be a significant
> issue.
> 

Right, it's certainly true BRIN indexes may be made fairly large (either
by using something like bloom or by making the ranges much smaller). But
those are exactly the indexes where building statistics for all columns
at once is going to cause issues with memory usage etc.

Note: Obviously, that depends on how much data per range we need to keep
in memory. For bloom I doubt we'd actually want to keep all the filters,
we'd probably calculate just some statistics (e.g. number of bits set),
so maybe the memory consumption is not that bad.

In fact, for such indexes the memory consumption may already be an issue
even when analyzing the index column by column. My feeling is we'll need
to do something about that, e.g. by reading only a sample of the ranges,
or something like that. That might help with the I/O too, I guess.

> Note that most of my concerns are related to our current
> documentation's statement that there are no demerits to multi-column
> BRIN indexes:
> 
> """
> 11.3. Multicolumn Indexes
> 
> [...] The only reason to have multiple BRIN indexes instead of one
> multicolumn BRIN index on a single table is to have a different
> pages_per_range storage parameter.
> """
> 

True, we may need to clarify this in the documentation.

> Wide BRIN indexes add IO overhead for single-attribute scans when
> compared to single-attribute indexes, so having N single-attribute
> index scans to build statistics the statistics on an N-attribute index
> is not great.
> 

Perhaps, but it's not like the alternative is perfect either
(complexity, memory consumption, ...). IMHO it's a reasonable tradeoff.

>> Maybe there are cases where this would be an issue, but I haven't seen
>> one when working on this patch (and I did a lot of experiments). I'd
>> like to see one before we start optimizing it ...
> 
> I'm not only worried about optimizing it, I'm also worried that we're
> putting this abstraction at the wrong level in a way that is difficult
> to modify.
> 

Yeah, that's certainly a valid concern. I admit I only did the minimum
amount of work on this part, as I was focused on the sorting part.

>> This also reminds me that the issues I actually saw (e.g. memory
>> consumption) would be made worse by processing all columns at once,
>> because then you need to keep more columns in memory.
> 
> Yes, I that can be a valid concern, but don't we already do similar
> things in the current table statistics gathering?
> 

Not really, I think. We sample a bunch of rows once, but then we build
statistics on this sample for each attribute / expression independently.
We could of course read the whole index into memory and then run the
analysis, but I think tuples tend to be much smaller (thanks to TOAST
etc.) and we only really scan a limited amount of them (30k).

But if we're concerned about the I/O, the BRIN is likely fairly large,
so maybe reading it into memory at once is not a great idea.

It's be more similar if we sampled sampled the BRIN ranges - I think
we'll have to do something like that anyway.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: PATCH: Using BRIN indexes for sorted output

From
Tomas Vondra
Date:
On 2/24/23 20:14, Tomas Vondra wrote:
> 
> 
> On 2/24/23 19:03, Matthias van de Meent wrote:
>> On Thu, 23 Feb 2023 at 19:48, Tomas Vondra
>> <tomas.vondra@enterprisedb.com> wrote:
>>>
>>> On 2/23/23 17:44, Matthias van de Meent wrote:
>>>> On Thu, 23 Feb 2023 at 16:22, Tomas Vondra
>>>> <tomas.vondra@enterprisedb.com> wrote:
>>>>>
>>>>> On 2/23/23 15:19, Matthias van de Meent wrote:
>>>>>> Comments on 0001, mostly comments and patch design:
>>>>
>>>> One more comment:
>>>>
>>>>>>> + range->min_value = bval->bv_values[0];
>>>>>>> + range->max_value = bval->bv_values[1];
>>>>
>>>> This produces dangling pointers for minmax indexes on by-ref types
>>>> such as text and bytea, due to the memory context of the decoding
>>>> tuple being reset every iteration. :-(
>>>>
>>>
>>> Yeah, that sounds like a bug. Also a sign the tests should have some
>>> by-ref data types (presumably there are none, as that would certainly
>>> trip some asserts etc.).
>>
>> I'm not sure we currently trip asserts, as the data we store in the
>> memory context is very limited, making it unlikely we actually release
>> the memory region back to the OS.
>> I did get assertion failures by adding the attached patch, but I don't
>> think that's something we can do in the final release.
>>
> 
> But we should randomize the memory if we ever do pfree(), and it's
> strange valgrind didn't complain when I ran tests with it. But yeah,
> I'll take a look and see if we can add some tests covering this.
> 

There was no patch to trigger the assertions, but the attached patches
should fix that, I think. It pretty much just does datumCopy() after
reading the BRIN tuple from disk.

It's interesting I've been unable to hit the usual asserts checking
freed memory etc. even with text columns etc. I guess the BRIN tuple
memory happens to be aligned in a way that just happens to work.

It howver triggered an assert failure in minval_end, because it didn't
use the proper comparator.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Attachment

Re: PATCH: Using BRIN indexes for sorted output

From
Matthias van de Meent
Date:
On Fri, 24 Feb 2023, 20:14 Tomas Vondra, <tomas.vondra@enterprisedb.com> wrote:
>
> On 2/24/23 19:03, Matthias van de Meent wrote:
> > On Thu, 23 Feb 2023 at 19:48, Tomas Vondra
> >> Yeah, that sounds like a bug. Also a sign the tests should have some
> >> by-ref data types (presumably there are none, as that would certainly
> >> trip some asserts etc.).
> >
> > I'm not sure we currently trip asserts, as the data we store in the
> > memory context is very limited, making it unlikely we actually release
> > the memory region back to the OS.
> > I did get assertion failures by adding the attached patch, but I don't
> > think that's something we can do in the final release.
> >
>
> But we should randomize the memory if we ever do pfree(), and it's
> strange valgrind didn't complain when I ran tests with it.

Well, we don't clean up the decoding tuple immediately after our last
iteration, so the memory context (and the last tuple's attributes) are
still valid memory addresses. And, assuming that min/max values for
all brin ranges all have the same max-aligned length, the attribute
pointers are likely to point to the same offset within the decoding
tuple's memory context's memory segment, which would mean the dangling
pointers still point to valid memory - just not memory with the
contents they expected to be pointing to.

> >> Considering how tiny BRIN indexes are, this is likely orders of
> >> magnitude less I/O than we expend on sampling rows from the table. I
> >> mean, with the default statistics target we read ~30000 pages (~240MB)
> >> or more just to sample the rows. Randomly, while the BRIN index is
> >> likely scanned mostly sequentially.
> >
> > Mostly agreed; except I think it's not too difficult to imagine a BRIN
> > index that is on that scale; with as an example the bloom filters that
> > easily take up several hundreds of bytes.
> >
> > With the default configuration of 128 pages_per_range,
> > n_distinct_per_range of -0.1, and false_positive_rate of 0.01, the
> > bloom filter size is 4.36 KiB - each indexed item on its own page. It
> > is still only 1% of the original table's size, but there are enough
> > tables that are larger than 24GB that this could be a significant
> > issue.
> >
>
> Right, it's certainly true BRIN indexes may be made fairly large (either
> by using something like bloom or by making the ranges much smaller). But
> those are exactly the indexes where building statistics for all columns
> at once is going to cause issues with memory usage etc.
>
> Note: Obviously, that depends on how much data per range we need to keep
> in memory. For bloom I doubt we'd actually want to keep all the filters,
> we'd probably calculate just some statistics (e.g. number of bits set),
> so maybe the memory consumption is not that bad.

Yes, I was thinking something along the same lines for bloom as well.
Something like 'average number of bits set' (or: histogram number of
bits set), and/or for each bit a count (or %) how many times it is
set, etc.

> >> Maybe there are cases where this would be an issue, but I haven't seen
> >> one when working on this patch (and I did a lot of experiments). I'd
> >> like to see one before we start optimizing it ...
> >
> > I'm not only worried about optimizing it, I'm also worried that we're
> > putting this abstraction at the wrong level in a way that is difficult
> > to modify.
> >
>
> Yeah, that's certainly a valid concern. I admit I only did the minimum
> amount of work on this part, as I was focused on the sorting part.
>
> >> This also reminds me that the issues I actually saw (e.g. memory
> >> consumption) would be made worse by processing all columns at once,
> >> because then you need to keep more columns in memory.
> >
> > Yes, I that can be a valid concern, but don't we already do similar
> > things in the current table statistics gathering?
> >
>
> Not really, I think. We sample a bunch of rows once, but then we build
> statistics on this sample for each attribute / expression independently.
> We could of course read the whole index into memory and then run the
> analysis, but I think tuples tend to be much smaller (thanks to TOAST
> etc.) and we only really scan a limited amount of them (30k).

Just to note, with default settings, sampling 30k index entries from
BRIN would cover ~29 GiB of a table. This isn't a lot, but it also
isn't exactly a small table. I think that it would be difficult to get
accurate avg_overlap statistics for some shapes of BRIN data...

> But if we're concerned about the I/O, the BRIN is likely fairly large,
> so maybe reading it into memory at once is not a great idea.

Agreed, we can't always expect that the interesting parts of the BRIN
index always fit in the available memory.

Kind regards,

Matthias van de Meent



Re: PATCH: Using BRIN indexes for sorted output

From
Matthias van de Meent
Date:
Hi,

On Thu, 23 Feb 2023 at 17:44, Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:
>
> I'll see to further reviewing 0004 and 0005 when I have additional time.

Some initial comments on 0004:

> +/*
> + * brin_minmax_ranges
> + *        Load the BRIN ranges and sort them.
> + */
> +Datum
> +brin_minmax_ranges(PG_FUNCTION_ARGS)
> +{

Like in 0001, this seems to focus on only single columns. Can't we put
the scan and sort infrastructure in brin, and put the weight- and
compare-operators in the opclasses? I.e.
brin_minmax_minorder(PG_FUNCTION_ARGS=brintuple) -> range.min and
brin_minmax_maxorder(PG_FUNCTION_ARGS=brintuple) -> range.max, and a
brin_minmax_compare(order, order) -> int? I'm thinking of something
similar to GIST's distance operators, which would make implementing
ordering by e.g. `pointcolumn <-> (1, 2)::point` implementable in the
brin infrastructure.

Note: One big reason I don't really like the current
brin_minmax_ranges (and the analyze code in 0001) is because it breaks
the operatorclass-vs-index abstraction layer. Operator classes don't
(shouldn't) know or care about which attribute number they have, nor
what the index does with the data.
Scanning the index is not something that I expect the operator class
to do, I expect that the index code organizes the scan, and forwards
the data to the relevant operator classes.
Scanning the index N times for N attributes can be excused if there
are good reasons, but I'd rather have that decision made in the
index's core code rather than at the design level.

> +/*
> + * XXX Does it make sense (is it possible) to have a sort by more than one
> + * column, using a BRIN index?
> + */

Yes, even if only one prefix column is included in the BRIN index
(e.g. `company` in `ORDER BY company, date`, the tuplesort with table
tuples can add additional sorting without first reading the whole
table, potentially (likely) reducing the total resource usage of the
query. That utilizes the same idea as incremental sorts, but with the
caveat that the input sort order is approximately likely but not at
all guaranteed. So, even if the range sort is on a single index
column, we can still do the table's tuplesort on all ORDER BY
attributes, as long as a prefix of ORDER BY columns are included in
the BRIN index.

> +            /*
> +             * XXX We can be a bit smarter for LIMIT queries - once we
> +             * know we have more rows in the tuplesort than we need to
> +             * output, we can stop spilling - those rows are not going
> +             * to be needed. We can discard the tuplesort (no need to
> +             * respill) and stop spilling.
> +             */

Shouldn't that be "discard the tuplestore"?

> +#define BRIN_PROCNUM_RANGES         12    /* optional */

It would be useful to add documentation for this in this patch.


Kind regards,

Matthias van de Meent.



Re: PATCH: Using BRIN indexes for sorted output

From
Alvaro Herrera
Date:
On 2023-Feb-24, Tomas Vondra wrote:

> On 2/24/23 16:14, Alvaro Herrera wrote:

> > I think a formulation of this kind has the benefit that it works after
> > BlockNumber is enlarged to 64 bits, and doesn't have to be changed ever
> > again (assuming it is correct).
> 
> Did anyone even propose doing that? I suspect this is unlikely to be the
> only place that'd might be broken by that.

True about other places also needing fixes, and no I haven't see anyone;
but while 32 TB does seem very far away to us now, it might be not
*that* far away.  So I think doing it the other way is better.

> > ... if pagesPerRange is not a whole divisor of MaxBlockNumber, I think
> > this will neglect the last range in the table.
> 
> Why would it? Let's say BlockNumber is uint8, i.e. 255 max. And there
> are 10 pages per range. That's 25 "full" ranges, and the last range
> being just 5 pages. So we get into
> 
>    prevHeapBlk = 240
>    heapBlk = 250
> 
> and we read the last 5 pages. And then we update
> 
>    prevHeapBlk = 250
>    heapBlk = (250 + 10) % 255 = 5
> 
> and we don't do that loop. Or did I get this wrong, somehow?

I stand corrected.

-- 
Álvaro Herrera               48°01'N 7°57'E  —  https://www.EnterpriseDB.com/



Re: PATCH: Using BRIN indexes for sorted output

From
Tomas Vondra
Date:
Hi,

I finally had time to look at this patch again. There's a bit of bitrot,
so here's a rebased version (no other changes).

[more comments inline]

On 2/27/23 16:40, Matthias van de Meent wrote:
> Hi,
> 
> On Thu, 23 Feb 2023 at 17:44, Matthias van de Meent
> <boekewurm+postgres@gmail.com> wrote:
>>
>> I'll see to further reviewing 0004 and 0005 when I have additional time.
> 
> Some initial comments on 0004:
> 
>> +/*
>> + * brin_minmax_ranges
>> + *        Load the BRIN ranges and sort them.
>> + */
>> +Datum
>> +brin_minmax_ranges(PG_FUNCTION_ARGS)
>> +{
> 
> Like in 0001, this seems to focus on only single columns. Can't we put
> the scan and sort infrastructure in brin, and put the weight- and
> compare-operators in the opclasses? I.e.
> brin_minmax_minorder(PG_FUNCTION_ARGS=brintuple) -> range.min and
> brin_minmax_maxorder(PG_FUNCTION_ARGS=brintuple) -> range.max, and a
> brin_minmax_compare(order, order) -> int? I'm thinking of something
> similar to GIST's distance operators, which would make implementing
> ordering by e.g. `pointcolumn <-> (1, 2)::point` implementable in the
> brin infrastructure.
> 
> Note: One big reason I don't really like the current
> brin_minmax_ranges (and the analyze code in 0001) is because it breaks
> the operatorclass-vs-index abstraction layer. Operator classes don't
> (shouldn't) know or care about which attribute number they have, nor
> what the index does with the data.
> Scanning the index is not something that I expect the operator class
> to do, I expect that the index code organizes the scan, and forwards
> the data to the relevant operator classes.
> Scanning the index N times for N attributes can be excused if there
> are good reasons, but I'd rather have that decision made in the
> index's core code rather than at the design level.
> 

I think you're right. It'd be more appropriate to have most of the core
scanning logic in brin.c, and then delegate only some minor decisions to
the opclass. Like, comparisons, extraction of min/max from ranges etc.

However, it's not quite clear to me is what you mean by the weight- and
compare-operators? That is, what are

 - brin_minmax_minorder(PG_FUNCTION_ARGS=brintuple) -> range.min
 - brin_minmax_maxorder(PG_FUNCTION_ARGS=brintuple) -> range.max
 - brin_minmax_compare(order, order) -> int

supposed to do? Or what does "PG_FUNCTION_ARGS=brintuple" mean?

In principle we just need a procedure that tells us min/max for a given
page range - I guess that's what the minorder/maxorder functions do? But
why would we need the compare one? We're comparing by the known data
type, so we can just delegate the comparison to that, no?

Also, the existence of these opclass procedures should be enough to
identify the opclasses which can support this.

>> +/*
>> + * XXX Does it make sense (is it possible) to have a sort by more than one
>> + * column, using a BRIN index?
>> + */
> 
> Yes, even if only one prefix column is included in the BRIN index
> (e.g. `company` in `ORDER BY company, date`, the tuplesort with table
> tuples can add additional sorting without first reading the whole
> table, potentially (likely) reducing the total resource usage of the
> query. That utilizes the same idea as incremental sorts, but with the
> caveat that the input sort order is approximately likely but not at
> all guaranteed. So, even if the range sort is on a single index
> column, we can still do the table's tuplesort on all ORDER BY
> attributes, as long as a prefix of ORDER BY columns are included in
> the BRIN index.
> 

That's now quite what I meant, though. When I mentioned sorting by more
than one column, I meant using a multi-column BRIN index on those
columns. Something like this:

   CREATE TABLE t (a int, b int);
   INSERT INTO t ...
   CREATE INDEX ON t USING brin (a,b);

   SELECT * FROM t ORDER BY a, b;

Now, what I think you described is using BRIN to sort by "a", and then
do incremental sort for "b". What I had in mind is whether it's possible
to use BRIN to sort by "b" too.

I was suspecting it might be made to work, but now that I think about it
again it probably can't - BRIN pretty much sorts the columns separately,
it's not like having an ordering by (a,b) - first by "a", then "b".

>> +            /*
>> +             * XXX We can be a bit smarter for LIMIT queries - once we
>> +             * know we have more rows in the tuplesort than we need to
>> +             * output, we can stop spilling - those rows are not going
>> +             * to be needed. We can discard the tuplesort (no need to
>> +             * respill) and stop spilling.
>> +             */
> 
> Shouldn't that be "discard the tuplestore"?
> 

Yeah, definitely.

>> +#define BRIN_PROCNUM_RANGES         12    /* optional */
> 
> It would be useful to add documentation for this in this patch.
> 

Right, this should be documented in doc/src/sgml/brin.sgml.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: PATCH: Using BRIN indexes for sorted output

From
Matthias van de Meent
Date:
On Fri, 7 Jul 2023 at 20:26, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:
>
> Hi,
>
> I finally had time to look at this patch again. There's a bit of bitrot,
> so here's a rebased version (no other changes).

Thanks!

> On 2/27/23 16:40, Matthias van de Meent wrote:
> > Some initial comments on 0004:
> >
> >> +/*
> >> + * brin_minmax_ranges
> >> + *        Load the BRIN ranges and sort them.
> >> + */
> >> +Datum
> >> +brin_minmax_ranges(PG_FUNCTION_ARGS)
> >> +{
> >
> > Like in 0001, this seems to focus on only single columns. Can't we put
> > the scan and sort infrastructure in brin, and put the weight- and
> > compare-operators in the opclasses? I.e.
> > brin_minmax_minorder(PG_FUNCTION_ARGS=brintuple) -> range.min and
> > brin_minmax_maxorder(PG_FUNCTION_ARGS=brintuple) -> range.max, and a
> > brin_minmax_compare(order, order) -> int? I'm thinking of something
> > similar to GIST's distance operators, which would make implementing
> > ordering by e.g. `pointcolumn <-> (1, 2)::point` implementable in the
> > brin infrastructure.
> >
>
> However, it's not quite clear to me is what you mean by the weight- and
> compare-operators? That is, what are
>
>  - brin_minmax_minorder(PG_FUNCTION_ARGS=brintuple) -> range.min
>  - brin_minmax_maxorder(PG_FUNCTION_ARGS=brintuple) -> range.max
>  - brin_minmax_compare(order, order) -> int
>
> supposed to do? Or what does "PG_FUNCTION_ARGS=brintuple" mean?

_minorder/_maxorder is for extracting the minimum/maximum relative
order of each range, used for ASC/DESC sorting of operator results
(e.g. to support ORDER BY <->(box_column, '(1,2)'::point) DESC).
PG_FUNCTION_ARGS is mentioned because of PG calling conventions;
though I did forget to describe the second operator argument for the
distance function. We might also want to use only one such "order
extraction function" with DESC/ASC indicated by an argument.

> In principle we just need a procedure that tells us min/max for a given
> page range - I guess that's what the minorder/maxorder functions do? But
> why would we need the compare one? We're comparing by the known data
> type, so we can just delegate the comparison to that, no?

Is there a comparison function for any custom orderable type that we
can just use? GIST distance ordering uses floats, and I don't quite
like that from a user perspective, as it makes ordering operations
imprecise. I'd rather allow (but discourage) any type with its own
compare function.

> Also, the existence of these opclass procedures should be enough to
> identify the opclasses which can support this.

Agreed.

> >> +/*
> >> + * XXX Does it make sense (is it possible) to have a sort by more than one
> >> + * column, using a BRIN index?
> >> + */
> >
> > Yes, even if only one prefix column is included in the BRIN index
> > (e.g. `company` in `ORDER BY company, date`, the tuplesort with table
> > tuples can add additional sorting without first reading the whole
> > table, potentially (likely) reducing the total resource usage of the
> > query. That utilizes the same idea as incremental sorts, but with the
> > caveat that the input sort order is approximately likely but not at
> > all guaranteed. So, even if the range sort is on a single index
> > column, we can still do the table's tuplesort on all ORDER BY
> > attributes, as long as a prefix of ORDER BY columns are included in
> > the BRIN index.
> >
>
> That's now quite what I meant, though. When I mentioned sorting by more
> than one column, I meant using a multi-column BRIN index on those
> columns. Something like this:
>
>    CREATE TABLE t (a int, b int);
>    INSERT INTO t ...
>    CREATE INDEX ON t USING brin (a,b);
>
>    SELECT * FROM t ORDER BY a, b;
>
> Now, what I think you described is using BRIN to sort by "a", and then
> do incremental sort for "b". What I had in mind is whether it's possible
> to use BRIN to sort by "b" too.
>
> I was suspecting it might be made to work, but now that I think about it
> again it probably can't - BRIN pretty much sorts the columns separately,
> it's not like having an ordering by (a,b) - first by "a", then "b".

I imagine it would indeed be limited to an extremely small subset of
cases, and probably not worth the effort to implement in an initial
version.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)



Re: PATCH: Using BRIN indexes for sorted output

From
Tomas Vondra
Date:

On 7/10/23 12:22, Matthias van de Meent wrote:
> On Fri, 7 Jul 2023 at 20:26, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:
>>
>> Hi,
>>
>> I finally had time to look at this patch again. There's a bit of bitrot,
>> so here's a rebased version (no other changes).
> 
> Thanks!
> 
>> On 2/27/23 16:40, Matthias van de Meent wrote:
>>> Some initial comments on 0004:
>>>
>>>> +/*
>>>> + * brin_minmax_ranges
>>>> + *        Load the BRIN ranges and sort them.
>>>> + */
>>>> +Datum
>>>> +brin_minmax_ranges(PG_FUNCTION_ARGS)
>>>> +{
>>>
>>> Like in 0001, this seems to focus on only single columns. Can't we put
>>> the scan and sort infrastructure in brin, and put the weight- and
>>> compare-operators in the opclasses? I.e.
>>> brin_minmax_minorder(PG_FUNCTION_ARGS=brintuple) -> range.min and
>>> brin_minmax_maxorder(PG_FUNCTION_ARGS=brintuple) -> range.max, and a
>>> brin_minmax_compare(order, order) -> int? I'm thinking of something
>>> similar to GIST's distance operators, which would make implementing
>>> ordering by e.g. `pointcolumn <-> (1, 2)::point` implementable in the
>>> brin infrastructure.
>>>
>>
>> However, it's not quite clear to me is what you mean by the weight- and
>> compare-operators? That is, what are
>>
>>  - brin_minmax_minorder(PG_FUNCTION_ARGS=brintuple) -> range.min
>>  - brin_minmax_maxorder(PG_FUNCTION_ARGS=brintuple) -> range.max
>>  - brin_minmax_compare(order, order) -> int
>>
>> supposed to do? Or what does "PG_FUNCTION_ARGS=brintuple" mean?
> 
> _minorder/_maxorder is for extracting the minimum/maximum relative
> order of each range, used for ASC/DESC sorting of operator results
> (e.g. to support ORDER BY <->(box_column, '(1,2)'::point) DESC).
> PG_FUNCTION_ARGS is mentioned because of PG calling conventions;
> though I did forget to describe the second operator argument for the
> distance function. We might also want to use only one such "order
> extraction function" with DESC/ASC indicated by an argument.
> 

I'm still not sure I understand what "minimum/maximum relative order"
is. Isn't it the same as returning min/max value that can appear in the
range? Although, that wouldn't work for points/boxes.

>> In principle we just need a procedure that tells us min/max for a given
>> page range - I guess that's what the minorder/maxorder functions do? But
>> why would we need the compare one? We're comparing by the known data
>> type, so we can just delegate the comparison to that, no?
> 
> Is there a comparison function for any custom orderable type that we
> can just use? GIST distance ordering uses floats, and I don't quite
> like that from a user perspective, as it makes ordering operations
> imprecise. I'd rather allow (but discourage) any type with its own
> compare function.
> 

I haven't really thought about geometric types, just about minmax and
minmax-multi. It's not clear to me what the benefit for these types be.
I mean, we can probably sort points lexicographically, but is anyone
doing that in queries? It seems useless for order by distance.

>> Also, the existence of these opclass procedures should be enough to
>> identify the opclasses which can support this.
> 
> Agreed.
> 
>>>> +/*
>>>> + * XXX Does it make sense (is it possible) to have a sort by more than one
>>>> + * column, using a BRIN index?
>>>> + */
>>>
>>> Yes, even if only one prefix column is included in the BRIN index
>>> (e.g. `company` in `ORDER BY company, date`, the tuplesort with table
>>> tuples can add additional sorting without first reading the whole
>>> table, potentially (likely) reducing the total resource usage of the
>>> query. That utilizes the same idea as incremental sorts, but with the
>>> caveat that the input sort order is approximately likely but not at
>>> all guaranteed. So, even if the range sort is on a single index
>>> column, we can still do the table's tuplesort on all ORDER BY
>>> attributes, as long as a prefix of ORDER BY columns are included in
>>> the BRIN index.
>>>
>>
>> That's now quite what I meant, though. When I mentioned sorting by more
>> than one column, I meant using a multi-column BRIN index on those
>> columns. Something like this:
>>
>>    CREATE TABLE t (a int, b int);
>>    INSERT INTO t ...
>>    CREATE INDEX ON t USING brin (a,b);
>>
>>    SELECT * FROM t ORDER BY a, b;
>>
>> Now, what I think you described is using BRIN to sort by "a", and then
>> do incremental sort for "b". What I had in mind is whether it's possible
>> to use BRIN to sort by "b" too.
>>
>> I was suspecting it might be made to work, but now that I think about it
>> again it probably can't - BRIN pretty much sorts the columns separately,
>> it's not like having an ordering by (a,b) - first by "a", then "b".
> 
> I imagine it would indeed be limited to an extremely small subset of
> cases, and probably not worth the effort to implement in an initial
> version.
> 

OK, let's stick to order by a single column.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: PATCH: Using BRIN indexes for sorted output

From
Matthias van de Meent
Date:
On Mon, 10 Jul 2023 at 13:43, Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
> On 7/10/23 12:22, Matthias van de Meent wrote:
>> On Fri, 7 Jul 2023 at 20:26, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:
>>> However, it's not quite clear to me is what you mean by the weight- and
>>> compare-operators? That is, what are
>>>
>>>  - brin_minmax_minorder(PG_FUNCTION_ARGS=brintuple) -> range.min
>>>  - brin_minmax_maxorder(PG_FUNCTION_ARGS=brintuple) -> range.max
>>>  - brin_minmax_compare(order, order) -> int
>>>
>>> supposed to do? Or what does "PG_FUNCTION_ARGS=brintuple" mean?
>>
>> _minorder/_maxorder is for extracting the minimum/maximum relative
>> order of each range, used for ASC/DESC sorting of operator results
>> (e.g. to support ORDER BY <->(box_column, '(1,2)'::point) DESC).
>> PG_FUNCTION_ARGS is mentioned because of PG calling conventions;
>> though I did forget to describe the second operator argument for the
>> distance function. We might also want to use only one such "order
>> extraction function" with DESC/ASC indicated by an argument.
>>
>
> I'm still not sure I understand what "minimum/maximum relative order"
> is. Isn't it the same as returning min/max value that can appear in the
> range? Although, that wouldn't work for points/boxes.

Kind of. For single-dimensional opclasses (minmax, minmax_multi) we
only need to extract the normal min/max values for ASC/DESC sorts,
which are readily available in the summary. But for multi-dimensional
and distance searches (nearest neighbour) we need to calculate the
distance between the indexed value(s) and the origin value to compare
the summary against, and the order would thus be asc/desc on distance
- a distance which may not be precisely represented by float types -
thus 'relative order' with its own order operation.

>>> In principle we just need a procedure that tells us min/max for a given
>>> page range - I guess that's what the minorder/maxorder functions do? But
>>> why would we need the compare one? We're comparing by the known data
>>> type, so we can just delegate the comparison to that, no?
>>
>> Is there a comparison function for any custom orderable type that we
>> can just use? GIST distance ordering uses floats, and I don't quite
>> like that from a user perspective, as it makes ordering operations
>> imprecise. I'd rather allow (but discourage) any type with its own
>> compare function.
>>
>
> I haven't really thought about geometric types, just about minmax and
> minmax-multi. It's not clear to me what the benefit for these types be.
> I mean, we can probably sort points lexicographically, but is anyone
> doing that in queries? It seems useless for order by distance.

Yes, that's why you would sort them by distance, where the distance is
generated by the opclass as min/max distance between the summary and
the distance's origin, and then inserted into the tuplesort.

(previously)
>>> I finally had time to look at this patch again. There's a bit of bitrot,
>>> so here's a rebased version (no other changes).

It seems like you forgot to attach the rebased patch, so unless you're
actively working on updating the patchset right now, could you send
the rebase to make CFBot happy?


Kind regards,

Matthias van de Meent
Neon (https://neon.tech/)



Re: PATCH: Using BRIN indexes for sorted output

From
Tomas Vondra
Date:
On 7/10/23 14:38, Matthias van de Meent wrote:
> On Mon, 10 Jul 2023 at 13:43, Tomas Vondra
> <tomas.vondra@enterprisedb.com> wrote:
>> On 7/10/23 12:22, Matthias van de Meent wrote:
>>> On Fri, 7 Jul 2023 at 20:26, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:
>>>> However, it's not quite clear to me is what you mean by the weight- and
>>>> compare-operators? That is, what are
>>>>
>>>>  - brin_minmax_minorder(PG_FUNCTION_ARGS=brintuple) -> range.min
>>>>  - brin_minmax_maxorder(PG_FUNCTION_ARGS=brintuple) -> range.max
>>>>  - brin_minmax_compare(order, order) -> int
>>>>
>>>> supposed to do? Or what does "PG_FUNCTION_ARGS=brintuple" mean?
>>>
>>> _minorder/_maxorder is for extracting the minimum/maximum relative
>>> order of each range, used for ASC/DESC sorting of operator results
>>> (e.g. to support ORDER BY <->(box_column, '(1,2)'::point) DESC).
>>> PG_FUNCTION_ARGS is mentioned because of PG calling conventions;
>>> though I did forget to describe the second operator argument for the
>>> distance function. We might also want to use only one such "order
>>> extraction function" with DESC/ASC indicated by an argument.
>>>
>>
>> I'm still not sure I understand what "minimum/maximum relative order"
>> is. Isn't it the same as returning min/max value that can appear in the
>> range? Although, that wouldn't work for points/boxes.
> 
> Kind of. For single-dimensional opclasses (minmax, minmax_multi) we
> only need to extract the normal min/max values for ASC/DESC sorts,
> which are readily available in the summary. But for multi-dimensional
> and distance searches (nearest neighbour) we need to calculate the
> distance between the indexed value(s) and the origin value to compare
> the summary against, and the order would thus be asc/desc on distance
> - a distance which may not be precisely represented by float types -
> thus 'relative order' with its own order operation.
> 

Can you give some examples of such data / queries, and how would it
leverage the BRIN sort stuff?

For distance searches, I imagine this as data indexed by BRIN inclusion
opclass, which creates a bounding box. We could return closest/furthest
point on the bounding box (from the point used in the query). Which
seems a bit like a R-tree ...

But I have no idea what would this do for multi-dimensional searches, or
what would those searches do? How would you sort such data other than
lexicographically? Which I think is covered by the current BRIN Sort,
because the data is either stored as multiple columns, in which case we
use the BRIN on the first column. Or it's indexed using BRIN minmax as a
tuple of values, but then it's sorted lexicographically.


>>>> In principle we just need a procedure that tells us min/max for a given
>>>> page range - I guess that's what the minorder/maxorder functions do? But
>>>> why would we need the compare one? We're comparing by the known data
>>>> type, so we can just delegate the comparison to that, no?
>>>
>>> Is there a comparison function for any custom orderable type that we
>>> can just use? GIST distance ordering uses floats, and I don't quite
>>> like that from a user perspective, as it makes ordering operations
>>> imprecise. I'd rather allow (but discourage) any type with its own
>>> compare function.
>>>
>>
>> I haven't really thought about geometric types, just about minmax and
>> minmax-multi. It's not clear to me what the benefit for these types be.
>> I mean, we can probably sort points lexicographically, but is anyone
>> doing that in queries? It seems useless for order by distance.
> 
> Yes, that's why you would sort them by distance, where the distance is
> generated by the opclass as min/max distance between the summary and
> the distance's origin, and then inserted into the tuplesort.
> 

OK, so the query says "order by distance from point X" and we calculate
the min/max distance of values in a given page range.

> (previously)
>>>> I finally had time to look at this patch again. There's a bit of bitrot,
>>>> so here's a rebased version (no other changes).
> 
> It seems like you forgot to attach the rebased patch, so unless you're
> actively working on updating the patchset right now, could you send
> the rebase to make CFBot happy?
> 

Yeah, I forgot about the attachment. So here it is.



regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Attachment

Re: PATCH: Using BRIN indexes for sorted output

From
Matthias van de Meent
Date:
On Mon, 10 Jul 2023 at 17:09, Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
> On 7/10/23 14:38, Matthias van de Meent wrote:
>> Kind of. For single-dimensional opclasses (minmax, minmax_multi) we
>> only need to extract the normal min/max values for ASC/DESC sorts,
>> which are readily available in the summary. But for multi-dimensional
>> and distance searches (nearest neighbour) we need to calculate the
>> distance between the indexed value(s) and the origin value to compare
>> the summary against, and the order would thus be asc/desc on distance
>> - a distance which may not be precisely represented by float types -
>> thus 'relative order' with its own order operation.
>>
>
> Can you give some examples of such data / queries, and how would it
> leverage the BRIN sort stuff?

Order by distance would be `ORDER BY box <-> '(1, 2)'::point ASC`, and
the opclass would then decide that `<->(box, point) ASC` means it has
to return the closest distance from the point to the summary, for some
measure of 'distance' (this case L2, <#> other types, etc.). For DESC,
that would return the distance from `'(1,2)'::point` to the furthest
edge of the summary away from that point. Etc.

> For distance searches, I imagine this as data indexed by BRIN inclusion
> opclass, which creates a bounding box. We could return closest/furthest
> point on the bounding box (from the point used in the query). Which
> seems a bit like a R-tree ...

Kind of; it would allow us to utilize such orderings without the
expensive 1 tuple = 1 index entry and without scanning the full table
before getting results. No tree involved, just a sequential scan on
the index to allow some sketch-based pre-sort on the data. Again, this
would work similar to how GiST's internal pages work: each downlink in
GiST contains a summary of the entries on the downlinked page, and
distance searches use a priority queue where the priority is the
distance of the opclass-provided distance operator - lower distance
means higher priority. For BRIN, we'd have to build a priority queue
for the whole table at once, but presorting table sections is part of
the design of BRIN sort, right?

> But I have no idea what would this do for multi-dimensional searches, or
> what would those searches do? How would you sort such data other than
> lexicographically? Which I think is covered by the current BRIN Sort,
> because the data is either stored as multiple columns, in which case we
> use the BRIN on the first column. Or it's indexed using BRIN minmax as a
> tuple of values, but then it's sorted lexicographically.

Yes, just any BRIN summary that allows distance operators and the like
should be enough MINMAX is easy to understand, and box inclusion are
IMO also fairly easy to understand.

>>> I haven't really thought about geometric types, just about minmax and
>>> minmax-multi. It's not clear to me what the benefit for these types be.
>>> I mean, we can probably sort points lexicographically, but is anyone
>>> doing that in queries? It seems useless for order by distance.
>>
>> Yes, that's why you would sort them by distance, where the distance is
>> generated by the opclass as min/max distance between the summary and
>> the distance's origin, and then inserted into the tuplesort.
>>
>
> OK, so the query says "order by distance from point X" and we calculate
> the min/max distance of values in a given page range.

Yes, and because it's BRIN that's an approximation, which should
generally be fine.


Kind regards,

Matthias van de Meent
Neon (https://neon.tech/)



Re: PATCH: Using BRIN indexes for sorted output

From
Tomas Vondra
Date:

On 7/10/23 18:18, Matthias van de Meent wrote:
> On Mon, 10 Jul 2023 at 17:09, Tomas Vondra
> <tomas.vondra@enterprisedb.com> wrote:
>> On 7/10/23 14:38, Matthias van de Meent wrote:
>>> Kind of. For single-dimensional opclasses (minmax, minmax_multi) we
>>> only need to extract the normal min/max values for ASC/DESC sorts,
>>> which are readily available in the summary. But for multi-dimensional
>>> and distance searches (nearest neighbour) we need to calculate the
>>> distance between the indexed value(s) and the origin value to compare
>>> the summary against, and the order would thus be asc/desc on distance
>>> - a distance which may not be precisely represented by float types -
>>> thus 'relative order' with its own order operation.
>>>
>>
>> Can you give some examples of such data / queries, and how would it
>> leverage the BRIN sort stuff?
> 
> Order by distance would be `ORDER BY box <-> '(1, 2)'::point ASC`, and
> the opclass would then decide that `<->(box, point) ASC` means it has
> to return the closest distance from the point to the summary, for some
> measure of 'distance' (this case L2, <#> other types, etc.). For DESC,
> that would return the distance from `'(1,2)'::point` to the furthest
> edge of the summary away from that point. Etc.
> 

Thanks.

>> For distance searches, I imagine this as data indexed by BRIN inclusion
>> opclass, which creates a bounding box. We could return closest/furthest
>> point on the bounding box (from the point used in the query). Which
>> seems a bit like a R-tree ...
> 
> Kind of; it would allow us to utilize such orderings without the
> expensive 1 tuple = 1 index entry and without scanning the full table
> before getting results. No tree involved, just a sequential scan on
> the index to allow some sketch-based pre-sort on the data. Again, this
> would work similar to how GiST's internal pages work: each downlink in
> GiST contains a summary of the entries on the downlinked page, and
> distance searches use a priority queue where the priority is the
> distance of the opclass-provided distance operator - lower distance
> means higher priority.

Yes, that's roughly how I understood this too - a tradeoff that won't
give the same performance as GiST, but much smaller and cheaper to maintain.

> For BRIN, we'd have to build a priority queue
> for the whole table at once, but presorting table sections is part of
> the design of BRIN sort, right?

Yes, that's kinda the whole point of BRIN sort.

> 
>> But I have no idea what would this do for multi-dimensional searches, or
>> what would those searches do? How would you sort such data other than
>> lexicographically? Which I think is covered by the current BRIN Sort,
>> because the data is either stored as multiple columns, in which case we
>> use the BRIN on the first column. Or it's indexed using BRIN minmax as a
>> tuple of values, but then it's sorted lexicographically.
> 
> Yes, just any BRIN summary that allows distance operators and the like
> should be enough MINMAX is easy to understand, and box inclusion are
> IMO also fairly easy to understand.
> 

True. If minmax is interpreted as inclusion with a simple 1D points, it
kinda does the same thing. (Of course, minmax work with data types that
don't have distances, but there's similarity.)

>>>> I haven't really thought about geometric types, just about minmax and
>>>> minmax-multi. It's not clear to me what the benefit for these types be.
>>>> I mean, we can probably sort points lexicographically, but is anyone
>>>> doing that in queries? It seems useless for order by distance.
>>>
>>> Yes, that's why you would sort them by distance, where the distance is
>>> generated by the opclass as min/max distance between the summary and
>>> the distance's origin, and then inserted into the tuplesort.
>>>
>>
>> OK, so the query says "order by distance from point X" and we calculate
>> the min/max distance of values in a given page range.
> 
> Yes, and because it's BRIN that's an approximation, which should
> generally be fine.
> 

Approximation in what sense? My understanding was we'd get a range of
distances that we know covers all rows in that range. So the results
should be accurate, no?


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: PATCH: Using BRIN indexes for sorted output

From
Matthias van de Meent
Date:
On Mon, 10 Jul 2023 at 22:04, Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
> On 7/10/23 18:18, Matthias van de Meent wrote:
>> On Mon, 10 Jul 2023 at 17:09, Tomas Vondra
>> <tomas.vondra@enterprisedb.com> wrote:
>>> On 7/10/23 14:38, Matthias van de Meent wrote:
>>>>> I haven't really thought about geometric types, just about minmax and
>>>>> minmax-multi. It's not clear to me what the benefit for these types be.
>>>>> I mean, we can probably sort points lexicographically, but is anyone
>>>>> doing that in queries? It seems useless for order by distance.
>>>>
>>>> Yes, that's why you would sort them by distance, where the distance is
>>>> generated by the opclass as min/max distance between the summary and
>>>> the distance's origin, and then inserted into the tuplesort.
>>>>
>>>
>>> OK, so the query says "order by distance from point X" and we calculate
>>> the min/max distance of values in a given page range.
>>
>> Yes, and because it's BRIN that's an approximation, which should
>> generally be fine.
>>
>
> Approximation in what sense? My understanding was we'd get a range of
> distances that we know covers all rows in that range. So the results
> should be accurate, no?

The distance is going to be accurate only to the degree that the
summary can produce accurate distances for the datapoints it
represents. That can be quite imprecise due to the nature of the
contained datapoints: a summary of the points (-1, -1) and (1, 1) will
have a minimum distance of 0 to the origin, where the summary (-1, 0)
and (-1, 0.5) would have a much more accurate distance of 1. The point
I was making is that the summary can only approximate the distance,
and that approximation is fine w.r.t. the BRIN sort algoritm.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech/)



Re: PATCH: Using BRIN indexes for sorted output

From
Tomas Vondra
Date:


On 7/11/23 13:20, Matthias van de Meent wrote:
> On Mon, 10 Jul 2023 at 22:04, Tomas Vondra
> <tomas.vondra@enterprisedb.com> wrote:
>> On 7/10/23 18:18, Matthias van de Meent wrote:
>>> On Mon, 10 Jul 2023 at 17:09, Tomas Vondra
>>> <tomas.vondra@enterprisedb.com> wrote:
>>>> On 7/10/23 14:38, Matthias van de Meent wrote:
>>>>>> I haven't really thought about geometric types, just about minmax and
>>>>>> minmax-multi. It's not clear to me what the benefit for these types be.
>>>>>> I mean, we can probably sort points lexicographically, but is anyone
>>>>>> doing that in queries? It seems useless for order by distance.
>>>>>
>>>>> Yes, that's why you would sort them by distance, where the distance is
>>>>> generated by the opclass as min/max distance between the summary and
>>>>> the distance's origin, and then inserted into the tuplesort.
>>>>>
>>>>
>>>> OK, so the query says "order by distance from point X" and we calculate
>>>> the min/max distance of values in a given page range.
>>>
>>> Yes, and because it's BRIN that's an approximation, which should
>>> generally be fine.
>>>
>>
>> Approximation in what sense? My understanding was we'd get a range of
>> distances that we know covers all rows in that range. So the results
>> should be accurate, no?
> 
> The distance is going to be accurate only to the degree that the
> summary can produce accurate distances for the datapoints it
> represents. That can be quite imprecise due to the nature of the
> contained datapoints: a summary of the points (-1, -1) and (1, 1) will
> have a minimum distance of 0 to the origin, where the summary (-1, 0)
> and (-1, 0.5) would have a much more accurate distance of 1.

Ummm, I'm probably missing something, or maybe my mental model of this
is just wrong, but why would the distance for the second summary be more
accurate? Or what does "more accurate" mean?

Is that about the range of distances for the summary? For the first
range the summary is a bounding box [(-1,1), (1,1)] so all we know the
points may have distance in range [0, sqrt(2)]. While for the second
summary it's [1, sqrt(1.25)].

> The point I was making is that the summary can only approximate the
> distance, and that approximation is fine w.r.t. the BRIN sort
> algoritm.
> 

I think as long as the approximation (whatever it means) does not cause
differences in results (compared to not using an index), it's OK.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: PATCH: Using BRIN indexes for sorted output

From
Matthias van de Meent
Date:
On Fri, 14 Jul 2023 at 16:21, Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
>
>
>
>
> On 7/11/23 13:20, Matthias van de Meent wrote:
>> On Mon, 10 Jul 2023 at 22:04, Tomas Vondra
>> <tomas.vondra@enterprisedb.com> wrote:
>>> Approximation in what sense? My understanding was we'd get a range of
>>> distances that we know covers all rows in that range. So the results
>>> should be accurate, no?
>>
>> The distance is going to be accurate only to the degree that the
>> summary can produce accurate distances for the datapoints it
>> represents. That can be quite imprecise due to the nature of the
>> contained datapoints: a summary of the points (-1, -1) and (1, 1) will
>> have a minimum distance of 0 to the origin, where the summary (-1, 0)
>> and (-1, 0.5) would have a much more accurate distance of 1.
>
> Ummm, I'm probably missing something, or maybe my mental model of this
> is just wrong, but why would the distance for the second summary be more
> accurate? Or what does "more accurate" mean?
>
> Is that about the range of distances for the summary? For the first
> range the summary is a bounding box [(-1,1), (1,1)] so all we know the
> points may have distance in range [0, sqrt(2)]. While for the second
> summary it's [1, sqrt(1.25)].

Yes; I was trying to refer to the difference between what results you
get from the summary vs what results you get from the actual
datapoints: In this case, for finding points which are closest to the
origin, the first bounding box has a less accurate estimate than the
second.

> > The point I was making is that the summary can only approximate the
> > distance, and that approximation is fine w.r.t. the BRIN sort
> > algoritm.
> >
>
> I think as long as the approximation (whatever it means) does not cause
> differences in results (compared to not using an index), it's OK.

Agreed.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)



Re: PATCH: Using BRIN indexes for sorted output

From
Tomas Vondra
Date:

On 7/14/23 16:42, Matthias van de Meent wrote:
> On Fri, 14 Jul 2023 at 16:21, Tomas Vondra
> <tomas.vondra@enterprisedb.com> wrote:
>>
>>
>>
>>
>> On 7/11/23 13:20, Matthias van de Meent wrote:
>>> On Mon, 10 Jul 2023 at 22:04, Tomas Vondra
>>> <tomas.vondra@enterprisedb.com> wrote:
>>>> Approximation in what sense? My understanding was we'd get a range of
>>>> distances that we know covers all rows in that range. So the results
>>>> should be accurate, no?
>>>
>>> The distance is going to be accurate only to the degree that the
>>> summary can produce accurate distances for the datapoints it
>>> represents. That can be quite imprecise due to the nature of the
>>> contained datapoints: a summary of the points (-1, -1) and (1, 1) will
>>> have a minimum distance of 0 to the origin, where the summary (-1, 0)
>>> and (-1, 0.5) would have a much more accurate distance of 1.
>>
>> Ummm, I'm probably missing something, or maybe my mental model of this
>> is just wrong, but why would the distance for the second summary be more
>> accurate? Or what does "more accurate" mean?
>>
>> Is that about the range of distances for the summary? For the first
>> range the summary is a bounding box [(-1,1), (1,1)] so all we know the
>> points may have distance in range [0, sqrt(2)]. While for the second
>> summary it's [1, sqrt(1.25)].
> 
> Yes; I was trying to refer to the difference between what results you
> get from the summary vs what results you get from the actual
> datapoints: In this case, for finding points which are closest to the
> origin, the first bounding box has a less accurate estimate than the
> second.
> 

OK. I think regular minmax indexes have a similar issue with
non-distance ordering, because we don't know if the min/max values are
still in the page range (or deleted/updated).

>>> The point I was making is that the summary can only approximate the
>>> distance, and that approximation is fine w.r.t. the BRIN sort
>>> algoritm.
>>>
>>
>> I think as long as the approximation (whatever it means) does not cause
>> differences in results (compared to not using an index), it's OK.
> 

I haven't written any code yet, but I think if we don't try to find the
exact min/max distances for the summary (e.g. by calculating the closest
point exactly) but rather "estimates" that are guaranteed to bound the
actual min/max, that's good enough for the sorting.

For the max, this probably is not an issue, as we can just calculate
distance for the corners and use a maximum of that. At least with
reasonable euclidean distance ... in 2D I'm imagining the bounding box
summary as a rectangle, with the "max distance" being a minimum radius
of a circle containing it (the rectangle).

For min we're looking for the largest radius not intersecting with the
box, which seems harder to calculate I think.

However, now that I'm thinking about it - don't (SP-)GiST indexes
already do pretty much exactly this?


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: PATCH: Using BRIN indexes for sorted output

From
Sergey Dudoladov
Date:
Hello,

> Parallel version is not supported, but I think it should be possible.

@Tomas are you working on this ? If not, I would like to give it a try.

> static void
> AssertCheckRanges(BrinSortState *node)
> {
> #ifdef USE_ASSERT_CHECKING
>
> #endif
> }

I guess it should not be empty at the ongoing development stage.

Attached a small modification of the patch with a draft of the docs.

Regards,
Sergey Dudoladov

Attachment

Re: PATCH: Using BRIN indexes for sorted output

From
Tomas Vondra
Date:

On 8/2/23 17:25, Sergey Dudoladov wrote:
> Hello,
> 
>> Parallel version is not supported, but I think it should be possible.
> 
> @Tomas are you working on this ? If not, I would like to give it a try.
> 

Feel free to try. Just keep it in a separate part/patch, to make it
easier to combine the work later.

>> static void
>> AssertCheckRanges(BrinSortState *node)
>> {
>> #ifdef USE_ASSERT_CHECKING
>>
>> #endif
>> }
> 
> I guess it should not be empty at the ongoing development stage.
> 
> Attached a small modification of the patch with a draft of the docs.
> 

Thanks. FWIW it's generally better to always post the whole patch
series, otherwise the cfbot gets confused as it's unable to combine
stuff from different messages.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: PATCH: Using BRIN indexes for sorted output

From
vignesh C
Date:
On Wed, 2 Aug 2023 at 21:34, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:
>
>
>
> On 8/2/23 17:25, Sergey Dudoladov wrote:
> > Hello,
> >
> >> Parallel version is not supported, but I think it should be possible.
> >
> > @Tomas are you working on this ? If not, I would like to give it a try.
> >
>
> Feel free to try. Just keep it in a separate part/patch, to make it
> easier to combine the work later.
>
> >> static void
> >> AssertCheckRanges(BrinSortState *node)
> >> {
> >> #ifdef USE_ASSERT_CHECKING
> >>
> >> #endif
> >> }
> >
> > I guess it should not be empty at the ongoing development stage.
> >
> > Attached a small modification of the patch with a draft of the docs.
> >
>
> Thanks. FWIW it's generally better to always post the whole patch
> series, otherwise the cfbot gets confused as it's unable to combine
> stuff from different messages.

Are we planning to take this patch forward? It has been nearly 5
months since the last discussion on this. If the interest has gone
down and if there are no plans to handle this I'm thinking of
returning this commitfest entry in this commitfest and can be opened
when there is more interest.

Regards,
Vignesh



Re: PATCH: Using BRIN indexes for sorted output

From
vignesh C
Date:
On Sun, 21 Jan 2024 at 07:32, vignesh C <vignesh21@gmail.com> wrote:
>
> On Wed, 2 Aug 2023 at 21:34, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:
> >
> >
> >
> > On 8/2/23 17:25, Sergey Dudoladov wrote:
> > > Hello,
> > >
> > >> Parallel version is not supported, but I think it should be possible.
> > >
> > > @Tomas are you working on this ? If not, I would like to give it a try.
> > >
> >
> > Feel free to try. Just keep it in a separate part/patch, to make it
> > easier to combine the work later.
> >
> > >> static void
> > >> AssertCheckRanges(BrinSortState *node)
> > >> {
> > >> #ifdef USE_ASSERT_CHECKING
> > >>
> > >> #endif
> > >> }
> > >
> > > I guess it should not be empty at the ongoing development stage.
> > >
> > > Attached a small modification of the patch with a draft of the docs.
> > >
> >
> > Thanks. FWIW it's generally better to always post the whole patch
> > series, otherwise the cfbot gets confused as it's unable to combine
> > stuff from different messages.
>
> Are we planning to take this patch forward? It has been nearly 5
> months since the last discussion on this. If the interest has gone
> down and if there are no plans to handle this I'm thinking of
> returning this commitfest entry in this commitfest and can be opened
> when there is more interest.

Since the author or no one else showed interest in taking it forward
and the patch had no activity for more than 5 months, I have changed
the status to RWF. Feel free to add a new CF entry when someone is
planning to resume work more actively by starting off with a rebased
version.

Regards,
Vignesh