Thread: [HACKERS] WIP: BRIN multi-range indexes

[HACKERS] WIP: BRIN multi-range indexes

From
Tomas Vondra
Date:
Hi all,

A couple of days ago I've shared a WIP patch [1] implementing BRIN
indexes based on bloom filters. One inherent limitation of that approach
is that it can only support equality conditions - that's perfectly fine
in many cases (e.g. with UUIDs it's rare to use range queries, etc.).

[1]
https://www.postgresql.org/message-id/5d78b774-7e9c-c94e-12cf-fef51cc89b1a%402ndquadrant.com

But in other cases that restriction is pretty unacceptable, e.g. with
timestamps that are queried mostly using range conditions. A common
issue is that while the data is initially well correlated (giving us
nice narrow min/max ranges in the BRIN index), this degrades over time
(typically due to DELETE/UPDATE and then new rows routed to free space).
There are not many options to prevent this, and fixing it pretty much
requires CLUSTER on the table.

This patch addresses this by BRIN indexes with more complex "summary".
Instead of keeping just a single "minmax interval", we maintain a list
of "minmax intervals", which allows us to track "gaps" in the data.

To illustrate the improvement, consider this table:

    create table a (val float8) with (fillfactor = 90);
    insert into a select i::float from generate_series(1,10000000) s(i);
    update a set val = 1 where random() < 0.01;
    update a set val = 10000000 where random() < 0.01;

Which means the column 'val' is almost perfectly correlated with the
position in the table (which would be great for BRIN minmax indexes),
but then 1% of the values is set to 1 and 10.000.000. That means pretty
much every range will be [1,10000000], which makes this BRIN index
mostly useless, as illustrated by these explain plans:

    create index on a using brin (val) with (pages_per_range = 16);

    explain analyze select * from a where val = 100;
                                  QUERY PLAN
    --------------------------------------------------------------------
     Bitmap Heap Scan on a  (cost=54.01..10691.02 rows=8 width=8)
                            (actual time=5.901..785.520 rows=1 loops=1)
       Recheck Cond: (val = '100'::double precision)
       Rows Removed by Index Recheck: 9999999
       Heap Blocks: lossy=49020
       ->  Bitmap Index Scan on a_val_idx
             (cost=0.00..54.00 rows=3400 width=0)
             (actual time=5.792..5.792 rows=490240 loops=1)
             Index Cond: (val = '100'::double precision)
     Planning time: 0.119 ms
     Execution time: 785.583 ms
    (8 rows)

    explain analyze select * from a where val between 100 and 10000;
                                  QUERY PLAN
    ------------------------------------------------------------------
     Bitmap Heap Scan on a  (cost=55.94..25132.00 rows=7728 width=8)
                      (actual time=5.939..858.125 rows=9695 loops=1)
       Recheck Cond: ((val >= '100'::double precision) AND
                      (val <= '10000'::double precision))
       Rows Removed by Index Recheck: 9990305
       Heap Blocks: lossy=49020
       ->  Bitmap Index Scan on a_val_idx
             (cost=0.00..54.01 rows=10200 width=0)
             (actual time=5.831..5.831 rows=490240 loops=1)
             Index Cond: ((val >= '100'::double precision) AND
                          (val <= '10000'::double precision))
     Planning time: 0.139 ms
     Execution time: 871.132 ms
    (8 rows)

Obviously, the queries do scan the whole table and then eliminate most
of the rows in "Index Recheck". Decreasing pages_per_range does not
really make a measurable difference in this case - it eliminates maybe
10% of the rechecks, but most pages still have very wide minmax range.

With the patch, it looks about like this:

    create index on a using brin (val float8_minmax_multi_ops)
                            with (pages_per_range = 16);

    explain analyze select * from a where val = 100;
                                  QUERY PLAN
    -------------------------------------------------------------------
     Bitmap Heap Scan on a  (cost=830.01..11467.02 rows=8 width=8)
                            (actual time=7.772..8.533 rows=1 loops=1)
       Recheck Cond: (val = '100'::double precision)
       Rows Removed by Index Recheck: 3263
       Heap Blocks: lossy=16
       ->  Bitmap Index Scan on a_val_idx
             (cost=0.00..830.00 rows=3400 width=0)
             (actual time=7.729..7.729 rows=160 loops=1)
             Index Cond: (val = '100'::double precision)
     Planning time: 0.124 ms
     Execution time: 8.580 ms
    (8 rows)


    explain analyze select * from a where val between 100 and 10000;
                                 QUERY PLAN
    ------------------------------------------------------------------
     Bitmap Heap Scan on a  (cost=831.94..25908.00 rows=7728 width=8)
                        (actual time=9.318..23.715 rows=9695 loops=1)
       Recheck Cond: ((val >= '100'::double precision) AND
                      (val <= '10000'::double precision))
       Rows Removed by Index Recheck: 3361
       Heap Blocks: lossy=64
       ->  Bitmap Index Scan on a_val_idx
             (cost=0.00..830.01 rows=10200 width=0)
             (actual time=9.274..9.274 rows=640 loops=1)
             Index Cond: ((val >= '100'::double precision) AND
                          (val <= '10000'::double precision))
     Planning time: 0.138 ms
     Execution time: 36.100 ms
    (8 rows)

That is, the timings drop from 785ms/871ms to 9ms/36s. The index is a
bit larger (1700kB instead of 150kB), but it's still orders of
magnitudes smaller than btree index (which is ~214MB in this case).

The index build is slower than the regular BRIN indexes (about
comparable to btree), but I'm sure it can be significantly improved.
Also, I'm sure it's not bug-free.

Two additional notes:

1) The patch does break the current BRIN indexes, because it requires
passing all SearchKeys to the "consistent" BRIN function at once
(otherwise we couldn't eliminate individual intervals in the summary),
while currently the BRIN only deals with one SearchKey at a time. And I
haven't modified the existing brin_minmax_consistent() function (yeah,
I'm lazy, but this should be enough for interested people to try it out
I believe).

2) It only works with float8 (and also timestamp data types) for now,
but it should be straightforward to add support for additional data
types. Most of that will be about adding catalog definitions anyway.


regards


-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] WIP: BRIN multi-range indexes

From
Tomas Vondra
Date:
Apparently I've managed to botch the git format-patch thing :-( Attached
are both patches (the first one adding BRIN bloom indexes, the other one
adding the BRIN multi-range). Hopefully I got it right this time ;-)

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] WIP: BRIN multi-range indexes

From
Tomas Vondra
Date:
Hi,

attached is a patch series that includes both the BRIN multi-range
minmax indexes discussed in this thread, and the BRIN bloom indexes
initially posted in [1].

It seems easier to just deal with a single patch series, although we may
end up adding just one of those proposed opclasses.

There are 4 parts:

0001 - Modifies bringetbitmap() to pass all scan keys to the consistent
function at once. This is actually needed by the multi-minmax indexes,
but not really required for the others.

I'm wondering if this is a safechange, considering it affects the BRIN
interface. I.e. custom BRIN opclasses (perhaps in extensions) will be
broken by this change. Maybe we should extend the BRIN API to support
two versions of the consistent function - one that processes scan keys
one by one, and the other one that processes all of them at once.

0002 - Adds BRIN bloom indexes, along with opclasses for all built-in
data types (or at least those that also have regular BRIN opclasses).

0003 - Adds BRIN multi-minmax indexes, but only with float8 opclasses
(which also includes timestamp etc.). That should be good enough for
now, but adding support for other data types will require adding some
sort of "distance" operator which is needed for merging ranges (to pick
the two "closest" ones). For float8 it's simply a subtraction.

0004 - Moves dealing with IS [NOT] NULL search keys from opclasses to
bringetbitmap(). The code was exactly the same in all opclasses, so
moving it to bringetbitmap() seems right. It also allows some nice
optimizations where we can skip the consistent() function entirely,
although maybe that's useless. Also, maybe the there are opclasses that
actually need to deal with the NULL values in consistent() function?


regards


[1]
https://www.postgresql.org/message-id/5d78b774-7e9c-c94e-12cf-fef51cc89b1a%402ndquadrant.com

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: WIP: BRIN multi-range indexes

From
Tomas Vondra
Date:
Hi,

Apparently there was some minor breakage due to duplicate OIDs, so here
is the patch series updated to current master.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

Re: WIP: BRIN multi-range indexes

From
Michael Paquier
Date:
On Sun, Nov 19, 2017 at 5:45 AM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
> Apparently there was some minor breakage due to duplicate OIDs, so here
> is the patch series updated to current master.

Moved to CF 2018-01.
-- 
Michael


Re: WIP: BRIN multi-range indexes

From
Mark Dilger
Date:
> On Nov 18, 2017, at 12:45 PM, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
>
> Hi,
>
> Apparently there was some minor breakage due to duplicate OIDs, so here
> is the patch series updated to current master.
>
> regards
>
> --
> Tomas Vondra                  http://www.2ndQuadrant.com
> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>
<0001-Pass-all-keys-to-BRIN-consistent-function-at-once.patch.gz><0002-BRIN-bloom-indexes.patch.gz><0003-BRIN-multi-range-minmax-indexes.patch.gz><0004-Move-IS-NOT-NULL-checks-to-bringetbitmap.patch.gz>


After applying these four patches to my copy of master, the regression
tests fail for F_SATISFIES_HASH_PARTITION 5028 as attached.

mark


Attachment

Re: WIP: BRIN multi-range indexes

From
Tomas Vondra
Date:

On 12/19/2017 08:38 PM, Mark Dilger wrote:
> 
>> On Nov 18, 2017, at 12:45 PM, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
>>
>> Hi,
>>
>> Apparently there was some minor breakage due to duplicate OIDs, so here
>> is the patch series updated to current master.
>>
>> regards
>>
>> -- 
>> Tomas Vondra                  http://www.2ndQuadrant.com
>> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>>
<0001-Pass-all-keys-to-BRIN-consistent-function-at-once.patch.gz><0002-BRIN-bloom-indexes.patch.gz><0003-BRIN-multi-range-minmax-indexes.patch.gz><0004-Move-IS-NOT-NULL-checks-to-bringetbitmap.patch.gz>
> 
> 
> After applying these four patches to my copy of master, the regression
> tests fail for F_SATISFIES_HASH_PARTITION 5028 as attached.
> 

D'oh! There was an incorrect OID referenced in pg_opclass, which was
also used by the satisfies_hash_partition() function. Fixed patches
attached.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

Re: WIP: BRIN multi-range indexes

From
Mark Dilger
Date:
> On Dec 19, 2017, at 5:16 PM, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
>
>
>
> On 12/19/2017 08:38 PM, Mark Dilger wrote:
>>
>>> On Nov 18, 2017, at 12:45 PM, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
>>>
>>> Hi,
>>>
>>> Apparently there was some minor breakage due to duplicate OIDs, so here
>>> is the patch series updated to current master.
>>>
>>> regards
>>>
>>> --
>>> Tomas Vondra                  http://www.2ndQuadrant.com
>>> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>>>
<0001-Pass-all-keys-to-BRIN-consistent-function-at-once.patch.gz><0002-BRIN-bloom-indexes.patch.gz><0003-BRIN-multi-range-minmax-indexes.patch.gz><0004-Move-IS-NOT-NULL-checks-to-bringetbitmap.patch.gz>
>>
>>
>> After applying these four patches to my copy of master, the regression
>> tests fail for F_SATISFIES_HASH_PARTITION 5028 as attached.
>>
>
> D'oh! There was an incorrect OID referenced in pg_opclass, which was
> also used by the satisfies_hash_partition() function. Fixed patches
> attached.

Thanks!  These fix the regression test failures.  On my mac, all tests are now
passing.  I have not yet looked any further into the merits of these patches,
however.

mark

Re: WIP: BRIN multi-range indexes

From
Mark Dilger
Date:
> On Dec 19, 2017, at 5:16 PM, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
>
>
>
> On 12/19/2017 08:38 PM, Mark Dilger wrote:
>>
>>> On Nov 18, 2017, at 12:45 PM, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
>>>
>>> Hi,
>>>
>>> Apparently there was some minor breakage due to duplicate OIDs, so here
>>> is the patch series updated to current master.
>>>
>>> regards
>>>
>>> --
>>> Tomas Vondra                  http://www.2ndQuadrant.com
>>> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>>>
<0001-Pass-all-keys-to-BRIN-consistent-function-at-once.patch.gz><0002-BRIN-bloom-indexes.patch.gz><0003-BRIN-multi-range-minmax-indexes.patch.gz><0004-Move-IS-NOT-NULL-checks-to-bringetbitmap.patch.gz>
>>
>>
>> After applying these four patches to my copy of master, the regression
>> tests fail for F_SATISFIES_HASH_PARTITION 5028 as attached.
>>
>
> D'oh! There was an incorrect OID referenced in pg_opclass, which was
> also used by the satisfies_hash_partition() function. Fixed patches
> attached.

Your use of type ScanKey in src/backend/access/brin/brin.c is a bit confusing.  A
ScanKey is defined elsewhere as a pointer to ScanKeyData.  When you define
an array of ScanKeys, you use pointer-to-pointer style:

+   ScanKey   **keys,
+              **nullkeys;

But when you allocate space for the array, you don't treat it that way:

+   keys = palloc0(sizeof(ScanKey) * bdesc->bd_tupdesc->natts);
+   nullkeys = palloc0(sizeof(ScanKey) * bdesc->bd_tupdesc->natts);

But then again when you use nullkeys, you treat it as a two-dimensional array:

+   nullkeys[keyattno - 1][nnullkeys[keyattno - 1]] = key;

and likewise when you allocate space within keys:

+    keys[keyattno - 1] = palloc0(sizeof(ScanKey) * scan->numberOfKeys);

Could you please clarify?  I have been awake a bit too long; hopefully, I am
not merely missing the obvious.

mark



Re: WIP: BRIN multi-range indexes

From
Tomas Vondra
Date:

On 12/20/2017 03:37 AM, Mark Dilger wrote:
> 
>> On Dec 19, 2017, at 5:16 PM, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
>>
>>
>>
>> On 12/19/2017 08:38 PM, Mark Dilger wrote:
>>>
>>>> On Nov 18, 2017, at 12:45 PM, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
>>>>
>>>> Hi,
>>>>
>>>> Apparently there was some minor breakage due to duplicate OIDs, so here
>>>> is the patch series updated to current master.
>>>>
>>>> regards
>>>>
>>>> -- 
>>>> Tomas Vondra                  http://www.2ndQuadrant.com
>>>> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>>>>
<0001-Pass-all-keys-to-BRIN-consistent-function-at-once.patch.gz><0002-BRIN-bloom-indexes.patch.gz><0003-BRIN-multi-range-minmax-indexes.patch.gz><0004-Move-IS-NOT-NULL-checks-to-bringetbitmap.patch.gz>
>>>
>>>
>>> After applying these four patches to my copy of master, the regression
>>> tests fail for F_SATISFIES_HASH_PARTITION 5028 as attached.
>>>
>>
>> D'oh! There was an incorrect OID referenced in pg_opclass, which was
>> also used by the satisfies_hash_partition() function. Fixed patches
>> attached.
> 
> Your use of type ScanKey in src/backend/access/brin/brin.c is a bit confusing.  A
> ScanKey is defined elsewhere as a pointer to ScanKeyData.  When you define
> an array of ScanKeys, you use pointer-to-pointer style:
> 
> +   ScanKey   **keys,
> +              **nullkeys;
> 
> But when you allocate space for the array, you don't treat it that way:
> 
> +   keys = palloc0(sizeof(ScanKey) * bdesc->bd_tupdesc->natts);
> +   nullkeys = palloc0(sizeof(ScanKey) * bdesc->bd_tupdesc->natts);
> 
> But then again when you use nullkeys, you treat it as a two-dimensional array:
> 
> +   nullkeys[keyattno - 1][nnullkeys[keyattno - 1]] = key;
> 
> and likewise when you allocate space within keys:
> 
> +    keys[keyattno - 1] = palloc0(sizeof(ScanKey) * scan->numberOfKeys);
> 
> Could you please clarify?  I have been awake a bit too long; hopefully, I am
> not merely missing the obvious.
> 

Yeah, that's wrong - it should be "sizeof(ScanKey *)" instead. It's
harmless, though, because ScanKey itself is a pointer, so the size is
the same.

Attached is an updated version of the patch series, significantly
reworking and improving the multi-minmax part (the rest of the patch is
mostly as it was before).

I've significantly refactored and cleaned up the multi-minmax part, and
I'm much happier with it - no doubt there's room for further improvement
but overall it's much better.

I've also added proper sgml docs for this part, and support for more
data types including variable-length ones (all integer types, numeric,
float-based types including timestamps, uuid, and a couple of others).

At the API level, I needed to add one extra support procedure that
measures distance between two values of the data type. This is needed so
because we only keep a limited number of intervals for each range, and
once in a while we need to decide which of them to "merge" (and we
simply merge the closest ones).

I've passed the indexes through significant testing and fixed a couple
of silly bugs / memory leaks. Let's see if there are more.

Performance-wise, the CREATE INDEX seems a bit slow - it's about an
order of magnitude slower than regular BRIN. Some of that is expected
(we simply need to do more stuff to maintain multiple ranges), but
perhaps there's room for additional improvements - that's something I'd
like to work on next.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

Re: WIP: BRIN multi-range indexes

From
Alvaro Herrera
Date:
This stuff sounds pretty nice.  However, have a look at this report:

https://codecov.io/gh/postgresql-cfbot/postgresql/commit/2aa632dae3066900e15d2d42a4aad811dec11f08

it seems to me that the new code is not tested at all.  Shouldn't you
add a few more tests?

I think 0004 should apply to unpatched master (except for the parts that
concern files not in master); sounds like a good candidate for first
apply.  Then 0001, which seems mostly just refactoring.  0002 and 0003
are the really interesting ones (minus the code removed by 0004).

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: WIP: BRIN multi-range indexes

From
Tomas Vondra
Date:

On 01/23/2018 09:05 PM, Alvaro Herrera wrote:
> This stuff sounds pretty nice.  However, have a look at this report:
> 
> https://codecov.io/gh/postgresql-cfbot/postgresql/commit/2aa632dae3066900e15d2d42a4aad811dec11f08
> 
> it seems to me that the new code is not tested at all.  Shouldn't you
> add a few more tests?
> 

I have a hard time reading the report, but you're right I haven't added
any tests for the new opclasses (bloom and minmax_multi). I agree that's
something that needs to be addressed.

> I think 0004 should apply to unpatched master (except for the parts
> that concern files not in master); sounds like a good candidate for
> first apply. Then 0001, which seems mostly just refactoring. 0002 and
> 0003 are the really interesting ones (minus the code removed by
> 0004).
> 

That sounds like a reasonable plan. I'll reorder the patch series along
those lines in the next few days.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: WIP: BRIN multi-range indexes

From
Tomas Vondra
Date:

On 01/23/2018 10:07 PM, Tomas Vondra wrote:
> 
> 
> On 01/23/2018 09:05 PM, Alvaro Herrera wrote:
>> This stuff sounds pretty nice.  However, have a look at this report:
>>
>> https://codecov.io/gh/postgresql-cfbot/postgresql/commit/2aa632dae3066900e15d2d42a4aad811dec11f08
>>
>> it seems to me that the new code is not tested at all.  Shouldn't you
>> add a few more tests?
>>
> 
> I have a hard time reading the report, but you're right I haven't added
> any tests for the new opclasses (bloom and minmax_multi). I agree that's
> something that needs to be addressed.
> 
>> I think 0004 should apply to unpatched master (except for the parts
>> that concern files not in master); sounds like a good candidate for
>> first apply. Then 0001, which seems mostly just refactoring. 0002 and
>> 0003 are the really interesting ones (minus the code removed by
>> 0004).
>>
> 
> That sounds like a reasonable plan. I'll reorder the patch series along
> those lines in the next few days.
> 

And here we go. Attached is a reworked patch series that moves the IS
NULL tweak to the beginning of the series, and also adds proper
regression tests both for the bloom and multi-minmax opclasses. I've
simply copied the brin.sql tests and tweaked it for the new opclasses.

I've also added a bunch of missing multi-minmax opclasses. At this point
all data types that have minmax opclass should also have multi-minmax
one, except for these types:

* bytea
* char
* name
* text
* bpchar
* bit
* varbit

The reason is that I'm not quite sure how to define the 'distance'
function, which is needed when picking ranges to merge when
building/updating the index.

BTW while working on the regression tests, I've noticed that brin.sql
fails to test a couple of minmax opclasses (e.g. abstime/reltime). Is
that intentional or is that something we should fix eventually?

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

Re: WIP: BRIN multi-range indexes

From
Mark Dilger
Date:
> BTW while working on the regression tests, I've noticed that brin.sql
> fails to test a couple of minmax opclasses (e.g. abstime/reltime). Is
> that intentional or is that something we should fix eventually?

I believe abstime/reltime are deprecated.  Perhaps nobody wanted to
bother adding test coverage for deprecated classes?  There was another
thread that discussed removing these types.  The consensus seemed to
be in favor of removing them, though I have not seen a patch for that yet.

mark



Re: WIP: BRIN multi-range indexes

From
Tomas Vondra
Date:
On 02/05/2018 09:27 PM, Mark Dilger wrote:
> 
>> BTW while working on the regression tests, I've noticed that brin.sql
>> fails to test a couple of minmax opclasses (e.g. abstime/reltime). Is
>> that intentional or is that something we should fix eventually?
> 
> I believe abstime/reltime are deprecated.  Perhaps nobody wanted to
> bother adding test coverage for deprecated classes?  There was another
> thread that discussed removing these types.  The consensus seemed to
> be in favor of removing them, though I have not seen a patch for that yet.
> 

Yeah, that's what I've been wondering about too. There's also this
comment in nabstime.h:

/*
 * Although time_t generally is a long int on 64 bit systems, these two
 * types must be 4 bytes, because that's what pg_type.h assumes. They
 * should be yanked (long) before 2038 and be replaced by timestamp and
 * interval.
 */

But then why adding BRIN opclasses at all? And if adding them, why not
to test them? We all know how long deprecation takes, particularly for
data types.

For me the question is whether to bother with adding the multi-minmax
opclasses, of course.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: WIP: BRIN multi-range indexes

From
Tom Lane
Date:
Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:
> Yeah, that's what I've been wondering about too. There's also this
> comment in nabstime.h:

> /*
>  * Although time_t generally is a long int on 64 bit systems, these two
>  * types must be 4 bytes, because that's what pg_type.h assumes. They
>  * should be yanked (long) before 2038 and be replaced by timestamp and
>  * interval.
>  */

> But then why adding BRIN opclasses at all? And if adding them, why not
> to test them? We all know how long deprecation takes, particularly for
> data types.

There was some pretty recent chatter about removing these types; IIRC
Andres was annoyed about their lack of overflow checks.

I would definitely vote against adding any BRIN support for these types,
or indeed doing any work on them at all other than removal.

            regards, tom lane


Re: WIP: BRIN multi-range indexes

From
Tomas Vondra
Date:
On 02/06/2018 12:40 AM, Tom Lane wrote:
> Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:
>> Yeah, that's what I've been wondering about too. There's also this
>> comment in nabstime.h:
> 
>> /*
>>  * Although time_t generally is a long int on 64 bit systems, these two
>>  * types must be 4 bytes, because that's what pg_type.h assumes. They
>>  * should be yanked (long) before 2038 and be replaced by timestamp and
>>  * interval.
>>  */
> 
>> But then why adding BRIN opclasses at all? And if adding them, why not
>> to test them? We all know how long deprecation takes, particularly for
>> data types.
> 
> There was some pretty recent chatter about removing these types;
> IIRC Andres was annoyed about their lack of overflow checks.
> 
> I would definitely vote against adding any BRIN support for these
> types, or indeed doing any work on them at all other than removal.
> 

Works for me. Ripping out the two opclasses from the patch is trivial.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: WIP: BRIN multi-range indexes

From
Tomas Vondra
Date:
Hi,

Attached is an updated patch series, fixing duplicate OIDs and removing
opclasses for reltime/abstime data types, as discussed.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

Re: WIP: BRIN multi-range indexes

From
Andres Freund
Date:
Hi,

On 2018-02-25 01:30:47 +0100, Tomas Vondra wrote:
> Note: Currently, this only works with float8-based data types.
> Supporting additional data types is not a big issue, but will
> require extending the opclass with "subtract" operator (used to
> compute distance between values when merging ranges).

Based on Tom's past stances I'm a bit doubtful he'd be happy with such a
restriction.  Note that something similar-ish also has come up in
0a459cec96.

I kinda wonder if there's any way to not have two similar but not equal
types of logic here?

    That problem is
    resolved here by adding the ability for btree operator classes to provide
    an "in_range" support function that defines how to add or subtract the
    RANGE offset value.  Factoring it this way also allows the operator class
    to avoid overflow problems near the ends of the datatype's range, if it
    wishes to expend effort on that.  (In the committed patch, the integer
    opclasses handle that issue, but it did not seem worth the trouble to
    avoid overflow failures for datetime types.)


- Andres


Re: WIP: BRIN multi-range indexes

From
Tom Lane
Date:
Andres Freund <andres@anarazel.de> writes:
> On 2018-02-25 01:30:47 +0100, Tomas Vondra wrote:
>> Note: Currently, this only works with float8-based data types.
>> Supporting additional data types is not a big issue, but will
>> require extending the opclass with "subtract" operator (used to
>> compute distance between values when merging ranges).

> Based on Tom's past stances I'm a bit doubtful he'd be happy with such a
> restriction.  Note that something similar-ish also has come up in
> 0a459cec96.

> I kinda wonder if there's any way to not have two similar but not equal
> types of logic here?

Hm.  I wonder what the patch intends to do with subtraction overflow,
or infinities, or NaNs.  Just as with the RANGE patch, it does not
seem to me that failure is really an acceptable option.  Indexes are
supposed to be able to index whatever the column datatype can store.

            regards, tom lane


Re: WIP: BRIN multi-range indexes

From
Tomas Vondra
Date:
On 03/02/2018 05:08 AM, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
>> On 2018-02-25 01:30:47 +0100, Tomas Vondra wrote:
>>> Note: Currently, this only works with float8-based data types.
>>> Supporting additional data types is not a big issue, but will
>>> require extending the opclass with "subtract" operator (used to
>>> compute distance between values when merging ranges).
> 
>> Based on Tom's past stances I'm a bit doubtful he'd be happy with such a
>> restriction.  Note that something similar-ish also has come up in
>> 0a459cec96.
> 

That restriction was lifted quite a long time ago, so now both index
types support pretty much the same data types as the original BRIN (with
the reltime/abstime exception, discussed in this thread earlier).

>> I kinda wonder if there's any way to not have two similar but not
>> equal types of logic here?
> 
> Hm. I wonder what the patch intends to do with subtraction overflow, 
> or infinities, or NaNs. Just as with the RANGE patch, it does not 
> seem to me that failure is really an acceptable option. Indexes are 
> supposed to be able to index whatever the column datatype can store.
> 

I admit that's something I haven't thought about very much. I'll look
into that, of course, but the indexes are only using the deltas to pick
which ranges to merge, so I think in the worst case it may results in
sub-optimal index. But let me check what the RANGE patch did.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: WIP: BRIN multi-range indexes

From
Tomas Vondra
Date:
Hi,

Attached is a patch version fixing breakage due to pg_proc changes
commited in fd1a421fe661.

On 03/02/2018 05:08 AM, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
>> On 2018-02-25 01:30:47 +0100, Tomas Vondra wrote:
>>> Note: Currently, this only works with float8-based data types.
>>> Supporting additional data types is not a big issue, but will
>>> require extending the opclass with "subtract" operator (used to
>>> compute distance between values when merging ranges).
> 
>> Based on Tom's past stances I'm a bit doubtful he'd be happy with 
>> such a restriction. Note that something similar-ish also has come 
>> up in 0a459cec96.
> 
>> I kinda wonder if there's any way to not have two similar but not 
>> equal types of logic here?
> 

I don't think it's very similar to what 0a459cec96 is doing. It's true
both deal with ranges of values, but that's about it - I don't see how
this patch could reuse some bits from 0a459cec96.

To elaborate, 0a459cec96 only really needs to know "does this value fall
into this range" while this patch needs to compare ranges by length.
That is, given a bunch of ranges (summary of values for a section of a
table), it needs to decide which ranges to merge - and it picks the
ranges with the smallest gap.

So for example with ranges [1,10], [15,20], [30,200], [250,300] it would
merge [1,10] and [15,20] because the gap between them is only 5, which
is shorter than the other gaps. This is used when the summary for a
range of pages gets "full" (the patch only keeps up to 32 ranges or so).

Not sure how I could reuse 0a459cec96 to do this.

> Hm. I wonder what the patch intends to do with subtraction overflow, 
> or infinities, or NaNs. Just as with the RANGE patch, it does not 
> seem to me that failure is really an acceptable option. Indexes are 
> supposed to be able to index whatever the column datatype can store.
> 

I've been thinking about this after looking at 0a459cec96, and I don't
think this patch has the same issues. One reason is that just like the
original minmax opclass, it does not really mess with the data it
stores. It only does min/max on the values, and stores that, so if there
was NaN or Infinity, it will index NaN or Infinity.

The subtraction is used only to decide which ranges to merge first, and
if the subtraction returns Infinity/NaN the ranges will be considered
very distant and merged last. Which is pretty much the desired behavior,
because it means -Infinity, Infinity and NaN will be keps as individual
"points" as long as possible.

Perhaps there is some other danger/thinko here, that I don't see?


The one overflow issue I found in the patch is that the numeric
"distance" function does this:

    d = DirectFunctionCall2(numeric_sub, a2, a1);    /* a2 - a1 */

    PG_RETURN_FLOAT8(DirectFunctionCall1(numeric_float8, d));

which can overflow, of course. But that is not fatal - the index may get
inefficient due to non-optimal merging of ranges, but it will still
return correct results. But I think this can be easily improved by
passing not only the two values, but also minimum and maximum, and use
that to normalize the values to [0,1].


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

Re: WIP: BRIN multi-range indexes

From
Tomas Vondra
Date:
On 03/04/2018 01:14 AM, Tomas Vondra wrote:
> ...
> 
> The one overflow issue I found in the patch is that the numeric
> "distance" function does this:
> 
>     d = DirectFunctionCall2(numeric_sub, a2, a1);    /* a2 - a1 */
> 
>     PG_RETURN_FLOAT8(DirectFunctionCall1(numeric_float8, d));
> 
> which can overflow, of course. But that is not fatal - the index may get
> inefficient due to non-optimal merging of ranges, but it will still
> return correct results. But I think this can be easily improved by
> passing not only the two values, but also minimum and maximum, and use
> that to normalize the values to [0,1].
> 

Attached is an updated patch series, addressing this possible overflow
the way I proposed - by computing (a2 - a1) / (b2 - b1), which is
guaranteed to produce a value between 0 and 1.

The two new arguments are ignored for most "distance" functions, because
those can't overflow or underflow in double precision AFAICS.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

Re: WIP: BRIN multi-range indexes

From
Tomas Vondra
Date:
Hi,

attached is updated and slightly improved version of the two BRIN
opclasses (bloom and multi-range minmax). Given the lack of reviews I
think it's likely to get bumped to 2018-09, which I guess is OK - it
surely needs more feedback regarding some decisions. So let me share
some thoughts about those, before I forget all of it, and some test
results showing the pros/cons of those indexes.


1) index parameters

The main improvement of this version is an introduction of a couple of
BRIN index parameters, next to pages_per_range and autosummarize.

a) n_distinct_per_range - used to size Bloom index
b) false_positive_rate - used to size Bloom index
c) values_per_range - number of values in the minmax-multi summary

Until now those parameters were pretty much hard-coded, this allows easy
customization depending on the data set. There are some basic rules to
to clamp the values (e.g. not to allow ndistinct to be less than 128 or
more than MaxHeapTuplesPerPage * page_per_range), but that's about it.
I'm sure we could devise more elaborate heuristics (e.g. when building
index on an existing table, we could inspect table statistics first),
but the patch does not do that.

One disadvantage is that those parameters are per-index. It's possible
to define multi-column BRIN index, possibly with different opclasses:

  CREATE INDEX ON t USING brin (a int4_bloom_ops,
                                b int8_bloom_ops,
                                c int4_minmax_multi_ops,
                                d int8_minmax_multi_ops)
  WITH (false_positive_rate = 0.01,
        n_distinct_per_range = 1024,
        values_per_range = 32);

in which case the parameters apply to all columns (with the relevant
opclass type). So for example false_positive_rate applies to both "a"
and "b".

This is somewhat unfortunate, but I don't think it's worth inventing
more complex solution. If you need to specify different parameters, you
can simply build separate indexes, and it's more practical anyway
because all the summaries must fit on the same index page which limits
the per-column space. So people are more likely to define single-column
bloom indexes anyway.

There's a room for improvement when it comes to validating the
parameters. For example, it's possible to specify parameters that would
produce bloom filters larger than 8kB, which may lead with over-sized
index rows later. For minmax-multi indexes this should be relatively
safe (maximum number of values is 256, which is low enough for all
fixed-length types). Of course, varlena columns can break it, but we
can't really validate those anyway.


2) test results

The attached spreadsheet shows results comparing these opclasses to
existing BRIN indexes, and also to BTREE/GIN. Clearly, the dataset were
picked to show advantages of those approaches, e.g. on data sets where
regular minmax fails to deliver any benefits.

Overall I think it looks nice - the indexes are larger than minmax
(expected, the summaries are larger), but still orders of magnitude
smaller than BTREE or even GIN. For bloom the build time is comparable
to minmax, for minmax-multi it's somewhat slower - again, I'm sure
there's room for improvements.

For query performance, it's clearly better than plain minmax (but well,
the datasets were constructed to demonstrate that, so no surprise here).

One interesting thing I haven't realized initially is the relationship
between false positive rate for Bloom indexes, and the fraction of table
scanned by a query on average. Essentially, a bloom index with 1% false
positive rate is expected to scan about 1% of table on average. That
pretty accurately determines the performance of bloom indexes.



regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

Re: WIP: BRIN multi-range indexes

From
Tomas Vondra
Date:
Hi,

Attached is rebased version of this BRIN patch series, fixing mostly the
breakage due to 372728b0 (aka initial-catalog-data format changes). As
2018-07 CF is meant for almost-ready patches, this is more a 2018-09
material. But perhaps someone would like to take a look - and I'd have
to fix it anyway ...

At the pgcon dev meeting I suggested that long-running patches should
have a "summary" post once in a while, so that reviewers don't have to
reread the whole thread and follow all the various discussions. So let
me start with this thread, although it's not a particularly long or
complex one, nor does it have a long discussion. But anyway ...


The patches introduce two new BRIN opclasses - minmax-multi and bloom.


minmax-multi
============

minmax-multi is a variant of the current minmax opclass that handles
cases where the plain minmax opclass degrades due to outlier values.

Imagine almost perfectly correlated data (say, timestamps in a log
table) - that works great with regular minmax indexes. But if you go and
delete a bunch of historical messages (for whatever reason), new rows
with new timestamps will be routed to the empty space and the minmax
indexes will degrade because the ranges will get much "wider" due to the
new values.

The minmax-multi indexes deal with that by maintaining not a single
minmax range, but several of them. That allows tracking the outlier
values separately, without constructing one wide minmax range.

Consider this artificial example:

    create table t (a bigint, b int);

    alter t set (fillfactor=95);

    insert into t select i + 1000*random(), i+1000*random()
     from generate_series(1,100000000) s(i);

    update t set a = 1, b = 1 where random() < 0.001;
    update t set a = 100000000, b = 100000000 where random() < 0.001;

Now if you create a regular minmax index, it's going to perform
terribly, because pretty much every minmax range is [1,100000000] thanks
to the update of 0.1% of rows.

    create index on t using brin (a);

    explain analyze select * from t
    where a between 1923300::int and 1923600::int;

                                  QUERY PLAN
  -----------------------------------------------------------------
   Bitmap Heap Scan on t  (cost=75.11..75884.45 rows=319 width=12)
                (actual time=948.906..101739.892 rows=308 loops=1)
     Recheck Cond: ((a >= 1923300) AND (a <= 1923600))
     Rows Removed by Index Recheck: 99999692
     Heap Blocks: lossy=568182
     ->  Bitmap Index Scan on t_a_idx  (cost=0.00..75.03 rows=22587
          width=0) (actual time=89.357..89.357 rows=5681920 loops=1)
           Index Cond: ((a >= 1923300) AND (a <= 1923600))
   Planning Time: 2.161 ms
   Execution Time: 101740.776 ms
  (8 rows)

But with the minmax-multi opclass, this is not an issue:

    create index on t using brin (a int8_minmax_multi_ops);

                                  QUERY PLAN
  -------------------------------------------------------------------
   Bitmap Heap Scan on t  (cost=1067.11..76876.45 rows=319 width=12)
                       (actual time=38.906..49.763 rows=308 loops=1)
     Recheck Cond: ((a >= 1923300) AND (a <= 1923600))
     Rows Removed by Index Recheck: 22220
     Heap Blocks: lossy=128
     ->  Bitmap Index Scan on t_a_idx  (cost=0.00..1067.03 rows=22587
               width=0) (actual time=28.069..28.069 rows=1280 loops=1)
           Index Cond: ((a >= 1923300) AND (a <= 1923600))
   Planning Time: 1.715 ms
   Execution Time: 50.866 ms
  (8 rows)

Which is clearly a big improvement.

Doing this required some changes to how BRIN evaluates conditions on
page ranges. With a single minmax range it was enough to evaluate them
one by one, but minmax-multi needs to see all of them at once (to match
them against the partial ranges).

Most of the complexity is in building the summary, particularly picking
which values (partial ranges) to merge. The max number of values in the
summary is specified as values_per_range index reloption, and by default
it's set to 64, so there can be either 64 points or 32 intervals or some
combination of those.

I've been thinking about some automated way to tune this (either
globally or for each page range independently), but so far I have not
been very successful. The challenge is that making good decisions
requires global information about values in the column (e.g. global
minimum and maximum). I think the reloption with 64 as a default is a
good enough solution for now.

Perhaps the stats from pg_statistic would be useful for improving this
in the future, but I'm not sure.


bloom
=====

As the name suggests, this opclass uses bloom filter for the summary.
Compared to the minmax-multi it's a bit more experimental idea, but I
believe the foundations are safe.

Using bloom filter means that the index can only support equalities, but
for many use cases that's an acceptable limitation - UUID, IP addresses,
... (various identifiers in general).

Of course, how to size the bloom filter? It's worth noting the false
positive rate of the filter is essentially the fraction of a table that
will be scanned every time.

Similarly to the minmax-multi, parameters for computing optimal filter
size are set as reloptions (false_positive_rate, n_distinct_per_range)
with some reasonable defaults (1% false positive rate and distinct
values 10% of maximum heap tuples in a page range).

Note: When building the filter, we don't compute the hashes from the
original values, but we first use the type-specific hash function (the
same we'd use for hash indexes or hash joins) and then use the hash a as
an input for the bloom filter. This generally works fine, but if "our"
hash function generates a lot of collisions, it increases false positive
ratio of the whole filter. I'm not aware of a case where this would be
an issue, though.

What further complicates sizing of the bloom filter is available space -
the whole bloom filter needs to fit onto an 8kB page, and "full" bloom
filters with about 1/2 the bits set are pretty non-compressible. So
there's maybe ~8000 bytes for the bitmap. So for columns with many
distinct values, it may be necessary to make the page range smaller, to
reduce the number of distinct values in it.

And of course it requires good ndistinct estimates, not just for the
column as a whole, but for a single page range (because that's what
matters for sizing the bloom filter). Which is not a particularly
reliable estimate, I'm afraid.

So reloptions seem like a sufficient solution, at least for now.


open questions
==============

* I suspect the definition of cross-type opclasses (int2 vs. int8) are
not entirely correct. That probably needs another look.

* The bloom filter now works in two modes - sorted (where in the sorted
mode it stores the hashes directly) and hashed (the usual bloom filter
behavior). The idea is that for ranges with significantly fewer distinct
values, we only store those to save space (instead of allocating the
whole bloom filter with mostly 0 bits).


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

Re: WIP: BRIN multi-range indexes

From
Thomas Munro
Date:
On Sun, Jun 24, 2018 at 2:01 PM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
> Attached is rebased version of this BRIN patch series, fixing mostly the
> breakage due to 372728b0 (aka initial-catalog-data format changes). As
> 2018-07 CF is meant for almost-ready patches, this is more a 2018-09
> material. But perhaps someone would like to take a look - and I'd have
> to fix it anyway ...

Hi Tomas,

FYI Windows doesn't like this:

  src/backend/access/brin/brin_bloom.c(146): warning C4013: 'round'
undefined; assuming extern returning int
[C:\projects\postgresql\postgres.vcxproj]

  brin_bloom.obj : error LNK2019: unresolved external symbol round
referenced in function bloom_init
[C:\projects\postgresql\postgres.vcxproj]

-- 
Thomas Munro
http://www.enterprisedb.com


Re: WIP: BRIN multi-range indexes

From
Tomas Vondra
Date:
On 06/24/2018 11:39 PM, Thomas Munro wrote:
> On Sun, Jun 24, 2018 at 2:01 PM, Tomas Vondra
> <tomas.vondra@2ndquadrant.com> wrote:
>> Attached is rebased version of this BRIN patch series, fixing mostly the
>> breakage due to 372728b0 (aka initial-catalog-data format changes). As
>> 2018-07 CF is meant for almost-ready patches, this is more a 2018-09
>> material. But perhaps someone would like to take a look - and I'd have
>> to fix it anyway ...
> 
> Hi Tomas,
> 
> FYI Windows doesn't like this:
> 
>   src/backend/access/brin/brin_bloom.c(146): warning C4013: 'round'
> undefined; assuming extern returning int
> [C:\projects\postgresql\postgres.vcxproj]
> 
>   brin_bloom.obj : error LNK2019: unresolved external symbol round
> referenced in function bloom_init
> [C:\projects\postgresql\postgres.vcxproj]
> 

Thanks, I've noticed the failure before, but was not sure what's the
exact cause. It seems there's still no 'round' on Windows, so I'll
probably fix that by using rint() instead, or something like that.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: WIP: BRIN multi-range indexes

From
Tomas Vondra
Date:

On 06/25/2018 12:31 AM, Tomas Vondra wrote:
> On 06/24/2018 11:39 PM, Thomas Munro wrote:
>> On Sun, Jun 24, 2018 at 2:01 PM, Tomas Vondra
>> <tomas.vondra@2ndquadrant.com> wrote:
>>> Attached is rebased version of this BRIN patch series, fixing mostly the
>>> breakage due to 372728b0 (aka initial-catalog-data format changes). As
>>> 2018-07 CF is meant for almost-ready patches, this is more a 2018-09
>>> material. But perhaps someone would like to take a look - and I'd have
>>> to fix it anyway ...
>>
>> Hi Tomas,
>>
>> FYI Windows doesn't like this:
>>
>>   src/backend/access/brin/brin_bloom.c(146): warning C4013: 'round'
>> undefined; assuming extern returning int
>> [C:\projects\postgresql\postgres.vcxproj]
>>
>>   brin_bloom.obj : error LNK2019: unresolved external symbol round
>> referenced in function bloom_init
>> [C:\projects\postgresql\postgres.vcxproj]
>>
> 
> Thanks, I've noticed the failure before, but was not sure what's the
> exact cause. It seems there's still no 'round' on Windows, so I'll
> probably fix that by using rint() instead, or something like that.
> 

OK, here is a version tweaked to use floor()/ceil() instead of round().
Let's see if the Windows machine likes that more.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

Re: WIP: BRIN multi-range indexes

From
Michael Paquier
Date:
Hi Tomas,

On Mon, Jun 25, 2018 at 02:14:20AM +0200, Tomas Vondra wrote:
> OK, here is a version tweaked to use floor()/ceil() instead of round().
> Let's see if the Windows machine likes that more.

The latest patch set does not apply cleanly.  Could you rebase it?  I
have moved the patch to CF 2018-10 for now, waiting on author.
--
Michael

Attachment

Re: WIP: BRIN multi-range indexes

From
Michael Paquier
Date:
On Tue, Oct 02, 2018 at 11:49:05AM +0900, Michael Paquier wrote:
> The latest patch set does not apply cleanly.  Could you rebase it?  I
> have moved the patch to CF 2018-10 for now, waiting on author.

It's been some time since that request, so I am marking the patch as
returned with feedback.
--
Michael

Attachment

Re: WIP: BRIN multi-range indexes

From
Tomas Vondra
Date:
On 2/4/19 6:54 AM, Michael Paquier wrote:
> On Tue, Oct 02, 2018 at 11:49:05AM +0900, Michael Paquier wrote:
>> The latest patch set does not apply cleanly.  Could you rebase it?  I
>> have moved the patch to CF 2018-10 for now, waiting on author.
> 
> It's been some time since that request, so I am marking the patch as
> returned with feedback. 

But that's not the most recent version of the patch. On 28/12 I've
submitted an updated / rebased patch.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: WIP: BRIN multi-range indexes

From
Alvaro Herrera
Date:
On 2019-Feb-04, Tomas Vondra wrote:

> On 2/4/19 6:54 AM, Michael Paquier wrote:
> > On Tue, Oct 02, 2018 at 11:49:05AM +0900, Michael Paquier wrote:
> >> The latest patch set does not apply cleanly.  Could you rebase it?  I
> >> have moved the patch to CF 2018-10 for now, waiting on author.
> > 
> > It's been some time since that request, so I am marking the patch as
> > returned with feedback. 
> 
> But that's not the most recent version of the patch. On 28/12 I've
> submitted an updated / rebased patch.

Moved to next commitfest instead.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: WIP: BRIN multi-range indexes

From
Tomas Vondra
Date:
Apparently cputube did not pick the last version of the patches I've
submitted in December (and I don't see the message in the thread in
archive either), so it's listed as broken.

So here we go again, hopefully this time everything will go through ...

regards

On 12/28/18 12:45 AM, Tomas Vondra wrote:
> Hi all,
> 
> Attached is an updated/rebased version of the patch series. There are no
> changes to behavior, but let me briefly summarize the current state:
> 
> 0001 and 0002
> -------------
> 
> The first two parts are "just" refactoring the existing code to pass all
> scankeys to the opclass at once - this is needed by the new minmax-like
> opclass, but per discussion with Alvaro it seems worthwhile even
> independently. I tend to agree with that. Similarly for the second part,
> which moves all IS NULL checks entirely to bringetbimap().
> 
> 0003 bloom opclass
> ------------------
> 
> The first new opclasss, based on bloom filters. For each page range
> (i.e. 1MB by default) a small bloom filter is built (with hash values of
> the original values as inputs), and then used to evaluate equality
> queries. A small optimization is that initially the actual (hash) values
> are kept until reaching the bloom filter size. This improves behavior in
> low-cardinality data sets.
> 
> Picking the bloom filter parameters is the tricky part - we don't have a
> reliable source of such information (namely number of distinct values
> per range), and e.g. the false positive rate actually has to be picked
> by the user because it's a compromise between index size and accuracy.
> Essentially, false positive rate is the fraction of the table that has
> to be scanned for a random value (on average). But it also makes the
> index larger, because the per-range bloom filters will be larger.
> 
> Another reason why this needs to be defined by the user is that the
> space for index tuple is limited by one page (8kB by default), so we
> can't allow the bloom filter to be larger (we have to assume it's
> non-compressible, because in the optimal fill it's 50% 0s and 1s). But
> the BRIN index may be multi-column, and the limit applies to the whole
> tuple. And we don't know what the opclasses or parameters of other
> columns are.
> 
> So the patch simply adds two reloptions
> 
> a) n_distinct_per_range - number of distinct values per range
> b) false_positive_rate - false positive rate of the filter
> 
> There are some simple heuristics to ensure the values are reasonable
> (e.g. upper limit for number of distinct values, etc.) and perhaps we
> might consider stats from the underlying table (when not empty), but the
> patch does not do that.
> 
> 
> 0004 multi-minmax opclass
> -------------------------
> 
> The second opclass addresses a common issue for minmax indexes, where
> the table is initially nicely correlated with the index, and it works
> fine. But then deletes/updates route data into other parts of the table
> making the ranges very wide ad rendering the BRIN index inefficient.
> 
> One way to deal improve this would be considering the index(es) while
> routing the new tuple, i.e. looking not only for page with enough free
> space, but for pages in already matching ranges (or close to it).
> 
> A partitioning is a possible approach so segregate the data. But it's
> certainly much higher overhead, both in terms of maintenance and
> planning (particularly with  1:1 of ranges vs. partitions).
> 
> So the new multi-minmax opclass takes a different approach, replacing
> the one minmax range with multiple ranges (64 boundary values or 32
> ranges by default). Initially individual values are stored, and after
> reaching the maximum number of values the values are merged into ranges
> by distance. This allows handling outliers very efficiently, because
> they will not be merged with the "main" range for as long as possible.
> 
> Similarly to the bloom opclass, the main challenge here is deciding the
> parameter - in this case, it's "number of values per range". Again, it's
> a compromise vs. index size and efficiency. The default (64 values) is
> fairly reasonable, but ultimately it's up to the user - there is a new
> reloption "values_per_range".
> 
> 
> 
> regards
> 

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

Re: WIP: BRIN multi-range indexes

From
Alexander Korotkov
Date:
Hi!

I'm starting to look at this patchset.  In the general, I think it's
very cool!  We definitely need this.

On Tue, Apr 3, 2018 at 10:51 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
> 1) index parameters
>
> The main improvement of this version is an introduction of a couple of
> BRIN index parameters, next to pages_per_range and autosummarize.
>
> a) n_distinct_per_range - used to size Bloom index
> b) false_positive_rate - used to size Bloom index
> c) values_per_range - number of values in the minmax-multi summary
>
> Until now those parameters were pretty much hard-coded, this allows easy
> customization depending on the data set. There are some basic rules to
> to clamp the values (e.g. not to allow ndistinct to be less than 128 or
> more than MaxHeapTuplesPerPage * page_per_range), but that's about it.
> I'm sure we could devise more elaborate heuristics (e.g. when building
> index on an existing table, we could inspect table statistics first),
> but the patch does not do that.
>
> One disadvantage is that those parameters are per-index.

For me, the main disadvantage of this solution is that we put
opclass-specific parameters into access method.  And this is generally
bad design.  So, user can specify such parameter if even not using
corresponding opclass, that may cause a confuse (if even we forbid
that, it needs to be hardcoded).  Also, extension opclasses can't do
the same thing.  Thus, it appears that extension opclasses are not
first class citizens anymore.  Have you take a look at opclass
parameters patch [1]?  I think it's proper solution of this problem.
I think we should postpone this parameterization until we push opclass
parameters patch.

1. https://www.postgresql.org/message-id/d22c3a18-31c7-1879-fc11-4c1ce2f5e5af%40postgrespro.ru

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Re: WIP: BRIN multi-range indexes

From
Alexander Korotkov
Date:
On Sun, Mar 4, 2018 at 3:15 AM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
> I've been thinking about this after looking at 0a459cec96, and I don't
> think this patch has the same issues. One reason is that just like the
> original minmax opclass, it does not really mess with the data it
> stores. It only does min/max on the values, and stores that, so if there
> was NaN or Infinity, it will index NaN or Infinity.

FWIW, I think the closest similar functionality is subtype_diff
function of range type.  But I don't think we should require range
type here just in order to fetch subtype_diff function out of it.  So,
opclass distance function looks OK for me, assuming it's not
AM-defined function, but function used for inter-opclass
compatibility.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Re: WIP: BRIN multi-range indexes

From
Tomas Vondra
Date:
On 3/2/19 10:05 AM, Alexander Korotkov wrote:
> On Sun, Mar 4, 2018 at 3:15 AM Tomas Vondra
> <tomas.vondra@2ndquadrant.com> wrote:
>> I've been thinking about this after looking at 0a459cec96, and I don't
>> think this patch has the same issues. One reason is that just like the
>> original minmax opclass, it does not really mess with the data it
>> stores. It only does min/max on the values, and stores that, so if there
>> was NaN or Infinity, it will index NaN or Infinity.
> 
> FWIW, I think the closest similar functionality is subtype_diff
> function of range type.  But I don't think we should require range
> type here just in order to fetch subtype_diff function out of it.  So,
> opclass distance function looks OK for me,

OK, agreed.

> assuming it's not AM-defined function, but function used for
> inter-opclass compatibility.
> 

I'm not sure I understand what you mean by this. Can you elaborate? Does
the current implementation (i.e. distance function being implemented as
an opclass support procedure) work for you or not?

Thanks for looking at the patch!


cheers

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: WIP: BRIN multi-range indexes

From
Tomas Vondra
Date:

On 3/2/19 10:00 AM, Alexander Korotkov wrote:
> Hi!
> 
> I'm starting to look at this patchset.  In the general, I think it's
> very cool!  We definitely need this.
> 
> On Tue, Apr 3, 2018 at 10:51 PM Tomas Vondra
> <tomas.vondra@2ndquadrant.com> wrote:
>> 1) index parameters
>>
>> The main improvement of this version is an introduction of a couple of
>> BRIN index parameters, next to pages_per_range and autosummarize.
>>
>> a) n_distinct_per_range - used to size Bloom index
>> b) false_positive_rate - used to size Bloom index
>> c) values_per_range - number of values in the minmax-multi summary
>>
>> Until now those parameters were pretty much hard-coded, this allows easy
>> customization depending on the data set. There are some basic rules to
>> to clamp the values (e.g. not to allow ndistinct to be less than 128 or
>> more than MaxHeapTuplesPerPage * page_per_range), but that's about it.
>> I'm sure we could devise more elaborate heuristics (e.g. when building
>> index on an existing table, we could inspect table statistics first),
>> but the patch does not do that.
>>
>> One disadvantage is that those parameters are per-index.
> 
> For me, the main disadvantage of this solution is that we put
> opclass-specific parameters into access method.  And this is generally
> bad design.  So, user can specify such parameter if even not using
> corresponding opclass, that may cause a confuse (if even we forbid
> that, it needs to be hardcoded).  Also, extension opclasses can't do
> the same thing.  Thus, it appears that extension opclasses are not
> first class citizens anymore.  Have you take a look at opclass
> parameters patch [1]?  I think it's proper solution of this problem.
> I think we should postpone this parameterization until we push opclass
> parameters patch.
> 
> 1. https://www.postgresql.org/message-id/d22c3a18-31c7-1879-fc11-4c1ce2f5e5af%40postgrespro.ru
> 

I've looked at that patch only very briefly so far, but I agree it's
likely a better solution than what my patch does at the moment (which I
agree is a misuse of the AM-level options). I'll take a closer look.

I agree it makes sense to re-use that infrastructure for this patch, but
I'm hesitant to rebase it on top of that patch right away. Because it
would mean this thread dependent on it, which would confuse cputube,
make it bitrot faster etc.

So I suggest we ignore this aspect of the patch for now, and let's talk
about the other bits first.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: WIP: BRIN multi-range indexes

From
Alexander Korotkov
Date:
On Sun, Mar 3, 2019 at 12:25 AM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
> I've looked at that patch only very briefly so far, but I agree it's
> likely a better solution than what my patch does at the moment (which I
> agree is a misuse of the AM-level options). I'll take a closer look.
>
> I agree it makes sense to re-use that infrastructure for this patch, but
> I'm hesitant to rebase it on top of that patch right away. Because it
> would mean this thread dependent on it, which would confuse cputube,
> make it bitrot faster etc.
>
> So I suggest we ignore this aspect of the patch for now, and let's talk
> about the other bits first.

Works for me.  We don't need to make the whole work made by this patch
to be dependent on opclass parameters.  It's OK to ignore this aspect
for now and come back when opclass parameters get committed.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Re: WIP: BRIN multi-range indexes

From
Alexander Korotkov
Date:
On Sun, Mar 3, 2019 at 12:12 AM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
> On 3/2/19 10:05 AM, Alexander Korotkov wrote:
> > assuming it's not AM-defined function, but function used for
> > inter-opclass compatibility.
>
> I'm not sure I understand what you mean by this. Can you elaborate? Does
> the current implementation (i.e. distance function being implemented as
> an opclass support procedure) work for you or not?

I mean that unlike other index access methods BRIN allow opclasses to
define custom support procedures.  These support procedures are not
directly called from AM, but might be called from other opclass
support procedures.  That allows to re-use the same high-level support
procedures in multiple opclasses.

So, distance support procedure is not directly called from AM.  We
don't have to change the interface between AM and opclass for that.
This is why I'm OK with that.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Re: WIP: BRIN multi-range indexes

From
Nikita Glukhov
Date:
Hi!

I have looked at this patch set too, but so far only at first two 
infrastructure patches.

First of all, I agree that opclass parameters patch is needed here.


0001. Pass all keys to BRIN consistent function at once.

I think that changing the signature of consistent function is bad, because then
the authors of existing BRIN opclasses will need to maintain two variants of
the function for different version of PosgreSQL.  Moreover, we can easily
distinguish two variants by the number of parameters.  So I returned back a
call to old 3-argument variant of consistent() in bringetbitmap().  Also I
fixed brinvalidate() adding support for new 4-argument variant, and fixed
catalog entries for brin_minmax_consistent() and brin_inclusion_consistent()
which remained 3-argument.  And also I removed unneeded indentation shift in
these two functions, which makes it difficult to compare changes, by extracting
subroutines minmax_consistent_key() and inclusion_consistent_key().


0002. Move IS NOT NULL checks to bringetbitmap()

I believe that removing duplicate code is always good.  But in this case it
seems a bit inconsistent to refactor only bringetbitmap().  I think we can't
guarantee that existing opclasses work with null flags in add_value() and
union() in the expected way.

So I refactored the work with BrinValues flags in other places in patch 0003.
I added flag BrinOpcInfp.oi_regular_nulls which enables regular processing of
NULLs before calling of support functions.  Now support functions don't need to
care about bv_hasnulls at all. add_value(), for example, works now only with
non-NULL values.

Patches 0002 and 0003 should be merged, I put 0003 in a separate patch just 
for ease of review.


0004. BRIN bloom indexes
0005. BRIN multi-range minmax indexes

I have not looked carefully at these packs yet, but fixed only catalog entries
and removed NULLs processing according to patch 0003.  I also noticed that the
following functions contain a lot of duplicated code, which needs to be
extracted into common subroutine:
inclusion_get_procinfo()
bloom_get_procinfo()
minmax_multi_get_procinfo()



Attached patches with all my changes.

-- 
Nikita Glukhov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
Attachment

Re: WIP: BRIN multi-range indexes

From
Tomas Vondra
Date:
Hi Nikita,

Thanks for looking at the patch.

On 3/12/19 11:33 AM, Nikita Glukhov wrote:
> Hi!
> 
> I have looked at this patch set too, but so far only at first two 
> infrastructure patches.
> 
> First of all, I agree that opclass parameters patch is needed here.
> 

OK.

> 
> 0001. Pass all keys to BRIN consistent function at once.
> 
> I think that changing the signature of consistent function is bad, because then
> the authors of existing BRIN opclasses will need to maintain two variants of
> the function for different version of PosgreSQL.  Moreover, we can easily
> distinguish two variants by the number of parameters.  So I returned back a
> call to old 3-argument variant of consistent() in bringetbitmap().  Also I
> fixed brinvalidate() adding support for new 4-argument variant, and fixed
> catalog entries for brin_minmax_consistent() and brin_inclusion_consistent()
> which remained 3-argument.  And also I removed unneeded indentation shift in
> these two functions, which makes it difficult to compare changes, by extracting
> subroutines minmax_consistent_key() and inclusion_consistent_key().
> 

Hmmm. I admit I rather dislike functions that change the signature based
on the number of arguments, for some reason. But maybe it's better than
changing the consistent function. Not sure.

> 
> 0002. Move IS NOT NULL checks to bringetbitmap()
> 
> I believe that removing duplicate code is always good.  But in this case it
> seems a bit inconsistent to refactor only bringetbitmap().  I think we can't
> guarantee that existing opclasses work with null flags in add_value() and
> union() in the expected way.
> 
> So I refactored the work with BrinValues flags in other places in patch 0003.
> I added flag BrinOpcInfp.oi_regular_nulls which enables regular processing of
> NULLs before calling of support functions.  Now support functions don't need to
> care about bv_hasnulls at all. add_value(), for example, works now only with
> non-NULL values.
> 

That seems like unnecessary complexity to me. We can't really guarantee
much about opclasses in extensions anyway. I don't know if there's some
sort of precedent but IMHO it's reasonable to expect the opclasses to be
updated accordingly.

> Patches 0002 and 0003 should be merged, I put 0003 in a separate patch just 
> for ease of review.
> 

Thanks.

> 
> 0004. BRIN bloom indexes
> 0005. BRIN multi-range minmax indexes
> 
> I have not looked carefully at these packs yet, but fixed only catalog entries
> and removed NULLs processing according to patch 0003.  I also noticed that the
> following functions contain a lot of duplicated code, which needs to be
> extracted into common subroutine:
> inclusion_get_procinfo()
> bloom_get_procinfo()
> minmax_multi_get_procinfo()
> 

Yes. The reason for the duplicate code is that initially this was
submitted as two separate patches, so there was no obvious need for
sharing code.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: WIP: BRIN multi-range indexes

From
Alexander Korotkov
Date:
On Tue, Mar 12, 2019 at 8:15 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
> > 0001. Pass all keys to BRIN consistent function at once.
> >
> > I think that changing the signature of consistent function is bad, because then
> > the authors of existing BRIN opclasses will need to maintain two variants of
> > the function for different version of PosgreSQL.  Moreover, we can easily
> > distinguish two variants by the number of parameters.  So I returned back a
> > call to old 3-argument variant of consistent() in bringetbitmap().  Also I
> > fixed brinvalidate() adding support for new 4-argument variant, and fixed
> > catalog entries for brin_minmax_consistent() and brin_inclusion_consistent()
> > which remained 3-argument.  And also I removed unneeded indentation shift in
> > these two functions, which makes it difficult to compare changes, by extracting
> > subroutines minmax_consistent_key() and inclusion_consistent_key().
> >
>
> Hmmm. I admit I rather dislike functions that change the signature based
> on the number of arguments, for some reason. But maybe it's better than
> changing the consistent function. Not sure.

I also kind of dislike signature change based on the number of
arguments.  But it's still good to let extensions use old interface if
needed.  What do you think about invention new consistent method, so
that extension can implement one of them?  We did similar thing for
GIN (bistate consistent vs tristate consistent).

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Re: WIP: BRIN multi-range indexes

From
Tomas Vondra
Date:

On 3/13/19 9:15 AM, Alexander Korotkov wrote:
> On Tue, Mar 12, 2019 at 8:15 PM Tomas Vondra
> <tomas.vondra@2ndquadrant.com> wrote:
>>> 0001. Pass all keys to BRIN consistent function at once.
>>>
>>> I think that changing the signature of consistent function is bad, because then
>>> the authors of existing BRIN opclasses will need to maintain two variants of
>>> the function for different version of PosgreSQL.  Moreover, we can easily
>>> distinguish two variants by the number of parameters.  So I returned back a
>>> call to old 3-argument variant of consistent() in bringetbitmap().  Also I
>>> fixed brinvalidate() adding support for new 4-argument variant, and fixed
>>> catalog entries for brin_minmax_consistent() and brin_inclusion_consistent()
>>> which remained 3-argument.  And also I removed unneeded indentation shift in
>>> these two functions, which makes it difficult to compare changes, by extracting
>>> subroutines minmax_consistent_key() and inclusion_consistent_key().
>>>
>>
>> Hmmm. I admit I rather dislike functions that change the signature based
>> on the number of arguments, for some reason. But maybe it's better than
>> changing the consistent function. Not sure.
> 
> I also kind of dislike signature change based on the number of
> arguments.  But it's still good to let extensions use old interface if
> needed.  What do you think about invention new consistent method, so
> that extension can implement one of them?  We did similar thing for
> GIN (bistate consistent vs tristate consistent).
> 

Possibly. The other annoyance of course is that to support the current
consistent method we'll have to keep all the code I guess :-(

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: WIP: BRIN multi-range indexes

From
Alexander Korotkov
Date:
On Wed, Mar 13, 2019 at 12:52 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
> On 3/13/19 9:15 AM, Alexander Korotkov wrote:
> > On Tue, Mar 12, 2019 at 8:15 PM Tomas Vondra
> > <tomas.vondra@2ndquadrant.com> wrote:
> >>> 0001. Pass all keys to BRIN consistent function at once.
> >>>
> >>> I think that changing the signature of consistent function is bad, because then
> >>> the authors of existing BRIN opclasses will need to maintain two variants of
> >>> the function for different version of PosgreSQL.  Moreover, we can easily
> >>> distinguish two variants by the number of parameters.  So I returned back a
> >>> call to old 3-argument variant of consistent() in bringetbitmap().  Also I
> >>> fixed brinvalidate() adding support for new 4-argument variant, and fixed
> >>> catalog entries for brin_minmax_consistent() and brin_inclusion_consistent()
> >>> which remained 3-argument.  And also I removed unneeded indentation shift in
> >>> these two functions, which makes it difficult to compare changes, by extracting
> >>> subroutines minmax_consistent_key() and inclusion_consistent_key().
> >>>
> >>
> >> Hmmm. I admit I rather dislike functions that change the signature based
> >> on the number of arguments, for some reason. But maybe it's better than
> >> changing the consistent function. Not sure.
> >
> > I also kind of dislike signature change based on the number of
> > arguments.  But it's still good to let extensions use old interface if
> > needed.  What do you think about invention new consistent method, so
> > that extension can implement one of them?  We did similar thing for
> > GIN (bistate consistent vs tristate consistent).
> >
>
> Possibly. The other annoyance of course is that to support the current
> consistent method we'll have to keep all the code I guess :-(

Yes, because incompatible change of opclass support function signature
is the thing we never did before.  We have to add new optional
arguments to GiST functions, but that was compatible change.

If we do incompatible change of opclass interface, it becomes unclear
to do pg_upgrade with extension installed.  Imagine, if we don't
require function signature to match, we could easily get segfault
because of extension incompatibility.  If we do require function
signature to match, extension upgrade would become complex.  It would
be required to not only adjust C-code, but also write some custom
script, which changes opclass (and users would have to run this script
manually?).

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Re: WIP: BRIN multi-range indexes

From
Tomas Vondra
Date:
On Sun, Mar 03, 2019 at 07:29:26AM +0300, Alexander Korotkov wrote:
>On Sun, Mar 3, 2019 at 12:25 AM Tomas Vondra
><tomas.vondra@2ndquadrant.com> wrote:
>> I've looked at that patch only very briefly so far, but I agree it's
>> likely a better solution than what my patch does at the moment (which I
>> agree is a misuse of the AM-level options). I'll take a closer look.
>>
>> I agree it makes sense to re-use that infrastructure for this patch, but
>> I'm hesitant to rebase it on top of that patch right away. Because it
>> would mean this thread dependent on it, which would confuse cputube,
>> make it bitrot faster etc.
>>
>> So I suggest we ignore this aspect of the patch for now, and let's talk
>> about the other bits first.
>
>Works for me.  We don't need to make the whole work made by this patch
>to be dependent on opclass parameters.  It's OK to ignore this aspect
>for now and come back when opclass parameters get committed.
>

Attached is this patch series, rebased on top of current master and the
opclass parameters patch [1]. I previously planned to keep those two
efforts separate for a while, but I decided to give it a try and the
breakage is fairly minor so I'll keep it this way - this patch has zero
chance of getting committed with the opclass parameters patch anyway.

Aside from rebase and changes due to adopting opclass parameters, the
patch is otherwise unchanged.

0001-0004 are just the opclass parameters patch series.

0005 adds opclass parameters to BRIN indexes (similarly to what the
preceding parts to for GIN/GiST indexes).

0006-0010 are the original patch series (BRIN tweaks, bloom and
multi-minmax) rebased and switched to opclass parameters.


So now, we can do things like this:

  CREATE INDEX x ON t USING brin (
      col1 int4_bloom_ops(false_positive_rate = 0.05),
      col2 int4_minmax_multi_ops(values_per_range = 16)
  ) WITH (pages_per_range = 32);

and so on. I think the patch [1] works fine - I only have some minor
comments, that I'll post to that thread.

The other challenges (e.g. how to pick the values for opclass parameters
automatically, based on the data) are still open.


regards


[1] https://www.postgresql.org/message-id/flat/d22c3a18-31c7-1879-fc11-4c1ce2f5e5af%40postgrespro.ru 

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Attachment

Re: WIP: BRIN multi-range indexes

From
Alvaro Herrera
Date:
On 2019-Jun-11, Tomas Vondra wrote:

> Attached is this patch series, rebased on top of current master and the
> opclass parameters patch [1]. I previously planned to keep those two
> efforts separate for a while, but I decided to give it a try and the
> breakage is fairly minor so I'll keep it this way - this patch has zero
> chance of getting committed with the opclass parameters patch anyway.
> 
> Aside from rebase and changes due to adopting opclass parameters, the
> patch is otherwise unchanged.

This patch series doesn't apply, but I'm leaving it alone since the
brokenness is the opclass part, for which I have pinged the other
thread.

Thanks,

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: WIP: BRIN multi-range indexes

From
Alexander Korotkov
Date:
Hi, Tomas!

I took a look at this patchset.

On Tue, Jun 11, 2019 at 8:31 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
> Attached is this patch series, rebased on top of current master and the
> opclass parameters patch [1]. I previously planned to keep those two
> efforts separate for a while, but I decided to give it a try and the
> breakage is fairly minor so I'll keep it this way - this patch has zero
> chance of getting committed with the opclass parameters patch anyway.

Great.  You can notice, Nikita updated opclass parameters patchset
providing uniform way of passing opclass parameters for all index
access methods.  We would appreciate if you share feedback on that.

> Aside from rebase and changes due to adopting opclass parameters, the
> patch is otherwise unchanged.
>
> 0001-0004 are just the opclass parameters patch series.
>
> 0005 adds opclass parameters to BRIN indexes (similarly to what the
> preceding parts to for GIN/GiST indexes).

I see this patch change validation and catalog entries for addvalue,
consistent and union procs.  However, I don't see additional argument
to be passed to those functions in this patch.  0009 adds argument to
addvalue.  Regarding consistent and union, new argument seems not be
added in any patch.  It's probably not so important if you're going to
rebase to current version of opclass parameters, because it provides
new way of passing opclass parameters to support functions.
------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



Re: WIP: BRIN multi-range indexes

From
Tomas Vondra
Date:
On Tue, Sep 03, 2019 at 06:05:04PM -0400, Alvaro Herrera wrote:
>On 2019-Jun-11, Tomas Vondra wrote:
>
>> Attached is this patch series, rebased on top of current master and the
>> opclass parameters patch [1]. I previously planned to keep those two
>> efforts separate for a while, but I decided to give it a try and the
>> breakage is fairly minor so I'll keep it this way - this patch has zero
>> chance of getting committed with the opclass parameters patch anyway.
>>
>> Aside from rebase and changes due to adopting opclass parameters, the
>> patch is otherwise unchanged.
>
>This patch series doesn't apply, but I'm leaving it alone since the
>brokenness is the opclass part, for which I have pinged the other
>thread.
>

Attached is an updated version of this patch series, rebased on top of
the opclass parameter patches, shared by Nikita a couple of days ago.
There's one extra fixup patch, addressing a bug in those patches.

Firstly, while I have some comments on the opclass parameters patches
(shared in the other thread), I think that patch series is moving in the
right direction. After rebase the code is somewhat simpler and easier to
read, which is good. I'm sure there's some more work needed on the APIs
and so on, but I'm optimistic about that.

The rest of this patch series (0007-0011) is mostly unchanged. I've
fixed a couple of bugs and added some comments (particularly to the
bloom opclass), but that's about it.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

Re: WIP: BRIN multi-range indexes

From
Alvaro Herrera
Date:
This patch fails to apply (or the opclass params one, maybe).  Please
update.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: WIP: BRIN multi-range indexes

From
Tomas Vondra
Date:
On Wed, Sep 25, 2019 at 05:07:48PM -0300, Alvaro Herrera wrote:
>This patch fails to apply (or the opclass params one, maybe).  Please
>update.
>

Yeah, the opclass params patches got broken by 773df883e adding enum
reloptions. The breakage is somewhat extensive so I'll leave it up to
Nikita to fix it in [1]. Until that happens, apply the patches on
top of caba97a9d9 for review.

Thanks

[1] https://commitfest.postgresql.org/24/2183/

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: WIP: BRIN multi-range indexes

From
Michael Paquier
Date:
On Thu, Sep 26, 2019 at 09:01:48PM +0200, Tomas Vondra wrote:
> Yeah, the opclass params patches got broken by 773df883e adding enum
> reloptions. The breakage is somewhat extensive so I'll leave it up to
> Nikita to fix it in [1]. Until that happens, apply the patches on
> top of caba97a9d9 for review.

This has been close to two months now, so I have the patch as RwF.
Feel free to update if you think that's incorrect.
--
Michael

Attachment

Re: WIP: BRIN multi-range indexes

From
Tomas Vondra
Date:
On Sun, Dec 01, 2019 at 10:55:02AM +0900, Michael Paquier wrote:
>On Thu, Sep 26, 2019 at 09:01:48PM +0200, Tomas Vondra wrote:
>> Yeah, the opclass params patches got broken by 773df883e adding enum
>> reloptions. The breakage is somewhat extensive so I'll leave it up to
>> Nikita to fix it in [1]. Until that happens, apply the patches on
>> top of caba97a9d9 for review.
>
>This has been close to two months now, so I have the patch as RwF.
>Feel free to update if you think that's incorrect.

I see the opclass parameters patch got committed a couple days ago, so
I've rebased the patch series on top of it. The pach was marked RwF
since 2019-11, so I'll add it to the next CF.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

Re: WIP: BRIN multi-range indexes

From
Alexander Korotkov
Date:
Hi!

On Thu, Apr 2, 2020 at 5:29 AM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
> On Sun, Dec 01, 2019 at 10:55:02AM +0900, Michael Paquier wrote:
> >On Thu, Sep 26, 2019 at 09:01:48PM +0200, Tomas Vondra wrote:
> >> Yeah, the opclass params patches got broken by 773df883e adding enum
> >> reloptions. The breakage is somewhat extensive so I'll leave it up to
> >> Nikita to fix it in [1]. Until that happens, apply the patches on
> >> top of caba97a9d9 for review.
> >
> >This has been close to two months now, so I have the patch as RwF.
> >Feel free to update if you think that's incorrect.
>
> I see the opclass parameters patch got committed a couple days ago, so
> I've rebased the patch series on top of it. The pach was marked RwF
> since 2019-11, so I'll add it to the next CF.

I think this patchset was marked RwF mainly because slow progress on
opclass parameters.  Now we got opclass parameters committed, and I
think this patchset is in a pretty good shape.  Moreover, opclass
parameters patch comes with very small examples.  This patchset would
be great showcase for opclass parameters.

I'd like to give this patchset a chance for v13.  I'm going to make
another pass trough this patchset.  If I wouldn't find serious issues,
I'm going to commit it.  Any objections?

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



Re: WIP: BRIN multi-range indexes

From
Tom Lane
Date:
Alexander Korotkov <a.korotkov@postgrespro.ru> writes:
> I'd like to give this patchset a chance for v13.  I'm going to make
> another pass trough this patchset.  If I wouldn't find serious issues,
> I'm going to commit it.  Any objections?

I think it is way too late to be reviving major features that nobody
has been looking at for months, that indeed were never even listed
in the final CF.  At this point in the cycle I think we should just be
trying to get small stuff over the line, not shove in major patches
and figure they can be stabilized later.

In this particular case, the last serious work on the patchset seems
to have been Tomas' revision of 2019-09-14, and he specifically stated
then that the APIs still needed work.  That doesn't sound like
"it's about ready to commit" to me.

            regards, tom lane



Re: WIP: BRIN multi-range indexes

From
Tomas Vondra
Date:
On Sun, Apr 05, 2020 at 06:29:15PM +0300, Alexander Korotkov wrote:
>Hi!
>
>On Thu, Apr 2, 2020 at 5:29 AM Tomas Vondra
><tomas.vondra@2ndquadrant.com> wrote:
>> On Sun, Dec 01, 2019 at 10:55:02AM +0900, Michael Paquier wrote:
>> >On Thu, Sep 26, 2019 at 09:01:48PM +0200, Tomas Vondra wrote:
>> >> Yeah, the opclass params patches got broken by 773df883e adding enum
>> >> reloptions. The breakage is somewhat extensive so I'll leave it up to
>> >> Nikita to fix it in [1]. Until that happens, apply the patches on
>> >> top of caba97a9d9 for review.
>> >
>> >This has been close to two months now, so I have the patch as RwF.
>> >Feel free to update if you think that's incorrect.
>>
>> I see the opclass parameters patch got committed a couple days ago, so
>> I've rebased the patch series on top of it. The pach was marked RwF
>> since 2019-11, so I'll add it to the next CF.
>
>I think this patchset was marked RwF mainly because slow progress on
>opclass parameters.  Now we got opclass parameters committed, and I
>think this patchset is in a pretty good shape.  Moreover, opclass
>parameters patch comes with very small examples.  This patchset would
>be great showcase for opclass parameters.
>
>I'd like to give this patchset a chance for v13.  I'm going to make
>another pass trough this patchset.  If I wouldn't find serious issues,
>I'm going to commit it.  Any objections?
>

I'm an author of the patchset and I'd love to see it committed, but I
think that might be a bit too rushed and unfair (considering it was not
included in the current CF at all).

I think the code is correct and I'm not aware of any bugs, but I'm not
sure there was sufficient discussion about things like costing, choosing
parameter values (e.g.  number of values in the multi-minmax or bloom
filter parameters).

That being said, I think the first couple of patches (that modify how
BRIN deals with multi-key scans and IS NULL clauses) are simple enough
and non-controversial, so maybe we could get 0001-0003 committed, and
leave the bloom/multi-minmax opclasses for v14.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: WIP: BRIN multi-range indexes

From
Alexander Korotkov
Date:
On Sun, Apr 5, 2020 at 6:51 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Alexander Korotkov <a.korotkov@postgrespro.ru> writes:
> > I'd like to give this patchset a chance for v13.  I'm going to make
> > another pass trough this patchset.  If I wouldn't find serious issues,
> > I'm going to commit it.  Any objections?
>
> I think it is way too late to be reviving major features that nobody
> has been looking at for months, that indeed were never even listed
> in the final CF.  At this point in the cycle I think we should just be
> trying to get small stuff over the line, not shove in major patches
> and figure they can be stabilized later.
>
> In this particular case, the last serious work on the patchset seems
> to have been Tomas' revision of 2019-09-14, and he specifically stated
> then that the APIs still needed work.  That doesn't sound like
> "it's about ready to commit" to me.

OK, got it.  Thank you for the feedback.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



Re: WIP: BRIN multi-range indexes

From
Alexander Korotkov
Date:
On Sun, Apr 5, 2020 at 6:53 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
> On Sun, Apr 05, 2020 at 06:29:15PM +0300, Alexander Korotkov wrote:
> >On Thu, Apr 2, 2020 at 5:29 AM Tomas Vondra
> ><tomas.vondra@2ndquadrant.com> wrote:
> >> On Sun, Dec 01, 2019 at 10:55:02AM +0900, Michael Paquier wrote:
> >> >On Thu, Sep 26, 2019 at 09:01:48PM +0200, Tomas Vondra wrote:
> >> >> Yeah, the opclass params patches got broken by 773df883e adding enum
> >> >> reloptions. The breakage is somewhat extensive so I'll leave it up to
> >> >> Nikita to fix it in [1]. Until that happens, apply the patches on
> >> >> top of caba97a9d9 for review.
> >> >
> >> >This has been close to two months now, so I have the patch as RwF.
> >> >Feel free to update if you think that's incorrect.
> >>
> >> I see the opclass parameters patch got committed a couple days ago, so
> >> I've rebased the patch series on top of it. The pach was marked RwF
> >> since 2019-11, so I'll add it to the next CF.
> >
> >I think this patchset was marked RwF mainly because slow progress on
> >opclass parameters.  Now we got opclass parameters committed, and I
> >think this patchset is in a pretty good shape.  Moreover, opclass
> >parameters patch comes with very small examples.  This patchset would
> >be great showcase for opclass parameters.
> >
> >I'd like to give this patchset a chance for v13.  I'm going to make
> >another pass trough this patchset.  If I wouldn't find serious issues,
> >I'm going to commit it.  Any objections?
> >
>
> I'm an author of the patchset and I'd love to see it committed, but I
> think that might be a bit too rushed and unfair (considering it was not
> included in the current CF at all).
>
> I think the code is correct and I'm not aware of any bugs, but I'm not
> sure there was sufficient discussion about things like costing, choosing
> parameter values (e.g.  number of values in the multi-minmax or bloom
> filter parameters).

Ok!

> That being said, I think the first couple of patches (that modify how
> BRIN deals with multi-key scans and IS NULL clauses) are simple enough
> and non-controversial, so maybe we could get 0001-0003 committed, and
> leave the bloom/multi-minmax opclasses for v14.

Regarding 0001-0003 I've couple of notes:
1) They should revise BRIN extensibility documentation section.
2) I think 0002 and 0003 should be merged.  NULL ScanKeys should be
still passed to consistent function when oi_regular_nulls == false.

Assuming we're not going to get 0001-0003 into v13, I'm not so
inclined to rush on these three as well.  But you're willing to commit
them, you can count round of review on me.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



Re: WIP: BRIN multi-range indexes

From
Tomas Vondra
Date:
On Sun, Apr 05, 2020 at 07:33:40PM +0300, Alexander Korotkov wrote:
>On Sun, Apr 5, 2020 at 6:53 PM Tomas Vondra
><tomas.vondra@2ndquadrant.com> wrote:
>> On Sun, Apr 05, 2020 at 06:29:15PM +0300, Alexander Korotkov wrote:
>> >On Thu, Apr 2, 2020 at 5:29 AM Tomas Vondra
>> ><tomas.vondra@2ndquadrant.com> wrote:
>> >> On Sun, Dec 01, 2019 at 10:55:02AM +0900, Michael Paquier wrote:
>> >> >On Thu, Sep 26, 2019 at 09:01:48PM +0200, Tomas Vondra wrote:
>> >> >> Yeah, the opclass params patches got broken by 773df883e adding enum
>> >> >> reloptions. The breakage is somewhat extensive so I'll leave it up to
>> >> >> Nikita to fix it in [1]. Until that happens, apply the patches on
>> >> >> top of caba97a9d9 for review.
>> >> >
>> >> >This has been close to two months now, so I have the patch as RwF.
>> >> >Feel free to update if you think that's incorrect.
>> >>
>> >> I see the opclass parameters patch got committed a couple days ago, so
>> >> I've rebased the patch series on top of it. The pach was marked RwF
>> >> since 2019-11, so I'll add it to the next CF.
>> >
>> >I think this patchset was marked RwF mainly because slow progress on
>> >opclass parameters.  Now we got opclass parameters committed, and I
>> >think this patchset is in a pretty good shape.  Moreover, opclass
>> >parameters patch comes with very small examples.  This patchset would
>> >be great showcase for opclass parameters.
>> >
>> >I'd like to give this patchset a chance for v13.  I'm going to make
>> >another pass trough this patchset.  If I wouldn't find serious issues,
>> >I'm going to commit it.  Any objections?
>> >
>>
>> I'm an author of the patchset and I'd love to see it committed, but I
>> think that might be a bit too rushed and unfair (considering it was not
>> included in the current CF at all).
>>
>> I think the code is correct and I'm not aware of any bugs, but I'm not
>> sure there was sufficient discussion about things like costing, choosing
>> parameter values (e.g.  number of values in the multi-minmax or bloom
>> filter parameters).
>
>Ok!
>
>> That being said, I think the first couple of patches (that modify how
>> BRIN deals with multi-key scans and IS NULL clauses) are simple enough
>> and non-controversial, so maybe we could get 0001-0003 committed, and
>> leave the bloom/multi-minmax opclasses for v14.
>
>Regarding 0001-0003 I've couple of notes:
>1) They should revise BRIN extensibility documentation section.
>2) I think 0002 and 0003 should be merged.  NULL ScanKeys should be
>still passed to consistent function when oi_regular_nulls == false.
>
>Assuming we're not going to get 0001-0003 into v13, I'm not so
>inclined to rush on these three as well.  But you're willing to commit
>them, you can count round of review on me.
>

I have no intention to get 0001-0003 committed. I think those changes
are beneficial on their own, but the primary reason was to support the
new opclasses (which require those changes). And those parts are not
going to make it into v13 ...

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: WIP: BRIN multi-range indexes

From
Alexander Korotkov
Date:
On Sun, Apr 5, 2020 at 8:00 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
> On Sun, Apr 05, 2020 at 07:33:40PM +0300, Alexander Korotkov wrote:
> >On Sun, Apr 5, 2020 at 6:53 PM Tomas Vondra
> ><tomas.vondra@2ndquadrant.com> wrote:
> >> On Sun, Apr 05, 2020 at 06:29:15PM +0300, Alexander Korotkov wrote:
> >> >On Thu, Apr 2, 2020 at 5:29 AM Tomas Vondra
> >> ><tomas.vondra@2ndquadrant.com> wrote:
> >> >> On Sun, Dec 01, 2019 at 10:55:02AM +0900, Michael Paquier wrote:
> >> >> >On Thu, Sep 26, 2019 at 09:01:48PM +0200, Tomas Vondra wrote:
> >> >> >> Yeah, the opclass params patches got broken by 773df883e adding enum
> >> >> >> reloptions. The breakage is somewhat extensive so I'll leave it up to
> >> >> >> Nikita to fix it in [1]. Until that happens, apply the patches on
> >> >> >> top of caba97a9d9 for review.
> >> >> >
> >> >> >This has been close to two months now, so I have the patch as RwF.
> >> >> >Feel free to update if you think that's incorrect.
> >> >>
> >> >> I see the opclass parameters patch got committed a couple days ago, so
> >> >> I've rebased the patch series on top of it. The pach was marked RwF
> >> >> since 2019-11, so I'll add it to the next CF.
> >> >
> >> >I think this patchset was marked RwF mainly because slow progress on
> >> >opclass parameters.  Now we got opclass parameters committed, and I
> >> >think this patchset is in a pretty good shape.  Moreover, opclass
> >> >parameters patch comes with very small examples.  This patchset would
> >> >be great showcase for opclass parameters.
> >> >
> >> >I'd like to give this patchset a chance for v13.  I'm going to make
> >> >another pass trough this patchset.  If I wouldn't find serious issues,
> >> >I'm going to commit it.  Any objections?
> >> >
> >>
> >> I'm an author of the patchset and I'd love to see it committed, but I
> >> think that might be a bit too rushed and unfair (considering it was not
> >> included in the current CF at all).
> >>
> >> I think the code is correct and I'm not aware of any bugs, but I'm not
> >> sure there was sufficient discussion about things like costing, choosing
> >> parameter values (e.g.  number of values in the multi-minmax or bloom
> >> filter parameters).
> >
> >Ok!
> >
> >> That being said, I think the first couple of patches (that modify how
> >> BRIN deals with multi-key scans and IS NULL clauses) are simple enough
> >> and non-controversial, so maybe we could get 0001-0003 committed, and
> >> leave the bloom/multi-minmax opclasses for v14.
> >
> >Regarding 0001-0003 I've couple of notes:
> >1) They should revise BRIN extensibility documentation section.
> >2) I think 0002 and 0003 should be merged.  NULL ScanKeys should be
> >still passed to consistent function when oi_regular_nulls == false.
> >
> >Assuming we're not going to get 0001-0003 into v13, I'm not so
> >inclined to rush on these three as well.  But you're willing to commit
> >them, you can count round of review on me.
> >
>
> I have no intention to get 0001-0003 committed. I think those changes
> are beneficial on their own, but the primary reason was to support the
> new opclasses (which require those changes). And those parts are not
> going to make it into v13 ...

OK, no problem.
Let's do this for v14.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



Re: WIP: BRIN multi-range indexes

From
Tomas Vondra
Date:
Hi,

here is an updated patch series, fixing duplicate OIDs etc.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

Re: WIP: BRIN multi-range indexes

From
Tomas Vondra
Date:
On Sun, Apr 05, 2020 at 08:01:50PM +0300, Alexander Korotkov wrote:
>On Sun, Apr 5, 2020 at 8:00 PM Tomas Vondra
><tomas.vondra@2ndquadrant.com> wrote:
...
>> >
>> >Assuming we're not going to get 0001-0003 into v13, I'm not so
>> >inclined to rush on these three as well.  But you're willing to commit
>> >them, you can count round of review on me.
>> >
>>
>> I have no intention to get 0001-0003 committed. I think those changes
>> are beneficial on their own, but the primary reason was to support the
>> new opclasses (which require those changes). And those parts are not
>> going to make it into v13 ...
>
>OK, no problem.
>Let's do this for v14.
>

Hi Alexander,

Are you still interested in reviewing those patches? I'll take a look at
0001-0003 to check that your previous feedback was addressed. Do you
have any comments about 0004 / 0005, which I think are the more
interesting parts of this series?


Attached is a rebased version - I realized I forgot to include 0005 in
the last update, for some reason.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

Re: WIP: BRIN multi-range indexes

From
Masahiko Sawada
Date:
On Fri, 3 Jul 2020 at 09:58, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
>
> On Sun, Apr 05, 2020 at 08:01:50PM +0300, Alexander Korotkov wrote:
> >On Sun, Apr 5, 2020 at 8:00 PM Tomas Vondra
> ><tomas.vondra@2ndquadrant.com> wrote:
> ...
> >> >
> >> >Assuming we're not going to get 0001-0003 into v13, I'm not so
> >> >inclined to rush on these three as well.  But you're willing to commit
> >> >them, you can count round of review on me.
> >> >
> >>
> >> I have no intention to get 0001-0003 committed. I think those changes
> >> are beneficial on their own, but the primary reason was to support the
> >> new opclasses (which require those changes). And those parts are not
> >> going to make it into v13 ...
> >
> >OK, no problem.
> >Let's do this for v14.
> >
>
> Hi Alexander,
>
> Are you still interested in reviewing those patches? I'll take a look at
> 0001-0003 to check that your previous feedback was addressed. Do you
> have any comments about 0004 / 0005, which I think are the more
> interesting parts of this series?
>
>
> Attached is a rebased version - I realized I forgot to include 0005 in
> the last update, for some reason.
>

I've done a quick test with this patch set. I wonder if we can improve
brin_page_items() SQL function in pageinspect as well. Currently,
brin_page_items() is hard-coded to support only normal brin indexes.
When we pass brin-bloom or brin-multi-range to that function the
binary values are shown in 'value' column but it seems not helpful for
users. For instance, here is an output of brin_page_items() with a
brin-multi-range index:

postgres(1:12801)=# select * from brin_page_items(get_raw_page('mul',
2), 'mul');
-[ RECORD 1
]----------------------------------------------------------------------------------------------------------------------

-----------------------------------------------------------------------------------------------------------------------------------
----------------------------
itemoffset  | 1
blknum      | 0
attnum      | 1
allnulls    | f
hasnulls    | f
placeholder | f
value       |
{\x010000001b0000002000000001000000e5700000e6700000e7700000e8700000e9700000ea700000eb700000ec700000ed700000ee700000ef

700000f0700000f1700000f2700000f3700000f4700000f5700000f6700000f7700000f8700000f9700000fa700000fb700000fc700000fd700000fe700000ff700
00000710000}

Also, I got an assertion failure when setting false_positive_rate reloption:

postgres(1:12448)=# create index blm on t using brin (c int4_bloom_ops
(false_positive_rate = 1));
TRAP: FailedAssertion("(false_positive_rate > 0) &&
(false_positive_rate < 1.0)", File: "brin_bloom.c", Line: 300)

I'll look at the code in depth and let you know if I find a problem.

Regards,

--
Masahiko Sawada            http://www.2ndQuadrant.com/

PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: WIP: BRIN multi-range indexes

From
Tomas Vondra
Date:
On Fri, Jul 10, 2020 at 06:01:58PM +0900, Masahiko Sawada wrote:
>On Fri, 3 Jul 2020 at 09:58, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
>>
>> On Sun, Apr 05, 2020 at 08:01:50PM +0300, Alexander Korotkov wrote:
>> >On Sun, Apr 5, 2020 at 8:00 PM Tomas Vondra
>> ><tomas.vondra@2ndquadrant.com> wrote:
>> ...
>> >> >
>> >> >Assuming we're not going to get 0001-0003 into v13, I'm not so
>> >> >inclined to rush on these three as well.  But you're willing to commit
>> >> >them, you can count round of review on me.
>> >> >
>> >>
>> >> I have no intention to get 0001-0003 committed. I think those changes
>> >> are beneficial on their own, but the primary reason was to support the
>> >> new opclasses (which require those changes). And those parts are not
>> >> going to make it into v13 ...
>> >
>> >OK, no problem.
>> >Let's do this for v14.
>> >
>>
>> Hi Alexander,
>>
>> Are you still interested in reviewing those patches? I'll take a look at
>> 0001-0003 to check that your previous feedback was addressed. Do you
>> have any comments about 0004 / 0005, which I think are the more
>> interesting parts of this series?
>>
>>
>> Attached is a rebased version - I realized I forgot to include 0005 in
>> the last update, for some reason.
>>
>
>I've done a quick test with this patch set. I wonder if we can improve
>brin_page_items() SQL function in pageinspect as well. Currently,
>brin_page_items() is hard-coded to support only normal brin indexes.
>When we pass brin-bloom or brin-multi-range to that function the
>binary values are shown in 'value' column but it seems not helpful for
>users. For instance, here is an output of brin_page_items() with a
>brin-multi-range index:
>
>postgres(1:12801)=# select * from brin_page_items(get_raw_page('mul',
>2), 'mul');
>-[ RECORD 1
]----------------------------------------------------------------------------------------------------------------------

>-----------------------------------------------------------------------------------------------------------------------------------
>----------------------------
>itemoffset  | 1
>blknum      | 0
>attnum      | 1
>allnulls    | f
>hasnulls    | f
>placeholder | f
>value       |
{\x010000001b0000002000000001000000e5700000e6700000e7700000e8700000e9700000ea700000eb700000ec700000ed700000ee700000ef

>700000f0700000f1700000f2700000f3700000f4700000f5700000f6700000f7700000f8700000f9700000fa700000fb700000fc700000fd700000fe700000ff700
>00000710000}
>

Hmm. I'm not sure we can do much better, without making the function
much more complicated. I mean, even with regular BRIN indexes we don't
really know if the value is plain min/max, right?


>Also, I got an assertion failure when setting false_positive_rate reloption:
>
>postgres(1:12448)=# create index blm on t using brin (c int4_bloom_ops
>(false_positive_rate = 1));
>TRAP: FailedAssertion("(false_positive_rate > 0) &&
>(false_positive_rate < 1.0)", File: "brin_bloom.c", Line: 300)
>
>I'll look at the code in depth and let you know if I find a problem.
>

Yeah, the assert should say (f_p_r <= 1.0).

But I'm not convinced we should allow values up to 1.0, really. The
f_p_r is the fraction of the table that will get matched always, so 1.0
would mean we get to scan the whole table. Seems kinda pointless. So
maybe we should cap it to something like 0.1 or so, but I agree the
value seems kinda arbitrary.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: WIP: BRIN multi-range indexes

From
Sascha Kuhl
Date:


Tomas Vondra <tomas.vondra@2ndquadrant.com> schrieb am Fr., 10. Juli 2020, 14:09:
On Fri, Jul 10, 2020 at 06:01:58PM +0900, Masahiko Sawada wrote:
>On Fri, 3 Jul 2020 at 09:58, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
>>
>> On Sun, Apr 05, 2020 at 08:01:50PM +0300, Alexander Korotkov wrote:
>> >On Sun, Apr 5, 2020 at 8:00 PM Tomas Vondra
>> ><tomas.vondra@2ndquadrant.com> wrote:
>> ...
>> >> >
>> >> >Assuming we're not going to get 0001-0003 into v13, I'm not so
>> >> >inclined to rush on these three as well.  But you're willing to commit
>> >> >them, you can count round of review on me.
>> >> >
>> >>
>> >> I have no intention to get 0001-0003 committed. I think those changes
>> >> are beneficial on their own, but the primary reason was to support the
>> >> new opclasses (which require those changes). And those parts are not
>> >> going to make it into v13 ...
>> >
>> >OK, no problem.
>> >Let's do this for v14.
>> >
>>
>> Hi Alexander,
>>
>> Are you still interested in reviewing those patches? I'll take a look at
>> 0001-0003 to check that your previous feedback was addressed. Do you
>> have any comments about 0004 / 0005, which I think are the more
>> interesting parts of this series?
>>
>>
>> Attached is a rebased version - I realized I forgot to include 0005 in
>> the last update, for some reason.
>>
>
>I've done a quick test with this patch set. I wonder if we can improve
>brin_page_items() SQL function in pageinspect as well. Currently,
>brin_page_items() is hard-coded to support only normal brin indexes.
>When we pass brin-bloom or brin-multi-range to that function the
>binary values are shown in 'value' column but it seems not helpful for
>users. For instance, here is an output of brin_page_items() with a
>brin-multi-range index:
>
>postgres(1:12801)=# select * from brin_page_items(get_raw_page('mul',
>2), 'mul');
>-[ RECORD 1 ]----------------------------------------------------------------------------------------------------------------------
>-----------------------------------------------------------------------------------------------------------------------------------
>----------------------------
>itemoffset  | 1
>blknum      | 0
>attnum      | 1
>allnulls    | f
>hasnulls    | f
>placeholder | f
>value       | {\x010000001b0000002000000001000000e5700000e6700000e7700000e8700000e9700000ea700000eb700000ec700000ed700000ee700000ef
>700000f0700000f1700000f2700000f3700000f4700000f5700000f6700000f7700000f8700000f9700000fa700000fb700000fc700000fd700000fe700000ff700
>00000710000}
>

Hmm. I'm not sure we can do much better, without making the function
much more complicated. I mean, even with regular BRIN indexes we don't
really know if the value is plain min/max, right?
You can be sure with the next node. The value is in can be false positiv. The value is out is clear. You can detect the change between in and out. 


>Also, I got an assertion failure when setting false_positive_rate reloption:
>
>postgres(1:12448)=# create index blm on t using brin (c int4_bloom_ops
>(false_positive_rate = 1));
>TRAP: FailedAssertion("(false_positive_rate > 0) &&
>(false_positive_rate < 1.0)", File: "brin_bloom.c", Line: 300)
>
>I'll look at the code in depth and let you know if I find a problem.
>

Yeah, the assert should say (f_p_r <= 1.0).

But I'm not convinced we should allow values up to 1.0, really. The
f_p_r is the fraction of the table that will get matched always, so 1.0
would mean we get to scan the whole table. Seems kinda pointless. So
maybe we should cap it to something like 0.1 or so, but I agree the
value seems kinda arbitrary.


regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: WIP: BRIN multi-range indexes

From
Tomas Vondra
Date:
On Fri, Jul 10, 2020 at 04:44:41PM +0200, Sascha Kuhl wrote:
>Tomas Vondra <tomas.vondra@2ndquadrant.com> schrieb am Fr., 10. Juli 2020,
>14:09:
>
>> On Fri, Jul 10, 2020 at 06:01:58PM +0900, Masahiko Sawada wrote:
>> >On Fri, 3 Jul 2020 at 09:58, Tomas Vondra <tomas.vondra@2ndquadrant.com>
>> wrote:
>> >>
>> >> On Sun, Apr 05, 2020 at 08:01:50PM +0300, Alexander Korotkov wrote:
>> >> >On Sun, Apr 5, 2020 at 8:00 PM Tomas Vondra
>> >> ><tomas.vondra@2ndquadrant.com> wrote:
>> >> ...
>> >> >> >
>> >> >> >Assuming we're not going to get 0001-0003 into v13, I'm not so
>> >> >> >inclined to rush on these three as well.  But you're willing to
>> commit
>> >> >> >them, you can count round of review on me.
>> >> >> >
>> >> >>
>> >> >> I have no intention to get 0001-0003 committed. I think those changes
>> >> >> are beneficial on their own, but the primary reason was to support
>> the
>> >> >> new opclasses (which require those changes). And those parts are not
>> >> >> going to make it into v13 ...
>> >> >
>> >> >OK, no problem.
>> >> >Let's do this for v14.
>> >> >
>> >>
>> >> Hi Alexander,
>> >>
>> >> Are you still interested in reviewing those patches? I'll take a look at
>> >> 0001-0003 to check that your previous feedback was addressed. Do you
>> >> have any comments about 0004 / 0005, which I think are the more
>> >> interesting parts of this series?
>> >>
>> >>
>> >> Attached is a rebased version - I realized I forgot to include 0005 in
>> >> the last update, for some reason.
>> >>
>> >
>> >I've done a quick test with this patch set. I wonder if we can improve
>> >brin_page_items() SQL function in pageinspect as well. Currently,
>> >brin_page_items() is hard-coded to support only normal brin indexes.
>> >When we pass brin-bloom or brin-multi-range to that function the
>> >binary values are shown in 'value' column but it seems not helpful for
>> >users. For instance, here is an output of brin_page_items() with a
>> >brin-multi-range index:
>> >
>> >postgres(1:12801)=# select * from brin_page_items(get_raw_page('mul',
>> >2), 'mul');
>> >-[ RECORD 1
>>
]----------------------------------------------------------------------------------------------------------------------
>>
>>
>-----------------------------------------------------------------------------------------------------------------------------------
>> >----------------------------
>> >itemoffset  | 1
>> >blknum      | 0
>> >attnum      | 1
>> >allnulls    | f
>> >hasnulls    | f
>> >placeholder | f
>> >value       |
>>
{\x010000001b0000002000000001000000e5700000e6700000e7700000e8700000e9700000ea700000eb700000ec700000ed700000ee700000ef
>>
>>
>700000f0700000f1700000f2700000f3700000f4700000f5700000f6700000f7700000f8700000f9700000fa700000fb700000fc700000fd700000fe700000ff700
>> >00000710000}
>> >
>>
>> Hmm. I'm not sure we can do much better, without making the function
>> much more complicated. I mean, even with regular BRIN indexes we don't
>> really know if the value is plain min/max, right?
>>
>You can be sure with the next node. The value is in can be false positiv.
>The value is out is clear. You can detect the change between in and out.
>

I'm sorry, I don't understand what you're suggesting. How is any of this
related to false positive rate, etc?

The problem here is that while plain BRIN opclasses have fairly simple
summary that can be stored using a fixed number of simple data types
(e.g. minmax will store two values with the same data types as the
indexd column)

     result = palloc0(MAXALIGN(SizeofBrinOpcInfo(2)) +
                      sizeof(MinmaxOpaque));
     result->oi_nstored = 2;
     result->oi_opaque = (MinmaxOpaque *)
         MAXALIGN((char *) result + SizeofBrinOpcInfo(2));
     result->oi_typcache[0] = result->oi_typcache[1] =
         lookup_type_cache(typoid, 0);

The opclassed introduced here have somewhat more complex summary, stored
as a single bytea value - which is what gets printed by brin_page_items.

To print something easier to read (for humans) we'd either have to teach
brin_page_items about the diffrent opclasses (multi-range, bloom) end
how to parse the summary bytea, or we'd have to extend the opclasses
with a function formatting the summary. Or rework how the summary is
stored, but that seems like the worst option.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 



Re: WIP: BRIN multi-range indexes

From
Sascha Kuhl
Date:



Tomas Vondra <tomas.vondra@2ndquadrant.com> schrieb am Sa., 11. Juli 2020, 13:24:
On Fri, Jul 10, 2020 at 04:44:41PM +0200, Sascha Kuhl wrote:
>Tomas Vondra <tomas.vondra@2ndquadrant.com> schrieb am Fr., 10. Juli 2020,
>14:09:
>
>> On Fri, Jul 10, 2020 at 06:01:58PM +0900, Masahiko Sawada wrote:
>> >On Fri, 3 Jul 2020 at 09:58, Tomas Vondra <tomas.vondra@2ndquadrant.com>
>> wrote:
>> >>
>> >> On Sun, Apr 05, 2020 at 08:01:50PM +0300, Alexander Korotkov wrote:
>> >> >On Sun, Apr 5, 2020 at 8:00 PM Tomas Vondra
>> >> ><tomas.vondra@2ndquadrant.com> wrote:
>> >> ...
>> >> >> >
>> >> >> >Assuming we're not going to get 0001-0003 into v13, I'm not so
>> >> >> >inclined to rush on these three as well.  But you're willing to
>> commit
>> >> >> >them, you can count round of review on me.
>> >> >> >
>> >> >>
>> >> >> I have no intention to get 0001-0003 committed. I think those changes
>> >> >> are beneficial on their own, but the primary reason was to support
>> the
>> >> >> new opclasses (which require those changes). And those parts are not
>> >> >> going to make it into v13 ...
>> >> >
>> >> >OK, no problem.
>> >> >Let's do this for v14.
>> >> >
>> >>
>> >> Hi Alexander,
>> >>
>> >> Are you still interested in reviewing those patches? I'll take a look at
>> >> 0001-0003 to check that your previous feedback was addressed. Do you
>> >> have any comments about 0004 / 0005, which I think are the more
>> >> interesting parts of this series?
>> >>
>> >>
>> >> Attached is a rebased version - I realized I forgot to include 0005 in
>> >> the last update, for some reason.
>> >>
>> >
>> >I've done a quick test with this patch set. I wonder if we can improve
>> >brin_page_items() SQL function in pageinspect as well. Currently,
>> >brin_page_items() is hard-coded to support only normal brin indexes.
>> >When we pass brin-bloom or brin-multi-range to that function the
>> >binary values are shown in 'value' column but it seems not helpful for
>> >users. For instance, here is an output of brin_page_items() with a
>> >brin-multi-range index:
>> >
>> >postgres(1:12801)=# select * from brin_page_items(get_raw_page('mul',
>> >2), 'mul');
>> >-[ RECORD 1
>> ]----------------------------------------------------------------------------------------------------------------------
>>
>> >-----------------------------------------------------------------------------------------------------------------------------------
>> >----------------------------
>> >itemoffset  | 1
>> >blknum      | 0
>> >attnum      | 1
>> >allnulls    | f
>> >hasnulls    | f
>> >placeholder | f
>> >value       |
>> {\x010000001b0000002000000001000000e5700000e6700000e7700000e8700000e9700000ea700000eb700000ec700000ed700000ee700000ef
>>
>> >700000f0700000f1700000f2700000f3700000f4700000f5700000f6700000f7700000f8700000f9700000fa700000fb700000fc700000fd700000fe700000ff700
>> >00000710000}
>> >
>>
>> Hmm. I'm not sure we can do much better, without making the function
>> much more complicated. I mean, even with regular BRIN indexes we don't
>> really know if the value is plain min/max, right?
>>
>You can be sure with the next node. The value is in can be false positiv.
>The value is out is clear. You can detect the change between in and out.
>

I'm sorry, I don't understand what you're suggesting. How is any of this
related to false positive rate, etc?

Hi,

You check by the bloom filter if a value you're searching is part of the node, right? 

In case, the value is in the bloom filter you could be mistaken, because another value could have the same hash profile, no? 

However if the value is out, the filter can not react. You can be sure that the value is out. 

If you looking for a range or many ranges of values, you traverse many nodes. By knowing the value is out, you can state a clear set of nodes that form the range. However the border is somehow unsharp because of the false positives. 

I am not sure if we write about the same. Please confirm, this can be needed. Please. 

I will try to understand what you write. Interesting

Sascha

The problem here is that while plain BRIN opclasses have fairly simple
summary that can be stored using a fixed number of simple data types
(e.g. minmax will store two values with the same data types as the
indexd column)

     result = palloc0(MAXALIGN(SizeofBrinOpcInfo(2)) +
                      sizeof(MinmaxOpaque));
     result->oi_nstored = 2;
     result->oi_opaque = (MinmaxOpaque *)
         MAXALIGN((char *) result + SizeofBrinOpcInfo(2));
     result->oi_typcache[0] = result->oi_typcache[1] =
         lookup_type_cache(typoid, 0);

The opclassed introduced here have somewhat more complex summary, stored
as a single bytea value - which is what gets printed by brin_page_items.

To print something easier to read (for humans) we'd either have to teach
brin_page_items about the diffrent opclasses (multi-range, bloom) end
how to parse the summary bytea, or we'd have to extend the opclasses
with a function formatting the summary. Or rework how the summary is
stored, but that seems like the worst option.


regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: WIP: BRIN multi-range indexes

From
Sascha Kuhl
Date:
Sorry, my topic is different 

Sascha Kuhl <yogidabanli@gmail.com> schrieb am Sa., 11. Juli 2020, 15:32:



Tomas Vondra <tomas.vondra@2ndquadrant.com> schrieb am Sa., 11. Juli 2020, 13:24:
On Fri, Jul 10, 2020 at 04:44:41PM +0200, Sascha Kuhl wrote:
>Tomas Vondra <tomas.vondra@2ndquadrant.com> schrieb am Fr., 10. Juli 2020,
>14:09:
>
>> On Fri, Jul 10, 2020 at 06:01:58PM +0900, Masahiko Sawada wrote:
>> >On Fri, 3 Jul 2020 at 09:58, Tomas Vondra <tomas.vondra@2ndquadrant.com>
>> wrote:
>> >>
>> >> On Sun, Apr 05, 2020 at 08:01:50PM +0300, Alexander Korotkov wrote:
>> >> >On Sun, Apr 5, 2020 at 8:00 PM Tomas Vondra
>> >> ><tomas.vondra@2ndquadrant.com> wrote:
>> >> ...
>> >> >> >
>> >> >> >Assuming we're not going to get 0001-0003 into v13, I'm not so
>> >> >> >inclined to rush on these three as well.  But you're willing to
>> commit
>> >> >> >them, you can count round of review on me.
>> >> >> >
>> >> >>
>> >> >> I have no intention to get 0001-0003 committed. I think those changes
>> >> >> are beneficial on their own, but the primary reason was to support
>> the
>> >> >> new opclasses (which require those changes). And those parts are not
>> >> >> going to make it into v13 ...
>> >> >
>> >> >OK, no problem.
>> >> >Let's do this for v14.
>> >> >
>> >>
>> >> Hi Alexander,
>> >>
>> >> Are you still interested in reviewing those patches? I'll take a look at
>> >> 0001-0003 to check that your previous feedback was addressed. Do you
>> >> have any comments about 0004 / 0005, which I think are the more
>> >> interesting parts of this series?
>> >>
>> >>
>> >> Attached is a rebased version - I realized I forgot to include 0005 in
>> >> the last update, for some reason.
>> >>
>> >
>> >I've done a quick test with this patch set. I wonder if we can improve
>> >brin_page_items() SQL function in pageinspect as well. Currently,
>> >brin_page_items() is hard-coded to support only normal brin indexes.
>> >When we pass brin-bloom or brin-multi-range to that function the
>> >binary values are shown in 'value' column but it seems not helpful for
>> >users. For instance, here is an output of brin_page_items() with a
>> >brin-multi-range index:
>> >
>> >postgres(1:12801)=# select * from brin_page_items(get_raw_page('mul',
>> >2), 'mul');
>> >-[ RECORD 1
>> ]----------------------------------------------------------------------------------------------------------------------
>>
>> >-----------------------------------------------------------------------------------------------------------------------------------
>> >----------------------------
>> >itemoffset  | 1
>> >blknum      | 0
>> >attnum      | 1
>> >allnulls    | f
>> >hasnulls    | f
>> >placeholder | f
>> >value       |
>> {\x010000001b0000002000000001000000e5700000e6700000e7700000e8700000e9700000ea700000eb700000ec700000ed700000ee700000ef
>>
>> >700000f0700000f1700000f2700000f3700000f4700000f5700000f6700000f7700000f8700000f9700000fa700000fb700000fc700000fd700000fe700000ff700
>> >00000710000}
>> >
>>
>> Hmm. I'm not sure we can do much better, without making the function
>> much more complicated. I mean, even with regular BRIN indexes we don't
>> really know if the value is plain min/max, right?
>>
>You can be sure with the next node. The value is in can be false positiv.
>The value is out is clear. You can detect the change between in and out.
>

I'm sorry, I don't understand what you're suggesting. How is any of this
related to false positive rate, etc?

Hi,

You check by the bloom filter if a value you're searching is part of the node, right? 

In case, the value is in the bloom filter you could be mistaken, because another value could have the same hash profile, no? 

However if the value is out, the filter can not react. You can be sure that the value is out. 

If you looking for a range or many ranges of values, you traverse many nodes. By knowing the value is out, you can state a clear set of nodes that form the range. However the border is somehow unsharp because of the false positives. 

I am not sure if we write about the same. Please confirm, this can be needed. Please. 

I will try to understand what you write. Interesting

Sascha

The problem here is that while plain BRIN opclasses have fairly simple
summary that can be stored using a fixed number of simple data types
(e.g. minmax will store two values with the same data types as the
indexd column)

     result = palloc0(MAXALIGN(SizeofBrinOpcInfo(2)) +
                      sizeof(MinmaxOpaque));
     result->oi_nstored = 2;
     result->oi_opaque = (MinmaxOpaque *)
         MAXALIGN((char *) result + SizeofBrinOpcInfo(2));
     result->oi_typcache[0] = result->oi_typcache[1] =
         lookup_type_cache(typoid, 0);

The opclassed introduced here have somewhat more complex summary, stored
as a single bytea value - which is what gets printed by brin_page_items.

To print something easier to read (for humans) we'd either have to teach
brin_page_items about the diffrent opclasses (multi-range, bloom) end
how to parse the summary bytea, or we'd have to extend the opclasses
with a function formatting the summary. Or rework how the summary is
stored, but that seems like the worst option.


regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: WIP: BRIN multi-range indexes

From
Tomas Vondra
Date:
On Sat, Jul 11, 2020 at 03:32:43PM +0200, Sascha Kuhl wrote:
>Tomas Vondra <tomas.vondra@2ndquadrant.com> schrieb am Sa., 11. Juli 2020,
>13:24:
>
>> On Fri, Jul 10, 2020 at 04:44:41PM +0200, Sascha Kuhl wrote:
>> >Tomas Vondra <tomas.vondra@2ndquadrant.com> schrieb am Fr., 10. Juli
>> 2020,
>> >14:09:
>> >
>> >> On Fri, Jul 10, 2020 at 06:01:58PM +0900, Masahiko Sawada wrote:
>> >> >On Fri, 3 Jul 2020 at 09:58, Tomas Vondra <
>> tomas.vondra@2ndquadrant.com>
>> >> wrote:
>> >> >>
>> >> >> On Sun, Apr 05, 2020 at 08:01:50PM +0300, Alexander Korotkov wrote:
>> >> >> >On Sun, Apr 5, 2020 at 8:00 PM Tomas Vondra
>> >> >> ><tomas.vondra@2ndquadrant.com> wrote:
>> >> >> ...
>> >> >> >> >
>> >> >> >> >Assuming we're not going to get 0001-0003 into v13, I'm not so
>> >> >> >> >inclined to rush on these three as well.  But you're willing to
>> >> commit
>> >> >> >> >them, you can count round of review on me.
>> >> >> >> >
>> >> >> >>
>> >> >> >> I have no intention to get 0001-0003 committed. I think those
>> changes
>> >> >> >> are beneficial on their own, but the primary reason was to support
>> >> the
>> >> >> >> new opclasses (which require those changes). And those parts are
>> not
>> >> >> >> going to make it into v13 ...
>> >> >> >
>> >> >> >OK, no problem.
>> >> >> >Let's do this for v14.
>> >> >> >
>> >> >>
>> >> >> Hi Alexander,
>> >> >>
>> >> >> Are you still interested in reviewing those patches? I'll take a
>> look at
>> >> >> 0001-0003 to check that your previous feedback was addressed. Do you
>> >> >> have any comments about 0004 / 0005, which I think are the more
>> >> >> interesting parts of this series?
>> >> >>
>> >> >>
>> >> >> Attached is a rebased version - I realized I forgot to include 0005
>> in
>> >> >> the last update, for some reason.
>> >> >>
>> >> >
>> >> >I've done a quick test with this patch set. I wonder if we can improve
>> >> >brin_page_items() SQL function in pageinspect as well. Currently,
>> >> >brin_page_items() is hard-coded to support only normal brin indexes.
>> >> >When we pass brin-bloom or brin-multi-range to that function the
>> >> >binary values are shown in 'value' column but it seems not helpful for
>> >> >users. For instance, here is an output of brin_page_items() with a
>> >> >brin-multi-range index:
>> >> >
>> >> >postgres(1:12801)=# select * from brin_page_items(get_raw_page('mul',
>> >> >2), 'mul');
>> >> >-[ RECORD 1
>> >>
>>
]----------------------------------------------------------------------------------------------------------------------
>> >>
>> >>
>>
>-----------------------------------------------------------------------------------------------------------------------------------
>> >> >----------------------------
>> >> >itemoffset  | 1
>> >> >blknum      | 0
>> >> >attnum      | 1
>> >> >allnulls    | f
>> >> >hasnulls    | f
>> >> >placeholder | f
>> >> >value       |
>> >>
>>
{\x010000001b0000002000000001000000e5700000e6700000e7700000e8700000e9700000ea700000eb700000ec700000ed700000ee700000ef
>> >>
>> >>
>>
>700000f0700000f1700000f2700000f3700000f4700000f5700000f6700000f7700000f8700000f9700000fa700000fb700000fc700000fd700000fe700000ff700
>> >> >00000710000}
>> >> >
>> >>
>> >> Hmm. I'm not sure we can do much better, without making the function
>> >> much more complicated. I mean, even with regular BRIN indexes we don't
>> >> really know if the value is plain min/max, right?
>> >>
>> >You can be sure with the next node. The value is in can be false positiv.
>> >The value is out is clear. You can detect the change between in and out.
>> >
>>
>> I'm sorry, I don't understand what you're suggesting. How is any of this
>> related to false positive rate, etc?
>>
>
>Hi,
>
>You check by the bloom filter if a value you're searching is part of the
>node, right?
>
>In case, the value is in the bloom filter you could be mistaken, because
>another value could have the same hash profile, no?
>
>However if the value is out, the filter can not react. You can be sure that
>the value is out.
>
>If you looking for a range or many ranges of values, you traverse many
>nodes. By knowing the value is out, you can state a clear set of nodes that
>form the range. However the border is somehow unsharp because of the false
>positives.
>
>I am not sure if we write about the same. Please confirm, this can be
>needed. Please.
>

Probably not. Masahiko-san pointed out that pageinspect (which also has
a function to print pages from a BRIN index) does not understand the
summary of the new opclasses and just prints the bytea verbatim.

That has nothing to do with inspecting the bloom filter, or anything
like that. So I think there's some confusion ...


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 



Re: WIP: BRIN multi-range indexes

From
Sascha Kuhl
Date:
Thanks, I see there is some understanding, though. 

Tomas Vondra <tomas.vondra@2ndquadrant.com> schrieb am So., 12. Juli 2020, 01:30:
On Sat, Jul 11, 2020 at 03:32:43PM +0200, Sascha Kuhl wrote:
>Tomas Vondra <tomas.vondra@2ndquadrant.com> schrieb am Sa., 11. Juli 2020,
>13:24:
>
>> On Fri, Jul 10, 2020 at 04:44:41PM +0200, Sascha Kuhl wrote:
>> >Tomas Vondra <tomas.vondra@2ndquadrant.com> schrieb am Fr., 10. Juli
>> 2020,
>> >14:09:
>> >
>> >> On Fri, Jul 10, 2020 at 06:01:58PM +0900, Masahiko Sawada wrote:
>> >> >On Fri, 3 Jul 2020 at 09:58, Tomas Vondra <
>> tomas.vondra@2ndquadrant.com>
>> >> wrote:
>> >> >>
>> >> >> On Sun, Apr 05, 2020 at 08:01:50PM +0300, Alexander Korotkov wrote:
>> >> >> >On Sun, Apr 5, 2020 at 8:00 PM Tomas Vondra
>> >> >> ><tomas.vondra@2ndquadrant.com> wrote:
>> >> >> ...
>> >> >> >> >
>> >> >> >> >Assuming we're not going to get 0001-0003 into v13, I'm not so
>> >> >> >> >inclined to rush on these three as well.  But you're willing to
>> >> commit
>> >> >> >> >them, you can count round of review on me.
>> >> >> >> >
>> >> >> >>
>> >> >> >> I have no intention to get 0001-0003 committed. I think those
>> changes
>> >> >> >> are beneficial on their own, but the primary reason was to support
>> >> the
>> >> >> >> new opclasses (which require those changes). And those parts are
>> not
>> >> >> >> going to make it into v13 ...
>> >> >> >
>> >> >> >OK, no problem.
>> >> >> >Let's do this for v14.
>> >> >> >
>> >> >>
>> >> >> Hi Alexander,
>> >> >>
>> >> >> Are you still interested in reviewing those patches? I'll take a
>> look at
>> >> >> 0001-0003 to check that your previous feedback was addressed. Do you
>> >> >> have any comments about 0004 / 0005, which I think are the more
>> >> >> interesting parts of this series?
>> >> >>
>> >> >>
>> >> >> Attached is a rebased version - I realized I forgot to include 0005
>> in
>> >> >> the last update, for some reason.
>> >> >>
>> >> >
>> >> >I've done a quick test with this patch set. I wonder if we can improve
>> >> >brin_page_items() SQL function in pageinspect as well. Currently,
>> >> >brin_page_items() is hard-coded to support only normal brin indexes.
>> >> >When we pass brin-bloom or brin-multi-range to that function the
>> >> >binary values are shown in 'value' column but it seems not helpful for
>> >> >users. For instance, here is an output of brin_page_items() with a
>> >> >brin-multi-range index:
>> >> >
>> >> >postgres(1:12801)=# select * from brin_page_items(get_raw_page('mul',
>> >> >2), 'mul');
>> >> >-[ RECORD 1
>> >>
>> ]----------------------------------------------------------------------------------------------------------------------
>> >>
>> >>
>> >-----------------------------------------------------------------------------------------------------------------------------------
>> >> >----------------------------
>> >> >itemoffset  | 1
>> >> >blknum      | 0
>> >> >attnum      | 1
>> >> >allnulls    | f
>> >> >hasnulls    | f
>> >> >placeholder | f
>> >> >value       |
>> >>
>> {\x010000001b0000002000000001000000e5700000e6700000e7700000e8700000e9700000ea700000eb700000ec700000ed700000ee700000ef
>> >>
>> >>
>> >700000f0700000f1700000f2700000f3700000f4700000f5700000f6700000f7700000f8700000f9700000fa700000fb700000fc700000fd700000fe700000ff700
>> >> >00000710000}
>> >> >
>> >>
>> >> Hmm. I'm not sure we can do much better, without making the function
>> >> much more complicated. I mean, even with regular BRIN indexes we don't
>> >> really know if the value is plain min/max, right?
>> >>
>> >You can be sure with the next node. The value is in can be false positiv.
>> >The value is out is clear. You can detect the change between in and out.
>> >
>>
>> I'm sorry, I don't understand what you're suggesting. How is any of this
>> related to false positive rate, etc?
>>
>
>Hi,
>
>You check by the bloom filter if a value you're searching is part of the
>node, right?
>
>In case, the value is in the bloom filter you could be mistaken, because
>another value could have the same hash profile, no?
>
>However if the value is out, the filter can not react. You can be sure that
>the value is out.
>
>If you looking for a range or many ranges of values, you traverse many
>nodes. By knowing the value is out, you can state a clear set of nodes that
>form the range. However the border is somehow unsharp because of the false
>positives.
>
>I am not sure if we write about the same. Please confirm, this can be
>needed. Please.
>

Probably not. Masahiko-san pointed out that pageinspect (which also has
a function to print pages from a BRIN index) does not understand the
summary of the new opclasses and just prints the bytea verbatim.

That has nothing to do with inspecting the bloom filter, or anything
like that. So I think there's some confusion ...


regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: WIP: BRIN multi-range indexes

From
Alvaro Herrera
Date:
On 2020-Jul-10, Tomas Vondra wrote:

> > postgres(1:12801)=# select * from brin_page_items(get_raw_page('mul',
> > 2), 'mul');
> > -[ RECORD 1
]----------------------------------------------------------------------------------------------------------------------
> >
-----------------------------------------------------------------------------------------------------------------------------------
> > ----------------------------
> > itemoffset  | 1
> > blknum      | 0
> > attnum      | 1
> > allnulls    | f
> > hasnulls    | f
> > placeholder | f
> > value       |
{\x010000001b0000002000000001000000e5700000e6700000e7700000e8700000e9700000ea700000eb700000ec700000ed700000ee700000ef
> >
700000f0700000f1700000f2700000f3700000f4700000f5700000f6700000f7700000f8700000f9700000fa700000fb700000fc700000fd700000fe700000ff700
> > 00000710000}
> 
> Hmm. I'm not sure we can do much better, without making the function
> much more complicated. I mean, even with regular BRIN indexes we don't
> really know if the value is plain min/max, right?

Maybe we can try to handle this with some other function that interprets
the bytea in 'value' and returns a user-readable text.  I think it'd
have to be a superuser-only function, because otherwise you could easily
cause a crash by passing a value of a different opclass.  But since this
seems a developer-only thing, that restriction seems fine to me.

(I don't know what's a good way to represent a bloom filter, mind.)

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: WIP: BRIN multi-range indexes

From
Tomas Vondra
Date:
On Sun, Jul 12, 2020 at 07:58:54PM -0400, Alvaro Herrera wrote:
>On 2020-Jul-10, Tomas Vondra wrote:
>
>> > postgres(1:12801)=# select * from brin_page_items(get_raw_page('mul',
>> > 2), 'mul');
>> > -[ RECORD 1
]----------------------------------------------------------------------------------------------------------------------
>> >
-----------------------------------------------------------------------------------------------------------------------------------
>> > ----------------------------
>> > itemoffset  | 1
>> > blknum      | 0
>> > attnum      | 1
>> > allnulls    | f
>> > hasnulls    | f
>> > placeholder | f
>> > value       |
{\x010000001b0000002000000001000000e5700000e6700000e7700000e8700000e9700000ea700000eb700000ec700000ed700000ee700000ef
>> >
700000f0700000f1700000f2700000f3700000f4700000f5700000f6700000f7700000f8700000f9700000fa700000fb700000fc700000fd700000fe700000ff700
>> > 00000710000}
>>
>> Hmm. I'm not sure we can do much better, without making the function
>> much more complicated. I mean, even with regular BRIN indexes we don't
>> really know if the value is plain min/max, right?
>
>Maybe we can try to handle this with some other function that interprets
>the bytea in 'value' and returns a user-readable text.  I think it'd
>have to be a superuser-only function, because otherwise you could easily
>cause a crash by passing a value of a different opclass.  But since this
>seems a developer-only thing, that restriction seems fine to me.
>

Ummm, I disagree a superuser check is sufficient protection from a
segfault or similar issues. If we really want to print something nicer,
I'd say it needs to be a special function in the BRIN opclass.

>(I don't know what's a good way to represent a bloom filter, mind.)
>

Me neither, but I guess we could print either some stats (size, number
of bits set, etc.) and/or then the bitmap.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 



Re: WIP: BRIN multi-range indexes

From
Alvaro Herrera
Date:
On 2020-Jul-13, Tomas Vondra wrote:

> On Sun, Jul 12, 2020 at 07:58:54PM -0400, Alvaro Herrera wrote:

> > Maybe we can try to handle this with some other function that interprets
> > the bytea in 'value' and returns a user-readable text.  I think it'd
> > have to be a superuser-only function, because otherwise you could easily
> > cause a crash by passing a value of a different opclass.  But since this
> > seems a developer-only thing, that restriction seems fine to me.
> 
> Ummm, I disagree a superuser check is sufficient protection from a
> segfault or similar issues.

My POV there is that it's the user's responsibility to call the right
function; and if they fail to do so, it's their fault.  I agree it's not
ideal, but frankly these pageinspect things are not critical to get 100%
user-friendly.

> If we really want to print something nicer, I'd say it needs to be a
> special function in the BRIN opclass.

If that can be done, then +1.  We just need to ensure that the function
knows and can verify the type of index that the value comes from.  I
guess we can pass the index OID so that it can extract the opclass from
catalogs to verify.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: WIP: BRIN multi-range indexes

From
Masahiko Sawada
Date:
On Mon, 13 Jul 2020 at 09:33, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
>
> On 2020-Jul-13, Tomas Vondra wrote:
>
> > On Sun, Jul 12, 2020 at 07:58:54PM -0400, Alvaro Herrera wrote:
>
> > > Maybe we can try to handle this with some other function that interprets
> > > the bytea in 'value' and returns a user-readable text.  I think it'd
> > > have to be a superuser-only function, because otherwise you could easily
> > > cause a crash by passing a value of a different opclass.  But since this
> > > seems a developer-only thing, that restriction seems fine to me.
> >
> > Ummm, I disagree a superuser check is sufficient protection from a
> > segfault or similar issues.
>
> My POV there is that it's the user's responsibility to call the right
> function; and if they fail to do so, it's their fault.  I agree it's not
> ideal, but frankly these pageinspect things are not critical to get 100%
> user-friendly.
>
> > If we really want to print something nicer, I'd say it needs to be a
> > special function in the BRIN opclass.
>
> If that can be done, then +1.  We just need to ensure that the function
> knows and can verify the type of index that the value comes from.  I
> guess we can pass the index OID so that it can extract the opclass from
> catalogs to verify.

+1 from me, too. Perhaps we can have it as optional. If a BRIN opclass
doesn't have it, the 'values' can be null.

Regards,

-- 
Masahiko Sawada            http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: WIP: BRIN multi-range indexes

From
Tomas Vondra
Date:
On Mon, Jul 13, 2020 at 02:54:56PM +0900, Masahiko Sawada wrote:
>On Mon, 13 Jul 2020 at 09:33, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
>>
>> On 2020-Jul-13, Tomas Vondra wrote:
>>
>> > On Sun, Jul 12, 2020 at 07:58:54PM -0400, Alvaro Herrera wrote:
>>
>> > > Maybe we can try to handle this with some other function that interprets
>> > > the bytea in 'value' and returns a user-readable text.  I think it'd
>> > > have to be a superuser-only function, because otherwise you could easily
>> > > cause a crash by passing a value of a different opclass.  But since this
>> > > seems a developer-only thing, that restriction seems fine to me.
>> >
>> > Ummm, I disagree a superuser check is sufficient protection from a
>> > segfault or similar issues.
>>
>> My POV there is that it's the user's responsibility to call the right
>> function; and if they fail to do so, it's their fault.  I agree it's not
>> ideal, but frankly these pageinspect things are not critical to get 100%
>> user-friendly.
>>
>> > If we really want to print something nicer, I'd say it needs to be a
>> > special function in the BRIN opclass.
>>
>> If that can be done, then +1.  We just need to ensure that the function
>> knows and can verify the type of index that the value comes from.  I
>> guess we can pass the index OID so that it can extract the opclass from
>> catalogs to verify.
>
>+1 from me, too. Perhaps we can have it as optional. If a BRIN opclass
>doesn't have it, the 'values' can be null.
>

I'd say that if the opclass does not have it, then we should print the
bytea value (or whatever the opclass uses to store the summary) using
the type functions.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 



Re: WIP: BRIN multi-range indexes

From
Alexander Korotkov
Date:
On Mon, Jul 13, 2020 at 5:59 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
> >> > If we really want to print something nicer, I'd say it needs to be a
> >> > special function in the BRIN opclass.
> >>
> >> If that can be done, then +1.  We just need to ensure that the function
> >> knows and can verify the type of index that the value comes from.  I
> >> guess we can pass the index OID so that it can extract the opclass from
> >> catalogs to verify.
> >
> >+1 from me, too. Perhaps we can have it as optional. If a BRIN opclass
> >doesn't have it, the 'values' can be null.
> >
>
> I'd say that if the opclass does not have it, then we should print the
> bytea value (or whatever the opclass uses to store the summary) using
> the type functions.

I've read the recent messages in this thread and I'd like to share my thoughts.

I think the way brin_page_items() displays values is not really
generic.  It uses a range-like textual representation of an array of
values, while that array doesn't necessarily have range semantics.

However, I think it's good that brin_page_items() uses a type output
function to display values.  So, it's not necessary to introduce a new
BRIN opclass function in order to get values displayed in a
human-readable way.  Instead, we could just make a standard of BRIN
value to be human readable.  I see at least two possibilities for
that.
1. Use standard container data-types to represent BRIN values.  For
instance we could use an array of ranges instead of bytea for
multirange.  Not about how convenient/performant it would be.
2. Introduce new data-type to represent values in BRIN index. And for
that type we can define output function with user-readable output. We
did similar things for GiST.  For instance, pg_trgm defines gtrgm
type, which has no input and no output. But for BRIN opclass we can
define type with just output.

BTW, I've applied the patchset to the current master, but I got a lot
of duplicate oids.  Could you please resolve these conflicts.  I think
it would be good to use high oid numbers to evade conflicts during
development/review, and rely on committer to set final oids (as
discussed in [1]).

Links
1. https://www.postgresql.org/message-id/CAH2-WzmMTGMcPuph4OvsO7Ykut0AOCF_i-%3DeaochT0dd2BN9CQ%40mail.gmail.com

------
Regards,
Alexander Korotkov



Re: WIP: BRIN multi-range indexes

From
Tomas Vondra
Date:
On Wed, Jul 15, 2020 at 05:34:05AM +0300, Alexander Korotkov wrote:
>On Mon, Jul 13, 2020 at 5:59 PM Tomas Vondra
><tomas.vondra@2ndquadrant.com> wrote:
>> >> > If we really want to print something nicer, I'd say it needs to be a
>> >> > special function in the BRIN opclass.
>> >>
>> >> If that can be done, then +1.  We just need to ensure that the function
>> >> knows and can verify the type of index that the value comes from.  I
>> >> guess we can pass the index OID so that it can extract the opclass from
>> >> catalogs to verify.
>> >
>> >+1 from me, too. Perhaps we can have it as optional. If a BRIN opclass
>> >doesn't have it, the 'values' can be null.
>> >
>>
>> I'd say that if the opclass does not have it, then we should print the
>> bytea value (or whatever the opclass uses to store the summary) using
>> the type functions.
>
>I've read the recent messages in this thread and I'd like to share my thoughts.
>
>I think the way brin_page_items() displays values is not really
>generic.  It uses a range-like textual representation of an array of
>values, while that array doesn't necessarily have range semantics.
>
>However, I think it's good that brin_page_items() uses a type output
>function to display values.  So, it's not necessary to introduce a new
>BRIN opclass function in order to get values displayed in a
>human-readable way.  Instead, we could just make a standard of BRIN
>value to be human readable.  I see at least two possibilities for
>that.
>1. Use standard container data-types to represent BRIN values.  For
>instance we could use an array of ranges instead of bytea for
>multirange.  Not about how convenient/performant it would be.
>2. Introduce new data-type to represent values in BRIN index. And for
>that type we can define output function with user-readable output. We
>did similar things for GiST.  For instance, pg_trgm defines gtrgm
>type, which has no input and no output. But for BRIN opclass we can
>define type with just output.
>

I think there's a number of weak points in this approach.

Firstly, it assumes the summaries can be represented as arrays of
built-in types, which I'm not really sure about. It clearly is not true
for the bloom opclasses, for example. But even for minmax oclasses it's
going to be tricky because the ranges may be on different data types so
presumably we'd need somewhat nested data structure.

Moreover, multi-minmax summary contains either points or intervals,
which requires additional fields/flags to indicate that. That further
complicates the things ...

maybe we could decompose that into separate arrays or something, but
honestly it seems somewhat premature - there are far more important
aspects to discuss, I think (e.g. how the ranges are built/merged in
multi-minmax, or whether bloom opclasses are useful at all).


>BTW, I've applied the patchset to the current master, but I got a lot
>of duplicate oids.  Could you please resolve these conflicts.  I think
>it would be good to use high oid numbers to evade conflicts during
>development/review, and rely on committer to set final oids (as
>discussed in [1]).
>
>Links
>1. https://www.postgresql.org/message-id/CAH2-WzmMTGMcPuph4OvsO7Ykut0AOCF_i-%3DeaochT0dd2BN9CQ%40mail.gmail.com
>

Did you use the patchset from 2020/07/03? I don't get any duplicate OIDs
with it, and it's already using quite high OIDs (part 4 uses >= 8000,
part 5 uses >= 9000).

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 



Re: WIP: BRIN multi-range indexes

From
Alexander Korotkov
Date:
Hi, Tomas!

Sorry for the late reply.

On Sun, Jul 19, 2020 at 6:19 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
> I think there's a number of weak points in this approach.
>
> Firstly, it assumes the summaries can be represented as arrays of
> built-in types, which I'm not really sure about. It clearly is not true
> for the bloom opclasses, for example. But even for minmax oclasses it's
> going to be tricky because the ranges may be on different data types so
> presumably we'd need somewhat nested data structure.
>
> Moreover, multi-minmax summary contains either points or intervals,
> which requires additional fields/flags to indicate that. That further
> complicates the things ...
>
> maybe we could decompose that into separate arrays or something, but
> honestly it seems somewhat premature - there are far more important
> aspects to discuss, I think (e.g. how the ranges are built/merged in
> multi-minmax, or whether bloom opclasses are useful at all).

I see.  But there is at least a second option to introduce a new
datatype with just an output function.  In the similar way
gist/tsvector_ops uses gtsvector key type.  I think it would be more
transparent than using just bytea.  Also, this is the way we already
use in the core.

> >BTW, I've applied the patchset to the current master, but I got a lot
> >of duplicate oids.  Could you please resolve these conflicts.  I think
> >it would be good to use high oid numbers to evade conflicts during
> >development/review, and rely on committer to set final oids (as
> >discussed in [1]).
> >
> >Links
> >1. https://www.postgresql.org/message-id/CAH2-WzmMTGMcPuph4OvsO7Ykut0AOCF_i-%3DeaochT0dd2BN9CQ%40mail.gmail.com
>
> Did you use the patchset from 2020/07/03? I don't get any duplicate OIDs
> with it, and it's already using quite high OIDs (part 4 uses >= 8000,
> part 5 uses >= 9000).

Yep, it appears that I was using the wrong version of patchset.
Patchset from 2020/07/03 works good on the current master.

------
Regards,
Alexander Korotkov



Re: WIP: BRIN multi-range indexes

From
Tomas Vondra
Date:
On Tue, Aug 04, 2020 at 05:36:51PM +0300, Alexander Korotkov wrote:
>Hi, Tomas!
>
>Sorry for the late reply.
>
>On Sun, Jul 19, 2020 at 6:19 PM Tomas Vondra
><tomas.vondra@2ndquadrant.com> wrote:
>> I think there's a number of weak points in this approach.
>>
>> Firstly, it assumes the summaries can be represented as arrays of
>> built-in types, which I'm not really sure about. It clearly is not true
>> for the bloom opclasses, for example. But even for minmax oclasses it's
>> going to be tricky because the ranges may be on different data types so
>> presumably we'd need somewhat nested data structure.
>>
>> Moreover, multi-minmax summary contains either points or intervals,
>> which requires additional fields/flags to indicate that. That further
>> complicates the things ...
>>
>> maybe we could decompose that into separate arrays or something, but
>> honestly it seems somewhat premature - there are far more important
>> aspects to discuss, I think (e.g. how the ranges are built/merged in
>> multi-minmax, or whether bloom opclasses are useful at all).
>
>I see.  But there is at least a second option to introduce a new
>datatype with just an output function.  In the similar way
>gist/tsvector_ops uses gtsvector key type.  I think it would be more
>transparent than using just bytea.  Also, this is the way we already
>use in the core.
>

So you're proposing to have a new data types "brin_minmax_multi_summary"
and "brin_bloom_summary" (or some other names), with output functions
printing something nicer? I suppose that could work, and we could even
add pageinspect functions returning the value as raw bytea.

Good idea!

>> >BTW, I've applied the patchset to the current master, but I got a lot
>> >of duplicate oids.  Could you please resolve these conflicts.  I think
>> >it would be good to use high oid numbers to evade conflicts during
>> >development/review, and rely on committer to set final oids (as
>> >discussed in [1]).
>> >
>> >Links
>> >1. https://www.postgresql.org/message-id/CAH2-WzmMTGMcPuph4OvsO7Ykut0AOCF_i-%3DeaochT0dd2BN9CQ%40mail.gmail.com
>>
>> Did you use the patchset from 2020/07/03? I don't get any duplicate OIDs
>> with it, and it's already using quite high OIDs (part 4 uses >= 8000,
>> part 5 uses >= 9000).
>
>Yep, it appears that I was using the wrong version of patchset.
>Patchset from 2020/07/03 works good on the current master.
>

OK, good.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: WIP: BRIN multi-range indexes

From
Tomas Vondra
Date:
On Tue, Aug 04, 2020 at 05:17:43PM +0200, Tomas Vondra wrote:
>On Tue, Aug 04, 2020 at 05:36:51PM +0300, Alexander Korotkov wrote:
>>Hi, Tomas!
>>
>>Sorry for the late reply.
>>
>>On Sun, Jul 19, 2020 at 6:19 PM Tomas Vondra
>><tomas.vondra@2ndquadrant.com> wrote:
>>>I think there's a number of weak points in this approach.
>>>
>>>Firstly, it assumes the summaries can be represented as arrays of
>>>built-in types, which I'm not really sure about. It clearly is not true
>>>for the bloom opclasses, for example. But even for minmax oclasses it's
>>>going to be tricky because the ranges may be on different data types so
>>>presumably we'd need somewhat nested data structure.
>>>
>>>Moreover, multi-minmax summary contains either points or intervals,
>>>which requires additional fields/flags to indicate that. That further
>>>complicates the things ...
>>>
>>>maybe we could decompose that into separate arrays or something, but
>>>honestly it seems somewhat premature - there are far more important
>>>aspects to discuss, I think (e.g. how the ranges are built/merged in
>>>multi-minmax, or whether bloom opclasses are useful at all).
>>
>>I see.  But there is at least a second option to introduce a new
>>datatype with just an output function.  In the similar way
>>gist/tsvector_ops uses gtsvector key type.  I think it would be more
>>transparent than using just bytea.  Also, this is the way we already
>>use in the core.
>>
>
>So you're proposing to have a new data types "brin_minmax_multi_summary"
>and "brin_bloom_summary" (or some other names), with output functions
>printing something nicer? I suppose that could work, and we could even
>add pageinspect functions returning the value as raw bytea.
>
>Good idea!
>

Attached is an updated version of the patch series, implementing this.
Adding the extra data types was fairly simple, because both bloom and
minmax-multi indexes already used "struct as varlena" approach, so all
that needed was a bunch of in/out functions and catalog records.

I've left the changes in separate patches for clarity, ultimately it'll
get merged into the other parts.


This reminded me that the current costing may not quite work, because
it depends on how well the index is correlated to the table. That may
be OK for minmax-multi in most cases, but for bloom it makes almost no
sense - correlation does not really matter for bloom filters, what
matters is the number of values in each range.

Consider this example:

create table t (a int);

insert into t select x from (
   select (i/10) as x from generate_series(1,10000000) s(i)
   order by random()
) foo;

create index on t using brin(
   a int4_bloom_ops(n_distinct_per_range=6000,
                    false_positive_rate=0.05))
with (pages_per_range = 16);

vacuum analyze t;

test=# explain analyze select * from t where a = 10000;
                                              QUERY PLAN                                              
-----------------------------------------------------------------------------------------------------
  Seq Scan on t  (cost=0.00..169247.71 rows=10 width=4) (actual time=38.088..513.654 rows=10 loops=1)
    Filter: (a = 10000)
    Rows Removed by Filter: 9999990
  Planning Time: 0.060 ms
  Execution Time: 513.719 ms
(5 rows)

test=# set enable_seqscan = off;
SET
test=# explain analyze select * from t where a = 10000;
                                                          QUERY PLAN
    
 

----------------------------------------------------------------------------------------------------------------------------
  Bitmap Heap Scan on t  (cost=5553.07..174800.78 rows=10 width=4) (actual time=7.790..27.585 rows=10 loops=1)
    Recheck Cond: (a = 10000)
    Rows Removed by Index Recheck: 224182
    Heap Blocks: lossy=992
    ->  Bitmap Index Scan on t_a_idx  (cost=0.00..5553.06 rows=9999977 width=0) (actual time=7.006..7.007 rows=9920
loops=1)
          Index Cond: (a = 10000)
  Planning Time: 0.052 ms
  Execution Time: 27.658 ms
(8 rows)

Clearly, the main problem is in brincostestimate relying on correlation
to tweak the selectivity estimates, leading to an estimate of almost the
whole table, when in practice we only scan a tiny fraction.

Part 0008 is an experimental tweaks the logic to ignore correlation for
bloom and minmax-multi opclasses, producing this plan:

test=# explain analyze select * from t where a = 10000;
                                                         QUERY PLAN

---------------------------------------------------------------------------------------------------------------------------
  Bitmap Heap Scan on t  (cost=5542.01..16562.95 rows=10 width=4) (actual time=12.013..34.705 rows=10 loops=1)
    Recheck Cond: (a = 10000)
    Rows Removed by Index Recheck: 224182
    Heap Blocks: lossy=992
    ->  Bitmap Index Scan on t_a_idx  (cost=0.00..5542.00 rows=3615 width=0) (actual time=11.108..11.109 rows=9920
loops=1)
          Index Cond: (a = 10000)
  Planning Time: 0.386 ms
  Execution Time: 34.778 ms
(8 rows)

which is way closer to reality, of course. I'm not entirely sure it
behaves correctly for multi-column BRIN indexes, but I think as a PoC
it's sufficient.

For bloom, I think we can be a bit smarter - we could use the false
positive rate as the "minimum expected selectivity" or something like
that. After all, the false positive rate essentially means "Given a
random value, what's the chance that a bloom filter matches?" So given a
table with N ranges, we expect about (N * fpr) to match. Of course, the
problem is that this only works for "full" bloom filters. Ranges with
fewer distinct values will have much lower probability, and ranges with
unexpectedly many distinct values will have much higher probability.

But I think we can ignore that, assume the index was created with good
parameters, so the bloom filters won't degrade and the target fpr is
probably a defensive value.

For minmax-multi, we probably should not ignore correlation entirely.
It does handle imperfect correlation much more gracefully than plain
minmax, but it still depends on reasonably ordered data.

A possible improvement would be to compute average "covering" of ranges,
i.e. given the length of a column domain

     D = MAX(column) - MIN(column)

compute what fraction of that is covered by a range by summing lengths
of intervals in the range, and dividing it by D. And then averaging it
over all BRIN ranges.

This would allow us to estimate how many ranges are matched by a random
value from the column domain, I think. But this requires extending what
data analyze collects for indexes - I don't think there are any stats
specific to BRIN-specific collected at the moment.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

Re: WIP: BRIN multi-range indexes

From
Michael Paquier
Date:
On Fri, Aug 07, 2020 at 06:27:01PM +0200, Tomas Vondra wrote:
> Attached is an updated version of the patch series, implementing this.
> Adding the extra data types was fairly simple, because both bloom and
> minmax-multi indexes already used "struct as varlena" approach, so all
> that needed was a bunch of in/out functions and catalog records.
>
> I've left the changes in separate patches for clarity, ultimately it'll
> get merged into the other parts.

This fails to apply per the CF bot, so please provide a rebase.
--
Michael

Attachment

Re: WIP: BRIN multi-range indexes

From
Tomas Vondra
Date:
On Sat, Sep 05, 2020 at 10:46:48AM +0900, Michael Paquier wrote:
>On Fri, Aug 07, 2020 at 06:27:01PM +0200, Tomas Vondra wrote:
>> Attached is an updated version of the patch series, implementing this.
>> Adding the extra data types was fairly simple, because both bloom and
>> minmax-multi indexes already used "struct as varlena" approach, so all
>> that needed was a bunch of in/out functions and catalog records.
>>
>> I've left the changes in separate patches for clarity, ultimately it'll
>> get merged into the other parts.
>
>This fails to apply per the CF bot, so please provide a rebase.

OK, here is a rebased version. Most of the breakage was due to changes
to the BRIN sgml docs.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

Re: WIP: BRIN multi-range indexes

From
John Naylor
Date:
On Sat, Sep 5, 2020 at 7:21 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
>
> OK, here is a rebased version. Most of the breakage was due to changes
> to the BRIN sgml docs.

Hi Tomas,

I plan on trying some different queries on different data
distributions to get a sense of when the planner chooses a
multi-minmax index, and whether the choice is good.

Just to start, I used the artificial example in [1], but scaled down a
bit to save time. Config is at the default except for:
shared_buffers = 1GB
random_page_cost = 1.1;
effective_cache_size = 4GB;

create table t (a bigint, b int) with (fillfactor=95);

insert into t select i + 1000*random(), i+1000*random()
  from generate_series(1,10000000) s(i);

update t set a = 1, b = 1 where random() < 0.001;
update t set a = 10000000, b = 10000000 where random() < 0.001;

analyze t;

create index on t using brin (a);
CREATE INDEX
Time: 1631.452 ms (00:01.631)

explain analyze select * from t
  where a between 1923300::int and 1923600::int;

                                                        QUERY PLAN

--------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on t  (cost=16.10..43180.43 rows=291 width=12)
(actual time=217.770..1131.366 rows=288 loops=1)
   Recheck Cond: ((a >= 1923300) AND (a <= 1923600))
   Rows Removed by Index Recheck: 9999712
   Heap Blocks: lossy=56819
   ->  Bitmap Index Scan on t_a_idx  (cost=0.00..16.03 rows=22595
width=0) (actual time=3.054..3.055 rows=568320 loops=1)
         Index Cond: ((a >= 1923300) AND (a <= 1923600))
 Planning Time: 0.328 ms
 Execution Time: 1131.411 ms
(8 rows)

Now add the multi-minmax:

create index on t using brin (a int8_minmax_multi_ops);
CREATE INDEX
Time: 6521.026 ms (00:06.521)

The first interesting thing is, with both BRIN indexes available, the
planner still chooses the conventional BRIN index. Only when I disable
it, does it choose the multi-minmax index:

explain analyze select * from t
  where a between 1923300::int and 1923600::int;

                                                       QUERY PLAN

-------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on t  (cost=68.10..43160.86 rows=291 width=12)
(actual time=1.835..4.196 rows=288 loops=1)
   Recheck Cond: ((a >= 1923300) AND (a <= 1923600))
   Rows Removed by Index Recheck: 22240
   Heap Blocks: lossy=128
   ->  Bitmap Index Scan on t_a_idx1  (cost=0.00..68.03 rows=22523
width=0) (actual time=0.691..0.691 rows=1280 loops=1)
         Index Cond: ((a >= 1923300) AND (a <= 1923600))
 Planning Time: 0.250 ms
 Execution Time: 4.244 ms
(8 rows)

I wonder if this is a clue that something in the costing unfairly
penalizes a multi-minmax index. Maybe not enough to matter in
practice, since I wouldn't expect a user to put different kinds of
index on the same column.

The second thing is, with parallel seq scan, the query is faster than
a BRIN bitmap scan, with this pathological data distribution, but the
planner won't choose it unless forced to:

set enable_bitmapscan = 'off';
explain analyze select * from t
  where a between 1923300::int and 1923600::int;
                                                      QUERY PLAN

-----------------------------------------------------------------------------------------------------------------------
 Gather  (cost=1000.00..120348.10 rows=291 width=12) (actual
time=372.766..380.364 rows=288 loops=1)
   Workers Planned: 2
   Workers Launched: 2
   ->  Parallel Seq Scan on t  (cost=0.00..119319.00 rows=121
width=12) (actual time=268.326..366.228 rows=96 loops=3)
         Filter: ((a >= 1923300) AND (a <= 1923600))
         Rows Removed by Filter: 3333237
 Planning Time: 0.089 ms
 Execution Time: 380.434 ms
(8 rows)

And just to compare size:

BRIN         32kB
BRIN multi  136kB
Btree       188MB

[1] https://www.postgresql.org/message-id/459eef3e-48c7-0f5a-8356-992442a78bb6%402ndquadrant.com

-- 
John Naylor                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: WIP: BRIN multi-range indexes

From
Tomas Vondra
Date:
On Wed, Sep 09, 2020 at 12:04:28PM -0400, John Naylor wrote:
>On Sat, Sep 5, 2020 at 7:21 PM Tomas Vondra
><tomas.vondra@2ndquadrant.com> wrote:
>>
>> OK, here is a rebased version. Most of the breakage was due to changes
>> to the BRIN sgml docs.
>
>Hi Tomas,
>
>I plan on trying some different queries on different data
>distributions to get a sense of when the planner chooses a
>multi-minmax index, and whether the choice is good.
>
>Just to start, I used the artificial example in [1], but scaled down a
>bit to save time. Config is at the default except for:
>shared_buffers = 1GB
>random_page_cost = 1.1;
>effective_cache_size = 4GB;
>
>create table t (a bigint, b int) with (fillfactor=95);
>
>insert into t select i + 1000*random(), i+1000*random()
>  from generate_series(1,10000000) s(i);
>
>update t set a = 1, b = 1 where random() < 0.001;
>update t set a = 10000000, b = 10000000 where random() < 0.001;
>
>analyze t;
>
>create index on t using brin (a);
>CREATE INDEX
>Time: 1631.452 ms (00:01.631)
>
>explain analyze select * from t
>  where a between 1923300::int and 1923600::int;
>
>                                                        QUERY PLAN

>--------------------------------------------------------------------------------------------------------------------------
> Bitmap Heap Scan on t  (cost=16.10..43180.43 rows=291 width=12)
>(actual time=217.770..1131.366 rows=288 loops=1)
>   Recheck Cond: ((a >= 1923300) AND (a <= 1923600))
>   Rows Removed by Index Recheck: 9999712
>   Heap Blocks: lossy=56819
>   ->  Bitmap Index Scan on t_a_idx  (cost=0.00..16.03 rows=22595
>width=0) (actual time=3.054..3.055 rows=568320 loops=1)
>         Index Cond: ((a >= 1923300) AND (a <= 1923600))
> Planning Time: 0.328 ms
> Execution Time: 1131.411 ms
>(8 rows)
>
>Now add the multi-minmax:
>
>create index on t using brin (a int8_minmax_multi_ops);
>CREATE INDEX
>Time: 6521.026 ms (00:06.521)
>
>The first interesting thing is, with both BRIN indexes available, the
>planner still chooses the conventional BRIN index. Only when I disable
>it, does it choose the multi-minmax index:
>
>explain analyze select * from t
>  where a between 1923300::int and 1923600::int;
>
>                                                       QUERY PLAN

>-------------------------------------------------------------------------------------------------------------------------
> Bitmap Heap Scan on t  (cost=68.10..43160.86 rows=291 width=12)
>(actual time=1.835..4.196 rows=288 loops=1)
>   Recheck Cond: ((a >= 1923300) AND (a <= 1923600))
>   Rows Removed by Index Recheck: 22240
>   Heap Blocks: lossy=128
>   ->  Bitmap Index Scan on t_a_idx1  (cost=0.00..68.03 rows=22523
>width=0) (actual time=0.691..0.691 rows=1280 loops=1)
>         Index Cond: ((a >= 1923300) AND (a <= 1923600))
> Planning Time: 0.250 ms
> Execution Time: 4.244 ms
>(8 rows)
>
>I wonder if this is a clue that something in the costing unfairly
>penalizes a multi-minmax index. Maybe not enough to matter in
>practice, since I wouldn't expect a user to put different kinds of
>index on the same column.
>

I think this is much more an estimation issue than a costing one. Notice
that in the "regular" BRIN minmax index we have this:

    ->  Bitmap Index Scan on t_a_idx  (cost=0.00..16.03 rows=22595
        width=0) (actual time=3.054..3.055 rows=568320 loops=1)

while for the multi-minmax we have this:

    ->  Bitmap Index Scan on t_a_idx1  (cost=0.00..68.03 rows=22523
        width=0) (actual time=0.691..0.691 rows=1280 loops=1)

So yes, the multi-minmax index is costed a bit higher, mostly because
the index is a bit larger. (There's also a tweak to the correlation, but
that does not make much difference because it's just 0.99 vs. 1.0.)

The main difference is that for minmax the bitmap index scan actually
matches ~586k rows (a bit confusing, considering the heap scan has to
process almost 10M rows during recheck). But the multi-minmax only
matches ~1300 rows, with a recheck of 22k.

I'm not sure how to consider this during costing, as we only see these
numbers at execution time. One way would be to also consider "size" of
the ranges (i.e. max-min) vs. range of the whole column. But that's not
something we already have.

I'm not sure how troublesome this issue really is - I don't think people
are very likely to have both minmax and multi-minmax indexes on the same
column.

>The second thing is, with parallel seq scan, the query is faster than
>a BRIN bitmap scan, with this pathological data distribution, but the
>planner won't choose it unless forced to:
>
>set enable_bitmapscan = 'off';
>explain analyze select * from t
>  where a between 1923300::int and 1923600::int;
>                                                      QUERY PLAN

>-----------------------------------------------------------------------------------------------------------------------
> Gather  (cost=1000.00..120348.10 rows=291 width=12) (actual
>time=372.766..380.364 rows=288 loops=1)
>   Workers Planned: 2
>   Workers Launched: 2
>   ->  Parallel Seq Scan on t  (cost=0.00..119319.00 rows=121
>width=12) (actual time=268.326..366.228 rows=96 loops=3)
>         Filter: ((a >= 1923300) AND (a <= 1923600))
>         Rows Removed by Filter: 3333237
> Planning Time: 0.089 ms
> Execution Time: 380.434 ms
>(8 rows)
>

I think this is the same root cause - the planner does not realize how
bad the minmax index actually is in this case, so it uses a bit too
optimistic estimate for costing. And then it has to do essentially
seqscan with extra work for bitmap index/heap scan.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: WIP: BRIN multi-range indexes

From
Alvaro Herrera
Date:
On 2020-Sep-09, John Naylor wrote:

> create index on t using brin (a);
> CREATE INDEX
> Time: 1631.452 ms (00:01.631)

> create index on t using brin (a int8_minmax_multi_ops);
> CREATE INDEX
> Time: 6521.026 ms (00:06.521)

It seems strange that the multi-minmax index takes so much longer to
build.  I wonder if there's some obvious part of the algorithm that can
be improved?

> The second thing is, with parallel seq scan, the query is faster than
> a BRIN bitmap scan, with this pathological data distribution, but the
> planner won't choose it unless forced to:
> 
> set enable_bitmapscan = 'off';
> explain analyze select * from t
>   where a between 1923300::int and 1923600::int;

This is probably explained by the fact that you likely have the whole
table in shared buffers, or at least in OS cache.  I'm not sure if the
costing should necessarily account for this.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: WIP: BRIN multi-range indexes

From
Tomas Vondra
Date:
On Wed, Sep 09, 2020 at 03:40:41PM -0300, Alvaro Herrera wrote:
>On 2020-Sep-09, John Naylor wrote:
>
>> create index on t using brin (a);
>> CREATE INDEX
>> Time: 1631.452 ms (00:01.631)
>
>> create index on t using brin (a int8_minmax_multi_ops);
>> CREATE INDEX
>> Time: 6521.026 ms (00:06.521)
>
>It seems strange that the multi-minmax index takes so much longer to
>build.  I wonder if there's some obvious part of the algorithm that can
>be improved?
>

There are some minor optimizations possible - for example I noticed we
call minmax_multi_get_strategy_procinfo often because it happens in a
loop, and we could easily do it just once. But that saves only about 10%
or so, it's not a ground-breaking optimization.

The main reason for the slowness is that we pass the values one by one
to brin_minmax_multi_add_value - and on each call we need to deserialize
(and then sometimes also serialize) the summary, which may be quite
expensive. The regular minmax does not have this issue, it just swaps
the Datum value and that's it.

I see two possible optimizations - firstly, adding some sort of batch
variant of the add_value function, which would get a bunch of values
instead of just a single one, amortizing the serialization costs.

Another option would be to teach add_value to keep the deserialized
summary somewhere, and then force serialization at the end of the BRIN
page range. The end result would be roughly the same, I think.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: WIP: BRIN multi-range indexes

From
Alvaro Herrera
Date:
On 2020-Sep-09, Tomas Vondra wrote:

> There are some minor optimizations possible - for example I noticed we
> call minmax_multi_get_strategy_procinfo often because it happens in a
> loop, and we could easily do it just once. But that saves only about 10%
> or so, it's not a ground-breaking optimization.

Well, I guess this kind of thing should be fixed regardless while we
still know it's there, just to avoid an obvious inefficiency.

> The main reason for the slowness is that we pass the values one by one
> to brin_minmax_multi_add_value - and on each call we need to deserialize
> (and then sometimes also serialize) the summary, which may be quite
> expensive. The regular minmax does not have this issue, it just swaps
> the Datum value and that's it.

Ah, right, that's more interesting.  The original dumb BRIN code
separates BrinMemTuple from BrinTuple so that things can be operated
efficiently in memory.  Maybe something similar can be done in this
case, which also sounds like your second suggestion:

> Another option would be to teach add_value to keep the deserialized
> summary somewhere, and then force serialization at the end of the BRIN
> page range. The end result would be roughly the same, I think.


Also, I think you could get a few initial patches pushed soon, since
they look like general improvements rather than specific to multi-range.


On a differen train of thought, I wonder if we shouldn't drop the idea
of there being two minmax opclasses; just have one (still called
"minmax") and have the multi-range version be the v2 of it.  We would
still need to keep code to operate on the old one, but if you ever
REINDEX then your index is upgraded to the new one.  I see no reason to
keep the dumb minmax version around, assuming the performance is roughly
similar.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: WIP: BRIN multi-range indexes

From
Tomas Vondra
Date:
On Wed, Sep 09, 2020 at 04:53:30PM -0300, Alvaro Herrera wrote:
>On 2020-Sep-09, Tomas Vondra wrote:
>
>> There are some minor optimizations possible - for example I noticed we
>> call minmax_multi_get_strategy_procinfo often because it happens in a
>> loop, and we could easily do it just once. But that saves only about 10%
>> or so, it's not a ground-breaking optimization.
>
>Well, I guess this kind of thing should be fixed regardless while we
>still know it's there, just to avoid an obvious inefficiency.
>

Sure. I was just suggesting it's not something that'd make this very
close to plain minmax opclass.

>> The main reason for the slowness is that we pass the values one by one
>> to brin_minmax_multi_add_value - and on each call we need to deserialize
>> (and then sometimes also serialize) the summary, which may be quite
>> expensive. The regular minmax does not have this issue, it just swaps
>> the Datum value and that's it.
>
>Ah, right, that's more interesting.  The original dumb BRIN code
>separates BrinMemTuple from BrinTuple so that things can be operated
>efficiently in memory.  Maybe something similar can be done in this
>case, which also sounds like your second suggestion:
>
>> Another option would be to teach add_value to keep the deserialized
>> summary somewhere, and then force serialization at the end of the BRIN
>> page range. The end result would be roughly the same, I think.
>

Well, the patch already has Ranges (memory) and SerializedRanges (disk)
but it's not very clear to me where to stash the in-memory data and
where to make the conversion.

>
>Also, I think you could get a few initial patches pushed soon, since
>they look like general improvements rather than specific to multi-range.
>

Yeah, I agree. I plan to review those once again in a couple days and
then push them.

>
>On a differen train of thought, I wonder if we shouldn't drop the idea
>of there being two minmax opclasses; just have one (still called
>"minmax") and have the multi-range version be the v2 of it.  We would
>still need to keep code to operate on the old one, but if you ever
>REINDEX then your index is upgraded to the new one.  I see no reason to
>keep the dumb minmax version around, assuming the performance is roughly
>similar.
>

I'm not a huge fan of that. I think it's unlikely we'll ever make this
new set of oplasses just as fast as the plain minmax, and moreover it
does have some additional requirements (e.g. the distance proc, which
may not make sense for some data types).


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: WIP: BRIN multi-range indexes

From
Tomas Vondra
Date:
On Wed, Sep 09, 2020 at 10:26:00PM +0200, Tomas Vondra wrote:
>On Wed, Sep 09, 2020 at 04:53:30PM -0300, Alvaro Herrera wrote:
>>On 2020-Sep-09, Tomas Vondra wrote:
>>
>>>There are some minor optimizations possible - for example I noticed we
>>>call minmax_multi_get_strategy_procinfo often because it happens in a
>>>loop, and we could easily do it just once. But that saves only about 10%
>>>or so, it's not a ground-breaking optimization.
>>
>>Well, I guess this kind of thing should be fixed regardless while we
>>still know it's there, just to avoid an obvious inefficiency.
>>
>
>Sure. I was just suggesting it's not something that'd make this very
>close to plain minmax opclass.
>
>>>The main reason for the slowness is that we pass the values one by one
>>>to brin_minmax_multi_add_value - and on each call we need to deserialize
>>>(and then sometimes also serialize) the summary, which may be quite
>>>expensive. The regular minmax does not have this issue, it just swaps
>>>the Datum value and that's it.
>>
>>Ah, right, that's more interesting.  The original dumb BRIN code
>>separates BrinMemTuple from BrinTuple so that things can be operated
>>efficiently in memory.  Maybe something similar can be done in this
>>case, which also sounds like your second suggestion:
>>
>>>Another option would be to teach add_value to keep the deserialized
>>>summary somewhere, and then force serialization at the end of the BRIN
>>>page range. The end result would be roughly the same, I think.
>>
>
>Well, the patch already has Ranges (memory) and SerializedRanges (disk)
>but it's not very clear to me where to stash the in-memory data and
>where to make the conversion.
>

I've spent a bit of time experimenting with this. My idea was to allow
keeping an "expanded" version of the summary somewhere. As the addValue
function only receives BrinValues I guess one option would be to just
add bv_mem_values field. Or do you have a better idea?

Of course, more would need to be done:

1) We'd need to also pass the right memory context (bt_context seems
like the right thing, but that's not something addValue sees now).

2) We'd also need to specify some sort of callback that serializes the
in-memory value into bt_values. That's not something addValue can do,
because it doesn't know whether it's the last value in the range etc. I
guess one option would be to add yet another support proc, but I guess a
simple callback would be enough.

I've hacked together an experimental version of this to see how much
would it help, and it reduces the duration from ~4.6s to ~3.3s. Which is
nice, but plain minmax is ~1.1s. I suppose there's room for further
improvements in compare_combine_ranges/reduce_combine_ranges and so on,
but I still think there'll always be a gap compared to plain minmax.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: WIP: BRIN multi-range indexes

From
Alvaro Herrera
Date:
On 2020-Sep-10, Tomas Vondra wrote:

> I've spent a bit of time experimenting with this. My idea was to allow
> keeping an "expanded" version of the summary somewhere. As the addValue
> function only receives BrinValues I guess one option would be to just
> add bv_mem_values field. Or do you have a better idea?

Maybe it's okay to pass the BrinMemTuple to the add_value function, and
keep something there.  Or maybe that's pointless and just a new field in
BrinValues is okay.

> Of course, more would need to be done:
> 
> 1) We'd need to also pass the right memory context (bt_context seems
> like the right thing, but that's not something addValue sees now).

You could use GetMemoryChunkContext() for that.

> 2) We'd also need to specify some sort of callback that serializes the
> in-memory value into bt_values. That's not something addValue can do,
> because it doesn't know whether it's the last value in the range etc. I
> guess one option would be to add yet another support proc, but I guess a
> simple callback would be enough.

Hmm.

> I've hacked together an experimental version of this to see how much
> would it help, and it reduces the duration from ~4.6s to ~3.3s. Which is
> nice, but plain minmax is ~1.1s. I suppose there's room for further
> improvements in compare_combine_ranges/reduce_combine_ranges and so on,
> but I still think there'll always be a gap compared to plain minmax.

The main reason I'm talking about desupporting plain minmax is that,
even if it's amazingly fast, it loses quite quickly in real-world cases
because of loss of correlation.  Minmax's build time is pretty much
determined by speed at which you can seqscan the table.  I don't think
we lose much if we add overhead in order to create an index that is 100x
more useful.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: WIP: BRIN multi-range indexes

From
John Naylor
Date:
Ok, here's an attempt at a somewhat more natural test, to see what
happens after bulk updates and deletes, followed by more inserts. The
short version is that multi-minmax is resilient to a change that
causes a 4x degradation for simple minmax.


shared_buffers = 1GB
random_page_cost = 1.1
effective_cache_size = 4GB
work_mem = 64MB
maintenance_work_mem = 512MB

create unlogged table iot (
    id bigint generated by default as identity primary key,
    num double precision not null,
    create_dt timestamptz not null,
    stuff text generated always as (md5(id::text)) stored
)
with (fillfactor = 95);

insert into iot (num, create_dt)
select random(), x
from generate_series(
  '2020-01-01 0:00'::timestamptz,
  '2020-01-01 0:00'::timestamptz +'49000999 seconds'::interval,
  '2 seconds'::interval) x;

INSERT 0 24500500

(01:18s, 2279 MB)

-- done in separate tests so the planner can choose each in turn
create index cd_single on iot using brin(create_dt);
6.7s
create index cd_multi on iot using brin(create_dt timestamptz_minmax_multi_ops);
34s

vacuum analyze;

-- aggregate February
-- single minmax and multi-minmax same plan and same Heap Blocks
below, so only one plan shown
-- query times between the opclasses within noise of variation

explain analyze select date_trunc('day', create_dt), avg(num)
from iot
where create_dt >= '2020-02-01 0:00' and create_dt < '2020-03-01 0:00'
group by 1;


      QUERY PLAN

--------------------------------------------------------------------------------------------------------------------------------------------------------------------
 HashAggregate  (cost=357664.79..388181.83 rows=1232234 width=16)
(actual time=559.805..561.649 rows=29 loops=1)
   Group Key: date_trunc('day'::text, create_dt)
   Planned Partitions: 4  Batches: 1  Memory Usage: 24601kB
   ->  Bitmap Heap Scan on iot  (cost=323.74..313622.05 rows=1232234
width=16) (actual time=1.787..368.256 rows=1252800 loops=1)
         Recheck Cond: ((create_dt >= '2020-02-01
00:00:00-04'::timestamp with time zone) AND (create_dt < '2020-03-01
00:00:00-04'::timestamp with time zone))
         Rows Removed by Index Recheck: 15936
         Heap Blocks: lossy=15104
         ->  Bitmap Index Scan on cd_single  (cost=0.00..15.68
rows=1236315 width=0) (actual time=0.933..0.934 rows=151040 loops=1)
               Index Cond: ((create_dt >= '2020-02-01
00:00:00-04'::timestamp with time zone) AND (create_dt < '2020-03-01
00:00:00-04'::timestamp with time zone))
 Planning Time: 0.118 ms
 Execution Time: 568.653 ms
(11 rows)


-- delete first month and hi/lo values to create some holes in the table
delete from iot
where create_dt < '2020-02-01 0:00'::timestamptz;

DELETE 1339200

delete from iot
where num < 0.05
or num > 0.95;

DELETE 2316036

vacuum analyze iot;

-- add add back first month, but with double density (1s step rather
than 2s) so it spills over into other parts of the table, causing more
block ranges to have a lower bound with this month.

insert into iot (num, create_dt)
select random(), x
from generate_series(
  '2020-01-01 0:00'::timestamptz,
  '2020-01-31 23:59'::timestamptz,
  '1 second'::interval) x;

INSERT 0 2678341

vacuum analyze;

-- aggregate February again

explain analyze select date_trunc('day', create_dt), avg(num)
from iot
where create_dt >= '2020-02-01 0:00' and create_dt < '2020-03-01 0:00'
group by 1;

-- simple minmax:

      QUERY PLAN

--------------------------------------------------------------------------------------------------------------------------------------------------------------------
 HashAggregate  (cost=354453.63..383192.38 rows=1160429 width=16)
(actual time=2375.075..2376.982 rows=29 loops=1)
   Group Key: date_trunc('day'::text, create_dt)
   Planned Partitions: 4  Batches: 1  Memory Usage: 24601kB
   ->  Bitmap Heap Scan on iot  (cost=305.85..312977.36 rows=1160429
width=16) (actual time=8.162..2201.547 rows=1127668 loops=1)
         Recheck Cond: ((create_dt >= '2020-02-01
00:00:00-04'::timestamp with time zone) AND (create_dt < '2020-03-01
00:00:00-04'::timestamp with time zone))
         Rows Removed by Index Recheck: 12278985
         Heap Blocks: lossy=159616
         ->  Bitmap Index Scan on cd_single  (cost=0.00..15.74
rows=1206496 width=0) (actual time=7.177..7.178 rows=1596160 loops=1)
               Index Cond: ((create_dt >= '2020-02-01
00:00:00-04'::timestamp with time zone) AND (create_dt < '2020-03-01
00:00:00-04'::timestamp with time zone))
 Planning Time: 0.117 ms
 Execution Time: 2383.685 ms
(11 rows)

-- multi minmax:

      QUERY PLAN

--------------------------------------------------------------------------------------------------------------------------------------------------------------------
 HashAggregate  (cost=354089.57..382932.46 rows=1164634 width=16)
(actual time=535.773..537.731 rows=29 loops=1)
   Group Key: date_trunc('day'::text, create_dt)
   Planned Partitions: 4  Batches: 1  Memory Usage: 24601kB
   ->  Bitmap Heap Scan on iot  (cost=376.07..312463.00 rows=1164634
width=16) (actual time=3.731..363.116 rows=1127117 loops=1)
         Recheck Cond: ((create_dt >= '2020-02-01
00:00:00-04'::timestamp with time zone) AND (create_dt < '2020-03-01
00:00:00-04'::timestamp with time zone))
         Rows Removed by Index Recheck: 141619
         Heap Blocks: lossy=15104
         ->  Bitmap Index Scan on cd_multi  (cost=0.00..84.92
rows=1166823 width=0) (actual time=3.048..3.048 rows=151040 loops=1)
               Index Cond: ((create_dt >= '2020-02-01
00:00:00-04'::timestamp with time zone) AND (create_dt < '2020-03-01
00:00:00-04'::timestamp with time zone))
 Planning Time: 0.117 ms
 Execution Time: 545.246 ms
(11 rows)

-- 
John Naylor                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: WIP: BRIN multi-range indexes

From
Tomas Vondra
Date:
On Thu, Sep 10, 2020 at 05:01:37PM -0300, Alvaro Herrera wrote:
>On 2020-Sep-10, Tomas Vondra wrote:
>
>> I've spent a bit of time experimenting with this. My idea was to allow
>> keeping an "expanded" version of the summary somewhere. As the addValue
>> function only receives BrinValues I guess one option would be to just
>> add bv_mem_values field. Or do you have a better idea?
>
>Maybe it's okay to pass the BrinMemTuple to the add_value function, and
>keep something there.  Or maybe that's pointless and just a new field in
>BrinValues is okay.
>

OK. I don't like changing the add_value API very much, so for the
experimental version I simply added three new fields into the BrinValues
struct - the deserialized value, serialization callback and the memory
context. This seems to be working good enough for a WIP version.

With the original (non-batched) patch version a build of an index took
about 4s. With the minmax_multi_get_strategy_procinfo optimization and
batch build it now takes ~2.6s, which is quite nice. It's still ~2.5x as
much compared to plain minmax though.

I think there's still room for a bit more improvement (in how we merge
the ranges etc.) and maybe we can get to ~2s or something like that.


>> Of course, more would need to be done:
>>
>> 1) We'd need to also pass the right memory context (bt_context seems
>> like the right thing, but that's not something addValue sees now).
>
>You could use GetMemoryChunkContext() for that.
>

Maybe, although I prefer to just pass the memory context explicitly.

>> 2) We'd also need to specify some sort of callback that serializes the
>> in-memory value into bt_values. That's not something addValue can do,
>> because it doesn't know whether it's the last value in the range etc. I
>> guess one option would be to add yet another support proc, but I guess a
>> simple callback would be enough.
>
>Hmm.
>

I added a simple serialization callback. It works but it's a bit weird
that twe have most functions defined as support procedures, and then
this extra C callback.

>> I've hacked together an experimental version of this to see how much
>> would it help, and it reduces the duration from ~4.6s to ~3.3s. Which is
>> nice, but plain minmax is ~1.1s. I suppose there's room for further
>> improvements in compare_combine_ranges/reduce_combine_ranges and so on,
>> but I still think there'll always be a gap compared to plain minmax.
>
>The main reason I'm talking about desupporting plain minmax is that,
>even if it's amazingly fast, it loses quite quickly in real-world cases
>because of loss of correlation.  Minmax's build time is pretty much
>determined by speed at which you can seqscan the table.  I don't think
>we lose much if we add overhead in order to create an index that is 100x
>more useful.
>

I understand. I just feel a bit uneasy about replacing an index with
something that may or may not be better for a certain use case. I mean,
if you have data set for which regular minmax works fine, wouldn't you
be annoyed if we just switched it for something slower?


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

Re: WIP: BRIN multi-range indexes

From
John Naylor
Date:
On Fri, Sep 11, 2020 at 6:14 AM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

> I understand. I just feel a bit uneasy about replacing an index with
> something that may or may not be better for a certain use case. I mean,
> if you have data set for which regular minmax works fine, wouldn't you
> be annoyed if we just switched it for something slower?

How about making multi minmax the default for new indexes, and those
who know their data will stay very well correlated can specify simple
minmax ops for speed? Upgraded indexes would stay the same, and only
new ones would have the risk of slowdown if not attended to.

Also, I wonder if the slowdown in building a new index is similar to
the slowdown for updates. I'd like to run some TCP-H tests (that will
take some time).

-- 
John Naylor                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: WIP: BRIN multi-range indexes

From
Tomas Vondra
Date:
On Fri, Sep 11, 2020 at 10:08:15AM -0400, John Naylor wrote:
>On Fri, Sep 11, 2020 at 6:14 AM Tomas Vondra
><tomas.vondra@2ndquadrant.com> wrote:
>
>> I understand. I just feel a bit uneasy about replacing an index with
>> something that may or may not be better for a certain use case. I mean,
>> if you have data set for which regular minmax works fine, wouldn't you
>> be annoyed if we just switched it for something slower?
>
>How about making multi minmax the default for new indexes, and those
>who know their data will stay very well correlated can specify simple
>minmax ops for speed? Upgraded indexes would stay the same, and only
>new ones would have the risk of slowdown if not attended to.
>

That might work, I think. I like that it's an explicit choice, i.e. we
may change what the default opclass is, but the behavior won't change
unexpectedly during REINDEX etc. It might still be a bit surprising
after dump/restore, but that's probably fine.

It would be ideal if the opclasses were binary compatible, allowing a
more seamless transition. Unfortunately that seems impossible, because
plain minmax uses two Datums to store the range, while multi-minmax uses
a more complex structure.

>Also, I wonder if the slowdown in building a new index is similar to
>the slowdown for updates. I'd like to run some TCP-H tests (that will
>take some time).
>

It might be, because it needs to deserialize/serialize the summary too,
and there's no option to amortize the costs over many inserts. OTOH the
insert probably needs to do various other things, so maybe it's won't be
that bad. But yeah, testing and benchmarking it would be nice. Do you
plan to test just the minmax-multi opclass, or will you look at the
bloom one too?

Attached is a slightly improved version - I've merged the various pieces
into the "main" patches, and made some minor additional optimizations.
I've left the cost tweak as a separate part for now, though.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

Re: WIP: BRIN multi-range indexes

From
John Naylor
Date:
On Fri, Sep 11, 2020 at 2:05 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
> that bad. But yeah, testing and benchmarking it would be nice. Do you
> plan to test just the minmax-multi opclass, or will you look at the
> bloom one too?

Yeah, I'll start looking at bloom next week, and I'll include it when
I do perf testing.

-- 
John Naylor                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: WIP: BRIN multi-range indexes

From
Tomas Vondra
Date:
On Fri, Sep 11, 2020 at 03:19:58PM -0400, John Naylor wrote:
>On Fri, Sep 11, 2020 at 2:05 PM Tomas Vondra
><tomas.vondra@2ndquadrant.com> wrote:
>> that bad. But yeah, testing and benchmarking it would be nice. Do you
>> plan to test just the minmax-multi opclass, or will you look at the
>> bloom one too?
>
>Yeah, I'll start looking at bloom next week, and I'll include it when
>I do perf testing.
>

OK. Here is a slightly improved version of the patch series, with better
commit messages and comments, and with the two patches tweaking handling
of NULL values merged into one.

As mentioned in my reply to Alvaro, I'm hoping to get the first two
parts (general improvements) committed soon, so that we can focus on the
new opclasses. I now recall why I was postponing pushing those parts
because it's primarily "just" a preparation for the new opclasses. Both
the scan keys and NULL handling tweaks are not going to help existing
opclasses very much, I think.

The NULL-handling might help a bit, but the scan key changes are mostly
irrelevant. So I'm wondering if we should even change the two existing
opclasses, instead of keeping them as they are (the code actually
supports that by checking number of arguments of the constitent
function). Opinions?


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

Re: WIP: BRIN multi-range indexes

From
Tomas Vondra
Date:
Hi,

while running some benchmarks to see if the first two patches cause any
regressions, I found a bug in 0002 which reworks the NULL handling. The
code failed to eliminate ranges early using the IS NULL scan keys,
resulting in expensive recheck. The attached version fixes that.

I also noticed that some of the queries seem to be slightly slower, most
likely due to bringetbitmap having to split the scan keys per attribute,
which also requires some allocations etc. The regression is fairly small
might be just noise (less than 2-3% in most cases), but it seems just
allocating everything in a single chunk eliminates most of it - this is
what the new 0003 patch does.

OTOH the rework also helps in other cases - I've measured ~2-3% speedups
for cases where moving the IS NULL handling to bringetbitmap eliminates
calls to the consistent function (e.g. IS NULL queries on columns with
no NULL values).

These results seems very dependent on the hardware (especially CPU),
though, and the differences are pretty small in general (1-2%).

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

Re: WIP: BRIN multi-range indexes

From
John Naylor
Date:
On Sun, Sep 13, 2020 at 12:40 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
> <20200913 patch set>

Hi Tomas,

The cfbot fails to apply this, but with 0001 from 0912 it works on my
end, so going with that.

One problem I have is I don't get success with the new reloptions:

create index cd_multi on iot using brin(create_dt
timestamptz_minmax_multi_ops) with (values_per_range = 64);
ERROR:  unrecognized parameter "values_per_range"

create index  on iot using brin(create_dt timestamptz_bloom_ops) with
(n_distinct_per_range = 16);
ERROR:  unrecognized parameter "n_distinct_per_range"


Aside from that, I'm going to try to understand the code, and ask
questions. Some of the code might still change, but I don't think it's
too early to do some comment and docs proofreading. I'll do this in
separate emails for bloom and multi-minmax to keep it from being too
long. Perf testing will come sometime later.


Bloom
-----

+     greater than 0.0 and smaller than 1.0. The default values is 0.01,

+     rows per block). The default values is <literal>-0.1</literal>, and

s/values/value/

+     the minimum number of distinct non-null values is <literal>128</literal>.

I don't see 128 in the code, but I do see this, is this the intention?:

#define BLOOM_MIN_NDISTINCT_PER_RANGE 16

+ * Bloom filters allow efficient test whether a given page range contains
+ * a particular value. Therefore, if we summarize each page range into a
+ * bloom filter, we can easily and cheaply test wheter it containst values
+ * we get later.

s/test/testing/
s/wheter it containst/whether it contains/

+ * The index only supports equality operator, similarly to hash indexes.

s/operator/operators/

+ * The number of distinct elements (in a page range) depends on the data,
+ * we can consider it fixed. This simplifies the trade-off to just false
+ * positive rate vs. size.

Sounds like the first sentence should start with "although".

+ * of distinct items to be stored in the filter. We can't quite the input
+ * data, of course, but we may make the BRIN page ranges smaller - instead

I think you accidentally a word.

+ * Of course, good sizing decisions depend on having the necessary data,
+ * i.e. number of distinct values in a page range (of a given size) and
+ * table size (to estimate cost change due to change in false positive
+ * rate due to having larger index vs. scanning larger indexes). We may
+ * not have that data - for example when building an index on empty table
+ * it's not really possible. And for some data we only have estimates for
+ * the whole table and we can only estimate per-range values (ndistinct).

and

+ * The current logic, implemented in brin_bloom_get_ndistinct, attempts to
+ * make some basic sizing decisions, based on the table ndistinct estimate.

+ * XXX We might also fetch info about ndistinct estimate for the column,
+ * and compute the expected number of distinct values in a range. But

Maybe I'm missing something, but the first two comments don't match
the last one -- I don't see where we get table ndistinct, which I take
to mean from the stats catalog?

+ * To address these issues, the opclass stores the raw values directly, and
+ * only switches to the actual bloom filter after reaching the same space
+ * requirements.

IIUC, it's after reaching a certain size (BLOOM_MAX_UNSORTED * 4), so
"same" doesn't make sense here.

+ /*
+ * The 1% value is mostly arbitrary, it just looks nice.
+ */
+#define BLOOM_DEFAULT_FALSE_POSITIVE_RATE 0.01 /* 1% fp rate */

I think we'd want better stated justification for this default, even
if just precedence in other implementations. Maybe I can test some
other values here?

+ * XXX Perhaps we could save a few bytes by using different data types, but
+ * considering the size of the bitmap, the difference is negligible.

Yeah, I think it's obvious enough to leave out.

+ m = ceil((ndistinct * log(false_positive_rate)) / log(1.0 /
(pow(2.0, log(2.0)))));

I find this pretty hard to read and pgindent is going to mess it up
further. I would have a comment with the formula in math notation
(note that you can dispense with the reciprocal and just use
negation), but in code fold the last part to a constant. That might go
against project style, though:

m = ceil(ndistinct * log(false_positive_rate) * -2.08136);

+ * XXX Maybe the 64B min size is not really needed?

Something to figure out before commit?

+ /* assume 'not updated' by default */
+ Assert(filter);

I don't see how these are related.

+ big_h = ((uint32) DatumGetInt64(hash_uint32(value)));

I'm curious about the Int64 part -- most callers use the bare value or
with DatumGetUInt32().

Also, is there a reference for the algorithm for hash values that
follows? I didn't see anything like it in my cursory reading of the
topic. Might be good to include it in the comments.

+ * Tweak the ndistinct value based on the pagesPerRange value. First,

Nitpick: "Tweak" to my mind means to adjust an existing value. The
above is only true if ndistinct is negative, but we're really not
tweaking, but using it as a scale factor. Otherwise it's not adjusted,
only clamped.

+ * XXX We can only do this when the pagesPerRange value was supplied.
+ * If it wasn't, it has to be a read-only access to the index, in which
+ * case we don't really care. But perhaps we should fall-back to the
+ * default pagesPerRange value?

I don't understand this.

+static double
+brin_bloom_get_fp_rate(BrinDesc *bdesc, BloomOptions *opts)
+{
+ return BloomGetFalsePositiveRate(opts);
+}

The body of the function is just a macro not used anywhere else -- is
there a reason for having the macro? Also, what's the first parameter
for?

Similarly, BloomGetNDistinctPerRange could just be inlined or turned
into a function for readability.

+ * or null if it is not exists.

s/is not exists/does not exist/

+ /*
+ * XXX not sure the detoasting is necessary (probably not, this
+ * can only be in an index).
+ */

Something to find out before commit?

+ /* TODO include the sorted/unsorted values */

Patch TODO or future TODO?

-- 
John Naylor                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: WIP: BRIN multi-range indexes

From
Tomas Vondra
Date:
On Thu, Sep 17, 2020 at 10:33:06AM -0400, John Naylor wrote:
>On Sun, Sep 13, 2020 at 12:40 PM Tomas Vondra
><tomas.vondra@2ndquadrant.com> wrote:
>> <20200913 patch set>
>
>Hi Tomas,
>
>The cfbot fails to apply this, but with 0001 from 0912 it works on my
>end, so going with that.
>

Hmm, it seems to fail because of some whitespace errors. Attached is an
updated version resolving that.

>One problem I have is I don't get success with the new reloptions:
>
>create index cd_multi on iot using brin(create_dt
>timestamptz_minmax_multi_ops) with (values_per_range = 64);
>ERROR:  unrecognized parameter "values_per_range"
>
>create index  on iot using brin(create_dt timestamptz_bloom_ops) with
>(n_distinct_per_range = 16);
>ERROR:  unrecognized parameter "n_distinct_per_range"
>

But those are opclass parameters, so the parameters are not specified in
WITH clause, but right after the opclass name:

CREATE INDEX idx ON table USING brin (
    bigint_col int8_minmax_multi_ops(values_per_range = 15)
);


>
>Aside from that, I'm going to try to understand the code, and ask
>questions. Some of the code might still change, but I don't think it's
>too early to do some comment and docs proofreading. I'll do this in
>separate emails for bloom and multi-minmax to keep it from being too
>long. Perf testing will come sometime later.
>

OK.

>
>Bloom
>-----
>
>+     greater than 0.0 and smaller than 1.0. The default values is 0.01,
>
>+     rows per block). The default values is <literal>-0.1</literal>, and
>
>s/values/value/
>
>+     the minimum number of distinct non-null values is <literal>128</literal>.
>
>I don't see 128 in the code, but I do see this, is this the intention?:
>
>#define BLOOM_MIN_NDISTINCT_PER_RANGE 16
>

Ah, that's right - I might have lowered the default after writing the
comment. Will fix.

>+ * Bloom filters allow efficient test whether a given page range contains
>+ * a particular value. Therefore, if we summarize each page range into a
>+ * bloom filter, we can easily and cheaply test wheter it containst values
>+ * we get later.
>
>s/test/testing/
>s/wheter it containst/whether it contains/
>

OK, will reword.

>+ * The index only supports equality operator, similarly to hash indexes.
>
>s/operator/operators/
>

Hmmm, are there really multiple equality operators?

>+ * The number of distinct elements (in a page range) depends on the data,
>+ * we can consider it fixed. This simplifies the trade-off to just false
>+ * positive rate vs. size.
>
>Sounds like the first sentence should start with "although".
>

Yeah, probably. Or maybe there should be "but" at the beginning of the
second sentence.

>+ * of distinct items to be stored in the filter. We can't quite the input
>+ * data, of course, but we may make the BRIN page ranges smaller - instead
>
>I think you accidentally a word.
>

Seems like that.

>+ * Of course, good sizing decisions depend on having the necessary data,
>+ * i.e. number of distinct values in a page range (of a given size) and
>+ * table size (to estimate cost change due to change in false positive
>+ * rate due to having larger index vs. scanning larger indexes). We may
>+ * not have that data - for example when building an index on empty table
>+ * it's not really possible. And for some data we only have estimates for
>+ * the whole table and we can only estimate per-range values (ndistinct).
>
>and
>
>+ * The current logic, implemented in brin_bloom_get_ndistinct, attempts to
>+ * make some basic sizing decisions, based on the table ndistinct estimate.
>
>+ * XXX We might also fetch info about ndistinct estimate for the column,
>+ * and compute the expected number of distinct values in a range. But
>
>Maybe I'm missing something, but the first two comments don't match
>the last one -- I don't see where we get table ndistinct, which I take
>to mean from the stats catalog?
>

Ah, right. The part suggesting we're looking at the table n_distinct
estimate is obsolete - some older version of the patch attempted to do
that, but I decided to remove it at some point. We can add it in the
future, but I'll fix the comment for now.

>+ * To address these issues, the opclass stores the raw values directly, and
>+ * only switches to the actual bloom filter after reaching the same space
>+ * requirements.
>
>IIUC, it's after reaching a certain size (BLOOM_MAX_UNSORTED * 4), so
>"same" doesn't make sense here.
>

Ummm, no. BLOOM_MAX_UNSORTED has nothing to do with the switch from
sorted mode to hashing (which is storing an actual bloom filter).

BLOOM_MAX_UNSORTED only determines number of new items that may not
be sorted - we don't sort after each insertion, but only once in a while
to amortize the costs. So for example you may have 1000 sorted values
and then we allow adding 32 new ones before sorting the array again
(using a merge sort). But all of this is in the "sorted" mode.

The number of items the comment refers to is this:

     /* how many uint32 hashes can we fit into the bitmap */
     int maxvalues = filter->nbits / (8 * sizeof(uint32));

where nbits is the size of the bloom filter. So I think the "same" is
quite right here.

>+ /*
>+ * The 1% value is mostly arbitrary, it just looks nice.
>+ */
>+#define BLOOM_DEFAULT_FALSE_POSITIVE_RATE 0.01 /* 1% fp rate */
>
>I think we'd want better stated justification for this default, even
>if just precedence in other implementations. Maybe I can test some
>other values here?
>

Well, I don't know how to pick a better default :-( Ultimately it's a
tarde-off between larger indexes and scanning larger fraction of a table
due to lower false positive. Then there's the restriction that the whole
index tuple needs to fit into a single 8kB page.

>+ * XXX Perhaps we could save a few bytes by using different data types, but
>+ * considering the size of the bitmap, the difference is negligible.
>
>Yeah, I think it's obvious enough to leave out.
>
>+ m = ceil((ndistinct * log(false_positive_rate)) / log(1.0 /
>(pow(2.0, log(2.0)))));
>
>I find this pretty hard to read and pgindent is going to mess it up
>further. I would have a comment with the formula in math notation
>(note that you can dispense with the reciprocal and just use
>negation), but in code fold the last part to a constant. That might go
>against project style, though:
>
>m = ceil(ndistinct * log(false_positive_rate) * -2.08136);
>

Hmm, maybe. I've mostly just copied this out from some bloom filter
paper, but maybe it's not readable.

>+ * XXX Maybe the 64B min size is not really needed?
>
>Something to figure out before commit?
>

Probably. I think this optimization is somewhat pointless and we should
just allocate the right amount of space, and repalloc if needed.

>+ /* assume 'not updated' by default */
>+ Assert(filter);
>

I think they are not related, although the formatting might make it seem
like that.

>I don't see how these are related.
>
>+ big_h = ((uint32) DatumGetInt64(hash_uint32(value)));
>
>I'm curious about the Int64 part -- most callers use the bare value or
>with DatumGetUInt32().
>

Yeah, that formula should use DatumGetUInt32.

>Also, is there a reference for the algorithm for hash values that
>follows? I didn't see anything like it in my cursory reading of the
>topic. Might be good to include it in the comments.
>

This was suggested by Yura Sokolov [1] in 2017. The post refers to a
paper [2] but I don't recall which part describes "our" algorithm.

[1] https://www.postgresql.org/message-id/94c173b54a0aef6ae9b18157ef52f03e@postgrespro.ru
[2] https://www.eecs.harvard.edu/~michaelm/postscripts/rsa2008.pdf


>+ * Tweak the ndistinct value based on the pagesPerRange value. First,
>
>Nitpick: "Tweak" to my mind means to adjust an existing value. The
>above is only true if ndistinct is negative, but we're really not
>tweaking, but using it as a scale factor. Otherwise it's not adjusted,
>only clamped.
>

OK. Perhaps 'adjust' would be a better term?

>+ * XXX We can only do this when the pagesPerRange value was supplied.
>+ * If it wasn't, it has to be a read-only access to the index, in which
>+ * case we don't really care. But perhaps we should fall-back to the
>+ * default pagesPerRange value?
>
>I don't understand this.
>

IIRC I thought there were situations when pagesPerRange value is not
defined, e.g. in read-only access. But I'm not sure about this, and
cosidering the code actally does not check for that (in fact, it has an
assert enforcing valid block number) I think it's a stale comment.


>+static double
>+brin_bloom_get_fp_rate(BrinDesc *bdesc, BloomOptions *opts)
>+{
>+ return BloomGetFalsePositiveRate(opts);
>+}
>
>The body of the function is just a macro not used anywhere else -- is
>there a reason for having the macro? Also, what's the first parameter
>for?
>

No reason. I think the function used to be more complicated at some
point, but it got simpler.

>Similarly, BloomGetNDistinctPerRange could just be inlined or turned
>into a function for readability.
>

Same story.

>+ * or null if it is not exists.
>
>s/is not exists/does not exist/
>
>+ /*
>+ * XXX not sure the detoasting is necessary (probably not, this
>+ * can only be in an index).
>+ */
>
>Something to find out before commit?
>
>+ /* TODO include the sorted/unsorted values */
>

This was simplemented as part of the discussion about pageinspect, and
I wanted some confirmation if the approach is acceptable or not before
spending more time on it. Also, the values are really just hashes of the
column values, so I'm not quite sure it makes sense to include that.
What would you suggest?


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

Re: WIP: BRIN multi-range indexes

From
Tomas Vondra
Date:
OK,

cfbot was not quite happy with the last version either - there was a bug
in 0003 part, allocating smaller chunk of memory than needed. Attached
is a version fixing that, hopefully cfbot will be happy with this one.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

Re: WIP: BRIN multi-range indexes

From
John Naylor
Date:
On Thu, Sep 17, 2020 at 12:34 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
>
> On Thu, Sep 17, 2020 at 10:33:06AM -0400, John Naylor wrote:
> >On Sun, Sep 13, 2020 at 12:40 PM Tomas Vondra
> ><tomas.vondra@2ndquadrant.com> wrote:
> >> <20200913 patch set>
> But those are opclass parameters, so the parameters are not specified in
> WITH clause, but right after the opclass name:
>
> CREATE INDEX idx ON table USING brin (
>         bigint_col int8_minmax_multi_ops(values_per_range = 15)
> );

D'oh!

> >+ * The index only supports equality operator, similarly to hash indexes.
> >
> >s/operator/operators/
> >
>
> Hmmm, are there really multiple equality operators?

Ah, I see what you meant -- then "_the_ equality operator" is what we want.

> The number of items the comment refers to is this:
>
>      /* how many uint32 hashes can we fit into the bitmap */
>      int maxvalues = filter->nbits / (8 * sizeof(uint32));
>
> where nbits is the size of the bloom filter. So I think the "same" is
> quite right here.

Ok, I get it now.

> >+ /*
> >+ * The 1% value is mostly arbitrary, it just looks nice.
> >+ */
> >+#define BLOOM_DEFAULT_FALSE_POSITIVE_RATE 0.01 /* 1% fp rate */
> >
> >I think we'd want better stated justification for this default, even
> >if just precedence in other implementations. Maybe I can test some
> >other values here?
> >
>
> Well, I don't know how to pick a better default :-( Ultimately it's a
> tarde-off between larger indexes and scanning larger fraction of a table
> due to lower false positive. Then there's the restriction that the whole
> index tuple needs to fit into a single 8kB page.

Well, it might be a perfectly good default, and it seems common in
articles on the topic, but the comment is referring to aesthetics. :-)
I still intend to test some cases.

> >Also, is there a reference for the algorithm for hash values that
> >follows? I didn't see anything like it in my cursory reading of the
> >topic. Might be good to include it in the comments.
> >
>
> This was suggested by Yura Sokolov [1] in 2017. The post refers to a
> paper [2] but I don't recall which part describes "our" algorithm.
>
> [1] https://www.postgresql.org/message-id/94c173b54a0aef6ae9b18157ef52f03e@postgrespro.ru
> [2] https://www.eecs.harvard.edu/~michaelm/postscripts/rsa2008.pdf

Hmm, I came across that paper while doing background reading. Okay,
now I get that "% (filter->nbits - 1)" is the second hash function in
that scheme. But now I wonder if that second function should actually
act on the passed "value" (the original hash), so that they are
actually independent, as required. In the language of that paper, the
patch seems to have

g(x) = h1(x) + i*h2(h1(x)) + f(i)

instead of

g(x) = h1(x) + i*h2(x) + f(i)

Concretely, I'm wondering if it should be:

 big_h = DatumGetUint32(hash_uint32(value));
 h = big_h % filter->nbits;
-d = big_h % (filter->nbits - 1);
+d = value % (filter->nbits - 1);

But I could be wrong.

Also, I take it that f(i) = 1 in the patch, hence the increment here:

+ h += d++;

But it's a little hidden. That might be worth commenting, if I haven't
completely missed something.

> >+ * Tweak the ndistinct value based on the pagesPerRange value. First,
> >
> >Nitpick: "Tweak" to my mind means to adjust an existing value. The
> >above is only true if ndistinct is negative, but we're really not
> >tweaking, but using it as a scale factor. Otherwise it's not adjusted,
> >only clamped.
> >
>
> OK. Perhaps 'adjust' would be a better term?

I felt like rewriting the whole thing, but your original gets the
point across ok, really.

"If the passed ndistinct value is positive, we can just use that, but
we also clamp the value to prevent over-sizing the bloom filter
unnecessarily. If it's negative, it represents a multiplier to apply
to the maximum number of tuples in the range (assuming each page gets
MaxHeapTuplesPerPage tuples, which is likely a significant
over-estimate), similar to the corresponding value in table
statistics."

> >+ /* TODO include the sorted/unsorted values */
> >
>
> This was simplemented as part of the discussion about pageinspect, and
> I wanted some confirmation if the approach is acceptable or not before
> spending more time on it. Also, the values are really just hashes of the
> column values, so I'm not quite sure it makes sense to include that.
> What would you suggest?

My gut feeling is the hashes are not useful for this purpose, but I
don't feel strongly either way.

--
John Naylor                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: WIP: BRIN multi-range indexes

From
John Naylor
Date:
I wrote:

> Hmm, I came across that paper while doing background reading. Okay,
> now I get that "% (filter->nbits - 1)" is the second hash function in
> that scheme. But now I wonder if that second function should actually
> act on the passed "value" (the original hash), so that they are
> actually independent, as required. In the language of that paper, the
> patch seems to have
>
> g(x) = h1(x) + i*h2(h1(x)) + f(i)
>
> instead of
>
> g(x) = h1(x) + i*h2(x) + f(i)
>
> Concretely, I'm wondering if it should be:
>
>  big_h = DatumGetUint32(hash_uint32(value));
>  h = big_h % filter->nbits;
> -d = big_h % (filter->nbits - 1);
> +d = value % (filter->nbits - 1);
>
> But I could be wrong.

I'm wrong -- if we use different operands to the moduli, we throw away
the assumption of co-primeness. But I'm still left wondering why we
have to re-hash the hash for this to work. In any case, there should
be some more documentation around the core algorithm, so that future
readers are not left scratching their heads.

-- 
John Naylor                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: WIP: BRIN multi-range indexes

From
Tomas Vondra
Date:
On Thu, Sep 17, 2020 at 05:42:59PM -0400, John Naylor wrote:
>On Thu, Sep 17, 2020 at 12:34 PM Tomas Vondra
><tomas.vondra@2ndquadrant.com> wrote:
>>
>> On Thu, Sep 17, 2020 at 10:33:06AM -0400, John Naylor wrote:
>> >On Sun, Sep 13, 2020 at 12:40 PM Tomas Vondra
>> ><tomas.vondra@2ndquadrant.com> wrote:
>> >> <20200913 patch set>
>> But those are opclass parameters, so the parameters are not specified in
>> WITH clause, but right after the opclass name:
>>
>> CREATE INDEX idx ON table USING brin (
>>         bigint_col int8_minmax_multi_ops(values_per_range = 15)
>> );
>
>D'oh!
>
>> >+ * The index only supports equality operator, similarly to hash indexes.
>> >
>> >s/operator/operators/
>> >
>>
>> Hmmm, are there really multiple equality operators?
>
>Ah, I see what you meant -- then "_the_ equality operator" is what we want.
>
>> The number of items the comment refers to is this:
>>
>>      /* how many uint32 hashes can we fit into the bitmap */
>>      int maxvalues = filter->nbits / (8 * sizeof(uint32));
>>
>> where nbits is the size of the bloom filter. So I think the "same" is
>> quite right here.
>
>Ok, I get it now.
>
>> >+ /*
>> >+ * The 1% value is mostly arbitrary, it just looks nice.
>> >+ */
>> >+#define BLOOM_DEFAULT_FALSE_POSITIVE_RATE 0.01 /* 1% fp rate */
>> >
>> >I think we'd want better stated justification for this default, even
>> >if just precedence in other implementations. Maybe I can test some
>> >other values here?
>> >
>>
>> Well, I don't know how to pick a better default :-( Ultimately it's a
>> tarde-off between larger indexes and scanning larger fraction of a table
>> due to lower false positive. Then there's the restriction that the whole
>> index tuple needs to fit into a single 8kB page.
>
>Well, it might be a perfectly good default, and it seems common in
>articles on the topic, but the comment is referring to aesthetics. :-)
>I still intend to test some cases.
>

I think we may formulate this as a question of how much I/O we need to
do for a random query, and pick the false positive rate minimizing that.
For a single BRIN range an approximation might look like this:

   bloom_size(fpr, ...) + (fpr * range_size) + (selectivity * range_size)

The "selectivity" shows the true selectivity of ranges, and it might be
esimated from a per-row selectivity I guess. But it does not matter much
because this is constant and independent of the false-positive rate, so
we can ignore it. Which leaves us with

   bloom_size(fpr, ...) + (fpr * range_size)

We might solve this for fixed parameters (range_size, ndistinct, ...),
either analytically or by brute force, giving us the "optimal" fpr.

The trouble is the bloom_size is restricted, and we don't really know
the limit - the whole index tuple needs to fit on a single 8kB page, and
there may be other BRIN summaries etc. So I've opted to use a somewhat
defensive default for the false positive rate.

>> >Also, is there a reference for the algorithm for hash values that
>> >follows? I didn't see anything like it in my cursory reading of the
>> >topic. Might be good to include it in the comments.
>> >
>>
>> This was suggested by Yura Sokolov [1] in 2017. The post refers to a
>> paper [2] but I don't recall which part describes "our" algorithm.
>>
>> [1] https://www.postgresql.org/message-id/94c173b54a0aef6ae9b18157ef52f03e@postgrespro.ru
>> [2] https://www.eecs.harvard.edu/~michaelm/postscripts/rsa2008.pdf
>
>Hmm, I came across that paper while doing background reading. Okay,
>now I get that "% (filter->nbits - 1)" is the second hash function in
>that scheme. But now I wonder if that second function should actually
>act on the passed "value" (the original hash), so that they are
>actually independent, as required. In the language of that paper, the
>patch seems to have
>
>g(x) = h1(x) + i*h2(h1(x)) + f(i)
>
>instead of
>
>g(x) = h1(x) + i*h2(x) + f(i)
>
>Concretely, I'm wondering if it should be:
>
> big_h = DatumGetUint32(hash_uint32(value));
> h = big_h % filter->nbits;
>-d = big_h % (filter->nbits - 1);
>+d = value % (filter->nbits - 1);
>
>But I could be wrong.
>
>Also, I take it that f(i) = 1 in the patch, hence the increment here:
>
>+ h += d++;
>
>But it's a little hidden. That might be worth commenting, if I haven't
>completely missed something.
>

OK

>> >+ * Tweak the ndistinct value based on the pagesPerRange value. First,
>> >
>> >Nitpick: "Tweak" to my mind means to adjust an existing value. The
>> >above is only true if ndistinct is negative, but we're really not
>> >tweaking, but using it as a scale factor. Otherwise it's not adjusted,
>> >only clamped.
>> >
>>
>> OK. Perhaps 'adjust' would be a better term?
>
>I felt like rewriting the whole thing, but your original gets the
>point across ok, really.
>
>"If the passed ndistinct value is positive, we can just use that, but
>we also clamp the value to prevent over-sizing the bloom filter
>unnecessarily. If it's negative, it represents a multiplier to apply
>to the maximum number of tuples in the range (assuming each page gets
>MaxHeapTuplesPerPage tuples, which is likely a significant
>over-estimate), similar to the corresponding value in table
>statistics."
>
>> >+ /* TODO include the sorted/unsorted values */
>> >
>>
>> This was simplemented as part of the discussion about pageinspect, and
>> I wanted some confirmation if the approach is acceptable or not before
>> spending more time on it. Also, the values are really just hashes of the
>> column values, so I'm not quite sure it makes sense to include that.
>> What would you suggest?
>
>My gut feeling is the hashes are not useful for this purpose, but I
>don't feel strongly either way.
>

OK. I share this feeling.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: WIP: BRIN multi-range indexes

From
Tomas Vondra
Date:
On Thu, Sep 17, 2020 at 06:48:11PM -0400, John Naylor wrote:
>I wrote:
>
>> Hmm, I came across that paper while doing background reading. Okay,
>> now I get that "% (filter->nbits - 1)" is the second hash function in
>> that scheme. But now I wonder if that second function should actually
>> act on the passed "value" (the original hash), so that they are
>> actually independent, as required. In the language of that paper, the
>> patch seems to have
>>
>> g(x) = h1(x) + i*h2(h1(x)) + f(i)
>>
>> instead of
>>
>> g(x) = h1(x) + i*h2(x) + f(i)
>>
>> Concretely, I'm wondering if it should be:
>>
>>  big_h = DatumGetUint32(hash_uint32(value));
>>  h = big_h % filter->nbits;
>> -d = big_h % (filter->nbits - 1);
>> +d = value % (filter->nbits - 1);
>>
>> But I could be wrong.
>
>I'm wrong -- if we use different operands to the moduli, we throw away
>the assumption of co-primeness. But I'm still left wondering why we
>have to re-hash the hash for this to work. In any case, there should
>be some more documentation around the core algorithm, so that future
>readers are not left scratching their heads.
>

Hmm, good question. I think we don't really need to hash it twice. It
does not rally achieve anything - it won't reduce number of collisions
or anything like that.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: WIP: BRIN multi-range indexes

From
John Naylor
Date:
On Fri, Sep 18, 2020 at 7:40 AM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
>
> On Thu, Sep 17, 2020 at 06:48:11PM -0400, John Naylor wrote:
> >I wrote:
> >
> >> Hmm, I came across that paper while doing background reading. Okay,
> >> now I get that "% (filter->nbits - 1)" is the second hash function in
> >> that scheme. But now I wonder if that second function should actually
> >> act on the passed "value" (the original hash), so that they are
> >> actually independent, as required. In the language of that paper, the
> >> patch seems to have
> >>
> >> g(x) = h1(x) + i*h2(h1(x)) + f(i)
> >>
> >> instead of
> >>
> >> g(x) = h1(x) + i*h2(x) + f(i)
> >>
> >> Concretely, I'm wondering if it should be:
> >>
> >>  big_h = DatumGetUint32(hash_uint32(value));
> >>  h = big_h % filter->nbits;
> >> -d = big_h % (filter->nbits - 1);
> >> +d = value % (filter->nbits - 1);
> >>
> >> But I could be wrong.
> >
> >I'm wrong -- if we use different operands to the moduli, we throw away
> >the assumption of co-primeness. But I'm still left wondering why we
> >have to re-hash the hash for this to work. In any case, there should
> >be some more documentation around the core algorithm, so that future
> >readers are not left scratching their heads.
> >
>
> Hmm, good question. I think we don't really need to hash it twice. It
> does not rally achieve anything - it won't reduce number of collisions
> or anything like that.

Yeah, looking back at the discussion you linked previously, I think
it's a holdover from when the uint32 was rehashed with k different
seeds. Anyway, after thinking about it some more, I still have doubts
about the mapping algorithm. There are two stages to a hash mapping --
hashing and modulus. I don't think a single hash function (whether
rehashed or not) can be turned into two independent functions via a
choice of second modulus. At least, that's not what the Kirsch &
Mitzenmacher paper is claiming. Since we're not actually applying two
independent hash functions on the scan key, we're kind of shooting in
the dark.

It turns out there is something called a one-hash bloom filter, and
the paper in [1] has a straightforward algorithm. Since we can
implement it exactly as stated in the paper, that gives me more
confidence in the real-world false positive rate. It goes like this:

Partition the filter bitmap into k partitions of similar but unequal
length, corresponding to consecutive prime numbers. Use the primes for
moduli of the uint32 value and map it to the bit of the corresponding
partition. For a simple example, let's use 7, 11, 13 for partitions in
a filter of size 31. The three bits are:

value % 7
7 + (value % 11)
7 + 11 + (value % 13)

We could store a const array of the first 256 primes. The largest such
prime is 1613, so with k=7 we can support up to ~11k bits, which is
more than we'd like to store anyway. Then we store the array index of
the largest prime in the 8bits of padding we currently have in
BloomFilter struct.

One wrinkle is that the sum of k primes is not going to match m
exactly. If the sum is too small, we can trim a few bits off of the
filter bitmap. If the sum is too large, the last partition can spill
into the front of the first one. This shouldn't matter much in the
common case since we need to round m to the nearest byte anyway.

This should be pretty straightforward to turn into code and I can take
a stab at it. Thoughts?

[1] https://www.researchgate.net/publication/284283336_One-Hashing_Bloom_Filter

-- 
John Naylor                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: WIP: BRIN multi-range indexes

From
Tomas Vondra
Date:
On Fri, Sep 18, 2020 at 05:06:49PM -0400, John Naylor wrote:
>On Fri, Sep 18, 2020 at 7:40 AM Tomas Vondra
><tomas.vondra@2ndquadrant.com> wrote:
>>
>> On Thu, Sep 17, 2020 at 06:48:11PM -0400, John Naylor wrote:
>> >I wrote:
>> >
>> >> Hmm, I came across that paper while doing background reading. Okay,
>> >> now I get that "% (filter->nbits - 1)" is the second hash function in
>> >> that scheme. But now I wonder if that second function should actually
>> >> act on the passed "value" (the original hash), so that they are
>> >> actually independent, as required. In the language of that paper, the
>> >> patch seems to have
>> >>
>> >> g(x) = h1(x) + i*h2(h1(x)) + f(i)
>> >>
>> >> instead of
>> >>
>> >> g(x) = h1(x) + i*h2(x) + f(i)
>> >>
>> >> Concretely, I'm wondering if it should be:
>> >>
>> >>  big_h = DatumGetUint32(hash_uint32(value));
>> >>  h = big_h % filter->nbits;
>> >> -d = big_h % (filter->nbits - 1);
>> >> +d = value % (filter->nbits - 1);
>> >>
>> >> But I could be wrong.
>> >
>> >I'm wrong -- if we use different operands to the moduli, we throw away
>> >the assumption of co-primeness. But I'm still left wondering why we
>> >have to re-hash the hash for this to work. In any case, there should
>> >be some more documentation around the core algorithm, so that future
>> >readers are not left scratching their heads.
>> >
>>
>> Hmm, good question. I think we don't really need to hash it twice. It
>> does not rally achieve anything - it won't reduce number of collisions
>> or anything like that.
>
>Yeah, looking back at the discussion you linked previously, I think
>it's a holdover from when the uint32 was rehashed with k different
>seeds. Anyway, after thinking about it some more, I still have doubts
>about the mapping algorithm. There are two stages to a hash mapping --
>hashing and modulus. I don't think a single hash function (whether
>rehashed or not) can be turned into two independent functions via a
>choice of second modulus. At least, that's not what the Kirsch &
>Mitzenmacher paper is claiming. Since we're not actually applying two
>independent hash functions on the scan key, we're kind of shooting in
>the dark.
>

OK. I admit the modulo by nbits and (nbits - 1) is a bit suspicious, so
you may be right this is not quite correct construction.

The current scheme was meant to reduce the number of expensive hashing
calls (especially for low fpr values we may require quite a few of
those, easily 10 or more.

But maybe we could still use this scheme by actually computing

    h1 = hash_uint32_extended(value, seed1);
    h2 = hash_uint32_extended(value, seed2);

and then use this as the independent hash functions. I think that would
meet the requirements of the paper.

>It turns out there is something called a one-hash bloom filter, and
>the paper in [1] has a straightforward algorithm. Since we can
>implement it exactly as stated in the paper, that gives me more
>confidence in the real-world false positive rate. It goes like this:
>
>Partition the filter bitmap into k partitions of similar but unequal
>length, corresponding to consecutive prime numbers. Use the primes for
>moduli of the uint32 value and map it to the bit of the corresponding
>partition. For a simple example, let's use 7, 11, 13 for partitions in
>a filter of size 31. The three bits are:
>
>value % 7
>7 + (value % 11)
>7 + 11 + (value % 13)
>
>We could store a const array of the first 256 primes. The largest such
>prime is 1613, so with k=7 we can support up to ~11k bits, which is
>more than we'd like to store anyway. Then we store the array index of
>the largest prime in the 8bits of padding we currently have in
>BloomFilter struct.
>

Why would 11k bits be more than we'd like to store? Assuming we could
use the whole 8kB page for the bloom filter, that'd be about 64k bits.
In practice there'd be a bit of overhead (page header ...) but it's
still much more than 11k bits. But I guess we can simply make the table
of primes a bit larger, right?

FWIW I don't think we need to be that careful about the space to store
stuff in padding etc. If we can - great, but compared to the size of the
filter it's negligible and I'd prioritize simplicity over a byte or two.

>One wrinkle is that the sum of k primes is not going to match m
>exactly. If the sum is too small, we can trim a few bits off of the
>filter bitmap. If the sum is too large, the last partition can spill
>into the front of the first one. This shouldn't matter much in the
>common case since we need to round m to the nearest byte anyway.
>

AFAIK the paper simply says that as long as the sum of partitions is
close to the requested nbits, it's good enough. So I guess we could just
roll with that, no need to trim/wrap or something like that.

>This should be pretty straightforward to turn into code and I can take
>a stab at it. Thoughts?
>

Sure, go ahead. I'm happy someone is actually looking at those patches
and proposing alternative solutions, and this might easily be a better
hashing scheme.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: WIP: BRIN multi-range indexes

From
John Naylor
Date:
On Fri, Sep 18, 2020 at 6:27 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

> But maybe we could still use this scheme by actually computing
>
>     h1 = hash_uint32_extended(value, seed1);
>     h2 = hash_uint32_extended(value, seed2);
>
> and then use this as the independent hash functions. I think that would
> meet the requirements of the paper.

Yeah, that would work algorithmically. It would be trivial to add to
the patch, too of course. There'd be a higher up-front cpu cost. Also,
I'm a bit cautious of rehashing hashes, and whether the two values
above are independent enough. I'm not sure either of these points
matters. My guess is the partition approach is more sound, but it has
some minor organizational challenges (see below).

> Why would 11k bits be more than we'd like to store? Assuming we could
> use the whole 8kB page for the bloom filter, that'd be about 64k bits.
> In practice there'd be a bit of overhead (page header ...) but it's
> still much more than 11k bits.

Brain fade -- I guess I thought we were avoiding being toasted, but
now I see that's not possible for BRIN storage. So, we'll want to
guard against this:

ERROR:  index row size 8160 exceeds maximum 8152 for index "bar_num_idx"

While playing around with the numbers I had an epiphany: At the
defaults, the filter already takes up ~4.3kB, over half the page.
There is no room for another tuple, so if we're only indexing one
column, we might as well take up the whole page. Here MT = max tuples
per 128 8k pages, or 37120, so default ndistinct is 3712.

n      k  m      p
MT/10  7  35580  0.01
MT/10  7  64000  0.0005
MT/10  12 64000  0.00025

Assuming ndistinct isn't way off reality, we get 20x-40x lower false
positive rate almost for free, and it'd be trivial to code! Keeping k
at 7 would be more robust, since it's equivalent to starting with n =
~6000, p = 0.006, which is still almost 2x less false positives than
you asked for. It also means nearly doubling the number of sorted
values before switching.

Going the other direction, capping nbits to 64k bits when ndistinct
gets too high, the false positive rate we can actually support starts
to drop. Here, the user requested 0.001 fpr.

n       k  p
4500    9  0.001
6000    7  0.006
7500    6  0.017
15000   3  0.129  (probably useless by now)
MT      1  0.440
64000   1  0.63   (possible with > 128 pages per range)

I imagine smaller pages_per_range settings are going to be useful for
skinny tables (note to self -- test). Maybe we could provide a way for
the user to see that their combination of pages_per_range, false
positive rate, and ndistinct is supportable, like
brin_bloom_get_supported_fpr(). Or document to check with
page_inspect. And that's not even considering multi-column indexes,
like you mentioned.

> But I guess we can simply make the table
> of primes a bit larger, right?

If we want to support all the above cases without falling down
entirely, it would have to go up to 32k to be safe (When k = 1 we
could degenerate to one modulus on the whole filter). That would be a
table of about 7kB, which we could binary search. [thinks for a
moment]...Actually, if we wanted to be clever, maybe we could
precalculate the primes needed for the 64k bit cases and stick them at
the end of the array. The usual algorithm will find them. That way, we
could keep the array around 2kB. However, for >8kB block size, we
couldn't adjust the 64k number, which might be okay, but not really
project policy.

We could also generate the primes via a sieve instead, which is really
fast (and more code). That would be more robust, but that would
require the filter to store the actual primes used, so 20 more bytes
at max k = 10. We could hard-code space for that, or to keep from
hard-coding maximum k and thus lowest possible false positive rate,
we'd need more bookkeeping.

So, with the partition approach, we'd likely have to set in stone
either max nbits, or min target false positive rate. The latter option
seems more principled, not only for the block size, but also since the
target fp rate is already fixed by the reloption, and as I've
demonstrated, we can often go above and beyond the reloption even
without changing k.

> >One wrinkle is that the sum of k primes is not going to match m
> >exactly. If the sum is too small, we can trim a few bits off of the
> >filter bitmap. If the sum is too large, the last partition can spill
> >into the front of the first one. This shouldn't matter much in the
> >common case since we need to round m to the nearest byte anyway.
> >
>
> AFAIK the paper simply says that as long as the sum of partitions is
> close to the requested nbits, it's good enough. So I guess we could just
> roll with that, no need to trim/wrap or something like that.

Hmm, I'm not sure I understand you. I can see not caring to trim
wasted bits, but we can't set/read off the end of the filter. If we
don't wrap, we could just skip reading/writing that bit. So a tiny
portion of access would be at k - 1. The paper is not clear on what to
do here, but they are explicit in minimizing the absolute value, which
could go on either side.

Also I found a bug:

+ add_local_real_reloption(relopts, "false_positive_rate",
+ "desired false-positive rate for the bloom filters",
+ BLOOM_DEFAULT_FALSE_POSITIVE_RATE,
+ 0.001, 1.0, offsetof(BloomOptions, falsePositiveRate));

When I set fp to 1.0, the reloption code is okay with that, but then
later the assertion gets triggered.

--
John Naylor                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: WIP: BRIN multi-range indexes

From
Tomas Vondra
Date:
On Mon, Sep 21, 2020 at 01:42:34PM -0400, John Naylor wrote:
>On Fri, Sep 18, 2020 at 6:27 PM Tomas Vondra
><tomas.vondra@2ndquadrant.com> wrote:
>
>> But maybe we could still use this scheme by actually computing
>>
>>     h1 = hash_uint32_extended(value, seed1);
>>     h2 = hash_uint32_extended(value, seed2);
>>
>> and then use this as the independent hash functions. I think that would
>> meet the requirements of the paper.
>
>Yeah, that would work algorithmically. It would be trivial to add to
>the patch, too of course. There'd be a higher up-front cpu cost. Also,
>I'm a bit cautious of rehashing hashes, and whether the two values
>above are independent enough. I'm not sure either of these points
>matters. My guess is the partition approach is more sound, but it has
>some minor organizational challenges (see below).
>

OK. I don't think rehashing hashes is an issue as long as the original
hash has sufficiently low collision rate (and while we know it's not
perfect we know it works well enough for hash indexes etc.). And I doubt
the cost of the extra hash of uint32 would be noticeable.

That being said the partitioning approach might be more sound and it's
definitely worth giving it a try.

>> Why would 11k bits be more than we'd like to store? Assuming we could
>> use the whole 8kB page for the bloom filter, that'd be about 64k bits.
>> In practice there'd be a bit of overhead (page header ...) but it's
>> still much more than 11k bits.
>
>Brain fade -- I guess I thought we were avoiding being toasted, but
>now I see that's not possible for BRIN storage. So, we'll want to
>guard against this:
>
>ERROR:  index row size 8160 exceeds maximum 8152 for index "bar_num_idx"
>
>While playing around with the numbers I had an epiphany: At the
>defaults, the filter already takes up ~4.3kB, over half the page.
>There is no room for another tuple, so if we're only indexing one
>column, we might as well take up the whole page.

Hmm, yeah. I may be wrong but IIRC indexes don't support external
storage but compression is still allowed. So even if those defaults are
a bit higher than needed that should make the bloom filters a bit more
compressible, and thus fit multiple BRIN tuples on a single page.

>Here MT = max tuples per 128 8k pages, or 37120, so default ndistinct
>is 3712.
>
>n      k  m      p
>MT/10  7  35580  0.01
>MT/10  7  64000  0.0005
>MT/10  12 64000  0.00025
>
>Assuming ndistinct isn't way off reality, we get 20x-40x lower false
>positive rate almost for free, and it'd be trivial to code! Keeping k
>at 7 would be more robust, since it's equivalent to starting with n =
>~6000, p = 0.006, which is still almost 2x less false positives than
>you asked for. It also means nearly doubling the number of sorted
>values before switching.
>
>Going the other direction, capping nbits to 64k bits when ndistinct
>gets too high, the false positive rate we can actually support starts
>to drop. Here, the user requested 0.001 fpr.
>
>n       k  p
>4500    9  0.001
>6000    7  0.006
>7500    6  0.017
>15000   3  0.129  (probably useless by now)
>MT      1  0.440
>64000   1  0.63 (possible with > 128 pages per range)
>
>I imagine smaller pages_per_range settings are going to be useful for
>skinny tables (note to self -- test). Maybe we could provide a way for
>the user to see that their combination of pages_per_range, false
>positive rate, and ndistinct is supportable, like
>brin_bloom_get_supported_fpr(). Or document to check with page_inspect.
>And that's not even considering multi-column indexes, like you
>mentioned.
>

I agree giving users visibility into this would be useful.

Not sure about how much we want to rely on these optimizations, though,
considering multi-column indexes kinda break this.

>> But I guess we can simply make the table of primes a bit larger,
>> right?
>
>If we want to support all the above cases without falling down
>entirely, it would have to go up to 32k to be safe (When k = 1 we could
>degenerate to one modulus on the whole filter). That would be a table
>of about 7kB, which we could binary search. [thinks for a
>moment]...Actually, if we wanted to be clever, maybe we could
>precalculate the primes needed for the 64k bit cases and stick them at
>the end of the array. The usual algorithm will find them. That way, we
>could keep the array around 2kB. However, for >8kB block size, we
>couldn't adjust the 64k number, which might be okay, but not really
>project policy.
>
>We could also generate the primes via a sieve instead, which is really
>fast (and more code). That would be more robust, but that would require
>the filter to store the actual primes used, so 20 more bytes at max k =
>10. We could hard-code space for that, or to keep from hard-coding
>maximum k and thus lowest possible false positive rate, we'd need more
>bookkeeping.
>

I don't think the efficiency of this code matters too much - it's only
used once when creating the index, so the simpler the better. Certainly
for now, while testing the partitioning approach.

>So, with the partition approach, we'd likely have to set in stone
>either max nbits, or min target false positive rate. The latter option
>seems more principled, not only for the block size, but also since the
>target fp rate is already fixed by the reloption, and as I've
>demonstrated, we can often go above and beyond the reloption even
>without changing k.
>

That seems like a rather annoying limitation, TBH.

>> >One wrinkle is that the sum of k primes is not going to match m
>> >exactly. If the sum is too small, we can trim a few bits off of the
>> >filter bitmap. If the sum is too large, the last partition can spill
>> >into the front of the first one. This shouldn't matter much in the
>> >common case since we need to round m to the nearest byte anyway.
>> >
>>
>> AFAIK the paper simply says that as long as the sum of partitions is
>> close to the requested nbits, it's good enough. So I guess we could
>> just roll with that, no need to trim/wrap or something like that.
>
>Hmm, I'm not sure I understand you. I can see not caring to trim wasted
>bits, but we can't set/read off the end of the filter. If we don't
>wrap, we could just skip reading/writing that bit. So a tiny portion of
>access would be at k - 1. The paper is not clear on what to do here,
>but they are explicit in minimizing the absolute value, which could go
>on either side.
>

What I meant is that is that the paper says this:

     Given a planned overall length mp for a Bloom filter, we usually
     cannot get k prime numbers to make their sum mf to be exactly mp. As
     long as the difference between mp and mf is small enough, it neither
     causes any trouble for the software implementation nor noticeably
     shifts the false positive ratio.

Which I think means we can pick mp, generate k primes with sum mf close
to mp, and just use that with mf bits.

>Also I found a bug:
>
>+ add_local_real_reloption(relopts, "false_positive_rate", + "desired
>false-positive rate for the bloom filters", +
>BLOOM_DEFAULT_FALSE_POSITIVE_RATE, + 0.001, 1.0, offsetof(BloomOptions,
>falsePositiveRate));
>
>When I set fp to 1.0, the reloption code is okay with that, but then
>later the assertion gets triggered.
>

Hmm, yeah. I wonder what to do about that, considering indexes with fp
1.0 are essentially useless.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: WIP: BRIN multi-range indexes

From
John Naylor
Date:
On Mon, Sep 21, 2020 at 3:56 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
>
> On Mon, Sep 21, 2020 at 01:42:34PM -0400, John Naylor wrote:

> >While playing around with the numbers I had an epiphany: At the
> >defaults, the filter already takes up ~4.3kB, over half the page.
> >There is no room for another tuple, so if we're only indexing one
> >column, we might as well take up the whole page.
>
> Hmm, yeah. I may be wrong but IIRC indexes don't support external
> storage but compression is still allowed. So even if those defaults are
> a bit higher than needed that should make the bloom filters a bit more
> compressible, and thus fit multiple BRIN tuples on a single page.

> Not sure about how much we want to rely on these optimizations, though,
> considering multi-column indexes kinda break this.

Yeah. Okay, then it sounds like we should go in the other direction,
as the block comment at the top of brin_bloom.c implies. Indexes with
multiple bloom-indexed columns already don't fit in one 8kB page, so I
think every documented example should have a much lower
pages_per_range. Using 32 pages per range with max tuples gives n =
928. With default p, that's about 1.1 kB per brin tuple, so one brin
page can index 224 pages, much more than with the default 128.

Hmm, how ugly would it be to change the default range size depending
on the opclass?

If indexes don't support external storage, that sounds like a pain to
add. Also, with very small fpr, you can easily get into many megabytes
of filter space, which kind of defeats the purpose of brin in the
first place.

There is already this item from the brin readme:

* Different-size page ranges?
  In the current design, each "index entry" in a BRIN index covers the same
  number of pages.  There's no hard reason for this; it might make sense to
  allow the index to self-tune so that some index entries cover smaller page
  ranges, if this allows the summary values to be more compact.  This
would incur
  larger BRIN overhead for the index itself, but might allow better pruning of
  page ranges during scan.  In the limit of one index tuple per page, the index
  itself would occupy too much space, even though we would be able to skip
  reading the most heap pages, because the summary values are tight; in the
  opposite limit of a single tuple that summarizes the whole table, we wouldn't
  be able to prune anything even though the index is very small.  This can
  probably be made to work by using the range map as an index in itself.

This sounds like a lot of work, but would be robust.

Anyway, given that this is a general problem and not specific to the
prime partition algorithm, I'll leave that out of the attached patch,
named as a .txt to avoid confusing the cfbot.

> >We could also generate the primes via a sieve instead, which is really
> >fast (and more code). That would be more robust, but that would require
> >the filter to store the actual primes used, so 20 more bytes at max k =
> >10. We could hard-code space for that, or to keep from hard-coding
> >maximum k and thus lowest possible false positive rate, we'd need more
> >bookkeeping.
> >
>
> I don't think the efficiency of this code matters too much - it's only
> used once when creating the index, so the simpler the better. Certainly
> for now, while testing the partitioning approach.

To check my understanding, isn't bloom_init() called for every tuple?
Agreed on simplicity so done this way.

> >So, with the partition approach, we'd likely have to set in stone
> >either max nbits, or min target false positive rate. The latter option
> >seems more principled, not only for the block size, but also since the
> >target fp rate is already fixed by the reloption, and as I've
> >demonstrated, we can often go above and beyond the reloption even
> >without changing k.
> >
>
> That seems like a rather annoying limitation, TBH.

I don't think the latter is that bad. I've capped k at 10 for
demonstration's sake.:

(928 is from using 32 pages per range)

n    k   m      p
928   7  8895   0.01
928  10  13343  0.001  (lowest p supported in patch set)
928  13  17790  0.0001
928  10  18280  0.0001 (still works with lower k, needs higher m)
928  10  17790  0.00012 (even keeping m from row #3, capping k doesn't
degrade p much)

Also, k seems pretty robust against small changes as long as m isn't
artificially constrained and as long as p is small.

So I *think* it's okay to cap k at 10 or 12, and not bother with
adjusting m, which worsens space issues. As I found before, lowering k
raises target fpr, but seems more robust to overshooting ndistinct. In
any case, we only need k * 2 bytes to store the partition lengths.

The only way I see to avoid any limitation is to make the array of
primes variable length, which could be done by putting the filter
offset calculation into a macro. But having two variable-length arrays
seems messy.

> >Hmm, I'm not sure I understand you. I can see not caring to trim wasted
> >bits, but we can't set/read off the end of the filter. If we don't
> >wrap, we could just skip reading/writing that bit. So a tiny portion of
> >access would be at k - 1. The paper is not clear on what to do here,
> >but they are explicit in minimizing the absolute value, which could go
> >on either side.
> >
>
> What I meant is that is that the paper says this:
>
>      Given a planned overall length mp for a Bloom filter, we usually
>      cannot get k prime numbers to make their sum mf to be exactly mp. As
>      long as the difference between mp and mf is small enough, it neither
>      causes any trouble for the software implementation nor noticeably
>      shifts the false positive ratio.
>
> Which I think means we can pick mp, generate k primes with sum mf close
> to mp, and just use that with mf bits.

Oh, I see. When I said "trim" I meant exactly that (when mf < mp).
Yeah, we can bump it up as well for the other case. I've done it that
way.

> >+ add_local_real_reloption(relopts, "false_positive_rate", + "desired
> >false-positive rate for the bloom filters", +
> >BLOOM_DEFAULT_FALSE_POSITIVE_RATE, + 0.001, 1.0, offsetof(BloomOptions,
> >falsePositiveRate));
> >
> >When I set fp to 1.0, the reloption code is okay with that, but then
> >later the assertion gets triggered.
> >
>
> Hmm, yeah. I wonder what to do about that, considering indexes with fp
> 1.0 are essentially useless.

Not just useless -- they're degenerate. When p = 1.0, m = k = 0 -- We
cannot accept this value from the user. Looking up thread, 0.1 was
suggested as a limit. That might be a good starting point.

This is interesting work! Having gone this far, I'm going to put more
attention to the multi-minmax patch and actually start performance
testing.

-- 
John Naylor                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

Re: WIP: BRIN multi-range indexes

From
Tomas Vondra
Date:
On Thu, Sep 24, 2020 at 05:18:03PM -0400, John Naylor wrote:
>On Mon, Sep 21, 2020 at 3:56 PM Tomas Vondra
><tomas.vondra@2ndquadrant.com> wrote:
>>
>> On Mon, Sep 21, 2020 at 01:42:34PM -0400, John Naylor wrote:
>
>> >While playing around with the numbers I had an epiphany: At the
>> >defaults, the filter already takes up ~4.3kB, over half the page.
>> >There is no room for another tuple, so if we're only indexing one
>> >column, we might as well take up the whole page.
>>
>> Hmm, yeah. I may be wrong but IIRC indexes don't support external
>> storage but compression is still allowed. So even if those defaults are
>> a bit higher than needed that should make the bloom filters a bit more
>> compressible, and thus fit multiple BRIN tuples on a single page.
>
>> Not sure about how much we want to rely on these optimizations, though,
>> considering multi-column indexes kinda break this.
>
>Yeah. Okay, then it sounds like we should go in the other direction,
>as the block comment at the top of brin_bloom.c implies. Indexes with
>multiple bloom-indexed columns already don't fit in one 8kB page, so I
>think every documented example should have a much lower
>pages_per_range. Using 32 pages per range with max tuples gives n =
>928. With default p, that's about 1.1 kB per brin tuple, so one brin
>page can index 224 pages, much more than with the default 128.
>
>Hmm, how ugly would it be to change the default range size depending
>on the opclass?
>

Not sure. What would happen for multi-column BRIN indexes with different
opclasses?

>If indexes don't support external storage, that sounds like a pain to
>add. Also, with very small fpr, you can easily get into many megabytes
>of filter space, which kind of defeats the purpose of brin in the
>first place.
>

True.

>There is already this item from the brin readme:
>
>* Different-size page ranges?
>  In the current design, each "index entry" in a BRIN index covers the same
>  number of pages.  There's no hard reason for this; it might make sense to
>  allow the index to self-tune so that some index entries cover smaller page
>  ranges, if this allows the summary values to be more compact.  This
>would incur
>  larger BRIN overhead for the index itself, but might allow better pruning of
>  page ranges during scan.  In the limit of one index tuple per page, the index
>  itself would occupy too much space, even though we would be able to skip
>  reading the most heap pages, because the summary values are tight; in the
>  opposite limit of a single tuple that summarizes the whole table, we wouldn't
>  be able to prune anything even though the index is very small.  This can
>  probably be made to work by using the range map as an index in itself.
>
>This sounds like a lot of work, but would be robust.
>

Yeah. I think it's a fairly independent / orthogonal project.

>Anyway, given that this is a general problem and not specific to the
>prime partition algorithm, I'll leave that out of the attached patch,
>named as a .txt to avoid confusing the cfbot.
>
>> >We could also generate the primes via a sieve instead, which is really
>> >fast (and more code). That would be more robust, but that would require
>> >the filter to store the actual primes used, so 20 more bytes at max k =
>> >10. We could hard-code space for that, or to keep from hard-coding
>> >maximum k and thus lowest possible false positive rate, we'd need more
>> >bookkeeping.
>> >
>>
>> I don't think the efficiency of this code matters too much - it's only
>> used once when creating the index, so the simpler the better. Certainly
>> for now, while testing the partitioning approach.
>
>To check my understanding, isn't bloom_init() called for every tuple?
>Agreed on simplicity so done this way.
>

No, it's only called for the first non-NULL value in the page range
(unless I made a boo boo when writing that code).

>> >So, with the partition approach, we'd likely have to set in stone
>> >either max nbits, or min target false positive rate. The latter option
>> >seems more principled, not only for the block size, but also since the
>> >target fp rate is already fixed by the reloption, and as I've
>> >demonstrated, we can often go above and beyond the reloption even
>> >without changing k.
>> >
>>
>> That seems like a rather annoying limitation, TBH.
>
>I don't think the latter is that bad. I've capped k at 10 for
>demonstration's sake.:
>
>(928 is from using 32 pages per range)
>
>n    k   m      p
>928   7  8895   0.01
>928  10  13343  0.001  (lowest p supported in patch set)
>928  13  17790  0.0001
>928  10  18280  0.0001 (still works with lower k, needs higher m)
>928  10  17790  0.00012 (even keeping m from row #3, capping k doesn't
>degrade p much)
>
>Also, k seems pretty robust against small changes as long as m isn't
>artificially constrained and as long as p is small.
>
>So I *think* it's okay to cap k at 10 or 12, and not bother with
>adjusting m, which worsens space issues. As I found before, lowering k
>raises target fpr, but seems more robust to overshooting ndistinct. In
>any case, we only need k * 2 bytes to store the partition lengths.
>
>The only way I see to avoid any limitation is to make the array of
>primes variable length, which could be done by putting the filter
>offset calculation into a macro. But having two variable-length arrays
>seems messy.
>

Hmmm. I wonder how would these limitations impact the conclusions from
the one-hashing paper? Or was this just for the sake of a demonstration?

I'd suggest we just do the simplest thing possible (be it a hard-coded
table of primes or a sieve) and then evaluate if we need to do something
more sophisticated.

>> >Hmm, I'm not sure I understand you. I can see not caring to trim wasted
>> >bits, but we can't set/read off the end of the filter. If we don't
>> >wrap, we could just skip reading/writing that bit. So a tiny portion of
>> >access would be at k - 1. The paper is not clear on what to do here,
>> >but they are explicit in minimizing the absolute value, which could go
>> >on either side.
>> >
>>
>> What I meant is that is that the paper says this:
>>
>>      Given a planned overall length mp for a Bloom filter, we usually
>>      cannot get k prime numbers to make their sum mf to be exactly mp. As
>>      long as the difference between mp and mf is small enough, it neither
>>      causes any trouble for the software implementation nor noticeably
>>      shifts the false positive ratio.
>>
>> Which I think means we can pick mp, generate k primes with sum mf close
>> to mp, and just use that with mf bits.
>
>Oh, I see. When I said "trim" I meant exactly that (when mf < mp).
>Yeah, we can bump it up as well for the other case. I've done it that
>way.
>

OK

>> >+ add_local_real_reloption(relopts, "false_positive_rate", + "desired
>> >false-positive rate for the bloom filters", +
>> >BLOOM_DEFAULT_FALSE_POSITIVE_RATE, + 0.001, 1.0, offsetof(BloomOptions,
>> >falsePositiveRate));
>> >
>> >When I set fp to 1.0, the reloption code is okay with that, but then
>> >later the assertion gets triggered.
>> >
>>
>> Hmm, yeah. I wonder what to do about that, considering indexes with fp
>> 1.0 are essentially useless.
>
>Not just useless -- they're degenerate. When p = 1.0, m = k = 0 -- We
>cannot accept this value from the user. Looking up thread, 0.1 was
>suggested as a limit. That might be a good starting point.
>

Makes sense, I'll fix it that way.

>This is interesting work! Having gone this far, I'm going to put more
>attention to the multi-minmax patch and actually start performance
>testing.
>

Cool, thanks! I'll take a look at your one-hashing patch tomorrow.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 



Re: WIP: BRIN multi-range indexes

From
John Naylor
Date:
On Thu, Sep 24, 2020 at 7:50 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
>
> On Thu, Sep 24, 2020 at 05:18:03PM -0400, John Naylor wrote:

> >Hmm, how ugly would it be to change the default range size depending
> >on the opclass?
> >
>
> Not sure. What would happen for multi-column BRIN indexes with different
> opclasses?

Sounds like a can of worms. In any case I suspect if there is no more
graceful way to handle too-large filters than ERROR out the first time
trying to write to the index, this feature might meet some resistance.
Not sure what to suggest, though.

> >> I don't think the efficiency of this code matters too much - it's only
> >> used once when creating the index, so the simpler the better. Certainly
> >> for now, while testing the partitioning approach.
> >
> >To check my understanding, isn't bloom_init() called for every tuple?
> >Agreed on simplicity so done this way.
> >
>
> No, it's only called for the first non-NULL value in the page range
> (unless I made a boo boo when writing that code).

Ok, then I basically understood -- by tuple I meant BRIN tuple, pardon
my ambiguity. After thinking about it, I agree that CPU cost is
probably trivial (and if not, something is seriously wrong).

> >n    k   m      p
> >928   7  8895   0.01
> >928  10  13343  0.001  (lowest p supported in patch set)
> >928  13  17790  0.0001
> >928  10  18280  0.0001 (still works with lower k, needs higher m)
> >928  10  17790  0.00012 (even keeping m from row #3, capping k doesn't
> >degrade p much)
> >
> >Also, k seems pretty robust against small changes as long as m isn't
> >artificially constrained and as long as p is small.
> >
> >So I *think* it's okay to cap k at 10 or 12, and not bother with
> >adjusting m, which worsens space issues. As I found before, lowering k
> >raises target fpr, but seems more robust to overshooting ndistinct. In
> >any case, we only need k * 2 bytes to store the partition lengths.
> >
> >The only way I see to avoid any limitation is to make the array of
> >primes variable length, which could be done by putting the filter
> >offset calculation into a macro. But having two variable-length arrays
> >seems messy.
> >
>
> Hmmm. I wonder how would these limitations impact the conclusions from
> the one-hashing paper? Or was this just for the sake of a demonstration?

Using "10" in the patch is a demonstration, which completely supports
the current fpr allowed by the reloption, and showing what happens if
fpr is allowed to go lower. But for your question, I *think* this
consideration is independent from the conclusions. The n, k, m values
give a theoretical false positive rate, assuming a completely perfect
hashing scheme. The numbers I'm playing with show consequences in the
theoretical fpr. The point of the paper (and others like it) is how to
get the real fpr as close as possible to the fpr predicted by the
theory. My understanding anyway.

-- 
John Naylor                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: WIP: BRIN multi-range indexes

From
Tomas Vondra
Date:
On Mon, Sep 28, 2020 at 04:42:39PM -0400, John Naylor wrote:
>On Thu, Sep 24, 2020 at 7:50 PM Tomas Vondra
><tomas.vondra@2ndquadrant.com> wrote:
>>
>> On Thu, Sep 24, 2020 at 05:18:03PM -0400, John Naylor wrote:
>
>> >Hmm, how ugly would it be to change the default range size depending
>> >on the opclass?
>> >
>>
>> Not sure. What would happen for multi-column BRIN indexes with different
>> opclasses?
>
>Sounds like a can of worms. In any case I suspect if there is no more
>graceful way to handle too-large filters than ERROR out the first time
>trying to write to the index, this feature might meet some resistance.
>Not sure what to suggest, though.
>

Is it actually all that different from the existing BRIN indexes?
Consider this example:

create table x (a text, b text, c text);

create index on x using brin (a,b,c);

create or replace function random_str(p_len int) returns text as $$
select string_agg(x, '') from (select chr(1 + (254 * random())::int ) as x from generate_series(1,$1)) foo;
$$ language sql;

test=# insert into x select random_str(1000), random_str(1000), random_str(1000);
ERROR:  index row size 9056 exceeds maximum 8152 for index "x_a_b_c_idx"


I'm a bit puzzled, though, because both of these things seem to work:

1) insert before creating the index

create table x (a text, b text, c text);
insert into x select random_str(1000), random_str(1000), random_str(1000);
create index on x using brin (a,b,c);
-- and there actually is a non-empty summary with real data
select * from brin_page_items(get_raw_page('x_a_b_c_idx', 2), 'x_a_b_c_idx'::regclass);


2) insert "small" row before inserting the over-sized one

create table x (a text, b text, c text);
insert into x select random_str(10), random_str(10), random_str(10);
insert into x select random_str(1000), random_str(1000), random_str(1000);
create index on x using brin (a,b,c);
-- and there actually is a non-empty summary with the "big" values
select * from brin_page_items(get_raw_page('x_a_b_c_idx', 2), 'x_a_b_c_idx'::regclass);


I find this somewhat strange - how come we don't fail here too?


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: WIP: BRIN multi-range indexes

From
John Naylor
Date:
On Mon, Sep 28, 2020 at 10:12 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

> Is it actually all that different from the existing BRIN indexes?
> Consider this example:
>
> create table x (a text, b text, c text);
>
> create index on x using brin (a,b,c);
>
> create or replace function random_str(p_len int) returns text as $$
> select string_agg(x, '') from (select chr(1 + (254 * random())::int ) as x from generate_series(1,$1)) foo;
> $$ language sql;
>
> test=# insert into x select random_str(1000), random_str(1000), random_str(1000);
> ERROR:  index row size 9056 exceeds maximum 8152 for index "x_a_b_c_idx"

Hmm, okay. As for which comes first, insert or index creation, I'm
baffled, too. I also would expect the example above would take up a
bit over 6000 bytes, but not 9000.

-- 
John Naylor                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: WIP: BRIN multi-range indexes

From
Tomas Vondra
Date:
On Wed, Sep 30, 2020 at 07:57:19AM -0400, John Naylor wrote:
>On Mon, Sep 28, 2020 at 10:12 PM Tomas Vondra
><tomas.vondra@2ndquadrant.com> wrote:
>
>> Is it actually all that different from the existing BRIN indexes?
>> Consider this example:
>>
>> create table x (a text, b text, c text);
>>
>> create index on x using brin (a,b,c);
>>
>> create or replace function random_str(p_len int) returns text as $$
>> select string_agg(x, '') from (select chr(1 + (254 * random())::int ) as x from generate_series(1,$1)) foo;
>> $$ language sql;
>>
>> test=# insert into x select random_str(1000), random_str(1000), random_str(1000);
>> ERROR:  index row size 9056 exceeds maximum 8152 for index "x_a_b_c_idx"
>
>Hmm, okay. As for which comes first, insert or index creation, I'm
>baffled, too. I also would expect the example above would take up a
>bit over 6000 bytes, but not 9000.
>

OK, so this seems like a data corruption bug in BRIN, actually.

The ~9000 bytes is actually about right, because the strings are in
UTF-8 so roughly 1.5B per character seems about right. And we have 6
values to store (3 columns, min/max for each), so 6 * 1500 = 9000.

The real question is how come INSERT + CREATE INDEX actually manages to
create an index tuple. And the answer is pretty simple - brin_form_tuple
kinda ignores toasting, happily building index tuples where some values
are toasted.

Consider this:

     create table x (a text, b text, c text);
     insert into x select random_str(1000), random_str(1000), random_str(1000);

     create index on x using brin (a,b,c);
     delete from x;
     vacuum x;

     set enable_seqscan=off;

     insert into x select random_str(10), random_str(10), random_str(10);
     ERROR:  missing chunk number 0 for toast value 16530 in pg_toast_16525

     explain analyze select * from x where a = 'xxx';
     ERROR:  missing chunk number 0 for toast value 16530 in pg_toast_16525

     select * from brin_page_items(get_raw_page('x_a_b_c_idx', 2), 'x_a_b_c_idx'::regclass);
     ERROR:  missing chunk number 0 for toast value 16547 in pg_toast_16541


Interestingly enough, running the select before the insert seems to be
working - not sure why.

Anyway, it behaves like this since 9.5 :-(


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 



Re: WIP: BRIN multi-range indexes

From
Alvaro Herrera
Date:
On 2020-Oct-01, Tomas Vondra wrote:

> OK, so this seems like a data corruption bug in BRIN, actually.

Oh crap.  You're right -- the data needs to be detoasted before being
put in the index.

I'll have a look at how this can be fixed.



Re: WIP: BRIN multi-range indexes

From
Anastasia Lubennikova
Date:
Status update for a commitfest entry.

According to cfbot the patch no longer compiles.
Tomas, can you send an update, please?

I also see that a few last messages mention a data corruption bug. Sounds pretty serious.
Alvaro, have you had a chance to look at it? I don't see anything committed yet, nor any active discussion in other
threads.

Re: WIP: BRIN multi-range indexes

From
Tomas Vondra
Date:
On Mon, Nov 02, 2020 at 06:05:27PM +0000, Anastasia Lubennikova wrote:
>Status update for a commitfest entry.
>
>According to cfbot the patch no longer compiles.  Tomas, can you send
>an update, please?
>

Yep, here's an updated patch series. It got broken by f90149e6285aa
which disallowed OID macros in pg_type, but fixing it was simple.

I've also included the patch adopting the one-hash bloom, as implemented
by John Naylor. I didn't have time to do any testing / evaluation yet,
so I've kept it as a separate part - ultimately we should either merge
it into the other bloom patch or discard it.

>I also see that a few last messages mention a data corruption bug.
>Sounds pretty serious.  Alvaro, have you had a chance to look at it? I
>don't see anything committed yet, nor any active discussion in other
>threads.

Yeah, I'm not aware of any fix addressing this - my understanding was
Alvaro plans to handle that, but amybe I misinterpreted his response.
Anyway, I think the fix is simple - we need to de-TOAST the data while
adding the data to index, and we need to consider what to do with
existing possibly-broken indexes.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

Re: WIP: BRIN multi-range indexes

From
Tomas Vondra
Date:
Hi,

Here's a rebased version of the patch series, to keep the cfbot happy.
I've also restricted the false positive rate to [0.001, 0.1] instead of
the original [0.0, 1.0], per discussion on this thread.


I've done a bunch of experiments, comparing the "regular" bloom indexes
with the one-hashing scheme proposed by John Naylor. I've been wondering
if there's some measurable difference, especially in:

* efficiency (query duration)

* false positive rate depending on the fill factor

So I ran a bunch of tests on synthetic data sets, varying parameters
affecting the BRIN bloom indexes:

1) different pages_per_range

2) number of distinct values per range

3) fill factor of the bloom filter (66%, 100%, 200%)

Attached is a script I used to test this, and a simple spreadsheet
summarizing the results, comparing the results for each combination of
parameters. For each combination it shows average query duration (over
10 runs) and scan fraction (what fraction of table was scanned).

Overall, I think there's very little difference, particularly in the
"match" cases when we're searching for a value that we know is in the
table. The one-hash variant seems to perform a bit better, but the
difference is fairly small.

In the "mismatch" cases (searching for value that is not in the table)
the differences are more significant, but it might be noise. It does
seem much more "green" than "red", i.e. the one-hash variant seems to be
faster (although this does depend on the values for formatting).

To sum this up, I think the one-hash approach seems interesting. It's
not going to give us huge speedups because we're only hashing int32
values anyway (not the source data), but it's worth exploring.


I've started looking at the one-hash code changes, and I've discovered a
couple issues. I've been wondering how expensive the naive prime sieve
is - it's not extremely hot code path, as we're only running it for each
page range. But still. So my plan was to create the largest bloom filter
possible, and see how much time generate_primes() takes.

So I initialized a cluster with 32kB blocks and tried to do this:

  create index on t
  using brin (a int4_bloom_ops(n_distinct_per_range=120000,
                               false_positive_rate=0.1));

which ends up using nbits=575104 (which is 2x the page size, but let's
ignore that) and nhashes=3. That however crashes and burns, because:

a) set_bloom_partitions does this:

    while (primes[pidx + nhashes - 1] <= target && primes[pidx] > 0)
       pidx++;

which is broken, because the second part of the condition only checks
the current index - we may end up using nhashes primes after that, and
some of them may be 0. So this needs to be:

    while (primes[pidx + nhashes - 1] <= target &&
           primes[pidx + nhashes] > 0)
       pidx++;

(We know there's always at least one 0 at the end, so it's OK not to
check the length explicitly.)

b) set_bloom_partitions does this to generate primes:

    /*
     * Increase the limit to ensure we have some primes higher than
     * the target partition length. The 100 value is arbitrary, but
     * should be well over what we need.
     */
    primes = generate_primes(target_partlen + 100);

It's not clear to me why 100 is sufficient, particularly for large page
sizes. AFAIK the primes get more and more sparse, so how does this
guarantee we'll get enough "sufficiently large" primes?


c) generate_primes uses uint16 to store the primes, so it can only
generate primes up to 32768. That's (probably) enough for 8kB pages, but
for 32kB pages it's clearly insufficient.


I've fixes these issues in a separate WIP patch, with some simple
debugging logging.


As for the original question how expensive this naive sieve is, I
haven't been able to measure any significant timings. The logging aroung
generate_primes usually looks like this:

2020-11-07 20:36:10.614 CET [232789] LOG:  generating primes nbits
575104 nhashes 3 target_partlen 191701
2020-11-07 20:36:10.614 CET [232789] LOG:  primes generated

So it takes 0.000 second for this extreme page size. I don't think we
need to invent anything more elaborate.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

Re: WIP: BRIN multi-range indexes

From
Tomas Vondra
Date:
Seems I forgot to replace uint16 with uint32 in couple places when
fixing the one-hash code, so it was triggering SIGFPE because of
division by 0. Here's a fixed patch series.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

Re: WIP: BRIN multi-range indexes

From
John Naylor
Date:
On Sat, Nov 7, 2020 at 4:38 PM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:

> Overall, I think there's very little difference, particularly in the
> "match" cases when we're searching for a value that we know is in the
> table. The one-hash variant seems to perform a bit better, but the
> difference is fairly small.
>
> In the "mismatch" cases (searching for value that is not in the table)
> the differences are more significant, but it might be noise. It does
> seem much more "green" than "red", i.e. the one-hash variant seems to be
> faster (although this does depend on the values for formatting).
>
> To sum this up, I think the one-hash approach seems interesting. It's
> not going to give us huge speedups because we're only hashing int32
> values anyway (not the source data), but it's worth exploring.

Thanks for testing! It seems you tested against the version with two moduli, and not the alternative discussed in


which would in fact be rehashing the 32 bit values. I think that would be the way to go if we don't use the one-hashing approach.

> a) set_bloom_partitions does this:
>
>     while (primes[pidx + nhashes - 1] <= target && primes[pidx] > 0)
>        pidx++;
>
> which is broken, because the second part of the condition only checks
> the current index - we may end up using nhashes primes after that, and
> some of them may be 0. So this needs to be:
>
>     while (primes[pidx + nhashes - 1] <= target &&
>            primes[pidx + nhashes] > 0)
>        pidx++;

Good catch.

> b) set_bloom_partitions does this to generate primes:
>
>     /*
>      * Increase the limit to ensure we have some primes higher than
>      * the target partition length. The 100 value is arbitrary, but
>      * should be well over what we need.
>      */
>     primes = generate_primes(target_partlen + 100);
>
> It's not clear to me why 100 is sufficient, particularly for large page
> sizes. AFAIK the primes get more and more sparse, so how does this
> guarantee we'll get enough "sufficiently large" primes?

This value is not rigorous and should be improved, but I started with that by looking at the table in section 3 in

https://primes.utm.edu/notes/gaps.html

I see two ways to make a stronger guarantee:

1. Take the average gap between primes near n, which is log(n), and multiply that by BLOOM_MAX_NUM_PARTITIONS. Adding that to the target seems a better heuristic than a constant, and is still simple to calculate.

With the pathological example you gave of n=575104, k=3 (target_partlen = 191701), the number to add is log(191701) * 10 = 122.  By the table referenced above, the largest prime gap under 360653 is 95, so we're guaranteed to find at least one prime in the space of 122 above the target. That will likely be enough to find the closest-to-target filter size for k=3. Even if it weren't, nbits is so large that the relative difference is tiny. I'd say a heuristic like this is most likely to be off precisely when it matters the least. At this size, even if we find zero primes above our target, the relative filter size is close to 

(575104 - 3 * 95) / 575104 = 0.9995

For a more realistic bad-case target partition length, log(1327) * 10 = 72. There are 33 composites after 1327, the largest such gap below 9551. That would give five primes larger than the target
1361   1367   1373   1381   1399

which is more than enough for k<=10:

1297 +  1301  + 1303  + 1307  + 1319  + 1321  + 1327  + 1361 + 1367 + 1373 = 13276

2. Use a "segmented range" algorithm for the sieve and iterate until we get k*2 primes, half below and half above the target. This would be an absolute guarantee, but also more code, so I'm inclined against that.

> c) generate_primes uses uint16 to store the primes, so it can only
> generate primes up to 32768. That's (probably) enough for 8kB pages, but
> for 32kB pages it's clearly insufficient.

Okay.

> As for the original question how expensive this naive sieve is, I
> haven't been able to measure any significant timings. The logging aroung
> generate_primes usually looks like this:
>
> 2020-11-07 20:36:10.614 CET [232789] LOG:  generating primes nbits
> 575104 nhashes 3 target_partlen 191701
> 2020-11-07 20:36:10.614 CET [232789] LOG:  primes generated
>
> So it takes 0.000 second for this extreme page size. I don't think we
> need to invent anything more elaborate.

Okay, good to know. If we were concerned about memory, we could have it check only odd numbers. That's a common feature of sieves, but also makes the code a bit harder to understand if you haven't seen it before.

Also to fill in something I left for later, the reference for this

/* upper bound of number of primes below limit */
/* WIP: reference for this number */
int numprimes = 1.26 * limit / log(limit);

is

Rosser, J. Barkley; Schoenfeld, Lowell (1962). "Approximate formulas for some functions of prime numbers". Illinois J. Math. 6: 64–94. doi:10.1215/ijm/1255631807

More precisely, it's 30*log(113)/113 rounded up.

--
John Naylor
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: WIP: BRIN multi-range indexes

From
Tomas Vondra
Date:

On 11/9/20 3:29 PM, John Naylor wrote:
> On Sat, Nov 7, 2020 at 4:38 PM Tomas Vondra
> <tomas.vondra@enterprisedb.com <mailto:tomas.vondra@enterprisedb.com>>
> wrote:
> 
>> Overall, I think there's very little difference, particularly in the
>> "match" cases when we're searching for a value that we know is in the
>> table. The one-hash variant seems to perform a bit better, but the
>> difference is fairly small.
>>
>> In the "mismatch" cases (searching for value that is not in the table)
>> the differences are more significant, but it might be noise. It does
>> seem much more "green" than "red", i.e. the one-hash variant seems to be
>> faster (although this does depend on the values for formatting).
>>
>> To sum this up, I think the one-hash approach seems interesting. It's
>> not going to give us huge speedups because we're only hashing int32
>> values anyway (not the source data), but it's worth exploring.
> 
> Thanks for testing! It seems you tested against the version with two
> moduli, and not the alternative discussed in
> 
> https://www.postgresql.org/message-id/20200918222702.omsieaphfj3ctqg3%40development
> <https://www.postgresql.org/message-id/20200918222702.omsieaphfj3ctqg3%40development>
> 
> which would in fact be rehashing the 32 bit values. I think that would
> be the way to go if we don't use the one-hashing approach.
> 

Yeah. I forgot about this detail, and I may try again with the two-hash
variant, but I wonder how much difference would it make, considering the
results match the expected results (that is, the scan fraction" results
for fill_factor=100 match the target fpr almost perfectly).

I think there's a possibly-more important omission in the testing - I
forgot about the "sort mode" used initially, when the filter keeps the
actual hash values and only switches to hashing later. I wonder if that
plays role for some of the cases.

I'll investigate this a bit in the next round of tests.


>> a) set_bloom_partitions does this:
>>
>>     while (primes[pidx + nhashes - 1] <= target && primes[pidx] > 0)
>>        pidx++;
>>
>> which is broken, because the second part of the condition only checks
>> the current index - we may end up using nhashes primes after that, and
>> some of them may be 0. So this needs to be:
>>
>>     while (primes[pidx + nhashes - 1] <= target &&
>>            primes[pidx + nhashes] > 0)
>>        pidx++;
> 
> Good catch.
> 
>> b) set_bloom_partitions does this to generate primes:
>>
>>     /*
>>      * Increase the limit to ensure we have some primes higher than
>>      * the target partition length. The 100 value is arbitrary, but
>>      * should be well over what we need.
>>      */
>>     primes = generate_primes(target_partlen + 100);
>>
>> It's not clear to me why 100 is sufficient, particularly for large page
>> sizes. AFAIK the primes get more and more sparse, so how does this
>> guarantee we'll get enough "sufficiently large" primes?
> 
> This value is not rigorous and should be improved, but I started with
> that by looking at the table in section 3 in
> 
> https://primes.utm.edu/notes/gaps.html
> <https://primes.utm.edu/notes/gaps.html>
> 
> I see two ways to make a stronger guarantee:
> 
> 1. Take the average gap between primes near n, which is log(n), and
> multiply that by BLOOM_MAX_NUM_PARTITIONS. Adding that to the target
> seems a better heuristic than a constant, and is still simple to calculate.
> 
> With the pathological example you gave of n=575104, k=3 (target_partlen
> = 191701), the number to add is log(191701) * 10 = 122.  By the table
> referenced above, the largest prime gap under 360653 is 95, so we're
> guaranteed to find at least one prime in the space of 122 above the
> target. That will likely be enough to find the closest-to-target filter
> size for k=3. Even if it weren't, nbits is so large that the relative
> difference is tiny. I'd say a heuristic like this is most likely to be
> off precisely when it matters the least. At this size, even if we find
> zero primes above our target, the relative filter size is close to 
> 
> (575104 - 3 * 95) / 575104 = 0.9995
> 
> For a more realistic bad-case target partition length, log(1327) * 10 =
> 72. There are 33 composites after 1327, the largest such gap below 9551.
> That would give five primes larger than the target
> 1361   1367   1373   1381   1399
> 
> which is more than enough for k<=10:
> 
> 1297 +  1301  + 1303  + 1307  + 1319  + 1321  + 1327  + 1361 + 1367 +
> 1373 = 13276
> 
> 2. Use a "segmented range" algorithm for the sieve and iterate until we
> get k*2 primes, half below and half above the target. This would be an
> absolute guarantee, but also more code, so I'm inclined against that.
> 

Thanks, that makes sense.

While investigating the failures, I've tried increasing the values a
lot, without observing any measurable increase in runtime. IIRC I've
even used (10 * target_partlen) or something like that. That tells me
it's not very sensitive part of the code, so I'd suggest to simply use
something that we know is large enough to be safe.

For example, the largest bloom filter we can have is 32kB, i.e. 262kb,
at which point the largest gap is less than 95 (per the gap table). And
we may use up to BLOOM_MAX_NUM_PARTITIONS, so let's just use

    BLOOM_MAX_NUM_PARTITIONS * 100

on the basis that we may need BLOOM_MAX_NUM_PARTITIONS partitions
before/after the target. We could consider the actual target being lower
(essentially 1/npartions of the nbits) which decreases the maximum gap,
but I don't think that's the extra complexity here.


FWIW I wonder if we should do something about bloom filters that we know
can get larger than page size. In the example I used, we know that
nbits=575104 is larger than page, so as the filter gets more full (and
thus more random and less compressible) it won't possibly fit. Maybe we
should reject that right away, instead of "delaying it" until later, on
the basis that it's easier to fix at CREATE INDEX time (compared to when
inserts/updates start failing at a random time).

The problem with this is of course that if the index is multi-column,
this may not be strict enough (i.e. each filter would fit independently,
but the whole index row is too large). But it's probably better to do at
least something, and maybe improve that later with some whole-row check.


>> c) generate_primes uses uint16 to store the primes, so it can only
>> generate primes up to 32768. That's (probably) enough for 8kB pages, but
>> for 32kB pages it's clearly insufficient.
> 
> Okay.
> 
>> As for the original question how expensive this naive sieve is, I
>> haven't been able to measure any significant timings. The logging aroung
>> generate_primes usually looks like this:
>>
>> 2020-11-07 20:36:10.614 CET [232789] LOG:  generating primes nbits
>> 575104 nhashes 3 target_partlen 191701
>> 2020-11-07 20:36:10.614 CET [232789] LOG:  primes generated
>>
>> So it takes 0.000 second for this extreme page size. I don't think we
>> need to invent anything more elaborate.
> 
> Okay, good to know. If we were concerned about memory, we could have it
> check only odd numbers. That's a common feature of sieves, but also
> makes the code a bit harder to understand if you haven't seen it before.
> 

IMO if we were concerned about memory we'd use Bitmapset instead of an
array of bools. That's 1:8 compression, not just 1:2.

> Also to fill in something I left for later, the reference for this
> 
> /* upper bound of number of primes below limit */
> /* WIP: reference for this number */
> int numprimes = 1.26 * limit / log(limit);
> 
> is
> 
> Rosser, J. Barkley; Schoenfeld, Lowell (1962). "Approximate formulas for
> some functions of prime numbers". Illinois J. Math. 6: 64–94.
> doi:10.1215/ijm/1255631807
> 
> More precisely, it's 30*log(113)/113 rounded up.
> 

Thanks, I was wondering where that came from.


-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: WIP: BRIN multi-range indexes

From
John Naylor
Date:
On Mon, Nov 9, 2020 at 1:39 PM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:
>
>
> While investigating the failures, I've tried increasing the values a
> lot, without observing any measurable increase in runtime. IIRC I've
> even used (10 * target_partlen) or something like that. That tells me
> it's not very sensitive part of the code, so I'd suggest to simply use
> something that we know is large enough to be safe.

Okay, then it's not worth being clever.

> For example, the largest bloom filter we can have is 32kB, i.e. 262kb,
> at which point the largest gap is less than 95 (per the gap table). And
> we may use up to BLOOM_MAX_NUM_PARTITIONS, so let's just use
>     BLOOM_MAX_NUM_PARTITIONS * 100

Sure.

> FWIW I wonder if we should do something about bloom filters that we know
> can get larger than page size. In the example I used, we know that
> nbits=575104 is larger than page, so as the filter gets more full (and
> thus more random and less compressible) it won't possibly fit. Maybe we
> should reject that right away, instead of "delaying it" until later, on
> the basis that it's easier to fix at CREATE INDEX time (compared to when
> inserts/updates start failing at a random time).

Yeah, I'd be inclined to reject that right away.

> The problem with this is of course that if the index is multi-column,
> this may not be strict enough (i.e. each filter would fit independently,
> but the whole index row is too large). But it's probably better to do at
> least something, and maybe improve that later with some whole-row check.

A whole-row check would be nice, but I don't know how hard that would be.

As a Devil's advocate proposal, how awful would it be to not allow multicolumn brin-bloom indexes?

--
John Naylor
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: WIP: BRIN multi-range indexes

From
Tomas Vondra
Date:
Hi,

Attached is an updated version of the patch series, rebased on current 
master, and results for benchmark comparing the various bloom variants.

The improvements are fairly minor:

1) Rejecting bloom filters that are clearly too large (larger than page) 
early. This is imperfect, as it works for individual index keys, not the 
whole row. But per discussion it seems useful.

2) I've added sort_mode opclass parameter, allowing disabling the sorted 
mode the bloom indexes start in by default. I'm not convinced we should 
commit this, I've needed this for the benchmarking.


The benchmarking compares the three parts with different Bloom variants:

0004 - single hash with mod by (nbits) and (nbits-1)
0005 - two independent hashes (two random seeds)
0006 - partitioned approach, proposed by John Naylor

I'm attaching the shell script used to run the benchmark, and a summary 
of the results. The 0004 is used as a baseline, and the comparisons show 
speedups for 0005 and 0006 relative to that (if you scroll to the 
right). Essentially, green means "faster than 0004" while red means slower.

I don't think any of those approaches comes as a clearly superior. The 
results for most queries are within 2%, which is mostly just noise. 
There are cases where the differences are more significant (~10%), but 
it's in either direction and if you compare duration of the whole 
benchmark (by summing per-query averages) it's within 1% again.

For the "mismatch" case (i.e. looking for values not contained in the 
table) the differences are larger, but that's mostly due to luck and 
hitting false positives for that particular query - on average the 
differences are negligible, just like for the "match" case.

So based on this I'm tempted to just use the version with two hashes, as 
implemented in 0005. It's much simpler than the partitioning scheme, 
does not need any of the logic to generate primes etc.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

Re: WIP: BRIN multi-range indexes

From
Thomas Munro
Date:
On Sun, Dec 20, 2020 at 1:16 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
> Attached is an updated version of the patch series, rebased on current
> master, and results for benchmark comparing the various bloom variants.

Perhaps src/include/utils/inet.h needs to include <sys/socket.h>,
because FreeBSD says:

brin_minmax_multi.c:1693:24: error: use of undeclared identifier 'AF_INET'
if (ip_family(ipa) == PGSQL_AF_INET)
^
../../../../src/include/utils/inet.h:39:24: note: expanded from macro
'PGSQL_AF_INET'
#define PGSQL_AF_INET (AF_INET + 0)
^



Re: WIP: BRIN multi-range indexes

From
Tomas Vondra
Date:
Hi,

On 1/2/21 7:42 AM, Thomas Munro wrote:
> On Sun, Dec 20, 2020 at 1:16 PM Tomas Vondra
> <tomas.vondra@enterprisedb.com> wrote:
>> Attached is an updated version of the patch series, rebased on current
>> master, and results for benchmark comparing the various bloom variants.
> 
> Perhaps src/include/utils/inet.h needs to include <sys/socket.h>,
> because FreeBSD says:
> 
> brin_minmax_multi.c:1693:24: error: use of undeclared identifier 'AF_INET'
> if (ip_family(ipa) == PGSQL_AF_INET)
> ^
> ../../../../src/include/utils/inet.h:39:24: note: expanded from macro
> 'PGSQL_AF_INET'
> #define PGSQL_AF_INET (AF_INET + 0)
> ^

Not sure. The other files using PGSQL_AF_INET just include sys/socket.h
directly, so maybe this should just do the same thing ...


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: WIP: BRIN multi-range indexes

From
John Naylor
Date:
On Sat, Dec 19, 2020 at 8:15 PM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:
> [12-20 version]

Hi Tomas,

The measurements look good. In case it fell through the cracks, my earlier review comments for Bloom BRIN indexes regarding minor details don't seem to have been addressed in this version. I'll point to earlier discussion for convenience:

https://www.postgresql.org/message-id/CACPNZCt%3Dx-fOL0CUJbjR3BFXKgcd9HMPaRUVY9cwRe58hmd8Xg%40mail.gmail.com

https://www.postgresql.org/message-id/CACPNZCuqpkCGt8%3DcywAk1kPu0OoV_TjPXeV-J639ABQWyViyug%40mail.gmail.com

> The improvements are fairly minor:
>
> 1) Rejecting bloom filters that are clearly too large (larger than page)
> early. This is imperfect, as it works for individual index keys, not the
> whole row. But per discussion it seems useful.

I think this is good enough.

> So based on this I'm tempted to just use the version with two hashes, as
> implemented in 0005. It's much simpler than the partitioning scheme,
> does not need any of the logic to generate primes etc.

Sounds like the best engineering decision.

Circling back to multi-minmax build times, I ran a couple quick tests on bigger hardware, and found that not only is multi-minmax slower than minmax, which is to be expected, but also slower than btree. (unlogged table ~12GB in size, maintenance_work_mem = 1GB, median of three runs)

btree          38.3s
minmax         26.2s
multi-minmax  110s

Since btree indexes are much larger, I imagine something algorithmic is involved. Is it worth digging further to see if some code path is taking more time than we would expect?

--
John Naylor
EDB: http://www.enterprisedb.com

Re: WIP: BRIN multi-range indexes

From
Tomas Vondra
Date:

On 1/12/21 6:28 PM, John Naylor wrote:
> On Sat, Dec 19, 2020 at 8:15 PM Tomas Vondra 
> <tomas.vondra@enterprisedb.com <mailto:tomas.vondra@enterprisedb.com>> 
> wrote:
>  > [12-20 version]
> 
> Hi Tomas,
> 
> The measurements look good. In case it fell through the cracks, my 
> earlier review comments for Bloom BRIN indexes regarding minor details 
> don't seem to have been addressed in this version. I'll point to earlier 
> discussion for convenience:
> 
> https://www.postgresql.org/message-id/CACPNZCt%3Dx-fOL0CUJbjR3BFXKgcd9HMPaRUVY9cwRe58hmd8Xg%40mail.gmail.com 
> <https://www.postgresql.org/message-id/CACPNZCt%3Dx-fOL0CUJbjR3BFXKgcd9HMPaRUVY9cwRe58hmd8Xg%40mail.gmail.com>
> 
> https://www.postgresql.org/message-id/CACPNZCuqpkCGt8%3DcywAk1kPu0OoV_TjPXeV-J639ABQWyViyug%40mail.gmail.com 
> <https://www.postgresql.org/message-id/CACPNZCuqpkCGt8%3DcywAk1kPu0OoV_TjPXeV-J639ABQWyViyug%40mail.gmail.com>
> 

Whooops :-( I'll go through those again, thanks for reminding me.

>  > The improvements are fairly minor:
>  >
>  > 1) Rejecting bloom filters that are clearly too large (larger than page)
>  > early. This is imperfect, as it works for individual index keys, not the
>  > whole row. But per discussion it seems useful.
> 
> I think this is good enough.
> 
>  > So based on this I'm tempted to just use the version with two hashes, as
>  > implemented in 0005. It's much simpler than the partitioning scheme,
>  > does not need any of the logic to generate primes etc.
> 
> Sounds like the best engineering decision.
> 
> Circling back to multi-minmax build times, I ran a couple quick tests on 
> bigger hardware, and found that not only is multi-minmax slower than 
> minmax, which is to be expected, but also slower than btree. (unlogged 
> table ~12GB in size, maintenance_work_mem = 1GB, median of three runs)
> 
> btree          38.3s
> minmax         26.2s
> multi-minmax  110s
> 
> Since btree indexes are much larger, I imagine something algorithmic is 
> involved. Is it worth digging further to see if some code path is taking 
> more time than we would expect?
> 

I suspect it'd due to minmax having to decide which "ranges" to merge, 
which requires repeated sorting, etc. I certainly don't dare to claim 
the current algorithm is perfect. I wouldn't have expected such big 
difference, though - so definitely worth investigating.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: WIP: BRIN multi-range indexes

From
Tomas Vondra
Date:

On 1/12/21 6:28 PM, John Naylor wrote:
> On Sat, Dec 19, 2020 at 8:15 PM Tomas Vondra 
> <tomas.vondra@enterprisedb.com <mailto:tomas.vondra@enterprisedb.com>> 
> wrote:
>  > [12-20 version]
> 
> Hi Tomas,
> 
> The measurements look good. In case it fell through the cracks, my 
> earlier review comments for Bloom BRIN indexes regarding minor details 
> don't seem to have been addressed in this version. I'll point to earlier 
> discussion for convenience:
> 
> https://www.postgresql.org/message-id/CACPNZCt%3Dx-fOL0CUJbjR3BFXKgcd9HMPaRUVY9cwRe58hmd8Xg%40mail.gmail.com 
> <https://www.postgresql.org/message-id/CACPNZCt%3Dx-fOL0CUJbjR3BFXKgcd9HMPaRUVY9cwRe58hmd8Xg%40mail.gmail.com>
> 
> https://www.postgresql.org/message-id/CACPNZCuqpkCGt8%3DcywAk1kPu0OoV_TjPXeV-J639ABQWyViyug%40mail.gmail.com 
> <https://www.postgresql.org/message-id/CACPNZCuqpkCGt8%3DcywAk1kPu0OoV_TjPXeV-J639ABQWyViyug%40mail.gmail.com>
> 

Attached is a patch, addressing those issues - particularly those from 
the first link, the second one is mostly a discussion about how to do 
the hashing properly etc. It also switches to the two-hash variant, as 
discussed earlier.

I've changed the range to allow false positives between 0.0001 and 0.25, 
instead the original range (0.001 and 0.1). The default (0.01) remains 
the same. I was worried that the original range was too narrow, and 
would prevent even sensible combinations of parameter values. But now 
that we reject bloom filters that are obviously too large, it's less of 
an issue I think.

I'm not entirely convinced the sort_mode option should be committed. It 
was meant only to allow benchmarking the hash approaches. In fact, I'm 
thinking about removing the sorted mode entirely - if the bloom filter 
contains only a few distinct values:

a) it's going to be almost entirely 0 bits, so easy to compress

b) it does not eliminate collisions entirely (we store hashes, not the 
original values)



regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

Re: WIP: BRIN multi-range indexes

From
Tomas Vondra
Date:
Here is a slightly improved version of the patch series.

Firstly, I realized the PG_DETOAST_DATUM() in brin_bloom_summary_out is
actually needed - the value can't be toasted, but it might be stored
with just 1B header. So we need to expand it to 4B, because the struct
has int32 as the first field.

I've also removed the sort mode from bloom filters. I've thought about
this for a long time, and ultimately concluded that it's not worth the
extra complexity. It might work for ranges with very few distinct
values, but that also means the bloom filter will be mostly 0 and thus
easy to compress (and with very low false-positive rate). There probably
are cases where it might be a bit better/smaller, but I had a hard time
constructing such cases. So I ditched it for now. I've kept the "flags"
which is unused and reserved for future, to allow such improvements.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

Re: WIP: BRIN multi-range indexes

From
John Naylor
Date:
On Tue, Jan 12, 2021 at 1:42 PM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:
> I suspect it'd due to minmax having to decide which "ranges" to merge,
> which requires repeated sorting, etc. I certainly don't dare to claim
> the current algorithm is perfect. I wouldn't have expected such big
> difference, though - so definitely worth investigating.

It seems that monotonically increasing (or decreasing) values in a table are a worst case scenario for multi-minmax indexes, or basically, unique values within a range. I'm guessing it's because it requires many passes to fit all the values into a limited number of ranges. I tried using smaller pages_per_range numbers, 32 and 8, and that didn't help.

Now, with a different data distribution, using only 10 values that repeat over and over, the results are much more sympathetic to multi-minmax:

insert into iot (num, create_dt)
select random(), '2020-01-01 0:00'::timestamptz + (x % 10 || ' seconds')::interval
from generate_series(1,5*365*24*60*60) x;

create index cd_single on iot using brin(create_dt);
27.2s

create index cd_multi on iot using brin(create_dt timestamptz_minmax_multi_ops);
30.4s

create index cd_bt on iot using btree(create_dt);
61.8s

Circling back to the monotonic case, I tried running a simple perf record on a backend creating a multi-minmax index on a timestamptz column and these were the highest non-kernel calls:
+   21.98%    21.91%  postgres         postgres            [.] FunctionCall2Coll
+    9.31%     9.29%  postgres         postgres            [.] compare_combine_ranges
+    8.60%     8.58%  postgres         postgres            [.] qsort_arg
+    5.68%     5.66%  postgres         postgres            [.] brin_minmax_multi_add_value
+    5.63%     5.60%  postgres         postgres            [.] timestamp_lt
+    4.73%     4.71%  postgres         postgres            [.] reduce_combine_ranges
+    3.80%     0.00%  postgres         [unknown]           [.] 0x0320016800040000
+    3.51%     3.50%  postgres         postgres            [.] timestamp_eq

There's no one place that's pathological enough to explain the 4x slowness over traditional BRIN and nearly 3x slowness over btree when using a large number of unique values per range, so making progress here would have to involve a more holistic approach.

--
John Naylor
EDB: http://www.enterprisedb.com

Re: WIP: BRIN multi-range indexes

From
Tomas Vondra
Date:
On 1/19/21 9:44 PM, John Naylor wrote:
> On Tue, Jan 12, 2021 at 1:42 PM Tomas Vondra 
> <tomas.vondra@enterprisedb.com <mailto:tomas.vondra@enterprisedb.com>> 
> wrote:
>  > I suspect it'd due to minmax having to decide which "ranges" to merge,
>  > which requires repeated sorting, etc. I certainly don't dare to claim
>  > the current algorithm is perfect. I wouldn't have expected such big
>  > difference, though - so definitely worth investigating.
> 
> It seems that monotonically increasing (or decreasing) values in a table 
> are a worst case scenario for multi-minmax indexes, or basically, unique 
> values within a range. I'm guessing it's because it requires many passes 
> to fit all the values into a limited number of ranges. I tried using 
> smaller pages_per_range numbers, 32 and 8, and that didn't help.
> 
> Now, with a different data distribution, using only 10 values that 
> repeat over and over, the results are muchs more sympathetic to multi-minmax:
> 
> insert into iot (num, create_dt)
> select random(), '2020-01-01 0:00'::timestamptz + (x % 10 || ' 
> seconds')::interval
> from generate_series(1,5*365*24*60*60) x;
> 
> create index cd_single on iot using brin(create_dt);
> 27.2s
> 
> create index cd_multi on iot using brin(create_dt 
> timestamptz_minmax_multi_ops);
> 30.4s
> 
> create index cd_bt on iot using btree(create_dt);
> 61.8s
> 
> Circling back to the monotonic case, I tried running a simple perf 
> record on a backend creating a multi-minmax index on a timestamptz 
> column and these were the highest non-kernel calls:
> +   21.98%    21.91%  postgres         postgres            [.] 
> FunctionCall2Coll
> +    9.31%     9.29%  postgres         postgres            [.] 
> compare_combine_ranges
> +    8.60%     8.58%  postgres         postgres            [.] qsort_arg
> +    5.68%     5.66%  postgres         postgres            [.] 
> brin_minmax_multi_add_value
> +    5.63%     5.60%  postgres         postgres            [.] timestamp_lt
> +    4.73%     4.71%  postgres         postgres            [.] 
> reduce_combine_ranges
> +    3.80%     0.00%  postgres         [unknown]           [.] 
> 0x0320016800040000
> +    3.51%     3.50%  postgres         postgres            [.] timestamp_eq
> 
> There's no one place that's pathological enough to explain the 4x 
> slowness over traditional BRIN and nearly 3x slowness over btree when 
> using a large number of unique values per range, so making progress here 
> would have to involve a more holistic approach.
> 

Yeah. This very much seems like the primary problem is in how we build 
the ranges incrementally - with monotonic sequences, we end up having to 
merge the ranges over and over again. I don't know what was the 
structure of the table, but I guess it was kinda narrow (very few 
columns), which exacerbates the problem further, because the number of 
rows per range will be way higher than in real-world.

I do think the solution to this might be to allow more values during 
batch index creation, and only "compress" to the requested number at the 
very end (when serializing to on-disk format).

There are a couple additional comments about possibly replacing 
sequential scan with a binary search, that could help a bit too.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: WIP: BRIN multi-range indexes

From
Tomas Vondra
Date:

On 1/20/21 1:07 AM, Tomas Vondra wrote:
> On 1/19/21 9:44 PM, John Naylor wrote:
>> On Tue, Jan 12, 2021 at 1:42 PM Tomas Vondra 
>> <tomas.vondra@enterprisedb.com <mailto:tomas.vondra@enterprisedb.com>> 
>> wrote:
>>  > I suspect it'd due to minmax having to decide which "ranges" to merge,
>>  > which requires repeated sorting, etc. I certainly don't dare to claim
>>  > the current algorithm is perfect. I wouldn't have expected such big
>>  > difference, though - so definitely worth investigating.
>>
>> It seems that monotonically increasing (or decreasing) values in a 
>> table are a worst case scenario for multi-minmax indexes, or 
>> basically, unique values within a range. I'm guessing it's because it 
>> requires many passes to fit all the values into a limited number of 
>> ranges. I tried using smaller pages_per_range numbers, 32 and 8, and 
>> that didn't help.
>>
>> Now, with a different data distribution, using only 10 values that 
>> repeat over and over, the results are muchs more sympathetic to 
>> multi-minmax:
>>
>> insert into iot (num, create_dt)
>> select random(), '2020-01-01 0:00'::timestamptz + (x % 10 || ' 
>> seconds')::interval
>> from generate_series(1,5*365*24*60*60) x;
>>
>> create index cd_single on iot using brin(create_dt);
>> 27.2s
>>
>> create index cd_multi on iot using brin(create_dt 
>> timestamptz_minmax_multi_ops);
>> 30.4s
>>
>> create index cd_bt on iot using btree(create_dt);
>> 61.8s
>>
>> Circling back to the monotonic case, I tried running a simple perf 
>> record on a backend creating a multi-minmax index on a timestamptz 
>> column and these were the highest non-kernel calls:
>> +   21.98%    21.91%  postgres         postgres            [.] 
>> FunctionCall2Coll
>> +    9.31%     9.29%  postgres         postgres            [.] 
>> compare_combine_ranges
>> +    8.60%     8.58%  postgres         postgres            [.] qsort_arg
>> +    5.68%     5.66%  postgres         postgres            [.] 
>> brin_minmax_multi_add_value
>> +    5.63%     5.60%  postgres         postgres            [.] 
>> timestamp_lt
>> +    4.73%     4.71%  postgres         postgres            [.] 
>> reduce_combine_ranges
>> +    3.80%     0.00%  postgres         [unknown]           [.] 
>> 0x0320016800040000
>> +    3.51%     3.50%  postgres         postgres            [.] 
>> timestamp_eq
>>
>> There's no one place that's pathological enough to explain the 4x 
>> slowness over traditional BRIN and nearly 3x slowness over btree when 
>> using a large number of unique values per range, so making progress 
>> here would have to involve a more holistic approach.
>>
> 
> Yeah. This very much seems like the primary problem is in how we build 
> the ranges incrementally - with monotonic sequences, we end up having to 
> merge the ranges over and over again. I don't know what was the 
> structure of the table, but I guess it was kinda narrow (very few 
> columns), which exacerbates the problem further, because the number of 
> rows per range will be way higher than in real-world.
> 
> I do think the solution to this might be to allow more values during 
> batch index creation, and only "compress" to the requested number at the 
> very end (when serializing to on-disk format).
> > There are a couple additional comments about possibly replacing
> sequential scan with a binary search, that could help a bit too.
> 

OK, I took a look at this, and I came up with two optimizations that 
improve this for the pathological cases. I've kept this as patches on 
top of the last patch, to allow easier review of the changes.


0007 - This reworks how the ranges are reduced by merging the closest 
ranges. Instead of doing that iteratively in a fairly expensive loop, 
the new reduce reduce_combine_ranges() uses much simpler approach.

There's a couple more optimizations (skipping expensive code when not 
needed, etc.) which should help a bit too.


0008 - This is a WIP version of the batch mode. Originally, when 
building an index we'd "fill" the small buffer, combine some of the 
ranges to free ~25% of space for new values. And we'd do this over and 
over. This involves some expensive steps (sorting etc.) and for some 
pathologic cases (like monotonic sequences) this performed particularly 
poorly. The new code simply collects all values in the range, and then 
does the expensive stuff only once.

Note: These parts are fairly new, with minimal testing so far.

When measured on a table with 10M rows with a number of data sets with 
different patterns, the results look like this:

     dataset              btree  minmax  unpatched  patched    diff
     --------------------------------------------------------------
     monotonic-100-asc    3023     1002       1281     1722    1.34
     monotonic-100-desc   3245     1042       1363     1674    1.23
     monotonic-10000-asc  2597     1028       2469     2272    0.92
     monotonic-10000-desc 2591     1021       2157     2246    1.04
     monotonic-asc        1863      968       4884     1106    0.23
     monotonic-desc       2546     1017       3520     2337    0.66
     random-100           3648     1133       1594     1797    1.13
     random-10000         3507     1124       1651     2657    1.61

The btree and minmax are the current indexes. unpatched means minmax 
multi from the previous patch version, patched is with 0007 and 0008 
applied. The diff shows patched/unpatched. The benchmarking script is 
attached.

The pathological case (monotonic-asc) is now 4x faster, roughly equal to 
regular minmax index build. The opposite (monotonic-desc) is about 33% 
faster, roughly in line with btree.

There are a couple cases where it's actually a bit slower - those are 
the cases with very few distinct values per range. I believe this 
happens because in the batch mode the code does not check if the summary 
already contains this value, adds it to the buffer and the last step 
ends up being more expensive than this.

I believe there's some "compromise" between those two extremes, i.e. we 
should use buffer that is too small or too large, but something in 
between, so that the reduction happens once in a while but not too often 
(as with the original aggressive approach).

FWIW, none of this is likely to be an issue in practice, because (a) 
tables usually don't have such strictly monotonic patterns, (b) people 
should stick to plain minmax for cases that do. And (c) regular tables 
tend to have much wider rows, so there are fewer values per range (so 
that other stuff is likely more expensive that building BRIN).


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

Re: WIP: BRIN multi-range indexes

From
John Naylor
Date:
On Thu, Jan 21, 2021 at 9:06 PM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:
> [wip optimizations]

> The pathological case (monotonic-asc) is now 4x faster, roughly equal to
> regular minmax index build. The opposite (monotonic-desc) is about 33%
> faster, roughly in line with btree.

Those numbers look good. I get similar results, shown below. I've read 0007-8 briefly but not in depth.

> There are a couple cases where it's actually a bit slower - those are
> the cases with very few distinct values per range. I believe this
> happens because in the batch mode the code does not check if the summary
> already contains this value, adds it to the buffer and the last step
> ends up being more expensive than this.

I think if it's worst case a bit faster than btree and best case a bit slower than traditional minmax, that's acceptable.

> I believe there's some "compromise" between those two extremes, i.e. we
> should use buffer that is too small or too large, but something in
> between, so that the reduction happens once in a while but not too often
> (as with the original aggressive approach).

This sounds good also.

> FWIW, none of this is likely to be an issue in practice, because (a)
> tables usually don't have such strictly monotonic patterns, (b) people
> should stick to plain minmax for cases that do.

Still, it would be great if multi-minmax can be a drop in replacement. I know there was a sticking point of a distance function not being available on all types, but I wonder if that can be remedied or worked around somehow.

And (c) regular tables
> tend to have much wider rows, so there are fewer values per range (so
> that other stuff is likely more expensive than building BRIN).

True. I'm still puzzled that it didn't help to use 8 pages per range, but it's moot now.


Here are some numbers (median of 3) with a similar scenario as before, repeated here with some added details. I didn't bother with what you call "unpatched":

               btree   minmax   multi
monotonic-asc  44.4    26.5     27.8
mono-del-ins   38.7    24.6     30.4
mono-10-asc    61.8    25.6     33.5


create unlogged table iot (
    id bigint generated by default as identity primary key,
    num double precision not null,
    create_dt timestamptz not null,
    stuff text generated always as (md5(id::text)) stored
)
with (fillfactor = 95);

-- monotonic-asc:

insert into iot (num, create_dt)
select random(), x
from generate_series(
  '2020-01-01 0:00'::timestamptz,
  '2020-01-01 0:00'::timestamptz +'5 years'::interval,
  '1 second'::interval) x;

-- mono-del-ins:
-- Here I deleted a few values from (likely) each page in the above table, and reinserted values that shouldn't be in existing ranges:

delete from iot1
where num < 0.05
or num > 0.95;

vacuum iot1;

insert into iot (num, create_dt)
select random(), x
from generate_series(
  '2020-01-01 0:00'::timestamptz,
  '2020-02-01 23:59'::timestamptz,
  '1 second'::interval) x;

-- mono-10-asc

truncate table iot;

insert into iot (num, create_dt)
select random(), '2020-01-01 0:00'::timestamptz + (x % 10 || ' seconds')::interval
from generate_series(1,5*365*24*60*60) x;

--
John Naylor
EDB: http://www.enterprisedb.com

Re: WIP: BRIN multi-range indexes

From
Tomas Vondra
Date:
On 1/23/21 12:27 AM, John Naylor wrote:
> On Thu, Jan 21, 2021 at 9:06 PM Tomas Vondra 
> <tomas.vondra@enterprisedb.com <mailto:tomas.vondra@enterprisedb.com>> 
> wrote:
>  > [wip optimizations]
> 
>  > The pathological case (monotonic-asc) is now 4x faster, roughly equal to
>  > regular minmax index build. The opposite (monotonic-desc) is about 33%
>  > faster, roughly in line with btree.
> 
> Those numbers look good. I get similar results, shown below. I've read 
> 0007-8 briefly but not in depth.
>  >  > There are a couple cases where it's actually a bit slower - those are
>  > the cases with very few distinct values per range. I believe this
>  > happens because in the batch mode the code does not check if the summary
>  > already contains this value, adds it to the buffer and the last step
>  > ends up being more expensive than this.
> 
> I think if it's worst case a bit faster than btree and best case a bit 
> slower than traditional minmax, that's acceptable.
> 
>  > I believe there's some "compromise" between those two extremes, i.e. we
>  > should use buffer that is too small or too large, but something in
>  > between, so that the reduction happens once in a while but not too often
>  > (as with the original aggressive approach).
> 
> This sounds good also.
> 

Yeah, I agree.

I think the reason why some of the cases got a bit slower is that in 
those cases the original approach (ranges being built fairly frequently, 
not just once at the end) we quickly built something that represented 
the whole range, so adding a new value was often no-op. The add_value 
callback found the range already "includes" the new value, etc.

With the batch mode, that's no longer true - we accumulate everything, 
so we have to sort it etc. Which I guess may be fairly expensive, thanks 
to calling comparator functions etc. I wonder if this could be optimized 
a bit, e.g. by first "deduplicating" the values using memcmp() or so.

But ultimately, I think the right solution will be to limit the buffer 
size to something like 10x the target, and roll with that. Typically, 
increasing the buffer size from e.g. 100B to 1000B brings much clearer 
improvement than increasing it from 1000B to 10000B. I'd bet this 
follows the pattern.

>  > FWIW, none of this is likely to be an issue in practice, because (a)
>  > tables usually don't have such strictly monotonic patterns, (b) people
>  > should stick to plain minmax for cases that do.
> 
> Still, it would be great if multi-minmax can be a drop in replacement. I 
> know there was a sticking point of a distance function not being 
> available on all types, but I wonder if that can be remedied or worked 
> around somehow.
> 

Hmm. I think Alvaro also mentioned he'd like to use this as a drop-in 
replacement for minmax (essentially, using these opclasses as the 
default ones, with the option to switch back to plain minmax). I'm not 
convinced we should do that - though. Imagine you have minmax indexes in 
your existing DB, it's working perfectly fine, and then we come and just 
silently change that during dump/restore. Is there some past example 
when we did something similar and it turned it to be OK?

As for the distance functions, I'm pretty sure there are data types 
without "natural" distance - like most strings, for example. We could 
probably invent something, but the question is how much we can rely on 
it working well enough in practice.

Of course, is minmax even the right index type for such data types? 
Strings are usually "labels" and not queried using range queries, 
although sometimes people encode stuff as strings (but then it's very 
unlikely we'll define the distance definition well). So maybe for those 
types a hash / bloom would be a better fit anyway.


But I do have an idea - maybe we can do without distances, in those 
cases. Essentially, the primary issue of minmax indexes are outliers, so 
what if we simply sort the values, keep one range in the middle and as 
many single points on each tail?

Imagine we have N values, and we want to represent this by K values. We 
simply sort the N values, keep (k-2)/2 values on each tail as outliers, 
and use 2 values for the values in between.

Example:

input: [1, 2, 100, 110, 111, ..., 120, , ..., 130, 201, 202]

Given k = 6, we would keep 2 values on tails, and range for the rest:

   [1, 2, (100, 130), 201, 202]

Of course, this does not optimize for the same thing as when we have 
distance - in that case we try to minimize the "covering" of the input 
data, something like

      sum(length(r) for r in ranges) / (max(ranges) - min(ranges))

But maybe it's good enough when there's no distance function ...


> And (c) regular tables
>  > tend to have much wider rows, so there are fewer values per range (so
>  > that other stuff is likely more expensive than building BRIN).
> 
> True. I'm still puzzled that it didn't help to use 8 pages per range, 
> but it's moot now.
> 

I'd bet that even with just 8 pages, there were quite a few values in 
the range - possibly hundreds per page. I haven't tried if the patches 
help with smaller ranges, so maybe we should check.

> 
> Here are some numbers (median of 3) with a similar scenario as before, 
> repeated here with some added details. I didn't bother with what you 
> call "unpatched":
> 
>                 btree   minmax   multi
> monotonic-asc  44.4    26.5     27.8
> mono-del-ins   38.7    24.6     30.4
> mono-10-asc    61.8    25.6     33.5
> 

Thanks. Those numbers seem reasonable.


-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: WIP: BRIN multi-range indexes

From
John Naylor
Date:
Hi Tomas, 
I took another look through the Bloom opclass portion (0004) with sorted_mode omitted, and it looks good to me code-wise. I think this part is close to commit-ready. I also did one more proofreading pass for minor details.

+     rows per block). The default values is <literal>-0.1</literal>, and

+     greater than 0.0 and smaller than 1.0. The default values is 0.01,

s/values/value/

+ * bloom filter, we can easily and cheaply test wheter it contains values

s/wheter/whether/

+ * XXX We assume the bloom filters have the same parameters fow now. In the

s/fow/for/

+ * or null if it does not exists.

s/exists/exist/

+ * We do expect the bloom filter to eventually switch to hashing mode,
+ * and it's bound to be almost perfectly random, so not compressible.

Leftover from when it started out in sorted mode.

+ if ((m/8) > BLCKSZ)

It seems we need something more safe, to account for page header and tuple header at least. As the comment before says, the filter will eventually not be compressible. I remember we can't be exact here, with the possibility of multiple columns, but we can leave a little slack space.

--

Re: WIP: BRIN multi-range indexes

From
John Naylor
Date:
On Fri, Jan 22, 2021 at 10:59 PM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:
>
>
> On 1/23/21 12:27 AM, John Naylor wrote:

> > Still, it would be great if multi-minmax can be a drop in replacement. I
> > know there was a sticking point of a distance function not being
> > available on all types, but I wonder if that can be remedied or worked
> > around somehow.
> >
>
> Hmm. I think Alvaro also mentioned he'd like to use this as a drop-in
> replacement for minmax (essentially, using these opclasses as the
> default ones, with the option to switch back to plain minmax). I'm not
> convinced we should do that - though. Imagine you have minmax indexes in
> your existing DB, it's working perfectly fine, and then we come and just
> silently change that during dump/restore. Is there some past example
> when we did something similar and it turned it to be OK?

I was assuming pg_dump can be taught to insert explicit opclasses for minmax indexes, so that upgrade would not cause surprises. If that's true, only new indexes would have the different default opclass.

> As for the distance functions, I'm pretty sure there are data types
> without "natural" distance - like most strings, for example. We could
> probably invent something, but the question is how much we can rely on
> it working well enough in practice.
>
> Of course, is minmax even the right index type for such data types?
> Strings are usually "labels" and not queried using range queries,
> although sometimes people encode stuff as strings (but then it's very
> unlikely we'll define the distance definition well). So maybe for those
> types a hash / bloom would be a better fit anyway.

Right.

> But I do have an idea - maybe we can do without distances, in those
> cases. Essentially, the primary issue of minmax indexes are outliers, so
> what if we simply sort the values, keep one range in the middle and as
> many single points on each tail?

That's an interesting idea. I think it would be a nice bonus to try to do something along these lines. On the other hand, I'm not the one volunteering to do the work, and the patch is useful as is.

--
John Naylor
EDB: http://www.enterprisedb.com

Re: WIP: BRIN multi-range indexes

From
Tomas Vondra
Date:

On 1/26/21 7:52 PM, John Naylor wrote:
> On Fri, Jan 22, 2021 at 10:59 PM Tomas Vondra 
> <tomas.vondra@enterprisedb.com <mailto:tomas.vondra@enterprisedb.com>> 
> wrote:
>  >
>  >
>  > On 1/23/21 12:27 AM, John Naylor wrote:
> 
>  > > Still, it would be great if multi-minmax can be a drop in 
> replacement. I
>  > > know there was a sticking point of a distance function not being
>  > > available on all types, but I wonder if that can be remedied or worked
>  > > around somehow.
>  > >
>  >
>  > Hmm. I think Alvaro also mentioned he'd like to use this as a drop-in
>  > replacement for minmax (essentially, using these opclasses as the
>  > default ones, with the option to switch back to plain minmax). I'm not
>  > convinced we should do that - though. Imagine you have minmax indexes in
>  > your existing DB, it's working perfectly fine, and then we come and just
>  > silently change that during dump/restore. Is there some past example
>  > when we did something similar and it turned it to be OK?
> 
> I was assuming pg_dump can be taught to insert explicit opclasses for 
> minmax indexes, so that upgrade would not cause surprises. If that's 
> true, only new indexes would have the different default opclass.
> 

Maybe, I suppose we could do that. But I always found such changes 
happening silently in the background a bit suspicious, because it may be 
quite confusing. I certainly wouldn't expect such difference between 
creating a new index and index created by dump/restore. Did we do such 
changes in the past? That might be a precedent, but I don't recall any 
example ...

>  > As for the distance functions, I'm pretty sure there are data types
>  > without "natural" distance - like most strings, for example. We could
>  > probably invent something, but the question is how much we can rely on
>  > it working well enough in practice.
>  >
>  > Of course, is minmax even the right index type for such data types?
>  > Strings are usually "labels" and not queried using range queries,
>  > although sometimes people encode stuff as strings (but then it's very
>  > unlikely we'll define the distance definition well). So maybe for those
>  > types a hash / bloom would be a better fit anyway.
> 
> Right.
> 
>  > But I do have an idea - maybe we can do without distances, in those
>  > cases. Essentially, the primary issue of minmax indexes are outliers, so
>  > what if we simply sort the values, keep one range in the middle and as
>  > many single points on each tail?
> 
> That's an interesting idea. I think it would be a nice bonus to try to 
> do something along these lines. On the other hand, I'm not the one 
> volunteering to do the work, and the patch is useful as is.
> 

IMO it's fairly small amount of code, so I'll take a stab at in in the 
next version of the patch.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: WIP: BRIN multi-range indexes

From
John Naylor
Date:
On Tue, Jan 26, 2021 at 6:59 PM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:
>
>
>
> On 1/26/21 7:52 PM, John Naylor wrote:
> > On Fri, Jan 22, 2021 at 10:59 PM Tomas Vondra
> > <tomas.vondra@enterprisedb.com <mailto:tomas.vondra@enterprisedb.com>>
> > wrote:
> >  > Hmm. I think Alvaro also mentioned he'd like to use this as a drop-in
> >  > replacement for minmax (essentially, using these opclasses as the
> >  > default ones, with the option to switch back to plain minmax). I'm not
> >  > convinced we should do that - though. Imagine you have minmax indexes in
> >  > your existing DB, it's working perfectly fine, and then we come and just
> >  > silently change that during dump/restore. Is there some past example
> >  > when we did something similar and it turned it to be OK?
> >
> > I was assuming pg_dump can be taught to insert explicit opclasses for
> > minmax indexes, so that upgrade would not cause surprises. If that's
> > true, only new indexes would have the different default opclass.
> >
>
> Maybe, I suppose we could do that. But I always found such changes
> happening silently in the background a bit suspicious, because it may be
> quite confusing. I certainly wouldn't expect such difference between
> creating a new index and index created by dump/restore. Did we do such
> changes in the past? That might be a precedent, but I don't recall any
> example ...

I couldn't think of a comparable example either. It comes down to evaluating risk. On the one hand it's nice if users get an enhancement without having to know about it, on the other hand if there is some kind of noticeable regression, that's bad.

--
John Naylor
EDB: http://www.enterprisedb.com

Re: WIP: BRIN multi-range indexes

From
Tomas Vondra
Date:
Hi,

Here's an updated and significantly improved version of the patch
series, particularly the multi-minmax part. I've fixed a number of
stupid bugs in that, discovered by either valgrind or stress-tests.

I was surprised by some of the bugs, or rather that the existing
regression tests failed to crash on them, so it's probably worth briefly
discussing the details. There were two main classes of such bugs:


1) missing datumCopy

AFAICS this happened because there were a couple missing datumCopy
calls, and BRIN covers multiple pages, so with by-ref data types we
added a pointer but the buffer might have gone away unexpectedly.
Regular regression tests passed just fine, because brin_multi runs
almost separately, so the chance of the buffer being evicted was low.
Valgrind reported this (with a rather enigmatic message, as usual), and
so did a simple stress-test creating many indexes concurrently. Anything
causing aggressive eviction of buffer would do the trick, I think,
triggering segfaults, asserts, etc.


2) bogus opclass definitions

There were a couple opclasses referencing incorrect distance function,
intended for a different data type. I was scratching my head WTH the
regression tests pass, as there is a table to build multi-minmax index
on all supported data types. The reason is pretty silly - the table is
very small, just 100 rows, with very low fillfactor (so only couple
values per page), and the index was created with pages_per_range=1. So
the compaction was not triggered and the distance function was never
actually called. I've decided to build the indexes on a larger data set
first, to test this. But maybe this needs somewhat different approach.


BLOOM
-----

The attached patch series addresses comments from the last review. As
for the size limit, I've defined a new macro

    #define BloomMaxFilterSize \
        MAXALIGN_DOWN(BLCKSZ - \
                      (MAXALIGN(SizeOfPageHeaderData + \
                                sizeof(ItemIdData)) + \
                       MAXALIGN(sizeof(BrinSpecialSpace)) + \
                       SizeOfBrinTuple))

and use that to determine if the bloom filter is too large. IMO that's
close enough, considering that this is a best-effort check anyway (due
to not being able to consider multi-column indexes).


MINMAX-MULTI
------------

As mentioned, there's a lot of fixes and improvements in this part, but
the basic principle is still the same. I've kept it split into three
parts with different approaches to building, so that it's possible to do
benchmarks and comparisons, and pick the best one.

a) 0005 - Aggressively compacts the summary, by always keeping it within
the limit defined by values_per_range. So if the range contains more
values, this may trigger compaction very often in some cases (e.g. for
monotonic series).

One drawback is that the more often the compactions happen, the less
optimal the result is - the algorithm is kinda greedy, picking something
like local optimums in each step.

b) 0006 - Batch build, exactly the opposite of 0005. Accumulates all
values in a buffer, then does a single round of compaction at the very
end. This obviously doesn't have the "greediness" issues, but it may
consume quite a bit of memory for some data types and/or indexes with
large BRIN ranges.

c) 0007 - A hybrid approach, using a buffer that is multiple of the
user-specified value, with some safety min/max limits. IMO this is what
we should use, although perhaps with some tuning of the exact limits.


Attached is a spreadsheet with benchmark results for each of those three
approaches, on different data types (byval/byref), data set types, index
parameters (pages/values per range) etc. I think 0007 is a reasonable
compromise overall, with performance somewhere in betwen 0005 and 0006.
Of course, there are cases where it's somewhat slow, e.g. for data types
with expensive comparisons and data sets forcing frequent compactions,
in which case it's ~10x slower compared to regular minmax (in most cases
it's ~1.5x). Compared to btree, it's usually much faster - ~2-3x as fast
(except for some extreme cases, of course).


As for the opclasses for indexes without "natural" distance function,
implemented in 0008, I think we should drop it. In theory it works, but
I'm far from convinced it's actually useful in practice. Essentially, if
you have a data type with ordering but without a meaningful concept of a
distance, it's hard to say what is an outlier. I'd bet most of those
data types are used as "labels" where even the ordering is kinda
useless, i.e. hardly anyone uses range queries on things like names,
it's all just equality searches. Which means the bloom indexes are a
much better match for this use case.


The other thing we were considering was using the new multi-minmax
opclasses as default ones, replacing the existing minmax ones. IMHO we
shouldn't do that either. For existing minmax indexes that's useless
(the opclass seems to be working, otherwise the index would be dropped).
But even for new indexes I'm not sure it's the right thing, so I don't
plan to change this.


I'm also attaching the stress-test that I used to test the hell out of
this, creating indexes on various data sets, data types, with varying
index parameters, etc.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

Re: WIP: BRIN multi-range indexes

From
Zhihong Yu
Date:
Hi,
For 0007-Remove-the-special-batch-mode-use-a-larger--20210203.patch :

+       /* same as preceding value, so store it */
+       if (compare_values(&range->values[start + i - 1],
+                          &range->values[start + i],
+                          (void *) &cxt) == 0)
+           continue;
+
+       range->values[start + n] = range->values[start + i];

It seems the comment doesn't match the code: the value is stored when subsequent value is different from the previous.

For has_matching_range():
+       int     midpoint = (start + end) / 2;

I think the standard notion for midpoint is start + (end-start)/2.

+       /* this means we ran out of ranges in the last step */
+       if (start > end)
+           return false;

It seems the above should be ahead of computation of midpoint.

Similar comment for the code in AssertCheckRanges().

Cheers

On Wed, Feb 3, 2021 at 3:55 PM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:
Hi,

Here's an updated and significantly improved version of the patch
series, particularly the multi-minmax part. I've fixed a number of
stupid bugs in that, discovered by either valgrind or stress-tests.

I was surprised by some of the bugs, or rather that the existing
regression tests failed to crash on them, so it's probably worth briefly
discussing the details. There were two main classes of such bugs:


1) missing datumCopy

AFAICS this happened because there were a couple missing datumCopy
calls, and BRIN covers multiple pages, so with by-ref data types we
added a pointer but the buffer might have gone away unexpectedly.
Regular regression tests passed just fine, because brin_multi runs
almost separately, so the chance of the buffer being evicted was low.
Valgrind reported this (with a rather enigmatic message, as usual), and
so did a simple stress-test creating many indexes concurrently. Anything
causing aggressive eviction of buffer would do the trick, I think,
triggering segfaults, asserts, etc.


2) bogus opclass definitions

There were a couple opclasses referencing incorrect distance function,
intended for a different data type. I was scratching my head WTH the
regression tests pass, as there is a table to build multi-minmax index
on all supported data types. The reason is pretty silly - the table is
very small, just 100 rows, with very low fillfactor (so only couple
values per page), and the index was created with pages_per_range=1. So
the compaction was not triggered and the distance function was never
actually called. I've decided to build the indexes on a larger data set
first, to test this. But maybe this needs somewhat different approach.


BLOOM
-----

The attached patch series addresses comments from the last review. As
for the size limit, I've defined a new macro

    #define BloomMaxFilterSize \
        MAXALIGN_DOWN(BLCKSZ - \
                      (MAXALIGN(SizeOfPageHeaderData + \
                                sizeof(ItemIdData)) + \
                       MAXALIGN(sizeof(BrinSpecialSpace)) + \
                       SizeOfBrinTuple))

and use that to determine if the bloom filter is too large. IMO that's
close enough, considering that this is a best-effort check anyway (due
to not being able to consider multi-column indexes).


MINMAX-MULTI
------------

As mentioned, there's a lot of fixes and improvements in this part, but
the basic principle is still the same. I've kept it split into three
parts with different approaches to building, so that it's possible to do
benchmarks and comparisons, and pick the best one.

a) 0005 - Aggressively compacts the summary, by always keeping it within
the limit defined by values_per_range. So if the range contains more
values, this may trigger compaction very often in some cases (e.g. for
monotonic series).

One drawback is that the more often the compactions happen, the less
optimal the result is - the algorithm is kinda greedy, picking something
like local optimums in each step.

b) 0006 - Batch build, exactly the opposite of 0005. Accumulates all
values in a buffer, then does a single round of compaction at the very
end. This obviously doesn't have the "greediness" issues, but it may
consume quite a bit of memory for some data types and/or indexes with
large BRIN ranges.

c) 0007 - A hybrid approach, using a buffer that is multiple of the
user-specified value, with some safety min/max limits. IMO this is what
we should use, although perhaps with some tuning of the exact limits.


Attached is a spreadsheet with benchmark results for each of those three
approaches, on different data types (byval/byref), data set types, index
parameters (pages/values per range) etc. I think 0007 is a reasonable
compromise overall, with performance somewhere in betwen 0005 and 0006.
Of course, there are cases where it's somewhat slow, e.g. for data types
with expensive comparisons and data sets forcing frequent compactions,
in which case it's ~10x slower compared to regular minmax (in most cases
it's ~1.5x). Compared to btree, it's usually much faster - ~2-3x as fast
(except for some extreme cases, of course).


As for the opclasses for indexes without "natural" distance function,
implemented in 0008, I think we should drop it. In theory it works, but
I'm far from convinced it's actually useful in practice. Essentially, if
you have a data type with ordering but without a meaningful concept of a
distance, it's hard to say what is an outlier. I'd bet most of those
data types are used as "labels" where even the ordering is kinda
useless, i.e. hardly anyone uses range queries on things like names,
it's all just equality searches. Which means the bloom indexes are a
much better match for this use case.


The other thing we were considering was using the new multi-minmax
opclasses as default ones, replacing the existing minmax ones. IMHO we
shouldn't do that either. For existing minmax indexes that's useless
(the opclass seems to be working, otherwise the index would be dropped).
But even for new indexes I'm not sure it's the right thing, so I don't
plan to change this.


I'm also attaching the stress-test that I used to test the hell out of
this, creating indexes on various data sets, data types, with varying
index parameters, etc.


regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: WIP: BRIN multi-range indexes

From
Tomas Vondra
Date:

On 2/4/21 1:49 AM, Zhihong Yu wrote:
> Hi,
> For 0007-Remove-the-special-batch-mode-use-a-larger--20210203.patch :
> 
> +       /* same as preceding value, so store it */
> +       if (compare_values(&range->values[start + i - 1],
> +                          &range->values[start + i],
> +                          (void *) &cxt) == 0)
> +           continue;
> +
> +       range->values[start + n] = range->values[start + i];
> 
> It seems the comment doesn't match the code: the value is stored when
> subsequent value is different from the previous.
> 

Yeah, you're right the comment is wrong - the code is doing exactly the
opposite. I'll need to go through this more carefully.

> For has_matching_range():
> +       int     midpoint = (start + end) / 2;
> 
> I think the standard notion for midpoint is start + (end-start)/2.
> 
> +       /* this means we ran out of ranges in the last step */
> +       if (start > end)
> +           return false;
> 
> It seems the above should be ahead of computation of midpoint.
> 

Not sure why would that be an issue, as we're not using the value and
the values are just plain integers (so no overflows ...).


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: WIP: BRIN multi-range indexes

From
Zhihong Yu
Date:
Hi,
bq. Not sure why would that be an issue

Moving the (start > end) check is up to your discretion.

But the midpoint computation should follow text book :-)

Cheers

On Wed, Feb 3, 2021 at 4:59 PM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:


On 2/4/21 1:49 AM, Zhihong Yu wrote:
> Hi,
> For 0007-Remove-the-special-batch-mode-use-a-larger--20210203.patch :
>
> +       /* same as preceding value, so store it */
> +       if (compare_values(&range->values[start + i - 1],
> +                          &range->values[start + i],
> +                          (void *) &cxt) == 0)
> +           continue;
> +
> +       range->values[start + n] = range->values[start + i];
>
> It seems the comment doesn't match the code: the value is stored when
> subsequent value is different from the previous.
>

Yeah, you're right the comment is wrong - the code is doing exactly the
opposite. I'll need to go through this more carefully.

> For has_matching_range():
> +       int     midpoint = (start + end) / 2;
>
> I think the standard notion for midpoint is start + (end-start)/2.
>
> +       /* this means we ran out of ranges in the last step */
> +       if (start > end)
> +           return false;
>
> It seems the above should be ahead of computation of midpoint.
>

Not sure why would that be an issue, as we're not using the value and
the values are just plain integers (so no overflows ...).


regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: WIP: BRIN multi-range indexes

From
John Naylor
Date:
On Wed, Feb 3, 2021 at 7:54 PM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:
>
> [v-20210203]

Hi Tomas,

I have some random comments from reading the patch, but haven't gone into detail in the newer aspects. I'll do so in the near future.

The cfbot seems to crash on this patch during make check, but it doesn't crash for me. I'm not even sure what date that cfbot status is from.

> BLOOM
> -----

Looks good, but make sure you change the commit message -- it still refers to sorted mode.

+ * not entirely clear how to distrubute the space between those columns.

s/distrubute/distribute/

> MINMAX-MULTI
> ------------

> c) 0007 - A hybrid approach, using a buffer that is multiple of the
> user-specified value, with some safety min/max limits. IMO this is what
> we should use, although perhaps with some tuning of the exact limits.

That seems like a good approach.

+#include "access/hash.h" /* XXX strange that it fails because of BRIN_AM_OID without this */

I think you want #include "catalog/pg_am.h" here.

> Attached is a spreadsheet with benchmark results for each of those three
> approaches, on different data types (byval/byref), data set types, index
> parameters (pages/values per range) etc. I think 0007 is a reasonable
> compromise overall, with performance somewhere in betwen 0005 and 0006.
> Of course, there are cases where it's somewhat slow, e.g. for data types
> with expensive comparisons and data sets forcing frequent compactions,
> in which case it's ~10x slower compared to regular minmax (in most cases
> it's ~1.5x). Compared to btree, it's usually much faster - ~2-3x as fast
> (except for some extreme cases, of course).
>
>
> As for the opclasses for indexes without "natural" distance function,
> implemented in 0008, I think we should drop it. In theory it works, but

Sounds reasonable.

> The other thing we were considering was using the new multi-minmax
> opclasses as default ones, replacing the existing minmax ones. IMHO we
> shouldn't do that either. For existing minmax indexes that's useless
> (the opclass seems to be working, otherwise the index would be dropped).
> But even for new indexes I'm not sure it's the right thing, so I don't
> plan to change this.

Okay.

--
John Naylor
EDB: http://www.enterprisedb.com

Re: WIP: BRIN multi-range indexes

From
Tomas Vondra
Date:

On 2/9/21 3:46 PM, John Naylor wrote:
> On Wed, Feb 3, 2021 at 7:54 PM Tomas Vondra 
> <tomas.vondra@enterprisedb.com <mailto:tomas.vondra@enterprisedb.com>> 
> wrote:
>  >
>  > [v-20210203]
> 
> Hi Tomas,
> 
> I have some random comments from reading the patch, but haven't gone 
> into detail in the newer aspects. I'll do so in the near future.
> 
> The cfbot seems to crash on this patch during make check, but it doesn't 
> crash for me. I'm not even sure what date that cfbot status is from.
> 

Yeah, I noticed that too, and I'm investigating.

I tried running the regression tests on a 32-bit machine (rpi4), which 
sometimes uncovers strange failures, and that indeed fails. There are 
two or three bugs.

Firstly, the allocation optimization patch does this:

     MAXALIGN(sizeof(ScanKey) * scan->numberOfKeys * natts)

instead of

     MAXALIGN(sizeof(ScanKey) * scan->numberOfKeys) * natts

and that sometimes produces the wrong result, triggering an assert.


Secondly, there seems to be an issue with cross-type bloom indexes. 
Imagine you have an int8 column, with a bloom index on it, and then you 
do this:

    WHERE column = '122'::int4;

Currently, we end up passing this to the consistent function, which 
tries to call hashint8 on the int4 datum - that succeeds on 64 bits 
(because both types are byval), but fails on 32-bits (where int8 is 
byref, so it fails on int4). Which causes a segfault.

I think I included those cross-type operators as a copy-paste from 
minmax indexes, but I see hash indexes don't allow this. And removing 
those cross-type rows from pg_amop.dat makes the crashes go away.

It's also possible I simplified the get_strategy_procinfo a bit too 
much. I see the minmax variant has subtype, so maybe we could do this 
instead (I recall the integer types should have "compatible" results).

There are a couple failues where the index does not produce the right 
number of results, though. I haven't investigated that yet. Once I fix 
this, I'll post an updated patch - hopefully that'll make cfbot happy.



regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: WIP: BRIN multi-range indexes

From
Tomas Vondra
Date:
Hi,

Attached is an updated version of the patch series, addressing all the 
failures on cfbot (at least I hope so). This turned out to be more fun 
than I expected, as the issues went unnoticed on 64-bits and only failed 
on 32-bits. That's also why I'm not entirely sure this will make cfbot 
happy as that seems to be x86_64, but the issues are real so let's see.

1) I already outlined the issue in the previous message:

     MAXALIGN(a * b) != MAXALIGN(a) * b

and there's an assert that we used exactly the same amount of memory we 
allocated, so this caused a crash. Strange that it'd fail on 32-bits and 
not 64-bits, but perhaps there's some math reason for that, or maybe it 
was just pure luck.


2) The rest of the issues generally boils down to types that are byval 
on 64-bits, but byref on 32-bits. Like int8 or float8 for example. The 
first place causing issues were cross-type operators, i.e. the bloom 
opclasses did things like this in pg_amop.dat:

   { amopfamily => 'brin/integer_bloom_ops', amoplefttype => 'int2',
     amoprighttype => 'int8', amopstrategy => '1',
     amopopr => '=(int2,int8)', amopmethod => 'brin' },

so it was possible to do this:

     WHERE int8column = 1234::int2

in which case we used the int8 opclass, so the consistent function 
thought it's working with int8, and used the hash function defined for 
that opclass in pg_amproc. That's hashint8 of course, but we called that 
on Datum storing int2. Clearly, dereferencing that pointer is guaranteed 
to fail with a segfault.

I think there are two options to fix this. Firstly, we can remove the 
cross-type operators, so that the left/right type is always the same. 
That'll work fine for most cases, and it's pretty simple. It's also what 
the hash_ops opclasses do, so I've done that.

An alternative would be to do something like minmax does for stategies, 
and consider the subtype (i.e. type of the right argument). It's trick a 
bit tricky, though, because it assumes the hash functions for the two 
types are "compatible" and produce the same hash for the same value. 
AFAIK that's correct for the usual cases (int2/int4/int8) and it'd be 
restricted by pg_amop. But hash_ops don't do that for some reason, so I 
wonder what am I missing. (The other thing is where to define these hash 
functions - right now pg_amproc only tracks hash function for the "base" 
data type, and there may be multiple supported subtypes, so where to 
store that? Perhaps we could use the hash function from the default hash 
opclass for each type.)

Anyway, I've decided to keep this simple for now, and I've ripped-out 
the cross-type operators. We can add that back later, if needed.


3) There were a couple byref failures in the distance functions, which 
generally used "double" internally (which I'm not sure is guaranteed to 
be 64-bit types) instead of float8, and used plain "return" instead of 
PG_RETURN_FLOAT8() in a couple places. Silly mistakes.


4) A particulary funny mistake was in actually calculating the hashes 
for bloom filter, which is using hash_uint32_extended (so that we can 
seed it). The trouble is that while hash_uint32() returns uint32, 
hash_uint32_extended returns ... uint64. So we calculated a hash, but 
then used the *pointer* to the uint64 value, not the value. I have to 
say, the "uint32" in the function name is somewhat misleading.


This passes all my tests, including valgrind on the 32-bit rpi4 machine, 
the stress test (testing both the bloom and multi-minmax opclasses etc.)


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

Re: WIP: BRIN multi-range indexes

From
Tomas Vondra
Date:
On 2/11/21 3:51 PM, Tomas Vondra wrote:
>
> ...
> 
> This passes all my tests, including valgrind on the 32-bit rpi4 machine,
> the stress test (testing both the bloom and multi-minmax opclasses etc.)
> 

OK, the cfbot seems happy with it, but I forgot to address the minor
issues mentioned in the review from 2021/02/09, so here's a patch series
addressing that.


Overall, I think the plan is to eventually commit 0001-0004 as is,
squash 0005-0007 (so the minmax-multi uses the "hybrid" approach). I
don't intend to commit 0008, because I have doubts those opclasses are
really useful for anything.

As for 0009, I think it's a fairly small tweak - the correlation made
sense for regular brin indexes, but those new oclasses are meant exactly
for cases where the data is not well correlated.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment