Thread: [HACKERS] WIP: BRIN bloom indexes

[HACKERS] WIP: BRIN bloom indexes

From
Tomas Vondra
Date:
Hi,

The BRIN minmax opclasses work well only for data where the column is
somewhat correlated to physical location in a table. So it works great
for timestamps in append-only log tables, for example. When that is not
the case (non-correlated columns) the minmax ranges get very "wide" and
we end up scanning large portions of the tables.

A typical example are columns with types that are inherently random (or
rather non-correlated) like for example UUIDs, IP or MAC addresses, and
so on. But it can just as easily happen even with simple IDs generated
from a sequence.

This WIP patch implements another type of BRIN index, with "summary"
being represented by a bloom filter. So instead of building [min,max]
range for each page range. The index is of course larger than the
regular "minmax" BRIN index, but it's still orders of magnitude smaller
than a btree index.

Note: This is different from the index type implemented in the "bloom"
extension. Those are not related to BRIN at all, and the index builds a
bloom filter on individual rows (values in different columns).

BTW kudos to everyone who worked on the BRIN infrastructure and API. I
found it very intuitive and well-documented. Adding the new opclass was
extremely straight-forward, and


To demonstrate this on a small example, consider this table:

    CREATE TABLE bloom_test (id uuid, padding text);

    INSERT INTO bloom_test
    SELECT md5((mod(i,1000000)/100)::text)::uuid, md5(i::text)
      FROM generate_series(1,200000000) s(i);

    VACUUM ANALYZE bloom_test;

                         List of relations
     Schema |    Name    | Type  | Owner | Size  | Description
    --------+------------+-------+-------+-------+-------------
     public | bloom_test | table | tomas | 16 GB |
    (1 row)

The table was built so that each range contains relatively small number
of distinct UUID values - this is typical e.g. for user activity with
"bursts" of actions from a particular user separated by long periods of
inactivity (with actions from other users).

Now, let's build BRIN "minmax", BRIN "bloom" and BTREE indexes on the id
column.

    create index test_brin_idx on bloom_test
     using brin (id);

    create index test_bloom_idx on bloom_test
     using brin (id uuid_bloom_ops);

    create index test_btree_idx on bloom_test (id);

which gives us this:

                          List of relations
     Schema |      Name      | Type  | Owner |   Table    |  Size
    --------+----------------+-------+-------+------------+---------
     public | test_bloom_idx | index | tomas | bloom_test | 12 MB
     public | test_brin_idx  | index | tomas | bloom_test | 832 kB
     public | test_btree_idx | index | tomas | bloom_test | 6016 MB
    (3 rows)

So yeah - the "bloom" index is larger than the default "minmax" index,
but it's still orders of magnitude smaller than the regular btree one.

Now, what about query performance? Unlike the "minmax" indexes, the
"bloom" filter can only handle equality conditions.

Let's see a query like this:

    select * from bloom_test
     where id = '8db1d4a6-31a6-e9a2-4e2c-0e842e1f1772';

The minmax index produces this plan

                                QUERY PLAN
------------------------------------------------------------------------
 Bitmap Heap Scan on bloom_test
     (cost=390.31..2130910.82 rows=20593 width=49)
     (actual time=197.974..22707.311 rows=20000 loops=1)
   Recheck Cond: (id = '8db1d4a6-31a6-e9a2-4e2c-0e842e1f1772'::uuid)
   Rows Removed by Index Recheck: 199980000
   Heap Blocks: lossy=2061856
   ->  Bitmap Index Scan on test_brin_idx
           (cost=0.00..385.16 rows=5493161 width=0)
           (actual time=133.463..133.463 rows=20619520 loops=1)
         Index Cond: (id = '8db1d4a6-31a6-e9a2-4e2c-0e842e1f1772'::uuid)
 Planning time: 0.165 ms
 Execution time: 22707.891 ms
(8 rows)

Which is not that great, and the long duration is a direct consequence
of "wide" ranges - the bitmap heap scan had to scan pretty much the
whole table. (A sequential scan takes only about 15 seconds.)

Now, the bloom index:

                                QUERY PLAN
------------------------------------------------------------------------
 Bitmap Heap Scan on bloom_test
     (cost=5898.31..2136418.82 rows=20593 width=49)
     (actual time=24.229..338.324 rows=20000 loops=1)
   Recheck Cond: (id = '8db1d4a6-31a6-e9a2-4e2c-0e842e1f1772'::uuid)
   Rows Removed by Index Recheck: 2500448
   Heap Blocks: lossy=25984
   ->  Bitmap Index Scan on test_bloom_idx
         (cost=0.00..5893.16 rows=5493161 width=0)
         (actual time=22.811..22.811 rows=259840 loops=1)
         Index Cond: (id = '8db1d4a6-31a6-e9a2-4e2c-0e842e1f1772'::uuid)
 Planning time: 0.201 ms
 Execution time: 338.946 ms
(8 rows)

So, way better.

For comparison, a simple index scan / bitmap index scan using the btree
take only about ~10ms in this case, but that's mostly thanks to the
whole dataset being entirely in-memory.

There are some remaining open questions.

1) sizing the bloom filter

Firstly, the size of the filter is currently static, based on 1000
distinct values per range, with 5% false-positive rate. Those are mostly
arbitrary values, and  I don't have any clear idea how to determine
optimal values.

Maybe we could do the sizing based on ndistinct value for the indexed
column? It also depends on the pages_per_range value, so perhaps it
should be a user-defined option too.

An interesting feature is that the bloom indexes "degrade locally". If a
page range has significantly more distinct values, the bloom filter may
be way too small and will suffer by high false-positive rate. But it
only affects that one page range, and the rest may be perfectly fine.

I was thinking about disabling the bloom filter when it gets "too full"
(too many bits set, resulting in high false-positive cases). But that
seems like a bad idea - the false-positive rate automatically jumps to
100% for that range, and we only save much smaller amount of space in
the index. So even if the false-positive rate is 50% it still seems
efficient to keep the bloom filter.

Another thing to consider is what happens when there are very few
distinct values in a given page range. The current code tries to be a
bit smart in this case - instead of building the bloom filter right
away, it initially keeps the exact values and only switches to bloom
filter when filling the same space.

2) costing

The other thing is costing of BRIN indexes. At this point it's rather
simplistic and independent of the operator class, so the only thing that
matters is size of the BRIN index. That means a "minmax" index is always
considered cheaper than "bloom" index, because the optimizer has no idea
the "minmax" indexes are more vulnerable to "wide ranges".

But perhaps this is a non-issue, as the bloom index are meant for cases
when minmax indexes don't work. And when minmax indexes work, they work
better than bloom. So people are unlikely to build both of them at the
same time.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] WIP: BRIN bloom indexes

From
Robert Haas
Date:
On Thu, Oct 19, 2017 at 10:15 PM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
> Let's see a query like this:
>
>     select * from bloom_test
>      where id = '8db1d4a6-31a6-e9a2-4e2c-0e842e1f1772';
>
> The minmax index produces this plan
>
>    Heap Blocks: lossy=2061856
>  Execution time: 22707.891 ms
>
> Now, the bloom index:
>
>    Heap Blocks: lossy=25984
>  Execution time: 338.946 ms

It's neat to see BRIN being extended like this.  Possibly we could
consider making it a contrib module rather than including it in core,
although I don't have strong feelings about it.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WIP: BRIN bloom indexes

From
Simon Riggs
Date:
On 27 October 2017 at 07:20, Robert Haas <robertmhaas@gmail.com> wrote:
> On Thu, Oct 19, 2017 at 10:15 PM, Tomas Vondra
> <tomas.vondra@2ndquadrant.com> wrote:
>> Let's see a query like this:
>>
>>     select * from bloom_test
>>      where id = '8db1d4a6-31a6-e9a2-4e2c-0e842e1f1772';
>>
>> The minmax index produces this plan
>>
>>    Heap Blocks: lossy=2061856
>>  Execution time: 22707.891 ms
>>
>> Now, the bloom index:
>>
>>    Heap Blocks: lossy=25984
>>  Execution time: 338.946 ms
>
> It's neat to see BRIN being extended like this.  Possibly we could
> consider making it a contrib module rather than including it in core,
> although I don't have strong feelings about it.

I see that SP-GIST includes two operator classes in core, one default.

Makes sense to do the same thing with BRIN and add this new op class
as a non-default option in core.

-- 
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WIP: BRIN bloom indexes

From
Tomas Vondra
Date:
hi,

On 10/27/2017 09:34 AM, Simon Riggs wrote:
> On 27 October 2017 at 07:20, Robert Haas <robertmhaas@gmail.com> wrote:
>> On Thu, Oct 19, 2017 at 10:15 PM, Tomas Vondra
>> <tomas.vondra@2ndquadrant.com> wrote:
>>> Let's see a query like this:
>>>
>>>     select * from bloom_test
>>>      where id = '8db1d4a6-31a6-e9a2-4e2c-0e842e1f1772';
>>>
>>> The minmax index produces this plan
>>>
>>>    Heap Blocks: lossy=2061856
>>>  Execution time: 22707.891 ms
>>>
>>> Now, the bloom index:
>>>
>>>    Heap Blocks: lossy=25984
>>>  Execution time: 338.946 ms
>>
>> It's neat to see BRIN being extended like this.  Possibly we could
>> consider making it a contrib module rather than including it in core,
>> although I don't have strong feelings about it.
> 
> I see that SP-GIST includes two operator classes in core, one default.
> 
> Makes sense to do the same thing with BRIN and add this new op class
> as a non-default option in core.
> 

Not sure "a number of in-core opclasses" is a good reason to (not) add
new ones. Also, we already have two built-in BRIN opclasses (minmax and
inclusion).

In general, "BRIN bloom" can be packed as a contrib module (at least I
believe so). That's not the case for the "BRIN multi-range" which also
requires some changes to some code in brin.c (but the rest can be moved
to contrib module, of course).


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WIP: BRIN bloom indexes

From
Alvaro Herrera
Date:
Tomas Vondra wrote:

> Not sure "a number of in-core opclasses" is a good reason to (not) add
> new ones. Also, we already have two built-in BRIN opclasses (minmax and
> inclusion).
>
> In general, "BRIN bloom" can be packed as a contrib module (at least I
> believe so). That's not the case for the "BRIN multi-range" which also
> requires some changes to some code in brin.c (but the rest can be moved
> to contrib module, of course).

I was rather thinking that if we can make this very robust against the
index growing out of proportion, we should consider ditching the
original minmax and replace it with multirange minmax, which seems like
it'd have much better behavior.

I don't see any reason to put any of this in contrib.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WIP: BRIN bloom indexes

From
Robert Haas
Date:
On Fri, Oct 27, 2017 at 2:55 PM, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> I was rather thinking that if we can make this very robust against the
> index growing out of proportion, we should consider ditching the
> original minmax and replace it with multirange minmax, which seems like
> it'd have much better behavior.

If the multirange stuff can be done in such a way that it's just an
updated version of the same opclass, and backward-compatible on disk,
then I think this would be OK.  But otherwise I don't think we ditch
what already exists.  That would break upgrades via both pg_upgrade
and pg_dump, which seems like too high a price to pay to get rid of
some arguably-worse code.  It's actually WORSE to drop an opclass
(which will make dumps not restore) than to do something like bump
HASH_VERSION (which doesn't affect pg_dump at all and for pg_upgrade
only requires post-upgrade steps rather than failing outright).

> I don't see any reason to put any of this in contrib.

Well, for one thing, it makes it easier to drop stuff later if we
decide we don't really want it.  I think that whichever BRIN opclasses
are thought to be high quality and of general utility can go into
core, just as we've done with other index AMs.  However, if we think
that something is useful-ish but maybe not something to which we want
to make a permanent commitment, putting it into contrib is good for
that.

Upgrades are easier for things in contrib, too, because there's a
built-in mechanism for people to try updating the SQL extensions
(ALTER EXTENSION .. UPDATE) and if it fails they can adjust things and
try again.  When you just make a hard change to SQL definitions in a
new release, any failures that result from those changes just turn
into upgrade failures, which is IMHO a lot more painful than a failure
to update an extension version while the database is still up and
usable the whole time.

For instance, if pg_stat_activity were bundled in an extension and we
made the C code backward-compatibility with old extension versions,
then some of the upgrade pain users have had with that over the years
could have been avoided.  People could update to the new version
without failures and then at their leisure try to update.  If the
update failed due to dependencies, then they would have time to figure
out what to do about it and try again later; in the meantime, they'd
be on the new version.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WIP: BRIN bloom indexes

From
Sokolov Yura
Date:
On 2017-10-19 23:15, Tomas Vondra wrote:
> Hi,
> 
> The BRIN minmax opclasses work well only for data where the column is
> somewhat correlated to physical location in a table. So it works great
> for timestamps in append-only log tables, for example. When that is not
> the case (non-correlated columns) the minmax ranges get very "wide" and
> we end up scanning large portions of the tables.
> 
> A typical example are columns with types that are inherently random (or
> rather non-correlated) like for example UUIDs, IP or MAC addresses, and
> so on. But it can just as easily happen even with simple IDs generated
> from a sequence.
> 
> This WIP patch implements another type of BRIN index, with "summary"
> being represented by a bloom filter. So instead of building [min,max]
> range for each page range. The index is of course larger than the
> regular "minmax" BRIN index, but it's still orders of magnitude smaller
> than a btree index.
> 
> Note: This is different from the index type implemented in the "bloom"
> extension. Those are not related to BRIN at all, and the index builds a
> bloom filter on individual rows (values in different columns).
> 
> BTW kudos to everyone who worked on the BRIN infrastructure and API. I
> found it very intuitive and well-documented. Adding the new opclass was
> extremely straight-forward, and
> 
> 
> To demonstrate this on a small example, consider this table:
> 
>     CREATE TABLE bloom_test (id uuid, padding text);
> 
>     INSERT INTO bloom_test
>     SELECT md5((mod(i,1000000)/100)::text)::uuid, md5(i::text)
>       FROM generate_series(1,200000000) s(i);
> 
>     VACUUM ANALYZE bloom_test;
> 
>                          List of relations
>      Schema |    Name    | Type  | Owner | Size  | Description
>     --------+------------+-------+-------+-------+-------------
>      public | bloom_test | table | tomas | 16 GB |
>     (1 row)
> 
> The table was built so that each range contains relatively small number
> of distinct UUID values - this is typical e.g. for user activity with
> "bursts" of actions from a particular user separated by long periods of
> inactivity (with actions from other users).
> 
> Now, let's build BRIN "minmax", BRIN "bloom" and BTREE indexes on the 
> id
> column.
> 
>     create index test_brin_idx on bloom_test
>      using brin (id);
> 
>     create index test_bloom_idx on bloom_test
>      using brin (id uuid_bloom_ops);
> 
>     create index test_btree_idx on bloom_test (id);
> 
> which gives us this:
> 
>                           List of relations
>      Schema |      Name      | Type  | Owner |   Table    |  Size
>     --------+----------------+-------+-------+------------+---------
>      public | test_bloom_idx | index | tomas | bloom_test | 12 MB
>      public | test_brin_idx  | index | tomas | bloom_test | 832 kB
>      public | test_btree_idx | index | tomas | bloom_test | 6016 MB
>     (3 rows)
> 
> So yeah - the "bloom" index is larger than the default "minmax" index,
> but it's still orders of magnitude smaller than the regular btree one.
> 
> Now, what about query performance? Unlike the "minmax" indexes, the
> "bloom" filter can only handle equality conditions.
> 
> Let's see a query like this:
> 
>     select * from bloom_test
>      where id = '8db1d4a6-31a6-e9a2-4e2c-0e842e1f1772';
> 
> The minmax index produces this plan
> 
>                                 QUERY PLAN
> ------------------------------------------------------------------------
>  Bitmap Heap Scan on bloom_test
>      (cost=390.31..2130910.82 rows=20593 width=49)
>      (actual time=197.974..22707.311 rows=20000 loops=1)
>    Recheck Cond: (id = '8db1d4a6-31a6-e9a2-4e2c-0e842e1f1772'::uuid)
>    Rows Removed by Index Recheck: 199980000
>    Heap Blocks: lossy=2061856
>    ->  Bitmap Index Scan on test_brin_idx
>            (cost=0.00..385.16 rows=5493161 width=0)
>            (actual time=133.463..133.463 rows=20619520 loops=1)
>          Index Cond: (id = 
> '8db1d4a6-31a6-e9a2-4e2c-0e842e1f1772'::uuid)
>  Planning time: 0.165 ms
>  Execution time: 22707.891 ms
> (8 rows)
> 
> Which is not that great, and the long duration is a direct consequence
> of "wide" ranges - the bitmap heap scan had to scan pretty much the
> whole table. (A sequential scan takes only about 15 seconds.)
> 
> Now, the bloom index:
> 
>                                 QUERY PLAN
> ------------------------------------------------------------------------
>  Bitmap Heap Scan on bloom_test
>      (cost=5898.31..2136418.82 rows=20593 width=49)
>      (actual time=24.229..338.324 rows=20000 loops=1)
>    Recheck Cond: (id = '8db1d4a6-31a6-e9a2-4e2c-0e842e1f1772'::uuid)
>    Rows Removed by Index Recheck: 2500448
>    Heap Blocks: lossy=25984
>    ->  Bitmap Index Scan on test_bloom_idx
>          (cost=0.00..5893.16 rows=5493161 width=0)
>          (actual time=22.811..22.811 rows=259840 loops=1)
>          Index Cond: (id = 
> '8db1d4a6-31a6-e9a2-4e2c-0e842e1f1772'::uuid)
>  Planning time: 0.201 ms
>  Execution time: 338.946 ms
> (8 rows)
> 
> So, way better.
> 
> For comparison, a simple index scan / bitmap index scan using the btree
> take only about ~10ms in this case, but that's mostly thanks to the
> whole dataset being entirely in-memory.
> 
> There are some remaining open questions.
> 
> 1) sizing the bloom filter
> 
> Firstly, the size of the filter is currently static, based on 1000
> distinct values per range, with 5% false-positive rate. Those are 
> mostly
> arbitrary values, and  I don't have any clear idea how to determine
> optimal values.
> 
> Maybe we could do the sizing based on ndistinct value for the indexed
> column? It also depends on the pages_per_range value, so perhaps it
> should be a user-defined option too.
> 
> An interesting feature is that the bloom indexes "degrade locally". If 
> a
> page range has significantly more distinct values, the bloom filter may
> be way too small and will suffer by high false-positive rate. But it
> only affects that one page range, and the rest may be perfectly fine.
> 
> I was thinking about disabling the bloom filter when it gets "too full"
> (too many bits set, resulting in high false-positive cases). But that
> seems like a bad idea - the false-positive rate automatically jumps to
> 100% for that range, and we only save much smaller amount of space in
> the index. So even if the false-positive rate is 50% it still seems
> efficient to keep the bloom filter.
> 
> Another thing to consider is what happens when there are very few
> distinct values in a given page range. The current code tries to be a
> bit smart in this case - instead of building the bloom filter right
> away, it initially keeps the exact values and only switches to bloom
> filter when filling the same space.
> 
> 2) costing
> 
> The other thing is costing of BRIN indexes. At this point it's rather
> simplistic and independent of the operator class, so the only thing 
> that
> matters is size of the BRIN index. That means a "minmax" index is 
> always
> considered cheaper than "bloom" index, because the optimizer has no 
> idea
> the "minmax" indexes are more vulnerable to "wide ranges".
> 
> But perhaps this is a non-issue, as the bloom index are meant for cases
> when minmax indexes don't work. And when minmax indexes work, they work
> better than bloom. So people are unlikely to build both of them at the
> same time.
> 
> 
> regards

Hi, Tomas

BRIN bloom index is a really cool feature, that definitely should be in
core distribution (either in contrib or builtin)!!!

Small suggestion for algorithm:

It is well known practice not to calculate whole hash function for every
point, but use double hashing approach.
Example of proving double hashing approach for bloom filters:
https://www.eecs.harvard.edu/~michaelm/postscripts/rsa2008.pdf

So, instead of:for (i = 0; i < filter->nhashes; i++){    /* compute the hashes with a seed, used for the bloom filter
*/   uint32 h = ((uint32)DatumGetInt64(hash_uint32_extended(value, i))) %  
filter->nbits;
    /* set or check bit h */}

such construction could be used:/* compute the hashes, used for the bloom filter */uint32 big_h =
((uint32)DatumGetInt64(hash_uint32_extended(value,i)));uint32 h = big_h % filter->nbits;/* ensures d is never >=
filter->nbits*/uint32 d = big_h % (filter->nbits - 1);
 
for (i = 0; i < filter->nhashes; i++){    /* set or check bit h */
    /* compute next bit h */    h += d++;    if (h >= filter->nbits) h -= filter->nbits;    if (d == filter->nbits) d =
0;}

Modulo of one 64bit result by two coprime numbers (`nbits` and 
`nbits-1`)
gives two good-quality functions usable for double hashing.

With regards,
-- 
Sokolov Yura
Postgres Professional: https://postgrespro.ru
The Russian Postgres Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WIP: BRIN bloom indexes

From
Nico Williams
Date:
On Thu, Oct 19, 2017 at 10:15:32PM +0200, Tomas Vondra wrote:

A bloom filter index would, indeed, be wonderful.

Comments:

+ * We use an optimisation that initially we store the uint32 values directly,
+ * without the extra hashing step. And only later filling the bitmap space,
+ * we switch to the regular bloom filter mode.

I don't think that optimization is worthwhile.  If I'm going to be using
a Bloom filter, it's because I expect to have a lot of rows.

(Granted, if I CREATE TABLE .. LIKE .. INCLUDING INDEXES, and the new
table just won't have lots of rows, then I might want this optimization,
but I can always just drop the Bloom filter index, or not include
indexes in the LIKE.)

+ * XXX Perhaps we could save a few bytes by using different data types, but
+ * considering the size of the bitmap, the difference is negligible.

A bytea is all that's needed.  See below.

+ * XXX We could also implement "sparse" bloom filters, keeping only the
+ * bytes that are not entirely 0. That might make the "sorted" phase
+ * mostly unnecessary.

Filter compression is not worthwhile.  You want to have a fairly uniform
hash distribution, and you want to end up with a Bloom filter that's not
much more than 50% full.  That means that when near 50% full the filter
will not be very sparse at all.  Optimizing for the not common case
doesn't seem worthwhile, and adds complexity.

+ * XXX We can also watch the number of bits set in the bloom filter, and
+ * then stop using it (and not store the bitmap, to save space) when the
+ * false positive rate gets too high.

Ah, right, what you want is a "Scalable Bloom Filter".

A Scalable Bloom filter is actually a series of Bloom filters where all
but the newest filter are closed to additions, and the newest filter is
where you do all the additions.  You generally want to make each new
filter bigger than the preceding one because when searching a Scalable
Bloom filter you have to search *all* of them, so you want to minimize
the number of filters.

Eventually, when you have enough sub-filters, you may want to re-write
the whole thing so that you start with a single sub-filter that is large
enough.

The format of the bytea might then be something like:

<size><bitmap>[[<size><bitmap>[...]]

where the last bitmap is the filter that is open to additions.

I wonder if there are write concurrency performance considerations
here...

It might be better to have a bytea value per-sub-filter so that there is
no lock contention for the closed filters.  The closed sub-filters are
constant, thus not even shared locks are needed for them, and especially
not exclusive locks.

Writing to the filter will necessitate locking the entire open filter,
or else byte-range locking on it.  Something to think about.

> Now, what about query performance? Unlike the "minmax" indexes, the
> "bloom" filter can only handle equality conditions.

A Bloom filter has non-zero false positives for existence, but zero
false positives for non-existence.

This means that a Bloom filter index can only be used for:

a) non-existence tests (with equality predicates, as you point out),

b) as an optimization to avoid slower index checks (or heap scans) when  the Bloom filter indicates non-existence.

(b) is really just an internal application of (a).

There might be applications where false positives might be ok in a query
like:
 SELECT a.* FROM a a JOIN b b USING (some_col);

but for most real-world queries like that, allowing false positives by
default would be very bad.  An option in the index declaration could be
used to enable existence equality predicates, but I would not include
such an option initially -- let's see if anyone actually has a use case
for it first.

Of course, for something like:
 SELECT a.*, b.* FROM a a JOIN b b USING (some_col);

a Bloom filter can only be used as an optimization to avoid using a
slower index (or heap scan) on the inner table source.

What I'm getting at is that the query planner really needs to understand
that a Bloom filter is a probabilistic data structure.

Nico
-- 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WIP: BRIN bloom indexes

From
Sokolov Yura
Date:
On 2017-10-27 20:17, Nico Williams wrote:
> On Thu, Oct 19, 2017 at 10:15:32PM +0200, Tomas Vondra wrote:
> 
> A bloom filter index would, indeed, be wonderful.
> 
> Comments:
> 
> + * We use an optimisation that initially we store the uint32 values 
> directly,
> + * without the extra hashing step. And only later filling the bitmap 
> space,
> + * we switch to the regular bloom filter mode.
> 
> I don't think that optimization is worthwhile.  If I'm going to be 
> using
> a Bloom filter, it's because I expect to have a lot of rows.
> 
> (Granted, if I CREATE TABLE .. LIKE .. INCLUDING INDEXES, and the new
> table just won't have lots of rows, then I might want this 
> optimization,
> but I can always just drop the Bloom filter index, or not include
> indexes in the LIKE.)
> 
> + * XXX Perhaps we could save a few bytes by using different data 
> types, but
> + * considering the size of the bitmap, the difference is negligible.
> 
> A bytea is all that's needed.  See below.
> 
> + * XXX We could also implement "sparse" bloom filters, keeping only 
> the
> + * bytes that are not entirely 0. That might make the "sorted" phase
> + * mostly unnecessary.
> 
> Filter compression is not worthwhile.  You want to have a fairly 
> uniform
> hash distribution, and you want to end up with a Bloom filter that's 
> not
> much more than 50% full.  That means that when near 50% full the filter
> will not be very sparse at all.  Optimizing for the not common case
> doesn't seem worthwhile, and adds complexity.
> 
> + * XXX We can also watch the number of bits set in the bloom filter, 
> and
> + * then stop using it (and not store the bitmap, to save space) when 
> the
> + * false positive rate gets too high.
> 
> Ah, right, what you want is a "Scalable Bloom Filter".
> 
> A Scalable Bloom filter is actually a series of Bloom filters where all
> but the newest filter are closed to additions, and the newest filter is
> where you do all the additions.  You generally want to make each new
> filter bigger than the preceding one because when searching a Scalable
> Bloom filter you have to search *all* of them, so you want to minimize
> the number of filters.
> 
> Eventually, when you have enough sub-filters, you may want to re-write
> the whole thing so that you start with a single sub-filter that is 
> large
> enough.
> 
> The format of the bytea might then be something like:
> 
> <size><bitmap>[[<size><bitmap>[...]]
> 
> where the last bitmap is the filter that is open to additions.
> 
> I wonder if there are write concurrency performance considerations
> here...
> 
> It might be better to have a bytea value per-sub-filter so that there 
> is
> no lock contention for the closed filters.  The closed sub-filters are
> constant, thus not even shared locks are needed for them, and 
> especially
> not exclusive locks.
> 
> Writing to the filter will necessitate locking the entire open filter,
> or else byte-range locking on it.  Something to think about.
> 
>> Now, what about query performance? Unlike the "minmax" indexes, the
>> "bloom" filter can only handle equality conditions.
> 
> A Bloom filter has non-zero false positives for existence, but zero
> false positives for non-existence.
> 
> This means that a Bloom filter index can only be used for:
> 
> a) non-existence tests (with equality predicates, as you point out),
> 
> b) as an optimization to avoid slower index checks (or heap scans) when
>    the Bloom filter indicates non-existence.
> 
> (b) is really just an internal application of (a).
> 
> There might be applications where false positives might be ok in a 
> query
> like:
> 
>   SELECT a.* FROM a a JOIN b b USING (some_col);
> 
> but for most real-world queries like that, allowing false positives by
> default would be very bad.  An option in the index declaration could be
> used to enable existence equality predicates, but I would not include
> such an option initially -- let's see if anyone actually has a use case
> for it first.
> 
> Of course, for something like:
> 
>   SELECT a.*, b.* FROM a a JOIN b b USING (some_col);
> 
> a Bloom filter can only be used as an optimization to avoid using a
> slower index (or heap scan) on the inner table source.
> 
> What I'm getting at is that the query planner really needs to 
> understand
> that a Bloom filter is a probabilistic data structure.
> 
> Nico
> --

PostgreSQL has a lot of probabilistic indexes.

Some GIST opclasses returns false positives and tells to executor to
recheck their results.
Bitmap-index-scan, when there are a lot of result tuples, falls back to
storing only page numbers, without actual tid's, and executer then scans
heap-pages to find necessary tuples.

BRIN index at all just "recommends executor to scan that heap page". 
Thus
BRIN index is at whole just an optimisation (regardless is it `minmax` 
or
`bloom`).
So that is ok.

-- 
Sokolov Yura
Postgres Professional: https://postgrespro.ru
The Russian Postgres Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WIP: BRIN bloom indexes

From
Tomas Vondra
Date:
Hi,

On 10/27/2017 07:17 PM, Nico Williams wrote:
> On Thu, Oct 19, 2017 at 10:15:32PM +0200, Tomas Vondra wrote:
> 
> A bloom filter index would, indeed, be wonderful.
> 
> Comments:
> 
> + * We use an optimisation that initially we store the uint32 values directly,
> + * without the extra hashing step. And only later filling the bitmap space,
> + * we switch to the regular bloom filter mode.
> 
> I don't think that optimization is worthwhile.  If I'm going to be using
> a Bloom filter, it's because I expect to have a lot of rows.
> 
> (Granted, if I CREATE TABLE .. LIKE .. INCLUDING INDEXES, and the new
> table just won't have lots of rows, then I might want this optimization,
> but I can always just drop the Bloom filter index, or not include
> indexes in the LIKE.)
> 

I think you're confusing "rows" and "distinct values". Furthermore, it's
rather irrelevant how many rows are in the table because BRIN indexes
split the table into "ranges" that are 1MB by default. And you can only
squash certain number of rows into such range.

The idea of the optimization is to efficiently support cases where each
range contains only small number of distinct values - which is quite
common in the cases I described in my initial message (with bursts of
rows with the same IP / client identifier etc.).

> + * XXX Perhaps we could save a few bytes by using different data types, but
> + * considering the size of the bitmap, the difference is negligible.
> 
> A bytea is all that's needed.  See below.
> 
> + * XXX We could also implement "sparse" bloom filters, keeping only the
> + * bytes that are not entirely 0. That might make the "sorted" phase
> + * mostly unnecessary.
> 
> Filter compression is not worthwhile.  You want to have a fairly uniform
> hash distribution, and you want to end up with a Bloom filter that's not
> much more than 50% full.  That means that when near 50% full the filter
> will not be very sparse at all.  Optimizing for the not common case
> doesn't seem worthwhile, and adds complexity.
> 

Properly sizing the bloom filter requires knowledge of many variables,
in particular the number of distinct values expected to be added to the
filter. But we don't really know that number, and we also don't even
know many values useful for estimating that (how many rows will fit into
a range, number of distinct values in the whole table, etc.)

So the idea was to oversize the bloom filter, and then use the sparse
representation to reduce the size.

> + * XXX We can also watch the number of bits set in the bloom filter, and
> + * then stop using it (and not store the bitmap, to save space) when the
> + * false positive rate gets too high.
> 
> Ah, right, what you want is a "Scalable Bloom Filter".
> 

That's not what I had in mind. My idea was to size the bloom filter on
"best effort" basis, and then if one range gets a bit too inaccurate
then just get rid of the filter. If that happens for many ranges, we
should probably allow specifying parameters as relopts for the index.

> A Scalable Bloom filter is actually a series of Bloom filters where all
> but the newest filter are closed to additions, and the newest filter is
> where you do all the additions.  You generally want to make each new
> filter bigger than the preceding one because when searching a Scalable
> Bloom filter you have to search *all* of them, so you want to minimize
> the number of filters.
> 
> Eventually, when you have enough sub-filters, you may want to re-write
> the whole thing so that you start with a single sub-filter that is large
> enough.
> 
> The format of the bytea might then be something like:
> 
> <size><bitmap>[[<size><bitmap>[...]]
> 
> where the last bitmap is the filter that is open to additions.
>

I think this is really an over-engineering, and I certainly don't plan
to extend the patch in this direction.

I do not expect these parameters (particularly the number of distinct
values in a range) to significantly change over time, so the easiest
solution is to provide a reloption to specify that number in
CREATE/ALTER index.

Alternatively, we could allow the summarization process to gather some
statistics (either on the whole range, or perhaps the whole table), and
store them somewhere for later use. For example to compute the number of
distinct values per range (in the existing data), and use that later.

> ...
> 
> Of course, for something like:
> 
>   SELECT a.*, b.* FROM a a JOIN b b USING (some_col);
> 
> a Bloom filter can only be used as an optimization to avoid using a
> slower index (or heap scan) on the inner table source.
> 
> What I'm getting at is that the query planner really needs to understand
> that a Bloom filter is a probabilistic data structure.
> 

It does, and we never produce incorrect results. Which seems like the
right thing to do.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WIP: BRIN bloom indexes

From
Tomas Vondra
Date:
Hi,

On 10/27/2017 05:22 PM, Sokolov Yura wrote:
> 
> Hi, Tomas
> 
> BRIN bloom index is a really cool feature, that definitely should be in
> core distribution (either in contrib or builtin)!!!
> 
> Small suggestion for algorithm:
> 
> It is well known practice not to calculate whole hash function for every
> point, but use double hashing approach.
> Example of proving double hashing approach for bloom filters:
> https://www.eecs.harvard.edu/~michaelm/postscripts/rsa2008.pdf
> 
> So, instead of:
>     for (i = 0; i < filter->nhashes; i++)
>     {
>         /* compute the hashes with a seed, used for the bloom filter */
>         uint32 h = ((uint32)DatumGetInt64(hash_uint32_extended(value,
> i))) % filter->nbits;
> 
>         /* set or check bit h */
>     }
> 
> such construction could be used:
>     /* compute the hashes, used for the bloom filter */
>     uint32 big_h = ((uint32)DatumGetInt64(hash_uint32_extended(value, i)));
>     uint32 h = big_h % filter->nbits;
>     /* ensures d is never >= filter->nbits */
>     uint32 d = big_h % (filter->nbits - 1);
> 
>     for (i = 0; i < filter->nhashes; i++)
>     {
>         /* set or check bit h */
> 
>         /* compute next bit h */
>         h += d++;
>         if (h >= filter->nbits) h -= filter->nbits;
>         if (d == filter->nbits) d = 0;
>     }
> 
> Modulo of one 64bit result by two coprime numbers (`nbits` and `nbits-1`)
> gives two good-quality functions usable for double hashing.
> 

OK, thanks for the suggestion.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WIP: BRIN bloom indexes

From
Nico Williams
Date:
On Fri, Oct 27, 2017 at 10:06:58PM +0200, Tomas Vondra wrote:
> > + * We use an optimisation that initially we store the uint32 values directly,
> > + * without the extra hashing step. And only later filling the bitmap space,
> > + * we switch to the regular bloom filter mode.
> > 
> > I don't think that optimization is worthwhile.  If I'm going to be using
> > a Bloom filter, it's because I expect to have a lot of rows.
> > 
> > (Granted, if I CREATE TABLE .. LIKE .. INCLUDING INDEXES, and the new
> > table just won't have lots of rows, then I might want this optimization,
> > but I can always just drop the Bloom filter index, or not include
> > indexes in the LIKE.)
> 
> I think you're confusing "rows" and "distinct values". Furthermore, it's
> rather irrelevant how many rows are in the table because BRIN indexes
> split the table into "ranges" that are 1MB by default. And you can only
> squash certain number of rows into such range.
> 
> The idea of the optimization is to efficiently support cases where each
> range contains only small number of distinct values - which is quite
> common in the cases I described in my initial message (with bursts of
> rows with the same IP / client identifier etc.).

What range?  It's just bits to set.

The point is that Bloom filters should ideally be about 50% full -- much
less than that and you are wasting space, much more than than and your
false positive rate becomes useless.

> > Filter compression is not worthwhile.  You want to have a fairly uniform
> > hash distribution, and you want to end up with a Bloom filter that's not
> > much more than 50% full.  That means that when near 50% full the filter
> > will not be very sparse at all.  Optimizing for the not common case
> > doesn't seem worthwhile, and adds complexity.
> 
> Properly sizing the bloom filter requires knowledge of many variables,
> in particular the number of distinct values expected to be added to the
> filter. But we don't really know that number, and we also don't even
> know many values useful for estimating that (how many rows will fit into
> a range, number of distinct values in the whole table, etc.)

This is why Scalable Bloom filters exist: so you can adapt.

> So the idea was to oversize the bloom filter, and then use the sparse
> representation to reduce the size.

A space-efficient representation of sparse bitmaps is not as simple as a
Scalable Bloom filter.

And a filter that requires user intervention to size correctly, or which
requires rebuilding when it gets too full, is also not as convenient as
a Scalable Bloom filter.

> > + * XXX We can also watch the number of bits set in the bloom filter, and
> > + * then stop using it (and not store the bitmap, to save space) when the
> > + * false positive rate gets too high.
> > 
> > Ah, right, what you want is a "Scalable Bloom Filter".
> 
> That's not what I had in mind. My idea was to size the bloom filter on
> "best effort" basis, and then if one range gets a bit too inaccurate
> then just get rid of the filter. If that happens for many ranges, we
> should probably allow specifying parameters as relopts for the index.

Scalable Bloom filters are way more convenient than that.  They're
always not-too-full, and only the open filter is less-than-full-enough.

And since you should grow them somewhat exponentially (but not quite as
much as a doubling in each generation), there should never be too many
filters to search.  But you can always "vacuum" (rebuild) the filter
starting with the size of the last filter added prior to the vacuum.

> I think this is really an over-engineering, and I certainly don't plan
> to extend the patch in this direction.

Whereas I think a sparse bitmap representation is overly complex and
"over-engineering".  Scalable Bloom filters are very well understood in
the literature -- there's nothing terribly complex to them.

> I do not expect these parameters (particularly the number of distinct
> values in a range) to significantly change over time, so the easiest
> solution is to provide a reloption to specify that number in
> CREATE/ALTER index.

Doesn't this depend on the use-case though?  I think a self-tuning
system is better than one that doesn't self-tune.  Call that
over-engineering if you like, but it doesn't make it not desirable :)

> Alternatively, we could allow the summarization process to gather some
> statistics (either on the whole range, or perhaps the whole table), and
> store them somewhere for later use. For example to compute the number of
> distinct values per range (in the existing data), and use that later.

Again, Scalable Bloom filters automatically adapt without needing a
statistics gathering exercise.  All you need is a good hash function
(that's another topic).

Scalable Bloom filters are a trivial extension of Bloom filters.

> > What I'm getting at is that the query planner really needs to understand
> > that a Bloom filter is a probabilistic data structure.
> 
> It does, and we never produce incorrect results. Which seems like the
> right thing to do.

OK, I saw your other response about this.  I didn't know that PG already
has support for probabilistic indexes.

Nico
-- 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WIP: BRIN bloom indexes

From
Tomas Vondra
Date:
hi,

On 10/28/2017 02:41 AM, Nico Williams wrote:
> On Fri, Oct 27, 2017 at 10:06:58PM +0200, Tomas Vondra wrote:
>>> + * We use an optimisation that initially we store the uint32 values directly,
>>> + * without the extra hashing step. And only later filling the bitmap space,
>>> + * we switch to the regular bloom filter mode.
>>>
>>> I don't think that optimization is worthwhile.  If I'm going to be using
>>> a Bloom filter, it's because I expect to have a lot of rows.
>>>
>>> (Granted, if I CREATE TABLE .. LIKE .. INCLUDING INDEXES, and the new
>>> table just won't have lots of rows, then I might want this optimization,
>>> but I can always just drop the Bloom filter index, or not include
>>> indexes in the LIKE.)
>>
>> I think you're confusing "rows" and "distinct values". Furthermore, it's
>> rather irrelevant how many rows are in the table because BRIN indexes
>> split the table into "ranges" that are 1MB by default. And you can only
>> squash certain number of rows into such range.
>>
>> The idea of the optimization is to efficiently support cases where each
>> range contains only small number of distinct values - which is quite
>> common in the cases I described in my initial message (with bursts of
>> rows with the same IP / client identifier etc.).
> 
> What range?  It's just bits to set.
> 
> The point is that Bloom filters should ideally be about 50% full -- much
> less than that and you are wasting space, much more than than and your
> false positive rate becomes useless.

The range defined by BRIN indexes. BRIN indexes split the table into a
sequence of page ranges (128 pages by default, i.e. 1MB), and the bloom
filters are built on those table "chunks". So it's irrelevant how many
rows are in the table in total, what matters is just the page range.

> 
>>> Filter compression is not worthwhile.  You want to have a fairly uniform
>>> hash distribution, and you want to end up with a Bloom filter that's not
>>> much more than 50% full.  That means that when near 50% full the filter
>>> will not be very sparse at all.  Optimizing for the not common case
>>> doesn't seem worthwhile, and adds complexity.
>>
>> Properly sizing the bloom filter requires knowledge of many variables,
>> in particular the number of distinct values expected to be added to the
>> filter. But we don't really know that number, and we also don't even
>> know many values useful for estimating that (how many rows will fit into
>> a range, number of distinct values in the whole table, etc.)
> 
> This is why Scalable Bloom filters exist: so you can adapt.
> 
>> So the idea was to oversize the bloom filter, and then use the sparse
>> representation to reduce the size.
> 
> A space-efficient representation of sparse bitmaps is not as simple as a
> Scalable Bloom filter.
> 
> And a filter that requires user intervention to size correctly, or which
> requires rebuilding when it gets too full, is also not as convenient as
> a Scalable Bloom filter.
> 

Maybe. For now I'm fine with the simple relopts-based approach and don't
plan to spend much time on these improvements. Which is not to say that
they may not be worthwhile, but it's not the only thing I'm working on.

I also suspect there are practical implications, as some things are only
possible with equally-sized bloom filters. I believe the "union"
(missing in the current patch) will rely on that when merging bloom filters.

>>> + * XXX We can also watch the number of bits set in the bloom filter, and
>>> + * then stop using it (and not store the bitmap, to save space) when the
>>> + * false positive rate gets too high.
>>>
>>> Ah, right, what you want is a "Scalable Bloom Filter".
>>
>> That's not what I had in mind. My idea was to size the bloom filter on
>> "best effort" basis, and then if one range gets a bit too inaccurate
>> then just get rid of the filter. If that happens for many ranges, we
>> should probably allow specifying parameters as relopts for the index.
> 
> Scalable Bloom filters are way more convenient than that.  They're
> always not-too-full, and only the open filter is less-than-full-enough.
> 
> And since you should grow them somewhat exponentially (but not quite as
> much as a doubling in each generation), there should never be too many
> filters to search.  But you can always "vacuum" (rebuild) the filter
> starting with the size of the last filter added prior to the vacuum.
> 
>> I think this is really an over-engineering, and I certainly don't plan
>> to extend the patch in this direction.
> 
> Whereas I think a sparse bitmap representation is overly complex and
> "over-engineering".  Scalable Bloom filters are very well understood in
> the literature -- there's nothing terribly complex to them.
> 

That "sparse bitmap" is mentioned merely in an XXX comment as a thing to
consider, not something I plan to work right now. Perhaps it's not
really a good idea in general, or maybe it's addressing a non-issue. In
which case we don't need scalable bloom filters either.

>> I do not expect these parameters (particularly the number of distinct
>> values in a range) to significantly change over time, so the easiest
>> solution is to provide a reloption to specify that number in
>> CREATE/ALTER index.
> 
> Doesn't this depend on the use-case though?  I think a self-tuning
> system is better than one that doesn't self-tune.  Call that
> over-engineering if you like, but it doesn't make it not desirable :)
> 

I'm simply not convinced we need that additional complexity, which has
it's cost too. Also, if "self-tuning means" means creating multiple
bloom filters with different sizes, then it may have impact on some
operations (like "union").

>> Alternatively, we could allow the summarization process to gather some
>> statistics (either on the whole range, or perhaps the whole table), and
>> store them somewhere for later use. For example to compute the number of
>> distinct values per range (in the existing data), and use that later.
> 
> Again, Scalable Bloom filters automatically adapt without needing a
> statistics gathering exercise.  All you need is a good hash function
> (that's another topic).
> 
> Scalable Bloom filters are a trivial extension of Bloom filters.
> 

Sure. But it's additional complexity and I'm not convinced we need it
(considering how BRIN splits the table into ranges of limited size).
Hence I'll adopt the simple approach in my patch, and if it turns out to
be necessary, it can be improved in this direction.

>>> What I'm getting at is that the query planner really needs to understand
>>> that a Bloom filter is a probabilistic data structure.
>>
>> It does, and we never produce incorrect results. Which seems like the
>> right thing to do.
> 
> OK, I saw your other response about this.  I didn't know that PG already
> has support for probabilistic indexes.
> 

It's not as much "probabilistic" as understanding that some indexes may
produce false positives, in which case a recheck is needed.

> Nico
> 

Thanks for the comments and feedback!

cheers

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers