[HACKERS] WIP: BRIN multi-range indexes - Mailing list pgsql-hackers

From Tomas Vondra
Subject [HACKERS] WIP: BRIN multi-range indexes
Date
Msg-id c1138ead-7668-f0e1-0638-c3be3237e812@2ndquadrant.com
Whole thread Raw
Responses Re: [HACKERS] WIP: BRIN multi-range indexes
List pgsql-hackers
Hi all,

A couple of days ago I've shared a WIP patch [1] implementing BRIN
indexes based on bloom filters. One inherent limitation of that approach
is that it can only support equality conditions - that's perfectly fine
in many cases (e.g. with UUIDs it's rare to use range queries, etc.).

[1]
https://www.postgresql.org/message-id/5d78b774-7e9c-c94e-12cf-fef51cc89b1a%402ndquadrant.com

But in other cases that restriction is pretty unacceptable, e.g. with
timestamps that are queried mostly using range conditions. A common
issue is that while the data is initially well correlated (giving us
nice narrow min/max ranges in the BRIN index), this degrades over time
(typically due to DELETE/UPDATE and then new rows routed to free space).
There are not many options to prevent this, and fixing it pretty much
requires CLUSTER on the table.

This patch addresses this by BRIN indexes with more complex "summary".
Instead of keeping just a single "minmax interval", we maintain a list
of "minmax intervals", which allows us to track "gaps" in the data.

To illustrate the improvement, consider this table:

    create table a (val float8) with (fillfactor = 90);
    insert into a select i::float from generate_series(1,10000000) s(i);
    update a set val = 1 where random() < 0.01;
    update a set val = 10000000 where random() < 0.01;

Which means the column 'val' is almost perfectly correlated with the
position in the table (which would be great for BRIN minmax indexes),
but then 1% of the values is set to 1 and 10.000.000. That means pretty
much every range will be [1,10000000], which makes this BRIN index
mostly useless, as illustrated by these explain plans:

    create index on a using brin (val) with (pages_per_range = 16);

    explain analyze select * from a where val = 100;
                                  QUERY PLAN
    --------------------------------------------------------------------
     Bitmap Heap Scan on a  (cost=54.01..10691.02 rows=8 width=8)
                            (actual time=5.901..785.520 rows=1 loops=1)
       Recheck Cond: (val = '100'::double precision)
       Rows Removed by Index Recheck: 9999999
       Heap Blocks: lossy=49020
       ->  Bitmap Index Scan on a_val_idx
             (cost=0.00..54.00 rows=3400 width=0)
             (actual time=5.792..5.792 rows=490240 loops=1)
             Index Cond: (val = '100'::double precision)
     Planning time: 0.119 ms
     Execution time: 785.583 ms
    (8 rows)

    explain analyze select * from a where val between 100 and 10000;
                                  QUERY PLAN
    ------------------------------------------------------------------
     Bitmap Heap Scan on a  (cost=55.94..25132.00 rows=7728 width=8)
                      (actual time=5.939..858.125 rows=9695 loops=1)
       Recheck Cond: ((val >= '100'::double precision) AND
                      (val <= '10000'::double precision))
       Rows Removed by Index Recheck: 9990305
       Heap Blocks: lossy=49020
       ->  Bitmap Index Scan on a_val_idx
             (cost=0.00..54.01 rows=10200 width=0)
             (actual time=5.831..5.831 rows=490240 loops=1)
             Index Cond: ((val >= '100'::double precision) AND
                          (val <= '10000'::double precision))
     Planning time: 0.139 ms
     Execution time: 871.132 ms
    (8 rows)

Obviously, the queries do scan the whole table and then eliminate most
of the rows in "Index Recheck". Decreasing pages_per_range does not
really make a measurable difference in this case - it eliminates maybe
10% of the rechecks, but most pages still have very wide minmax range.

With the patch, it looks about like this:

    create index on a using brin (val float8_minmax_multi_ops)
                            with (pages_per_range = 16);

    explain analyze select * from a where val = 100;
                                  QUERY PLAN
    -------------------------------------------------------------------
     Bitmap Heap Scan on a  (cost=830.01..11467.02 rows=8 width=8)
                            (actual time=7.772..8.533 rows=1 loops=1)
       Recheck Cond: (val = '100'::double precision)
       Rows Removed by Index Recheck: 3263
       Heap Blocks: lossy=16
       ->  Bitmap Index Scan on a_val_idx
             (cost=0.00..830.00 rows=3400 width=0)
             (actual time=7.729..7.729 rows=160 loops=1)
             Index Cond: (val = '100'::double precision)
     Planning time: 0.124 ms
     Execution time: 8.580 ms
    (8 rows)


    explain analyze select * from a where val between 100 and 10000;
                                 QUERY PLAN
    ------------------------------------------------------------------
     Bitmap Heap Scan on a  (cost=831.94..25908.00 rows=7728 width=8)
                        (actual time=9.318..23.715 rows=9695 loops=1)
       Recheck Cond: ((val >= '100'::double precision) AND
                      (val <= '10000'::double precision))
       Rows Removed by Index Recheck: 3361
       Heap Blocks: lossy=64
       ->  Bitmap Index Scan on a_val_idx
             (cost=0.00..830.01 rows=10200 width=0)
             (actual time=9.274..9.274 rows=640 loops=1)
             Index Cond: ((val >= '100'::double precision) AND
                          (val <= '10000'::double precision))
     Planning time: 0.138 ms
     Execution time: 36.100 ms
    (8 rows)

That is, the timings drop from 785ms/871ms to 9ms/36s. The index is a
bit larger (1700kB instead of 150kB), but it's still orders of
magnitudes smaller than btree index (which is ~214MB in this case).

The index build is slower than the regular BRIN indexes (about
comparable to btree), but I'm sure it can be significantly improved.
Also, I'm sure it's not bug-free.

Two additional notes:

1) The patch does break the current BRIN indexes, because it requires
passing all SearchKeys to the "consistent" BRIN function at once
(otherwise we couldn't eliminate individual intervals in the summary),
while currently the BRIN only deals with one SearchKey at a time. And I
haven't modified the existing brin_minmax_consistent() function (yeah,
I'm lazy, but this should be enough for interested people to try it out
I believe).

2) It only works with float8 (and also timestamp data types) for now,
but it should be straightforward to add support for additional data
types. Most of that will be about adding catalog definitions anyway.


regards


-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

pgsql-hackers by date:

Previous
From: Michael Paquier
Date:
Subject: Re: [HACKERS] CurTransactionContext freed before transaction COMMIT ???
Next
From: Tomas Vondra
Date:
Subject: Re: [HACKERS] WIP: BRIN multi-range indexes