Re: WIP: BRIN multi-range indexes - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: WIP: BRIN multi-range indexes
Date
Msg-id fbb3583b-35cb-8415-e719-1836865543f7@enterprisedb.com
Whole thread Raw
In response to Re: WIP: BRIN multi-range indexes  (John Naylor <john.naylor@enterprisedb.com>)
Responses Re: WIP: BRIN multi-range indexes  (Zhihong Yu <zyu@yugabyte.com>)
Re: WIP: BRIN multi-range indexes  (John Naylor <john.naylor@enterprisedb.com>)
List pgsql-hackers
Hi,

Here's an updated and significantly improved version of the patch
series, particularly the multi-minmax part. I've fixed a number of
stupid bugs in that, discovered by either valgrind or stress-tests.

I was surprised by some of the bugs, or rather that the existing
regression tests failed to crash on them, so it's probably worth briefly
discussing the details. There were two main classes of such bugs:


1) missing datumCopy

AFAICS this happened because there were a couple missing datumCopy
calls, and BRIN covers multiple pages, so with by-ref data types we
added a pointer but the buffer might have gone away unexpectedly.
Regular regression tests passed just fine, because brin_multi runs
almost separately, so the chance of the buffer being evicted was low.
Valgrind reported this (with a rather enigmatic message, as usual), and
so did a simple stress-test creating many indexes concurrently. Anything
causing aggressive eviction of buffer would do the trick, I think,
triggering segfaults, asserts, etc.


2) bogus opclass definitions

There were a couple opclasses referencing incorrect distance function,
intended for a different data type. I was scratching my head WTH the
regression tests pass, as there is a table to build multi-minmax index
on all supported data types. The reason is pretty silly - the table is
very small, just 100 rows, with very low fillfactor (so only couple
values per page), and the index was created with pages_per_range=1. So
the compaction was not triggered and the distance function was never
actually called. I've decided to build the indexes on a larger data set
first, to test this. But maybe this needs somewhat different approach.


BLOOM
-----

The attached patch series addresses comments from the last review. As
for the size limit, I've defined a new macro

    #define BloomMaxFilterSize \
        MAXALIGN_DOWN(BLCKSZ - \
                      (MAXALIGN(SizeOfPageHeaderData + \
                                sizeof(ItemIdData)) + \
                       MAXALIGN(sizeof(BrinSpecialSpace)) + \
                       SizeOfBrinTuple))

and use that to determine if the bloom filter is too large. IMO that's
close enough, considering that this is a best-effort check anyway (due
to not being able to consider multi-column indexes).


MINMAX-MULTI
------------

As mentioned, there's a lot of fixes and improvements in this part, but
the basic principle is still the same. I've kept it split into three
parts with different approaches to building, so that it's possible to do
benchmarks and comparisons, and pick the best one.

a) 0005 - Aggressively compacts the summary, by always keeping it within
the limit defined by values_per_range. So if the range contains more
values, this may trigger compaction very often in some cases (e.g. for
monotonic series).

One drawback is that the more often the compactions happen, the less
optimal the result is - the algorithm is kinda greedy, picking something
like local optimums in each step.

b) 0006 - Batch build, exactly the opposite of 0005. Accumulates all
values in a buffer, then does a single round of compaction at the very
end. This obviously doesn't have the "greediness" issues, but it may
consume quite a bit of memory for some data types and/or indexes with
large BRIN ranges.

c) 0007 - A hybrid approach, using a buffer that is multiple of the
user-specified value, with some safety min/max limits. IMO this is what
we should use, although perhaps with some tuning of the exact limits.


Attached is a spreadsheet with benchmark results for each of those three
approaches, on different data types (byval/byref), data set types, index
parameters (pages/values per range) etc. I think 0007 is a reasonable
compromise overall, with performance somewhere in betwen 0005 and 0006.
Of course, there are cases where it's somewhat slow, e.g. for data types
with expensive comparisons and data sets forcing frequent compactions,
in which case it's ~10x slower compared to regular minmax (in most cases
it's ~1.5x). Compared to btree, it's usually much faster - ~2-3x as fast
(except for some extreme cases, of course).


As for the opclasses for indexes without "natural" distance function,
implemented in 0008, I think we should drop it. In theory it works, but
I'm far from convinced it's actually useful in practice. Essentially, if
you have a data type with ordering but without a meaningful concept of a
distance, it's hard to say what is an outlier. I'd bet most of those
data types are used as "labels" where even the ordering is kinda
useless, i.e. hardly anyone uses range queries on things like names,
it's all just equality searches. Which means the bloom indexes are a
much better match for this use case.


The other thing we were considering was using the new multi-minmax
opclasses as default ones, replacing the existing minmax ones. IMHO we
shouldn't do that either. For existing minmax indexes that's useless
(the opclass seems to be working, otherwise the index would be dropped).
But even for new indexes I'm not sure it's the right thing, so I don't
plan to change this.


I'm also attaching the stress-test that I used to test the hell out of
this, creating indexes on various data sets, data types, with varying
index parameters, etc.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

pgsql-hackers by date:

Previous
From: Peter Smith
Date:
Subject: Re: Typo in tablesync comment
Next
From: Peter Smith
Date:
Subject: Re: DROP TABLE can crash the replication sync worker