Re: WIP: BRIN multi-range indexes - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: WIP: BRIN multi-range indexes
Date
Msg-id 20200924235045.bgxsilvf7askukph@development
Whole thread Raw
In response to Re: WIP: BRIN multi-range indexes  (John Naylor <john.naylor@2ndquadrant.com>)
Responses Re: WIP: BRIN multi-range indexes
List pgsql-hackers
On Thu, Sep 24, 2020 at 05:18:03PM -0400, John Naylor wrote:
>On Mon, Sep 21, 2020 at 3:56 PM Tomas Vondra
><tomas.vondra@2ndquadrant.com> wrote:
>>
>> On Mon, Sep 21, 2020 at 01:42:34PM -0400, John Naylor wrote:
>
>> >While playing around with the numbers I had an epiphany: At the
>> >defaults, the filter already takes up ~4.3kB, over half the page.
>> >There is no room for another tuple, so if we're only indexing one
>> >column, we might as well take up the whole page.
>>
>> Hmm, yeah. I may be wrong but IIRC indexes don't support external
>> storage but compression is still allowed. So even if those defaults are
>> a bit higher than needed that should make the bloom filters a bit more
>> compressible, and thus fit multiple BRIN tuples on a single page.
>
>> Not sure about how much we want to rely on these optimizations, though,
>> considering multi-column indexes kinda break this.
>
>Yeah. Okay, then it sounds like we should go in the other direction,
>as the block comment at the top of brin_bloom.c implies. Indexes with
>multiple bloom-indexed columns already don't fit in one 8kB page, so I
>think every documented example should have a much lower
>pages_per_range. Using 32 pages per range with max tuples gives n =
>928. With default p, that's about 1.1 kB per brin tuple, so one brin
>page can index 224 pages, much more than with the default 128.
>
>Hmm, how ugly would it be to change the default range size depending
>on the opclass?
>

Not sure. What would happen for multi-column BRIN indexes with different
opclasses?

>If indexes don't support external storage, that sounds like a pain to
>add. Also, with very small fpr, you can easily get into many megabytes
>of filter space, which kind of defeats the purpose of brin in the
>first place.
>

True.

>There is already this item from the brin readme:
>
>* Different-size page ranges?
>  In the current design, each "index entry" in a BRIN index covers the same
>  number of pages.  There's no hard reason for this; it might make sense to
>  allow the index to self-tune so that some index entries cover smaller page
>  ranges, if this allows the summary values to be more compact.  This
>would incur
>  larger BRIN overhead for the index itself, but might allow better pruning of
>  page ranges during scan.  In the limit of one index tuple per page, the index
>  itself would occupy too much space, even though we would be able to skip
>  reading the most heap pages, because the summary values are tight; in the
>  opposite limit of a single tuple that summarizes the whole table, we wouldn't
>  be able to prune anything even though the index is very small.  This can
>  probably be made to work by using the range map as an index in itself.
>
>This sounds like a lot of work, but would be robust.
>

Yeah. I think it's a fairly independent / orthogonal project.

>Anyway, given that this is a general problem and not specific to the
>prime partition algorithm, I'll leave that out of the attached patch,
>named as a .txt to avoid confusing the cfbot.
>
>> >We could also generate the primes via a sieve instead, which is really
>> >fast (and more code). That would be more robust, but that would require
>> >the filter to store the actual primes used, so 20 more bytes at max k =
>> >10. We could hard-code space for that, or to keep from hard-coding
>> >maximum k and thus lowest possible false positive rate, we'd need more
>> >bookkeeping.
>> >
>>
>> I don't think the efficiency of this code matters too much - it's only
>> used once when creating the index, so the simpler the better. Certainly
>> for now, while testing the partitioning approach.
>
>To check my understanding, isn't bloom_init() called for every tuple?
>Agreed on simplicity so done this way.
>

No, it's only called for the first non-NULL value in the page range
(unless I made a boo boo when writing that code).

>> >So, with the partition approach, we'd likely have to set in stone
>> >either max nbits, or min target false positive rate. The latter option
>> >seems more principled, not only for the block size, but also since the
>> >target fp rate is already fixed by the reloption, and as I've
>> >demonstrated, we can often go above and beyond the reloption even
>> >without changing k.
>> >
>>
>> That seems like a rather annoying limitation, TBH.
>
>I don't think the latter is that bad. I've capped k at 10 for
>demonstration's sake.:
>
>(928 is from using 32 pages per range)
>
>n    k   m      p
>928   7  8895   0.01
>928  10  13343  0.001  (lowest p supported in patch set)
>928  13  17790  0.0001
>928  10  18280  0.0001 (still works with lower k, needs higher m)
>928  10  17790  0.00012 (even keeping m from row #3, capping k doesn't
>degrade p much)
>
>Also, k seems pretty robust against small changes as long as m isn't
>artificially constrained and as long as p is small.
>
>So I *think* it's okay to cap k at 10 or 12, and not bother with
>adjusting m, which worsens space issues. As I found before, lowering k
>raises target fpr, but seems more robust to overshooting ndistinct. In
>any case, we only need k * 2 bytes to store the partition lengths.
>
>The only way I see to avoid any limitation is to make the array of
>primes variable length, which could be done by putting the filter
>offset calculation into a macro. But having two variable-length arrays
>seems messy.
>

Hmmm. I wonder how would these limitations impact the conclusions from
the one-hashing paper? Or was this just for the sake of a demonstration?

I'd suggest we just do the simplest thing possible (be it a hard-coded
table of primes or a sieve) and then evaluate if we need to do something
more sophisticated.

>> >Hmm, I'm not sure I understand you. I can see not caring to trim wasted
>> >bits, but we can't set/read off the end of the filter. If we don't
>> >wrap, we could just skip reading/writing that bit. So a tiny portion of
>> >access would be at k - 1. The paper is not clear on what to do here,
>> >but they are explicit in minimizing the absolute value, which could go
>> >on either side.
>> >
>>
>> What I meant is that is that the paper says this:
>>
>>      Given a planned overall length mp for a Bloom filter, we usually
>>      cannot get k prime numbers to make their sum mf to be exactly mp. As
>>      long as the difference between mp and mf is small enough, it neither
>>      causes any trouble for the software implementation nor noticeably
>>      shifts the false positive ratio.
>>
>> Which I think means we can pick mp, generate k primes with sum mf close
>> to mp, and just use that with mf bits.
>
>Oh, I see. When I said "trim" I meant exactly that (when mf < mp).
>Yeah, we can bump it up as well for the other case. I've done it that
>way.
>

OK

>> >+ add_local_real_reloption(relopts, "false_positive_rate", + "desired
>> >false-positive rate for the bloom filters", +
>> >BLOOM_DEFAULT_FALSE_POSITIVE_RATE, + 0.001, 1.0, offsetof(BloomOptions,
>> >falsePositiveRate));
>> >
>> >When I set fp to 1.0, the reloption code is okay with that, but then
>> >later the assertion gets triggered.
>> >
>>
>> Hmm, yeah. I wonder what to do about that, considering indexes with fp
>> 1.0 are essentially useless.
>
>Not just useless -- they're degenerate. When p = 1.0, m = k = 0 -- We
>cannot accept this value from the user. Looking up thread, 0.1 was
>suggested as a limit. That might be a good starting point.
>

Makes sense, I'll fix it that way.

>This is interesting work! Having gone this far, I'm going to put more
>attention to the multi-minmax patch and actually start performance
>testing.
>

Cool, thanks! I'll take a look at your one-hashing patch tomorrow.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 



pgsql-hackers by date:

Previous
From: Thomas Munro
Date:
Subject: Re: Handing off SLRU fsyncs to the checkpointer
Next
From: Soumyadeep Chakraborty
Date:
Subject: Re: Refactor pg_rewind code and make it work against a standby