Re: WIP: BRIN multi-range indexes - Mailing list pgsql-hackers
From | Tomas Vondra |
---|---|
Subject | Re: WIP: BRIN multi-range indexes |
Date | |
Msg-id | 20200924235045.bgxsilvf7askukph@development Whole thread Raw |
In response to | Re: WIP: BRIN multi-range indexes (John Naylor <john.naylor@2ndquadrant.com>) |
Responses |
Re: WIP: BRIN multi-range indexes
|
List | pgsql-hackers |
On Thu, Sep 24, 2020 at 05:18:03PM -0400, John Naylor wrote: >On Mon, Sep 21, 2020 at 3:56 PM Tomas Vondra ><tomas.vondra@2ndquadrant.com> wrote: >> >> On Mon, Sep 21, 2020 at 01:42:34PM -0400, John Naylor wrote: > >> >While playing around with the numbers I had an epiphany: At the >> >defaults, the filter already takes up ~4.3kB, over half the page. >> >There is no room for another tuple, so if we're only indexing one >> >column, we might as well take up the whole page. >> >> Hmm, yeah. I may be wrong but IIRC indexes don't support external >> storage but compression is still allowed. So even if those defaults are >> a bit higher than needed that should make the bloom filters a bit more >> compressible, and thus fit multiple BRIN tuples on a single page. > >> Not sure about how much we want to rely on these optimizations, though, >> considering multi-column indexes kinda break this. > >Yeah. Okay, then it sounds like we should go in the other direction, >as the block comment at the top of brin_bloom.c implies. Indexes with >multiple bloom-indexed columns already don't fit in one 8kB page, so I >think every documented example should have a much lower >pages_per_range. Using 32 pages per range with max tuples gives n = >928. With default p, that's about 1.1 kB per brin tuple, so one brin >page can index 224 pages, much more than with the default 128. > >Hmm, how ugly would it be to change the default range size depending >on the opclass? > Not sure. What would happen for multi-column BRIN indexes with different opclasses? >If indexes don't support external storage, that sounds like a pain to >add. Also, with very small fpr, you can easily get into many megabytes >of filter space, which kind of defeats the purpose of brin in the >first place. > True. >There is already this item from the brin readme: > >* Different-size page ranges? > In the current design, each "index entry" in a BRIN index covers the same > number of pages. There's no hard reason for this; it might make sense to > allow the index to self-tune so that some index entries cover smaller page > ranges, if this allows the summary values to be more compact. This >would incur > larger BRIN overhead for the index itself, but might allow better pruning of > page ranges during scan. In the limit of one index tuple per page, the index > itself would occupy too much space, even though we would be able to skip > reading the most heap pages, because the summary values are tight; in the > opposite limit of a single tuple that summarizes the whole table, we wouldn't > be able to prune anything even though the index is very small. This can > probably be made to work by using the range map as an index in itself. > >This sounds like a lot of work, but would be robust. > Yeah. I think it's a fairly independent / orthogonal project. >Anyway, given that this is a general problem and not specific to the >prime partition algorithm, I'll leave that out of the attached patch, >named as a .txt to avoid confusing the cfbot. > >> >We could also generate the primes via a sieve instead, which is really >> >fast (and more code). That would be more robust, but that would require >> >the filter to store the actual primes used, so 20 more bytes at max k = >> >10. We could hard-code space for that, or to keep from hard-coding >> >maximum k and thus lowest possible false positive rate, we'd need more >> >bookkeeping. >> > >> >> I don't think the efficiency of this code matters too much - it's only >> used once when creating the index, so the simpler the better. Certainly >> for now, while testing the partitioning approach. > >To check my understanding, isn't bloom_init() called for every tuple? >Agreed on simplicity so done this way. > No, it's only called for the first non-NULL value in the page range (unless I made a boo boo when writing that code). >> >So, with the partition approach, we'd likely have to set in stone >> >either max nbits, or min target false positive rate. The latter option >> >seems more principled, not only for the block size, but also since the >> >target fp rate is already fixed by the reloption, and as I've >> >demonstrated, we can often go above and beyond the reloption even >> >without changing k. >> > >> >> That seems like a rather annoying limitation, TBH. > >I don't think the latter is that bad. I've capped k at 10 for >demonstration's sake.: > >(928 is from using 32 pages per range) > >n k m p >928 7 8895 0.01 >928 10 13343 0.001 (lowest p supported in patch set) >928 13 17790 0.0001 >928 10 18280 0.0001 (still works with lower k, needs higher m) >928 10 17790 0.00012 (even keeping m from row #3, capping k doesn't >degrade p much) > >Also, k seems pretty robust against small changes as long as m isn't >artificially constrained and as long as p is small. > >So I *think* it's okay to cap k at 10 or 12, and not bother with >adjusting m, which worsens space issues. As I found before, lowering k >raises target fpr, but seems more robust to overshooting ndistinct. In >any case, we only need k * 2 bytes to store the partition lengths. > >The only way I see to avoid any limitation is to make the array of >primes variable length, which could be done by putting the filter >offset calculation into a macro. But having two variable-length arrays >seems messy. > Hmmm. I wonder how would these limitations impact the conclusions from the one-hashing paper? Or was this just for the sake of a demonstration? I'd suggest we just do the simplest thing possible (be it a hard-coded table of primes or a sieve) and then evaluate if we need to do something more sophisticated. >> >Hmm, I'm not sure I understand you. I can see not caring to trim wasted >> >bits, but we can't set/read off the end of the filter. If we don't >> >wrap, we could just skip reading/writing that bit. So a tiny portion of >> >access would be at k - 1. The paper is not clear on what to do here, >> >but they are explicit in minimizing the absolute value, which could go >> >on either side. >> > >> >> What I meant is that is that the paper says this: >> >> Given a planned overall length mp for a Bloom filter, we usually >> cannot get k prime numbers to make their sum mf to be exactly mp. As >> long as the difference between mp and mf is small enough, it neither >> causes any trouble for the software implementation nor noticeably >> shifts the false positive ratio. >> >> Which I think means we can pick mp, generate k primes with sum mf close >> to mp, and just use that with mf bits. > >Oh, I see. When I said "trim" I meant exactly that (when mf < mp). >Yeah, we can bump it up as well for the other case. I've done it that >way. > OK >> >+ add_local_real_reloption(relopts, "false_positive_rate", + "desired >> >false-positive rate for the bloom filters", + >> >BLOOM_DEFAULT_FALSE_POSITIVE_RATE, + 0.001, 1.0, offsetof(BloomOptions, >> >falsePositiveRate)); >> > >> >When I set fp to 1.0, the reloption code is okay with that, but then >> >later the assertion gets triggered. >> > >> >> Hmm, yeah. I wonder what to do about that, considering indexes with fp >> 1.0 are essentially useless. > >Not just useless -- they're degenerate. When p = 1.0, m = k = 0 -- We >cannot accept this value from the user. Looking up thread, 0.1 was >suggested as a limit. That might be a good starting point. > Makes sense, I'll fix it that way. >This is interesting work! Having gone this far, I'm going to put more >attention to the multi-minmax patch and actually start performance >testing. > Cool, thanks! I'll take a look at your one-hashing patch tomorrow. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
pgsql-hackers by date: