Home > mailing lists

Re: WIP: BRIN multi-range indexes - Mailing list pgsql-hackers

From	John Naylor
Subject	Re: WIP: BRIN multi-range indexes
Date	September 18, 2020 21:06:49
Msg-id	CACPNZCvjt_Td2q3OqZdo_qZUjX_j+9Gotru1eEe4UVq02fLM+A@mail.gmail.com Whole thread Raw
In response to	Re: WIP: BRIN multi-range indexes (Tomas Vondra <tomas.vondra@2ndquadrant.com>)
Responses	Re: WIP: BRIN multi-range indexes
List	pgsql-hackers

Tree view

On Fri, Sep 18, 2020 at 7:40 AM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
>
> On Thu, Sep 17, 2020 at 06:48:11PM -0400, John Naylor wrote:
> >I wrote:
> >
> >> Hmm, I came across that paper while doing background reading. Okay,
> >> now I get that "% (filter->nbits - 1)" is the second hash function in
> >> that scheme. But now I wonder if that second function should actually
> >> act on the passed "value" (the original hash), so that they are
> >> actually independent, as required. In the language of that paper, the
> >> patch seems to have
> >>
> >> g(x) = h1(x) + i*h2(h1(x)) + f(i)
> >>
> >> instead of
> >>
> >> g(x) = h1(x) + i*h2(x) + f(i)
> >>
> >> Concretely, I'm wondering if it should be:
> >>
> >>  big_h = DatumGetUint32(hash_uint32(value));
> >>  h = big_h % filter->nbits;
> >> -d = big_h % (filter->nbits - 1);
> >> +d = value % (filter->nbits - 1);
> >>
> >> But I could be wrong.
> >
> >I'm wrong -- if we use different operands to the moduli, we throw away
> >the assumption of co-primeness. But I'm still left wondering why we
> >have to re-hash the hash for this to work. In any case, there should
> >be some more documentation around the core algorithm, so that future
> >readers are not left scratching their heads.
> >
>
> Hmm, good question. I think we don't really need to hash it twice. It
> does not rally achieve anything - it won't reduce number of collisions
> or anything like that.

Yeah, looking back at the discussion you linked previously, I think
it's a holdover from when the uint32 was rehashed with k different
seeds. Anyway, after thinking about it some more, I still have doubts
about the mapping algorithm. There are two stages to a hash mapping --
hashing and modulus. I don't think a single hash function (whether
rehashed or not) can be turned into two independent functions via a
choice of second modulus. At least, that's not what the Kirsch &
Mitzenmacher paper is claiming. Since we're not actually applying two
independent hash functions on the scan key, we're kind of shooting in
the dark.

It turns out there is something called a one-hash bloom filter, and
the paper in [1] has a straightforward algorithm. Since we can
implement it exactly as stated in the paper, that gives me more
confidence in the real-world false positive rate. It goes like this:

Partition the filter bitmap into k partitions of similar but unequal
length, corresponding to consecutive prime numbers. Use the primes for
moduli of the uint32 value and map it to the bit of the corresponding
partition. For a simple example, let's use 7, 11, 13 for partitions in
a filter of size 31. The three bits are:

value % 7
7 + (value % 11)
7 + 11 + (value % 13)

We could store a const array of the first 256 primes. The largest such
prime is 1613, so with k=7 we can support up to ~11k bits, which is
more than we'd like to store anyway. Then we store the array index of
the largest prime in the 8bits of padding we currently have in
BloomFilter struct.

One wrinkle is that the sum of k primes is not going to match m
exactly. If the sum is too small, we can trim a few bits off of the
filter bitmap. If the sum is too large, the last partition can spill
into the front of the first one. This shouldn't matter much in the
common case since we need to round m to the nearest byte anyway.

This should be pretty straightforward to turn into code and I can take
a stab at it. Thoughts?

[1] https://www.researchgate.net/publication/284283336_One-Hashing_Bloom_Filter

-- 
John Naylor                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

pgsql-hackers by date:

From: Tom Lane
Date: 18 September 2020, 20:48:25
Subject: Re: factorial function/phase out postfix operators?

From: Tom Lane
Date: 18 September 2020, 22:05:47
Subject: Re: XversionUpgrade tests broken by postfix operator removal

Re: WIP: BRIN multi-range indexes - Mailing list pgsql-hackers

Previous

Next