Re: WIP: BRIN multi-range indexes - Mailing list pgsql-hackers

From John Naylor
Subject Re: WIP: BRIN multi-range indexes
Date
Msg-id CAFBsxsFwLTg3qfs6RcoQouGDh54GUZLr7+BEXfo86qYPuNaZjA@mail.gmail.com
Whole thread Raw
In response to Re: WIP: BRIN multi-range indexes  (Tomas Vondra <tomas.vondra@enterprisedb.com>)
Responses Re: WIP: BRIN multi-range indexes  (Tomas Vondra <tomas.vondra@enterprisedb.com>)
List pgsql-hackers
On Sat, Nov 7, 2020 at 4:38 PM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:

> Overall, I think there's very little difference, particularly in the
> "match" cases when we're searching for a value that we know is in the
> table. The one-hash variant seems to perform a bit better, but the
> difference is fairly small.
>
> In the "mismatch" cases (searching for value that is not in the table)
> the differences are more significant, but it might be noise. It does
> seem much more "green" than "red", i.e. the one-hash variant seems to be
> faster (although this does depend on the values for formatting).
>
> To sum this up, I think the one-hash approach seems interesting. It's
> not going to give us huge speedups because we're only hashing int32
> values anyway (not the source data), but it's worth exploring.

Thanks for testing! It seems you tested against the version with two moduli, and not the alternative discussed in


which would in fact be rehashing the 32 bit values. I think that would be the way to go if we don't use the one-hashing approach.

> a) set_bloom_partitions does this:
>
>     while (primes[pidx + nhashes - 1] <= target && primes[pidx] > 0)
>        pidx++;
>
> which is broken, because the second part of the condition only checks
> the current index - we may end up using nhashes primes after that, and
> some of them may be 0. So this needs to be:
>
>     while (primes[pidx + nhashes - 1] <= target &&
>            primes[pidx + nhashes] > 0)
>        pidx++;

Good catch.

> b) set_bloom_partitions does this to generate primes:
>
>     /*
>      * Increase the limit to ensure we have some primes higher than
>      * the target partition length. The 100 value is arbitrary, but
>      * should be well over what we need.
>      */
>     primes = generate_primes(target_partlen + 100);
>
> It's not clear to me why 100 is sufficient, particularly for large page
> sizes. AFAIK the primes get more and more sparse, so how does this
> guarantee we'll get enough "sufficiently large" primes?

This value is not rigorous and should be improved, but I started with that by looking at the table in section 3 in

https://primes.utm.edu/notes/gaps.html

I see two ways to make a stronger guarantee:

1. Take the average gap between primes near n, which is log(n), and multiply that by BLOOM_MAX_NUM_PARTITIONS. Adding that to the target seems a better heuristic than a constant, and is still simple to calculate.

With the pathological example you gave of n=575104, k=3 (target_partlen = 191701), the number to add is log(191701) * 10 = 122.  By the table referenced above, the largest prime gap under 360653 is 95, so we're guaranteed to find at least one prime in the space of 122 above the target. That will likely be enough to find the closest-to-target filter size for k=3. Even if it weren't, nbits is so large that the relative difference is tiny. I'd say a heuristic like this is most likely to be off precisely when it matters the least. At this size, even if we find zero primes above our target, the relative filter size is close to 

(575104 - 3 * 95) / 575104 = 0.9995

For a more realistic bad-case target partition length, log(1327) * 10 = 72. There are 33 composites after 1327, the largest such gap below 9551. That would give five primes larger than the target
1361   1367   1373   1381   1399

which is more than enough for k<=10:

1297 +  1301  + 1303  + 1307  + 1319  + 1321  + 1327  + 1361 + 1367 + 1373 = 13276

2. Use a "segmented range" algorithm for the sieve and iterate until we get k*2 primes, half below and half above the target. This would be an absolute guarantee, but also more code, so I'm inclined against that.

> c) generate_primes uses uint16 to store the primes, so it can only
> generate primes up to 32768. That's (probably) enough for 8kB pages, but
> for 32kB pages it's clearly insufficient.

Okay.

> As for the original question how expensive this naive sieve is, I
> haven't been able to measure any significant timings. The logging aroung
> generate_primes usually looks like this:
>
> 2020-11-07 20:36:10.614 CET [232789] LOG:  generating primes nbits
> 575104 nhashes 3 target_partlen 191701
> 2020-11-07 20:36:10.614 CET [232789] LOG:  primes generated
>
> So it takes 0.000 second for this extreme page size. I don't think we
> need to invent anything more elaborate.

Okay, good to know. If we were concerned about memory, we could have it check only odd numbers. That's a common feature of sieves, but also makes the code a bit harder to understand if you haven't seen it before.

Also to fill in something I left for later, the reference for this

/* upper bound of number of primes below limit */
/* WIP: reference for this number */
int numprimes = 1.26 * limit / log(limit);

is

Rosser, J. Barkley; Schoenfeld, Lowell (1962). "Approximate formulas for some functions of prime numbers". Illinois J. Math. 6: 64–94. doi:10.1215/ijm/1255631807

More precisely, it's 30*log(113)/113 rounded up.

--
John Naylor
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

pgsql-hackers by date:

Previous
From: Alvaro Herrera
Date:
Subject: Re: Prevent printing "next step instructions" in initdb and pg_upgrade
Next
From: gkokolatos@pm.me
Date:
Subject: Re: Strange behavior with polygon and NaN