Re: WIP: BRIN multi-range indexes - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: WIP: BRIN multi-range indexes
Date
Msg-id 1fdcbd2a-ff0f-fb81-d227-22279d5cbcb7@enterprisedb.com
Whole thread Raw
In response to Re: WIP: BRIN multi-range indexes  (John Naylor <john.naylor@enterprisedb.com>)
Responses Re: WIP: BRIN multi-range indexes  (John Naylor <john.naylor@enterprisedb.com>)
List pgsql-hackers

On 11/9/20 3:29 PM, John Naylor wrote:
> On Sat, Nov 7, 2020 at 4:38 PM Tomas Vondra
> <tomas.vondra@enterprisedb.com <mailto:tomas.vondra@enterprisedb.com>>
> wrote:
> 
>> Overall, I think there's very little difference, particularly in the
>> "match" cases when we're searching for a value that we know is in the
>> table. The one-hash variant seems to perform a bit better, but the
>> difference is fairly small.
>>
>> In the "mismatch" cases (searching for value that is not in the table)
>> the differences are more significant, but it might be noise. It does
>> seem much more "green" than "red", i.e. the one-hash variant seems to be
>> faster (although this does depend on the values for formatting).
>>
>> To sum this up, I think the one-hash approach seems interesting. It's
>> not going to give us huge speedups because we're only hashing int32
>> values anyway (not the source data), but it's worth exploring.
> 
> Thanks for testing! It seems you tested against the version with two
> moduli, and not the alternative discussed in
> 
> https://www.postgresql.org/message-id/20200918222702.omsieaphfj3ctqg3%40development
> <https://www.postgresql.org/message-id/20200918222702.omsieaphfj3ctqg3%40development>
> 
> which would in fact be rehashing the 32 bit values. I think that would
> be the way to go if we don't use the one-hashing approach.
> 

Yeah. I forgot about this detail, and I may try again with the two-hash
variant, but I wonder how much difference would it make, considering the
results match the expected results (that is, the scan fraction" results
for fill_factor=100 match the target fpr almost perfectly).

I think there's a possibly-more important omission in the testing - I
forgot about the "sort mode" used initially, when the filter keeps the
actual hash values and only switches to hashing later. I wonder if that
plays role for some of the cases.

I'll investigate this a bit in the next round of tests.


>> a) set_bloom_partitions does this:
>>
>>     while (primes[pidx + nhashes - 1] <= target && primes[pidx] > 0)
>>        pidx++;
>>
>> which is broken, because the second part of the condition only checks
>> the current index - we may end up using nhashes primes after that, and
>> some of them may be 0. So this needs to be:
>>
>>     while (primes[pidx + nhashes - 1] <= target &&
>>            primes[pidx + nhashes] > 0)
>>        pidx++;
> 
> Good catch.
> 
>> b) set_bloom_partitions does this to generate primes:
>>
>>     /*
>>      * Increase the limit to ensure we have some primes higher than
>>      * the target partition length. The 100 value is arbitrary, but
>>      * should be well over what we need.
>>      */
>>     primes = generate_primes(target_partlen + 100);
>>
>> It's not clear to me why 100 is sufficient, particularly for large page
>> sizes. AFAIK the primes get more and more sparse, so how does this
>> guarantee we'll get enough "sufficiently large" primes?
> 
> This value is not rigorous and should be improved, but I started with
> that by looking at the table in section 3 in
> 
> https://primes.utm.edu/notes/gaps.html
> <https://primes.utm.edu/notes/gaps.html>
> 
> I see two ways to make a stronger guarantee:
> 
> 1. Take the average gap between primes near n, which is log(n), and
> multiply that by BLOOM_MAX_NUM_PARTITIONS. Adding that to the target
> seems a better heuristic than a constant, and is still simple to calculate.
> 
> With the pathological example you gave of n=575104, k=3 (target_partlen
> = 191701), the number to add is log(191701) * 10 = 122.  By the table
> referenced above, the largest prime gap under 360653 is 95, so we're
> guaranteed to find at least one prime in the space of 122 above the
> target. That will likely be enough to find the closest-to-target filter
> size for k=3. Even if it weren't, nbits is so large that the relative
> difference is tiny. I'd say a heuristic like this is most likely to be
> off precisely when it matters the least. At this size, even if we find
> zero primes above our target, the relative filter size is close to 
> 
> (575104 - 3 * 95) / 575104 = 0.9995
> 
> For a more realistic bad-case target partition length, log(1327) * 10 =
> 72. There are 33 composites after 1327, the largest such gap below 9551.
> That would give five primes larger than the target
> 1361   1367   1373   1381   1399
> 
> which is more than enough for k<=10:
> 
> 1297 +  1301  + 1303  + 1307  + 1319  + 1321  + 1327  + 1361 + 1367 +
> 1373 = 13276
> 
> 2. Use a "segmented range" algorithm for the sieve and iterate until we
> get k*2 primes, half below and half above the target. This would be an
> absolute guarantee, but also more code, so I'm inclined against that.
> 

Thanks, that makes sense.

While investigating the failures, I've tried increasing the values a
lot, without observing any measurable increase in runtime. IIRC I've
even used (10 * target_partlen) or something like that. That tells me
it's not very sensitive part of the code, so I'd suggest to simply use
something that we know is large enough to be safe.

For example, the largest bloom filter we can have is 32kB, i.e. 262kb,
at which point the largest gap is less than 95 (per the gap table). And
we may use up to BLOOM_MAX_NUM_PARTITIONS, so let's just use

    BLOOM_MAX_NUM_PARTITIONS * 100

on the basis that we may need BLOOM_MAX_NUM_PARTITIONS partitions
before/after the target. We could consider the actual target being lower
(essentially 1/npartions of the nbits) which decreases the maximum gap,
but I don't think that's the extra complexity here.


FWIW I wonder if we should do something about bloom filters that we know
can get larger than page size. In the example I used, we know that
nbits=575104 is larger than page, so as the filter gets more full (and
thus more random and less compressible) it won't possibly fit. Maybe we
should reject that right away, instead of "delaying it" until later, on
the basis that it's easier to fix at CREATE INDEX time (compared to when
inserts/updates start failing at a random time).

The problem with this is of course that if the index is multi-column,
this may not be strict enough (i.e. each filter would fit independently,
but the whole index row is too large). But it's probably better to do at
least something, and maybe improve that later with some whole-row check.


>> c) generate_primes uses uint16 to store the primes, so it can only
>> generate primes up to 32768. That's (probably) enough for 8kB pages, but
>> for 32kB pages it's clearly insufficient.
> 
> Okay.
> 
>> As for the original question how expensive this naive sieve is, I
>> haven't been able to measure any significant timings. The logging aroung
>> generate_primes usually looks like this:
>>
>> 2020-11-07 20:36:10.614 CET [232789] LOG:  generating primes nbits
>> 575104 nhashes 3 target_partlen 191701
>> 2020-11-07 20:36:10.614 CET [232789] LOG:  primes generated
>>
>> So it takes 0.000 second for this extreme page size. I don't think we
>> need to invent anything more elaborate.
> 
> Okay, good to know. If we were concerned about memory, we could have it
> check only odd numbers. That's a common feature of sieves, but also
> makes the code a bit harder to understand if you haven't seen it before.
> 

IMO if we were concerned about memory we'd use Bitmapset instead of an
array of bools. That's 1:8 compression, not just 1:2.

> Also to fill in something I left for later, the reference for this
> 
> /* upper bound of number of primes below limit */
> /* WIP: reference for this number */
> int numprimes = 1.26 * limit / log(limit);
> 
> is
> 
> Rosser, J. Barkley; Schoenfeld, Lowell (1962). "Approximate formulas for
> some functions of prime numbers". Illinois J. Math. 6: 64–94.
> doi:10.1215/ijm/1255631807
> 
> More precisely, it's 30*log(113)/113 rounded up.
> 

Thanks, I was wondering where that came from.


-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



pgsql-hackers by date:

Previous
From: Stephen Frost
Date:
Subject: Re: Disable WAL logging to speed up data loading
Next
From: Jeevan Ladhe
Date:
Subject: Re: Misuse of TimestampDifference() in the autoprewarm feature of pg_prewarm