Re: WIP: BRIN multi-range indexes - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: WIP: BRIN multi-range indexes
Date
Msg-id 88b3831d-8afa-4135-a420-27b12bf9fae4@enterprisedb.com
Whole thread Raw
In response to Re: WIP: BRIN multi-range indexes  (Tomas Vondra <tomas.vondra@2ndquadrant.com>)
Responses Re: WIP: BRIN multi-range indexes  (Tomas Vondra <tomas.vondra@enterprisedb.com>)
Re: WIP: BRIN multi-range indexes  (John Naylor <john.naylor@enterprisedb.com>)
List pgsql-hackers
Hi,

Here's a rebased version of the patch series, to keep the cfbot happy.
I've also restricted the false positive rate to [0.001, 0.1] instead of
the original [0.0, 1.0], per discussion on this thread.


I've done a bunch of experiments, comparing the "regular" bloom indexes
with the one-hashing scheme proposed by John Naylor. I've been wondering
if there's some measurable difference, especially in:

* efficiency (query duration)

* false positive rate depending on the fill factor

So I ran a bunch of tests on synthetic data sets, varying parameters
affecting the BRIN bloom indexes:

1) different pages_per_range

2) number of distinct values per range

3) fill factor of the bloom filter (66%, 100%, 200%)

Attached is a script I used to test this, and a simple spreadsheet
summarizing the results, comparing the results for each combination of
parameters. For each combination it shows average query duration (over
10 runs) and scan fraction (what fraction of table was scanned).

Overall, I think there's very little difference, particularly in the
"match" cases when we're searching for a value that we know is in the
table. The one-hash variant seems to perform a bit better, but the
difference is fairly small.

In the "mismatch" cases (searching for value that is not in the table)
the differences are more significant, but it might be noise. It does
seem much more "green" than "red", i.e. the one-hash variant seems to be
faster (although this does depend on the values for formatting).

To sum this up, I think the one-hash approach seems interesting. It's
not going to give us huge speedups because we're only hashing int32
values anyway (not the source data), but it's worth exploring.


I've started looking at the one-hash code changes, and I've discovered a
couple issues. I've been wondering how expensive the naive prime sieve
is - it's not extremely hot code path, as we're only running it for each
page range. But still. So my plan was to create the largest bloom filter
possible, and see how much time generate_primes() takes.

So I initialized a cluster with 32kB blocks and tried to do this:

  create index on t
  using brin (a int4_bloom_ops(n_distinct_per_range=120000,
                               false_positive_rate=0.1));

which ends up using nbits=575104 (which is 2x the page size, but let's
ignore that) and nhashes=3. That however crashes and burns, because:

a) set_bloom_partitions does this:

    while (primes[pidx + nhashes - 1] <= target && primes[pidx] > 0)
       pidx++;

which is broken, because the second part of the condition only checks
the current index - we may end up using nhashes primes after that, and
some of them may be 0. So this needs to be:

    while (primes[pidx + nhashes - 1] <= target &&
           primes[pidx + nhashes] > 0)
       pidx++;

(We know there's always at least one 0 at the end, so it's OK not to
check the length explicitly.)

b) set_bloom_partitions does this to generate primes:

    /*
     * Increase the limit to ensure we have some primes higher than
     * the target partition length. The 100 value is arbitrary, but
     * should be well over what we need.
     */
    primes = generate_primes(target_partlen + 100);

It's not clear to me why 100 is sufficient, particularly for large page
sizes. AFAIK the primes get more and more sparse, so how does this
guarantee we'll get enough "sufficiently large" primes?


c) generate_primes uses uint16 to store the primes, so it can only
generate primes up to 32768. That's (probably) enough for 8kB pages, but
for 32kB pages it's clearly insufficient.


I've fixes these issues in a separate WIP patch, with some simple
debugging logging.


As for the original question how expensive this naive sieve is, I
haven't been able to measure any significant timings. The logging aroung
generate_primes usually looks like this:

2020-11-07 20:36:10.614 CET [232789] LOG:  generating primes nbits
575104 nhashes 3 target_partlen 191701
2020-11-07 20:36:10.614 CET [232789] LOG:  primes generated

So it takes 0.000 second for this extreme page size. I don't think we
need to invent anything more elaborate.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: errors when there is a bit literal in ecpg
Next
From: Tomas Vondra
Date:
Subject: Re: First-draft release notes for back branches are up