Re: [HACKERS] A design for amcheck heapam verification - Mailing list pgsql-hackers

From Peter Geoghegan
Subject Re: [HACKERS] A design for amcheck heapam verification
Date
Msg-id CAH2-Wznm5ZOjS0_DJoWrcm9Us19gzbkm0aTKt5hHprvjHFVHpQ@mail.gmail.com
Whole thread Raw
In response to Re: [HACKERS] A design for amcheck heapam verification  (Peter Geoghegan <pg@bowt.ie>)
Responses Re: [HACKERS] A design for amcheck heapam verification  (Michael Paquier <michael.paquier@gmail.com>)
List pgsql-hackers
On Tue, Nov 28, 2017 at 9:54 PM, Peter Geoghegan <pg@bowt.ie> wrote:
> On Tue, Nov 28, 2017 at 9:50 PM, Michael Paquier
> <michael.paquier@gmail.com> wrote:
>>> Would that address your concern? There would be an SQL interface, but
>>> it would be trivial.
>>
>> That's exactly what I think you should do, and mentioned so upthread.
>> A SQL interface can also show a good example of how developers can use
>> this API.

Attach revision, v5, adds a new test harness -- test_bloomfilter.

This can be used to experimentally verify that the meets the well
known "1% false positive rate with 9.6 bits per element" standard. It
manages to do exactly that:

postgres=# set client_min_messages = 'debug1';
SET
postgres=# SELECT test_bloomfilter(power => 23, nelements => 873813,
seed => -1, tests => 3);
DEBUG:  beginning test #1...
DEBUG:  bloom_work_mem (KB): 1024
DEBUG:  false positives: 8630 (rate: 0.009876, proportion bits set:
0.517625, seed: 1373191603)
DEBUG:  beginning test #2...
DEBUG:  bloom_work_mem (KB): 1024
DEBUG:  false positives: 8623 (rate: 0.009868, proportion bits set:
0.517623, seed: 406665822)
DEBUG:  beginning test #3...
DEBUG:  bloom_work_mem (KB): 1024
WARNING:  false positives: 8840 (rate: 0.010117, proportion bits set:
0.517748, seed: 398116374)
 test_bloomfilter
------------------

(1 row)

Here, we repeat the same test 3 times, varying only the seed value
used for each run.

The last message is a WARNING because we exceed the 1% threshold
(hard-coded into test_bloomfilter.c), though only by a tiny margin,
due only to random variations in seed value. We round up to 10 bits
per element for the regression tests. That's where the *actual*
"nelements" argument comes from within the tests, so pg_regress tests
should never see the WARNING (if they do, that counts as a failure).

I've experimentally observed that we get the 1% false positive rate
with any possible bitset when "nelements" works out at 9.6 bitset bits
per element. Inter-run variation is tiny. With 50 tests, I didn't
observe these same Bloom filter parameters produce a false positive
rate that came near to 1.1%. 1.01% or 1.02% was about as bad as it
got.

There is a fairly extensive README, which I hope will clear up the
theory behind the bloomfilter.c strategy on bitset size and false
positives. Also, there was a regression that I had to fix in
bloomfilter.c, in seeding. It didn't reliably cause variation in the
false positives. And, there was bitrot with the documentation that I
fixed up.

-- 
Peter Geoghegan

Attachment

pgsql-hackers by date:

Previous
From: Peter Eisentraut
Date:
Subject: Re: Transform for pl/perl
Next
From: "Joshua D. Drake"
Date:
Subject: Re: Logical replication without a Primary Key