Re: [HACKERS] A design for amcheck heapam verification - Mailing list pgsql-hackers

From Andres Freund
Subject Re: [HACKERS] A design for amcheck heapam verification
Date
Msg-id 20180330011834.4hnr7bhvid26zgpt@alap3.anarazel.de
Whole thread Raw
In response to Re: [HACKERS] A design for amcheck heapam verification  (Peter Geoghegan <pg@bowt.ie>)
Responses Re: [HACKERS] A design for amcheck heapam verification  (Peter Geoghegan <pg@bowt.ie>)
List pgsql-hackers
Hi,

On 2018-03-26 20:20:57 -0700, Peter Geoghegan wrote:
> From ede1ba731dc818172a94adbb6331323c1f2b1170 Mon Sep 17 00:00:00 2001
> From: Peter Geoghegan <pg@bowt.ie>
> Date: Thu, 24 Aug 2017 20:58:21 -0700
> Subject: [PATCH v7 1/2] Add Bloom filter data structure implementation.
> 
> A Bloom filter is a space-efficient, probabilistic data structure that
> can be used to test set membership.  Callers will sometimes incur false
> positives, but never false negatives.  The rate of false positives is a
> function of the total number of elements and the amount of memory
> available for the Bloom filter.
> 
> Two classic applications of Bloom filters are cache filtering, and data
> synchronization testing.  Any user of Bloom filters must accept the
> possibility of false positives as a cost worth paying for the benefit in
> space efficiency.
> 
> This commit adds a test harness extension module, test_bloomfilter.  It
> can be used to get a sense of how the Bloom filter implementation
> performs under varying conditions.

Maybe add a short paragraph explaining what this'll be used for soon.

> @@ -12,7 +12,7 @@ subdir = src/backend/lib
>  top_builddir = ../../..
>  include $(top_builddir)/src/Makefile.global
>  
> -OBJS = binaryheap.o bipartite_match.o dshash.o hyperloglog.o ilist.o \
> -       knapsack.o pairingheap.o rbtree.o stringinfo.o
> +OBJS = binaryheap.o bipartite_match.o bloomfilter.o dshash.o hyperloglog.o \
> +       ilist.o knapsack.o pairingheap.o rbtree.o stringinfo.o

*NOT* for this patch: I really wonder whether we should move towards a
style where there's only ever a single object per-line. Would make
things like this easier to view and conflicts easier to resolve.


> --- /dev/null
> +++ b/src/backend/lib/bloomfilter.c
> @@ -0,0 +1,304 @@
> +/*-------------------------------------------------------------------------
> + *
> + * bloomfilter.c
> + *        Space-efficient set membership testing
> + *
> + * A Bloom filter is a probabilistic data structure that is used to test an
> + * element's membership of a set.

s/of/in/?


> False positives are possible, but false
> + * negatives are not; a test of membership of the set returns either "possibly
> + * in set" or "definitely not in set".  This can be very space efficient when
> + * individual elements are larger than a few bytes, because elements are hashed
> + * in order to set bits in the Bloom filter bitset.

The second half of this paragraph isn't easy to understand.


> + * Elements can be added to the set, but not removed.  The more elements that
> + * are added, the larger the probability of false positives.  Caller must hint
> + * an estimated total size of the set when its Bloom filter is initialized.
> + * This is used to balance the use of memory against the final false positive
> + * rate.

s/its Bloom/the Bloom/?


> + * The implementation is well suited to data synchronization problems between
> + * unordered sets, especially where predictable performance is important and
> + * some false positives are acceptable.

I'm not finding "data synchronization" very descriptive. Makes me think
of multi-threaded races and such.



> +/*
> + * Create Bloom filter in caller's memory context.  This should get a false
> + * positive rate of between 1% and 2% when bitset is not constrained by memory.

s/should/aims at/?


> + * total_elems is an estimate of the final size of the set.  It ought to be
> + * approximately correct, but we can cope well with it being off by perhaps a
> + * factor of five or more.  See "Bloom Filters in Probabilistic Verification"
> + * (Dillinger & Manolios, 2004) for details of why this is the case.

I'd simplify the language here. I'd replace ought with should at the
very least.  Replace we with "the bloom filter" or similar?


> + * bloom_work_mem is sized in KB, in line with the general work_mem convention.
> + * This determines the size of the underlying bitset (trivial bookkeeping space
> + * isn't counted).  The bitset is always sized as a power-of-two number of
> + * bits, and the largest possible bitset is 512MB.  The implementation rounds
> + * down as needed.

"as needed" should be expanded. Just say ~"Only the required amount of
memory is allocated"?


> +bloom_filter *
> +bloom_create(int64 total_elems, int bloom_work_mem, uint32 seed)
> +{
> +    bloom_filter *filter;
> +    int            bloom_power;
> +    uint64        bitset_bytes;
> +    uint64        bitset_bits;
> +
> +    /*
> +     * Aim for two bytes per element; this is sufficient to get a false
> +     * positive rate below 1%, independent of the size of the bitset or total
> +     * number of elements.  Also, if rounding down the size of the bitset to
> +     * the next lowest power of two turns out to be a significant drop, the
> +     * false positive rate still won't exceed 2% in almost all cases.
> +     */
> +    bitset_bytes = Min(bloom_work_mem * 1024L, total_elems * 2);
> +    /* Minimum allowable size is 1MB */
> +    bitset_bytes = Max(1024L * 1024L, bitset_bytes);

Some upcasting might be advisable, to avoid dangers of overflows?


> +/*
> + * Generate k hash values for element.
> + *
> + * Caller passes array, which is filled-in with k values determined by hashing
> + * caller's element.
> + *
> + * Only 2 real independent hash functions are actually used to support an
> + * interface of up to MAX_HASH_FUNCS hash functions; enhanced double hashing is
> + * used to make this work.  The main reason we prefer enhanced double hashing
> + * to classic double hashing is that the latter has an issue with collisions
> + * when using power-of-two sized bitsets.  See Dillinger & Manolios for full
> + * details.
> + */
> +static void
> +k_hashes(bloom_filter *filter, uint32 *hashes, unsigned char *elem, size_t len)
> +{
> +    uint64        hash;
> +    uint32        x, y;
> +    uint64        m;
> +    int            i;
> +
> +    /* Use 64-bit hashing to get two independent 32-bit hashes */
> +    hash = DatumGetUInt64(hash_any_extended(elem, len, filter->seed));

Hm. Is that smart given how some hash functions are defined? E.g. for
int8 the higher bits aren't really that independent for small values:

Datum
hashint8(PG_FUNCTION_ARGS)
{
    /*
     * The idea here is to produce a hash value compatible with the values
     * produced by hashint4 and hashint2 for logically equal inputs; this is
     * necessary to support cross-type hash joins across these input types.
     * Since all three types are signed, we can xor the high half of the int8
     * value if the sign is positive, or the complement of the high half when
     * the sign is negative.
     */
    int64        val = PG_GETARG_INT64(0);
    uint32        lohalf = (uint32) val;
    uint32        hihalf = (uint32) (val >> 32);

    lohalf ^= (val >= 0) ? hihalf : ~hihalf;

    return hash_uint32(lohalf);
}

Datum
hashint8extended(PG_FUNCTION_ARGS)
{
    /* Same approach as hashint8 */
    int64        val = PG_GETARG_INT64(0);
    uint32        lohalf = (uint32) val;
    uint32        hihalf = (uint32) (val >> 32);

    lohalf ^= (val >= 0) ? hihalf : ~hihalf;

    return hash_uint32_extended(lohalf, PG_GETARG_INT64(1));
}




> +/*
> + * Calculate "val MOD m" inexpensively.
> + *
> + * Assumes that m (which is bitset size) is a power-of-two.
> + *
> + * Using a power-of-two number of bits for bitset size allows us to use bitwise
> + * AND operations to calculate the modulo of a hash value.  It's also a simple
> + * way of avoiding the modulo bias effect.
> + */
> +static inline uint32
> +mod_m(uint32 val, uint64 m)
> +{
> +    Assert(m <= PG_UINT32_MAX + UINT64CONST(1));
> +    Assert(((m - 1) & m) == 0);
> +
> +    return val & (m - 1);
> +}

What's up with the two different widths here?


> @@ -0,0 +1,27 @@
> +/*-------------------------------------------------------------------------
> + *
> + * bloomfilter.h
> + *      Space-efficient set membership testing
> + *
> + * Copyright (c) 2018, PostgreSQL Global Development Group
> + *
> + * IDENTIFICATION
> + *    src/include/lib/bloomfilter.h
> + *
> + *-------------------------------------------------------------------------
> + */
> +#ifndef _BLOOMFILTER_H_
> +#define _BLOOMFILTER_H_

Names starting with an underscore followed by an uppercase letter are
reserved. Yes, we have some already. No, we shouldn't introduce further ones.



> From 71878742061500b969faf7a7cff3603d644c90ca Mon Sep 17 00:00:00 2001
> From: Peter Geoghegan <pg@bowt.ie>
> Date: Tue, 2 May 2017 00:19:24 -0700
> Subject: [PATCH v7 2/2] Add amcheck verification of indexes against heap.
> 
> Add a new, optional capability to bt_index_check() and
> bt_index_parent_check():  callers can check that each heap tuple that
> ought to have an index entry does in fact have one.  This happens at the
> end of the existing verification checks.

And here we get back to why I last year though this interface is
bad. Now this really can't properly described as a pure index check
anymore, and we need to add the same functionality to multiple
functions.


> +--
> +-- bt_index_check()
> +--
> +DROP FUNCTION bt_index_check(regclass);
> +CREATE FUNCTION bt_index_check(index regclass,
> +    heapallindexed boolean DEFAULT false)
> +RETURNS VOID
> +AS 'MODULE_PATHNAME', 'bt_index_check'
> +LANGUAGE C STRICT PARALLEL RESTRICTED;

This breaks functions et al referencing the existing function. Also, I
don't quite recall the rules, don't we have to drop the function from
the extension first?



>  /*
> - * bt_index_check(index regclass)
> + * bt_index_check(index regclass, heapallindexed boolean)
>   *
>   * Verify integrity of B-Tree index.
>   *
>   * Acquires AccessShareLock on heap & index relations.  Does not consider
> - * invariants that exist between parent/child pages.
> + * invariants that exist between parent/child pages.  Optionally verifies
> + * that heap does not contain any unindexed or incorrectly indexed tuples.
>   */
>  Datum
>  bt_index_check(PG_FUNCTION_ARGS)
>  {
>      Oid            indrelid = PG_GETARG_OID(0);
> +    bool        heapallindexed = false;
>  
> -    bt_index_check_internal(indrelid, false);
> +    if (PG_NARGS() == 2)
> +        heapallindexed = PG_GETARG_BOOL(1);
> +
> +    bt_index_check_internal(indrelid, false, heapallindexed);
>  
>      PG_RETURN_VOID();
>  }

Given the PG_NARGS() checks I don't understand why don't just create a
separate two-argument SQL function above? If you rely on the default
value you don't need it anyway?


> +    /*
> +     * * Heap contains unindexed/malformed tuples check *
> +     */

I'd reorder this to "Check whether heap contains ...".


> +    if (state->heapallindexed)
> +    {
> +        IndexInfo  *indexinfo = BuildIndexInfo(state->rel);
> +        HeapScanDesc scan;
> +
> +        /*
> +         * Create our own scan for IndexBuildHeapScan(), like a parallel index
> +         * build.  We do things this way because it lets us use the MVCC
> +         * snapshot we acquired before index fingerprinting began in the
> +         * !readonly case.
> +         */

I'd shorten the middle part out, so it's "IndexBuildHeapScan(), so we
can register an MVCC snapshot acquired before..."


> +        scan = heap_beginscan_strat(state->heaprel, /* relation */
> +                                    snapshot,    /* snapshot */
> +                                    0,    /* number of keys */
> +                                    NULL,    /* scan key */
> +                                    true,    /* buffer access strategy OK */
> +                                    true);    /* syncscan OK? */
> +
> +        /*
> +         * Scan will behave as the first scan of a CREATE INDEX CONCURRENTLY
> +         * behaves when only AccessShareLock held.  This is really only needed
> +         * to prevent confusion within IndexBuildHeapScan() about how to
> +         * interpret the state we pass.
> +         */
> +        indexinfo->ii_Concurrent = !state->readonly;

That's not very descriptive.


> +        /* Fingerprint leaf page tuples (those that point to the heap) */
> +        if (state->heapallindexed && P_ISLEAF(topaque) && !ItemIdIsDead(itemid))
> +            bloom_add_element(state->filter, (unsigned char *) itup,
> +                              IndexTupleSize(itup));

So, if we didn't use IndexBuildHeapScan(), we could also check whether
dead entries at least have a corresponding item on the page, right? I'm
not asking to change, mostly curious.


> +/*
> + * Per-tuple callback from IndexBuildHeapScan, used to determine if index has
> + * all the entries that definitely should have been observed in leaf pages of
> + * the target index (that is, all IndexTuples that were fingerprinted by our
> + * Bloom filter).  All heapallindexed checks occur here.

The last sentence isn't entirely fully true ;), given that we check for
the bloom insertion above. s/checks/verification/?



We should be able to get this into v11...


Greetings,

Andres Freund


pgsql-hackers by date:

Previous
From: Bruce Momjian
Date:
Subject: Re: cannot drop replication slot if server is running in single-usermode
Next
From: Chapman Flack
Date:
Subject: Re: [HACKERS] AdvanceXLInsertBuffer vs. WAL segment compressibility