Re: [HACKERS] A design for amcheck heapam verification - Mailing list pgsql-hackers

From Andrey Borodin
Subject Re: [HACKERS] A design for amcheck heapam verification
Date
Msg-id A08BC5DC-6659-48D0-B765-08E012F20E0C@yandex-team.ru
Whole thread Raw
In response to Re: [HACKERS] A design for amcheck heapam verification  (Andrey Borodin <x4mmm@yandex-team.ru>)
Responses Re: [HACKERS] A design for amcheck heapam verification
List pgsql-hackers
Hi, Peter!

> 11 янв. 2018 г., в 15:14, Andrey Borodin <x4mmm@yandex-team.ru> написал(а):
> I like heapam verification functionality and use it right now. So, I'm planning to provide review for this patch,
probably,this week. 

I've looked into the code and here's my review.

The new functionality works and is useful right now. I believe it should be shipped in the Postgres box (currently, I
installit with packet managers). 
The code is nice and well documented.

I'd be happy, if I could pass argument like ErrorLevel, which would help me to estimate scale of corruption. Or any
otherway to list more than one error. But, clearly, this functionality is not necessary for this patch. 
Also, I couldn't find where is check that ensures that there is a heap tuple for every B-tree leaf tuple. Is it there?

I've found few neatpicks in bloom filter:
1. Choice of sdbmhash does not seems like a best option from my point of view. I'd stick with MurmurX, with any
availableX. Or anything doing 32-bit alligned computations. Hacks like (hash << 6) + (hash << 16) - hash are cool, but
nowadaysthere is no point not to use hash * 65599. 
2.
+    bitset_bytes = Max(1024L * 1024L, bitset_bytes);
bloom_work_mem was supposed to be the limit? Why we do not honor this limit?
3.
+    filter = palloc0(offsetof(bloom_filter, bitset) +
+                     sizeof(unsigned char) * bitset_bytes);
sizeof(unsigned char) == 1 by C standard
4. function my_bloom_power() returns bit numers, then it's result is powered INT64CONST(1) << bloom_power; back. I'd
stikwith removing bits in a loop by while(target_bitset_bits & (target_bitset_bits-1)) {
target_bitset_bits&=target_bitset_bits-1;} or something like that. Or, if we use code like sdbm hash, fallback to
bithacks:) https://graphics.stanford.edu/~seander/bithacks.html#RoundUpPowerOf2 
5. I would implement k_hashed with the folllowing code:
static void
k_hashes(bloom_filter *filter, uint32 *hashes, unsigned char *elem, size_t len)
{
    uint64  hasha,
            hashb;
    int     i;

    hasha = DatumGetUInt32(hash_any(elem, len));
    hashb = (filter->k_hash_funcs > 1 ? sdbmhash(elem, len) : 0);

    /*
     * Mix seed value using XOR.  Mixing with addition instead would defeat the
     * purpose of having a seed (false positives would never change for a given
     * set of input elements).
     */
    hasha ^= filter->seed;

    for (i = 0; i < filter->k_hash_funcs; i++)
    {
        /* Accumulate hash value for caller */
        hashes[i] = (hasha + i * hashb + i) % filter->bitset_bits;
    }
}
It produces almost same result (hashes 1..k-1 are +1'd), but uses a lot less % operations. Potential overflow is
governedby uint64 type. 

That's all what I've found. I do not know did the patch had all necessary reviewers attention. Please, feel free to
changestatus if you think that patch is ready. From my point of view, the patch is in a good shape. 


It was a pleasure to read amcheck code, thanks for writing it.


Best regards, Andrey Borodin.

pgsql-hackers by date:

Previous
From: Michael Paquier
Date:
Subject: Re: Enhance pg_stat_wal_receiver view to display connected host
Next
From: Amit Langote
Date:
Subject: Re: [Sender Address Forgery]Re: [HACKERS] path toward fasterpartition pruning