Re: [HACKERS] A design for amcheck heapam verification - Mailing list pgsql-hackers
From | Michael Paquier |
---|---|
Subject | Re: [HACKERS] A design for amcheck heapam verification |
Date | |
Msg-id | CAB7nPqSUKppzvNSHY1OM_TdSj0UE18xNFCrOwPC3E8svq7Mb_Q@mail.gmail.com Whole thread Raw |
In response to | Re: [HACKERS] A design for amcheck heapam verification (Peter Geoghegan <pg@bowt.ie>) |
Responses |
Re: [HACKERS] A design for amcheck heapam verification
|
List | pgsql-hackers |
On Thu, Sep 7, 2017 at 11:26 AM, Peter Geoghegan <pg@bowt.ie> wrote: > On Wed, Aug 30, 2017 at 9:29 AM, Peter Geoghegan <pg@bowt.ie> wrote: >> On Wed, Aug 30, 2017 at 5:02 AM, Alvaro Herrera >> <alvherre@2ndquadrant.com> wrote: >>> Eh, if you want to optimize it for the case where debug output is not >>> enabled, make sure to use ereport() not elog(). ereport() >>> short-circuits evaluation of arguments, whereas elog() does not. >> >> I should do that, but it's still not really noticeable. > > Since this patch has now bit-rotted, I attach a new revision, V2. > Apart from fixing some Makefile bitrot, this revision also makes some > small tweaks as suggested by Thomas and Alvaro. The documentation is > also revised and expanded, and now discusses practical aspects of the > set membership being tested using a Bloom filter, how that relates to > maintenance_work_mem, and so on. > > Note that this revision does not let the Bloom filter caller use their > own dynamic shared memory, which is something that Thomas asked about. > While that could easily be added, I think it should happen later. I > really just wanted to make sure that my Bloom filter was not in some > way fundamentally incompatible with Thomas' planned enhancements to > (parallel) hash join. I have signed up as a reviewer of this patch, and I have looked at the bloom filter implementation for now. This is the kind of facility that people have asked for on this list for many years. One first thing striking me is that there is no test for this implementation, which would be a base stone for other things, it would be nice to validate that things are working properly before moving on with 0002, and 0001 is a feature on its own. I don't think that it would be complicated to have a small module in src/test/modules which plugs in a couple of SQL functions on top of bloomfilter.h. +#define MAX_HASH_FUNCS 10 Being able to define the number of hash functions used at initialization would be nicer. Usually this number is kept way lower than the number of elements to check as part of a set, but I see no reason to not allow people to play with this API in a more extended way. You can then let your module decide what it wants to use. + * work_mem is sized in KB, in line with the general convention. In what is that a general convention? Using bytes would be more intuitive IMO.. Still I think that this could be removed, see below points. +/* + * Hash function is taken from sdbm, a public-domain reimplementation of the + * ndbm database library. + */ Reference link? + bitset_bytes = bitset_bits / BITS_PER_BYTE; + filter->bitset = palloc0(bitset_bytes); + filter->k_hash_funcs = optimal_k(bitset_bits, total_elems); + filter->seed = seed; I think that doing the allocation within the initialization phase is a mistake. Being able to use a DSA would be nice, but I predict as well cases where a module may want to have a bloom filter that persists as well across multiple transactions, so the allocation should be able to live across memory contexts. What I think you should do instead to make this bloom implementation more modular is to let the caller give a pointer to a memory area as well as its size. Then what bloom_init should do is to just initialize this area of memory with zeros. This approach would give a lot of freedom. Not linking a bloom definition to work_mem would be nice as well. + hashb = (filter->k_hash_funcs > 1 ? sdbmhash(elem, len) : 0); I am wondering if it would make sense to be able to enforce the hash function being used. The default does not look bad to me though, so we could live with that. +typedef struct bloom_filter bloom_filter; Not allowing callers have a look at the structure contents is definitely the right approach. So, in my opinion, this bloom facility still needs more work. -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
pgsql-hackers by date: