Re: reducing the footprint of ScanKeyword (was Re: Large writablevariables) - Mailing list pgsql-hackers

From Andres Freund
Subject Re: reducing the footprint of ScanKeyword (was Re: Large writablevariables)
Date
Msg-id 20181228085449.v2veeelcdpstmlsq@alap3.anarazel.de
Whole thread Raw
In response to Re: reducing the footprint of ScanKeyword (was Re: Large writable variables)  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-hackers
On 2018-12-27 14:22:11 -0500, Tom Lane wrote:
> Andrew Dunstan <andrew.dunstan@2ndquadrant.com> writes:
> > A smaller project might be to see if we can replace the binary keyword
> > search in ScanKeyword with a perfect hashing function generated by
> > gperf, or something similar. I had a quick look at that, too.
> 
> Yeah, we've looked at gperf before, eg
> 
> https://www.postgresql.org/message-id/20170927183156.jqzcsy7ocjcbdnmo@alap3.anarazel.de
> 
> Perhaps it'd be a win but I'm not very convinced.

Note that the tradeoffs mentioned there, by memory, aren't necessarily
applicable here. As we're dealing with strings anyway, gperf wanting to
deal with strings rather than being able to deal with numbers isn't
problematic.


> I don't know much about the theory of perfect hashing, but I wonder
> if we could just roll our own tool for that.  Since we're not dealing
> with extremely large keyword sets, perhaps brute force search for a
> set of multipliers for a hash computation like
> (char[0] * some_prime + char[1] * some_other_prime ...) mod table_size
> would work.

The usual way to do do perfect hashing is to bascially have a two stage
hashtable, with the first stage keyed by a "normal" hash fuinction, and
the second one disambiguating the values that hash into the same bucket,
by additionally keying a hash-function with the value in the cell in the
intermediate hash table. Determining the parameters in the intermediate
table is what takes time.  That most perfect hash functions look like
that way is also a good part of the reason why I doubt it's worthwhile
to go there over a simple linear probing hashtable, with a good
hashfunction - computing two hash-values will usually be worse than
linear probing for *small* and *not modified* hashtables.

A simple (i.e. slow for large numbers of keys) implementation for
generating a perfect hash function isn't particularly hard. E.g. look at
the python implementation at http://iswsa.acm.org/mphf/index.html and
http://stevehanov.ca/blog/index.php?id=119 for an easy explanation with
graphics.

Greetings,

Andres Freund


pgsql-hackers by date:

Previous
From: Mitar
Date:
Subject: Re: Feature: triggers on materialized views
Next
From: Peter Eisentraut
Date:
Subject: Re: insensitive collations