Re: Hash indexes - Mailing list pgsql-hackers

From Dann Corbit
Subject Re: Hash indexes
Date
Msg-id D425483C2C5C9F49B5B7A41F8944154757DADB@postal.corporate.connx.com
Whole thread Raw
List pgsql-hackers
> -----Original Message-----
> From: pgsql-hackers-owner@postgresql.org [mailto:pgsql-hackers-
> owner@postgresql.org] On Behalf Of Greg Stark
> Sent: Tuesday, August 01, 2006 4:42 PM
> To: Hannu Krosing
> Cc: Andrew Dunstan; Gregory Stark; Tom Lane; Alvaro Herrera; Jim C. Nasby;
> Luke Lonergan; mark@mark.mielke.cc; Bruce Momjian; Jie Zhang; Gavin
> Sherry; pgsql-hackers@postgresql.org
> Subject: Re: [HACKERS] Hash indexes
> 
> Hannu Krosing <hannu@skype.net> writes:
> 
> > Ühel kenal päeval, T, 2006-08-01 kell 10:54, kirjutas Andrew Dunstan:
> > > Gregory Stark wrote:
> > > >
> > > > I looked a while back and was suspicious about the actual hash
> functions too.
> > > > It seemed like a lot of them were vastly suboptimal. That would mean
> we're
> > > > often dealing with mostly empty and mostly full buckets instead of
> well
> > > > distributed hash tables.
> > > >
> > > >
> > > >
> > >
> > > This is now sounding like a lot of low hanging fruit ... highly
> > > performant hash indexed tables could possibly be a very big win.
> > >
> >
> > Are you sure about the badness of our hash functions ?
> >
> > I just tested and hashtext(text) has about 1.4% of collisions on about
> > 120M distinct texts, which is not bad considering thet total space for
> > hashes is 4G, meaning that 120M covers itself already 3% of possible
> > hash space.
> 
> I don't recall exactly what triggered my suspicions, but I think it had
> more
> to do with things like how int4 and int8 were mapped to hash codes and
> what
> the hash function looks like that compresses the hash codes into the
> actual
> bucket. IIRC it's just the low order bits for int8 and it's the actual
> int4
> which then only takes the low order bits. I seem to recall wondering what
> would happen if you had, say, a list of /24 network addresses for example.
> 
> I didn't actually do any tests though so it's quite possible I missed
> something that mitigated the issues I feared.

There is a program called QDBM:
http://qdbm.sourceforge.net/

That has very effective disk based hashing.  A clever extension that he uses is to use btrees to chain hash collisions
(verysimilar to an idea I like to use from time to time which is using a hash table of skiplists).  The project is LGPL
butit is possible that Mikio Hirabayashi could be talked into letting the on disk hash table portion be made available
witha Berkeley style license.
 

As for effectiveness of hash functions, there are test programs on this page:
http://www.burtleburtle.net/bob/hash/index.html

and this hash library:
http://eternallyconfuzzled.com/libs/jsw_hlib.html
has statistics built into it (you can easily add and test your own hash functions using the provided API).

> --
> greg
> 
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 9: In versions below 8.0, the planner will ignore your desire to
>        choose an index scan if your joining column's datatypes do not
>        match

pgsql-hackers by date:

Previous
From: Alvaro Herrera
Date:
Subject: Re: [PATCHES] Replication Documentation
Next
From: "Joshua D. Drake"
Date:
Subject: Re: [PATCHES] Replication Documentation