Home > mailing lists

Re: Hash indexes - Mailing list pgsql-hackers

From	Dann Corbit
Subject	Re: Hash indexes
Date	August 2, 2006 00:46:00
Msg-id	D425483C2C5C9F49B5B7A41F8944154757DADB@postal.corporate.connx.com Whole thread Raw
List	pgsql-hackers

Tree view

> -----Original Message-----
> From: pgsql-hackers-owner@postgresql.org [mailto:pgsql-hackers-
> owner@postgresql.org] On Behalf Of Greg Stark
> Sent: Tuesday, August 01, 2006 4:42 PM
> To: Hannu Krosing
> Cc: Andrew Dunstan; Gregory Stark; Tom Lane; Alvaro Herrera; Jim C. Nasby;
> Luke Lonergan; mark@mark.mielke.cc; Bruce Momjian; Jie Zhang; Gavin
> Sherry; pgsql-hackers@postgresql.org
> Subject: Re: [HACKERS] Hash indexes
> 
> Hannu Krosing <hannu@skype.net> writes:
> 
> > Ãœhel kenal pÃ¤eval, T, 2006-08-01 kell 10:54, kirjutas Andrew Dunstan:
> > > Gregory Stark wrote:
> > > >
> > > > I looked a while back and was suspicious about the actual hash
> functions too.
> > > > It seemed like a lot of them were vastly suboptimal. That would mean
> we're
> > > > often dealing with mostly empty and mostly full buckets instead of
> well
> > > > distributed hash tables.
> > > >
> > > >
> > > >
> > >
> > > This is now sounding like a lot of low hanging fruit ... highly
> > > performant hash indexed tables could possibly be a very big win.
> > >
> >
> > Are you sure about the badness of our hash functions ?
> >
> > I just tested and hashtext(text) has about 1.4% of collisions on about
> > 120M distinct texts, which is not bad considering thet total space for
> > hashes is 4G, meaning that 120M covers itself already 3% of possible
> > hash space.
> 
> I don't recall exactly what triggered my suspicions, but I think it had
> more
> to do with things like how int4 and int8 were mapped to hash codes and
> what
> the hash function looks like that compresses the hash codes into the
> actual
> bucket. IIRC it's just the low order bits for int8 and it's the actual
> int4
> which then only takes the low order bits. I seem to recall wondering what
> would happen if you had, say, a list of /24 network addresses for example.
> 
> I didn't actually do any tests though so it's quite possible I missed
> something that mitigated the issues I feared.

There is a program called QDBM:
http://qdbm.sourceforge.net/

That has very effective disk based hashing.  A clever extension that he uses is to use btrees to chain hash collisions
(verysimilar to an idea I like to use from time to time which is using a hash table of skiplists).  The project is LGPL
butit is possible that Mikio Hirabayashi could be talked into letting the on disk hash table portion be made available
witha Berkeley style license.
 

As for effectiveness of hash functions, there are test programs on this page:
http://www.burtleburtle.net/bob/hash/index.html

and this hash library:
http://eternallyconfuzzled.com/libs/jsw_hlib.html
has statistics built into it (you can easily add and test your own hash functions using the provided API).

> --
> greg
> 
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 9: In versions below 8.0, the planner will ignore your desire to
>        choose an index scan if your joining column's datatypes do not
>        match

pgsql-hackers by date:

From: Alvaro Herrera
Date: 02 August 2006, 00:39:28
Subject: Re: [PATCHES] Replication Documentation

From: "Joshua D. Drake"
Date: 02 August 2006, 00:47:03
Subject: Re: [PATCHES] Replication Documentation

Re: Hash indexes - Mailing list pgsql-hackers

Previous

Next