Thread: documentation for hashtext?

documentation for hashtext?

From
"Massa, Harald Armin"
Date:
hello,

just tried to lookup the documentation of hashtext.

And I did not find it by:

a) searching the online docs with the integrated searchengine

b) reading through the chapter 9 "functions and operators" 
   -> within string functions, there is md5(string), but not hashtext.

c) searching postgresql.org with google
 -> leads to some mailing-list discussions about hashtext in comparison to md5

Am I searching wrong or is the documentation in fact missing? 

Harald
-- 
Harald Armin Massa     www.2ndQuadrant.de
PostgreSQL  Training, Services  and Support

2ndQuadrant Deutschland GmbH     
GF: Harald Armin Massa
Amtsgericht Stuttgart, HRB 736399

Re: documentation for hashtext?

From
Magnus Hagander
Date:
On Thu, Aug 25, 2011 at 13:20, Massa, Harald Armin <chef@ghum.de> wrote:
> hello,
> just tried to lookup the documentation of hashtext.
> And I did not find it by:
> a) searching the online docs with the integrated searchengine
> b) reading through the chapter 9 "functions and operators"
>    -> within string functions, there is md5(string), but not hashtext.
> c) searching postgresql.org with google
>  -> leads to some mailing-list discussions about hashtext in comparison to
> md5
> Am I searching wrong or is the documentation in fact missing?

I believe it's considered an internal function, and not one to rely
on. If you search the archives, you will find reports around a version
upgrade when it changed the output for a certain input and thus broke
things for those using it as partitioning key in for example plproxy.
IIRC (but I'm too lazy to lookup the discussion right now) the
conclusion was that it's not documented because it's internal and
you're not supposed to use/rely on it.

--
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/

Re: documentation for hashtext?

From
"Massa, Harald Armin"
Date:
 hashtext


I believe it's considered an internal function, and not one to rely
on. If you search the archives, you will find reports around a version
upgrade when it changed the output for a certain input and thus broke

yes. I did find that discussion, and a blog post by Peter E. about this breakage.

BUT... 

then there is skytools ... using hashtext for various things, and the report:


describing it is fact roughly 40% faster then md5 AND is an integer, so

select sum(hastext(t.*::text)) as signature from datable t

as a fast way of comparing 2 replicated tables is really something to consider....
 
conclusion was that it's not documented because it's internal and
you're not supposed to use/rely on it.

My impression is that people are allready using it, relying their sharding on it, even building indexes on it.

So... I suggest we start documenting it. Even if there was a recommendation not to use it, people will get hurt anyway if their application rely on it and it breaks.

Could I persuade?

Harald

-- 
Harald Armin Massa     www.2ndQuadrant.de
PostgreSQL  Training, Services  and Support

2ndQuadrant Deutschland GmbH     
GF: Harald Armin Massa
Amtsgericht Stuttgart, HRB 736399

Re: documentation for hashtext?

From
pasman pasmański
Date:
The better solution is you write own hashing function.

--
------------
pasman

Re: documentation for hashtext?

From
Peter Eisentraut
Date:
On tor, 2011-08-25 at 14:05 +0200, Massa, Harald Armin wrote:
> > conclusion was that it's not documented because it's internal and
> > you're not supposed to use/rely on it.
> >
> > My impression is that people are allready using it, relying their sharding
> on it, even building indexes on it.

I think a better solution is either to implement lookup3 in PostgreSQL,
which is what hashtext was originally based on, so you have a hash
function that won't change and is comparable to the current one in
behavior.  Or you use a standard cryptographic hash function such as md5
or sha1 and shard by that.  They are slower than the lookup3-type hash
functions, but for multiple-node applications, it's probably OK.

> So... I suggest we start documenting it. Even if there was a
> recommendation not to use it, people will get hurt anyway if their
> application rely on it and it breaks.

Well, it's already broken for what people are using it or are thinking
about using it, and documenting it won't help that.  Better come up with
a purpose-built solution, as per above.