Thread: hash options

hash options

From
"Little, Douglas"
Date:

Hello,

 

I’m working on a data warehouse dimensionalization process   where I need to hash a text string to use as the key. 

I’ve implemented with MD5.  It works fine,  the problem I have is the size of the md5 (32 bytes) is often longer than the original string – thus not accomplishing what I want – space savings.

 

Does anybody have alternative hash function recommendations?  

 I looked at the options I knew of

select length(encode('ar=514','hex')); -- 12

select length(decode('ar=514','base64')); -- 24

select length(DIGEST('ar=514', 'md5')) -- 16bytes

select length(DIGEST('ar=514', 'sha1')) -- 20bytes

 

function is currently written in pg/plsql,  but I’m considering switching to python for broader library choice.

 

 

 

Source data is delimited list of name/value pairs.  Length range from 0-2500 bytes.

ar=514,cc=CA,ci=Montreal,cn=North+America,co=Sympatico,cs=Canada,nt=Xdsl,rc=QC,rs=Quebec,tp=High,tz=GMT%2D5

 

Thanks in advance

Doug Little

 

Sr. Data Warehouse Architect | Business Intelligence Architecture | Orbitz Worldwide

Douglas.Little@orbitz.com

 Description: cid:image001.jpg@01CABEC8.D4980670  orbitz.com | ebookers.com | hotelclub.com | cheaptickets.com | ratestogo.com | asiahotels.com

 

Attachment

Re: hash options

From
Chris Angelico
Date:
On Mon, Jan 23, 2012 at 2:59 AM, Little, Douglas
<DOUGLAS.LITTLE@orbitz.com> wrote:
>
> I’ve implemented with MD5.  It works fine,  the problem I have is the size of the md5 (32 bytes) is often longer than
theoriginal string – thus not accomplishing what I want – space savings. 

You can always use a truncated hash - for instance, take the first 6-8
hex digits of the MD5 or SHA1 hash. For human readability, that's
likely to be all you need (for instance, git references commits by
their SHA1 hashes, but you can work with just the first six digits
quite happily). Otherwise, can you provide more details on why you
need a hash, and why it wants to be shorter than the original?

Chris Angelico

Re: hash options

From
David W Noon
Date:
On Sun, 22 Jan 2012 09:59:55 -0600, Little, Douglas wrote about
[GENERAL] hash options:

>I'm working on a data warehouse dimensionalization process   where I
>need to hash a text string to use as the key. I've implemented with
>MD5.  It works fine,  the problem I have is the size of the md5 (32
>bytes) is often longer than the original string - thus not
>accomplishing what I want - space savings.
>
>Does anybody have alternative hash function recommendations?

Try CRC32, possibly augmented by a CRC16 in a separate attribute.

I have CRC functions for PostgreSQL, written in C, and will make them
available to anybody who wants them.
--
Regards,

Dave  [RLU #314465]
*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*
dwnoon@ntlworld.com (David W Noon)
*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*

Attachment

Re: hash options

From
dwnoon@ntlworld.com
Date:
This message has been digitally signed by the sender.
Attachment