Re: Using hashtext and unique constraints together - Mailing list pgsql-general

From Mason Hale
Subject Re: Using hashtext and unique constraints together
Date
Msg-id 8bca3aa10712111600w70fe53bfg202ae2d55e7818cd@mail.gmail.com
Whole thread Raw
In response to Using hashtext and unique constraints together  ("Mason Hale" <masonhale@gmail.com>)
List pgsql-general


I recently discovered the hashtext() function, and I'm interested in using it to reduce the size of one of the biggest indexes in my database. 
...

The problem with either MD5 or hashtext() is that neither can guarantee unique output even if the inputs are all unique.
...

The problem I need help with is guaranteeing uniqueness of the URL on insert, with a non-unique index on hashtext(url) and *without* creating a unique index on page(url).

I'm thinking that an insert trigger that ensures (SELECT count(*) FROM page WHERE hashtext(url) = hashtext('http://www.postgresql.org') AND url = ' http://www.postgresql.org' ) = 0 won't work given MVCC, as two transactions could simultaneously insert the same url at the same time.

Can anyone think of a way to guarantee uniqueness of the URL without adding a unique constraint on the page(url) column?

I gather from the lack of response the answer is "no" ?

Sorry to respond to my own post... but I had an idea -- what if hash the same URL two different ways, and create a unique compound index on both hashes?

For example

CREATE unique index on page( hashtext(url), hashtext(md5(url)) )

The idea here is that the odds for two different string to hash to the same value with hashtext *and* md5 is remote enough to rely on for uniqueness. I'm converting the md5 to hashtext to save space.

or even

CREATE unique index on page( hashtext(url), hashtext(url || 'SALT') )

An alternative approach, assumes that two different strings that hash to the same value, will not also hash to the same value if the concatenated with some additional (constant) data. Again, doesn't absolutely guarantee uniqueness, but seems like it should be safe enough. I guess it depends on how truly random the hashtext output is.

Both of the above would require 8 bytes for each index entry (plus overhead), which is still a big savings over our current implementation.

Any thoughts on these ideas?

thanks in advance,
Mason

pgsql-general by date:

Previous
From: Gregory Stark
Date:
Subject: Re: Hijack!
Next
From: "Gregory Williamson"
Date:
Subject: Re: Hijack!