Thread: Question regarding fast-hashing in PGSQL

Question regarding fast-hashing in PGSQL

From

Stephen Conley

Date:

18 September 2019, 16:41:28

Hey there;

I have a weird use case where I am basically taking data from many different sources and merging it into a single table, while trying to avoid duplicates as much as possible. None of them share any kind of primary key, but I have determined 3 columns that, together, will almost always be unique so I am planning on using those 3 columns as a composite primary key.

Two of those columns are integers, which is great. The third column is a string, UTF-8, which may be quite long (though probably no longer than 50 characters ... on average probably around 10 - 30 characters). The strings could be practically anything, and they absolutely will not be unique on their own (these three data points are basically x, y coordinates and then some name...for a given x,y coordinate there may be multiple names, but the likihood of the same name at the same x, y is almost 0)

I really don't want to do a string comparison if possible because this DB will be getting very large, very quickly -- 20 million or so rows anticipated in the near future (i.e. next few weeks), with possible growth up to around 200 million (1+ year later).

My idea was to hash the string to a bigint, because the likelihood of all 3 columns colliding is almost 0, and if a duplicate does crop up, it isn't the end of the world.

However, Postgresql doesn't seem to have any 'native' hashing calls that result in a bigint. The closest I've found is pgcrypto's 'digest' call -- I could theoretically take an md5 hash, and just use the first 8 bytes of it to make a bigint.

HOWEVER... there is no straight forward way to do this. The most straight forward way I've seen is md5 -> hex string -> substring -> bigint. This is ridiculous to me -- I'm basically converting binary to text, then converting text to binary. However, you can't convert a bytea to a bigint in any fashion that I can tell so I have to eat a bunch of overhead for fun.

What would be the fastest way to do this? I will be generating potentially a LOT of these keys so I want to do it the least dumb way. I am using Digital Ocean's hosted PostgreSQL so I can't use my own C code -- but I can use PL/Psql, PL/Perl or any of these extensions:

https://www.digitalocean.com/docs/databases/postgresql/resources/supported-extensions/

If my concerns about string comparisons are unfounded and I'm working way too hard to avoid something that doesn't matter ... feel free to tell me that as well. Basically, PostgreSQL performance guys, how would you tackle this one?

Thanks,

Stephen

Re: Question regarding fast-hashing in PGSQL

From

Adam Brusselback

Date:

18 September 2019, 16:49:58

I've had a similar issue in the past.

I used the md5 hash function and stored it in a UUID column for my comparisons. Bigger than a bigint, but still much faster than string comparisons directly for my use case.

UUID works fine for storing md5 hashes and gives you the ability to piggyback on all the index support built for them.

Hope that helps,

-Adam

Re: Question regarding fast-hashing in PGSQL

From

Stephen Conley

Date:

18 September 2019, 16:56:32

This should work perfectly for me. Thank you so much!

On Wed, Sep 18, 2019 at 12:50 PM Adam Brusselback <adambrusselback@gmail.com> wrote:

I've had a similar issue in the past.

I used the md5 hash function and stored it in a UUID column for my comparisons. Bigger than a bigint, but still much faster than string comparisons directly for my use case.
UUID works fine for storing md5 hashes and gives you the ability to piggyback on all the index support built for them.

Hope that helps,
-Adam

Re: Question regarding fast-hashing in PGSQL

From

Tom Lane

Date:

18 September 2019, 20:39:37

Stephen Conley <cheetah@tanabi.org> writes:
> My idea was to hash the string to a bigint, because the likelihood of all 3
> columns colliding is almost 0, and if a duplicate does crop up, it isn't
> the end of the world.

> However, Postgresql doesn't seem to have any 'native' hashing calls that
> result in a bigint.

regression=# \df hashtext*
                               List of functions
   Schema   |       Name       | Result data type | Argument data types | Type
------------+------------------+------------------+---------------------+------
 pg_catalog | hashtext         | integer          | text                | func
 pg_catalog | hashtextextended | bigint           | text, bigint        | func
(2 rows)

The "extended" hash API has only been there since v11, so you
couldn't rely on it if you need portability to old servers.
But otherwise it seems to respond precisely to your question.

If you do need portability ... does the text string's part of the
hash *really* have to be 64 bits wide?  Why not just concatenate
it with a 32-bit hash of the other fields?

            regards, tom lane