Thread: Creating large database of MD5 hash values

Creating large database of MD5 hash values

From
"Jon Stewart"
Date:
Hello,

I am creating a large database of MD5 hash values. I am a relative
newb with PostgreSQL (or any database for that matter). The schema and
operation will be quite simple -- only a few tables, probably no
stored procedures -- but I may easily end up with several hundred
million rows of hash values, possible even get into the billions. The
hash values will be organized into logical sets, with a many-many
relationship. I have some questions before I set out on this endeavor,
however, and would appreciate any and all feedback, including SWAGs,
WAGs, and outright lies. :-) I am trying to batch up operations as
much as possible, so I will largely be doing comparisons of whole
sets, with bulk COPY importing. I hope to avoid single hash value
lookup as much as possible.

1. Which datatype should I use to represent the hash value? UUIDs are
also 16 bytes...
2. Does it make sense to denormalize the hash set relationships?
3. Should I index?
4. What other data structure options would it make sense for me to choose?

Thanks in advance,


Jon

Re: Creating large database of MD5 hash values

From
Chris
Date:

> 1. Which datatype should I use to represent the hash value? UUIDs are
> also 16 bytes...

md5's are always 32 characters long so probably varchar(32).

> 2. Does it make sense to denormalize the hash set relationships?

The general rule is normalize as much as possible then only denormalize
when absolutely necessary.

> 3. Should I index?

What sort of queries are you going to be running?

> 4. What other data structure options would it make sense for me to choose?

What sort of other data will you be needing to store?

--
Postgresql & php tutorials
http://www.designmagick.com/

Re: Creating large database of MD5 hash values

From
Florian Weimer
Date:
* Jon Stewart:

> 1. Which datatype should I use to represent the hash value? UUIDs are
> also 16 bytes...

BYTEA is slower to load and a bit inconvenient to use from DBI, but
occupies less space on disk than TEXT or VARCHAR in hex form (17 vs 33
bytes with PostgreSQL 8.3).

> 2. Does it make sense to denormalize the hash set relationships?

That depends entirely on your application.

> 3. Should I index?

Depends.  B-tree is generally faster than Hash, even for randomly
distributed keys (like the output of a hash function).

> 4. What other data structure options would it make sense for me to
> choose?

See 2.

In general, hashing is bad because it destroy locality.  But in some
cases, there is no other choice.

--
Florian Weimer                <fweimer@bfk.de>
BFK edv-consulting GmbH       http://www.bfk.de/
Kriegsstraße 100              tel: +49-721-96201-1
D-76133 Karlsruhe             fax: +49-721-96201-99

Re: Creating large database of MD5 hash values

From
Alvaro Herrera
Date:
Jon Stewart escribió:
> Hello,
>
> I am creating a large database of MD5 hash values. I am a relative
> newb with PostgreSQL (or any database for that matter). The schema and
> operation will be quite simple -- only a few tables, probably no
> stored procedures -- but I may easily end up with several hundred
> million rows of hash values, possible even get into the billions. The
> hash values will be organized into logical sets, with a many-many
> relationship. I have some questions before I set out on this endeavor,
> however, and would appreciate any and all feedback, including SWAGs,
> WAGs, and outright lies. :-) I am trying to batch up operations as
> much as possible, so I will largely be doing comparisons of whole
> sets, with bulk COPY importing. I hope to avoid single hash value
> lookup as much as possible.

If MD5 values will be your primary data and you'll be storing millions
of them, it would be wise to create your own datatype and operators with
the most compact and efficient representation possible.

--
Alvaro Herrera                                http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

Re: Creating large database of MD5 hash values

From
"Jon Stewart"
Date:
>  > 1. Which datatype should I use to represent the hash value? UUIDs are
>  > also 16 bytes...
>
>  BYTEA is slower to load and a bit inconvenient to use from DBI, but
>  occupies less space on disk than TEXT or VARCHAR in hex form (17 vs 33
>  bytes with PostgreSQL 8.3).


Can you clarify the "slower to load" point? Where is that pain point
in the postgres architecture?

Storing the values in binary makes intuitive sense to me since the
data is twice as dense, thus getting you more bang for the buck on
comparisons, caching, and streaming reads. I'm not too concerned about
raw convenience, as there's not going to be a lot of code around my
application. I haven't built the thing yet so it's hard to say what
performance will be like, but for the users the difference between an
8 hour query that can run overnight and a 16 hour query that they must
wait on is significant.


>  > 2. Does it make sense to denormalize the hash set relationships?
>
>  That depends entirely on your application.


General schema would be as such:

HASH_VALUES
datatype md5;
bigint id;

SET_LINK
integer hash_value_id;
integer hash_set_id;

HASH_SETS
integer id;
varchar name;
// other data here

The idea is that you have named sets of hash values, and hash values
can be in multiple sets.

The big operations will be to calculate the unions, intersections, and
differences between sets. That is, I'll load a new set into the
database and then see whether it has anything in common with another
set (probably throw the results into a temp table and then dump it
out). I will also periodically run queries to determine the size of
the intersection of two sets for all pairs of sets (in order to
generate some nice graphs).

The number of sets could grow into the thousands, but will start
small. One of the sets I expect to be very large (could account for
50%-90% of all hashes); the others will all be smaller, and range from
10,000 in size to 1,000,000. The number of hashes total could get into
the hundreds of millions, possibly billions.

One problem I'm worried about is the lack of concurrency in the
application. It will be relatively rare for more than one query to be
inflight at a time; this is not a high connection application. It
doesn't sound like I'd get any marginal performance improvement out of
postgres by throwing more cores at the problem (other than dualcore;
always nice to have a spare handling everything else).

Thanks very much for the comments from all. Pretty simple application
conceptually, just one at a large scale. Other approaches (map
reduce-ish or straightahead turnkey storage) could potentially provide
better performance, but the users feel more comfortable maintaining
databases and the overall convenience of a database over other systems
is nice.


Jon

Re: Creating large database of MD5 hash values

From
Florian Weimer
Date:
* Jon Stewart:

>>  BYTEA is slower to load and a bit inconvenient to use from DBI, but
>>  occupies less space on disk than TEXT or VARCHAR in hex form (17 vs 33
>>  bytes with PostgreSQL 8.3).

> Can you clarify the "slower to load" point? Where is that pain point
> in the postgres architecture?

COPY FROM needs to read 2.5 bytes on average, instead 2, and a complex
form of double-decoding is necessary.

> Storing the values in binary makes intuitive sense to me since the
> data is twice as dense, thus getting you more bang for the buck on
> comparisons, caching, and streaming reads. I'm not too concerned about
> raw convenience, as there's not going to be a lot of code around my
> application.

The main issue is that you can't use the parameter-providing version
of $sth->execute (or things like $sth->selectarray, $sth->do), you
must use explicit binding by parameter index in order to specify the
type information.

> The idea is that you have named sets of hash values, and hash values
> can be in multiple sets.

The ID step is only going to help you if your sets are very large and
you use certain types of joins, I think.  So it's better to
denormalize in this case (if that's what you were alluding to in your
original post).

> The big operations will be to calculate the unions, intersections, and
> differences between sets. That is, I'll load a new set into the
> database and then see whether it has anything in common with another
> set (probably throw the results into a temp table and then dump it
> out).

In this case, PostgreSQL's in-memory bitmap indices should give you
most of the effect of your hash <-> ID mapping anyway.

> I will also periodically run queries to determine the size of
> the intersection of two sets for all pairs of sets (in order to
> generate some nice graphs).

I think it's very difficult to compute that efficiently, but I haven't
thought much about it.  This type of query might benefit from your
hash <-> ID mapping, however, because the working set is smaller.

--
Florian Weimer                <fweimer@bfk.de>
BFK edv-consulting GmbH       http://www.bfk.de/
Kriegsstraße 100              tel: +49-721-96201-1
D-76133 Karlsruhe             fax: +49-721-96201-99

Re: Creating large database of MD5 hash values

From
Decibel!
Date:
On Apr 11, 2008, at 10:25 AM, Alvaro Herrera wrote:

Sorry, yes, I'm behind on email... :(

> If MD5 values will be your primary data and you'll be storing millions
> of them, it would be wise to create your own datatype and operators
> with
> the most compact and efficient representation possible.


If you do this *please* post it. I really think it would be worth
while for us to have fixed-size data types for common forms of binary
data; MD5, SHA1 and SHA256 come to mind.
--
Decibel!, aka Jim C. Nasby, Database Architect  decibel@decibel.org
Give your computer some brain candy! www.distributed.net Team #1828



Attachment

Re: Creating large database of MD5 hash values

From
Bruce Momjian
Date:
Decibel! wrote:
> On Apr 11, 2008, at 10:25 AM, Alvaro Herrera wrote:
>
> Sorry, yes, I'm behind on email... :(
>
> > If MD5 values will be your primary data and you'll be storing millions
> > of them, it would be wise to create your own datatype and operators
> > with
> > the most compact and efficient representation possible.
>
>
> If you do this *please* post it. I really think it would be worth
> while for us to have fixed-size data types for common forms of binary
> data; MD5, SHA1 and SHA256 come to mind.

Why do you think it would be worth while?

--
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

Re: Creating large database of MD5 hash values

From
Tom Lane
Date:
Bruce Momjian <bruce@momjian.us> writes:
> Decibel! wrote:
>> If you do this *please* post it. I really think it would be worth
>> while for us to have fixed-size data types for common forms of binary
>> data; MD5, SHA1 and SHA256 come to mind.

> Why do you think it would be worth while?

Given that the overhead for short bytea values is now only one byte
not four-plus-padding, the argument for such types is surely a lot
weaker than it was before 8.3.

            regards, tom lane