Re: Creating large database of MD5 hash values - Mailing list pgsql-performance
From | Jon Stewart |
---|---|
Subject | Re: Creating large database of MD5 hash values |
Date | |
Msg-id | c0787d8a0804110828w52a0f614v6130fca72b85956c@mail.gmail.com Whole thread Raw |
In response to | Re: Creating large database of MD5 hash values (Florian Weimer <fweimer@bfk.de>) |
Responses |
Re: Creating large database of MD5 hash values
|
List | pgsql-performance |
> > 1. Which datatype should I use to represent the hash value? UUIDs are > > also 16 bytes... > > BYTEA is slower to load and a bit inconvenient to use from DBI, but > occupies less space on disk than TEXT or VARCHAR in hex form (17 vs 33 > bytes with PostgreSQL 8.3). Can you clarify the "slower to load" point? Where is that pain point in the postgres architecture? Storing the values in binary makes intuitive sense to me since the data is twice as dense, thus getting you more bang for the buck on comparisons, caching, and streaming reads. I'm not too concerned about raw convenience, as there's not going to be a lot of code around my application. I haven't built the thing yet so it's hard to say what performance will be like, but for the users the difference between an 8 hour query that can run overnight and a 16 hour query that they must wait on is significant. > > 2. Does it make sense to denormalize the hash set relationships? > > That depends entirely on your application. General schema would be as such: HASH_VALUES datatype md5; bigint id; SET_LINK integer hash_value_id; integer hash_set_id; HASH_SETS integer id; varchar name; // other data here The idea is that you have named sets of hash values, and hash values can be in multiple sets. The big operations will be to calculate the unions, intersections, and differences between sets. That is, I'll load a new set into the database and then see whether it has anything in common with another set (probably throw the results into a temp table and then dump it out). I will also periodically run queries to determine the size of the intersection of two sets for all pairs of sets (in order to generate some nice graphs). The number of sets could grow into the thousands, but will start small. One of the sets I expect to be very large (could account for 50%-90% of all hashes); the others will all be smaller, and range from 10,000 in size to 1,000,000. The number of hashes total could get into the hundreds of millions, possibly billions. One problem I'm worried about is the lack of concurrency in the application. It will be relatively rare for more than one query to be inflight at a time; this is not a high connection application. It doesn't sound like I'd get any marginal performance improvement out of postgres by throwing more cores at the problem (other than dualcore; always nice to have a spare handling everything else). Thanks very much for the comments from all. Pretty simple application conceptually, just one at a large scale. Other approaches (map reduce-ish or straightahead turnkey storage) could potentially provide better performance, but the users feel more comfortable maintaining databases and the overall convenience of a database over other systems is nice. Jon
pgsql-performance by date: