Thread: Creating large database of MD5 hash values
Hello, I am creating a large database of MD5 hash values. I am a relative newb with PostgreSQL (or any database for that matter). The schema and operation will be quite simple -- only a few tables, probably no stored procedures -- but I may easily end up with several hundred million rows of hash values, possible even get into the billions. The hash values will be organized into logical sets, with a many-many relationship. I have some questions before I set out on this endeavor, however, and would appreciate any and all feedback, including SWAGs, WAGs, and outright lies. :-) I am trying to batch up operations as much as possible, so I will largely be doing comparisons of whole sets, with bulk COPY importing. I hope to avoid single hash value lookup as much as possible. 1. Which datatype should I use to represent the hash value? UUIDs are also 16 bytes... 2. Does it make sense to denormalize the hash set relationships? 3. Should I index? 4. What other data structure options would it make sense for me to choose? Thanks in advance, Jon
> 1. Which datatype should I use to represent the hash value? UUIDs are > also 16 bytes... md5's are always 32 characters long so probably varchar(32). > 2. Does it make sense to denormalize the hash set relationships? The general rule is normalize as much as possible then only denormalize when absolutely necessary. > 3. Should I index? What sort of queries are you going to be running? > 4. What other data structure options would it make sense for me to choose? What sort of other data will you be needing to store? -- Postgresql & php tutorials http://www.designmagick.com/
* Jon Stewart: > 1. Which datatype should I use to represent the hash value? UUIDs are > also 16 bytes... BYTEA is slower to load and a bit inconvenient to use from DBI, but occupies less space on disk than TEXT or VARCHAR in hex form (17 vs 33 bytes with PostgreSQL 8.3). > 2. Does it make sense to denormalize the hash set relationships? That depends entirely on your application. > 3. Should I index? Depends. B-tree is generally faster than Hash, even for randomly distributed keys (like the output of a hash function). > 4. What other data structure options would it make sense for me to > choose? See 2. In general, hashing is bad because it destroy locality. But in some cases, there is no other choice. -- Florian Weimer <fweimer@bfk.de> BFK edv-consulting GmbH http://www.bfk.de/ Kriegsstraße 100 tel: +49-721-96201-1 D-76133 Karlsruhe fax: +49-721-96201-99
Jon Stewart escribió: > Hello, > > I am creating a large database of MD5 hash values. I am a relative > newb with PostgreSQL (or any database for that matter). The schema and > operation will be quite simple -- only a few tables, probably no > stored procedures -- but I may easily end up with several hundred > million rows of hash values, possible even get into the billions. The > hash values will be organized into logical sets, with a many-many > relationship. I have some questions before I set out on this endeavor, > however, and would appreciate any and all feedback, including SWAGs, > WAGs, and outright lies. :-) I am trying to batch up operations as > much as possible, so I will largely be doing comparisons of whole > sets, with bulk COPY importing. I hope to avoid single hash value > lookup as much as possible. If MD5 values will be your primary data and you'll be storing millions of them, it would be wise to create your own datatype and operators with the most compact and efficient representation possible. -- Alvaro Herrera http://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support
> > 1. Which datatype should I use to represent the hash value? UUIDs are > > also 16 bytes... > > BYTEA is slower to load and a bit inconvenient to use from DBI, but > occupies less space on disk than TEXT or VARCHAR in hex form (17 vs 33 > bytes with PostgreSQL 8.3). Can you clarify the "slower to load" point? Where is that pain point in the postgres architecture? Storing the values in binary makes intuitive sense to me since the data is twice as dense, thus getting you more bang for the buck on comparisons, caching, and streaming reads. I'm not too concerned about raw convenience, as there's not going to be a lot of code around my application. I haven't built the thing yet so it's hard to say what performance will be like, but for the users the difference between an 8 hour query that can run overnight and a 16 hour query that they must wait on is significant. > > 2. Does it make sense to denormalize the hash set relationships? > > That depends entirely on your application. General schema would be as such: HASH_VALUES datatype md5; bigint id; SET_LINK integer hash_value_id; integer hash_set_id; HASH_SETS integer id; varchar name; // other data here The idea is that you have named sets of hash values, and hash values can be in multiple sets. The big operations will be to calculate the unions, intersections, and differences between sets. That is, I'll load a new set into the database and then see whether it has anything in common with another set (probably throw the results into a temp table and then dump it out). I will also periodically run queries to determine the size of the intersection of two sets for all pairs of sets (in order to generate some nice graphs). The number of sets could grow into the thousands, but will start small. One of the sets I expect to be very large (could account for 50%-90% of all hashes); the others will all be smaller, and range from 10,000 in size to 1,000,000. The number of hashes total could get into the hundreds of millions, possibly billions. One problem I'm worried about is the lack of concurrency in the application. It will be relatively rare for more than one query to be inflight at a time; this is not a high connection application. It doesn't sound like I'd get any marginal performance improvement out of postgres by throwing more cores at the problem (other than dualcore; always nice to have a spare handling everything else). Thanks very much for the comments from all. Pretty simple application conceptually, just one at a large scale. Other approaches (map reduce-ish or straightahead turnkey storage) could potentially provide better performance, but the users feel more comfortable maintaining databases and the overall convenience of a database over other systems is nice. Jon
* Jon Stewart: >> BYTEA is slower to load and a bit inconvenient to use from DBI, but >> occupies less space on disk than TEXT or VARCHAR in hex form (17 vs 33 >> bytes with PostgreSQL 8.3). > Can you clarify the "slower to load" point? Where is that pain point > in the postgres architecture? COPY FROM needs to read 2.5 bytes on average, instead 2, and a complex form of double-decoding is necessary. > Storing the values in binary makes intuitive sense to me since the > data is twice as dense, thus getting you more bang for the buck on > comparisons, caching, and streaming reads. I'm not too concerned about > raw convenience, as there's not going to be a lot of code around my > application. The main issue is that you can't use the parameter-providing version of $sth->execute (or things like $sth->selectarray, $sth->do), you must use explicit binding by parameter index in order to specify the type information. > The idea is that you have named sets of hash values, and hash values > can be in multiple sets. The ID step is only going to help you if your sets are very large and you use certain types of joins, I think. So it's better to denormalize in this case (if that's what you were alluding to in your original post). > The big operations will be to calculate the unions, intersections, and > differences between sets. That is, I'll load a new set into the > database and then see whether it has anything in common with another > set (probably throw the results into a temp table and then dump it > out). In this case, PostgreSQL's in-memory bitmap indices should give you most of the effect of your hash <-> ID mapping anyway. > I will also periodically run queries to determine the size of > the intersection of two sets for all pairs of sets (in order to > generate some nice graphs). I think it's very difficult to compute that efficiently, but I haven't thought much about it. This type of query might benefit from your hash <-> ID mapping, however, because the working set is smaller. -- Florian Weimer <fweimer@bfk.de> BFK edv-consulting GmbH http://www.bfk.de/ Kriegsstraße 100 tel: +49-721-96201-1 D-76133 Karlsruhe fax: +49-721-96201-99
On Apr 11, 2008, at 10:25 AM, Alvaro Herrera wrote: Sorry, yes, I'm behind on email... :( > If MD5 values will be your primary data and you'll be storing millions > of them, it would be wise to create your own datatype and operators > with > the most compact and efficient representation possible. If you do this *please* post it. I really think it would be worth while for us to have fixed-size data types for common forms of binary data; MD5, SHA1 and SHA256 come to mind. -- Decibel!, aka Jim C. Nasby, Database Architect decibel@decibel.org Give your computer some brain candy! www.distributed.net Team #1828
Attachment
Decibel! wrote: > On Apr 11, 2008, at 10:25 AM, Alvaro Herrera wrote: > > Sorry, yes, I'm behind on email... :( > > > If MD5 values will be your primary data and you'll be storing millions > > of them, it would be wise to create your own datatype and operators > > with > > the most compact and efficient representation possible. > > > If you do this *please* post it. I really think it would be worth > while for us to have fixed-size data types for common forms of binary > data; MD5, SHA1 and SHA256 come to mind. Why do you think it would be worth while? -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
Bruce Momjian <bruce@momjian.us> writes: > Decibel! wrote: >> If you do this *please* post it. I really think it would be worth >> while for us to have fixed-size data types for common forms of binary >> data; MD5, SHA1 and SHA256 come to mind. > Why do you think it would be worth while? Given that the overhead for short bytea values is now only one byte not four-plus-padding, the argument for such types is surely a lot weaker than it was before 8.3. regards, tom lane