Thread: Using bytea field...
Hi, I'm using a bytea field to store small images in base64. I need to know if I can compare bytea rows for equality. To know for example if the file already exists in the database, this is possible with bytea? Best Regads,
On 3/8/2011 12:28 PM, Andre Lopes wrote: > Hi, > > I'm using a bytea field to store small images in base64. I need to > know if I can compare bytea rows for equality. To know for example if > the file already exists in the database, this is possible with bytea? > > Best Regads, > You dont need to use both base64 and bytea. You can store base64 in text field... or just store the photo as-is into bytea. To answer your question: it would be faster if you computed an md5 (or sha or whatever) and stored it in the db, then you could check to see if an image exists by searching for the md5, which would be way faster, an send a lot less data over the wire. -Andy
Not sure if it is possible directly but have you considered (or you might have to) generating an MD5 hash of the data (possiblyafter encoding) and then comparing the hashes? For a small image it may not matter but if you plan on making thecheck with any frequency (and multiple times against the same record) doing a one-time hash generation is going to bequite a bit more efficient. Just make sure you know how you are going to keep the hash and the binary contents in-sync. You could avoid the synchronization issues by putting the hash in an index...in theory...I'd wait for someone else to opineon that particular option. David J. -----Original Message----- From: pgsql-general-owner@postgresql.org [mailto:pgsql-general-owner@postgresql.org] On Behalf Of Andre Lopes Sent: Tuesday, March 08, 2011 1:29 PM To: postgresql Forums Subject: [GENERAL] Using bytea field... Hi, I'm using a bytea field to store small images in base64. I need to know if I can compare bytea rows for equality. To knowfor example if the file already exists in the database, this is possible with bytea? Best Regads, -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
On Tue, Mar 8, 2011 at 8:00 PM, David Johnston <polobo@yahoo.com> wrote: > You could avoid the synchronization issues by putting the hash in an index...in theory...I'd wait for someone else to opineon that particular option. Yes, a functional index on MD5(your_bytea_column) will work and is the way to go for equality comparisons on bytea values. You can use the built-in MD5() function or one of the various hash functions in pgcrypto. You can also save some index size by only storing the bytea-encoded md5 result in the index, something like: CREATE INDEX bigcol_idx ON foo (decode(MD5(bigcol), 'hex')); And then run formulate your queries similarly so they use the index: test=# EXPLAIN ANALYZE SELECT * FROM foo WHERE (decode(md5(bigcol), 'hex')) = (decode(md5('4'), 'hex')); QUERY PLAN ------------------------------------------------------------------------------- --------------------------------- Index Scan using bigcol_idx on foo (cost=0.00..8.28 rows=1 width=4) (actual t ime=0.032..0.034 rows=1 loops=1) Index Cond: (decode(md5(bigcol), 'hex'::text) = '\xa87ff679a2f3e71d9181a67b7 542122c'::bytea) Total runtime: 0.095 ms (3 rows) Josh
MD5 is not collision resistant (using the immortal words of wikipedia http://en.wikipedia.org/wiki/MD5). This means that it is possible that multiple images will return the same md5 hash. The question is, if it screws up and says that an image already exists and then returns a different image when querying for it, how bad would that be. I've seen a lot of discussions in the past few years about how problematic the md5 approach is. Sim On 03/08/2011 09:06 PM, Andy Colson wrote: > On 3/8/2011 12:28 PM, Andre Lopes wrote: >> Hi, >> >> I'm using a bytea field to store small images in base64. I need to >> know if I can compare bytea rows for equality. To know for example if >> the file already exists in the database, this is possible with bytea? >> >> Best Regads, >> > > You dont need to use both base64 and bytea. You can store base64 in > text field... or just store the photo as-is into bytea. > > To answer your question: it would be faster if you computed an md5 (or > sha or whatever) and stored it in the db, then you could check to see > if an image exists by searching for the md5, which would be way > faster, an send a lot less data over the wire. > > -Andy >
On 03/09/2011 01:27 PM, Sim Zacks wrote: > This means that it is possible that multiple images will return the > same md5 hash. OTOH, if you had an indexed md5 hash and compared the image only to the matches, that would be a fast and accurate querying method
On Wed, 09 Mar 2011 13:27:16 +0200, Sim Zacks wrote: > MD5 is not collision resistant (using the immortal words of wikipedia > http://en.wikipedia.org/wiki/MD5). > > This means that it is possible that multiple images will return the > same md5 hash. > > The question is, if it screws up and says that an image already > exists and then returns a different image when querying for it, how > bad would that be. > > > I've seen a lot of discussions in the past few years about how > problematic the md5 approach is. > > > Sim > > > On 03/08/2011 09:06 PM, Andy Colson wrote: > >> On 3/8/2011 12:28 PM, Andre Lopes wrote: >>> Hi, >>> >>> I'm using a bytea field to store small images in base64. I need to >>> know if I can compare bytea rows for equality. To know for example >>> if >>> the file already exists in the database, this is possible with >>> bytea? >>> >>> Best Regads, >>> >> >> You dont need to use both base64 and bytea. You can store base64 in >> text field... or just store the photo as-is into bytea. >> >> To answer your question: it would be faster if you computed an md5 >> (or sha or whatever) and stored it in the db, then you could check to >> see if an image exists by searching for the md5, which would be way >> faster, an send a lot less data over the wire. >> >> -Andy >> Every hash algorithm is not collision resistant, and in theory You need to check equality of images. Hash function says that if hashes are different, then images are different. It not says if hashes are same then images are same. Regards, Radek
> On 03/08/2011 09:06 PM, Andy Colson wrote: > >> On 3/8/2011 12:28 PM, Andre Lopes wrote: >>> Hi, >>> >>> I'm using a bytea field to store small images in base64. I need to >>> know if I can compare bytea rows for equality. To know for example if >>> the file already exists in the database, this is possible with bytea? >>> >>> Best Regads, >>> >> >> You dont need to use both base64 and bytea. You can store base64 in >> text field... or just store the photo as-is into bytea. >> >> To answer your question: it would be faster if you computed an md5 (or >> sha or whatever) and stored it in the db, then you could check to see >> if an image exists by searching for the md5, which would be way >> faster, an send a lot less data over the wire. >> >> -Andy >> > > On 3/9/2011 5:27 AM, Sim Zacks wrote: > MD5 is not collision resistant (using the immortal words of wikipedia > http://en.wikipedia.org/wiki/MD5). > > This means that it is possible that multiple images will return the same > md5 hash. > > The question is, if it screws up and says that an image already exists > and then returns a different image when querying for it, how bad would > that be. > > > I've seen a lot of discussions in the past few years about how > problematic the md5 approach is. > > > Sim > > It'll never happen: http://stackoverflow.com/questions/862346/how-do-i-assess-the-hash-collision-probability Sure you CAN go out of your way to generate collisions, but I'd bet money you never see one from your setup. The probability is extremely slim. And if thats too much of a chance, use sha2, its mind numbingly slim. If you were doing cryptography it would be a problem, yes, but not checking file equality. -Andy
> > The question is, if it screws up and says that an image already exists > > and then returns a different image when querying for it, how bad would > > that be. > > > > > It'll never happen: > > http://stackoverflow.com/questions/862346/how-do-i-assess-the-hash-collision-probability > > > Sure you CAN go out of your way to generate collisions, but I'd bet > money you never see one from your setup. > > The probability is extremely slim. And if thats too much of a chance, > use sha2, its mind numbingly slim. > > If you were doing cryptography it would be a problem, yes, but not > checking file equality. > > -Andy Never is a long time. The question that I asked is precisely: how much money you would bet that you'll never hit a collision. It depends on the use case. If you are talking about privacy issues, which can include lawsuits, loss of reputation and/or damages, then I wouldn't take that risk, even on sha2. Especially not with all the publicly available documentation explaining why not to do it. If you are talking about a minor inconvenience or professional pride because the wrong image showed up, or the right image was never stored, then it may be worth the risk. Sim
On Thu, Mar 10, 2011 at 2:09 AM, Sim Zacks <sim@compulab.co.il> wrote: > >> > The question is, if it screws up and says that an image already exists >> > and then returns a different image when querying for it, how bad would >> > that be. >> > >> >> >> It'll never happen: >> >> >> http://stackoverflow.com/questions/862346/how-do-i-assess-the-hash-collision-probability >> >> >> Sure you CAN go out of your way to generate collisions, but I'd bet >> money you never see one from your setup. >> >> The probability is extremely slim. And if thats too much of a chance, >> use sha2, its mind numbingly slim. >> >> If you were doing cryptography it would be a problem, yes, but not >> checking file equality. >> >> -Andy > > Never is a long time. The question that I asked is precisely: how much money > you would bet that you'll never hit a collision. It depends on the use case. > If you are talking about privacy issues, which can include lawsuits, loss of > reputation and/or damages, then I wouldn't take that risk, even on sha2. > Especially not with all the publicly available documentation explaining why > not to do it. If you are talking about a minor inconvenience or > professional pride because the wrong image showed up, or the right image was > never stored, then it may be worth the risk. Regardless of the intended use, I would bet every dollar I've ever made, will make, could borrow, beg steal, etc vs 1 of your dollars and happily collect it when I won the bet. See here: (http://en.wikipedia.org/wiki/Birthday_attack) and look at the table of odds vs population size...your statement is not in line with mathematical reality, and from a risk standpoint there is a large number of things to be looking at before sha2 collision such as drive bit error rates, spontaneous combustion, etc. AFAIK, even sha1 collisions have never been found in the wild, and the zfs deduplication system uses sha1 to deduplicate disk blocks, as does bit torrent. In fact many computing systems you rely on make hash safety assumptions weaker than sha2. Schneier speculates that we may see a collision soon here: http://blog.valerieaurora.org/2009/06/25/sha-1-collision-expected-within-a-year/. A small number of duplicate accidental md5 hashes have been found in the wild. merlin