Thread: Using bytea field...

Using bytea field...

From
Andre Lopes
Date:
Hi,

I'm using a bytea field to store small images in base64. I need to
know if I can compare bytea rows for equality. To know for example if
the file already exists in the database, this is possible with bytea?

Best Regads,

Re: Using bytea field...

From
Andy Colson
Date:
On 3/8/2011 12:28 PM, Andre Lopes wrote:
> Hi,
>
> I'm using a bytea field to store small images in base64. I need to
> know if I can compare bytea rows for equality. To know for example if
> the file already exists in the database, this is possible with bytea?
>
> Best Regads,
>

You dont need to use both base64 and bytea.  You can store base64 in
text field... or just store the photo as-is into bytea.

To answer your question: it would be faster if you computed an md5 (or
sha or whatever) and stored it in the db, then you could check to see if
an image exists by searching for the md5, which would be way faster, an
send a lot less data over the wire.

-Andy

Re: Using bytea field...

From
"David Johnston"
Date:
Not sure if it is possible directly but have you considered (or you might have to) generating an MD5 hash of the data
(possiblyafter encoding) and then comparing the hashes?  For a small image it may not matter but if you plan on making
thecheck with any frequency (and multiple times against the same record) doing a one-time hash generation is going to
bequite a bit more efficient. 

Just make sure you know how you are going to keep the hash and the binary contents in-sync.

You could avoid the synchronization issues by putting the hash in an index...in theory...I'd wait for someone else to
opineon that particular option. 

David J.

-----Original Message-----
From: pgsql-general-owner@postgresql.org [mailto:pgsql-general-owner@postgresql.org] On Behalf Of Andre Lopes
Sent: Tuesday, March 08, 2011 1:29 PM
To: postgresql Forums
Subject: [GENERAL] Using bytea field...

Hi,

I'm using a bytea field to store small images in base64. I need to know if I can compare bytea rows for equality. To
knowfor example if the file already exists in the database, this is possible with bytea? 

Best Regads,

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: Using bytea field...

From
Josh Kupershmidt
Date:
On Tue, Mar 8, 2011 at 8:00 PM, David Johnston <polobo@yahoo.com> wrote:
> You could avoid the synchronization issues by putting the hash in an index...in theory...I'd wait for someone else to
opineon that particular option. 

Yes, a functional index on MD5(your_bytea_column) will work and is the
way to go for equality comparisons on bytea values. You can use the
built-in MD5() function or one of the various hash functions in
pgcrypto. You can also save some index size by only storing the
bytea-encoded md5 result in the index, something like:

CREATE INDEX bigcol_idx ON foo (decode(MD5(bigcol), 'hex'));

And then run formulate your queries similarly so they use the index:

test=# EXPLAIN ANALYZE SELECT * FROM foo
            WHERE (decode(md5(bigcol), 'hex')) = (decode(md5('4'), 'hex'));

                                                   QUERY PLAN

-------------------------------------------------------------------------------
---------------------------------
 Index Scan using bigcol_idx on foo  (cost=0.00..8.28 rows=1 width=4) (actual t
ime=0.032..0.034 rows=1 loops=1)
   Index Cond: (decode(md5(bigcol), 'hex'::text) = '\xa87ff679a2f3e71d9181a67b7
542122c'::bytea)
 Total runtime: 0.095 ms
(3 rows)


Josh

Re: Using bytea field...

From
Sim Zacks
Date:
MD5 is not collision resistant (using the immortal words of wikipedia
http://en.wikipedia.org/wiki/MD5).

This means that it is possible that multiple images will return the same
md5 hash.

The question is, if it screws up and says that an image already exists
and then returns a different image when querying for it, how bad would
that be.


I've seen a lot of discussions in the past few years about how
problematic the md5 approach is.


Sim


On 03/08/2011 09:06 PM, Andy Colson wrote:

> On 3/8/2011 12:28 PM, Andre Lopes wrote:
>> Hi,
>>
>> I'm using a bytea field to store small images in base64. I need to
>> know if I can compare bytea rows for equality. To know for example if
>> the file already exists in the database, this is possible with bytea?
>>
>> Best Regads,
>>
>
> You dont need to use both base64 and bytea.  You can store base64 in
> text field... or just store the photo as-is into bytea.
>
> To answer your question: it would be faster if you computed an md5 (or
> sha or whatever) and stored it in the db, then you could check to see
> if an image exists by searching for the md5, which would be way
> faster, an send a lot less data over the wire.
>
> -Andy
>


Re: Using bytea field...

From
Sim Zacks
Date:
On 03/09/2011 01:27 PM, Sim Zacks wrote:
> This means that it is possible that multiple images will return the
> same md5 hash.
OTOH, if you had an indexed md5 hash and compared the image only to the
matches, that would be a fast and accurate querying method

Re: Using bytea field...

From
rsmogura
Date:
 On Wed, 09 Mar 2011 13:27:16 +0200, Sim Zacks wrote:
> MD5 is not collision resistant (using the immortal words of wikipedia
> http://en.wikipedia.org/wiki/MD5).
>
> This means that it is possible that multiple images will return the
> same md5 hash.
>
> The question is, if it screws up and says that an image already
> exists and then returns a different image when querying for it, how
> bad would that be.
>
>
> I've seen a lot of discussions in the past few years about how
> problematic the md5 approach is.
>
>
> Sim
>
>
> On 03/08/2011 09:06 PM, Andy Colson wrote:
>
>> On 3/8/2011 12:28 PM, Andre Lopes wrote:
>>> Hi,
>>>
>>> I'm using a bytea field to store small images in base64. I need to
>>> know if I can compare bytea rows for equality. To know for example
>>> if
>>> the file already exists in the database, this is possible with
>>> bytea?
>>>
>>> Best Regads,
>>>
>>
>> You dont need to use both base64 and bytea.  You can store base64 in
>> text field... or just store the photo as-is into bytea.
>>
>> To answer your question: it would be faster if you computed an md5
>> (or sha or whatever) and stored it in the db, then you could check to
>> see if an image exists by searching for the md5, which would be way
>> faster, an send a lot less data over the wire.
>>
>> -Andy
>>

 Every hash algorithm is not collision resistant, and in theory You need
 to check equality of images. Hash function says that if hashes are
 different, then images are different. It not says if hashes are same
 then images are same.

 Regards,
 Radek

Re: Using bytea field...

From
Andy Colson
Date:
> On 03/08/2011 09:06 PM, Andy Colson wrote:
>
>> On 3/8/2011 12:28 PM, Andre Lopes wrote:
>>> Hi,
>>>
>>> I'm using a bytea field to store small images in base64. I need to
>>> know if I can compare bytea rows for equality. To know for example if
>>> the file already exists in the database, this is possible with bytea?
>>>
>>> Best Regads,
>>>
>>
>> You dont need to use both base64 and bytea. You can store base64 in
>> text field... or just store the photo as-is into bytea.
>>
>> To answer your question: it would be faster if you computed an md5 (or
>> sha or whatever) and stored it in the db, then you could check to see
>> if an image exists by searching for the md5, which would be way
>> faster, an send a lot less data over the wire.
>>
>> -Andy
>>
>
>

On 3/9/2011 5:27 AM, Sim Zacks wrote:
 > MD5 is not collision resistant (using the immortal words of wikipedia
 > http://en.wikipedia.org/wiki/MD5).
 >
 > This means that it is possible that multiple images will return the same
 > md5 hash.
 >
 > The question is, if it screws up and says that an image already exists
 > and then returns a different image when querying for it, how bad would
 > that be.
 >
 >
 > I've seen a lot of discussions in the past few years about how
 > problematic the md5 approach is.
 >
 >
 > Sim
 >
 >


It'll never happen:

http://stackoverflow.com/questions/862346/how-do-i-assess-the-hash-collision-probability

Sure you CAN go out of your way to generate collisions, but I'd bet
money you never see one from your setup.

The probability is extremely slim.  And if thats too much of a chance,
use sha2, its mind numbingly slim.

If you were doing cryptography it would be a problem, yes, but not
checking file equality.

-Andy

Re: Using bytea field...

From
Sim Zacks
Date:
> > The question is, if it screws up and says that an image already exists
> > and then returns a different image when querying for it, how bad would
> > that be.
> >
>
>
> It'll never happen:
>
> http://stackoverflow.com/questions/862346/how-do-i-assess-the-hash-collision-probability
>
>
> Sure you CAN go out of your way to generate collisions, but I'd bet
> money you never see one from your setup.
>
> The probability is extremely slim.  And if thats too much of a chance,
> use sha2, its mind numbingly slim.
>
> If you were doing cryptography it would be a problem, yes, but not
> checking file equality.
>
> -Andy
Never is a long time. The question that I asked is precisely: how much
money you would bet that you'll never hit a collision. It depends on the
use case. If you are talking about privacy issues, which can include
lawsuits, loss of reputation and/or damages, then I wouldn't take that
risk, even on sha2. Especially not with all the publicly available
documentation explaining why not to do it.  If you are talking about a
minor inconvenience or professional pride because the wrong image showed
up, or the right image was never stored, then it may be worth the risk.

Sim

Re: Using bytea field...

From
Merlin Moncure
Date:
On Thu, Mar 10, 2011 at 2:09 AM, Sim Zacks <sim@compulab.co.il> wrote:
>
>> > The question is, if it screws up and says that an image already exists
>> > and then returns a different image when querying for it, how bad would
>> > that be.
>> >
>>
>>
>> It'll never happen:
>>
>>
>> http://stackoverflow.com/questions/862346/how-do-i-assess-the-hash-collision-probability
>>
>>
>> Sure you CAN go out of your way to generate collisions, but I'd bet
>> money you never see one from your setup.
>>
>> The probability is extremely slim.  And if thats too much of a chance,
>> use sha2, its mind numbingly slim.
>>
>> If you were doing cryptography it would be a problem, yes, but not
>> checking file equality.
>>
>> -Andy
>
> Never is a long time. The question that I asked is precisely: how much money
> you would bet that you'll never hit a collision. It depends on the use case.
> If you are talking about privacy issues, which can include lawsuits, loss of
> reputation and/or damages, then I wouldn't take that risk, even on sha2.
> Especially not with all the publicly available documentation explaining why
> not to do it.  If you are talking about a minor inconvenience or
> professional pride because the wrong image showed up, or the right image was
> never stored, then it may be worth the risk.

Regardless of the intended use, I would bet every dollar I've ever
made, will make, could borrow, beg steal, etc vs 1 of your dollars and
happily collect it when I won the bet.  See  here:
(http://en.wikipedia.org/wiki/Birthday_attack) and look at the table
of odds vs population size...your statement is not in line with
mathematical reality, and from a risk standpoint there is a large
number of things to be looking at before sha2 collision such as drive
bit error rates, spontaneous combustion, etc.

AFAIK, even sha1 collisions have never been found in the wild, and the
zfs deduplication system uses sha1 to deduplicate disk blocks, as does
bit torrent. In fact many computing systems you rely on make hash
safety assumptions weaker than sha2.

Schneier speculates that we may see a collision soon here:
http://blog.valerieaurora.org/2009/06/25/sha-1-collision-expected-within-a-year/.
A small number of duplicate accidental md5 hashes have been found in
the wild.

merlin