Thread: cyclical redundancy checksum algorithm(s)?

cyclical redundancy checksum algorithm(s)?

From

"Karen Hill"

Date:

27 September 2006, 17:05:52

I just finished reading one of Ralph Kimball's books.  In it he
mentions something called a cyclical redundancy checksum (crc)
function.  A crc function is a hash function that generates a checksum.

I am wondering a few things.  A crc function would be extremely useful
and time saving in determining if a row needs to be updated or not (are
the values the same, if yes don't update, if not update).  In fact
Ralph Kimball states that this is a way to check for changes.  You just
have an extra column for the crc checksum.  When you go to update data,
generate a crc checksum and compare it to the one in the crc column.
If they are same, your data has not changed.

Yet what happens if there is a collision of the checksum for a row?

Ralph Kimball did not mention which algorithm to use, nor how to create
a crc function that would not have collisions.   He does have a PhD,
and a leader in the OLAP datawarehouse world, so I assume there is a
good solution.

Is there a crc function in postgresql?  If not what algorithm would I
need to use to create one in pl/pgsql?

regards,
karen

Re: cyclical redundancy checksum algorithm(s)?

From

Tom Lane

Date:

27 September 2006, 18:01:57

"Karen Hill" <karen_hill22@yahoo.com> writes:
> Ralph Kimball states that this is a way to check for changes.  You just
> have an extra column for the crc checksum.  When you go to update data,
> generate a crc checksum and compare it to the one in the crc column.
> If they are same, your data has not changed.

You sure that's actually what he said?  A change in CRC proves the data
changed, but lack of a change does not prove it didn't.

People do sometimes use this logic in connection with much wider
"summary" functions, such as an MD5 hash.  I wouldn't trust it at all
with a 32-bit CRC, and not much with a 64-bit CRC.  Too much risk of
collision.

            regards, tom lane

Re: cyclical redundancy checksum algorithm(s)?

From

"Karen Hill"

Date:

27 September 2006, 18:03:03

Gene Wirchenko wrote:

> >I just finished reading one of Ralph Kimball's books.  In it he
> >mentions something called a cyclical redundancy checksum (crc)
> >function.  A crc function is a hash function that generates a checksum.
> >
> >I am wondering a few things.  A crc function would be extremely useful
> >and time saving in determining if a row needs to be updated or not (are
> >the values the same, if yes don't update, if not update).  In fact
> >Ralph Kimball states that this is a way to check for changes.  You just
> >have an extra column for the crc checksum.  When you go to update data,
> >generate a crc checksum and compare it to the one in the crc column.
> >If they are same, your data has not changed.
> >
> >Yet what happens if there is a collision of the checksum for a row?
>
>      Then you get told that no change has occurred when one has.  I
> would call this an error.

That's exactly what I thought when I read that in his book.  I was
thinking back to the sha1 and md5 algorithms, maybe a special crc
algorithm is safe from this.

> >Ralph Kimball did not mention which algorithm to use, nor how to create
> >a crc function that would not have collisions.   He does have a PhD,
> >and a leader in the OLAP datawarehouse world, so I assume there is a
> >good solution.
>
>      Your error.  Having a Ph.D. does not stop someone from being
> wrong.

> >Is there a crc function in postgresql?  If not what algorithm would I
> >need to use to create one in pl/pgsql?
>
>      I think you are focusing on irrelevant minutiae.  Is the
> performance really that bad that you have go to odd lengths to up it?

It is not for performance.  It is to save time writing a lot of stored
procedure code.  when you hav e an updateable view with 70 values that
need to be checked for changes a checksum starts to sound very
appealing.

Re: cyclical redundancy checksum algorithm(s)?

From

Bill Moran

Date:

27 September 2006, 18:11:04

In response to Tom Lane <tgl@sss.pgh.pa.us>:

> "Karen Hill" <karen_hill22@yahoo.com> writes:
> > Ralph Kimball states that this is a way to check for changes.  You just
> > have an extra column for the crc checksum.  When you go to update data,
> > generate a crc checksum and compare it to the one in the crc column.
> > If they are same, your data has not changed.
>
> You sure that's actually what he said?  A change in CRC proves the data
> changed, but lack of a change does not prove it didn't.
>
> People do sometimes use this logic in connection with much wider
> "summary" functions, such as an MD5 hash.  I wouldn't trust it at all
> with a 32-bit CRC, and not much with a 64-bit CRC.  Too much risk of
> collision.

I remember reading some high-level stuff on rsync and how it uses
multiple checksums in tandem.  Don't know the math, though -- if
both the SHA256 and the MD5 match, what are the chances that it's
changed?

You also hit diminishing returns ... after calculating so many
checksums, you might be better off just checking the data itself,
unless that data is very large.

--
Bill Moran
Collaborative Fusion Inc.

Re: cyclical redundancy checksum algorithm(s)?

From

Ron Johnson

Date:

27 September 2006, 18:25:50

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 09/27/06 16:02, Karen Hill wrote:
> Gene Wirchenko wrote:
[snip]
>>> Yet what happens if there is a collision of the checksum for a row?
>>      Then you get told that no change has occurred when one has.  I
>> would call this an error.
>
> That's exactly what I thought when I read that in his book.  I was
> thinking back to the sha1 and md5 algorithms, maybe a special crc
> algorithm is safe from this.

I doubt it.  The typical CRC is 32 bits, whereas the MD5 hash is 128
bits and SHA1 is 160 bits.

- --
Ron Johnson, Jr.
Jefferson LA  USA

Is "common sense" really valid?
For example, it is "common sense" to white-power racists that
whites are superior to blacks, and that those with brown skins
are mud people.
However, that "common sense" is obviously wrong.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.5 (GNU/Linux)

iD4DBQFFGuxUS9HxQb37XmcRAr0GAJ0d6r4C7Zt2tug/AHH/aCPG5p8UMgCY8bDU
MgTQuPe9uNT5Ny3nW0qk6w==
=o//D
-----END PGP SIGNATURE-----

Re: cyclical redundancy checksum algorithm(s)?

From

"Karen Hill"

Date:

27 September 2006, 18:49:40

Tom Lane wrote:
> "Karen Hill" <karen_hill22@yahoo.com> writes:
> > Ralph Kimball states that this is a way to check for changes.  You just
> > have an extra column for the crc checksum.  When you go to update data,
> > generate a crc checksum and compare it to the one in the crc column.
> > If they are same, your data has not changed.
>
> You sure that's actually what he said?  A change in CRC proves the data
> changed, but lack of a change does not prove it didn't.

On page 100 in the book, "The Data Warehouse Toolkit" Second Edition,
Ralph Kimball writes the following:

"Rather than checking each field to see if something has changed, we
instead compute a checksum for the entire row all at once.  A cyclic
redundancy checksum (CRC) algorithm helps us quickly recognize that a
wide messy row has changed without looking at each of its constituent
fields."

On page 360 he writes:

"To quickly determine if rows have changed, we rely on a cyclic
redundancy checksum (CRC) algorithm.   If the CRC is identical for the
extracted record and the most recent row in the master table, then we
ignore the extracted record.  We don't need to check every column to be
certain that the two rows match exactly."

>
> People do sometimes use this logic in connection with much wider
> "summary" functions, such as an MD5 hash.  I wouldn't trust it at all
> with a 32-bit CRC, and not much with a 64-bit CRC.  Too much risk of
> collision.
>

Re: cyclical redundancy checksum algorithm(s)?

From

John Sidney-Woollett

Date:

28 September 2006, 03:09:02

Stepping back a bit...

Why not use an update trigger on the affected tables to record a
lastupdated timestamp value when the record is changed.

Surely this is simpler thanks computing some kind of row hash?

John

Karen Hill wrote:
> Tom Lane wrote:
>> "Karen Hill" <karen_hill22@yahoo.com> writes:
>>> Ralph Kimball states that this is a way to check for changes.  You just
>>> have an extra column for the crc checksum.  When you go to update data,
>>> generate a crc checksum and compare it to the one in the crc column.
>>> If they are same, your data has not changed.
>> You sure that's actually what he said?  A change in CRC proves the data
>> changed, but lack of a change does not prove it didn't.
>
>
> On page 100 in the book, "The Data Warehouse Toolkit" Second Edition,
> Ralph Kimball writes the following:
>
> "Rather than checking each field to see if something has changed, we
> instead compute a checksum for the entire row all at once.  A cyclic
> redundancy checksum (CRC) algorithm helps us quickly recognize that a
> wide messy row has changed without looking at each of its constituent
> fields."
>
> On page 360 he writes:
>
> "To quickly determine if rows have changed, we rely on a cyclic
> redundancy checksum (CRC) algorithm.   If the CRC is identical for the
> extracted record and the most recent row in the master table, then we
> ignore the extracted record.  We don't need to check every column to be
> certain that the two rows match exactly."
>
>> People do sometimes use this logic in connection with much wider
>> "summary" functions, such as an MD5 hash.  I wouldn't trust it at all
>> with a 32-bit CRC, and not much with a 64-bit CRC.  Too much risk of
>> collision.
>>
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 1: if posting/reading through Usenet, please send an appropriate
>        subscribe-nomail command to majordomo@postgresql.org so that your
>        message can get through to the mailing list cleanly

Re: cyclical redundancy checksum algorithm(s)?

From

Teodor Sigaev

Date:

28 September 2006, 04:54:04

>> You sure that's actually what he said?  A change in CRC proves the data
>> changed, but lack of a change does not prove it didn't.
>
> "To quickly determine if rows have changed, we rely on a cyclic
> redundancy checksum (CRC) algorithm.   If the CRC is identical for the
 >
>> "summary" functions, such as an MD5 hash.  I wouldn't trust it at all
>> with a 32-bit CRC, and not much with a 64-bit CRC.  Too much risk of
>> collision.

Small example of collisions for crc32:
0x38ee5531
         Hundemarke
         92294
0x59471e4f
         raciner
         tranchefiler
0x947bb6c0
         Betriebsteile
         4245


I had make a lot of work when choosing hash function for tsearch2. Also, I had
find that popular hash algorithms produce more collision for non-ascii
languages... CRC32 is more "smooth".
On dictionary with 332296 unique words CRC32 produces 11 collisions, perl's hash
function - 35, pgsql's hash_any - 12.

--
Teodor Sigaev                                   E-mail: teodor@sigaev.ru
                                                    WWW: http://www.sigaev.ru/

Re: cyclical redundancy checksum algorithm(s)?

From

Joachim Wieland

Date:

28 September 2006, 05:30:29

On Thu, Sep 28, 2006 at 07:09:43AM +0100, John Sidney-Woollett wrote:
> Why not use an update trigger on the affected tables to record a
> lastupdated timestamp value when the record is changed.

> Surely this is simpler thanks computing some kind of row hash?

It depends on how you define "change". With the triggers you propose an

UPDATE table SET col = col;

is a change because there was a write operation. Any hash function's output
would be "no change" because the actual data did not change. An update might
entail an expensive update of some external data so you might want to make
sure that data really got modified.

--
Joachim Wieland                                              joe@mcknight.de
                                                           GPG key available

Re: cyclical redundancy checksum algorithm(s)?

From

John Sidney-Woollett

Date:

28 September 2006, 05:42:27

Ah, good point! Missed the subtlety of what was being asked.

John

Joachim Wieland wrote:
> On Thu, Sep 28, 2006 at 07:09:43AM +0100, John Sidney-Woollett wrote:
>> Why not use an update trigger on the affected tables to record a
>> lastupdated timestamp value when the record is changed.
>
>> Surely this is simpler thanks computing some kind of row hash?
>
> It depends on how you define "change". With the triggers you propose an
>
> UPDATE table SET col = col;
>
> is a change because there was a write operation. Any hash function's output
> would be "no change" because the actual data did not change. An update might
> entail an expensive update of some external data so you might want to make
> sure that data really got modified.
>

Re: cyclical redundancy checksum algorithm(s)?

From

Bernhard Weisshuhn

Date:

28 September 2006, 05:55:57

Joachim Wieland schrieb:
> On Thu, Sep 28, 2006 at 07:09:43AM +0100, John Sidney-Woollett wrote:
>> Why not use an update trigger on the affected tables to record a
>> lastupdated timestamp value when the record is changed.
>
>> Surely this is simpler thanks computing some kind of row hash?
>
> It depends on how you define "change". With the triggers you propose an
>
> UPDATE table SET col = col;
>
> is a change because there was a write operation. Any hash function's output
> would be "no change" because the actual data did not change. An update might
> entail an expensive update of some external data so you might want to make
> sure that data really got modified.
>

You can always compare the old and new values in the trigger function to
detect that kind of non-update.

bkw

Re: cyclical redundancy checksum algorithm(s)?

From

Gene Wirchenko

Date:

30 September 2006, 17:33:03

"Karen Hill" <karen_hill22@yahoo.com> wrote:

>I just finished reading one of Ralph Kimball's books.  In it he
>mentions something called a cyclical redundancy checksum (crc)
>function.  A crc function is a hash function that generates a checksum.
>
>I am wondering a few things.  A crc function would be extremely useful
>and time saving in determining if a row needs to be updated or not (are
>the values the same, if yes don't update, if not update).  In fact
>Ralph Kimball states that this is a way to check for changes.  You just
>have an extra column for the crc checksum.  When you go to update data,
>generate a crc checksum and compare it to the one in the crc column.
>If they are same, your data has not changed.
>
>Yet what happens if there is a collision of the checksum for a row?

     Then you get told that no change has occurred when one has.  I
would call this an error.

>Ralph Kimball did not mention which algorithm to use, nor how to create
>a crc function that would not have collisions.   He does have a PhD,
>and a leader in the OLAP datawarehouse world, so I assume there is a
>good solution.

     Your error.  Having a Ph.D. does not stop someone from being
wrong.

>Is there a crc function in postgresql?  If not what algorithm would I
>need to use to create one in pl/pgsql?

     I think you are focusing on irrelevant minutiae.  Is the
performance really that bad that you have go to odd lengths to up it?
If you think so, is this because you have actually tested it, or is it
just a feeling?  You could be setting yourself up for a lot of work
that may be error-prone and just plain not work and for very little.

Sincerely,

Gene Wirchenko

Computerese Irregular Verb Conjugation:
     I have preferences.
     You have biases.
     He/She has prejudices.

Re: cyclical redundancy checksum algorithm(s)?

From

"Marshall"

Date:

30 September 2006, 17:33:04

Karen Hill wrote:
> Gene Wirchenko wrote:
>
> > >Yet what happens if there is a collision of the checksum for a row?
> >
> >      Then you get told that no change has occurred when one has.  I
> > would call this an error.
>
> That's exactly what I thought when I read that in his book.  I was
> thinking back to the sha1 and md5 algorithms, maybe a special crc
> algorithm is safe from this.

Because of the Pigeonhole Principle, it is not possible that some
careful selection of hash algorithm will be "safe" from collisions.
Collisions are always possible (assuming the input size is larger
than the output size, which in practice it always is, because hashing
would be pointless otherwise.) One addresses the risk of
collisions probabalistically.

Although sha1 and md5 are currently a la mode, they are
specifically cryptographic hashes, and as a result they have
been chosen for properties that you don't care about if
all you want is a regular hash. CRC is probably a fine
choice; it's popular and well-understood.

The way one deals with the possibility of hash collisions is
to choose an appropriate hash size for the task at hand.

In general, though, I have to agree with Gene; I don't see
that this is a fruitful avenue to pursue unless there are some
*very* specific circumstances that warrant it.

Marshall

Re: cyclical redundancy checksum algorithm(s)?

From

"David Cressey"

Date:

30 September 2006, 17:33:06

"Karen Hill" <karen_hill22@yahoo.com> wrote in message
news:1159387540.543806.256650@i3g2000cwc.googlegroups.com...
> I just finished reading one of Ralph Kimball's books.  In it he
> mentions something called a cyclical redundancy checksum (crc)
> function.  A crc function is a hash function that generates a checksum.
>
> I am wondering a few things.  A crc function would be extremely useful
> and time saving in determining if a row needs to be updated or not (are
> the values the same, if yes don't update, if not update).  In fact
> Ralph Kimball states that this is a way to check for changes.  You just
> have an extra column for the crc checksum.  When you go to update data,
> generate a crc checksum and compare it to the one in the crc column.
> If they are same, your data has not changed.
>
> Yet what happens if there is a collision of the checksum for a row?
>
> Ralph Kimball did not mention which algorithm to use, nor how to create
> a crc function that would not have collisions.   He does have a PhD,
> and a leader in the OLAP datawarehouse world, so I assume there is a
> good solution.
>
> Is there a crc function in postgresql?  If not what algorithm would I
> need to use to create one in pl/pgsql?
>
> regards,
> karen
>

I generally like Kimball,  but this idea sounds bogus.

Of course, if you only want your data warehouse to be right some of the
time....

Re: cyclical redundancy checksum algorithm(s)?

From

Jonathan Leffler

Date:

30 September 2006, 17:34:00

Karen Hill wrote:
> Tom Lane wrote:
>> "Karen Hill" <karen_hill22@yahoo.com> writes:
>>> Ralph Kimball states that this is a way to check for changes.  You just
>>> have an extra column for the crc checksum.  When you go to update data,
>>> generate a crc checksum and compare it to the one in the crc column.
>>> If they are same, your data has not changed.
>> You sure that's actually what he said?  A change in CRC proves the data
>> changed, but lack of a change does not prove it didn't.
>
>
> On page 100 in the book, "The Data Warehouse Toolkit" Second Edition,
> Ralph Kimball writes the following:
>
> "Rather than checking each field to see if something has changed, we
> instead compute a checksum for the entire row all at once.  A cyclic
> redundancy checksum (CRC) algorithm helps us quickly recognize that a
> wide messy row has changed without looking at each of its constituent
> fields."
>
> On page 360 he writes:
>
> "To quickly determine if rows have changed, we rely on a cyclic
> redundancy checksum (CRC) algorithm.   If the CRC is identical for the
> extracted record and the most recent row in the master table, then we
> ignore the extracted record.  We don't need to check every column to be
> certain that the two rows match exactly."
>
>> People do sometimes use this logic in connection with much wider
>> "summary" functions, such as an MD5 hash.  I wouldn't trust it at all
>> with a 32-bit CRC, and not much with a 64-bit CRC.  Too much risk of
>> collision.

An MD5 hash value has 128 bits; an SHA1 hash value has 160 bits.
Roughly speaking, a 32-bit check sum gives you a 1 in 4 billion chance
that a change won't be detected; a 64-bit check sum gives you a 1 in 16
billion billion chance; and a 128-bit check sum therefore a 1 in 256
billion billion billion billion chance.  Every hash is subject to
collisions - there are an indefinitely large number of possible input
values (of widely differing lengths) that produce the same output.
However, the chance of seeing two such values is very improbable.

And, don't forget, if you update a row with a few changes (so some bytes
change but most do not), the chance that the new row produces the same
checksum as the old row is very small with a well-designed checksum.
Most updates will produce small changes in the data, but big changes in
the checksum.

The other issue is how long does it take to compute the checksum
compared with doing a field-by-field check?  There are many facets to
that answer - related to caching of old values and comparison with the new.

--
Jonathan Leffler                   #include <disclaimer.h>
Email: jleffler@earthlink.net, jleffler@us.ibm.com
Guardian of DBD::Informix v2005.02 -- http://dbi.perl.org/

Re: cyclical redundancy checksum algorithm(s)?

From

"David Portas"

Date:

30 September 2006, 17:34:00

Karen Hill wrote:
> Tom Lane wrote:
> > "Karen Hill" <karen_hill22@yahoo.com> writes:
> > > Ralph Kimball states that this is a way to check for changes.  You just
> > > have an extra column for the crc checksum.  When you go to update data,
> > > generate a crc checksum and compare it to the one in the crc column.
> > > If they are same, your data has not changed.
> >
> > You sure that's actually what he said?  A change in CRC proves the data
> > changed, but lack of a change does not prove it didn't.
>
>
> On page 100 in the book, "The Data Warehouse Toolkit" Second Edition,
> Ralph Kimball writes the following:
>
> "Rather than checking each field to see if something has changed, we
> instead compute a checksum for the entire row all at once.  A cyclic
> redundancy checksum (CRC) algorithm helps us quickly recognize that a
> wide messy row has changed without looking at each of its constituent
> fields."
>
> On page 360 he writes:
>
> "To quickly determine if rows have changed, we rely on a cyclic
> redundancy checksum (CRC) algorithm.   If the CRC is identical for the
> extracted record and the most recent row in the master table, then we
> ignore the extracted record.  We don't need to check every column to be
> certain that the two rows match exactly."
>

Be careful with Kimball. Read him to get the industry argot but treat
his ideas on design and implementation with some healthy scepticism.
His bases are often shaky or obscure and sometimes just plain wrong.

--
David Portas

Re: cyclical redundancy checksum algorithm(s)?

From

"Cimode"

Date:

30 September 2006, 17:34:04

Karen Hill wrote:
> Tom Lane wrote:
> > "Karen Hill" <karen_hill22@yahoo.com> writes:
> > > Ralph Kimball states that this is a way to check for changes.  You just
> > > have an extra column for the crc checksum.  When you go to update data,
> > > generate a crc checksum and compare it to the one in the crc column.
> > > If they are same, your data has not changed.
> >
> > You sure that's actually what he said?  A change in CRC proves the data
> > changed, but lack of a change does not prove it didn't.
>
>
> On page 100 in the book, "The Data Warehouse Toolkit" Second Edition,
> Ralph Kimball writes the following:
>
> "Rather than checking each field to see if something has changed, we
> instead compute a checksum for the entire row all at once.  A cyclic
> redundancy checksum (CRC) algorithm helps us quickly recognize that a
> wide messy row has changed without looking at each of its constituent
> fields."
>
> On page 360 he writes:
>
> "To quickly determine if rows have changed, we rely on a cyclic
> redundancy checksum (CRC) algorithm.   If the CRC is identical for the
> extracted record and the most recent row in the master table, then we
> ignore the extracted record.  We don't need to check every column to be
> certain that the two rows match exactly."

> > People do sometimes use this logic in connection with much wider
> > "summary" functions, such as an MD5 hash.  I wouldn't trust it at all
> > with a 32-bit CRC, and not much with a 64-bit CRC.  Too much risk of
> > collision.
Ho do you calculate a checksum on a binary file stored in a database?
What part of the file are you going to use for computations?

Re: cyclical redundancy checksum algorithm(s)?

From

Volker Hetzer

Date:

30 September 2006, 17:34:05

Karen Hill schrieb:
> I just finished reading one of Ralph Kimball's books.  In it he
> mentions something called a cyclical redundancy checksum (crc)
> function.  A crc function is a hash function that generates a checksum.
>
> I am wondering a few things.  A crc function would be extremely useful
> and time saving in determining if a row needs to be updated or not (are
> the values the same, if yes don't update, if not update).  In fact
> Ralph Kimball states that this is a way to check for changes.  You just
> have an extra column for the crc checksum.  When you go to update data,
> generate a crc checksum and compare it to the one in the crc column.
> If they are same, your data has not changed.
>
> Yet what happens if there is a collision of the checksum for a row?
It's simple:
If a crc indicates a difference, you can rely on the answer. If a crc
indicates equality, you have to do a real comparison.

Therefore a CRC can give you a large speedup, depending on the situation
and CRC size. Small CRCs (like 2 or 3 bit) are very fast to compare but the
false positive percentage is high, necessitating lots of full comparisons.
crc's almost as long as the data sets themselves are almost as slow as
exhaustive comparison but ere almost always correct.

Given a certain average data record size, hash size, number of data
records and relative frequency of comparison versus data change (i.e. crc
calculation) you can figure out how much time you save when using crc's.

But in most circumstances you don't have to bother because...

> Is there a crc function in postgresql?  If not what algorithm would I
> need to use to create one in pl/pgsql?
...normally a simple 32bit crc is amply sufficient.
Given 2^16 data records, about /one/ pair out of them will share one crc,
and, given one data record, /one/ other out of 2^32 data records will have
the same crc.
They are fast to compute and can be indexed as integer. That's what I did
when I once had to implement a simple disk mirroring application using mysql
for the comparisons.
This compared several million file names (daily) using the method indicated
above and worked flawlessly and fast.

Lots of Greetings!
Volker
--
For email replies, please substitute the obvious.