Thread: cyclical redundancy checksum algorithm(s)?
I just finished reading one of Ralph Kimball's books. In it he mentions something called a cyclical redundancy checksum (crc) function. A crc function is a hash function that generates a checksum. I am wondering a few things. A crc function would be extremely useful and time saving in determining if a row needs to be updated or not (are the values the same, if yes don't update, if not update). In fact Ralph Kimball states that this is a way to check for changes. You just have an extra column for the crc checksum. When you go to update data, generate a crc checksum and compare it to the one in the crc column. If they are same, your data has not changed. Yet what happens if there is a collision of the checksum for a row? Ralph Kimball did not mention which algorithm to use, nor how to create a crc function that would not have collisions. He does have a PhD, and a leader in the OLAP datawarehouse world, so I assume there is a good solution. Is there a crc function in postgresql? If not what algorithm would I need to use to create one in pl/pgsql? regards, karen
"Karen Hill" <karen_hill22@yahoo.com> writes: > Ralph Kimball states that this is a way to check for changes. You just > have an extra column for the crc checksum. When you go to update data, > generate a crc checksum and compare it to the one in the crc column. > If they are same, your data has not changed. You sure that's actually what he said? A change in CRC proves the data changed, but lack of a change does not prove it didn't. People do sometimes use this logic in connection with much wider "summary" functions, such as an MD5 hash. I wouldn't trust it at all with a 32-bit CRC, and not much with a 64-bit CRC. Too much risk of collision. regards, tom lane
Gene Wirchenko wrote: > >I just finished reading one of Ralph Kimball's books. In it he > >mentions something called a cyclical redundancy checksum (crc) > >function. A crc function is a hash function that generates a checksum. > > > >I am wondering a few things. A crc function would be extremely useful > >and time saving in determining if a row needs to be updated or not (are > >the values the same, if yes don't update, if not update). In fact > >Ralph Kimball states that this is a way to check for changes. You just > >have an extra column for the crc checksum. When you go to update data, > >generate a crc checksum and compare it to the one in the crc column. > >If they are same, your data has not changed. > > > >Yet what happens if there is a collision of the checksum for a row? > > Then you get told that no change has occurred when one has. I > would call this an error. That's exactly what I thought when I read that in his book. I was thinking back to the sha1 and md5 algorithms, maybe a special crc algorithm is safe from this. > >Ralph Kimball did not mention which algorithm to use, nor how to create > >a crc function that would not have collisions. He does have a PhD, > >and a leader in the OLAP datawarehouse world, so I assume there is a > >good solution. > > Your error. Having a Ph.D. does not stop someone from being > wrong. > >Is there a crc function in postgresql? If not what algorithm would I > >need to use to create one in pl/pgsql? > > I think you are focusing on irrelevant minutiae. Is the > performance really that bad that you have go to odd lengths to up it? It is not for performance. It is to save time writing a lot of stored procedure code. when you hav e an updateable view with 70 values that need to be checked for changes a checksum starts to sound very appealing.
In response to Tom Lane <tgl@sss.pgh.pa.us>: > "Karen Hill" <karen_hill22@yahoo.com> writes: > > Ralph Kimball states that this is a way to check for changes. You just > > have an extra column for the crc checksum. When you go to update data, > > generate a crc checksum and compare it to the one in the crc column. > > If they are same, your data has not changed. > > You sure that's actually what he said? A change in CRC proves the data > changed, but lack of a change does not prove it didn't. > > People do sometimes use this logic in connection with much wider > "summary" functions, such as an MD5 hash. I wouldn't trust it at all > with a 32-bit CRC, and not much with a 64-bit CRC. Too much risk of > collision. I remember reading some high-level stuff on rsync and how it uses multiple checksums in tandem. Don't know the math, though -- if both the SHA256 and the MD5 match, what are the chances that it's changed? You also hit diminishing returns ... after calculating so many checksums, you might be better off just checking the data itself, unless that data is very large. -- Bill Moran Collaborative Fusion Inc.
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 09/27/06 16:02, Karen Hill wrote: > Gene Wirchenko wrote: [snip] >>> Yet what happens if there is a collision of the checksum for a row? >> Then you get told that no change has occurred when one has. I >> would call this an error. > > That's exactly what I thought when I read that in his book. I was > thinking back to the sha1 and md5 algorithms, maybe a special crc > algorithm is safe from this. I doubt it. The typical CRC is 32 bits, whereas the MD5 hash is 128 bits and SHA1 is 160 bits. - -- Ron Johnson, Jr. Jefferson LA USA Is "common sense" really valid? For example, it is "common sense" to white-power racists that whites are superior to blacks, and that those with brown skins are mud people. However, that "common sense" is obviously wrong. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.5 (GNU/Linux) iD4DBQFFGuxUS9HxQb37XmcRAr0GAJ0d6r4C7Zt2tug/AHH/aCPG5p8UMgCY8bDU MgTQuPe9uNT5Ny3nW0qk6w== =o//D -----END PGP SIGNATURE-----
Tom Lane wrote: > "Karen Hill" <karen_hill22@yahoo.com> writes: > > Ralph Kimball states that this is a way to check for changes. You just > > have an extra column for the crc checksum. When you go to update data, > > generate a crc checksum and compare it to the one in the crc column. > > If they are same, your data has not changed. > > You sure that's actually what he said? A change in CRC proves the data > changed, but lack of a change does not prove it didn't. On page 100 in the book, "The Data Warehouse Toolkit" Second Edition, Ralph Kimball writes the following: "Rather than checking each field to see if something has changed, we instead compute a checksum for the entire row all at once. A cyclic redundancy checksum (CRC) algorithm helps us quickly recognize that a wide messy row has changed without looking at each of its constituent fields." On page 360 he writes: "To quickly determine if rows have changed, we rely on a cyclic redundancy checksum (CRC) algorithm. If the CRC is identical for the extracted record and the most recent row in the master table, then we ignore the extracted record. We don't need to check every column to be certain that the two rows match exactly." > > People do sometimes use this logic in connection with much wider > "summary" functions, such as an MD5 hash. I wouldn't trust it at all > with a 32-bit CRC, and not much with a 64-bit CRC. Too much risk of > collision. >
Stepping back a bit... Why not use an update trigger on the affected tables to record a lastupdated timestamp value when the record is changed. Surely this is simpler thanks computing some kind of row hash? John Karen Hill wrote: > Tom Lane wrote: >> "Karen Hill" <karen_hill22@yahoo.com> writes: >>> Ralph Kimball states that this is a way to check for changes. You just >>> have an extra column for the crc checksum. When you go to update data, >>> generate a crc checksum and compare it to the one in the crc column. >>> If they are same, your data has not changed. >> You sure that's actually what he said? A change in CRC proves the data >> changed, but lack of a change does not prove it didn't. > > > On page 100 in the book, "The Data Warehouse Toolkit" Second Edition, > Ralph Kimball writes the following: > > "Rather than checking each field to see if something has changed, we > instead compute a checksum for the entire row all at once. A cyclic > redundancy checksum (CRC) algorithm helps us quickly recognize that a > wide messy row has changed without looking at each of its constituent > fields." > > On page 360 he writes: > > "To quickly determine if rows have changed, we rely on a cyclic > redundancy checksum (CRC) algorithm. If the CRC is identical for the > extracted record and the most recent row in the master table, then we > ignore the extracted record. We don't need to check every column to be > certain that the two rows match exactly." > >> People do sometimes use this logic in connection with much wider >> "summary" functions, such as an MD5 hash. I wouldn't trust it at all >> with a 32-bit CRC, and not much with a 64-bit CRC. Too much risk of >> collision. >> > > > ---------------------------(end of broadcast)--------------------------- > TIP 1: if posting/reading through Usenet, please send an appropriate > subscribe-nomail command to majordomo@postgresql.org so that your > message can get through to the mailing list cleanly
>> You sure that's actually what he said? A change in CRC proves the data >> changed, but lack of a change does not prove it didn't. > > "To quickly determine if rows have changed, we rely on a cyclic > redundancy checksum (CRC) algorithm. If the CRC is identical for the > >> "summary" functions, such as an MD5 hash. I wouldn't trust it at all >> with a 32-bit CRC, and not much with a 64-bit CRC. Too much risk of >> collision. Small example of collisions for crc32: 0x38ee5531 Hundemarke 92294 0x59471e4f raciner tranchefiler 0x947bb6c0 Betriebsteile 4245 I had make a lot of work when choosing hash function for tsearch2. Also, I had find that popular hash algorithms produce more collision for non-ascii languages... CRC32 is more "smooth". On dictionary with 332296 unique words CRC32 produces 11 collisions, perl's hash function - 35, pgsql's hash_any - 12. -- Teodor Sigaev E-mail: teodor@sigaev.ru WWW: http://www.sigaev.ru/
On Thu, Sep 28, 2006 at 07:09:43AM +0100, John Sidney-Woollett wrote: > Why not use an update trigger on the affected tables to record a > lastupdated timestamp value when the record is changed. > Surely this is simpler thanks computing some kind of row hash? It depends on how you define "change". With the triggers you propose an UPDATE table SET col = col; is a change because there was a write operation. Any hash function's output would be "no change" because the actual data did not change. An update might entail an expensive update of some external data so you might want to make sure that data really got modified. -- Joachim Wieland joe@mcknight.de GPG key available
Ah, good point! Missed the subtlety of what was being asked. John Joachim Wieland wrote: > On Thu, Sep 28, 2006 at 07:09:43AM +0100, John Sidney-Woollett wrote: >> Why not use an update trigger on the affected tables to record a >> lastupdated timestamp value when the record is changed. > >> Surely this is simpler thanks computing some kind of row hash? > > It depends on how you define "change". With the triggers you propose an > > UPDATE table SET col = col; > > is a change because there was a write operation. Any hash function's output > would be "no change" because the actual data did not change. An update might > entail an expensive update of some external data so you might want to make > sure that data really got modified. >
Joachim Wieland schrieb: > On Thu, Sep 28, 2006 at 07:09:43AM +0100, John Sidney-Woollett wrote: >> Why not use an update trigger on the affected tables to record a >> lastupdated timestamp value when the record is changed. > >> Surely this is simpler thanks computing some kind of row hash? > > It depends on how you define "change". With the triggers you propose an > > UPDATE table SET col = col; > > is a change because there was a write operation. Any hash function's output > would be "no change" because the actual data did not change. An update might > entail an expensive update of some external data so you might want to make > sure that data really got modified. > You can always compare the old and new values in the trigger function to detect that kind of non-update. bkw
"Karen Hill" <karen_hill22@yahoo.com> wrote: >I just finished reading one of Ralph Kimball's books. In it he >mentions something called a cyclical redundancy checksum (crc) >function. A crc function is a hash function that generates a checksum. > >I am wondering a few things. A crc function would be extremely useful >and time saving in determining if a row needs to be updated or not (are >the values the same, if yes don't update, if not update). In fact >Ralph Kimball states that this is a way to check for changes. You just >have an extra column for the crc checksum. When you go to update data, >generate a crc checksum and compare it to the one in the crc column. >If they are same, your data has not changed. > >Yet what happens if there is a collision of the checksum for a row? Then you get told that no change has occurred when one has. I would call this an error. >Ralph Kimball did not mention which algorithm to use, nor how to create >a crc function that would not have collisions. He does have a PhD, >and a leader in the OLAP datawarehouse world, so I assume there is a >good solution. Your error. Having a Ph.D. does not stop someone from being wrong. >Is there a crc function in postgresql? If not what algorithm would I >need to use to create one in pl/pgsql? I think you are focusing on irrelevant minutiae. Is the performance really that bad that you have go to odd lengths to up it? If you think so, is this because you have actually tested it, or is it just a feeling? You could be setting yourself up for a lot of work that may be error-prone and just plain not work and for very little. Sincerely, Gene Wirchenko Computerese Irregular Verb Conjugation: I have preferences. You have biases. He/She has prejudices.
Karen Hill wrote: > Gene Wirchenko wrote: > > > >Yet what happens if there is a collision of the checksum for a row? > > > > Then you get told that no change has occurred when one has. I > > would call this an error. > > That's exactly what I thought when I read that in his book. I was > thinking back to the sha1 and md5 algorithms, maybe a special crc > algorithm is safe from this. Because of the Pigeonhole Principle, it is not possible that some careful selection of hash algorithm will be "safe" from collisions. Collisions are always possible (assuming the input size is larger than the output size, which in practice it always is, because hashing would be pointless otherwise.) One addresses the risk of collisions probabalistically. Although sha1 and md5 are currently a la mode, they are specifically cryptographic hashes, and as a result they have been chosen for properties that you don't care about if all you want is a regular hash. CRC is probably a fine choice; it's popular and well-understood. The way one deals with the possibility of hash collisions is to choose an appropriate hash size for the task at hand. In general, though, I have to agree with Gene; I don't see that this is a fruitful avenue to pursue unless there are some *very* specific circumstances that warrant it. Marshall
"Karen Hill" <karen_hill22@yahoo.com> wrote in message news:1159387540.543806.256650@i3g2000cwc.googlegroups.com... > I just finished reading one of Ralph Kimball's books. In it he > mentions something called a cyclical redundancy checksum (crc) > function. A crc function is a hash function that generates a checksum. > > I am wondering a few things. A crc function would be extremely useful > and time saving in determining if a row needs to be updated or not (are > the values the same, if yes don't update, if not update). In fact > Ralph Kimball states that this is a way to check for changes. You just > have an extra column for the crc checksum. When you go to update data, > generate a crc checksum and compare it to the one in the crc column. > If they are same, your data has not changed. > > Yet what happens if there is a collision of the checksum for a row? > > Ralph Kimball did not mention which algorithm to use, nor how to create > a crc function that would not have collisions. He does have a PhD, > and a leader in the OLAP datawarehouse world, so I assume there is a > good solution. > > Is there a crc function in postgresql? If not what algorithm would I > need to use to create one in pl/pgsql? > > regards, > karen > I generally like Kimball, but this idea sounds bogus. Of course, if you only want your data warehouse to be right some of the time....
Karen Hill wrote: > Tom Lane wrote: >> "Karen Hill" <karen_hill22@yahoo.com> writes: >>> Ralph Kimball states that this is a way to check for changes. You just >>> have an extra column for the crc checksum. When you go to update data, >>> generate a crc checksum and compare it to the one in the crc column. >>> If they are same, your data has not changed. >> You sure that's actually what he said? A change in CRC proves the data >> changed, but lack of a change does not prove it didn't. > > > On page 100 in the book, "The Data Warehouse Toolkit" Second Edition, > Ralph Kimball writes the following: > > "Rather than checking each field to see if something has changed, we > instead compute a checksum for the entire row all at once. A cyclic > redundancy checksum (CRC) algorithm helps us quickly recognize that a > wide messy row has changed without looking at each of its constituent > fields." > > On page 360 he writes: > > "To quickly determine if rows have changed, we rely on a cyclic > redundancy checksum (CRC) algorithm. If the CRC is identical for the > extracted record and the most recent row in the master table, then we > ignore the extracted record. We don't need to check every column to be > certain that the two rows match exactly." > >> People do sometimes use this logic in connection with much wider >> "summary" functions, such as an MD5 hash. I wouldn't trust it at all >> with a 32-bit CRC, and not much with a 64-bit CRC. Too much risk of >> collision. An MD5 hash value has 128 bits; an SHA1 hash value has 160 bits. Roughly speaking, a 32-bit check sum gives you a 1 in 4 billion chance that a change won't be detected; a 64-bit check sum gives you a 1 in 16 billion billion chance; and a 128-bit check sum therefore a 1 in 256 billion billion billion billion chance. Every hash is subject to collisions - there are an indefinitely large number of possible input values (of widely differing lengths) that produce the same output. However, the chance of seeing two such values is very improbable. And, don't forget, if you update a row with a few changes (so some bytes change but most do not), the chance that the new row produces the same checksum as the old row is very small with a well-designed checksum. Most updates will produce small changes in the data, but big changes in the checksum. The other issue is how long does it take to compute the checksum compared with doing a field-by-field check? There are many facets to that answer - related to caching of old values and comparison with the new. -- Jonathan Leffler #include <disclaimer.h> Email: jleffler@earthlink.net, jleffler@us.ibm.com Guardian of DBD::Informix v2005.02 -- http://dbi.perl.org/
Karen Hill wrote: > Tom Lane wrote: > > "Karen Hill" <karen_hill22@yahoo.com> writes: > > > Ralph Kimball states that this is a way to check for changes. You just > > > have an extra column for the crc checksum. When you go to update data, > > > generate a crc checksum and compare it to the one in the crc column. > > > If they are same, your data has not changed. > > > > You sure that's actually what he said? A change in CRC proves the data > > changed, but lack of a change does not prove it didn't. > > > On page 100 in the book, "The Data Warehouse Toolkit" Second Edition, > Ralph Kimball writes the following: > > "Rather than checking each field to see if something has changed, we > instead compute a checksum for the entire row all at once. A cyclic > redundancy checksum (CRC) algorithm helps us quickly recognize that a > wide messy row has changed without looking at each of its constituent > fields." > > On page 360 he writes: > > "To quickly determine if rows have changed, we rely on a cyclic > redundancy checksum (CRC) algorithm. If the CRC is identical for the > extracted record and the most recent row in the master table, then we > ignore the extracted record. We don't need to check every column to be > certain that the two rows match exactly." > Be careful with Kimball. Read him to get the industry argot but treat his ideas on design and implementation with some healthy scepticism. His bases are often shaky or obscure and sometimes just plain wrong. -- David Portas
Karen Hill wrote: > Tom Lane wrote: > > "Karen Hill" <karen_hill22@yahoo.com> writes: > > > Ralph Kimball states that this is a way to check for changes. You just > > > have an extra column for the crc checksum. When you go to update data, > > > generate a crc checksum and compare it to the one in the crc column. > > > If they are same, your data has not changed. > > > > You sure that's actually what he said? A change in CRC proves the data > > changed, but lack of a change does not prove it didn't. > > > On page 100 in the book, "The Data Warehouse Toolkit" Second Edition, > Ralph Kimball writes the following: > > "Rather than checking each field to see if something has changed, we > instead compute a checksum for the entire row all at once. A cyclic > redundancy checksum (CRC) algorithm helps us quickly recognize that a > wide messy row has changed without looking at each of its constituent > fields." > > On page 360 he writes: > > "To quickly determine if rows have changed, we rely on a cyclic > redundancy checksum (CRC) algorithm. If the CRC is identical for the > extracted record and the most recent row in the master table, then we > ignore the extracted record. We don't need to check every column to be > certain that the two rows match exactly." > > People do sometimes use this logic in connection with much wider > > "summary" functions, such as an MD5 hash. I wouldn't trust it at all > > with a 32-bit CRC, and not much with a 64-bit CRC. Too much risk of > > collision. Ho do you calculate a checksum on a binary file stored in a database? What part of the file are you going to use for computations?
Karen Hill schrieb: > I just finished reading one of Ralph Kimball's books. In it he > mentions something called a cyclical redundancy checksum (crc) > function. A crc function is a hash function that generates a checksum. > > I am wondering a few things. A crc function would be extremely useful > and time saving in determining if a row needs to be updated or not (are > the values the same, if yes don't update, if not update). In fact > Ralph Kimball states that this is a way to check for changes. You just > have an extra column for the crc checksum. When you go to update data, > generate a crc checksum and compare it to the one in the crc column. > If they are same, your data has not changed. > > Yet what happens if there is a collision of the checksum for a row? It's simple: If a crc indicates a difference, you can rely on the answer. If a crc indicates equality, you have to do a real comparison. Therefore a CRC can give you a large speedup, depending on the situation and CRC size. Small CRCs (like 2 or 3 bit) are very fast to compare but the false positive percentage is high, necessitating lots of full comparisons. crc's almost as long as the data sets themselves are almost as slow as exhaustive comparison but ere almost always correct. Given a certain average data record size, hash size, number of data records and relative frequency of comparison versus data change (i.e. crc calculation) you can figure out how much time you save when using crc's. But in most circumstances you don't have to bother because... > Is there a crc function in postgresql? If not what algorithm would I > need to use to create one in pl/pgsql? ...normally a simple 32bit crc is amply sufficient. Given 2^16 data records, about /one/ pair out of them will share one crc, and, given one data record, /one/ other out of 2^32 data records will have the same crc. They are fast to compute and can be indexed as integer. That's what I did when I once had to implement a simple disk mirroring application using mysql for the comparisons. This compared several million file names (daily) using the method indicated above and worked flawlessly and fast. Lots of Greetings! Volker -- For email replies, please substitute the obvious.