Thread: Is there an md5sum for tables?
Hi all, I need to know if multiple tables have (may have most probably) identical content. Since I want a fast solution (which means not comparing tables row by row), I thought it would be a good idea to have an sql function operating on a table or view similar to md5sum on a file and only compare the generated hashes. I did not find such a function in the documentation. Is such a function already available? If not, does it make no sense to use such an approach? Regards, Michael -- Wincor Nixdorf International GmbH Sitz der Gesellschaft: Paderborn Registergericht Paderborn HRB 3507 Gesch�ftsf�hrer: Eckard Heidloff (Vorsitzender), Stefan Auerbach, Dr. J�rgen Wunram Vorsitzender des Aufsichtsrats: Karl-Heinz Stiller Steuernummer: 339/5884/0020 - Ust-ID Nr.: DE812927716 - WEEE-Reg.-Nr. DE44477193 Diese E-Mail enth�lt vertrauliche Informationen. Wenn Sie nicht der richtige Adressat sind oder diese E-Mail irrt�mlich erhaltenhaben, informieren Sie bitte sofort den Absender und vernichten Sie diese E-Mail. Das unerlaubte Kopieren sowie dieunbefugte Weitergabe dieser E-Mail ist nicht gestattet. This e-mail may contain confidential information. If you are not the intended recipient (or have received this e-mail inerror) please notify the sender immediately and destroy this e-mail. Any unauthorised copying, disclosure or distributionof the material in this e-mail is strictly forbidden.
am Wed, dem 02.04.2008, um 11:56:38 +0800 mailte Michael Enke folgendes: > Hi all, > I need to know if multiple tables have (may have most probably) identical > content. > Since I want a fast solution (which means not comparing tables row by row), > I thought it would be a good idea to have an sql function operating on a > table or view > similar to md5sum on a file and only compare the generated hashes. > I did not find such a function in the documentation. > > Is such a function already available? No. > If not, does it make no sense to use such an approach? I think, no, because there is no order in the tables and there are maybe dead tuples (MVCC) -> different md5-sums for the data files. Andreas -- Andreas Kretschmer Kontakt: Heynitz: 035242/47150, D1: 0160/7141639 (mehr: -> Header) GnuPG-ID: 0x3FFF606C, privat 0x7F4584DA http://wwwkeys.de.pgp.net
Michael Enke <michael.enke@wincor-nixdorf.com> writes: > I need to know if multiple tables have (may have most probably) > identical content. That sounds me you should re-consider your database design. Having PK/FK relations is the reason we divide common parts into separate tables to avoid from data duplication. > Since I want a fast solution (which means not comparing tables row by > row), I thought it would be a good idea to have an sql function > operating on a table or view similar to md5sum on a file and only > compare the generated hashes. I did not find such a function in the > documentation. First alter related tables and prepend a hash field to each one. After that, you can implement your own PL/pgSQL procedure which takes a table name and queries supplied table's columns from information_schema and builds an SQL query string to concatenate MD5 hash of available table fields into a single field, and updates hash field of each field according to computed hash. Regards.
Michael Enke wrote: > I need to know if multiple tables have (may have most probably) > identical content. > Since I want a fast solution (which means not comparing tables row by row), > I thought it would be a good idea to have an sql function operating on a > table or view > similar to md5sum on a file and only compare the generated hashes. > I did not find such a function in the documentation. > > Is such a function already available? I think what I would do is develop a shell script that ran select queries (with an "order by" clause, of course) against the tables, then compare the output the result sets using diff. If you need to monitor for differences over time, like to catch changes, then drive that script with cron and have it generate an email warning message when diff notices that the output are no longer identical. -- BMT
On Apr 1, 2008, at 8:56 PM, Michael Enke wrote: > Hi all, > I need to know if multiple tables have (may have most probably) > identical content. > Since I want a fast solution (which means not comparing tables row > by row), > I thought it would be a good idea to have an sql function operating > on a table or view > similar to md5sum on a file and only compare the generated hashes. > I did not find such a function in the documentation. > > Is such a function already available? > If not, does it make no sense to use such an approach? I've found that it's pretty easy to dump both tables to text files and then diff them. Then, not only do you see if there are differences, ou can see what's different about them, which is probably going to be your next question when you see they're not the same. :)
On Apr 1, 2008, at 8:56 PM, Michael Enke wrote: > Hi all, > I need to know if multiple tables have (may have most probably) > identical content. > Since I want a fast solution (which means not comparing tables row > by row), > I thought it would be a good idea to have an sql function operating > on a table or view > similar to md5sum on a file and only compare the generated hashes. > I did not find such a function in the documentation. > > Is such a function already available? No, but you could probably build it fairly easily from parts that are already there. > > If not, does it make no sense to use such an approach? Possibly, but generating the hash might be time consuming, as you'd need to iterate through the whole table, probably in a specific order, to generate the hash. So every time you modified the table you'd have to do something like "select * from table order by <something unique>" to maintain a current hash. (If you used a weak hash, something like XOR of the md5 of each row, then you could do it cheaply incrementally - and that might even be a usable hash if there's a unique constraint on some field). Overall, though, this is a very non-database thing to want to do. Iit sounds like you may have made some bad decisions mapping your data into the database. Usually you'd abstract out common data and refer to it via a foreign key, or somesuch. What's the actual application? Cheers, Steve