Thread: Is there an md5sum for tables?

Is there an md5sum for tables?

From
Michael Enke
Date:
Hi all,
I need to know if multiple tables have (may have most probably) identical content.
Since I want a fast solution (which means not comparing tables row by row),
I thought it would be a good idea to have an sql function operating on a table or view
similar to md5sum on a file and only compare the generated hashes.
I did not find such a function in the documentation.

Is such a function already available?
If not, does it make no sense to use such an approach?

Regards,
Michael

--
Wincor Nixdorf International GmbH
Sitz der Gesellschaft: Paderborn
Registergericht Paderborn HRB 3507
Gesch�ftsf�hrer: Eckard Heidloff (Vorsitzender), Stefan Auerbach, Dr. J�rgen Wunram
Vorsitzender des Aufsichtsrats: Karl-Heinz Stiller
Steuernummer: 339/5884/0020 - Ust-ID Nr.: DE812927716 - WEEE-Reg.-Nr. DE44477193

Diese E-Mail enth�lt vertrauliche Informationen. Wenn Sie nicht der richtige Adressat sind oder diese E-Mail irrt�mlich
erhaltenhaben, informieren Sie bitte sofort den Absender und vernichten Sie diese E-Mail. Das unerlaubte Kopieren sowie
dieunbefugte Weitergabe dieser E-Mail ist nicht gestattet. 

This e-mail may contain confidential information. If you are not the intended recipient (or have received this e-mail
inerror) please notify the sender immediately and destroy this e-mail. Any unauthorised copying, disclosure or
distributionof the material in this e-mail is strictly forbidden. 


Re: Is there an md5sum for tables?

From
"A. Kretschmer"
Date:
am  Wed, dem 02.04.2008, um 11:56:38 +0800 mailte Michael Enke folgendes:
> Hi all,
> I need to know if multiple tables have (may have most probably) identical
> content.
> Since I want a fast solution (which means not comparing tables row by row),
> I thought it would be a good idea to have an sql function operating on a
> table or view
> similar to md5sum on a file and only compare the generated hashes.
> I did not find such a function in the documentation.
>
> Is such a function already available?

No.


> If not, does it make no sense to use such an approach?

I think, no, because there is no order in the tables and there are maybe
dead tuples (MVCC) -> different md5-sums for the data files.

Andreas
--
Andreas Kretschmer
Kontakt:  Heynitz: 035242/47150,   D1: 0160/7141639 (mehr: -> Header)
GnuPG-ID:   0x3FFF606C, privat 0x7F4584DA   http://wwwkeys.de.pgp.net

Re: Is there an md5sum for tables?

From
Volkan YAZICI
Date:
Michael Enke <michael.enke@wincor-nixdorf.com> writes:
> I need to know if multiple tables have (may have most probably)
> identical content.

That sounds me you should re-consider your database design. Having PK/FK
relations is the reason we divide common parts into separate tables to
avoid from data duplication.

> Since I want a fast solution (which means not comparing tables row by
> row), I thought it would be a good idea to have an sql function
> operating on a table or view similar to md5sum on a file and only
> compare the generated hashes.  I did not find such a function in the
> documentation.

First alter related tables and prepend a hash field to each one. After
that, you can implement your own PL/pgSQL procedure which takes a table
name and queries supplied table's columns from information_schema and
builds an SQL query string to concatenate MD5 hash of available table
fields into a single field, and updates hash field of each field
according to computed hash.


Regards.

Re: Is there an md5sum for tables?

From
btober@ct.metrocast.net
Date:
Michael Enke wrote:
> I need to know if multiple tables have (may have most probably)
> identical content.
> Since I want a fast solution (which means not comparing tables row by row),
> I thought it would be a good idea to have an sql function operating on a
> table or view
> similar to md5sum on a file and only compare the generated hashes.
> I did not find such a function in the documentation.
>
> Is such a function already available?

I think what I would do is develop a shell script that ran select
queries (with an "order by" clause, of course) against the tables, then
compare the output the result sets using diff.

If you need to monitor for differences over time, like to catch changes,
then drive that script with cron and have it generate an email warning
message when diff notices that the output are no longer identical.

-- BMT



Re: Is there an md5sum for tables?

From
Ben Chobot
Date:
On Apr 1, 2008, at 8:56 PM, Michael Enke wrote:

> Hi all,
> I need to know if multiple tables have (may have most probably)
> identical content.
> Since I want a fast solution (which means not comparing tables row
> by row),
> I thought it would be a good idea to have an sql function operating
> on a table or view
> similar to md5sum on a file and only compare the generated hashes.
> I did not find such a function in the documentation.
>
> Is such a function already available?
> If not, does it make no sense to use such an approach?

I've found that it's pretty easy to dump both tables to text files and
then diff them. Then, not only do you see if there are differences,
ou can see what's different about them, which is probably going to be
your next question when you see they're not the same. :)

Re: Is there an md5sum for tables?

From
Steve Atkins
Date:
On Apr 1, 2008, at 8:56 PM, Michael Enke wrote:
> Hi all,
> I need to know if multiple tables have (may have most probably)
> identical content.
> Since I want a fast solution (which means not comparing tables row
> by row),
> I thought it would be a good idea to have an sql function operating
> on a table or view
> similar to md5sum on a file and only compare the generated hashes.
> I did not find such a function in the documentation.
>
> Is such a function already available?

No, but you could probably build it fairly easily from parts that
are already there.

>
> If not, does it make no sense to use such an approach?

Possibly, but generating the hash might be time consuming,
as you'd need to iterate through the whole table, probably
in a specific order, to generate the hash. So every time you
modified the table you'd have to do something like
"select * from table order by <something unique>" to
maintain a current hash. (If you used a weak hash, something
like XOR of the md5 of each row, then you could do it cheaply
incrementally - and that might even be a usable hash if there's
a unique constraint on some field).

Overall, though, this is a very non-database thing to want to
do. Iit sounds like you may have made some bad
decisions mapping your data into the database. Usually
you'd abstract out common data and refer to it via a foreign
key, or somesuch.

What's the actual application?

Cheers,
   Steve