Home > mailing lists

Re: Table checksum proposal - Mailing list pgsql-general

From	hubert depesz lubaczewski
Subject	Re: Table checksum proposal
Date	July 24, 2014 11:43:39
Msg-id	CAKrjmhcorNLdR-6fxYakfdeD2n0=sJqdsivn9g7XwVx5o6mzgQ@mail.gmail.com Whole thread Raw
In response to	Table checksum proposal (matt@byrney.com)
Responses	Re: Table checksum proposal Re: Table checksum proposal
List	pgsql-general

Tree view

On Thu, Jul 24, 2014 at 3:35 AM, <matt@byrney.com> wrote:

I have a suggestion for a table checksumming facility within PostgreSQL.
The applications are reasonably obvious - detecting changes to tables,
validating data migrations, unit testing etc. A possible algorithm is as
follows:

1. For each row of the table, take the binary representations of the
values and serialise them to CSV.
2. Calculate the MD5 sum of each CSV-serialised row.
3. XOR the row MD5 sums together.
4. CSV-serialise and MD5 a list of representations (of some sort) of the
types of the table's columns and XOR it with the rest.
5. Output the result as the table's checksum.

Advantages of this approach:

1. Easily implemented using SPI.
2. Since XOR is commutative and associative, order of ingestion of rows
doesn't matter; therefore, unlike some other table checksumming methods,
this doesn't need an expensive ORDER BY *. So, this should be pretty much
as fast as a SELECT * FROM, which is probably as fast as a table checksum
can be.
3. Using a cursor in SPI, rows can be ingested a few at a time. So memory
footprint is low even for large tables.
4. Output has a convenient fixed size of 128 bits.

Questions:

1. Should this be a contrib module which provides a function, or should it
be a built-in piece of functionality?
2. Is MD5 too heavyweight for this? Would using a non-cryptographic
checksum be worth the speed boost?
3. Is there a risk of different architectures/versions returning different
checksums for tables which could be considered identical? If so, is this
worth worrying about?

Hmm - Do you really think we need an extension for something that can be done using query as simple as:

select md5(string_agg(md5(c::text), '' order by md5(c::text))) from pg_class c;

(of course you can do it on any table, not only pg_class).

If you want to use the xor idea (which make sense), all you need is to write xor aggregate.

depesz

pgsql-general by date:

From: hubert depesz lubaczewski
Date: 24 July 2014, 11:34:07
Subject: Re: event triggers in 9.3.4

From: Adrian Klaver
Date: 24 July 2014, 13:57:04
Subject: Re: tab_to_sting

Re: Table checksum proposal - Mailing list pgsql-general

Previous

Next