Thread: Table checksum proposal

Table checksum proposal

From
matt@byrney.com
Date:
I have a suggestion for a table checksumming facility within PostgreSQL.
The applications are reasonably obvious - detecting changes to tables,
validating data migrations, unit testing etc.  A possible algorithm is as
follows:

1. For each row of the table, take the binary representations of the
values and serialise them to CSV.
2. Calculate the MD5 sum of each CSV-serialised row.
3. XOR the row MD5 sums together.
4. CSV-serialise and MD5 a list of representations (of some sort) of the
types of the table's columns and XOR it with the rest.
5. Output the result as the table's checksum.

Advantages of this approach:

1. Easily implemented using SPI.
2. Since XOR is commutative and associative, order of ingestion of rows
doesn't matter; therefore, unlike some other table checksumming methods,
this doesn't need an expensive ORDER BY *.  So, this should be pretty much
as fast as a SELECT * FROM, which is probably as fast as a table checksum
can be.
3. Using a cursor in SPI, rows can be ingested a few at a time.  So memory
footprint is low even for large tables.
4. Output has a convenient fixed size of 128 bits.

Questions:

1. Should this be a contrib module which provides a function, or should it
be a built-in piece of functionality?
2. Is MD5 too heavyweight for this?  Would using a non-cryptographic
checksum be worth the speed boost?
3. Is there a risk of different architectures/versions returning different
checksums for tables which could be considered identical?  If so, is this
worth worrying about?

I have knocked up some sample code if anyone is interested.

Regards,

Matthew


Re: Table checksum proposal

From
hubert depesz lubaczewski
Date:

On Thu, Jul 24, 2014 at 3:35 AM, <matt@byrney.com> wrote:
I have a suggestion for a table checksumming facility within PostgreSQL.
The applications are reasonably obvious - detecting changes to tables,
validating data migrations, unit testing etc.  A possible algorithm is as
follows:

1. For each row of the table, take the binary representations of the
values and serialise them to CSV.
2. Calculate the MD5 sum of each CSV-serialised row.
3. XOR the row MD5 sums together.
4. CSV-serialise and MD5 a list of representations (of some sort) of the
types of the table's columns and XOR it with the rest.
5. Output the result as the table's checksum.

Advantages of this approach:

1. Easily implemented using SPI.
2. Since XOR is commutative and associative, order of ingestion of rows
doesn't matter; therefore, unlike some other table checksumming methods,
this doesn't need an expensive ORDER BY *.  So, this should be pretty much
as fast as a SELECT * FROM, which is probably as fast as a table checksum
can be.
3. Using a cursor in SPI, rows can be ingested a few at a time.  So memory
footprint is low even for large tables.
4. Output has a convenient fixed size of 128 bits.

Questions:

1. Should this be a contrib module which provides a function, or should it
be a built-in piece of functionality?
2. Is MD5 too heavyweight for this?  Would using a non-cryptographic
checksum be worth the speed boost?
3. Is there a risk of different architectures/versions returning different
checksums for tables which could be considered identical?  If so, is this
worth worrying about?

Hmm - Do you really think we need an extension for something that can be done using query as simple as:

select md5(string_agg(md5(c::text), '' order by md5(c::text))) from pg_class c;

(of course you can do it on any table, not only pg_class).

If you want to use the xor idea (which make sense), all you need is to write xor aggregate.

depesz

Re: Table checksum proposal

From
matt@byrney.com
Date:
> On Thu, Jul 24, 2014 at 3:35 AM, <matt@byrney.com> wrote:
>
>> I have a suggestion for a table checksumming facility within PostgreSQL.
>> The applications are reasonably obvious - detecting changes to tables,
>> validating data migrations, unit testing etc.  A possible algorithm is
>> as
>> follows:
>>
>> 1. For each row of the table, take the binary representations of the
>> values and serialise them to CSV.
>> 2. Calculate the MD5 sum of each CSV-serialised row.
>> 3. XOR the row MD5 sums together.
>> 4. CSV-serialise and MD5 a list of representations (of some sort) of the
>> types of the table's columns and XOR it with the rest.
>> 5. Output the result as the table's checksum.
>>
>> Advantages of this approach:
>>
>> 1. Easily implemented using SPI.
>> 2. Since XOR is commutative and associative, order of ingestion of rows
>> doesn't matter; therefore, unlike some other table checksumming methods,
>> this doesn't need an expensive ORDER BY *.  So, this should be pretty
>> much
>> as fast as a SELECT * FROM, which is probably as fast as a table
>> checksum
>> can be.
>> 3. Using a cursor in SPI, rows can be ingested a few at a time.  So
>> memory
>> footprint is low even for large tables.
>> 4. Output has a convenient fixed size of 128 bits.
>>
>> Questions:
>>
>> 1. Should this be a contrib module which provides a function, or should
>> it
>> be a built-in piece of functionality?
>> 2. Is MD5 too heavyweight for this?  Would using a non-cryptographic
>> checksum be worth the speed boost?
>> 3. Is there a risk of different architectures/versions returning
>> different
>> checksums for tables which could be considered identical?  If so, is
>> this
>> worth worrying about?
>>
>
> Hmm - Do you really think we need an extension for something that can be
> done using query as simple as:
>
> select md5(string_agg(md5(c::text), '' order by md5(c::text))) from
> pg_class c;
>
> (of course you can do it on any table, not only pg_class).
>
> If you want to use the xor idea (which make sense), all you need is to
> write xor aggregate.
>
> depesz
>

This is nice and neat but there are some major disadvantages with this
approach:

1. It can't detect differences in types, e.g. converting an INT column to
TEXT will leave the checksum unchanged.
2. The string_agg requires a string with length 32 * (number of rows) to
be created and then MD5ed.  So on a 100m row table that means using 3.2GB
of memory, which seems unnecessarily heavy.
3. You have used an ORDER BY, which adds more memory usage and time cost.
4. The existence of an in-built checksum facility lends some possibility
of a common standard, which is of course a major factor in making a
checksum useful - one can supply a database dump and a list of tables and
"standard" checksums without also supplying sample code.

Matthew


Re: Table checksum proposal

From
Karsten Hilbert
Date:
On Thu, Jul 24, 2014 at 01:43:29PM +0200, hubert depesz lubaczewski wrote:

> > 1. Should this be a contrib module which provides a function, or should it
> > be a built-in piece of functionality?
> > 2. Is MD5 too heavyweight for this?  Would using a non-cryptographic
> > checksum be worth the speed boost?
> > 3. Is there a risk of different architectures/versions returning different
> > checksums for tables which could be considered identical?  If so, is this
> > worth worrying about?
> >
>
> Hmm - Do you really think we need an extension for something that can be
> done using query as simple as:
>
> select md5(string_agg(md5(c::text), '' order by md5(c::text))) from
> pg_class c;

Just a nitpick, and not really relevant to the OPs question,
but probably useful to people trying to actually use this
across potentially different locale settings: throw in a

    DECODE(md5(...), 'hex')

for the ORDER BY like

    select md5(string_agg(md5(c::text), '' order by decode(md5(c::text), 'hex'))) from pg_class c;

Karsten
--
GPG key ID E4071346 @ gpg-keyserver.de
E167 67FD A291 2BEA 73BD  4537 78B9 A9F9 E407 1346


Re: Table checksum proposal

From
Karsten Hilbert
Date:
On Thu, Jul 24, 2014 at 03:06:28PM +0100, matt@byrney.com wrote:

> > select md5(string_agg(md5(c::text), '' order by md5(c::text))) from
> > pg_class c;
> >
> > (of course you can do it on any table, not only pg_class).
> >
> > If you want to use the xor idea (which make sense), all you need is to
> > write xor aggregate.
>
> This is nice and neat but there are some major disadvantages with this
> approach:
>
> 1. It can't detect differences in types, e.g. converting an INT column to
> TEXT will leave the checksum unchanged.

Unless you apply it to pg_attribute, no ?

Karsten
--
GPG key ID E4071346 @ gpg-keyserver.de
E167 67FD A291 2BEA 73BD  4537 78B9 A9F9 E407 1346