Thread: Checking for data changes across a very large table

Checking for data changes across a very large table

From
"Stephen Hutchings"
Date:

I’d like some general guidance on a security issue please. This may belong in the another list so please push me there if appropriate.

 

We have an application design which includes a potential 2 billion row table (A). When the application user kicks off an analysis process, there is a requirement to perform a check on that table to verify that the data within hasn’t changed.

 

The initial intent was to hold a single “checksum” or equivalent in another small table (B). This checksum value would have been calculated over a subset of row values held within the 2bn row table. When the user chooses to kick off their processing, the application would re-calculate this checksum for table A in real time and compare the new result with the original value stored in table B. We might obtain the subset from table A like this:

 

SELECT * FROM (

            SELECT            *,

                        (row_number() OVER( ORDER BY "ID"))%10000000 AS rn

            FROM "TableA") aa

WHERE aa.rn=1 LIMIT 20000

 

... and then perform an operation on column data from that subset (in this example, a 100,000th of the original data) to calculate a checksum value. Obviously the data within the subset would have to have been affected in order for a difference to be detected.

 

It is a requirement that users have to have direct access to the DB (I hate this, but am overruled). I envisage that non-admin users would have read-only access, and the above validation would be an extra check on the data.

 

But my main concern is performance - I fear that this won’t be able to perform (in a few seconds rather than minutes). Does the general approach make any sense? Can anyone suggest a better starting point?

 

 

This is Postgres 9.0. The minimum hardware spec is small and Windows-based (64-bit, Core i7 processor, 8GB RAM, single 1TB hard disk). My evaluation DB has ended up with the following config tweaks:

 

shared_buffers 2048 MB

work_mem 2047 MB

checkpoint_segments 128

 

 

Thanks in advance.

 


JBA Consulting

 

JBA Consulting

South Barn

Broughton Hall

Skipton

North Yorkshire

BD23 3AE

United Kingdom

NCE Consultant of the Year 2010

t: +44 (0)1756 799919 | f: +44 (0)1756 799449

 

JBA is a Carbon Neutral Company. Please don't print this e-mail unless you really need to.

This email is covered by JBA Consulting's email disclaimer.

 

Attachment

Re: Checking for data changes across a very large table

From
"Greg Sabino Mullane"
Date:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: RIPEMD160


> We have an application design which includes a potential 2
> billion row table (A). When the application user kicks off
> an analysis process, there is a requirement to perform a
> check on that table to verify that the data within hasn't changed.
...
> But my main concern is performance - I fear that this won't be able to
> perform (in a few seconds rather than minutes). Does the general
> approach make any sense? Can anyone suggest a better starting point?

It's not clear if you need to check the entire table, or just a
subset related to that user each time, but one solution for either
is a trigger (after insert, update, delete) that simply increments
a sequence. Lock the sequence down and make the function security
definer. Don't increment if an update hasn't actually changed
anything. If the sequence number is not the same as last time the
app checked, then the data is not the same. You can store the sequence
value in a table if you need some persistence, or add multiple columns
if you need to check for a user-derived subset of the data (with
multiple sequences or simply increment the values in the table itself
like a version control number).

- --
Greg Sabino Mullane greg@turnstep.com
End Point Corporation http://www.endpoint.com/
PGP Key: 0x14964AC8 201012100942
http://biglumber.com/x/web?pk=2529DF6AB8F79407E94445B4BC9B906714964AC8
-----BEGIN PGP SIGNATURE-----

iEYEAREDAAYFAk0CPLYACgkQvJuQZxSWSsi8NQCgz4+bmWPMZm+aIX9maelZhj/+
wycAoNT32GFwudXF1Totvpw25+TXsu+E
=jc8n
-----END PGP SIGNATURE-----