I am working with a client who is facing issues with database corruption after a physical hard power off (the machines are at remote sites, this could be a power outage or user error).
They have an environment made up of many of the following consumer grade stand alone machines:
They have currently proven that the corruption is repeatable in a testbed with and without OS/RAID controller caching - but I am working with them to make this process a little more detailed.
The new process will be:
Power on machine
If PostgreSQL doesn't start archive $PGDATA and initdb
Perform a pg_dumpall to test for corruption
If pg_dumpall fails then archive $PGDATA and initdb
Start test suite (which mimics high load from their application), which INSERTS and DELETES records in and out of transaction
After 15 minutes cut power and repeat process
We are hoping to get about 20 machines in this testbed, giving us around 1500 power cycles per day.
Test scenarios which have been floated so far:
As described above, all caching off
As described above, all caching off, 9.2 stable
As described above, all caching off, 9.5 stable with checksums
Can anyone think of anything else we should be considering / testing / factoring in?
Cheers,
James Sewell,
PostgreSQL Team Lead / Solutions Architect
Suite 112, Jones Bay Wharf, 26-32 Pirrama Road, Pyrmont NSW 2009
The contents of this email are confidential and may be subject to legal or professional privilege and copyright. No representation is made that this email is free of viruses or other defects. If you have received this communication in error, you may not copy or distribute any part of it or otherwise disclose its contents to anyone. Please advise the sender of your incorrect receipt of this correspondence.