Thread: Verify Option with pg_dump
Hi,
recently I had problems with a corrupt pg_dump file. The problem with the file was due to a faulty disk. The trouble with this is that I was unaware of the disk problem and the pg_dump file corruption so I did not have a full valid backup. In order to reduce the chances of this I was hoping that there could be a verify option as in SQL Server for the backups. This could be as simple as checking the CRC/MD5 as the stream is created. So pg_dump | crc_save
The idea being that the pg_dump is crc'd before it is streamed to disk, and then the file re-read from disk to check the CRC.
Is there a linux utility to do this or would it be simple to modify pg_dump to do this?
Thanks
Howard.
On Wed, Nov 30, 2016 at 12:00:07PM +0000, Howard News wrote: > recently I had problems with a corrupt pg_dump file. The problem with the > file was due to a faulty disk. The trouble with this is that I was unaware > of the disk problem and the pg_dump file corruption so I did not have a full > valid backup. In order to reduce the chances of this I was hoping that there > could be a verify option as in SQL Server for the backups. This could be as > simple as checking the CRC/MD5 as the stream is created. So pg_dump | > crc_save > > The idea being that the pg_dump is crc'd before it is streamed to disk, and > then the file re-read from disk to check the CRC. > > Is there a linux utility to do this or would it be simple to modify pg_dump > to do this? You can try to suitably combine "pg_dump --format=plain" with "tee" and "md5sum" such that the output stream is diverted to both a file and a pipe-into-CRC-algorithm and eventually compare the pipe's sum with the sum generated from the file. But the better solution might be to stream to a filesystem that verifies disk writes immediately. Or to a suitable RAID array. Regards, Karsten -- GPG key ID E4071346 @ eu.pool.sks-keyservers.net E167 67FD A291 2BEA 73BD 4537 78B9 A9F9 E407 1346
On 30/11/2016 12:27, Karsten Hilbert wrote: > > You can try to suitably combine "pg_dump --format=plain" with > "tee" and "md5sum" such that the output stream is diverted to > both a file and a pipe-into-CRC-algorithm and eventually > compare the pipe's sum with the sum generated from the file. > > But the better solution might be to stream to a filesystem > that verifies disk writes immediately. Or to a suitable RAID > array. > > Regards, > Karsten Thanks for this info Karsten. I will look into using "tee". As a matter of interest, why does the format need to be plain? Regarding the filesystem solution, the dump is currently written to a HP RAID 10 array with an NTFS partition. What filesystems / raid arrays have this ability? Thanks.
On Wed, Nov 30, 2016 at 01:11:58PM +0000, Howard News wrote: > > You can try to suitably combine "pg_dump --format=plain" with > > "tee" and "md5sum" such that the output stream is diverted to > > both a file and a pipe-into-CRC-algorithm and eventually > > compare the pipe's sum with the sum generated from the file. > > > > But the better solution might be to stream to a filesystem > > that verifies disk writes immediately. Or to a suitable RAID > > array. > Thanks for this info Karsten. I will look into using "tee". As a matter of > interest, why does the format need to be plain? Actually, any of the formats producing a _single_ file right away are likely to work. So, any but "directory", I guess. > Regarding the filesystem solution, the dump is currently written to a HP > RAID 10 array with an NTFS partition. What filesystems / raid arrays have > this ability? If you can't trust your RAID 10 (1 meaning mirrored) to actually store what you told it to you've got problems beyond somehow verifying a pg_dump. Regards, Karsten -- GPG key ID E4071346 @ eu.pool.sks-keyservers.net E167 67FD A291 2BEA 73BD 4537 78B9 A9F9 E407 1346
Regarding the filesystem solution, the dump is currently written to a HP >> RAID 10 array with an NTFS partition. What filesystems / raid arrays have >> this ability? > If you can't trust your RAID 10 (1 meaning mirrored) to > actually store what you told it to you've got problems beyond > somehow verifying a pg_dump. > > Regards, > Karsten I am told RAID can only protect you against disk failure. File writes to one or more of the disks in an array are not typically compared so a RAID array carrys on until the disk failure, or error count get to a certain level. So RAID does not fully protect you from data corruption. So you can't trust RAID!
On Wed, Nov 30, 2016 at 01:53:21PM +0000, Howard News wrote: > Regarding the filesystem solution, the dump is currently written to a HP > > > RAID 10 array with an NTFS partition. What filesystems / raid arrays have > > > this ability? > > If you can't trust your RAID 10 (1 meaning mirrored) to > > actually store what you told it to you've got problems beyond > > somehow verifying a pg_dump. > > > > Regards, > > Karsten > I am told RAID can only protect you against disk failure. File writes to one > or more of the disks in an array are not typically compared so a RAID array > carrys on until the disk failure, or error count get to a certain level. So > RAID does not fully protect you from data corruption. True enough. So it seems you are referring to "silent data corruption". Does this link help ? http://www.raidix.com/knowledge-base/silent-data-corruption/ This link also seems relevant: http://stackoverflow.com/questions/13107783/pipe-output-to-two-different-commands Regards, Karsten -- GPG key ID E4071346 @ eu.pool.sks-keyservers.net E167 67FD A291 2BEA 73BD 4537 78B9 A9F9 E407 1346
Also this https://en.wikipedia.org/wiki/Silent_data_corruption#Countermeasures -- GPG key ID E4071346 @ eu.pool.sks-keyservers.net E167 67FD A291 2BEA 73BD 4537 78B9 A9F9 E407 1346