Greg Smith <greg@2ndQuadrant.com> wrote:
> But if you need all that infrastructure just to get the feature
> launched, that's a bit hard to stomach.
Triggering a vacuum or some hypothetical "scrubbing" feature?
> Also, as someone who follows Murphy's Law as my chosen religion,
If you don't think I pay attention to Murphy's Law, I should recap
our backup procedures -- which involves three separate forms of
backup, each to multiple servers in different buildings, real-time,
plus idle-time comparison of the databases of origin to all replicas
with reporting of any discrepancies. And off-line "snapshot"
backups on disk at a records center controlled by a different
department. That's in addition to RAID redundancy and hardware
health and performance monitoring. Some people think I border on
the paranoid on this issue.
> I would expect this situation could be exactly how flaky hardware
> would first manifest itself: server crash and a bad CRC on the
> last thing written out. And in that case, the last thing you want
> to do is assume things are fine, then kick off a VACUUM that might
> overwrite more good data with bad.
Are you arguing that autovacuum should be disabled after crash
recovery? I guess if you are arguing that a database VACUUM might
destroy recoverable data when hardware starts to fail, I can't
argue. And certainly there are way too many people who don't ensure
that they have a good backup before firing up PostgreSQL after a
failure, so I can see not making autovacuum more aggressive than
usual, and perhaps even disabling it until there is some sort of
confirmation (I have no idea how) that a backup has been made. That
said, a database VACUUM would be one of my first steps after
ensuring that I had a copy of the data directory tree, personally.
I guess I could even live with that as recommended procedure rather
than something triggered through autovacuum and not feel that the
rest of my posts on this are too far off track.
> The main way I expect to validate this sort of thing is with an as
> yet unwritten function to grab information about a data block from
> a standby server for this purpose, something like this:
>
> Master: Computed CRC A, Stored CRC B; error raised because A!=B
> Standby: Computed CRC C, Stored CRC D
>
> If C==D && A==C, the corruption is probably overwritten bits of
> the CRC B.
Are you arguing we need *that* infrastructure to get the feature
launched?
-Kevin