Re: streaming replication, "frozen snapshot backup on it" and missing relfile (postgres 9.2.3 on xfs + LVM) - Mailing list pgsql-hackers
From | Heikki Linnakangas |
---|---|
Subject | Re: streaming replication, "frozen snapshot backup on it" and missing relfile (postgres 9.2.3 on xfs + LVM) |
Date | |
Msg-id | 5193D8EF.3060000@vmware.com Whole thread Raw |
In response to | Re: streaming replication, "frozen snapshot backup on it" and missing relfile (postgres 9.2.3 on xfs + LVM) (David Powers <dpowers@janestreet.com>) |
Responses |
Re: streaming replication, "frozen snapshot backup on it"
and missing relfile (postgres 9.2.3 on xfs + LVM)
|
List | pgsql-hackers |
On 15.05.2013 15:42, David Powers wrote: > First, thanks for the replies. This sort of thing is frustrating and hard > to diagnose at a distance, and any help is appreciated. > > Here is some more background: > > We have 3 9.2.4 databases using the following setup: The subject says 9.2.3. Are you sure you're running 9.2.4 on all the servers? There was a fix to a bug related to starting a standby server from a filesystem snapshot. I don't think it was quite the case you have, but pretty close. > - A primary box > - A standby box running as a hot streaming replica from the primary > - A testing box restored nightly from a static backup > > As noted, the static backup is taken off of the standby by taking an LVM > snapshot of the database filesystem and rsyncing. I don't think it's a > likely problem but the rsync leverages the previous backup (using > --link-dest) to make the rsync faster and the resulting backup smaller. I guess that should work, as long as you make a full copy when you restore the backup, and never start a postgres server directly against a backup that contains the hardlinks. > We've been using the same system for quite some time, but previously (~ 1 > month ago) had been taking the backup off of the primary (still using the > LVM snapshot). The replication is a recent addition, and a very helpful > one. LVM snapshots aren't lightweight in the face of writes and in some > circumstances a long running rsync would spike the IO load on the > production box. > > Results of some additional tests: > > After the user noticed that the test restore showed the original problem we > ran `vacuum analyze` on all three testing databases thinking that it had a > good chance of quickly touching most of the underlying files. That gave us > errors on two of the testing restores similar to: > > ERROR: invalid page header in block 5427 of relation base/16417/199732075 Huh, that's surprising. You have to be quite unlucky to end with a corrupt page header, even if there's something wrong with the underlying storage or backup routines. Could you grab a copy of that block? Something like dd skip=5427 bs=8192 count=1 if=base/16417/199732075 of=corrupt-block I'm especially curious what the first 20 bytes or so look like. I'm guessing that it's all zeros, while some later part of the page contains data. Ie. the page was torn at some point in the process. > I'm also running the vacuum analyze on the production machines to double > check that the base databases don't have a subtle corruption that simply > hasn't been noticed. They run with normal autovacuum settings, so I > suspect that they are fine/this won't show anything because we should have > seen this from the autovacuum daemon before. Vacuum analyze doesn't always scan the whole table with default settings, so that's no proof that are no missing pages. "set vacuum_freeze_table_age=0; vacuum <table>" will force a full scan of the table. > I'm happy to share the scripts we use for the backup/restore process if the > information above isn't enough, as well as the logs - though the postgres > logs don't seem to contain much of interest (the database system doesn't > really get involved). Yeah, please share the scripts. What you've described seems correct to me, but maybe there's something I'm missing. > I also have the rsyncs of the failed snapshots available and could restore > them for testing purposes. It's also easy to look in them (they are just > saved as normal directories on a big SAN) if I know what to look for. Great. Not sure what to look at right now, but good that you have kept the evidence. - Heikki
pgsql-hackers by date: