Re: streaming replication, "frozen snapshot backup on it" and missing relfile (postgres 9.2.3 on xfs + LVM) - Mailing list pgsql-hackers

From David Powers
Subject Re: streaming replication, "frozen snapshot backup on it" and missing relfile (postgres 9.2.3 on xfs + LVM)
Date
Msg-id CAJpcCMhxK56fyjj708Q2x-8F8Q2nacJ5gs9ALMFW13K9sqjeoQ@mail.gmail.com
Whole thread Raw
In response to Re: streaming replication, "frozen snapshot backup on it" and missing relfile (postgres 9.2.3 on xfs + LVM)  (Heikki Linnakangas <hlinnakangas@vmware.com>)
Responses Re: streaming replication, "frozen snapshot backup on it" and missing relfile (postgres 9.2.3 on xfs + LVM)  (Heikki Linnakangas <hlinnakangas@vmware.com>)
List pgsql-hackers
First, thanks for the replies.  This sort of thing is frustrating and hard to diagnose at a distance, and any help is appreciated.

Here is some more background:

We have 3 9.2.4 databases using the following setup:

- A primary box
- A standby box running as a hot streaming replica from the primary
- A testing box restored nightly from a static backup

As noted, the static backup is taken off of the standby by taking an LVM snapshot of the database filesystem and rsyncing.  I don't think it's a likely problem but the rsync leverages the previous backup (using --link-dest) to make the rsync faster and the resulting backup smaller.  Each database is ~1.5T, so this is necessary to keep static backup times reasonable.

We've been using the same system for quite some time, but previously (~ 1 month ago) had been taking the backup off of the primary (still using the LVM snapshot).  The replication is a recent addition, and a very helpful one.  LVM snapshots aren't lightweight in the face of writes and in some circumstances a long running rsync would spike the IO load on the production box.

Results of some additional tests:

After the user noticed that the test restore showed the original problem we ran `vacuum analyze` on all three testing databases thinking that it had a good chance of quickly touching most of the underlying files.  That gave us errors on two of the testing restores similar to:

ERROR:  invalid page header in block 5427 of relation base/16417/199732075

In the meantime I modified the static backup procedure to shut standby completely down before taking the LVM snapshot and am trying a restore using that snapshot now.  I'll test that using the same vacuum analyze test, and if that passes, a full vacuum.

I'm also running the vacuum analyze on the production machines to double check that the base databases don't have a subtle corruption that simply hasn't been noticed.  They run with normal autovacuum settings, so I suspect that they are fine/this won't show anything because we should have seen this from the autovacuum daemon before.

I'm happy to share the scripts we use for the backup/restore process if the information above isn't enough, as well as the logs - though the postgres logs don't seem to contain much of interest (the database system doesn't really get involved).

I also have the rsyncs of the failed snapshots available and could restore them for testing purposes.  It's also easy to look in them (they are just saved as normal directories on a big SAN) if I know what to look for.

-David


On Wed, May 15, 2013 at 2:24 AM, Heikki Linnakangas <hlinnakangas@vmware.com> wrote:
On 14.05.2013 23:47, Benedikt Grundmann wrote:
The only thing that is *new* is that we took the snapshot from the

streaming replica.  So again my best guess as of now is that if the
database crashes while it is in streaming standby a invalid disk state can
result during during the following startup (in rare and as of now unclear
circumstances).

A bug is certainly possible. There isn't much detail here to debug with, I'm afraid. Can you share the full logs on all three systems? I'm particularly interest


You seem to be quite convinced that it must be LVM can you elaborate why?

Well, you said that there was a file in the original filesystem, but not in the snapshot. If you didn't do anything in between, then surely the snapshot is broken, if it skipped a file. Or was the file created in the original filesystem after the snapshot was taken? You probably left out some crucial details on how exactly the snapshot and rsync are performed. Can you share the scripts you're using?

Can you reproduce this problem with a new snapshot? Do you still have the failed snapshot unchanged?

- Heikki

pgsql-hackers by date:

Previous
From: Josh Berkus
Date:
Subject: Re: commit fest schedule for 9.4
Next
From: Alvaro Herrera
Date:
Subject: Re: commit fest schedule for 9.4