Home > mailing lists

Re: streaming replication, "frozen snapshot backup on it" and missing relfile (postgres 9.2.3 on xfs + LVM) - Mailing list pgsql-hackers

From	David Powers
Subject	Re: streaming replication, "frozen snapshot backup on it" and missing relfile (postgres 9.2.3 on xfs + LVM)
Date	May 15, 2013 20:44:55
Msg-id	CAJpcCMhxK56fyjj708Q2x-8F8Q2nacJ5gs9ALMFW13K9sqjeoQ@mail.gmail.com Whole thread Raw
In response to	Re: streaming replication, "frozen snapshot backup on it" and missing relfile (postgres 9.2.3 on xfs + LVM) (Heikki Linnakangas <hlinnakangas@vmware.com>)
Responses	Re: streaming replication, "frozen snapshot backup on it" and missing relfile (postgres 9.2.3 on xfs + LVM) (Heikki Linnakangas <hlinnakangas@vmware.com>)
List	pgsql-hackers

Tree view

First, thanks for the replies. This sort of thing is frustrating and hard to diagnose at a distance, and any help is appreciated.

Here is some more background:

We have 3 9.2.4 databases using the following setup:

- A primary box

- A standby box running as a hot streaming replica from the primary

- A testing box restored nightly from a static backup

As noted, the static backup is taken off of the standby by taking an LVM snapshot of the database filesystem and rsyncing. I don't think it's a likely problem but the rsync leverages the previous backup (using --link-dest) to make the rsync faster and the resulting backup smaller. Each database is ~1.5T, so this is necessary to keep static backup times reasonable.

We've been using the same system for quite some time, but previously (~ 1 month ago) had been taking the backup off of the primary (still using the LVM snapshot). The replication is a recent addition, and a very helpful one. LVM snapshots aren't lightweight in the face of writes and in some circumstances a long running rsync would spike the IO load on the production box.

Results of some additional tests:

After the user noticed that the test restore showed the original problem we ran `vacuum analyze` on all three testing databases thinking that it had a good chance of quickly touching most of the underlying files. That gave us errors on two of the testing restores similar to:

ERROR: invalid page header in block 5427 of relation base/16417/199732075

In the meantime I modified the static backup procedure to shut standby completely down before taking the LVM snapshot and am trying a restore using that snapshot now. I'll test that using the same vacuum analyze test, and if that passes, a full vacuum.

I'm also running the vacuum analyze on the production machines to double check that the base databases don't have a subtle corruption that simply hasn't been noticed. They run with normal autovacuum settings, so I suspect that they are fine/this won't show anything because we should have seen this from the autovacuum daemon before.

I'm happy to share the scripts we use for the backup/restore process if the information above isn't enough, as well as the logs - though the postgres logs don't seem to contain much of interest (the database system doesn't really get involved).

I also have the rsyncs of the failed snapshots available and could restore them for testing purposes. It's also easy to look in them (they are just saved as normal directories on a big SAN) if I know what to look for.

-David

On Wed, May 15, 2013 at 2:24 AM, Heikki Linnakangas <hlinnakangas@vmware.com> wrote:

On 14.05.2013 23:47, Benedikt Grundmann wrote:
The only thing that is *new* is that we took the snapshot from the

streaming replica. So again my best guess as of now is that if the
database crashes while it is in streaming standby a invalid disk state can
result during during the following startup (in rare and as of now unclear
circumstances).

A bug is certainly possible. There isn't much detail here to debug with, I'm afraid. Can you share the full logs on all three systems? I'm particularly interest

You seem to be quite convinced that it must be LVM can you elaborate why?

Well, you said that there was a file in the original filesystem, but not in the snapshot. If you didn't do anything in between, then surely the snapshot is broken, if it skipped a file. Or was the file created in the original filesystem after the snapshot was taken? You probably left out some crucial details on how exactly the snapshot and rsync are performed. Can you share the scripts you're using?

Can you reproduce this problem with a new snapshot? Do you still have the failed snapshot unchanged?

- Heikki

pgsql-hackers by date:

From: Josh Berkus
Date: 15 May 2013, 20:44:26
Subject: Re: commit fest schedule for 9.4

From: Alvaro Herrera
Date: 15 May 2013, 20:50:54
Subject: Re: commit fest schedule for 9.4

Re: streaming replication, "frozen snapshot backup on it" and missing relfile (postgres 9.2.3 on xfs + LVM) - Mailing list pgsql-hackers

Previous

Next