Re: streaming replication, "frozen snapshot backup on it" and missing relfile (postgres 9.2.3 on xfs + LVM) - Mailing list pgsql-hackers

From David Powers
Subject Re: streaming replication, "frozen snapshot backup on it" and missing relfile (postgres 9.2.3 on xfs + LVM)
Date
Msg-id CAJpcCMjAZ7r0Tbs2f9gwtjN573GON4WcE4eeu1UqDmKYyDKpPQ@mail.gmail.com
Whole thread Raw
In response to Re: streaming replication, "frozen snapshot backup on it" and missing relfile (postgres 9.2.3 on xfs + LVM)  (Robert Haas <robertmhaas@gmail.com>)
List pgsql-hackers
It's another possibility, but I think it's still somewhat remote given how long we've been using this method with this code.  It's sadly hard to test because taking the full backup without the hard linking is fairly expensive (the databases comprise multiple terabytes).

As a possibly unsatisfying solution I've spent the last day reworking the backups to use the low level api and the pg_basebackup method to take snapshots and the streaming replica out of the picture entirely.

-David


On Tue, May 28, 2013 at 7:27 PM, Robert Haas <robertmhaas@gmail.com> wrote:
On Tue, May 28, 2013 at 10:53 AM, Benedikt Grundmann
<bgrundmann@janestreet.com> wrote:
> Today we have seen
>
> 2013-05-28 04:11:12.300 EDT,,,30600,,51a41946.7788,1,,2013-05-27 22:41:10
> EDT,,0,ERROR,XX000,"xlog flush request 1E95/AFB2DB10 is not satisfied ---
> flushed only to 1E7E/21CB79A0",,,,,"writing block 9 of relation
> base/16416/293974676",,,,""
> 2013-05-28 04:11:13.316 EDT,,,30600,,51a41946.7788,2,,2013-05-27 22:41:10
> EDT,,0,ERROR,XX000,"xlog flush request 1E95/AFB2DB10 is not satisfied ---
> flushed only to 1E7E/21CB79A0",,,,,"writing block 9 of relation
> base/16416/293974676",,,,""
>
> while taking the backup of the primary.  We have been running for a few days
> like that and today is the first day where we see these problems again.  So
> it's not entirely deterministic / we don't know yet what we have to do to
> reproduce.
>
> So this makes Robert's theory more likely.  However we have also using this
> method (LVM + rsync with hardlinks from primary) for years without these
> problems.  So the big question is what changed?

Well... I don't know.  But my guess is there's something wrong with
the way you're using hardlinks.  Remember, a hardlink means two
logical pointers to the same file on disk.  So if either file gets
modified after the fact, then the other pointer is going to see the
changes.  The xlog flush request not satisfied stuff could happen if,
for example, the backup is pointing to a file, and the primary is
pointing to the same file, and the primary modifies the file after the
backup is taken (thus modifying the backup after-the-fact).

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

pgsql-hackers by date:

Previous
From: Alvaro Herrera
Date:
Subject: Re: Running pgindent
Next
From: Bruce Momjian
Date:
Subject: Re: Running pgindent