Thread: streaming replication, "frozen snapshot backup on it" and missing relfile (postgres 9.2.3 on xfs + LVM)


Today we have seen this on our testing database instance:

ERROR:  could not open file "base/16416/291498116.3" (target block 431006): No such file or directory

That database get's created by rsyncing the LVM snapshot of the standby, which is a readonly backup of proddb
using streaming replication.

We do not put the standby database into backup mode before rsyncing the LVM snapshot, trusting postgres ability to recover
from crash.  Now we are not sure anymore...  Thoughts?  It's worth noting that we have been using this method of taking a backup
for a long long time, but we only recently (after 9.2 came out we switched to doing a streaming replica and then to take the
"frozen snapshot" backup method from the streaming replica).

Let me know if there is more information you need, or if this is a known problem.

Thanks,

Bene
On 14.05.2013 14:57, Benedikt Grundmann wrote:
> Today we have seen this on our testing database instance:
>
> ERROR:  could not open file "base/16416/291498116.3" (target block 431006):
> No such file or directory
>
> That database get's created by rsyncing the LVM snapshot of the standby,
> which is a readonly backup of proddb
> using streaming replication.
>
> We do not put the standby database into backup mode before rsyncing the LVM
> snapshot, trusting postgres ability to recover
> from crash.  Now we are not sure anymore...  Thoughts?  It's worth noting
> that we have been using this method of taking a backup
> for a long long time, but we only recently (after 9.2 came out we switched
> to doing a streaming replica and then to take the
> "frozen snapshot" backup method from the streaming replica).
>
> Let me know if there is more information you need, or if this is a known
> problem.

That certainly should work. I'd suggest that you narrow down the problem 
a bit more. Is the file present in the original system? Anything else 
that might be related in the logs?

- Heikki



It's on the production database and the streaming replica.  But not on the snapshot.

production
-rw------- 1 postgres postgres 312778752 May 13 21:28 /database/postgres/base/16416/291498116.3

streaming replica
-rw------- 1 postgres postgres 312778752 May 13 23:50 /database/postgres/base/16416/291498116.3

Is there a way to find out what the file contains?

We just got some more information.  All of the following was done / seen in the logs of the snapshot database.

After we saw this we run a vacuum full on the table we suspect to be backed by this file.  This happened:


WARNING:  concurrent insert in progress within table "js_equity_daily_diff"

Over and over again.  So we killed the VACUUM full by hitting ctrl-c again and again.  After that the logs contained:

fgrep ERROR postgresql-2013-05-14.csv  | head
2013-05-14 00:23:16.028 EDT,"postgres_prod","proddb_testing",15698,"172.24.65.103:54536",5191bc31.3d52,3,"SELECT",2013-05-14 00:23:13 EDT,2/3330,0,ERROR,58P01,"could not open file ""base/16416/291498116.3"" (target block 431006): No such file or directory",,,,,"SQL statement ""select max(equity_date) from js_equity_daily_diff""

The above was the original error that made us investigate.

2013-05-14 09:21:47.121 EDT,"postgres_prod","proddb_testing",21002,"172.27.41.24:36815",51923a67.520a,38756,"VACUUM",2013-05-14 09:21:43 EDT,2/21611,645995272,ERROR,57014,"canceling statement due to user request",,,,,,"vacuum full js_equity_daily_diff",,,"psql"
2013-05-14 09:22:04.700 EDT,"postgres_prod","proddb_testing",21008,"172.27.41.24:36831",51923a75.5210,115909,"VACUUM",2013-05-14 09:21:57 EDT,2/21618,645995273,ERROR,57014,"canceling statement due to user request",,,,,,"vacuum full js_equity_daily_diff",,,"psql"

Than us cancelling the vacuum full

2013-05-14 09:22:13.947 EDT,,,30911,,51919d78.78bf,1,,2013-05-13 22:12:08 EDT,,0,ERROR,XX000,"xlog flush request 1D08/9B57FCD0 is not satisfied --- flushed only to 1CEE/31266090",,,,,"writing block 0 of relation base/16416/291498116",,,,""
2013-05-14 09:22:14.964 EDT,,,30911,,51919d78.78bf,2,,2013-05-13 22:12:08 EDT,,0,ERROR,XX000,"xlog flush request 1D08/9B57FCD0 is not satisfied --- flushed only to 1CEE/31266090",,,,,"writing block 0 of relation base/16416/291498116",,,,""


And after that these started appearing in logs (and they get repeated every second now:

[root@nyc-dbc-001 pg_log]# fgrep ERROR postgresql-2013-05-14.csv  | tail -n 2
2013-05-14 09:47:43.301 EDT,,,30911,,51919d78.78bf,3010,,2013-05-13 22:12:08 EDT,,0,ERROR,XX000,"xlog flush request 1D08/9B57FCD0 is not satisfied --- flushed only to 1CEE/3C869588",,,,,"writing block 0 of relation base/16416/291498116",,,,""
2013-05-14 09:47:44.317 EDT,,,30911,,51919d78.78bf,3012,,2013-05-13 22:12:08 EDT,,0,ERROR,XX000,"xlog flush request 1D08/9B57FCD0 is not satisfied --- flushed only to 1CEE/3C869588",,,,,"writing block 0 of relation base/16416/291498116",,,,""

There are no earlier ERROR's in the logs.

2013-05-14 09:38:03.115 EDT,,,30911,,51919d78.78bf,1868,,2013-05-13 22:12:08 EDT,,0,ERROR,XX000,"xlog flush request 1D08/9B57FCD0 is not satisfied --- flushed only to 1CEE/3C869588",,,,,"writing block 0 of relation base/16416/291498116",,,,""
2013-05-14 09:38:03.115 EDT,,,30911,,51919d78.78bf,1869,,2013-05-13 22:12:08 EDT,,0,WARNING,58030,"could not write block 0 of base/16416/291498116","Multiple failures --- write error might be permanent.",,,,,,,,""

The disk is not full nor are there any messages in the kernel logs.

Cheers,

Bene



On Tue, May 14, 2013 at 9:27 AM, Heikki Linnakangas <hlinnakangas@vmware.com> wrote:
On 14.05.2013 14:57, Benedikt Grundmann wrote:
Today we have seen this on our testing database instance:

ERROR:  could not open file "base/16416/291498116.3" (target block 431006):
No such file or directory

That database get's created by rsyncing the LVM snapshot of the standby,
which is a readonly backup of proddb
using streaming replication.

We do not put the standby database into backup mode before rsyncing the LVM
snapshot, trusting postgres ability to recover
from crash.  Now we are not sure anymore...  Thoughts?  It's worth noting
that we have been using this method of taking a backup
for a long long time, but we only recently (after 9.2 came out we switched
to doing a streaming replica and then to take the
"frozen snapshot" backup method from the streaming replica).

Let me know if there is more information you need, or if this is a known
problem.

That certainly should work. I'd suggest that you narrow down the problem a bit more. Is the file present in the original system? Anything else that might be related in the logs?

- Heikki

On 14.05.2013 16:48, Benedikt Grundmann wrote:
> It's on the production database and the streaming replica.  But not on the
> snapshot.

So, the LVM snapshot didn't work correctly?

- Heikki



<div dir="ltr">That's one possible explanation.  It's worth noting that we haven't seen this before moving to streaming
repfirst and we have been using that method for a long time.<br /></div><div class="gmail_extra"><br /><br /><div
class="gmail_quote">OnTue, May 14, 2013 at 11:34 AM, Heikki Linnakangas <span dir="ltr"><<a
href="mailto:hlinnakangas@vmware.com"target="_blank">hlinnakangas@vmware.com</a>></span> wrote:<br /><blockquote
class="gmail_quote"style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div class="im">On 14.05.2013
16:48,Benedikt Grundmann wrote:<br /><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc
solid;padding-left:1ex">It's on the production database and the streaming replica.  But not on the<br /> snapshot.<br
/></blockquote><br/></div> So, the LVM snapshot didn't work correctly?<span class="HOEnZb"><font color="#888888"><br
/><br/> - Heikki<br /></font></span></blockquote></div><br /></div> 
I think my previous message wasn't clear enough.  I do NOT think that LVM snapshot is the culprit. 

However I cannot discount it as one of the possibilities.  But I have no evidence in either /var/log/messages or in dmesg that the LVM snapshot went into a bad state AND we have been using this method for a long time.

The only thing that is new is that we took the snapshot from the streaming replica.  So again my best guess as of now is that if the database crashes while it is in streaming standby a invalid disk state can result during during the following startup (in rare and as of now unclear circumstances).

You seem to be quite convinced that it must be LVM can you elaborate why?

Thanks,

Bene




On Tue, May 14, 2013 at 12:09 PM, Benedikt Grundmann <bgrundmann@janestreet.com> wrote:
That's one possible explanation.  It's worth noting that we haven't seen this before moving to streaming rep first and we have been using that method for a long time.


On Tue, May 14, 2013 at 11:34 AM, Heikki Linnakangas <hlinnakangas@vmware.com> wrote:
On 14.05.2013 16:48, Benedikt Grundmann wrote:
It's on the production database and the streaming replica.  But not on the
snapshot.

So, the LVM snapshot didn't work correctly?

- Heikki


On Tuesday, May 14, 2013 7:19 PM Benedikt Grundmann wrote:
>It's on the production database and the streaming replica.  But not on the
snapshot.

> production
> -rw------- 1 postgres postgres 312778752 May 13 21:28
/database/postgres/base/16416/291498116.3

> streaming replica
> -rw------- 1 postgres postgres 312778752 May 13 23:50
/database/postgres/base/16416/291498116.3
> Is there a way to find out what the file contains?

You can try with pageinspect module in contrib.

> We just got some more information.  All of the following was done / seen
in the logs of the snapshot database.

> After we saw this we run a vacuum full on the table we suspect to be
backed by this file.  This happened:

>WARNING:  concurrent insert in progress within table "js_equity_daily_diff"



> 2013-05-14 09:22:13.947 EDT,,,30911,,51919d78.78bf,1,,2013-05-13 22:12:08
EDT,,0,ERROR,XX000,"xlog flush request 1D08/9B57FCD0 is not satisfied ---
flushed only to 1CEE/31266090",,,,,"writing block 0
> of relation base/16416/291498116",,,,""
> 2013-05-14 09:22:14.964 EDT,,,30911,,51919d78.78bf,2,,2013-05-13 22:12:08
EDT,,0,ERROR,XX000,"xlog flush request 1D08/9B57FCD0 is not satisfied ---
flushed only to 1CEE/31266090",,,,,"writing block 0
> of relation base/16416/291498116",,,,""
> And after that these started appearing in logs (and they get repeated
every second now:

> [root@nyc-dbc-001 pg_log]# fgrep ERROR postgresql-2013-05-14.csv  | tail
-n 2
> 2013-05-14 09:47:43.301 EDT,,,30911,,51919d78.78bf,3010,,2013-05-13
22:12:08 EDT,,0,ERROR,XX000,"xlog flush request 1D08/9B57FCD0 is not
satisfied --- flushed only to 1CEE/3C869588",,,,,"writing block > 0 of
relation base/16416/291498116",,,,""
> 2013-05-14 09:47:44.317 EDT,,,30911,,51919d78.78bf,3012,,2013-05-13
22:12:08 EDT,,0,ERROR,XX000,"xlog flush request 1D08/9B57FCD0 is not
satisfied --- flushed only to 1CEE/3C869588",,,,,"writing block > 0 of
relation base/16416/291498116",,,,""
> There are no earlier ERROR's in the logs.
> 2013-05-14 09:38:03.115 EDT,,,30911,,51919d78.78bf,1868,,2013-05-13
22:12:08 EDT,,0,ERROR,XX000,"xlog flush request 1D08/9B57FCD0 is not
satisfied --- flushed only to 1CEE/3C869588",,,,,"writing block > 0 of
relation base/16416/291498116",,,,""
> 2013-05-14 09:38:03.115 EDT,,,30911,,51919d78.78bf,1869,,2013-05-13
22:12:08 EDT,,0,WARNING,58030,"could not write block 0 of
base/16416/291498116","Multiple failures --- write error might be
> permanent.",,,,,,,,""

> The disk is not full nor are there any messages in the kernel logs.

The reason for this is that system is not able to flush XLOG upto requested
point, most likely, the requested flush point is past end of XLOG.
This has been seen to occur when a disk page has a corrupted LSN. (I am
quoting this from comment in code where the above error message occur)

So if XLOG is not flushed checkpointer will not flush even data of file
291498116.

It seems to me that your database where these errors are observed is
corrupted.

With Regards,
Amit Kapila.





On 14.05.2013 23:47, Benedikt Grundmann wrote:
> The only thing that is *new* is that we took the snapshot from the
> streaming replica.  So again my best guess as of now is that if the
> database crashes while it is in streaming standby a invalid disk state can
> result during during the following startup (in rare and as of now unclear
> circumstances).

A bug is certainly possible. There isn't much detail here to debug with, 
I'm afraid. Can you share the full logs on all three systems? I'm 
particularly interest

> You seem to be quite convinced that it must be LVM can you elaborate why?

Well, you said that there was a file in the original filesystem, but not 
in the snapshot. If you didn't do anything in between, then surely the 
snapshot is broken, if it skipped a file. Or was the file created in the 
original filesystem after the snapshot was taken? You probably left out 
some crucial details on how exactly the snapshot and rsync are 
performed. Can you share the scripts you're using?

Can you reproduce this problem with a new snapshot? Do you still have 
the failed snapshot unchanged?

- Heikki



First, thanks for the replies.  This sort of thing is frustrating and hard to diagnose at a distance, and any help is appreciated.

Here is some more background:

We have 3 9.2.4 databases using the following setup:

- A primary box
- A standby box running as a hot streaming replica from the primary
- A testing box restored nightly from a static backup

As noted, the static backup is taken off of the standby by taking an LVM snapshot of the database filesystem and rsyncing.  I don't think it's a likely problem but the rsync leverages the previous backup (using --link-dest) to make the rsync faster and the resulting backup smaller.  Each database is ~1.5T, so this is necessary to keep static backup times reasonable.

We've been using the same system for quite some time, but previously (~ 1 month ago) had been taking the backup off of the primary (still using the LVM snapshot).  The replication is a recent addition, and a very helpful one.  LVM snapshots aren't lightweight in the face of writes and in some circumstances a long running rsync would spike the IO load on the production box.

Results of some additional tests:

After the user noticed that the test restore showed the original problem we ran `vacuum analyze` on all three testing databases thinking that it had a good chance of quickly touching most of the underlying files.  That gave us errors on two of the testing restores similar to:

ERROR:  invalid page header in block 5427 of relation base/16417/199732075

In the meantime I modified the static backup procedure to shut standby completely down before taking the LVM snapshot and am trying a restore using that snapshot now.  I'll test that using the same vacuum analyze test, and if that passes, a full vacuum.

I'm also running the vacuum analyze on the production machines to double check that the base databases don't have a subtle corruption that simply hasn't been noticed.  They run with normal autovacuum settings, so I suspect that they are fine/this won't show anything because we should have seen this from the autovacuum daemon before.

I'm happy to share the scripts we use for the backup/restore process if the information above isn't enough, as well as the logs - though the postgres logs don't seem to contain much of interest (the database system doesn't really get involved).

I also have the rsyncs of the failed snapshots available and could restore them for testing purposes.  It's also easy to look in them (they are just saved as normal directories on a big SAN) if I know what to look for.

-David


On Wed, May 15, 2013 at 2:24 AM, Heikki Linnakangas <hlinnakangas@vmware.com> wrote:
On 14.05.2013 23:47, Benedikt Grundmann wrote:
The only thing that is *new* is that we took the snapshot from the

streaming replica.  So again my best guess as of now is that if the
database crashes while it is in streaming standby a invalid disk state can
result during during the following startup (in rare and as of now unclear
circumstances).

A bug is certainly possible. There isn't much detail here to debug with, I'm afraid. Can you share the full logs on all three systems? I'm particularly interest


You seem to be quite convinced that it must be LVM can you elaborate why?

Well, you said that there was a file in the original filesystem, but not in the snapshot. If you didn't do anything in between, then surely the snapshot is broken, if it skipped a file. Or was the file created in the original filesystem after the snapshot was taken? You probably left out some crucial details on how exactly the snapshot and rsync are performed. Can you share the scripts you're using?

Can you reproduce this problem with a new snapshot? Do you still have the failed snapshot unchanged?

- Heikki

On 15.05.2013 15:42, David Powers wrote:
> First, thanks for the replies.  This sort of thing is frustrating and hard
> to diagnose at a distance, and any help is appreciated.
>
> Here is some more background:
>
> We have 3 9.2.4 databases using the following setup:

The subject says 9.2.3. Are you sure you're running 9.2.4 on all the 
servers? There was a fix to a bug related to starting a standby server 
from a filesystem snapshot. I don't think it was quite the case you 
have, but pretty close.

> - A primary box
> - A standby box running as a hot streaming replica from the primary
> - A testing box restored nightly from a static backup
>
> As noted, the static backup is taken off of the standby by taking an LVM
> snapshot of the database filesystem and rsyncing.  I don't think it's a
> likely problem but the rsync leverages the previous backup (using
> --link-dest) to make the rsync faster and the resulting backup smaller.

I guess that should work, as long as you make a full copy when you 
restore the backup, and never start a postgres server directly against a 
backup that contains the hardlinks.

> We've been using the same system for quite some time, but previously (~ 1
> month ago) had been taking the backup off of the primary (still using the
> LVM snapshot).  The replication is a recent addition, and a very helpful
> one.  LVM snapshots aren't lightweight in the face of writes and in some
> circumstances a long running rsync would spike the IO load on the
> production box.
>
> Results of some additional tests:
>
> After the user noticed that the test restore showed the original problem we
> ran `vacuum analyze` on all three testing databases thinking that it had a
> good chance of quickly touching most of the underlying files.  That gave us
> errors on two of the testing restores similar to:
>
> ERROR:  invalid page header in block 5427 of relation base/16417/199732075

Huh, that's surprising. You have to be quite unlucky to end with a 
corrupt page header, even if there's something wrong with the underlying 
storage or backup routines. Could you grab a copy of that block? 
Something like

dd skip=5427 bs=8192 count=1 if=base/16417/199732075 of=corrupt-block

I'm especially curious what the first 20 bytes or so look like. I'm 
guessing that it's all zeros, while some later part of the page contains 
data. Ie. the page was torn at some point in the process.

> I'm also running the vacuum analyze on the production machines to double
> check that the base databases don't have a subtle corruption that simply
> hasn't been noticed.  They run with normal autovacuum settings, so I
> suspect that they are fine/this won't show anything because we should have
> seen this from the autovacuum daemon before.

Vacuum analyze doesn't always scan the whole table with default 
settings, so that's no proof that are no missing pages. "set 
vacuum_freeze_table_age=0; vacuum <table>" will force a full scan of the 
table.

> I'm happy to share the scripts we use for the backup/restore process if the
> information above isn't enough, as well as the logs - though the postgres
> logs don't seem to contain much of interest (the database system doesn't
> really get involved).

Yeah, please share the scripts. What you've described seems correct to 
me, but maybe there's something I'm missing.

> I also have the rsyncs of the failed snapshots available and could restore
> them for testing purposes.  It's also easy to look in them (they are just
> saved as normal directories on a big SAN) if I know what to look for.

Great. Not sure what to look at right now, but good that you have kept 
the evidence.

- Heikki



On Wed, May 15, 2013 at 2:50 PM, Heikki Linnakangas <hlinnakangas@vmware.com> wrote:
On 15.05.2013 15:42, David Powers wrote:
First, thanks for the replies.  This sort of thing is frustrating and hard
to diagnose at a distance, and any help is appreciated.

Here is some more background:

We have 3 9.2.4 databases using the following setup:

The subject says 9.2.3. Are you sure you're running 9.2.4 on all the servers? There was a fix to a bug related to starting a standby server from a filesystem snapshot. I don't think it was quite the case you have, but pretty close.


So this is delightfully embarrassing I just went back to double check and

- primary box is 9.2.3
- standby is 9.2.4
- testing is 9.2.4

I guess that alone could possibly explain it? 

Thanks,

Bene
On 15.05.2013 22:50, Benedikt Grundmann wrote:
> On Wed, May 15, 2013 at 2:50 PM, Heikki Linnakangas<hlinnakangas@vmware.com
>> The subject says 9.2.3. Are you sure you're running 9.2.4 on all the
>> servers? There was a fix to a bug related to starting a standby server from
>> a filesystem snapshot. I don't think it was quite the case you have, but
>> pretty close.
>
> So this is delightfully embarrassing I just went back to double check and
>
> - primary box is 9.2.3
> - standby is 9.2.4
> - testing is 9.2.4
>
> I guess that alone could possibly explain it?

Hmm, no, it should still work. There haven't been any changes in the WAL 
format. I do recommend upgrading the primary, of course, but I don't 
really see how that would explain what you're seeing.

- Heikki



I'll try to get the primary upgraded over the weekend when we can afford a restart.

In the meantime I have a single test showing that a shutdown, snapshot, restart produces a backup that passes the vacuum analyze test.  I'm going to run a full vacuum today.

-David


On Wed, May 15, 2013 at 3:53 PM, Heikki Linnakangas <hlinnakangas@vmware.com> wrote:
On 15.05.2013 22:50, Benedikt Grundmann wrote:
On Wed, May 15, 2013 at 2:50 PM, Heikki Linnakangas<hlinnakangas@vmware.com
The subject says 9.2.3. Are you sure you're running 9.2.4 on all the

servers? There was a fix to a bug related to starting a standby server from
a filesystem snapshot. I don't think it was quite the case you have, but
pretty close.

So this is delightfully embarrassing I just went back to double check and

- primary box is 9.2.3
- standby is 9.2.4
- testing is 9.2.4

I guess that alone could possibly explain it?

Hmm, no, it should still work. There haven't been any changes in the WAL format. I do recommend upgrading the primary, of course, but I don't really see how that would explain what you're seeing.

- Heikki

We are seeing these errors on a regular basis on the testing box now.  We have even changed the backup script to
shutdown the hot standby, take lvm snapshot, restart the hot standby, rsync the lvm snapshot.  It still happens.

We have never seen this before we introduced the hot standby.  So we will now revert to taking the backups from lvm snapshots on the production database.  If you have ideas of what else we should try / what information we can give you to debug this let us know and we will try to so.

Until then we will sadly operate on the assumption that the combination of hot standby and "frozen snapshot" backup of it is not production ready.

Thanks,

Bene




On Thu, May 16, 2013 at 8:10 AM, David Powers <dpowers@janestreet.com> wrote:
I'll try to get the primary upgraded over the weekend when we can afford a restart.

In the meantime I have a single test showing that a shutdown, snapshot, restart produces a backup that passes the vacuum analyze test.  I'm going to run a full vacuum today.

-David


On Wed, May 15, 2013 at 3:53 PM, Heikki Linnakangas <hlinnakangas@vmware.com> wrote:
On 15.05.2013 22:50, Benedikt Grundmann wrote:
On Wed, May 15, 2013 at 2:50 PM, Heikki Linnakangas<hlinnakangas@vmware.com
The subject says 9.2.3. Are you sure you're running 9.2.4 on all the

servers? There was a fix to a bug related to starting a standby server from
a filesystem snapshot. I don't think it was quite the case you have, but
pretty close.

So this is delightfully embarrassing I just went back to double check and

- primary box is 9.2.3
- standby is 9.2.4
- testing is 9.2.4

I guess that alone could possibly explain it?

Hmm, no, it should still work. There haven't been any changes in the WAL format. I do recommend upgrading the primary, of course, but I don't really see how that would explain what you're seeing.

- Heikki


On Tue, May 21, 2013 at 11:59 AM, Benedikt Grundmann
<bgrundmann@janestreet.com> wrote:
> We are seeing these errors on a regular basis on the testing box now.  We
> have even changed the backup script to
> shutdown the hot standby, take lvm snapshot, restart the hot standby, rsync
> the lvm snapshot.  It still happens.
>
> We have never seen this before we introduced the hot standby.  So we will
> now revert to taking the backups from lvm snapshots on the production
> database.  If you have ideas of what else we should try / what information
> we can give you to debug this let us know and we will try to so.
>
> Until then we will sadly operate on the assumption that the combination of
> hot standby and "frozen snapshot" backup of it is not production ready.

I'm pretty suspicious that your backup procedure is messed up in some
way.  The fact that you got invalid page headers is really difficult
to attribute to a PostgreSQL bug.  A number of the other messages that
you have posted also tend to indicate either corruption, or that WAL
replay has stopped early.  It would be interesting to see the logs
from when the clone was first started up, juxtaposed against the later
WAL flush error messages.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Thanks for the response.

I have some evidence against an issue in the backup procedure (though I'm not ruling it out).  We moved back to taking the backup off of the primary and all errors for all three clusters went away.  All of the hardware is the same, OS and postgres versions are largely the same (9.2.3 vs. 9.2.4 in some cases, various patch levels of Cent 6.3 for the OS).  The backup code is exactly the same, just pointed at a different set of boxes.

Currently I'm just running for a couple of days to ensure that we have viable static backups.  After that I'll redo one of the restores from a suspected backup and will post the logs.

-David


On Thu, May 23, 2013 at 11:26 AM, Robert Haas <robertmhaas@gmail.com> wrote:
On Tue, May 21, 2013 at 11:59 AM, Benedikt Grundmann
<bgrundmann@janestreet.com> wrote:
> We are seeing these errors on a regular basis on the testing box now.  We
> have even changed the backup script to
> shutdown the hot standby, take lvm snapshot, restart the hot standby, rsync
> the lvm snapshot.  It still happens.
>
> We have never seen this before we introduced the hot standby.  So we will
> now revert to taking the backups from lvm snapshots on the production
> database.  If you have ideas of what else we should try / what information
> we can give you to debug this let us know and we will try to so.
>
> Until then we will sadly operate on the assumption that the combination of
> hot standby and "frozen snapshot" backup of it is not production ready.

I'm pretty suspicious that your backup procedure is messed up in some
way.  The fact that you got invalid page headers is really difficult
to attribute to a PostgreSQL bug.  A number of the other messages that
you have posted also tend to indicate either corruption, or that WAL
replay has stopped early.  It would be interesting to see the logs
from when the clone was first started up, juxtaposed against the later
WAL flush error messages.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Today we have seen

2013-05-28 04:11:12.300 EDT,,,30600,,51a41946.7788,1,,2013-05-27 22:41:10 EDT,,0,ERROR,XX000,"xlog flush request 1E95/AFB2DB10 is not satisfied --- flushed only to 1E7E/21CB79A0",,,,,"writing block 9 of relation base/16416/293974676",,,,""
2013-05-28 04:11:13.316 EDT,,,30600,,51a41946.7788,2,,2013-05-27 22:41:10 EDT,,0,ERROR,XX000,"xlog flush request 1E95/AFB2DB10 is not satisfied --- flushed only to 1E7E/21CB79A0",,,,,"writing block 9 of relation base/16416/293974676",,,,""

while taking the backup of the primary.  We have been running for a few days like that and today is the first day where we see these problems again.  So it's not entirely deterministic / we don't know yet what we have to do to reproduce.

So this makes Robert's theory more likely.  However we have also using this method (LVM + rsync with hardlinks from primary) for years without these problems.  So the big question is what changed?

One hypothesis is that it is related to the primary being in hot_standby instead of minimal or archive wal_method (which we used before we switched to 9.2).

Here are the entries in the log related to the startup of the corrupt testing cluster:

2013-05-27 22:41:10.029 EDT,,,30598,,51a41946.7786,1,,2013-05-27 22:41:10 EDT,,0,LOG,00000,"database system was interrupted; last known up at 2013-05-26 21:01:09 EDT",,,,,,,,,""
2013-05-27 22:41:10.029 EDT,,,30599,"",51a41946.7787,1,"",2013-05-27 22:41:10 EDT,,0,LOG,00000,"connection received: host=172.27.65.204 port=55279",,,,,,,,,""
2013-05-27 22:41:10.030 EDT,,,30598,,51a41946.7786,2,,2013-05-27 22:41:10 EDT,,0,LOG,00000,"database system was not properly shut down; automatic recovery in progress",,,,,,,,,""
2013-05-27 22:41:10.030 EDT,"as-elephant","postgres",30599,"172.27.65.204:55279",51a41946.7787,2,"",2013-05-27 22:41:10 EDT,,0,FATAL,57P03,"the database system is starting up",,,,,,,,,""
2013-05-27 22:41:10.031 EDT,,,30598,,51a41946.7786,3,,2013-05-27 22:41:10 EDT,,0,LOG,00000,"redo starts at 1E7E/2152B178",,,,,,,,,""
2013-05-27 22:41:10.094 EDT,,,30598,,51a41946.7786,4,,2013-05-27 22:41:10 EDT,,0,LOG,00000,"record with zero length at 1E7E/215AC6B8",,,,,,,,,""
2013-05-27 22:41:10.094 EDT,,,30598,,51a41946.7786,5,,2013-05-27 22:41:10 EDT,,0,LOG,00000,"redo done at 1E7E/215AC688",,,,,,,,,""
2013-05-27 22:41:10.094 EDT,,,30598,,51a41946.7786,6,,2013-05-27 22:41:10 EDT,,0,LOG,00000,"last completed transaction was at log time 2013-05-26 21:09:08.06351-04",,,,,,,,,""
2013-05-27 22:41:10.134 EDT,,,30595,,51a41945.7783,1,,2013-05-27 22:41:09 EDT,,0,LOG,00000,"database system is ready to accept connections",,,,,,,,,""
2013-05-27 22:41:10.134 EDT,,,30603,,51a41946.778b,1,,2013-05-27 22:41:10 EDT,,0,LOG,00000,"autovacuum launcher started",,,,,,,,,""
2013-05-27 22:41:15.037 EDT,,,30608,"",51a4194b.7790,1,"",2013-05-27 22:41:15 EDT,,0,LOG,00000,"connection received: host=172.27.65.204 port=55283",,,,,,,,,""

This means we currently do NOT have a way to make backups that we trust.  We are very open to any suggestions of any alternative methods we should consider using.  The database is of non trivial size by now:

proddb=> select pg_size_pretty(pg_database_size('proddb'));
 pg_size_pretty
----------------
 1294 GB
(1 row)

The backup script itself is by now a rather long OCaml program, so I doubt the value in posting it to this list.  But here is the log of what it did which should be pretty explanatory:

proddb backup: starting
proddb backup: /bin/bash -c "/usr/bin/ssh -l root tot-dbc-001 ls -d /net/nyc-isilon1/ifs/data/backup-dbc//proddb/proddb'.*-*-*.*-*-*'" (enqueued)
proddb backup: /bin/bash -c "/usr/bin/ssh -l root tot-dbc-001 ls -d /net/nyc-isilon1/ifs/data/backup-dbc//proddb/proddb'.*-*-*.*-*-*'" (running as pid: [23422])
proddb backup: /bin/bash -c "/usr/bin/ssh -l root tot-dbc-001 ls -d /net/nyc-isilon1/ifs/data/backup-dbc//proddb/proddb'.*-*-*.*-*-*'" ([23422] exited normally)
proddb backup: /bin/mkdir -p /net/nyc-isilon1/ifs/data/backup-dbc//proddb/proddb.in-progress (enqueued)
proddb backup: /bin/mkdir -p /net/nyc-isilon1/ifs/data/backup-dbc//proddb/proddb.in-progress (running as pid: [23433])
proddb backup: /bin/mkdir -p /net/nyc-isilon1/ifs/data/backup-dbc//proddb/proddb.in-progress ([23433] exited normally)
proddb backup: /bin/bash -c "/usr/bin/ssh -l root tot-dbc-001 /mnt/global/base/bin/db tools backup backup -v -src '"((dbname proddb) (hostname tot-dbc-001) (superuser postgres_prod) (basedir /database) (version 9.2))"' -dst '"((username postgres) (hostname 127.0.0.1) (backup_dir /net/nyc-isilon1/ifs/data/backup-dbc//proddb/proddb.in-progress) (last_backup_dir (..//proddb.2013-05-26.20-00-00)
proddb backup: /bin/bash -c "/usr/bin/ssh -l root tot-dbc-001 /mnt/global/base/bin/db tools backup backup -v -src '"((dbname proddb) (hostname tot-dbc-001) (superuser postgres_prod) (basedir /database) (version 9.2))"' -dst '"((username postgres) (hostname 127.0.0.1) (backup_dir /net/nyc-isilon1/ifs/data/backup-dbc//proddb/proddb.in-progress) (last_backup_dir (..//proddb.2013-05-26.20-00-00)
proddb backup [23437]: 2013-05-27 20:00:02.232864-04:00 Debug backing up tot-dbc-001:proddb to ((username postgres) (hostname 127.0.0.1)
proddb backup [23437]:  (backup_dir /net/nyc-isilon1/ifs/data/backup-dbc//proddb/proddb.in-progress)
proddb backup [23437]:  (last_backup_dir (..//proddb.2013-05-26.20-00-00)))
proddb backup [23437]: 2013-05-27 20:00:02.232869-04:00 Debug running /usr/pgsql-9.2/bin/psql -U postgres_prod -h tot-dbc-001 -d postgres -c checkpoint
proddb backup [23437]: 2013-05-27 20:00:02.459035-04:00 Debug Running preliminary rsync outside of snapshot to minimize its lifetime.
proddb backup [23437]: 2013-05-27 20:00:02.459049-04:00 Debug running rsync from /database/ to postgres@127.0.0.1:/net/nyc-isilon1/ifs/data/backup-dbc//proddb/proddb.in-progress//
proddb backup [23437]: 2013-05-27 20:00:02.459053-04:00 Debug running nice -n 19 rsync --link-dest ..//proddb.2013-05-26.20-00-00 -a -q --delete /database/ postgres@127.0.0.1:/net/nyc-isilon1/ifs/data/backup-dbc//proddb/proddb.in-progress//
proddb backup [23437]: 2013-05-27 21:43:05.742036-04:00 Debug Taking snapshot of database
proddb backup [23437]: 2013-05-27 21:43:05.742056-04:00 Debug finding LVM device for /database
proddb backup [23437]: 2013-05-27 21:43:05.796354-04:00 Debug real path of /database is /mnt/dbc01-d1/proddb
proddb backup [23437]: 2013-05-27 21:43:05.797222-04:00 Debug /database is located on LVM device /dev/dbc01/d1 at relative path /proddb
proddb backup [23437]: 2013-05-27 21:43:05.797237-04:00 Debug running lvcreate --size 500G --snapshot --name proddb-snapshot /dev/dbc01/d1
proddb backup [23437]: 2013-05-27 21:43:06.487274-04:00 Debug running mount -o nouuid,ro /dev/dbc01/proddb-snapshot /tmp/snapfchjHv
proddb backup [23437]: 2013-05-27 21:43:06.682700-04:00 Debug running rsync from /tmp/snapfchjHv/proddb/ to postgres@127.0.0.1:/net/nyc-isilon1/ifs/data/backup-dbc//proddb/proddb.in-progress//
proddb backup [23437]: 2013-05-27 21:43:06.682710-04:00 Debug running nice -n 19 rsync --inplace --link-dest ..//proddb.2013-05-26.20-00-00 -a -q --delete /tmp/snapfchjHv/proddb/ postgres@127.0.0.1:/net/nyc-isilon1/ifs/data/backup-dbc//proddb/proddb.in-progress//
proddb backup [23437]: 2013-05-27 22:38:45.817805-04:00 Debug Deleting snapshot of database
proddb backup [23437]: 2013-05-27 22:38:45.817824-04:00 Debug running umount /tmp/snapfchjHv
proddb backup [23437]: 2013-05-27 22:38:47.096200-04:00 Debug removing mount point /tmp/snapfchjHv
proddb backup [23437]: 2013-05-27 22:38:47.096732-04:00 Debug running lvremove -f /dev/dbc01/proddb-snapshot
proddb backup [23437]: 2013-05-27 22:38:47.532788-04:00 Debug running lvdisplay /dev/dbc01/proddb-snapshot
proddb backup: /bin/bash -c "/usr/bin/ssh -l root tot-dbc-001 /mnt/global/base/bin/db tools backup backup -v -src '"((dbname proddb) (hostname tot-dbc-001) (superuser postgres_prod) (basedir /database) (version 9.2))"' -dst '"((username postgres) (hostname 127.0.0.1) (backup_dir /net/nyc-isilon1/ifs/data/backup-dbc//proddb/proddb.in-progress) (last_backup_dir (..//proddb.2013-05-26.20-00-00)
proddb backup: /usr/bin/ssh postgres@tot-dbc-001 cd /net/nyc-isilon1/ifs/data/backup-dbc//proddb/proddb.in-progress (enqueued)
proddb backup: /usr/bin/ssh postgres@tot-dbc-001 cd /net/nyc-isilon1/ifs/data/backup-dbc//proddb/proddb.in-progress (running as pid: [14318])
proddb backup: /usr/bin/ssh postgres@tot-dbc-001 cd /net/nyc-isilon1/ifs/data/backup-dbc//proddb/proddb.in-progress ([14318] exited normally)
proddb backup: /usr/bin/ssh postgres@tot-dbc-001 mv /net/nyc-isilon1/ifs/data/backup-dbc//proddb/proddb.in-progress /net/nyc-isilon1/ifs/data/backup-dbc//proddb/proddb.2013-05-27.20-00-00 (enqueued)
proddb backup: /usr/bin/ssh postgres@tot-dbc-001 mv /net/nyc-isilon1/ifs/data/backup-dbc//proddb/proddb.in-progress /net/nyc-isilon1/ifs/data/backup-dbc//proddb/proddb.2013-05-27.20-00-00 (running as pid: [14319])
proddb backup: /usr/bin/ssh postgres@tot-dbc-001 mv /net/nyc-isilon1/ifs/data/backup-dbc//proddb/proddb.in-progress /net/nyc-isilon1/ifs/data/backup-dbc//proddb/proddb.2013-05-27.20-00-00 ([14319] exited normally)
proddb backup: finished; status = ok

Thanks in advance,

Bene




On Thu, May 23, 2013 at 5:30 PM, David Powers <dpowers@janestreet.com> wrote:
Thanks for the response.

I have some evidence against an issue in the backup procedure (though I'm not ruling it out).  We moved back to taking the backup off of the primary and all errors for all three clusters went away.  All of the hardware is the same, OS and postgres versions are largely the same (9.2.3 vs. 9.2.4 in some cases, various patch levels of Cent 6.3 for the OS).  The backup code is exactly the same, just pointed at a different set of boxes.

Currently I'm just running for a couple of days to ensure that we have viable static backups.  After that I'll redo one of the restores from a suspected backup and will post the logs.

-David


On Thu, May 23, 2013 at 11:26 AM, Robert Haas <robertmhaas@gmail.com> wrote:
On Tue, May 21, 2013 at 11:59 AM, Benedikt Grundmann
<bgrundmann@janestreet.com> wrote:
> We are seeing these errors on a regular basis on the testing box now.  We
> have even changed the backup script to
> shutdown the hot standby, take lvm snapshot, restart the hot standby, rsync
> the lvm snapshot.  It still happens.
>
> We have never seen this before we introduced the hot standby.  So we will
> now revert to taking the backups from lvm snapshots on the production
> database.  If you have ideas of what else we should try / what information
> we can give you to debug this let us know and we will try to so.
>
> Until then we will sadly operate on the assumption that the combination of
> hot standby and "frozen snapshot" backup of it is not production ready.

I'm pretty suspicious that your backup procedure is messed up in some
way.  The fact that you got invalid page headers is really difficult
to attribute to a PostgreSQL bug.  A number of the other messages that
you have posted also tend to indicate either corruption, or that WAL
replay has stopped early.  It would be interesting to see the logs
from when the clone was first started up, juxtaposed against the later
WAL flush error messages.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


On Tue, May 28, 2013 at 10:53 AM, Benedikt Grundmann
<bgrundmann@janestreet.com> wrote:
> Today we have seen
>
> 2013-05-28 04:11:12.300 EDT,,,30600,,51a41946.7788,1,,2013-05-27 22:41:10
> EDT,,0,ERROR,XX000,"xlog flush request 1E95/AFB2DB10 is not satisfied ---
> flushed only to 1E7E/21CB79A0",,,,,"writing block 9 of relation
> base/16416/293974676",,,,""
> 2013-05-28 04:11:13.316 EDT,,,30600,,51a41946.7788,2,,2013-05-27 22:41:10
> EDT,,0,ERROR,XX000,"xlog flush request 1E95/AFB2DB10 is not satisfied ---
> flushed only to 1E7E/21CB79A0",,,,,"writing block 9 of relation
> base/16416/293974676",,,,""
>
> while taking the backup of the primary.  We have been running for a few days
> like that and today is the first day where we see these problems again.  So
> it's not entirely deterministic / we don't know yet what we have to do to
> reproduce.
>
> So this makes Robert's theory more likely.  However we have also using this
> method (LVM + rsync with hardlinks from primary) for years without these
> problems.  So the big question is what changed?

Well... I don't know.  But my guess is there's something wrong with
the way you're using hardlinks.  Remember, a hardlink means two
logical pointers to the same file on disk.  So if either file gets
modified after the fact, then the other pointer is going to see the
changes.  The xlog flush request not satisfied stuff could happen if,
for example, the backup is pointing to a file, and the primary is
pointing to the same file, and the primary modifies the file after the
backup is taken (thus modifying the backup after-the-fact).

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



It's another possibility, but I think it's still somewhat remote given how long we've been using this method with this code.  It's sadly hard to test because taking the full backup without the hard linking is fairly expensive (the databases comprise multiple terabytes).

As a possibly unsatisfying solution I've spent the last day reworking the backups to use the low level api and the pg_basebackup method to take snapshots and the streaming replica out of the picture entirely.

-David


On Tue, May 28, 2013 at 7:27 PM, Robert Haas <robertmhaas@gmail.com> wrote:
On Tue, May 28, 2013 at 10:53 AM, Benedikt Grundmann
<bgrundmann@janestreet.com> wrote:
> Today we have seen
>
> 2013-05-28 04:11:12.300 EDT,,,30600,,51a41946.7788,1,,2013-05-27 22:41:10
> EDT,,0,ERROR,XX000,"xlog flush request 1E95/AFB2DB10 is not satisfied ---
> flushed only to 1E7E/21CB79A0",,,,,"writing block 9 of relation
> base/16416/293974676",,,,""
> 2013-05-28 04:11:13.316 EDT,,,30600,,51a41946.7788,2,,2013-05-27 22:41:10
> EDT,,0,ERROR,XX000,"xlog flush request 1E95/AFB2DB10 is not satisfied ---
> flushed only to 1E7E/21CB79A0",,,,,"writing block 9 of relation
> base/16416/293974676",,,,""
>
> while taking the backup of the primary.  We have been running for a few days
> like that and today is the first day where we see these problems again.  So
> it's not entirely deterministic / we don't know yet what we have to do to
> reproduce.
>
> So this makes Robert's theory more likely.  However we have also using this
> method (LVM + rsync with hardlinks from primary) for years without these
> problems.  So the big question is what changed?

Well... I don't know.  But my guess is there's something wrong with
the way you're using hardlinks.  Remember, a hardlink means two
logical pointers to the same file on disk.  So if either file gets
modified after the fact, then the other pointer is going to see the
changes.  The xlog flush request not satisfied stuff could happen if,
for example, the backup is pointing to a file, and the primary is
pointing to the same file, and the primary modifies the file after the
backup is taken (thus modifying the backup after-the-fact).

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company