Re: streaming replication, "frozen snapshot backup on it" and missing relfile (postgres 9.2.3 on xfs + LVM) - Mailing list pgsql-hackers

From Amit Kapila
Subject Re: streaming replication, "frozen snapshot backup on it" and missing relfile (postgres 9.2.3 on xfs + LVM)
Date
Msg-id 004f01ce5130$f9077a60$eb166f20$@kapila@huawei.com
Whole thread Raw
In response to Re: streaming replication, "frozen snapshot backup on it" and missing relfile (postgres 9.2.3 on xfs + LVM)  (Benedikt Grundmann <bgrundmann@janestreet.com>)
List pgsql-hackers
On Tuesday, May 14, 2013 7:19 PM Benedikt Grundmann wrote:
>It's on the production database and the streaming replica.  But not on the
snapshot.

> production
> -rw------- 1 postgres postgres 312778752 May 13 21:28
/database/postgres/base/16416/291498116.3

> streaming replica
> -rw------- 1 postgres postgres 312778752 May 13 23:50
/database/postgres/base/16416/291498116.3
> Is there a way to find out what the file contains?

You can try with pageinspect module in contrib.

> We just got some more information.  All of the following was done / seen
in the logs of the snapshot database.

> After we saw this we run a vacuum full on the table we suspect to be
backed by this file.  This happened:

>WARNING:  concurrent insert in progress within table "js_equity_daily_diff"



> 2013-05-14 09:22:13.947 EDT,,,30911,,51919d78.78bf,1,,2013-05-13 22:12:08
EDT,,0,ERROR,XX000,"xlog flush request 1D08/9B57FCD0 is not satisfied ---
flushed only to 1CEE/31266090",,,,,"writing block 0
> of relation base/16416/291498116",,,,""
> 2013-05-14 09:22:14.964 EDT,,,30911,,51919d78.78bf,2,,2013-05-13 22:12:08
EDT,,0,ERROR,XX000,"xlog flush request 1D08/9B57FCD0 is not satisfied ---
flushed only to 1CEE/31266090",,,,,"writing block 0
> of relation base/16416/291498116",,,,""
> And after that these started appearing in logs (and they get repeated
every second now:

> [root@nyc-dbc-001 pg_log]# fgrep ERROR postgresql-2013-05-14.csv  | tail
-n 2
> 2013-05-14 09:47:43.301 EDT,,,30911,,51919d78.78bf,3010,,2013-05-13
22:12:08 EDT,,0,ERROR,XX000,"xlog flush request 1D08/9B57FCD0 is not
satisfied --- flushed only to 1CEE/3C869588",,,,,"writing block > 0 of
relation base/16416/291498116",,,,""
> 2013-05-14 09:47:44.317 EDT,,,30911,,51919d78.78bf,3012,,2013-05-13
22:12:08 EDT,,0,ERROR,XX000,"xlog flush request 1D08/9B57FCD0 is not
satisfied --- flushed only to 1CEE/3C869588",,,,,"writing block > 0 of
relation base/16416/291498116",,,,""
> There are no earlier ERROR's in the logs.
> 2013-05-14 09:38:03.115 EDT,,,30911,,51919d78.78bf,1868,,2013-05-13
22:12:08 EDT,,0,ERROR,XX000,"xlog flush request 1D08/9B57FCD0 is not
satisfied --- flushed only to 1CEE/3C869588",,,,,"writing block > 0 of
relation base/16416/291498116",,,,""
> 2013-05-14 09:38:03.115 EDT,,,30911,,51919d78.78bf,1869,,2013-05-13
22:12:08 EDT,,0,WARNING,58030,"could not write block 0 of
base/16416/291498116","Multiple failures --- write error might be
> permanent.",,,,,,,,""

> The disk is not full nor are there any messages in the kernel logs.

The reason for this is that system is not able to flush XLOG upto requested
point, most likely, the requested flush point is past end of XLOG.
This has been seen to occur when a disk page has a corrupted LSN. (I am
quoting this from comment in code where the above error message occur)

So if XLOG is not flushed checkpointer will not flush even data of file
291498116.

It seems to me that your database where these errors are observed is
corrupted.

With Regards,
Amit Kapila.





pgsql-hackers by date:

Previous
From: Hannu Krosing
Date:
Subject: Re: Parallel Sort
Next
From: Heikki Linnakangas
Date:
Subject: Re: streaming replication, "frozen snapshot backup on it" and missing relfile (postgres 9.2.3 on xfs + LVM)