Thread: Streaming replication bug in 9.3.2, "WAL contains references to invalid pages"

Streaming replication bug in 9.3.2, "WAL contains references to invalid pages"

From
Christophe Pettus
Date:
Greetings,

We've had two clients experience a crash on the secondary of a streaming replication pair, running PostgreSQL 9.3.2.
Inboth cases, the messages were close to this example: 

2013-12-30 18:08:00.464 PST,,,23869,,52ab4839.5d3d,16,,2013-12-13 09:47:37 PST,1/0,0,WARNING,01000,"page 45785 of
relationbase/236971/365951 is uninitialized",,,,,"xlog redo vacuum: rel 1663/236971/365951; blk 45794,
lastBlockVacuumed45784",,,,"" 
2013-12-30 18:08:00.465 PST,,,23869,,52ab4839.5d3d,17,,2013-12-13 09:47:37 PST,1/0,0,PANIC,XX000,"WAL contains
referencesto invalid pages",,,,,"xlog redo vacuum: rel 1663/236971/365951; blk 45794, lastBlockVacuumed 45784",,,,"" 
2013-12-30 18:08:00.950 PST,,,23866,,52ab4838.5d3a,8,,2013-12-13 09:47:36 PST,,0,LOG,00000,"startup process (PID 23869)
wasterminated by signal 6: Aborted",,,,,,,,,"" 

In both cases, the indicated relation was a primary key index.  In one case, rebuilding the primary key index caused
theproblem to go away permanently (to date).  In the second case, the problem returned even after a full dump / restore
ofthe master database (that is, after a dump / restore of the master, and reimaging the secondary, the problem returned
atthe same primary key index, although of course with a different OID value). 

It looks like this has been experienced on 9.2.6, as well:
http://www.postgresql.org/message-id/flat/CAL_0b1s4QCkFy_55kk_8XWcJPs7wsgVWf8vn4=jXe6V4R7Hxmg@mail.gmail.com

Let me know if there's any further information I can provide.

Best,
--
-- Christophe Pettus  xof@thebuild.com




From: "Christophe Pettus" <xof@thebuild.com>
We've had two clients experience a crash on the secondary of a streaming 
replication pair, running PostgreSQL 9.3.2.  In both cases, the messages 
were close to this example:

2013-12-30 18:08:00.464 PST,,,23869,,52ab4839.5d3d,16,,2013-12-13 09:47:37 
PST,1/0,0,WARNING,01000,"page 45785 of relation base/236971/365951 is 
uninitialized",,,,,"xlog redo vacuum: rel 1663/236971/365951; blk 45794, 
lastBlockVacuumed 45784",,,,""
2013-12-30 18:08:00.465 PST,,,23869,,52ab4839.5d3d,17,,2013-12-13 09:47:37 
PST,1/0,0,PANIC,XX000,"WAL contains references to invalid pages",,,,,"xlog 
redo vacuum: rel 1663/236971/365951; blk 45794, lastBlockVacuumed 
45784",,,,""
2013-12-30 18:08:00.950 PST,,,23866,,52ab4838.5d3a,8,,2013-12-13 09:47:36 
PST,,0,LOG,00000,"startup process (PID 23869) was terminated by signal 6: 
Aborted",,,,,,,,,""

In both cases, the indicated relation was a primary key index.  In one case, 
rebuilding the primary key index caused the problem to go away permanently 
(to date).  In the second case, the problem returned even after a full dump 
/ restore of the master database (that is, after a dump / restore of the 
master, and reimaging the secondary, the problem returned at the same 
primary key index, although of course with a different OID value).

It looks like this has been experienced on 9.2.6, as well:


I've experienced this problem with 9.2.4 once at the end of last year, too. 
The messages were the same except the relation and page numbers.  In 
addition, I encountered a similar (possibly the same) problem with 9.1.6 
about a year ago.  At that time, I found in the pgsql-* MLs several people 
report similar problems in the past several years, but those were not 
solved.  There seems to be a big dangerous bug hiding somewhere.

Regards
MauMau




Re: Streaming replication bug in 9.3.2, "WAL contains references to invalid pages"

From
Sergey Konoplev
Date:
On Thu, Jan 2, 2014 at 11:59 AM, Christophe Pettus <xof@thebuild.com> wrote:
> In both cases, the indicated relation was a primary key index.  In one case, rebuilding the primary key index caused
theproblem to go away permanently (to date).  In the second case, the problem returned even after a full dump / restore
ofthe master database (that is, after a dump / restore of the master, and reimaging the secondary, the problem returned
atthe same primary key index, although of course with a different OID value). 
>
> It looks like this has been experienced on 9.2.6, as well:
>
>         http://www.postgresql.org/message-id/flat/CAL_0b1s4QCkFy_55kk_8XWcJPs7wsgVWf8vn4=jXe6V4R7Hxmg@mail.gmail.com

This problem worries me a lot too. If someone is interested I still
have a file system copy of the buggy cluster including WAL.

--
Kind regards,
Sergey Konoplev
PostgreSQL Consultant and DBA

http://www.linkedin.com/in/grayhemp
+1 (415) 867-9984, +7 (901) 903-0499, +7 (988) 888-1979
gray.ru@gmail.com



We had the same issues running 9.2.4:

[2013-10-15 00:23:01 GMT/0/15396] WARNING:  page 8789807 of relation
base/16429/2349631976 is uninitialized
[2013-10-15 00:23:01 GMT/0/15396] CONTEXT:  xlog redo vacuum: rel
1663/16429/2349631976; blk 8858544, lastBlockVacuumed 0
[2013-10-15 00:23:01 GMT/0/15396] PANIC:  WAL contains references to
invalid pages
[2013-10-15 00:23:01 GMT/0/15396] CONTEXT:  xlog redo vacuum: rel
1663/16429/2349631976; blk 8858544, lastBlockVacuumed 0
[2013-10-15 00:23:11 GMT/0/15393] LOG:  startup process (PID 15396)
was terminated by signal 6: Aborted
[2013-10-15 00:23:11 GMT/0/15393] LOG:  terminating any other active
server processes

Also on an index. I ended up manually patching the heap files at that
block location to "fix" the problem. It happened again about 2 weeks
after that, then never again. It hit all connected secondaries.

On Fri, Jan 3, 2014 at 12:50 PM, Sergey Konoplev <gray.ru@gmail.com> wrote:
> On Thu, Jan 2, 2014 at 11:59 AM, Christophe Pettus <xof@thebuild.com> wrote:
>> In both cases, the indicated relation was a primary key index.  In one case, rebuilding the primary key index caused
theproblem to go away permanently (to date).  In the second case, the problem returned even after a full dump / restore
ofthe master database (that is, after a dump / restore of the master, and reimaging the secondary, the problem returned
atthe same primary key index, although of course with a different OID value). 
>>
>> It looks like this has been experienced on 9.2.6, as well:
>>
>>         http://www.postgresql.org/message-id/flat/CAL_0b1s4QCkFy_55kk_8XWcJPs7wsgVWf8vn4=jXe6V4R7Hxmg@mail.gmail.com
>
> This problem worries me a lot too. If someone is interested I still
> have a file system copy of the buggy cluster including WAL.
>
> --
> Kind regards,
> Sergey Konoplev
> PostgreSQL Consultant and DBA
>
> http://www.linkedin.com/in/grayhemp
> +1 (415) 867-9984, +7 (901) 903-0499, +7 (988) 888-1979
> gray.ru@gmail.com
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers