BUG #13797: file "pg_multixact/members/79D4" doesn't exist, reading as zeroes - Mailing list pgsql-bugs

From plockaby@uw.edu
Subject BUG #13797: file "pg_multixact/members/79D4" doesn't exist, reading as zeroes
Date
Msg-id 20151204225806.2618.8469@wrigleys.postgresql.org
Whole thread Raw
List pgsql-bugs
The following bug has been logged on the website:

Bug reference:      13797
Logged by:          Paul Lockaby
Email address:      plockaby@uw.edu
PostgreSQL version: 9.4.5
Operating system:   CentOS 6
Description:

We have three postgres servers configured with cascading replication:
1->2->3. The primary stayed up. The replicas both failed simultaneously. I
see this in the error logs:

Dec  4 04:05:12 mario03 postgres[26467]: [38-1] LOG:  file
"pg_multixact/members/79D4" doesn't exist, reading as zeroes
Dec  4 04:05:12 mario03 postgres[26467]: [38-2] CONTEXT:  xlog redo create
mxid 4185669 offset 1632733906 nmembers 26561: 41794592 (keysh) 41794593
(fornokeyupd) 41794594 (fornokeyupd) 41794595 (fornokeyupd) 41794596
(fornokeyupd) 41794597 (fornokeyupd) 41794598 (fornokeyupd) 41794599
(fornokeyupd) 41794600 (fornokeyupd) 41794601 (fornokeyupd) 41794602
(fornokeyupd) 41794603 (fornokeyupd) 41794604 (fornokeyupd) 41794605
(fornokeyupd) 41794606 (fornokeyupd) 41794607 (fornokeyupd) 41794608
(fornokeyupd) 41794609 (fornokeyupd) 41794610 (fornokeyupd) 41794611
(fornokeyupd) 41794612 (fornokeyupd) 41794613 (fornokeyupd) 41794614
(fornokeyupd) 41794615 (fornokeyupd) 41794616 (fornokeyupd) 41794617
(fornokeyupd) 41794618 (fornokeyupd) 41794619 (fornokeyupd) 41794620
(fornokeyupd) 41794621 (fornokeyupd) 41794622 (fornokeyupd) 41794623
(fornokeyupd) 41794624 (fornokeyupd) 41794625 (fornokeyupd) 41794626
(fornokeyupd) 41794627 (fornokeyupd)
Dec  4 04:05:12 mario03 postgres[26467]: [38-3]  41794628 (fornokeyupd)
41794629 (fornokeyupd) 41794630 (fornokeyupd) 41794631 (fornokeyupd)
41794632 (fornokeyupd) 41794633 (fornokeyupd) 41794634 (fornokeyupd)
41794635 (fornokeyupd) 41794636 (fornokeyupd) 41794637 (fornokeyupd)
41794638 (fornokeyupd) 41794639 (fornokeyupd) 41794640 (fornokeyupd)
41794641 (fornokeyupd) 41794642 (fornokeyupd) 41794643 (fornokeyupd)
41794644 (fornokeyupd) 41794645 (fornokeyupd) 41794646 (fornokeyupd)
41794647 (fornokeyupd) 41794648 (fornokeyupd) 41794649 (fornokeyupd)
41794650 (fornokeyupd) 41794651 (fornokeyupd) 41794652 (fornokeyupd)
41794653 (fornokeyupd) 41794654 (fornokeyupd) 41794655 (fornokeyupd)
41794656 (fornokeyupd) 41794657 (fornokeyupd) 41794658 (fornokeyupd)
41794659 (fornokeyupd) 41794660 (fornokeyupd) 41794661 (fornokeyupd)
41794662 (fornokeyupd) 41794663 (fornokeyupd) 41794664 (fornokeyupd)
41794665 (fornokeyupd) 41794666 (fornokeyupd)
....

The fornokeyupd error message repeated dozens of times for 2 seconds and
then appeared to recover. The same error about a different file in
pg_multixact appeared 52 minutes later only to recover. And again it
occurred about 30 minutes later. Both replicas did this throughout the day
until one of them ended this way:

Dec  4 13:05:30 mario03 postgres[3617]: [6-1] FATAL:  could not receive data
from WAL stream: SSL SYSCALL error: EOF detected
Dec  4 13:05:30 mario03 postgres[3617]: [6-2] #011
Dec  4 13:05:30 mario03 postgres[26466]: [18-1] LOG:  startup process (PID
26467) exited with exit code 1
Dec  4 13:05:30 mario03 postgres[26466]: [19-1] LOG:  terminating any other
active server processes
Dec  4 13:05:30 mario03 postgres[21981]: [10-1] WARNING:  terminating
connection because of crash of another server process
Dec  4 13:05:30 mario03 postgres[21981]: [10-2] DETAIL:  The postmaster has
commanded this server process to roll back the current transaction and exit,
because another server process exited abnormally and possibly corrupted
shared memory.
Dec  4 13:05:30 mario03 postgres[21981]: [10-3] HINT:  In a moment you
should be able to reconnect to the database and repeat your command.


I attempted to restart the server and it attempted to start and appeared to
be processing wal files (as in the wal receiver counter kept incrementing.
But the server never left "the database system is starting up" state:

Dec  4 14:39:25 mario03 postgres[24794]: [1-1] LOG:  ending log output to
stderr
Dec  4 14:39:25 mario03 postgres[24794]: [1-2] HINT:  Future log output will
go to log destination "syslog".
Dec  4 14:39:25 mario03 postgres[24795]: [2-1] LOG:  database system was
interrupted while in recovery at log time 2015-12-04 13:02:36 PST
Dec  4 14:39:25 mario03 postgres[24795]: [2-2] HINT:  If this has occurred
more than once some data might be corrupted and you might need to choose an
earlier recovery target.
Dec  4 14:39:26 mario03 postgres[24795]: [3-1] LOG:  entering standby mode
Dec  4 14:39:26 mario03 postgres[24795]: [4-1] LOG:  redo starts at
219/A81D9EF8
Dec  4 14:39:27 mario03 postgres[24795]: [5-1] LOG:  consistent recovery
state reached at 219/A84A6FC0
Dec  4 14:39:27 mario03 postgres[24795]: [6-1] LOG:  invalid record length
at 219/A84E0FC8
Dec  4 14:39:27 mario03 postgres[24798]: [2-1] FATAL:  could not connect to
the primary server: FATAL:  the database system is starting up
Dec  4 14:39:27 mario03 postgres[24798]: [2-2] #011
Dec  4 14:39:35 mario03 postgres[24842]: [2-1] LOG:  connection received:
host=[local]
Dec  4 14:39:35 mario03 postgres[24842]: [3-1] FATAL:  the database system
is starting up
Dec  4 14:39:42 mario03 postgres[24888]: [2-1] FATAL:  could not connect to
the primary server: FATAL:  the database system is starting up
Dec  4 14:39:42 mario03 postgres[24888]: [2-2] #011

The contents of pg_multixact/members contains 34,533 files dating all the
way back to when we first turned the cluster on, October 26th. In fact the
file 79D4 does indeed exist:

-rw------- 1 postgres users 262144 Dec  4 04:05 79D4



Ultimately we had to rebuild the replicas with pg_basebackup. This same
exact configuration worked on Postgres 9.1 for literally years without ever
restarting or crashing so I'm a bit flummoxed about this.

pgsql-bugs by date:

Previous
From: peter@integrix.com.au
Date:
Subject: BUG #13794: pgAdmin Crash on launch
Next
From: Michael Paquier
Date:
Subject: Re: BUG #13794: pgAdmin Crash on launch