Thread: Failing replication

Failing replication

From
Scott Whitney
Date:
So, I have 3 servers:

1) Production.
2) "Production" training server. Replicates #1. At same data center. 2 postmasters on separate ports.
3) Internal "production" server. 3 postmasters on separate ports. Replicates #2 and #1 (both from #2). 

This has happened to me several times in the past few weeks.

From  #3:

LOG: entering standby mode
LOG: redo starts at 31B/FDFFC0D8
WARNING: page 7843 of relation base/25999670/33421425 is uninitialized
PANIC: WAL contains references to invalid pages
LOG: startup process (PID 4580) was terminated by signal 6: Aborted
LOG: terminating any other active server processes


===

here is me restarting...adding this since there are no timestamps in this log...

===


LOG: database system was interrupted while in recovery at log time 2014-06-02 17:31:05 CDT
HINT: If this has occurred more than once some data might be corrupted and you might need to choose an earlier recovery target.
LOG: entering standby mode
LOG: redo starts at 31B/FDFFC0D8
WARNING: page 7843 of relation base/25999670/33421425 is uninitialized
PANIC: WAL contains references to invalid pages
LOG: startup process (PID 4838) was terminated by signal 6: Aborted
LOG: terminating any other active server processes



There is no activity in #2's pg_log postgresql-Mon.log file consummate to this event.


I have wal segments keeping 1,000. I have pg_xlog files dated yesterday.


This happened earlier today as well. It happened twice last week.


The "warning page" stuff is different from the pg_controldata stuff earlier today.


Here is pg_controldata from server 3 wrt server 1 replication.


pg_control version number: 922
Catalog version number: 201204301
Database system identifier: 5868266298097360799
Database cluster state: in archive recovery
pg_control last modified: Mon Jun 2 18:42:18 2014
Latest checkpoint location: 31B/FE08DB50
Prior checkpoint location: 31B/FE08DB50
Latest checkpoint's REDO location: 31B/FDFFC0D8
Latest checkpoint's TimeLineID: 1
Latest checkpoint's full_page_writes: on
Latest checkpoint's NextXID: 0/441709396
Latest checkpoint's NextOID: 33432042
Latest checkpoint's NextMultiXactId: 28077
Latest checkpoint's NextMultiOffset: 57507
Latest checkpoint's oldestXID: 241800232
Latest checkpoint's oldestXID's DB: 16573
Latest checkpoint's oldestActiveXID: 441709396
Time of latest checkpoint: Mon Jun 2 17:31:05 2014
Minimum recovery ending location: 31C/4E1C18
Backup start location: 0/0
Backup end location: 0/0
End-of-backup record required: no
Current wal_level setting: hot_standby
Current max_connections setting: 4000
Current max_prepared_xacts setting: 0
Current max_locks_per_xact setting: 64
Maximum data alignment: 8
Database block size: 8192
Blocks per segment of large relation: 131072
WAL block size: 8192
Bytes per WAL segment: 16777216
Maximum length of identifiers: 64
Maximum columns in an index: 32
Maximum size of a TOAST chunk: 1996
Date/time type storage: 64-bit integers
Float4 argument passing: by value
Float8 argument passing: by value


I have to redo the base backup and bring it back up again, so the rest of the stuff is about to be rsycnhed awak, but it has now failed twice today. It might well, again.


What more should I provide to try to find a solution?


Server 2 -> 1 replication has no issues.

Server 3 -> 2 (replicating server 1) is the one that's failing.

65Mbps line between them.


PG 9.28 across the board on Centos 6.5 64bit all across.




Re: Failing replication

From
Scott Whitney
Date:
rsync has completed. Cascading (#3 -> #1 via #2) is up and running again.

Log as of now:

LOG: consistent recovery state reached at 31C/1C979FD8
LOG: database system is ready to accept read only connections
LOG: record with zero length at 31C/1CCDF680
LOG: streaming replication successfully connected to primary

=== me....it's working 7:29 pm
=== scott again...restarting pg -> db01 at 6/2/14 7:33 just to see what happens.

LOG: database system was shut down in recovery at 2014-06-02 19:34:07 CDT

LOG: entering standby mode
LOG: redo starts at 31C/1F443578
LOG: consistent recovery state reached at 31C/20278980
LOG: record with incorrect prev-link 319/60278768 at 31C/20278980
LOG: database system is ready to accept read only connections
LOG: streaming replication successfully connected to primary




So, I have 3 servers:

1) Production.
2) "Production" training server. Replicates #1. At same data center. 2 postmasters on separate ports.
3) Internal "production" server. 3 postmasters on separate ports. Replicates #2 and #1 (both from #2). 

This has happened to me several times in the past few weeks.

From  #3:

LOG: entering standby mode
LOG: redo starts at 31B/FDFFC0D8
WARNING: page 7843 of relation base/25999670/33421425 is uninitialized
PANIC: WAL contains references to invalid pages
LOG: startup process (PID 4580) was terminated by signal 6: Aborted
LOG: terminating any other active server processes


===

here is me restarting...adding this since there are no timestamps in this log...

===


LOG: database system was interrupted while in recovery at log time 2014-06-02 17:31:05 CDT
HINT: If this has occurred more than once some data might be corrupted and you might need to choose an earlier recovery target.
LOG: entering standby mode
LOG: redo starts at 31B/FDFFC0D8
WARNING: page 7843 of relation base/25999670/33421425 is uninitialized
PANIC: WAL contains references to invalid pages
LOG: startup process (PID 4838) was terminated by signal 6: Aborted
LOG: terminating any other active server processes



There is no activity in #2's pg_log postgresql-Mon.log file consummate to this event.


I have wal segments keeping 1,000. I have pg_xlog files dated yesterday.


This happened earlier today as well. It happened twice last week.


The "warning page" stuff is different from the pg_controldata stuff earlier today.


Here is pg_controldata from server 3 wrt server 1 replication.


pg_control version number: 922
Catalog version number: 201204301
Database system identifier: 5868266298097360799
Database cluster state: in archive recovery
pg_control last modified: Mon Jun 2 18:42:18 2014
Latest checkpoint location: 31B/FE08DB50
Prior checkpoint location: 31B/FE08DB50
Latest checkpoint's REDO location: 31B/FDFFC0D8
Latest checkpoint's TimeLineID: 1
Latest checkpoint's full_page_writes: on
Latest checkpoint's NextXID: 0/441709396
Latest checkpoint's NextOID: 33432042
Latest checkpoint's NextMultiXactId: 28077
Latest checkpoint's NextMultiOffset: 57507
Latest checkpoint's oldestXID: 241800232
Latest checkpoint's oldestXID's DB: 16573
Latest checkpoint's oldestActiveXID: 441709396
Time of latest checkpoint: Mon Jun 2 17:31:05 2014
Minimum recovery ending location: 31C/4E1C18
Backup start location: 0/0
Backup end location: 0/0
End-of-backup record required: no
Current wal_level setting: hot_standby
Current max_connections setting: 4000
Current max_prepared_xacts setting: 0
Current max_locks_per_xact setting: 64
Maximum data alignment: 8
Database block size: 8192
Blocks per segment of large relation: 131072
WAL block size: 8192
Bytes per WAL segment: 16777216
Maximum length of identifiers: 64
Maximum columns in an index: 32
Maximum size of a TOAST chunk: 1996
Date/time type storage: 64-bit integers
Float4 argument passing: by value
Float8 argument passing: by value


I have to redo the base backup and bring it back up again, so the rest of the stuff is about to be rsycnhed awak, but it has now failed twice today. It might well, again.


What more should I provide to try to find a solution?


Server 2 -> 1 replication has no issues.

Server 3 -> 2 (replicating server 1) is the one that's failing.

65Mbps line between them.


PG 9.28 across the board on Centos 6.5 64bit all across.