Thread: Streaming Replication Error

Streaming Replication Error

From
Andrew Hannon
Date:
Hello,

We were auditing our logs on one of our PG 9.0.6 standby servers that we use for nightly snapshotting. The high-level
processis: 

1. Stop PG
2. Snapshot
3. Start PG

Where "Snapshot" includes several steps to ensure data/filesystem integrity. The archive command on the master
continuesthroughout this process, so the standby does have all of the log files. When we restart the cluster, we see
thetypical startup message about restoring files from the archive. However, we have noticed that occasionally the
followingoccurs: 

LOG:  restored log file "00000001000044560000007F" from archive
LOG:  restored log file "000000010000445600000080" from archive
cp: cannot stat `/ebs-raid0/archive/000000010000445600000081': No such file or directory
LOG:  unexpected pageaddr 4454/74000000 in log file 17494, segment 129, offset 0
cp: cannot stat `/ebs-raid0/archive/000000010000445600000081': No such file or directory
LOG:  streaming replication successfully connected to primary
FATAL:  could not receive data from WAL stream: FATAL:  requested WAL segment 000000010000445600000091 has already been
removed

LOG:  restored log file "000000010000445600000091" from archive
LOG:  restored log file "000000010000445600000092" from archive
LOG:  restored log file "000000010000445600000093" from archive
…
LOG:  restored log file "000000010000445700000092" from archive
cp: cannot stat `/ebs-raid0/archive/000000010000445700000093': No such file or directory
LOG:  streaming replication successfully connected to primary

------

The concerning bit here is that we receive the FATAL message "requested WAL segment 000000010000445600000091 has
alreadybeen removed" after streaming replication connects successfully, which seems to trigger an additional sequence
oflog restores. 

The questions we have are:

1. Is our data intact? PG eventually starts up, and it seems like once the streaming suffers the FATAL error, it falls
backto performing log restores. 
2. What triggers this error? Too much time between log recovery, streaming startup and a low wal_keep_segments value
(currently128)? 

Thank you very much,

Andrew Hannon

Re: Streaming Replication Error

From
Jeff Davis
Date:
On Mon, 2012-04-30 at 17:23 -0400, Andrew Hannon wrote:

> 1. Is our data intact? PG eventually starts up, and it seems like once
> the streaming suffers the FATAL error, it falls back to performing log
> restores.

I don't see anything alarming there. Postgres will not start up if it
thinks it's really missing data.

I'd advise using an archive command that does not output anything unless
it's something you really need to know. A log file missing from the
archive is normal operation for recovery mode, so notices telling you
that are just cluttering the log.

> 2. What triggers this error? Too much time between log recovery,
> streaming startup and a low wal_keep_segments value (currently 128)?

128 sounds like a high-enough number, so after it catches up fully, it
should be plenty.

It looks like, while trying to catch up, it falls within the 128
segments and begins streaming, and then momentarily falls back out and
needs to restore from the archive.

Unless you have steady-state replication lag, it should catch up fully
and then just be able to use streaming all the time. Do you see it
resume streaming later on in the logfile?

Disclaimer: I'm not 100% confident in my response, so please take it
with a grain of salt, but I hope it is helpful anyway.

Regards,
    Jeff Davis