Thread: Streaming Replication Error
Hello, We were auditing our logs on one of our PG 9.0.6 standby servers that we use for nightly snapshotting. The high-level processis: 1. Stop PG 2. Snapshot 3. Start PG Where "Snapshot" includes several steps to ensure data/filesystem integrity. The archive command on the master continuesthroughout this process, so the standby does have all of the log files. When we restart the cluster, we see thetypical startup message about restoring files from the archive. However, we have noticed that occasionally the followingoccurs: LOG: restored log file "00000001000044560000007F" from archive LOG: restored log file "000000010000445600000080" from archive cp: cannot stat `/ebs-raid0/archive/000000010000445600000081': No such file or directory LOG: unexpected pageaddr 4454/74000000 in log file 17494, segment 129, offset 0 cp: cannot stat `/ebs-raid0/archive/000000010000445600000081': No such file or directory LOG: streaming replication successfully connected to primary FATAL: could not receive data from WAL stream: FATAL: requested WAL segment 000000010000445600000091 has already been removed LOG: restored log file "000000010000445600000091" from archive LOG: restored log file "000000010000445600000092" from archive LOG: restored log file "000000010000445600000093" from archive … LOG: restored log file "000000010000445700000092" from archive cp: cannot stat `/ebs-raid0/archive/000000010000445700000093': No such file or directory LOG: streaming replication successfully connected to primary ------ The concerning bit here is that we receive the FATAL message "requested WAL segment 000000010000445600000091 has alreadybeen removed" after streaming replication connects successfully, which seems to trigger an additional sequence oflog restores. The questions we have are: 1. Is our data intact? PG eventually starts up, and it seems like once the streaming suffers the FATAL error, it falls backto performing log restores. 2. What triggers this error? Too much time between log recovery, streaming startup and a low wal_keep_segments value (currently128)? Thank you very much, Andrew Hannon
On Mon, 2012-04-30 at 17:23 -0400, Andrew Hannon wrote: > 1. Is our data intact? PG eventually starts up, and it seems like once > the streaming suffers the FATAL error, it falls back to performing log > restores. I don't see anything alarming there. Postgres will not start up if it thinks it's really missing data. I'd advise using an archive command that does not output anything unless it's something you really need to know. A log file missing from the archive is normal operation for recovery mode, so notices telling you that are just cluttering the log. > 2. What triggers this error? Too much time between log recovery, > streaming startup and a low wal_keep_segments value (currently 128)? 128 sounds like a high-enough number, so after it catches up fully, it should be plenty. It looks like, while trying to catch up, it falls within the 128 segments and begins streaming, and then momentarily falls back out and needs to restore from the archive. Unless you have steady-state replication lag, it should catch up fully and then just be able to use streaming all the time. Do you see it resume streaming later on in the logfile? Disclaimer: I'm not 100% confident in my response, so please take it with a grain of salt, but I hope it is helpful anyway. Regards, Jeff Davis