Thread: Replication fell out of sync

Replication fell out of sync

From

David Kerr

Date:

02 March 2015, 23:25:16

Howdy,

I had an instance where a replica fell out of sync with the master.

Now it's in in a state where it's unable to catch up because the master has already removed the WAL segment.

(logs)
Mar  2 23:10:13 db13 postgres[11099]: [3-1] user=,db=,host= LOG:  streaming replication successfully connected to
primary
Mar  2 23:10:13 db13 postgres[11099]: [4-1] user=,db=,host= FATAL:  could not receive data from WAL stream: FATAL:
requestedWAL segment 000000060000047C0000001F has already been removed 


I was under the impression that when you setup streaming replication if you specify a restore command like :
restore_command= 'cp /arch/%f %p' 

Then even if the slave falls out of sync, and the master removes the WAL segment, as long as you can still retrieve the
WALfiles, then it can bring itself back into sync. 


But that doesn't seem to be happening.

The restore_command is working
# Slave's $PGDATA/pg_xlog/
-rw------- 1 postgres postgres 16777216 Mar  2 21:29 000000060000047C0000001F
-rwx------ 1 postgres postgres 16777216 Mar  2 23:13 RECOVERYXLOG

I'm on PG 9.2.7, which i know is old, but I'm upgrading shortly.

recovery.conf:
standby_mode      = 'on'
primary_conninfo  = 'host=pgmaster port=5432'
restore_command   = 'cp /arch/%f %p'

relevant info from postgresql.conf:
wal_level = hot_standby
max_wal_senders = 5
wal_keep_segments = 32
archive_mode = on
hot_standby = on
hot_standby_feedback = true


I know that to avoid this entirely I need to set wal_keep_segments higher, although in this particular case it wouldn't
havemattered because a rogue program slammed the database and basically 32/64/128 WAL segments went by in a short span
oftime. 

However, I really thought that as long as PG could get the archived logs i'd be able to recover.

Was I wrong with that assertion or did i just run into a bug?

Thanks

Re: Replication fell out of sync

From

"Joshua D. Drake"

Date:

02 March 2015, 23:33:27

On 03/02/2015 03:25 PM, David Kerr wrote:
>
> Howdy,
>
> I had an instance where a replica fell out of sync with the master.
>
> Now it's in in a state where it's unable to catch up because the master has already removed the WAL segment.
>
> (logs)
> Mar  2 23:10:13 db13 postgres[11099]: [3-1] user=,db=,host= LOG:  streaming replication successfully connected to
primary
> Mar  2 23:10:13 db13 postgres[11099]: [4-1] user=,db=,host= FATAL:  could not receive data from WAL stream: FATAL:
requestedWAL segment 000000060000047C0000001F has already been removed 
>
>
> I was under the impression that when you setup streaming replication if you specify a restore command like :
restore_command= 'cp /arch/%f %p' 
>
> Then even if the slave falls out of sync, and the master removes the WAL segment, as long as you can still retrieve
theWAL files, then it can bring itself back into sync. 

If the archive command is also set so that the restore command has a
file to retrieve, then yes it will work that way.


--
Command Prompt, Inc. - http://www.commandprompt.com/  503-667-4564
PostgreSQL Support, Training, Professional Services and Development
High Availability, Oracle Conversion, @cmdpromptinc

Now I get it: your service is designed for a customer
base that grew up with Facebook, watches Japanese seizure
robot anime, and has the attention span of a gnat.
I'm not that user., "Tyler Riddle"

Re: Replication fell out of sync

From

David Kerr

Date:

02 March 2015, 23:39:36

On Mon, Mar 02, 2015 at 03:33:22PM PDT, Joshua D. Drake wrote:
>
> On 03/02/2015 03:25 PM, David Kerr wrote:
> >
> >Howdy,
> >
> >I had an instance where a replica fell out of sync with the master.
> >
> >Now it's in in a state where it's unable to catch up because the master has already removed the WAL segment.
> >
> >(logs)
> >Mar  2 23:10:13 db13 postgres[11099]: [3-1] user=,db=,host= LOG:  streaming replication successfully connected to
primary
> >Mar  2 23:10:13 db13 postgres[11099]: [4-1] user=,db=,host= FATAL:  could not receive data from WAL stream: FATAL:
requestedWAL segment 000000060000047C0000001F has already been removed 
> >
> >
> >I was under the impression that when you setup streaming replication if you specify a restore command like :
restore_command= 'cp /arch/%f %p' 
> >
> >Then even if the slave falls out of sync, and the master removes the WAL segment, as long as you can still retrieve
theWAL files, then it can bring itself back into sync. 
>
> If the archive command is also set so that the restore command has a
> file to retrieve, then yes it will work that way.

Yeah it is, it's actually pulling the file down.

Glad that's how it's supposed to work. I'd rather be unlucky then crazy.  =)

Re: Replication fell out of sync

From

Adrian Klaver

Date:

03 March 2015, 00:06:10

On 03/02/2015 03:25 PM, David Kerr wrote:
> Howdy,
>
> I had an instance where a replica fell out of sync with the master.
>
> Now it's in in a state where it's unable to catch up because the master has already removed the WAL segment.
>
> (logs)
> Mar  2 23:10:13 db13 postgres[11099]: [3-1] user=,db=,host= LOG:  streaming replication successfully connected to
primary
> Mar  2 23:10:13 db13 postgres[11099]: [4-1] user=,db=,host= FATAL:  could not receive data from WAL stream: FATAL:
requestedWAL segment 000000060000047C0000001F has already been removed 
>
>
> I was under the impression that when you setup streaming replication if you specify a restore command like :
restore_command= 'cp /arch/%f %p' 
>
> Then even if the slave falls out of sync, and the master removes the WAL segment, as long as you can still retrieve
theWAL files, then it can bring itself back into sync. 
>
>
> But that doesn't seem to be happening.
>
> The restore_command is working
> # Slave's $PGDATA/pg_xlog/
> -rw------- 1 postgres postgres 16777216 Mar  2 21:29 000000060000047C0000001F
> -rwx------ 1 postgres postgres 16777216 Mar  2 23:13 RECOVERYXLOG

Trying to figure out why the error occurred at Mar  2 23:10:13 and the
file shows a time stamp of Mar  2 21:29, especially since you say the
WAL segments flew past?

Are there any other WAL files in the slave ~/pg_xlog?

>
> I'm on PG 9.2.7, which i know is old, but I'm upgrading shortly.
>
> recovery.conf:
> standby_mode      = 'on'
> primary_conninfo  = 'host=pgmaster port=5432'
> restore_command   = 'cp /arch/%f %p'
>
> relevant info from postgresql.conf:
> wal_level = hot_standby
> max_wal_senders = 5
> wal_keep_segments = 32
> archive_mode = on
> hot_standby = on
> hot_standby_feedback = true
>
>
> I know that to avoid this entirely I need to set wal_keep_segments higher, although in this particular case it
wouldn'thave mattered because a rogue program slammed the database and basically 32/64/128 WAL segments went by in a
shortspan of time. 
>
> However, I really thought that as long as PG could get the archived logs i'd be able to recover.
>
> Was I wrong with that assertion or did i just run into a bug?
>
> Thanks
>
>


--
Adrian Klaver
adrian.klaver@aklaver.com

Re: Replication fell out of sync

From

David Kerr

Date:

03 March 2015, 00:11:48

On Mon, Mar 02, 2015 at 04:06:02PM PDT, Adrian Klaver wrote:
> On 03/02/2015 03:25 PM, David Kerr wrote:
> >Howdy,
> >
> >I had an instance where a replica fell out of sync with the master.
> >
> >Now it's in in a state where it's unable to catch up because the master has already removed the WAL segment.
> >
> >(logs)
> >Mar  2 23:10:13 db13 postgres[11099]: [3-1] user=,db=,host= LOG:  streaming replication successfully connected to
primary
> >Mar  2 23:10:13 db13 postgres[11099]: [4-1] user=,db=,host= FATAL:  could not receive data from WAL stream: FATAL:
requestedWAL segment 000000060000047C0000001F has already been removed 
> >
> >
> >I was under the impression that when you setup streaming replication if you specify a restore command like :
restore_command= 'cp /arch/%f %p' 
> >
> >Then even if the slave falls out of sync, and the master removes the WAL segment, as long as you can still retrieve
theWAL files, then it can bring itself back into sync. 
> >
> >
> >But that doesn't seem to be happening.
> >
> >The restore_command is working
> ># Slave's $PGDATA/pg_xlog/
> >-rw------- 1 postgres postgres 16777216 Mar  2 21:29 000000060000047C0000001F
> >-rwx------ 1 postgres postgres 16777216 Mar  2 23:13 RECOVERYXLOG
>
> Trying to figure out why the error occurred at Mar  2 23:10:13 and
> the file shows a time stamp of Mar  2 21:29, especially since you
> say the WAL segments flew past?
>
> Are there any other WAL files in the slave ~/pg_xlog?

Turns out just that file had gotten corrupt on copy. When i re-pulled it from source life was good.

(phew!)

Thanks all for looking.