Trouble with replication - Mailing list pgsql-general

From David Greco
Subject Trouble with replication
Date
Msg-id 187F6C10D2931A4386EE8E58E13857F6303F96D3@BY2PRD0811MB415.namprd08.prod.outlook.com
Whole thread Raw
Responses Re: Trouble with replication  (John R Pierce <pierce@hogranch.com>)
Re: Trouble with replication  (Jeff Janes <jeff.janes@gmail.com>)
List pgsql-general

I’ve setup two 9.2.4 servers to serve as master-slave in a streaming replication scenario. I started with a  fresh database on the master, setup the replication, then imported using pg_restore about 30GB of data. The master and slave are geographically separated, so replication of this amount of data can/should take hours.  I saw from pg_last_xlog_receive_location and pg_last_xlog_replay_location that the slave began to receive the replication information, it eventually quit with the following errors in the log:

 

2013-06-05 16:28:43.198 EDT,,,19978,,51af9f7a.4e0a,2,,2013-06-05 16:28:42 EDT,,0,FATAL,XX000,"could not receive data from WAL stream: FATAL:  requested WAL segment 000000010000000000000022 has already been removed

",,,,,,,,,""

 

Checking the master, I see that file has in fact been removed from the pg_xlog directory. The master has archive_command setup to ship the wal files to the slave, and the slave is setup with a recovery_command to read them from that directory. In fact, that WAL segment exists in the slave’s pg_xlog directory as well.

 

Now, from what I can tell, the master archived this wal file out of its xlog directory (based on the keep wal segments setting). Then, why did the slave not pick it up from the directory that it was archived to? It is my understanding that the log shipping via archive_command from master to slave is precisely there to prevent this scenario. What am I doing wrong? Below are some of the pertinent settings.

 

On Master:

wal_level  = hot_standby

archive_mode = on

archive_command = 'rsync -zaq %p postgres@pg-slave:/var/lib/pgsql/wal_restore/%f && test ! -f /var/lib/pgsql/backup/wal_archive/%f && cp %p /var/lib/pgsql/backup/wal_archive/'

archive_timeout = 300

max_wal_senders = 5

wal_keep_segments = 0   # not sure why I’ve set it to this?

 

On Slave:

wal_level = hot_standby

archive_mode = on

archive_command = 'test ! -f /var/lib/pgsql/backup/wal_archive/%f && cp -i %p /var/lib/pgsql/backup/wal_archive/%f < /dev/null'

hot_standby = on

wal_keep_segments = 1

 

On slave – recovery.conf:

standby_mode = 'on'

primary_conninfo = 'host=pg-master port=5432 user=replicator'

restore_command = 'cp /var/lib/psql/wal_restore/%f %p'

archive_cleanup_command = 'pg_archivecleanup /var/lib/pgsql/wal_restore/ %r'

 

 

 

pgsql-general by date:

Previous
From: "Tambade, Kedar"
Date:
Subject: Re: [postgis-users] point_ops with GiST PostGIS Spatial Index
Next
From: John R Pierce
Date:
Subject: Re: Trouble with replication