Thread: 9.0 Streaming Replication Problem to two slaves
I have a master server and two slave servers, one in the same rack and one in another data center that has a normal latency of about 15ms. Both master and slaves are running CentOS 5.6 x86_64 with: postgresql90-server-9.0.4-1PGDG.rhel5.x86_64 from http://yum.pgrpms.org The master server is using: wal_level = hot_standby checkpoint_segments = 64 max_wal_senders = 10 wal_keep_segments = 512 (good for 8-12 hours of wal segments, way more than required, but I have been trying to debug this) The slaves are using: hot_standby = on max_standby_streaming_delay = 60s I have the servers configured, and get the replication up and running, and then it will run for the better part of a day, and then the slaves appear to stop receiving or requesting updates, there doesn't appear to be anything in the logs other than Jul 23 09:32:47 backupdb postgres[23010]: [2-1] FATAL: terminating connection due to conflict with recovery Jul 23 09:32:47 backupdb postgres[23010]: [2-2] DETAIL: User query might have needed to see row versions that must be removed. Jul 23 09:32:47 backupdb postgres[23010]: [2-3] HINT: In a moment you should be able to reconnect to the database and repeat your command. I don't have any idea what might be causing the problem, I was considering that the problem might be something to do with disk access not being fast enough on the slaves and when there is competition for disk access while copying other backup files to those servers that the resulting slowdown to the disks is causing recovery to falter. I am monitoring the master and slaves for synchronization using a script which selects some data from each server and compares the result. To construct the slaves I am using the following script: SERVERS="server1.example.com server2.example.com" if [ `whoami` == 'postgres' ] then psql -d postgres -c "checkpoint; select pg_switch_xlog();" psql -c "SELECT pg_start_backup('backup', true)"; else su - postgres -c "psql -c \"checkpoint; select pg_switch_xlog();\";" su - postgres -c "psql -c \"SELECT pg_start_backup('backup', true)\";" fi for server in $SERVERS do ssh root@$server /etc/init.d/postgresql-9.0 stop rsync -zav --delete /var/lib/pgsql/9.0/data/ root@$server:/var/lib/pgsql/9.0/data/ --exclude postmaster.pid --exclude recovery.conf --exclude postgresql.conf --exclude pg_hba.conf done if [ `whoami` == 'postgres' ] then # the statement_timeout kills the command after 60 seconds # this is a hack, but otherwise it hangs indefiniately psql -c "SET statement_timeout = 60000; SELECT pg_stop_backup()" else su - postgres -c "psql -c \"SELECT pg_stop_backup()\"" fi for server in $SERVERS do rsync -zav --delete /var/lib/pgsql/9.0/data/pg_xlog/ root@$server:/var/lib/pgsql/9.0/data/pg_xlog/ ssh root@$server /etc/init.d/postgresql-9.0 start done
On 07/25/2011 11:38 AM, Michael Best wrote: > I have the servers configured, and get the replication up and running, > and then it will run for the better part of a day, and then the slaves > appear to stop receiving or requesting updates, there doesn't appear to > be anything in the logs other than One of the problems I was experiencing was one of my recovery databases disk was filling up, I solved this by using pg_archivecleanup The real cause of this appears to be that overnight my database produces something on the order of 1200 to 2500 WAL archives which are being transmitted correctly to the replication databases, but they are having trouble replaying these logs fast enough to ever get caught up. archive_timeout is not set, but I believe the default is 0 Is this likely that the disks are too slow on the replication servers, or is something else happening, such as the restoration of logs is considerably slower than on the primary? -Mike
On Tue, Aug 2, 2011 at 9:24 AM, Michael Best <mbest@pendragon.org> wrote: > On 07/25/2011 11:38 AM, Michael Best wrote: >> >> I have the servers configured, and get the replication up and running, >> and then it will run for the better part of a day, and then the slaves >> appear to stop receiving or requesting updates, there doesn't appear to >> be anything in the logs other than > Is this likely that the disks are too slow on the replication servers, or is > something else happening, such as the restoration of logs is considerably > slower than on the primary? Could be. Are the drives on the slaves much slower? I'd imagine a slave with the same drive setup would be able to keep up.
On Tue, 2011-08-02 at 11:55 -0600, Scott Marlowe wrote: > > Is this likely that the disks are too slow on the replication > > servers, or is something else happening, such as the restoration of > logs is considerably slower than on the primary? > > Could be. Are the drives on the slaves much slower? I'd imagine a > slave with the same drive setup would be able to keep up. We have a customer who generates 150+ xlogs per minute under daily load, and our first HS installation failed just because of : * Network was slow (10 Mbit), so could not keep up with WAL files. * Disks were also slow on slave. So yeah, that could be it. Regards, -- Devrim GÜNDÜZ Principal Systems Engineer @ EnterpriseDB: http://www.enterprisedb.com PostgreSQL Danışmanı/Consultant, Red Hat Certified Engineer Community: devrim~PostgreSQL.org, devrim.gunduz~linux.org.tr http://www.gunduz.org Twitter: http://twitter.com/devrimgunduz
Attachment
This message has been digitally signed by the sender.