Thread: pg_standby replication problem
Please help me with this, my secondary server shows a replication problem. It stopped at the file called 0000000500004BAF000000AF …then from here primary server kept on sending walfiles, until the walfiles used up the disc space in the data directory. How do I fix this problem. It’s postgres 9.1.2.
Postgres log file Postgres-2014-06-08_000000.log file has the following details :
2014-06-08 00:15:54 SAST LOG: restored log file "0000000500004BAF000000AF" from archive
Trigger file: /tmp/recovery.pgsql.trigger.5432
Waiting for WAL file: 0000000500004BAF000000B0
WAL file path: /pgsql2/walfiles/0000000500004BAF000000B0
Restoring to: pg_xlog/RECOVERYXLOG
Sleep interval: 2 seconds
Max wait interval: 0 forever
Command for restore: cp "/pgsql2/walfiles/0000000500004BAF000000B0" "pg_xlog/RECOVERYXLOG"
Keep archive history: 0000000500004BAE000000F7 and later
WAL file not present yet. Checking for trigger file...
WAL file not present yet. Checking for trigger file...
WAL file not present yet. Checking for trigger file...
WAL file not present yet. Checking for trigger file...
WAL file not present yet. Checking for trigger file...
CONFIDENTIALITY NOTICE The contents of and attachments to this e-mail are intended for the addressee only, and may contain the confidential information of Argility (Proprietary) Limited and/or its subsidiaries. Any review, use or dissemination thereof by anyone other than the intended addressee is prohibited.If you are not the intended addressee please notify the writer immediately and destroy the e-mail. Argility (Proprietary) Limited and its subsidiaries distance themselves from and accept no liability for unauthorised use of their e-mail facilities or e-mails sent other than strictly for business purposes.
Filesystem Size Used Avail Use% Mounted on
/dev/sda3 57G 15G 39G 28% /
/dev/mapper/vg0-pgsql2
5.4T 5.3T 0 100% /pgsql2
/dev/sda1 99M 12M 83M 13% /boot
tmpfs 30G 0 30G 0% /dev/shm
Disc space Breakdown:
4.0K ./backup
12K ./copy
4.9T ./data
204K ./test
16K ./lost+found
361G ./walfiles
5.3T .
From: Khangelani Gama [mailto:kgama@argility.com]
Sent: Monday, June 09, 2014 1:42 PM
To: pgsql-general@postgresql.org
Subject: pg_standby replication problem
Please help me with this, my secondary server shows a replication problem. It stopped at the file called 0000000500004BAF000000AF …then from here primary server kept on sending walfiles, until the walfiles used up the disc space in the data directory. How do I fix this problem. It’s postgres 9.1.2.
Postgres log file Postgres-2014-06-08_000000.log file has the following details :
2014-06-08 00:15:54 SAST LOG: restored log file "0000000500004BAF000000AF" from archive
Trigger file: /tmp/recovery.pgsql.trigger.5432
Waiting for WAL file: 0000000500004BAF000000B0
WAL file path: /pgsql2/walfiles/0000000500004BAF000000B0
Restoring to: pg_xlog/RECOVERYXLOG
Sleep interval: 2 seconds
Max wait interval: 0 forever
Command for restore: cp "/pgsql2/walfiles/0000000500004BAF000000B0" "pg_xlog/RECOVERYXLOG"
Keep archive history: 0000000500004BAE000000F7 and later
WAL file not present yet. Checking for trigger file...
WAL file not present yet. Checking for trigger file...
WAL file not present yet. Checking for trigger file...
WAL file not present yet. Checking for trigger file...
WAL file not present yet. Checking for trigger file...
CONFIDENTIALITY NOTICE The contents of and attachments to this e-mail are intended for the addressee only, and may contain the confidential information of Argility (Proprietary) Limited and/or its subsidiaries. Any review, use or dissemination thereof by anyone other than the intended addressee is prohibited.If you are not the intended addressee please notify the writer immediately and destroy the e-mail. Argility (Proprietary) Limited and its subsidiaries distance themselves from and accept no liability for unauthorised use of their e-mail facilities or e-mails sent other than strictly for business purposes.
The big question we can’t answer is that when the replication was at this point (Command for restore: cp "/pgsql2/walfiles/0000000500004BAF000000B0" "pg_xlog/RECOVERYXLOG"
) , it then started to say WAL file not present yet. We can’t find this 0000000500004BAF000000B0 file any where .
Command for restore: cp "/pgsql2/walfiles/0000000500004BAF000000B0" "pg_xlog/RECOVERYXLOG"
Keep archive history: 0000000500004BAE000000F7 and later
WAL file not present yet. Checking for trigger file...
-rw------- 1 postgres postgres 16M Jun 7 21:42 0000000500004BAE000000F7
From: Khangelani Gama [mailto:kgama@argility.com]
Sent: Monday, June 09, 2014 1:45 PM
To: pgsql-general@postgresql.org
Subject: RE: pg_standby replication problem
Filesystem Size Used Avail Use% Mounted on
/dev/sda3 57G 15G 39G 28% /
/dev/mapper/vg0-pgsql2
5.4T 5.3T 0 100% /pgsql2
/dev/sda1 99M 12M 83M 13% /boot
tmpfs 30G 0 30G 0% /dev/shm
Disc space Breakdown:
4.0K ./backup
12K ./copy
4.9T ./data
204K ./test
16K ./lost+found
361G ./walfiles
5.3T .
From: Khangelani Gama [mailto:kgama@argility.com]
Sent: Monday, June 09, 2014 1:42 PM
To: pgsql-general@postgresql.org
Subject: pg_standby replication problem
Please help me with this, my secondary server shows a replication problem. It stopped at the file called 0000000500004BAF000000AF …then from here primary server kept on sending walfiles, until the walfiles used up the disc space in the data directory. How do I fix this problem. It’s postgres 9.1.2.
Postgres log file Postgres-2014-06-08_000000.log file has the following details :
2014-06-08 00:15:54 SAST LOG: restored log file "0000000500004BAF000000AF" from archive
Trigger file: /tmp/recovery.pgsql.trigger.5432
Waiting for WAL file: 0000000500004BAF000000B0
WAL file path: /pgsql2/walfiles/0000000500004BAF000000B0
Restoring to: pg_xlog/RECOVERYXLOG
Sleep interval: 2 seconds
Max wait interval: 0 forever
Command for restore: cp "/pgsql2/walfiles/0000000500004BAF000000B0" "pg_xlog/RECOVERYXLOG"
Keep archive history: 0000000500004BAE000000F7 and later
WAL file not present yet. Checking for trigger file...
WAL file not present yet. Checking for trigger file...
WAL file not present yet. Checking for trigger file...
WAL file not present yet. Checking for trigger file...
WAL file not present yet. Checking for trigger file...
CONFIDENTIALITY NOTICE The contents of and attachments to this e-mail are intended for the addressee only, and may contain the confidential information of Argility (Proprietary) Limited and/or its subsidiaries. Any review, use or dissemination thereof by anyone other than the intended addressee is prohibited.If you are not the intended addressee please notify the writer immediately and destroy the e-mail. Argility (Proprietary) Limited and its subsidiaries distance themselves from and accept no liability for unauthorised use of their e-mail facilities or e-mails sent other than strictly for business purposes.
Please please help
From: Khangelani Gama [mailto:kgama@argility.com]
Sent: Monday, June 09, 2014 1:42 PM
To: pgsql-general@postgresql.org
Subject: pg_standby replication problem
Please help me with this, my secondary server shows a replication problem. It stopped at the file called 0000000500004BAF000000AF …then from here primary server kept on sending walfiles, until the walfiles used up the disc space in the data directory. How do I fix this problem. It’s postgres 9.1.2.
Postgres log file Postgres-2014-06-08_000000.log file has the following details :
2014-06-08 00:15:54 SAST LOG: restored log file "0000000500004BAF000000AF" from archive
Trigger file: /tmp/recovery.pgsql.trigger.5432
Waiting for WAL file: 0000000500004BAF000000B0
WAL file path: /pgsql2/walfiles/0000000500004BAF000000B0
Restoring to: pg_xlog/RECOVERYXLOG
Sleep interval: 2 seconds
Max wait interval: 0 forever
Command for restore: cp "/pgsql2/walfiles/0000000500004BAF000000B0" "pg_xlog/RECOVERYXLOG"
Keep archive history: 0000000500004BAE000000F7 and later
WAL file not present yet. Checking for trigger file...
WAL file not present yet. Checking for trigger file...
WAL file not present yet. Checking for trigger file...
WAL file not present yet. Checking for trigger file...
WAL file not present yet. Checking for trigger file...
CONFIDENTIALITY NOTICE The contents of and attachments to this e-mail are intended for the addressee only, and may contain the confidential information of Argility (Proprietary) Limited and/or its subsidiaries. Any review, use or dissemination thereof by anyone other than the intended addressee is prohibited.If you are not the intended addressee please notify the writer immediately and destroy the e-mail. Argility (Proprietary) Limited and its subsidiaries distance themselves from and accept no liability for unauthorised use of their e-mail facilities or e-mails sent other than strictly for business purposes.
On 06/09/2014 07:28 AM, Khangelani Gama wrote: > Please please help Before anyone can help you will need to provide more information on what your archiving, replication setup is. To begin: 1)Are you doing both archiving and streaming replication? 2) What are the settings in the configuration files for those operations? 3) What is the layout for archiving, in other words do the archived files get copied remotely to a third site or some other arrangement? 4) What caused the trigger file to be set? > > *From:*Khangelani Gama [mailto:kgama@argility.com > <mailto:kgama@argility.com>] > *Sent:* Monday, June 09, 2014 1:42 PM > *To:* pgsql-general@postgresql.org <mailto:pgsql-general@postgresql.org> > *Subject:* pg_standby replication problem > > Please help me with this, my secondary server shows a replication > problem. It stopped at the file called *0000000500004BAF000000AF …*then > from here primary server kept on sending walfiles, until the walfiles > used up the disc space in the data directory. How do I fix this problem. > It’s postgres 9.1.2. > > *_Postgres log file Postgres-2014-06-08_000000.log file _*_has the > following details :_ > > 2014-06-08 00:15:54 SAST LOG: restored log file > *"0000000500004BAF000000AF" from*archive > > Trigger file: /tmp/recovery.pgsql.trigger.5432 > > Waiting for WAL file: 0000000500004BAF000000B0 > > WAL file path: /pgsql2/walfiles/0000000500004BAF000000B0 > > Restoring to: pg_xlog/RECOVERYXLOG > > Sleep interval: 2 seconds > > Max wait interval: 0 forever > > *Command for restore: cp "/pgsql2/walfiles/0000000500004BAF000000B0" > "pg_xlog/RECOVERYXLOG"* > > Keep archive history: 0000000500004BAE000000F7 and later > > WAL file not present yet. Checking for trigger file... > > WAL file not present yet. Checking for trigger file... > > WAL file not present yet. Checking for trigger file... > > WAL file not present yet. Checking for trigger file... > > WAL file not present yet. Checking for trigger file... > > > CONFIDENTIALITY NOTICE > The contents of and attachments to this e-mail are intended for the addressee only, and may contain the confidential > information of Argility (Proprietary) Limited and/or its subsidiaries. Any review, use or dissemination thereof by anyone > other than the intended addressee is prohibited.If you are not the intended addressee please notify the writer immediately > and destroy the e-mail. Argility (Proprietary) Limited and its subsidiaries distance themselves from and accept no liability > for unauthorised use of their e-mail facilities or e-mails sent other than strictly for business purposes. > > > -- Adrian Klaver adrian.klaver@aklaver.com
On Monday, June 09, 2014 04:28:53 PM Khangelani Gama wrote: > Please help me with this, my secondary server shows a replication problem. > It stopped at the file called *0000000500004BAF000000AF …*then from here > primary server kept on sending walfiles, until the walfiles used up the > disc space in the data directory. How do I fix this problem. It’s postgres > 9.1.2. > It looks to me like your archive_command is probably failing on the primary server. If that fails, the logs will build up and fill up your disk as described. And they wouldn't be available to the slave to find.
-----Original Message----- From: pgsql-general-owner@postgresql.org [mailto:pgsql-general-owner@postgresql.org] On Behalf Of Alan Hodgson Sent: Monday, June 09, 2014 4:51 PM To: pgsql-general@postgresql.org Subject: Re: [GENERAL] pg_standby replication problem On Monday, June 09, 2014 04:28:53 PM Khangelani Gama wrote: > Please help me with this, my secondary server shows a replication problem. > It stopped at the file called *0000000500004BAF000000AF …*then from > here primary server kept on sending walfiles, until the walfiles used > up the disc space in the data directory. How do I fix this problem. > It’s postgres 9.1.2. > It looks to me like your archive_command is probably failing on the primary server. If that fails, the logs will build up and fill up your disk as described. And they wouldn't be available to the slave to find. I am sorry, I am still trying to understand all the settings, the person who set up the servers left the company. In primary server, postgresql.conf shows the following: # WRITE AHEAD LOG #------------------------------------------------------------------------------ # - Settings - wal_level = archive # - Checkpoints - checkpoint_segments = 128 checkpoint_timeout = 15min checkpoint_warning = 885s # - Archiving - archive_mode = on #archive_mode = off # allows archiving to be done archive_command = '/home/cdbs/bin/run_replication.sh %p %f' # REPLICATION #------------------------------------------------------------------------------ # - Master Server - # These settings are ignored on a standby server max_wal_senders = 3 The setting archive_command points to a script being run and the variable %p and %f being passed. There is replication script running in the primary server has the following: while [ $test = "false" ] do rsync -a /pgsql2/data/${src} postgres@10.58.101.10:/pgsql2/walfiles/${dest} >> /tmp/run_replication.sh.out 2>> /tmp/run_replication.sh.out test=`ssh AB_CDS3 "if [ -f /pgsql2/walfiles/${dest} ];then echo 'true' ;else echo 'false';fi"` if [ ${test} = "false" ] then echo "Test is false for CDS3, sleeping 10" >> /tmp/run_replication.sh.out sleep 10 cnt=$(( $cnt + 1 )) if [ ${cnt} -ge 60 ] then message="Replication ERROR: Unable to send WAL file(${desc}) from CDS to CDS3" echo "`date` : ${message}" >> /tmp/run_replication.sh.out sendsms fi fi done -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general CONFIDENTIALITY NOTICE The contents of and attachments to this e-mail are intended for the addressee only, and may contain the confidential information of Argility (Proprietary) Limited and/or its subsidiaries. Any review, use or dissemination thereof by anyone other than the intended addressee is prohibited.If you are not the intended addressee please notify the writer immediately and destroy the e-mail. Argility (Proprietary) Limited and its subsidiaries distance themselves from and accept no liability for unauthorised use of their e-mail facilities or e-mails sent other than strictly for business purposes.
I just saw got this from the primary server (/tmp/run_replication.sh.out), secondary server's IP 10.58.101.10. replication started: Sun Jun 8 00:05:26 SAST 2014 source: pg_xlog/0000000500004BAF000000AF, dest: 0000000500004BAF000000AF replication finished: Sun Jun 8 00:05:33 SAST 2014 replication started: Sun Jun 8 00:05:33 SAST 2014 source: pg_xlog/0000000500004BAF000000B0, dest: 0000000500004BAF000000B0 ssh: connect to host 10.58.101.10 port 22: Connection timed out^M rsync: connection unexpectedly closed (0 bytes received so far) [sender] rsync error: unexplained error (code 255) at io.c(600) [sender=3.0.6] replication finished: Sun Jun 8 00:07:41 SAST 2014 replication started: Sun Jun 8 00:07:41 SAST 2014 source: pg_xlog/0000000500004BAF000000B1, dest: 0000000500004BAF000000B1 replication finished: Sun Jun 8 00:07:53 SAST 2014 replication started: Sun Jun 8 00:07:53 SAST 2014 source: pg_xlog/0000000500004BAF000000B2, dest: 0000000500004BAF000000B2 replication finished: Sun Jun 8 00:07:57 SAST 2014 replication started: Sun Jun 8 00:07:58 SAST 2014 source: pg_xlog/0000000500004BAF000000B3, dest: 0000000500004BAF000000B3 replication finished: Sun Jun 8 00:08:06 SAST 2014 replication started: Sun Jun 8 00:08:06 SAST 2014 source: pg_xlog/0000000500004BAF000000B4, dest: 0000000500004BAF000000B4 replication finished: Sun Jun 8 00:08:11 SAST 2014 replication started: Sun Jun 8 00:08:11 SAST 2014 source: pg_xlog/0000000500004BAF000000B5, dest: 0000000500004BAF000000B5 replication finished: Sun Jun 8 00:08:16 SAST 2014 replication started: Sun Jun 8 00:08:16 SAST 2014 source: pg_xlog/0000000500004BAF000000B6, dest: 0000000500004BAF000000B6 replication finished: Sun Jun 8 00:08:22 SAST 2014 CONFIDENTIALITY NOTICE The contents of and attachments to this e-mail are intended for the addressee only, and may contain the confidential information of Argility (Proprietary) Limited and/or its subsidiaries. Any review, use or dissemination thereof by anyone other than the intended addressee is prohibited.If you are not the intended addressee please notify the writer immediately and destroy the e-mail. Argility (Proprietary) Limited and its subsidiaries distance themselves from and accept no liability for unauthorised use of their e-mail facilities or e-mails sent other than strictly for business purposes.
-----Original Message----- From: Khangelani Gama [mailto:kgama@argility.com] Sent: Monday, June 09, 2014 5:26 PM To: 'Alan Hodgson'; 'pgsql-general@postgresql.org' Subject: RE: [GENERAL] pg_standby replication problem I just saw got this from the primary server (/tmp/run_replication.sh.out), secondary server's IP 10.58.101.10. replication started: Sun Jun 8 00:05:26 SAST 2014 source: pg_xlog/0000000500004BAF000000AF, dest: 0000000500004BAF000000AF replication finished: Sun Jun 8 00:05:33 SAST 2014 replication started: Sun Jun 8 00:05:33 SAST 2014 source: pg_xlog/0000000500004BAF000000B0, dest: 0000000500004BAF000000B0 ssh: connect to host 10.58.101.10 port 22: Connection timed out^M rsync: connection unexpectedly closed (0 bytes received so far) [sender] rsync error: unexplained error (code 255) at io.c(600) [sender=3.0.6] replication finished: Sun Jun 8 00:07:41 SAST 2014 replication started: Sun Jun 8 00:07:41 SAST 2014 source: pg_xlog/0000000500004BAF000000B1, dest: 0000000500004BAF000000B1 replication finished: Sun Jun 8 00:07:53 SAST 2014 replication started: Sun Jun 8 00:07:53 SAST 2014 source: pg_xlog/0000000500004BAF000000B2, dest: 0000000500004BAF000000B2 replication finished: Sun Jun 8 00:07:57 SAST 2014 replication started: Sun Jun 8 00:07:58 SAST 2014 source: pg_xlog/0000000500004BAF000000B3, dest: 0000000500004BAF000000B3 replication finished: Sun Jun 8 00:08:06 SAST 2014 replication started: Sun Jun 8 00:08:06 SAST 2014 source: pg_xlog/0000000500004BAF000000B4, dest: 0000000500004BAF000000B4 replication finished: Sun Jun 8 00:08:11 SAST 2014 replication started: Sun Jun 8 00:08:11 SAST 2014 source: pg_xlog/0000000500004BAF000000B5, dest: 0000000500004BAF000000B5 replication finished: Sun Jun 8 00:08:16 SAST 2014 replication started: Sun Jun 8 00:08:16 SAST 2014 source: pg_xlog/0000000500004BAF000000B6, dest: 0000000500004BAF000000B6 replication finished: Sun Jun 8 00:08:22 SAST 2014 Since there was a Connection time out Problem in the primary server, how can I make disc space in the secondary server for the replication to continue from where it stopped. Do I remove waltfiles from the secondary server? Disc space Breakdown: 4.0K ./backup 12K ./copy 4.9T ./data 204K ./test 16K ./lost+found 361G ./walfiles 5.3T . CONFIDENTIALITY NOTICE The contents of and attachments to this e-mail are intended for the addressee only, and may contain the confidential information of Argility (Proprietary) Limited and/or its subsidiaries. Any review, use or dissemination thereof by anyone other than the intended addressee is prohibited.If you are not the intended addressee please notify the writer immediately and destroy the e-mail. Argility (Proprietary) Limited and its subsidiaries distance themselves from and accept no liability for unauthorised use of their e-mail facilities or e-mails sent other than strictly for business purposes.