Thread: Wal files - Question | Postgres 9.2
Hi guys,
I currently have a slave02 server that is replicating from another slave01 via Cascading replication. The master01 server is shipping wal_files (via ssh) to both slaves.
I'm doing some tests on slave02 to test the recovery via wal_files... The goal here is to stop postgres, wait few minutes, start postgres again, watch it recovering from wal_files, once it's done see the streaming replication start working again.
1 - Stop postgres on slave02(streaming replication + wal_files)
2 - Wait for 5 minutes
3 - Start postgres - The goal here is to tail the logs to see if the wal_files are being successfully recovered
However, when doing step3 I get these messages:
cp: cannot stat '/walfiles/0000000200001AF8000000A4': No such file or directorycp: cannot stat '/walfiles/0000000200001AF8000000A5': No such file or directorycp: cannot stat '/walfiles/0000000200001AF8000000A6': No such file or directoryLOG: consistent recovery state reached at 1AF8/AB629F90LOG: database system is ready to accept read only connectionsLOG: streaming replication successfully connected to primary
still on slave01: Sometimes the log_delay time is bigger.. sometimes is lower
SELECT CASE WHEN pg_last_xlog_receive_location() = pg_last_xlog_replay_location() THEN 0 ELSE EXTRACT (EPOCH FROM now() - pg_last_xact_replay_timestamp()) END AS log_delay;log_delay-----------0.386863
On master01:
select * from pg_current_xlog_location();pg_current_xlog_location--------------------------1AF8/D3F47A80
QUESTION:
So.. I just wanna understand what's the risk of those errors... what's happening?
cp: cannot stat '/walfiles/0000000200001AF8000000A5': No such file or directory - Means it didn't find the file. However, the file exists on the Master, but it didn't start shipping yet. What are the consequences of that?
Cheers
Patrick
On Wed, Nov 23, 2016 at 1:03 PM, Patrick B <patrickbakerbr@gmail.com> wrote:
Hi guys,I currently have a slave02 server that is replicating from another slave01 via Cascading replication. The master01 server is shipping wal_files (via ssh) to both slaves.I'm doing some tests on slave02 to test the recovery via wal_files... The goal here is to stop postgres, wait few minutes, start postgres again, watch it recovering from wal_files, once it's done see the streaming replication start working again.1 - Stop postgres on slave02(streaming replication + wal_files)2 - Wait for 5 minutes3 - Start postgres - The goal here is to tail the logs to see if the wal_files are being successfully recoveredHowever, when doing step3 I get these messages:cp: cannot stat '/walfiles/0000000200001AF8000000A4': No such file or directory cp: cannot stat '/walfiles/0000000200001AF8000000A5': No such file or directory cp: cannot stat '/walfiles/0000000200001AF8000000A6': No such file or directory LOG: consistent recovery state reached at 1AF8/AB629F90LOG: database system is ready to accept read only connectionsLOG: streaming replication successfully connected to primarystill on slave01: Sometimes the log_delay time is bigger.. sometimes is lowerSELECT CASE WHEN pg_last_xlog_receive_location() = pg_last_xlog_replay_location() THEN 0 ELSE EXTRACT (EPOCH FROM now() - pg_last_xact_replay_timestamp( )) END AS log_delay; log_delay-----------0.386863On master01:select * from pg_current_xlog_location();pg_current_xlog_location--------------------------1AF8/D3F47A80QUESTION:So.. I just wanna understand what's the risk of those errors... what's happening?cp: cannot stat '/walfiles/0000000200001AF8000000A5': No such file or directory - Means it didn't find the file. However, the file exists on the Master, but it didn't start shipping yet. What are the consequences of that?
That is just saying that the slave cannot find the WAL file. That should not be of big importance. Eventually, that will vanish when the log file gets shipped from the master. Also "cp: cannot stat." errors have been been fixed in 9.3 i believe.
Regards,
Venkata B N
Database Consultant
Fujitsu Australia
2016-11-23 15:55 GMT+13:00 Venkata B Nagothi <nag1010@gmail.com>:
On Wed, Nov 23, 2016 at 1:03 PM, Patrick B <patrickbakerbr@gmail.com> wrote:Hi guys,I currently have a slave02 server that is replicating from another slave01 via Cascading replication. The master01 server is shipping wal_files (via ssh) to both slaves.I'm doing some tests on slave02 to test the recovery via wal_files... The goal here is to stop postgres, wait few minutes, start postgres again, watch it recovering from wal_files, once it's done see the streaming replication start working again.1 - Stop postgres on slave02(streaming replication + wal_files)2 - Wait for 5 minutes3 - Start postgres - The goal here is to tail the logs to see if the wal_files are being successfully recoveredHowever, when doing step3 I get these messages:cp: cannot stat '/walfiles/0000000200001AF8000000A4': No such file or directory cp: cannot stat '/walfiles/0000000200001AF8000000A5': No such file or directory cp: cannot stat '/walfiles/0000000200001AF8000000A6': No such file or directory LOG: consistent recovery state reached at 1AF8/AB629F90LOG: database system is ready to accept read only connectionsLOG: streaming replication successfully connected to primarystill on slave01: Sometimes the log_delay time is bigger.. sometimes is lowerSELECT CASE WHEN pg_last_xlog_receive_location() = pg_last_xlog_replay_location() THEN 0 ELSE EXTRACT (EPOCH FROM now() - pg_last_xact_replay_timestamp( )) END AS log_delay; log_delay-----------0.386863On master01:select * from pg_current_xlog_location();pg_current_xlog_location--------------------------1AF8/D3F47A80QUESTION:So.. I just wanna understand what's the risk of those errors... what's happening?cp: cannot stat '/walfiles/0000000200001AF8000000A5': No such file or directory - Means it didn't find the file. However, the file exists on the Master, but it didn't start shipping yet. What are the consequences of that? That is just saying that the slave cannot find the WAL file. That should not be of big importance. Eventually, that will vanish when the log file gets shipped from the master. Also "cp: cannot stat." errors have been been fixed in 9.3 i believe.
Hi Venkata !
Yeah that's fine.. the streaming replication is already working fine.
But, as it didn't find/recover some of the wal_files, doesn't that mean that the DB isn't up-to-date?
Otherwise what's the purpose of the wal_files if not be responsible to contain the essential data to the DB be up-to-date?
Thanks!
On Wed, Nov 23, 2016 at 1:59 PM, Patrick B <patrickbakerbr@gmail.com> wrote:
2016-11-23 15:55 GMT+13:00 Venkata B Nagothi <nag1010@gmail.com>:On Wed, Nov 23, 2016 at 1:03 PM, Patrick B <patrickbakerbr@gmail.com> wrote:Hi guys,I currently have a slave02 server that is replicating from another slave01 via Cascading replication. The master01 server is shipping wal_files (via ssh) to both slaves.I'm doing some tests on slave02 to test the recovery via wal_files... The goal here is to stop postgres, wait few minutes, start postgres again, watch it recovering from wal_files, once it's done see the streaming replication start working again.1 - Stop postgres on slave02(streaming replication + wal_files)2 - Wait for 5 minutes3 - Start postgres - The goal here is to tail the logs to see if the wal_files are being successfully recoveredHowever, when doing step3 I get these messages:cp: cannot stat '/walfiles/0000000200001AF8000000A4': No such file or directory cp: cannot stat '/walfiles/0000000200001AF8000000A5': No such file or directory cp: cannot stat '/walfiles/0000000200001AF8000000A6': No such file or directory LOG: consistent recovery state reached at 1AF8/AB629F90LOG: database system is ready to accept read only connectionsLOG: streaming replication successfully connected to primarystill on slave01: Sometimes the log_delay time is bigger.. sometimes is lowerSELECT CASE WHEN pg_last_xlog_receive_location() = pg_last_xlog_replay_location() THEN 0 ELSE EXTRACT (EPOCH FROM now() - pg_last_xact_replay_timestamp( )) END AS log_delay; log_delay-----------0.386863On master01:select * from pg_current_xlog_location();pg_current_xlog_location--------------------------1AF8/D3F47A80QUESTION:So.. I just wanna understand what's the risk of those errors... what's happening?cp: cannot stat '/walfiles/0000000200001AF8000000A5': No such file or directory - Means it didn't find the file. However, the file exists on the Master, but it didn't start shipping yet. What are the consequences of that? That is just saying that the slave cannot find the WAL file. That should not be of big importance. Eventually, that will vanish when the log file gets shipped from the master. Also "cp: cannot stat." errors have been been fixed in 9.3 i believe.Hi Venkata !Yeah that's fine.. the streaming replication is already working fine.But, as it didn't find/recover some of the wal_files, doesn't that mean that the DB isn't up-to-date?
Not necessarily. Standby periodically checks if the WAL file it is looking for is available at restore_command location and generates that message if the file is not available. These messages are not of any harm.
Below link might help you :
Regards,
Venkata B N
Database Consultant
Fujitsu Australia
2016-11-23 16:18 GMT+13:00 Venkata B Nagothi <nag1010@gmail.com>:
If I get messages like: On Wed, Nov 23, 2016 at 1:59 PM, Patrick B <patrickbakerbr@gmail.com> wrote:2016-11-23 15:55 GMT+13:00 Venkata B Nagothi <nag1010@gmail.com>:On Wed, Nov 23, 2016 at 1:03 PM, Patrick B <patrickbakerbr@gmail.com> wrote:Hi guys,I currently have a slave02 server that is replicating from another slave01 via Cascading replication. The master01 server is shipping wal_files (via ssh) to both slaves.I'm doing some tests on slave02 to test the recovery via wal_files... The goal here is to stop postgres, wait few minutes, start postgres again, watch it recovering from wal_files, once it's done see the streaming replication start working again.1 - Stop postgres on slave02(streaming replication + wal_files)2 - Wait for 5 minutes3 - Start postgres - The goal here is to tail the logs to see if the wal_files are being successfully recoveredHowever, when doing step3 I get these messages:cp: cannot stat '/walfiles/0000000200001AF8000000A4': No such file or directory cp: cannot stat '/walfiles/0000000200001AF8000000A5': No such file or directory cp: cannot stat '/walfiles/0000000200001AF8000000A6': No such file or directory LOG: consistent recovery state reached at 1AF8/AB629F90LOG: database system is ready to accept read only connectionsLOG: streaming replication successfully connected to primarystill on slave01: Sometimes the log_delay time is bigger.. sometimes is lowerSELECT CASE WHEN pg_last_xlog_receive_location() = pg_last_xlog_replay_location() THEN 0 ELSE EXTRACT (EPOCH FROM now() - pg_last_xact_replay_timestamp( )) END AS log_delay; log_delay-----------0.386863On master01:select * from pg_current_xlog_location();pg_current_xlog_location--------------------------1AF8/D3F47A80QUESTION:So.. I just wanna understand what's the risk of those errors... what's happening?cp: cannot stat '/walfiles/0000000200001AF8000000A5': No such file or directory - Means it didn't find the file. However, the file exists on the Master, but it didn't start shipping yet. What are the consequences of that? That is just saying that the slave cannot find the WAL file. That should not be of big importance. Eventually, that will vanish when the log file gets shipped from the master. Also "cp: cannot stat." errors have been been fixed in 9.3 i believe.Hi Venkata !Yeah that's fine.. the streaming replication is already working fine.But, as it didn't find/recover some of the wal_files, doesn't that mean that the DB isn't up-to-date?Not necessarily. Standby periodically checks if the WAL file it is looking for is available at restore_command location and generates that message if the file is not available. These messages are not of any harm.Below link might help you :
hmm.. I really don't get it.
cp: cannot stat '/walfiles/0000000200001AF8000 000A5': No such file or directory
In my head, it's saying that it was unable to recover that file and, because of that, there is missing data.
Even if the server is able to connect to tbe master via streaming replication, there might be missing data. There might be data that is into master but not into slave.
Am I wrong? If so, I don't understand why are the wal_files for, then.
Patrick
On Fri, Nov 25, 2016 at 10:05 AM, Patrick B <patrickbakerbr@gmail.com> wrote: > If I get messages like: > > cp: cannot stat '/walfiles/0000000200001AF8000000A5': No such file or > directory > > In my head, it's saying that it was unable to recover that file and, because > of that, there is missing data. > Even if the server is able to connect to tbe master via streaming > replication, there might be missing data. There might be data that is into > master but not into slave. > > Am I wrong? If so, I don't understand why are the wal_files for, then. In order, a startup process will look for a WAL segments in its local pg_xlog, then using restore_command, and finally via streaming (XLogPageRead in xlog.c). If streaming fails, because for example the segment requested has already been recycled, then startup process switches back to a lookup in archive/pg_xlog. If you happen to have repetitive failures with restore_command, you are missing a segment and the standby would not be able to replay further. If the segment is indeed missing, a new base backup is the way to go. A standby without the WAL segments to replay is useless except if you plan to use PITR and restore to a position before the last segment available in its timeline history. -- Michael
On 11/24/16 8:05 PM, Patrick B wrote: > hmm.. I really don't get it. > > > > If I get messages like: > > *cp: cannot stat '/walfiles/0000000200001AF8000000A5': No such file or > director*y > > In my head, it's saying that it was unable to recover that file and, > because of that, there is missing data. > Even if the server is able to connect to tbe master via streaming > replication, there might be missing data. There might be data that is > into master but not into slave. > > Am I wrong? If so, I don't understand why are the wal_files for, then. This is normal as far as it goes. Postgres keeps checking for the WAL file that it needs to become consistent and cp keeps throwing errors when it can't find the file (yet). This simply demonstrates how unsuitable cp is as a recovery command. I recommend you look at doing backup/archive with a professional tool such as pgBackRest (http://www.pgbackrest.org) or Barman (http://www.pgbarman.org/). -- -David david@pgmasters.net
2016-11-29 15:21 GMT+13:00 David Steele <david@pgmasters.net>:
7. Postgres started to recover the wal_files and then streaming replication worked:On 11/24/16 8:05 PM, Patrick B wrote:
> hmm.. I really don't get it.
>
>
>
> If I get messages like:
>
> *cp: cannot stat '/walfiles/0000000200001AF8000000A5': No such file or
> director*y
>
> In my head, it's saying that it was unable to recover that file and,
> because of that, there is missing data.
> Even if the server is able to connect to tbe master via streaming
> replication, there might be missing data. There might be data that is
> into master but not into slave.
>
> Am I wrong? If so, I don't understand why are the wal_files for, then.
This is normal as far as it goes. Postgres keeps checking for the WAL
file that it needs to become consistent and cp keeps throwing errors
when it can't find the file (yet).
This simply demonstrates how unsuitable cp is as a recovery command. I
recommend you look at doing backup/archive with a professional tool such
as pgBackRest (http://www.pgbackrest.org) or Barman
(http://www.pgbarman.org/).
Thanks for all the replies guys.
Related to the "cp" - That's fine. I'm just doing this in a test environment. On my prod servers I use wal-e(aws s3) and bash script(locally) to do this work.
I've done some tests, that only gave me more questions. I'll put everything I done here and hopefully you guys will be able to help me to understand it.
1. Set up a pg-test-01 machine on Ubuntu 16 Server
2. installed Postgres 9.2 + contribs + libs + etc
3. Configured postgres to be a streaming replication slave (postgresql.conf and recovery.conf)
4. Configured my masterdb01 server to ship wal_files to the pg-test-01 server
5. Made a pg_basebackup on my master DB server (DB is 1TB big)
6. Recovered the basebackup on the pg-test-01 machine
LOG: streaming replication successfully connected to primary
8. Stopped postgres on the pg-test-01 server for 10 minutes
9. Started postgres on the pg-test-01 to see if the wal_files would be recovered
LOG: restored log file "0000000200001B2F0000004F" from archiveLOG: restored log file "0000000200001B2F00000050" from archiveLOG: restored log file "0000000200001B2F00000051" from archiveLOG: restored log file "0000000200001B2F00000052" from archive
10. Removed all the wal_files from pg-test-01 (No wal_files are being shipped to this server anymore)
11. Stopped postgres for 2 hours
12. Started postgres as streaming replication
cp: cannot stat 'archive/0000000200002B36000000BB': No such file or directory
cp: cannot stat 'archive/0000000200002B36000000BC': No such file or directory
cp: cannot stat 'archive/0000000200002B36000000BD': No such file or directory
LOG: streaming replication successfully connected to primary
Postgres was stopped for 2h. How can postgres connect using streaming replication, if no wal_files were in there???
Hos is that even possible?? I don't understand!
Thanks
Patrick
Ho[w]is that even possible?? I don't understand!
"""
If you use streaming replication without file-based continuous archiving, you have to set wal_keep_segments in the master to a value high enough to ensure that old WAL segments are not recycled too early, while the standby might still need them to catch up. If the standby falls behind too much, it needs to be reinitialized from a new base backup. If you set up a WAL archive that's accessible from the standby, wal_keep_segments is not required as the standby can always use the archive to catch up.
"""
Basically you did just that when you destroyed the archive. Apparently the master doesn't churn through WAL quickly enough to have had to discard the segments from the prior two hours.
David J.
2016-11-29 16:36 GMT+13:00 David G. Johnston <david.g.johnston@gmail.com>:
Ho[w]is that even possible?? I don't understand!"""If you use streaming replication without file-based continuous archiving, you have to set wal_keep_segments in the master to a value high enough to ensure that old WAL segments are not recycled too early, while the standby might still need them to catch up. If the standby falls behind too much, it needs to be reinitialized from a new base backup. If you set up a WAL archive that's accessible from the standby, wal_keep_segments is not required as the standby can always use the archive to catch up."""Basically you did just that when you destroyed the archive. Apparently the master doesn't churn through WAL quickly enough to have had to discard the segments from the prior two hours.David J.
That was really helpful! Thanks David!
Patrick
2016-11-29 23:59 GMT+13:00 Patrick B <patrickbakerbr@gmail.com>:
2016-11-29 16:36 GMT+13:00 David G. Johnston <david.g.johnston@gmail.com>:Ho[w]is that even possible?? I don't understand!"""If you use streaming replication without file-based continuous archiving, you have to set wal_keep_segments in the master to a value high enough to ensure that old WAL segments are not recycled too early, while the standby might still need them to catch up. If the standby falls behind too much, it needs to be reinitialized from a new base backup. If you set up a WAL archive that's accessible from the standby, wal_keep_segments is not required as the standby can always use the archive to catch up."""Basically you did just that when you destroyed the archive. Apparently the master doesn't churn through WAL quickly enough to have had to discard the segments from the prior two hours.David J.That was really helpful! Thanks David!Patrick
Hey guys,
wal_keep_segments is the parameter responsible for streaming replication be able to recover itself without using wal_files, is that right?
wal_keep_segments is the parameter responsible for streaming replication be able to recover itself without using wal_files, is that right?
[...] without using wal_files[sic] [from an external archive location]
David J.
2016-12-01 14:15 GMT+13:00 David G. Johnston <david.g.johnston@gmail.com>:
wal_keep_segments is the parameter responsible for streaming replication be able to recover itself without using wal_files, is that right?[...] without using wal_files[sic] [from an external archive location]David J.
Yep David! Thanks for correcting me.
Got it now. Thanks everyone.
one more question... I'm working with wal_keep_segments = 1000 - Would that in your opinion be a safe number?
Yes. I got plenty of free space and I'm using pg_xlogs in a different partition on the master.
Patrick