Thread: Wal files - Question | Postgres 9.2

Wal files - Question | Postgres 9.2

From
Patrick B
Date:
Hi guys,

I currently have a slave02 server that is replicating from another slave01 via Cascading replication. The master01 server is shipping wal_files (via ssh) to both slaves. 


I'm doing some tests on slave02 to test the recovery via wal_files... The goal here is to stop postgres, wait few minutes, start postgres again, watch it recovering from wal_files, once it's done see the streaming replication start working again.

1 - Stop postgres on slave02(streaming replication + wal_files)
2 - Wait for 5 minutes
3 - Start postgres - The goal here is to tail the logs to see if the wal_files are being successfully recovered

However, when doing step3 I get these messages:
cp: cannot stat '/walfiles/0000000200001AF8000000A4': No such file or directory
cp: cannot stat '/walfiles/0000000200001AF8000000A5': No such file or directory
cp: cannot stat '/walfiles/0000000200001AF8000000A6': No such file or directory
LOG:  consistent recovery state reached at 1AF8/AB629F90
LOG:  database system is ready to accept read only connections
LOG:  streaming replication successfully connected to primary


still on slave01: Sometimes the log_delay time is bigger.. sometimes is lower
SELECT CASE WHEN pg_last_xlog_receive_location() = pg_last_xlog_replay_location() THEN 0 ELSE EXTRACT (EPOCH FROM now() - pg_last_xact_replay_timestamp()) END AS log_delay;
 log_delay
-----------
  0.386863


On master01:
select * from pg_current_xlog_location();
 pg_current_xlog_location
--------------------------
 1AF8/D3F47A80



QUESTION:

So.. I just wanna understand what's the risk of those errors... what's happening? 
cp: cannot stat '/walfiles/0000000200001AF8000000A5': No such file or directory - Means it didn't find the file. However, the file exists on the Master, but it didn't start shipping yet. What are the consequences of that?

Cheers
Patrick

Re: Wal files - Question | Postgres 9.2

From
Venkata B Nagothi
Date:


On Wed, Nov 23, 2016 at 1:03 PM, Patrick B <patrickbakerbr@gmail.com> wrote:
Hi guys,

I currently have a slave02 server that is replicating from another slave01 via Cascading replication. The master01 server is shipping wal_files (via ssh) to both slaves. 


I'm doing some tests on slave02 to test the recovery via wal_files... The goal here is to stop postgres, wait few minutes, start postgres again, watch it recovering from wal_files, once it's done see the streaming replication start working again.

1 - Stop postgres on slave02(streaming replication + wal_files)
2 - Wait for 5 minutes
3 - Start postgres - The goal here is to tail the logs to see if the wal_files are being successfully recovered

However, when doing step3 I get these messages:
cp: cannot stat '/walfiles/0000000200001AF8000000A4': No such file or directory
cp: cannot stat '/walfiles/0000000200001AF8000000A5': No such file or directory
cp: cannot stat '/walfiles/0000000200001AF8000000A6': No such file or directory
LOG:  consistent recovery state reached at 1AF8/AB629F90
LOG:  database system is ready to accept read only connections
LOG:  streaming replication successfully connected to primary


still on slave01: Sometimes the log_delay time is bigger.. sometimes is lower
SELECT CASE WHEN pg_last_xlog_receive_location() = pg_last_xlog_replay_location() THEN 0 ELSE EXTRACT (EPOCH FROM now() - pg_last_xact_replay_timestamp()) END AS log_delay;
 log_delay
-----------
  0.386863


On master01:
select * from pg_current_xlog_location();
 pg_current_xlog_location
--------------------------
 1AF8/D3F47A80



QUESTION:

So.. I just wanna understand what's the risk of those errors... what's happening? 
cp: cannot stat '/walfiles/0000000200001AF8000000A5': No such file or directory - Means it didn't find the file. However, the file exists on the Master, but it didn't start shipping yet. What are the consequences of that?

That is just saying that the slave cannot find the WAL file. That should not be of big importance. Eventually, that will vanish when the log file gets shipped from the master. Also "cp: cannot stat." errors have been been fixed in 9.3 i believe.

Regards,

Venkata B N
Database Consultant

Fujitsu Australia

Re: Wal files - Question | Postgres 9.2

From
Patrick B
Date:


2016-11-23 15:55 GMT+13:00 Venkata B Nagothi <nag1010@gmail.com>:


On Wed, Nov 23, 2016 at 1:03 PM, Patrick B <patrickbakerbr@gmail.com> wrote:
Hi guys,

I currently have a slave02 server that is replicating from another slave01 via Cascading replication. The master01 server is shipping wal_files (via ssh) to both slaves. 


I'm doing some tests on slave02 to test the recovery via wal_files... The goal here is to stop postgres, wait few minutes, start postgres again, watch it recovering from wal_files, once it's done see the streaming replication start working again.

1 - Stop postgres on slave02(streaming replication + wal_files)
2 - Wait for 5 minutes
3 - Start postgres - The goal here is to tail the logs to see if the wal_files are being successfully recovered

However, when doing step3 I get these messages:
cp: cannot stat '/walfiles/0000000200001AF8000000A4': No such file or directory
cp: cannot stat '/walfiles/0000000200001AF8000000A5': No such file or directory
cp: cannot stat '/walfiles/0000000200001AF8000000A6': No such file or directory
LOG:  consistent recovery state reached at 1AF8/AB629F90
LOG:  database system is ready to accept read only connections
LOG:  streaming replication successfully connected to primary


still on slave01: Sometimes the log_delay time is bigger.. sometimes is lower
SELECT CASE WHEN pg_last_xlog_receive_location() = pg_last_xlog_replay_location() THEN 0 ELSE EXTRACT (EPOCH FROM now() - pg_last_xact_replay_timestamp()) END AS log_delay;
 log_delay
-----------
  0.386863


On master01:
select * from pg_current_xlog_location();
 pg_current_xlog_location
--------------------------
 1AF8/D3F47A80



QUESTION:

So.. I just wanna understand what's the risk of those errors... what's happening? 
cp: cannot stat '/walfiles/0000000200001AF8000000A5': No such file or directory - Means it didn't find the file. However, the file exists on the Master, but it didn't start shipping yet. What are the consequences of that?

That is just saying that the slave cannot find the WAL file. That should not be of big importance. Eventually, that will vanish when the log file gets shipped from the master. Also "cp: cannot stat." errors have been been fixed in 9.3 i believe.

Hi Venkata !

Yeah that's fine.. the streaming replication is already working fine.

But, as it didn't find/recover some of the wal_files, doesn't that mean that the DB isn't up-to-date? 
Otherwise what's the purpose of the wal_files if not be responsible to contain the essential data to the DB be up-to-date?

Thanks!
 

Re: Wal files - Question | Postgres 9.2

From
Venkata B Nagothi
Date:
 
On Wed, Nov 23, 2016 at 1:59 PM, Patrick B <patrickbakerbr@gmail.com> wrote:


2016-11-23 15:55 GMT+13:00 Venkata B Nagothi <nag1010@gmail.com>:


On Wed, Nov 23, 2016 at 1:03 PM, Patrick B <patrickbakerbr@gmail.com> wrote:
Hi guys,

I currently have a slave02 server that is replicating from another slave01 via Cascading replication. The master01 server is shipping wal_files (via ssh) to both slaves. 


I'm doing some tests on slave02 to test the recovery via wal_files... The goal here is to stop postgres, wait few minutes, start postgres again, watch it recovering from wal_files, once it's done see the streaming replication start working again.

1 - Stop postgres on slave02(streaming replication + wal_files)
2 - Wait for 5 minutes
3 - Start postgres - The goal here is to tail the logs to see if the wal_files are being successfully recovered

However, when doing step3 I get these messages:
cp: cannot stat '/walfiles/0000000200001AF8000000A4': No such file or directory
cp: cannot stat '/walfiles/0000000200001AF8000000A5': No such file or directory
cp: cannot stat '/walfiles/0000000200001AF8000000A6': No such file or directory
LOG:  consistent recovery state reached at 1AF8/AB629F90
LOG:  database system is ready to accept read only connections
LOG:  streaming replication successfully connected to primary


still on slave01: Sometimes the log_delay time is bigger.. sometimes is lower
SELECT CASE WHEN pg_last_xlog_receive_location() = pg_last_xlog_replay_location() THEN 0 ELSE EXTRACT (EPOCH FROM now() - pg_last_xact_replay_timestamp()) END AS log_delay;
 log_delay
-----------
  0.386863


On master01:
select * from pg_current_xlog_location();
 pg_current_xlog_location
--------------------------
 1AF8/D3F47A80



QUESTION:

So.. I just wanna understand what's the risk of those errors... what's happening? 
cp: cannot stat '/walfiles/0000000200001AF8000000A5': No such file or directory - Means it didn't find the file. However, the file exists on the Master, but it didn't start shipping yet. What are the consequences of that?

That is just saying that the slave cannot find the WAL file. That should not be of big importance. Eventually, that will vanish when the log file gets shipped from the master. Also "cp: cannot stat." errors have been been fixed in 9.3 i believe.

Hi Venkata !

Yeah that's fine.. the streaming replication is already working fine.

But, as it didn't find/recover some of the wal_files, doesn't that mean that the DB isn't up-to-date?

Not necessarily. Standby periodically checks if the WAL file it is looking for is available at restore_command location and generates that message if the file is not available. These messages are not of any harm.

Below link might help you :


Regards,
Venkata B N
Database Consultant

Fujitsu Australia

Re: Wal files - Question | Postgres 9.2

From
Patrick B
Date:


2016-11-23 16:18 GMT+13:00 Venkata B Nagothi <nag1010@gmail.com>:
 
On Wed, Nov 23, 2016 at 1:59 PM, Patrick B <patrickbakerbr@gmail.com> wrote:


2016-11-23 15:55 GMT+13:00 Venkata B Nagothi <nag1010@gmail.com>:


On Wed, Nov 23, 2016 at 1:03 PM, Patrick B <patrickbakerbr@gmail.com> wrote:
Hi guys,

I currently have a slave02 server that is replicating from another slave01 via Cascading replication. The master01 server is shipping wal_files (via ssh) to both slaves. 


I'm doing some tests on slave02 to test the recovery via wal_files... The goal here is to stop postgres, wait few minutes, start postgres again, watch it recovering from wal_files, once it's done see the streaming replication start working again.

1 - Stop postgres on slave02(streaming replication + wal_files)
2 - Wait for 5 minutes
3 - Start postgres - The goal here is to tail the logs to see if the wal_files are being successfully recovered

However, when doing step3 I get these messages:
cp: cannot stat '/walfiles/0000000200001AF8000000A4': No such file or directory
cp: cannot stat '/walfiles/0000000200001AF8000000A5': No such file or directory
cp: cannot stat '/walfiles/0000000200001AF8000000A6': No such file or directory
LOG:  consistent recovery state reached at 1AF8/AB629F90
LOG:  database system is ready to accept read only connections
LOG:  streaming replication successfully connected to primary


still on slave01: Sometimes the log_delay time is bigger.. sometimes is lower
SELECT CASE WHEN pg_last_xlog_receive_location() = pg_last_xlog_replay_location() THEN 0 ELSE EXTRACT (EPOCH FROM now() - pg_last_xact_replay_timestamp()) END AS log_delay;
 log_delay
-----------
  0.386863


On master01:
select * from pg_current_xlog_location();
 pg_current_xlog_location
--------------------------
 1AF8/D3F47A80



QUESTION:

So.. I just wanna understand what's the risk of those errors... what's happening? 
cp: cannot stat '/walfiles/0000000200001AF8000000A5': No such file or directory - Means it didn't find the file. However, the file exists on the Master, but it didn't start shipping yet. What are the consequences of that?

That is just saying that the slave cannot find the WAL file. That should not be of big importance. Eventually, that will vanish when the log file gets shipped from the master. Also "cp: cannot stat." errors have been been fixed in 9.3 i believe.

Hi Venkata !

Yeah that's fine.. the streaming replication is already working fine.

But, as it didn't find/recover some of the wal_files, doesn't that mean that the DB isn't up-to-date?

Not necessarily. Standby periodically checks if the WAL file it is looking for is available at restore_command location and generates that message if the file is not available. These messages are not of any harm.

Below link might help you :




hmm.. I really don't get it.


 
If I get messages like: 

cp: cannot stat '/walfiles/0000000200001AF8000000A5': No such file or directory

In my head, it's saying that it was unable to recover that file and, because of that, there is missing data.
Even if the server is able to connect to tbe master via streaming replication, there might be missing data. There might be data that is into master but not into slave.

Am I wrong? If so, I don't understand why are the wal_files for, then.


Patrick

Re: Wal files - Question | Postgres 9.2

From
Michael Paquier
Date:
On Fri, Nov 25, 2016 at 10:05 AM, Patrick B <patrickbakerbr@gmail.com> wrote:
> If I get messages like:
>
> cp: cannot stat '/walfiles/0000000200001AF8000000A5': No such file or
> directory
>
> In my head, it's saying that it was unable to recover that file and, because
> of that, there is missing data.
> Even if the server is able to connect to tbe master via streaming
> replication, there might be missing data. There might be data that is into
> master but not into slave.
>
> Am I wrong? If so, I don't understand why are the wal_files for, then.

In order, a startup process will look for a WAL segments in its local
pg_xlog, then using restore_command, and finally via streaming
(XLogPageRead in xlog.c). If streaming fails, because for example the
segment requested has already been recycled, then startup process
switches back to a lookup in archive/pg_xlog. If you happen to have
repetitive failures with restore_command, you are missing a segment
and the standby would not be able to replay further. If the segment is
indeed missing, a new base backup is the way to go. A standby without
the WAL segments to replay is useless except if you plan to use PITR
and restore to a position before the last segment available in its
timeline history.
--
Michael


Re: Wal files - Question | Postgres 9.2

From
David Steele
Date:
On 11/24/16 8:05 PM, Patrick B wrote:

> hmm.. I really don't get it.
>
>
>
> If I get messages like:
>
> *cp: cannot stat '/walfiles/0000000200001AF8000000A5': No such file or
> director*y
>
> In my head, it's saying that it was unable to recover that file and,
> because of that, there is missing data.
> Even if the server is able to connect to tbe master via streaming
> replication, there might be missing data. There might be data that is
> into master but not into slave.
>
> Am I wrong? If so, I don't understand why are the wal_files for, then.

This is normal as far as it goes.  Postgres keeps checking for the WAL
file that it needs to become consistent and cp keeps throwing errors
when it can't find the file (yet).

This simply demonstrates how unsuitable cp is as a recovery command.  I
recommend you look at doing backup/archive with a professional tool such
as pgBackRest (http://www.pgbackrest.org) or Barman
(http://www.pgbarman.org/).

--
-David
david@pgmasters.net


Re: Wal files - Question | Postgres 9.2

From
Patrick B
Date:


2016-11-29 15:21 GMT+13:00 David Steele <david@pgmasters.net>:
On 11/24/16 8:05 PM, Patrick B wrote:

> hmm.. I really don't get it.
>
>
>
> If I get messages like:
>
> *cp: cannot stat '/walfiles/0000000200001AF8000000A5': No such file or
> director*y
>
> In my head, it's saying that it was unable to recover that file and,
> because of that, there is missing data.
> Even if the server is able to connect to tbe master via streaming
> replication, there might be missing data. There might be data that is
> into master but not into slave.
>
> Am I wrong? If so, I don't understand why are the wal_files for, then.

This is normal as far as it goes.  Postgres keeps checking for the WAL
file that it needs to become consistent and cp keeps throwing errors
when it can't find the file (yet).

This simply demonstrates how unsuitable cp is as a recovery command.  I
recommend you look at doing backup/archive with a professional tool such
as pgBackRest (http://www.pgbackrest.org) or Barman
(http://www.pgbarman.org/).


Thanks for all the replies guys.

Related to the "cp" - That's fine. I'm just doing this in a test environment. On my prod servers I use wal-e(aws s3) and bash script(locally) to do this work.

I've done some tests, that only gave me more questions. I'll put everything I done here and hopefully you guys will be able to help me to understand it.

1. Set up a pg-test-01 machine on Ubuntu 16 Server
2. installed Postgres 9.2 + contribs + libs  + etc
3. Configured postgres to be a streaming replication slave (postgresql.conf and recovery.conf)
4. Configured my masterdb01 server to ship wal_files to the pg-test-01 server
5. Made a pg_basebackup on my master DB server (DB is 1TB big)
6. Recovered the basebackup on the pg-test-01 machine
7. Postgres started to recover the wal_files and then streaming replication worked:

LOG:  streaming replication successfully connected to primary


8. Stopped postgres on the pg-test-01 server for 10 minutes
9. Started postgres on the pg-test-01 to see if the wal_files would be recovered

LOG:  restored log file "0000000200001B2F0000004F" from archive
LOG:  restored log file "0000000200001B2F00000050" from archive
LOG:  restored log file "0000000200001B2F00000051" from archive
LOG:  restored log file "0000000200001B2F00000052" from archive

10. Removed all the wal_files from pg-test-01 (No wal_files are being shipped to this server anymore)
11. Stopped postgres for 2 hours
12. Started postgres as streaming replication

cp: cannot stat 'archive/0000000200002B36000000BB': No such file or directory
cp: cannot stat 'archive/0000000200002B36000000BC': No such file or directory
cp: cannot stat 'archive/0000000200002B36000000BD': No such file or directory
LOG:  streaming replication successfully connected to primary



Postgres was stopped for 2h. How can postgres connect using streaming replication, if no wal_files were in there???

Hos is that even possible?? I don't understand!

Thanks
Patrick

Re: Wal files - Question | Postgres 9.2

From
"David G. Johnston"
Date:
On Mon, Nov 28, 2016 at 8:22 PM, Patrick B <patrickbakerbr@gmail.com> wrote:

Ho
​[w]
 is that even possible?? I don't understand!


"""​
 
If you use streaming replication without file-based continuous archiving, you have to set wal_keep_segments in the master to a value high enough to ensure that old WAL segments are not recycled too early, while the standby might still need them to catch up. If the standby falls behind too much, it needs to be reinitialized from a new base backup. If you set up a WAL archive that's accessible from the standby, wal_keep_segments is not required as the standby can always use the archive to catch up.
​"""

Basically you did just that when you destroyed the archive.  Apparently the master doesn't churn through WAL quickly enough to have had to discard the segments from the prior two hours.

David J.

Re: Wal files - Question | Postgres 9.2

From
Patrick B
Date:


2016-11-29 16:36 GMT+13:00 David G. Johnston <david.g.johnston@gmail.com>:
On Mon, Nov 28, 2016 at 8:22 PM, Patrick B <patrickbakerbr@gmail.com> wrote:

Ho
​[w]
 is that even possible?? I don't understand!


"""​
 
If you use streaming replication without file-based continuous archiving, you have to set wal_keep_segments in the master to a value high enough to ensure that old WAL segments are not recycled too early, while the standby might still need them to catch up. If the standby falls behind too much, it needs to be reinitialized from a new base backup. If you set up a WAL archive that's accessible from the standby, wal_keep_segments is not required as the standby can always use the archive to catch up.
​"""

Basically you did just that when you destroyed the archive.  Apparently the master doesn't churn through WAL quickly enough to have had to discard the segments from the prior two hours.

David J.



That was really helpful! Thanks David!

Patrick 

Re: Wal files - Question | Postgres 9.2

From
Patrick B
Date:


2016-11-29 23:59 GMT+13:00 Patrick B <patrickbakerbr@gmail.com>:


2016-11-29 16:36 GMT+13:00 David G. Johnston <david.g.johnston@gmail.com>:
On Mon, Nov 28, 2016 at 8:22 PM, Patrick B <patrickbakerbr@gmail.com> wrote:

Ho
​[w]
 is that even possible?? I don't understand!


"""​
 
If you use streaming replication without file-based continuous archiving, you have to set wal_keep_segments in the master to a value high enough to ensure that old WAL segments are not recycled too early, while the standby might still need them to catch up. If the standby falls behind too much, it needs to be reinitialized from a new base backup. If you set up a WAL archive that's accessible from the standby, wal_keep_segments is not required as the standby can always use the archive to catch up.
​"""

Basically you did just that when you destroyed the archive.  Apparently the master doesn't churn through WAL quickly enough to have had to discard the segments from the prior two hours.

David J.



That was really helpful! Thanks David!

Patrick 





Hey guys,


wal_keep_segments is the parameter responsible for streaming replication be able to recover itself without using wal_files, is that right?

Re: Wal files - Question | Postgres 9.2

From
"David G. Johnston"
Date:
On Wed, Nov 30, 2016 at 6:05 PM, Patrick B <patrickbakerbr@gmail.com> wrote:

wal_keep_segments is the parameter responsible for streaming replication be able to recover itself without using wal_files, is that right?

[...] ​without using wal_files[sic] [from an external archive location]

David J.

Re: Wal files - Question | Postgres 9.2

From
Patrick B
Date:


2016-12-01 14:15 GMT+13:00 David G. Johnston <david.g.johnston@gmail.com>:
On Wed, Nov 30, 2016 at 6:05 PM, Patrick B <patrickbakerbr@gmail.com> wrote:

wal_keep_segments is the parameter responsible for streaming replication be able to recover itself without using wal_files, is that right?

[...] ​without using wal_files[sic] [from an external archive location]

David J.



Yep David! Thanks for correcting me.

Got it now. Thanks everyone.

one more question...  I'm working with wal_keep_segments = 1000 - Would that in your opinion be a safe number?

Yes. I got plenty of free space and I'm using pg_xlogs in a different partition on the master.

Patrick