Thread: wal exist in slave but getting err requested WAL segment has alreadybeen removed
wal exist in slave but getting err requested WAL segment has alreadybeen removed
From
Mariel Cherkassky
Date:
Hi,
I have in my cluster 3 nodes (1 master version 9.6.3+ 2 slaves version 9.6.3). I configured repmgr (with repmgrd active) v 4.0.4.
Suddenly today after a few good weeks I noticed that there is a lag in one of the slaves and the error in the log indicated that the slave didnt get the wal :
could not receive data from WAL stream: ERROR: requested WAL segment 0000000900002E61000000BD has already been removed
However, when I check if the wal was recieveed :
postgres=# select pg_is_in_recovery(),pg_is_xlog_replay_paused(),pg_last_xlog_receive_location(),pg_last_xlog_replay_location();
pg_is_in_recovery | pg_is_xlog_replay_paused | pg_last_xlog_receive_location | pg_last_xlog_replay_location
-------------------+--------------------------+-------------------------------+------------------------------
t | f | 2E61/BDF5C000 | 2E61/BDF5B930
(1 row)
and I checked in pg_xlog directory :
ls -l ../pg_xlog/0000000900002E61000000BD
-rw------- 1 postgres postgres 16777216 Jul 11 11:13 ../pg_xlog/0000000900002E61000000BD
and the xlog is exist.
Now is my question, why the wal wasnt replayed ?
In my repmgr.conf I dont have any parameters regarding recovery just some basic things. The recovery.conf file in the data directory :
standby_mode = 'on'
primary_conninfo = 'host=xxxxxxx user=repmgr application_name=''psgsqldb2'' connect_timeout=2'
recovery_target_timeline = 'latest'
any idea ?
Re: wal exist in slave but getting err requested WAL segment hasalready been removed
From
Achilleas Mantzios
Date:
On 11/07/2018 16:09, Mariel Cherkassky wrote:
Hi,I have in my cluster 3 nodes (1 master version 9.6.3+ 2 slaves version 9.6.3). I configured repmgr (with repmgrd active) v 4.0.4.Suddenly today after a few good weeks I noticed that there is a lag in one of the slaves and the error in the log indicated that the slave didnt get the wal :could not receive data from WAL stream: ERROR: requested WAL segment 0000000900002E61000000BD has already been removedHowever, when I check if the wal was recieveed :postgres=# select pg_is_in_recovery(),pg_is_xlog_replay_paused(),pg_last_xlog_receive_location(),pg_last_xlog_replay_location();pg_is_in_recovery | pg_is_xlog_replay_paused | pg_last_xlog_receive_location | pg_last_xlog_replay_location-------------------+--------------------------+-------------------------------+------------------------------t | f | 2E61/BDF5C000 | 2E61/BDF5B930(1 row)and I checked in pg_xlog directory :ls -l ../pg_xlog/0000000900002E61000000BD-rw------- 1 postgres postgres 16777216 Jul 11 11:13 ../pg_xlog/0000000900002E61000000BDand the xlog is exist.
In which node did you check for the file?
If the file in the primary is still available, try to compare their md5sum .
If you have a working WAL shipping method in place, then add the appropriate line in the recovery.conf of your standby :
restore_command = 'rsync somemachine:/somepath/pitr/%f "%p" '
Now is my question, why the wal wasnt replayed ?In my repmgr.conf I dont have any parameters regarding recovery just some basic things. The recovery.conf file in the data directory :standby_mode = 'on'primary_conninfo = 'host=xxxxxxx user=repmgr application_name=''psgsqldb2'' connect_timeout=2'recovery_target_timeline = 'latest'any idea ?
-- Achilleas Mantzios IT DEV Lead IT DEPT Dynacom Tankers Mgmt
Re: wal exist in slave but getting err requested WAL segment hasalready been removed
From
Mariel Cherkassky
Date:
The wal is available on the standby, not on the primary. It is already in the pg_xlog directory of the slave...
2018-07-11 16:26 GMT+03:00 Achilleas Mantzios <achill@matrix.gatewaynet.com>:
On 11/07/2018 16:09, Mariel Cherkassky wrote:Hi,I have in my cluster 3 nodes (1 master version 9.6.3+ 2 slaves version 9.6.3). I configured repmgr (with repmgrd active) v 4.0.4.Suddenly today after a few good weeks I noticed that there is a lag in one of the slaves and the error in the log indicated that the slave didnt get the wal :could not receive data from WAL stream: ERROR: requested WAL segment 0000000900002E61000000BD has already been removedHowever, when I check if the wal was recieveed :postgres=# select pg_is_in_recovery(),pg_is_xlog_replay_paused(),pg_last_ xlog_receive_location(),pg_ last_xlog_replay_location(); pg_is_in_recovery | pg_is_xlog_replay_paused | pg_last_xlog_receive_location | pg_last_xlog_replay_location-------------------+--------------------------+------------- ------------------+----------- ------------------- t | f | 2E61/BDF5C000 | 2E61/BDF5B930(1 row)and I checked in pg_xlog directory :ls -l ../pg_xlog/0000000900002E61000000BD -rw------- 1 postgres postgres 16777216 Jul 11 11:13 ../pg_xlog/0000000900002E61000000BD and the xlog is exist.
In which node did you check for the file?
If the file in the primary is still available, try to compare their md5sum .
If you have a working WAL shipping method in place, then add the appropriate line in the recovery.conf of your standby :restore_command = 'rsync somemachine:/somepath/pitr/%f "%p" 'Now is my question, why the wal wasnt replayed ?In my repmgr.conf I dont have any parameters regarding recovery just some basic things. The recovery.conf file in the data directory :standby_mode = 'on'primary_conninfo = 'host=xxxxxxx user=repmgr application_name=''psgsqldb2'' connect_timeout=2'recovery_target_timeline = 'latest'any idea ?
-- Achilleas Mantzios IT DEV Lead IT DEPT Dynacom Tankers Mgmt
Re: wal exist in slave but getting err requested WAL segment hasalready been removed
From
Achilleas Mantzios
Date:
On 11/07/2018 16:32, Mariel Cherkassky wrote:
Ok but apparently this is not complete. Can you see its contents with pg_waldump (or pg_xlogdump) ?The wal is available on the standby, not on the primary. It is already in the pg_xlog directory of the slave...
Do you have any backup mechanism in place? Any WAL shipping / archiving mechanism ?
2018-07-11 16:26 GMT+03:00 Achilleas Mantzios <achill@matrix.gatewaynet.com>:On 11/07/2018 16:09, Mariel Cherkassky wrote:Hi,I have in my cluster 3 nodes (1 master version 9.6.3+ 2 slaves version 9.6.3). I configured repmgr (with repmgrd active) v 4.0.4.Suddenly today after a few good weeks I noticed that there is a lag in one of the slaves and the error in the log indicated that the slave didnt get the wal :could not receive data from WAL stream: ERROR: requested WAL segment 0000000900002E61000000BD has already been removedHowever, when I check if the wal was recieveed :postgres=# select pg_is_in_recovery(),pg_is_xlog_replay_paused(),pg_last_ xlog_receive_location(),pg_ last_xlog_replay_location(); pg_is_in_recovery | pg_is_xlog_replay_paused | pg_last_xlog_receive_location | pg_last_xlog_replay_location-------------------+--------------------------+------------- ------------------+----------- ------------------- t | f | 2E61/BDF5C000 | 2E61/BDF5B930(1 row)and I checked in pg_xlog directory :ls -l ../pg_xlog/0000000900002E61000000BD -rw------- 1 postgres postgres 16777216 Jul 11 11:13 ../pg_xlog/0000000900002E61000000BD and the xlog is exist.
In which node did you check for the file?
If the file in the primary is still available, try to compare their md5sum .
If you have a working WAL shipping method in place, then add the appropriate line in the recovery.conf of your standby :restore_command = 'rsync somemachine:/somepath/pitr/%f "%p" 'Now is my question, why the wal wasnt replayed ?In my repmgr.conf I dont have any parameters regarding recovery just some basic things. The recovery.conf file in the data directory :standby_mode = 'on'primary_conninfo = 'host=xxxxxxx user=repmgr application_name=''psgsqldb2'' connect_timeout=2'recovery_target_timeline = 'latest'any idea ?
-- Achilleas Mantzios IT DEV Lead IT DEPT Dynacom Tankers Mgmt
-- Achilleas Mantzios IT DEV Lead IT DEPT Dynacom Tankers Mgmt
Re: wal exist in slave but getting err requested WAL segment hasalready been removed
From
Mariel Cherkassky
Date:
Yes i can see its content. However in the end of its content I'm getting the next msg :
pg_xlogdump: FATAL: error in WAL record at 2E61/BDF59950: invalid magic number 0000 in log segment 0000000000002E61000000BD, offset 16105472
Maybe this is the reason behind it ?
2018-07-11 16:39 GMT+03:00 Achilleas Mantzios <achill@matrix.gatewaynet.com>:
On 11/07/2018 16:32, Mariel Cherkassky wrote:Ok but apparently this is not complete. Can you see its contents with pg_waldump (or pg_xlogdump) ?The wal is available on the standby, not on the primary. It is already in the pg_xlog directory of the slave...
Do you have any backup mechanism in place? Any WAL shipping / archiving mechanism ?2018-07-11 16:26 GMT+03:00 Achilleas Mantzios <achill@matrix.gatewaynet.com>: On 11/07/2018 16:09, Mariel Cherkassky wrote:Hi,I have in my cluster 3 nodes (1 master version 9.6.3+ 2 slaves version 9.6.3). I configured repmgr (with repmgrd active) v 4.0.4.Suddenly today after a few good weeks I noticed that there is a lag in one of the slaves and the error in the log indicated that the slave didnt get the wal :could not receive data from WAL stream: ERROR: requested WAL segment 0000000900002E61000000BD has already been removedHowever, when I check if the wal was recieveed :postgres=# select pg_is_in_recovery(),pg_is_xlog_replay_paused(),pg_last_xlog_ receive_location(),pg_last_ xlog_replay_location(); pg_is_in_recovery | pg_is_xlog_replay_paused | pg_last_xlog_receive_location | pg_last_xlog_replay_location-------------------+--------------------------+------------- ------------------+----------- ------------------- t | f | 2E61/BDF5C000 | 2E61/BDF5B930(1 row)and I checked in pg_xlog directory :ls -l ../pg_xlog/0000000900002E61000000BD -rw------- 1 postgres postgres 16777216 Jul 11 11:13 ../pg_xlog/0000000900002E61000000BD and the xlog is exist.
In which node did you check for the file?
If the file in the primary is still available, try to compare their md5sum .
If you have a working WAL shipping method in place, then add the appropriate line in the recovery.conf of your standby :restore_command = 'rsync somemachine:/somepath/pitr/%f "%p" 'Now is my question, why the wal wasnt replayed ?In my repmgr.conf I dont have any parameters regarding recovery just some basic things. The recovery.conf file in the data directory :standby_mode = 'on'primary_conninfo = 'host=xxxxxxx user=repmgr application_name=''psgsqldb2'' connect_timeout=2'recovery_target_timeline = 'latest'any idea ?
-- Achilleas Mantzios IT DEV Lead IT DEPT Dynacom Tankers Mgmt
-- Achilleas Mantzios IT DEV Lead IT DEPT Dynacom Tankers Mgmt
Re: wal exist in slave but getting err requested WAL segment hasalready been removed
From
Kenneth Marshall
Date:
On Wed, Jul 11, 2018 at 04:44:24PM +0300, Mariel Cherkassky wrote: > Yes i can see its content. However in the end of its content I'm getting > the next msg : > pg_xlogdump: FATAL: error in WAL record at 2E61/BDF59950: invalid magic > number 0000 in log segment 0000000000002E61000000BD, offset 16105472 > Maybe this is the reason behind it ? > Hi Mariel, I do not know if this applies to your case, but 9.6.9 has this in the release notes: Fix a corner case where a streaming standby gets stuck at a WAL continuation record (Kyotaro Horiguchi) Regards, Ken
Re: wal exist in slave but getting err requested WAL segment hasalready been removed
From
Mariel Cherkassky
Date:
How can I get more info regarding this bug ? I would like to be sure that i faced a real bug.
2018-07-11 16:50 GMT+03:00 Kenneth Marshall <ktm@rice.edu>:
On Wed, Jul 11, 2018 at 04:44:24PM +0300, Mariel Cherkassky wrote:
> Yes i can see its content. However in the end of its content I'm getting
> the next msg :
> pg_xlogdump: FATAL: error in WAL record at 2E61/BDF59950: invalid magic
> number 0000 in log segment 0000000000002E61000000BD, offset 16105472
> Maybe this is the reason behind it ?
>
Hi Mariel,
I do not know if this applies to your case, but 9.6.9 has this
in the release notes:
Fix a corner case where a streaming standby gets stuck at a WAL continuation record (Kyotaro Horiguchi)
Regards,
Ken