Thread: 2nd PostgreSQL server in WAL shipping cluster fails to start

2nd PostgreSQL server in WAL shipping cluster fails to start

From
Samuel VISCAPI
Date:

Dear all,

 

Some years ago I set up a two PostgreSQL 13 nodes cluster on Debian 11. This cluster uses the Write-Ahead Log Shipping method. I’ve just been told the second server (in standby mode) is down and refuses to start again with the following error message (roughly translated from French) :

 

2025-01-09 09:46:35.742 CET [3147382] FATAL:  could not receive data from WAL stream: ERROR:  segment requested from transaction journal, 000000010000000D000000D6, has already been removed

2025-01-09 09:46:40.745 CET [3147395] LOG:  journal flow started from primary at D/D6000000 on timeline 1

2025-01-09 09:46:40.745 CET [3147395] FATAL:  could not receive data from WAL stream : ERROR:  segment requested from transaction journal, 000000010000000D000000D6, has already been removed

2025-01-09 09:46:45.749 CET [3147397] LOG: journal flow started from primary at D/D6000000 on timeline 1

2025-01-09 09:46:45.749 CET [3147397] FATAL:  could not receive data from WAL stream : ERROR:  segment requested from transaction journal, 000000010000000D000000D6, has already been removed

2025-01-09 09:46:50.753 CET [3147424] LOG:  journal flow started from primary at D/D6000000 on timeline 1

2025-01-09 09:46:50.753 CET [3147424] FATAL:  could not receive data from WAL stream : ERROR:  segment requested from transaction journal, 000000010000000D000000D6, has already been removed

 

Would you have any idea how to fix this issue ?

 

Best regards,

 

Samuel

Re: 2nd PostgreSQL server in WAL shipping cluster fails to start

From
Alvaro Herrera
Date:
On 2025-Jan-09, Samuel VISCAPI wrote:

> Dear all,
> 
> Some years ago I set up a two PostgreSQL 13 nodes cluster on Debian 11. This cluster uses the Write-Ahead Log
Shippingmethod. I've just been told the second server (in standby mode) is down and refuses to start again with the
followingerror message (roughly translated from French) :
 
> 
> 2025-01-09 09:46:35.742 CET [3147382] FATAL:  could not receive data from WAL stream: ERROR:  segment requested from
transactionjournal, 000000010000000D000000D6, has already been removed
 
> 2025-01-09 09:46:40.745 CET [3147395] LOG:  journal flow started from primary at D/D6000000 on timeline 1
> 2025-01-09 09:46:40.745 CET [3147395] FATAL:  could not receive data from WAL stream : ERROR:  segment requested from
transactionjournal, 000000010000000D000000D6, has already been removed
 
> 2025-01-09 09:46:45.749 CET [3147397] LOG: journal flow started from primary at D/D6000000 on timeline 1
> 2025-01-09 09:46:45.749 CET [3147397] FATAL:  could not receive data from WAL stream : ERROR:  segment requested from
transactionjournal, 000000010000000D000000D6, has already been removed
 
> 2025-01-09 09:46:50.753 CET [3147424] LOG:  journal flow started from primary at D/D6000000 on timeline 1
> 2025-01-09 09:46:50.753 CET [3147424] FATAL:  could not receive data from WAL stream : ERROR:  segment requested from
transactionjournal, 000000010000000D000000D6, has already been removed
 

This means the standby is requesting a segment that was already removed.
You may be able to find those files in a WAL archive, if you have
archive_command set in the primary.  If you do, then it would work to
copy those to the standby's pg_wal/ subdirectory.  If you don't have
them, then the replica must be rebuilt.


Note that it's not a good idea to translate the error messages when
posting -- that is, it's better to post exactly what the log file has.
For helpfulness you could post a translation for the lines separately.
But the developers can find the translated messages in the source code,
if they need them ... but if you translate them yourself, there's no way
to know exactly which ones they are.  Sometimes subtle differences are
important cues.

-- 
Álvaro Herrera               48°01'N 7°57'E  —  https://www.EnterpriseDB.com/



RE: 2nd PostgreSQL server in WAL shipping cluster fails to start

From
Samuel VISCAPI
Date:
Dear Alvaro and Anton,

Thank you for your quick replies, really appreciated. I ran some more tests today and the results are as follows:

Available disk space on primary server : 101GB free in the data directory. I believe some cleaning up has been done, I
needto check with my colleagues about that. 
Wal_sender_timeout : #wal_sender_timeout = 60s
Wal_keep_segments : not present in postgresql.conf
Networking issues : none at the moment, the primary server can ping the server in standby mode by its IP address. Name
resolutiondoesn't work though. 
Monitor replication lag between the two nodes:
- Primary node :

SELECT pg_current_wal_lsn();
 pg_current_wal_lsn
--------------------
 18/B6C404B8

- Standby node : TODO when the node is up again

I'm tempted to give the pgbasebackup command a go: https://www.postgresql.org/docs/13/app-pgbasebackup.html

pg_basebackup -h primary_host -D /some_app/data -U postgres -v -P

@Alvaro Herrera : the archive_command is indeed set in the primary server, as follows:

archive_command = 'test ! -f /some_app/data/mnt/server/archivedir/%f && cp %p /some_app/data/mnt/server/archivedir/%f'

There are some files in the pg_wal directory but they only amount to 289MB. If I need to rebuild the replica from
scratch,could I use the pg_basebackup command above for that purpose ? 

And you are obviously right about the error message in French. Please find it below:

2025-01-09 09:46:35.742 CET [3147382] FATAL:  n'a pas pu recevoir des données du flux de WAL : ERREUR:  le segment
demandédu journal de transaction, 000000010000000D000000D6, a déjà été supprimé 
2025-01-09 09:46:40.745 CET [3147395] LOG:  démarré le flux des journaux depuis le principal à D/D6000000 sur la
timeline1 
2025-01-09 09:46:40.745 CET [3147395] FATAL:  n'a pas pu recevoir des données du flux de WAL : ERREUR:  le segment
demandédu journal de transaction, 000000010000000D000000D6, a déjà été supprimé 
2025-01-09 09:46:45.749 CET [3147397] LOG:  démarré le flux des journaux depuis le principal à D/D6000000 sur la
timeline1 
2025-01-09 09:46:45.749 CET [3147397] FATAL:  n'a pas pu recevoir des données du flux de WAL : ERREUR:  le segment
demandédu journal de transaction, 000000010000000D000000D6, a déjà été supprimé 
2025-01-09 09:46:50.753 CET [3147424] LOG:  démarré le flux des journaux depuis le principal à D/D6000000 sur la
timeline1 
2025-01-09 09:46:50.753 CET [3147424] FATAL:  n'a pas pu recevoir des données du flux de WAL : ERREUR:  le segment
demandédu journal de transaction, 000000010000000D000000D6, a déjà été supprimé 

Thanks again for your help.

Best regards,

Samuel

-----Message d'origine-----
De : Alvaro Herrera <alvherre@alvh.no-ip.org>
Envoyé : jeudi 9 janvier 2025 12:11
À : Samuel VISCAPI <samuel.viscapi@xelians.fr>
Cc : pgsql-admin@lists.postgresql.org
Objet : Re: 2nd PostgreSQL server in WAL shipping cluster fails to start

On 2025-Jan-09, Samuel VISCAPI wrote:

> Dear all,
>
> Some years ago I set up a two PostgreSQL 13 nodes cluster on Debian 11. This cluster uses the Write-Ahead Log
Shippingmethod. I've just been told the second server (in standby mode) is down and refuses to start again with the
followingerror message (roughly translated from French) : 
>
> 2025-01-09 09:46:35.742 CET [3147382] FATAL:  could not receive data
> from WAL stream: ERROR:  segment requested from transaction journal,
> 000000010000000D000000D6, has already been removed
> 2025-01-09 09:46:40.745 CET [3147395] LOG:  journal flow started from
> primary at D/D6000000 on timeline 1
> 2025-01-09 09:46:40.745 CET [3147395] FATAL:  could not receive data
> from WAL stream : ERROR:  segment requested from transaction journal,
> 000000010000000D000000D6, has already been removed
> 2025-01-09 09:46:45.749 CET [3147397] LOG: journal flow started from
> primary at D/D6000000 on timeline 1
> 2025-01-09 09:46:45.749 CET [3147397] FATAL:  could not receive data
> from WAL stream : ERROR:  segment requested from transaction journal,
> 000000010000000D000000D6, has already been removed
> 2025-01-09 09:46:50.753 CET [3147424] LOG:  journal flow started from
> primary at D/D6000000 on timeline 1
> 2025-01-09 09:46:50.753 CET [3147424] FATAL:  could not receive data
> from WAL stream : ERROR:  segment requested from transaction journal,
> 000000010000000D000000D6, has already been removed

This means the standby is requesting a segment that was already removed.
You may be able to find those files in a WAL archive, if you have archive_command set in the primary.  If you do, then
itwould work to copy those to the standby's pg_wal/ subdirectory.  If you don't have them, then the replica must be
rebuilt.


Note that it's not a good idea to translate the error messages when posting -- that is, it's better to post exactly
whatthe log file has. 
For helpfulness you could post a translation for the lines separately.
But the developers can find the translated messages in the source code, if they need them ... but if you translate them
yourself,there's no way to know exactly which ones they are.  Sometimes subtle differences are important cues. 

--
Álvaro Herrera               48°01'N 7°57'E  -  https://www.EnterpriseDB.com/



RE: 2nd PostgreSQL server in WAL shipping cluster fails to start

From
Samuel VISCAPI
Date:
Dear all,

A quick follow-up message :

Files were indeed removed from the primary server in order to make some room. The directory
/some_app/data/mnt/server/archivedir/used in the "archive_command" eventually almost filled up the whole partition
(118GB)where the application data is kept. So it's been decided to remove those files in order to keep the application
running.That directory is growing again on the primary server though, with 506 files now (7.9GB). For comparison, on
thestand-by server that same directory contains only 1 file of 17MB. That file is from ~2 years ago (February 2023). I
thinkthis is when the stand-by server stopped running. As of today, that server is still down. 

Regarding the "pg_wal" directories on both nodes:

- Stand-by server : 5 files + "archive_status" directory. Among the files there is one 000000010000000D000000D6, as per
theoriginal error message. 
- Primary server : 8 files + "archive_status" directory. The file 000000010000000D000000D6 is missing there.

I still believe "pg_basebackup" to be the most promising lead to solve the issue at hand and recreate the replica from
scratch.

What do you think ? Finally, since storage size can be an issue, I understand one can configure PostgreSQL to compress
thearchive logs, as per : https://www.postgresql.org/docs/13/continuous-archiving.html#BACKUP-TIPS . Would you have any
priorexperience with that setting ? 

Many thanks again for your help.

Best regards,

Samuel

-----Message d'origine-----
De : Samuel VISCAPI
Envoyé : jeudi 30 janvier 2025 10:54
À : Alvaro Herrera <alvherre@alvh.no-ip.org>
Cc : pgsql-admin@lists.postgresql.org
Objet : RE: 2nd PostgreSQL server in WAL shipping cluster fails to start

Dear Alvaro and Anton,

Thank you for your quick replies, really appreciated. I ran some more tests today and the results are as follows:

Available disk space on primary server : 101GB free in the data directory. I believe some cleaning up has been done, I
needto check with my colleagues about that. 
Wal_sender_timeout : #wal_sender_timeout = 60s Wal_keep_segments : not present in postgresql.conf Networking issues :
noneat the moment, the primary server can ping the server in standby mode by its IP address. Name resolution doesn't
workthough. 
Monitor replication lag between the two nodes:
- Primary node :

SELECT pg_current_wal_lsn();
 pg_current_wal_lsn
--------------------
 18/B6C404B8

- Standby node : TODO when the node is up again

I'm tempted to give the pgbasebackup command a go: https://www.postgresql.org/docs/13/app-pgbasebackup.html

pg_basebackup -h primary_host -D /some_app/data -U postgres -v -P

@Alvaro Herrera : the archive_command is indeed set in the primary server, as follows:

archive_command = 'test ! -f /some_app/data/mnt/server/archivedir/%f && cp %p /some_app/data/mnt/server/archivedir/%f'

There are some files in the pg_wal directory but they only amount to 289MB. If I need to rebuild the replica from
scratch,could I use the pg_basebackup command above for that purpose ? 

And you are obviously right about the error message in French. Please find it below:

2025-01-09 09:46:35.742 CET [3147382] FATAL:  n'a pas pu recevoir des données du flux de WAL : ERREUR:  le segment
demandédu journal de transaction, 000000010000000D000000D6, a déjà été supprimé 
2025-01-09 09:46:40.745 CET [3147395] LOG:  démarré le flux des journaux depuis le principal à D/D6000000 sur la
timeline1 
2025-01-09 09:46:40.745 CET [3147395] FATAL:  n'a pas pu recevoir des données du flux de WAL : ERREUR:  le segment
demandédu journal de transaction, 000000010000000D000000D6, a déjà été supprimé 
2025-01-09 09:46:45.749 CET [3147397] LOG:  démarré le flux des journaux depuis le principal à D/D6000000 sur la
timeline1 
2025-01-09 09:46:45.749 CET [3147397] FATAL:  n'a pas pu recevoir des données du flux de WAL : ERREUR:  le segment
demandédu journal de transaction, 000000010000000D000000D6, a déjà été supprimé 
2025-01-09 09:46:50.753 CET [3147424] LOG:  démarré le flux des journaux depuis le principal à D/D6000000 sur la
timeline1 
2025-01-09 09:46:50.753 CET [3147424] FATAL:  n'a pas pu recevoir des données du flux de WAL : ERREUR:  le segment
demandédu journal de transaction, 000000010000000D000000D6, a déjà été supprimé 

Thanks again for your help.

Best regards,

Samuel

-----Message d'origine-----
De : Alvaro Herrera <alvherre@alvh.no-ip.org> Envoyé : jeudi 9 janvier 2025 12:11 À : Samuel VISCAPI
<samuel.viscapi@xelians.fr>Cc : pgsql-admin@lists.postgresql.org Objet : Re: 2nd PostgreSQL server in WAL shipping
clusterfails to start 

On 2025-Jan-09, Samuel VISCAPI wrote:

> Dear all,
>
> Some years ago I set up a two PostgreSQL 13 nodes cluster on Debian 11. This cluster uses the Write-Ahead Log
Shippingmethod. I've just been told the second server (in standby mode) is down and refuses to start again with the
followingerror message (roughly translated from French) : 
>
> 2025-01-09 09:46:35.742 CET [3147382] FATAL:  could not receive data
> from WAL stream: ERROR:  segment requested from transaction journal,
> 000000010000000D000000D6, has already been removed
> 2025-01-09 09:46:40.745 CET [3147395] LOG:  journal flow started from
> primary at D/D6000000 on timeline 1
> 2025-01-09 09:46:40.745 CET [3147395] FATAL:  could not receive data
> from WAL stream : ERROR:  segment requested from transaction journal,
> 000000010000000D000000D6, has already been removed
> 2025-01-09 09:46:45.749 CET [3147397] LOG: journal flow started from
> primary at D/D6000000 on timeline 1
> 2025-01-09 09:46:45.749 CET [3147397] FATAL:  could not receive data
> from WAL stream : ERROR:  segment requested from transaction journal,
> 000000010000000D000000D6, has already been removed
> 2025-01-09 09:46:50.753 CET [3147424] LOG:  journal flow started from
> primary at D/D6000000 on timeline 1
> 2025-01-09 09:46:50.753 CET [3147424] FATAL:  could not receive data
> from WAL stream : ERROR:  segment requested from transaction journal,
> 000000010000000D000000D6, has already been removed

This means the standby is requesting a segment that was already removed.
You may be able to find those files in a WAL archive, if you have archive_command set in the primary.  If you do, then
itwould work to copy those to the standby's pg_wal/ subdirectory.  If you don't have them, then the replica must be
rebuilt.


Note that it's not a good idea to translate the error messages when posting -- that is, it's better to post exactly
whatthe log file has. 
For helpfulness you could post a translation for the lines separately.
But the developers can find the translated messages in the source code, if they need them ... but if you translate them
yourself,there's no way to know exactly which ones they are.  Sometimes subtle differences are important cues. 

--
Álvaro Herrera               48°01'N 7°57'E  -  https://www.EnterpriseDB.com/