Thread: WAL File Recovery on Standby Server Stops Before End of WAL Files

WAL File Recovery on Standby Server Stops Before End of WAL Files

From

"Ryan, Les"

Date:

27 October 2021, 16:42:52

Hello,

I’m hoping to get some suggestions on what to do here. I am running PostgreSQL version 13.2 and am shipping the WAL files to a standby server. Once a day I restart the standby server and it recovers the new WAL files that have been shipped to it. Everything was working great until yesterday. My problem is that the WAL recovery is now stopping before it recovers all of the available WAL files. This happened once before and the only way I could get the WAL recovery to go past that file was to create a fresh back and restore that. I’m hoping to avoid that as it takes about a week to create the backup.

Here are the specifics:

PostgreSQL version 13.2
The primary server creates the WAL files and a scheduled process copies them to a folder on the standby server.
Once a day, the standby server is restared using the following command:
- "C:\Program Files\PostgreSQL\13\bin\pg_ctl" restart -D .\
The log contains the following:

2021-10-27 10:26:30.508 MDT [6204] LOG: starting PostgreSQL 13.2, compiled by Visual C++ build 1914, 64-bit

2021-10-27 10:26:30.509 MDT [6204] LOG: listening on IPv6 address "::", port 5432

2021-10-27 10:26:30.510 MDT [6204] LOG: listening on IPv4 address "0.0.0.0", port 5432

2021-10-27 10:26:30.585 MDT [2012] LOG: database system was shut down in recovery at 2021-10-27 10:26:29 MDT

2021-10-27 10:26:30.701 MDT [2012] LOG: entering standby mode

2021-10-27 10:26:30.821 MDT [2012] LOG: restored log file "000000010000041900000056" from archive

2021-10-27 10:26:31.158 MDT [2012] LOG: restored log file "000000010000041900000052" from archive

2021-10-27 10:26:31.467 MDT [2012] LOG: redo starts at 419/5229A858

2021-10-27 10:26:31.561 MDT [2012] LOG: restored log file "000000010000041900000053" from archive

2021-10-27 10:26:32.108 MDT [2012] LOG: restored log file "000000010000041900000054" from archive

2021-10-27 10:26:32.849 MDT [2012] LOG: restored log file "000000010000041900000055" from archive

2021-10-27 10:26:33.612 MDT [2012] LOG: restored log file "000000010000041900000056" from archive

2021-10-27 10:26:34.342 MDT [2012] LOG: restored log file "000000010000041900000057" from archive

2021-10-27 10:26:35.146 MDT [2012] LOG: restored log file "000000010000041900000058" from archive

2021-10-27 10:26:35.718 MDT [2012] LOG: restored log file "000000010000041900000059" from archive

2021-10-27 10:26:36.188 MDT [2012] LOG: restored log file "00000001000004190000005A" from archive

2021-10-27 10:26:36.750 MDT [2012] LOG: consistent recovery state reached at 419/5ABFFFF8

2021-10-27 10:26:36.752 MDT [6204] LOG: database system is ready to accept read only connections

2021-10-27 10:26:36.823 MDT [6040] LOG: started streaming WAL from primary at 419/5A000000 on timeline 1

There are many more WAL files available starting with 00000001000004190000005B but the restore process always stops at 00000001000004190000005A.

I have two questions:

Why does the WAL file recovery process now stop after it reads 00000001000004190000005A?
What do I need to do to get PostgreSQL to recover the rest of the available WAL files.

Thanks in advance for any suggestions.

Sincerely,

-Les

Les Ryan, P.Eng | WSP

SCADA Engineer

Energy, Resources & Industry

T +1 403-813-6327

E les.ryan@wsp.com

O 405 18 St SE. Calgary, Alberta T2E 6J5

NOTICE: This communication and any attachments ("this message") may contain information which is privileged, confidential, proprietary or otherwise subject to restricted disclosure under applicable law. This message is for the sole use of the intended recipient(s). Any unauthorized use, disclosure, viewing, copying, alteration, dissemination or distribution of, or reliance on, this message is strictly prohibited. If you have received this message in error, or you are not an authorized or intended recipient, please notify the sender immediately by replying to this message, delete this message and all copies from your e-mail system and destroy any printed copies. You are receiving this communication because you are listed as a current WSP contact. Should you have any questions regarding WSP's electronic communications policy, please consult our Anti-Spam Commitment at www.wsp.com/casl. For any concern or if you believe you should not be receiving this message, please forward this message to caslcompliance@wsp.com so that we can promptly address your request. Note that not all messages sent by WSP qualify as commercial electronic messages.

AVIS : Ce message, incluant tout fichier l'accompagnant (« le message »), peut contenir des renseignements ou de l'information privilégiés, confidentiels, propriétaires ou à divulgation restreinte en vertu de la loi. Ce message est destiné à l'usage exclusif du/des destinataire(s) voulu(s). Toute utilisation non permise, divulgation, lecture, reproduction, modification, diffusion ou distribution est interdite. Si vous avez reçu ce message par erreur, ou que vous n'êtes pas un destinataire autorisé ou voulu, veuillez en aviser l'expéditeur immédiatement et détruire le message et toute copie électronique ou imprimée. Vous recevez cette communication car vous faites partie des contacts de WSP. Si vous avez des questions concernant la politique de communications électroniques de WSP, veuillez consulter notre Engagement anti-pourriel au www.wsp.com/lcap. Pour toute question ou si vous croyez que vous ne devriez pas recevoir ce message, prière de le transférer au conformitelcap@wsp.com afin que nous puissions rapidement traiter votre demande. Notez que ce ne sont pas tous les messages transmis par WSP qui constituent des messages electroniques commerciaux.

-LAEmHhHzdJzBlTWfa4Hgs7pbKl

Re: WAL File Recovery on Standby Server Stops Before End of WAL Files

From

Kyotaro Horiguchi

Date:

28 October 2021, 01:58:31

At Wed, 27 Oct 2021 16:42:52 +0000, "Ryan, Les" <Les.Ryan@wsp.com> wrote in 
> 2021-10-27 10:26:31.467 MDT [2012] LOG:  redo starts at 419/5229A858
...
> 2021-10-27 10:26:36.188 MDT [2012] LOG:  restored log file "00000001000004190000005A" from archive
> 2021-10-27 10:26:36.750 MDT [2012] LOG:  consistent recovery state reached at 419/5ABFFFF8
> 2021-10-27 10:26:36.752 MDT [6204] LOG:  database system is ready to accept read only connections
> 2021-10-27 10:26:36.823 MDT [6040] LOG:  started streaming WAL from primary at 419/5A000000 on timeline 1
> 
>   *   There are many more WAL files available starting with 00000001000004190000005B but the restore process always
stopsat 00000001000004190000005A.
 
> 
> I have two questions:
> 
>   *   Why does the WAL file recovery process now stop after it reads 00000001000004190000005A?
>   *   What do I need to do to get PostgreSQL to recover the rest of the available WAL files.

The info above alone donesn't clearly suggest anything about the
reason. Could you show the result of "pg_waldump
00000001000004190000005A 2>&1 | tail -5"?  What I'm expecting to see
is an error message from pg_waldump before the end of the file. It
would be the immediate cause of the problem.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: WAL File Recovery on Standby Server Stops Before End of WAL Files

From

Dilip Kumar

Date:

28 October 2021, 04:29:11

On Thu, Oct 28, 2021 at 7:28 AM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:
>
> At Wed, 27 Oct 2021 16:42:52 +0000, "Ryan, Les" <Les.Ryan@wsp.com> wrote in
> > 2021-10-27 10:26:31.467 MDT [2012] LOG:  redo starts at 419/5229A858
> ...
> > 2021-10-27 10:26:36.188 MDT [2012] LOG:  restored log file "00000001000004190000005A" from archive
> > 2021-10-27 10:26:36.750 MDT [2012] LOG:  consistent recovery state reached at 419/5ABFFFF8
> > 2021-10-27 10:26:36.752 MDT [6204] LOG:  database system is ready to accept read only connections
> > 2021-10-27 10:26:36.823 MDT [6040] LOG:  started streaming WAL from primary at 419/5A000000 on timeline 1
> >
> >   *   There are many more WAL files available starting with 00000001000004190000005B but the restore process always
stopsat 00000001000004190000005A.
 
> >
> > I have two questions:
> >
> >   *   Why does the WAL file recovery process now stop after it reads 00000001000004190000005A?
> >   *   What do I need to do to get PostgreSQL to recover the rest of the available WAL files.
>
> The info above alone donesn't clearly suggest anything about the
> reason. Could you show the result of "pg_waldump
> 00000001000004190000005A 2>&1 | tail -5"?  What I'm expecting to see
> is an error message from pg_waldump before the end of the file. It
> would be the immediate cause of the problem.

+1, that will be the best place to start with, additionally, you can
enable DEBUG2 message so that from logs we can identify why it could
not continue recovery from the archive.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

RE: WAL File Recovery on Standby Server Stops Before End of WAL Files

From

"Ryan, Les"

Date:

28 October 2021, 13:52:36

Hi Kyotaro and Dilip,

Thank you for getting back to me.

Kyotaro: I ran pg_dump and the output was "pg_waldump: fatal: could not read file "00000001000004190000005A": read 50
of8192".  I'm guessing that it means that wal file 00000001000004190000005A is corrupted and that is why the recovery
processstops there.  Is there any way to fix the file?

Dilip:   setting the log level to debug2 did not provide any additional information.  Here are the log entries:

2021-10-28 06:51:06.166 MDT [7556] LOG:  restored log file "000000010000041900000059" from archive
2021-10-28 06:51:06.464 MDT [7556] DEBUG:  got WAL segment from archive
2021-10-28 06:51:06.579 MDT [7556] LOG:  restored log file "00000001000004190000005A" from archive
2021-10-28 06:51:06.854 MDT [7556] DEBUG:  got WAL segment from archive
2021-10-28 06:51:07.107 MDT [7556] LOG:  consistent recovery state reached at 419/5ABFFFF8
2021-10-28 06:51:07.107 MDT [7556] DEBUG:  switched WAL source from archive to stream after failure
2021-10-28 06:51:07.109 MDT [7844] LOG:  database system is ready to accept read only connections
2021-10-28 06:51:07.152 MDT [7844] DEBUG:  forked new backend, pid=6900 socket=6068

I set the log level to debug5 and here is what I got:

2021-10-28 06:25:41.262 MDT [6288] CONTEXT:  WAL redo at 419/5ABFFF60 for Btree/INSERT_LEAF: off 130
2021-10-28 06:25:41.262 MDT [6288] DEBUG:  record known xact 33776257 latestObservedXid 33776257
2021-10-28 06:25:41.262 MDT [6288] CONTEXT:  WAL redo at 419/5ABFFFA0 for Heap/INSERT: off 95 flags 0x00
2021-10-28 06:25:41.262 MDT [6288] LOG:  consistent recovery state reached at 419/5ABFFFF8
2021-10-28 06:25:41.263 MDT [6288] DEBUG:  switched WAL source from archive to stream after failure
2021-10-28 06:25:41.264 MDT [5512] LOG:  database system is ready to accept read only connections

Does the "switched WAL source from archive to stream after failure" indicate a problem with the WAL file?

Anyway, it looks like I need to restore the standby server from a new backup.  Thank you both for your help.

Sincerely,
-Les

-----Original Message-----
From: Dilip Kumar <dilipbalaut@gmail.com>
Sent: October 27, 2021 10:29 PM
To: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Cc: Ryan, Les <Les.Ryan@wsp.com>; pgsql-generallists.postgresql.org <pgsql-general@lists.postgresql.org>
Subject: Re: WAL File Recovery on Standby Server Stops Before End of WAL Files

On Thu, Oct 28, 2021 at 7:28 AM Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote:
>
> At Wed, 27 Oct 2021 16:42:52 +0000, "Ryan, Les" <Les.Ryan@wsp.com>
> wrote in
> > 2021-10-27 10:26:31.467 MDT [2012] LOG:  redo starts at 419/5229A858
> ...
> > 2021-10-27 10:26:36.188 MDT [2012] LOG:  restored log file
> > "00000001000004190000005A" from archive
> > 2021-10-27 10:26:36.750 MDT [2012] LOG:  consistent recovery state
> > reached at 419/5ABFFFF8
> > 2021-10-27 10:26:36.752 MDT [6204] LOG:  database system is ready to
> > accept read only connections
> > 2021-10-27 10:26:36.823 MDT [6040] LOG:  started streaming WAL from
> > primary at 419/5A000000 on timeline 1
> >
> >   *   There are many more WAL files available starting with 00000001000004190000005B but the restore process always
stopsat 00000001000004190000005A.

> >
> > I have two questions:
> >
> >   *   Why does the WAL file recovery process now stop after it reads 00000001000004190000005A?
> >   *   What do I need to do to get PostgreSQL to recover the rest of the available WAL files.
>
> The info above alone donesn't clearly suggest anything about the
> reason. Could you show the result of "pg_waldump
> 00000001000004190000005A 2>&1 | tail -5"?  What I'm expecting to see
> is an error message from pg_waldump before the end of the file. It
> would be the immediate cause of the problem.

+1, that will be the best place to start with, additionally, you can
enable DEBUG2 message so that from logs we can identify why it could not continue recovery from the archive.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

________________________________

NOTICE: This communication and any attachments ("this message") may contain information which is privileged,
confidential,proprietary or otherwise subject to restricted disclosure under applicable law. This message is for the
soleuse of the intended recipient(s). Any unauthorized use, disclosure, viewing, copying, alteration, dissemination or
distributionof, or reliance on, this message is strictly prohibited. If you have received this message in error, or you
arenot an authorized or intended recipient, please notify the sender immediately by replying to this message, delete
thismessage and all copies from your e-mail system and destroy any printed copies. You are receiving this communication
becauseyou are listed as a current WSP contact. Should you have any questions regarding WSP's electronic communications
policy,please consult our Anti-Spam Commitment at www.wsp.com/casl<http://www.wsp.com/casl>. For any concern or if you
believeyou should not be receiving this message, please forward this message to
caslcompliance@wsp.com<mailto:caslcompliance@wsp.com>so that we can promptly address your request. Note that not all
messagessent by WSP qualify as commercial electronic messages.

AVIS : Ce message, incluant tout fichier l'accompagnant (« le message »), peut contenir des renseignements ou de
l'informationprivilégiés, confidentiels, propriétaires ou à divulgation restreinte en vertu de la loi. Ce message est
destinéà l'usage exclusif du/des destinataire(s) voulu(s). Toute utilisation non permise, divulgation, lecture,
reproduction,modification, diffusion ou distribution est interdite. Si vous avez reçu ce message par erreur, ou que
vousn'êtes pas un destinataire autorisé ou voulu, veuillez en aviser l'expéditeur immédiatement et détruire le message
ettoute copie électronique ou imprimée. Vous recevez cette communication car vous faites partie des contacts de WSP. Si
vousavez des questions concernant la politique de communications électroniques de WSP, veuillez consulter notre
Engagementanti-pourriel au www.wsp.com/lcap<http://www.wsp.com/lcap>. Pour toute question ou si vous croyez que vous ne
devriezpas recevoir ce message, prière de le transférer au conformitelcap@wsp.com<mailto:conformitelcap@wsp.com> afin
quenous puissions rapidement traiter votre demande. Notez que ce ne sont pas tous les messages transmis par WSP qui
constituentdes messages electroniques commerciaux.

-LAEmHhHzdJzBlTWfa4Hgs7pbKl

Re: WAL File Recovery on Standby Server Stops Before End of WAL Files

From

Kyotaro Horiguchi

Date:

29 October 2021, 01:44:46

At Thu, 28 Oct 2021 13:52:36 +0000, "Ryan, Les" <Les.Ryan@wsp.com> wrote in 
> Hi Kyotaro and Dilip,
> 
> Thank you for getting back to me.
> 
> Kyotaro: I ran pg_dump and the output was "pg_waldump: fatal: could not read file "00000001000004190000005A": read 50
of8192".  I'm guessing that it means that wal file 00000001000004190000005A is corrupted and that is why the recovery
processstops there.  Is there any way to fix the file?
 

No way, unless the segment is still living in the primary's
pg_wal. Your archive storage is unstable or archive_command is not
performing its task reliably enough.

> Dilip:   setting the log level to debug2 did not provide any additional information.  Here are the log entries:
> 
> 2021-10-28 06:51:06.166 MDT [7556] LOG:  restored log file "000000010000041900000059" from archive
> 2021-10-28 06:51:06.464 MDT [7556] DEBUG:  got WAL segment from archive
> 2021-10-28 06:51:06.579 MDT [7556] LOG:  restored log file "00000001000004190000005A" from archive
> 2021-10-28 06:51:06.854 MDT [7556] DEBUG:  got WAL segment from archive
> 2021-10-28 06:51:07.107 MDT [7556] LOG:  consistent recovery state reached at 419/5ABFFFF8
> 2021-10-28 06:51:07.107 MDT [7556] DEBUG:  switched WAL source from archive to stream after failure
> 2021-10-28 06:51:07.109 MDT [7844] LOG:  database system is ready to accept read only connections
> 2021-10-28 06:51:07.152 MDT [7844] DEBUG:  forked new backend, pid=6900 socket=6068
> 
> I set the log level to debug5 and here is what I got:
> 
> 2021-10-28 06:25:41.262 MDT [6288] CONTEXT:  WAL redo at 419/5ABFFF60 for Btree/INSERT_LEAF: off 130
> 2021-10-28 06:25:41.262 MDT [6288] DEBUG:  record known xact 33776257 latestObservedXid 33776257
> 2021-10-28 06:25:41.262 MDT [6288] CONTEXT:  WAL redo at 419/5ABFFFA0 for Heap/INSERT: off 95 flags 0x00
> 2021-10-28 06:25:41.262 MDT [6288] LOG:  consistent recovery state reached at 419/5ABFFFF8
> 2021-10-28 06:25:41.263 MDT [6288] DEBUG:  switched WAL source from archive to stream after failure
> 2021-10-28 06:25:41.264 MDT [5512] LOG:  database system is ready to accept read only connections
> 
> Does the "switched WAL source from archive to stream after failure" indicate a problem with the WAL file?

Actually the message says that some trouble happend while reading file
but that happens always at the end of wal.  We might need a bit more
detailed message about the trouble in higher debug level messages.

> Anyway, it looks like I need to restore the standby server from a new backup.  Thank you both for your help.

Unfortunately I think so, too.

> Sincerely,
> -Les

-- 
Kyotaro Horiguchi
NTT Open Source Software Center