Thread: May data be corrupted after an interrupted, but afterwards sucessfully replayed recovery?

Hi experts,

we have a cluster that crashed because of a full 'pg_wal' disk.
When it automatically tried to restart after failure, it went into recovery,
crashed again - this time in recovery - and stopped (see logs below [1]).

The original problem is solved and the cluster now successfully started an
finished recovery.
But when I restarted the cluster it HINTed:

2025-02-20 09:26:57.615 CET [3982486] LOG:  database system was interrupted
while in recovery at 2025-02-20 03:38:24 CET
2025-02-20 09:26:57.615 CET [3982486] HINT:  This probably means that some
data is corrupted and you will have to use the last backup for recovery.


My question is now, if there is still a chance that the data is corrupted,
although the interrupted recovery was successfully redone later?
Should I recover the whole cluster from the last PITR backup to be safe? Or
can I be sure that the data is in a solid state, since recovery succeeded on
second try?

Regards,
Benjamin 


----------------------
[1] logs:

2025-02-20 03:38:18.278 CET [3767151] PANIC:  could not write to file
"pg_wal/xlogtemp.3767151": No space left on device
2025-02-20 03:38:18.315 CET [1909] LOG:  server process (PID 3767151) was
terminated by signal 6: Aborted
2025-02-20 03:38:18.315 CET [1909] LOG:  terminating any other active server
processes
2025-02-20 03:38:21.938 CET [1909] LOG:  all server processes terminated;
reinitializing
2025-02-20 03:38:23.867 CET [3863518] LOG:  database system was interrupted;
last known up at 2025-02-20 03:34:49 CET
2025-02-20 03:38:24.089 CET [3863518] LOG:  database system was not properly
shut down; automatic recovery in progress
2025-02-20 03:38:24.102 CET [3863518] LOG:  redo starts at E8A/EB648078
2025-02-20 03:38:24.119 CET [3863518] LOG:  redo done at E8A/EC19B6B0 system
usage: CPU: user: 0.01 s, system: 0.00 s, elapsed: 0.02 s
2025-02-20 03:38:24.136 CET [3863518] FATAL:  could not write to file
"pg_wal/xlogtemp.3863518": No space left on device
2025-02-20 03:38:24.139 CET [1909] LOG:  startup process (PID 3863518)
exited with exit code 1
2025-02-20 03:38:24.139 CET [1909] LOG:  terminating any other active server
processes
2025-02-20 03:38:24.140 CET [1909] LOG:  shutting down due to startup
process failure
2025-02-20 03:38:24.213 CET [1909] LOG:  database system is shut down
2025-02-20 09:26:57.597 CET [3982483] LOG:  starting PostgreSQL 15.10
(Debian 15.10-1.pgdg120+1) on x86_64-pc-linux-gnu, compiled by gcc (Debian
12.2.0-14) 12.2.0, 64-bit
2025-02-20 09:26:57.597 CET [3982483] LOG:  listening on IPv4 address
"0.0.0.0", port 5432
2025-02-20 09:26:57.597 CET [3982483] LOG:  listening on IPv6 address "::",
port 5432
2025-02-20 09:26:57.598 CET [3982483] LOG:  listening on Unix socket
"/var/run/postgresql/.s.PGSQL.5432"
2025-02-20 09:26:57.615 CET [3982486] LOG:  database system was interrupted
while in recovery at 2025-02-20 03:38:24 CET
2025-02-20 09:26:57.615 CET [3982486] HINT:  This probably means that some
data is corrupted and you will have to use the last backup for recovery.
2025-02-20 09:26:57.886 CET [3982486] LOG:  database system was not properly
shut down; automatic recovery in progress
2025-02-20 09:26:57.890 CET [3982486] LOG:  redo starts at E8A/EB648078
2025-02-20 09:26:57.932 CET [3982486] LOG:  redo done at E8A/EC19B6B0 system
usage: CPU: user: 0.00 s, system: 0.01 s, elapsed: 0.04 s
2025-02-20 09:26:57.966 CET [3982484] LOG:  checkpoint starting:
end-of-recovery immediate wait
2025-02-20 09:26:58.061 CET [3982484] LOG:  checkpoint complete: wrote 390
buffers (0.0%); 0 WAL file(s) added, 0 removed, 0 recycled; write=0.058 s,
sync=0.001 s, total=0.096 s; sync files=103, longest=0.001 s, average=0.001
s; distance=26335 kB, estimate=26335 kB
2025-02-20 09:26:58.068 CET [3982483] LOG:  database system is ready to
accept connections

Attachment
On Thu, 2025-02-20 at 15:00 +0000, Dietrich, Benjamin wrote:
> we have a cluster that crashed because of a full 'pg_wal' disk.
> When it automatically tried to restart after failure, it went into recovery,
> crashed again - this time in recovery - and stopped (see logs below [1]).
> 
> The original problem is solved and the cluster now successfully started an
> finished recovery.
> But when I restarted the cluster it HINTed:
> 
> 2025-02-20 09:26:57.615 CET [3982486] LOG:  database system was interrupted
> while in recovery at 2025-02-20 03:38:24 CET
> 2025-02-20 09:26:57.615 CET [3982486] HINT:  This probably means that some
> data is corrupted and you will have to use the last backup for recovery.
> 
> 
> My question is now, if there is still a chance that the data is corrupted,
> although the interrupted recovery was successfully redone later?

No, if you didn't take other steps to break the database, like remove files
or run "pg_resetwal", everything should be in perfet order.

Perhaps that hint should be defused to say "This could mean that..."

> ----------------------
> [1] logs:
> 
> 2025-02-20 03:38:18.278 CET [3767151] PANIC:  could not write to file
> "pg_wal/xlogtemp.3767151": No space left on device
> [...]
> 2025-02-20 03:38:23.867 CET [3863518] LOG:  database system was interrupted;
> last known up at 2025-02-20 03:34:49 CET
> [...]
> 2025-02-20 03:38:24.136 CET [3863518] FATAL:  could not write to file
> "pg_wal/xlogtemp.3863518": No space left on device
> [...]
> 2025-02-20 09:26:57.615 CET [3982486] LOG:  database system was interrupted
> while in recovery at 2025-02-20 03:38:24 CET
> 2025-02-20 09:26:57.615 CET [3982486] HINT:  This probably means that some
> data is corrupted and you will have to use the last backup for recovery.
> 2025-02-20 09:26:57.886 CET [3982486] LOG:  database system was not properly
> shut down; automatic recovery in progress
> 2025-02-20 09:26:57.890 CET [3982486] LOG:  redo starts at E8A/EB648078
> 2025-02-20 09:26:57.932 CET [3982486] LOG:  redo done at E8A/EC19B6B0 system
> usage: CPU: user: 0.00 s, system: 0.01 s, elapsed: 0.04 s
> 2025-02-20 09:26:57.966 CET [3982484] LOG:  checkpoint starting:
> end-of-recovery immediate wait
> 2025-02-20 09:26:58.061 CET [3982484] LOG:  checkpoint complete: wrote 390
> buffers (0.0%); 0 WAL file(s) added, 0 removed, 0 recycled; write=0.058 s,
> sync=0.001 s, total=0.096 s; sync files=103, longest=0.001 s, average=0.001
> s; distance=26335 kB, estimate=26335 kB
> 2025-02-20 09:26:58.068 CET [3982483] LOG:  database system is ready to
> accept connections

That confirms that everything worked the way it should, and your database
will be fine.

Yours,
Laurenz Albe

-- 

*E-Mail Disclaimer*
Der Inhalt dieser E-Mail ist ausschliesslich fuer den 
bezeichneten Adressaten bestimmt. Wenn Sie nicht der vorgesehene Adressat 
dieser E-Mail oder dessen Vertreter sein sollten, so beachten Sie bitte, 
dass jede Form der Kenntnisnahme, Veroeffentlichung, Vervielfaeltigung oder 
Weitergabe des Inhalts dieser E-Mail unzulaessig ist. Wir bitten Sie, sich 
in diesem Fall mit dem Absender der E-Mail in Verbindung zu setzen.

*CONFIDENTIALITY NOTICE & DISCLAIMER
*This message and any attachment are 
confidential and may be privileged or otherwise protected from disclosure 
and solely for the use of the person(s) or entity to whom it is intended. 
If you have received this message in error and are not the intended 
recipient, please notify the sender immediately and delete this message and 
any attachment from your system. If you are not the intended recipient, be 
advised that any use of this message is prohibited and may be unlawful, and 
you must not copy this message or attachment or disclose the contents to 
any other person.