Thread: May data be corrupted after an interrupted, but afterwards sucessfully replayed recovery?
May data be corrupted after an interrupted, but afterwards sucessfully replayed recovery?
From
"Dietrich, Benjamin"
Date:
Hi experts, we have a cluster that crashed because of a full 'pg_wal' disk. When it automatically tried to restart after failure, it went into recovery, crashed again - this time in recovery - and stopped (see logs below [1]). The original problem is solved and the cluster now successfully started an finished recovery. But when I restarted the cluster it HINTed: 2025-02-20 09:26:57.615 CET [3982486] LOG: database system was interrupted while in recovery at 2025-02-20 03:38:24 CET 2025-02-20 09:26:57.615 CET [3982486] HINT: This probably means that some data is corrupted and you will have to use the last backup for recovery. My question is now, if there is still a chance that the data is corrupted, although the interrupted recovery was successfully redone later? Should I recover the whole cluster from the last PITR backup to be safe? Or can I be sure that the data is in a solid state, since recovery succeeded on second try? Regards, Benjamin ---------------------- [1] logs: 2025-02-20 03:38:18.278 CET [3767151] PANIC: could not write to file "pg_wal/xlogtemp.3767151": No space left on device 2025-02-20 03:38:18.315 CET [1909] LOG: server process (PID 3767151) was terminated by signal 6: Aborted 2025-02-20 03:38:18.315 CET [1909] LOG: terminating any other active server processes 2025-02-20 03:38:21.938 CET [1909] LOG: all server processes terminated; reinitializing 2025-02-20 03:38:23.867 CET [3863518] LOG: database system was interrupted; last known up at 2025-02-20 03:34:49 CET 2025-02-20 03:38:24.089 CET [3863518] LOG: database system was not properly shut down; automatic recovery in progress 2025-02-20 03:38:24.102 CET [3863518] LOG: redo starts at E8A/EB648078 2025-02-20 03:38:24.119 CET [3863518] LOG: redo done at E8A/EC19B6B0 system usage: CPU: user: 0.01 s, system: 0.00 s, elapsed: 0.02 s 2025-02-20 03:38:24.136 CET [3863518] FATAL: could not write to file "pg_wal/xlogtemp.3863518": No space left on device 2025-02-20 03:38:24.139 CET [1909] LOG: startup process (PID 3863518) exited with exit code 1 2025-02-20 03:38:24.139 CET [1909] LOG: terminating any other active server processes 2025-02-20 03:38:24.140 CET [1909] LOG: shutting down due to startup process failure 2025-02-20 03:38:24.213 CET [1909] LOG: database system is shut down 2025-02-20 09:26:57.597 CET [3982483] LOG: starting PostgreSQL 15.10 (Debian 15.10-1.pgdg120+1) on x86_64-pc-linux-gnu, compiled by gcc (Debian 12.2.0-14) 12.2.0, 64-bit 2025-02-20 09:26:57.597 CET [3982483] LOG: listening on IPv4 address "0.0.0.0", port 5432 2025-02-20 09:26:57.597 CET [3982483] LOG: listening on IPv6 address "::", port 5432 2025-02-20 09:26:57.598 CET [3982483] LOG: listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432" 2025-02-20 09:26:57.615 CET [3982486] LOG: database system was interrupted while in recovery at 2025-02-20 03:38:24 CET 2025-02-20 09:26:57.615 CET [3982486] HINT: This probably means that some data is corrupted and you will have to use the last backup for recovery. 2025-02-20 09:26:57.886 CET [3982486] LOG: database system was not properly shut down; automatic recovery in progress 2025-02-20 09:26:57.890 CET [3982486] LOG: redo starts at E8A/EB648078 2025-02-20 09:26:57.932 CET [3982486] LOG: redo done at E8A/EC19B6B0 system usage: CPU: user: 0.00 s, system: 0.01 s, elapsed: 0.04 s 2025-02-20 09:26:57.966 CET [3982484] LOG: checkpoint starting: end-of-recovery immediate wait 2025-02-20 09:26:58.061 CET [3982484] LOG: checkpoint complete: wrote 390 buffers (0.0%); 0 WAL file(s) added, 0 removed, 0 recycled; write=0.058 s, sync=0.001 s, total=0.096 s; sync files=103, longest=0.001 s, average=0.001 s; distance=26335 kB, estimate=26335 kB 2025-02-20 09:26:58.068 CET [3982483] LOG: database system is ready to accept connections
Attachment
Re: May data be corrupted after an interrupted, but afterwards sucessfully replayed recovery?
From
Laurenz Albe
Date:
On Thu, 2025-02-20 at 15:00 +0000, Dietrich, Benjamin wrote: > we have a cluster that crashed because of a full 'pg_wal' disk. > When it automatically tried to restart after failure, it went into recovery, > crashed again - this time in recovery - and stopped (see logs below [1]). > > The original problem is solved and the cluster now successfully started an > finished recovery. > But when I restarted the cluster it HINTed: > > 2025-02-20 09:26:57.615 CET [3982486] LOG: database system was interrupted > while in recovery at 2025-02-20 03:38:24 CET > 2025-02-20 09:26:57.615 CET [3982486] HINT: This probably means that some > data is corrupted and you will have to use the last backup for recovery. > > > My question is now, if there is still a chance that the data is corrupted, > although the interrupted recovery was successfully redone later? No, if you didn't take other steps to break the database, like remove files or run "pg_resetwal", everything should be in perfet order. Perhaps that hint should be defused to say "This could mean that..." > ---------------------- > [1] logs: > > 2025-02-20 03:38:18.278 CET [3767151] PANIC: could not write to file > "pg_wal/xlogtemp.3767151": No space left on device > [...] > 2025-02-20 03:38:23.867 CET [3863518] LOG: database system was interrupted; > last known up at 2025-02-20 03:34:49 CET > [...] > 2025-02-20 03:38:24.136 CET [3863518] FATAL: could not write to file > "pg_wal/xlogtemp.3863518": No space left on device > [...] > 2025-02-20 09:26:57.615 CET [3982486] LOG: database system was interrupted > while in recovery at 2025-02-20 03:38:24 CET > 2025-02-20 09:26:57.615 CET [3982486] HINT: This probably means that some > data is corrupted and you will have to use the last backup for recovery. > 2025-02-20 09:26:57.886 CET [3982486] LOG: database system was not properly > shut down; automatic recovery in progress > 2025-02-20 09:26:57.890 CET [3982486] LOG: redo starts at E8A/EB648078 > 2025-02-20 09:26:57.932 CET [3982486] LOG: redo done at E8A/EC19B6B0 system > usage: CPU: user: 0.00 s, system: 0.01 s, elapsed: 0.04 s > 2025-02-20 09:26:57.966 CET [3982484] LOG: checkpoint starting: > end-of-recovery immediate wait > 2025-02-20 09:26:58.061 CET [3982484] LOG: checkpoint complete: wrote 390 > buffers (0.0%); 0 WAL file(s) added, 0 removed, 0 recycled; write=0.058 s, > sync=0.001 s, total=0.096 s; sync files=103, longest=0.001 s, average=0.001 > s; distance=26335 kB, estimate=26335 kB > 2025-02-20 09:26:58.068 CET [3982483] LOG: database system is ready to > accept connections That confirms that everything worked the way it should, and your database will be fine. Yours, Laurenz Albe -- *E-Mail Disclaimer* Der Inhalt dieser E-Mail ist ausschliesslich fuer den bezeichneten Adressaten bestimmt. Wenn Sie nicht der vorgesehene Adressat dieser E-Mail oder dessen Vertreter sein sollten, so beachten Sie bitte, dass jede Form der Kenntnisnahme, Veroeffentlichung, Vervielfaeltigung oder Weitergabe des Inhalts dieser E-Mail unzulaessig ist. Wir bitten Sie, sich in diesem Fall mit dem Absender der E-Mail in Verbindung zu setzen. *CONFIDENTIALITY NOTICE & DISCLAIMER *This message and any attachment are confidential and may be privileged or otherwise protected from disclosure and solely for the use of the person(s) or entity to whom it is intended. If you have received this message in error and are not the intended recipient, please notify the sender immediately and delete this message and any attachment from your system. If you are not the intended recipient, be advised that any use of this message is prohibited and may be unlawful, and you must not copy this message or attachment or disclose the contents to any other person.