RE: Very URGENT REQUEST - Postgresql error : PANIC: could not locate a valid checkpoint record - Mailing list pgsql-bugs

From Agarwal, Pragati - Dell Team
Subject RE: Very URGENT REQUEST - Postgresql error : PANIC: could not locate a valid checkpoint record
Date
Msg-id CO1PR19MB4967E7E513BFE852732A2A5E9E309@CO1PR19MB4967.namprd19.prod.outlook.com
Whole thread Raw
In response to Re: Very URGENT REQUEST - Postgresql error : PANIC: could not locate a valid checkpoint record  (Merlin Moncure <mmoncure@gmail.com>)
Responses Re: Very URGENT REQUEST - Postgresql error : PANIC: could not locate a valid checkpoint record  (Julien Rouhaud <rjuju123@gmail.com>)
List pgsql-bugs
Hi Merlin, Julien

Thanks for your response. 

Besides this, there is public schema in postgres which we are not using in our Datastore service.

As a workaround of an upgrade process, we plan to drop the public schema. We would like to know your thoughts on this
andany detrimental impact on the data or service.
 

Thanks,
Pragati


Internal Use - Confidential

-----Original Message-----
From: Merlin Moncure <mmoncure@gmail.com> 
Sent: Thursday, February 10, 2022 9:12 PM
To: Julien Rouhaud
Cc: Silaparasetti, Ramesh; pgsql-bugs@lists.postgresql.org; Kishore, Nanda - Dell Team; Mahendrakar, Prabhakar - Dell
Team;Agarwal, Pragati - Dell Team
 
Subject: Re: Very URGENT REQUEST - Postgresql error : PANIC: could not locate a valid checkpoint record


[EXTERNAL EMAIL] 

On Thu, Feb 10, 2022 at 8:45 AM Julien Rouhaud <rjuju123@gmail.com> wrote:
>
> On Thu, Feb 10, 2022 at 01:13:27PM +0000, Silaparasetti, Ramesh wrote:
> >
> > 1. Below is the output of the command : "<DPA_INSTALL_DIRECTORY>\services\datastore\engine\bin\pg_controldata.exe
-D"<DPA_INSTALL_DIRECTORY>\services\datastore\data""
 
> > C:\Program Files\EMC\DPA\services\datastore\engine\bin>pg_controldata.exe -D "F:\datastore\data\data"
> > pg_control-Versionsnummer:                   1300
> > Katalogversionsnummer:                       202007201
> > Datenbanksystemidentifikation:               7054941472659574120
> > Datenbank-Cluster-Status:                    heruntergefahren
> > pg_control zuletzt geändert:                 07.02.2022 14:57:30
> > Position des letzten Checkpoints:            9/C80000D8
> > REDO-Position des letzten Checkpoints:       9/C80000D8
> > REDO-WAL-Datei des letzten Checkpoints:      0000000100000009000000C8
> > [...]
> > 2. As you suggested, we verified the value of Latest checkpoint's REDO WAL file: 0000000100000009000000C8.
> >
> >  This WAL file does not exist at the pg_wal directory.
> > We have enabled debug logging and below is the logging information from Postgres.
> >
> > 2022-02-10 11:38:05.675 CET [7916] LOG:  starting PostgreSQL 13.1, 
> > compiled by Visual C++ build 1900, 64-bit
> > 2022-02-10 11:38:05.679 CET [7916] LOG:  listening on IPv4 address 
> > "127.0.0.1", port 9003
> > 2022-02-10 11:38:05.681 CET [7916] LOG:  listening on IPv4 address 
> > "10.91.198.36", port 9003
> > 2022-02-10 11:38:06.756 CET [348] LOG:  database system was shut 
> > down at 2022-02-07 14:57:30 CET
> > 2022-02-10 11:38:06.756 CET [348] DEBUG:  mapped win32 error code 2 
> > to 2
> > 2022-02-10 11:38:06.757 CET [348] DEBUG:  mapped win32 error code 2 
> > to 2
> > 2022-02-10 11:38:06.757 CET [348] DEBUG:  could not open file 
> > "pg_wal/0000000100000009000000C8": No such file or directory
>
> So, unless you can find that 0000000100000009000000C8 file (and all 
> the files after that), your instance is corrupted and you lost data.  
> If you have WAL archiving or streaming to another location you should 
> be able to recover from that, assuming that no other files are 
> damaged.  Otherwise your best shot is restoring from a backup.
>
> > 4. Is it ok to execute "pg_resetwal" to recover from this situation? Does it pose any data loss ?
>
> pg_resetwal will make the situation worse.  The server will start but 
> in a totally inconsistent state.  This should be your last choice, and 
> understand that it will irremediably corrupt your system even more.

If it was me, I'd take a full filesystem level backup, and then run
resetwal on the copy (or maybe copy of the copy).   The damage may or
may not be very significant, and getting a baseline to amend to is important  (or perhaps pull data from to check
againsta backup you
 
hopefully have).   If you go that route, immediately take a pg_dump.
and restore it, and you should have something workable albeit with missing or incompletely applied transactions

A pristine copy is important, there may be data sitting in there that you can recover if it comes to that.

As to *why* this occurred, there are four major things to look at:
*) Storage issues (consider enabling checksums immediately if not already done)
*) O/S Issues
*) 3rd part code running in postgres (may not apply to you. What non-core extensions are you running?)
*) postgres bugs.   You are running 13.1, current release is 13.5.
This is not good, and raises your risk significantly.  ALWAYS apply bugfixes quickly double especially if .2 or less
basedon historical context; severity and number of bugs in 0.0,0.1,0.2 tend to be higher.
 

merlin

pgsql-bugs by date:

Previous
From: Andres Freund
Date:
Subject: Re: BUG #17399: Dead tuple number stats not updated on long running queries
Next
From: Julien Rouhaud
Date:
Subject: Re: Very URGENT REQUEST - Postgresql error : PANIC: could not locate a valid checkpoint record