Thread: RE: Very URGENT REQUEST - Postgresql error : PANIC: could not locate a valid checkpoint record

RE: Very URGENT REQUEST - Postgresql error : PANIC: could not locate a valid checkpoint record

From
"Silaparasetti, Ramesh"
Date:

Hi Team,

 

Below is the information you suggested to get, kindly let us know your response:

 

------------------------------------

1. Below is the output of the command : "<DPA_INSTALL_DIRECTORY>\services\datastore\engine\bin\pg_controldata.exe -D "<DPA_INSTALL_DIRECTORY>\services\datastore\data""

 

Our customer locale is German and below is the output.

 

C:\Program Files\EMC\DPA\services\datastore\engine\bin>pg_controldata.exe -D "F:\datastore\data\data"

pg_control-Versionsnummer:                   1300

Katalogversionsnummer:                       202007201

Datenbanksystemidentifikation:               7054941472659574120

Datenbank-Cluster-Status:                    heruntergefahren

pg_control zuletzt geändert:                 07.02.2022 14:57:30

Position des letzten Checkpoints:            9/C80000D8

REDO-Position des letzten Checkpoints:       9/C80000D8

REDO-WAL-Datei des letzten Checkpoints:      0000000100000009000000C8

TimeLineID des letzten Checkpoints:          1

PrevTimeLineID des letzten Checkpoints:      1

full_page_writes des letzten Checkpoints:    aus

NextXID des letzten Checkpoints:             0:230890

NextOID des letzten Checkpoints:             38960

NextMultiXactId des letzten Checkpoints:     1

NextMultiOffset des letzten Checkpoints:     0

oldestXID des letzten Checkpoints:           549

DB der oldestXID des letzten Checkpoints:    0

oldestActiveXID des letzten Checkpoints:     0

oldestMultiXid des letzten Checkpoints:      1

DB des oldestMulti des letzten Checkpoints:  0

oldestCommitTsXid des letzten Checkpoints:   0

newestCommitTsXid des letzten Checkpoints:   0

Zeit des letzten Checkpoints:                07.02.2022 14:57:30

Fake-LSN-Zähler für ungeloggte Relationen:   0/3E8

Minimaler Wiederherstellungsendpunkt:        0/0

Zeitleiste des minimalen Wiederherstellungsendpunkts: 0

Backup-Startpunkt:                           0/0

Backup-Endpunkt:                             0/0

End-of-Backup-Eintrag erforderlich:          nein

wal_level-Einstellung:                       replica

wal_log_hints-Einstellung:                   aus

max_connections-Einstellung:                 250

max_worker_processes-Einstellung:            8

max_wal_senders-Einstellung:                 1

max_prepared_xacts-Einstellung:              250

max_locks_per_xact-Einstellung:              64

track_commit_timestamp-Einstellung:          aus

Maximale Datenausrichtung (Alignment):       8

Datenbankblockgröße:                         8192

Blöcke pro Segment:                          131072

WAL-Blockgröße:                              8192

Bytes pro WAL-Segment:                       16777216

Maximale Bezeichnerlänge:                    64

Maximale Spalten in einem Index:             32

Maximale Größe eines Stücks TOAST:           1996

Größe eines Large-Object-Chunks:             2048

Speicherung von Datum/Zeit-Typen:            64-Bit-Ganzzahlen

Übergabe von Float8-Argumenten:              Wert

Datenseitenprüfsummenversion:                0

Mock-Authentifizierungs-Nonce:               7b9b893df8e583b7cf43c1647aa0433dc3dabf9eb00bde88bfaf879695cadf78

 

C:\Program Files\EMC\DPA\services\datastore\engine\bin>

 

--------------------------------

 

2. As you suggested, we verified the value of Latest checkpoint's REDO WAL file: 0000000100000009000000C8.

 This WAL file does not exist at the pg_wal directory

 

We have enabled debug logging and below is the logging information from Postgres.

 

2022-02-10 11:38:05.675 CET [7916] LOG:  starting PostgreSQL 13.1, compiled by Visual C++ build 1900, 64-bit

2022-02-10 11:38:05.679 CET [7916] LOG:  listening on IPv4 address "127.0.0.1", port 9003

2022-02-10 11:38:05.681 CET [7916] LOG:  listening on IPv4 address "10.91.198.36", port 9003

2022-02-10 11:38:06.756 CET [348] LOG:  database system was shut down at 2022-02-07 14:57:30 CET

2022-02-10 11:38:06.756 CET [348] DEBUG:  mapped win32 error code 2 to 2

2022-02-10 11:38:06.757 CET [348] DEBUG:  mapped win32 error code 2 to 2

2022-02-10 11:38:06.757 CET [348] DEBUG:  could not open file "pg_wal/0000000100000009000000C8": No such file or directory

2022-02-10 11:38:06.760 CET [7916] DEBUG:  reaping dead processes

2022-02-10 11:38:06.760 CET [7916] LOG:  startup process (PID 348) was terminated by exception 0xC0000005

2022-02-10 11:38:06.760 CET [7916] HINT:  See C include file "ntstatus.h" for a description of the hexadecimal value.

2022-02-10 11:38:06.760 CET [7916] LOG:  aborting startup due to startup process failure

2022-02-10 11:38:06.760 CET [7916] DEBUG:  shmem_exit(1): 0 before_shmem_exit callbacks to make

2022-02-10 11:38:06.760 CET [7916] DEBUG:  shmem_exit(1): 3 on_shmem_exit callbacks to make

2022-02-10 11:38:06.760 CET [7916] DEBUG:  cleaning up dynamic shared memory control segment with ID 1993677498

2022-02-10 11:38:06.774 CET [7916] DEBUG:  proc_exit(1): 2 callbacks to make

2022-02-10 11:38:06.774 CET [7916] LOG:  database system is shut down

2022-02-10 11:38:06.774 CET [7916] DEBUG:  exit(1)

2022-02-10 11:38:06.774 CET [7916] DEBUG:  shmem_exit(-1): 0 before_shmem_exit callbacks to make

2022-02-10 11:38:06.774 CET [7916] DEBUG:  shmem_exit(-1): 0 on_shmem_exit callbacks to make

2022-02-10 11:38:06.774 CET [7916] DEBUG:  proc_exit(-1): 0 callbacks to make

2022-02-10 11:38:06.778 CET [904] DEBUG:  logger shutting down

2022-02-10 11:38:06.779 CET [904] DEBUG:  shmem_exit(0): 0 before_shmem_exit callbacks to make

2022-02-10 11:38:06.779 CET [904] DEBUG:  shmem_exit(0): 0 on_shmem_exit callbacks to make

2022-02-10 11:38:06.779 CET [904] DEBUG:  proc_exit(0): 0 callbacks to make

2022-02-10 11:38:06.779 CET [904] DEBUG:  exit(0)

 

-----------------------------------

3. Note that we have not got any Crash dump for Postgres service at below location:

"C:\Users\<USER>\AppData\Local\CrashDumps", where <USER> is the login starting the DB.

 

-----------------------------------

 

4. Is it ok to execute “pg_resetwal” to recover from this situation? Does it pose any data loss ?

 

Best Regards,

Ramesh

 

-----Original Message-----

From: Julien Rouhaud <rjuju123@gmail.com>

Sent: Thursday, February 3, 2022 9:45 AM

To: Pragati Agarwal

Cc: pgsql-bugs@lists.postgresql.org; Silaparasetti, Ramesh; Kishore, Nanda - Dell Team

Subject: Re: Postgresql error : PANIC: could not locate a valid checkpoint record

 

 

[EXTERNAL EMAIL]

 

Hi,

 

On Wed, Feb 02, 2022 at 11:29:45PM +0530, Pragati Agarwal wrote:

>

> > The following error "could not locate a valid checkpoint record"

> > occurs on a datastore server running on postgreSQL 13.1.

> > Datastore is hosted on windows 2016 and the system has McAfee virus

> > scan installed.

> > This error occurs when server start is attempted.

> >

> > [7804] LOG:  starting PostgreSQL 13.1, compiled by Visual C++ build

> > 1914, 64-bit [8812] LOG:  invalid primary checkpoint record [8812]

> > PANIC:  could not locate a valid checkpoint record [7804] LOG: 

> > startup process (PID 8812) was terminated by exception 0xC0000409

> > [7804] HINT:  See C include file "ntstatus.h" for a description of the hexadecimal value.

> > [7804] LOG:  aborting startup due to startup process failure [7804]

> > LOG:  database system is shut down

 

Did you experience some hardware failure, or do the previous logs show any other problems?

 

Can you send the output of

 

pg_controldata.exe -D $PGDATA

 

where $PGDATA is the directory where the instance is stored.

 

The needed command line is likely to be:

 

"C:\Program Files\Postgresql\13\bin\pg_controldata.exe -D "C:\Program Files\Postgresql\13\data"

 

Also, there will be a line of the form:

 

Latest checkpoint's REDO WAL file:    XXXXXXXXXXXXXXXXXXXXXXXX

 

Can you check if the reported files exists in your pg_wal directory (it's a subdirectory of the $PGDATA directory), and if its size is 16MB?

 

If yes, please check that the permission on the files allows the user starting the postgres service.  You could also temporarily disable your antivirus to check if it's blocking access.

 

On Thu, Feb 10, 2022 at 01:13:27PM +0000, Silaparasetti, Ramesh wrote:
> 
> 1. Below is the output of the command : "<DPA_INSTALL_DIRECTORY>\services\datastore\engine\bin\pg_controldata.exe -D
"<DPA_INSTALL_DIRECTORY>\services\datastore\data""
> C:\Program Files\EMC\DPA\services\datastore\engine\bin>pg_controldata.exe -D "F:\datastore\data\data"
> pg_control-Versionsnummer:                   1300
> Katalogversionsnummer:                       202007201
> Datenbanksystemidentifikation:               7054941472659574120
> Datenbank-Cluster-Status:                    heruntergefahren
> pg_control zuletzt geändert:                 07.02.2022 14:57:30
> Position des letzten Checkpoints:            9/C80000D8
> REDO-Position des letzten Checkpoints:       9/C80000D8
> REDO-WAL-Datei des letzten Checkpoints:      0000000100000009000000C8
> [...]
> 2. As you suggested, we verified the value of Latest checkpoint's REDO WAL file: 0000000100000009000000C8.
> 
>  This WAL file does not exist at the pg_wal directory.
> We have enabled debug logging and below is the logging information from Postgres.
> 
> 2022-02-10 11:38:05.675 CET [7916] LOG:  starting PostgreSQL 13.1, compiled by Visual C++ build 1900, 64-bit
> 2022-02-10 11:38:05.679 CET [7916] LOG:  listening on IPv4 address "127.0.0.1", port 9003
> 2022-02-10 11:38:05.681 CET [7916] LOG:  listening on IPv4 address "10.91.198.36", port 9003
> 2022-02-10 11:38:06.756 CET [348] LOG:  database system was shut down at 2022-02-07 14:57:30 CET
> 2022-02-10 11:38:06.756 CET [348] DEBUG:  mapped win32 error code 2 to 2
> 2022-02-10 11:38:06.757 CET [348] DEBUG:  mapped win32 error code 2 to 2
> 2022-02-10 11:38:06.757 CET [348] DEBUG:  could not open file "pg_wal/0000000100000009000000C8": No such file or
directory

So, unless you can find that 0000000100000009000000C8 file (and all the files
after that), your instance is corrupted and you lost data.  If you have WAL
archiving or streaming to another location you should be able to recover from
that, assuming that no other files are damaged.  Otherwise your best shot is
restoring from a backup.

> 4. Is it ok to execute "pg_resetwal" to recover from this situation? Does it pose any data loss ?

pg_resetwal will make the situation worse.  The server will start but in a
totally inconsistent state.  This should be your last choice, and understand
that it will irremediably corrupt your system even more.

At that point you should probably consider hiring some company with postgres
expertise to:

- try to recover some data if possible (it depends on what was the problem,
  what other wals you have and the state of the rest of the files)
- understand what happened
- fix the root problem
- help you setup monitoring, alerting, archiving, backup, HA and other things
  you might need



On Thu, Feb 10, 2022 at 8:45 AM Julien Rouhaud <rjuju123@gmail.com> wrote:
>
> On Thu, Feb 10, 2022 at 01:13:27PM +0000, Silaparasetti, Ramesh wrote:
> >
> > 1. Below is the output of the command : "<DPA_INSTALL_DIRECTORY>\services\datastore\engine\bin\pg_controldata.exe
-D"<DPA_INSTALL_DIRECTORY>\services\datastore\data"" 
> > C:\Program Files\EMC\DPA\services\datastore\engine\bin>pg_controldata.exe -D "F:\datastore\data\data"
> > pg_control-Versionsnummer:                   1300
> > Katalogversionsnummer:                       202007201
> > Datenbanksystemidentifikation:               7054941472659574120
> > Datenbank-Cluster-Status:                    heruntergefahren
> > pg_control zuletzt geändert:                 07.02.2022 14:57:30
> > Position des letzten Checkpoints:            9/C80000D8
> > REDO-Position des letzten Checkpoints:       9/C80000D8
> > REDO-WAL-Datei des letzten Checkpoints:      0000000100000009000000C8
> > [...]
> > 2. As you suggested, we verified the value of Latest checkpoint's REDO WAL file: 0000000100000009000000C8.
> >
> >  This WAL file does not exist at the pg_wal directory.
> > We have enabled debug logging and below is the logging information from Postgres.
> >
> > 2022-02-10 11:38:05.675 CET [7916] LOG:  starting PostgreSQL 13.1, compiled by Visual C++ build 1900, 64-bit
> > 2022-02-10 11:38:05.679 CET [7916] LOG:  listening on IPv4 address "127.0.0.1", port 9003
> > 2022-02-10 11:38:05.681 CET [7916] LOG:  listening on IPv4 address "10.91.198.36", port 9003
> > 2022-02-10 11:38:06.756 CET [348] LOG:  database system was shut down at 2022-02-07 14:57:30 CET
> > 2022-02-10 11:38:06.756 CET [348] DEBUG:  mapped win32 error code 2 to 2
> > 2022-02-10 11:38:06.757 CET [348] DEBUG:  mapped win32 error code 2 to 2
> > 2022-02-10 11:38:06.757 CET [348] DEBUG:  could not open file "pg_wal/0000000100000009000000C8": No such file or
directory
>
> So, unless you can find that 0000000100000009000000C8 file (and all the files
> after that), your instance is corrupted and you lost data.  If you have WAL
> archiving or streaming to another location you should be able to recover from
> that, assuming that no other files are damaged.  Otherwise your best shot is
> restoring from a backup.
>
> > 4. Is it ok to execute "pg_resetwal" to recover from this situation? Does it pose any data loss ?
>
> pg_resetwal will make the situation worse.  The server will start but in a
> totally inconsistent state.  This should be your last choice, and understand
> that it will irremediably corrupt your system even more.

If it was me, I'd take a full filesystem level backup, and then run
resetwal on the copy (or maybe copy of the copy).   The damage may or
may not be very significant, and getting a baseline to amend to is
important  (or perhaps pull data from to check against a backup you
hopefully have).   If you go that route, immediately take a pg_dump.
and restore it, and you should have something workable albeit with
missing or incompletely applied transactions

A pristine copy is important, there may be data sitting in there that
you can recover if it comes to that.

As to *why* this occurred, there are four major things to look at:
*) Storage issues (consider enabling checksums immediately if not already done)
*) O/S Issues
*) 3rd part code running in postgres (may not apply to you. What
non-core extensions are you running?)
*) postgres bugs.   You are running 13.1, current release is 13.5.
This is not good, and raises your risk significantly.  ALWAYS apply
bugfixes quickly double especially if .2 or less based on historical
context; severity and number of bugs in 0.0,0.1,0.2 tend to be higher.

merlin



RE: Very URGENT REQUEST - Postgresql error : PANIC: could not locate a valid checkpoint record

From
"Agarwal, Pragati - Dell Team"
Date:
Hi Merlin, Julien

Thanks for your response. 

Besides this, there is public schema in postgres which we are not using in our Datastore service.

As a workaround of an upgrade process, we plan to drop the public schema. We would like to know your thoughts on this
andany detrimental impact on the data or service.
 

Thanks,
Pragati


Internal Use - Confidential

-----Original Message-----
From: Merlin Moncure <mmoncure@gmail.com> 
Sent: Thursday, February 10, 2022 9:12 PM
To: Julien Rouhaud
Cc: Silaparasetti, Ramesh; pgsql-bugs@lists.postgresql.org; Kishore, Nanda - Dell Team; Mahendrakar, Prabhakar - Dell
Team;Agarwal, Pragati - Dell Team
 
Subject: Re: Very URGENT REQUEST - Postgresql error : PANIC: could not locate a valid checkpoint record


[EXTERNAL EMAIL] 

On Thu, Feb 10, 2022 at 8:45 AM Julien Rouhaud <rjuju123@gmail.com> wrote:
>
> On Thu, Feb 10, 2022 at 01:13:27PM +0000, Silaparasetti, Ramesh wrote:
> >
> > 1. Below is the output of the command : "<DPA_INSTALL_DIRECTORY>\services\datastore\engine\bin\pg_controldata.exe
-D"<DPA_INSTALL_DIRECTORY>\services\datastore\data""
 
> > C:\Program Files\EMC\DPA\services\datastore\engine\bin>pg_controldata.exe -D "F:\datastore\data\data"
> > pg_control-Versionsnummer:                   1300
> > Katalogversionsnummer:                       202007201
> > Datenbanksystemidentifikation:               7054941472659574120
> > Datenbank-Cluster-Status:                    heruntergefahren
> > pg_control zuletzt geändert:                 07.02.2022 14:57:30
> > Position des letzten Checkpoints:            9/C80000D8
> > REDO-Position des letzten Checkpoints:       9/C80000D8
> > REDO-WAL-Datei des letzten Checkpoints:      0000000100000009000000C8
> > [...]
> > 2. As you suggested, we verified the value of Latest checkpoint's REDO WAL file: 0000000100000009000000C8.
> >
> >  This WAL file does not exist at the pg_wal directory.
> > We have enabled debug logging and below is the logging information from Postgres.
> >
> > 2022-02-10 11:38:05.675 CET [7916] LOG:  starting PostgreSQL 13.1, 
> > compiled by Visual C++ build 1900, 64-bit
> > 2022-02-10 11:38:05.679 CET [7916] LOG:  listening on IPv4 address 
> > "127.0.0.1", port 9003
> > 2022-02-10 11:38:05.681 CET [7916] LOG:  listening on IPv4 address 
> > "10.91.198.36", port 9003
> > 2022-02-10 11:38:06.756 CET [348] LOG:  database system was shut 
> > down at 2022-02-07 14:57:30 CET
> > 2022-02-10 11:38:06.756 CET [348] DEBUG:  mapped win32 error code 2 
> > to 2
> > 2022-02-10 11:38:06.757 CET [348] DEBUG:  mapped win32 error code 2 
> > to 2
> > 2022-02-10 11:38:06.757 CET [348] DEBUG:  could not open file 
> > "pg_wal/0000000100000009000000C8": No such file or directory
>
> So, unless you can find that 0000000100000009000000C8 file (and all 
> the files after that), your instance is corrupted and you lost data.  
> If you have WAL archiving or streaming to another location you should 
> be able to recover from that, assuming that no other files are 
> damaged.  Otherwise your best shot is restoring from a backup.
>
> > 4. Is it ok to execute "pg_resetwal" to recover from this situation? Does it pose any data loss ?
>
> pg_resetwal will make the situation worse.  The server will start but 
> in a totally inconsistent state.  This should be your last choice, and 
> understand that it will irremediably corrupt your system even more.

If it was me, I'd take a full filesystem level backup, and then run
resetwal on the copy (or maybe copy of the copy).   The damage may or
may not be very significant, and getting a baseline to amend to is important  (or perhaps pull data from to check
againsta backup you
 
hopefully have).   If you go that route, immediately take a pg_dump.
and restore it, and you should have something workable albeit with missing or incompletely applied transactions

A pristine copy is important, there may be data sitting in there that you can recover if it comes to that.

As to *why* this occurred, there are four major things to look at:
*) Storage issues (consider enabling checksums immediately if not already done)
*) O/S Issues
*) 3rd part code running in postgres (may not apply to you. What non-core extensions are you running?)
*) postgres bugs.   You are running 13.1, current release is 13.5.
This is not good, and raises your risk significantly.  ALWAYS apply bugfixes quickly double especially if .2 or less
basedon historical context; severity and number of bugs in 0.0,0.1,0.2 tend to be higher.
 

merlin

On Fri, Feb 11, 2022 at 11:19:54AM +0000, Agarwal, Pragati - Dell Team wrote:
> Hi Merlin, Julien
>
> Thanks for your response.
>
> Besides this, there is public schema in postgres which we are not using in
> our Datastore service.
>
> As a workaround of an upgrade process, we plan to drop the public schema. We
> would like to know your thoughts on this and any detrimental impact on the
> data or service.

I'm not sure how that's a workaround, but that schema isn't necessary for
postgres, it's just a schema that's created by default.  You can remove it
without impact as long as you don't rely on it in any of your applications.