Thread: Postgresql crashdown and recovery failure

Postgresql crashdown and recovery failure

From
"xiebin (F)"
Date:

Hi,

 

My database (Postgresql 12.5) server uses remote storage as data directory.

 

I powered off the database server and started postgresql (same version) on another server, with same data. However it failed. I checked wal logs and found that the content around latest checkpoint is empty. (about 1700 zero bytes)

 

# pg_controldata -D .

pg_control version number:            1201

Catalog version number:               201909212

Database system identifier:           6967594878781603854

Database cluster state:               in production

pg_control last modified:             Wed 09 Jun 2021 04:23:24 PM CST

Latest checkpoint location:           0/9893978

Latest checkpoint's REDO location:    0/9893940

Latest checkpoint's REDO WAL file:    000000010000000000000009

Latest checkpoint's TimeLineID:       1

...

 

# pg_waldump -p ./pg_wal/ -s 0/9893900

first record is after 0/9893900, at 0/9893908, skipping over 8 bytes

rmgr: Standby     len (rec/tot):     50/    50, tx:          0, lsn: 0/09893908, prev 0/098938E0, desc: RUNNING_XACTS nextXid 30176 latestCompletedXid 30175 oldestRunningXid 30176

pg_waldump: fatal: error in WAL record at 0/9893908: invalid record length at 0/9893940: wanted 24, got 0

 

# hexdump -C ./pg_wal/000000010000000000000009 -s 0x89393a -n 10000

0089393a  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|

*

00893ffa  00 00 00 00 00 00 01 d1  05 00 01 00 00 00 00 40  |...............@|

0089400a  89 07 00 00 00 00 4b 00  00 00 00 00 00 00 00 00  |......K.........|

...

 

It seems the checkpoint record in wal buffer is not flushed to disk, leading to recovery failure.

 

The remote storage support Posix IO, meaning the data would be written to disk as long as postgresql called fdatasync (or other similar system calls). Please notice the storage server has not been shut down.

 

Besides, I traced the system calls of postgresql checkpoint (triggered by “psql –c checkpoint”).

 

# strace -fvy -s 64 pg_ctl -D /tmp/data start 2>&1 | grep –E ‘open|sync’

...

 

[pid 109705] openat(AT_FDCWD, "pg_xact/0000", O_RDWR|O_CREAT, 0600 <unfinished ...>

[pid 109705] <... openat resumed> )     = 3</tmp/data/pg_xact/0000>

[pid 109705] fsync(3</tmp/data/pg_xact/0000> <unfinished ...>

[pid 109705] <... fsync resumed> )      = 0

[pid 109705] openat(AT_FDCWD, "pg_xact", O_RDONLY) = 3</tmp/data/pg_xact>

[pid 109705] fsync(3</tmp/data/pg_xact>) = 0

[pid 109705] openat(AT_FDCWD, "pg_commit_ts", O_RDONLY) = 3</tmp/data/pg_commit_ts>

[pid 109705] fsync(3</tmp/data/pg_commit_ts>) = 0

[pid 109705] openat(AT_FDCWD, "pg_subtrans/0000", O_RDWR|O_CREAT, 0600) = 3</tmp/data/pg_subtrans/0000>

[pid 109705] openat(AT_FDCWD, "pg_multixact/offsets/0000", O_RDWR|O_CREAT, 0600) = 3</tmp/data/pg_multixact/offsets/0000>

[pid 109705] fsync(3</tmp/data/pg_multixact/offsets/0000>) = 0

[pid 109705] openat(AT_FDCWD, "pg_multixact/offsets", O_RDONLY) = 3</tmp/data/pg_multixact/offsets>

[pid 109705] fsync(3</tmp/data/pg_multixact/offsets>) = 0

[pid 109705] openat(AT_FDCWD, "pg_multixact/members", O_RDONLY) = 3</tmp/data/pg_multixact/members>

[pid 109705] fsync(3</tmp/data/pg_multixact/members>) = 0

[pid 109705] openat(AT_FDCWD, "pg_logical/snapshots", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 3</tmp/data/pg_logical/snapshots>

[pid 109705] openat(AT_FDCWD, "pg_logical/mappings", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 3</tmp/data/pg_logical/mappings>

[pid 109705] openat(AT_FDCWD, "pg_logical/replorigin_checkpoint.tmp", O_WRONLY|O_CREAT|O_EXCL, 0600) = 3</tmp/data/pg_logical/replorigin_checkpoint.tmp>

[pid 109705] openat(AT_FDCWD, "pg_logical/replorigin_checkpoint.tmp", O_RDWR) = 3</tmp/data/pg_logical/replorigin_checkpoint.tmp>

[pid 109705] fsync(3</tmp/data/pg_logical/replorigin_checkpoint.tmp>) = 0

[pid 109705] openat(AT_FDCWD, "pg_logical/replorigin_checkpoint", O_RDWR) = 3</tmp/data/pg_logical/replorigin_checkpoint>

[pid 109705] fsync(3</tmp/data/pg_logical/replorigin_checkpoint>) = 0

[pid 109705] openat(AT_FDCWD, "pg_logical/replorigin_checkpoint", O_RDWR) = 3</tmp/data/pg_logical/replorigin_checkpoint>

[pid 109705] fsync(3</tmp/data/pg_logical/replorigin_checkpoint>) = 0

[pid 109705] openat(AT_FDCWD, "pg_logical", O_RDONLY) = 3</tmp/data/pg_logical>

[pid 109705] fsync(3</tmp/data/pg_logical>) = 0

[pid 109705] openat(AT_FDCWD, "pg_wal/00000001000000000000000B", O_RDWR <unfinished ...>

[pid 109705] <... openat resumed> )     = 3</tmp/data/pg_wal/00000001000000000000000B>

[pid 109705] fdatasync(3</tmp/data/pg_wal/00000001000000000000000B> <unfinished ...>          (Let’s say it is A)

[pid 109705] <... fdatasync resumed> )  = 0

[pid 109705] openat(AT_FDCWD, "/tmp/data/global/pg_control", O_RDWR <unfinished ...>

[pid 109705] <... openat resumed> )     = 4</tmp/data/global/pg_control>

[pid 109705] fsync(4</tmp/data/global/pg_control> <unfinished ...>                               (Let’s say it is B)

[pid 109705] <... fsync resumed> )      = 0

[pid 109705] openat(AT_FDCWD, "pg_wal", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 4</tmp/data/pg_wal>

[pid 109705] openat(AT_FDCWD, "pg_subtrans", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 4</tmp/data/pg_subtrans>

[pid 109707] fdatasync(3</tmp/data/pg_wal/00000001000000000000000B>) = 0                        (Let’s say it is C)

 

...

 

From the output, I can see that data files, pg_control file and wal log are flushed to disk. And I noticed that pg_control file is flushed between 2 wal flush (as noted A,B,C).

 

Here are my questions:

1.  If power off occurs between B and C. Would it be possible that checkpoint was written to pg_control but not to wal log? And as a consequence, postgresql recovery would fail?

2.  Is there any other reason causing this problem?

3.  Is there any way to recover the data? I know pg_resetwal can reset wal log, then postgresql can start. But undoubtedly it could cause some data loss.

 

Thanks,

Xie Bin

Re: Postgresql crashdown and recovery failure

From
Ron
Date:
On 6/10/21 10:30 PM, xiebin (F) wrote:

Hi,

 

My database (Postgresql 12.5) server uses remote storage as data directory.

 

I powered off the database server and started postgresql (same version) on another server, with same data. However it failed. I checked wal logs and found that the content around latest checkpoint is empty. (about 1700 zero bytes)

[snip]
 

From the output, I can see that data files, pg_control file and wal log are flushed to disk. And I noticed that pg_control file is flushed between 2 wal flush (as noted A,B,C).

 

Here are my questions:

1.  If power off occurs between B and C. Would it be possible that checkpoint was written to pg_control but not to wal log? And as a consequence, postgresql recovery would fail?

2.  Is there any other reason causing this problem?


Is your Postgresql instance configured to use fsync?

3.  Is there any way to recover the data? I know pg_resetwal can reset wal log, then postgresql can start. But undoubtedly it could cause some data loss.


--
Angular momentum makes the world go 'round.