Home > mailing lists

Incorrect resource manager data checksum in record with zfs and compression - Mailing list pgsql-general

From	John Bolliger
Subject	Incorrect resource manager data checksum in record with zfs and compression
Date	October 3, 2022 19:41:23
Msg-id	CADaZ5OGv7zVf_J000eA6NzmHm+6h78d0dQ-RmME_b0FDuo41qg@mail.gmail.com Whole thread Raw
Responses	Re: Incorrect resource manager data checksum in record with zfs and compression
List	pgsql-general

Tree view

This is a follow up to https://www.postgresql.org/message-id/CANQ55Tsoa6%3Dvk2YkeVUN7qO-2YdqJf_AMVQxqsVTYJm0qqQQuw%40mail.gmail.com, which I am at the same company as the original poster.

Our architecture is similar but all of the servers are now on ZFS now and Postgres 13.8 with Ubuntu 18.04+ and still doing streaming replication, all with ECC memory and 26-64 cores with 192gb ram+ on top of a ZPOOL made out of NVMe PCIe SSDs.

A101 (primary) -> A201 (replica) -> B101(primary) -> B201 (replica).

We are seeing this error occur about once per week (across all postgres clusters/chains). It is the same pattern we have been seeing for a number of years now.

Possibly relevant configuration options:

wal_init_zero=on

The last time this occurred I grabbed the good WAL file from the parent, and the corrupted WAL file from the descendant. Comparing them showed no differences until 12Mb in where the "good" WAL file continued, and the "bad" WAL file was zeros until the end of the file.

When the "good" WAL file is copied from the parent and to the descendant, replication resumes and the descendant becomes a healthy replica once again.

After doing some investigation I partially suspected that https://github.com/postgres/postgres/commit/dd9b3fced83edb51a3e2f44d3d4476a45d0f5a24 could possibly impact this behavior and fix the issues. But we saw this occur on 13.8 between two nodes.

I am almost done on a tool that will tail the postgresql journal and walk up the chain to the primary and get the WAL file name for the LSN in the error message, then sync and overwrite the bad WAL file from the parent, but I hope I can track down the actual bug here, rather than relying upon a process to fix this after the fact.

John Bolliger

pgsql-general by date:

From: jacktby jacktby
Date: 03 October 2022, 11:13:39
Subject: Please explain what's c99 brain dead?

From: Ryan Booz
Date: 03 October 2022, 20:26:28
Subject: PGSQL Phriday #001 - Two truths and a lie about PostgreSQL

Incorrect resource manager data checksum in record with zfs and compression - Mailing list pgsql-general

Previous

Next