Re: Logging corruption error codes - Mailing list pgsql-bugs

From Andrey Borodin
Subject Re: Logging corruption error codes
Date
Msg-id FB0BEAE7-F856-44D6-9130-C8EFD964D1D0@yandex-team.ru
Whole thread Raw
In response to Re: Logging corruption error codes  (Peter Eisentraut <peter.eisentraut@2ndquadrant.com>)
Responses Re: Logging corruption error codes  (Peter Geoghegan <pg@bowt.ie>)
List pgsql-bugs

> 22 июля 2019 г., в 16:16, Peter Eisentraut <peter.eisentraut@2ndquadrant.com> написал(а):
>
> On 2019-06-20 11:57, Andrey Borodin wrote:
>> We are fine-tuning our data corruption monitoring and found out that many corruption cases do not report proper
errorcode. 
>> This makes automatic log analyzer way too smart program.
>> We think that corruption error codes should be given in cases when B-tree or TOAST do not know how to interpret
data.
>> PFA patch with cases that we have found in logs and consider evidence of corruption.
>>
>> Best regards, Andrey Borodin.
>
> Should we use errmsg_internal() in the adjusted calls, so that the error
> messages are not picked up for translation?  I could go either way, but
> it's something that should be considered.

Thanks for looking into this.

From my POV these messages provide meaningful information to cope with corruption. But they are definitely internal.
Translations already provide some information on toast chunks, mentions btree many times times and many other internal
things.
So, I'm confused about status of these messages.
Such messages should be rare enough and those to whom they are addressed should be familiar with English.

We've encountered few more cases of messages, that potentially follow data corruption. In our test environment, we were
experimentingwith custom Linux kernel that had page cache bug. The bug manifested itself in reappearing stale page
versions.This causes various data corruptions, always undetected by data checksums (do we want Merkle tree?). 

Besides messages in this patch we also had:
could not read block 1751 in file "base/16452/358336": Bad address  // Probably mostly not only data corruption, but
hardwarefault 
t_xmin is uncommitted in tuple to be updated // Probably on-disk corruption
failed to re-find parent key in index // Probably index corruption
left link changed unexpectedly in block // Probably on-disk data corruption
right sibling 45056 of block * is not next child * of block * in index // Definitely index corruption

Should I add corruption codes for these messages in the patch? Or make a separate discussion about these?

Thanks!

Best regards, Andrey Borodin.


pgsql-bugs by date:

Previous
From: Michael Paquier
Date:
Subject: Re: REINDEX CONCURRENTLY causes ALTER TABLE to fail
Next
From: PG Bug reporting form
Date:
Subject: BUG #15924: Query Execution and variable declaration