Re: Online verification of checksums - Mailing list pgsql-hackers

From Michael Paquier
Subject Re: Online verification of checksums
Date
Msg-id 20201121013003.GC6052@paquier.xyz
Whole thread Raw
In response to Re: Online verification of checksums  (Stephen Frost <sfrost@snowman.net>)
Responses Re: Online verification of checksums  (Anastasia Lubennikova <a.lubennikova@postgrespro.ru>)
Re: Online verification of checksums  (Stephen Frost <sfrost@snowman.net>)
List pgsql-hackers
On Fri, Nov 20, 2020 at 11:08:27AM -0500, Stephen Frost wrote:
> David Steele (david@pgmasters.net) wrote:
>> Our current plan for pgBackRest:
>>
>> 1) Remove the LSN check as you have done in your patch and when rechecking
>> see if the page has become valid *or* the LSN is ascending.
>> 2) Check the LSN against the max LSN reported by PostgreSQL to make sure it
>> is valid.
>
> Yup, that's my recollection also as to our plans for how to improve
> things here.
>
>> These do completely rule out any type of corruption, but they certainly
>> narrows the possibility by a lot.
>
> *don't :)

Have you considered the possibility of only using pd_checksums for the
validation?  This is the only source of truth in the page header we
can rely on to validate the full contents of the page, so if the logic
relies on anything but the checksum then you expose the logic to risks
of reporting pages as corrupted while they were just torn, or just
miss corrupted pages, which is what we should avoid for such things.
Both are bad.

>> As for your patch, it mostly looks good but my objection is that a page may
>> be reported as invalid after 5 retries when in fact it may just be very hot.
>
> Yeah.. while unlikely that it'd actually get written out that much, it
> does seem at least possible.
>
>> Maybe checking for an ascending LSN is a good idea there as well? At least
>> in that case we could issue a different warning, instead of "checksum
>> verification failed" perhaps "checksum verification skipped due to
>> concurrent modifications".
>
> +1.

I don't quite understand how you can make sure that the page is not
corrupted here?  It could be possible that the last 4kB of a 8kB page
got corrupted, where the header had valid data but failing the
checksum verification.  So if you are not careful you could have at
hand a corrupted page discarded because of it failed the retry
multiple times in a row.  The only method I can think as being really
reliable is based on two facts:
- Do a check only on pd_checksums, as that validates the full contents
of the page.
- When doing a retry, make sure that there is no concurrent I/O
activity in the shared buffers.  This requires an API we don't have
yet.
--
Michael

Attachment

pgsql-hackers by date:

Previous
From: Michael Paquier
Date:
Subject: Re: scram-sha-256 broken with FIPS and OpenSSL 1.0.2
Next
From: James Coleman
Date:
Subject: Fix generate_useful_gather_paths for parallel unsafe pathkeys