Re: Online verification of checksums - Mailing list pgsql-hackers

From Fabien COELHO
Subject Re: Online verification of checksums
Date
Msg-id alpine.DEB.2.21.1810301754020.9086@lancre
Whole thread Raw
In response to Re: Online verification of checksums  (Michael Banck <michael.banck@credativ.de>)
Responses Re: Online verification of checksums  (Michael Banck <michael.banck@credativ.de>)
List pgsql-hackers
Hallo Michael,

Patch v6 applies cleanly, compiles, local make check is ok.

>> My current opinion is that when offline some errors are not admissible,
>> whereas the same errors are admissible when online because they may be due
>> to the ongoing database processing, so the behavior should not be strictly
>> the same.
>
> Indeed, the recently-added pg_verify_checksums testsuite

A welcome addition!

> adds a few files with just 'foo' in them and with V5 of the patch, 
> pg_verify_checksums no longer bails out with an error on those.

> I have now re-added the retry logic for partially-read pages, so that it
> bails out if it reads a page partially twice. This makes the testsuite
> work again.
>
> I am not convinced we need to differentiate further between online and
> offline operation, can you explain in more detail which other
> differences are ok in online mode and why?

For instance the "file/directory was removed" do not look okay at all when 
offline, even if unlikely. Moreover, the checks hides the error message 
and is fully silent in this case, while it was not beforehand on the 
same error when offline.

The "check if page was modified since checkpoint" does not look useful 
when offline. Maybe it lacks a comment to say that this cannot (should not 
?) happen when offline, but even then I would not like it to be true: ISTM 
that no page should be allowed to be skipped on the checkpoint condition 
when offline, but it is probably ok to skip with the new page test, which 
make me still think that they should be counted and reported separately, 
or at least the checkpoint skip test should not be run when offline.

When offline, the retry logic does not make much sense, it should complain 
directly on the first error? Also, I'm unsure of the read & checksum retry 
logic *without any delay*.

>> This might suggest some option to tell the command that it should work in
>> online or offline mode, so that it may be stricter in some cases. The
>> default may be one of the option, eg the stricter offline mode, or maybe
>> guessed at startup.
>
> If we believe the operation should be different, the patch removes the
> "is cluster online?" check (as it is no longer necessary), so we could
> just replace the current error message with a global variable with the
> result of that check and use it where needed (if any).

That could let open the issue of someone starting the check offline, and 
then starting the database while it is not finished. Maybe it is not worth 
sweating about such a narrow use case.

If operations are to be different, and it seems to me they should be, I'd 
suggest (1) auto detect default based one the existing "is cluster online" 
code, (2) force options, eg --online vs --offline, which would complain 
and exit if the cluster is not in the right state on startup.

I'd suggest to add a failing checksum online test, if possible. At least a 
"foo" file? It would also be nice if the test could apply on an active 
database, eg with a low-rate pgbench running in parallel to the 
verification, but I'm not sure how easy it is to add such a thing.

-- 
Fabien.


pgsql-hackers by date:

Previous
From: "David G. Johnston"
Date:
Subject: Re: Constraint documentation
Next
From: Andres Freund
Date:
Subject: Re: replication_slots usability issue