Thread: Online verification of checksums

Online verification of checksums

From
Michael Banck
Date:
Hi,

v10 almost added online activation of checksums, but all we've got is
pg_verify_checksums, i.e. offline verification of checkums. 

However, we also got (online) checksum verification during base backups,
and I have ported/adapted David Steele's recheck code to my personal
fork of pg_checksums[1], removed the online check (for verification) and
that seems to work fine.

I've now forward-ported this change to pg_verify_checksums, in order to
make this application useful for online clusters, see attached patch.

I've tested this in a tight loop (while true; do pg_verify_checksums -D
data1 -d > /dev/null || /bin/true; done)[2] while doing "while true; do
createdb pgbench; pgbench -i -s 10 pgbench > /dev/null; dropdb pgbench;
done", which I already used to develop the original code in the fork and
which brought up a few bugs.

I got one checksums verification failure this way, all others were
caught by the recheck (I've introduced a 500ms delay for the first ten
failures) like this:

|pg_verify_checksums: checksum verification failed on first attempt in
|file "data1/base/16837/16850", block 7770: calculated checksum 785 but
|expected 5063
|pg_verify_checksums: block 7770 in file "data1/base/16837/16850"
|verified ok on recheck

However, I am also seeing sporadic (maybe 0.5 times per pgbench run)
failures like this:

|pg_verify_checksums: short read of block 2644 in file
|"data1/base/16637/16650", got only 4096 bytes

This is not strictly a verification failure, should we do anything about
this? In my fork, I am also rechecking on this[3] (and I am happy to
extend the patch that way), but that makes the code and the patch more
complicated and I wanted to check the general opinion on this case
first.


Michael

[1] https://github.com/credativ/pg_checksums/commit/dc052f0d6f1282d3c821
5b0eb28b8e7c4e74f9e5
[2] while patching out the somewhat unhelpful (in regular operation,
anyway) debug message for every successful checksum verification
[3] https://github.com/credativ/pg_checksums/blob/master/pg_checksums.c#
L160
-- 
Michael Banck
Projektleiter / Senior Berater
Tel.: +49 2166 9901-171
Fax:  +49 2166 9901-100
Email: michael.banck@credativ.de

credativ GmbH, HRB Mönchengladbach 12080
USt-ID-Nummer: DE204566209
Trompeterallee 108, 41189 Mönchengladbach
Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer

Unser Umgang mit personenbezogenen Daten unterliegt
folgenden Bestimmungen: https://www.credativ.de/datenschutz
Attachment

Re: Online verification of checksums

From
Peter Eisentraut
Date:
On 26/07/2018 13:59, Michael Banck wrote:
> I've now forward-ported this change to pg_verify_checksums, in order to
> make this application useful for online clusters, see attached patch.

Why not provide this functionality as a server function or command.
Then you can access blocks with proper locks and don't have to do this
rather ad hoc retry logic on concurrent access.

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: Online verification of checksums

From
Magnus Hagander
Date:
On Thu, Aug 30, 2018 at 8:06 PM, Peter Eisentraut <peter.eisentraut@2ndquadrant.com> wrote:
On 26/07/2018 13:59, Michael Banck wrote:
> I've now forward-ported this change to pg_verify_checksums, in order to
> make this application useful for online clusters, see attached patch.

Why not provide this functionality as a server function or command.
Then you can access blocks with proper locks and don't have to do this
rather ad hoc retry logic on concurrent access.

I think it would make sense to provide this functionality in the "checksum worker" infrastruture suggested in the online checksum enabling patch. But I think being able to run it from the outside would also be useful, particularly when it's this simple.

But why do we need a sleep in it? AFAICT this is basically the same code that we have in basebackup.c, and that one does not need the sleep? Certainly 500ms would be very long since we're just protecting against a torn page, but the comment is wrong I think, and we're actually sleeping 0.5ms?

--

Re: Online verification of checksums

From
Fabien COELHO
Date:
Hallo Michael,

> I've now forward-ported this change to pg_verify_checksums, in order to
> make this application useful for online clusters, see attached patch.

Patch does not seem to apply anymore, could you rebase it?

-- 
Fabien.


Re: Online verification of checksums

From
Tomas Vondra
Date:
Hi,

The patch is mostly copying the verification / retry logic from
basebackup.c, but I think it omitted a rather important detail that
makes it incorrect in the presence of concurrent writes.

The very first thing basebackup does is this:

    startptr = do_pg_start_backup(...);

i.e. it waits for a checkpoint, remembering the LSN. And then when
checking a page it does this:

   if (!PageIsNew(page) && PageGetLSN(page) < startptr)
   {
       ... verify the page checksum
   }

Obviously, pg_verify_checksums can't do that easily because it's
supposed to run from outside the database instance. But the startptr
detail is pretty important because it supports this retry reasoning:

    /*
     * Retry the block on the first failure.  It's
     * possible that we read the first 4K page of the
     * block just before postgres updated the entire block
     * so it ends up looking torn to us.  We only need to
     * retry once because the LSN should be updated to
     * something we can ignore on the next pass.  If the
     * error happens again then it is a true validation
     * failure.
     */

Imagine the 8kB page as two 4kB pages, with the initial state being
[A1,A2] and another process over-writing it with [B1,B2]. If you read
the 8kB page, what states can you see?

I don't think POSIX provides any guarantees about atomicity of the write
calls (and even if it does, the filesystems on Linux don't seem to). So
you may observe both [A1,B2] or [A2,B1], or various inconsistent mixes
of the two versions, depending on timing. Well, torn pages ...

Pretty much the only thing you can rely on is that when one process does

    write([B1,B2])

the other process may first read [A1,B2], but the next read will return
[B1,B2] (or possibly newer data, if there was another write). It will
not read the "stale" A1 again.

The basebackup relies on this kinda implicitly - on the retry it'll
notice the LSN changed (thanks to the startptr check), and the page will
be skipped entirely. This is pretty important, because the new page
might be torn in some other way.

The pg_verify_checksum apparently ignores this skip logic, because on
the retry it simply re-reads the page again, verifies the checksum and
reports an error. Which is broken, because the newly read page might be
torn again due to a concurrent write.

So IMHO this should do something similar to basebackup - check the page
LSN, and if it changed then skip the page.

I'm afraid this requires using the last checkpoint LSN, the way startptr
is used in basebackup. In particular we can't simply remember LSN from
the first read, because we might actually read [B1,A2] on the first try,
and then [B1,B2] or [B1,C2] on the retry. (Actually, the page may be
torn in various other ways, not necessarily at the 4kB boundary - it
might be torn right after the LSN, for example).


FWIW I also don't understand the purpose of pg_sleep(), it does not seem
to protect against anything, really.

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: Online verification of checksums

From
Michael Banck
Date:
Hi,

On Mon, Sep 03, 2018 at 10:29:18PM +0200, Tomas Vondra wrote:
> The patch is mostly copying the verification / retry logic from
> basebackup.c, but I think it omitted a rather important detail that
> makes it incorrect in the presence of concurrent writes.
> 
> The very first thing basebackup does is this:
> 
>     startptr = do_pg_start_backup(...);
> 
> i.e. it waits for a checkpoint, remembering the LSN. And then when
> checking a page it does this:
> 
>    if (!PageIsNew(page) && PageGetLSN(page) < startptr)
>    {
>        ... verify the page checksum
>    }
> 
> Obviously, pg_verify_checksums can't do that easily because it's
> supposed to run from outside the database instance. 

It reads pg_control anyway, so couldn't we just take
ControlFile->checkPoint?

Other than that, basebackup.c seems to only look at pages which haven't
been changed since the backup starting checkpoint (see above if
statement). That's reasonable for backups, but is it just as reasonable
for online verification?

> But the startptr detail is pretty important because it supports this
> retry reasoning:
> 
>     /*
>      * Retry the block on the first failure.  It's
>      * possible that we read the first 4K page of the
>      * block just before postgres updated the entire block
>      * so it ends up looking torn to us.  We only need to
>      * retry once because the LSN should be updated to
>      * something we can ignore on the next pass.  If the
>      * error happens again then it is a true validation
>      * failure.
>      */
> 
> Imagine the 8kB page as two 4kB pages, with the initial state being
> [A1,A2] and another process over-writing it with [B1,B2]. If you read
> the 8kB page, what states can you see?
> 
> I don't think POSIX provides any guarantees about atomicity of the write
> calls (and even if it does, the filesystems on Linux don't seem to). So
> you may observe both [A1,B2] or [A2,B1], or various inconsistent mixes
> of the two versions, depending on timing. Well, torn pages ...
> 
> Pretty much the only thing you can rely on is that when one process does
> 
>     write([B1,B2])
> 
> the other process may first read [A1,B2], but the next read will return
> [B1,B2] (or possibly newer data, if there was another write). It will
> not read the "stale" A1 again.
> 
> The basebackup relies on this kinda implicitly - on the retry it'll
> notice the LSN changed (thanks to the startptr check), and the page will
> be skipped entirely. This is pretty important, because the new page
> might be torn in some other way.
>
> The pg_verify_checksum apparently ignores this skip logic, because on
> the retry it simply re-reads the page again, verifies the checksum and
> reports an error. Which is broken, because the newly read page might be
> torn again due to a concurrent write.

Well, ok.
 
> So IMHO this should do something similar to basebackup - check the page
> LSN, and if it changed then skip the page.
> 
> I'm afraid this requires using the last checkpoint LSN, the way startptr
> is used in basebackup. In particular we can't simply remember LSN from
> the first read, because we might actually read [B1,A2] on the first try,
> and then [B1,B2] or [B1,C2] on the retry. (Actually, the page may be
> torn in various other ways, not necessarily at the 4kB boundary - it
> might be torn right after the LSN, for example).

I'd prefer to come up with a plan where we don't just give up once we
see a new LSN, if possible. If I run a modified pg_verify_checksums
which skips on newer pages in a tight benchmark, basically everything
gets skipped as checkpoints don't happen often enough.

So how about we do check every page, but if one fails on retry, and the
LSN is newer than the checkpoint, we then skip it? Is that logic sound?

In any case, if we decide we really should skip the page if it is newer
than the checkpoint, I think it makes sense to track those skipped pages
and print their number out at the end, if there are any.

> FWIW I also don't understand the purpose of pg_sleep(), it does not seem
> to protect against anything, really.

Well, I've noticed that without it I get sporadic checksum failures on
reread, so I've added it to make them go away. It was certainly a
phenomenological decision that I am happy to trade for a better one.

Also, I noticed there's sometimes a 'data/global/pg_internal.init.606'
or some such file which pg_verify_checksums gets confused on, I guess we
should skip that as well.  Can we assume that all files that start with
the ones in skip[] are safe to skip or should we have an exception for
files starting with pg_internal.init?


Michael

-- 
Michael Banck
Projektleiter / Senior Berater
Tel.: +49 2166 9901-171
Fax:  +49 2166 9901-100
Email: michael.banck@credativ.de

credativ GmbH, HRB Mönchengladbach 12080
USt-ID-Nummer: DE204566209
Trompeterallee 108, 41189 Mönchengladbach
Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer

Unser Umgang mit personenbezogenen Daten unterliegt
folgenden Bestimmungen: https://www.credativ.de/datenschutz


Re: Online verification of checksums

From
Stephen Frost
Date:
Greetings,

* Michael Banck (michael.banck@credativ.de) wrote:
> On Mon, Sep 03, 2018 at 10:29:18PM +0200, Tomas Vondra wrote:
> > Obviously, pg_verify_checksums can't do that easily because it's
> > supposed to run from outside the database instance.
>
> It reads pg_control anyway, so couldn't we just take
> ControlFile->checkPoint?
>
> Other than that, basebackup.c seems to only look at pages which haven't
> been changed since the backup starting checkpoint (see above if
> statement). That's reasonable for backups, but is it just as reasonable
> for online verification?

Right, basebackup doesn't need to look at other pages.

> > The pg_verify_checksum apparently ignores this skip logic, because on
> > the retry it simply re-reads the page again, verifies the checksum and
> > reports an error. Which is broken, because the newly read page might be
> > torn again due to a concurrent write.
>
> Well, ok.

The newly read page will have an updated LSN though then on the re-read,
in which case basebackup can know that what happened was a rewrite of
the page and it no longer has to care about the page and can skip it.

I haven't looked, but if basebackup isn't checking the LSN again for the
newly read page then that'd be broken, but I believe it does (at least,
that's the algorithm we came up with for pgBackRest, and I know David
shared that when the basebackup code was being written).

> > So IMHO this should do something similar to basebackup - check the page
> > LSN, and if it changed then skip the page.
> >
> > I'm afraid this requires using the last checkpoint LSN, the way startptr
> > is used in basebackup. In particular we can't simply remember LSN from
> > the first read, because we might actually read [B1,A2] on the first try,
> > and then [B1,B2] or [B1,C2] on the retry. (Actually, the page may be
> > torn in various other ways, not necessarily at the 4kB boundary - it
> > might be torn right after the LSN, for example).
>
> I'd prefer to come up with a plan where we don't just give up once we
> see a new LSN, if possible. If I run a modified pg_verify_checksums
> which skips on newer pages in a tight benchmark, basically everything
> gets skipped as checkpoints don't happen often enough.

I'm really not sure how you expect to be able to do something different
here.  Even if we started poking into shared buffers, all you'd be able
to see is that there's a bunch of dirty pages- and we don't maintain the
checksums in shared buffers, so it's not like you could verify them
there.

You could possibly have an option that says "force a checkpoint" but,
honestly, that's really not all that interesting either- all you'd be
doing is forcing all the pages to be written out from shared buffers
into the kernel cache and then reading them from there instead, it's not
like you'd actually be able to tell if there was a disk/storage error
because you'll only be looking at the kernel cache.

> So how about we do check every page, but if one fails on retry, and the
> LSN is newer than the checkpoint, we then skip it? Is that logic sound?

I thought that's what basebackup did- if it doesn't do that today, then
it really should.

> In any case, if we decide we really should skip the page if it is newer
> than the checkpoint, I think it makes sense to track those skipped pages
> and print their number out at the end, if there are any.

Not sure what the point of this is.  If we wanted to really do something
to cross-check here, we'd track the pages that were skipped and then
look through the WAL to make sure that they're there.  That's something
we've talked about doing with pgBackRest, but don't currently.

> > FWIW I also don't understand the purpose of pg_sleep(), it does not seem
> > to protect against anything, really.
>
> Well, I've noticed that without it I get sporadic checksum failures on
> reread, so I've added it to make them go away. It was certainly a
> phenomenological decision that I am happy to trade for a better one.

That then sounds like we really aren't re-checking the LSN, and we
really should be, to avoid getting these sporadic checksum failures on
reread..

> Also, I noticed there's sometimes a 'data/global/pg_internal.init.606'
> or some such file which pg_verify_checksums gets confused on, I guess we
> should skip that as well.  Can we assume that all files that start with
> the ones in skip[] are safe to skip or should we have an exception for
> files starting with pg_internal.init?

Everything listed in skip is safe to skip on a restore..  I've not
really thought too much about if they're all safe to skip when checking
checksums for an online system, but I would generally think so..

Thanks!

Stephen

Attachment

Re: Online verification of checksums

From
Tomas Vondra
Date:
On 09/17/2018 04:46 PM, Stephen Frost wrote:
> Greetings,
> 
> * Michael Banck (michael.banck@credativ.de) wrote:
>> On Mon, Sep 03, 2018 at 10:29:18PM +0200, Tomas Vondra wrote:
>>> Obviously, pg_verify_checksums can't do that easily because it's
>>> supposed to run from outside the database instance. 
>>
>> It reads pg_control anyway, so couldn't we just take
>> ControlFile->checkPoint?
>>
>> Other than that, basebackup.c seems to only look at pages which haven't
>> been changed since the backup starting checkpoint (see above if
>> statement). That's reasonable for backups, but is it just as reasonable
>> for online verification?
> 
> Right, basebackup doesn't need to look at other pages.
> 
>>> The pg_verify_checksum apparently ignores this skip logic, because on
>>> the retry it simply re-reads the page again, verifies the checksum and
>>> reports an error. Which is broken, because the newly read page might be
>>> torn again due to a concurrent write.
>>
>> Well, ok.
> 
> The newly read page will have an updated LSN though then on the re-read,
> in which case basebackup can know that what happened was a rewrite of
> the page and it no longer has to care about the page and can skip it.
> 
> I haven't looked, but if basebackup isn't checking the LSN again for the
> newly read page then that'd be broken, but I believe it does (at least,
> that's the algorithm we came up with for pgBackRest, and I know David
> shared that when the basebackup code was being written).
> 

Yes, basebackup does check the LSN on re-read, and skips the page if it
changed on re-read (because it eliminates the consistency guarantees
provided by the checkpoint).

>>> So IMHO this should do something similar to basebackup - check the page
>>> LSN, and if it changed then skip the page.
>>>
>>> I'm afraid this requires using the last checkpoint LSN, the way startptr
>>> is used in basebackup. In particular we can't simply remember LSN from
>>> the first read, because we might actually read [B1,A2] on the first try,
>>> and then [B1,B2] or [B1,C2] on the retry. (Actually, the page may be
>>> torn in various other ways, not necessarily at the 4kB boundary - it
>>> might be torn right after the LSN, for example).
>>
>> I'd prefer to come up with a plan where we don't just give up once we
>> see a new LSN, if possible. If I run a modified pg_verify_checksums
>> which skips on newer pages in a tight benchmark, basically everything
>> gets skipped as checkpoints don't happen often enough.
> 
> I'm really not sure how you expect to be able to do something different
> here.  Even if we started poking into shared buffers, all you'd be able
> to see is that there's a bunch of dirty pages- and we don't maintain the
> checksums in shared buffers, so it's not like you could verify them
> there.
> 
> You could possibly have an option that says "force a checkpoint" but,
> honestly, that's really not all that interesting either- all you'd be
> doing is forcing all the pages to be written out from shared buffers
> into the kernel cache and then reading them from there instead, it's not
> like you'd actually be able to tell if there was a disk/storage error
> because you'll only be looking at the kernel cache.
> 

Yeah.

>> So how about we do check every page, but if one fails on retry, and the
>> LSN is newer than the checkpoint, we then skip it? Is that logic sound?
> 
> I thought that's what basebackup did- if it doesn't do that today, then
> it really should.
> 

The crucial distinction here is that the trick is not in comparing LSNs
from the two page reads, but comparing it to the checkpoint LSN. If it's
greater, the page may be torn or broken, and there's no way to know
which case it is - so basebackup simply skips it.

>> In any case, if we decide we really should skip the page if it is newer
>> than the checkpoint, I think it makes sense to track those skipped pages
>> and print their number out at the end, if there are any.
> 
> Not sure what the point of this is.  If we wanted to really do something
> to cross-check here, we'd track the pages that were skipped and then
> look through the WAL to make sure that they're there.  That's something
> we've talked about doing with pgBackRest, but don't currently.
> 

I agree simply printing the page numbers seems rather useless. What we
could do is remember which pages we skipped and then try checking them
after another checkpoint. Or something like that.

>>> FWIW I also don't understand the purpose of pg_sleep(), it does not seem
>>> to protect against anything, really.
>>
>> Well, I've noticed that without it I get sporadic checksum failures on
>> reread, so I've added it to make them go away. It was certainly a
>> phenomenological decision that I am happy to trade for a better one.
> 
> That then sounds like we really aren't re-checking the LSN, and we
> really should be, to avoid getting these sporadic checksum failures on
> reread..
> 

Again, it's not enough to check the LSN against the preceding read. We
need a checkpoint LSN or something like that.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: Online verification of checksums

From
Tomas Vondra
Date:

On 09/17/2018 04:04 PM, Michael Banck wrote:
> Hi,
> 
> On Mon, Sep 03, 2018 at 10:29:18PM +0200, Tomas Vondra wrote:
>> The patch is mostly copying the verification / retry logic from
>> basebackup.c, but I think it omitted a rather important detail that
>> makes it incorrect in the presence of concurrent writes.
>>
>> The very first thing basebackup does is this:
>>
>>     startptr = do_pg_start_backup(...);
>>
>> i.e. it waits for a checkpoint, remembering the LSN. And then when
>> checking a page it does this:
>>
>>    if (!PageIsNew(page) && PageGetLSN(page) < startptr)
>>    {
>>        ... verify the page checksum
>>    }
>>
>> Obviously, pg_verify_checksums can't do that easily because it's
>> supposed to run from outside the database instance. 
> 
> It reads pg_control anyway, so couldn't we just take
> ControlFile->checkPoint?
> 
> Other than that, basebackup.c seems to only look at pages which haven't
> been changed since the backup starting checkpoint (see above if
> statement). That's reasonable for backups, but is it just as reasonable
> for online verification?
> 

I suppose we might refresh the checkpoint LSN regularly, and use the
most recent one. On large/busy databases that would allow checking
larger part of the database.

>> But the startptr detail is pretty important because it supports this
>> retry reasoning:
>>
>>     /*
>>      * Retry the block on the first failure.  It's
>>      * possible that we read the first 4K page of the
>>      * block just before postgres updated the entire block
>>      * so it ends up looking torn to us.  We only need to
>>      * retry once because the LSN should be updated to
>>      * something we can ignore on the next pass.  If the
>>      * error happens again then it is a true validation
>>      * failure.
>>      */
>>
>> Imagine the 8kB page as two 4kB pages, with the initial state being
>> [A1,A2] and another process over-writing it with [B1,B2]. If you read
>> the 8kB page, what states can you see?
>>
>> I don't think POSIX provides any guarantees about atomicity of the write
>> calls (and even if it does, the filesystems on Linux don't seem to). So
>> you may observe both [A1,B2] or [A2,B1], or various inconsistent mixes
>> of the two versions, depending on timing. Well, torn pages ...
>>
>> Pretty much the only thing you can rely on is that when one process does
>>
>>     write([B1,B2])
>>
>> the other process may first read [A1,B2], but the next read will return
>> [B1,B2] (or possibly newer data, if there was another write). It will
>> not read the "stale" A1 again.
>>
>> The basebackup relies on this kinda implicitly - on the retry it'll
>> notice the LSN changed (thanks to the startptr check), and the page will
>> be skipped entirely. This is pretty important, because the new page
>> might be torn in some other way.
>>
>> The pg_verify_checksum apparently ignores this skip logic, because on
>> the retry it simply re-reads the page again, verifies the checksum and
>> reports an error. Which is broken, because the newly read page might be
>> torn again due to a concurrent write.
> 
> Well, ok.
>  
>> So IMHO this should do something similar to basebackup - check the page
>> LSN, and if it changed then skip the page.
>>
>> I'm afraid this requires using the last checkpoint LSN, the way startptr
>> is used in basebackup. In particular we can't simply remember LSN from
>> the first read, because we might actually read [B1,A2] on the first try,
>> and then [B1,B2] or [B1,C2] on the retry. (Actually, the page may be
>> torn in various other ways, not necessarily at the 4kB boundary - it
>> might be torn right after the LSN, for example).
> 
> I'd prefer to come up with a plan where we don't just give up once we
> see a new LSN, if possible. If I run a modified pg_verify_checksums
> which skips on newer pages in a tight benchmark, basically everything
> gets skipped as checkpoints don't happen often enough.
> 

But in that case the checksums are verified when reading the buffer into
shared buffers, it's not like we don't notice the checksum error at all.
We are interested in the pages that have not been read/written for an
extended period time. So I think this is not a problem.

> So how about we do check every page, but if one fails on retry, and the
> LSN is newer than the checkpoint, we then skip it? Is that logic sound?
> 

Hmmm, maybe.

> In any case, if we decide we really should skip the page if it is newer
> than the checkpoint, I think it makes sense to track those skipped pages
> and print their number out at the end, if there are any.
> 

I agree it might be useful to know how many pages were skipped, and how
many actually passed the checksum check.

>> FWIW I also don't understand the purpose of pg_sleep(), it does not seem
>> to protect against anything, really.
> 
> Well, I've noticed that without it I get sporadic checksum failures on
> reread, so I've added it to make them go away. It was certainly a
> phenomenological decision that I am happy to trade for a better one.
> 

My guess is this happened because both the read and re-read completed
during the same write.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: Online verification of checksums

From
Stephen Frost
Date:
Greetings,

* Tomas Vondra (tomas.vondra@2ndquadrant.com) wrote:
> On 09/17/2018 04:46 PM, Stephen Frost wrote:
> > * Michael Banck (michael.banck@credativ.de) wrote:
> >> On Mon, Sep 03, 2018 at 10:29:18PM +0200, Tomas Vondra wrote:
> >>> Obviously, pg_verify_checksums can't do that easily because it's
> >>> supposed to run from outside the database instance.
> >>
> >> It reads pg_control anyway, so couldn't we just take
> >> ControlFile->checkPoint?
> >>
> >> Other than that, basebackup.c seems to only look at pages which haven't
> >> been changed since the backup starting checkpoint (see above if
> >> statement). That's reasonable for backups, but is it just as reasonable
> >> for online verification?
> >
> > Right, basebackup doesn't need to look at other pages.
> >
> >>> The pg_verify_checksum apparently ignores this skip logic, because on
> >>> the retry it simply re-reads the page again, verifies the checksum and
> >>> reports an error. Which is broken, because the newly read page might be
> >>> torn again due to a concurrent write.
> >>
> >> Well, ok.
> >
> > The newly read page will have an updated LSN though then on the re-read,
> > in which case basebackup can know that what happened was a rewrite of
> > the page and it no longer has to care about the page and can skip it.
> >
> > I haven't looked, but if basebackup isn't checking the LSN again for the
> > newly read page then that'd be broken, but I believe it does (at least,
> > that's the algorithm we came up with for pgBackRest, and I know David
> > shared that when the basebackup code was being written).
>
> Yes, basebackup does check the LSN on re-read, and skips the page if it
> changed on re-read (because it eliminates the consistency guarantees
> provided by the checkpoint).

Ok, good, though I'm not sure what you mean by 'eliminates the
consistency guarantees provided by the checkpoint'.  The point is that
the page will be in the WAL and the WAL will be replayed during the
restore of the backup.

> >> So how about we do check every page, but if one fails on retry, and the
> >> LSN is newer than the checkpoint, we then skip it? Is that logic sound?
> >
> > I thought that's what basebackup did- if it doesn't do that today, then
> > it really should.
>
> The crucial distinction here is that the trick is not in comparing LSNs
> from the two page reads, but comparing it to the checkpoint LSN. If it's
> greater, the page may be torn or broken, and there's no way to know
> which case it is - so basebackup simply skips it.

Sure, because we don't care about it any longer- that page isn't
interesting because the WAL will replay over it.  IIRC it actually goes
something like: check the checksum, if it failed then check if the LSN
is greater than the checkpoint (of the backup start..), if not, then
re-read, if the LSN is now newer than the checkpoint then skip, if the
LSN is the same then throw an error.

> >> In any case, if we decide we really should skip the page if it is newer
> >> than the checkpoint, I think it makes sense to track those skipped pages
> >> and print their number out at the end, if there are any.
> >
> > Not sure what the point of this is.  If we wanted to really do something
> > to cross-check here, we'd track the pages that were skipped and then
> > look through the WAL to make sure that they're there.  That's something
> > we've talked about doing with pgBackRest, but don't currently.
>
> I agree simply printing the page numbers seems rather useless. What we
> could do is remember which pages we skipped and then try checking them
> after another checkpoint. Or something like that.

I'm still not sure I'm seeing the point of that.  They're still going to
almost certainly be in the kernel cache.  The reason for checking
against the WAL would be to detect errors in PG where we aren't putting
a page into the WAL when it really should be, or something similar,
which seems like it at least could be useful.

Maybe to put it another way- there's very little point in checking the
checksum of a page which we know must be re-written during recovery to
get to a consistent point.  I don't think it hurts in the general case,
but I wouldn't write a lot of code which then needs to be tested to
handle it.  I also don't think that we really need to make
pg_verify_checksum spend lots of extra cycles trying to verify that
*every* page had its checksum validated when we know that lots of pages
are going to be in memory marked dirty and our checking of them will be
ultimately pointless as they'll either be written out by the
checkpointer or some other process, or we'll replay them from the WAL if
we crash.

> >>> FWIW I also don't understand the purpose of pg_sleep(), it does not seem
> >>> to protect against anything, really.
> >>
> >> Well, I've noticed that without it I get sporadic checksum failures on
> >> reread, so I've added it to make them go away. It was certainly a
> >> phenomenological decision that I am happy to trade for a better one.
> >
> > That then sounds like we really aren't re-checking the LSN, and we
> > really should be, to avoid getting these sporadic checksum failures on
> > reread..
>
> Again, it's not enough to check the LSN against the preceding read. We
> need a checkpoint LSN or something like that.

I actually tend to disagree with you that, for this purpose, it's
actually necessary to check against the checkpoint LSN- if the LSN
changed and everything is operating correctly then the new LSN must be
more recent than the last checkpoint location or things are broken
badly.

Now, that said, I do think it's a good *idea* to check against the
checkpoint LSN (presuming this is for online checking of checksums- for
basebackup, we could just check against the backup-start LSN as anything
after that point will be rewritten by WAL anyway).  The reason that I
think it's a good idea to check against the checkpoint LSN is that we'd
want to throw a big warning if the kernel is just feeding us random
garbage on reads and only finding a difference between two reads isn't
really doing any kind of validation, whereas checking against the
checkpoint-LSN would at least give us some idea that the value being
read isn't completely ridiculous.

When it comes to if the pg_sleep() is necessary or not, I have to admit
to being unsure about that..  I could see how it might be but it seems a
bit surprising- I'd probably want to see exactly what the page was at
the time of the failure and at the time of the second (no-sleep) re-read
and then after a delay and convince myself that it was just an unlucky
case of being scheduled in twice to read that page before the process
writing it out got a chance to finish the write.

Thanks!

Stephen

Attachment

Re: Online verification of checksums

From
Tomas Vondra
Date:
Hi,

On 09/17/2018 06:42 PM, Stephen Frost wrote:
> Greetings,
> 
> * Tomas Vondra (tomas.vondra@2ndquadrant.com) wrote:
>> On 09/17/2018 04:46 PM, Stephen Frost wrote:
>>> * Michael Banck (michael.banck@credativ.de) wrote:
>>>> On Mon, Sep 03, 2018 at 10:29:18PM +0200, Tomas Vondra wrote:
>>>>> Obviously, pg_verify_checksums can't do that easily because it's
>>>>> supposed to run from outside the database instance. 
>>>>
>>>> It reads pg_control anyway, so couldn't we just take
>>>> ControlFile->checkPoint?
>>>>
>>>> Other than that, basebackup.c seems to only look at pages which haven't
>>>> been changed since the backup starting checkpoint (see above if
>>>> statement). That's reasonable for backups, but is it just as reasonable
>>>> for online verification?
>>>
>>> Right, basebackup doesn't need to look at other pages.
>>>
>>>>> The pg_verify_checksum apparently ignores this skip logic, because on
>>>>> the retry it simply re-reads the page again, verifies the checksum and
>>>>> reports an error. Which is broken, because the newly read page might be
>>>>> torn again due to a concurrent write.
>>>>
>>>> Well, ok.
>>>
>>> The newly read page will have an updated LSN though then on the re-read,
>>> in which case basebackup can know that what happened was a rewrite of
>>> the page and it no longer has to care about the page and can skip it.
>>>
>>> I haven't looked, but if basebackup isn't checking the LSN again for the
>>> newly read page then that'd be broken, but I believe it does (at least,
>>> that's the algorithm we came up with for pgBackRest, and I know David
>>> shared that when the basebackup code was being written).
>>
>> Yes, basebackup does check the LSN on re-read, and skips the page if it
>> changed on re-read (because it eliminates the consistency guarantees
>> provided by the checkpoint).
> 
> Ok, good, though I'm not sure what you mean by 'eliminates the
> consistency guarantees provided by the checkpoint'.  The point is that
> the page will be in the WAL and the WAL will be replayed during the
> restore of the backup.
> 

The checkpoint guarantees that the whole page was written and flushed to
disk with an LSN before the ckeckpoint LSN. So when you read a page with
that LSN, you know the whole write already completed and a read won't
return data from before the LSN.

Without the checkpoint that's not guaranteed, and simply re-reading the
page and rechecking it vs. the first read does not help:

1) write the first 512B of the page (sector), which includes the LSN

2) read the whole page, which will be a mix [new 512B, ... old ... ]

3) the checksum verification fails

4) read the page again (possibly reading a bit more new data)

5) the LSN did not change compared to the first read, yet the checksum
still fails


>>>> So how about we do check every page, but if one fails on retry, and the
>>>> LSN is newer than the checkpoint, we then skip it? Is that logic sound?
>>>
>>> I thought that's what basebackup did- if it doesn't do that today, then
>>> it really should.
>>
>> The crucial distinction here is that the trick is not in comparing LSNs
>> from the two page reads, but comparing it to the checkpoint LSN. If it's
>> greater, the page may be torn or broken, and there's no way to know
>> which case it is - so basebackup simply skips it.
> 
> Sure, because we don't care about it any longer- that page isn't
> interesting because the WAL will replay over it.  IIRC it actually goes
> something like: check the checksum, if it failed then check if the LSN
> is greater than the checkpoint (of the backup start..), if not, then
> re-read, if the LSN is now newer than the checkpoint then skip, if the
> LSN is the same then throw an error.
> 

Nope, we only verify the checksum if it's LSN precedes the checkpoint:

https://github.com/postgres/postgres/blob/master/src/backend/replication/basebackup.c#L1454

>>>> In any case, if we decide we really should skip the page if it is newer
>>>> than the checkpoint, I think it makes sense to track those skipped pages
>>>> and print their number out at the end, if there are any.
>>>
>>> Not sure what the point of this is.  If we wanted to really do something
>>> to cross-check here, we'd track the pages that were skipped and then
>>> look through the WAL to make sure that they're there.  That's something
>>> we've talked about doing with pgBackRest, but don't currently.
>>
>> I agree simply printing the page numbers seems rather useless. What we
>> could do is remember which pages we skipped and then try checking them
>> after another checkpoint. Or something like that.
> 
> I'm still not sure I'm seeing the point of that.  They're still going to
> almost certainly be in the kernel cache.  The reason for checking
> against the WAL would be to detect errors in PG where we aren't putting
> a page into the WAL when it really should be, or something similar,
> which seems like it at least could be useful.
> 
> Maybe to put it another way- there's very little point in checking the
> checksum of a page which we know must be re-written during recovery to
> get to a consistent point.  I don't think it hurts in the general case,
> but I wouldn't write a lot of code which then needs to be tested to
> handle it.  I also don't think that we really need to make
> pg_verify_checksum spend lots of extra cycles trying to verify that
> *every* page had its checksum validated when we know that lots of pages
> are going to be in memory marked dirty and our checking of them will be
> ultimately pointless as they'll either be written out by the
> checkpointer or some other process, or we'll replay them from the WAL if
> we crash.
> 

Yeah, I agree.

>>>>> FWIW I also don't understand the purpose of pg_sleep(), it does not seem
>>>>> to protect against anything, really.
>>>>
>>>> Well, I've noticed that without it I get sporadic checksum failures on
>>>> reread, so I've added it to make them go away. It was certainly a
>>>> phenomenological decision that I am happy to trade for a better one.
>>>
>>> That then sounds like we really aren't re-checking the LSN, and we
>>> really should be, to avoid getting these sporadic checksum failures on
>>> reread..
>>
>> Again, it's not enough to check the LSN against the preceding read. We
>> need a checkpoint LSN or something like that.
> 
> I actually tend to disagree with you that, for this purpose, it's
> actually necessary to check against the checkpoint LSN- if the LSN
> changed and everything is operating correctly then the new LSN must be
> more recent than the last checkpoint location or things are broken
> badly.
> 

I don't follow. Are you suggesting we don't need the checkpoint LSN?

I'm pretty sure that's not the case. The thing is - the LSN may not
change between the two reads, but that's not a guarantee the page was
not torn. The example I posted earlier in this message illustrates that.

> Now, that said, I do think it's a good *idea* to check against the
> checkpoint LSN (presuming this is for online checking of checksums- for
> basebackup, we could just check against the backup-start LSN as anything
> after that point will be rewritten by WAL anyway).  The reason that I
> think it's a good idea to check against the checkpoint LSN is that we'd
> want to throw a big warning if the kernel is just feeding us random
> garbage on reads and only finding a difference between two reads isn't
> really doing any kind of validation, whereas checking against the
> checkpoint-LSN would at least give us some idea that the value being
> read isn't completely ridiculous.
> 
> When it comes to if the pg_sleep() is necessary or not, I have to admit
> to being unsure about that..  I could see how it might be but it seems a
> bit surprising- I'd probably want to see exactly what the page was at
> the time of the failure and at the time of the second (no-sleep) re-read
> and then after a delay and convince myself that it was just an unlucky
> case of being scheduled in twice to read that page before the process
> writing it out got a chance to finish the write.
> 

I think the pg_sleep() is a pretty strong sign there's something broken.
At the very least, it's likely to misbehave on machines with different
timings, machines under memory and/or memory pressure, etc.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: Online verification of checksums

From
Stephen Frost
Date:
Greetings,

* Tomas Vondra (tomas.vondra@2ndquadrant.com) wrote:
> On 09/17/2018 06:42 PM, Stephen Frost wrote:
> > Ok, good, though I'm not sure what you mean by 'eliminates the
> > consistency guarantees provided by the checkpoint'.  The point is that
> > the page will be in the WAL and the WAL will be replayed during the
> > restore of the backup.
>
> The checkpoint guarantees that the whole page was written and flushed to
> disk with an LSN before the ckeckpoint LSN. So when you read a page with
> that LSN, you know the whole write already completed and a read won't
> return data from before the LSN.

Well, you know that the first part was written out at some prior point,
but you could end up reading the first part of a page with an older LSN
while also reading the second part with new data.

> Without the checkpoint that's not guaranteed, and simply re-reading the
> page and rechecking it vs. the first read does not help:
>
> 1) write the first 512B of the page (sector), which includes the LSN
>
> 2) read the whole page, which will be a mix [new 512B, ... old ... ]
>
> 3) the checksum verification fails
>
> 4) read the page again (possibly reading a bit more new data)
>
> 5) the LSN did not change compared to the first read, yet the checksum
> still fails

So, I agree with all of the above though I've found it to be extremely
rare to get a single read which you've managed to catch part-way through
a write, getting multiple of them over a period of time strikes me as
even more unlikely.  Still, if we can come up with a solution to solve
all of this, great, but I'm not sure that I'm hearing one.

> > Sure, because we don't care about it any longer- that page isn't
> > interesting because the WAL will replay over it.  IIRC it actually goes
> > something like: check the checksum, if it failed then check if the LSN
> > is greater than the checkpoint (of the backup start..), if not, then
> > re-read, if the LSN is now newer than the checkpoint then skip, if the
> > LSN is the same then throw an error.
>
> Nope, we only verify the checksum if it's LSN precedes the checkpoint:
>
> https://github.com/postgres/postgres/blob/master/src/backend/replication/basebackup.c#L1454

That seems like it's leaving something on the table, but, to be fair, we
know that all of those pages should be rewritten by WAL anyway so they
aren't all that interesting to us, particularly in the basebackup case.

> > I actually tend to disagree with you that, for this purpose, it's
> > actually necessary to check against the checkpoint LSN- if the LSN
> > changed and everything is operating correctly then the new LSN must be
> > more recent than the last checkpoint location or things are broken
> > badly.
>
> I don't follow. Are you suggesting we don't need the checkpoint LSN?
>
> I'm pretty sure that's not the case. The thing is - the LSN may not
> change between the two reads, but that's not a guarantee the page was
> not torn. The example I posted earlier in this message illustrates that.

I agree that there's some risk there, but it's certainly much less
likely.

> > Now, that said, I do think it's a good *idea* to check against the
> > checkpoint LSN (presuming this is for online checking of checksums- for
> > basebackup, we could just check against the backup-start LSN as anything
> > after that point will be rewritten by WAL anyway).  The reason that I
> > think it's a good idea to check against the checkpoint LSN is that we'd
> > want to throw a big warning if the kernel is just feeding us random
> > garbage on reads and only finding a difference between two reads isn't
> > really doing any kind of validation, whereas checking against the
> > checkpoint-LSN would at least give us some idea that the value being
> > read isn't completely ridiculous.
> >
> > When it comes to if the pg_sleep() is necessary or not, I have to admit
> > to being unsure about that..  I could see how it might be but it seems a
> > bit surprising- I'd probably want to see exactly what the page was at
> > the time of the failure and at the time of the second (no-sleep) re-read
> > and then after a delay and convince myself that it was just an unlucky
> > case of being scheduled in twice to read that page before the process
> > writing it out got a chance to finish the write.
>
> I think the pg_sleep() is a pretty strong sign there's something broken.
> At the very least, it's likely to misbehave on machines with different
> timings, machines under memory and/or memory pressure, etc.

If we assume that what you've outlined above is a serious enough issue
that we have to address it, and do so without a pg_sleep(), then I think
we have to bake into this a way for the process to check with PG as to
what the page's current LSN is, in shared buffers, because that's the
only place where we've got the locking required to ensure that we don't
end up with a read of a partially written page, and I'm really not
entirely convinced that we need to go to that level.  It'd certainly add
a huge amount of additional complexity for what appears to be a quite
unlikely gain.

I'll chat w/ David shortly about this again though and get his thoughts
on it.  This is certainly an area we've spent time thinking about but
are obviously also open to finding a better solution.

Thanks!

Stephen

Attachment

Re: Online verification of checksums

From
Tomas Vondra
Date:
On 09/17/2018 07:11 PM, Stephen Frost wrote:
> Greetings,
> 
> * Tomas Vondra (tomas.vondra@2ndquadrant.com) wrote:
>> On 09/17/2018 06:42 PM, Stephen Frost wrote:
>>> Ok, good, though I'm not sure what you mean by 'eliminates the
>>> consistency guarantees provided by the checkpoint'.  The point is that
>>> the page will be in the WAL and the WAL will be replayed during the
>>> restore of the backup.
>>
>> The checkpoint guarantees that the whole page was written and flushed to
>> disk with an LSN before the ckeckpoint LSN. So when you read a page with
>> that LSN, you know the whole write already completed and a read won't
>> return data from before the LSN.
> 
> Well, you know that the first part was written out at some prior point,
> but you could end up reading the first part of a page with an older LSN
> while also reading the second part with new data.
> 

Doesn't the checkpoint fsync pretty much guarantee this can't happen?

>> Without the checkpoint that's not guaranteed, and simply re-reading the
>> page and rechecking it vs. the first read does not help:
>>
>> 1) write the first 512B of the page (sector), which includes the LSN
>>
>> 2) read the whole page, which will be a mix [new 512B, ... old ... ]
>>
>> 3) the checksum verification fails
>>
>> 4) read the page again (possibly reading a bit more new data)
>>
>> 5) the LSN did not change compared to the first read, yet the checksum
>> still fails
> 
> So, I agree with all of the above though I've found it to be extremely
> rare to get a single read which you've managed to catch part-way through
> a write, getting multiple of them over a period of time strikes me as
> even more unlikely.  Still, if we can come up with a solution to solve
> all of this, great, but I'm not sure that I'm hearing one.
> 

I don't recall claiming catching many such torn pages - I'm sure it's
not very common in most workloads. But I suspect constructing workloads
hitting them regularly is not very difficult either (something with a
lot of churn in shared buffers should do the trick).

>>> Sure, because we don't care about it any longer- that page isn't
>>> interesting because the WAL will replay over it.  IIRC it actually goes
>>> something like: check the checksum, if it failed then check if the LSN
>>> is greater than the checkpoint (of the backup start..), if not, then
>>> re-read, if the LSN is now newer than the checkpoint then skip, if the
>>> LSN is the same then throw an error.
>>
>> Nope, we only verify the checksum if it's LSN precedes the checkpoint:
>>
>> https://github.com/postgres/postgres/blob/master/src/backend/replication/basebackup.c#L1454
> 
> That seems like it's leaving something on the table, but, to be fair, we
> know that all of those pages should be rewritten by WAL anyway so they
> aren't all that interesting to us, particularly in the basebackup case.
> 

Yep.

>>> I actually tend to disagree with you that, for this purpose, it's
>>> actually necessary to check against the checkpoint LSN- if the LSN
>>> changed and everything is operating correctly then the new LSN must be
>>> more recent than the last checkpoint location or things are broken
>>> badly.
>>
>> I don't follow. Are you suggesting we don't need the checkpoint LSN?
>>
>> I'm pretty sure that's not the case. The thing is - the LSN may not
>> change between the two reads, but that's not a guarantee the page was
>> not torn. The example I posted earlier in this message illustrates that.
> 
> I agree that there's some risk there, but it's certainly much less
> likely.
> 

Well. If we're going to report a checksum failure, we better be sure it
actually is a broken page. I don't want users to start chasing bogus
data corruption issues.

>>> Now, that said, I do think it's a good *idea* to check against the
>>> checkpoint LSN (presuming this is for online checking of checksums- for
>>> basebackup, we could just check against the backup-start LSN as anything
>>> after that point will be rewritten by WAL anyway).  The reason that I
>>> think it's a good idea to check against the checkpoint LSN is that we'd
>>> want to throw a big warning if the kernel is just feeding us random
>>> garbage on reads and only finding a difference between two reads isn't
>>> really doing any kind of validation, whereas checking against the
>>> checkpoint-LSN would at least give us some idea that the value being
>>> read isn't completely ridiculous.
>>>
>>> When it comes to if the pg_sleep() is necessary or not, I have to admit
>>> to being unsure about that..  I could see how it might be but it seems a
>>> bit surprising- I'd probably want to see exactly what the page was at
>>> the time of the failure and at the time of the second (no-sleep) re-read
>>> and then after a delay and convince myself that it was just an unlucky
>>> case of being scheduled in twice to read that page before the process
>>> writing it out got a chance to finish the write.
>>
>> I think the pg_sleep() is a pretty strong sign there's something broken.
>> At the very least, it's likely to misbehave on machines with different
>> timings, machines under memory and/or memory pressure, etc.
> 
> If we assume that what you've outlined above is a serious enough issue
> that we have to address it, and do so without a pg_sleep(), then I think
> we have to bake into this a way for the process to check with PG as to
> what the page's current LSN is, in shared buffers, because that's the
> only place where we've got the locking required to ensure that we don't
> end up with a read of a partially written page, and I'm really not
> entirely convinced that we need to go to that level.  It'd certainly add
> a huge amount of additional complexity for what appears to be a quite
> unlikely gain.
> 
> I'll chat w/ David shortly about this again though and get his thoughts
> on it.  This is certainly an area we've spent time thinking about but
> are obviously also open to finding a better solution.
> 

Why not to simply look at the last checkpoint LSN and use that the same
way basebackup does? AFAICS that should make the pg_sleep() unnecessary.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: Online verification of checksums

From
Stephen Frost
Date:
Greetings,

On Mon, Sep 17, 2018 at 13:20 Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
On 09/17/2018 07:11 PM, Stephen Frost wrote:
> Greetings,
>
> * Tomas Vondra (tomas.vondra@2ndquadrant.com) wrote:
>> On 09/17/2018 06:42 PM, Stephen Frost wrote:
>>> Ok, good, though I'm not sure what you mean by 'eliminates the
>>> consistency guarantees provided by the checkpoint'.  The point is that
>>> the page will be in the WAL and the WAL will be replayed during the
>>> restore of the backup.
>>
>> The checkpoint guarantees that the whole page was written and flushed to
>> disk with an LSN before the ckeckpoint LSN. So when you read a page with
>> that LSN, you know the whole write already completed and a read won't
>> return data from before the LSN.
>
> Well, you know that the first part was written out at some prior point,
> but you could end up reading the first part of a page with an older LSN
> while also reading the second part with new data.


Doesn't the checkpoint fsync pretty much guarantee this can't happen?

How? Either it’s possible for the latter half of a page to be updated before the first half (where the LSN lives), or it isn’t. If it’s possible then that LSN could be ancient and it wouldn’t matter. 

>> Without the checkpoint that's not guaranteed, and simply re-reading the
>> page and rechecking it vs. the first read does not help:
>>
>> 1) write the first 512B of the page (sector), which includes the LSN
>>
>> 2) read the whole page, which will be a mix [new 512B, ... old ... ]
>>
>> 3) the checksum verification fails
>>
>> 4) read the page again (possibly reading a bit more new data)
>>
>> 5) the LSN did not change compared to the first read, yet the checksum
>> still fails
>
> So, I agree with all of the above though I've found it to be extremely
> rare to get a single read which you've managed to catch part-way through
> a write, getting multiple of them over a period of time strikes me as
> even more unlikely.  Still, if we can come up with a solution to solve
> all of this, great, but I'm not sure that I'm hearing one.

I don't recall claiming catching many such torn pages - I'm sure it's
not very common in most workloads. But I suspect constructing workloads
hitting them regularly is not very difficult either (something with a
lot of churn in shared buffers should do the trick).

The question is if it’s possible to catch a torn page where the second half is updated *before* the first half of the page in a read (and then in subsequent reads having that state be maintained).  I have some skepticism that it’s really possible to happen in the first place but having an interrupted system call be stalled across two more system calls just seems terribly unlikely, and this is all based on the assumption that the kernel might write the second half of a write before the first to the kernel cache in the first place. 

>>> Sure, because we don't care about it any longer- that page isn't
>>> interesting because the WAL will replay over it.  IIRC it actually goes
>>> something like: check the checksum, if it failed then check if the LSN
>>> is greater than the checkpoint (of the backup start..), if not, then
>>> re-read, if the LSN is now newer than the checkpoint then skip, if the
>>> LSN is the same then throw an error.
>>
>> Nope, we only verify the checksum if it's LSN precedes the checkpoint:
>>
>> https://github.com/postgres/postgres/blob/master/src/backend/replication/basebackup.c#L1454
>
> That seems like it's leaving something on the table, but, to be fair, we
> know that all of those pages should be rewritten by WAL anyway so they
> aren't all that interesting to us, particularly in the basebackup case.
>

Yep.

>>> I actually tend to disagree with you that, for this purpose, it's
>>> actually necessary to check against the checkpoint LSN- if the LSN
>>> changed and everything is operating correctly then the new LSN must be
>>> more recent than the last checkpoint location or things are broken
>>> badly.
>>
>> I don't follow. Are you suggesting we don't need the checkpoint LSN?
>>
>> I'm pretty sure that's not the case. The thing is - the LSN may not
>> change between the two reads, but that's not a guarantee the page was
>> not torn. The example I posted earlier in this message illustrates that.
>
> I agree that there's some risk there, but it's certainly much less
> likely.
>

Well. If we're going to report a checksum failure, we better be sure it
actually is a broken page. I don't want users to start chasing bogus
data corruption issues.

Yes, I definitely agree that we don’t want to mis-report checksum failures if we can avoid it. 

>>> Now, that said, I do think it's a good *idea* to check against the
>>> checkpoint LSN (presuming this is for online checking of checksums- for
>>> basebackup, we could just check against the backup-start LSN as anything
>>> after that point will be rewritten by WAL anyway).  The reason that I
>>> think it's a good idea to check against the checkpoint LSN is that we'd
>>> want to throw a big warning if the kernel is just feeding us random
>>> garbage on reads and only finding a difference between two reads isn't
>>> really doing any kind of validation, whereas checking against the
>>> checkpoint-LSN would at least give us some idea that the value being
>>> read isn't completely ridiculous.
>>>
>>> When it comes to if the pg_sleep() is necessary or not, I have to admit
>>> to being unsure about that..  I could see how it might be but it seems a
>>> bit surprising- I'd probably want to see exactly what the page was at
>>> the time of the failure and at the time of the second (no-sleep) re-read
>>> and then after a delay and convince myself that it was just an unlucky
>>> case of being scheduled in twice to read that page before the process
>>> writing it out got a chance to finish the write.
>>
>> I think the pg_sleep() is a pretty strong sign there's something broken.
>> At the very least, it's likely to misbehave on machines with different
>> timings, machines under memory and/or memory pressure, etc.
>
> If we assume that what you've outlined above is a serious enough issue
> that we have to address it, and do so without a pg_sleep(), then I think
> we have to bake into this a way for the process to check with PG as to
> what the page's current LSN is, in shared buffers, because that's the
> only place where we've got the locking required to ensure that we don't
> end up with a read of a partially written page, and I'm really not
> entirely convinced that we need to go to that level.  It'd certainly add
> a huge amount of additional complexity for what appears to be a quite
> unlikely gain.
>
> I'll chat w/ David shortly about this again though and get his thoughts
> on it.  This is certainly an area we've spent time thinking about but
> are obviously also open to finding a better solution.

Why not to simply look at the last checkpoint LSN and use that the same
way basebackup does? AFAICS that should make the pg_sleep() unnecessary.

Use that to compare to what?  The LSN in the first half of the page could be from well before the checkpoint or even the backup started.

Thanks!

Stephen

Re: Online verification of checksums

From
Michael Banck
Date:
Hi,

so, trying some intermediate summary here, sorry for (also) top-posting:

1. the basebackup checksum verification logic only checks pages not
changed since the checkpoint, which makes sense for the basebackup. 

2. However, it would be desirable to go further for pg_verify_checksums
and (re-)check all pages.

3. pg_verify_checksums should read the checkpoint LSN on startup and
compare the page LSN against it on re-read, and discard pages which have
checksum failures but are new. (Maybe it should read new checkpoint LSNs
as they come in during its runtime as well? See below).

4. The pg_sleep should go.

5. There seems to be no consensus on whether the number of skipped pages
should be summarized at the end.

Further comments:

Am Montag, den 17.09.2018, 19:19 +0200 schrieb Tomas Vondra:
> On 09/17/2018 07:11 PM, Stephen Frost wrote:
> > * Tomas Vondra (tomas.vondra@2ndquadrant.com) wrote:
> > > On 09/17/2018 06:42 PM, Stephen Frost wrote:
> > > Without the checkpoint that's not guaranteed, and simply re-reading the
> > > page and rechecking it vs. the first read does not help:
> > > 
> > > 1) write the first 512B of the page (sector), which includes the LSN
> > > 
> > > 2) read the whole page, which will be a mix [new 512B, ... old ... ]
> > > 
> > > 3) the checksum verification fails
> > > 
> > > 4) read the page again (possibly reading a bit more new data)
> > > 
> > > 5) the LSN did not change compared to the first read, yet the checksum
> > > still fails
> > 
> > So, I agree with all of the above though I've found it to be extremely
> > rare to get a single read which you've managed to catch part-way through
> > a write, getting multiple of them over a period of time strikes me as
> > even more unlikely.  Still, if we can come up with a solution to solve
> > all of this, great, but I'm not sure that I'm hearing one.
> 
> I don't recall claiming catching many such torn pages - I'm sure it's
> not very common in most workloads. But I suspect constructing workloads
> hitting them regularly is not very difficult either (something with a
> lot of churn in shared buffers should do the trick).
> 
> > > > Sure, because we don't care about it any longer- that page isn't
> > > > interesting because the WAL will replay over it.  IIRC it actually goes
> > > > something like: check the checksum, if it failed then check if the LSN
> > > > is greater than the checkpoint (of the backup start..), if not, then
> > > > re-read, if the LSN is now newer than the checkpoint then skip, if the
> > > > LSN is the same then throw an error.
> > > 
> > > Nope, we only verify the checksum if it's LSN precedes the checkpoint:
> > > 
> > > https://github.com/postgres/postgres/blob/master/src/backend/replication/basebackup.c#L1454
> > 
> > That seems like it's leaving something on the table, but, to be fair, we
> > know that all of those pages should be rewritten by WAL anyway so they
> > aren't all that interesting to us, particularly in the basebackup case.
> 
> Yep.

Right, see point 1 above.

> > > > I actually tend to disagree with you that, for this purpose, it's
> > > > actually necessary to check against the checkpoint LSN- if the LSN
> > > > changed and everything is operating correctly then the new LSN must be
> > > > more recent than the last checkpoint location or things are broken
> > > > badly.
> > > 
> > > I don't follow. Are you suggesting we don't need the checkpoint LSN?
> > > 
> > > I'm pretty sure that's not the case. The thing is - the LSN may not
> > > change between the two reads, but that's not a guarantee the page was
> > > not torn. The example I posted earlier in this message illustrates that.
> > 
> > I agree that there's some risk there, but it's certainly much less
> > likely.
> 
> Well. If we're going to report a checksum failure, we better be sure it
> actually is a broken page. I don't want users to start chasing bogus
> data corruption issues.

I agree.

> > > > Now, that said, I do think it's a good *idea* to check against the
> > > > checkpoint LSN (presuming this is for online checking of checksums- for
> > > > basebackup, we could just check against the backup-start LSN as anything
> > > > after that point will be rewritten by WAL anyway).  The reason that I
> > > > think it's a good idea to check against the checkpoint LSN is that we'd
> > > > want to throw a big warning if the kernel is just feeding us random
> > > > garbage on reads and only finding a difference between two reads isn't
> > > > really doing any kind of validation, whereas checking against the
> > > > checkpoint-LSN would at least give us some idea that the value being
> > > > read isn't completely ridiculous.

Are you suggesting here that we always check against the current
checkpoint, or is checking against the checkpoint that we saw at startup
enough? I think re-reading pg_control all the time might be more
errorprone that what we could get from this, so I would prefer not to do
this.

> > > > When it comes to if the pg_sleep() is necessary or not, I have to admit
> > > > to being unsure about that..  I could see how it might be but it seems a
> > > > bit surprising- I'd probably want to see exactly what the page was at
> > > > the time of the failure and at the time of the second (no-sleep) re-read
> > > > and then after a delay and convince myself that it was just an unlucky
> > > > case of being scheduled in twice to read that page before the process
> > > > writing it out got a chance to finish the write.
> > > 
> > > I think the pg_sleep() is a pretty strong sign there's something broken.
> > > At the very least, it's likely to misbehave on machines with different
> > > timings, machines under memory and/or memory pressure, etc.

I swapped out the pg_sleep earlier today for the check-against-
checkpoint-LSN-on-reread, and that seems to work just as fine, at least
in the tests I ran.

> > If we assume that what you've outlined above is a serious enough issue
> > that we have to address it, and do so without a pg_sleep(), then I think
> > we have to bake into this a way for the process to check with PG as to
> > what the page's current LSN is, in shared buffers, because that's the
> > only place where we've got the locking required to ensure that we don't
> > end up with a read of a partially written page, and I'm really not
> > entirely convinced that we need to go to that level.  It'd certainly add
> > a huge amount of additional complexity for what appears to be a quite
> > unlikely gain.
> > 
> > I'll chat w/ David shortly about this again though and get his thoughts
> > on it.  This is certainly an area we've spent time thinking about but
> > are obviously also open to finding a better solution.
> 
> Why not to simply look at the last checkpoint LSN and use that the same
> way basebackup does? AFAICS that should make the pg_sleep() unnecessary.

Right.


Michael

-- 
Michael Banck
Projektleiter / Senior Berater
Tel.: +49 2166 9901-171
Fax:  +49 2166 9901-100
Email: michael.banck@credativ.de

credativ GmbH, HRB Mönchengladbach 12080
USt-ID-Nummer: DE204566209
Trompeterallee 108, 41189 Mönchengladbach
Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer

Unser Umgang mit personenbezogenen Daten unterliegt
folgenden Bestimmungen: https://www.credativ.de/datenschutz


Re: Online verification of checksums

From
Stephen Frost
Date:
Greetings,

On Mon, Sep 17, 2018 at 13:38 Michael Banck <michael.banck@credativ.de> wrote:
so, trying some intermediate summary here, sorry for (also) top-posting:

1. the basebackup checksum verification logic only checks pages not
changed since the checkpoint, which makes sense for the basebackup. 

Right. I’m tending towards the idea that this also be adopted for pg_verify_checksums. 

2. However, it would be desirable to go further for pg_verify_checksums
and (re-)check all pages.

Maybe.  I’m not entirely convinced that it’s all that useful. 

3. pg_verify_checksums should read the checkpoint LSN on startup and
compare the page LSN against it on re-read, and discard pages which have
checksum failures but are new. (Maybe it should read new checkpoint LSNs
as they come in during its runtime as well? See below).

I’m not sure that we really need to but I’m not against it either- but in that case you’re definitely going to see checksum failures on torn pages.

4. The pg_sleep should go.

I know that pgbackrest does not have a sleep currently and we’ve not yet seen or been able to reproduce this case where, on a reread, we still see an older LSN, but we check the LSN first also.  If it’s possible that the LSN still hasn’t changed on the reread then maybe we do need to have a sleep to force ourselves off CPU to allow the other process to finish writing, or maybe finish the file and come back around to these pages later, but we have yet to see this behavior in the wild anywhere, nor have we been able to reproduce it. 

5. There seems to be no consensus on whether the number of skipped pages
should be summarized at the end.

I agree with printing the number of skipped pages, that does seem like a nice to have.  I don’t know that actually printing the pages themselves is all that useful though. 

Further comments:

Am Montag, den 17.09.2018, 19:19 +0200 schrieb Tomas Vondra:
> On 09/17/2018 07:11 PM, Stephen Frost wrote:
> > * Tomas Vondra (tomas.vondra@2ndquadrant.com) wrote:
> > > On 09/17/2018 06:42 PM, Stephen Frost wrote:
> > > Without the checkpoint that's not guaranteed, and simply re-reading the
> > > page and rechecking it vs. the first read does not help:
> > >
> > > 1) write the first 512B of the page (sector), which includes the LSN
> > >
> > > 2) read the whole page, which will be a mix [new 512B, ... old ... ]
> > >
> > > 3) the checksum verification fails
> > >
> > > 4) read the page again (possibly reading a bit more new data)
> > >
> > > 5) the LSN did not change compared to the first read, yet the checksum
> > > still fails
> >
> > So, I agree with all of the above though I've found it to be extremely
> > rare to get a single read which you've managed to catch part-way through
> > a write, getting multiple of them over a period of time strikes me as
> > even more unlikely.  Still, if we can come up with a solution to solve
> > all of this, great, but I'm not sure that I'm hearing one.
>
> I don't recall claiming catching many such torn pages - I'm sure it's
> not very common in most workloads. But I suspect constructing workloads
> hitting them regularly is not very difficult either (something with a
> lot of churn in shared buffers should do the trick).
>
> > > > Sure, because we don't care about it any longer- that page isn't
> > > > interesting because the WAL will replay over it.  IIRC it actually goes
> > > > something like: check the checksum, if it failed then check if the LSN
> > > > is greater than the checkpoint (of the backup start..), if not, then
> > > > re-read, if the LSN is now newer than the checkpoint then skip, if the
> > > > LSN is the same then throw an error.
> > >
> > > Nope, we only verify the checksum if it's LSN precedes the checkpoint:
> > >
> > > https://github.com/postgres/postgres/blob/master/src/backend/replication/basebackup.c#L1454
> >
> > That seems like it's leaving something on the table, but, to be fair, we
> > know that all of those pages should be rewritten by WAL anyway so they
> > aren't all that interesting to us, particularly in the basebackup case.
>
> Yep.

Right, see point 1 above.

> > > > I actually tend to disagree with you that, for this purpose, it's
> > > > actually necessary to check against the checkpoint LSN- if the LSN
> > > > changed and everything is operating correctly then the new LSN must be
> > > > more recent than the last checkpoint location or things are broken
> > > > badly.
> > >
> > > I don't follow. Are you suggesting we don't need the checkpoint LSN?
> > >
> > > I'm pretty sure that's not the case. The thing is - the LSN may not
> > > change between the two reads, but that's not a guarantee the page was
> > > not torn. The example I posted earlier in this message illustrates that.
> >
> > I agree that there's some risk there, but it's certainly much less
> > likely.
>
> Well. If we're going to report a checksum failure, we better be sure it
> actually is a broken page. I don't want users to start chasing bogus
> data corruption issues.

I agree.

> > > > Now, that said, I do think it's a good *idea* to check against the
> > > > checkpoint LSN (presuming this is for online checking of checksums- for
> > > > basebackup, we could just check against the backup-start LSN as anything
> > > > after that point will be rewritten by WAL anyway).  The reason that I
> > > > think it's a good idea to check against the checkpoint LSN is that we'd
> > > > want to throw a big warning if the kernel is just feeding us random
> > > > garbage on reads and only finding a difference between two reads isn't
> > > > really doing any kind of validation, whereas checking against the
> > > > checkpoint-LSN would at least give us some idea that the value being
> > > > read isn't completely ridiculous.

Are you suggesting here that we always check against the current
checkpoint, or is checking against the checkpoint that we saw at startup
enough? I think re-reading pg_control all the time might be more
errorprone that what we could get from this, so I would prefer not to do
this.

I don’t follow why rereading pg_control would be error-prone.  That said, I don’t have a particularly strong opinion either way on this. 

> > > > When it comes to if the pg_sleep() is necessary or not, I have to admit
> > > > to being unsure about that..  I could see how it might be but it seems a
> > > > bit surprising- I'd probably want to see exactly what the page was at
> > > > the time of the failure and at the time of the second (no-sleep) re-read
> > > > and then after a delay and convince myself that it was just an unlucky
> > > > case of being scheduled in twice to read that page before the process
> > > > writing it out got a chance to finish the write.
> > >
> > > I think the pg_sleep() is a pretty strong sign there's something broken.
> > > At the very least, it's likely to misbehave on machines with different
> > > timings, machines under memory and/or memory pressure, etc.

I swapped out the pg_sleep earlier today for the check-against-
checkpoint-LSN-on-reread, and that seems to work just as fine, at least
in the tests I ran.

Ok, this sounds like you were probably seeing normal forward torn pages, and we have certainly seen that before.  

> > If we assume that what you've outlined above is a serious enough issue
> > that we have to address it, and do so without a pg_sleep(), then I think
> > we have to bake into this a way for the process to check with PG as to
> > what the page's current LSN is, in shared buffers, because that's the
> > only place where we've got the locking required to ensure that we don't
> > end up with a read of a partially written page, and I'm really not
> > entirely convinced that we need to go to that level.  It'd certainly add
> > a huge amount of additional complexity for what appears to be a quite
> > unlikely gain.
> >
> > I'll chat w/ David shortly about this again though and get his thoughts
> > on it.  This is certainly an area we've spent time thinking about but
> > are obviously also open to finding a better solution.
>
> Why not to simply look at the last checkpoint LSN and use that the same
> way basebackup does? AFAICS that should make the pg_sleep() unnecessary.

Right.

This is fine if you know the kernel will always write the first page first, or you accept that a reread of a page which isn’t valid will always result in seeing a completely updated page.

We’ve made the assumption that a reread on a failure where the LSN on the first read was older than the backup-start LSN will give us an updated first-half of the page which we then check the LSN, of, but we have yet to prove that this is actually possible.

Thanks!

Stephen

Re: Online verification of checksums

From
Tomas Vondra
Date:
On 09/17/2018 07:35 PM, Stephen Frost wrote:
> Greetings,
> 
> On Mon, Sep 17, 2018 at 13:20 Tomas Vondra <tomas.vondra@2ndquadrant.com
> <mailto:tomas.vondra@2ndquadrant.com>> wrote:
> 
>     On 09/17/2018 07:11 PM, Stephen Frost wrote:
>     > Greetings,
>     >
>     > * Tomas Vondra (tomas.vondra@2ndquadrant.com
>     <mailto:tomas.vondra@2ndquadrant.com>) wrote:
>     >> On 09/17/2018 06:42 PM, Stephen Frost wrote:
>     >>> Ok, good, though I'm not sure what you mean by 'eliminates the
>     >>> consistency guarantees provided by the checkpoint'.  The point
>     is that
>     >>> the page will be in the WAL and the WAL will be replayed during the
>     >>> restore of the backup.
>     >>
>     >> The checkpoint guarantees that the whole page was written and
>     flushed to
>     >> disk with an LSN before the ckeckpoint LSN. So when you read a
>     page with
>     >> that LSN, you know the whole write already completed and a read won't
>     >> return data from before the LSN.
>     >
>     > Well, you know that the first part was written out at some prior
>     point,
>     > but you could end up reading the first part of a page with an
>     older LSN
>     > while also reading the second part with new data.
> 
> 
> 
>     Doesn't the checkpoint fsync pretty much guarantee this can't happen?
> 
> 
> How? Either it’s possible for the latter half of a page to be updated
> before the first half (where the LSN lives), or it isn’t. If it’s
> possible then that LSN could be ancient and it wouldn’t matter. 
> 

I'm not sure I understand what you're saying here.

It is not about the latter page to be updated before the first half. I
don't think that's quite possible, because write() into page cache does
in fact write the data sequentially.

The problem is that the write is not atomic, and AFAIK it happens in
sectors (which are either 512B or 4K these days). And it may arbitrarily
interleave with reads.

So you may do write(8k), but it actually happens in 512B chunks and a
concurrent read may observe some mix of those.

But the trick is that if the read sees the effect of the write somewhere
in the middle of the page, the next read is guaranteed to see all the
preceding new data.

Without the checkpoint we risk seeing the same write() both in read and
re-read, just in a different stage - so the LSN would not change, making
the check futile.

But by waiting for the checkpoint we know that the original write is no
longer in progress, so if we saw a partial write we're guaranteed to see
a new LSN on re-read.

This is what I mean by the checkpoint / fsync guarantee.


>     >> Without the checkpoint that's not guaranteed, and simply
>     re-reading the
>     >> page and rechecking it vs. the first read does not help:
>     >>
>     >> 1) write the first 512B of the page (sector), which includes the LSN
>     >>
>     >> 2) read the whole page, which will be a mix [new 512B, ... old ... ]
>     >>
>     >> 3) the checksum verification fails
>     >>
>     >> 4) read the page again (possibly reading a bit more new data)
>     >>
>     >> 5) the LSN did not change compared to the first read, yet the
>     checksum
>     >> still fails
>     >
>     > So, I agree with all of the above though I've found it to be extremely
>     > rare to get a single read which you've managed to catch part-way
>     through
>     > a write, getting multiple of them over a period of time strikes me as
>     > even more unlikely.  Still, if we can come up with a solution to solve
>     > all of this, great, but I'm not sure that I'm hearing one.
> 
> 
>     I don't recall claiming catching many such torn pages - I'm sure it's
>     not very common in most workloads. But I suspect constructing workloads
>     hitting them regularly is not very difficult either (something with a
>     lot of churn in shared buffers should do the trick).
> 
> 
> The question is if it’s possible to catch a torn page where the second
> half is updated *before* the first half of the page in a read (and then
> in subsequent reads having that state be maintained).  I have some
> skepticism that it’s really possible to happen in the first place but
> having an interrupted system call be stalled across two more system
> calls just seems terribly unlikely, and this is all based on the
> assumption that the kernel might write the second half of a write before
> the first to the kernel cache in the first place.
> 

Yes, if that was possible, the explanation about the checkpoint fsync
guarantee would be bogus, obviously.

I've spent quite a bit of time looking into how write() is handled, and
I believe seeing only the second half is not possible. You may observe a
page torn in various ways (not necessarily in half), e.g.

    [old,new,old]

but then the re-read you should be guaranteed to see new data up until
the last "new" chunk:

    [new,new,old]

At least that's my understanding. I failed to deduce what POSIX says
about this, or how it behaves on various OS/filesystems.

The one thing I've done was writing a simple stress test that writes a
single 8kB page in a loop, reads it concurrently and checks the
behavior. And it seems consistent with my understanding.

> 
>     >>> Now, that said, I do think it's a good *idea* to check against the
>     >>> checkpoint LSN (presuming this is for online checking of
>     checksums- for
>     >>> basebackup, we could just check against the backup-start LSN as
>     anything
>     >>> after that point will be rewritten by WAL anyway).  The reason
>     that I
>     >>> think it's a good idea to check against the checkpoint LSN is
>     that we'd
>     >>> want to throw a big warning if the kernel is just feeding us random
>     >>> garbage on reads and only finding a difference between two reads
>     isn't
>     >>> really doing any kind of validation, whereas checking against the
>     >>> checkpoint-LSN would at least give us some idea that the value being
>     >>> read isn't completely ridiculous.
>     >>>
>     >>> When it comes to if the pg_sleep() is necessary or not, I have
>     to admit
>     >>> to being unsure about that..  I could see how it might be but it
>     seems a
>     >>> bit surprising- I'd probably want to see exactly what the page
>     was at
>     >>> the time of the failure and at the time of the second (no-sleep)
>     re-read
>     >>> and then after a delay and convince myself that it was just an
>     unlucky
>     >>> case of being scheduled in twice to read that page before the
>     process
>     >>> writing it out got a chance to finish the write.
>     >>
>     >> I think the pg_sleep() is a pretty strong sign there's something
>     broken.
>     >> At the very least, it's likely to misbehave on machines with
>     different
>     >> timings, machines under memory and/or memory pressure, etc.
>     >
>     > If we assume that what you've outlined above is a serious enough issue
>     > that we have to address it, and do so without a pg_sleep(), then I
>     think
>     > we have to bake into this a way for the process to check with PG as to
>     > what the page's current LSN is, in shared buffers, because that's the
>     > only place where we've got the locking required to ensure that we
>     don't
>     > end up with a read of a partially written page, and I'm really not
>     > entirely convinced that we need to go to that level.  It'd
>     certainly add
>     > a huge amount of additional complexity for what appears to be a quite
>     > unlikely gain.
>     >
>     > I'll chat w/ David shortly about this again though and get his
>     thoughts
>     > on it.  This is certainly an area we've spent time thinking about but
>     > are obviously also open to finding a better solution.
> 
> 
>     Why not to simply look at the last checkpoint LSN and use that the same
>     way basebackup does? AFAICS that should make the pg_sleep() unnecessary.
> 
> 
> Use that to compare to what?  The LSN in the first half of the page
> could be from well before the checkpoint or even the backup started.
> 

Not sure I follow. If the LSN in the page header is old, and the
checksum check failed, then on re-read we either find a new LSN (in
which case we skip the page) or consider this to be a checksum failure.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: Online verification of checksums

From
Stephen Frost
Date:
Greetings,

* Tomas Vondra (tomas.vondra@2ndquadrant.com) wrote:
> On 09/17/2018 07:35 PM, Stephen Frost wrote:
> > On Mon, Sep 17, 2018 at 13:20 Tomas Vondra <tomas.vondra@2ndquadrant.com
> > <mailto:tomas.vondra@2ndquadrant.com>> wrote:
> >     Doesn't the checkpoint fsync pretty much guarantee this can't happen?
> >
> > How? Either it’s possible for the latter half of a page to be updated
> > before the first half (where the LSN lives), or it isn’t. If it’s
> > possible then that LSN could be ancient and it wouldn’t matter. 
>
> I'm not sure I understand what you're saying here.
>
> It is not about the latter page to be updated before the first half. I
> don't think that's quite possible, because write() into page cache does
> in fact write the data sequentially.

Well, maybe 'updated before' wasn't quite the right way to talk about
it, but consider if a read(8K) gets only half-way through the copy
before having to go do something else and by the time it gets back, a
write has come in and rewritten the page, such that the read(8K)
returns half-old and half-new data.

> The problem is that the write is not atomic, and AFAIK it happens in
> sectors (which are either 512B or 4K these days). And it may arbitrarily
> interleave with reads.

Yes, of course the write isn't atomic, that's clear.

> So you may do write(8k), but it actually happens in 512B chunks and a
> concurrent read may observe some mix of those.

Right, I'm not sure that we really need to worry about sub-4K writes
though I suppose they're technically possible, but it doesn't much
matter in this case since the LSN is early on in the page, of course.

> But the trick is that if the read sees the effect of the write somewhere
> in the middle of the page, the next read is guaranteed to see all the
> preceding new data.

If that's guaranteed then we can just check the LSN and be done.

> Without the checkpoint we risk seeing the same write() both in read and
> re-read, just in a different stage - so the LSN would not change, making
> the check futile.

This is the part that isn't making much sense to me.  If we are
guaranteed that writes into the kernel cache are always in order and
always at least 512B in size, then if we check the LSN first and
discover it's "old", and then read the rest of the page and calculate
the checksum, discover it's a bad checksum, and then go back and re-read
the page then we *must* see that the LSN has changed OR conclude that
the checksum is invalidated.

The reason this can happen in the first place is that our 8K read might
only get half-way done before getting scheduled off and a 8K write
happened on the page before our read(8K) gets back to finishing the
read, but if what you're saying is true, then we can't ever have a case
where such a thing would happen and a re-read would still see the "old"
LSN.

If we check the LSN first and discover it's "new" (as in, more recent
than our last checkpoint, or the checkpoint where the backup started)
then, sure, there's going to be a risk that the page is currently being
written right that moment and isn't yet completely valid.

The problem that we aren't solving for is if, somehow, we do a read(8K)
and get the first half/second half mixup and then on a subsequent
read(8K) we see that *again*, implying that somehow the kernel's copy
has the latter-half of the page updated consistently but not the first
half.  That's a problem that I haven't got a solution to today.  I'd
love to have a guarantee that it's not possible- we've certainly never
seen it but it's been a concern and I thought Michael was suggesting
he'd seen that, but it sounds like there wasn't a check on the LSN in
the first read, in which case it could have just been a 'regular' torn
page case.

> But by waiting for the checkpoint we know that the original write is no
> longer in progress, so if we saw a partial write we're guaranteed to see
> a new LSN on re-read.
>
> This is what I mean by the checkpoint / fsync guarantee.

I don't think any of this really has anythign to do with either fsync
being called or with the actual checkpointing process (except to the
extent that the checkpointer is the thing doing the writing, and that we
should be checking the LSN against the LSN of the last checkpoint when
we started, or against the start of the backup LSN if we're talking
about doing a backup).

> > The question is if it’s possible to catch a torn page where the second
> > half is updated *before* the first half of the page in a read (and then
> > in subsequent reads having that state be maintained).  I have some
> > skepticism that it’s really possible to happen in the first place but
> > having an interrupted system call be stalled across two more system
> > calls just seems terribly unlikely, and this is all based on the
> > assumption that the kernel might write the second half of a write before
> > the first to the kernel cache in the first place.
>
> Yes, if that was possible, the explanation about the checkpoint fsync
> guarantee would be bogus, obviously.
>
> I've spent quite a bit of time looking into how write() is handled, and
> I believe seeing only the second half is not possible. You may observe a
> page torn in various ways (not necessarily in half), e.g.
>
>     [old,new,old]
>
> but then the re-read you should be guaranteed to see new data up until
> the last "new" chunk:
>
>     [new,new,old]
>
> At least that's my understanding. I failed to deduce what POSIX says
> about this, or how it behaves on various OS/filesystems.
>
> The one thing I've done was writing a simple stress test that writes a
> single 8kB page in a loop, reads it concurrently and checks the
> behavior. And it seems consistent with my understanding.

Good.

> > Use that to compare to what?  The LSN in the first half of the page
> > could be from well before the checkpoint or even the backup started.
>
> Not sure I follow. If the LSN in the page header is old, and the
> checksum check failed, then on re-read we either find a new LSN (in
> which case we skip the page) or consider this to be a checksum failure.

Right, I'm in agreement with doing that and it's what is done in
pgbasebackup and pgBackRest.

Thanks!

Stephen

Attachment

Re: Online verification of checksums

From
Tomas Vondra
Date:
On 09/18/2018 12:01 AM, Stephen Frost wrote:
> Greetings,
> 
> * Tomas Vondra (tomas.vondra@2ndquadrant.com) wrote:
>> On 09/17/2018 07:35 PM, Stephen Frost wrote:
>>> On Mon, Sep 17, 2018 at 13:20 Tomas Vondra <tomas.vondra@2ndquadrant.com
>>> <mailto:tomas.vondra@2ndquadrant.com>> wrote:
>>>     Doesn't the checkpoint fsync pretty much guarantee this can't happen?
>>>
>>> How? Either it’s possible for the latter half of a page to be updated
>>> before the first half (where the LSN lives), or it isn’t. If it’s
>>> possible then that LSN could be ancient and it wouldn’t matter. 
>>
>> I'm not sure I understand what you're saying here.
>>
>> It is not about the latter page to be updated before the first half. I
>> don't think that's quite possible, because write() into page cache does
>> in fact write the data sequentially.
> 
> Well, maybe 'updated before' wasn't quite the right way to talk about
> it, but consider if a read(8K) gets only half-way through the copy
> before having to go do something else and by the time it gets back, a
> write has come in and rewritten the page, such that the read(8K)
> returns half-old and half-new data.
> 
>> The problem is that the write is not atomic, and AFAIK it happens in
>> sectors (which are either 512B or 4K these days). And it may arbitrarily
>> interleave with reads.
> 
> Yes, of course the write isn't atomic, that's clear.
> 
>> So you may do write(8k), but it actually happens in 512B chunks and a
>> concurrent read may observe some mix of those.
> 
> Right, I'm not sure that we really need to worry about sub-4K writes
> though I suppose they're technically possible, but it doesn't much
> matter in this case since the LSN is early on in the page, of course.
> 
>> But the trick is that if the read sees the effect of the write somewhere
>> in the middle of the page, the next read is guaranteed to see all the
>> preceding new data.
> 
> If that's guaranteed then we can just check the LSN and be done.
> 

What do you mean by "check the LSN"? Compare it to LSN from the first
read? You don't know if the first read already saw the new LSN or not
(see the next example).

>> Without the checkpoint we risk seeing the same write() both in read and
>> re-read, just in a different stage - so the LSN would not change, making
>> the check futile.
> 
> This is the part that isn't making much sense to me.  If we are
> guaranteed that writes into the kernel cache are always in order and
> always at least 512B in size, then if we check the LSN first and
> discover it's "old", and then read the rest of the page and calculate
> the checksum, discover it's a bad checksum, and then go back and re-read
> the page then we *must* see that the LSN has changed OR conclude that
> the checksum is invalidated.
> 

Even if the writes are in order and in 512B chunks, you don't know how
they are interleaved with the reads.

Let's assume we're doing a write(), which splits the 8kB page into 512B
chunks. A concurrent read may observe a random mix of old and new data,
depending on timing.

So let's say a read sees the first 2kB of data like this:

[new, new, new, old, new, old, new, old]

OK, the page is obviously torn, checksum fails, and we try reading it
again. We should see new data at least until the last 'new' chunk in the
first read, so let's say we got this:

[new, new, new, new, new, new, new, old]

Obviously, this page is also torn (there are old data at the end), but
we've read the new data in both cases, which includes the LSN. So the
LSN is the same in both cases, and your detection fails.

Comparing the page LSN to the last checkpoint LSN solves this, because
if the LSN is older than the checkpoint LSN, that write must have been
completed by now, and so we're not in danger of seeing only incomplete
effects of it. And newer write will update the LSN.

> The reason this can happen in the first place is that our 8K read might
> only get half-way done before getting scheduled off and a 8K write
> happened on the page before our read(8K) gets back to finishing the
> read, but if what you're saying is true, then we can't ever have a case
> where such a thing would happen and a re-read would still see the "old"
> LSN.
> 
> If we check the LSN first and discover it's "new" (as in, more recent
> than our last checkpoint, or the checkpoint where the backup started)
> then, sure, there's going to be a risk that the page is currently being
> written right that moment and isn't yet completely valid.
> 

Right.

> The problem that we aren't solving for is if, somehow, we do a read(8K)
> and get the first half/second half mixup and then on a subsequent
> read(8K) we see that *again*, implying that somehow the kernel's copy
> has the latter-half of the page updated consistently but not the first
> half.  That's a problem that I haven't got a solution to today.  I'd
> love to have a guarantee that it's not possible- we've certainly never
> seen it but it's been a concern and I thought Michael was suggesting
> he'd seen that, but it sounds like there wasn't a check on the LSN in
> the first read, in which case it could have just been a 'regular' torn
> page case.
> 

Well, yeah. If that would be possible, we'd be in serious trouble. I've
done quite a bit of experimentation with concurrent reads and writes and
I have not observed such behavior. Of course, that's hardly a proof it
can't happen, and it wouldn't be the first surprise with respect to
kernel I/O this year ...

>> But by waiting for the checkpoint we know that the original write is no
>> longer in progress, so if we saw a partial write we're guaranteed to see
>> a new LSN on re-read.
>>
>> This is what I mean by the checkpoint / fsync guarantee.
> 
> I don't think any of this really has anythign to do with either fsync
> being called or with the actual checkpointing process (except to the
> extent that the checkpointer is the thing doing the writing, and that we
> should be checking the LSN against the LSN of the last checkpoint when
> we started, or against the start of the backup LSN if we're talking
> about doing a backup).
> 

You're right it's not about the fsync, sorry for the confusion. My point
is that using the checkpoint LSN gives us a guarantee that write is no
longer in progress, and so we can't see a page torn because of it. And
if we see a partial write due to a new write, it's guaranteed to update
the page LSN (and we'll notice it).

>>> The question is if it’s possible to catch a torn page where the second
>>> half is updated *before* the first half of the page in a read (and then
>>> in subsequent reads having that state be maintained).  I have some
>>> skepticism that it’s really possible to happen in the first place but
>>> having an interrupted system call be stalled across two more system
>>> calls just seems terribly unlikely, and this is all based on the
>>> assumption that the kernel might write the second half of a write before
>>> the first to the kernel cache in the first place.
>>
>> Yes, if that was possible, the explanation about the checkpoint fsync
>> guarantee would be bogus, obviously.
>>
>> I've spent quite a bit of time looking into how write() is handled, and
>> I believe seeing only the second half is not possible. You may observe a
>> page torn in various ways (not necessarily in half), e.g.
>>
>>     [old,new,old]
>>
>> but then the re-read you should be guaranteed to see new data up until
>> the last "new" chunk:
>>
>>     [new,new,old]
>>
>> At least that's my understanding. I failed to deduce what POSIX says
>> about this, or how it behaves on various OS/filesystems.
>>
>> The one thing I've done was writing a simple stress test that writes a
>> single 8kB page in a loop, reads it concurrently and checks the
>> behavior. And it seems consistent with my understanding.
> 
> Good.
> 
>>> Use that to compare to what?  The LSN in the first half of the page
>>> could be from well before the checkpoint or even the backup started.
>>
>> Not sure I follow. If the LSN in the page header is old, and the
>> checksum check failed, then on re-read we either find a new LSN (in
>> which case we skip the page) or consider this to be a checksum failure.
> 
> Right, I'm in agreement with doing that and it's what is done in
> pgbasebackup and pgBackRest.
> 

OK. All I'm saying is pg_verify_checksums should probably do the same
thing, i.e. grab checkpoint LSN and roll with that.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: Online verification of checksums

From
Stephen Frost
Date:
Greetings,

* Tomas Vondra (tomas.vondra@2ndquadrant.com) wrote:
> On 09/18/2018 12:01 AM, Stephen Frost wrote:
> > * Tomas Vondra (tomas.vondra@2ndquadrant.com) wrote:
> >> On 09/17/2018 07:35 PM, Stephen Frost wrote:
> >> But the trick is that if the read sees the effect of the write somewhere
> >> in the middle of the page, the next read is guaranteed to see all the
> >> preceding new data.
> >
> > If that's guaranteed then we can just check the LSN and be done.
>
> What do you mean by "check the LSN"? Compare it to LSN from the first
> read? You don't know if the first read already saw the new LSN or not
> (see the next example).

Hmm, ok, I can see your point there.  I've been going back and forth
between checking against what the prior LSN was on the page and checking
it against an independent source (like the last checkpoint's LSN), but..

[...]

> Comparing the page LSN to the last checkpoint LSN solves this, because
> if the LSN is older than the checkpoint LSN, that write must have been
> completed by now, and so we're not in danger of seeing only incomplete
> effects of it. And newer write will update the LSN.

Yeah, that makes sense- we need to be looking at something which only
gets updated once the write has actually completed, and the last
checkpoint's LSN gives us that guarantee.

> > The problem that we aren't solving for is if, somehow, we do a read(8K)
> > and get the first half/second half mixup and then on a subsequent
> > read(8K) we see that *again*, implying that somehow the kernel's copy
> > has the latter-half of the page updated consistently but not the first
> > half.  That's a problem that I haven't got a solution to today.  I'd
> > love to have a guarantee that it's not possible- we've certainly never
> > seen it but it's been a concern and I thought Michael was suggesting
> > he'd seen that, but it sounds like there wasn't a check on the LSN in
> > the first read, in which case it could have just been a 'regular' torn
> > page case.
>
> Well, yeah. If that would be possible, we'd be in serious trouble. I've
> done quite a bit of experimentation with concurrent reads and writes and
> I have not observed such behavior. Of course, that's hardly a proof it
> can't happen, and it wouldn't be the first surprise with respect to
> kernel I/O this year ...

I'm glad to hear that you've done a lot of experimentation in this area
and haven't seen such strange behavior happen- we've got quite a few
people running pgBackRest with checksum-checking and haven't seen it
either, but it's always been a bit of a concern.

> You're right it's not about the fsync, sorry for the confusion. My point
> is that using the checkpoint LSN gives us a guarantee that write is no
> longer in progress, and so we can't see a page torn because of it. And
> if we see a partial write due to a new write, it's guaranteed to update
> the page LSN (and we'll notice it).

Right, no worries about the confusion, I hadn't been fully thinking
through the LSN bit either and that what we really need is some external
confirmation of a write having *completed* (not just started) and that
makes a definite difference.

> > Right, I'm in agreement with doing that and it's what is done in
> > pgbasebackup and pgBackRest.
>
> OK. All I'm saying is pg_verify_checksums should probably do the same
> thing, i.e. grab checkpoint LSN and roll with that.

Agreed.

Thanks!

Stephen

Attachment

Re: Online verification of checksums

From
Michael Banck
Date:
Hi,

Am Montag, den 17.09.2018, 14:09 -0400 schrieb Stephen Frost:
> > 5. There seems to be no consensus on whether the number of skipped pages
> > should be summarized at the end.
> 
> I agree with printing the number of skipped pages, that does seem like
> a nice to have.  I don’t know that actually printing the pages
> themselves is all that useful though. 

Oh ok - I never intended to print out the block numbers themselves, just
the final number of skipped blocks in the summary. So I guess that's
fine and I will add that in my branch.


Michael

-- 
Michael Banck
Projektleiter / Senior Berater
Tel.: +49 2166 9901-171
Fax:  +49 2166 9901-100
Email: michael.banck@credativ.de

credativ GmbH, HRB Mönchengladbach 12080
USt-ID-Nummer: DE204566209
Trompeterallee 108, 41189 Mönchengladbach
Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer

Unser Umgang mit personenbezogenen Daten unterliegt
folgenden Bestimmungen: https://www.credativ.de/datenschutz


Re: Online verification of checksums

From
Michael Banck
Date:
Hi.

Am Montag, den 17.09.2018, 20:45 -0400 schrieb Stephen Frost:
> > You're right it's not about the fsync, sorry for the confusion. My point
> > is that using the checkpoint LSN gives us a guarantee that write is no
> > longer in progress, and so we can't see a page torn because of it. And
> > if we see a partial write due to a new write, it's guaranteed to update
> > the page LSN (and we'll notice it).
> 
> Right, no worries about the confusion, I hadn't been fully thinking
> through the LSN bit either and that what we really need is some external
> confirmation of a write having *completed* (not just started) and that
> makes a definite difference.
> 
> > > Right, I'm in agreement with doing that and it's what is done in
> > > pgbasebackup and pgBackRest.
> > 
> > OK. All I'm saying is pg_verify_checksums should probably do the same
> > thing, i.e. grab checkpoint LSN and roll with that.
> 
> Agreed.

I've attached the patch I added to my branch to swap out the pg_sleep()
with a check against the checkpoint LSN on a recheck verification
failure.

Let me know if there are still issues with it. I'll send a new patch for
the whole online verification feature in a bit.


Michael

-- 
Michael Banck
Projektleiter / Senior Berater
Tel.: +49 2166 9901-171
Fax:  +49 2166 9901-100
Email: michael.banck@credativ.de

credativ GmbH, HRB Mönchengladbach 12080
USt-ID-Nummer: DE204566209
Trompeterallee 108, 41189 Mönchengladbach
Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer

Unser Umgang mit personenbezogenen Daten unterliegt
folgenden Bestimmungen: https://www.credativ.de/datenschutz
Attachment

Re: Online verification of checksums

From
Michael Banck
Date:
Hi,

please find attached version 2 of the patch.

Am Donnerstag, den 26.07.2018, 13:59 +0200 schrieb Michael Banck:
> I've now forward-ported this change to pg_verify_checksums, in order to
> make this application useful for online clusters, see attached patch.
> 
> I've tested this in a tight loop (while true; do pg_verify_checksums -D
> data1 -d > /dev/null || /bin/true; done)[2] while doing "while true; do
> createdb pgbench; pgbench -i -s 10 pgbench > /dev/null; dropdb pgbench;
> done", which I already used to develop the original code in the fork and
> which brought up a few bugs.
> 
> I got one checksums verification failure this way, all others were
> caught by the recheck (I've introduced a 500ms delay for the first ten
> failures) like this:
> 
> > pg_verify_checksums: checksum verification failed on first attempt in
> > file "data1/base/16837/16850", block 7770: calculated checksum 785 but
> > expected 5063
> > pg_verify_checksums: block 7770 in file "data1/base/16837/16850"
> > verified ok on recheck

I have now changed this from the pg_sleep() to a check against the
checkpoint LSN as discussed upthread.

> However, I am also seeing sporadic (maybe 0.5 times per pgbench run)
> failures like this:
> 
> > pg_verify_checksums: short read of block 2644 in file
> > "data1/base/16637/16650", got only 4096 bytes
> 
> This is not strictly a verification failure, should we do anything about
> this? In my fork, I am also rechecking on this[3] (and I am happy to
> extend the patch that way), but that makes the code and the patch more
> complicated and I wanted to check the general opinion on this case
> first.

I have added a retry for this as well now, without a pg_sleep() as well.
This catches around 80% of the half-reads, but a few slip through. At
that point we bail out with exit(1), and the user can try again, which I
think is fine? 

Alternatively, we could just skip to the next file then and don't make
it count as a checksum failure.

Other changes from V1:

1. Rebased to 422952ee
2. Ignore ENOENT failure during file open and skip to next file
3. Mention total number of skipped blocks during the summary at the end
of the run
4. Skip files starting with pg_internal.init*


Michael

-- 
Michael Banck
Projektleiter / Senior Berater
Tel.: +49 2166 9901-171
Fax:  +49 2166 9901-100
Email: michael.banck@credativ.de

credativ GmbH, HRB Mönchengladbach 12080
USt-ID-Nummer: DE204566209
Trompeterallee 108, 41189 Mönchengladbach
Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer

Unser Umgang mit personenbezogenen Daten unterliegt
folgenden Bestimmungen: https://www.credativ.de/datenschutz
Attachment

Re: Online verification of checksums

From
Stephen Frost
Date:
Greetings,

* Michael Banck (michael.banck@credativ.de) wrote:
> please find attached version 2 of the patch.
>
> Am Donnerstag, den 26.07.2018, 13:59 +0200 schrieb Michael Banck:
> > I've now forward-ported this change to pg_verify_checksums, in order to
> > make this application useful for online clusters, see attached patch.
> >
> > I've tested this in a tight loop (while true; do pg_verify_checksums -D
> > data1 -d > /dev/null || /bin/true; done)[2] while doing "while true; do
> > createdb pgbench; pgbench -i -s 10 pgbench > /dev/null; dropdb pgbench;
> > done", which I already used to develop the original code in the fork and
> > which brought up a few bugs.
> >
> > I got one checksums verification failure this way, all others were
> > caught by the recheck (I've introduced a 500ms delay for the first ten
> > failures) like this:
> >
> > > pg_verify_checksums: checksum verification failed on first attempt in
> > > file "data1/base/16837/16850", block 7770: calculated checksum 785 but
> > > expected 5063
> > > pg_verify_checksums: block 7770 in file "data1/base/16837/16850"
> > > verified ok on recheck
>
> I have now changed this from the pg_sleep() to a check against the
> checkpoint LSN as discussed upthread.

Ok.

> > However, I am also seeing sporadic (maybe 0.5 times per pgbench run)
> > failures like this:
> >
> > > pg_verify_checksums: short read of block 2644 in file
> > > "data1/base/16637/16650", got only 4096 bytes
> >
> > This is not strictly a verification failure, should we do anything about
> > this? In my fork, I am also rechecking on this[3] (and I am happy to
> > extend the patch that way), but that makes the code and the patch more
> > complicated and I wanted to check the general opinion on this case
> > first.
>
> I have added a retry for this as well now, without a pg_sleep() as well.

> This catches around 80% of the half-reads, but a few slip through. At
> that point we bail out with exit(1), and the user can try again, which I
> think is fine? 

No, this is perfectly normal behavior, as is having completely blank
pages, now that I think about it.  If we get a short read then I'd say
we simply check that we got an EOF and, in that case, we just move on.

> Alternatively, we could just skip to the next file then and don't make
> it count as a checksum failure.

No, I wouldn't count it as a checksum failure.  We could possibly count
it towards the skipped pages, though I'm even on the fence about that.

Thanks!

Stephen

Attachment

Re: Online verification of checksums

From
David Steele
Date:
On 9/18/18 11:45 AM, Stephen Frost wrote:
> * Michael Banck (michael.banck@credativ.de) wrote:

>> I have added a retry for this as well now, without a pg_sleep() as well.
>
>> This catches around 80% of the half-reads, but a few slip through. At
>> that point we bail out with exit(1), and the user can try again, which I
>> think is fine? 
>
> No, this is perfectly normal behavior, as is having completely blank
> pages, now that I think about it.  If we get a short read then I'd say
> we simply check that we got an EOF and, in that case, we just move on.
>
>> Alternatively, we could just skip to the next file then and don't make
>> it count as a checksum failure.
>
> No, I wouldn't count it as a checksum failure.  We could possibly count
> it towards the skipped pages, though I'm even on the fence about that.

+1 for it not being a failure.  Personally I'd count it as a skipped
page, since we know the page exists but it can't be verified.

The other option is to wait for the page to stabilize, which doesn't
seem like it would take very long in most cases -- unless you are doing
this test from another host with shared storage.  Then I would expect to
see all kinds of interesting torn pages after the last checkpoint.

Regards,
--
-David
david@pgmasters.net


Attachment

Re: Online verification of checksums

From
Michael Banck
Date:
Hi,

Am Dienstag, den 18.09.2018, 13:52 -0400 schrieb David Steele:
> On 9/18/18 11:45 AM, Stephen Frost wrote:
> > * Michael Banck (michael.banck@credativ.de) wrote:
> > > I have added a retry for this as well now, without a pg_sleep() as well.
> > > This catches around 80% of the half-reads, but a few slip through. At
> > > that point we bail out with exit(1), and the user can try again, which I
> > > think is fine? 
> > 
> > No, this is perfectly normal behavior, as is having completely blank
> > pages, now that I think about it.  If we get a short read then I'd say
> > we simply check that we got an EOF and, in that case, we just move on.
> > 
> > > Alternatively, we could just skip to the next file then and don't make
> > > it count as a checksum failure.
> > 
> > No, I wouldn't count it as a checksum failure.  We could possibly count
> > it towards the skipped pages, though I'm even on the fence about that.
> 
> +1 for it not being a failure.  Personally I'd count it as a skipped
> page, since we know the page exists but it can't be verified.
> 
> The other option is to wait for the page to stabilize, which doesn't
> seem like it would take very long in most cases -- unless you are doing
> this test from another host with shared storage.  Then I would expect to
> see all kinds of interesting torn pages after the last checkpoint.

OK, I'm skipping the block now on first try, as this makes (i) sense and
(ii) simplifies the code (again).

Version 3 is attached.


Michael

-- 
Michael Banck
Projektleiter / Senior Berater
Tel.: +49 2166 9901-171
Fax:  +49 2166 9901-100
Email: michael.banck@credativ.de

credativ GmbH, HRB Mönchengladbach 12080
USt-ID-Nummer: DE204566209
Trompeterallee 108, 41189 Mönchengladbach
Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer

Unser Umgang mit personenbezogenen Daten unterliegt
folgenden Bestimmungen: https://www.credativ.de/datenschutz
Attachment

Re: Online verification of checksums

From
Fabien COELHO
Date:
Hallo Michael,

Patch v3 applies cleanly, code compiles and make check is ok, but the 
command is probably not tested anywhere, as already mentioned on other 
threads.

The patch is missing a documentation update.

There are debatable changes of behavior:

    if (errno == ENOENT) return / continue...

For instance, a file disappearing is ok online, but not so if offline. On 
the other hand, the probability that a file suddenly disappears while the 
server offline looks remote, so reporting such issues does not seem 
useful.

However I'm more wary with other continues/skips added. ISTM that skipping 
a block because of a read error, or because it is new, or some other 
reasons, is not the same thing, so should be counted & reported 
differently?

   + if (block_retry == false)

Why not trust boolean operations?

   if (!block_retry)

-- 
Fabien.


Re: Online verification of checksums

From
Michael Banck
Date:
Hi,

Am Mittwoch, den 26.09.2018, 13:23 +0200 schrieb Fabien COELHO:
> Patch v3 applies cleanly, code compiles and make check is ok, but the 
> command is probably not tested anywhere, as already mentioned on other 
> threads.

Right.

> The patch is missing a documentation update.

I've added that now. I think the only change needed was removing the
"server needs to be offline" part?

> There are debatable changes of behavior:
> 
>     if (errno == ENOENT) return / continue...
> 
> For instance, a file disappearing is ok online, but not so if offline. On 
> the other hand, the probability that a file suddenly disappears while the 
> server offline looks remote, so reporting such issues does not seem 
> useful.
> 
> However I'm more wary with other continues/skips added. ISTM that skipping 
> a block because of a read error, or because it is new, or some other 
> reasons, is not the same thing, so should be counted & reported 
> differently?

I think that would complicate things further without a lot of benefit.

After all, we are interested in checksum failures, not necessarily read
failures etc. so exiting on them (and skip checking possibly large parts
of PGDATA) looks undesirable to me.

So I have done no changes in this part so far, what do others think
about this?

>    + if (block_retry == false)
> 
> Why not trust boolean operations?
> 
>    if (!block_retry)

I've changed that as well.

Version 4 is attached.


Michael

-- 
Michael Banck
Projektleiter / Senior Berater
Tel.: +49 2166 9901-171
Fax:  +49 2166 9901-100
Email: michael.banck@credativ.de

credativ GmbH, HRB Mönchengladbach 12080
USt-ID-Nummer: DE204566209
Trompeterallee 108, 41189 Mönchengladbach
Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer

Unser Umgang mit personenbezogenen Daten unterliegt
folgenden Bestimmungen: https://www.credativ.de/datenschutz
Attachment

Re: Online verification of checksums

From
Stephen Frost
Date:
Greetings,

* Michael Banck (michael.banck@credativ.de) wrote:
> Am Mittwoch, den 26.09.2018, 13:23 +0200 schrieb Fabien COELHO:
> > There are debatable changes of behavior:
> >
> >     if (errno == ENOENT) return / continue...
> >
> > For instance, a file disappearing is ok online, but not so if offline. On
> > the other hand, the probability that a file suddenly disappears while the
> > server offline looks remote, so reporting such issues does not seem
> > useful.
> >
> > However I'm more wary with other continues/skips added. ISTM that skipping
> > a block because of a read error, or because it is new, or some other
> > reasons, is not the same thing, so should be counted & reported
> > differently?
>
> I think that would complicate things further without a lot of benefit.
>
> After all, we are interested in checksum failures, not necessarily read
> failures etc. so exiting on them (and skip checking possibly large parts
> of PGDATA) looks undesirable to me.
>
> So I have done no changes in this part so far, what do others think
> about this?

I certainly don't see a lot of point in doing much more than what was
discussed previously for 'new' blocks (counting them as skipped and
moving on).

An actual read() error (that is, a failure on a read() call such as
getting back EIO), on the other hand, is something which I'd probably
report back to the user immediately and then move on, and perhaps
report again at the end.

Note that a short read isn't an error and falls under the 'new' blocks
discussion above.

Thanks!

Stephen

Attachment

Re: Online verification of checksums

From
Fabien COELHO
Date:
>> The patch is missing a documentation update.
>
> I've added that now. I think the only change needed was removing the
> "server needs to be offline" part?

Yes, and also checking that the described behavior correspond to the new 
version.

>> There are debatable changes of behavior:
>>
>>     if (errno == ENOENT) return / continue...
>>
>> For instance, a file disappearing is ok online, but not so if offline. On
>> the other hand, the probability that a file suddenly disappears while the
>> server offline looks remote, so reporting such issues does not seem
>> useful.
>>
>> However I'm more wary with other continues/skips added. ISTM that skipping
>> a block because of a read error, or because it is new, or some other
>> reasons, is not the same thing, so should be counted & reported
>> differently?
>
> I think that would complicate things further without a lot of benefit.
>
> After all, we are interested in checksum failures, not necessarily read
> failures etc. so exiting on them (and skip checking possibly large parts
> of PGDATA) looks undesirable to me.

Hmmm.

I'm really saying that it is debatable, so here is some fuel to the 
debate:

If I run the check command and it cannot do its job, there is a problem 
which is as bad as a failing checksum. The only safe assumption on a 
cannot-read block is that the checksum is bad... So ISTM that on 
on some of the "skipped" errors there should be appropriate report (exit 
code, final output) that something is amiss.

-- 
Fabien.


Re: Online verification of checksums

From
Michael Banck
Date:
Hi,

Am Mittwoch, den 26.09.2018, 10:54 -0400 schrieb Stephen Frost:
> * Michael Banck (michael.banck@credativ.de) wrote:
> > Am Mittwoch, den 26.09.2018, 13:23 +0200 schrieb Fabien COELHO:
> > > There are debatable changes of behavior:
> > > 
> > >     if (errno == ENOENT) return / continue...
> > > 
> > > For instance, a file disappearing is ok online, but not so if offline. On 
> > > the other hand, the probability that a file suddenly disappears while the 
> > > server offline looks remote, so reporting such issues does not seem 
> > > useful.
> > > 
> > > However I'm more wary with other continues/skips added. ISTM that skipping 
> > > a block because of a read error, or because it is new, or some other 
> > > reasons, is not the same thing, so should be counted & reported 
> > > differently?
> > 
> > I think that would complicate things further without a lot of benefit.
> > 
> > After all, we are interested in checksum failures, not necessarily read
> > failures etc. so exiting on them (and skip checking possibly large parts
> > of PGDATA) looks undesirable to me.
> > 
> > So I have done no changes in this part so far, what do others think
> > about this?
> 
> I certainly don't see a lot of point in doing much more than what was
> discussed previously for 'new' blocks (counting them as skipped and
> moving on).
> 
> An actual read() error (that is, a failure on a read() call such as
> getting back EIO), on the other hand, is something which I'd probably
> report back to the user immediately and then move on, and perhaps
> report again at the end.
> 
> Note that a short read isn't an error and falls under the 'new' blocks
> discussion above.

So I've added ENOENT checks when opening or statting files, i.e. EIO
would still be reported.

The current code in master exits on reads which do not return BLCKSZ,
which I've changed to a skip. So that means we now no longer check for
read failures (return code < 0) so I have now added a check for that and
emit an error message and return.

New version 5 attached.


Michael

-- 
Michael Banck
Projektleiter / Senior Berater
Tel.: +49 2166 9901-171
Fax:  +49 2166 9901-100
Email: michael.banck@credativ.de

credativ GmbH, HRB Mönchengladbach 12080
USt-ID-Nummer: DE204566209
Trompeterallee 108, 41189 Mönchengladbach
Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer

Unser Umgang mit personenbezogenen Daten unterliegt
folgenden Bestimmungen: https://www.credativ.de/datenschutz
Attachment

Re: Online verification of checksums

From
Fabien COELHO
Date:
Hello Stephen,

> I certainly don't see a lot of point in doing much more than what was
> discussed previously for 'new' blocks (counting them as skipped and
> moving on).

Sure.

> An actual read() error (that is, a failure on a read() call such as
> getting back EIO), on the other hand, is something which I'd probably
> report back to the user immediately and then move on, and perhaps
> report again at the end.

Yep.

> Note that a short read isn't an error and falls under the 'new' blocks
> discussion above.

I'm really unsure that a short read should really be coldly skipped:

If the check is offline, then one file is in a very bad state, this is 
really a panic situation.

If the check is online, given that both postgres and the verify command 
interact with the same OS (?) and at the pg page level, I'm not sure in 
which situation there could be a partial block, because pg would only 
send full pages to the OS.

-- 
Fabien.


Re: Online verification of checksums

From
Stephen Frost
Date:
Greetings,

* Fabien COELHO (coelho@cri.ensmp.fr) wrote:
> >Note that a short read isn't an error and falls under the 'new' blocks
> >discussion above.
>
> I'm really unsure that a short read should really be coldly skipped:
>
> If the check is offline, then one file is in a very bad state, this is
> really a panic situation.

Why?  Are we sure that's really something which can't ever happen, even
if the database was shutdown with 'immediate'?  I don't think it can but
that's something to consider.  In any case, my comments were
specifically thinking about it from an 'online' perspective.

> If the check is online, given that both postgres and the verify command
> interact with the same OS (?) and at the pg page level, I'm not sure in
> which situation there could be a partial block, because pg would only send
> full pages to the OS.

The OS doesn't operate at the same level that PG does- a single write in
PG could get blocked and scheduled off after having only copied half of
the 8k that PG sends.  This isn't really debatable- we've seen it happen
and everything is operating perfectly correctly, it just happens that
you were able to get a read() at the same time a write() was happening
and that only part of the page had been updated at that point.

Thanks!

Stephen

Attachment

Re: Online verification of checksums

From
Tomas Vondra
Date:
Hi,

On 09/26/2018 05:15 PM, Michael Banck wrote:
> ...
> 
> New version 5 attached.
> 

I've looked at v5, and the retry/recheck logic seems OK to me - I'd
still vote to keep it consistent with what pg_basebackup does (i.e.
doing the LSN check first, before looking at the checksum), but I don't
think it's a bug.

I'm not sure about the other issues brought up (ENOENT, short reads). I
haven't given it much thought.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: Online verification of checksums

From
Tomas Vondra
Date:
Hi,

One more thought - when running similar tools on a live system, it's
usually a good idea to limit the impact by throttling the throughput. As
the verification runs in an independent process it can't reuse the
vacuum-like cost limit directly, but perhaps it could do something
similar? Like, limit the number of blocks read/second, or so?

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: Online verification of checksums

From
Michael Paquier
Date:
On Sat, Sep 29, 2018 at 10:51:23AM +0200, Tomas Vondra wrote:
> One more thought - when running similar tools on a live system, it's
> usually a good idea to limit the impact by throttling the throughput. As
> the verification runs in an independent process it can't reuse the
> vacuum-like cost limit directly, but perhaps it could do something
> similar? Like, limit the number of blocks read/second, or so?

When it comes to such parameters, not using a number of blocks but
throttling with a value in bytes (kB or MB of course) speaks more to the
user.  The past experience with checkpoint_segments is one example of
that.  Converting that to a number of blocks internally would definitely
make sense the most sense.  +1 for this idea.
--
Michael

Attachment

Re: Online verification of checksums

From
Stephen Frost
Date:
Greetings,

* Michael Paquier (michael@paquier.xyz) wrote:
> On Sat, Sep 29, 2018 at 10:51:23AM +0200, Tomas Vondra wrote:
> > One more thought - when running similar tools on a live system, it's
> > usually a good idea to limit the impact by throttling the throughput. As
> > the verification runs in an independent process it can't reuse the
> > vacuum-like cost limit directly, but perhaps it could do something
> > similar? Like, limit the number of blocks read/second, or so?
>
> When it comes to such parameters, not using a number of blocks but
> throttling with a value in bytes (kB or MB of course) speaks more to the
> user.  The past experience with checkpoint_segments is one example of
> that.  Converting that to a number of blocks internally would definitely
> make sense the most sense.  +1 for this idea.

While I agree this would be a nice additional feature to have, it seems
like something which could certainly be added later and doesn't
necessairly have to be included in the initial patch.  If Michael has
time to add that, great, if not, I'd rather have this as-is than not.

I do tend to agree with Michael that having the parameter be specified
as (or at least able to accept) a byte-based value is a good idea.  As
another feature idea, having this able to work in parallel across
tablespaces would be nice too.  I can certainly imagine some point where
this is a default process which scans the database at a slow pace across
all the tablespaces more-or-less all the time checking for corruption.

Thanks!

Stephen

Attachment

Re: Online verification of checksums

From
Tomas Vondra
Date:

On 09/29/2018 02:14 PM, Stephen Frost wrote:
> Greetings,
> 
> * Michael Paquier (michael@paquier.xyz) wrote:
>> On Sat, Sep 29, 2018 at 10:51:23AM +0200, Tomas Vondra wrote:
>>> One more thought - when running similar tools on a live system, it's
>>> usually a good idea to limit the impact by throttling the throughput. As
>>> the verification runs in an independent process it can't reuse the
>>> vacuum-like cost limit directly, but perhaps it could do something
>>> similar? Like, limit the number of blocks read/second, or so?
>>
>> When it comes to such parameters, not using a number of blocks but
>> throttling with a value in bytes (kB or MB of course) speaks more to the
>> user.  The past experience with checkpoint_segments is one example of
>> that.  Converting that to a number of blocks internally would definitely
>> make sense the most sense.  +1 for this idea.
> 
> While I agree this would be a nice additional feature to have, it seems
> like something which could certainly be added later and doesn't
> necessairly have to be included in the initial patch.  If Michael has
> time to add that, great, if not, I'd rather have this as-is than not.
> 

True, although I don't think it'd be particularly difficult.

> I do tend to agree with Michael that having the parameter be specified
> as (or at least able to accept) a byte-based value is a good idea.

Sure, I was not really expecting it to be exposed as raw block count. I
agree it should be in byte-based values (i.e. just like --max-rate in
pg_basebackup).

> As another feature idea, having this able to work in parallel across
> tablespaces would be nice too.  I can certainly imagine some point where
> this is a default process which scans the database at a slow pace across
> all the tablespaces more-or-less all the time checking for corruption.
> 

Maybe, but that's certainly a non-trivial feature.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: Online verification of checksums

From
Fabien COELHO
Date:
Hallo Michael,

> New version 5 attached.

Patch does not seem to apply anymore.

Moreover, ISTM that some discussions about behavioral changes are not 
fully settled.

My current opinion is that when offline some errors are not admissible, 
whereas the same errors are admissible when online because they may be due 
to the ongoing database processing, so the behavior should not be strictly 
the same.

This might suggest some option to tell the command that it should work in 
online or offline mode, so that it may be stricter in some cases. The 
default may be one of the option, eg the stricter offline mode, or maybe 
guessed at startup.

I put the patch in "waiting on author" state.

-- 
Fabien.


Re: Online verification of checksums

From
Michael Banck
Date:
Hi Fabien,

On Thu, Oct 25, 2018 at 10:16:03AM +0200, Fabien COELHO wrote:
> >New version 5 attached.
> 
> Patch does not seem to apply anymore.

Thanks, rebased version attached.

> Moreover, ISTM that some discussions about behavioral changes are not fully
> settled.
> 
> My current opinion is that when offline some errors are not admissible,
> whereas the same errors are admissible when online because they may be due
> to the ongoing database processing, so the behavior should not be strictly
> the same.

Indeed, the recently-added pg_verify_checksums testsuite adds a few
files with just 'foo' in them and with V5 of the patch,
pg_verify_checksums no longer bails out with an error on those.

I have now re-added the retry logic for partially-read pages, so that it
bails out if it reads a page partially twice. This makes the testsuite
work again.

I am not convinced we need to differentiate further between online and
offline operation, can you explain in more detail which other
differences are ok in online mode and why?
 
> This might suggest some option to tell the command that it should work in
> online or offline mode, so that it may be stricter in some cases. The
> default may be one of the option, eg the stricter offline mode, or maybe
> guessed at startup.

If we believe the operation should be different, the patch removes the
"is cluster online?" check (as it is no longer necessary), so we could
just replace the current error message with a global variable with the
result of that check and use it where needed (if any).


Michael

-- 
Michael Banck
Projektleiter / Senior Berater
Tel.: +49 2166 9901-171
Fax:  +49 2166 9901-100
Email: michael.banck@credativ.de

credativ GmbH, HRB Mönchengladbach 12080
USt-ID-Nummer: DE204566209
Trompeterallee 108, 41189 Mönchengladbach
Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer

Unser Umgang mit personenbezogenen Daten unterliegt
folgenden Bestimmungen: https://www.credativ.de/datenschutz

Attachment

Re: Online verification of checksums

From
Fabien COELHO
Date:
Hallo Michael,

Patch v6 applies cleanly, compiles, local make check is ok.

>> My current opinion is that when offline some errors are not admissible,
>> whereas the same errors are admissible when online because they may be due
>> to the ongoing database processing, so the behavior should not be strictly
>> the same.
>
> Indeed, the recently-added pg_verify_checksums testsuite

A welcome addition!

> adds a few files with just 'foo' in them and with V5 of the patch, 
> pg_verify_checksums no longer bails out with an error on those.

> I have now re-added the retry logic for partially-read pages, so that it
> bails out if it reads a page partially twice. This makes the testsuite
> work again.
>
> I am not convinced we need to differentiate further between online and
> offline operation, can you explain in more detail which other
> differences are ok in online mode and why?

For instance the "file/directory was removed" do not look okay at all when 
offline, even if unlikely. Moreover, the checks hides the error message 
and is fully silent in this case, while it was not beforehand on the 
same error when offline.

The "check if page was modified since checkpoint" does not look useful 
when offline. Maybe it lacks a comment to say that this cannot (should not 
?) happen when offline, but even then I would not like it to be true: ISTM 
that no page should be allowed to be skipped on the checkpoint condition 
when offline, but it is probably ok to skip with the new page test, which 
make me still think that they should be counted and reported separately, 
or at least the checkpoint skip test should not be run when offline.

When offline, the retry logic does not make much sense, it should complain 
directly on the first error? Also, I'm unsure of the read & checksum retry 
logic *without any delay*.

>> This might suggest some option to tell the command that it should work in
>> online or offline mode, so that it may be stricter in some cases. The
>> default may be one of the option, eg the stricter offline mode, or maybe
>> guessed at startup.
>
> If we believe the operation should be different, the patch removes the
> "is cluster online?" check (as it is no longer necessary), so we could
> just replace the current error message with a global variable with the
> result of that check and use it where needed (if any).

That could let open the issue of someone starting the check offline, and 
then starting the database while it is not finished. Maybe it is not worth 
sweating about such a narrow use case.

If operations are to be different, and it seems to me they should be, I'd 
suggest (1) auto detect default based one the existing "is cluster online" 
code, (2) force options, eg --online vs --offline, which would complain 
and exit if the cluster is not in the right state on startup.

I'd suggest to add a failing checksum online test, if possible. At least a 
"foo" file? It would also be nice if the test could apply on an active 
database, eg with a low-rate pgbench running in parallel to the 
verification, but I'm not sure how easy it is to add such a thing.

-- 
Fabien.


Re: Online verification of checksums

From
Michael Banck
Date:
Hi,

On Tue, Oct 30, 2018 at 06:22:52PM +0100, Fabien COELHO wrote:
> >I am not convinced we need to differentiate further between online and
> >offline operation, can you explain in more detail which other
> >differences are ok in online mode and why?
> 
> For instance the "file/directory was removed" do not look okay at all when
> offline, even if unlikely. Moreover, the checks hides the error message and
> is fully silent in this case, while it was not beforehand on the same error
> when offline.

OK, I kinda see the point here and added that.
 
> The "check if page was modified since checkpoint" does not look useful when
> offline. Maybe it lacks a comment to say that this cannot (should not ?)
> happen when offline, but even then I would not like it to be true: ISTM that
> no page should be allowed to be skipped on the checkpoint condition when
> offline, but it is probably ok to skip with the new page test, which make me
> still think that they should be counted and reported separately, or at least
> the checkpoint skip test should not be run when offline.

What is the rationale to not skip on the checkpoint condition when the
instance is offline?  If it was shutdown cleanly, this should not
happen, if the instance crashed, those would be spurious errors that
would get repaired on recovery. 

I have not changed that for now.

> When offline, the retry logic does not make much sense, it should complain
> directly on the first error? Also, I'm unsure of the read & checksum retry
> logic *without any delay*.

I think the small overhead of retrying in offline mode even if useless
is worth avoiding making the code more complicated in order to cater for
both modes.

Initially there was a delay, but this was removed after analysis and
requests by several other reviewers.

> >>This might suggest some option to tell the command that it should work in
> >>online or offline mode, so that it may be stricter in some cases. The
> >>default may be one of the option, eg the stricter offline mode, or maybe
> >>guessed at startup.
> >
> >If we believe the operation should be different, the patch removes the
> >"is cluster online?" check (as it is no longer necessary), so we could
> >just replace the current error message with a global variable with the
> >result of that check and use it where needed (if any).
> 
> That could let open the issue of someone starting the check offline, and
> then starting the database while it is not finished. Maybe it is not worth
> sweating about such a narrow use case.

I don't think we need to cater for that, yeah.
 
> If operations are to be different, and it seems to me they should be, I'd
> suggest (1) auto detect default based one the existing "is cluster online"
> code, (2) force options, eg --online vs --offline, which would complain and
> exit if the cluster is not in the right state on startup.

The current code bails out if it thinks the cluster is online. What is
wrong with just setting a flag now in case it is?
 
> I'd suggest to add a failing checksum online test, if possible. At least a
> "foo" file? 

Ok, done so.

> It would also be nice if the test could apply on an active database,
> eg with a low-rate pgbench running in parallel to the verification,
> but I'm not sure how easy it is to add such a thing.

That sounds much more complicated so I have not tackled that yet.


Michael

-- 
Michael Banck
Projektleiter / Senior Berater
Tel.: +49 2166 9901-171
Fax:  +49 2166 9901-100
Email: michael.banck@credativ.de

credativ GmbH, HRB Mönchengladbach 12080
USt-ID-Nummer: DE204566209
Trompeterallee 108, 41189 Mönchengladbach
Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer

Unser Umgang mit personenbezogenen Daten unterliegt
folgenden Bestimmungen: https://www.credativ.de/datenschutz

Attachment

Re: Online verification of checksums

From
Stephen Frost
Date:
Greetings,

* Michael Banck (michael.banck@credativ.de) wrote:
> On Tue, Oct 30, 2018 at 06:22:52PM +0100, Fabien COELHO wrote:
> > The "check if page was modified since checkpoint" does not look useful when
> > offline. Maybe it lacks a comment to say that this cannot (should not ?)
> > happen when offline, but even then I would not like it to be true: ISTM that
> > no page should be allowed to be skipped on the checkpoint condition when
> > offline, but it is probably ok to skip with the new page test, which make me
> > still think that they should be counted and reported separately, or at least
> > the checkpoint skip test should not be run when offline.
>
> What is the rationale to not skip on the checkpoint condition when the
> instance is offline?  If it was shutdown cleanly, this should not
> happen, if the instance crashed, those would be spurious errors that
> would get repaired on recovery.
>
> I have not changed that for now.

Agreed- this is an important check even in offline mode.

> > When offline, the retry logic does not make much sense, it should complain
> > directly on the first error? Also, I'm unsure of the read & checksum retry
> > logic *without any delay*.

The race condition being considered here is where an 8k read somehow
gets the first 4k, then is scheduled off-cpu, and the full 8k page is
then written by some other process, and then this process is woken up
to read the second 4k.  I agree that this is unnecessary when the
database is offline, but it's also pretty cheap.  When the database is
online, it's an extremely unlikely case to hit (just try to reproduce
it...) but if it does get hit then it's easy enough to recheck by doing
a reread, which should show that the LSN has been updated in the first
4k and we can then know that this page is in the WAL.  We have not yet
seen a case where such a re-read returns an old LSN and an invalid
checksum; based on discussion with other hackers, that shouldn't be
possible as every kernel seems to consistently write in-order, meaning
that the first 4k will be updated before the second, so a single re-read
should be sufficient.

Remember- this is all in-memory activity also, we aren't talking about
what might happen on disk here.

> I think the small overhead of retrying in offline mode even if useless
> is worth avoiding making the code more complicated in order to cater for
> both modes.

Agreed.

> Initially there was a delay, but this was removed after analysis and
> requests by several other reviewers.

Agreed, there's no need for or point to having such a delay.

> > >>This might suggest some option to tell the command that it should work in
> > >>online or offline mode, so that it may be stricter in some cases. The
> > >>default may be one of the option, eg the stricter offline mode, or maybe
> > >>guessed at startup.
> > >
> > >If we believe the operation should be different, the patch removes the
> > >"is cluster online?" check (as it is no longer necessary), so we could
> > >just replace the current error message with a global variable with the
> > >result of that check and use it where needed (if any).
> >
> > That could let open the issue of someone starting the check offline, and
> > then starting the database while it is not finished. Maybe it is not worth
> > sweating about such a narrow use case.
>
> I don't think we need to cater for that, yeah.

Agreed.

> > It would also be nice if the test could apply on an active database,
> > eg with a low-rate pgbench running in parallel to the verification,
> > but I'm not sure how easy it is to add such a thing.
>
> That sounds much more complicated so I have not tackled that yet.

I agree that this would be nice, but I don't want the regression tests
to become much longer...

Thanks!

Stephen

Attachment

Re: Online verification of checksums

From
Tomas Vondra
Date:

On 11/22/18 2:12 AM, Stephen Frost wrote:
> Greetings,
> 
> * Michael Banck (michael.banck@credativ.de) wrote:
>> On Tue, Oct 30, 2018 at 06:22:52PM +0100, Fabien COELHO wrote:
>>> The "check if page was modified since checkpoint" does not look useful when
>>> offline. Maybe it lacks a comment to say that this cannot (should not ?)
>>> happen when offline, but even then I would not like it to be true: ISTM that
>>> no page should be allowed to be skipped on the checkpoint condition when
>>> offline, but it is probably ok to skip with the new page test, which make me
>>> still think that they should be counted and reported separately, or at least
>>> the checkpoint skip test should not be run when offline.
>>
>> What is the rationale to not skip on the checkpoint condition when the
>> instance is offline?  If it was shutdown cleanly, this should not
>> happen, if the instance crashed, those would be spurious errors that
>> would get repaired on recovery.
>>
>> I have not changed that for now.
> 
> Agreed- this is an important check even in offline mode.
> 

Yeah. I suppose we could detect if the shutdown was clean (like 
pg_rewind does), and then skip the check. Or perhaps we should still do 
the check (without a retry), and report it as issue when we find a page 
with LSN newer than the last checkpoint.

In any case, the check is pretty cheap (comparing two 64-bit values), 
and I don't see how skipping it would optimize anything. It would make 
the code a tad simpler, but we still need the check for the online mode.

>>> When offline, the retry logic does not make much sense, it should complain
>>> directly on the first error? Also, I'm unsure of the read & checksum retry
>>> logic *without any delay*.
> 
> The race condition being considered here is where an 8k read somehow
> gets the first 4k, then is scheduled off-cpu, and the full 8k page is
> then written by some other process, and then this process is woken up
> to read the second 4k.  I agree that this is unnecessary when the
> database is offline, but it's also pretty cheap.  When the database is
> online, it's an extremely unlikely case to hit (just try to reproduce
> it...) but if it does get hit then it's easy enough to recheck by doing
> a reread, which should show that the LSN has been updated in the first
> 4k and we can then know that this page is in the WAL.  We have not yet
> seen a case where such a re-read returns an old LSN and an invalid
> checksum; based on discussion with other hackers, that shouldn't be
> possible as every kernel seems to consistently write in-order, meaning
> that the first 4k will be updated before the second, so a single re-read
> should be sufficient.
> 

Right.

A minor detail is that the reads/writes should be atomic at the sector 
level, which used to be 512B, so it's not just about pages torn in 
4kB/4kB manner, but possibly an arbitrary mix of 512B chunks from old 
and new version.

This also explains why we don't need any delay - the reread happens 
after the write must have already written the page header, so the new 
LSN must be already visible.

So no delay is necessary. And if it was, how long should the delay be? 
The processes might end up off-CPU for arbitrary amount of time, so 
picking a good value would be pretty tricky.

> Remember- this is all in-memory activity also, we aren't talking about
> what might happen on disk here.
> 
>> I think the small overhead of retrying in offline mode even if useless
>> is worth avoiding making the code more complicated in order to cater for
>> both modes.
> 
> Agreed.
> 
>> Initially there was a delay, but this was removed after analysis and
>> requests by several other reviewers.
> 
> Agreed, there's no need for or point to having such a delay.
> 

Yep.

>>>>> This might suggest some option to tell the command that it should work in
>>>>> online or offline mode, so that it may be stricter in some cases. The
>>>>> default may be one of the option, eg the stricter offline mode, or maybe
>>>>> guessed at startup.
>>>>
>>>> If we believe the operation should be different, the patch removes the
>>>> "is cluster online?" check (as it is no longer necessary), so we could
>>>> just replace the current error message with a global variable with the
>>>> result of that check and use it where needed (if any).
>>>
>>> That could let open the issue of someone starting the check offline, and
>>> then starting the database while it is not finished. Maybe it is not worth
>>> sweating about such a narrow use case.
>>
>> I don't think we need to cater for that, yeah.
> 
> Agreed.
> 

Yep. I don't think other tools protect against that either. And 
pg_rewind does actually modify the cluster state, unlike checksum 
verification.

>>> It would also be nice if the test could apply on an active database,
>>> eg with a low-rate pgbench running in parallel to the verification,
>>> but I'm not sure how easy it is to add such a thing.
>>
>> That sounds much more complicated so I have not tackled that yet.
> 
> I agree that this would be nice, but I don't want the regression tests
> to become much longer...
> 

I have to admit I find this thread rather confusing, because the subject 
is "online verification of checksums" yet we're discussing verification 
on offline instances.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: Online verification of checksums

From
Stephen Frost
Date:
Greetings,

* Tomas Vondra (tomas.vondra@2ndquadrant.com) wrote:
> On 11/22/18 2:12 AM, Stephen Frost wrote:
> >* Michael Banck (michael.banck@credativ.de) wrote:
> >>On Tue, Oct 30, 2018 at 06:22:52PM +0100, Fabien COELHO wrote:
> >>>The "check if page was modified since checkpoint" does not look useful when
> >>>offline. Maybe it lacks a comment to say that this cannot (should not ?)
> >>>happen when offline, but even then I would not like it to be true: ISTM that
> >>>no page should be allowed to be skipped on the checkpoint condition when
> >>>offline, but it is probably ok to skip with the new page test, which make me
> >>>still think that they should be counted and reported separately, or at least
> >>>the checkpoint skip test should not be run when offline.
> >>
> >>What is the rationale to not skip on the checkpoint condition when the
> >>instance is offline?  If it was shutdown cleanly, this should not
> >>happen, if the instance crashed, those would be spurious errors that
> >>would get repaired on recovery.
> >>
> >>I have not changed that for now.
> >
> >Agreed- this is an important check even in offline mode.
>
> Yeah. I suppose we could detect if the shutdown was clean (like pg_rewind
> does), and then skip the check. Or perhaps we should still do the check
> (without a retry), and report it as issue when we find a page with LSN newer
> than the last checkpoint.

I agree that it'd be nice to report an issue if it's a clean shutdown
but there's an LSN newer than the last checkpoint, though I suspect that
would be more useful in debugging and such and not so useful for users.

> In any case, the check is pretty cheap (comparing two 64-bit values), and I
> don't see how skipping it would optimize anything. It would make the code a
> tad simpler, but we still need the check for the online mode.

Yeah, I'd just keep the check.

> A minor detail is that the reads/writes should be atomic at the sector
> level, which used to be 512B, so it's not just about pages torn in 4kB/4kB
> manner, but possibly an arbitrary mix of 512B chunks from old and new
> version.

Sure.

> This also explains why we don't need any delay - the reread happens after
> the write must have already written the page header, so the new LSN must be
> already visible.

Agreed.

Thanks!

Stephen

Attachment

Re: Online verification of checksums

From
Dmitry Dolgov
Date:
> On Wed, Nov 21, 2018 at 1:38 PM Michael Banck <michael.banck@credativ.de> wrote:
>
> Hi,
>
> On Tue, Oct 30, 2018 at 06:22:52PM +0100, Fabien COELHO wrote:
> > >I am not convinced we need to differentiate further between online and
> > >offline operation, can you explain in more detail which other
> > >differences are ok in online mode and why?
> >
> > For instance the "file/directory was removed" do not look okay at all when
> > offline, even if unlikely. Moreover, the checks hides the error message and
> > is fully silent in this case, while it was not beforehand on the same error
> > when offline.
>
> OK, I kinda see the point here and added that.

Hi,

Just for the information, looks like part of this patch (or at least some
similar code), related to the tests in 002_actions.pl, was committed recently
in 5c99513975, so there are minor conflicts with the master.


Re: Online verification of checksums

From
Michael Paquier
Date:
On Sat, Dec 01, 2018 at 12:47:13PM +0100, Dmitry Dolgov wrote:
> Just for the information, looks like part of this patch (or at least some
> similar code), related to the tests in 002_actions.pl, was committed recently
> in 5c99513975, so there are minor conflicts with the master.

What what I can see in v7 of the patch as posted in [1], all the changes
to 002_actions.pl could just be removed because there are already
equivalents.

[1]: https://postgr.es/m/20181121123535.GD23740@nighthawk.caipicrew.dd-dns.de
--
Michael

Attachment

Re: Online verification of checksums

From
Michael Banck
Date:
Hi,

On Mon, Dec 03, 2018 at 09:48:43AM +0900, Michael Paquier wrote:
> On Sat, Dec 01, 2018 at 12:47:13PM +0100, Dmitry Dolgov wrote:
> > Just for the information, looks like part of this patch (or at least some
> > similar code), related to the tests in 002_actions.pl, was committed recently
> > in 5c99513975, so there are minor conflicts with the master.
> 
> What what I can see in v7 of the patch as posted in [1], all the changes
> to 002_actions.pl could just be removed because there are already
> equivalents.

Yeah, new rebased version attached.


Michael

-- 
Michael Banck
Projektleiter / Senior Berater
Tel.: +49 2166 9901-171
Fax:  +49 2166 9901-100
Email: michael.banck@credativ.de

credativ GmbH, HRB Mönchengladbach 12080
USt-ID-Nummer: DE204566209
Trompeterallee 108, 41189 Mönchengladbach
Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer

Unser Umgang mit personenbezogenen Daten unterliegt
folgenden Bestimmungen: https://www.credativ.de/datenschutz

Attachment

Re: Online verification of checksums

From
Michael Banck
Date:
Hi,

On Thu, Dec 20, 2018 at 04:19:11PM +0100, Michael Banck wrote:
> Yeah, new rebased version attached.

By the way, one thing that this patch also fixes is checksum
verification on basebackups (as pointed out the other day by my
colleague Bernd Helmele):

postgres@kohn:~$ initdb -k data
postgres@kohn:~$ pg_ctl -D data -l logfile start
waiting for server to start.... done
server started
postgres@kohn:~$ pg_verify_checksums -D data
pg_verify_checksums: cluster must be shut down to verify checksums
postgres@kohn:~$ pg_basebackup -h /tmp -D backup1
postgres@kohn:~$ pg_verify_checksums -D backup1
pg_verify_checksums: cluster must be shut down to verify checksums
postgres@kohn:~$ pg_checksums -c -D backup1
Checksum scan completed
Files scanned:  1094
Blocks scanned: 2867
Bad checksums:  0
Data checksum version: 1

Where pg_checksums has the online verification patch applied.

As I don't think many people will take down their production servers in
order to verify checksums, verifying them on basebackups looks like a
useful use-case that is currently broken with pg_verify_checksums.


Michael

-- 
Michael Banck
Projektleiter / Senior Berater
Tel.: +49 2166 9901-171
Fax:  +49 2166 9901-100
Email: michael.banck@credativ.de

credativ GmbH, HRB Mönchengladbach 12080
USt-ID-Nummer: DE204566209
Trompeterallee 108, 41189 Mönchengladbach
Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer

Unser Umgang mit personenbezogenen Daten unterliegt
folgenden Bestimmungen: https://www.credativ.de/datenschutz


Re: Online verification of checksums

From
Fabien COELHO
Date:
Hallo Michael,

> Yeah, new rebased version attached.

Patch v8 applies cleanly, compiles, global & local make check are ok.

A few comments:

About added tests: the node is left running at the end of the script, 
which is not very clean. I'd suggest to either move the added checks 
before stopping, or to stop again at the end of the script, depending on 
the intention.

I'm wondering (possibly again) about the existing early exit if one block 
cannot be read on retry: the command should count this as a kind of bad 
block, proceed on checking other files, and obviously fail in the end, but 
having checked everything else and generated a report. I do not think that 
this condition warrants a full stop. ISTM that under rare race conditions 
(eg, an unlucky concurrent "drop database" or "drop table") this could 
happen when online, although I could not trigger one despite heavy 
testing, so I'm possibly mistaken.

-- 
Fabien.


Re: Online verification of checksums

From
Andres Freund
Date:
Hi,

On 2018-12-25 10:25:46 +0100, Fabien COELHO wrote:
> Hallo Michael,
> 
> > Yeah, new rebased version attached.
> 
> Patch v8 applies cleanly, compiles, global & local make check are ok.
> 
> A few comments:
> 
> About added tests: the node is left running at the end of the script, which
> is not very clean. I'd suggest to either move the added checks before
> stopping, or to stop again at the end of the script, depending on the
> intention.

Michael?


> I'm wondering (possibly again) about the existing early exit if one block
> cannot be read on retry: the command should count this as a kind of bad
> block, proceed on checking other files, and obviously fail in the end, but
> having checked everything else and generated a report. I do not think that
> this condition warrants a full stop. ISTM that under rare race conditions
> (eg, an unlucky concurrent "drop database" or "drop table") this could
> happen when online, although I could not trigger one despite heavy testing,
> so I'm possibly mistaken.

This seems like a defensible judgement call either way.

Greetings,

Andres Freund


Re: Online verification of checksums

From
Michael Paquier
Date:
On Sun, Feb 03, 2019 at 02:06:45AM -0800, Andres Freund wrote:
> On 2018-12-25 10:25:46 +0100, Fabien COELHO wrote:
>> About added tests: the node is left running at the end of the script, which
>> is not very clean. I'd suggest to either move the added checks before
>> stopping, or to stop again at the end of the script, depending on the
>> intention.
>
> Michael?

Unlikely P., and most likely B.

I have marked the patch as returned with feedback as it has been a
couple of weeks already.
--
Michael

Attachment

Re: Online verification of checksums

From
Michael Banck
Date:
Hi,

Am Sonntag, den 03.02.2019, 02:06 -0800 schrieb Andres Freund:
> Hi,
> 
> On 2018-12-25 10:25:46 +0100, Fabien COELHO wrote:
> > Hallo Michael,
> > 
> > > Yeah, new rebased version attached.
> > 
> > Patch v8 applies cleanly, compiles, global & local make check are ok.
> > 
> > A few comments:
> > 
> > About added tests: the node is left running at the end of the script, which
> > is not very clean. I'd suggest to either move the added checks before
> > stopping, or to stop again at the end of the script, depending on the
> > intention.
> 
> Michael?

Uh, I kinda forgot about this, I've made the tests stop the node now.

> > I'm wondering (possibly again) about the existing early exit if one block
> > cannot be read on retry: the command should count this as a kind of bad
> > block, proceed on checking other files, and obviously fail in the end, but
> > having checked everything else and generated a report. I do not think that
> > this condition warrants a full stop. ISTM that under rare race conditions
> > (eg, an unlucky concurrent "drop database" or "drop table") this could
> > happen when online, although I could not trigger one despite heavy testing,
> > so I'm possibly mistaken.
> 
> This seems like a defensible judgement call either way.

Right now we have a few tests that explicitly check that
pg_verify_checksums fail on broken data ("foo" in the file).  Those
would then just get skipped AFAICT, which I think is the worse behaviour
, but if everybody thinks that should be the way to go, we can
drop/adjust those tests and make pg_verify_checksums skip them.

Thoughts?

In the meanwhile, v9 is attached with the above change and rebased
(without changes) to master.


Michael

-- 
Michael Banck
Projektleiter / Senior Berater
Tel.: +49 2166 9901-171
Fax:  +49 2166 9901-100
Email: michael.banck@credativ.de

credativ GmbH, HRB Mönchengladbach 12080
USt-ID-Nummer: DE204566209
Trompeterallee 108, 41189 Mönchengladbach
Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer

Unser Umgang mit personenbezogenen Daten unterliegt
folgenden Bestimmungen: https://www.credativ.de/datenschutz
Attachment

Re: Online verification of checksums

From
Fabien COELHO
Date:
Hallo Michael,

>>> I'm wondering (possibly again) about the existing early exit if one block
>>> cannot be read on retry: the command should count this as a kind of bad
>>> block, proceed on checking other files, and obviously fail in the end, but
>>> having checked everything else and generated a report. I do not think that
>>> this condition warrants a full stop. ISTM that under rare race conditions
>>> (eg, an unlucky concurrent "drop database" or "drop table") this could
>>> happen when online, although I could not trigger one despite heavy testing,
>>> so I'm possibly mistaken.
>>
>> This seems like a defensible judgement call either way.
>
> Right now we have a few tests that explicitly check that
> pg_verify_checksums fail on broken data ("foo" in the file).  Those
> would then just get skipped AFAICT, which I think is the worse behaviour
> , but if everybody thinks that should be the way to go, we can
> drop/adjust those tests and make pg_verify_checksums skip them.
>
> Thoughts?

My point is that it should fail as it does, only not immediately (early 
exit), but after having checked everything else. This mean avoiding 
calling "exit(1)" here and there (lseek, fopen...), but taking note that 
something bad happened, and call exit only in the end.

-- 
Fabien.


Re: Online verification of checksums

From
Andres Freund
Date:
Hi,

On 2019-02-05 06:57:06 +0100, Fabien COELHO wrote:
> > > > I'm wondering (possibly again) about the existing early exit if one block
> > > > cannot be read on retry: the command should count this as a kind of bad
> > > > block, proceed on checking other files, and obviously fail in the end, but
> > > > having checked everything else and generated a report. I do not think that
> > > > this condition warrants a full stop. ISTM that under rare race conditions
> > > > (eg, an unlucky concurrent "drop database" or "drop table") this could
> > > > happen when online, although I could not trigger one despite heavy testing,
> > > > so I'm possibly mistaken.
> > > 
> > > This seems like a defensible judgement call either way.
> > 
> > Right now we have a few tests that explicitly check that
> > pg_verify_checksums fail on broken data ("foo" in the file).  Those
> > would then just get skipped AFAICT, which I think is the worse behaviour
> > , but if everybody thinks that should be the way to go, we can
> > drop/adjust those tests and make pg_verify_checksums skip them.
> > 
> > Thoughts?
> 
> My point is that it should fail as it does, only not immediately (early
> exit), but after having checked everything else. This mean avoiding calling
> "exit(1)" here and there (lseek, fopen...), but taking note that something
> bad happened, and call exit only in the end.

I can see both as being valuable (one gives you a more complete picture,
the other a quicker answer in scripts). For me that's the point where
it's the prerogative of the author to make that choice.

Greetings,

Andres Freund


Re: Online verification of checksums

From
Tomas Vondra
Date:

On 2/5/19 8:01 AM, Andres Freund wrote:
> Hi,
> 
> On 2019-02-05 06:57:06 +0100, Fabien COELHO wrote:
>>>>> I'm wondering (possibly again) about the existing early exit if one block
>>>>> cannot be read on retry: the command should count this as a kind of bad
>>>>> block, proceed on checking other files, and obviously fail in the end, but
>>>>> having checked everything else and generated a report. I do not think that
>>>>> this condition warrants a full stop. ISTM that under rare race conditions
>>>>> (eg, an unlucky concurrent "drop database" or "drop table") this could
>>>>> happen when online, although I could not trigger one despite heavy testing,
>>>>> so I'm possibly mistaken.
>>>>
>>>> This seems like a defensible judgement call either way.
>>>
>>> Right now we have a few tests that explicitly check that
>>> pg_verify_checksums fail on broken data ("foo" in the file).  Those
>>> would then just get skipped AFAICT, which I think is the worse behaviour
>>> , but if everybody thinks that should be the way to go, we can
>>> drop/adjust those tests and make pg_verify_checksums skip them.
>>>
>>> Thoughts?
>>
>> My point is that it should fail as it does, only not immediately (early
>> exit), but after having checked everything else. This mean avoiding calling
>> "exit(1)" here and there (lseek, fopen...), but taking note that something
>> bad happened, and call exit only in the end.
> 
> I can see both as being valuable (one gives you a more complete picture,
> the other a quicker answer in scripts). For me that's the point where
> it's the prerogative of the author to make that choice.
> 

Why not make this configurable, using a command-line option?

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: Online verification of checksums

From
Michael Banck
Date:
Hi,

Am Dienstag, den 05.02.2019, 11:30 +0100 schrieb Tomas Vondra:
> On 2/5/19 8:01 AM, Andres Freund wrote:
> > On 2019-02-05 06:57:06 +0100, Fabien COELHO wrote:
> > > > > > I'm wondering (possibly again) about the existing early exit if one block
> > > > > > cannot be read on retry: the command should count this as a kind of bad
> > > > > > block, proceed on checking other files, and obviously fail in the end, but
> > > > > > having checked everything else and generated a report. I do not think that
> > > > > > this condition warrants a full stop. ISTM that under rare race conditions
> > > > > > (eg, an unlucky concurrent "drop database" or "drop table") this could
> > > > > > happen when online, although I could not trigger one despite heavy testing,
> > > > > > so I'm possibly mistaken.
> > > > > 
> > > > > This seems like a defensible judgement call either way.
> > > > 
> > > > Right now we have a few tests that explicitly check that
> > > > pg_verify_checksums fail on broken data ("foo" in the file).  Those
> > > > would then just get skipped AFAICT, which I think is the worse behaviour
> > > > , but if everybody thinks that should be the way to go, we can
> > > > drop/adjust those tests and make pg_verify_checksums skip them.
> > > > 
> > > > Thoughts?
> > > 
> > > My point is that it should fail as it does, only not immediately (early
> > > exit), but after having checked everything else. This mean avoiding calling
> > > "exit(1)" here and there (lseek, fopen...), but taking note that something
> > > bad happened, and call exit only in the end.
> > 
> > I can see both as being valuable (one gives you a more complete picture,
> > the other a quicker answer in scripts). For me that's the point where
> > it's the prerogative of the author to make that choice.

Personally, I would prefer to keep it as simple as possible for now and
get this patch committed; in my opinion the behaviour is already like
this (early exit on corrupt files) so I don't think the online
verification patch should change this.

If we see complaints about this, then I'd be happy to change it
afterwards.

> Why not make this configurable, using a command-line option?

I like this even less - this tool is about verifying checksums, so
adding options on what to do when it encounters broken pages looks out-
of-scope to me.  Unless we want to say it should generally abort on the
first issue (i.e. on wrong checksums as well).


Michael

-- 
Michael Banck
Projektleiter / Senior Berater
Tel.: +49 2166 9901-171
Fax:  +49 2166 9901-100
Email: michael.banck@credativ.de

credativ GmbH, HRB Mönchengladbach 12080
USt-ID-Nummer: DE204566209
Trompeterallee 108, 41189 Mönchengladbach
Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer

Unser Umgang mit personenbezogenen Daten unterliegt
folgenden Bestimmungen: https://www.credativ.de/datenschutz


Re: Online verification of checksums

From
Stephen Frost
Date:
Greetings,

* Michael Banck (michael.banck@credativ.de) wrote:
> Am Dienstag, den 05.02.2019, 11:30 +0100 schrieb Tomas Vondra:
> > On 2/5/19 8:01 AM, Andres Freund wrote:
> > > On 2019-02-05 06:57:06 +0100, Fabien COELHO wrote:
> > > > > > > I'm wondering (possibly again) about the existing early exit if one block
> > > > > > > cannot be read on retry: the command should count this as a kind of bad
> > > > > > > block, proceed on checking other files, and obviously fail in the end, but
> > > > > > > having checked everything else and generated a report. I do not think that
> > > > > > > this condition warrants a full stop. ISTM that under rare race conditions
> > > > > > > (eg, an unlucky concurrent "drop database" or "drop table") this could
> > > > > > > happen when online, although I could not trigger one despite heavy testing,
> > > > > > > so I'm possibly mistaken.
> > > > > >
> > > > > > This seems like a defensible judgement call either way.
> > > > >
> > > > > Right now we have a few tests that explicitly check that
> > > > > pg_verify_checksums fail on broken data ("foo" in the file).  Those
> > > > > would then just get skipped AFAICT, which I think is the worse behaviour
> > > > > , but if everybody thinks that should be the way to go, we can
> > > > > drop/adjust those tests and make pg_verify_checksums skip them.
> > > > >
> > > > > Thoughts?
> > > >
> > > > My point is that it should fail as it does, only not immediately (early
> > > > exit), but after having checked everything else. This mean avoiding calling
> > > > "exit(1)" here and there (lseek, fopen...), but taking note that something
> > > > bad happened, and call exit only in the end.
> > >
> > > I can see both as being valuable (one gives you a more complete picture,
> > > the other a quicker answer in scripts). For me that's the point where
> > > it's the prerogative of the author to make that choice.

... unless people here object or prefer other options, and then it's up
to discussion and hopefully some consensus comes out of it.

Also, I have to say that I really don't think the 'quicker answer'
argument holds any weight, making me question if that's a valid
use-case.  If there *isn't* an issue, which we would likely all agree is
the case the vast majority of the time that this is going to be run,
then it's going to take quite a while and anyone calling it should
expect and be prepared for that.  In the extremely rare cases, what does
exiting early actually do for us?

> Personally, I would prefer to keep it as simple as possible for now and
> get this patch committed; in my opinion the behaviour is already like
> this (early exit on corrupt files) so I don't think the online
> verification patch should change this.

I'm also in the camp of "would rather it not exit immediately, so the
extent of the issue is clear".

> If we see complaints about this, then I'd be happy to change it
> afterwards.

I really don't think this is something we should change later on in a
future release..  If the consensus is that there's really two different
but valid use-cases then we should make it configurable, but I'm not
convinced there is.

> > Why not make this configurable, using a command-line option?
>
> I like this even less - this tool is about verifying checksums, so
> adding options on what to do when it encounters broken pages looks out-
> of-scope to me.  Unless we want to say it should generally abort on the
> first issue (i.e. on wrong checksums as well).

I definitely disagree that it's somehow 'out of scope' for this tool to
skip broken pages, when we can tell that they're broken.  There is a
question here about how to handle a short read since that can happen
under normal conditions if we're unlucky.  The same is also true for
files disappearing entirely.

So, let's talk/think through a few cases:

A file with just 'foo\n' in it- could that be a page starting with
an LSN around 666F6F0A that we somehow only read the first few bytes of?
If not, why not?  I could possibly see an argument that we expect to
always get at least 512 bytes in a read, or 4K, but it seems like we
could possibly run into edge cases on odd filesystems or such.  In the
end, I'm leaning towards categorizing different things, well,
differently- a short read would be reported as a NOTICE or equivilant,
perhaps, meaning that the test case needs to do something more than just
have a file with 'foo' in it, but that is likely a good things anyway-
the test cases would be better if they were closer to real world.  Other
read failures would be reported in a more serious category assuming they
are "this really shouldn't happen" cases.  A file disappearing isn't a
"can't happen" case, and might be reported at the same 'NOTICE' level
(or maybe with a 'verbose' ption).

A file that's 8k in size and has a checksum but it's not right seems
pretty clear to me.  Might as well include a count of pages which have a
valid checksum, I would think, though perhaps only in a 'verbose' mode
would that get reported.

A completely zero'd page could also be reported at a NOTICE level or
with a count, or perhaps only with verbose.

Other thoughts about use-cases and what should happen..?

Thanks!

Stephen

Attachment

Re: Online verification of checksums

From
Michael Banck
Date:
Hi,

Am Mittwoch, den 06.02.2019, 11:39 -0500 schrieb Stephen Frost:
> * Michael Banck (michael.banck@credativ.de) wrote:
> > Am Dienstag, den 05.02.2019, 11:30 +0100 schrieb Tomas Vondra:
> > > On 2/5/19 8:01 AM, Andres Freund wrote:
> > > > On 2019-02-05 06:57:06 +0100, Fabien COELHO wrote:
> > > > > > > > I'm wondering (possibly again) about the existing early exit if one block
> > > > > > > > cannot be read on retry: the command should count this as a kind of bad
> > > > > > > > block, proceed on checking other files, and obviously fail in the end, but
> > > > > > > > having checked everything else and generated a report. I do not think that
> > > > > > > > this condition warrants a full stop. ISTM that under rare race conditions
> > > > > > > > (eg, an unlucky concurrent "drop database" or "drop table") this could
> > > > > > > > happen when online, although I could not trigger one despite heavy testing,
> > > > > > > > so I'm possibly mistaken.
> > > > > > > 
> > > > > > > This seems like a defensible judgement call either way.
> > > > > > 
> > > > > > Right now we have a few tests that explicitly check that
> > > > > > pg_verify_checksums fail on broken data ("foo" in the file).  Those
> > > > > > would then just get skipped AFAICT, which I think is the worse behaviour
> > > > > > , but if everybody thinks that should be the way to go, we can
> > > > > > drop/adjust those tests and make pg_verify_checksums skip them.
> > > > > > 
> > > > > > Thoughts?
> > > > > 
> > > > > My point is that it should fail as it does, only not immediately (early
> > > > > exit), but after having checked everything else. This mean avoiding calling
> > > > > "exit(1)" here and there (lseek, fopen...), but taking note that something
> > > > > bad happened, and call exit only in the end.
> > > > 
> > > > I can see both as being valuable (one gives you a more complete picture,
> > > > the other a quicker answer in scripts). For me that's the point where
> > > > it's the prerogative of the author to make that choice.
> 
> ... unless people here object or prefer other options, and then it's up
> to discussion and hopefully some consensus comes out of it.
> 
> Also, I have to say that I really don't think the 'quicker answer'
> argument holds any weight, making me question if that's a valid
> use-case.  If there *isn't* an issue, which we would likely all agree is
> the case the vast majority of the time that this is going to be run,
> then it's going to take quite a while and anyone calling it should
> expect and be prepared for that.  In the extremely rare cases, what does
> exiting early actually do for us?
> 
> > Personally, I would prefer to keep it as simple as possible for now and
> > get this patch committed; in my opinion the behaviour is already like
> > this (early exit on corrupt files) so I don't think the online
> > verification patch should change this.
> 
> I'm also in the camp of "would rather it not exit immediately, so the
> extent of the issue is clear".
> 
> > If we see complaints about this, then I'd be happy to change it
> > afterwards.
> 
> I really don't think this is something we should change later on in a
> future release..  If the consensus is that there's really two different
> but valid use-cases then we should make it configurable, but I'm not
> convinced there is.

OK, fair enough.

> > > Why not make this configurable, using a command-line option?
> > 
> > I like this even less - this tool is about verifying checksums, so
> > adding options on what to do when it encounters broken pages looks out-
> > of-scope to me.  Unless we want to say it should generally abort on the
> > first issue (i.e. on wrong checksums as well).
> 
> I definitely disagree that it's somehow 'out of scope' for this tool to
> skip broken pages, when we can tell that they're broken.  

I didn't mean that it's out-of-scope for pg_verify_checksums, I meant it
is out-of-scope for this patch, which adds online checking.

> There is a question here about how to handle a short read since that
> can happen under normal conditions if we're unlucky.  The same is also
> true for files disappearing entirely.
> 
> So, let's talk/think through a few cases:
> 
> A file with just 'foo\n' in it- could that be a page starting with
> an LSN around 666F6F0A that we somehow only read the first few bytes of?
> If not, why not?  I could possibly see an argument that we expect to
> always get at least 512 bytes in a read, or 4K, but it seems like we
> could possibly run into edge cases on odd filesystems or such.  In the
> end, I'm leaning towards categorizing different things, well,
> differently- a short read would be reported as a NOTICE or equivilant,
> perhaps, meaning that the test case needs to do something more than just
> have a file with 'foo' in it, but that is likely a good things anyway-
> the test cases would be better if they were closer to real world.  Other
> read failures would be reported in a more serious category assuming they
> are "this really shouldn't happen" cases.  A file disappearing isn't a
> "can't happen" case, and might be reported at the same 'NOTICE' level
> (or maybe with a 'verbose' ption).

In the context of this patch, we should also discern whether a
particular case is merely a notice (or warning) on an offline cluster, I
guess you think it should be?

So I've changed it such that a short read emits a "warning" message,
increments a new skippedfiles (as it is not just a skipped block)
variable and reports its number at the end - should it then exit with >
0 even if there were no wrong checksums?

> A file that's 8k in size and has a checksum but it's not right seems
> pretty clear to me.  Might as well include a count of pages which have a
> valid checksum, I would think, though perhaps only in a 'verbose' mode
> would that get reported.

What's the use for that? It already reports the number of scanned blocks
at the end, so that number is pretty easy to figure out from it and the
number of bad checksums. 
 
> A completely zero'd page could also be reported at a NOTICE level or
> with a count, or perhaps only with verbose.

It is counted as a skipped block right now (well, every block that
qualifes for PageIsNew() is), but skipped blocks are not mentioned right
now. I guess the rationale is that it might lead to excessive screen
output (but then, verbose originally logged /every/ block), but you'd
have to check with the original authors.

So I have now changed behaviour so that short writes count as skipped
files and pg_verify_checksums no longer bails out on them. When this
occors a warning is written to stderr and their overall count is also
reported at the end. However, unless there are other blocks with bad
checksums, the exit status is kept at zero.

New patch attached.


Michael

-- 
Michael Banck
Projektleiter / Senior Berater
Tel.: +49 2166 9901-171
Fax:  +49 2166 9901-100
Email: michael.banck@credativ.de

credativ GmbH, HRB Mönchengladbach 12080
USt-ID-Nummer: DE204566209
Trompeterallee 108, 41189 Mönchengladbach
Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer

Unser Umgang mit personenbezogenen Daten unterliegt
folgenden Bestimmungen: https://www.credativ.de/datenschutz
Attachment

Re: Online verification of checksums

From
Fabien COELHO
Date:
Hallo Mickael,

> So I have now changed behaviour so that short writes count as skipped
> files and pg_verify_checksums no longer bails out on them. When this
> occors a warning is written to stderr and their overall count is also
> reported at the end. However, unless there are other blocks with bad
> checksums, the exit status is kept at zero.

This seems fair when online, however I'm wondering whether it is when 
offline. I'd say that the whole retry logic should be skipped in this 
case? i.e. "if (block_retry || !online) { error message and continue }"
on both short read & checksum failure retries.

> New patch attached.

Patch applies cleanly, compiles, global & local make check ok.

I'm wondering whether it should exit(1) on "lseek" failures. Would it make 
sense to skip the file and report it as such? Should it be counted as a 
skippedfile?

WRT the final status, ISTM that slippedblocks & files could warrant an 
error when offline, although they might be ok when online?

-- 
Fabien.


Re: Online verification of checksums

From
Michael Banck
Date:
Hi,

Am Donnerstag, den 28.02.2019, 14:29 +0100 schrieb Fabien COELHO:
> > So I have now changed behaviour so that short writes count as skipped
> > files and pg_verify_checksums no longer bails out on them. When this
> > occors a warning is written to stderr and their overall count is also
> > reported at the end. However, unless there are other blocks with bad
> > checksums, the exit status is kept at zero.
> 
> This seems fair when online, however I'm wondering whether it is when 
> offline. I'd say that the whole retry logic should be skipped in this 
> case? i.e. "if (block_retry || !online) { error message and continue }"
> on both short read & checksum failure retries.

Ok, the stand-alone pg_checksums program also got a PR about the LSN
skip logic not being helpful when the instance is offline and somebody
just writes /dev/urandom over the heap files: 

https://github.com/credativ/pg_checksums/pull/6

So I now tried to change the patch so that it only retries blocks when
online.

> Patch applies cleanly, compiles, global & local make check ok.
> 
> I'm wondering whether it should exit(1) on "lseek" failures. Would it make 
> sense to skip the file and report it as such? Should it be counted as a 
> skippedfile?

Ok, I think it makes sense to march on and I changed it that way.

> WRT the final status, ISTM that slippedblocks & files could warrant an 
> error when offline, although they might be ok when online?

Ok, also changed it that way.

New patch attached.


Michael

-- 
Michael Banck
Projektleiter / Senior Berater
Tel.: +49 2166 9901-171
Fax:  +49 2166 9901-100
Email: michael.banck@credativ.de

credativ GmbH, HRB Mönchengladbach 12080
USt-ID-Nummer: DE204566209
Trompeterallee 108, 41189 Mönchengladbach
Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer

Unser Umgang mit personenbezogenen Daten unterliegt
folgenden Bestimmungen: https://www.credativ.de/datenschutz
Attachment

Re: Online verification of checksums

From
Robert Haas
Date:
On Tue, Sep 18, 2018 at 10:37 AM Michael Banck
<michael.banck@credativ.de> wrote:
> I have added a retry for this as well now, without a pg_sleep() as well.
> This catches around 80% of the half-reads, but a few slip through. At
> that point we bail out with exit(1), and the user can try again, which I
> think is fine?

Maybe I'm confused here, but catching 80% of torn pages doesn't sound
robust at all.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Online verification of checksums

From
Michael Banck
Date:
Hi,

Am Freitag, den 01.03.2019, 18:03 -0500 schrieb Robert Haas:
> On Tue, Sep 18, 2018 at 10:37 AM Michael Banck
> <michael.banck@credativ.de> wrote:
> > I have added a retry for this as well now, without a pg_sleep() as well.
> > This catches around 80% of the half-reads, but a few slip through. At
> > that point we bail out with exit(1), and the user can try again, which I
> > think is fine?
> 
> Maybe I'm confused here, but catching 80% of torn pages doesn't sound
> robust at all.

The chance that pg_verify_checksums hits a torn page (at least in my
tests, see below) is already pretty low, a couple of times per 1000
runs. Maybe 4 out 5 times, the page is read fine on retry and we march
on. Otherwise, we now just issue a warning and skip the file (or so was
the idea, see below), do you think that is not acceptable?

I re-ran the tests (concurrent createdb/pgbench -i -s 50/dropdb and
pg_verify_checksums in tight loops) with the current patch version, and
I am seeing short reads very, very rarely (maybe every 1000th run) with
a warning like:

|1174
|pg_verify_checksums: warning: could not read block 374 in file "data/base/18032/18045": read 4096 of 8192
|pg_verify_checksums: warning: could not read block 375 in file "data/base/18032/18045": read 4096 of 8192
|Files skipped: 2

The 1174 is the sequence number, the first 1173 runs of
pg_verify_checksums only skipped blocks.

However, the fact it shows two warnings for the same file means there is
something wrong here. It was continueing to the next block while I think
it should just skip to the next file on read failures. So I have changed
that now, new patch attached.


Michael

-- 
Michael Banck
Projektleiter / Senior Berater
Tel.: +49 2166 9901-171
Fax:  +49 2166 9901-100
Email: michael.banck@credativ.de

credativ GmbH, HRB Mönchengladbach 12080
USt-ID-Nummer: DE204566209
Trompeterallee 108, 41189 Mönchengladbach
Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer

Unser Umgang mit personenbezogenen Daten unterliegt
folgenden Bestimmungen: https://www.credativ.de/datenschutz
Attachment

Re: Online verification of checksums

From
Stephen Frost
Date:
Greetings,

* Michael Banck (michael.banck@credativ.de) wrote:
> Am Freitag, den 01.03.2019, 18:03 -0500 schrieb Robert Haas:
> > On Tue, Sep 18, 2018 at 10:37 AM Michael Banck
> > <michael.banck@credativ.de> wrote:
> > > I have added a retry for this as well now, without a pg_sleep() as well.
> > > This catches around 80% of the half-reads, but a few slip through. At
> > > that point we bail out with exit(1), and the user can try again, which I
> > > think is fine?
> >
> > Maybe I'm confused here, but catching 80% of torn pages doesn't sound
> > robust at all.
>
> The chance that pg_verify_checksums hits a torn page (at least in my
> tests, see below) is already pretty low, a couple of times per 1000
> runs. Maybe 4 out 5 times, the page is read fine on retry and we march
> on. Otherwise, we now just issue a warning and skip the file (or so was
> the idea, see below), do you think that is not acceptable?
>
> I re-ran the tests (concurrent createdb/pgbench -i -s 50/dropdb and
> pg_verify_checksums in tight loops) with the current patch version, and
> I am seeing short reads very, very rarely (maybe every 1000th run) with
> a warning like:
>
> |1174
> |pg_verify_checksums: warning: could not read block 374 in file "data/base/18032/18045": read 4096 of 8192
> |pg_verify_checksums: warning: could not read block 375 in file "data/base/18032/18045": read 4096 of 8192
> |Files skipped: 2
>
> The 1174 is the sequence number, the first 1173 runs of
> pg_verify_checksums only skipped blocks.
>
> However, the fact it shows two warnings for the same file means there is
> something wrong here. It was continueing to the next block while I think
> it should just skip to the next file on read failures. So I have changed
> that now, new patch attached.

I'm confused- if previously it was continueing to the next block instead
of doing the re-read on the same block, why don't we just change it to
do the re-read on the same block properly and see if that fixes the
retry, instead of just giving up and skipping..?  I'm not necessairly
against skipping to the next file, to be clear, but I think I'd be
happier if we kept reading the file until we actually get EOF.

(I've not looked at the actual patch, just read what you wrote..)

Thanks!

Stephen

Attachment

Re: Online verification of checksums

From
Tomas Vondra
Date:
On 3/2/19 12:03 AM, Robert Haas wrote:
> On Tue, Sep 18, 2018 at 10:37 AM Michael Banck
> <michael.banck@credativ.de> wrote:
>> I have added a retry for this as well now, without a pg_sleep() as well.
>> This catches around 80% of the half-reads, but a few slip through. At
>> that point we bail out with exit(1), and the user can try again, which I
>> think is fine?
> 
> Maybe I'm confused here, but catching 80% of torn pages doesn't sound
> robust at all.
> 

FWIW I don't think this qualifies as torn page - i.e. it's not a full
read with a mix of old and new data. This is partial write, most likely
because we read the blocks one by one, and when we hit the last page
while the table is being extended, we may only see the fist 4kB. And if
we retry very fast, we may still see only the first 4kB.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: Online verification of checksums

From
Tomas Vondra
Date:

On 3/2/19 5:08 PM, Stephen Frost wrote:
> Greetings,
> 
> * Michael Banck (michael.banck@credativ.de) wrote:
>> Am Freitag, den 01.03.2019, 18:03 -0500 schrieb Robert Haas:
>>> On Tue, Sep 18, 2018 at 10:37 AM Michael Banck
>>> <michael.banck@credativ.de> wrote:
>>>> I have added a retry for this as well now, without a pg_sleep() as well.
>>>> This catches around 80% of the half-reads, but a few slip through. At
>>>> that point we bail out with exit(1), and the user can try again, which I
>>>> think is fine?
>>>
>>> Maybe I'm confused here, but catching 80% of torn pages doesn't sound
>>> robust at all.
>>
>> The chance that pg_verify_checksums hits a torn page (at least in my
>> tests, see below) is already pretty low, a couple of times per 1000
>> runs. Maybe 4 out 5 times, the page is read fine on retry and we march
>> on. Otherwise, we now just issue a warning and skip the file (or so was
>> the idea, see below), do you think that is not acceptable?
>>
>> I re-ran the tests (concurrent createdb/pgbench -i -s 50/dropdb and
>> pg_verify_checksums in tight loops) with the current patch version, and
>> I am seeing short reads very, very rarely (maybe every 1000th run) with
>> a warning like:
>>
>> |1174
>> |pg_verify_checksums: warning: could not read block 374 in file "data/base/18032/18045": read 4096 of 8192
>> |pg_verify_checksums: warning: could not read block 375 in file "data/base/18032/18045": read 4096 of 8192
>> |Files skipped: 2
>>
>> The 1174 is the sequence number, the first 1173 runs of
>> pg_verify_checksums only skipped blocks.
>>
>> However, the fact it shows two warnings for the same file means there is
>> something wrong here. It was continueing to the next block while I think
>> it should just skip to the next file on read failures. So I have changed
>> that now, new patch attached.
> 
> I'm confused- if previously it was continueing to the next block instead
> of doing the re-read on the same block, why don't we just change it to
> do the re-read on the same block properly and see if that fixes the
> retry, instead of just giving up and skipping..?  I'm not necessairly
> against skipping to the next file, to be clear, but I think I'd be
> happier if we kept reading the file until we actually get EOF.
> 
> (I've not looked at the actual patch, just read what you wrote..)
> 

Notice that those two errors are actually for two consecutive blocks in
the same file. So what probably happened is that postgres started to
extend the page, and the verification tried to read the last page after
the kernel added just the first 4kB filesystem page. Then it probably
succeeded on a retry, and then the same thing happened on the next page.

I don't think EOF addresses this, though - the partial read happens
before we actually reach the end of the file.

And re-reads are not a solution either, because the second read may
still see only the first half, and then what - is it a permanent issue
(in which case it's a data corruption), or an extension in progress?

I wonder if we can simply ignore those errors entirely, if it's the last
page in the segment? We can't really check the file is "complete"
anyway, e.g. if you have multiple segments for a table, and the "middle"
one is a page shorter, we'll happily ignore that during verification.

Also, what if we're reading a file and it gets truncated (e.g. after
vacuum notices the last few pages are empty)? Doesn't that have the same
issue?

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: Online verification of checksums

From
Andres Freund
Date:
Hi,


On 2019-03-02 22:49:33 +0100, Tomas Vondra wrote:
> 
> 
> On 3/2/19 5:08 PM, Stephen Frost wrote:
> > Greetings,
> > 
> > * Michael Banck (michael.banck@credativ.de) wrote:
> >> Am Freitag, den 01.03.2019, 18:03 -0500 schrieb Robert Haas:
> >>> On Tue, Sep 18, 2018 at 10:37 AM Michael Banck
> >>> <michael.banck@credativ.de> wrote:
> >>>> I have added a retry for this as well now, without a pg_sleep() as well.
> >>>> This catches around 80% of the half-reads, but a few slip through. At
> >>>> that point we bail out with exit(1), and the user can try again, which I
> >>>> think is fine?
> >>>
> >>> Maybe I'm confused here, but catching 80% of torn pages doesn't sound
> >>> robust at all.
> >>
> >> The chance that pg_verify_checksums hits a torn page (at least in my
> >> tests, see below) is already pretty low, a couple of times per 1000
> >> runs. Maybe 4 out 5 times, the page is read fine on retry and we march
> >> on. Otherwise, we now just issue a warning and skip the file (or so was
> >> the idea, see below), do you think that is not acceptable?
> >>
> >> I re-ran the tests (concurrent createdb/pgbench -i -s 50/dropdb and
> >> pg_verify_checksums in tight loops) with the current patch version, and
> >> I am seeing short reads very, very rarely (maybe every 1000th run) with
> >> a warning like:
> >>
> >> |1174
> >> |pg_verify_checksums: warning: could not read block 374 in file "data/base/18032/18045": read 4096 of 8192
> >> |pg_verify_checksums: warning: could not read block 375 in file "data/base/18032/18045": read 4096 of 8192
> >> |Files skipped: 2
> >>
> >> The 1174 is the sequence number, the first 1173 runs of
> >> pg_verify_checksums only skipped blocks.
> >>
> >> However, the fact it shows two warnings for the same file means there is
> >> something wrong here. It was continueing to the next block while I think
> >> it should just skip to the next file on read failures. So I have changed
> >> that now, new patch attached.
> > 
> > I'm confused- if previously it was continueing to the next block instead
> > of doing the re-read on the same block, why don't we just change it to
> > do the re-read on the same block properly and see if that fixes the
> > retry, instead of just giving up and skipping..?  I'm not necessairly
> > against skipping to the next file, to be clear, but I think I'd be
> > happier if we kept reading the file until we actually get EOF.
> > 
> > (I've not looked at the actual patch, just read what you wrote..)
> > 
> 
> Notice that those two errors are actually for two consecutive blocks in
> the same file. So what probably happened is that postgres started to
> extend the page, and the verification tried to read the last page after
> the kernel added just the first 4kB filesystem page. Then it probably
> succeeded on a retry, and then the same thing happened on the next page.
> 
> I don't think EOF addresses this, though - the partial read happens
> before we actually reach the end of the file.
> 
> And re-reads are not a solution either, because the second read may
> still see only the first half, and then what - is it a permanent issue
> (in which case it's a data corruption), or an extension in progress?
> 
> I wonder if we can simply ignore those errors entirely, if it's the last
> page in the segment? We can't really check the file is "complete"
> anyway, e.g. if you have multiple segments for a table, and the "middle"
> one is a page shorter, we'll happily ignore that during verification.
> 
> Also, what if we're reading a file and it gets truncated (e.g. after
> vacuum notices the last few pages are empty)? Doesn't that have the same
> issue?

I gotta say, my conclusion from this debate is that it's simply a
mistake to do this without involvement of the server that can use
locking to prevent these kind of issues.  It seems pretty absurd to me
to have hacky workarounds around partial writes of a live server, around
truncation, etc, even though the server has ways to deal with that.

- Andres


Re: Online verification of checksums

From
Michael Paquier
Date:
On Sat, Mar 02, 2019 at 02:00:31PM -0800, Andres Freund wrote:
> I gotta say, my conclusion from this debate is that it's simply a
> mistake to do this without involvement of the server that can use
> locking to prevent these kind of issues.  It seems pretty absurd to me
> to have hacky workarounds around partial writes of a live server, around
> truncation, etc, even though the server has ways to deal with that.

I agree with Andres on this one.  We are never going to make this
stuff safe if we don't handle page reads with the proper locks because
of torn pages.  What I think we should do is provide a SQL function
which reads a page in shared mode, and then checks its checksum if its
LSN is older than the previous redo point.  This discards cases with
rather hot pages, but if the page is hot enough then the backend
re-reading the page would just do the same by verifying the page
checksum by itself.
--
Michael

Attachment

Re: Online verification of checksums

From
Tomas Vondra
Date:
On 3/3/19 12:48 AM, Michael Paquier wrote:
> On Sat, Mar 02, 2019 at 02:00:31PM -0800, Andres Freund wrote:
>> I gotta say, my conclusion from this debate is that it's simply a
>> mistake to do this without involvement of the server that can use
>> locking to prevent these kind of issues.  It seems pretty absurd to me
>> to have hacky workarounds around partial writes of a live server, around
>> truncation, etc, even though the server has ways to deal with that.
> 
> I agree with Andres on this one.  We are never going to make this
> stuff safe if we don't handle page reads with the proper locks because
> of torn pages.  What I think we should do is provide a SQL function
> which reads a page in shared mode, and then checks its checksum if its
> LSN is older than the previous redo point.  This discards cases with
> rather hot pages, but if the page is hot enough then the backend
> re-reading the page would just do the same by verifying the page
> checksum by itself.

Handling torn pages is not difficult, and the patch already does that
(it reads LSN of the last checkpoint LSN from the control file, and uses
it the same way basebackup does). That's working since (at least)
September, so I don't see how would the SQL function help with this?

The other issue (raised recently) is partial reads, where we read only a
fraction of the page. Basebackup simply ignores such pages, likely on
the assumption that it's either concurrent extension or truncation (in
which case it's newer than the last checkpoint LSN anyway). So maybe we
should do the same thing here. As I mentioned before, we can't reliably
detect incomplete segments anyway (at least I believe that's the case).

You and Andres may be right that trying to verify checksums online
without close interaction with the server is ultimately futile (or at
least overly complex). But I'm not sure those issues (torn pages and
partial reads) are very good arguments, considering basebackup has to
deal with them too. Not sure.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: Online verification of checksums

From
Fabien COELHO
Date:
Bonjour Michaël,

>> I gotta say, my conclusion from this debate is that it's simply a
>> mistake to do this without involvement of the server that can use
>> locking to prevent these kind of issues.  It seems pretty absurd to me
>> to have hacky workarounds around partial writes of a live server, around
>> truncation, etc, even though the server has ways to deal with that.
>
> I agree with Andres on this one.  We are never going to make this stuff 
> safe if we don't handle page reads with the proper locks because of torn 
> pages. What I think we should do is provide a SQL function which reads a 
> page in shared mode, and then checks its checksum if its LSN is older 
> than the previous redo point.  This discards cases with rather hot 
> pages, but if the page is hot enough then the backend re-reading the 
> page would just do the same by verifying the page checksum by itself. -- 
> Michael

My 0.02€ about that, as one of the reviewer of the patch:

I agree that having a server function (extension?) to do a full checksum 
verification, possibly bandwidth-controlled, would be a good thing. 
However it would have side effects, such as interfering deeply with the 
server page cache, which may or may not be desirable.

On the other hand I also see value in an independent system-level external 
tool capable of a best effort checksum verification: the current check 
that the cluster is offline to prevent pg_verify_checksum from running is 
kind of artificial, and when online simply counting 
online-database-related checksum issues looks like a reasonable 
compromise.

So basically I think that allowing pg_verify_checksum to run on an online 
cluster is still a good thing, provided that expected errors are correctly 
handled.

-- 
Fabien.

Re: Online verification of checksums

From
Michael Banck
Date:
Hi,

Am Samstag, den 02.03.2019, 11:08 -0500 schrieb Stephen Frost:h
> * Michael Banck (michael.banck@credativ.de) wrote:
> > Am Freitag, den 01.03.2019, 18:03 -0500 schrieb Robert Haas:
> > > On Tue, Sep 18, 2018 at 10:37 AM Michael Banck
> > > <michael.banck@credativ.de> wrote:
> > > > I have added a retry for this as well now, without a pg_sleep() as well.
> > > > This catches around 80% of the half-reads, but a few slip through. At
> > > > that point we bail out with exit(1), and the user can try again, which I
> > > > think is fine?
> > > 
> > > Maybe I'm confused here, but catching 80% of torn pages doesn't sound
> > > robust at all.
> > 
> > The chance that pg_verify_checksums hits a torn page (at least in my
> > tests, see below) is already pretty low, a couple of times per 1000
> > runs. Maybe 4 out 5 times, the page is read fine on retry and we march
> > on. Otherwise, we now just issue a warning and skip the file (or so was
> > the idea, see below), do you think that is not acceptable?
> > 
> > I re-ran the tests (concurrent createdb/pgbench -i -s 50/dropdb and
> > pg_verify_checksums in tight loops) with the current patch version, and
> > I am seeing short reads very, very rarely (maybe every 1000th run) with
> > a warning like:
> > 
> > > 1174
> > > pg_verify_checksums: warning: could not read block 374 in file "data/base/18032/18045": read 4096 of 8192
> > > pg_verify_checksums: warning: could not read block 375 in file "data/base/18032/18045": read 4096 of 8192
> > > Files skipped: 2
> > 
> > The 1174 is the sequence number, the first 1173 runs of
> > pg_verify_checksums only skipped blocks.
> > 
> > However, the fact it shows two warnings for the same file means there is
> > something wrong here. It was continueing to the next block while I think
> > it should just skip to the next file on read failures. So I have changed
> > that now, new patch attached.
> 
> I'm confused- if previously it was continueing to the next block instead
> of doing the re-read on the same block, why don't we just change it to
> do the re-read on the same block properly and see if that fixes the
> retry, instead of just giving up and skipping..?  

It was re-reading the block and continueing to read the file after it
got a short read even on re-read.

> I'm not necessairly against skipping to the next file, to be clear,
> but I think I'd be happier if we kept reading the file until we
> actually get EOF.

So if we read half a block twice we should seek() to the next block and
continue till EOF, ok. I think in most cases those pages will be new
anyway and there will be no checksum check, but it sounds like a cleaner
approach. I've seen one or two examples where we did successfully verify
the checksum of a page after a half-read, so it might be worth it.

The alternative would be to just bail out early and skip the file on the
first short read and (possibly) log a skipped file.

I still think that an external checksum verification tool has some
merit, given that basebackup does it and the current offline requirement
is really not useful in practise.


Michael

-- 
Michael Banck
Projektleiter / Senior Berater
Tel.: +49 2166 9901-171
Fax:  +49 2166 9901-100
Email: michael.banck@credativ.de

credativ GmbH, HRB Mönchengladbach 12080
USt-ID-Nummer: DE204566209
Trompeterallee 108, 41189 Mönchengladbach
Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer

Unser Umgang mit personenbezogenen Daten unterliegt
folgenden Bestimmungen: https://www.credativ.de/datenschutz


Re: Online verification of checksums

From
Michael Paquier
Date:
On Sun, Mar 03, 2019 at 03:12:51AM +0100, Tomas Vondra wrote:
> You and Andres may be right that trying to verify checksums online
> without close interaction with the server is ultimately futile (or at
> least overly complex). But I'm not sure those issues (torn pages and
> partial reads) are very good arguments, considering basebackup has to
> deal with them too. Not sure.

FWIW, I don't think that the backend is right in its way of checking
checksums the way it does currently either with warnings and a limited
set of failures generated.  I raised concerns about that unfortunately
after 11 has been GA'ed, which was too late, so this time, for this
patch, I prefer raising them before the fact and I'd rather not spread
this kind of methodology around the core code more and more.  I work a
lot with virtualization, and I have seen ESX hanging around I/O
requests from time to time depending on the environment used (which is
actually wrong, anyway, but a lot of tests happen on a daily basis on
the stuff I work on).  What's presented on this thread is *never*
going to be 100% safe, and would generate false positives which can be
confusing for the user.  This is not a good sign.
--
Michael

Attachment

Re: Online verification of checksums

From
Michael Paquier
Date:
On Sun, Mar 03, 2019 at 11:51:48AM +0100, Michael Banck wrote:
> I still think that an external checksum verification tool has some
> merit, given that basebackup does it and the current offline requirement
> is really not useful in practise.

I am not going to argue again about the way checksum verification is
done in a base backup..  :)

Being able to do an online verification of checksums has a lot of
value, do not take me wrong, and an SQL interface to do that does not
prevent having a frontend wrapper using it.
--
Michael

Attachment

Re: Online verification of checksums

From
Michael Paquier
Date:
On Sun, Mar 03, 2019 at 07:58:26AM +0100, Fabien COELHO wrote:
> I agree that having a server function (extension?) to do a full checksum
> verification, possibly bandwidth-controlled, would be a good thing. However
> it would have side effects, such as interfering deeply with the server page
> cache, which may or may not be desirable.

In what is that different from VACUUM or a sequential scan?  It is
possible to use buffer ring replacement strategies in such cases using
the normal clock-sweep algorithm, so that scanning a range of pages
does not really impact Postgres shared buffer cache.
--
Michael

Attachment

Re: Online verification of checksums

From
Fabien COELHO
Date:
Bonjour Michaël,

>> I agree that having a server function (extension?) to do a full checksum
>> verification, possibly bandwidth-controlled, would be a good thing. However
>> it would have side effects, such as interfering deeply with the server page
>> cache, which may or may not be desirable.
>
> In what is that different from VACUUM or a sequential scan?

Scrubbing would read all files, not only relation data? I'm unsure about 
what does VACUUM, but it is probably pretty similar.

> It is possible to use buffer ring replacement strategies in such cases 
> using the normal clock-sweep algorithm, so that scanning a range of 
> pages does not really impact Postgres shared buffer cache.

Good! I did not know that there was an existing strategy to avoid filling 
the cache.

-- 
Fabien.

Re: Online verification of checksums

From
Magnus Hagander
Date:
On Mon, Mar 4, 2019, 04:10 Michael Paquier <michael@paquier.xyz> wrote:
On Sun, Mar 03, 2019 at 07:58:26AM +0100, Fabien COELHO wrote:
> I agree that having a server function (extension?) to do a full checksum
> verification, possibly bandwidth-controlled, would be a good thing. However
> it would have side effects, such as interfering deeply with the server page
> cache, which may or may not be desirable.

In what is that different from VACUUM or a sequential scan?  It is
possible to use buffer ring replacement strategies in such cases using
the normal clock-sweep algorithm, so that scanning a range of pages
does not really impact Postgres shared buffer cache.


Yeah, I wouldn't worry too much about the effect on the postgres cache when that is done. It could of course have a much worse impact on the os cache or on the "smart" (aka dumb) storage system cache. But that effect will be there just as much with a separate tool. 

/Magnus 

Re: Online verification of checksums

From
Tomas Vondra
Date:

On 3/4/19 4:09 AM, Michael Paquier wrote:
> On Sun, Mar 03, 2019 at 07:58:26AM +0100, Fabien COELHO wrote:
>> I agree that having a server function (extension?) to do a full checksum
>> verification, possibly bandwidth-controlled, would be a good thing. However
>> it would have side effects, such as interfering deeply with the server page
>> cache, which may or may not be desirable.
> 
> In what is that different from VACUUM or a sequential scan?  It is
> possible to use buffer ring replacement strategies in such cases using
> the normal clock-sweep algorithm, so that scanning a range of pages
> does not really impact Postgres shared buffer cache.
> --

But Fabien was talking about page cache, not shared buffers. And we
can't use custom ring buffer there. OTOH I don't see why accessing the
file through SQL function would behave any differently than direct
access (i.e. what the tool does now).

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: Online verification of checksums

From
Tomas Vondra
Date:


On 3/4/19 2:00 AM, Michael Paquier wrote:
> On Sun, Mar 03, 2019 at 03:12:51AM +0100, Tomas Vondra wrote:
>> You and Andres may be right that trying to verify checksums online
>> without close interaction with the server is ultimately futile (or at
>> least overly complex). But I'm not sure those issues (torn pages and
>> partial reads) are very good arguments, considering basebackup has to
>> deal with them too. Not sure.
> 
> FWIW, I don't think that the backend is right in its way of checking
> checksums the way it does currently either with warnings and a limited
> set of failures generated.  I raised concerns about that unfortunately
> after 11 has been GA'ed, which was too late, so this time, for this
> patch, I prefer raising them before the fact and I'd rather not spread
> this kind of methodology around the core code more and more.

I still don't understand what issue you see in how basebackup verifies
checksums. Can you point me to the explanation you've sent after 11 was
released?

> I work a lot with virtualization, and I have seen ESX hanging around
> I/O requests from time to time depending on the environment used
> (which is actually wrong, anyway, but a lot of tests happen on a
> daily basis on the stuff I work on).  What's presented on this thread
> is *never* going to be 100% safe, and would generate false positives
> which can be confusing for the user.  This is not a good sign.

So you have a workload/configuration that actually results in data
corruption yet we fail to detect that? Or we generate false positives?
Or what do you mean by "100% safe" here?


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: Online verification of checksums

From
Magnus Hagander
Date:


On Mon, Mar 4, 2019 at 3:02 PM Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:


On 3/4/19 4:09 AM, Michael Paquier wrote:
> On Sun, Mar 03, 2019 at 07:58:26AM +0100, Fabien COELHO wrote:
>> I agree that having a server function (extension?) to do a full checksum
>> verification, possibly bandwidth-controlled, would be a good thing. However
>> it would have side effects, such as interfering deeply with the server page
>> cache, which may or may not be desirable.
>
> In what is that different from VACUUM or a sequential scan?  It is
> possible to use buffer ring replacement strategies in such cases using
> the normal clock-sweep algorithm, so that scanning a range of pages
> does not really impact Postgres shared buffer cache.
> --

But Fabien was talking about page cache, not shared buffers. And we
can't use custom ring buffer there. OTOH I don't see why accessing the
file through SQL function would behave any differently than direct
access (i.e. what the tool does now).

It shouldn't.

One other thought that I had around this though, which if it's been covered before and I missed it, please disregard :)

The *online* version of the tool is very similar to running pg_basebackup to /dev/null, is it not? Except it doesn't set the cluster to backup mode. Perhaps what we really want is a simpler way to do *that*. That wouldn't necessarily make it a SQL callable function, but it would be a CLI tool that would call a command on a walsender for example.

(We'd of course still need the standalone tool for offline checks)
 
--

Re: Online verification of checksums

From
Michael Paquier
Date:
On Mon, Mar 04, 2019 at 03:08:09PM +0100, Tomas Vondra wrote:
> I still don't understand what issue you see in how basebackup verifies
> checksums. Can you point me to the explanation you've sent after 11 was
> released?

The history is mostly on this thread:
https://www.postgresql.org/message-id/20181020044248.GD2553@paquier.xyz

> So you have a workload/configuration that actually results in data
> corruption yet we fail to detect that? Or we generate false positives?
> Or what do you mean by "100% safe" here?

What's proposed on this thread could generate false positives.  Checks
which have deterministic properties and clean failure handling are
reliable when it comes to reports.
--
Michael

Attachment

Re: Online verification of checksums

From
Tomas Vondra
Date:
On 3/5/19 4:12 AM, Michael Paquier wrote:
> On Mon, Mar 04, 2019 at 03:08:09PM +0100, Tomas Vondra wrote:
>> I still don't understand what issue you see in how basebackup verifies
>> checksums. Can you point me to the explanation you've sent after 11 was
>> released?
> 
> The history is mostly on this thread:
> https://www.postgresql.org/message-id/20181020044248.GD2553@paquier.xyz
> 

Thanks, will look.

Based on quickly skimming that thread the main issue seems to be
deciding which files in the data directory are expected to have
checksums. Which is a valid issue, of course, but I was expecting
something about partial read/writes etc.

>> So you have a workload/configuration that actually results in data
>> corruption yet we fail to detect that? Or we generate false positives?
>> Or what do you mean by "100% safe" here?
> 
> What's proposed on this thread could generate false positives.  Checks
> which have deterministic properties and clean failure handling are
> reliable when it comes to reports.

My understanding is that:

(a) The checksum verification should not generate false positives (same
as for basebackup).

(b) The partial reads do emit warnings, which might be considered false
positives I guess. Which is why I'm arguing for changing it to do the
same thing basebackup does, i.e. ignore this.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: Online verification of checksums

From
Michael Paquier
Date:
On Tue, Mar 05, 2019 at 02:08:03PM +0100, Tomas Vondra wrote:
> Based on quickly skimming that thread the main issue seems to be
> deciding which files in the data directory are expected to have
> checksums. Which is a valid issue, of course, but I was expecting
> something about partial read/writes etc.

I remember complaining about partial write handling as well for the
base backup checks...  There should be an email about it on the list,
cannot find it now ;p

> My understanding is that:
>
> (a) The checksum verification should not generate false positives (same
> as for basebackup).
>
> (b) The partial reads do emit warnings, which might be considered false
> positives I guess. Which is why I'm arguing for changing it to do the
> same thing basebackup does, i.e. ignore this.

Well, at least that's consistent...  Argh, I really think that we
ought to make the failures reported harder because that's easier to
detect within a tool and some deployments set log_min_messages >
WARNING so checksum failures would just be lost.  For base backups we
don't care much about that as files are just blindly copied so they
could have torn pages, which is fine as that's fixed at replay.  Now
we are talking about a set of tools which could have reliable
detection mechanisms for those problems.
--
Michael

Attachment

Re: Online verification of checksums

From
Stephen Frost
Date:
Greetings,

On Tue, Mar 5, 2019 at 18:36 Michael Paquier <michael@paquier.xyz> wrote:
On Tue, Mar 05, 2019 at 02:08:03PM +0100, Tomas Vondra wrote:
> Based on quickly skimming that thread the main issue seems to be
> deciding which files in the data directory are expected to have
> checksums. Which is a valid issue, of course, but I was expecting
> something about partial read/writes etc.

I remember complaining about partial write handling as well for the
base backup checks...  There should be an email about it on the list,
cannot find it now ;p

> My understanding is that:
>
> (a) The checksum verification should not generate false positives (same
> as for basebackup).
>
> (b) The partial reads do emit warnings, which might be considered false
> positives I guess. Which is why I'm arguing for changing it to do the
> same thing basebackup does, i.e. ignore this.

Well, at least that's consistent...  Argh, I really think that we
ought to make the failures reported harder because that's easier to
detect within a tool and some deployments set log_min_messages >
WARNING so checksum failures would just be lost.  For base backups we
don't care much about that as files are just blindly copied so they
could have torn pages, which is fine as that's fixed at replay.  Now
we are talking about a set of tools which could have reliable
detection mechanisms for those problems.

I’m traveling but will try to comment more in the coming days but in general I agree with Tomas on these items. Also, pg_basebackup has to handle torn pages when it comes to checksums just like the verify tool does, and having them be consistent (along with external tools) would really be for the best, imv.  I still feel like a retry of a short read (try reading more to get the whole page..) would be alright and reading until we hit eof and then moving on. I’m not sure it’s possible but I do worry a bit that we might get a short read from a network file system or something that isn’t actually at eof and then we would skip a significant remaining portion of the file...   another thought might be to stat the file after we have opened it to see it’s length...

Just a few thoughts since I’m on my phone.  Will try to write up something more in a day or two. 

Thanks!

Stephen

Re: Online verification of checksums

From
Robert Haas
Date:
On Sat, Mar 2, 2019 at 4:38 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
> FWIW I don't think this qualifies as torn page - i.e. it's not a full
> read with a mix of old and new data. This is partial write, most likely
> because we read the blocks one by one, and when we hit the last page
> while the table is being extended, we may only see the fist 4kB. And if
> we retry very fast, we may still see only the first 4kB.

I see the distinction you're making, and you're right.  The problem
is, whether in this case or whether for a real torn page, we don't
seem to have a way to distinguish between a state that occurs
transiently due to lack of synchronization and a situation that is
permanent and means that we have corruption.  And that worries me,
because it means we'll either report bogus complaints that will scare
easily-panicked users (and anybody who is running this tool has a good
chance of being in the "easily-panicked" category ...), or else we'll
skip reporting real problems.  Neither is good.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Online verification of checksums

From
Robert Haas
Date:
On Sat, Mar 2, 2019 at 5:45 AM Michael Banck <michael.banck@credativ.de> wrote:
> Am Freitag, den 01.03.2019, 18:03 -0500 schrieb Robert Haas:
> > On Tue, Sep 18, 2018 at 10:37 AM Michael Banck
> > <michael.banck@credativ.de> wrote:
> > > I have added a retry for this as well now, without a pg_sleep() as well.
> > > This catches around 80% of the half-reads, but a few slip through. At
> > > that point we bail out with exit(1), and the user can try again, which I
> > > think is fine?
> >
> > Maybe I'm confused here, but catching 80% of torn pages doesn't sound
> > robust at all.
>
> The chance that pg_verify_checksums hits a torn page (at least in my
> tests, see below) is already pretty low, a couple of times per 1000
> runs. Maybe 4 out 5 times, the page is read fine on retry and we march
> on. Otherwise, we now just issue a warning and skip the file (or so was
> the idea, see below), do you think that is not acceptable?

Yeah.  Consider a paranoid customer with 100 clusters who runs this
every day on every cluster.  They're going to see failures every day
or three and go ballistic.

I suspect that better retry logic might help here.  I mean, I would
guess that 10 retries at 1 second intervals or something of that sort
would be enough to virtually eliminate false positives while still
allowing us to report persistent -- and thus real -- problems.  But if
even that is going to produce false positives with any measurable
probability different from zero, then I think we have a problem,
because I neither like a verification tool that ignores possible signs
of trouble nor one that "cries wolf" when things are fine.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Online verification of checksums

From
Andres Freund
Date:
On 2019-03-06 12:33:49 -0500, Robert Haas wrote:
> On Sat, Mar 2, 2019 at 5:45 AM Michael Banck <michael.banck@credativ.de> wrote:
> > Am Freitag, den 01.03.2019, 18:03 -0500 schrieb Robert Haas:
> > > On Tue, Sep 18, 2018 at 10:37 AM Michael Banck
> > > <michael.banck@credativ.de> wrote:
> > > > I have added a retry for this as well now, without a pg_sleep() as well.
> > > > This catches around 80% of the half-reads, but a few slip through. At
> > > > that point we bail out with exit(1), and the user can try again, which I
> > > > think is fine?
> > >
> > > Maybe I'm confused here, but catching 80% of torn pages doesn't sound
> > > robust at all.
> >
> > The chance that pg_verify_checksums hits a torn page (at least in my
> > tests, see below) is already pretty low, a couple of times per 1000
> > runs. Maybe 4 out 5 times, the page is read fine on retry and we march
> > on. Otherwise, we now just issue a warning and skip the file (or so was
> > the idea, see below), do you think that is not acceptable?
> 
> Yeah.  Consider a paranoid customer with 100 clusters who runs this
> every day on every cluster.  They're going to see failures every day
> or three and go ballistic.

+1


> I suspect that better retry logic might help here.  I mean, I would
> guess that 10 retries at 1 second intervals or something of that sort
> would be enough to virtually eliminate false positives while still
> allowing us to report persistent -- and thus real -- problems.  But if
> even that is going to produce false positives with any measurable
> probability different from zero, then I think we have a problem,
> because I neither like a verification tool that ignores possible signs
> of trouble nor one that "cries wolf" when things are fine.

To me the right way seems to be to IO lock the page via PG after such a
failure, and then retry. Which should be relatively easily doable for
the basebackup case, but obviously harder for the pg_verify_checksums
case.

Greetings,

Andres Freund


Re: Online verification of checksums

From
Tomas Vondra
Date:
On 3/6/19 6:26 PM, Robert Haas wrote:
> On Sat, Mar 2, 2019 at 4:38 PM Tomas Vondra
> <tomas.vondra@2ndquadrant.com> wrote:
>> FWIW I don't think this qualifies as torn page - i.e. it's not a full
>> read with a mix of old and new data. This is partial write, most likely
>> because we read the blocks one by one, and when we hit the last page
>> while the table is being extended, we may only see the fist 4kB. And if
>> we retry very fast, we may still see only the first 4kB.
> 
> I see the distinction you're making, and you're right.  The problem
> is, whether in this case or whether for a real torn page, we don't
> seem to have a way to distinguish between a state that occurs
> transiently due to lack of synchronization and a situation that is
> permanent and means that we have corruption.  And that worries me,
> because it means we'll either report bogus complaints that will scare
> easily-panicked users (and anybody who is running this tool has a good
> chance of being in the "easily-panicked" category ...), or else we'll
> skip reporting real problems.  Neither is good.
> 

Sure, I'd also prefer having a tool that reliably detects all cases of
data corruption, and I certainly do share your concerns about false
positives and false negatives.

But maybe we shouldn't expect a tool meant to verify checksums to detect
various other issues.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: Online verification of checksums

From
Tomas Vondra
Date:

On 3/6/19 6:42 PM, Andres Freund wrote:
> On 2019-03-06 12:33:49 -0500, Robert Haas wrote:
>> On Sat, Mar 2, 2019 at 5:45 AM Michael Banck <michael.banck@credativ.de> wrote:
>>> Am Freitag, den 01.03.2019, 18:03 -0500 schrieb Robert Haas:
>>>> On Tue, Sep 18, 2018 at 10:37 AM Michael Banck
>>>> <michael.banck@credativ.de> wrote:
>>>>> I have added a retry for this as well now, without a pg_sleep() as well.
>>>>> This catches around 80% of the half-reads, but a few slip through. At
>>>>> that point we bail out with exit(1), and the user can try again, which I
>>>>> think is fine?
>>>>
>>>> Maybe I'm confused here, but catching 80% of torn pages doesn't sound
>>>> robust at all.
>>>
>>> The chance that pg_verify_checksums hits a torn page (at least in my
>>> tests, see below) is already pretty low, a couple of times per 1000
>>> runs. Maybe 4 out 5 times, the page is read fine on retry and we march
>>> on. Otherwise, we now just issue a warning and skip the file (or so was
>>> the idea, see below), do you think that is not acceptable?
>>
>> Yeah.  Consider a paranoid customer with 100 clusters who runs this
>> every day on every cluster.  They're going to see failures every day
>> or three and go ballistic.
> 
> +1
> 
> 
>> I suspect that better retry logic might help here.  I mean, I would
>> guess that 10 retries at 1 second intervals or something of that sort
>> would be enough to virtually eliminate false positives while still
>> allowing us to report persistent -- and thus real -- problems.  But if
>> even that is going to produce false positives with any measurable
>> probability different from zero, then I think we have a problem,
>> because I neither like a verification tool that ignores possible signs
>> of trouble nor one that "cries wolf" when things are fine.
> 
> To me the right way seems to be to IO lock the page via PG after such a
> failure, and then retry. Which should be relatively easily doable for
> the basebackup case, but obviously harder for the pg_verify_checksums
> case.
> 

Yes, if we could ensure the retry happens after completing the current
I/O on the page (without actually initiating a read into shared buffers)
that would work I think - both for partial reads and torn pages.

Not sure how to integrate it into the CLI tool, though. Perhaps we it
could require connection info so that it can execute a function, when
executed in online mode?

cheers

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: Online verification of checksums

From
Andres Freund
Date:
Hi,

On 2019-03-06 20:37:39 +0100, Tomas Vondra wrote:
> Not sure how to integrate it into the CLI tool, though. Perhaps we it
> could require connection info so that it can execute a function, when
> executed in online mode?

To me the right fix would be to simply have this run as part of the
cluster / in a function. I don't see much point in running this outside
of the cluster.

Greetings,

Andres Freund


Re: Online verification of checksums

From
Tomas Vondra
Date:
On 3/6/19 8:41 PM, Andres Freund wrote:
> Hi,
> 
> On 2019-03-06 20:37:39 +0100, Tomas Vondra wrote:
>> Not sure how to integrate it into the CLI tool, though. Perhaps we it
>> could require connection info so that it can execute a function, when
>> executed in online mode?
> 
> To me the right fix would be to simply have this run as part of the
> cluster / in a function. I don't see much point in running this outside
> of the cluster.
> 

Not sure. AFAICS that would to require a single transaction, and if we
happen to add some sort of throttling (which is a feature request I'd
expect pretty soon to make it usable on live clusters) that might be
quite long-running. So, not great.

If we want to run it from the server itself, then I guess a background
worker would be a better solution. Incidentally, that's something I've
been toying with some time ago, see [1].


[1] https://github.com/tvondra/scrub

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: Online verification of checksums

From
Michael Paquier
Date:
On Wed, Mar 06, 2019 at 08:53:57PM +0100, Tomas Vondra wrote:
> Not sure. AFAICS that would to require a single transaction, and if we
> happen to add some sort of throttling (which is a feature request I'd
> expect pretty soon to make it usable on live clusters) that might be
> quite long-running. So, not great.
>
> If we want to run it from the server itself, then I guess a background
> worker would be a better solution. Incidentally, that's something I've
> been toying with some time ago, see [1].

It does not prevent having a SQL function which acts as a wrapper on
top of the whole routine logic, does it?  I think that it would be
nice to have the possibility to target a specific relation and a
specific page, as well as being able to check fully a relation at
once.  It gets easier to check for page ranges this way, and the
throttling can be part of the function doing a full-relation check.
--
Michael

Attachment

Re: Online verification of checksums

From
Tomas Vondra
Date:
On 3/6/19 6:42 PM, Andres Freund wrote:
 >
> ...
 >
> To me the right way seems to be to IO lock the page via PG after such a
> failure, and then retry. Which should be relatively easily doable for
> the basebackup case, but obviously harder for the pg_verify_checksums
> case.
> 

Actually, what do you mean by "IO lock the page"? Just waiting for the 
current IO to complete (essentially BM_IO_IN_PROGRESS)? Or essentially 
acquiring a lock and holding it for the duration of the check?

The former does not really help, because there might be another I/O 
request initiated right after, interfering with the retry.

The latter might work, assuming the check is fast (which it probably 
is). I wonder if this might cause issues due to loading possibly 
corrupted data (with invalid checksums) into shared buffers. But then 
again, we could just hack a special version of ReadBuffer_common() which 
would just

(a) check if a page is in shared buffers, and if it is then consider the 
checksum correct (because in memory it may be stale, and it was read 
successfully so it was OK at that moment)

(b) if it's not in shared buffers already, try reading it and verify the 
checksum, and then just evict it right away (not to spoil sb)

Or did you have something else in mind?


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: Online verification of checksums

From
Andres Freund
Date:
Hi,

On 2019-03-07 12:53:30 +0100, Tomas Vondra wrote:
> On 3/6/19 6:42 PM, Andres Freund wrote:
> >
> > ...
> >
> > To me the right way seems to be to IO lock the page via PG after such a
> > failure, and then retry. Which should be relatively easily doable for
> > the basebackup case, but obviously harder for the pg_verify_checksums
> > case.
> > 
> 
> Actually, what do you mean by "IO lock the page"? Just waiting for the
> current IO to complete (essentially BM_IO_IN_PROGRESS)? Or essentially
> acquiring a lock and holding it for the duration of the check?

The latter. And with IO lock I meant BufferDescriptorGetIOLock(), in
contrast to a buffer's content lock. That way we wouldn't block
modifications to the in-memory page.


> The former does not really help, because there might be another I/O request
> initiated right after, interfering with the retry.
> 
> The latter might work, assuming the check is fast (which it probably is). I
> wonder if this might cause issues due to loading possibly corrupted data
> (with invalid checksums) into shared buffers.

Oh, I was basically thinking that we'd just reread from disk outside of
postgres in that case, while preventing postgres related IO by holding
the IO lock.

But:

> But then again, we could just
> hack a special version of ReadBuffer_common() which would just

> (a) check if a page is in shared buffers, and if it is then consider the
> checksum correct (because in memory it may be stale, and it was read
> successfully so it was OK at that moment)
> 
> (b) if it's not in shared buffers already, try reading it and verify the
> checksum, and then just evict it right away (not to spoil sb)

This'd also make sense and make the whole process more efficient. OTOH,
it might actually be worthwhile to check the on-disk page even if
there's in-memory state. Unless IO is in progress the on-disk page
always should be valid.

Greetings,

Andres Freund


Re: Online verification of checksums

From
Michael Banck
Date:
Hi,

Am Sonntag, den 03.03.2019, 11:51 +0100 schrieb Michael Banck:
> Am Samstag, den 02.03.2019, 11:08 -0500 schrieb Stephen Frost:
> > I'm not necessairly against skipping to the next file, to be clear,
> > but I think I'd be happier if we kept reading the file until we
> > actually get EOF.
> 
> So if we read half a block twice we should seek() to the next block and
> continue till EOF, ok. I think in most cases those pages will be new
> anyway and there will be no checksum check, but it sounds like a cleaner
> approach. I've seen one or two examples where we did successfully verify
> the checksum of a page after a half-read, so it might be worth it.

I've done that now, i.e. it seeks to the next block and continues to
read there (possibly getting an EOF).

I don't issue a warning for this skipped block anymore as it is somewhat
to be expected that we see some half-reads. If the seek fails for some
reason, that is still a warning.

> I still think that an external checksum verification tool has some
> merit, given that basebackup does it and the current offline requirement
> is really not useful in practise.

I've read the rest of the thread, and it seems several people prefer a
solution that interacts with the server. I won't be able to work on that
for v12 and I guess it would be too late in the cycle anyway.

I thought about I/O throttling in online mode, but it seems to be most
easily tied in with the progress reporting (that already keeps track of
everything or most of what we'd need), so I will work on it in that
context.



Michael

-- 
Michael Banck
Projektleiter / Senior Berater
Tel.: +49 2166 9901-171
Fax:  +49 2166 9901-100
Email: michael.banck@credativ.de

credativ GmbH, HRB Mönchengladbach 12080
USt-ID-Nummer: DE204566209
Trompeterallee 108, 41189 Mönchengladbach
Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer

Unser Umgang mit personenbezogenen Daten unterliegt
folgenden Bestimmungen: https://www.credativ.de/datenschutz
Attachment

Re: Online verification of checksums

From
Julien Rouhaud
Date:
On Thu, Mar 7, 2019 at 7:00 PM Andres Freund <andres@anarazel.de> wrote:
>
> On 2019-03-07 12:53:30 +0100, Tomas Vondra wrote:
> >
> > But then again, we could just
> > hack a special version of ReadBuffer_common() which would just
>
> > (a) check if a page is in shared buffers, and if it is then consider the
> > checksum correct (because in memory it may be stale, and it was read
> > successfully so it was OK at that moment)
> >
> > (b) if it's not in shared buffers already, try reading it and verify the
> > checksum, and then just evict it right away (not to spoil sb)
>
> This'd also make sense and make the whole process more efficient. OTOH,
> it might actually be worthwhile to check the on-disk page even if
> there's in-memory state. Unless IO is in progress the on-disk page
> always should be valid.

Definitely.  I already saw servers with all-frozen-read-only blocks
popular enough to never get evicted in months, and then a minor
upgrade / restart having catastrophic consequences.


Re: Online verification of checksums

From
Tomas Vondra
Date:
On 3/8/19 4:19 PM, Julien Rouhaud wrote:
> On Thu, Mar 7, 2019 at 7:00 PM Andres Freund <andres@anarazel.de> wrote:
>>
>> On 2019-03-07 12:53:30 +0100, Tomas Vondra wrote:
>>>
>>> But then again, we could just
>>> hack a special version of ReadBuffer_common() which would just
>>
>>> (a) check if a page is in shared buffers, and if it is then consider the
>>> checksum correct (because in memory it may be stale, and it was read
>>> successfully so it was OK at that moment)
>>>
>>> (b) if it's not in shared buffers already, try reading it and verify the
>>> checksum, and then just evict it right away (not to spoil sb)
>>
>> This'd also make sense and make the whole process more efficient. OTOH,
>> it might actually be worthwhile to check the on-disk page even if
>> there's in-memory state. Unless IO is in progress the on-disk page
>> always should be valid.
> 
> Definitely.  I already saw servers with all-frozen-read-only blocks
> popular enough to never get evicted in months, and then a minor
> upgrade / restart having catastrophic consequences.
> 

Do I understand correctly the "catastrophic consequences" here are due
to data corruption / broken checksums on those on-disk pages?

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: Online verification of checksums

From
Julien Rouhaud
Date:
On Fri, Mar 8, 2019 at 6:50 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
>
> On 3/8/19 4:19 PM, Julien Rouhaud wrote:
> > On Thu, Mar 7, 2019 at 7:00 PM Andres Freund <andres@anarazel.de> wrote:
> >>
> >> On 2019-03-07 12:53:30 +0100, Tomas Vondra wrote:
> >>>
> >>> But then again, we could just
> >>> hack a special version of ReadBuffer_common() which would just
> >>
> >>> (a) check if a page is in shared buffers, and if it is then consider the
> >>> checksum correct (because in memory it may be stale, and it was read
> >>> successfully so it was OK at that moment)
> >>>
> >>> (b) if it's not in shared buffers already, try reading it and verify the
> >>> checksum, and then just evict it right away (not to spoil sb)
> >>
> >> This'd also make sense and make the whole process more efficient. OTOH,
> >> it might actually be worthwhile to check the on-disk page even if
> >> there's in-memory state. Unless IO is in progress the on-disk page
> >> always should be valid.
> >
> > Definitely.  I already saw servers with all-frozen-read-only blocks
> > popular enough to never get evicted in months, and then a minor
> > upgrade / restart having catastrophic consequences.
> >
>
> Do I understand correctly the "catastrophic consequences" here are due
> to data corruption / broken checksums on those on-disk pages?

Ah, yes sorry I should have been clearer.  Indeed, there was silent
data corruptions (no ckecksum though) that was revealed by the
restart.  So a routine minor update resulted in a massive outage.
Such a scenario can't be avoided if we always bypass checksum check
for alreay in shared_buffers pages.


Re: Online verification of checksums

From
Stephen Frost
Date:
Greetings,

* Tomas Vondra (tomas.vondra@2ndquadrant.com) wrote:
> On 3/2/19 12:03 AM, Robert Haas wrote:
> > On Tue, Sep 18, 2018 at 10:37 AM Michael Banck
> > <michael.banck@credativ.de> wrote:
> >> I have added a retry for this as well now, without a pg_sleep() as well.
> >> This catches around 80% of the half-reads, but a few slip through. At
> >> that point we bail out with exit(1), and the user can try again, which I
> >> think is fine?
> >
> > Maybe I'm confused here, but catching 80% of torn pages doesn't sound
> > robust at all.
>
> FWIW I don't think this qualifies as torn page - i.e. it's not a full
> read with a mix of old and new data. This is partial write, most likely
> because we read the blocks one by one, and when we hit the last page
> while the table is being extended, we may only see the fist 4kB. And if
> we retry very fast, we may still see only the first 4kB.

I really still am not following why this is such an issue- we do a read,
get back 4KB, do another read, check if it's zero, and if so then we
should be able to conclude that we're at the end of the file, no?  If
we're at the end of the file and we don't have a final complete block to
run a checksum check on then it seems clear to me that the file was
being extended and it's ok to skip that block.  We could also stat the
file and keep track of where we are, to detect such an extension of the
file happening, if we wanted an additional cross-check, couldn't we?  If
we do a read and get 4KB back and then do another and get 4KB back, then
we just treat it like we would an 8KB block.  Really, as long as a
subsequent read is returning bytes then we keep going, and if it returns
zero then it's EOF.  I could maybe see a "one final read" option, but I
don't think it makes sense to have some kind of time-based delay around
this where we keep trying to read.

All of this about hacking up a way to connect to PG and lock pages in
shared buffers so that we can perform a checksum check seems really
rather ridiculous for either the extension case or the regular mid-file
torn-page case.

To be clear, I agree completely that we don't want to be reporting false
positives or "this might mean corruption!" to users running the tool,
but I haven't seen a good explaination of why this needs to involve the
server to avoid that happening.  If someone would like to point that out
to me, I'd be happy to go read about it and try to understand.

Thanks!

Stephen

Attachment

Re: Online verification of checksums

From
Stephen Frost
Date:
Greetings,

* Tomas Vondra (tomas.vondra@2ndquadrant.com) wrote:
> If we want to run it from the server itself, then I guess a background
> worker would be a better solution. Incidentally, that's something I've
> been toying with some time ago, see [1].

So, I'm a big fan of this idea of having a background worker that's
running and (slowly, maybe configurably) scanning through the data
directory checking for corrupted pages.  I'd certainly prefer it if that
background worker didn't fault those pages into shared buffers though,
and I don't really think it should need to even check if a given page is
currently being written out or is presently in shared buffers.
Basically, I'd think it would work just fine to have it essentially do
what I am imagining pg_checksums to do, but as a background worker.

Thanks!

Stephen

Attachment

Re: Online verification of checksums

From
Michael Paquier
Date:
On Mon, Mar 18, 2019 at 01:43:08AM -0400, Stephen Frost wrote:
> To be clear, I agree completely that we don't want to be reporting false
> positives or "this might mean corruption!" to users running the tool,
> but I haven't seen a good explaination of why this needs to involve the
> server to avoid that happening.  If someone would like to point that out
> to me, I'd be happy to go read about it and try to understand.

The mentions on this thread that the server has all the facility in
place to properly lock a buffer and make sure that a partial read
*never* happens and that we *never* have any kind of false positives,
directly preventing the set of issues we are trying to implement
workarounds for in a frontend tool are rather good arguments in my
opinion (you can grep for BufferDescriptorGetIOLock() on this thread
for example).
--
Michael

Attachment

Re: Online verification of checksums

From
Stephen Frost
Date:
Greetings,

* Michael Paquier (michael@paquier.xyz) wrote:
> On Mon, Mar 18, 2019 at 01:43:08AM -0400, Stephen Frost wrote:
> > To be clear, I agree completely that we don't want to be reporting false
> > positives or "this might mean corruption!" to users running the tool,
> > but I haven't seen a good explaination of why this needs to involve the
> > server to avoid that happening.  If someone would like to point that out
> > to me, I'd be happy to go read about it and try to understand.
>
> The mentions on this thread that the server has all the facility in
> place to properly lock a buffer and make sure that a partial read
> *never* happens and that we *never* have any kind of false positives,

Uh, we are, of course, going to have partial reads- we just need to
handle them appropriately, and that's not hard to do in a way that we
never have false positives.

I do not understand, at all, the whole sub-thread argument that we have
to avoid partial reads.  We certainly don't worry about that when doing
backups, and I don't see why we need to avoid it here.  We are going to
have partial reads- and that's ok, as long as it's because we're at the
end of the file, and that's easy enough to check by just doing another
read to see if we get back zero bytes, which indicates we're at the end
of the file, and then we move on, no need to coordinate anything with
the backend for this.

> directly preventing the set of issues we are trying to implement
> workarounds for in a frontend tool are rather good arguments in my
> opinion (you can grep for BufferDescriptorGetIOLock() on this thread
> for example).

Sure the backend has those facilities since it needs to, but these
frontend tools *don't* need that to *never* have any false positives, so
why are we complicating things by saying that this frontend tool and the
backend have to coordinate?

If there's an explanation of why we can't avoid having false positives
in the frontend tool, I've yet to see it.  I definitely understand that
we can get partial reads, but a partial read isn't a failure, and
shouldn't be reported as such.

Thanks!

Stephen

Attachment

Re: Online verification of checksums

From
Michael Banck
Date:
Hi,

Am Montag, den 18.03.2019, 02:38 -0400 schrieb Stephen Frost:
> * Michael Paquier (michael@paquier.xyz) wrote:
> > On Mon, Mar 18, 2019 at 01:43:08AM -0400, Stephen Frost wrote:
> > > To be clear, I agree completely that we don't want to be reporting false
> > > positives or "this might mean corruption!" to users running the tool,
> > > but I haven't seen a good explaination of why this needs to involve the
> > > server to avoid that happening.  If someone would like to point that out
> > > to me, I'd be happy to go read about it and try to understand.
> > 
> > The mentions on this thread that the server has all the facility in
> > place to properly lock a buffer and make sure that a partial read
> > *never* happens and that we *never* have any kind of false positives,
> 
> Uh, we are, of course, going to have partial reads- we just need to
> handle them appropriately, and that's not hard to do in a way that we
> never have false positives.

I think the current patch (V13 from https://www.postgresql.org/message-i
d/1552045881.4947.43.camel@credativ.de) does that, modulo possible bugs.

> I do not understand, at all, the whole sub-thread argument that we have
> to avoid partial reads.  We certainly don't worry about that when doing
> backups, and I don't see why we need to avoid it here.  We are going to
> have partial reads- and that's ok, as long as it's because we're at the
> end of the file, and that's easy enough to check by just doing another
> read to see if we get back zero bytes, which indicates we're at the end
> of the file, and then we move on, no need to coordinate anything with
> the backend for this.

Well, I agree with you, but we don't seem to have consensus on that.

> > directly preventing the set of issues we are trying to implement
> > workarounds for in a frontend tool are rather good arguments in my
> > opinion (you can grep for BufferDescriptorGetIOLock() on this thread 
> > for example).
> 
> Sure the backend has those facilities since it needs to, but these
> frontend tools *don't* need that to *never* have any false positives, so
> why are we complicating things by saying that this frontend tool and the
> backend have to coordinate?
> 
> If there's an explanation of why we can't avoid having false positives
> in the frontend tool, I've yet to see it.  I definitely understand that
> we can get partial reads, but a partial read isn't a failure, and
> shouldn't be reported as such.

It is not in the current patch, it should just get reported as a skipped
block in the end.  If the cluster is online that is, if it is offline,
we do consider it a failure.

I have now rebased that patch on top of the pg_verify_checksums ->
pg_checksums renaming, see attached.


Michael

-- 
Michael Banck
Projektleiter / Senior Berater
Tel.: +49 2166 9901-171
Fax:  +49 2166 9901-100
Email: michael.banck@credativ.de

credativ GmbH, HRB Mönchengladbach 12080
USt-ID-Nummer: DE204566209
Trompeterallee 108, 41189 Mönchengladbach
Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer

Unser Umgang mit personenbezogenen Daten unterliegt
folgenden Bestimmungen: https://www.credativ.de/datenschutz
Attachment

Re: Online verification of checksums

From
Stephen Frost
Date:
Greetings,

* Michael Banck (michael.banck@credativ.de) wrote:
> Am Montag, den 18.03.2019, 02:38 -0400 schrieb Stephen Frost:
> > * Michael Paquier (michael@paquier.xyz) wrote:
> > > On Mon, Mar 18, 2019 at 01:43:08AM -0400, Stephen Frost wrote:
> > > > To be clear, I agree completely that we don't want to be reporting false
> > > > positives or "this might mean corruption!" to users running the tool,
> > > > but I haven't seen a good explaination of why this needs to involve the
> > > > server to avoid that happening.  If someone would like to point that out
> > > > to me, I'd be happy to go read about it and try to understand.
> > >
> > > The mentions on this thread that the server has all the facility in
> > > place to properly lock a buffer and make sure that a partial read
> > > *never* happens and that we *never* have any kind of false positives,
> >
> > Uh, we are, of course, going to have partial reads- we just need to
> > handle them appropriately, and that's not hard to do in a way that we
> > never have false positives.
>
> I think the current patch (V13 from https://www.postgresql.org/message-i
> d/1552045881.4947.43.camel@credativ.de) does that, modulo possible bugs.

I think the question here is- do you ever see false positives with this
latest version..?  If you are, then that's an issue and we should
discuss and try to figure out what's happening.  If you aren't seeing
false positives, then it seems like we're done here, right?

> > I do not understand, at all, the whole sub-thread argument that we have
> > to avoid partial reads.  We certainly don't worry about that when doing
> > backups, and I don't see why we need to avoid it here.  We are going to
> > have partial reads- and that's ok, as long as it's because we're at the
> > end of the file, and that's easy enough to check by just doing another
> > read to see if we get back zero bytes, which indicates we're at the end
> > of the file, and then we move on, no need to coordinate anything with
> > the backend for this.
>
> Well, I agree with you, but we don't seem to have consensus on that.

I feel like everyone is concerned that we'd report an acceptable partial
read as a failure, hence it would be a false positive, and I agree
entirely that we don't want false positives, but the answer to that
seems to be that we shouldn't report partial reads as failures, solving
the problem in a simple way that doesn't involve the server and doesn't
materially reduce the check that's being performed.

> > > directly preventing the set of issues we are trying to implement
> > > workarounds for in a frontend tool are rather good arguments in my
> > > opinion (you can grep for BufferDescriptorGetIOLock() on this thread
> > > for example).
> >
> > Sure the backend has those facilities since it needs to, but these
> > frontend tools *don't* need that to *never* have any false positives, so
> > why are we complicating things by saying that this frontend tool and the
> > backend have to coordinate?
> >
> > If there's an explanation of why we can't avoid having false positives
> > in the frontend tool, I've yet to see it.  I definitely understand that
> > we can get partial reads, but a partial read isn't a failure, and
> > shouldn't be reported as such.
>
> It is not in the current patch, it should just get reported as a skipped
> block in the end.  If the cluster is online that is, if it is offline,
> we do consider it a failure.

Ok, that sounds fine- and do we ever see false positives now?

> I have now rebased that patch on top of the pg_verify_checksums ->
> pg_checksums renaming, see attached.

Thanks for that.  Reading through the code though, I don't entirely
understand why we're making things complicated for ourselves by trying
to seek and re-read the entire block, specifically this:

>          if (r != BLCKSZ)
>          {
> -            fprintf(stderr, _("%s: could not read block %u in file \"%s\": read %d of %d\n"),
> -                    progname, blockno, fn, r, BLCKSZ);
> -            exit(1);
> +            if (online)
> +            {
> +                if (block_retry)
> +                {
> +                    /* We already tried once to reread the block, skip to the next block */
> +                    skippedblocks++;
> +                    if (lseek(f, BLCKSZ-r, SEEK_CUR) == -1)
> +                    {
> +                        skippedfiles++;
> +                        fprintf(stderr, _("%s: could not lseek to next block in file \"%s\": %m\n"),
> +                                progname, fn);
> +                        return;
> +                    }
> +                    continue;
> +                }
> +
> +                /*
> +                 * Retry the block. It's possible that we read the block while it
> +                 * was extended or shrinked, so it it ends up looking torn to us.
> +                 */
> +
> +                /*
> +                 * Seek back by the amount of bytes we read to the beginning of
> +                 * the failed block.
> +                 */
> +                if (lseek(f, -r, SEEK_CUR) == -1)
> +                {
> +                    skippedfiles++;
> +                    fprintf(stderr, _("%s: could not lseek in file \"%s\": %m\n"),
> +                            progname, fn);
> +                    return;
> +                }
> +
> +                /* Set flag so we know a retry was attempted */
> +                block_retry = true;
> +
> +                /* Reset loop to validate the block again */
> +                blockno--;
> +
> +                continue;
> +            }

I would think that we could just do:

  insert_location = 0;
  r = read(BLCKSIZE - insert_location);
  if (r < 0) error();
  if (r == 0) EOF detected, move to next
  if (r < (BLCKSIZE - insert_location)) {
    insert_location += r;
    continue;
  }

At this point, we should have a full block, do our checks...

Have you seen cases where the kernel will actually return a partial read
for something that isn't at the end of the file, and where you could
actually lseek past that point and read the next block?  I'd be really
curious to see that if you can reproduce it...  I've definitely seen
empty pages come back with a claim that the full amount was read, but
that's a very different thing.

Obviously the same goes for anywhere else we're trying to handle a
partial read return from..

Thanks!

Stephen

Attachment

Re: Online verification of checksums

From
Michael Paquier
Date:
On Mon, Mar 18, 2019 at 02:38:10AM -0400, Stephen Frost wrote:
> Uh, we are, of course, going to have partial reads- we just need to
> handle them appropriately, and that's not hard to do in a way that we
> never have false positives.

Ere, my apologies here.  I meant the read of a torn page, not a
partial read (when extending the relation file we have locks
preventing from a partial read as well by the way).
--
Michael

Attachment

Re: Online verification of checksums

From
Stephen Frost
Date:
Greetings,

* Michael Paquier (michael@paquier.xyz) wrote:
> On Mon, Mar 18, 2019 at 02:38:10AM -0400, Stephen Frost wrote:
> > Uh, we are, of course, going to have partial reads- we just need to
> > handle them appropriately, and that's not hard to do in a way that we
> > never have false positives.
>
> Ere, my apologies here.  I meant the read of a torn page, not a

In the case of a torn page, we should be able to check the LSN, as
discussed extensively previously, and if the LSN is from after the
checkpoint we started at then we should be fine to skip the page.

> partial read (when extending the relation file we have locks
> preventing from a partial read as well by the way).

Yes, we do, in the backend...  We don't have (nor do we need) to get
involved in those locks for these tools though..

Thanks!

Stephen

Attachment

Re: Online verification of checksums

From
Michael Banck
Date:
Hi,

Am Montag, den 18.03.2019, 08:18 +0100 schrieb Michael Banck:
> I have now rebased that patch on top of the pg_verify_checksums ->
> pg_checksums renaming, see attached.

Sorry, I had missed some hunks in the TAP tests, fixed-up patch
attached.


Michael

-- 
Michael Banck
Projektleiter / Senior Berater
Tel.: +49 2166 9901-171
Fax:  +49 2166 9901-100
Email: michael.banck@credativ.de

credativ GmbH, HRB Mönchengladbach 12080
USt-ID-Nummer: DE204566209
Trompeterallee 108, 41189 Mönchengladbach
Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer

Unser Umgang mit personenbezogenen Daten unterliegt
folgenden Bestimmungen: https://www.credativ.de/datenschutz
Attachment

Re: Online verification of checksums

From
Michael Banck
Date:
Hi.

Am Montag, den 18.03.2019, 03:34 -0400 schrieb Stephen Frost:
> * Michael Banck (michael.banck@credativ.de) wrote:
> > Am Montag, den 18.03.2019, 02:38 -0400 schrieb Stephen Frost:
> > > * Michael Paquier (michael@paquier.xyz) wrote:
> > > > On Mon, Mar 18, 2019 at 01:43:08AM -0400, Stephen Frost wrote:
> > > > > To be clear, I agree completely that we don't want to be reporting false
> > > > > positives or "this might mean corruption!" to users running the tool,
> > > > > but I haven't seen a good explaination of why this needs to involve the
> > > > > server to avoid that happening.  If someone would like to point that out
> > > > > to me, I'd be happy to go read about it and try to understand.
> > > > 
> > > > The mentions on this thread that the server has all the facility in
> > > > place to properly lock a buffer and make sure that a partial read
> > > > *never* happens and that we *never* have any kind of false positives,
> > > 
> > > Uh, we are, of course, going to have partial reads- we just need to
> > > handle them appropriately, and that's not hard to do in a way that we
> > > never have false positives.
> > 
> > I think the current patch (V13 from https://www.postgresql.org/message-i
> > d/1552045881.4947.43.camel@credativ.de) does that, modulo possible bugs.
> 
> I think the question here is- do you ever see false positives with this
> latest version..?  If you are, then that's an issue and we should
> discuss and try to figure out what's happening.  If you aren't seeing
> false positives, then it seems like we're done here, right?

What do you mean with false positives here? I've never seen a bogus
checksum failure, i.e. pg_checksums claiming some checksum is wrong
cause it only read half of a block or a torn page.

I do see sporadic partial reads and they get treated by the re-check
logic and (if that is not enough) get tallied up as a skipped block in
the end.  Is that a false positive in your book?

[...]

> > I have now rebased that patch on top of the pg_verify_checksums ->
> > pg_checksums renaming, see attached.
> 
> Thanks for that.  Reading through the code though, I don't entirely
> understand why we're making things complicated for ourselves by trying
> to seek and re-read the entire block, specifically this:

[...]

> I would think that we could just do:
> 
>   insert_location = 0;
>   r = read(BLCKSIZE - insert_location);
>   if (r < 0) error();
>   if (r == 0) EOF detected, move to next
>   if (r < (BLCKSIZE - insert_location)) {
>     insert_location += r;
>     continue;
>   }
> 
> At this point, we should have a full block, do our checks...

Well, we need to read() into some buffer which you have ommitted.

So if we had a short read, and then read the rest of the block via
(BLCKSIZE - insert_location) wouldn't we have to read that in a second
buffer and then join the two in order to compute the checksum?  That
does not sounds simpler to me than just re-reading the block entirely.

> Have you seen cases where the kernel will actually return a partial read
> for something that isn't at the end of the file, and where you could
> actually lseek past that point and read the next block?  I'd be really
> curious to see that if you can reproduce it...  I've definitely seen
> empty pages come back with a claim that the full amount was read, but
> that's a very different thing.

Well, I've seen partial reads and I have seen very rarely that it will
continue to read another block afterwards.  If the relation is being
extended while we check it, it sounds plausible that another block could
be written before we get to read EOF on the next read() after a partial
read() so that does not sounds like a bug to me either.

I might be misunderstanding your question though?


Michael

-- 
Michael Banck
Projektleiter / Senior Berater
Tel.: +49 2166 9901-171
Fax:  +49 2166 9901-100
Email: michael.banck@credativ.de

credativ GmbH, HRB Mönchengladbach 12080
USt-ID-Nummer: DE204566209
Trompeterallee 108, 41189 Mönchengladbach
Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer

Unser Umgang mit personenbezogenen Daten unterliegt
folgenden Bestimmungen: https://www.credativ.de/datenschutz


Re: Online verification of checksums

From
Stephen Frost
Date:
Greetings,

On Mon, Mar 18, 2019 at 15:52 Michael Banck <michael.banck@credativ.de> wrote:
Hi.

Am Montag, den 18.03.2019, 03:34 -0400 schrieb Stephen Frost:
> * Michael Banck (michael.banck@credativ.de) wrote:
> > Am Montag, den 18.03.2019, 02:38 -0400 schrieb Stephen Frost:
> > > * Michael Paquier (michael@paquier.xyz) wrote:
> > > > On Mon, Mar 18, 2019 at 01:43:08AM -0400, Stephen Frost wrote:
> > > > > To be clear, I agree completely that we don't want to be reporting false
> > > > > positives or "this might mean corruption!" to users running the tool,
> > > > > but I haven't seen a good explaination of why this needs to involve the
> > > > > server to avoid that happening.  If someone would like to point that out
> > > > > to me, I'd be happy to go read about it and try to understand.
> > > >
> > > > The mentions on this thread that the server has all the facility in
> > > > place to properly lock a buffer and make sure that a partial read
> > > > *never* happens and that we *never* have any kind of false positives,
> > >
> > > Uh, we are, of course, going to have partial reads- we just need to
> > > handle them appropriately, and that's not hard to do in a way that we
> > > never have false positives.
> >
> > I think the current patch (V13 from https://www.postgresql.org/message-i
> > d/1552045881.4947.43.camel@credativ.de) does that, modulo possible bugs.
>
> I think the question here is- do you ever see false positives with this
> latest version..?  If you are, then that's an issue and we should
> discuss and try to figure out what's happening.  If you aren't seeing
> false positives, then it seems like we're done here, right?

What do you mean with false positives here? I've never seen a bogus
checksum failure, i.e. pg_checksums claiming some checksum is wrong
cause it only read half of a block or a torn page.

I do see sporadic partial reads and they get treated by the re-check
logic and (if that is not enough) get tallied up as a skipped block in
the end.  Is that a false positive in your book?

No, that’s clearer not a false positive.

[...]

> > I have now rebased that patch on top of the pg_verify_checksums ->
> > pg_checksums renaming, see attached.
>
> Thanks for that.  Reading through the code though, I don't entirely
> understand why we're making things complicated for ourselves by trying
> to seek and re-read the entire block, specifically this:

[...]

> I would think that we could just do:
>
>   insert_location = 0;
>   r = read(BLCKSIZE - insert_location);
>   if (r < 0) error();
>   if (r == 0) EOF detected, move to next
>   if (r < (BLCKSIZE - insert_location)) {
>     insert_location += r;
>     continue;
>   }
>
> At this point, we should have a full block, do our checks...

Well, we need to read() into some buffer which you have ommitted.

Surely there’s a buffer the read in the existing code is passing in, you just need to offset by the current pointer, sorry for not being clear.

In other words the read would look more like:

read(fd,buf + insert_ptr, BUFSZ - insert_ptr)

And then you have to reset insert_ptr once you have a full block.

So if we had a short read, and then read the rest of the block via
(BLCKSIZE - insert_location) wouldn't we have to read that in a second
buffer and then join the two in order to compute the checksum?  That
does not sounds simpler to me than just re-reading the block entirely.

No, just read into your existing buffer at the point where the prior partial read left off...

> Have you seen cases where the kernel will actually return a partial read
> for something that isn't at the end of the file, and where you could
> actually lseek past that point and read the next block?  I'd be really
> curious to see that if you can reproduce it...  I've definitely seen
> empty pages come back with a claim that the full amount was read, but
> that's a very different thing.

Well, I've seen partial reads and I have seen very rarely that it will
continue to read another block afterwards.  If the relation is being
extended while we check it, it sounds plausible that another block could
be written before we get to read EOF on the next read() after a partial
read() so that does not sounds like a bug to me either.

Right, absolutely you can have a partial read during a relation extension and then come back around and do another read and discover more data, that’s entirely reasonable and I’ve seen it happen too.

I might be misunderstanding your question though?

Yes, the question was more like this: have you ever seen a read return a partial result when you know you’re in the middle somewhere of an existing file and the length of the file hasn’t been changed by something else..?  I can’t say that I have, when reading from regular files, even in kernel-error type of conditions due to hardware issues, but I’m open to being told I’m wrong...  in such a case though I would still expect an error on a subsequent read, which would work just fine for our case. If the kernel just decides to return a zero in that case then I don’t know that there’s really anything we can do about that because that seems like it would be pretty clearly broken results from the kernel and that’s out of scope for this.

Apologies if this isn’t clear, on my phone now. 

Thanks!

Stephen

Re: Online verification of checksums

From
Robert Haas
Date:
On Mon, Mar 18, 2019 at 2:06 AM Michael Paquier <michael@paquier.xyz> wrote:
> The mentions on this thread that the server has all the facility in
> place to properly lock a buffer and make sure that a partial read
> *never* happens and that we *never* have any kind of false positives,
> directly preventing the set of issues we are trying to implement
> workarounds for in a frontend tool are rather good arguments in my
> opinion (you can grep for BufferDescriptorGetIOLock() on this thread
> for example).

Yeah, exactly.  It may be that there is a good way to avoid those
issues without interacting with the server and that would be nice, but
... as far as I can see, nobody's figured out a way that's reliable
yet, and all of the solutions proposed so far basically amount to
"let's ignore things that might be serious problems because they might
be transient" and/or "let's retry and see if the problem goes away."
I'm more sanguine about a retry-based solution than an
ignore-possible-problems solution, but what's been proposed so far
seems quite prone to retrying so fast that it makes no difference, and
it's not clear how much code complexity we'd have to add to do better
or how reliable it would be even then.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Online verification of checksums

From
Michael Banck
Date:
Hi,

Am Montag, den 18.03.2019, 16:11 +0800 schrieb Stephen Frost:
> On Mon, Mar 18, 2019 at 15:52 Michael Banck <michael.banck@credativ.de> wrote:
> > Am Montag, den 18.03.2019, 03:34 -0400 schrieb Stephen Frost:
> > > Thanks for that.  Reading through the code though, I don't entirely
> > > understand why we're making things complicated for ourselves by trying
> > > to seek and re-read the entire block, specifically this:
> > 
> > [...]
> > 
> > > I would think that we could just do:
> > > 
> > >   insert_location = 0;
> > >   r = read(BLCKSIZE - insert_location);
> > >   if (r < 0) error();
> > >   if (r == 0) EOF detected, move to next
> > >   if (r < (BLCKSIZE - insert_location)) {
> > >     insert_location += r;
> > >     continue;
> > >   }
> > > 
> > > At this point, we should have a full block, do our checks...
> > 
> > Well, we need to read() into some buffer which you have ommitted.
> 
> Surely there’s a buffer the read in the existing code is passing in,
> you just need to offset by the current pointer, sorry for not being
> clear.
> 
> In other words the read would look more like:
> 
> read(fd,buf + insert_ptr, BUFSZ - insert_ptr)
> 
> And then you have to reset insert_ptr once you have a full block.

Ok, thanks for clearing that up.

I've tried to do that now in the attached, does that suit you?

> Yes, the question was more like this: have you ever seen a read return
> a partial result when you know you’re in the middle somewhere of an
> existing file and the length of the file hasn’t been changed by
> something else..?

I don't think I've seen that, but that wouldn't turn up in regular
testing anyway I guess but only in pathological cases?  I guess we are
probably dealing with this in the current version of the patch, but I
can't say for certain as it sounds pretty difficult to test.

I have also added a paragraph to the documentation about possilby
skipping new or recently updated pages:

+   If the cluster is online, pages that have been (re-)written since the last
+   checkpoint will not count as checksum failures if they cannot be read or
+   verified correctly.

Wording improvements welcome.


Michael


-- 
Michael Banck
Projektleiter / Senior Berater
Tel.: +49 2166 9901-171
Fax:  +49 2166 9901-100
Email: michael.banck@credativ.de

credativ GmbH, HRB Mönchengladbach 12080
USt-ID-Nummer: DE204566209
Trompeterallee 108, 41189 Mönchengladbach
Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer

Unser Umgang mit personenbezogenen Daten unterliegt
folgenden Bestimmungen: https://www.credativ.de/datenschutz
Attachment

Re: Online verification of checksums

From
Stephen Frost
Date:
Greetings,

On Tue, Mar 19, 2019 at 04:15 Michael Banck <michael.banck@credativ.de> wrote:
Am Montag, den 18.03.2019, 16:11 +0800 schrieb Stephen Frost:
> On Mon, Mar 18, 2019 at 15:52 Michael Banck <michael.banck@credativ.de> wrote:
> > Am Montag, den 18.03.2019, 03:34 -0400 schrieb Stephen Frost:
> > > Thanks for that.  Reading through the code though, I don't entirely
> > > understand why we're making things complicated for ourselves by trying
> > > to seek and re-read the entire block, specifically this:
> >
> > [...]
> >
> > > I would think that we could just do:
> > >
> > >   insert_location = 0;
> > >   r = read(BLCKSIZE - insert_location);
> > >   if (r < 0) error();
> > >   if (r == 0) EOF detected, move to next
> > >   if (r < (BLCKSIZE - insert_location)) {
> > >     insert_location += r;
> > >     continue;
> > >   }
> > >
> > > At this point, we should have a full block, do our checks...
> >
> > Well, we need to read() into some buffer which you have ommitted.
>
> Surely there’s a buffer the read in the existing code is passing in,
> you just need to offset by the current pointer, sorry for not being
> clear.
>
> In other words the read would look more like:
>
> read(fd,buf + insert_ptr, BUFSZ - insert_ptr)
>
> And then you have to reset insert_ptr once you have a full block.

Ok, thanks for clearing that up.

I've tried to do that now in the attached, does that suit you?

Yes, that’s what I was thinking.  I’m honestly not entirely convinced that the lseek() efforts still need to be put in- I would have thought it’d be fine to simply check the LSN on a checksum failure and mark it as skipped if the LSN is past the current checkpoint.  That seems like it would make things much simpler, but I’m also not against keeping that logic now that it’s in, provided it doesn’t cause issues

> Yes, the question was more like this: have you ever seen a read return
> a partial result when you know you’re in the middle somewhere of an
> existing file and the length of the file hasn’t been changed by
> something else..?

I don't think I've seen that, but that wouldn't turn up in regular
testing anyway I guess but only in pathological cases?  I guess we are
probably dealing with this in the current version of the patch, but I
can't say for certain as it sounds pretty difficult to test.

Yeah, a lot of things in this area are unfortunately difficult to test.  I’m glad to hear that it doesn’t sound like you’ve seen it though. 

I have also added a paragraph to the documentation about possilby
skipping new or recently updated pages:

+   If the cluster is online, pages that have been (re-)written since the last
+   checkpoint will not count as checksum failures if they cannot be read or
+   verified correctly.

I would flip this around:

——-
In an online cluster, pages are being concurrently written to the files while the check is being run, leading to possible torn pages or partial reads.  When the tool detects a concurrently written page, indicated by the page’s LSN being beyond the checkpoint the tool started at, that page will be reported as skipped.  Note that in a crash scenario, any pages written since the last checkpoint will be replayed from the WAL.
——-

Now here’s the $64 question- have you tested this latest version under load..?  If not, could you?  And when you do, can you report back what the results are?  Do you still see any actual checksum failures?  Do the number of skipped pages seem reasonable in your tests or is there a concern there?

If you still see actual checksum failures which aren’t because the LSN is higher than the checkpoint, or because of a short read, then we need to investigate further but hopefully that isn’t happening now.  I think a lot of the concerns raised on this thread about wanting to avoid false positives are because the torn page (with higher LSN than current checkpoint) and short read cases were previously reported as failures when they really are expected.  Let’s test this as much as we can and make sure we aren’t seeing false positives anymore.

Thanks!

Stephen

Re: Online verification of checksums

From
Robert Haas
Date:
On Mon, Mar 18, 2019 at 2:38 AM Stephen Frost <sfrost@snowman.net> wrote:
> Sure the backend has those facilities since it needs to, but these
> frontend tools *don't* need that to *never* have any false positives, so
> why are we complicating things by saying that this frontend tool and the
> backend have to coordinate?
>
> If there's an explanation of why we can't avoid having false positives
> in the frontend tool, I've yet to see it.  I definitely understand that
> we can get partial reads, but a partial read isn't a failure, and
> shouldn't be reported as such.

I think there's some confusion between 'partial read' and 'torn page',
as Michael also said.

It's torn pages that I am concerned about - the server is writing and
we are reading, and we get a mix of old and new content.  We have been
quite diligent about protecting ourselves from such risks elsewhere,
and checksum verification should not be held to any lesser standard.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Online verification of checksums

From
Michael Banck
Date:
Hi,

Am Dienstag, den 19.03.2019, 11:22 -0400 schrieb Robert Haas:
> It's torn pages that I am concerned about - the server is writing and
> we are reading, and we get a mix of old and new content.  We have been
> quite diligent about protecting ourselves from such risks elsewhere,
> and checksum verification should not be held to any lesser standard.

If we see a checksum failure on an otherwise correctly read block in
online mode, we retry the block on the theory that we might have read a
torn page.  If the checksum verification still fails, we compare its LSN
to the LSN of the current checkpoint and don't mind if its newer.  This
way, a torn page should not cause a false positive either way I think?. 
 If it is a genuine storage failure we will see it in the next
pg_checksums run as its LSN will be older than the checkpoint.  The
basebackup checksum verification works in the same way.

I am happy to look into further option about how to make things better,
but I am not sure what the actual problem might be that you mention
above. I will see whether I can stress-test the patch a bit more but
I've already taxed the SSD on my company notebook quite a bit during the
development of this so will see whether I can get some real server
hardware somewhere.


Michael

-- 
Michael Banck
Projektleiter / Senior Berater
Tel.: +49 2166 9901-171
Fax:  +49 2166 9901-100
Email: michael.banck@credativ.de

credativ GmbH, HRB Mönchengladbach 12080
USt-ID-Nummer: DE204566209
Trompeterallee 108, 41189 Mönchengladbach
Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer

Unser Umgang mit personenbezogenen Daten unterliegt
folgenden Bestimmungen: https://www.credativ.de/datenschutz


Re: Online verification of checksums

From
Andres Freund
Date:
Hi,

On 2019-03-19 16:52:08 +0100, Michael Banck wrote:
> Am Dienstag, den 19.03.2019, 11:22 -0400 schrieb Robert Haas:
> > It's torn pages that I am concerned about - the server is writing and
> > we are reading, and we get a mix of old and new content.  We have been
> > quite diligent about protecting ourselves from such risks elsewhere,
> > and checksum verification should not be held to any lesser standard.
> 
> If we see a checksum failure on an otherwise correctly read block in
> online mode, we retry the block on the theory that we might have read a
> torn page.  If the checksum verification still fails, we compare its LSN
> to the LSN of the current checkpoint and don't mind if its newer.  This
> way, a torn page should not cause a false positive either way I
> think?.

False positives, no. But there's plenty potential for false
negatives. In plenty clusters a large fraction of the pages is going to
be touched in most checkpoints.


>  If it is a genuine storage failure we will see it in the next
> pg_checksums run as its LSN will be older than the checkpoint.

Well, but also, by that time it might be too late to recover things. Or
it might be a backup that you just made, that you later want to recover
from, ...


> The basebackup checksum verification works in the same way.

Shouldn't have been merged that way.

Greetings,

Andres Freund


Re: Online verification of checksums

From
Stephen Frost
Date:
Greetings,

On Tue, Mar 19, 2019 at 23:59 Andres Freund <andres@anarazel.de> wrote:
Hi,

On 2019-03-19 16:52:08 +0100, Michael Banck wrote:
> Am Dienstag, den 19.03.2019, 11:22 -0400 schrieb Robert Haas:
> > It's torn pages that I am concerned about - the server is writing and
> > we are reading, and we get a mix of old and new content.  We have been
> > quite diligent about protecting ourselves from such risks elsewhere,
> > and checksum verification should not be held to any lesser standard.
>
> If we see a checksum failure on an otherwise correctly read block in
> online mode, we retry the block on the theory that we might have read a
> torn page.  If the checksum verification still fails, we compare its LSN
> to the LSN of the current checkpoint and don't mind if its newer.  This
> way, a torn page should not cause a false positive either way I
> think?.

False positives, no. But there's plenty potential for false
negatives. In plenty clusters a large fraction of the pages is going to
be touched in most checkpoints.

How is it a false negative?  The page was in the middle of being written, if we crash the page won’t be used because it’ll get replayed over by the checkpoint, if we don’t crash then it also won’t be used until it’s been written out completely.  I don’t agree that this is in any way a false negative- it’s simply a page that happens to be in the middle of a file that we can skip because it isn’t going to be used. It’s not like there’s going to be a checksum failure if the backend reads it.

Not only that, but checksums and such failures are much more likely to happen on long dormant data, not on data that’s actively being written out and therefore is still in the Linux FS cache and hasn’t even hit actual storage yet anyway.

>  If it is a genuine storage failure we will see it in the next
> pg_checksums run as its LSN will be older than the checkpoint.

Well, but also, by that time it might be too late to recover things. Or
it might be a backup that you just made, that you later want to recover
from, ...

If it’s a backup you just made then that page is going to be in the WAL and the torn page on disk isn’t going to be used, so how is this an issue?  This is why we have WAL- to deal with torn pages.

> The basebackup checksum verification works in the same way.

Shouldn't have been merged that way.

I have a hard time not finding this offensive.  These issues were considered, discussed, and well thought out, with the result being committed after agreement.

Do you have any example cases where the code in pg_basebackup has resulted in either a false positive or a false negative?  Any case which can be shown to result in either?

If not then I think we need to stop this, because if we can’t trust that a torn page won’t be actually used in that torn state then it seems likely that our entire WAL system is broken and we can’t trust the way we do backups either and have to rewrite all of that to take precautions to lock pages while doing a backup.

Thanks!

Stephen

Re: Online verification of checksums

From
Andres Freund
Date:
Hi,

On 2019-03-20 03:27:55 +0800, Stephen Frost wrote:
> On Tue, Mar 19, 2019 at 23:59 Andres Freund <andres@anarazel.de> wrote:
> > On 2019-03-19 16:52:08 +0100, Michael Banck wrote:
> > > Am Dienstag, den 19.03.2019, 11:22 -0400 schrieb Robert Haas:
> > > > It's torn pages that I am concerned about - the server is writing and
> > > > we are reading, and we get a mix of old and new content.  We have been
> > > > quite diligent about protecting ourselves from such risks elsewhere,
> > > > and checksum verification should not be held to any lesser standard.
> > >
> > > If we see a checksum failure on an otherwise correctly read block in
> > > online mode, we retry the block on the theory that we might have read a
> > > torn page.  If the checksum verification still fails, we compare its LSN
> > > to the LSN of the current checkpoint and don't mind if its newer.  This
> > > way, a torn page should not cause a false positive either way I
> > > think?.
> >
> > False positives, no. But there's plenty potential for false
> > negatives. In plenty clusters a large fraction of the pages is going to
> > be touched in most checkpoints.
> 
> 
> How is it a false negative?  The page was in the middle of being
> written,

You don't actually know that. It could just be random gunk in the LSN,
and this type of logic just ignores such failures as long as the random
gunk is above the system's LSN.

And the basebackup logic doesn't just ignore if both the checksum
failed, and the lsn is between startptr and current insertion pointer -
it just does it with *any* page that has a pd_upper != 0 and a pd_lsn >
startptr. Given typical startlsn values (skewing heavily towards lower
int64s), that means that random data is more likely than not to pass
this test.

As it stands, the logic seems to give more false confidence than
anything else.


> > The basebackup checksum verification works in the same way.
> >
> > Shouldn't have been merged that way.
> 
> 
> I have a hard time not finding this offensive.  These issues were
> considered, discussed, and well thought out, with the result being
> committed after agreement.

Well, I don't know what to tell you. But:

                /*
                 * Only check pages which have not been modified since the
                 * start of the base backup. Otherwise, they might have been
                 * written only halfway and the checksum would not be valid.
                 * However, replaying WAL would reinstate the correct page in
                 * this case. We also skip completely new pages, since they
                 * don't have a checksum yet.
                 */
                if (!PageIsNew(page) && PageGetLSN(page) < startptr)
                {

doesn't consider plenty scenarios, as pointed out above.  It'd be one
thing if the concerns I point out above were actually commented upon and
weighed not substantial enough (not that I know how). But...




> Do you have any example cases where the code in pg_basebackup has resulted
> in either a false positive or a false negative?  Any case which can be
> shown to result in either?

CREATE TABLE corruptme AS SELECT g.i::text AS data FROM generate_series(1, 1000000) g(i);
SELECT pg_relation_size('corruptme');
postgres[22890][1]=# SELECT current_setting('data_directory') || '/' || pg_relation_filepath('corruptme');
┌─────────────────────────────────────┐
│              ?column?               │
├─────────────────────────────────────┤
│ /srv/dev/pgdev-dev/base/13390/16384 │
└─────────────────────────────────────┘
(1 row)
dd if=/dev/urandom of=/srv/dev/pgdev-dev/base/13390/16384 bs=8192 count=1 conv=notrunc

Try a basebackup and see how many times it'll detect the corrupt
data. In the vast majority of cases you're going to see checksum
failures when reading the data for normal operation, but not when using
basebackup (or this new tool).

At the very very least this would need to do

a) checks that the page is all zeroes if PageIsNew() (like
   PageIsVerified() does for the backend). That avoids missing cases
   where corruption just zeroed out the header, but not the whole page.
b) Check that pd_lsn is between startlsn and the insertion pointer. That
   avoids accepting just about all random data.

And that'd *still* be less strenuous than what normal backends
check. And that's already not great (due to not noticing zeroed out
data).

I fail to see how it's offensive to describe this as "shouldn't have
been merged that way".

Greetings,

Andres Freund


Re: Online verification of checksums

From
Andres Freund
Date:
On 2019-03-19 13:00:50 -0700, Andres Freund wrote:
> As it stands, the logic seems to give more false confidence than
> anything else.

To demonstrate that I ran a loop that verified that a) a normal backend
query using the tale detects the corruption b) pg_basebackup doesn't.

i=0;
while true; do
    i=$(($i+1));
    echo attempt $i;
    dd if=/dev/urandom of=/srv/dev/pgdev-dev/base/13390/16384 bs=8192 count=1 conv=notrunc 2>/dev/null;
    psql -X -c 'SELECT * FROM corruptme;' 2>/dev/null && break;
    ~/build/postgres/dev-assert/vpath/src/bin/pg_basebackup/pg_basebackup -X fetch -F t -D - -c fast > /dev/null ||
break;
done

(excuse the crappy one-off sh)

had, during ~12k iterations, always detected the corruption in the
backend, and never via pg_basebackup. Given the likely LSNs in a
cluster, that's not too surprising.

Greetings,

Andres Freund


Re: Online verification of checksums

From
Robert Haas
Date:
On Tue, Mar 19, 2019 at 4:49 PM Andres Freund <andres@anarazel.de> wrote:
> To demonstrate that I ran a loop that verified that a) a normal backend
> query using the tale detects the corruption b) pg_basebackup doesn't.
>
> i=0;
> while true; do
>     i=$(($i+1));
>     echo attempt $i;
>     dd if=/dev/urandom of=/srv/dev/pgdev-dev/base/13390/16384 bs=8192 count=1 conv=notrunc 2>/dev/null;
>     psql -X -c 'SELECT * FROM corruptme;' 2>/dev/null && break;
>     ~/build/postgres/dev-assert/vpath/src/bin/pg_basebackup/pg_basebackup -X fetch -F t -D - -c fast > /dev/null ||
break;
> done
>
> (excuse the crappy one-off sh)
>
> had, during ~12k iterations, always detected the corruption in the
> backend, and never via pg_basebackup. Given the likely LSNs in a
> cluster, that's not too surprising.

Wow.  So we shipped a checksum-verification feature (in pg_basebackup)
that reliably fails to detect blatantly corrupt pages.  That's pretty
awful.  Your chances get better the more WAL you've ever generated,
but you have to generate 163 petabytes of WAL to have a 1% chance of
detecting a page of random garbage, so realistically they never get
very good.

It's probably fair to point out that flipping a couple of random bytes
on the page is a more likely error than replacing the entire page with
garbage, and the check as designed will detect that fairly reliably --
unless those bytes are very near the beginning of the page.  Still,
that leaves a lot of kinds of corruption that this will not catch.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Online verification of checksums

From
Michael Banck
Date:
Hi,

Am Dienstag, den 19.03.2019, 13:00 -0700 schrieb Andres Freund:
> On 2019-03-20 03:27:55 +0800, Stephen Frost wrote:
> > On Tue, Mar 19, 2019 at 23:59 Andres Freund <andres@anarazel.de> wrote:
> > > On 2019-03-19 16:52:08 +0100, Michael Banck wrote:
> > > > Am Dienstag, den 19.03.2019, 11:22 -0400 schrieb Robert Haas:
> > > > > It's torn pages that I am concerned about - the server is writing and
> > > > > we are reading, and we get a mix of old and new content.  We have been
> > > > > quite diligent about protecting ourselves from such risks elsewhere,
> > > > > and checksum verification should not be held to any lesser standard.
> > > > 
> > > > If we see a checksum failure on an otherwise correctly read block in
> > > > online mode, we retry the block on the theory that we might have read a
> > > > torn page.  If the checksum verification still fails, we compare its LSN
> > > > to the LSN of the current checkpoint and don't mind if its newer.  This
> > > > way, a torn page should not cause a false positive either way I
> > > > think?.
> > > 
> > > False positives, no. But there's plenty potential for false
> > > negatives. In plenty clusters a large fraction of the pages is going to
> > > be touched in most checkpoints.
> > 
> > 
> > How is it a false negative?  The page was in the middle of being
> > written,
> 
> You don't actually know that. It could just be random gunk in the LSN,
> and this type of logic just ignores such failures as long as the random
> gunk is above the system's LSN.

Right, I think this needs to be taken into account. For pg_basebackup,
that'd be an additional check for GetRedoRecPtr() or something 
in the below check:

[...]

> Well, I don't know what to tell you. But:
> 
>                 /*
>                  * Only check pages which have not been modified since the
>                  * start of the base backup. Otherwise, they might have been
>                  * written only halfway and the checksum would not be valid.
>                  * However, replaying WAL would reinstate the correct page in
>                  * this case. We also skip completely new pages, since they
>                  * don't have a checksum yet.
>                  */
>                 if (!PageIsNew(page) && PageGetLSN(page) < startptr)
>                 {
> 
> doesn't consider plenty scenarios, as pointed out above.  It'd be one
> thing if the concerns I point out above were actually commented upon and
> weighed not substantial enough (not that I know how). But...
> 

> > Do you have any example cases where the code in pg_basebackup has resulted
> > in either a false positive or a false negative?  Any case which can be
> > shown to result in either?
> 
> CREATE TABLE corruptme AS SELECT g.i::text AS data FROM generate_series(1, 1000000) g(i);
> SELECT pg_relation_size('corruptme');
> postgres[22890][1]=# SELECT current_setting('data_directory') || '/' || pg_relation_filepath('corruptme');
> ┌─────────────────────────────────────┐
> │              ?column?               │
> ├─────────────────────────────────────┤
> │ /srv/dev/pgdev-dev/base/13390/16384 │
> └─────────────────────────────────────┘
> (1 row)
> dd if=/dev/urandom of=/srv/dev/pgdev-dev/base/13390/16384 bs=8192 count=1 conv=notrunc
> 
> Try a basebackup and see how many times it'll detect the corrupt
> data. In the vast majority of cases you're going to see checksum
> failures when reading the data for normal operation, but not when using
> basebackup (or this new tool).

Right, see above.

> At the very very least this would need to do
> 
> a) checks that the page is all zeroes if PageIsNew() (like
>    PageIsVerified() does for the backend). That avoids missing cases
>    where corruption just zeroed out the header, but not the whole page.

We can't run pg_checksum_page() on those afterwards though as it would
fire an assertion:

|pg_checksums: [...]/../src/include/storage/checksum_impl.h:194:
|pg_checksum_page: Assertion `!(((PageHeader) (&cpage->phdr))->pd_upper
|== 0)' failed.

But we should count it as a checksum error and generate an appropriate
error message in that case.

> b) Check that pd_lsn is between startlsn and the insertion pointer. That
>    avoids accepting just about all random data.

However, for pg_checksums being a stand-alone application it can't just
access the insertion pointer, can it? We could maybe set a threshold
from the last checkpoint after which we consider the pd_lsn bogus. But
what's a good threshold here?
 
And/or we could port the other sanity checks from PageIsVerified:

|                if ((p->pd_flags & ~PD_VALID_FLAG_BITS) == 0 &&
|                       p->pd_lower <= p->pd_upper &&
|                       p->pd_upper <= p->pd_special &&
|                       p->pd_special <= BLCKSZ &&
|                       p->pd_special == MAXALIGN(p->pd_special))
|                       header_sane = true

That should catch large-scale random corruption like you showed above. 



Michael

-- 
Michael Banck
Projektleiter / Senior Berater
Tel.: +49 2166 9901-171
Fax:  +49 2166 9901-100
Email: michael.banck@credativ.de

credativ GmbH, HRB Mönchengladbach 12080
USt-ID-Nummer: DE204566209
Trompeterallee 108, 41189 Mönchengladbach
Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer

Unser Umgang mit personenbezogenen Daten unterliegt
folgenden Bestimmungen: https://www.credativ.de/datenschutz


Re: Online verification of checksums

From
Andres Freund
Date:
Hi,

On 2019-03-19 22:39:16 +0100, Michael Banck wrote:
> Am Dienstag, den 19.03.2019, 13:00 -0700 schrieb Andres Freund:
> > a) checks that the page is all zeroes if PageIsNew() (like
> >    PageIsVerified() does for the backend). That avoids missing cases
> >    where corruption just zeroed out the header, but not the whole page.
> 
> We can't run pg_checksum_page() on those afterwards though as it would
> fire an assertion:
> 
> |pg_checksums: [...]/../src/include/storage/checksum_impl.h:194:
> |pg_checksum_page: Assertion `!(((PageHeader) (&cpage->phdr))->pd_upper
> |== 0)' failed.
> 
> But we should count it as a checksum error and generate an appropriate
> error message in that case.

All I'm saying is that if PageIsNew() you need to run the same checks
that PageIsVerified() runs in that case. Namely verifying that the page
is all-zeroes, rather than just the pd_upper field.  That's separate
from running pg_checksum_page().


> > b) Check that pd_lsn is between startlsn and the insertion pointer. That
> >    avoids accepting just about all random data.
> 
> However, for pg_checksums being a stand-alone application it can't just
> access the insertion pointer, can it? We could maybe set a threshold
> from the last checkpoint after which we consider the pd_lsn bogus. But
> what's a good threshold here?

That's *PRECISELY* my point. I think it's a bad idea to do online
checksumming from outside the backend. It needs to be inside the
backend, and if there's any verification failures on a block, it needs
to acquire the IO lock on the page, and reread from disk.

Greetings,

Andres Freund


Re: Online verification of checksums

From
Michael Paquier
Date:
On Tue, Mar 19, 2019 at 02:44:52PM -0700, Andres Freund wrote:
> That's *PRECISELY* my point. I think it's a bad idea to do online
> checksumming from outside the backend. It needs to be inside the
> backend, and if there's any verification failures on a block, it needs
> to acquire the IO lock on the page, and reread from disk.

Yeah, FWIW, Julien Rouhaud was mentioning me that we could use
mdread() and loop over the blocks so as we don't finish loading
corrupted blocks into shared buffers, checking on the way if the block
is already in shared buffers or not.
--
Michael

Attachment

Re: Online verification of checksums

From
Michael Banck
Date:
Hi,

I have rebased this patch now.

I also fixed the two issues Andres reported, namely a zeroed-out
pageheader and a random LSN. The first is caught be checking for an all-
zero-page in the way PageIsVerified() does. The second is caught by
comparing the upper 32 bits of the LSN as well and demanding that they
are equal. If the LSN is corrupted, the upper 32 bits should be wildly
different to the current checkpoint LSN.

Well, at least that is a stab at a fix; there is a window where the
upper 32 bits could legitimately be different. In order to make that as
small as possible, I update the checkpoint LSN every once in a while.

Am Montag, den 18.03.2019, 21:15 +0100 schrieb Michael Banck:
> I have also added a paragraph to the documentation about possilby
> skipping new or recently updated pages:
> 
> +   If the cluster is online, pages that have been (re-)written since the last
> +   checkpoint will not count as checksum failures if they cannot be read or
> +   verified correctly.

I have removed that for now as it seems to be more confusing than
helpful.


Michael

-- 
Michael Banck
Projektleiter / Senior Berater
Tel.: +49 2166 9901-171
Fax:  +49 2166 9901-100
Email: michael.banck@credativ.de

credativ GmbH, HRB Mönchengladbach 12080
USt-ID-Nummer: DE204566209
Trompeterallee 108, 41189 Mönchengladbach
Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer

Unser Umgang mit personenbezogenen Daten unterliegt
folgenden Bestimmungen: https://www.credativ.de/datenschutz
Attachment

Re: Online verification of checksums

From
Tomas Vondra
Date:
On Thu, Mar 28, 2019 at 05:08:33PM +0100, Michael Banck wrote:
>Hi,
>
>I have rebased this patch now.
>
>I also fixed the two issues Andres reported, namely a zeroed-out
>pageheader and a random LSN. The first is caught be checking for an all-
>zero-page in the way PageIsVerified() does. The second is caught by
>comparing the upper 32 bits of the LSN as well and demanding that they
>are equal. If the LSN is corrupted, the upper 32 bits should be wildly
>different to the current checkpoint LSN.
>
>Well, at least that is a stab at a fix; there is a window where the
>upper 32 bits could legitimately be different. In order to make that as
>small as possible, I update the checkpoint LSN every once in a while.
>

Doesn't that mean we'll report a false positive?

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Online verification of checksums

From
Michael Banck
Date:
Hi,

Am Donnerstag, den 28.03.2019, 18:19 +0100 schrieb Tomas Vondra:
> On Thu, Mar 28, 2019 at 05:08:33PM +0100, Michael Banck wrote:
> > I also fixed the two issues Andres reported, namely a zeroed-out
> > pageheader and a random LSN. The first is caught be checking for an all-
> > zero-page in the way PageIsVerified() does. The second is caught by
> > comparing the upper 32 bits of the LSN as well and demanding that they
> > are equal. If the LSN is corrupted, the upper 32 bits should be wildly
> > different to the current checkpoint LSN.
> > 
> > Well, at least that is a stab at a fix; there is a window where the
> > upper 32 bits could legitimately be different. In order to make that as
> > small as possible, I update the checkpoint LSN every once in a while.

I decided it makes more sense to just re-read the checkpoint LSN from
the control file when we encounter a wrong checksum on re-read of a page
as that is when it counts, instead of doing it only every once in a
while.

> Doesn't that mean we'll report a false positive?

A false positive would be pg_checksums claiming a block has a wrong
checksum while in fact it does not (after it is correctly written out
and synced to disk), right?

If pg_checksums reads a current first part and a stale second part twice
in a row (we re-read the block), then the LSN of the first part would
presumably(?) be higher than the latest checkpoint LSN. If there was a
wraparound in the lower part of the LSN so that the upper part is now
different to the latest checkpoint LSN, then pg_checksums would report
this as a false positive I believe. 

We could add some additional heuristics like checking the upper part of
the LSN has advanced by at most one but that does not seem to make it
100% certified robust either, does it?

If pg_checksums reads a current second part and a stale first part
twice, then the pageheader LSN would presumably be lower than the
checkpoint LSN and again a false positive would be reported.

At least in my testing I haven't seen the second case and the first
(disregarding the wraparound issue for now) extremely rarely if at all
(usually the torn page is gone on re-read). The first case requiring a
wraparound since the latest checkpointLSN update also seems quite narrow
compared to the issue of random data being written due to corruption. So
I think it is more important to make sure random data won't be a false
negative than this being a false positive.

Maybe we can just issue a warning in online mode that some checksum
failures could be false positives and advise the user to recheck those
files (using the -r switch) again? I have added this in the attached new
version:

+  printf(_("%s ran against an online cluster and found some bad checksums.\n"), progname);
+  printf(_("It could be that those are false positives due concurrently updated blocks,\n"));
+  printf(_("checking the offending files again with the -r option is advised.\n"));

It was not mentioned on this thread, but I want to stress again that you
cannot run the current pg_checksums on a basebackup due to the control
file claiming it is still online. This makes the current program pretty
useless for production setups right now in my opinion as few people have
the luxury of regular maintenance downtimes when pg_checksums could run
and running it against base backups is quite cumbersome.

Maybe we can improve things by checking for the postmaster.pid as well
and going ahead (only for --check of course) if it is missing, but that
hasn't been implemented yet.

I agree that the current patch might have some corner-cases where it
does not guarantee 100% accuracy in online mode, but I hope the current
version at least has no more false negatives.


Michael

-- 
Michael Banck
Projektleiter / Senior Berater
Tel.: +49 2166 9901-171
Fax:  +49 2166 9901-100
Email: michael.banck@credativ.de

credativ GmbH, HRB Mönchengladbach 12080
USt-ID-Nummer: DE204566209
Trompeterallee 108, 41189 Mönchengladbach
Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer

Unser Umgang mit personenbezogenen Daten unterliegt
folgenden Bestimmungen: https://www.credativ.de/datenschutz
Attachment

Re: Online verification of checksums

From
Andres Freund
Date:
Hi,

On 2019-03-28 21:09:22 +0100, Michael Banck wrote:
> I agree that the current patch might have some corner-cases where it
> does not guarantee 100% accuracy in online mode, but I hope the current
> version at least has no more false negatives.

False positives are *bad*. We shouldn't integrate code that has them.

Greetings,

Andres Freund



Re: Online verification of checksums

From
Tomas Vondra
Date:
On Thu, Mar 28, 2019 at 01:11:40PM -0700, Andres Freund wrote:
>Hi,
>
>On 2019-03-28 21:09:22 +0100, Michael Banck wrote:
>> I agree that the current patch might have some corner-cases where it
>> does not guarantee 100% accuracy in online mode, but I hope the current
>> version at least has no more false negatives.
>
>False positives are *bad*. We shouldn't integrate code that has them.
>

Yeah, I agree. I'm a bit puzzled by the reluctance to make the online mode
communicate with the server, which would presumably address these issues.
Can someone explain why not to do that?

FWIW I've initially argued against that, believing that we can address
those issues in some other way, and I'd love if that was possible. But
considering we're still trying to make that work reliably I think the
reasonable conclusion is that Andres was right communicating with the
server is necessary.

Of course, I definitely appreciate people are working on this, otherwise
we wouldn't be having this discussion ...

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Online verification of checksums

From
Magnus Hagander
Date:


On Thu, Mar 28, 2019 at 10:19 PM Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
On Thu, Mar 28, 2019 at 01:11:40PM -0700, Andres Freund wrote:
>Hi,
>
>On 2019-03-28 21:09:22 +0100, Michael Banck wrote:
>> I agree that the current patch might have some corner-cases where it
>> does not guarantee 100% accuracy in online mode, but I hope the current
>> version at least has no more false negatives.
>
>False positives are *bad*. We shouldn't integrate code that has them.
>

Yeah, I agree. I'm a bit puzzled by the reluctance to make the online mode
communicate with the server, which would presumably address these issues.
Can someone explain why not to do that?

I agree that this effort seems better spent on fixing those issues there (of which many are the same), and then re-use that.


FWIW I've initially argued against that, believing that we can address
those issues in some other way, and I'd love if that was possible. But
considering we're still trying to make that work reliably I think the
reasonable conclusion is that Andres was right communicating with the
server is necessary.

Of course, I definitely appreciate people are working on this, otherwise
we wouldn't be having this discussion ...

+1.
 
--

Re: Online verification of checksums

From
Stephen Frost
Date:
Greetings,

* Magnus Hagander (magnus@hagander.net) wrote:
> On Thu, Mar 28, 2019 at 10:19 PM Tomas Vondra <tomas.vondra@2ndquadrant.com>
> wrote:
>
> > On Thu, Mar 28, 2019 at 01:11:40PM -0700, Andres Freund wrote:
> > >Hi,
> > >
> > >On 2019-03-28 21:09:22 +0100, Michael Banck wrote:
> > >> I agree that the current patch might have some corner-cases where it
> > >> does not guarantee 100% accuracy in online mode, but I hope the current
> > >> version at least has no more false negatives.
> > >
> > >False positives are *bad*. We shouldn't integrate code that has them.
> > >
> >
> > Yeah, I agree. I'm a bit puzzled by the reluctance to make the online mode
> > communicate with the server, which would presumably address these issues.
> > Can someone explain why not to do that?
>
> I agree that this effort seems better spent on fixing those issues there
> (of which many are the same), and then re-use that.

This really seems like it depends on which of the options we're talking
about..   Connecting to the server and asking what the current insert
point is, so we can check that the LSN isn't completely insane, seems
reasonable, but at least one option being discussed was to have
pg_basebackup actually *lock the page* (even if just for I/O..) and then
re-read it, and having an external tool doing that instead of the
backend seems like a whole different level to me.  That would involve
having an SQL function for "lock this page against I/O" and then another
for "unlock this page", wouldn't it?

> > FWIW I've initially argued against that, believing that we can address
> > those issues in some other way, and I'd love if that was possible. But
> > considering we're still trying to make that work reliably I think the
> > reasonable conclusion is that Andres was right communicating with the
> > server is necessary.

As part of a backup, you could check against the pages written out into
the WAL as a cross-check and be able to be confident that at least
everything which was backed up had been checked.  That doesn't cover
things like unlogged tables though.

For my part, at least, adding additional checks around the LSN seems
like a good solution (though we can't allow those checks to turn into
false positives...) and would seriously reduce the risk that we have
false negatives (we can *not* completely eliminate false negatives
entirely..  we could possibly get to a point where at least we don't
have any more false negatives than PG itself has but it looks like an
awful lot of work and ends up adding its own risks...).

As I've said before, I'd certainly support a background worker which
performs ongoing checksum validation of pages and that would be able to
use the same approach as what we do with pg_basebackup, but having an
external tool locking pages seems really unlikely to be reasonable.

Thanks!

Stephen

Attachment

Re: Online verification of checksums

From
Andres Freund
Date:
Hi,

On 2019-03-29 11:30:15 -0400, Stephen Frost wrote:
> * Magnus Hagander (magnus@hagander.net) wrote:
> > On Thu, Mar 28, 2019 at 10:19 PM Tomas Vondra <tomas.vondra@2ndquadrant.com>
> > wrote:
> > > On Thu, Mar 28, 2019 at 01:11:40PM -0700, Andres Freund wrote:
> > > >Hi,
> > > >
> > > >On 2019-03-28 21:09:22 +0100, Michael Banck wrote:
> > > >> I agree that the current patch might have some corner-cases where it
> > > >> does not guarantee 100% accuracy in online mode, but I hope the current
> > > >> version at least has no more false negatives.
> > > >
> > > >False positives are *bad*. We shouldn't integrate code that has them.
> > > >
> > >
> > > Yeah, I agree. I'm a bit puzzled by the reluctance to make the online mode
> > > communicate with the server, which would presumably address these issues.
> > > Can someone explain why not to do that?
> > 
> > I agree that this effort seems better spent on fixing those issues there
> > (of which many are the same), and then re-use that.
> 
> This really seems like it depends on which of the options we're talking
> about..   Connecting to the server and asking what the current insert
> point is, so we can check that the LSN isn't completely insane, seems
> reasonable, but at least one option being discussed was to have
> pg_basebackup actually *lock the page* (even if just for I/O..) and then
> re-read it, and having an external tool doing that instead of the
> backend seems like a whole different level to me.  That would involve
> having an SQL function for "lock this page against I/O" and then another
> for "unlock this page", wouldn't it?

No, I don't think so. And we obviously couldn't have a SQL level
function hold an LWLock after it has finished, that'd make undetected
deadlocks triggerable by users.  The way I'd imagine that being done is
to just perform the checksum test in the commandline tool, and whenever
there's a checksum failure that could plausibly be a torn read, call a
server side function that re-tests the page after locking it. Which then
would just return the error message in a string.

Greetings,

Andres Freund



Re: Online verification of checksums

From
Stephen Frost
Date:
Greetings,

* Andres Freund (andres@anarazel.de) wrote:
> On 2019-03-29 11:30:15 -0400, Stephen Frost wrote:
> > * Magnus Hagander (magnus@hagander.net) wrote:
> > > On Thu, Mar 28, 2019 at 10:19 PM Tomas Vondra <tomas.vondra@2ndquadrant.com>
> > > wrote:
> > > > On Thu, Mar 28, 2019 at 01:11:40PM -0700, Andres Freund wrote:
> > > > >Hi,
> > > > >
> > > > >On 2019-03-28 21:09:22 +0100, Michael Banck wrote:
> > > > >> I agree that the current patch might have some corner-cases where it
> > > > >> does not guarantee 100% accuracy in online mode, but I hope the current
> > > > >> version at least has no more false negatives.
> > > > >
> > > > >False positives are *bad*. We shouldn't integrate code that has them.
> > > > >
> > > >
> > > > Yeah, I agree. I'm a bit puzzled by the reluctance to make the online mode
> > > > communicate with the server, which would presumably address these issues.
> > > > Can someone explain why not to do that?
> > >
> > > I agree that this effort seems better spent on fixing those issues there
> > > (of which many are the same), and then re-use that.
> >
> > This really seems like it depends on which of the options we're talking
> > about..   Connecting to the server and asking what the current insert
> > point is, so we can check that the LSN isn't completely insane, seems
> > reasonable, but at least one option being discussed was to have
> > pg_basebackup actually *lock the page* (even if just for I/O..) and then
> > re-read it, and having an external tool doing that instead of the
> > backend seems like a whole different level to me.  That would involve
> > having an SQL function for "lock this page against I/O" and then another
> > for "unlock this page", wouldn't it?
>
> No, I don't think so. And we obviously couldn't have a SQL level
> function hold an LWLock after it has finished, that'd make undetected
> deadlocks triggerable by users.  The way I'd imagine that being done is
> to just perform the checksum test in the commandline tool, and whenever
> there's a checksum failure that could plausibly be a torn read, call a
> server side function that re-tests the page after locking it. Which then
> would just return the error message in a string.

The server-side function would essentially lock the page against i/o,
re-read it off disk into an independent location, unlock the page, then
calculate the checksum and report back?

That seems like it would be reasonable to me.  Wouldn't it make sense to
then have pg_basebackup use that same function..?

Thanks,

Stephen

Attachment

Re: Online verification of checksums

From
Andres Freund
Date:
Hi,

On 2019-03-29 11:38:02 -0400, Stephen Frost wrote:
> The server-side function would essentially lock the page against i/o,
> re-read it off disk into an independent location, unlock the page, then
> calculate the checksum and report back?

Right. I think there's a few minor variations of how this could be done,
but that'd be the basic approach.


> That seems like it would be reasonable to me.  Wouldn't it make sense to
> then have pg_basebackup use that same function..?

Yea, probably. Or at least reuse the majority of it, I can imagine the
error reporting would be a bit different (sqlstates et al are needed for
the basebackup.c case, but not the pg_checksum case).

Greetings,

Andres Freund



Re: Online verification of checksums

From
Magnus Hagander
Date:


On Fri, Mar 29, 2019 at 4:30 PM Stephen Frost <sfrost@snowman.net> wrote:
Greetings,

* Magnus Hagander (magnus@hagander.net) wrote:
> On Thu, Mar 28, 2019 at 10:19 PM Tomas Vondra <tomas.vondra@2ndquadrant.com>
> wrote:
>
> > On Thu, Mar 28, 2019 at 01:11:40PM -0700, Andres Freund wrote:
> > >Hi,
> > >
> > >On 2019-03-28 21:09:22 +0100, Michael Banck wrote:
> > >> I agree that the current patch might have some corner-cases where it
> > >> does not guarantee 100% accuracy in online mode, but I hope the current
> > >> version at least has no more false negatives.
> > >
> > >False positives are *bad*. We shouldn't integrate code that has them.
> > >
> >
> > Yeah, I agree. I'm a bit puzzled by the reluctance to make the online mode
> > communicate with the server, which would presumably address these issues.
> > Can someone explain why not to do that?
>
> I agree that this effort seems better spent on fixing those issues there
> (of which many are the same), and then re-use that.

This really seems like it depends on which of the options we're talking
about..   Connecting to the server and asking what the current insert
point is, so we can check that the LSN isn't completely insane, seems
reasonable, but at least one option being discussed was to have
pg_basebackup actually *lock the page* (even if just for I/O..) and then
re-read it, and having an external tool doing that instead of the
backend seems like a whole different level to me.  That would involve
having an SQL function for "lock this page against I/O" and then another
for "unlock this page", wouldn't it?

Right.

But what if we just added a flag to the BASE_BACKUP command in the replication protocol that said "meh, I really just want to verify the checksums, so please send the data to devnull and only feed me regular status updates on this connection"?

--

Re: Online verification of checksums

From
Michael Banck
Date:
Hi,

Am Freitag, den 29.03.2019, 16:52 +0100 schrieb Magnus Hagander:
> On Fri, Mar 29, 2019 at 4:30 PM Stephen Frost <sfrost@snowman.net> wrote:
> > * Magnus Hagander (magnus@hagander.net) wrote:
> > > On Thu, Mar 28, 2019 at 10:19 PM Tomas Vondra <tomas.vondra@2ndquadrant.com>
> > > wrote:
> > > > On Thu, Mar 28, 2019 at 01:11:40PM -0700, Andres Freund wrote:
> > > > >On 2019-03-28 21:09:22 +0100, Michael Banck wrote:
> > > > >> I agree that the current patch might have some corner-cases where it
> > > > >> does not guarantee 100% accuracy in online mode, but I hope the current
> > > > >> version at least has no more false negatives.
> > > > >
> > > > >False positives are *bad*. We shouldn't integrate code that has them.
> > > >
> > > > Yeah, I agree. I'm a bit puzzled by the reluctance to make the online mode
> > > > communicate with the server, which would presumably address these issues.
> > > > Can someone explain why not to do that?
> > > 
> > > I agree that this effort seems better spent on fixing those issues there
> > > (of which many are the same), and then re-use that.
> > 
> > This really seems like it depends on which of the options we're talking
> > about..   Connecting to the server and asking what the current insert
> > point is, so we can check that the LSN isn't completely insane, seems
> > reasonable, but at least one option being discussed was to have
> > pg_basebackup actually *lock the page* (even if just for I/O..) and then
> > re-read it, and having an external tool doing that instead of the
> > backend seems like a whole different level to me.  That would involve
> > having an SQL function for "lock this page against I/O" and then another
> > for "unlock this page", wouldn't it?
> 
> Right.
> 
> But what if we just added a flag to the BASE_BACKUP command in the
> replication protocol that said "meh, I really just want to verify the
> checksums, so please send the data to devnull and only feed me regular
> status updates on this connection"?

I don't know whether BASE_BACKUP is the best interface for that (at
least right now) - backend/replication/basebackup.c's sendFile() gets
only an absolute filename to send, which is not adequate for more in-
depth server-based things like locking a particular page in a particular
relation of some particular tablespace.

ISTM that the fact that we had to teach it about different segment files
for checksum verification by splitting up the filename at "." implies
that it is not the correct level of abstraction (but maybe it could get
schooled some more about Postgres internals, e.g. by passing it a
RefFileNode struct and not a filename).


Michael

-- 
Michael Banck
Projektleiter / Senior Berater
Tel.: +49 2166 9901-171
Fax:  +49 2166 9901-100
Email: michael.banck@credativ.de

credativ GmbH, HRB Mönchengladbach 12080
USt-ID-Nummer: DE204566209
Trompeterallee 108, 41189 Mönchengladbach
Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer

Unser Umgang mit personenbezogenen Daten unterliegt
folgenden Bestimmungen: https://www.credativ.de/datenschutz



Re: Online verification of checksums

From
Magnus Hagander
Date:


On Fri, Mar 29, 2019 at 10:08 PM Michael Banck <michael.banck@credativ.de> wrote:
Hi,

Am Freitag, den 29.03.2019, 16:52 +0100 schrieb Magnus Hagander:
> On Fri, Mar 29, 2019 at 4:30 PM Stephen Frost <sfrost@snowman.net> wrote:
> > * Magnus Hagander (magnus@hagander.net) wrote:
> > > On Thu, Mar 28, 2019 at 10:19 PM Tomas Vondra <tomas.vondra@2ndquadrant.com>
> > > wrote:
> > > > On Thu, Mar 28, 2019 at 01:11:40PM -0700, Andres Freund wrote:
> > > > >On 2019-03-28 21:09:22 +0100, Michael Banck wrote:
> > > > >> I agree that the current patch might have some corner-cases where it
> > > > >> does not guarantee 100% accuracy in online mode, but I hope the current
> > > > >> version at least has no more false negatives.
> > > > >
> > > > >False positives are *bad*. We shouldn't integrate code that has them.
> > > >
> > > > Yeah, I agree. I'm a bit puzzled by the reluctance to make the online mode
> > > > communicate with the server, which would presumably address these issues.
> > > > Can someone explain why not to do that?
> > >
> > > I agree that this effort seems better spent on fixing those issues there
> > > (of which many are the same), and then re-use that.
> >
> > This really seems like it depends on which of the options we're talking
> > about..   Connecting to the server and asking what the current insert
> > point is, so we can check that the LSN isn't completely insane, seems
> > reasonable, but at least one option being discussed was to have
> > pg_basebackup actually *lock the page* (even if just for I/O..) and then
> > re-read it, and having an external tool doing that instead of the
> > backend seems like a whole different level to me.  That would involve
> > having an SQL function for "lock this page against I/O" and then another
> > for "unlock this page", wouldn't it?
>
> Right.
>
> But what if we just added a flag to the BASE_BACKUP command in the
> replication protocol that said "meh, I really just want to verify the
> checksums, so please send the data to devnull and only feed me regular
> status updates on this connection"?

I don't know whether BASE_BACKUP is the best interface for that (at
least right now) - backend/replication/basebackup.c's sendFile() gets
only an absolute filename to send, which is not adequate for more in-
depth server-based things like locking a particular page in a particular
relation of some particular tablespace. 

ISTM that the fact that we had to teach it about different segment files
for checksum verification by splitting up the filename at "." implies
that it is not the correct level of abstraction (but maybe it could get
schooled some more about Postgres internals, e.g. by passing it a
RefFileNode struct and not a filename).

But that has to be fixed in pg_basebackup *regardless*, doesn't it? And if we fix it there, we only have to fix it once...


//Magnus

Re: Online verification of checksums

From
Andres Freund
Date:
Hi,

On 2019-03-30 12:56:21 +0100, Magnus Hagander wrote:
> > ISTM that the fact that we had to teach it about different segment files
> > for checksum verification by splitting up the filename at "." implies
> > that it is not the correct level of abstraction (but maybe it could get
> > schooled some more about Postgres internals, e.g. by passing it a
> > RefFileNode struct and not a filename).
> >
> 
> But that has to be fixed in pg_basebackup *regardless*, doesn't it? And if
> we fix it there, we only have to fix it once...

I'm not understanding the problem here. We already need to know all of
this? sendFile() determines whether the file is checksummed, and
computes the segment number:

        if (is_checksummed_file(readfilename, filename))
        {
            verify_checksum = true;
...
                    checksum = pg_checksum_page((char *) page, blkno + segmentno * RELSEG_SIZE);
                    phdr = (PageHeader) page;

I agree that the way checksumming works is a bit of a layering
violation. In my opinion it belongs in the smgr level, not bufmgr.c etc,
so different storage methods can store it differently. But that seems
fairly indepedent of this problem.

Greetings,

Andres Freund



Re: [Patch] Base backups and random or zero pageheaders

From
Michael Banck
Date:
Hi,

Am Mittwoch, den 27.03.2019, 11:37 +0100 schrieb Michael Banck:
> Am Dienstag, den 26.03.2019, 19:23 +0100 schrieb Michael Banck:
> > Am Dienstag, den 26.03.2019, 10:30 -0700 schrieb Andres Freund:
> > > On 2019-03-26 18:22:55 +0100, Michael Banck wrote:
> > > >                  /*
> > > > -                 * Only check pages which have not been modified since the
> > > > -                 * start of the base backup. Otherwise, they might have been
> > > > -                 * written only halfway and the checksum would not be valid.
> > > > -                 * However, replaying WAL would reinstate the correct page in
> > > > -                 * this case. We also skip completely new pages, since they
> > > > -                 * don't have a checksum yet.
> > > > +                 * We skip completely new pages after checking they are
> > > > +                 * all-zero, since they don't have a checksum yet.
> > > >                   */
> > > > -                if (!PageIsNew(page) && PageGetLSN(page) < startptr)
> > > > +                if (PageIsNew(page))
> > > >                  {
> > > > -                    checksum = pg_checksum_page((char *) page, blkno + segmentno * RELSEG_SIZE);
> > > > -                    phdr = (PageHeader) page;
> > > > -                    if (phdr->pd_checksum != checksum)
> > > > +                    all_zeroes = true;
> > > > +                    pagebytes = (size_t *) page;
> > > > +                    for (int i = 0; i < (BLCKSZ / sizeof(size_t)); i++)
> > > 
> > > Can we please abstract the zeroeness check into a separate function to
> > > be used both by PageIsVerified() and this?
> > 
> > Ok, done so as PageIsZero further down in bufpage.c.
> 
> It turns out that pg_checksums (current master and back branches, not
> just the online version) needs this treatment as well as it won't catch
> zeroed-out pageheader corruption, see attached patch to its TAP tests
> which trigger it (I also added a random data check similar to
> pg_basebackup as well which is not a problem for the current codebase).
> 
> Any suggestion on how to handle this? Should I duplicate the
> PageIsZero() code in pg_checksums? Should I move PageIsZero into
> something like bufpage_impl.h for use by external programs, similar to
> pg_checksum_page()?
> 
> I've done the latter as a POC in the second attached patch.

This is still an open item for the back branches I guess, i.e. zero page
header for pg_verify_checksums and additionally random page header for
pg_basebackup's base backup.

Do you plan to work on the patch you have outlined, what would I need to
change in the patches I submitted or is another approach warranted
entirely?  Should I add my patches to the next commitfest in order to
track them?


Michael

-- 
Michael Banck
Projektleiter / Senior Berater
Tel.: +49 2166 9901-171
Fax:  +49 2166 9901-100
Email: michael.banck@credativ.de

credativ GmbH, HRB Mönchengladbach 12080
USt-ID-Nummer: DE204566209
Trompeterallee 108, 41189 Mönchengladbach
Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer

Unser Umgang mit personenbezogenen Daten unterliegt
folgenden Bestimmungen: https://www.credativ.de/datenschutz



Re: [Patch] Base backups and random or zero pageheaders

From
Michael Paquier
Date:
On Tue, Apr 30, 2019 at 03:07:43PM +0200, Michael Banck wrote:
> This is still an open item for the back branches I guess, i.e. zero page
> header for pg_verify_checksums and additionally random page header for
> pg_basebackup's base backup.

I may be missing something, but could you add an entry in the future
commit fest about the stuff discussed here?  I have not looked at your
patch closely..  Sorry.
--
Michael

Attachment

Re: [Patch] Base backups and random or zero pageheaders

From
Michael Banck
Date:
Hi,

Am Samstag, den 04.05.2019, 21:50 +0900 schrieb Michael Paquier:
> On Tue, Apr 30, 2019 at 03:07:43PM +0200, Michael Banck wrote:
> > This is still an open item for the back branches I guess, i.e. zero page
> > header for pg_verify_checksums and additionally random page header for
> > pg_basebackup's base backup.
> 
> I may be missing something, but could you add an entry in the future
> commit fest about the stuff discussed here?  I have not looked at your
> patch closely..  Sorry.

Here is finally a rebased patch for the (IMO) more important issue in
pg_basebackup. I've added a commitfest entry for this now: 
https://commitfest.postgresql.org/25/2308/


Michael

-- 
Michael Banck
Projektleiter / Senior Berater
Tel.: +49 2166 9901-171
Fax:  +49 2166 9901-100
Email: michael.banck@credativ.de

credativ GmbH, HRB Mönchengladbach 12080
USt-ID-Nummer: DE204566209
Trompeterallee 108, 41189 Mönchengladbach
Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer

Unser Umgang mit personenbezogenen Daten unterliegt
folgenden Bestimmungen: https://www.credativ.de/datenschutz
Attachment

Re: [Patch] Base backups and random or zero pageheaders

From
Asif Rehman
Date:


On Fri, Oct 18, 2019 at 2:06 PM Michael Banck <michael.banck@credativ.de> wrote:
Hi,

Am Samstag, den 04.05.2019, 21:50 +0900 schrieb Michael Paquier:
> On Tue, Apr 30, 2019 at 03:07:43PM +0200, Michael Banck wrote:
> > This is still an open item for the back branches I guess, i.e. zero page
> > header for pg_verify_checksums and additionally random page header for
> > pg_basebackup's base backup.
>
> I may be missing something, but could you add an entry in the future
> commit fest about the stuff discussed here?  I have not looked at your
> patch closely..  Sorry.

Here is finally a rebased patch for the (IMO) more important issue in
pg_basebackup. I've added a commitfest entry for this now: 
https://commitfest.postgresql.org/25/2308/



Hi Michael,

The patch does not seem to apply anymore, can you rebase it?

--
Asif Rehman
Highgo Software (Canada/China/Pakistan)
URL : www.highgo.ca

Re: [Patch] Base backups and random or zero pageheaders

From
Michael Banck
Date:
Hi,

Am Dienstag, den 25.02.2020, 19:34 +0500 schrieb Asif Rehman:
> On Fri, Oct 18, 2019 at 2:06 PM Michael Banck <michael.banck@credativ.de> wrote:
> > Here is finally a rebased patch for the (IMO) more important issue in
> > pg_basebackup. I've added a commitfest entry for this now: 
> > https://commitfest.postgresql.org/25/2308/
> 
> The patch does not seem to apply anymore, can you rebase it?

Thanks for letting me know, please find attached a rebased version. I
hope the StaticAssertDecl() is still correct in bufpage.h.


Michael

-- 
Michael Banck
Projektleiter / Senior Berater
Tel.: +49 2166 9901-171
Fax:  +49 2166 9901-100
Email: michael.banck@credativ.de

credativ GmbH, HRB Mönchengladbach 12080
USt-ID-Nummer: DE204566209
Trompeterallee 108, 41189 Mönchengladbach
Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer

Unser Umgang mit personenbezogenen Daten unterliegt
folgenden Bestimmungen: https://www.credativ.de/datenschutz

Attachment

Re: Online verification of checksums

From
Asif Rehman
Date:
The following review has been posted through the commitfest application:
make installcheck-world:  tested, passed
Implements feature:       tested, passed
Spec compliant:           tested, passed
Documentation:            not tested

The patch applies cleanly and works as expected. Just a few minor observations:

- I would suggest refactoring PageIsZero function by getting rid of all_zeroes variable
and simply returning false when a non-zero byte is found, rather than setting all_zeros
variable to false and breaking the for loop. The function should simply return true at the
end otherwise.

- Remove the empty line:
+                        * would throw an assertion failure.  Consider this a
+                        * checksum failure.
+                        */
+
+                       checksum_failures++;


- Code needs to run through pgindent.

Also, I'd suggest to make "5" a define within the current file/function, perhaps 
something like "MAX_CHECKSUM_FAILURES". You could move the second 
warning outside the conditional statement as it appears in both "if" and "else" blocks.


Regards,
--Asif

The new status of this patch is: Waiting on Author

Re: Online verification of checksums

From
Michael Banck
Date:
Hi,

thanks for reviewing this patch!

Am Donnerstag, den 27.02.2020, 10:57 +0000 schrieb Asif Rehman:
> The following review has been posted through the commitfest application:
> make installcheck-world:  tested, passed
> Implements feature:       tested, passed
> Spec compliant:           tested, passed
> Documentation:            not tested
> 
> The patch applies cleanly and works as expected. Just a few minor observations:
> 
> - I would suggest refactoring PageIsZero function by getting rid of all_zeroes variable
> and simply returning false when a non-zero byte is found, rather than setting all_zeros
> variable to false and breaking the for loop. The function should simply return true at the
> end otherwise.


Good point, I have done so.

> - Remove the empty line:
> +                        * would throw an assertion failure.  Consider this a
> +                        * checksum failure.
> +                        */
> +
> +                       checksum_failures++;

Done

> - Code needs to run through pgindent.

Done.

> Also, I'd suggest to make "5" a define within the current file/function, perhaps 
> something like "MAX_CHECKSUM_FAILURES". You could move the second 
> warning outside the conditional statement as it appears in both "if" and "else" blocks.

Well, I think you have a valid point, but that would be a different (non
bug-fix) patch as this part is not changed by this patch, but code is at
most moved around, is it?

New version attached.


Best regards,

Michael

-- 
Michael Banck
Projektleiter / Senior Berater
Tel.: +49 2166 9901-171
Fax:  +49 2166 9901-100
Email: michael.banck@credativ.de

credativ GmbH, HRB Mönchengladbach 12080
USt-ID-Nummer: DE204566209
Trompeterallee 108, 41189 Mönchengladbach
Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer

Unser Umgang mit personenbezogenen Daten unterliegt
folgenden Bestimmungen: https://www.credativ.de/datenschutz

Attachment

Re: Online verification of checksums

From
Tom Lane
Date:
Michael Banck <michael.banck@credativ.de> writes:
> [ 0001-Fix-checksum-verification-in-base-backups-for-random_V3.patch ]

I noticed that the cfbot wasn't testing this because of a minor merge
conflict.  I rebased it over that, and also readjusted things a little bit
to avoid unnecessarily reindenting existing code, in hopes of making the
patch easier to review.  Doing that reveals that the patch actually
removes a chunk of code, namely a special case for EOF.  Was that
intentional, or a result of a faulty merge earlier?  It certainly isn't
mentioned in your proposed commit message.

Another thing that's bothering me is that the patch compares page LSN
against GetInsertRecPtr(); but that function says

 * NOTE: The value *actually* returned is the position of the last full
 * xlog page. It lags behind the real insert position by at most 1 page.
 * For that, we don't need to scan through WAL insertion locks, and an
 * approximation is enough for the current usage of this function.

I'm not convinced that an approximation is good enough here.  It seems
like a page that's just now been updated could have an LSN beyond the
current XLOG page start, potentially leading to a false checksum
complaint.  Maybe we could address that by adding one xlog page to
the GetInsertRecPtr result?  Kind of a hack, but ...

            regards, tom lane

diff --git a/src/backend/replication/basebackup.c b/src/backend/replication/basebackup.c
index 5d94b9c..c7ff9a8 100644
--- a/src/backend/replication/basebackup.c
+++ b/src/backend/replication/basebackup.c
@@ -2028,15 +2028,47 @@ sendFile(const char *readfilename, const char *tarfilename,
                 page = buf + BLCKSZ * i;

                 /*
-                 * Only check pages which have not been modified since the
-                 * start of the base backup. Otherwise, they might have been
-                 * written only halfway and the checksum would not be valid.
-                 * However, replaying WAL would reinstate the correct page in
-                 * this case. We also skip completely new pages, since they
-                 * don't have a checksum yet.
+                 * We skip completely new pages after checking they are
+                 * all-zero, since they don't have a checksum yet.
                  */
-                if (!PageIsNew(page) && PageGetLSN(page) < startptr)
+                if (PageIsNew(page))
                 {
+                    if (!PageIsZero(page))
+                    {
+                        /*
+                         * pd_upper is zero, but the page is not all zero.  We
+                         * cannot run pg_checksum_page() on the page as it
+                         * would throw an assertion failure.  Consider this a
+                         * checksum failure.
+                         */
+                        checksum_failures++;
+
+                        if (checksum_failures <= 5)
+                            ereport(WARNING,
+                                    (errmsg("checksum verification failed in "
+                                            "file \"%s\", block %d: pd_upper "
+                                            "is zero but page is not all-zero",
+                                            readfilename, blkno)));
+                        if (checksum_failures == 5)
+                            ereport(WARNING,
+                                    (errmsg("further checksum verification "
+                                            "failures in file \"%s\" will not "
+                                            "be reported", readfilename)));
+                    }
+                }
+                else if (PageGetLSN(page) < startptr ||
+                         PageGetLSN(page) > GetInsertRecPtr())
+                {
+                    /*
+                     * Only check pages which have not been modified since the
+                     * start of the base backup. Otherwise, they might have
+                     * been written only halfway and the checksum would not be
+                     * valid. However, replaying WAL would reinstate the
+                     * correct page in this case. If the page LSN is larger
+                     * than the current insert pointer then we assume a bogus
+                     * LSN due to random page header corruption and do verify
+                     * the checksum.
+                     */
                     checksum = pg_checksum_page((char *) page, blkno + segmentno * RELSEG_SIZE);
                     phdr = (PageHeader) page;
                     if (phdr->pd_checksum != checksum)
@@ -2064,20 +2096,6 @@ sendFile(const char *readfilename, const char *tarfilename,

                             if (fread(buf + BLCKSZ * i, 1, BLCKSZ, fp) != BLCKSZ)
                             {
-                                /*
-                                 * If we hit end-of-file, a concurrent
-                                 * truncation must have occurred, so break out
-                                 * of this loop just as if the initial fread()
-                                 * returned 0. We'll drop through to the same
-                                 * code that handles that case. (We must fix
-                                 * up cnt first, though.)
-                                 */
-                                if (feof(fp))
-                                {
-                                    cnt = BLCKSZ * i;
-                                    break;
-                                }
-
                                 ereport(ERROR,
                                         (errcode_for_file_access(),
                                          errmsg("could not reread block %d of file \"%s\": %m",
diff --git a/src/backend/storage/page/bufpage.c b/src/backend/storage/page/bufpage.c
index d708117..2dc8322 100644
--- a/src/backend/storage/page/bufpage.c
+++ b/src/backend/storage/page/bufpage.c
@@ -82,11 +82,8 @@ bool
 PageIsVerified(Page page, BlockNumber blkno)
 {
     PageHeader    p = (PageHeader) page;
-    size_t       *pagebytes;
-    int            i;
     bool        checksum_failure = false;
     bool        header_sane = false;
-    bool        all_zeroes = false;
     uint16        checksum = 0;

     /*
@@ -120,18 +117,7 @@ PageIsVerified(Page page, BlockNumber blkno)
     }

     /* Check all-zeroes case */
-    all_zeroes = true;
-    pagebytes = (size_t *) page;
-    for (i = 0; i < (BLCKSZ / sizeof(size_t)); i++)
-    {
-        if (pagebytes[i] != 0)
-        {
-            all_zeroes = false;
-            break;
-        }
-    }
-
-    if (all_zeroes)
+    if (PageIsZero(page))
         return true;

     /*
@@ -154,6 +140,25 @@ PageIsVerified(Page page, BlockNumber blkno)
     return false;
 }

+/*
+ * PageIsZero
+ *        Check that the page consists only of zero bytes.
+ *
+ */
+bool
+PageIsZero(Page page)
+{
+    int            i;
+    size_t       *pagebytes = (size_t *) page;
+
+    for (i = 0; i < (BLCKSZ / sizeof(size_t)); i++)
+    {
+        if (pagebytes[i] != 0)
+            return false;
+    }
+
+    return true;
+}

 /*
  *    PageAddItemExtended
diff --git a/src/bin/pg_basebackup/t/010_pg_basebackup.pl b/src/bin/pg_basebackup/t/010_pg_basebackup.pl
index 6338176..598453e 100644
--- a/src/bin/pg_basebackup/t/010_pg_basebackup.pl
+++ b/src/bin/pg_basebackup/t/010_pg_basebackup.pl
@@ -6,7 +6,7 @@ use File::Basename qw(basename dirname);
 use File::Path qw(rmtree);
 use PostgresNode;
 use TestLib;
-use Test::More tests => 109;
+use Test::More tests => 112;

 program_help_ok('pg_basebackup');
 program_version_ok('pg_basebackup');
@@ -497,21 +497,37 @@ my $file_corrupt2 = $node->safe_psql('postgres',
 my $pageheader_size = 24;
 my $block_size = $node->safe_psql('postgres', 'SHOW block_size;');

-# induce corruption
+# induce corruption in the pageheader by writing random data into it
 system_or_bail 'pg_ctl', '-D', $pgdata, 'stop';
 open $file, '+<', "$pgdata/$file_corrupt1";
-seek($file, $pageheader_size, 0);
-syswrite($file, "\0\0\0\0\0\0\0\0\0");
+my $random_data = join '', map { ("a".."z")[rand 26] } 1 .. $pageheader_size;
+syswrite($file, $random_data);
+close $file;
+system_or_bail 'pg_ctl', '-D', $pgdata, 'start';
+
+$node->command_checks_all(
+    [ 'pg_basebackup', '-D', "$tempdir/backup_corrupt1" ],
+    1,
+    [qr{^$}],
+    [qr/^WARNING.*checksum verification failed/s],
+    "pg_basebackup reports checksum mismatch for random pageheader data");
+rmtree("$tempdir/backup_corrupt1");
+
+# zero out the pageheader completely
+open $file, '+<', "$pgdata/$file_corrupt1";
+system_or_bail 'pg_ctl', '-D', $pgdata, 'stop';
+my $zero_data = "\0"x$pageheader_size;
+syswrite($file, $zero_data);
 close $file;
 system_or_bail 'pg_ctl', '-D', $pgdata, 'start';

 $node->command_checks_all(
-    [ 'pg_basebackup', '-D', "$tempdir/backup_corrupt" ],
+    [ 'pg_basebackup', '-D', "$tempdir/backup_corrupt1a" ],
     1,
     [qr{^$}],
     [qr/^WARNING.*checksum verification failed/s],
-    'pg_basebackup reports checksum mismatch');
-rmtree("$tempdir/backup_corrupt");
+    "pg_basebackup reports checksum mismatch for zeroed pageheader");
+rmtree("$tempdir/backup_corrupt1a");

 # induce further corruption in 5 more blocks
 system_or_bail 'pg_ctl', '-D', $pgdata, 'stop';
diff --git a/src/include/storage/bufpage.h b/src/include/storage/bufpage.h
index 3f88683..a1fcb21 100644
--- a/src/include/storage/bufpage.h
+++ b/src/include/storage/bufpage.h
@@ -419,17 +419,18 @@ do { \
                         ((is_heap) ? PAI_IS_HEAP : 0))

 /*
- * Check that BLCKSZ is a multiple of sizeof(size_t).  In PageIsVerified(),
- * it is much faster to check if a page is full of zeroes using the native
- * word size.  Note that this assertion is kept within a header to make
- * sure that StaticAssertDecl() works across various combinations of
- * platforms and compilers.
+ * Check that BLCKSZ is a multiple of sizeof(size_t).  In PageIsZero(), it is
+ * much faster to check if a page is full of zeroes using the native word size.
+ * Note that this assertion is kept within a header to make sure that
+ * StaticAssertDecl() works across various combinations of platforms and
+ * compilers.
  */
 StaticAssertDecl(BLCKSZ == ((BLCKSZ / sizeof(size_t)) * sizeof(size_t)),
                  "BLCKSZ has to be a multiple of sizeof(size_t)");

 extern void PageInit(Page page, Size pageSize, Size specialSize);
 extern bool PageIsVerified(Page page, BlockNumber blkno);
+extern bool PageIsZero(Page page);
 extern OffsetNumber PageAddItemExtended(Page page, Item item, Size size,
                                         OffsetNumber offsetNumber, int flags);
 extern Page PageGetTempPage(Page page);

Re: Online verification of checksums

From
Tom Lane
Date:
I wrote:
> Another thing that's bothering me is that the patch compares page LSN
> against GetInsertRecPtr(); but that function says
> ...
> I'm not convinced that an approximation is good enough here.  It seems
> like a page that's just now been updated could have an LSN beyond the
> current XLOG page start, potentially leading to a false checksum
> complaint.  Maybe we could address that by adding one xlog page to
> the GetInsertRecPtr result?  Kind of a hack, but ...

Actually, after thinking about that a bit more: why is there an LSN-based
special condition at all?  It seems like it'd be far more useful to
checksum everything, and on failure try to re-read and re-verify the page
once or twice, so as to handle the corner case where we examine a page
that's in process of being overwritten.

            regards, tom lane



Re: Online verification of checksums

From
Michael Banck
Date:
Hi,

Am Montag, den 06.04.2020, 16:45 -0400 schrieb Tom Lane:
> I wrote:
> > Another thing that's bothering me is that the patch compares page LSN
> > against GetInsertRecPtr(); but that function says
> > ...
> > I'm not convinced that an approximation is good enough here.  It seems
> > like a page that's just now been updated could have an LSN beyond the
> > current XLOG page start, potentially leading to a false checksum
> > complaint.  Maybe we could address that by adding one xlog page to
> > the GetInsertRecPtr result?  Kind of a hack, but ...

I was about to write that it sounds like a pragmatic solution to me,
but...

> Actually, after thinking about that a bit more: why is there an LSN-based
> special condition at all?  It seems like it'd be far more useful to
> checksum everything, and on failure try to re-read and re-verify the page
> once or twice, so as to handle the corner case where we examine a page
> that's in process of being overwritten.

Andres outlined something about a year ago which on re-reading sounds
similar to what you suggest above in     
20190326170820.6sylklg7eh6uhabd@alap3.anarazel.de but never posted a
full patch. He seems to have had a few additional checks from PageIsVerified() in mind, though.

The original check against the checkpoint LSN wasn't suggested by me;
I've submitted this patch with the InsertRecPtr as an upper bound as a
*(presumably) minimal-invasive patch which could be back-patched (when
nothing came of the above thread for a while), but the issue seems to be
quite a bit nuanced.

Probably we need to take a step back; the question is whether something
like what Andres suggested should/could be coded up for v13 still
(before the feature freeze) and if so, by whom (I won't have the time),
or whether it would still qualify as a back-patchable bug-fix and/or
whether your suggestion above would.


Michael
    
-- 
Michael Banck
Projektleiter / Senior Berater
Tel.: +49 2166 9901-171
Fax:  +49 2166 9901-100
Email: michael.banck@credativ.de

credativ GmbH, HRB Mönchengladbach 12080
USt-ID-Nummer: DE204566209
Trompeterallee 108, 41189 Mönchengladbach
Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer

Unser Umgang mit personenbezogenen Daten unterliegt
folgenden Bestimmungen: https://www.credativ.de/datenschutz




Re: Online verification of checksums

From
Daniel Gustafsson
Date:
> On 6 Apr 2020, at 23:15, Michael Banck <michael.banck@credativ.de> wrote:

> Probably we need to take a step back;

This patch has been Waiting on Author since the last commitfest (and no longer
applies as well), and by the sounds of the thread there are some open issues
with it.  Should it be Returned with Feedback to be re-opened with a fresh take
on it?

cheers ./daniel


Re: Online verification of checksums

From
Daniel Gustafsson
Date:
> On 5 Jul 2020, at 13:52, Daniel Gustafsson <daniel@yesql.se> wrote:
>
>> On 6 Apr 2020, at 23:15, Michael Banck <michael.banck@credativ.de> wrote:
>
>> Probably we need to take a step back;
>
> This patch has been Waiting on Author since the last commitfest (and no longer
> applies as well), and by the sounds of the thread there are some open issues
> with it.  Should it be Returned with Feedback to be re-opened with a fresh take
> on it?

Marked as Returned with Feedback, please open a new entry in case there is a
renewed interest with a new patch.

cheers ./daniel


Re: Online verification of checksums

From
Michael Banck
Date:
Hi,

Am Dienstag, den 20.10.2020, 18:11 +0900 schrieb Michael Paquier:
> On Mon, Apr 06, 2020 at 04:45:44PM -0400, Tom Lane wrote:
> > Actually, after thinking about that a bit more: why is there an LSN-based
> > special condition at all?  It seems like it'd be far more useful to
> > checksum everything, and on failure try to re-read and re-verify the page
> > once or twice, so as to handle the corner case where we examine a page
> > that's in process of being overwritten.
> 
> I was reviewing this area today, and that actually matches my
> impression.  Why do we need a LSN-based check at all?  As said
> upthread, that's of course weak with random data as we would miss most
> of the real checksum failures, with odds getting better depending on
> the current LSN of the cluster moving on.  However, it seems to me
> that we would have an extra advantage in removing this check
> all together: it would be possible to check for pages even if these
> are more recent than the start LSN of the backup, and that could be a
> lot of pages that could be checked on a large cluster.  So by keeping
> this check we also delay the detection of real problems.

The check was ported (or the concept of it adapted) from pgBackRest if I
remember correctly.

> As things stand, I'd like to think that it would be much more useful
> to remove this check and to have one or two extra retries (the current
> code only has one).  I don't like much the possibility of false
> positives for such critical checks, but as we need to live with what
> has been released, that looks like a good move for stable branches.

Sounds good to me. I think some were advocating for locking the page
before re-reading. When I looked at it, the level of abstraction that
pg_basebackup has (just a list of files chopped up into blocks, no
notion of relations I think) made that non-trivial, but maybe still
possible for v14 and beyond.


Michael

-- 
Michael Banck
Projektleiter / Senior Berater
Tel.: +49 2166 9901-171
Fax:  +49 2166 9901-100
Email: michael.banck@credativ.de

credativ GmbH, HRB Mönchengladbach 12080
USt-ID-Nummer: DE204566209
Trompeterallee 108, 41189 Mönchengladbach
Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer

Unser Umgang mit personenbezogenen Daten unterliegt
folgenden Bestimmungen: https://www.credativ.de/datenschutz




Re: Online verification of checksums

From
Magnus Hagander
Date:


On Tue, Nov 10, 2020 at 5:44 AM Michael Paquier <michael@paquier.xyz> wrote:
On Thu, Nov 05, 2020 at 10:57:16AM +0900, Michael Paquier wrote:
> I was referring to the patch I sent on this thread that fixes the
> detection of a corruption for the zero-only case and where pd_lsn
> and/or pg_upper are trashed by a corruption of the page header.  Both
> cases allow a base backup to complete on HEAD, while sending pages
> that could be corrupted, which is wrong.  Once you make the page
> verification rely only on pd_checksum, as the patch does because the
> checksum is the only source of truth in the page header, corrupted
> pages are correctly detected, causing pg_basebackup to complain as it
> should.  However, it has also the risk to cause pg_basebackup to fail
> *and* to report as broken pages that are in the process of being
> written, depending on how slow a disk is able to finish a 8kB write.
> That's a different kind of wrongness, and users have two more reasons
> to be pissed.  Note that if a page is found as torn we have a
> consistent page header, meaning that on HEAD the PageIsNew() and
> PageGetLSN() would pass, but the checksum verification would fail as
> the contents at the end of the page does not match the checksum.

Magnus, as the original committer of 4eb77d5, do you have an opinion
to share?

I admit that I at some point lost track of the overlapping threads around this, and just figured there was enough different checksum-involved-people on those threads to handle it :) Meaning the short answer is "no, I don't really have one at this point".

Slightly longer comment is that it does seem reasonable, but I have not read in on all the different issues discussed over the whole thread, so take that as a weak-certainty comment.

--

Re: Online verification of checksums

From
Michael Paquier
Date:
On Sun, Nov 15, 2020 at 04:37:36PM +0100, Magnus Hagander wrote:
> On Tue, Nov 10, 2020 at 5:44 AM Michael Paquier <michael@paquier.xyz> wrote:
>> On Thu, Nov 05, 2020 at 10:57:16AM +0900, Michael Paquier wrote:
>>> I was referring to the patch I sent on this thread that fixes the
>>> detection of a corruption for the zero-only case and where pd_lsn
>>> and/or pg_upper are trashed by a corruption of the page header.  Both
>>> cases allow a base backup to complete on HEAD, while sending pages
>>> that could be corrupted, which is wrong.  Once you make the page
>>> verification rely only on pd_checksum, as the patch does because the
>>> checksum is the only source of truth in the page header, corrupted
>>> pages are correctly detected, causing pg_basebackup to complain as it
>>> should.  However, it has also the risk to cause pg_basebackup to fail
>>> *and* to report as broken pages that are in the process of being
>>> written, depending on how slow a disk is able to finish a 8kB write.
>>> That's a different kind of wrongness, and users have two more reasons
>>> to be pissed.  Note that if a page is found as torn we have a
>>> consistent page header, meaning that on HEAD the PageIsNew() and
>>> PageGetLSN() would pass, but the checksum verification would fail as
>>> the contents at the end of the page does not match the checksum.
>>
>> Magnus, as the original committer of 4eb77d5, do you have an opinion
>> to share?
>>
>
> I admit that I at some point lost track of the overlapping threads around
> this, and just figured there was enough different checksum-involved-people
> on those threads to handle it :) Meaning the short answer is "no, I don't
> really have one at this point".
>
> Slightly longer comment is that it does seem reasonable, but I have not
> read in on all the different issues discussed over the whole thread, so
> take that as a weak-certainty comment.

Which part are you considering as reasonable?  The removal-feature
part on a stable branch or perhaps something else?
--
Michael

Attachment

Re: Online verification of checksums

From
Magnus Hagander
Date:


On Mon, Nov 16, 2020 at 1:23 AM Michael Paquier <michael@paquier.xyz> wrote:
On Sun, Nov 15, 2020 at 04:37:36PM +0100, Magnus Hagander wrote:
> On Tue, Nov 10, 2020 at 5:44 AM Michael Paquier <michael@paquier.xyz> wrote:
>> On Thu, Nov 05, 2020 at 10:57:16AM +0900, Michael Paquier wrote:
>>> I was referring to the patch I sent on this thread that fixes the
>>> detection of a corruption for the zero-only case and where pd_lsn
>>> and/or pg_upper are trashed by a corruption of the page header.  Both
>>> cases allow a base backup to complete on HEAD, while sending pages
>>> that could be corrupted, which is wrong.  Once you make the page
>>> verification rely only on pd_checksum, as the patch does because the
>>> checksum is the only source of truth in the page header, corrupted
>>> pages are correctly detected, causing pg_basebackup to complain as it
>>> should.  However, it has also the risk to cause pg_basebackup to fail
>>> *and* to report as broken pages that are in the process of being
>>> written, depending on how slow a disk is able to finish a 8kB write.
>>> That's a different kind of wrongness, and users have two more reasons
>>> to be pissed.  Note that if a page is found as torn we have a
>>> consistent page header, meaning that on HEAD the PageIsNew() and
>>> PageGetLSN() would pass, but the checksum verification would fail as
>>> the contents at the end of the page does not match the checksum.
>>
>> Magnus, as the original committer of 4eb77d5, do you have an opinion
>> to share?
>>
>
> I admit that I at some point lost track of the overlapping threads around
> this, and just figured there was enough different checksum-involved-people
> on those threads to handle it :) Meaning the short answer is "no, I don't
> really have one at this point".
>
> Slightly longer comment is that it does seem reasonable, but I have not
> read in on all the different issues discussed over the whole thread, so
> take that as a weak-certainty comment.

Which part are you considering as reasonable?  The removal-feature
part on a stable branch or perhaps something else?

I was referring to the latest patch on the thread. But as I said, I have not read up on all the different issues raised in the thread, so take it with a big grain os salt.

And I would also echo the previous comment that this code was adapted from what the pgbackrest folks do. As such, it would be good to get a comment from for example David on that -- I don't see any of them having commented after that was mentioned?

--

Re: Online verification of checksums

From
Michael Paquier
Date:
On Mon, Nov 16, 2020 at 11:41:51AM +0100, Magnus Hagander wrote:
> I was referring to the latest patch on the thread. But as I said, I have
> not read up on all the different issues raised in the thread, so take it
> with a big grain os salt.
>
> And I would also echo the previous comment that this code was adapted from
> what the pgbackrest folks do. As such, it would be good to get a comment
> from for example David on that -- I don't see any of them having commented
> after that was mentioned?

Agreed.  I am adding Stephen as well in CC.  From the code of
backrest, the same logic happens in src/command/backup/pageChecksum.c
(see pageChecksumProcess), where two checks on pd_upper and pd_lsn
happen before verifying the checksum.  So, if the page header finishes
with random junk because of some kind of corruption, even corrupted
pages would be incorrectly considered as correct if the random data
passes the pd_upper and pg_lsn checks :/
--
Michael

Attachment

Re: Online verification of checksums

From
David Steele
Date:
Hi Michael,

On 11/20/20 2:28 AM, Michael Paquier wrote:
> On Mon, Nov 16, 2020 at 11:41:51AM +0100, Magnus Hagander wrote:
>> I was referring to the latest patch on the thread. But as I said, I have
>> not read up on all the different issues raised in the thread, so take it
>> with a big grain os salt.
>>
>> And I would also echo the previous comment that this code was adapted from
>> what the pgbackrest folks do. As such, it would be good to get a comment
>> from for example David on that -- I don't see any of them having commented
>> after that was mentioned?
> 
> Agreed.  I am adding Stephen as well in CC.  From the code of
> backrest, the same logic happens in src/command/backup/pageChecksum.c
> (see pageChecksumProcess), where two checks on pd_upper and pd_lsn
> happen before verifying the checksum.  So, if the page header finishes
> with random junk because of some kind of corruption, even corrupted
> pages would be incorrectly considered as correct if the random data
> passes the pd_upper and pg_lsn checks :/

Indeed, this is not good, as Andres pointed out some time ago. My 
apologies for not getting to this sooner.

Our current plan for pgBackRest:

1) Remove the LSN check as you have done in your patch and when 
rechecking see if the page has become valid *or* the LSN is ascending.
2) Check the LSN against the max LSN reported by PostgreSQL to make sure 
it is valid.

These do completely rule out any type of corruption, but they certainly 
narrows the possibility by a lot.

In the future we would also like to scan the WAL to verify that the page 
is definitely being written to.

As for your patch, it mostly looks good but my objection is that a page 
may be reported as invalid after 5 retries when in fact it may just be 
very hot.

Maybe checking for an ascending LSN is a good idea there as well? At 
least in that case we could issue a different warning, instead of 
"checksum verification failed" perhaps "checksum verification skipped 
due to concurrent modifications".

Regards,
-- 
-David
david@pgmasters.net



Re: Online verification of checksums

From
Stephen Frost
Date:
Greetings,

* David Steele (david@pgmasters.net) wrote:
> On 11/20/20 2:28 AM, Michael Paquier wrote:
> >On Mon, Nov 16, 2020 at 11:41:51AM +0100, Magnus Hagander wrote:
> >>I was referring to the latest patch on the thread. But as I said, I have
> >>not read up on all the different issues raised in the thread, so take it
> >>with a big grain os salt.
> >>
> >>And I would also echo the previous comment that this code was adapted from
> >>what the pgbackrest folks do. As such, it would be good to get a comment
> >>from for example David on that -- I don't see any of them having commented
> >>after that was mentioned?
> >
> >Agreed.  I am adding Stephen as well in CC.  From the code of
> >backrest, the same logic happens in src/command/backup/pageChecksum.c
> >(see pageChecksumProcess), where two checks on pd_upper and pd_lsn
> >happen before verifying the checksum.  So, if the page header finishes
> >with random junk because of some kind of corruption, even corrupted
> >pages would be incorrectly considered as correct if the random data
> >passes the pd_upper and pg_lsn checks :/
>
> Indeed, this is not good, as Andres pointed out some time ago. My apologies
> for not getting to this sooner.

Yeah, it's been on our backlog to improve this.

> Our current plan for pgBackRest:
>
> 1) Remove the LSN check as you have done in your patch and when rechecking
> see if the page has become valid *or* the LSN is ascending.
> 2) Check the LSN against the max LSN reported by PostgreSQL to make sure it
> is valid.

Yup, that's my recollection also as to our plans for how to improve
things here.

> These do completely rule out any type of corruption, but they certainly
> narrows the possibility by a lot.

*don't :)

> In the future we would also like to scan the WAL to verify that the page is
> definitely being written to.

Yeah, that'd certainly be nice to do too.

> As for your patch, it mostly looks good but my objection is that a page may
> be reported as invalid after 5 retries when in fact it may just be very hot.

Yeah.. while unlikely that it'd actually get written out that much, it
does seem at least possible.

> Maybe checking for an ascending LSN is a good idea there as well? At least
> in that case we could issue a different warning, instead of "checksum
> verification failed" perhaps "checksum verification skipped due to
> concurrent modifications".

+1.

Thanks,

Stephen

Attachment

Re: Online verification of checksums

From
Michael Paquier
Date:
On Fri, Nov 20, 2020 at 11:08:27AM -0500, Stephen Frost wrote:
> David Steele (david@pgmasters.net) wrote:
>> Our current plan for pgBackRest:
>>
>> 1) Remove the LSN check as you have done in your patch and when rechecking
>> see if the page has become valid *or* the LSN is ascending.
>> 2) Check the LSN against the max LSN reported by PostgreSQL to make sure it
>> is valid.
>
> Yup, that's my recollection also as to our plans for how to improve
> things here.
>
>> These do completely rule out any type of corruption, but they certainly
>> narrows the possibility by a lot.
>
> *don't :)

Have you considered the possibility of only using pd_checksums for the
validation?  This is the only source of truth in the page header we
can rely on to validate the full contents of the page, so if the logic
relies on anything but the checksum then you expose the logic to risks
of reporting pages as corrupted while they were just torn, or just
miss corrupted pages, which is what we should avoid for such things.
Both are bad.

>> As for your patch, it mostly looks good but my objection is that a page may
>> be reported as invalid after 5 retries when in fact it may just be very hot.
>
> Yeah.. while unlikely that it'd actually get written out that much, it
> does seem at least possible.
>
>> Maybe checking for an ascending LSN is a good idea there as well? At least
>> in that case we could issue a different warning, instead of "checksum
>> verification failed" perhaps "checksum verification skipped due to
>> concurrent modifications".
>
> +1.

I don't quite understand how you can make sure that the page is not
corrupted here?  It could be possible that the last 4kB of a 8kB page
got corrupted, where the header had valid data but failing the
checksum verification.  So if you are not careful you could have at
hand a corrupted page discarded because of it failed the retry
multiple times in a row.  The only method I can think as being really
reliable is based on two facts:
- Do a check only on pd_checksums, as that validates the full contents
of the page.
- When doing a retry, make sure that there is no concurrent I/O
activity in the shared buffers.  This requires an API we don't have
yet.
--
Michael

Attachment

Re: Online verification of checksums

From
Anastasia Lubennikova
Date:
On 21.11.2020 04:30, Michael Paquier wrote:
> The only method I can think as being really
> reliable is based on two facts:
> - Do a check only on pd_checksums, as that validates the full contents
> of the page.
> - When doing a retry, make sure that there is no concurrent I/O
> activity in the shared buffers.  This requires an API we don't have
> yet.

It seems reasonable to me to rely on checksums only.

As for retry, I think that API for concurrent I/O will be complicated. 
Instead, we can introduce a function to read the page directly from 
shared buffers after PAGE_RETRY_THRESHOLD attempts. It looks like a 
bullet-proof solution to me. Do you see any possible problems with it?

-- 
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company




Re: Online verification of checksums

From
Stephen Frost
Date:
Greetings,

* Michael Paquier (michael@paquier.xyz) wrote:
> On Fri, Nov 20, 2020 at 11:08:27AM -0500, Stephen Frost wrote:
> > David Steele (david@pgmasters.net) wrote:
> >> Our current plan for pgBackRest:
> >>
> >> 1) Remove the LSN check as you have done in your patch and when rechecking
> >> see if the page has become valid *or* the LSN is ascending.
> >> 2) Check the LSN against the max LSN reported by PostgreSQL to make sure it
> >> is valid.
> >
> > Yup, that's my recollection also as to our plans for how to improve
> > things here.
> >
> >> These do completely rule out any type of corruption, but they certainly
> >> narrows the possibility by a lot.
> >
> > *don't :)
>
> Have you considered the possibility of only using pd_checksums for the
> validation?  This is the only source of truth in the page header we
> can rely on to validate the full contents of the page, so if the logic
> relies on anything but the checksum then you expose the logic to risks
> of reporting pages as corrupted while they were just torn, or just
> miss corrupted pages, which is what we should avoid for such things.
> Both are bad.

There's no doubt that you'll get checksum failures from time to time,
and that it's an entirely valid case if the page is being concurrently
written, so we have to decide if we should be reporting those failures,
retrying, or what.

It's not at all clear what you're suggesting here as to how you can use
'only' the checksum.

> >> As for your patch, it mostly looks good but my objection is that a page may
> >> be reported as invalid after 5 retries when in fact it may just be very hot.
> >
> > Yeah.. while unlikely that it'd actually get written out that much, it
> > does seem at least possible.
> >
> >> Maybe checking for an ascending LSN is a good idea there as well? At least
> >> in that case we could issue a different warning, instead of "checksum
> >> verification failed" perhaps "checksum verification skipped due to
> >> concurrent modifications".
> >
> > +1.
>
> I don't quite understand how you can make sure that the page is not
> corrupted here?  It could be possible that the last 4kB of a 8kB page
> got corrupted, where the header had valid data but failing the
> checksum verification.

Not sure that the proposed approach was really understood here.
Specifically what we're talking about is:

- read(), save the LSN seen
- calculate checksum- get a failure
- re-read(), compare LSN to prior LSN, maybe also re-check checksum

If checksum fails again AND the LSN has changed and increased (and
perhaps otherwise seems reasonable) then we have at least a bit more
confidence that the failing checksum is due to the page being rewritten
concurrently and not due to latest storage corruption, which is the
specific distinction that we're trying to discern here.

> So if you are not careful you could have at
> hand a corrupted page discarded because of it failed the retry
> multiple times in a row.

The point of checking for an ascending LSN is to see if the page is
being concurrently modified.  If it is, then we actually don't care if
the page is corrupted because it's going to be rewritten during WAL
replay as part of the restore process.

> The only method I can think as being really
> reliable is based on two facts:
> - Do a check only on pd_checksums, as that validates the full contents
> of the page.
> - When doing a retry, make sure that there is no concurrent I/O
> activity in the shared buffers.  This requires an API we don't have
> yet.

I don't think we actually want the backup process to start locking
pages, which it seems like is what you're suggesting here..?  Trying to
do a check without a lock and without having PG end up reading the page
back in if it had been evicted due to pressure seems likely to be hard
to do reliably and without race conditions complicating things.

The other 100% reliable approach, as David discussed before, is to be
scanning the WAL at the same time and to ignore any checksum failures
for pages that we know are in the WAL with FPIs.  Unfortunately, reading
WAL for all different versions of PG is a fair bit of work and we
haven't quite gotten to biting that off yet (though it's on the
roadmap), and the core code certainly doesn't help us in that regard
since any given version only supports the current major version WAL (an
issue pg_basebackup would also have to deal with it, were it to be
modified to use such an approach and to continue working with older
versions of PG..).  In a similar vein to what we do (in pgbackrest) with
pg_control, we expect to develop our own library basically vendorizing
WAL reading code from all the major versions of PG which we support in
order to track FPIs, restore points, all the kinds of potential recovery
targets, and other useful information.

Thanks,

Stephen

Attachment

Re: Online verification of checksums

From
Stephen Frost
Date:
Greetings,

* Anastasia Lubennikova (a.lubennikova@postgrespro.ru) wrote:
> On 21.11.2020 04:30, Michael Paquier wrote:
> >The only method I can think as being really
> >reliable is based on two facts:
> >- Do a check only on pd_checksums, as that validates the full contents
> >of the page.
> >- When doing a retry, make sure that there is no concurrent I/O
> >activity in the shared buffers.  This requires an API we don't have
> >yet.
>
> It seems reasonable to me to rely on checksums only.
>
> As for retry, I think that API for concurrent I/O will be complicated.
> Instead, we can introduce a function to read the page directly from shared
> buffers after PAGE_RETRY_THRESHOLD attempts. It looks like a bullet-proof
> solution to me. Do you see any possible problems with it?

We might end up reading pages back in that have been evicted, for one
thing, which doesn't seem great, and this also seems likely to be
awkward for cases which aren't using the replication protocol, unless
every process maintains a connection to PG the entire time, which also
doesn't seem great.

Also- what is the point of reading the page from shared buffers
anyway..?  All we need to do is prove that the page will be rewritten
during WAL replay.  If we can prove that, we don't actually care what
the contents of the page are.  We certainly can't calculate the
checksum on a page we plucked out of shared buffers since we only
calculate the checksum when we go to write the page out.

Thanks,

Stephen

Attachment

Re: Online verification of checksums

From
Anastasia Lubennikova
Date:
On 23.11.2020 18:35, Stephen Frost wrote:
Greetings,

* Anastasia Lubennikova (a.lubennikova@postgrespro.ru) wrote:
On 21.11.2020 04:30, Michael Paquier wrote:
The only method I can think as being really
reliable is based on two facts:
- Do a check only on pd_checksums, as that validates the full contents
of the page.
- When doing a retry, make sure that there is no concurrent I/O
activity in the shared buffers.  This requires an API we don't have
yet.
It seems reasonable to me to rely on checksums only.

As for retry, I think that API for concurrent I/O will be complicated.
Instead, we can introduce a function to read the page directly from shared
buffers after PAGE_RETRY_THRESHOLD attempts. It looks like a bullet-proof
solution to me. Do you see any possible problems with it?
We might end up reading pages back in that have been evicted, for one
thing, which doesn't seem great, 
TBH, I think it is highly unlikely that the page that was just updated will be evicted.
and this also seems likely to be
awkward for cases which aren't using the replication protocol, unless
every process maintains a connection to PG the entire time, which also
doesn't seem great.
Have I missed something? Now pg_basebackup has only one process + one child process for streaming. Anyway, I totally agree with your argument. The need to maintain connection(s) to PG is the most unpleasant part of the proposed approach.

Also- what is the point of reading the page from shared buffers
anyway..?  
Well... Reading a page from shared buffers is a reliable way to get a correct page from postgres under any concurrent load. So it just seems natural to me.
All we need to do is prove that the page will be rewritten
during WAL replay. 
Yes and this is a tricky part. Until you have explained it in your latest message, I wasn't sure how we can distinct concurrent update from a page header corruption. Now I agree that if page LSN updated and increased between rereads, it is safe enough to conclude that we have some concurrent load.

 If we can prove that, we don't actually care what
the contents of the page are.  We certainly can't calculate the
checksum on a page we plucked out of shared buffers since we only
calculate the checksum when we go to write the page out.
Good point. I was thinking that we can recalculate checksum. Or even save a page without it, as we have checked LSN and know for sure that it will be rewritten by WAL replay.


To sum up, I agree with your proposal to reread the page and rely on ascending LSNs. Can you submit a patch?
You can write it on top of the latest attachment in this thread:
v8-master-0001-Fix-page-verifications-in-base-backups.patch from this message https://www.postgresql.org/message-id/20201030023028.GC1693@paquier.xyz

-- 
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: Online verification of checksums

From
Stephen Frost
Date:
Greetings,

* Anastasia Lubennikova (a.lubennikova@postgrespro.ru) wrote:
> On 23.11.2020 18:35, Stephen Frost wrote:
> >* Anastasia Lubennikova (a.lubennikova@postgrespro.ru) wrote:
> >>On 21.11.2020 04:30, Michael Paquier wrote:
> >>>The only method I can think as being really
> >>>reliable is based on two facts:
> >>>- Do a check only on pd_checksums, as that validates the full contents
> >>>of the page.
> >>>- When doing a retry, make sure that there is no concurrent I/O
> >>>activity in the shared buffers.  This requires an API we don't have
> >>>yet.
> >>It seems reasonable to me to rely on checksums only.
> >>
> >>As for retry, I think that API for concurrent I/O will be complicated.
> >>Instead, we can introduce a function to read the page directly from shared
> >>buffers after PAGE_RETRY_THRESHOLD attempts. It looks like a bullet-proof
> >>solution to me. Do you see any possible problems with it?
> >We might end up reading pages back in that have been evicted, for one
> >thing, which doesn't seem great,
> TBH, I think it is highly unlikely that the page that was just updated will
> be evicted.

Is it though..?  Consider that the page which was being written out was
being done so specifically to free a page for use by another backend-
while perhaps that doesn't happen all the time, it certainly happens
enough on very busy systems.

> >and this also seems likely to be
> >awkward for cases which aren't using the replication protocol, unless
> >every process maintains a connection to PG the entire time, which also
> >doesn't seem great.
> Have I missed something? Now pg_basebackup has only one process + one child
> process for streaming. Anyway, I totally agree with your argument. The need
> to maintain connection(s) to PG is the most unpleasant part of the proposed
> approach.

I was thinking beyond pg_basebackup, yes; apologies for that not being
clear but that's what I was meaning when I said "aren't using the
replication protocol".

> >Also- what is the point of reading the page from shared buffers
> >anyway..?
> Well... Reading a page from shared buffers is a reliable way to get a
> correct page from postgres under any concurrent load. So it just seems
> natural to me.

Yes, that's true, but if a dirty page was just written out by a backend
in order to be able to evict it, so that the backend can then pull in a
new page, then having pg_basebackup pull that page back in really isn't
great.

> >All we need to do is prove that the page will be rewritten
> >during WAL replay.
> Yes and this is a tricky part. Until you have explained it in your latest
> message, I wasn't sure how we can distinct concurrent update from a page
> header corruption. Now I agree that if page LSN updated and increased
> between rereads, it is safe enough to conclude that we have some concurrent
> load.

Even in this case, it's almost free to compare the LSN to the starting
backup LSN, and to the current LSN position, and make sure it's
somewhere between the two.  While that doesn't entirely eliminite the
possibility that the page happened to get corrupted *and* return a
different result on subsequent reads *and* that it was corrupted in such
a way that the LSN ended up falling between the starting backup LSN and
the current LSN, it's certainly reducing the chances of a false negative
a fair bit.

A concern here, however, is- can we be 100% sure that we'll get a
different result from the two subsequent reads?  For my part, at least,
I've been doubtful that it's possible but it'd be nice to hear it from
someone who has really looked at the kernel side.  To try and clairfy,
let me illustrate:

pg_basebackup (the backend that's sending data to it anyway) starts
reading an 8K page, but gets interrupted halfway through, meaning that
it's read 4K and is now paused.

PG writes that same 8K page, and is able to successfully write the
entire block.

pg_basebackup then wakes up, reads the second half, computes a checksum
and gets a checksum failure.

At this point the question is: if pg_basebackup loops, seeks and
re-reads the same 8K block again, is it possible that pg_basebackup will
get the "old" starting 4K and the "new" ending 4K again?  I'd like to
think that the answer is 'no' and that the kernel will guarantee that if
we managed to read a "new" ending 4K block then the following read of
the full 8K block would be guaranteed to give us the "new" starting 4K.
If that is truely guaranteed then we could be much more confident that
the idea here of simply checking for an ascending LSN, which falls
between the starting LSN of the backup and the current LSN (or perhaps
the final LSN for the backup) would be sufficient to detect this case.

I would also think that, if we can trust that, then there really isn't
any need for the delay in performing the re-read, which I have to admit
that I don't particularly care for.

> >  If we can prove that, we don't actually care what
> >the contents of the page are.  We certainly can't calculate the
> >checksum on a page we plucked out of shared buffers since we only
> >calculate the checksum when we go to write the page out.
> Good point. I was thinking that we can recalculate checksum. Or even save a
> page without it, as we have checked LSN and know for sure that it will be
> rewritten by WAL replay.

At the point that we know the page is in the WAL which must be replayed
to make this backup consistent, we could theoretically zero the page out
of the actual backup (or if we're doing some kind of incremental magic,
skip it entirely, as long as we zero-fill it on restore).

> To sum up, I agree with your proposal to reread the page and rely on
> ascending LSNs. Can you submit a patch?

Probably would make sense to give Michael an opportunity to comment and
get his thoughts on this, and for him to update the patch if he agrees.

As it relates to pgbackrest, we're currently contemplating having a
higher level loop which, upon detecting any page with an invalid
checksum, continues to scan to the end of that file and perform the
compression, encryption, et al, but then loops back after we've
completed that file and skips through the file again, re-reading those
pages which didn't have a valid checksum the first time to see if their
LSN has changed and is within the range of the backup.  This will
certainly give more opportunity for the kernel to 'catch up', if needed,
and give us an updated page without a random 100ms delay, and will also
make it easier for us to, eventually, check and make sure the page was
in the WAL that was been produced as part of the backup, to give us a
complete guarantee that the contents of this page don't matter and that
the failed checksum isn't a sign of latent storage corruption.

Thanks,

Stephen

Attachment

Re: Online verification of checksums

From
Michael Paquier
Date:
On Mon, Nov 23, 2020 at 10:35:54AM -0500, Stephen Frost wrote:
> * Anastasia Lubennikova (a.lubennikova@postgrespro.ru) wrote:
>> It seems reasonable to me to rely on checksums only.
>>
>> As for retry, I think that API for concurrent I/O will be complicated.
>> Instead, we can introduce a function to read the page directly from shared
>> buffers after PAGE_RETRY_THRESHOLD attempts. It looks like a bullet-proof
>> solution to me. Do you see any possible problems with it?

It seems to me that you are missing the point here.  It is not
necessary to read a page from shared buffers.  What is necessary is to
make sure that there is zero concurrent I/O activity in shared buffers
while a page is getting checked on disk, giving the insurance that
there is zero risk of having a torn page for a check for anything
working with shared buffers.  You could do that only on a retry if we
found a page where there was a checksum mismatch, meaning that the
page we either torn or currupted, but need an extra verification
anyway.

> We might end up reading pages back in that have been evicted, for one
> thing, which doesn't seem great, and this also seems likely to be
> awkward for cases which aren't using the replication protocol, unless
> every process maintains a connection to PG the entire time, which also
> doesn't seem great.

I don't quite see a problem in checking pages that have been just
evicted if we are able to detect faster that a page is corrupted,
because the initial check may fail because a page was torn, meaning
that it was in the middle of an eviction, but the page could also be
corrupted, meaning also that it was *not* torn, and would fail a retry
where we should make sure that there is no s_b concurrent activity.
So in the worst case of seeing you make the detection of a corrupted
page faster.

Please note that Andres also mentioned about the potential need to
worry about table AMs that call directly smgrwrite(), bypassing shared
buffers.  The only cases in-core where it is used are related to init
forks when an unlogged relation gets created, where it would not
matter if you are doing a page check while holding a database
transaction as the newly-created relation would not be visible yet,
but it would matter in the case of base backups doing direct page
lookups.  Fun.

> Also- what is the point of reading the page from shared buffers
> anyway..?  All we need to do is prove that the page will be rewritten
> during WAL replay.  If we can prove that, we don't actually care what
> the contents of the page are.  We certainly can't calculate the
> checksum on a page we plucked out of shared buffers since we only
> calculate the checksum when we go to write the page out.

A LSN-based check makes the thing tricky.  How do you make sure that
pd_lsn is not itself broken?  It could be perfectly possible that a
random on-disk corruption makes pd_lsn seen as having a correct value,
still the rest of the page is borked.
--
Michael

Attachment

Re: Online verification of checksums

From
Michael Paquier
Date:
On Mon, Nov 23, 2020 at 05:28:52PM -0500, Stephen Frost wrote:
> * Anastasia Lubennikova (a.lubennikova@postgrespro.ru) wrote:
>> Yes and this is a tricky part. Until you have explained it in your latest
>> message, I wasn't sure how we can distinct concurrent update from a page
>> header corruption. Now I agree that if page LSN updated and increased
>> between rereads, it is safe enough to conclude that we have some concurrent
>> load.
>
> Even in this case, it's almost free to compare the LSN to the starting
> backup LSN, and to the current LSN position, and make sure it's
> somewhere between the two.  While that doesn't entirely eliminite the
> possibility that the page happened to get corrupted *and* return a
> different result on subsequent reads *and* that it was corrupted in such
> a way that the LSN ended up falling between the starting backup LSN and
> the current LSN, it's certainly reducing the chances of a false negative
> a fair bit.

FWIW, I am not much a fan of designs that are not bullet-proof by
design.  This reduces the odds of problems, sure, still it does not
discard the possibility of incorrect results, confusing users as well
as people looking at such reports.

>> To sum up, I agree with your proposal to reread the page and rely on
>> ascending LSNs. Can you submit a patch?
>
> Probably would make sense to give Michael an opportunity to comment and
> get his thoughts on this, and for him to update the patch if he agrees.

I think that a LSN check would be a safe thing to do iff pd_checksum
is already checked first to make sure that the page contents are fine
to use.   Still, what's the point in doing a LSN check anyway if we
know that the checksum is valid?  Then on a retry if the first attempt
failed you also need the guarantee that there is zero concurrent I/O
activity while a page is rechecked (no need to do that unless the
initial page check doing a checksum match failed).  So the retry needs
to do some s_b interactions, but then comes the much trickier point of
concurrent smgrwrite() calls bypassing the shared buffers.

> As it relates to pgbackrest, we're currently contemplating having a
> higher level loop which, upon detecting any page with an invalid
> checksum, continues to scan to the end of that file and perform the
> compression, encryption, et al, but then loops back after we've
> completed that file and skips through the file again, re-reading those
> pages which didn't have a valid checksum the first time to see if their
> LSN has changed and is within the range of the backup.  This will
> certainly give more opportunity for the kernel to 'catch up', if needed,
> and give us an updated page without a random 100ms delay, and will also
> make it easier for us to, eventually, check and make sure the page was
> in the WAL that was been produced as part of the backup, to give us a
> complete guarantee that the contents of this page don't matter and that
> the failed checksum isn't a sign of latent storage corruption.

That would reduce the likelyhood of facing torn pages, still you
cannot fully discard the problem either as a same page may get changed
again once you loop over, no?  And what if a corruption has updated
pd_lsn on-disk?  Unlikely so, still possible.
--
Michael

Attachment

Re: Online verification of checksums

From
Stephen Frost
Date:
Greetings,

On Mon, Nov 23, 2020 at 20:28 Michael Paquier <michael@paquier.xyz> wrote:
On Mon, Nov 23, 2020 at 05:28:52PM -0500, Stephen Frost wrote:
> * Anastasia Lubennikova (a.lubennikova@postgrespro.ru) wrote:
>> Yes and this is a tricky part. Until you have explained it in your latest
>> message, I wasn't sure how we can distinct concurrent update from a page
>> header corruption. Now I agree that if page LSN updated and increased
>> between rereads, it is safe enough to conclude that we have some concurrent
>> load.
>
> Even in this case, it's almost free to compare the LSN to the starting
> backup LSN, and to the current LSN position, and make sure it's
> somewhere between the two.  While that doesn't entirely eliminite the
> possibility that the page happened to get corrupted *and* return a
> different result on subsequent reads *and* that it was corrupted in such
> a way that the LSN ended up falling between the starting backup LSN and
> the current LSN, it's certainly reducing the chances of a false negative
> a fair bit.

FWIW, I am not much a fan of designs that are not bullet-proof by
design.  This reduces the odds of problems, sure, still it does not
discard the possibility of incorrect results, confusing users as well
as people looking at such reports.

Let’s be clear about this- our checksums are, themselves, far from bulletproof, regardless of all of our other efforts.  They are not foolproof against any corruption, and certainly not even close to being sufficient for guarantees you’d expect in, say, encryption integrity.  We cannot say with certainty that a page which passes checksum validation isn’t corrupted in some way.  A page which doesn’t pass checksum validation may be corrupted or may be torn and we aren’t 100% of that either, but we can work to try and make a sensible call about which it is.

>> To sum up, I agree with your proposal to reread the page and rely on
>> ascending LSNs. Can you submit a patch?
>
> Probably would make sense to give Michael an opportunity to comment and
> get his thoughts on this, and for him to update the patch if he agrees.

I think that a LSN check would be a safe thing to do iff pd_checksum
is already checked first to make sure that the page contents are fine
to use.   Still, what's the point in doing a LSN check anyway if we
know that the checksum is valid?  Then on a retry if the first attempt
failed you also need the guarantee that there is zero concurrent I/O
activity while a page is rechecked (no need to do that unless the
initial page check doing a checksum match failed).  So the retry needs
to do some s_b interactions, but then comes the much trickier point of
concurrent smgrwrite() calls bypassing the shared buffers.

I agree that the LSN check isn’t interesting if the page passes the checksum validation.  I do think we can look at the LSN and make reasonable inferences based off of it even if the checksum doesn’t validate- in particular, in my experience at least, the result of a read, without any intervening write, is very likely to be the same if performed multiple times quickly even if there is latent storage corruption- due to cache’ing, if nothing else.  What’s interesting about the LSN check is that we are specifically looking to see if it *changed* in a reasonable and predictable manner, and that it was replaced with a new yet reasonable value. The chances of that happening due to latent storage corruption is vanishingly small.

> As it relates to pgbackrest, we're currently contemplating having a
> higher level loop which, upon detecting any page with an invalid
> checksum, continues to scan to the end of that file and perform the
> compression, encryption, et al, but then loops back after we've
> completed that file and skips through the file again, re-reading those
> pages which didn't have a valid checksum the first time to see if their
> LSN has changed and is within the range of the backup.  This will
> certainly give more opportunity for the kernel to 'catch up', if needed,
> and give us an updated page without a random 100ms delay, and will also
> make it easier for us to, eventually, check and make sure the page was
> in the WAL that was been produced as part of the backup, to give us a
> complete guarantee that the contents of this page don't matter and that
> the failed checksum isn't a sign of latent storage corruption.

That would reduce the likelyhood of facing torn pages, still you
cannot fully discard the problem either as a same page may get changed
again once you loop over, no?  And what if a corruption has updated
pd_lsn on-disk?  Unlikely so, still possible.

We surely don’t care about a page which has been changed multiple times by PG during the backup, since all those changes will be, by definition, in the WAL, no?  Therefore, one loop to see that the value of the LSN *changed*, meaning something wrote something new there, with a cross-check to see that the LSN was in the expected range, is going an awfully long way to assuring that this isn’t a case of latent storage corruption. If there is an attacker who is not the PG process but who is modifying files then, yes, that’s a risk, and won’t be picked up by this, but why would they create an invalid checksum in the first place..?

We aren’t attempting to protect against a sophisticated attack, we are trying to detect latent storage corruption.

I would also ask for a clarification as to if you feel that checking the WAL for the page to be insufficient somehow, since I mentioned that as also being on the roadmap.  If there’s some reason that checking the WAL for the page wouldn’t be sufficient, I am anxious to understand that reasoning.

Thanks,

Stephen

Re: Online verification of checksums

From
David Steele
Date:
Hi Michael,

On 11/23/20 8:10 PM, Michael Paquier wrote:
> On Mon, Nov 23, 2020 at 10:35:54AM -0500, Stephen Frost wrote:
> 
>> Also- what is the point of reading the page from shared buffers
>> anyway..?  All we need to do is prove that the page will be rewritten
>> during WAL replay.  If we can prove that, we don't actually care what
>> the contents of the page are.  We certainly can't calculate the
>> checksum on a page we plucked out of shared buffers since we only
>> calculate the checksum when we go to write the page out.
> 
> A LSN-based check makes the thing tricky.  How do you make sure that
> pd_lsn is not itself broken?  It could be perfectly possible that a
> random on-disk corruption makes pd_lsn seen as having a correct value,
> still the rest of the page is borked.

We are not just looking at one LSN value. Here are the steps we are 
proposing (I'll skip checks for zero pages here):

1) Test the page checksum. If it passes the page is OK.
2) If the checksum does not pass then record the page offset and LSN and 
continue.
3) After the file is copied, reopen and reread the file, seeking to 
offsets where possible invalid pages were recorded in the first pass.
     a) If the page is now valid then it is OK.
     b) If the page is not valid but the LSN has increased from the LSN 
recorded in the previous pass then it is OK. We can infer this because 
the LSN has been updated in a way that is not consistent with storage 
corruption.

This is what we are planning for the first round of improving our page 
checksum validation. We believe that doing the retry in a second pass 
will be faster and more reliable because some time will have passed 
since the first read without having to build in a delay for each page error.

A further improvement is to check the ascending LSNs found in 3b against 
PostgreSQL to be completely sure they are valid. We are planning this 
for our second round of improvements.

Reopening the file for the second pass does require some additional logic:

1) The file may have been deleted by PG since the first pass and in that 
case we won't report any page errors.
2) The file may have been truncated by PG since the first pass so we 
won't report any errors past the point of truncation.

A malicious attacker could easily trick these checks, but as Stephen 
pointed out elsewhere they would likely make the checksums valid which 
would escape detection anyway.

We believe that the chances of random storage corruption passing all 
these checks is incredibly small, but eventually we'll also check 
against the WAL to be completely sure.

Regards,
-- 
-David
david@pgmasters.net



Re: Online verification of checksums

From
Michael Paquier
Date:
On Tue, Nov 24, 2020 at 12:38:30PM -0500, David Steele wrote:
> We are not just looking at one LSN value. Here are the steps we are
> proposing (I'll skip checks for zero pages here):
>
> 1) Test the page checksum. If it passes the page is OK.
> 2) If the checksum does not pass then record the page offset and LSN and
> continue.

But here the checksum is broken, so while the offset is something we
can rely on how do you make sure that the LSN is fine?  A broken
checksum could perfectly mean that the LSN is actually *not* fine if
the page header got corrupted.

> 3) After the file is copied, reopen and reread the file, seeking to offsets
> where possible invalid pages were recorded in the first pass.
>     a) If the page is now valid then it is OK.
>     b) If the page is not valid but the LSN has increased from the LSN

Per se the previous point about the LSN value that we cannot rely on.

> A malicious attacker could easily trick these checks, but as Stephen pointed
> out elsewhere they would likely make the checksums valid which would escape
> detection anyway.
>
> We believe that the chances of random storage corruption passing all these
> checks is incredibly small, but eventually we'll also check against the WAL
> to be completely sure.

The lack of check for any concurrent I/O on the follow-up retries is
disturbing.  How do you guarantee that on the second retry what you
have is a torn page and not something corrupted?  Init forks for
example are made of up to 2 blocks, so the window would get short for
at least those.  There are many instances with tables that have few
pages as well.
--
Michael

Attachment

Re: Online verification of checksums

From
Magnus Hagander
Date:
On Thu, Nov 26, 2020 at 8:42 AM Michael Paquier <michael@paquier.xyz> wrote:
>
> On Tue, Nov 24, 2020 at 12:38:30PM -0500, David Steele wrote:
> > We are not just looking at one LSN value. Here are the steps we are
> > proposing (I'll skip checks for zero pages here):
> >
> > 1) Test the page checksum. If it passes the page is OK.
> > 2) If the checksum does not pass then record the page offset and LSN and
> > continue.
>
> But here the checksum is broken, so while the offset is something we
> can rely on how do you make sure that the LSN is fine?  A broken
> checksum could perfectly mean that the LSN is actually *not* fine if
> the page header got corrupted.
>
> > 3) After the file is copied, reopen and reread the file, seeking to offsets
> > where possible invalid pages were recorded in the first pass.
> >     a) If the page is now valid then it is OK.
> >     b) If the page is not valid but the LSN has increased from the LSN
>
> Per se the previous point about the LSN value that we cannot rely on.

We cannot rely on the LSN itself. But it's a lot more likely that we
can rely on the LSN changing, and on the LSN changing in a "correct
way". That is, if the LSN *decreases* we know it's corrupt. If the LSN
*doesn't change* we know it's corrupt. But if the LSN *increases* AND
the new page now has a correct checksum, it's very most likely to be
correct. You could perhaps even put cap on it saying "if the LSN
increased, but less than <n>", where <n> is a sufficiently high number
that it's entirely unreasonable to advanced that far between the
reading of two blocks. But it has to have a very high margin in that
case.


> > A malicious attacker could easily trick these checks, but as Stephen pointed
> > out elsewhere they would likely make the checksums valid which would escape
> > detection anyway.
> >
> > We believe that the chances of random storage corruption passing all these
> > checks is incredibly small, but eventually we'll also check against the WAL
> > to be completely sure.
>
> The lack of check for any concurrent I/O on the follow-up retries is
> disturbing.  How do you guarantee that on the second retry what you
> have is a torn page and not something corrupted?  Init forks for
> example are made of up to 2 blocks, so the window would get short for
> at least those.  There are many instances with tables that have few
> pages as well.

Here I was more worried that the window might get *too long* if tables
are large :)

The risk is certainly that you get a torn page *again* on the second
read. It could be the same torn page (if it hasn't changed), but you
can detect that (by the fact that it hasn't actually changed) and
possibly do a short delay before trying again if it gets that far.
That could happen if the process is too quick. It could also be that
you are unlucky and that you hit a *new* write, and you were so
unlucky that both times it happened to hit exactly when you were
reading the page the next time. I'm not sure the chance of that
happening is even big enough we have to care about it, though?

-- 
 Magnus Hagander
 Me: https://www.hagander.net/
 Work: https://www.redpill-linpro.com/



Re: Online verification of checksums

From
Stephen Frost
Date:
Greetings,

* Magnus Hagander (magnus@hagander.net) wrote:
> On Thu, Nov 26, 2020 at 8:42 AM Michael Paquier <michael@paquier.xyz> wrote:
> > On Tue, Nov 24, 2020 at 12:38:30PM -0500, David Steele wrote:
> > > We are not just looking at one LSN value. Here are the steps we are
> > > proposing (I'll skip checks for zero pages here):
> > >
> > > 1) Test the page checksum. If it passes the page is OK.
> > > 2) If the checksum does not pass then record the page offset and LSN and
> > > continue.
> >
> > But here the checksum is broken, so while the offset is something we
> > can rely on how do you make sure that the LSN is fine?  A broken
> > checksum could perfectly mean that the LSN is actually *not* fine if
> > the page header got corrupted.

Of course that could be the case, but it gets to be a smaller and
smaller chance by checking that the LSN read falls within reasonable
bounds.

> > > 3) After the file is copied, reopen and reread the file, seeking to offsets
> > > where possible invalid pages were recorded in the first pass.
> > >     a) If the page is now valid then it is OK.
> > >     b) If the page is not valid but the LSN has increased from the LSN
> >
> > Per se the previous point about the LSN value that we cannot rely on.
>
> We cannot rely on the LSN itself. But it's a lot more likely that we
> can rely on the LSN changing, and on the LSN changing in a "correct
> way". That is, if the LSN *decreases* we know it's corrupt. If the LSN
> *doesn't change* we know it's corrupt. But if the LSN *increases* AND
> the new page now has a correct checksum, it's very most likely to be
> correct. You could perhaps even put cap on it saying "if the LSN
> increased, but less than <n>", where <n> is a sufficiently high number
> that it's entirely unreasonable to advanced that far between the
> reading of two blocks. But it has to have a very high margin in that
> case.

This is, in fact, included in what was proposed- the "max increase"
would be "the ending LSN of the backup".  I don't think we can make it
any tighter than that though without risking false positives, which is
surely worse than a false negative in this particular case- we already
risk false negatives due to the fact that our checksum isn't perfect, so
even a perfect check to make sure that the page will, in fact, be
replayed over during crash recovery doesn't guarantee that there's no
corruption.

> > > A malicious attacker could easily trick these checks, but as Stephen pointed
> > > out elsewhere they would likely make the checksums valid which would escape
> > > detection anyway.
> > >
> > > We believe that the chances of random storage corruption passing all these
> > > checks is incredibly small, but eventually we'll also check against the WAL
> > > to be completely sure.
> >
> > The lack of check for any concurrent I/O on the follow-up retries is
> > disturbing.  How do you guarantee that on the second retry what you
> > have is a torn page and not something corrupted?  Init forks for
> > example are made of up to 2 blocks, so the window would get short for
> > at least those.  There are many instances with tables that have few
> > pages as well.

If there's an easy and cheap way to see if there was concurrent i/o
happening for the page, then let's hear it.  One idea that has occured
to me which hasn't been discussed is checking the file's mtime to see if
it's changed since the backup started.  In that case, I would think it'd
be something like:

- Checksum is invalid
- LSN is within range
- Close file
- Stat file
- If mtime is from before the backup then signal possible corruption

If the checksum is invalid and the LSN isn't in range, then signal
corruption.

In general, however, I don't like the idea of reaching into PG and
asking PG for this page.

> Here I was more worried that the window might get *too long* if tables
> are large :)

I'm not sure that there's really a 'too long' possibility here.

> The risk is certainly that you get a torn page *again* on the second
> read. It could be the same torn page (if it hasn't changed), but you
> can detect that (by the fact that it hasn't actually changed) and
> possibly do a short delay before trying again if it gets that far.

I'm really not a fan of introducing these delays in the hopes that
they'll work..

> That could happen if the process is too quick. It could also be that
> you are unlucky and that you hit a *new* write, and you were so
> unlucky that both times it happened to hit exactly when you were
> reading the page the next time. I'm not sure the chance of that
> happening is even big enough we have to care about it, though?

If there's actually a new write, surely the LSN would be new?  At the
least, it wouldn't be the same LSN as the first read that picked up a
torn page.

In general though, I agree, we are getting to the point here where the
chances of missing something with this approach seems extremely slim.  I
do still like the idea of doing better by actually scanning the WAL but
at least for now, this is far better than what we have today while not
introducing a huge amount of additional code or complexity.

Thanks,

Stephen

Attachment

Re: Online verification of checksums

From
Michael Paquier
Date:
On Fri, Nov 27, 2020 at 11:15:27AM -0500, Stephen Frost wrote:
> * Magnus Hagander (magnus@hagander.net) wrote:
>> On Thu, Nov 26, 2020 at 8:42 AM Michael Paquier <michael@paquier.xyz> wrote:
>>> But here the checksum is broken, so while the offset is something we
>>> can rely on how do you make sure that the LSN is fine?  A broken
>>> checksum could perfectly mean that the LSN is actually *not* fine if
>>> the page header got corrupted.
>
> Of course that could be the case, but it gets to be a smaller and
> smaller chance by checking that the LSN read falls within reasonable
> bounds.

FWIW, I find that scary.

>> We cannot rely on the LSN itself. But it's a lot more likely that we
>> can rely on the LSN changing, and on the LSN changing in a "correct
>> way". That is, if the LSN *decreases* we know it's corrupt. If the LSN
>> *doesn't change* we know it's corrupt. But if the LSN *increases* AND
>> the new page now has a correct checksum, it's very most likely to be
>> correct. You could perhaps even put cap on it saying "if the LSN
>> increased, but less than <n>", where <n> is a sufficiently high number
>> that it's entirely unreasonable to advanced that far between the
>> reading of two blocks. But it has to have a very high margin in that
>> case.
>
> This is, in fact, included in what was proposed- the "max increase"
> would be "the ending LSN of the backup".  I don't think we can make it
> any tighter than that though without risking false positives, which is
> surely worse than a false negative in this particular case- we already
> risk false negatives due to the fact that our checksum isn't perfect, so
> even a perfect check to make sure that the page will, in fact, be
> replayed over during crash recovery doesn't guarantee that there's no
> corruption.
>
> If there's an easy and cheap way to see if there was concurrent i/o
> happening for the page, then let's hear it.

This has been discussed for a couple of months now.  I would recommend
to go through this thread:
https://www.postgresql.org/message-id/CAOBaU_aVvMjQn=ge5qPiJOPMmOj5=ii3st5Q0Y+WuLML5sR17w@mail.gmail.com

And this bit is interesting, because that would give the guarantees
you are looking for with a page retry (just grep for BM_IO_IN_PROGRESS
on the thread):
https://www.postgresql.org/message-id/20201102193457.fc2hoen7ahth4bbc@alap3.anarazel.de

> One idea that has occured
> to me which hasn't been discussed is checking the file's mtime to see if
> it's changed since the backup started.  In that case, I would think it'd
> be something like:
>
> - Checksum is invalid
> - LSN is within range
> - Close file
> - Stat file
> - If mtime is from before the backup then signal possible corruption

I suspect that relying on mtime may cause problems.  One case coming
to my mind is NFS.

> In general, however, I don't like the idea of reaching into PG and
> asking PG for this page.

It seems to me that if we don't ask to PG what it thinks about a page,
we will never have a fully bullet-proof design either.
--
Michael

Attachment

Re: Online verification of checksums

From
Stephen Frost
Date:
Greetings,

* Michael Paquier (michael@paquier.xyz) wrote:
> On Fri, Nov 27, 2020 at 11:15:27AM -0500, Stephen Frost wrote:
> > * Magnus Hagander (magnus@hagander.net) wrote:
> >> On Thu, Nov 26, 2020 at 8:42 AM Michael Paquier <michael@paquier.xyz> wrote:
> >>> But here the checksum is broken, so while the offset is something we
> >>> can rely on how do you make sure that the LSN is fine?  A broken
> >>> checksum could perfectly mean that the LSN is actually *not* fine if
> >>> the page header got corrupted.
> >
> > Of course that could be the case, but it gets to be a smaller and
> > smaller chance by checking that the LSN read falls within reasonable
> > bounds.
>
> FWIW, I find that scary.

There's ultimately different levels of 'scary' and the risk here that
something is actually wrong following these checks strikes me as being
on the same order as random bits being flipped in the page and still
getting a valid checksum (which is entirely possible with our current
checksum...), or maybe even less.  Both cases would result in a false
negative, which is surely bad, though that strikes me as better than a
false positive, where we say there's corruption when there isn't.

> And this bit is interesting, because that would give the guarantees
> you are looking for with a page retry (just grep for BM_IO_IN_PROGRESS
> on the thread):
> https://www.postgresql.org/message-id/20201102193457.fc2hoen7ahth4bbc@alap3.anarazel.de

There's no guarantee that the page is still in shared buffers or that we
have a buffer descriptor still for it by the time we're doing this, as I
said up-thread.  This approach requires that we reach into PG, acquire
at least a buffer descriptor and set BM_IO_IN_PROGRESS on it and then
read the page again and checksum it again before finally looking at the
(now 'trusted' LSN, even though it might have had some bits flipped in
it and we wouldn't know..) and see if it's higher than the start of the
backup, and maybe less than the current LSN.  Maybe we can avoid
actually pulling the page into shared buffers (reading it into our own
memory instead) and just have the buffer descriptor but none of this
seems like it's going to be very unobtrusive in either code or the
running system, and it isn't going to give us an actual guarantee that
there's been no corruption.  The amount that it improves on the checks
that I outline above seems to be exceedingly small and the question is
if it's worth it for, most likely, exclusively pg_basebackup (unless
we're going to figure out a way to expose this via SQL, which seems
unlikely).

> > One idea that has occured
> > to me which hasn't been discussed is checking the file's mtime to see if
> > it's changed since the backup started.  In that case, I would think it'd
> > be something like:
> >
> > - Checksum is invalid
> > - LSN is within range
> > - Close file
> > - Stat file
> > - If mtime is from before the backup then signal possible corruption
>
> I suspect that relying on mtime may cause problems.  One case coming
> to my mind is NFS.

I agree that it might not be perfect but it also seems like something
which could be reasonably cheaply checked and the window (between when
the backup started and the time we hit this torn page) is very likely to
be large enough that the mtime will have been updated and be different
(and forward, if it was modified) of what it was at the time the backup
started.  It's also something that incremental backups may be looking
at, so if there's serious problems with it then there's a good chance
you've got bigger issues.

> > In general, however, I don't like the idea of reaching into PG and
> > asking PG for this page.
>
> It seems to me that if we don't ask to PG what it thinks about a page,
> we will never have a fully bullet-proof design either.

None of this is bullet-proof, it's all trade-offs.

Thanks,

Stephen

Attachment

Re: Online verification of checksums

From
David Steele
Date:
On 11/30/20 9:27 AM, Stephen Frost wrote:
> Greetings,
> 
> * Michael Paquier (michael@paquier.xyz) wrote:
>> On Fri, Nov 27, 2020 at 11:15:27AM -0500, Stephen Frost wrote:
>>> * Magnus Hagander (magnus@hagander.net) wrote:
>>>> On Thu, Nov 26, 2020 at 8:42 AM Michael Paquier <michael@paquier.xyz> wrote:
>>>>> But here the checksum is broken, so while the offset is something we
>>>>> can rely on how do you make sure that the LSN is fine?  A broken
>>>>> checksum could perfectly mean that the LSN is actually *not* fine if
>>>>> the page header got corrupted.
>>>
>>> Of course that could be the case, but it gets to be a smaller and
>>> smaller chance by checking that the LSN read falls within reasonable
>>> bounds.
>>
>> FWIW, I find that scary.
> 
> There's ultimately different levels of 'scary' and the risk here that
> something is actually wrong following these checks strikes me as being
> on the same order as random bits being flipped in the page and still
> getting a valid checksum (which is entirely possible with our current
> checksum...), or maybe even less.  

I would say a lot less. First you'd need to corrupt one of the eight 
bytes that make up the LSN (pretty likely since corruption will probably 
affect the entire block) and then it would need to be updated to a value 
that falls within the current backup range, a 1 in 16 million chance if 
a terabyte of WAL is generated during the backup. Plus, the corruption 
needs to happen during the backup since we are going to check for that, 
and the corrupted LSN needs to be ascending, and the LSN originally read 
needs to be within the backup range (another 1 in 16 million chance) 
since pages written before the start backup checkpoint should not be torn.

So as far as I can see there are more likely to be false negatives from 
the checksum itself.

It would also be easy to add a few rounds of checks, i.e. test if the 
LSN ascends but stays in the backup LSN range N times.

Honestly, I'm much more worried about corruption zeroing the entire 
page. I don't know how likely that is, but I know none of our proposed 
solutions would catch it.

Andres, since you brought this issue up originally perhaps you'd like to 
weigh in?

Regards,
-- 
-David
david@pgmasters.net



Re: Online verification of checksums

From
Ibrar Ahmed
Date:


On Tue, Mar 9, 2021 at 10:43 PM David Steele <david@pgmasters.net> wrote:
On 11/30/20 6:38 PM, David Steele wrote:
> On 11/30/20 9:27 AM, Stephen Frost wrote:
>> * Michael Paquier (michael@paquier.xyz) wrote:
>>> On Fri, Nov 27, 2020 at 11:15:27AM -0500, Stephen Frost wrote:
>>>> * Magnus Hagander (magnus@hagander.net) wrote:
>>>>> On Thu, Nov 26, 2020 at 8:42 AM Michael Paquier
>>>>> <michael@paquier.xyz> wrote:
>>>>>> But here the checksum is broken, so while the offset is something we
>>>>>> can rely on how do you make sure that the LSN is fine?  A broken
>>>>>> checksum could perfectly mean that the LSN is actually *not* fine if
>>>>>> the page header got corrupted.
>>>>
>>>> Of course that could be the case, but it gets to be a smaller and
>>>> smaller chance by checking that the LSN read falls within reasonable
>>>> bounds.
>>>
>>> FWIW, I find that scary.
>>
>> There's ultimately different levels of 'scary' and the risk here that
>> something is actually wrong following these checks strikes me as being
>> on the same order as random bits being flipped in the page and still
>> getting a valid checksum (which is entirely possible with our current
>> checksum...), or maybe even less.
>
> I would say a lot less. First you'd need to corrupt one of the eight
> bytes that make up the LSN (pretty likely since corruption will probably
> affect the entire block) and then it would need to be updated to a value
> that falls within the current backup range, a 1 in 16 million chance if
> a terabyte of WAL is generated during the backup. Plus, the corruption
> needs to happen during the backup since we are going to check for that,
> and the corrupted LSN needs to be ascending, and the LSN originally read
> needs to be within the backup range (another 1 in 16 million chance)
> since pages written before the start backup checkpoint should not be torn.
>
> So as far as I can see there are more likely to be false negatives from
> the checksum itself.
>
> It would also be easy to add a few rounds of checks, i.e. test if the
> LSN ascends but stays in the backup LSN range N times.
>
> Honestly, I'm much more worried about corruption zeroing the entire
> page. I don't know how likely that is, but I know none of our proposed
> solutions would catch it.
>
> Andres, since you brought this issue up originally perhaps you'd like to
> weigh in?

I had another look at this patch and though I think my suggestions above
would improve the patch, I have no objections to going forward as is (if
that is the consensus) since this seems an improvement over what we have
now.

It comes down to whether you prefer false negatives or false positives.
With the LSN checking Stephen and I advocate it is theoretically
possible to have a false negative but the chances of the LSN ascending N
times but staying within the backup LSN range due to corruption seems to
be approaching zero.

I think Michael's method is unlikely to throw false positives, but it
seems at least possible that a block would be hot enough to be appear
torn N times in a row. Torn pages themselves are really easy to reproduce.

If we do go forward with this method I would likely propose the
LSN-based approach as a future improvement.

Regards,
--
-David
david@pgmasters.net


 
I am changing the status to "Waiting on Author" based on the latest comments of @David Steele 
and secondly the patch does not apply cleanly.

Re: Online verification of checksums

From
Daniel Gustafsson
Date:
> On 9 Jul 2021, at 22:00, Ibrar Ahmed <ibrar.ahmad@gmail.com> wrote:

> I am changing the status to "Waiting on Author" based on the latest comments of @David Steele
> and secondly the patch does not apply cleanly.

This patch hasn’t moved since marked as WoA in the last CF and still doesn’t
apply, unless there is a new version brewing it seems apt to close this as RwF
and await a new entry in a future CF.

--
Daniel Gustafsson        https://vmware.com/




Re: Online verification of checksums

From
Daniel Gustafsson
Date:
> On 2 Sep 2021, at 13:18, Daniel Gustafsson <daniel@yesql.se> wrote:
>
>> On 9 Jul 2021, at 22:00, Ibrar Ahmed <ibrar.ahmad@gmail.com> wrote:
>
>> I am changing the status to "Waiting on Author" based on the latest comments of @David Steele
>> and secondly the patch does not apply cleanly.
>
> This patch hasn’t moved since marked as WoA in the last CF and still doesn’t
> apply, unless there is a new version brewing it seems apt to close this as RwF
> and await a new entry in a future CF.

As there has been no movement, I've marked this patch as RwF.

--
Daniel Gustafsson        https://vmware.com/