Re: Online verification of checksums - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: Online verification of checksums
Date
Msg-id dc63116a-d88b-0f7f-1a3d-621b6e002329@2ndquadrant.com
Whole thread Raw
In response to Re: Online verification of checksums  (Stephen Frost <sfrost@snowman.net>)
Responses Re: Online verification of checksums
Re: Online verification of checksums
List pgsql-hackers
On 09/17/2018 07:11 PM, Stephen Frost wrote:
> Greetings,
> 
> * Tomas Vondra (tomas.vondra@2ndquadrant.com) wrote:
>> On 09/17/2018 06:42 PM, Stephen Frost wrote:
>>> Ok, good, though I'm not sure what you mean by 'eliminates the
>>> consistency guarantees provided by the checkpoint'.  The point is that
>>> the page will be in the WAL and the WAL will be replayed during the
>>> restore of the backup.
>>
>> The checkpoint guarantees that the whole page was written and flushed to
>> disk with an LSN before the ckeckpoint LSN. So when you read a page with
>> that LSN, you know the whole write already completed and a read won't
>> return data from before the LSN.
> 
> Well, you know that the first part was written out at some prior point,
> but you could end up reading the first part of a page with an older LSN
> while also reading the second part with new data.
> 

Doesn't the checkpoint fsync pretty much guarantee this can't happen?

>> Without the checkpoint that's not guaranteed, and simply re-reading the
>> page and rechecking it vs. the first read does not help:
>>
>> 1) write the first 512B of the page (sector), which includes the LSN
>>
>> 2) read the whole page, which will be a mix [new 512B, ... old ... ]
>>
>> 3) the checksum verification fails
>>
>> 4) read the page again (possibly reading a bit more new data)
>>
>> 5) the LSN did not change compared to the first read, yet the checksum
>> still fails
> 
> So, I agree with all of the above though I've found it to be extremely
> rare to get a single read which you've managed to catch part-way through
> a write, getting multiple of them over a period of time strikes me as
> even more unlikely.  Still, if we can come up with a solution to solve
> all of this, great, but I'm not sure that I'm hearing one.
> 

I don't recall claiming catching many such torn pages - I'm sure it's
not very common in most workloads. But I suspect constructing workloads
hitting them regularly is not very difficult either (something with a
lot of churn in shared buffers should do the trick).

>>> Sure, because we don't care about it any longer- that page isn't
>>> interesting because the WAL will replay over it.  IIRC it actually goes
>>> something like: check the checksum, if it failed then check if the LSN
>>> is greater than the checkpoint (of the backup start..), if not, then
>>> re-read, if the LSN is now newer than the checkpoint then skip, if the
>>> LSN is the same then throw an error.
>>
>> Nope, we only verify the checksum if it's LSN precedes the checkpoint:
>>
>> https://github.com/postgres/postgres/blob/master/src/backend/replication/basebackup.c#L1454
> 
> That seems like it's leaving something on the table, but, to be fair, we
> know that all of those pages should be rewritten by WAL anyway so they
> aren't all that interesting to us, particularly in the basebackup case.
> 

Yep.

>>> I actually tend to disagree with you that, for this purpose, it's
>>> actually necessary to check against the checkpoint LSN- if the LSN
>>> changed and everything is operating correctly then the new LSN must be
>>> more recent than the last checkpoint location or things are broken
>>> badly.
>>
>> I don't follow. Are you suggesting we don't need the checkpoint LSN?
>>
>> I'm pretty sure that's not the case. The thing is - the LSN may not
>> change between the two reads, but that's not a guarantee the page was
>> not torn. The example I posted earlier in this message illustrates that.
> 
> I agree that there's some risk there, but it's certainly much less
> likely.
> 

Well. If we're going to report a checksum failure, we better be sure it
actually is a broken page. I don't want users to start chasing bogus
data corruption issues.

>>> Now, that said, I do think it's a good *idea* to check against the
>>> checkpoint LSN (presuming this is for online checking of checksums- for
>>> basebackup, we could just check against the backup-start LSN as anything
>>> after that point will be rewritten by WAL anyway).  The reason that I
>>> think it's a good idea to check against the checkpoint LSN is that we'd
>>> want to throw a big warning if the kernel is just feeding us random
>>> garbage on reads and only finding a difference between two reads isn't
>>> really doing any kind of validation, whereas checking against the
>>> checkpoint-LSN would at least give us some idea that the value being
>>> read isn't completely ridiculous.
>>>
>>> When it comes to if the pg_sleep() is necessary or not, I have to admit
>>> to being unsure about that..  I could see how it might be but it seems a
>>> bit surprising- I'd probably want to see exactly what the page was at
>>> the time of the failure and at the time of the second (no-sleep) re-read
>>> and then after a delay and convince myself that it was just an unlucky
>>> case of being scheduled in twice to read that page before the process
>>> writing it out got a chance to finish the write.
>>
>> I think the pg_sleep() is a pretty strong sign there's something broken.
>> At the very least, it's likely to misbehave on machines with different
>> timings, machines under memory and/or memory pressure, etc.
> 
> If we assume that what you've outlined above is a serious enough issue
> that we have to address it, and do so without a pg_sleep(), then I think
> we have to bake into this a way for the process to check with PG as to
> what the page's current LSN is, in shared buffers, because that's the
> only place where we've got the locking required to ensure that we don't
> end up with a read of a partially written page, and I'm really not
> entirely convinced that we need to go to that level.  It'd certainly add
> a huge amount of additional complexity for what appears to be a quite
> unlikely gain.
> 
> I'll chat w/ David shortly about this again though and get his thoughts
> on it.  This is certainly an area we've spent time thinking about but
> are obviously also open to finding a better solution.
> 

Why not to simply look at the last checkpoint LSN and use that the same
way basebackup does? AFAICS that should make the pg_sleep() unnecessary.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


pgsql-hackers by date:

Previous
From: Stephen Frost
Date:
Subject: Re: Online verification of checksums
Next
From: Stephen Frost
Date:
Subject: Re: Online verification of checksums