Re: Online verification of checksums - Mailing list pgsql-hackers
From | Stephen Frost |
---|---|
Subject | Re: Online verification of checksums |
Date | |
Msg-id | 20180917171106.GA4184@tamriel.snowman.net Whole thread Raw |
In response to | Re: Online verification of checksums (Tomas Vondra <tomas.vondra@2ndquadrant.com>) |
Responses |
Re: Online verification of checksums
|
List | pgsql-hackers |
Greetings, * Tomas Vondra (tomas.vondra@2ndquadrant.com) wrote: > On 09/17/2018 06:42 PM, Stephen Frost wrote: > > Ok, good, though I'm not sure what you mean by 'eliminates the > > consistency guarantees provided by the checkpoint'. The point is that > > the page will be in the WAL and the WAL will be replayed during the > > restore of the backup. > > The checkpoint guarantees that the whole page was written and flushed to > disk with an LSN before the ckeckpoint LSN. So when you read a page with > that LSN, you know the whole write already completed and a read won't > return data from before the LSN. Well, you know that the first part was written out at some prior point, but you could end up reading the first part of a page with an older LSN while also reading the second part with new data. > Without the checkpoint that's not guaranteed, and simply re-reading the > page and rechecking it vs. the first read does not help: > > 1) write the first 512B of the page (sector), which includes the LSN > > 2) read the whole page, which will be a mix [new 512B, ... old ... ] > > 3) the checksum verification fails > > 4) read the page again (possibly reading a bit more new data) > > 5) the LSN did not change compared to the first read, yet the checksum > still fails So, I agree with all of the above though I've found it to be extremely rare to get a single read which you've managed to catch part-way through a write, getting multiple of them over a period of time strikes me as even more unlikely. Still, if we can come up with a solution to solve all of this, great, but I'm not sure that I'm hearing one. > > Sure, because we don't care about it any longer- that page isn't > > interesting because the WAL will replay over it. IIRC it actually goes > > something like: check the checksum, if it failed then check if the LSN > > is greater than the checkpoint (of the backup start..), if not, then > > re-read, if the LSN is now newer than the checkpoint then skip, if the > > LSN is the same then throw an error. > > Nope, we only verify the checksum if it's LSN precedes the checkpoint: > > https://github.com/postgres/postgres/blob/master/src/backend/replication/basebackup.c#L1454 That seems like it's leaving something on the table, but, to be fair, we know that all of those pages should be rewritten by WAL anyway so they aren't all that interesting to us, particularly in the basebackup case. > > I actually tend to disagree with you that, for this purpose, it's > > actually necessary to check against the checkpoint LSN- if the LSN > > changed and everything is operating correctly then the new LSN must be > > more recent than the last checkpoint location or things are broken > > badly. > > I don't follow. Are you suggesting we don't need the checkpoint LSN? > > I'm pretty sure that's not the case. The thing is - the LSN may not > change between the two reads, but that's not a guarantee the page was > not torn. The example I posted earlier in this message illustrates that. I agree that there's some risk there, but it's certainly much less likely. > > Now, that said, I do think it's a good *idea* to check against the > > checkpoint LSN (presuming this is for online checking of checksums- for > > basebackup, we could just check against the backup-start LSN as anything > > after that point will be rewritten by WAL anyway). The reason that I > > think it's a good idea to check against the checkpoint LSN is that we'd > > want to throw a big warning if the kernel is just feeding us random > > garbage on reads and only finding a difference between two reads isn't > > really doing any kind of validation, whereas checking against the > > checkpoint-LSN would at least give us some idea that the value being > > read isn't completely ridiculous. > > > > When it comes to if the pg_sleep() is necessary or not, I have to admit > > to being unsure about that.. I could see how it might be but it seems a > > bit surprising- I'd probably want to see exactly what the page was at > > the time of the failure and at the time of the second (no-sleep) re-read > > and then after a delay and convince myself that it was just an unlucky > > case of being scheduled in twice to read that page before the process > > writing it out got a chance to finish the write. > > I think the pg_sleep() is a pretty strong sign there's something broken. > At the very least, it's likely to misbehave on machines with different > timings, machines under memory and/or memory pressure, etc. If we assume that what you've outlined above is a serious enough issue that we have to address it, and do so without a pg_sleep(), then I think we have to bake into this a way for the process to check with PG as to what the page's current LSN is, in shared buffers, because that's the only place where we've got the locking required to ensure that we don't end up with a read of a partially written page, and I'm really not entirely convinced that we need to go to that level. It'd certainly add a huge amount of additional complexity for what appears to be a quite unlikely gain. I'll chat w/ David shortly about this again though and get his thoughts on it. This is certainly an area we've spent time thinking about but are obviously also open to finding a better solution. Thanks! Stephen
Attachment
pgsql-hackers by date: