Re: Online verification of checksums - Mailing list pgsql-hackers
From | Stephen Frost |
---|---|
Subject | Re: Online verification of checksums |
Date | |
Msg-id | 20180917164246.GZ4184@tamriel.snowman.net Whole thread Raw |
In response to | Re: Online verification of checksums (Tomas Vondra <tomas.vondra@2ndquadrant.com>) |
Responses |
Re: Online verification of checksums
|
List | pgsql-hackers |
Greetings, * Tomas Vondra (tomas.vondra@2ndquadrant.com) wrote: > On 09/17/2018 04:46 PM, Stephen Frost wrote: > > * Michael Banck (michael.banck@credativ.de) wrote: > >> On Mon, Sep 03, 2018 at 10:29:18PM +0200, Tomas Vondra wrote: > >>> Obviously, pg_verify_checksums can't do that easily because it's > >>> supposed to run from outside the database instance. > >> > >> It reads pg_control anyway, so couldn't we just take > >> ControlFile->checkPoint? > >> > >> Other than that, basebackup.c seems to only look at pages which haven't > >> been changed since the backup starting checkpoint (see above if > >> statement). That's reasonable for backups, but is it just as reasonable > >> for online verification? > > > > Right, basebackup doesn't need to look at other pages. > > > >>> The pg_verify_checksum apparently ignores this skip logic, because on > >>> the retry it simply re-reads the page again, verifies the checksum and > >>> reports an error. Which is broken, because the newly read page might be > >>> torn again due to a concurrent write. > >> > >> Well, ok. > > > > The newly read page will have an updated LSN though then on the re-read, > > in which case basebackup can know that what happened was a rewrite of > > the page and it no longer has to care about the page and can skip it. > > > > I haven't looked, but if basebackup isn't checking the LSN again for the > > newly read page then that'd be broken, but I believe it does (at least, > > that's the algorithm we came up with for pgBackRest, and I know David > > shared that when the basebackup code was being written). > > Yes, basebackup does check the LSN on re-read, and skips the page if it > changed on re-read (because it eliminates the consistency guarantees > provided by the checkpoint). Ok, good, though I'm not sure what you mean by 'eliminates the consistency guarantees provided by the checkpoint'. The point is that the page will be in the WAL and the WAL will be replayed during the restore of the backup. > >> So how about we do check every page, but if one fails on retry, and the > >> LSN is newer than the checkpoint, we then skip it? Is that logic sound? > > > > I thought that's what basebackup did- if it doesn't do that today, then > > it really should. > > The crucial distinction here is that the trick is not in comparing LSNs > from the two page reads, but comparing it to the checkpoint LSN. If it's > greater, the page may be torn or broken, and there's no way to know > which case it is - so basebackup simply skips it. Sure, because we don't care about it any longer- that page isn't interesting because the WAL will replay over it. IIRC it actually goes something like: check the checksum, if it failed then check if the LSN is greater than the checkpoint (of the backup start..), if not, then re-read, if the LSN is now newer than the checkpoint then skip, if the LSN is the same then throw an error. > >> In any case, if we decide we really should skip the page if it is newer > >> than the checkpoint, I think it makes sense to track those skipped pages > >> and print their number out at the end, if there are any. > > > > Not sure what the point of this is. If we wanted to really do something > > to cross-check here, we'd track the pages that were skipped and then > > look through the WAL to make sure that they're there. That's something > > we've talked about doing with pgBackRest, but don't currently. > > I agree simply printing the page numbers seems rather useless. What we > could do is remember which pages we skipped and then try checking them > after another checkpoint. Or something like that. I'm still not sure I'm seeing the point of that. They're still going to almost certainly be in the kernel cache. The reason for checking against the WAL would be to detect errors in PG where we aren't putting a page into the WAL when it really should be, or something similar, which seems like it at least could be useful. Maybe to put it another way- there's very little point in checking the checksum of a page which we know must be re-written during recovery to get to a consistent point. I don't think it hurts in the general case, but I wouldn't write a lot of code which then needs to be tested to handle it. I also don't think that we really need to make pg_verify_checksum spend lots of extra cycles trying to verify that *every* page had its checksum validated when we know that lots of pages are going to be in memory marked dirty and our checking of them will be ultimately pointless as they'll either be written out by the checkpointer or some other process, or we'll replay them from the WAL if we crash. > >>> FWIW I also don't understand the purpose of pg_sleep(), it does not seem > >>> to protect against anything, really. > >> > >> Well, I've noticed that without it I get sporadic checksum failures on > >> reread, so I've added it to make them go away. It was certainly a > >> phenomenological decision that I am happy to trade for a better one. > > > > That then sounds like we really aren't re-checking the LSN, and we > > really should be, to avoid getting these sporadic checksum failures on > > reread.. > > Again, it's not enough to check the LSN against the preceding read. We > need a checkpoint LSN or something like that. I actually tend to disagree with you that, for this purpose, it's actually necessary to check against the checkpoint LSN- if the LSN changed and everything is operating correctly then the new LSN must be more recent than the last checkpoint location or things are broken badly. Now, that said, I do think it's a good *idea* to check against the checkpoint LSN (presuming this is for online checking of checksums- for basebackup, we could just check against the backup-start LSN as anything after that point will be rewritten by WAL anyway). The reason that I think it's a good idea to check against the checkpoint LSN is that we'd want to throw a big warning if the kernel is just feeding us random garbage on reads and only finding a difference between two reads isn't really doing any kind of validation, whereas checking against the checkpoint-LSN would at least give us some idea that the value being read isn't completely ridiculous. When it comes to if the pg_sleep() is necessary or not, I have to admit to being unsure about that.. I could see how it might be but it seems a bit surprising- I'd probably want to see exactly what the page was at the time of the failure and at the time of the second (no-sleep) re-read and then after a delay and convince myself that it was just an unlucky case of being scheduled in twice to read that page before the process writing it out got a chance to finish the write. Thanks! Stephen
Attachment
pgsql-hackers by date: