Re: Online verification of checksums - Mailing list pgsql-hackers
From | Tomas Vondra |
---|---|
Subject | Re: Online verification of checksums |
Date | |
Msg-id | dc63116a-d88b-0f7f-1a3d-621b6e002329@2ndquadrant.com Whole thread Raw |
In response to | Re: Online verification of checksums (Stephen Frost <sfrost@snowman.net>) |
Responses |
Re: Online verification of checksums
Re: Online verification of checksums |
List | pgsql-hackers |
On 09/17/2018 07:11 PM, Stephen Frost wrote: > Greetings, > > * Tomas Vondra (tomas.vondra@2ndquadrant.com) wrote: >> On 09/17/2018 06:42 PM, Stephen Frost wrote: >>> Ok, good, though I'm not sure what you mean by 'eliminates the >>> consistency guarantees provided by the checkpoint'. The point is that >>> the page will be in the WAL and the WAL will be replayed during the >>> restore of the backup. >> >> The checkpoint guarantees that the whole page was written and flushed to >> disk with an LSN before the ckeckpoint LSN. So when you read a page with >> that LSN, you know the whole write already completed and a read won't >> return data from before the LSN. > > Well, you know that the first part was written out at some prior point, > but you could end up reading the first part of a page with an older LSN > while also reading the second part with new data. > Doesn't the checkpoint fsync pretty much guarantee this can't happen? >> Without the checkpoint that's not guaranteed, and simply re-reading the >> page and rechecking it vs. the first read does not help: >> >> 1) write the first 512B of the page (sector), which includes the LSN >> >> 2) read the whole page, which will be a mix [new 512B, ... old ... ] >> >> 3) the checksum verification fails >> >> 4) read the page again (possibly reading a bit more new data) >> >> 5) the LSN did not change compared to the first read, yet the checksum >> still fails > > So, I agree with all of the above though I've found it to be extremely > rare to get a single read which you've managed to catch part-way through > a write, getting multiple of them over a period of time strikes me as > even more unlikely. Still, if we can come up with a solution to solve > all of this, great, but I'm not sure that I'm hearing one. > I don't recall claiming catching many such torn pages - I'm sure it's not very common in most workloads. But I suspect constructing workloads hitting them regularly is not very difficult either (something with a lot of churn in shared buffers should do the trick). >>> Sure, because we don't care about it any longer- that page isn't >>> interesting because the WAL will replay over it. IIRC it actually goes >>> something like: check the checksum, if it failed then check if the LSN >>> is greater than the checkpoint (of the backup start..), if not, then >>> re-read, if the LSN is now newer than the checkpoint then skip, if the >>> LSN is the same then throw an error. >> >> Nope, we only verify the checksum if it's LSN precedes the checkpoint: >> >> https://github.com/postgres/postgres/blob/master/src/backend/replication/basebackup.c#L1454 > > That seems like it's leaving something on the table, but, to be fair, we > know that all of those pages should be rewritten by WAL anyway so they > aren't all that interesting to us, particularly in the basebackup case. > Yep. >>> I actually tend to disagree with you that, for this purpose, it's >>> actually necessary to check against the checkpoint LSN- if the LSN >>> changed and everything is operating correctly then the new LSN must be >>> more recent than the last checkpoint location or things are broken >>> badly. >> >> I don't follow. Are you suggesting we don't need the checkpoint LSN? >> >> I'm pretty sure that's not the case. The thing is - the LSN may not >> change between the two reads, but that's not a guarantee the page was >> not torn. The example I posted earlier in this message illustrates that. > > I agree that there's some risk there, but it's certainly much less > likely. > Well. If we're going to report a checksum failure, we better be sure it actually is a broken page. I don't want users to start chasing bogus data corruption issues. >>> Now, that said, I do think it's a good *idea* to check against the >>> checkpoint LSN (presuming this is for online checking of checksums- for >>> basebackup, we could just check against the backup-start LSN as anything >>> after that point will be rewritten by WAL anyway). The reason that I >>> think it's a good idea to check against the checkpoint LSN is that we'd >>> want to throw a big warning if the kernel is just feeding us random >>> garbage on reads and only finding a difference between two reads isn't >>> really doing any kind of validation, whereas checking against the >>> checkpoint-LSN would at least give us some idea that the value being >>> read isn't completely ridiculous. >>> >>> When it comes to if the pg_sleep() is necessary or not, I have to admit >>> to being unsure about that.. I could see how it might be but it seems a >>> bit surprising- I'd probably want to see exactly what the page was at >>> the time of the failure and at the time of the second (no-sleep) re-read >>> and then after a delay and convince myself that it was just an unlucky >>> case of being scheduled in twice to read that page before the process >>> writing it out got a chance to finish the write. >> >> I think the pg_sleep() is a pretty strong sign there's something broken. >> At the very least, it's likely to misbehave on machines with different >> timings, machines under memory and/or memory pressure, etc. > > If we assume that what you've outlined above is a serious enough issue > that we have to address it, and do so without a pg_sleep(), then I think > we have to bake into this a way for the process to check with PG as to > what the page's current LSN is, in shared buffers, because that's the > only place where we've got the locking required to ensure that we don't > end up with a read of a partially written page, and I'm really not > entirely convinced that we need to go to that level. It'd certainly add > a huge amount of additional complexity for what appears to be a quite > unlikely gain. > > I'll chat w/ David shortly about this again though and get his thoughts > on it. This is certainly an area we've spent time thinking about but > are obviously also open to finding a better solution. > Why not to simply look at the last checkpoint LSN and use that the same way basebackup does? AFAICS that should make the pg_sleep() unnecessary. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
pgsql-hackers by date: