Thread: Online verification of checksums
Hi, v10 almost added online activation of checksums, but all we've got is pg_verify_checksums, i.e. offline verification of checkums. However, we also got (online) checksum verification during base backups, and I have ported/adapted David Steele's recheck code to my personal fork of pg_checksums[1], removed the online check (for verification) and that seems to work fine. I've now forward-ported this change to pg_verify_checksums, in order to make this application useful for online clusters, see attached patch. I've tested this in a tight loop (while true; do pg_verify_checksums -D data1 -d > /dev/null || /bin/true; done)[2] while doing "while true; do createdb pgbench; pgbench -i -s 10 pgbench > /dev/null; dropdb pgbench; done", which I already used to develop the original code in the fork and which brought up a few bugs. I got one checksums verification failure this way, all others were caught by the recheck (I've introduced a 500ms delay for the first ten failures) like this: |pg_verify_checksums: checksum verification failed on first attempt in |file "data1/base/16837/16850", block 7770: calculated checksum 785 but |expected 5063 |pg_verify_checksums: block 7770 in file "data1/base/16837/16850" |verified ok on recheck However, I am also seeing sporadic (maybe 0.5 times per pgbench run) failures like this: |pg_verify_checksums: short read of block 2644 in file |"data1/base/16637/16650", got only 4096 bytes This is not strictly a verification failure, should we do anything about this? In my fork, I am also rechecking on this[3] (and I am happy to extend the patch that way), but that makes the code and the patch more complicated and I wanted to check the general opinion on this case first. Michael [1] https://github.com/credativ/pg_checksums/commit/dc052f0d6f1282d3c821 5b0eb28b8e7c4e74f9e5 [2] while patching out the somewhat unhelpful (in regular operation, anyway) debug message for every successful checksum verification [3] https://github.com/credativ/pg_checksums/blob/master/pg_checksums.c# L160 -- Michael Banck Projektleiter / Senior Berater Tel.: +49 2166 9901-171 Fax: +49 2166 9901-100 Email: michael.banck@credativ.de credativ GmbH, HRB Mönchengladbach 12080 USt-ID-Nummer: DE204566209 Trompeterallee 108, 41189 Mönchengladbach Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer Unser Umgang mit personenbezogenen Daten unterliegt folgenden Bestimmungen: https://www.credativ.de/datenschutz
Attachment
On 26/07/2018 13:59, Michael Banck wrote: > I've now forward-ported this change to pg_verify_checksums, in order to > make this application useful for online clusters, see attached patch. Why not provide this functionality as a server function or command. Then you can access blocks with proper locks and don't have to do this rather ad hoc retry logic on concurrent access. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Thu, Aug 30, 2018 at 8:06 PM, Peter Eisentraut <peter.eisentraut@2ndquadrant.com> wrote:
On 26/07/2018 13:59, Michael Banck wrote:
> I've now forward-ported this change to pg_verify_checksums, in order to
> make this application useful for online clusters, see attached patch.
Why not provide this functionality as a server function or command.
Then you can access blocks with proper locks and don't have to do this
rather ad hoc retry logic on concurrent access.
I think it would make sense to provide this functionality in the "checksum worker" infrastruture suggested in the online checksum enabling patch. But I think being able to run it from the outside would also be useful, particularly when it's this simple.
But why do we need a sleep in it? AFAICT this is basically the same code that we have in basebackup.c, and that one does not need the sleep? Certainly 500ms would be very long since we're just protecting against a torn page, but the comment is wrong I think, and we're actually sleeping 0.5ms?
Hallo Michael, > I've now forward-ported this change to pg_verify_checksums, in order to > make this application useful for online clusters, see attached patch. Patch does not seem to apply anymore, could you rebase it? -- Fabien.
Hi, The patch is mostly copying the verification / retry logic from basebackup.c, but I think it omitted a rather important detail that makes it incorrect in the presence of concurrent writes. The very first thing basebackup does is this: startptr = do_pg_start_backup(...); i.e. it waits for a checkpoint, remembering the LSN. And then when checking a page it does this: if (!PageIsNew(page) && PageGetLSN(page) < startptr) { ... verify the page checksum } Obviously, pg_verify_checksums can't do that easily because it's supposed to run from outside the database instance. But the startptr detail is pretty important because it supports this retry reasoning: /* * Retry the block on the first failure. It's * possible that we read the first 4K page of the * block just before postgres updated the entire block * so it ends up looking torn to us. We only need to * retry once because the LSN should be updated to * something we can ignore on the next pass. If the * error happens again then it is a true validation * failure. */ Imagine the 8kB page as two 4kB pages, with the initial state being [A1,A2] and another process over-writing it with [B1,B2]. If you read the 8kB page, what states can you see? I don't think POSIX provides any guarantees about atomicity of the write calls (and even if it does, the filesystems on Linux don't seem to). So you may observe both [A1,B2] or [A2,B1], or various inconsistent mixes of the two versions, depending on timing. Well, torn pages ... Pretty much the only thing you can rely on is that when one process does write([B1,B2]) the other process may first read [A1,B2], but the next read will return [B1,B2] (or possibly newer data, if there was another write). It will not read the "stale" A1 again. The basebackup relies on this kinda implicitly - on the retry it'll notice the LSN changed (thanks to the startptr check), and the page will be skipped entirely. This is pretty important, because the new page might be torn in some other way. The pg_verify_checksum apparently ignores this skip logic, because on the retry it simply re-reads the page again, verifies the checksum and reports an error. Which is broken, because the newly read page might be torn again due to a concurrent write. So IMHO this should do something similar to basebackup - check the page LSN, and if it changed then skip the page. I'm afraid this requires using the last checkpoint LSN, the way startptr is used in basebackup. In particular we can't simply remember LSN from the first read, because we might actually read [B1,A2] on the first try, and then [B1,B2] or [B1,C2] on the retry. (Actually, the page may be torn in various other ways, not necessarily at the 4kB boundary - it might be torn right after the LSN, for example). FWIW I also don't understand the purpose of pg_sleep(), it does not seem to protect against anything, really. -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Hi, On Mon, Sep 03, 2018 at 10:29:18PM +0200, Tomas Vondra wrote: > The patch is mostly copying the verification / retry logic from > basebackup.c, but I think it omitted a rather important detail that > makes it incorrect in the presence of concurrent writes. > > The very first thing basebackup does is this: > > startptr = do_pg_start_backup(...); > > i.e. it waits for a checkpoint, remembering the LSN. And then when > checking a page it does this: > > if (!PageIsNew(page) && PageGetLSN(page) < startptr) > { > ... verify the page checksum > } > > Obviously, pg_verify_checksums can't do that easily because it's > supposed to run from outside the database instance. It reads pg_control anyway, so couldn't we just take ControlFile->checkPoint? Other than that, basebackup.c seems to only look at pages which haven't been changed since the backup starting checkpoint (see above if statement). That's reasonable for backups, but is it just as reasonable for online verification? > But the startptr detail is pretty important because it supports this > retry reasoning: > > /* > * Retry the block on the first failure. It's > * possible that we read the first 4K page of the > * block just before postgres updated the entire block > * so it ends up looking torn to us. We only need to > * retry once because the LSN should be updated to > * something we can ignore on the next pass. If the > * error happens again then it is a true validation > * failure. > */ > > Imagine the 8kB page as two 4kB pages, with the initial state being > [A1,A2] and another process over-writing it with [B1,B2]. If you read > the 8kB page, what states can you see? > > I don't think POSIX provides any guarantees about atomicity of the write > calls (and even if it does, the filesystems on Linux don't seem to). So > you may observe both [A1,B2] or [A2,B1], or various inconsistent mixes > of the two versions, depending on timing. Well, torn pages ... > > Pretty much the only thing you can rely on is that when one process does > > write([B1,B2]) > > the other process may first read [A1,B2], but the next read will return > [B1,B2] (or possibly newer data, if there was another write). It will > not read the "stale" A1 again. > > The basebackup relies on this kinda implicitly - on the retry it'll > notice the LSN changed (thanks to the startptr check), and the page will > be skipped entirely. This is pretty important, because the new page > might be torn in some other way. > > The pg_verify_checksum apparently ignores this skip logic, because on > the retry it simply re-reads the page again, verifies the checksum and > reports an error. Which is broken, because the newly read page might be > torn again due to a concurrent write. Well, ok. > So IMHO this should do something similar to basebackup - check the page > LSN, and if it changed then skip the page. > > I'm afraid this requires using the last checkpoint LSN, the way startptr > is used in basebackup. In particular we can't simply remember LSN from > the first read, because we might actually read [B1,A2] on the first try, > and then [B1,B2] or [B1,C2] on the retry. (Actually, the page may be > torn in various other ways, not necessarily at the 4kB boundary - it > might be torn right after the LSN, for example). I'd prefer to come up with a plan where we don't just give up once we see a new LSN, if possible. If I run a modified pg_verify_checksums which skips on newer pages in a tight benchmark, basically everything gets skipped as checkpoints don't happen often enough. So how about we do check every page, but if one fails on retry, and the LSN is newer than the checkpoint, we then skip it? Is that logic sound? In any case, if we decide we really should skip the page if it is newer than the checkpoint, I think it makes sense to track those skipped pages and print their number out at the end, if there are any. > FWIW I also don't understand the purpose of pg_sleep(), it does not seem > to protect against anything, really. Well, I've noticed that without it I get sporadic checksum failures on reread, so I've added it to make them go away. It was certainly a phenomenological decision that I am happy to trade for a better one. Also, I noticed there's sometimes a 'data/global/pg_internal.init.606' or some such file which pg_verify_checksums gets confused on, I guess we should skip that as well. Can we assume that all files that start with the ones in skip[] are safe to skip or should we have an exception for files starting with pg_internal.init? Michael -- Michael Banck Projektleiter / Senior Berater Tel.: +49 2166 9901-171 Fax: +49 2166 9901-100 Email: michael.banck@credativ.de credativ GmbH, HRB Mönchengladbach 12080 USt-ID-Nummer: DE204566209 Trompeterallee 108, 41189 Mönchengladbach Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer Unser Umgang mit personenbezogenen Daten unterliegt folgenden Bestimmungen: https://www.credativ.de/datenschutz
Greetings, * Michael Banck (michael.banck@credativ.de) wrote: > On Mon, Sep 03, 2018 at 10:29:18PM +0200, Tomas Vondra wrote: > > Obviously, pg_verify_checksums can't do that easily because it's > > supposed to run from outside the database instance. > > It reads pg_control anyway, so couldn't we just take > ControlFile->checkPoint? > > Other than that, basebackup.c seems to only look at pages which haven't > been changed since the backup starting checkpoint (see above if > statement). That's reasonable for backups, but is it just as reasonable > for online verification? Right, basebackup doesn't need to look at other pages. > > The pg_verify_checksum apparently ignores this skip logic, because on > > the retry it simply re-reads the page again, verifies the checksum and > > reports an error. Which is broken, because the newly read page might be > > torn again due to a concurrent write. > > Well, ok. The newly read page will have an updated LSN though then on the re-read, in which case basebackup can know that what happened was a rewrite of the page and it no longer has to care about the page and can skip it. I haven't looked, but if basebackup isn't checking the LSN again for the newly read page then that'd be broken, but I believe it does (at least, that's the algorithm we came up with for pgBackRest, and I know David shared that when the basebackup code was being written). > > So IMHO this should do something similar to basebackup - check the page > > LSN, and if it changed then skip the page. > > > > I'm afraid this requires using the last checkpoint LSN, the way startptr > > is used in basebackup. In particular we can't simply remember LSN from > > the first read, because we might actually read [B1,A2] on the first try, > > and then [B1,B2] or [B1,C2] on the retry. (Actually, the page may be > > torn in various other ways, not necessarily at the 4kB boundary - it > > might be torn right after the LSN, for example). > > I'd prefer to come up with a plan where we don't just give up once we > see a new LSN, if possible. If I run a modified pg_verify_checksums > which skips on newer pages in a tight benchmark, basically everything > gets skipped as checkpoints don't happen often enough. I'm really not sure how you expect to be able to do something different here. Even if we started poking into shared buffers, all you'd be able to see is that there's a bunch of dirty pages- and we don't maintain the checksums in shared buffers, so it's not like you could verify them there. You could possibly have an option that says "force a checkpoint" but, honestly, that's really not all that interesting either- all you'd be doing is forcing all the pages to be written out from shared buffers into the kernel cache and then reading them from there instead, it's not like you'd actually be able to tell if there was a disk/storage error because you'll only be looking at the kernel cache. > So how about we do check every page, but if one fails on retry, and the > LSN is newer than the checkpoint, we then skip it? Is that logic sound? I thought that's what basebackup did- if it doesn't do that today, then it really should. > In any case, if we decide we really should skip the page if it is newer > than the checkpoint, I think it makes sense to track those skipped pages > and print their number out at the end, if there are any. Not sure what the point of this is. If we wanted to really do something to cross-check here, we'd track the pages that were skipped and then look through the WAL to make sure that they're there. That's something we've talked about doing with pgBackRest, but don't currently. > > FWIW I also don't understand the purpose of pg_sleep(), it does not seem > > to protect against anything, really. > > Well, I've noticed that without it I get sporadic checksum failures on > reread, so I've added it to make them go away. It was certainly a > phenomenological decision that I am happy to trade for a better one. That then sounds like we really aren't re-checking the LSN, and we really should be, to avoid getting these sporadic checksum failures on reread.. > Also, I noticed there's sometimes a 'data/global/pg_internal.init.606' > or some such file which pg_verify_checksums gets confused on, I guess we > should skip that as well. Can we assume that all files that start with > the ones in skip[] are safe to skip or should we have an exception for > files starting with pg_internal.init? Everything listed in skip is safe to skip on a restore.. I've not really thought too much about if they're all safe to skip when checking checksums for an online system, but I would generally think so.. Thanks! Stephen
Attachment
On 09/17/2018 04:46 PM, Stephen Frost wrote: > Greetings, > > * Michael Banck (michael.banck@credativ.de) wrote: >> On Mon, Sep 03, 2018 at 10:29:18PM +0200, Tomas Vondra wrote: >>> Obviously, pg_verify_checksums can't do that easily because it's >>> supposed to run from outside the database instance. >> >> It reads pg_control anyway, so couldn't we just take >> ControlFile->checkPoint? >> >> Other than that, basebackup.c seems to only look at pages which haven't >> been changed since the backup starting checkpoint (see above if >> statement). That's reasonable for backups, but is it just as reasonable >> for online verification? > > Right, basebackup doesn't need to look at other pages. > >>> The pg_verify_checksum apparently ignores this skip logic, because on >>> the retry it simply re-reads the page again, verifies the checksum and >>> reports an error. Which is broken, because the newly read page might be >>> torn again due to a concurrent write. >> >> Well, ok. > > The newly read page will have an updated LSN though then on the re-read, > in which case basebackup can know that what happened was a rewrite of > the page and it no longer has to care about the page and can skip it. > > I haven't looked, but if basebackup isn't checking the LSN again for the > newly read page then that'd be broken, but I believe it does (at least, > that's the algorithm we came up with for pgBackRest, and I know David > shared that when the basebackup code was being written). > Yes, basebackup does check the LSN on re-read, and skips the page if it changed on re-read (because it eliminates the consistency guarantees provided by the checkpoint). >>> So IMHO this should do something similar to basebackup - check the page >>> LSN, and if it changed then skip the page. >>> >>> I'm afraid this requires using the last checkpoint LSN, the way startptr >>> is used in basebackup. In particular we can't simply remember LSN from >>> the first read, because we might actually read [B1,A2] on the first try, >>> and then [B1,B2] or [B1,C2] on the retry. (Actually, the page may be >>> torn in various other ways, not necessarily at the 4kB boundary - it >>> might be torn right after the LSN, for example). >> >> I'd prefer to come up with a plan where we don't just give up once we >> see a new LSN, if possible. If I run a modified pg_verify_checksums >> which skips on newer pages in a tight benchmark, basically everything >> gets skipped as checkpoints don't happen often enough. > > I'm really not sure how you expect to be able to do something different > here. Even if we started poking into shared buffers, all you'd be able > to see is that there's a bunch of dirty pages- and we don't maintain the > checksums in shared buffers, so it's not like you could verify them > there. > > You could possibly have an option that says "force a checkpoint" but, > honestly, that's really not all that interesting either- all you'd be > doing is forcing all the pages to be written out from shared buffers > into the kernel cache and then reading them from there instead, it's not > like you'd actually be able to tell if there was a disk/storage error > because you'll only be looking at the kernel cache. > Yeah. >> So how about we do check every page, but if one fails on retry, and the >> LSN is newer than the checkpoint, we then skip it? Is that logic sound? > > I thought that's what basebackup did- if it doesn't do that today, then > it really should. > The crucial distinction here is that the trick is not in comparing LSNs from the two page reads, but comparing it to the checkpoint LSN. If it's greater, the page may be torn or broken, and there's no way to know which case it is - so basebackup simply skips it. >> In any case, if we decide we really should skip the page if it is newer >> than the checkpoint, I think it makes sense to track those skipped pages >> and print their number out at the end, if there are any. > > Not sure what the point of this is. If we wanted to really do something > to cross-check here, we'd track the pages that were skipped and then > look through the WAL to make sure that they're there. That's something > we've talked about doing with pgBackRest, but don't currently. > I agree simply printing the page numbers seems rather useless. What we could do is remember which pages we skipped and then try checking them after another checkpoint. Or something like that. >>> FWIW I also don't understand the purpose of pg_sleep(), it does not seem >>> to protect against anything, really. >> >> Well, I've noticed that without it I get sporadic checksum failures on >> reread, so I've added it to make them go away. It was certainly a >> phenomenological decision that I am happy to trade for a better one. > > That then sounds like we really aren't re-checking the LSN, and we > really should be, to avoid getting these sporadic checksum failures on > reread.. > Again, it's not enough to check the LSN against the preceding read. We need a checkpoint LSN or something like that. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 09/17/2018 04:04 PM, Michael Banck wrote: > Hi, > > On Mon, Sep 03, 2018 at 10:29:18PM +0200, Tomas Vondra wrote: >> The patch is mostly copying the verification / retry logic from >> basebackup.c, but I think it omitted a rather important detail that >> makes it incorrect in the presence of concurrent writes. >> >> The very first thing basebackup does is this: >> >> startptr = do_pg_start_backup(...); >> >> i.e. it waits for a checkpoint, remembering the LSN. And then when >> checking a page it does this: >> >> if (!PageIsNew(page) && PageGetLSN(page) < startptr) >> { >> ... verify the page checksum >> } >> >> Obviously, pg_verify_checksums can't do that easily because it's >> supposed to run from outside the database instance. > > It reads pg_control anyway, so couldn't we just take > ControlFile->checkPoint? > > Other than that, basebackup.c seems to only look at pages which haven't > been changed since the backup starting checkpoint (see above if > statement). That's reasonable for backups, but is it just as reasonable > for online verification? > I suppose we might refresh the checkpoint LSN regularly, and use the most recent one. On large/busy databases that would allow checking larger part of the database. >> But the startptr detail is pretty important because it supports this >> retry reasoning: >> >> /* >> * Retry the block on the first failure. It's >> * possible that we read the first 4K page of the >> * block just before postgres updated the entire block >> * so it ends up looking torn to us. We only need to >> * retry once because the LSN should be updated to >> * something we can ignore on the next pass. If the >> * error happens again then it is a true validation >> * failure. >> */ >> >> Imagine the 8kB page as two 4kB pages, with the initial state being >> [A1,A2] and another process over-writing it with [B1,B2]. If you read >> the 8kB page, what states can you see? >> >> I don't think POSIX provides any guarantees about atomicity of the write >> calls (and even if it does, the filesystems on Linux don't seem to). So >> you may observe both [A1,B2] or [A2,B1], or various inconsistent mixes >> of the two versions, depending on timing. Well, torn pages ... >> >> Pretty much the only thing you can rely on is that when one process does >> >> write([B1,B2]) >> >> the other process may first read [A1,B2], but the next read will return >> [B1,B2] (or possibly newer data, if there was another write). It will >> not read the "stale" A1 again. >> >> The basebackup relies on this kinda implicitly - on the retry it'll >> notice the LSN changed (thanks to the startptr check), and the page will >> be skipped entirely. This is pretty important, because the new page >> might be torn in some other way. >> >> The pg_verify_checksum apparently ignores this skip logic, because on >> the retry it simply re-reads the page again, verifies the checksum and >> reports an error. Which is broken, because the newly read page might be >> torn again due to a concurrent write. > > Well, ok. > >> So IMHO this should do something similar to basebackup - check the page >> LSN, and if it changed then skip the page. >> >> I'm afraid this requires using the last checkpoint LSN, the way startptr >> is used in basebackup. In particular we can't simply remember LSN from >> the first read, because we might actually read [B1,A2] on the first try, >> and then [B1,B2] or [B1,C2] on the retry. (Actually, the page may be >> torn in various other ways, not necessarily at the 4kB boundary - it >> might be torn right after the LSN, for example). > > I'd prefer to come up with a plan where we don't just give up once we > see a new LSN, if possible. If I run a modified pg_verify_checksums > which skips on newer pages in a tight benchmark, basically everything > gets skipped as checkpoints don't happen often enough. > But in that case the checksums are verified when reading the buffer into shared buffers, it's not like we don't notice the checksum error at all. We are interested in the pages that have not been read/written for an extended period time. So I think this is not a problem. > So how about we do check every page, but if one fails on retry, and the > LSN is newer than the checkpoint, we then skip it? Is that logic sound? > Hmmm, maybe. > In any case, if we decide we really should skip the page if it is newer > than the checkpoint, I think it makes sense to track those skipped pages > and print their number out at the end, if there are any. > I agree it might be useful to know how many pages were skipped, and how many actually passed the checksum check. >> FWIW I also don't understand the purpose of pg_sleep(), it does not seem >> to protect against anything, really. > > Well, I've noticed that without it I get sporadic checksum failures on > reread, so I've added it to make them go away. It was certainly a > phenomenological decision that I am happy to trade for a better one. > My guess is this happened because both the read and re-read completed during the same write. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Greetings, * Tomas Vondra (tomas.vondra@2ndquadrant.com) wrote: > On 09/17/2018 04:46 PM, Stephen Frost wrote: > > * Michael Banck (michael.banck@credativ.de) wrote: > >> On Mon, Sep 03, 2018 at 10:29:18PM +0200, Tomas Vondra wrote: > >>> Obviously, pg_verify_checksums can't do that easily because it's > >>> supposed to run from outside the database instance. > >> > >> It reads pg_control anyway, so couldn't we just take > >> ControlFile->checkPoint? > >> > >> Other than that, basebackup.c seems to only look at pages which haven't > >> been changed since the backup starting checkpoint (see above if > >> statement). That's reasonable for backups, but is it just as reasonable > >> for online verification? > > > > Right, basebackup doesn't need to look at other pages. > > > >>> The pg_verify_checksum apparently ignores this skip logic, because on > >>> the retry it simply re-reads the page again, verifies the checksum and > >>> reports an error. Which is broken, because the newly read page might be > >>> torn again due to a concurrent write. > >> > >> Well, ok. > > > > The newly read page will have an updated LSN though then on the re-read, > > in which case basebackup can know that what happened was a rewrite of > > the page and it no longer has to care about the page and can skip it. > > > > I haven't looked, but if basebackup isn't checking the LSN again for the > > newly read page then that'd be broken, but I believe it does (at least, > > that's the algorithm we came up with for pgBackRest, and I know David > > shared that when the basebackup code was being written). > > Yes, basebackup does check the LSN on re-read, and skips the page if it > changed on re-read (because it eliminates the consistency guarantees > provided by the checkpoint). Ok, good, though I'm not sure what you mean by 'eliminates the consistency guarantees provided by the checkpoint'. The point is that the page will be in the WAL and the WAL will be replayed during the restore of the backup. > >> So how about we do check every page, but if one fails on retry, and the > >> LSN is newer than the checkpoint, we then skip it? Is that logic sound? > > > > I thought that's what basebackup did- if it doesn't do that today, then > > it really should. > > The crucial distinction here is that the trick is not in comparing LSNs > from the two page reads, but comparing it to the checkpoint LSN. If it's > greater, the page may be torn or broken, and there's no way to know > which case it is - so basebackup simply skips it. Sure, because we don't care about it any longer- that page isn't interesting because the WAL will replay over it. IIRC it actually goes something like: check the checksum, if it failed then check if the LSN is greater than the checkpoint (of the backup start..), if not, then re-read, if the LSN is now newer than the checkpoint then skip, if the LSN is the same then throw an error. > >> In any case, if we decide we really should skip the page if it is newer > >> than the checkpoint, I think it makes sense to track those skipped pages > >> and print their number out at the end, if there are any. > > > > Not sure what the point of this is. If we wanted to really do something > > to cross-check here, we'd track the pages that were skipped and then > > look through the WAL to make sure that they're there. That's something > > we've talked about doing with pgBackRest, but don't currently. > > I agree simply printing the page numbers seems rather useless. What we > could do is remember which pages we skipped and then try checking them > after another checkpoint. Or something like that. I'm still not sure I'm seeing the point of that. They're still going to almost certainly be in the kernel cache. The reason for checking against the WAL would be to detect errors in PG where we aren't putting a page into the WAL when it really should be, or something similar, which seems like it at least could be useful. Maybe to put it another way- there's very little point in checking the checksum of a page which we know must be re-written during recovery to get to a consistent point. I don't think it hurts in the general case, but I wouldn't write a lot of code which then needs to be tested to handle it. I also don't think that we really need to make pg_verify_checksum spend lots of extra cycles trying to verify that *every* page had its checksum validated when we know that lots of pages are going to be in memory marked dirty and our checking of them will be ultimately pointless as they'll either be written out by the checkpointer or some other process, or we'll replay them from the WAL if we crash. > >>> FWIW I also don't understand the purpose of pg_sleep(), it does not seem > >>> to protect against anything, really. > >> > >> Well, I've noticed that without it I get sporadic checksum failures on > >> reread, so I've added it to make them go away. It was certainly a > >> phenomenological decision that I am happy to trade for a better one. > > > > That then sounds like we really aren't re-checking the LSN, and we > > really should be, to avoid getting these sporadic checksum failures on > > reread.. > > Again, it's not enough to check the LSN against the preceding read. We > need a checkpoint LSN or something like that. I actually tend to disagree with you that, for this purpose, it's actually necessary to check against the checkpoint LSN- if the LSN changed and everything is operating correctly then the new LSN must be more recent than the last checkpoint location or things are broken badly. Now, that said, I do think it's a good *idea* to check against the checkpoint LSN (presuming this is for online checking of checksums- for basebackup, we could just check against the backup-start LSN as anything after that point will be rewritten by WAL anyway). The reason that I think it's a good idea to check against the checkpoint LSN is that we'd want to throw a big warning if the kernel is just feeding us random garbage on reads and only finding a difference between two reads isn't really doing any kind of validation, whereas checking against the checkpoint-LSN would at least give us some idea that the value being read isn't completely ridiculous. When it comes to if the pg_sleep() is necessary or not, I have to admit to being unsure about that.. I could see how it might be but it seems a bit surprising- I'd probably want to see exactly what the page was at the time of the failure and at the time of the second (no-sleep) re-read and then after a delay and convince myself that it was just an unlucky case of being scheduled in twice to read that page before the process writing it out got a chance to finish the write. Thanks! Stephen
Attachment
Hi, On 09/17/2018 06:42 PM, Stephen Frost wrote: > Greetings, > > * Tomas Vondra (tomas.vondra@2ndquadrant.com) wrote: >> On 09/17/2018 04:46 PM, Stephen Frost wrote: >>> * Michael Banck (michael.banck@credativ.de) wrote: >>>> On Mon, Sep 03, 2018 at 10:29:18PM +0200, Tomas Vondra wrote: >>>>> Obviously, pg_verify_checksums can't do that easily because it's >>>>> supposed to run from outside the database instance. >>>> >>>> It reads pg_control anyway, so couldn't we just take >>>> ControlFile->checkPoint? >>>> >>>> Other than that, basebackup.c seems to only look at pages which haven't >>>> been changed since the backup starting checkpoint (see above if >>>> statement). That's reasonable for backups, but is it just as reasonable >>>> for online verification? >>> >>> Right, basebackup doesn't need to look at other pages. >>> >>>>> The pg_verify_checksum apparently ignores this skip logic, because on >>>>> the retry it simply re-reads the page again, verifies the checksum and >>>>> reports an error. Which is broken, because the newly read page might be >>>>> torn again due to a concurrent write. >>>> >>>> Well, ok. >>> >>> The newly read page will have an updated LSN though then on the re-read, >>> in which case basebackup can know that what happened was a rewrite of >>> the page and it no longer has to care about the page and can skip it. >>> >>> I haven't looked, but if basebackup isn't checking the LSN again for the >>> newly read page then that'd be broken, but I believe it does (at least, >>> that's the algorithm we came up with for pgBackRest, and I know David >>> shared that when the basebackup code was being written). >> >> Yes, basebackup does check the LSN on re-read, and skips the page if it >> changed on re-read (because it eliminates the consistency guarantees >> provided by the checkpoint). > > Ok, good, though I'm not sure what you mean by 'eliminates the > consistency guarantees provided by the checkpoint'. The point is that > the page will be in the WAL and the WAL will be replayed during the > restore of the backup. > The checkpoint guarantees that the whole page was written and flushed to disk with an LSN before the ckeckpoint LSN. So when you read a page with that LSN, you know the whole write already completed and a read won't return data from before the LSN. Without the checkpoint that's not guaranteed, and simply re-reading the page and rechecking it vs. the first read does not help: 1) write the first 512B of the page (sector), which includes the LSN 2) read the whole page, which will be a mix [new 512B, ... old ... ] 3) the checksum verification fails 4) read the page again (possibly reading a bit more new data) 5) the LSN did not change compared to the first read, yet the checksum still fails >>>> So how about we do check every page, but if one fails on retry, and the >>>> LSN is newer than the checkpoint, we then skip it? Is that logic sound? >>> >>> I thought that's what basebackup did- if it doesn't do that today, then >>> it really should. >> >> The crucial distinction here is that the trick is not in comparing LSNs >> from the two page reads, but comparing it to the checkpoint LSN. If it's >> greater, the page may be torn or broken, and there's no way to know >> which case it is - so basebackup simply skips it. > > Sure, because we don't care about it any longer- that page isn't > interesting because the WAL will replay over it. IIRC it actually goes > something like: check the checksum, if it failed then check if the LSN > is greater than the checkpoint (of the backup start..), if not, then > re-read, if the LSN is now newer than the checkpoint then skip, if the > LSN is the same then throw an error. > Nope, we only verify the checksum if it's LSN precedes the checkpoint: https://github.com/postgres/postgres/blob/master/src/backend/replication/basebackup.c#L1454 >>>> In any case, if we decide we really should skip the page if it is newer >>>> than the checkpoint, I think it makes sense to track those skipped pages >>>> and print their number out at the end, if there are any. >>> >>> Not sure what the point of this is. If we wanted to really do something >>> to cross-check here, we'd track the pages that were skipped and then >>> look through the WAL to make sure that they're there. That's something >>> we've talked about doing with pgBackRest, but don't currently. >> >> I agree simply printing the page numbers seems rather useless. What we >> could do is remember which pages we skipped and then try checking them >> after another checkpoint. Or something like that. > > I'm still not sure I'm seeing the point of that. They're still going to > almost certainly be in the kernel cache. The reason for checking > against the WAL would be to detect errors in PG where we aren't putting > a page into the WAL when it really should be, or something similar, > which seems like it at least could be useful. > > Maybe to put it another way- there's very little point in checking the > checksum of a page which we know must be re-written during recovery to > get to a consistent point. I don't think it hurts in the general case, > but I wouldn't write a lot of code which then needs to be tested to > handle it. I also don't think that we really need to make > pg_verify_checksum spend lots of extra cycles trying to verify that > *every* page had its checksum validated when we know that lots of pages > are going to be in memory marked dirty and our checking of them will be > ultimately pointless as they'll either be written out by the > checkpointer or some other process, or we'll replay them from the WAL if > we crash. > Yeah, I agree. >>>>> FWIW I also don't understand the purpose of pg_sleep(), it does not seem >>>>> to protect against anything, really. >>>> >>>> Well, I've noticed that without it I get sporadic checksum failures on >>>> reread, so I've added it to make them go away. It was certainly a >>>> phenomenological decision that I am happy to trade for a better one. >>> >>> That then sounds like we really aren't re-checking the LSN, and we >>> really should be, to avoid getting these sporadic checksum failures on >>> reread.. >> >> Again, it's not enough to check the LSN against the preceding read. We >> need a checkpoint LSN or something like that. > > I actually tend to disagree with you that, for this purpose, it's > actually necessary to check against the checkpoint LSN- if the LSN > changed and everything is operating correctly then the new LSN must be > more recent than the last checkpoint location or things are broken > badly. > I don't follow. Are you suggesting we don't need the checkpoint LSN? I'm pretty sure that's not the case. The thing is - the LSN may not change between the two reads, but that's not a guarantee the page was not torn. The example I posted earlier in this message illustrates that. > Now, that said, I do think it's a good *idea* to check against the > checkpoint LSN (presuming this is for online checking of checksums- for > basebackup, we could just check against the backup-start LSN as anything > after that point will be rewritten by WAL anyway). The reason that I > think it's a good idea to check against the checkpoint LSN is that we'd > want to throw a big warning if the kernel is just feeding us random > garbage on reads and only finding a difference between two reads isn't > really doing any kind of validation, whereas checking against the > checkpoint-LSN would at least give us some idea that the value being > read isn't completely ridiculous. > > When it comes to if the pg_sleep() is necessary or not, I have to admit > to being unsure about that.. I could see how it might be but it seems a > bit surprising- I'd probably want to see exactly what the page was at > the time of the failure and at the time of the second (no-sleep) re-read > and then after a delay and convince myself that it was just an unlucky > case of being scheduled in twice to read that page before the process > writing it out got a chance to finish the write. > I think the pg_sleep() is a pretty strong sign there's something broken. At the very least, it's likely to misbehave on machines with different timings, machines under memory and/or memory pressure, etc. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Greetings, * Tomas Vondra (tomas.vondra@2ndquadrant.com) wrote: > On 09/17/2018 06:42 PM, Stephen Frost wrote: > > Ok, good, though I'm not sure what you mean by 'eliminates the > > consistency guarantees provided by the checkpoint'. The point is that > > the page will be in the WAL and the WAL will be replayed during the > > restore of the backup. > > The checkpoint guarantees that the whole page was written and flushed to > disk with an LSN before the ckeckpoint LSN. So when you read a page with > that LSN, you know the whole write already completed and a read won't > return data from before the LSN. Well, you know that the first part was written out at some prior point, but you could end up reading the first part of a page with an older LSN while also reading the second part with new data. > Without the checkpoint that's not guaranteed, and simply re-reading the > page and rechecking it vs. the first read does not help: > > 1) write the first 512B of the page (sector), which includes the LSN > > 2) read the whole page, which will be a mix [new 512B, ... old ... ] > > 3) the checksum verification fails > > 4) read the page again (possibly reading a bit more new data) > > 5) the LSN did not change compared to the first read, yet the checksum > still fails So, I agree with all of the above though I've found it to be extremely rare to get a single read which you've managed to catch part-way through a write, getting multiple of them over a period of time strikes me as even more unlikely. Still, if we can come up with a solution to solve all of this, great, but I'm not sure that I'm hearing one. > > Sure, because we don't care about it any longer- that page isn't > > interesting because the WAL will replay over it. IIRC it actually goes > > something like: check the checksum, if it failed then check if the LSN > > is greater than the checkpoint (of the backup start..), if not, then > > re-read, if the LSN is now newer than the checkpoint then skip, if the > > LSN is the same then throw an error. > > Nope, we only verify the checksum if it's LSN precedes the checkpoint: > > https://github.com/postgres/postgres/blob/master/src/backend/replication/basebackup.c#L1454 That seems like it's leaving something on the table, but, to be fair, we know that all of those pages should be rewritten by WAL anyway so they aren't all that interesting to us, particularly in the basebackup case. > > I actually tend to disagree with you that, for this purpose, it's > > actually necessary to check against the checkpoint LSN- if the LSN > > changed and everything is operating correctly then the new LSN must be > > more recent than the last checkpoint location or things are broken > > badly. > > I don't follow. Are you suggesting we don't need the checkpoint LSN? > > I'm pretty sure that's not the case. The thing is - the LSN may not > change between the two reads, but that's not a guarantee the page was > not torn. The example I posted earlier in this message illustrates that. I agree that there's some risk there, but it's certainly much less likely. > > Now, that said, I do think it's a good *idea* to check against the > > checkpoint LSN (presuming this is for online checking of checksums- for > > basebackup, we could just check against the backup-start LSN as anything > > after that point will be rewritten by WAL anyway). The reason that I > > think it's a good idea to check against the checkpoint LSN is that we'd > > want to throw a big warning if the kernel is just feeding us random > > garbage on reads and only finding a difference between two reads isn't > > really doing any kind of validation, whereas checking against the > > checkpoint-LSN would at least give us some idea that the value being > > read isn't completely ridiculous. > > > > When it comes to if the pg_sleep() is necessary or not, I have to admit > > to being unsure about that.. I could see how it might be but it seems a > > bit surprising- I'd probably want to see exactly what the page was at > > the time of the failure and at the time of the second (no-sleep) re-read > > and then after a delay and convince myself that it was just an unlucky > > case of being scheduled in twice to read that page before the process > > writing it out got a chance to finish the write. > > I think the pg_sleep() is a pretty strong sign there's something broken. > At the very least, it's likely to misbehave on machines with different > timings, machines under memory and/or memory pressure, etc. If we assume that what you've outlined above is a serious enough issue that we have to address it, and do so without a pg_sleep(), then I think we have to bake into this a way for the process to check with PG as to what the page's current LSN is, in shared buffers, because that's the only place where we've got the locking required to ensure that we don't end up with a read of a partially written page, and I'm really not entirely convinced that we need to go to that level. It'd certainly add a huge amount of additional complexity for what appears to be a quite unlikely gain. I'll chat w/ David shortly about this again though and get his thoughts on it. This is certainly an area we've spent time thinking about but are obviously also open to finding a better solution. Thanks! Stephen
Attachment
On 09/17/2018 07:11 PM, Stephen Frost wrote: > Greetings, > > * Tomas Vondra (tomas.vondra@2ndquadrant.com) wrote: >> On 09/17/2018 06:42 PM, Stephen Frost wrote: >>> Ok, good, though I'm not sure what you mean by 'eliminates the >>> consistency guarantees provided by the checkpoint'. The point is that >>> the page will be in the WAL and the WAL will be replayed during the >>> restore of the backup. >> >> The checkpoint guarantees that the whole page was written and flushed to >> disk with an LSN before the ckeckpoint LSN. So when you read a page with >> that LSN, you know the whole write already completed and a read won't >> return data from before the LSN. > > Well, you know that the first part was written out at some prior point, > but you could end up reading the first part of a page with an older LSN > while also reading the second part with new data. > Doesn't the checkpoint fsync pretty much guarantee this can't happen? >> Without the checkpoint that's not guaranteed, and simply re-reading the >> page and rechecking it vs. the first read does not help: >> >> 1) write the first 512B of the page (sector), which includes the LSN >> >> 2) read the whole page, which will be a mix [new 512B, ... old ... ] >> >> 3) the checksum verification fails >> >> 4) read the page again (possibly reading a bit more new data) >> >> 5) the LSN did not change compared to the first read, yet the checksum >> still fails > > So, I agree with all of the above though I've found it to be extremely > rare to get a single read which you've managed to catch part-way through > a write, getting multiple of them over a period of time strikes me as > even more unlikely. Still, if we can come up with a solution to solve > all of this, great, but I'm not sure that I'm hearing one. > I don't recall claiming catching many such torn pages - I'm sure it's not very common in most workloads. But I suspect constructing workloads hitting them regularly is not very difficult either (something with a lot of churn in shared buffers should do the trick). >>> Sure, because we don't care about it any longer- that page isn't >>> interesting because the WAL will replay over it. IIRC it actually goes >>> something like: check the checksum, if it failed then check if the LSN >>> is greater than the checkpoint (of the backup start..), if not, then >>> re-read, if the LSN is now newer than the checkpoint then skip, if the >>> LSN is the same then throw an error. >> >> Nope, we only verify the checksum if it's LSN precedes the checkpoint: >> >> https://github.com/postgres/postgres/blob/master/src/backend/replication/basebackup.c#L1454 > > That seems like it's leaving something on the table, but, to be fair, we > know that all of those pages should be rewritten by WAL anyway so they > aren't all that interesting to us, particularly in the basebackup case. > Yep. >>> I actually tend to disagree with you that, for this purpose, it's >>> actually necessary to check against the checkpoint LSN- if the LSN >>> changed and everything is operating correctly then the new LSN must be >>> more recent than the last checkpoint location or things are broken >>> badly. >> >> I don't follow. Are you suggesting we don't need the checkpoint LSN? >> >> I'm pretty sure that's not the case. The thing is - the LSN may not >> change between the two reads, but that's not a guarantee the page was >> not torn. The example I posted earlier in this message illustrates that. > > I agree that there's some risk there, but it's certainly much less > likely. > Well. If we're going to report a checksum failure, we better be sure it actually is a broken page. I don't want users to start chasing bogus data corruption issues. >>> Now, that said, I do think it's a good *idea* to check against the >>> checkpoint LSN (presuming this is for online checking of checksums- for >>> basebackup, we could just check against the backup-start LSN as anything >>> after that point will be rewritten by WAL anyway). The reason that I >>> think it's a good idea to check against the checkpoint LSN is that we'd >>> want to throw a big warning if the kernel is just feeding us random >>> garbage on reads and only finding a difference between two reads isn't >>> really doing any kind of validation, whereas checking against the >>> checkpoint-LSN would at least give us some idea that the value being >>> read isn't completely ridiculous. >>> >>> When it comes to if the pg_sleep() is necessary or not, I have to admit >>> to being unsure about that.. I could see how it might be but it seems a >>> bit surprising- I'd probably want to see exactly what the page was at >>> the time of the failure and at the time of the second (no-sleep) re-read >>> and then after a delay and convince myself that it was just an unlucky >>> case of being scheduled in twice to read that page before the process >>> writing it out got a chance to finish the write. >> >> I think the pg_sleep() is a pretty strong sign there's something broken. >> At the very least, it's likely to misbehave on machines with different >> timings, machines under memory and/or memory pressure, etc. > > If we assume that what you've outlined above is a serious enough issue > that we have to address it, and do so without a pg_sleep(), then I think > we have to bake into this a way for the process to check with PG as to > what the page's current LSN is, in shared buffers, because that's the > only place where we've got the locking required to ensure that we don't > end up with a read of a partially written page, and I'm really not > entirely convinced that we need to go to that level. It'd certainly add > a huge amount of additional complexity for what appears to be a quite > unlikely gain. > > I'll chat w/ David shortly about this again though and get his thoughts > on it. This is certainly an area we've spent time thinking about but > are obviously also open to finding a better solution. > Why not to simply look at the last checkpoint LSN and use that the same way basebackup does? AFAICS that should make the pg_sleep() unnecessary. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Greetings,
On Mon, Sep 17, 2018 at 13:20 Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
On 09/17/2018 07:11 PM, Stephen Frost wrote:
> Greetings,
>
> * Tomas Vondra (tomas.vondra@2ndquadrant.com) wrote:
>> On 09/17/2018 06:42 PM, Stephen Frost wrote:
>>> Ok, good, though I'm not sure what you mean by 'eliminates the
>>> consistency guarantees provided by the checkpoint'. The point is that
>>> the page will be in the WAL and the WAL will be replayed during the
>>> restore of the backup.
>>
>> The checkpoint guarantees that the whole page was written and flushed to
>> disk with an LSN before the ckeckpoint LSN. So when you read a page with
>> that LSN, you know the whole write already completed and a read won't
>> return data from before the LSN.
>
> Well, you know that the first part was written out at some prior point,
> but you could end up reading the first part of a page with an older LSN
> while also reading the second part with new data.
Doesn't the checkpoint fsync pretty much guarantee this can't happen?
How? Either it’s possible for the latter half of a page to be updated before the first half (where the LSN lives), or it isn’t. If it’s possible then that LSN could be ancient and it wouldn’t matter.
>> Without the checkpoint that's not guaranteed, and simply re-reading the
>> page and rechecking it vs. the first read does not help:
>>
>> 1) write the first 512B of the page (sector), which includes the LSN
>>
>> 2) read the whole page, which will be a mix [new 512B, ... old ... ]
>>
>> 3) the checksum verification fails
>>
>> 4) read the page again (possibly reading a bit more new data)
>>
>> 5) the LSN did not change compared to the first read, yet the checksum
>> still fails
>
> So, I agree with all of the above though I've found it to be extremely
> rare to get a single read which you've managed to catch part-way through
> a write, getting multiple of them over a period of time strikes me as
> even more unlikely. Still, if we can come up with a solution to solve
> all of this, great, but I'm not sure that I'm hearing one.
I don't recall claiming catching many such torn pages - I'm sure it's
not very common in most workloads. But I suspect constructing workloads
hitting them regularly is not very difficult either (something with a
lot of churn in shared buffers should do the trick).
The question is if it’s possible to catch a torn page where the second half is updated *before* the first half of the page in a read (and then in subsequent reads having that state be maintained). I have some skepticism that it’s really possible to happen in the first place but having an interrupted system call be stalled across two more system calls just seems terribly unlikely, and this is all based on the assumption that the kernel might write the second half of a write before the first to the kernel cache in the first place.
>>> Sure, because we don't care about it any longer- that page isn't
>>> interesting because the WAL will replay over it. IIRC it actually goes
>>> something like: check the checksum, if it failed then check if the LSN
>>> is greater than the checkpoint (of the backup start..), if not, then
>>> re-read, if the LSN is now newer than the checkpoint then skip, if the
>>> LSN is the same then throw an error.
>>
>> Nope, we only verify the checksum if it's LSN precedes the checkpoint:
>>
>> https://github.com/postgres/postgres/blob/master/src/backend/replication/basebackup.c#L1454
>
> That seems like it's leaving something on the table, but, to be fair, we
> know that all of those pages should be rewritten by WAL anyway so they
> aren't all that interesting to us, particularly in the basebackup case.
>
Yep.
>>> I actually tend to disagree with you that, for this purpose, it's
>>> actually necessary to check against the checkpoint LSN- if the LSN
>>> changed and everything is operating correctly then the new LSN must be
>>> more recent than the last checkpoint location or things are broken
>>> badly.
>>
>> I don't follow. Are you suggesting we don't need the checkpoint LSN?
>>
>> I'm pretty sure that's not the case. The thing is - the LSN may not
>> change between the two reads, but that's not a guarantee the page was
>> not torn. The example I posted earlier in this message illustrates that.
>
> I agree that there's some risk there, but it's certainly much less
> likely.
>
Well. If we're going to report a checksum failure, we better be sure it
actually is a broken page. I don't want users to start chasing bogus
data corruption issues.
Yes, I definitely agree that we don’t want to mis-report checksum failures if we can avoid it.
>>> Now, that said, I do think it's a good *idea* to check against the
>>> checkpoint LSN (presuming this is for online checking of checksums- for
>>> basebackup, we could just check against the backup-start LSN as anything
>>> after that point will be rewritten by WAL anyway). The reason that I
>>> think it's a good idea to check against the checkpoint LSN is that we'd
>>> want to throw a big warning if the kernel is just feeding us random
>>> garbage on reads and only finding a difference between two reads isn't
>>> really doing any kind of validation, whereas checking against the
>>> checkpoint-LSN would at least give us some idea that the value being
>>> read isn't completely ridiculous.
>>>
>>> When it comes to if the pg_sleep() is necessary or not, I have to admit
>>> to being unsure about that.. I could see how it might be but it seems a
>>> bit surprising- I'd probably want to see exactly what the page was at
>>> the time of the failure and at the time of the second (no-sleep) re-read
>>> and then after a delay and convince myself that it was just an unlucky
>>> case of being scheduled in twice to read that page before the process
>>> writing it out got a chance to finish the write.
>>
>> I think the pg_sleep() is a pretty strong sign there's something broken.
>> At the very least, it's likely to misbehave on machines with different
>> timings, machines under memory and/or memory pressure, etc.
>
> If we assume that what you've outlined above is a serious enough issue
> that we have to address it, and do so without a pg_sleep(), then I think
> we have to bake into this a way for the process to check with PG as to
> what the page's current LSN is, in shared buffers, because that's the
> only place where we've got the locking required to ensure that we don't
> end up with a read of a partially written page, and I'm really not
> entirely convinced that we need to go to that level. It'd certainly add
> a huge amount of additional complexity for what appears to be a quite
> unlikely gain.
>
> I'll chat w/ David shortly about this again though and get his thoughts
> on it. This is certainly an area we've spent time thinking about but
> are obviously also open to finding a better solution.
Why not to simply look at the last checkpoint LSN and use that the same
way basebackup does? AFAICS that should make the pg_sleep() unnecessary.
Use that to compare to what? The LSN in the first half of the page could be from well before the checkpoint or even the backup started.
Thanks!
Stephen
Hi, so, trying some intermediate summary here, sorry for (also) top-posting: 1. the basebackup checksum verification logic only checks pages not changed since the checkpoint, which makes sense for the basebackup. 2. However, it would be desirable to go further for pg_verify_checksums and (re-)check all pages. 3. pg_verify_checksums should read the checkpoint LSN on startup and compare the page LSN against it on re-read, and discard pages which have checksum failures but are new. (Maybe it should read new checkpoint LSNs as they come in during its runtime as well? See below). 4. The pg_sleep should go. 5. There seems to be no consensus on whether the number of skipped pages should be summarized at the end. Further comments: Am Montag, den 17.09.2018, 19:19 +0200 schrieb Tomas Vondra: > On 09/17/2018 07:11 PM, Stephen Frost wrote: > > * Tomas Vondra (tomas.vondra@2ndquadrant.com) wrote: > > > On 09/17/2018 06:42 PM, Stephen Frost wrote: > > > Without the checkpoint that's not guaranteed, and simply re-reading the > > > page and rechecking it vs. the first read does not help: > > > > > > 1) write the first 512B of the page (sector), which includes the LSN > > > > > > 2) read the whole page, which will be a mix [new 512B, ... old ... ] > > > > > > 3) the checksum verification fails > > > > > > 4) read the page again (possibly reading a bit more new data) > > > > > > 5) the LSN did not change compared to the first read, yet the checksum > > > still fails > > > > So, I agree with all of the above though I've found it to be extremely > > rare to get a single read which you've managed to catch part-way through > > a write, getting multiple of them over a period of time strikes me as > > even more unlikely. Still, if we can come up with a solution to solve > > all of this, great, but I'm not sure that I'm hearing one. > > I don't recall claiming catching many such torn pages - I'm sure it's > not very common in most workloads. But I suspect constructing workloads > hitting them regularly is not very difficult either (something with a > lot of churn in shared buffers should do the trick). > > > > > Sure, because we don't care about it any longer- that page isn't > > > > interesting because the WAL will replay over it. IIRC it actually goes > > > > something like: check the checksum, if it failed then check if the LSN > > > > is greater than the checkpoint (of the backup start..), if not, then > > > > re-read, if the LSN is now newer than the checkpoint then skip, if the > > > > LSN is the same then throw an error. > > > > > > Nope, we only verify the checksum if it's LSN precedes the checkpoint: > > > > > > https://github.com/postgres/postgres/blob/master/src/backend/replication/basebackup.c#L1454 > > > > That seems like it's leaving something on the table, but, to be fair, we > > know that all of those pages should be rewritten by WAL anyway so they > > aren't all that interesting to us, particularly in the basebackup case. > > Yep. Right, see point 1 above. > > > > I actually tend to disagree with you that, for this purpose, it's > > > > actually necessary to check against the checkpoint LSN- if the LSN > > > > changed and everything is operating correctly then the new LSN must be > > > > more recent than the last checkpoint location or things are broken > > > > badly. > > > > > > I don't follow. Are you suggesting we don't need the checkpoint LSN? > > > > > > I'm pretty sure that's not the case. The thing is - the LSN may not > > > change between the two reads, but that's not a guarantee the page was > > > not torn. The example I posted earlier in this message illustrates that. > > > > I agree that there's some risk there, but it's certainly much less > > likely. > > Well. If we're going to report a checksum failure, we better be sure it > actually is a broken page. I don't want users to start chasing bogus > data corruption issues. I agree. > > > > Now, that said, I do think it's a good *idea* to check against the > > > > checkpoint LSN (presuming this is for online checking of checksums- for > > > > basebackup, we could just check against the backup-start LSN as anything > > > > after that point will be rewritten by WAL anyway). The reason that I > > > > think it's a good idea to check against the checkpoint LSN is that we'd > > > > want to throw a big warning if the kernel is just feeding us random > > > > garbage on reads and only finding a difference between two reads isn't > > > > really doing any kind of validation, whereas checking against the > > > > checkpoint-LSN would at least give us some idea that the value being > > > > read isn't completely ridiculous. Are you suggesting here that we always check against the current checkpoint, or is checking against the checkpoint that we saw at startup enough? I think re-reading pg_control all the time might be more errorprone that what we could get from this, so I would prefer not to do this. > > > > When it comes to if the pg_sleep() is necessary or not, I have to admit > > > > to being unsure about that.. I could see how it might be but it seems a > > > > bit surprising- I'd probably want to see exactly what the page was at > > > > the time of the failure and at the time of the second (no-sleep) re-read > > > > and then after a delay and convince myself that it was just an unlucky > > > > case of being scheduled in twice to read that page before the process > > > > writing it out got a chance to finish the write. > > > > > > I think the pg_sleep() is a pretty strong sign there's something broken. > > > At the very least, it's likely to misbehave on machines with different > > > timings, machines under memory and/or memory pressure, etc. I swapped out the pg_sleep earlier today for the check-against- checkpoint-LSN-on-reread, and that seems to work just as fine, at least in the tests I ran. > > If we assume that what you've outlined above is a serious enough issue > > that we have to address it, and do so without a pg_sleep(), then I think > > we have to bake into this a way for the process to check with PG as to > > what the page's current LSN is, in shared buffers, because that's the > > only place where we've got the locking required to ensure that we don't > > end up with a read of a partially written page, and I'm really not > > entirely convinced that we need to go to that level. It'd certainly add > > a huge amount of additional complexity for what appears to be a quite > > unlikely gain. > > > > I'll chat w/ David shortly about this again though and get his thoughts > > on it. This is certainly an area we've spent time thinking about but > > are obviously also open to finding a better solution. > > Why not to simply look at the last checkpoint LSN and use that the same > way basebackup does? AFAICS that should make the pg_sleep() unnecessary. Right. Michael -- Michael Banck Projektleiter / Senior Berater Tel.: +49 2166 9901-171 Fax: +49 2166 9901-100 Email: michael.banck@credativ.de credativ GmbH, HRB Mönchengladbach 12080 USt-ID-Nummer: DE204566209 Trompeterallee 108, 41189 Mönchengladbach Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer Unser Umgang mit personenbezogenen Daten unterliegt folgenden Bestimmungen: https://www.credativ.de/datenschutz
Greetings,
On Mon, Sep 17, 2018 at 13:38 Michael Banck <michael.banck@credativ.de> wrote:
so, trying some intermediate summary here, sorry for (also) top-posting:
1. the basebackup checksum verification logic only checks pages not
changed since the checkpoint, which makes sense for the basebackup.
Right. I’m tending towards the idea that this also be adopted for pg_verify_checksums.
2. However, it would be desirable to go further for pg_verify_checksums
and (re-)check all pages.
Maybe. I’m not entirely convinced that it’s all that useful.
3. pg_verify_checksums should read the checkpoint LSN on startup and
compare the page LSN against it on re-read, and discard pages which have
checksum failures but are new. (Maybe it should read new checkpoint LSNs
as they come in during its runtime as well? See below).
I’m not sure that we really need to but I’m not against it either- but in that case you’re definitely going to see checksum failures on torn pages.
4. The pg_sleep should go.
I know that pgbackrest does not have a sleep currently and we’ve not yet seen or been able to reproduce this case where, on a reread, we still see an older LSN, but we check the LSN first also. If it’s possible that the LSN still hasn’t changed on the reread then maybe we do need to have a sleep to force ourselves off CPU to allow the other process to finish writing, or maybe finish the file and come back around to these pages later, but we have yet to see this behavior in the wild anywhere, nor have we been able to reproduce it.
5. There seems to be no consensus on whether the number of skipped pages
should be summarized at the end.
I agree with printing the number of skipped pages, that does seem like a nice to have. I don’t know that actually printing the pages themselves is all that useful though.
Further comments:
Am Montag, den 17.09.2018, 19:19 +0200 schrieb Tomas Vondra:
> On 09/17/2018 07:11 PM, Stephen Frost wrote:
> > * Tomas Vondra (tomas.vondra@2ndquadrant.com) wrote:
> > > On 09/17/2018 06:42 PM, Stephen Frost wrote:
> > > Without the checkpoint that's not guaranteed, and simply re-reading the
> > > page and rechecking it vs. the first read does not help:
> > >
> > > 1) write the first 512B of the page (sector), which includes the LSN
> > >
> > > 2) read the whole page, which will be a mix [new 512B, ... old ... ]
> > >
> > > 3) the checksum verification fails
> > >
> > > 4) read the page again (possibly reading a bit more new data)
> > >
> > > 5) the LSN did not change compared to the first read, yet the checksum
> > > still fails
> >
> > So, I agree with all of the above though I've found it to be extremely
> > rare to get a single read which you've managed to catch part-way through
> > a write, getting multiple of them over a period of time strikes me as
> > even more unlikely. Still, if we can come up with a solution to solve
> > all of this, great, but I'm not sure that I'm hearing one.
>
> I don't recall claiming catching many such torn pages - I'm sure it's
> not very common in most workloads. But I suspect constructing workloads
> hitting them regularly is not very difficult either (something with a
> lot of churn in shared buffers should do the trick).
>
> > > > Sure, because we don't care about it any longer- that page isn't
> > > > interesting because the WAL will replay over it. IIRC it actually goes
> > > > something like: check the checksum, if it failed then check if the LSN
> > > > is greater than the checkpoint (of the backup start..), if not, then
> > > > re-read, if the LSN is now newer than the checkpoint then skip, if the
> > > > LSN is the same then throw an error.
> > >
> > > Nope, we only verify the checksum if it's LSN precedes the checkpoint:
> > >
> > > https://github.com/postgres/postgres/blob/master/src/backend/replication/basebackup.c#L1454
> >
> > That seems like it's leaving something on the table, but, to be fair, we
> > know that all of those pages should be rewritten by WAL anyway so they
> > aren't all that interesting to us, particularly in the basebackup case.
>
> Yep.
Right, see point 1 above.
> > > > I actually tend to disagree with you that, for this purpose, it's
> > > > actually necessary to check against the checkpoint LSN- if the LSN
> > > > changed and everything is operating correctly then the new LSN must be
> > > > more recent than the last checkpoint location or things are broken
> > > > badly.
> > >
> > > I don't follow. Are you suggesting we don't need the checkpoint LSN?
> > >
> > > I'm pretty sure that's not the case. The thing is - the LSN may not
> > > change between the two reads, but that's not a guarantee the page was
> > > not torn. The example I posted earlier in this message illustrates that.
> >
> > I agree that there's some risk there, but it's certainly much less
> > likely.
>
> Well. If we're going to report a checksum failure, we better be sure it
> actually is a broken page. I don't want users to start chasing bogus
> data corruption issues.
I agree.
> > > > Now, that said, I do think it's a good *idea* to check against the
> > > > checkpoint LSN (presuming this is for online checking of checksums- for
> > > > basebackup, we could just check against the backup-start LSN as anything
> > > > after that point will be rewritten by WAL anyway). The reason that I
> > > > think it's a good idea to check against the checkpoint LSN is that we'd
> > > > want to throw a big warning if the kernel is just feeding us random
> > > > garbage on reads and only finding a difference between two reads isn't
> > > > really doing any kind of validation, whereas checking against the
> > > > checkpoint-LSN would at least give us some idea that the value being
> > > > read isn't completely ridiculous.
Are you suggesting here that we always check against the current
checkpoint, or is checking against the checkpoint that we saw at startup
enough? I think re-reading pg_control all the time might be more
errorprone that what we could get from this, so I would prefer not to do
this.
I don’t follow why rereading pg_control would be error-prone. That said, I don’t have a particularly strong opinion either way on this.
> > > > When it comes to if the pg_sleep() is necessary or not, I have to admit
> > > > to being unsure about that.. I could see how it might be but it seems a
> > > > bit surprising- I'd probably want to see exactly what the page was at
> > > > the time of the failure and at the time of the second (no-sleep) re-read
> > > > and then after a delay and convince myself that it was just an unlucky
> > > > case of being scheduled in twice to read that page before the process
> > > > writing it out got a chance to finish the write.
> > >
> > > I think the pg_sleep() is a pretty strong sign there's something broken.
> > > At the very least, it's likely to misbehave on machines with different
> > > timings, machines under memory and/or memory pressure, etc.
I swapped out the pg_sleep earlier today for the check-against-
checkpoint-LSN-on-reread, and that seems to work just as fine, at least
in the tests I ran.
Ok, this sounds like you were probably seeing normal forward torn pages, and we have certainly seen that before.
> > If we assume that what you've outlined above is a serious enough issue
> > that we have to address it, and do so without a pg_sleep(), then I think
> > we have to bake into this a way for the process to check with PG as to
> > what the page's current LSN is, in shared buffers, because that's the
> > only place where we've got the locking required to ensure that we don't
> > end up with a read of a partially written page, and I'm really not
> > entirely convinced that we need to go to that level. It'd certainly add
> > a huge amount of additional complexity for what appears to be a quite
> > unlikely gain.
> >
> > I'll chat w/ David shortly about this again though and get his thoughts
> > on it. This is certainly an area we've spent time thinking about but
> > are obviously also open to finding a better solution.
>
> Why not to simply look at the last checkpoint LSN and use that the same
> way basebackup does? AFAICS that should make the pg_sleep() unnecessary.
Right.
This is fine if you know the kernel will always write the first page first, or you accept that a reread of a page which isn’t valid will always result in seeing a completely updated page.
We’ve made the assumption that a reread on a failure where the LSN on the first read was older than the backup-start LSN will give us an updated first-half of the page which we then check the LSN, of, but we have yet to prove that this is actually possible.
Thanks!
Stephen
On 09/17/2018 07:35 PM, Stephen Frost wrote: > Greetings, > > On Mon, Sep 17, 2018 at 13:20 Tomas Vondra <tomas.vondra@2ndquadrant.com > <mailto:tomas.vondra@2ndquadrant.com>> wrote: > > On 09/17/2018 07:11 PM, Stephen Frost wrote: > > Greetings, > > > > * Tomas Vondra (tomas.vondra@2ndquadrant.com > <mailto:tomas.vondra@2ndquadrant.com>) wrote: > >> On 09/17/2018 06:42 PM, Stephen Frost wrote: > >>> Ok, good, though I'm not sure what you mean by 'eliminates the > >>> consistency guarantees provided by the checkpoint'. The point > is that > >>> the page will be in the WAL and the WAL will be replayed during the > >>> restore of the backup. > >> > >> The checkpoint guarantees that the whole page was written and > flushed to > >> disk with an LSN before the ckeckpoint LSN. So when you read a > page with > >> that LSN, you know the whole write already completed and a read won't > >> return data from before the LSN. > > > > Well, you know that the first part was written out at some prior > point, > > but you could end up reading the first part of a page with an > older LSN > > while also reading the second part with new data. > > > > Doesn't the checkpoint fsync pretty much guarantee this can't happen? > > > How? Either it’s possible for the latter half of a page to be updated > before the first half (where the LSN lives), or it isn’t. If it’s > possible then that LSN could be ancient and it wouldn’t matter. > I'm not sure I understand what you're saying here. It is not about the latter page to be updated before the first half. I don't think that's quite possible, because write() into page cache does in fact write the data sequentially. The problem is that the write is not atomic, and AFAIK it happens in sectors (which are either 512B or 4K these days). And it may arbitrarily interleave with reads. So you may do write(8k), but it actually happens in 512B chunks and a concurrent read may observe some mix of those. But the trick is that if the read sees the effect of the write somewhere in the middle of the page, the next read is guaranteed to see all the preceding new data. Without the checkpoint we risk seeing the same write() both in read and re-read, just in a different stage - so the LSN would not change, making the check futile. But by waiting for the checkpoint we know that the original write is no longer in progress, so if we saw a partial write we're guaranteed to see a new LSN on re-read. This is what I mean by the checkpoint / fsync guarantee. > >> Without the checkpoint that's not guaranteed, and simply > re-reading the > >> page and rechecking it vs. the first read does not help: > >> > >> 1) write the first 512B of the page (sector), which includes the LSN > >> > >> 2) read the whole page, which will be a mix [new 512B, ... old ... ] > >> > >> 3) the checksum verification fails > >> > >> 4) read the page again (possibly reading a bit more new data) > >> > >> 5) the LSN did not change compared to the first read, yet the > checksum > >> still fails > > > > So, I agree with all of the above though I've found it to be extremely > > rare to get a single read which you've managed to catch part-way > through > > a write, getting multiple of them over a period of time strikes me as > > even more unlikely. Still, if we can come up with a solution to solve > > all of this, great, but I'm not sure that I'm hearing one. > > > I don't recall claiming catching many such torn pages - I'm sure it's > not very common in most workloads. But I suspect constructing workloads > hitting them regularly is not very difficult either (something with a > lot of churn in shared buffers should do the trick). > > > The question is if it’s possible to catch a torn page where the second > half is updated *before* the first half of the page in a read (and then > in subsequent reads having that state be maintained). I have some > skepticism that it’s really possible to happen in the first place but > having an interrupted system call be stalled across two more system > calls just seems terribly unlikely, and this is all based on the > assumption that the kernel might write the second half of a write before > the first to the kernel cache in the first place. > Yes, if that was possible, the explanation about the checkpoint fsync guarantee would be bogus, obviously. I've spent quite a bit of time looking into how write() is handled, and I believe seeing only the second half is not possible. You may observe a page torn in various ways (not necessarily in half), e.g. [old,new,old] but then the re-read you should be guaranteed to see new data up until the last "new" chunk: [new,new,old] At least that's my understanding. I failed to deduce what POSIX says about this, or how it behaves on various OS/filesystems. The one thing I've done was writing a simple stress test that writes a single 8kB page in a loop, reads it concurrently and checks the behavior. And it seems consistent with my understanding. > > >>> Now, that said, I do think it's a good *idea* to check against the > >>> checkpoint LSN (presuming this is for online checking of > checksums- for > >>> basebackup, we could just check against the backup-start LSN as > anything > >>> after that point will be rewritten by WAL anyway). The reason > that I > >>> think it's a good idea to check against the checkpoint LSN is > that we'd > >>> want to throw a big warning if the kernel is just feeding us random > >>> garbage on reads and only finding a difference between two reads > isn't > >>> really doing any kind of validation, whereas checking against the > >>> checkpoint-LSN would at least give us some idea that the value being > >>> read isn't completely ridiculous. > >>> > >>> When it comes to if the pg_sleep() is necessary or not, I have > to admit > >>> to being unsure about that.. I could see how it might be but it > seems a > >>> bit surprising- I'd probably want to see exactly what the page > was at > >>> the time of the failure and at the time of the second (no-sleep) > re-read > >>> and then after a delay and convince myself that it was just an > unlucky > >>> case of being scheduled in twice to read that page before the > process > >>> writing it out got a chance to finish the write. > >> > >> I think the pg_sleep() is a pretty strong sign there's something > broken. > >> At the very least, it's likely to misbehave on machines with > different > >> timings, machines under memory and/or memory pressure, etc. > > > > If we assume that what you've outlined above is a serious enough issue > > that we have to address it, and do so without a pg_sleep(), then I > think > > we have to bake into this a way for the process to check with PG as to > > what the page's current LSN is, in shared buffers, because that's the > > only place where we've got the locking required to ensure that we > don't > > end up with a read of a partially written page, and I'm really not > > entirely convinced that we need to go to that level. It'd > certainly add > > a huge amount of additional complexity for what appears to be a quite > > unlikely gain. > > > > I'll chat w/ David shortly about this again though and get his > thoughts > > on it. This is certainly an area we've spent time thinking about but > > are obviously also open to finding a better solution. > > > Why not to simply look at the last checkpoint LSN and use that the same > way basebackup does? AFAICS that should make the pg_sleep() unnecessary. > > > Use that to compare to what? The LSN in the first half of the page > could be from well before the checkpoint or even the backup started. > Not sure I follow. If the LSN in the page header is old, and the checksum check failed, then on re-read we either find a new LSN (in which case we skip the page) or consider this to be a checksum failure. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Greetings, * Tomas Vondra (tomas.vondra@2ndquadrant.com) wrote: > On 09/17/2018 07:35 PM, Stephen Frost wrote: > > On Mon, Sep 17, 2018 at 13:20 Tomas Vondra <tomas.vondra@2ndquadrant.com > > <mailto:tomas.vondra@2ndquadrant.com>> wrote: > > Doesn't the checkpoint fsync pretty much guarantee this can't happen? > > > > How? Either it’s possible for the latter half of a page to be updated > > before the first half (where the LSN lives), or it isn’t. If it’s > > possible then that LSN could be ancient and it wouldn’t matter. > > I'm not sure I understand what you're saying here. > > It is not about the latter page to be updated before the first half. I > don't think that's quite possible, because write() into page cache does > in fact write the data sequentially. Well, maybe 'updated before' wasn't quite the right way to talk about it, but consider if a read(8K) gets only half-way through the copy before having to go do something else and by the time it gets back, a write has come in and rewritten the page, such that the read(8K) returns half-old and half-new data. > The problem is that the write is not atomic, and AFAIK it happens in > sectors (which are either 512B or 4K these days). And it may arbitrarily > interleave with reads. Yes, of course the write isn't atomic, that's clear. > So you may do write(8k), but it actually happens in 512B chunks and a > concurrent read may observe some mix of those. Right, I'm not sure that we really need to worry about sub-4K writes though I suppose they're technically possible, but it doesn't much matter in this case since the LSN is early on in the page, of course. > But the trick is that if the read sees the effect of the write somewhere > in the middle of the page, the next read is guaranteed to see all the > preceding new data. If that's guaranteed then we can just check the LSN and be done. > Without the checkpoint we risk seeing the same write() both in read and > re-read, just in a different stage - so the LSN would not change, making > the check futile. This is the part that isn't making much sense to me. If we are guaranteed that writes into the kernel cache are always in order and always at least 512B in size, then if we check the LSN first and discover it's "old", and then read the rest of the page and calculate the checksum, discover it's a bad checksum, and then go back and re-read the page then we *must* see that the LSN has changed OR conclude that the checksum is invalidated. The reason this can happen in the first place is that our 8K read might only get half-way done before getting scheduled off and a 8K write happened on the page before our read(8K) gets back to finishing the read, but if what you're saying is true, then we can't ever have a case where such a thing would happen and a re-read would still see the "old" LSN. If we check the LSN first and discover it's "new" (as in, more recent than our last checkpoint, or the checkpoint where the backup started) then, sure, there's going to be a risk that the page is currently being written right that moment and isn't yet completely valid. The problem that we aren't solving for is if, somehow, we do a read(8K) and get the first half/second half mixup and then on a subsequent read(8K) we see that *again*, implying that somehow the kernel's copy has the latter-half of the page updated consistently but not the first half. That's a problem that I haven't got a solution to today. I'd love to have a guarantee that it's not possible- we've certainly never seen it but it's been a concern and I thought Michael was suggesting he'd seen that, but it sounds like there wasn't a check on the LSN in the first read, in which case it could have just been a 'regular' torn page case. > But by waiting for the checkpoint we know that the original write is no > longer in progress, so if we saw a partial write we're guaranteed to see > a new LSN on re-read. > > This is what I mean by the checkpoint / fsync guarantee. I don't think any of this really has anythign to do with either fsync being called or with the actual checkpointing process (except to the extent that the checkpointer is the thing doing the writing, and that we should be checking the LSN against the LSN of the last checkpoint when we started, or against the start of the backup LSN if we're talking about doing a backup). > > The question is if it’s possible to catch a torn page where the second > > half is updated *before* the first half of the page in a read (and then > > in subsequent reads having that state be maintained). I have some > > skepticism that it’s really possible to happen in the first place but > > having an interrupted system call be stalled across two more system > > calls just seems terribly unlikely, and this is all based on the > > assumption that the kernel might write the second half of a write before > > the first to the kernel cache in the first place. > > Yes, if that was possible, the explanation about the checkpoint fsync > guarantee would be bogus, obviously. > > I've spent quite a bit of time looking into how write() is handled, and > I believe seeing only the second half is not possible. You may observe a > page torn in various ways (not necessarily in half), e.g. > > [old,new,old] > > but then the re-read you should be guaranteed to see new data up until > the last "new" chunk: > > [new,new,old] > > At least that's my understanding. I failed to deduce what POSIX says > about this, or how it behaves on various OS/filesystems. > > The one thing I've done was writing a simple stress test that writes a > single 8kB page in a loop, reads it concurrently and checks the > behavior. And it seems consistent with my understanding. Good. > > Use that to compare to what? The LSN in the first half of the page > > could be from well before the checkpoint or even the backup started. > > Not sure I follow. If the LSN in the page header is old, and the > checksum check failed, then on re-read we either find a new LSN (in > which case we skip the page) or consider this to be a checksum failure. Right, I'm in agreement with doing that and it's what is done in pgbasebackup and pgBackRest. Thanks! Stephen
Attachment
On 09/18/2018 12:01 AM, Stephen Frost wrote: > Greetings, > > * Tomas Vondra (tomas.vondra@2ndquadrant.com) wrote: >> On 09/17/2018 07:35 PM, Stephen Frost wrote: >>> On Mon, Sep 17, 2018 at 13:20 Tomas Vondra <tomas.vondra@2ndquadrant.com >>> <mailto:tomas.vondra@2ndquadrant.com>> wrote: >>> Doesn't the checkpoint fsync pretty much guarantee this can't happen? >>> >>> How? Either it’s possible for the latter half of a page to be updated >>> before the first half (where the LSN lives), or it isn’t. If it’s >>> possible then that LSN could be ancient and it wouldn’t matter. >> >> I'm not sure I understand what you're saying here. >> >> It is not about the latter page to be updated before the first half. I >> don't think that's quite possible, because write() into page cache does >> in fact write the data sequentially. > > Well, maybe 'updated before' wasn't quite the right way to talk about > it, but consider if a read(8K) gets only half-way through the copy > before having to go do something else and by the time it gets back, a > write has come in and rewritten the page, such that the read(8K) > returns half-old and half-new data. > >> The problem is that the write is not atomic, and AFAIK it happens in >> sectors (which are either 512B or 4K these days). And it may arbitrarily >> interleave with reads. > > Yes, of course the write isn't atomic, that's clear. > >> So you may do write(8k), but it actually happens in 512B chunks and a >> concurrent read may observe some mix of those. > > Right, I'm not sure that we really need to worry about sub-4K writes > though I suppose they're technically possible, but it doesn't much > matter in this case since the LSN is early on in the page, of course. > >> But the trick is that if the read sees the effect of the write somewhere >> in the middle of the page, the next read is guaranteed to see all the >> preceding new data. > > If that's guaranteed then we can just check the LSN and be done. > What do you mean by "check the LSN"? Compare it to LSN from the first read? You don't know if the first read already saw the new LSN or not (see the next example). >> Without the checkpoint we risk seeing the same write() both in read and >> re-read, just in a different stage - so the LSN would not change, making >> the check futile. > > This is the part that isn't making much sense to me. If we are > guaranteed that writes into the kernel cache are always in order and > always at least 512B in size, then if we check the LSN first and > discover it's "old", and then read the rest of the page and calculate > the checksum, discover it's a bad checksum, and then go back and re-read > the page then we *must* see that the LSN has changed OR conclude that > the checksum is invalidated. > Even if the writes are in order and in 512B chunks, you don't know how they are interleaved with the reads. Let's assume we're doing a write(), which splits the 8kB page into 512B chunks. A concurrent read may observe a random mix of old and new data, depending on timing. So let's say a read sees the first 2kB of data like this: [new, new, new, old, new, old, new, old] OK, the page is obviously torn, checksum fails, and we try reading it again. We should see new data at least until the last 'new' chunk in the first read, so let's say we got this: [new, new, new, new, new, new, new, old] Obviously, this page is also torn (there are old data at the end), but we've read the new data in both cases, which includes the LSN. So the LSN is the same in both cases, and your detection fails. Comparing the page LSN to the last checkpoint LSN solves this, because if the LSN is older than the checkpoint LSN, that write must have been completed by now, and so we're not in danger of seeing only incomplete effects of it. And newer write will update the LSN. > The reason this can happen in the first place is that our 8K read might > only get half-way done before getting scheduled off and a 8K write > happened on the page before our read(8K) gets back to finishing the > read, but if what you're saying is true, then we can't ever have a case > where such a thing would happen and a re-read would still see the "old" > LSN. > > If we check the LSN first and discover it's "new" (as in, more recent > than our last checkpoint, or the checkpoint where the backup started) > then, sure, there's going to be a risk that the page is currently being > written right that moment and isn't yet completely valid. > Right. > The problem that we aren't solving for is if, somehow, we do a read(8K) > and get the first half/second half mixup and then on a subsequent > read(8K) we see that *again*, implying that somehow the kernel's copy > has the latter-half of the page updated consistently but not the first > half. That's a problem that I haven't got a solution to today. I'd > love to have a guarantee that it's not possible- we've certainly never > seen it but it's been a concern and I thought Michael was suggesting > he'd seen that, but it sounds like there wasn't a check on the LSN in > the first read, in which case it could have just been a 'regular' torn > page case. > Well, yeah. If that would be possible, we'd be in serious trouble. I've done quite a bit of experimentation with concurrent reads and writes and I have not observed such behavior. Of course, that's hardly a proof it can't happen, and it wouldn't be the first surprise with respect to kernel I/O this year ... >> But by waiting for the checkpoint we know that the original write is no >> longer in progress, so if we saw a partial write we're guaranteed to see >> a new LSN on re-read. >> >> This is what I mean by the checkpoint / fsync guarantee. > > I don't think any of this really has anythign to do with either fsync > being called or with the actual checkpointing process (except to the > extent that the checkpointer is the thing doing the writing, and that we > should be checking the LSN against the LSN of the last checkpoint when > we started, or against the start of the backup LSN if we're talking > about doing a backup). > You're right it's not about the fsync, sorry for the confusion. My point is that using the checkpoint LSN gives us a guarantee that write is no longer in progress, and so we can't see a page torn because of it. And if we see a partial write due to a new write, it's guaranteed to update the page LSN (and we'll notice it). >>> The question is if it’s possible to catch a torn page where the second >>> half is updated *before* the first half of the page in a read (and then >>> in subsequent reads having that state be maintained). I have some >>> skepticism that it’s really possible to happen in the first place but >>> having an interrupted system call be stalled across two more system >>> calls just seems terribly unlikely, and this is all based on the >>> assumption that the kernel might write the second half of a write before >>> the first to the kernel cache in the first place. >> >> Yes, if that was possible, the explanation about the checkpoint fsync >> guarantee would be bogus, obviously. >> >> I've spent quite a bit of time looking into how write() is handled, and >> I believe seeing only the second half is not possible. You may observe a >> page torn in various ways (not necessarily in half), e.g. >> >> [old,new,old] >> >> but then the re-read you should be guaranteed to see new data up until >> the last "new" chunk: >> >> [new,new,old] >> >> At least that's my understanding. I failed to deduce what POSIX says >> about this, or how it behaves on various OS/filesystems. >> >> The one thing I've done was writing a simple stress test that writes a >> single 8kB page in a loop, reads it concurrently and checks the >> behavior. And it seems consistent with my understanding. > > Good. > >>> Use that to compare to what? The LSN in the first half of the page >>> could be from well before the checkpoint or even the backup started. >> >> Not sure I follow. If the LSN in the page header is old, and the >> checksum check failed, then on re-read we either find a new LSN (in >> which case we skip the page) or consider this to be a checksum failure. > > Right, I'm in agreement with doing that and it's what is done in > pgbasebackup and pgBackRest. > OK. All I'm saying is pg_verify_checksums should probably do the same thing, i.e. grab checkpoint LSN and roll with that. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Greetings, * Tomas Vondra (tomas.vondra@2ndquadrant.com) wrote: > On 09/18/2018 12:01 AM, Stephen Frost wrote: > > * Tomas Vondra (tomas.vondra@2ndquadrant.com) wrote: > >> On 09/17/2018 07:35 PM, Stephen Frost wrote: > >> But the trick is that if the read sees the effect of the write somewhere > >> in the middle of the page, the next read is guaranteed to see all the > >> preceding new data. > > > > If that's guaranteed then we can just check the LSN and be done. > > What do you mean by "check the LSN"? Compare it to LSN from the first > read? You don't know if the first read already saw the new LSN or not > (see the next example). Hmm, ok, I can see your point there. I've been going back and forth between checking against what the prior LSN was on the page and checking it against an independent source (like the last checkpoint's LSN), but.. [...] > Comparing the page LSN to the last checkpoint LSN solves this, because > if the LSN is older than the checkpoint LSN, that write must have been > completed by now, and so we're not in danger of seeing only incomplete > effects of it. And newer write will update the LSN. Yeah, that makes sense- we need to be looking at something which only gets updated once the write has actually completed, and the last checkpoint's LSN gives us that guarantee. > > The problem that we aren't solving for is if, somehow, we do a read(8K) > > and get the first half/second half mixup and then on a subsequent > > read(8K) we see that *again*, implying that somehow the kernel's copy > > has the latter-half of the page updated consistently but not the first > > half. That's a problem that I haven't got a solution to today. I'd > > love to have a guarantee that it's not possible- we've certainly never > > seen it but it's been a concern and I thought Michael was suggesting > > he'd seen that, but it sounds like there wasn't a check on the LSN in > > the first read, in which case it could have just been a 'regular' torn > > page case. > > Well, yeah. If that would be possible, we'd be in serious trouble. I've > done quite a bit of experimentation with concurrent reads and writes and > I have not observed such behavior. Of course, that's hardly a proof it > can't happen, and it wouldn't be the first surprise with respect to > kernel I/O this year ... I'm glad to hear that you've done a lot of experimentation in this area and haven't seen such strange behavior happen- we've got quite a few people running pgBackRest with checksum-checking and haven't seen it either, but it's always been a bit of a concern. > You're right it's not about the fsync, sorry for the confusion. My point > is that using the checkpoint LSN gives us a guarantee that write is no > longer in progress, and so we can't see a page torn because of it. And > if we see a partial write due to a new write, it's guaranteed to update > the page LSN (and we'll notice it). Right, no worries about the confusion, I hadn't been fully thinking through the LSN bit either and that what we really need is some external confirmation of a write having *completed* (not just started) and that makes a definite difference. > > Right, I'm in agreement with doing that and it's what is done in > > pgbasebackup and pgBackRest. > > OK. All I'm saying is pg_verify_checksums should probably do the same > thing, i.e. grab checkpoint LSN and roll with that. Agreed. Thanks! Stephen
Attachment
Hi, Am Montag, den 17.09.2018, 14:09 -0400 schrieb Stephen Frost: > > 5. There seems to be no consensus on whether the number of skipped pages > > should be summarized at the end. > > I agree with printing the number of skipped pages, that does seem like > a nice to have. I don’t know that actually printing the pages > themselves is all that useful though. Oh ok - I never intended to print out the block numbers themselves, just the final number of skipped blocks in the summary. So I guess that's fine and I will add that in my branch. Michael -- Michael Banck Projektleiter / Senior Berater Tel.: +49 2166 9901-171 Fax: +49 2166 9901-100 Email: michael.banck@credativ.de credativ GmbH, HRB Mönchengladbach 12080 USt-ID-Nummer: DE204566209 Trompeterallee 108, 41189 Mönchengladbach Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer Unser Umgang mit personenbezogenen Daten unterliegt folgenden Bestimmungen: https://www.credativ.de/datenschutz
Hi. Am Montag, den 17.09.2018, 20:45 -0400 schrieb Stephen Frost: > > You're right it's not about the fsync, sorry for the confusion. My point > > is that using the checkpoint LSN gives us a guarantee that write is no > > longer in progress, and so we can't see a page torn because of it. And > > if we see a partial write due to a new write, it's guaranteed to update > > the page LSN (and we'll notice it). > > Right, no worries about the confusion, I hadn't been fully thinking > through the LSN bit either and that what we really need is some external > confirmation of a write having *completed* (not just started) and that > makes a definite difference. > > > > Right, I'm in agreement with doing that and it's what is done in > > > pgbasebackup and pgBackRest. > > > > OK. All I'm saying is pg_verify_checksums should probably do the same > > thing, i.e. grab checkpoint LSN and roll with that. > > Agreed. I've attached the patch I added to my branch to swap out the pg_sleep() with a check against the checkpoint LSN on a recheck verification failure. Let me know if there are still issues with it. I'll send a new patch for the whole online verification feature in a bit. Michael -- Michael Banck Projektleiter / Senior Berater Tel.: +49 2166 9901-171 Fax: +49 2166 9901-100 Email: michael.banck@credativ.de credativ GmbH, HRB Mönchengladbach 12080 USt-ID-Nummer: DE204566209 Trompeterallee 108, 41189 Mönchengladbach Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer Unser Umgang mit personenbezogenen Daten unterliegt folgenden Bestimmungen: https://www.credativ.de/datenschutz
Attachment
Hi, please find attached version 2 of the patch. Am Donnerstag, den 26.07.2018, 13:59 +0200 schrieb Michael Banck: > I've now forward-ported this change to pg_verify_checksums, in order to > make this application useful for online clusters, see attached patch. > > I've tested this in a tight loop (while true; do pg_verify_checksums -D > data1 -d > /dev/null || /bin/true; done)[2] while doing "while true; do > createdb pgbench; pgbench -i -s 10 pgbench > /dev/null; dropdb pgbench; > done", which I already used to develop the original code in the fork and > which brought up a few bugs. > > I got one checksums verification failure this way, all others were > caught by the recheck (I've introduced a 500ms delay for the first ten > failures) like this: > > > pg_verify_checksums: checksum verification failed on first attempt in > > file "data1/base/16837/16850", block 7770: calculated checksum 785 but > > expected 5063 > > pg_verify_checksums: block 7770 in file "data1/base/16837/16850" > > verified ok on recheck I have now changed this from the pg_sleep() to a check against the checkpoint LSN as discussed upthread. > However, I am also seeing sporadic (maybe 0.5 times per pgbench run) > failures like this: > > > pg_verify_checksums: short read of block 2644 in file > > "data1/base/16637/16650", got only 4096 bytes > > This is not strictly a verification failure, should we do anything about > this? In my fork, I am also rechecking on this[3] (and I am happy to > extend the patch that way), but that makes the code and the patch more > complicated and I wanted to check the general opinion on this case > first. I have added a retry for this as well now, without a pg_sleep() as well. This catches around 80% of the half-reads, but a few slip through. At that point we bail out with exit(1), and the user can try again, which I think is fine? Alternatively, we could just skip to the next file then and don't make it count as a checksum failure. Other changes from V1: 1. Rebased to 422952ee 2. Ignore ENOENT failure during file open and skip to next file 3. Mention total number of skipped blocks during the summary at the end of the run 4. Skip files starting with pg_internal.init* Michael -- Michael Banck Projektleiter / Senior Berater Tel.: +49 2166 9901-171 Fax: +49 2166 9901-100 Email: michael.banck@credativ.de credativ GmbH, HRB Mönchengladbach 12080 USt-ID-Nummer: DE204566209 Trompeterallee 108, 41189 Mönchengladbach Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer Unser Umgang mit personenbezogenen Daten unterliegt folgenden Bestimmungen: https://www.credativ.de/datenschutz
Attachment
Greetings, * Michael Banck (michael.banck@credativ.de) wrote: > please find attached version 2 of the patch. > > Am Donnerstag, den 26.07.2018, 13:59 +0200 schrieb Michael Banck: > > I've now forward-ported this change to pg_verify_checksums, in order to > > make this application useful for online clusters, see attached patch. > > > > I've tested this in a tight loop (while true; do pg_verify_checksums -D > > data1 -d > /dev/null || /bin/true; done)[2] while doing "while true; do > > createdb pgbench; pgbench -i -s 10 pgbench > /dev/null; dropdb pgbench; > > done", which I already used to develop the original code in the fork and > > which brought up a few bugs. > > > > I got one checksums verification failure this way, all others were > > caught by the recheck (I've introduced a 500ms delay for the first ten > > failures) like this: > > > > > pg_verify_checksums: checksum verification failed on first attempt in > > > file "data1/base/16837/16850", block 7770: calculated checksum 785 but > > > expected 5063 > > > pg_verify_checksums: block 7770 in file "data1/base/16837/16850" > > > verified ok on recheck > > I have now changed this from the pg_sleep() to a check against the > checkpoint LSN as discussed upthread. Ok. > > However, I am also seeing sporadic (maybe 0.5 times per pgbench run) > > failures like this: > > > > > pg_verify_checksums: short read of block 2644 in file > > > "data1/base/16637/16650", got only 4096 bytes > > > > This is not strictly a verification failure, should we do anything about > > this? In my fork, I am also rechecking on this[3] (and I am happy to > > extend the patch that way), but that makes the code and the patch more > > complicated and I wanted to check the general opinion on this case > > first. > > I have added a retry for this as well now, without a pg_sleep() as well. > This catches around 80% of the half-reads, but a few slip through. At > that point we bail out with exit(1), and the user can try again, which I > think is fine? No, this is perfectly normal behavior, as is having completely blank pages, now that I think about it. If we get a short read then I'd say we simply check that we got an EOF and, in that case, we just move on. > Alternatively, we could just skip to the next file then and don't make > it count as a checksum failure. No, I wouldn't count it as a checksum failure. We could possibly count it towards the skipped pages, though I'm even on the fence about that. Thanks! Stephen
Attachment
On 9/18/18 11:45 AM, Stephen Frost wrote: > * Michael Banck (michael.banck@credativ.de) wrote: >> I have added a retry for this as well now, without a pg_sleep() as well. > >> This catches around 80% of the half-reads, but a few slip through. At >> that point we bail out with exit(1), and the user can try again, which I >> think is fine? > > No, this is perfectly normal behavior, as is having completely blank > pages, now that I think about it. If we get a short read then I'd say > we simply check that we got an EOF and, in that case, we just move on. > >> Alternatively, we could just skip to the next file then and don't make >> it count as a checksum failure. > > No, I wouldn't count it as a checksum failure. We could possibly count > it towards the skipped pages, though I'm even on the fence about that. +1 for it not being a failure. Personally I'd count it as a skipped page, since we know the page exists but it can't be verified. The other option is to wait for the page to stabilize, which doesn't seem like it would take very long in most cases -- unless you are doing this test from another host with shared storage. Then I would expect to see all kinds of interesting torn pages after the last checkpoint. Regards, -- -David david@pgmasters.net
Attachment
Hi, Am Dienstag, den 18.09.2018, 13:52 -0400 schrieb David Steele: > On 9/18/18 11:45 AM, Stephen Frost wrote: > > * Michael Banck (michael.banck@credativ.de) wrote: > > > I have added a retry for this as well now, without a pg_sleep() as well. > > > This catches around 80% of the half-reads, but a few slip through. At > > > that point we bail out with exit(1), and the user can try again, which I > > > think is fine? > > > > No, this is perfectly normal behavior, as is having completely blank > > pages, now that I think about it. If we get a short read then I'd say > > we simply check that we got an EOF and, in that case, we just move on. > > > > > Alternatively, we could just skip to the next file then and don't make > > > it count as a checksum failure. > > > > No, I wouldn't count it as a checksum failure. We could possibly count > > it towards the skipped pages, though I'm even on the fence about that. > > +1 for it not being a failure. Personally I'd count it as a skipped > page, since we know the page exists but it can't be verified. > > The other option is to wait for the page to stabilize, which doesn't > seem like it would take very long in most cases -- unless you are doing > this test from another host with shared storage. Then I would expect to > see all kinds of interesting torn pages after the last checkpoint. OK, I'm skipping the block now on first try, as this makes (i) sense and (ii) simplifies the code (again). Version 3 is attached. Michael -- Michael Banck Projektleiter / Senior Berater Tel.: +49 2166 9901-171 Fax: +49 2166 9901-100 Email: michael.banck@credativ.de credativ GmbH, HRB Mönchengladbach 12080 USt-ID-Nummer: DE204566209 Trompeterallee 108, 41189 Mönchengladbach Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer Unser Umgang mit personenbezogenen Daten unterliegt folgenden Bestimmungen: https://www.credativ.de/datenschutz
Attachment
Hallo Michael, Patch v3 applies cleanly, code compiles and make check is ok, but the command is probably not tested anywhere, as already mentioned on other threads. The patch is missing a documentation update. There are debatable changes of behavior: if (errno == ENOENT) return / continue... For instance, a file disappearing is ok online, but not so if offline. On the other hand, the probability that a file suddenly disappears while the server offline looks remote, so reporting such issues does not seem useful. However I'm more wary with other continues/skips added. ISTM that skipping a block because of a read error, or because it is new, or some other reasons, is not the same thing, so should be counted & reported differently? + if (block_retry == false) Why not trust boolean operations? if (!block_retry) -- Fabien.
Hi, Am Mittwoch, den 26.09.2018, 13:23 +0200 schrieb Fabien COELHO: > Patch v3 applies cleanly, code compiles and make check is ok, but the > command is probably not tested anywhere, as already mentioned on other > threads. Right. > The patch is missing a documentation update. I've added that now. I think the only change needed was removing the "server needs to be offline" part? > There are debatable changes of behavior: > > if (errno == ENOENT) return / continue... > > For instance, a file disappearing is ok online, but not so if offline. On > the other hand, the probability that a file suddenly disappears while the > server offline looks remote, so reporting such issues does not seem > useful. > > However I'm more wary with other continues/skips added. ISTM that skipping > a block because of a read error, or because it is new, or some other > reasons, is not the same thing, so should be counted & reported > differently? I think that would complicate things further without a lot of benefit. After all, we are interested in checksum failures, not necessarily read failures etc. so exiting on them (and skip checking possibly large parts of PGDATA) looks undesirable to me. So I have done no changes in this part so far, what do others think about this? > + if (block_retry == false) > > Why not trust boolean operations? > > if (!block_retry) I've changed that as well. Version 4 is attached. Michael -- Michael Banck Projektleiter / Senior Berater Tel.: +49 2166 9901-171 Fax: +49 2166 9901-100 Email: michael.banck@credativ.de credativ GmbH, HRB Mönchengladbach 12080 USt-ID-Nummer: DE204566209 Trompeterallee 108, 41189 Mönchengladbach Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer Unser Umgang mit personenbezogenen Daten unterliegt folgenden Bestimmungen: https://www.credativ.de/datenschutz
Attachment
Greetings, * Michael Banck (michael.banck@credativ.de) wrote: > Am Mittwoch, den 26.09.2018, 13:23 +0200 schrieb Fabien COELHO: > > There are debatable changes of behavior: > > > > if (errno == ENOENT) return / continue... > > > > For instance, a file disappearing is ok online, but not so if offline. On > > the other hand, the probability that a file suddenly disappears while the > > server offline looks remote, so reporting such issues does not seem > > useful. > > > > However I'm more wary with other continues/skips added. ISTM that skipping > > a block because of a read error, or because it is new, or some other > > reasons, is not the same thing, so should be counted & reported > > differently? > > I think that would complicate things further without a lot of benefit. > > After all, we are interested in checksum failures, not necessarily read > failures etc. so exiting on them (and skip checking possibly large parts > of PGDATA) looks undesirable to me. > > So I have done no changes in this part so far, what do others think > about this? I certainly don't see a lot of point in doing much more than what was discussed previously for 'new' blocks (counting them as skipped and moving on). An actual read() error (that is, a failure on a read() call such as getting back EIO), on the other hand, is something which I'd probably report back to the user immediately and then move on, and perhaps report again at the end. Note that a short read isn't an error and falls under the 'new' blocks discussion above. Thanks! Stephen
Attachment
>> The patch is missing a documentation update. > > I've added that now. I think the only change needed was removing the > "server needs to be offline" part? Yes, and also checking that the described behavior correspond to the new version. >> There are debatable changes of behavior: >> >> if (errno == ENOENT) return / continue... >> >> For instance, a file disappearing is ok online, but not so if offline. On >> the other hand, the probability that a file suddenly disappears while the >> server offline looks remote, so reporting such issues does not seem >> useful. >> >> However I'm more wary with other continues/skips added. ISTM that skipping >> a block because of a read error, or because it is new, or some other >> reasons, is not the same thing, so should be counted & reported >> differently? > > I think that would complicate things further without a lot of benefit. > > After all, we are interested in checksum failures, not necessarily read > failures etc. so exiting on them (and skip checking possibly large parts > of PGDATA) looks undesirable to me. Hmmm. I'm really saying that it is debatable, so here is some fuel to the debate: If I run the check command and it cannot do its job, there is a problem which is as bad as a failing checksum. The only safe assumption on a cannot-read block is that the checksum is bad... So ISTM that on on some of the "skipped" errors there should be appropriate report (exit code, final output) that something is amiss. -- Fabien.
Hi, Am Mittwoch, den 26.09.2018, 10:54 -0400 schrieb Stephen Frost: > * Michael Banck (michael.banck@credativ.de) wrote: > > Am Mittwoch, den 26.09.2018, 13:23 +0200 schrieb Fabien COELHO: > > > There are debatable changes of behavior: > > > > > > if (errno == ENOENT) return / continue... > > > > > > For instance, a file disappearing is ok online, but not so if offline. On > > > the other hand, the probability that a file suddenly disappears while the > > > server offline looks remote, so reporting such issues does not seem > > > useful. > > > > > > However I'm more wary with other continues/skips added. ISTM that skipping > > > a block because of a read error, or because it is new, or some other > > > reasons, is not the same thing, so should be counted & reported > > > differently? > > > > I think that would complicate things further without a lot of benefit. > > > > After all, we are interested in checksum failures, not necessarily read > > failures etc. so exiting on them (and skip checking possibly large parts > > of PGDATA) looks undesirable to me. > > > > So I have done no changes in this part so far, what do others think > > about this? > > I certainly don't see a lot of point in doing much more than what was > discussed previously for 'new' blocks (counting them as skipped and > moving on). > > An actual read() error (that is, a failure on a read() call such as > getting back EIO), on the other hand, is something which I'd probably > report back to the user immediately and then move on, and perhaps > report again at the end. > > Note that a short read isn't an error and falls under the 'new' blocks > discussion above. So I've added ENOENT checks when opening or statting files, i.e. EIO would still be reported. The current code in master exits on reads which do not return BLCKSZ, which I've changed to a skip. So that means we now no longer check for read failures (return code < 0) so I have now added a check for that and emit an error message and return. New version 5 attached. Michael -- Michael Banck Projektleiter / Senior Berater Tel.: +49 2166 9901-171 Fax: +49 2166 9901-100 Email: michael.banck@credativ.de credativ GmbH, HRB Mönchengladbach 12080 USt-ID-Nummer: DE204566209 Trompeterallee 108, 41189 Mönchengladbach Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer Unser Umgang mit personenbezogenen Daten unterliegt folgenden Bestimmungen: https://www.credativ.de/datenschutz
Attachment
Hello Stephen, > I certainly don't see a lot of point in doing much more than what was > discussed previously for 'new' blocks (counting them as skipped and > moving on). Sure. > An actual read() error (that is, a failure on a read() call such as > getting back EIO), on the other hand, is something which I'd probably > report back to the user immediately and then move on, and perhaps > report again at the end. Yep. > Note that a short read isn't an error and falls under the 'new' blocks > discussion above. I'm really unsure that a short read should really be coldly skipped: If the check is offline, then one file is in a very bad state, this is really a panic situation. If the check is online, given that both postgres and the verify command interact with the same OS (?) and at the pg page level, I'm not sure in which situation there could be a partial block, because pg would only send full pages to the OS. -- Fabien.
Greetings, * Fabien COELHO (coelho@cri.ensmp.fr) wrote: > >Note that a short read isn't an error and falls under the 'new' blocks > >discussion above. > > I'm really unsure that a short read should really be coldly skipped: > > If the check is offline, then one file is in a very bad state, this is > really a panic situation. Why? Are we sure that's really something which can't ever happen, even if the database was shutdown with 'immediate'? I don't think it can but that's something to consider. In any case, my comments were specifically thinking about it from an 'online' perspective. > If the check is online, given that both postgres and the verify command > interact with the same OS (?) and at the pg page level, I'm not sure in > which situation there could be a partial block, because pg would only send > full pages to the OS. The OS doesn't operate at the same level that PG does- a single write in PG could get blocked and scheduled off after having only copied half of the 8k that PG sends. This isn't really debatable- we've seen it happen and everything is operating perfectly correctly, it just happens that you were able to get a read() at the same time a write() was happening and that only part of the page had been updated at that point. Thanks! Stephen
Attachment
Hi, On 09/26/2018 05:15 PM, Michael Banck wrote: > ... > > New version 5 attached. > I've looked at v5, and the retry/recheck logic seems OK to me - I'd still vote to keep it consistent with what pg_basebackup does (i.e. doing the LSN check first, before looking at the checksum), but I don't think it's a bug. I'm not sure about the other issues brought up (ENOENT, short reads). I haven't given it much thought. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Hi, One more thought - when running similar tools on a live system, it's usually a good idea to limit the impact by throttling the throughput. As the verification runs in an independent process it can't reuse the vacuum-like cost limit directly, but perhaps it could do something similar? Like, limit the number of blocks read/second, or so? regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Sat, Sep 29, 2018 at 10:51:23AM +0200, Tomas Vondra wrote: > One more thought - when running similar tools on a live system, it's > usually a good idea to limit the impact by throttling the throughput. As > the verification runs in an independent process it can't reuse the > vacuum-like cost limit directly, but perhaps it could do something > similar? Like, limit the number of blocks read/second, or so? When it comes to such parameters, not using a number of blocks but throttling with a value in bytes (kB or MB of course) speaks more to the user. The past experience with checkpoint_segments is one example of that. Converting that to a number of blocks internally would definitely make sense the most sense. +1 for this idea. -- Michael
Attachment
Greetings, * Michael Paquier (michael@paquier.xyz) wrote: > On Sat, Sep 29, 2018 at 10:51:23AM +0200, Tomas Vondra wrote: > > One more thought - when running similar tools on a live system, it's > > usually a good idea to limit the impact by throttling the throughput. As > > the verification runs in an independent process it can't reuse the > > vacuum-like cost limit directly, but perhaps it could do something > > similar? Like, limit the number of blocks read/second, or so? > > When it comes to such parameters, not using a number of blocks but > throttling with a value in bytes (kB or MB of course) speaks more to the > user. The past experience with checkpoint_segments is one example of > that. Converting that to a number of blocks internally would definitely > make sense the most sense. +1 for this idea. While I agree this would be a nice additional feature to have, it seems like something which could certainly be added later and doesn't necessairly have to be included in the initial patch. If Michael has time to add that, great, if not, I'd rather have this as-is than not. I do tend to agree with Michael that having the parameter be specified as (or at least able to accept) a byte-based value is a good idea. As another feature idea, having this able to work in parallel across tablespaces would be nice too. I can certainly imagine some point where this is a default process which scans the database at a slow pace across all the tablespaces more-or-less all the time checking for corruption. Thanks! Stephen
Attachment
On 09/29/2018 02:14 PM, Stephen Frost wrote: > Greetings, > > * Michael Paquier (michael@paquier.xyz) wrote: >> On Sat, Sep 29, 2018 at 10:51:23AM +0200, Tomas Vondra wrote: >>> One more thought - when running similar tools on a live system, it's >>> usually a good idea to limit the impact by throttling the throughput. As >>> the verification runs in an independent process it can't reuse the >>> vacuum-like cost limit directly, but perhaps it could do something >>> similar? Like, limit the number of blocks read/second, or so? >> >> When it comes to such parameters, not using a number of blocks but >> throttling with a value in bytes (kB or MB of course) speaks more to the >> user. The past experience with checkpoint_segments is one example of >> that. Converting that to a number of blocks internally would definitely >> make sense the most sense. +1 for this idea. > > While I agree this would be a nice additional feature to have, it seems > like something which could certainly be added later and doesn't > necessairly have to be included in the initial patch. If Michael has > time to add that, great, if not, I'd rather have this as-is than not. > True, although I don't think it'd be particularly difficult. > I do tend to agree with Michael that having the parameter be specified > as (or at least able to accept) a byte-based value is a good idea. Sure, I was not really expecting it to be exposed as raw block count. I agree it should be in byte-based values (i.e. just like --max-rate in pg_basebackup). > As another feature idea, having this able to work in parallel across > tablespaces would be nice too. I can certainly imagine some point where > this is a default process which scans the database at a slow pace across > all the tablespaces more-or-less all the time checking for corruption. > Maybe, but that's certainly a non-trivial feature. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Hallo Michael, > New version 5 attached. Patch does not seem to apply anymore. Moreover, ISTM that some discussions about behavioral changes are not fully settled. My current opinion is that when offline some errors are not admissible, whereas the same errors are admissible when online because they may be due to the ongoing database processing, so the behavior should not be strictly the same. This might suggest some option to tell the command that it should work in online or offline mode, so that it may be stricter in some cases. The default may be one of the option, eg the stricter offline mode, or maybe guessed at startup. I put the patch in "waiting on author" state. -- Fabien.
Hi Fabien, On Thu, Oct 25, 2018 at 10:16:03AM +0200, Fabien COELHO wrote: > >New version 5 attached. > > Patch does not seem to apply anymore. Thanks, rebased version attached. > Moreover, ISTM that some discussions about behavioral changes are not fully > settled. > > My current opinion is that when offline some errors are not admissible, > whereas the same errors are admissible when online because they may be due > to the ongoing database processing, so the behavior should not be strictly > the same. Indeed, the recently-added pg_verify_checksums testsuite adds a few files with just 'foo' in them and with V5 of the patch, pg_verify_checksums no longer bails out with an error on those. I have now re-added the retry logic for partially-read pages, so that it bails out if it reads a page partially twice. This makes the testsuite work again. I am not convinced we need to differentiate further between online and offline operation, can you explain in more detail which other differences are ok in online mode and why? > This might suggest some option to tell the command that it should work in > online or offline mode, so that it may be stricter in some cases. The > default may be one of the option, eg the stricter offline mode, or maybe > guessed at startup. If we believe the operation should be different, the patch removes the "is cluster online?" check (as it is no longer necessary), so we could just replace the current error message with a global variable with the result of that check and use it where needed (if any). Michael -- Michael Banck Projektleiter / Senior Berater Tel.: +49 2166 9901-171 Fax: +49 2166 9901-100 Email: michael.banck@credativ.de credativ GmbH, HRB Mönchengladbach 12080 USt-ID-Nummer: DE204566209 Trompeterallee 108, 41189 Mönchengladbach Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer Unser Umgang mit personenbezogenen Daten unterliegt folgenden Bestimmungen: https://www.credativ.de/datenschutz
Attachment
Hallo Michael, Patch v6 applies cleanly, compiles, local make check is ok. >> My current opinion is that when offline some errors are not admissible, >> whereas the same errors are admissible when online because they may be due >> to the ongoing database processing, so the behavior should not be strictly >> the same. > > Indeed, the recently-added pg_verify_checksums testsuite A welcome addition! > adds a few files with just 'foo' in them and with V5 of the patch, > pg_verify_checksums no longer bails out with an error on those. > I have now re-added the retry logic for partially-read pages, so that it > bails out if it reads a page partially twice. This makes the testsuite > work again. > > I am not convinced we need to differentiate further between online and > offline operation, can you explain in more detail which other > differences are ok in online mode and why? For instance the "file/directory was removed" do not look okay at all when offline, even if unlikely. Moreover, the checks hides the error message and is fully silent in this case, while it was not beforehand on the same error when offline. The "check if page was modified since checkpoint" does not look useful when offline. Maybe it lacks a comment to say that this cannot (should not ?) happen when offline, but even then I would not like it to be true: ISTM that no page should be allowed to be skipped on the checkpoint condition when offline, but it is probably ok to skip with the new page test, which make me still think that they should be counted and reported separately, or at least the checkpoint skip test should not be run when offline. When offline, the retry logic does not make much sense, it should complain directly on the first error? Also, I'm unsure of the read & checksum retry logic *without any delay*. >> This might suggest some option to tell the command that it should work in >> online or offline mode, so that it may be stricter in some cases. The >> default may be one of the option, eg the stricter offline mode, or maybe >> guessed at startup. > > If we believe the operation should be different, the patch removes the > "is cluster online?" check (as it is no longer necessary), so we could > just replace the current error message with a global variable with the > result of that check and use it where needed (if any). That could let open the issue of someone starting the check offline, and then starting the database while it is not finished. Maybe it is not worth sweating about such a narrow use case. If operations are to be different, and it seems to me they should be, I'd suggest (1) auto detect default based one the existing "is cluster online" code, (2) force options, eg --online vs --offline, which would complain and exit if the cluster is not in the right state on startup. I'd suggest to add a failing checksum online test, if possible. At least a "foo" file? It would also be nice if the test could apply on an active database, eg with a low-rate pgbench running in parallel to the verification, but I'm not sure how easy it is to add such a thing. -- Fabien.
Hi, On Tue, Oct 30, 2018 at 06:22:52PM +0100, Fabien COELHO wrote: > >I am not convinced we need to differentiate further between online and > >offline operation, can you explain in more detail which other > >differences are ok in online mode and why? > > For instance the "file/directory was removed" do not look okay at all when > offline, even if unlikely. Moreover, the checks hides the error message and > is fully silent in this case, while it was not beforehand on the same error > when offline. OK, I kinda see the point here and added that. > The "check if page was modified since checkpoint" does not look useful when > offline. Maybe it lacks a comment to say that this cannot (should not ?) > happen when offline, but even then I would not like it to be true: ISTM that > no page should be allowed to be skipped on the checkpoint condition when > offline, but it is probably ok to skip with the new page test, which make me > still think that they should be counted and reported separately, or at least > the checkpoint skip test should not be run when offline. What is the rationale to not skip on the checkpoint condition when the instance is offline? If it was shutdown cleanly, this should not happen, if the instance crashed, those would be spurious errors that would get repaired on recovery. I have not changed that for now. > When offline, the retry logic does not make much sense, it should complain > directly on the first error? Also, I'm unsure of the read & checksum retry > logic *without any delay*. I think the small overhead of retrying in offline mode even if useless is worth avoiding making the code more complicated in order to cater for both modes. Initially there was a delay, but this was removed after analysis and requests by several other reviewers. > >>This might suggest some option to tell the command that it should work in > >>online or offline mode, so that it may be stricter in some cases. The > >>default may be one of the option, eg the stricter offline mode, or maybe > >>guessed at startup. > > > >If we believe the operation should be different, the patch removes the > >"is cluster online?" check (as it is no longer necessary), so we could > >just replace the current error message with a global variable with the > >result of that check and use it where needed (if any). > > That could let open the issue of someone starting the check offline, and > then starting the database while it is not finished. Maybe it is not worth > sweating about such a narrow use case. I don't think we need to cater for that, yeah. > If operations are to be different, and it seems to me they should be, I'd > suggest (1) auto detect default based one the existing "is cluster online" > code, (2) force options, eg --online vs --offline, which would complain and > exit if the cluster is not in the right state on startup. The current code bails out if it thinks the cluster is online. What is wrong with just setting a flag now in case it is? > I'd suggest to add a failing checksum online test, if possible. At least a > "foo" file? Ok, done so. > It would also be nice if the test could apply on an active database, > eg with a low-rate pgbench running in parallel to the verification, > but I'm not sure how easy it is to add such a thing. That sounds much more complicated so I have not tackled that yet. Michael -- Michael Banck Projektleiter / Senior Berater Tel.: +49 2166 9901-171 Fax: +49 2166 9901-100 Email: michael.banck@credativ.de credativ GmbH, HRB Mönchengladbach 12080 USt-ID-Nummer: DE204566209 Trompeterallee 108, 41189 Mönchengladbach Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer Unser Umgang mit personenbezogenen Daten unterliegt folgenden Bestimmungen: https://www.credativ.de/datenschutz
Attachment
Greetings, * Michael Banck (michael.banck@credativ.de) wrote: > On Tue, Oct 30, 2018 at 06:22:52PM +0100, Fabien COELHO wrote: > > The "check if page was modified since checkpoint" does not look useful when > > offline. Maybe it lacks a comment to say that this cannot (should not ?) > > happen when offline, but even then I would not like it to be true: ISTM that > > no page should be allowed to be skipped on the checkpoint condition when > > offline, but it is probably ok to skip with the new page test, which make me > > still think that they should be counted and reported separately, or at least > > the checkpoint skip test should not be run when offline. > > What is the rationale to not skip on the checkpoint condition when the > instance is offline? If it was shutdown cleanly, this should not > happen, if the instance crashed, those would be spurious errors that > would get repaired on recovery. > > I have not changed that for now. Agreed- this is an important check even in offline mode. > > When offline, the retry logic does not make much sense, it should complain > > directly on the first error? Also, I'm unsure of the read & checksum retry > > logic *without any delay*. The race condition being considered here is where an 8k read somehow gets the first 4k, then is scheduled off-cpu, and the full 8k page is then written by some other process, and then this process is woken up to read the second 4k. I agree that this is unnecessary when the database is offline, but it's also pretty cheap. When the database is online, it's an extremely unlikely case to hit (just try to reproduce it...) but if it does get hit then it's easy enough to recheck by doing a reread, which should show that the LSN has been updated in the first 4k and we can then know that this page is in the WAL. We have not yet seen a case where such a re-read returns an old LSN and an invalid checksum; based on discussion with other hackers, that shouldn't be possible as every kernel seems to consistently write in-order, meaning that the first 4k will be updated before the second, so a single re-read should be sufficient. Remember- this is all in-memory activity also, we aren't talking about what might happen on disk here. > I think the small overhead of retrying in offline mode even if useless > is worth avoiding making the code more complicated in order to cater for > both modes. Agreed. > Initially there was a delay, but this was removed after analysis and > requests by several other reviewers. Agreed, there's no need for or point to having such a delay. > > >>This might suggest some option to tell the command that it should work in > > >>online or offline mode, so that it may be stricter in some cases. The > > >>default may be one of the option, eg the stricter offline mode, or maybe > > >>guessed at startup. > > > > > >If we believe the operation should be different, the patch removes the > > >"is cluster online?" check (as it is no longer necessary), so we could > > >just replace the current error message with a global variable with the > > >result of that check and use it where needed (if any). > > > > That could let open the issue of someone starting the check offline, and > > then starting the database while it is not finished. Maybe it is not worth > > sweating about such a narrow use case. > > I don't think we need to cater for that, yeah. Agreed. > > It would also be nice if the test could apply on an active database, > > eg with a low-rate pgbench running in parallel to the verification, > > but I'm not sure how easy it is to add such a thing. > > That sounds much more complicated so I have not tackled that yet. I agree that this would be nice, but I don't want the regression tests to become much longer... Thanks! Stephen
Attachment
On 11/22/18 2:12 AM, Stephen Frost wrote: > Greetings, > > * Michael Banck (michael.banck@credativ.de) wrote: >> On Tue, Oct 30, 2018 at 06:22:52PM +0100, Fabien COELHO wrote: >>> The "check if page was modified since checkpoint" does not look useful when >>> offline. Maybe it lacks a comment to say that this cannot (should not ?) >>> happen when offline, but even then I would not like it to be true: ISTM that >>> no page should be allowed to be skipped on the checkpoint condition when >>> offline, but it is probably ok to skip with the new page test, which make me >>> still think that they should be counted and reported separately, or at least >>> the checkpoint skip test should not be run when offline. >> >> What is the rationale to not skip on the checkpoint condition when the >> instance is offline? If it was shutdown cleanly, this should not >> happen, if the instance crashed, those would be spurious errors that >> would get repaired on recovery. >> >> I have not changed that for now. > > Agreed- this is an important check even in offline mode. > Yeah. I suppose we could detect if the shutdown was clean (like pg_rewind does), and then skip the check. Or perhaps we should still do the check (without a retry), and report it as issue when we find a page with LSN newer than the last checkpoint. In any case, the check is pretty cheap (comparing two 64-bit values), and I don't see how skipping it would optimize anything. It would make the code a tad simpler, but we still need the check for the online mode. >>> When offline, the retry logic does not make much sense, it should complain >>> directly on the first error? Also, I'm unsure of the read & checksum retry >>> logic *without any delay*. > > The race condition being considered here is where an 8k read somehow > gets the first 4k, then is scheduled off-cpu, and the full 8k page is > then written by some other process, and then this process is woken up > to read the second 4k. I agree that this is unnecessary when the > database is offline, but it's also pretty cheap. When the database is > online, it's an extremely unlikely case to hit (just try to reproduce > it...) but if it does get hit then it's easy enough to recheck by doing > a reread, which should show that the LSN has been updated in the first > 4k and we can then know that this page is in the WAL. We have not yet > seen a case where such a re-read returns an old LSN and an invalid > checksum; based on discussion with other hackers, that shouldn't be > possible as every kernel seems to consistently write in-order, meaning > that the first 4k will be updated before the second, so a single re-read > should be sufficient. > Right. A minor detail is that the reads/writes should be atomic at the sector level, which used to be 512B, so it's not just about pages torn in 4kB/4kB manner, but possibly an arbitrary mix of 512B chunks from old and new version. This also explains why we don't need any delay - the reread happens after the write must have already written the page header, so the new LSN must be already visible. So no delay is necessary. And if it was, how long should the delay be? The processes might end up off-CPU for arbitrary amount of time, so picking a good value would be pretty tricky. > Remember- this is all in-memory activity also, we aren't talking about > what might happen on disk here. > >> I think the small overhead of retrying in offline mode even if useless >> is worth avoiding making the code more complicated in order to cater for >> both modes. > > Agreed. > >> Initially there was a delay, but this was removed after analysis and >> requests by several other reviewers. > > Agreed, there's no need for or point to having such a delay. > Yep. >>>>> This might suggest some option to tell the command that it should work in >>>>> online or offline mode, so that it may be stricter in some cases. The >>>>> default may be one of the option, eg the stricter offline mode, or maybe >>>>> guessed at startup. >>>> >>>> If we believe the operation should be different, the patch removes the >>>> "is cluster online?" check (as it is no longer necessary), so we could >>>> just replace the current error message with a global variable with the >>>> result of that check and use it where needed (if any). >>> >>> That could let open the issue of someone starting the check offline, and >>> then starting the database while it is not finished. Maybe it is not worth >>> sweating about such a narrow use case. >> >> I don't think we need to cater for that, yeah. > > Agreed. > Yep. I don't think other tools protect against that either. And pg_rewind does actually modify the cluster state, unlike checksum verification. >>> It would also be nice if the test could apply on an active database, >>> eg with a low-rate pgbench running in parallel to the verification, >>> but I'm not sure how easy it is to add such a thing. >> >> That sounds much more complicated so I have not tackled that yet. > > I agree that this would be nice, but I don't want the regression tests > to become much longer... > I have to admit I find this thread rather confusing, because the subject is "online verification of checksums" yet we're discussing verification on offline instances. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Greetings, * Tomas Vondra (tomas.vondra@2ndquadrant.com) wrote: > On 11/22/18 2:12 AM, Stephen Frost wrote: > >* Michael Banck (michael.banck@credativ.de) wrote: > >>On Tue, Oct 30, 2018 at 06:22:52PM +0100, Fabien COELHO wrote: > >>>The "check if page was modified since checkpoint" does not look useful when > >>>offline. Maybe it lacks a comment to say that this cannot (should not ?) > >>>happen when offline, but even then I would not like it to be true: ISTM that > >>>no page should be allowed to be skipped on the checkpoint condition when > >>>offline, but it is probably ok to skip with the new page test, which make me > >>>still think that they should be counted and reported separately, or at least > >>>the checkpoint skip test should not be run when offline. > >> > >>What is the rationale to not skip on the checkpoint condition when the > >>instance is offline? If it was shutdown cleanly, this should not > >>happen, if the instance crashed, those would be spurious errors that > >>would get repaired on recovery. > >> > >>I have not changed that for now. > > > >Agreed- this is an important check even in offline mode. > > Yeah. I suppose we could detect if the shutdown was clean (like pg_rewind > does), and then skip the check. Or perhaps we should still do the check > (without a retry), and report it as issue when we find a page with LSN newer > than the last checkpoint. I agree that it'd be nice to report an issue if it's a clean shutdown but there's an LSN newer than the last checkpoint, though I suspect that would be more useful in debugging and such and not so useful for users. > In any case, the check is pretty cheap (comparing two 64-bit values), and I > don't see how skipping it would optimize anything. It would make the code a > tad simpler, but we still need the check for the online mode. Yeah, I'd just keep the check. > A minor detail is that the reads/writes should be atomic at the sector > level, which used to be 512B, so it's not just about pages torn in 4kB/4kB > manner, but possibly an arbitrary mix of 512B chunks from old and new > version. Sure. > This also explains why we don't need any delay - the reread happens after > the write must have already written the page header, so the new LSN must be > already visible. Agreed. Thanks! Stephen
Attachment
> On Wed, Nov 21, 2018 at 1:38 PM Michael Banck <michael.banck@credativ.de> wrote: > > Hi, > > On Tue, Oct 30, 2018 at 06:22:52PM +0100, Fabien COELHO wrote: > > >I am not convinced we need to differentiate further between online and > > >offline operation, can you explain in more detail which other > > >differences are ok in online mode and why? > > > > For instance the "file/directory was removed" do not look okay at all when > > offline, even if unlikely. Moreover, the checks hides the error message and > > is fully silent in this case, while it was not beforehand on the same error > > when offline. > > OK, I kinda see the point here and added that. Hi, Just for the information, looks like part of this patch (or at least some similar code), related to the tests in 002_actions.pl, was committed recently in 5c99513975, so there are minor conflicts with the master.
On Sat, Dec 01, 2018 at 12:47:13PM +0100, Dmitry Dolgov wrote: > Just for the information, looks like part of this patch (or at least some > similar code), related to the tests in 002_actions.pl, was committed recently > in 5c99513975, so there are minor conflicts with the master. What what I can see in v7 of the patch as posted in [1], all the changes to 002_actions.pl could just be removed because there are already equivalents. [1]: https://postgr.es/m/20181121123535.GD23740@nighthawk.caipicrew.dd-dns.de -- Michael
Attachment
Hi, On Mon, Dec 03, 2018 at 09:48:43AM +0900, Michael Paquier wrote: > On Sat, Dec 01, 2018 at 12:47:13PM +0100, Dmitry Dolgov wrote: > > Just for the information, looks like part of this patch (or at least some > > similar code), related to the tests in 002_actions.pl, was committed recently > > in 5c99513975, so there are minor conflicts with the master. > > What what I can see in v7 of the patch as posted in [1], all the changes > to 002_actions.pl could just be removed because there are already > equivalents. Yeah, new rebased version attached. Michael -- Michael Banck Projektleiter / Senior Berater Tel.: +49 2166 9901-171 Fax: +49 2166 9901-100 Email: michael.banck@credativ.de credativ GmbH, HRB Mönchengladbach 12080 USt-ID-Nummer: DE204566209 Trompeterallee 108, 41189 Mönchengladbach Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer Unser Umgang mit personenbezogenen Daten unterliegt folgenden Bestimmungen: https://www.credativ.de/datenschutz
Attachment
Hi, On Thu, Dec 20, 2018 at 04:19:11PM +0100, Michael Banck wrote: > Yeah, new rebased version attached. By the way, one thing that this patch also fixes is checksum verification on basebackups (as pointed out the other day by my colleague Bernd Helmele): postgres@kohn:~$ initdb -k data postgres@kohn:~$ pg_ctl -D data -l logfile start waiting for server to start.... done server started postgres@kohn:~$ pg_verify_checksums -D data pg_verify_checksums: cluster must be shut down to verify checksums postgres@kohn:~$ pg_basebackup -h /tmp -D backup1 postgres@kohn:~$ pg_verify_checksums -D backup1 pg_verify_checksums: cluster must be shut down to verify checksums postgres@kohn:~$ pg_checksums -c -D backup1 Checksum scan completed Files scanned: 1094 Blocks scanned: 2867 Bad checksums: 0 Data checksum version: 1 Where pg_checksums has the online verification patch applied. As I don't think many people will take down their production servers in order to verify checksums, verifying them on basebackups looks like a useful use-case that is currently broken with pg_verify_checksums. Michael -- Michael Banck Projektleiter / Senior Berater Tel.: +49 2166 9901-171 Fax: +49 2166 9901-100 Email: michael.banck@credativ.de credativ GmbH, HRB Mönchengladbach 12080 USt-ID-Nummer: DE204566209 Trompeterallee 108, 41189 Mönchengladbach Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer Unser Umgang mit personenbezogenen Daten unterliegt folgenden Bestimmungen: https://www.credativ.de/datenschutz
Hallo Michael, > Yeah, new rebased version attached. Patch v8 applies cleanly, compiles, global & local make check are ok. A few comments: About added tests: the node is left running at the end of the script, which is not very clean. I'd suggest to either move the added checks before stopping, or to stop again at the end of the script, depending on the intention. I'm wondering (possibly again) about the existing early exit if one block cannot be read on retry: the command should count this as a kind of bad block, proceed on checking other files, and obviously fail in the end, but having checked everything else and generated a report. I do not think that this condition warrants a full stop. ISTM that under rare race conditions (eg, an unlucky concurrent "drop database" or "drop table") this could happen when online, although I could not trigger one despite heavy testing, so I'm possibly mistaken. -- Fabien.
Hi, On 2018-12-25 10:25:46 +0100, Fabien COELHO wrote: > Hallo Michael, > > > Yeah, new rebased version attached. > > Patch v8 applies cleanly, compiles, global & local make check are ok. > > A few comments: > > About added tests: the node is left running at the end of the script, which > is not very clean. I'd suggest to either move the added checks before > stopping, or to stop again at the end of the script, depending on the > intention. Michael? > I'm wondering (possibly again) about the existing early exit if one block > cannot be read on retry: the command should count this as a kind of bad > block, proceed on checking other files, and obviously fail in the end, but > having checked everything else and generated a report. I do not think that > this condition warrants a full stop. ISTM that under rare race conditions > (eg, an unlucky concurrent "drop database" or "drop table") this could > happen when online, although I could not trigger one despite heavy testing, > so I'm possibly mistaken. This seems like a defensible judgement call either way. Greetings, Andres Freund
On Sun, Feb 03, 2019 at 02:06:45AM -0800, Andres Freund wrote: > On 2018-12-25 10:25:46 +0100, Fabien COELHO wrote: >> About added tests: the node is left running at the end of the script, which >> is not very clean. I'd suggest to either move the added checks before >> stopping, or to stop again at the end of the script, depending on the >> intention. > > Michael? Unlikely P., and most likely B. I have marked the patch as returned with feedback as it has been a couple of weeks already. -- Michael
Attachment
Hi, Am Sonntag, den 03.02.2019, 02:06 -0800 schrieb Andres Freund: > Hi, > > On 2018-12-25 10:25:46 +0100, Fabien COELHO wrote: > > Hallo Michael, > > > > > Yeah, new rebased version attached. > > > > Patch v8 applies cleanly, compiles, global & local make check are ok. > > > > A few comments: > > > > About added tests: the node is left running at the end of the script, which > > is not very clean. I'd suggest to either move the added checks before > > stopping, or to stop again at the end of the script, depending on the > > intention. > > Michael? Uh, I kinda forgot about this, I've made the tests stop the node now. > > I'm wondering (possibly again) about the existing early exit if one block > > cannot be read on retry: the command should count this as a kind of bad > > block, proceed on checking other files, and obviously fail in the end, but > > having checked everything else and generated a report. I do not think that > > this condition warrants a full stop. ISTM that under rare race conditions > > (eg, an unlucky concurrent "drop database" or "drop table") this could > > happen when online, although I could not trigger one despite heavy testing, > > so I'm possibly mistaken. > > This seems like a defensible judgement call either way. Right now we have a few tests that explicitly check that pg_verify_checksums fail on broken data ("foo" in the file). Those would then just get skipped AFAICT, which I think is the worse behaviour , but if everybody thinks that should be the way to go, we can drop/adjust those tests and make pg_verify_checksums skip them. Thoughts? In the meanwhile, v9 is attached with the above change and rebased (without changes) to master. Michael -- Michael Banck Projektleiter / Senior Berater Tel.: +49 2166 9901-171 Fax: +49 2166 9901-100 Email: michael.banck@credativ.de credativ GmbH, HRB Mönchengladbach 12080 USt-ID-Nummer: DE204566209 Trompeterallee 108, 41189 Mönchengladbach Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer Unser Umgang mit personenbezogenen Daten unterliegt folgenden Bestimmungen: https://www.credativ.de/datenschutz
Attachment
Hallo Michael, >>> I'm wondering (possibly again) about the existing early exit if one block >>> cannot be read on retry: the command should count this as a kind of bad >>> block, proceed on checking other files, and obviously fail in the end, but >>> having checked everything else and generated a report. I do not think that >>> this condition warrants a full stop. ISTM that under rare race conditions >>> (eg, an unlucky concurrent "drop database" or "drop table") this could >>> happen when online, although I could not trigger one despite heavy testing, >>> so I'm possibly mistaken. >> >> This seems like a defensible judgement call either way. > > Right now we have a few tests that explicitly check that > pg_verify_checksums fail on broken data ("foo" in the file). Those > would then just get skipped AFAICT, which I think is the worse behaviour > , but if everybody thinks that should be the way to go, we can > drop/adjust those tests and make pg_verify_checksums skip them. > > Thoughts? My point is that it should fail as it does, only not immediately (early exit), but after having checked everything else. This mean avoiding calling "exit(1)" here and there (lseek, fopen...), but taking note that something bad happened, and call exit only in the end. -- Fabien.
Hi, On 2019-02-05 06:57:06 +0100, Fabien COELHO wrote: > > > > I'm wondering (possibly again) about the existing early exit if one block > > > > cannot be read on retry: the command should count this as a kind of bad > > > > block, proceed on checking other files, and obviously fail in the end, but > > > > having checked everything else and generated a report. I do not think that > > > > this condition warrants a full stop. ISTM that under rare race conditions > > > > (eg, an unlucky concurrent "drop database" or "drop table") this could > > > > happen when online, although I could not trigger one despite heavy testing, > > > > so I'm possibly mistaken. > > > > > > This seems like a defensible judgement call either way. > > > > Right now we have a few tests that explicitly check that > > pg_verify_checksums fail on broken data ("foo" in the file). Those > > would then just get skipped AFAICT, which I think is the worse behaviour > > , but if everybody thinks that should be the way to go, we can > > drop/adjust those tests and make pg_verify_checksums skip them. > > > > Thoughts? > > My point is that it should fail as it does, only not immediately (early > exit), but after having checked everything else. This mean avoiding calling > "exit(1)" here and there (lseek, fopen...), but taking note that something > bad happened, and call exit only in the end. I can see both as being valuable (one gives you a more complete picture, the other a quicker answer in scripts). For me that's the point where it's the prerogative of the author to make that choice. Greetings, Andres Freund
On 2/5/19 8:01 AM, Andres Freund wrote: > Hi, > > On 2019-02-05 06:57:06 +0100, Fabien COELHO wrote: >>>>> I'm wondering (possibly again) about the existing early exit if one block >>>>> cannot be read on retry: the command should count this as a kind of bad >>>>> block, proceed on checking other files, and obviously fail in the end, but >>>>> having checked everything else and generated a report. I do not think that >>>>> this condition warrants a full stop. ISTM that under rare race conditions >>>>> (eg, an unlucky concurrent "drop database" or "drop table") this could >>>>> happen when online, although I could not trigger one despite heavy testing, >>>>> so I'm possibly mistaken. >>>> >>>> This seems like a defensible judgement call either way. >>> >>> Right now we have a few tests that explicitly check that >>> pg_verify_checksums fail on broken data ("foo" in the file). Those >>> would then just get skipped AFAICT, which I think is the worse behaviour >>> , but if everybody thinks that should be the way to go, we can >>> drop/adjust those tests and make pg_verify_checksums skip them. >>> >>> Thoughts? >> >> My point is that it should fail as it does, only not immediately (early >> exit), but after having checked everything else. This mean avoiding calling >> "exit(1)" here and there (lseek, fopen...), but taking note that something >> bad happened, and call exit only in the end. > > I can see both as being valuable (one gives you a more complete picture, > the other a quicker answer in scripts). For me that's the point where > it's the prerogative of the author to make that choice. > Why not make this configurable, using a command-line option? regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Hi, Am Dienstag, den 05.02.2019, 11:30 +0100 schrieb Tomas Vondra: > On 2/5/19 8:01 AM, Andres Freund wrote: > > On 2019-02-05 06:57:06 +0100, Fabien COELHO wrote: > > > > > > I'm wondering (possibly again) about the existing early exit if one block > > > > > > cannot be read on retry: the command should count this as a kind of bad > > > > > > block, proceed on checking other files, and obviously fail in the end, but > > > > > > having checked everything else and generated a report. I do not think that > > > > > > this condition warrants a full stop. ISTM that under rare race conditions > > > > > > (eg, an unlucky concurrent "drop database" or "drop table") this could > > > > > > happen when online, although I could not trigger one despite heavy testing, > > > > > > so I'm possibly mistaken. > > > > > > > > > > This seems like a defensible judgement call either way. > > > > > > > > Right now we have a few tests that explicitly check that > > > > pg_verify_checksums fail on broken data ("foo" in the file). Those > > > > would then just get skipped AFAICT, which I think is the worse behaviour > > > > , but if everybody thinks that should be the way to go, we can > > > > drop/adjust those tests and make pg_verify_checksums skip them. > > > > > > > > Thoughts? > > > > > > My point is that it should fail as it does, only not immediately (early > > > exit), but after having checked everything else. This mean avoiding calling > > > "exit(1)" here and there (lseek, fopen...), but taking note that something > > > bad happened, and call exit only in the end. > > > > I can see both as being valuable (one gives you a more complete picture, > > the other a quicker answer in scripts). For me that's the point where > > it's the prerogative of the author to make that choice. Personally, I would prefer to keep it as simple as possible for now and get this patch committed; in my opinion the behaviour is already like this (early exit on corrupt files) so I don't think the online verification patch should change this. If we see complaints about this, then I'd be happy to change it afterwards. > Why not make this configurable, using a command-line option? I like this even less - this tool is about verifying checksums, so adding options on what to do when it encounters broken pages looks out- of-scope to me. Unless we want to say it should generally abort on the first issue (i.e. on wrong checksums as well). Michael -- Michael Banck Projektleiter / Senior Berater Tel.: +49 2166 9901-171 Fax: +49 2166 9901-100 Email: michael.banck@credativ.de credativ GmbH, HRB Mönchengladbach 12080 USt-ID-Nummer: DE204566209 Trompeterallee 108, 41189 Mönchengladbach Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer Unser Umgang mit personenbezogenen Daten unterliegt folgenden Bestimmungen: https://www.credativ.de/datenschutz
Greetings, * Michael Banck (michael.banck@credativ.de) wrote: > Am Dienstag, den 05.02.2019, 11:30 +0100 schrieb Tomas Vondra: > > On 2/5/19 8:01 AM, Andres Freund wrote: > > > On 2019-02-05 06:57:06 +0100, Fabien COELHO wrote: > > > > > > > I'm wondering (possibly again) about the existing early exit if one block > > > > > > > cannot be read on retry: the command should count this as a kind of bad > > > > > > > block, proceed on checking other files, and obviously fail in the end, but > > > > > > > having checked everything else and generated a report. I do not think that > > > > > > > this condition warrants a full stop. ISTM that under rare race conditions > > > > > > > (eg, an unlucky concurrent "drop database" or "drop table") this could > > > > > > > happen when online, although I could not trigger one despite heavy testing, > > > > > > > so I'm possibly mistaken. > > > > > > > > > > > > This seems like a defensible judgement call either way. > > > > > > > > > > Right now we have a few tests that explicitly check that > > > > > pg_verify_checksums fail on broken data ("foo" in the file). Those > > > > > would then just get skipped AFAICT, which I think is the worse behaviour > > > > > , but if everybody thinks that should be the way to go, we can > > > > > drop/adjust those tests and make pg_verify_checksums skip them. > > > > > > > > > > Thoughts? > > > > > > > > My point is that it should fail as it does, only not immediately (early > > > > exit), but after having checked everything else. This mean avoiding calling > > > > "exit(1)" here and there (lseek, fopen...), but taking note that something > > > > bad happened, and call exit only in the end. > > > > > > I can see both as being valuable (one gives you a more complete picture, > > > the other a quicker answer in scripts). For me that's the point where > > > it's the prerogative of the author to make that choice. ... unless people here object or prefer other options, and then it's up to discussion and hopefully some consensus comes out of it. Also, I have to say that I really don't think the 'quicker answer' argument holds any weight, making me question if that's a valid use-case. If there *isn't* an issue, which we would likely all agree is the case the vast majority of the time that this is going to be run, then it's going to take quite a while and anyone calling it should expect and be prepared for that. In the extremely rare cases, what does exiting early actually do for us? > Personally, I would prefer to keep it as simple as possible for now and > get this patch committed; in my opinion the behaviour is already like > this (early exit on corrupt files) so I don't think the online > verification patch should change this. I'm also in the camp of "would rather it not exit immediately, so the extent of the issue is clear". > If we see complaints about this, then I'd be happy to change it > afterwards. I really don't think this is something we should change later on in a future release.. If the consensus is that there's really two different but valid use-cases then we should make it configurable, but I'm not convinced there is. > > Why not make this configurable, using a command-line option? > > I like this even less - this tool is about verifying checksums, so > adding options on what to do when it encounters broken pages looks out- > of-scope to me. Unless we want to say it should generally abort on the > first issue (i.e. on wrong checksums as well). I definitely disagree that it's somehow 'out of scope' for this tool to skip broken pages, when we can tell that they're broken. There is a question here about how to handle a short read since that can happen under normal conditions if we're unlucky. The same is also true for files disappearing entirely. So, let's talk/think through a few cases: A file with just 'foo\n' in it- could that be a page starting with an LSN around 666F6F0A that we somehow only read the first few bytes of? If not, why not? I could possibly see an argument that we expect to always get at least 512 bytes in a read, or 4K, but it seems like we could possibly run into edge cases on odd filesystems or such. In the end, I'm leaning towards categorizing different things, well, differently- a short read would be reported as a NOTICE or equivilant, perhaps, meaning that the test case needs to do something more than just have a file with 'foo' in it, but that is likely a good things anyway- the test cases would be better if they were closer to real world. Other read failures would be reported in a more serious category assuming they are "this really shouldn't happen" cases. A file disappearing isn't a "can't happen" case, and might be reported at the same 'NOTICE' level (or maybe with a 'verbose' ption). A file that's 8k in size and has a checksum but it's not right seems pretty clear to me. Might as well include a count of pages which have a valid checksum, I would think, though perhaps only in a 'verbose' mode would that get reported. A completely zero'd page could also be reported at a NOTICE level or with a count, or perhaps only with verbose. Other thoughts about use-cases and what should happen..? Thanks! Stephen
Attachment
Hi, Am Mittwoch, den 06.02.2019, 11:39 -0500 schrieb Stephen Frost: > * Michael Banck (michael.banck@credativ.de) wrote: > > Am Dienstag, den 05.02.2019, 11:30 +0100 schrieb Tomas Vondra: > > > On 2/5/19 8:01 AM, Andres Freund wrote: > > > > On 2019-02-05 06:57:06 +0100, Fabien COELHO wrote: > > > > > > > > I'm wondering (possibly again) about the existing early exit if one block > > > > > > > > cannot be read on retry: the command should count this as a kind of bad > > > > > > > > block, proceed on checking other files, and obviously fail in the end, but > > > > > > > > having checked everything else and generated a report. I do not think that > > > > > > > > this condition warrants a full stop. ISTM that under rare race conditions > > > > > > > > (eg, an unlucky concurrent "drop database" or "drop table") this could > > > > > > > > happen when online, although I could not trigger one despite heavy testing, > > > > > > > > so I'm possibly mistaken. > > > > > > > > > > > > > > This seems like a defensible judgement call either way. > > > > > > > > > > > > Right now we have a few tests that explicitly check that > > > > > > pg_verify_checksums fail on broken data ("foo" in the file). Those > > > > > > would then just get skipped AFAICT, which I think is the worse behaviour > > > > > > , but if everybody thinks that should be the way to go, we can > > > > > > drop/adjust those tests and make pg_verify_checksums skip them. > > > > > > > > > > > > Thoughts? > > > > > > > > > > My point is that it should fail as it does, only not immediately (early > > > > > exit), but after having checked everything else. This mean avoiding calling > > > > > "exit(1)" here and there (lseek, fopen...), but taking note that something > > > > > bad happened, and call exit only in the end. > > > > > > > > I can see both as being valuable (one gives you a more complete picture, > > > > the other a quicker answer in scripts). For me that's the point where > > > > it's the prerogative of the author to make that choice. > > ... unless people here object or prefer other options, and then it's up > to discussion and hopefully some consensus comes out of it. > > Also, I have to say that I really don't think the 'quicker answer' > argument holds any weight, making me question if that's a valid > use-case. If there *isn't* an issue, which we would likely all agree is > the case the vast majority of the time that this is going to be run, > then it's going to take quite a while and anyone calling it should > expect and be prepared for that. In the extremely rare cases, what does > exiting early actually do for us? > > > Personally, I would prefer to keep it as simple as possible for now and > > get this patch committed; in my opinion the behaviour is already like > > this (early exit on corrupt files) so I don't think the online > > verification patch should change this. > > I'm also in the camp of "would rather it not exit immediately, so the > extent of the issue is clear". > > > If we see complaints about this, then I'd be happy to change it > > afterwards. > > I really don't think this is something we should change later on in a > future release.. If the consensus is that there's really two different > but valid use-cases then we should make it configurable, but I'm not > convinced there is. OK, fair enough. > > > Why not make this configurable, using a command-line option? > > > > I like this even less - this tool is about verifying checksums, so > > adding options on what to do when it encounters broken pages looks out- > > of-scope to me. Unless we want to say it should generally abort on the > > first issue (i.e. on wrong checksums as well). > > I definitely disagree that it's somehow 'out of scope' for this tool to > skip broken pages, when we can tell that they're broken. I didn't mean that it's out-of-scope for pg_verify_checksums, I meant it is out-of-scope for this patch, which adds online checking. > There is a question here about how to handle a short read since that > can happen under normal conditions if we're unlucky. The same is also > true for files disappearing entirely. > > So, let's talk/think through a few cases: > > A file with just 'foo\n' in it- could that be a page starting with > an LSN around 666F6F0A that we somehow only read the first few bytes of? > If not, why not? I could possibly see an argument that we expect to > always get at least 512 bytes in a read, or 4K, but it seems like we > could possibly run into edge cases on odd filesystems or such. In the > end, I'm leaning towards categorizing different things, well, > differently- a short read would be reported as a NOTICE or equivilant, > perhaps, meaning that the test case needs to do something more than just > have a file with 'foo' in it, but that is likely a good things anyway- > the test cases would be better if they were closer to real world. Other > read failures would be reported in a more serious category assuming they > are "this really shouldn't happen" cases. A file disappearing isn't a > "can't happen" case, and might be reported at the same 'NOTICE' level > (or maybe with a 'verbose' ption). In the context of this patch, we should also discern whether a particular case is merely a notice (or warning) on an offline cluster, I guess you think it should be? So I've changed it such that a short read emits a "warning" message, increments a new skippedfiles (as it is not just a skipped block) variable and reports its number at the end - should it then exit with > 0 even if there were no wrong checksums? > A file that's 8k in size and has a checksum but it's not right seems > pretty clear to me. Might as well include a count of pages which have a > valid checksum, I would think, though perhaps only in a 'verbose' mode > would that get reported. What's the use for that? It already reports the number of scanned blocks at the end, so that number is pretty easy to figure out from it and the number of bad checksums. > A completely zero'd page could also be reported at a NOTICE level or > with a count, or perhaps only with verbose. It is counted as a skipped block right now (well, every block that qualifes for PageIsNew() is), but skipped blocks are not mentioned right now. I guess the rationale is that it might lead to excessive screen output (but then, verbose originally logged /every/ block), but you'd have to check with the original authors. So I have now changed behaviour so that short writes count as skipped files and pg_verify_checksums no longer bails out on them. When this occors a warning is written to stderr and their overall count is also reported at the end. However, unless there are other blocks with bad checksums, the exit status is kept at zero. New patch attached. Michael -- Michael Banck Projektleiter / Senior Berater Tel.: +49 2166 9901-171 Fax: +49 2166 9901-100 Email: michael.banck@credativ.de credativ GmbH, HRB Mönchengladbach 12080 USt-ID-Nummer: DE204566209 Trompeterallee 108, 41189 Mönchengladbach Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer Unser Umgang mit personenbezogenen Daten unterliegt folgenden Bestimmungen: https://www.credativ.de/datenschutz
Attachment
Hallo Mickael, > So I have now changed behaviour so that short writes count as skipped > files and pg_verify_checksums no longer bails out on them. When this > occors a warning is written to stderr and their overall count is also > reported at the end. However, unless there are other blocks with bad > checksums, the exit status is kept at zero. This seems fair when online, however I'm wondering whether it is when offline. I'd say that the whole retry logic should be skipped in this case? i.e. "if (block_retry || !online) { error message and continue }" on both short read & checksum failure retries. > New patch attached. Patch applies cleanly, compiles, global & local make check ok. I'm wondering whether it should exit(1) on "lseek" failures. Would it make sense to skip the file and report it as such? Should it be counted as a skippedfile? WRT the final status, ISTM that slippedblocks & files could warrant an error when offline, although they might be ok when online? -- Fabien.
Hi, Am Donnerstag, den 28.02.2019, 14:29 +0100 schrieb Fabien COELHO: > > So I have now changed behaviour so that short writes count as skipped > > files and pg_verify_checksums no longer bails out on them. When this > > occors a warning is written to stderr and their overall count is also > > reported at the end. However, unless there are other blocks with bad > > checksums, the exit status is kept at zero. > > This seems fair when online, however I'm wondering whether it is when > offline. I'd say that the whole retry logic should be skipped in this > case? i.e. "if (block_retry || !online) { error message and continue }" > on both short read & checksum failure retries. Ok, the stand-alone pg_checksums program also got a PR about the LSN skip logic not being helpful when the instance is offline and somebody just writes /dev/urandom over the heap files: https://github.com/credativ/pg_checksums/pull/6 So I now tried to change the patch so that it only retries blocks when online. > Patch applies cleanly, compiles, global & local make check ok. > > I'm wondering whether it should exit(1) on "lseek" failures. Would it make > sense to skip the file and report it as such? Should it be counted as a > skippedfile? Ok, I think it makes sense to march on and I changed it that way. > WRT the final status, ISTM that slippedblocks & files could warrant an > error when offline, although they might be ok when online? Ok, also changed it that way. New patch attached. Michael -- Michael Banck Projektleiter / Senior Berater Tel.: +49 2166 9901-171 Fax: +49 2166 9901-100 Email: michael.banck@credativ.de credativ GmbH, HRB Mönchengladbach 12080 USt-ID-Nummer: DE204566209 Trompeterallee 108, 41189 Mönchengladbach Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer Unser Umgang mit personenbezogenen Daten unterliegt folgenden Bestimmungen: https://www.credativ.de/datenschutz
Attachment
On Tue, Sep 18, 2018 at 10:37 AM Michael Banck <michael.banck@credativ.de> wrote: > I have added a retry for this as well now, without a pg_sleep() as well. > This catches around 80% of the half-reads, but a few slip through. At > that point we bail out with exit(1), and the user can try again, which I > think is fine? Maybe I'm confused here, but catching 80% of torn pages doesn't sound robust at all. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hi, Am Freitag, den 01.03.2019, 18:03 -0500 schrieb Robert Haas: > On Tue, Sep 18, 2018 at 10:37 AM Michael Banck > <michael.banck@credativ.de> wrote: > > I have added a retry for this as well now, without a pg_sleep() as well. > > This catches around 80% of the half-reads, but a few slip through. At > > that point we bail out with exit(1), and the user can try again, which I > > think is fine? > > Maybe I'm confused here, but catching 80% of torn pages doesn't sound > robust at all. The chance that pg_verify_checksums hits a torn page (at least in my tests, see below) is already pretty low, a couple of times per 1000 runs. Maybe 4 out 5 times, the page is read fine on retry and we march on. Otherwise, we now just issue a warning and skip the file (or so was the idea, see below), do you think that is not acceptable? I re-ran the tests (concurrent createdb/pgbench -i -s 50/dropdb and pg_verify_checksums in tight loops) with the current patch version, and I am seeing short reads very, very rarely (maybe every 1000th run) with a warning like: |1174 |pg_verify_checksums: warning: could not read block 374 in file "data/base/18032/18045": read 4096 of 8192 |pg_verify_checksums: warning: could not read block 375 in file "data/base/18032/18045": read 4096 of 8192 |Files skipped: 2 The 1174 is the sequence number, the first 1173 runs of pg_verify_checksums only skipped blocks. However, the fact it shows two warnings for the same file means there is something wrong here. It was continueing to the next block while I think it should just skip to the next file on read failures. So I have changed that now, new patch attached. Michael -- Michael Banck Projektleiter / Senior Berater Tel.: +49 2166 9901-171 Fax: +49 2166 9901-100 Email: michael.banck@credativ.de credativ GmbH, HRB Mönchengladbach 12080 USt-ID-Nummer: DE204566209 Trompeterallee 108, 41189 Mönchengladbach Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer Unser Umgang mit personenbezogenen Daten unterliegt folgenden Bestimmungen: https://www.credativ.de/datenschutz
Attachment
Greetings, * Michael Banck (michael.banck@credativ.de) wrote: > Am Freitag, den 01.03.2019, 18:03 -0500 schrieb Robert Haas: > > On Tue, Sep 18, 2018 at 10:37 AM Michael Banck > > <michael.banck@credativ.de> wrote: > > > I have added a retry for this as well now, without a pg_sleep() as well. > > > This catches around 80% of the half-reads, but a few slip through. At > > > that point we bail out with exit(1), and the user can try again, which I > > > think is fine? > > > > Maybe I'm confused here, but catching 80% of torn pages doesn't sound > > robust at all. > > The chance that pg_verify_checksums hits a torn page (at least in my > tests, see below) is already pretty low, a couple of times per 1000 > runs. Maybe 4 out 5 times, the page is read fine on retry and we march > on. Otherwise, we now just issue a warning and skip the file (or so was > the idea, see below), do you think that is not acceptable? > > I re-ran the tests (concurrent createdb/pgbench -i -s 50/dropdb and > pg_verify_checksums in tight loops) with the current patch version, and > I am seeing short reads very, very rarely (maybe every 1000th run) with > a warning like: > > |1174 > |pg_verify_checksums: warning: could not read block 374 in file "data/base/18032/18045": read 4096 of 8192 > |pg_verify_checksums: warning: could not read block 375 in file "data/base/18032/18045": read 4096 of 8192 > |Files skipped: 2 > > The 1174 is the sequence number, the first 1173 runs of > pg_verify_checksums only skipped blocks. > > However, the fact it shows two warnings for the same file means there is > something wrong here. It was continueing to the next block while I think > it should just skip to the next file on read failures. So I have changed > that now, new patch attached. I'm confused- if previously it was continueing to the next block instead of doing the re-read on the same block, why don't we just change it to do the re-read on the same block properly and see if that fixes the retry, instead of just giving up and skipping..? I'm not necessairly against skipping to the next file, to be clear, but I think I'd be happier if we kept reading the file until we actually get EOF. (I've not looked at the actual patch, just read what you wrote..) Thanks! Stephen
Attachment
On 3/2/19 12:03 AM, Robert Haas wrote: > On Tue, Sep 18, 2018 at 10:37 AM Michael Banck > <michael.banck@credativ.de> wrote: >> I have added a retry for this as well now, without a pg_sleep() as well. >> This catches around 80% of the half-reads, but a few slip through. At >> that point we bail out with exit(1), and the user can try again, which I >> think is fine? > > Maybe I'm confused here, but catching 80% of torn pages doesn't sound > robust at all. > FWIW I don't think this qualifies as torn page - i.e. it's not a full read with a mix of old and new data. This is partial write, most likely because we read the blocks one by one, and when we hit the last page while the table is being extended, we may only see the fist 4kB. And if we retry very fast, we may still see only the first 4kB. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 3/2/19 5:08 PM, Stephen Frost wrote: > Greetings, > > * Michael Banck (michael.banck@credativ.de) wrote: >> Am Freitag, den 01.03.2019, 18:03 -0500 schrieb Robert Haas: >>> On Tue, Sep 18, 2018 at 10:37 AM Michael Banck >>> <michael.banck@credativ.de> wrote: >>>> I have added a retry for this as well now, without a pg_sleep() as well. >>>> This catches around 80% of the half-reads, but a few slip through. At >>>> that point we bail out with exit(1), and the user can try again, which I >>>> think is fine? >>> >>> Maybe I'm confused here, but catching 80% of torn pages doesn't sound >>> robust at all. >> >> The chance that pg_verify_checksums hits a torn page (at least in my >> tests, see below) is already pretty low, a couple of times per 1000 >> runs. Maybe 4 out 5 times, the page is read fine on retry and we march >> on. Otherwise, we now just issue a warning and skip the file (or so was >> the idea, see below), do you think that is not acceptable? >> >> I re-ran the tests (concurrent createdb/pgbench -i -s 50/dropdb and >> pg_verify_checksums in tight loops) with the current patch version, and >> I am seeing short reads very, very rarely (maybe every 1000th run) with >> a warning like: >> >> |1174 >> |pg_verify_checksums: warning: could not read block 374 in file "data/base/18032/18045": read 4096 of 8192 >> |pg_verify_checksums: warning: could not read block 375 in file "data/base/18032/18045": read 4096 of 8192 >> |Files skipped: 2 >> >> The 1174 is the sequence number, the first 1173 runs of >> pg_verify_checksums only skipped blocks. >> >> However, the fact it shows two warnings for the same file means there is >> something wrong here. It was continueing to the next block while I think >> it should just skip to the next file on read failures. So I have changed >> that now, new patch attached. > > I'm confused- if previously it was continueing to the next block instead > of doing the re-read on the same block, why don't we just change it to > do the re-read on the same block properly and see if that fixes the > retry, instead of just giving up and skipping..? I'm not necessairly > against skipping to the next file, to be clear, but I think I'd be > happier if we kept reading the file until we actually get EOF. > > (I've not looked at the actual patch, just read what you wrote..) > Notice that those two errors are actually for two consecutive blocks in the same file. So what probably happened is that postgres started to extend the page, and the verification tried to read the last page after the kernel added just the first 4kB filesystem page. Then it probably succeeded on a retry, and then the same thing happened on the next page. I don't think EOF addresses this, though - the partial read happens before we actually reach the end of the file. And re-reads are not a solution either, because the second read may still see only the first half, and then what - is it a permanent issue (in which case it's a data corruption), or an extension in progress? I wonder if we can simply ignore those errors entirely, if it's the last page in the segment? We can't really check the file is "complete" anyway, e.g. if you have multiple segments for a table, and the "middle" one is a page shorter, we'll happily ignore that during verification. Also, what if we're reading a file and it gets truncated (e.g. after vacuum notices the last few pages are empty)? Doesn't that have the same issue? regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Hi, On 2019-03-02 22:49:33 +0100, Tomas Vondra wrote: > > > On 3/2/19 5:08 PM, Stephen Frost wrote: > > Greetings, > > > > * Michael Banck (michael.banck@credativ.de) wrote: > >> Am Freitag, den 01.03.2019, 18:03 -0500 schrieb Robert Haas: > >>> On Tue, Sep 18, 2018 at 10:37 AM Michael Banck > >>> <michael.banck@credativ.de> wrote: > >>>> I have added a retry for this as well now, without a pg_sleep() as well. > >>>> This catches around 80% of the half-reads, but a few slip through. At > >>>> that point we bail out with exit(1), and the user can try again, which I > >>>> think is fine? > >>> > >>> Maybe I'm confused here, but catching 80% of torn pages doesn't sound > >>> robust at all. > >> > >> The chance that pg_verify_checksums hits a torn page (at least in my > >> tests, see below) is already pretty low, a couple of times per 1000 > >> runs. Maybe 4 out 5 times, the page is read fine on retry and we march > >> on. Otherwise, we now just issue a warning and skip the file (or so was > >> the idea, see below), do you think that is not acceptable? > >> > >> I re-ran the tests (concurrent createdb/pgbench -i -s 50/dropdb and > >> pg_verify_checksums in tight loops) with the current patch version, and > >> I am seeing short reads very, very rarely (maybe every 1000th run) with > >> a warning like: > >> > >> |1174 > >> |pg_verify_checksums: warning: could not read block 374 in file "data/base/18032/18045": read 4096 of 8192 > >> |pg_verify_checksums: warning: could not read block 375 in file "data/base/18032/18045": read 4096 of 8192 > >> |Files skipped: 2 > >> > >> The 1174 is the sequence number, the first 1173 runs of > >> pg_verify_checksums only skipped blocks. > >> > >> However, the fact it shows two warnings for the same file means there is > >> something wrong here. It was continueing to the next block while I think > >> it should just skip to the next file on read failures. So I have changed > >> that now, new patch attached. > > > > I'm confused- if previously it was continueing to the next block instead > > of doing the re-read on the same block, why don't we just change it to > > do the re-read on the same block properly and see if that fixes the > > retry, instead of just giving up and skipping..? I'm not necessairly > > against skipping to the next file, to be clear, but I think I'd be > > happier if we kept reading the file until we actually get EOF. > > > > (I've not looked at the actual patch, just read what you wrote..) > > > > Notice that those two errors are actually for two consecutive blocks in > the same file. So what probably happened is that postgres started to > extend the page, and the verification tried to read the last page after > the kernel added just the first 4kB filesystem page. Then it probably > succeeded on a retry, and then the same thing happened on the next page. > > I don't think EOF addresses this, though - the partial read happens > before we actually reach the end of the file. > > And re-reads are not a solution either, because the second read may > still see only the first half, and then what - is it a permanent issue > (in which case it's a data corruption), or an extension in progress? > > I wonder if we can simply ignore those errors entirely, if it's the last > page in the segment? We can't really check the file is "complete" > anyway, e.g. if you have multiple segments for a table, and the "middle" > one is a page shorter, we'll happily ignore that during verification. > > Also, what if we're reading a file and it gets truncated (e.g. after > vacuum notices the last few pages are empty)? Doesn't that have the same > issue? I gotta say, my conclusion from this debate is that it's simply a mistake to do this without involvement of the server that can use locking to prevent these kind of issues. It seems pretty absurd to me to have hacky workarounds around partial writes of a live server, around truncation, etc, even though the server has ways to deal with that. - Andres
On Sat, Mar 02, 2019 at 02:00:31PM -0800, Andres Freund wrote: > I gotta say, my conclusion from this debate is that it's simply a > mistake to do this without involvement of the server that can use > locking to prevent these kind of issues. It seems pretty absurd to me > to have hacky workarounds around partial writes of a live server, around > truncation, etc, even though the server has ways to deal with that. I agree with Andres on this one. We are never going to make this stuff safe if we don't handle page reads with the proper locks because of torn pages. What I think we should do is provide a SQL function which reads a page in shared mode, and then checks its checksum if its LSN is older than the previous redo point. This discards cases with rather hot pages, but if the page is hot enough then the backend re-reading the page would just do the same by verifying the page checksum by itself. -- Michael
Attachment
On 3/3/19 12:48 AM, Michael Paquier wrote: > On Sat, Mar 02, 2019 at 02:00:31PM -0800, Andres Freund wrote: >> I gotta say, my conclusion from this debate is that it's simply a >> mistake to do this without involvement of the server that can use >> locking to prevent these kind of issues. It seems pretty absurd to me >> to have hacky workarounds around partial writes of a live server, around >> truncation, etc, even though the server has ways to deal with that. > > I agree with Andres on this one. We are never going to make this > stuff safe if we don't handle page reads with the proper locks because > of torn pages. What I think we should do is provide a SQL function > which reads a page in shared mode, and then checks its checksum if its > LSN is older than the previous redo point. This discards cases with > rather hot pages, but if the page is hot enough then the backend > re-reading the page would just do the same by verifying the page > checksum by itself. Handling torn pages is not difficult, and the patch already does that (it reads LSN of the last checkpoint LSN from the control file, and uses it the same way basebackup does). That's working since (at least) September, so I don't see how would the SQL function help with this? The other issue (raised recently) is partial reads, where we read only a fraction of the page. Basebackup simply ignores such pages, likely on the assumption that it's either concurrent extension or truncation (in which case it's newer than the last checkpoint LSN anyway). So maybe we should do the same thing here. As I mentioned before, we can't reliably detect incomplete segments anyway (at least I believe that's the case). You and Andres may be right that trying to verify checksums online without close interaction with the server is ultimately futile (or at least overly complex). But I'm not sure those issues (torn pages and partial reads) are very good arguments, considering basebackup has to deal with them too. Not sure. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Bonjour Michaël, >> I gotta say, my conclusion from this debate is that it's simply a >> mistake to do this without involvement of the server that can use >> locking to prevent these kind of issues. It seems pretty absurd to me >> to have hacky workarounds around partial writes of a live server, around >> truncation, etc, even though the server has ways to deal with that. > > I agree with Andres on this one. We are never going to make this stuff > safe if we don't handle page reads with the proper locks because of torn > pages. What I think we should do is provide a SQL function which reads a > page in shared mode, and then checks its checksum if its LSN is older > than the previous redo point. This discards cases with rather hot > pages, but if the page is hot enough then the backend re-reading the > page would just do the same by verifying the page checksum by itself. -- > Michael My 0.02€ about that, as one of the reviewer of the patch: I agree that having a server function (extension?) to do a full checksum verification, possibly bandwidth-controlled, would be a good thing. However it would have side effects, such as interfering deeply with the server page cache, which may or may not be desirable. On the other hand I also see value in an independent system-level external tool capable of a best effort checksum verification: the current check that the cluster is offline to prevent pg_verify_checksum from running is kind of artificial, and when online simply counting online-database-related checksum issues looks like a reasonable compromise. So basically I think that allowing pg_verify_checksum to run on an online cluster is still a good thing, provided that expected errors are correctly handled. -- Fabien.
Hi, Am Samstag, den 02.03.2019, 11:08 -0500 schrieb Stephen Frost:h > * Michael Banck (michael.banck@credativ.de) wrote: > > Am Freitag, den 01.03.2019, 18:03 -0500 schrieb Robert Haas: > > > On Tue, Sep 18, 2018 at 10:37 AM Michael Banck > > > <michael.banck@credativ.de> wrote: > > > > I have added a retry for this as well now, without a pg_sleep() as well. > > > > This catches around 80% of the half-reads, but a few slip through. At > > > > that point we bail out with exit(1), and the user can try again, which I > > > > think is fine? > > > > > > Maybe I'm confused here, but catching 80% of torn pages doesn't sound > > > robust at all. > > > > The chance that pg_verify_checksums hits a torn page (at least in my > > tests, see below) is already pretty low, a couple of times per 1000 > > runs. Maybe 4 out 5 times, the page is read fine on retry and we march > > on. Otherwise, we now just issue a warning and skip the file (or so was > > the idea, see below), do you think that is not acceptable? > > > > I re-ran the tests (concurrent createdb/pgbench -i -s 50/dropdb and > > pg_verify_checksums in tight loops) with the current patch version, and > > I am seeing short reads very, very rarely (maybe every 1000th run) with > > a warning like: > > > > > 1174 > > > pg_verify_checksums: warning: could not read block 374 in file "data/base/18032/18045": read 4096 of 8192 > > > pg_verify_checksums: warning: could not read block 375 in file "data/base/18032/18045": read 4096 of 8192 > > > Files skipped: 2 > > > > The 1174 is the sequence number, the first 1173 runs of > > pg_verify_checksums only skipped blocks. > > > > However, the fact it shows two warnings for the same file means there is > > something wrong here. It was continueing to the next block while I think > > it should just skip to the next file on read failures. So I have changed > > that now, new patch attached. > > I'm confused- if previously it was continueing to the next block instead > of doing the re-read on the same block, why don't we just change it to > do the re-read on the same block properly and see if that fixes the > retry, instead of just giving up and skipping..? It was re-reading the block and continueing to read the file after it got a short read even on re-read. > I'm not necessairly against skipping to the next file, to be clear, > but I think I'd be happier if we kept reading the file until we > actually get EOF. So if we read half a block twice we should seek() to the next block and continue till EOF, ok. I think in most cases those pages will be new anyway and there will be no checksum check, but it sounds like a cleaner approach. I've seen one or two examples where we did successfully verify the checksum of a page after a half-read, so it might be worth it. The alternative would be to just bail out early and skip the file on the first short read and (possibly) log a skipped file. I still think that an external checksum verification tool has some merit, given that basebackup does it and the current offline requirement is really not useful in practise. Michael -- Michael Banck Projektleiter / Senior Berater Tel.: +49 2166 9901-171 Fax: +49 2166 9901-100 Email: michael.banck@credativ.de credativ GmbH, HRB Mönchengladbach 12080 USt-ID-Nummer: DE204566209 Trompeterallee 108, 41189 Mönchengladbach Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer Unser Umgang mit personenbezogenen Daten unterliegt folgenden Bestimmungen: https://www.credativ.de/datenschutz
On Sun, Mar 03, 2019 at 03:12:51AM +0100, Tomas Vondra wrote: > You and Andres may be right that trying to verify checksums online > without close interaction with the server is ultimately futile (or at > least overly complex). But I'm not sure those issues (torn pages and > partial reads) are very good arguments, considering basebackup has to > deal with them too. Not sure. FWIW, I don't think that the backend is right in its way of checking checksums the way it does currently either with warnings and a limited set of failures generated. I raised concerns about that unfortunately after 11 has been GA'ed, which was too late, so this time, for this patch, I prefer raising them before the fact and I'd rather not spread this kind of methodology around the core code more and more. I work a lot with virtualization, and I have seen ESX hanging around I/O requests from time to time depending on the environment used (which is actually wrong, anyway, but a lot of tests happen on a daily basis on the stuff I work on). What's presented on this thread is *never* going to be 100% safe, and would generate false positives which can be confusing for the user. This is not a good sign. -- Michael
Attachment
On Sun, Mar 03, 2019 at 11:51:48AM +0100, Michael Banck wrote: > I still think that an external checksum verification tool has some > merit, given that basebackup does it and the current offline requirement > is really not useful in practise. I am not going to argue again about the way checksum verification is done in a base backup.. :) Being able to do an online verification of checksums has a lot of value, do not take me wrong, and an SQL interface to do that does not prevent having a frontend wrapper using it. -- Michael
Attachment
On Sun, Mar 03, 2019 at 07:58:26AM +0100, Fabien COELHO wrote: > I agree that having a server function (extension?) to do a full checksum > verification, possibly bandwidth-controlled, would be a good thing. However > it would have side effects, such as interfering deeply with the server page > cache, which may or may not be desirable. In what is that different from VACUUM or a sequential scan? It is possible to use buffer ring replacement strategies in such cases using the normal clock-sweep algorithm, so that scanning a range of pages does not really impact Postgres shared buffer cache. -- Michael
Attachment
Bonjour Michaël, >> I agree that having a server function (extension?) to do a full checksum >> verification, possibly bandwidth-controlled, would be a good thing. However >> it would have side effects, such as interfering deeply with the server page >> cache, which may or may not be desirable. > > In what is that different from VACUUM or a sequential scan? Scrubbing would read all files, not only relation data? I'm unsure about what does VACUUM, but it is probably pretty similar. > It is possible to use buffer ring replacement strategies in such cases > using the normal clock-sweep algorithm, so that scanning a range of > pages does not really impact Postgres shared buffer cache. Good! I did not know that there was an existing strategy to avoid filling the cache. -- Fabien.
On Mon, Mar 4, 2019, 04:10 Michael Paquier <michael@paquier.xyz> wrote:
On Sun, Mar 03, 2019 at 07:58:26AM +0100, Fabien COELHO wrote:
> I agree that having a server function (extension?) to do a full checksum
> verification, possibly bandwidth-controlled, would be a good thing. However
> it would have side effects, such as interfering deeply with the server page
> cache, which may or may not be desirable.
In what is that different from VACUUM or a sequential scan? It is
possible to use buffer ring replacement strategies in such cases using
the normal clock-sweep algorithm, so that scanning a range of pages
does not really impact Postgres shared buffer cache.
Yeah, I wouldn't worry too much about the effect on the postgres cache when that is done. It could of course have a much worse impact on the os cache or on the "smart" (aka dumb) storage system cache. But that effect will be there just as much with a separate tool.
/Magnus
On 3/4/19 4:09 AM, Michael Paquier wrote: > On Sun, Mar 03, 2019 at 07:58:26AM +0100, Fabien COELHO wrote: >> I agree that having a server function (extension?) to do a full checksum >> verification, possibly bandwidth-controlled, would be a good thing. However >> it would have side effects, such as interfering deeply with the server page >> cache, which may or may not be desirable. > > In what is that different from VACUUM or a sequential scan? It is > possible to use buffer ring replacement strategies in such cases using > the normal clock-sweep algorithm, so that scanning a range of pages > does not really impact Postgres shared buffer cache. > -- But Fabien was talking about page cache, not shared buffers. And we can't use custom ring buffer there. OTOH I don't see why accessing the file through SQL function would behave any differently than direct access (i.e. what the tool does now). regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 3/4/19 2:00 AM, Michael Paquier wrote: > On Sun, Mar 03, 2019 at 03:12:51AM +0100, Tomas Vondra wrote: >> You and Andres may be right that trying to verify checksums online >> without close interaction with the server is ultimately futile (or at >> least overly complex). But I'm not sure those issues (torn pages and >> partial reads) are very good arguments, considering basebackup has to >> deal with them too. Not sure. > > FWIW, I don't think that the backend is right in its way of checking > checksums the way it does currently either with warnings and a limited > set of failures generated. I raised concerns about that unfortunately > after 11 has been GA'ed, which was too late, so this time, for this > patch, I prefer raising them before the fact and I'd rather not spread > this kind of methodology around the core code more and more. I still don't understand what issue you see in how basebackup verifies checksums. Can you point me to the explanation you've sent after 11 was released? > I work a lot with virtualization, and I have seen ESX hanging around > I/O requests from time to time depending on the environment used > (which is actually wrong, anyway, but a lot of tests happen on a > daily basis on the stuff I work on). What's presented on this thread > is *never* going to be 100% safe, and would generate false positives > which can be confusing for the user. This is not a good sign. So you have a workload/configuration that actually results in data corruption yet we fail to detect that? Or we generate false positives? Or what do you mean by "100% safe" here? regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Mon, Mar 4, 2019 at 3:02 PM Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
On 3/4/19 4:09 AM, Michael Paquier wrote:
> On Sun, Mar 03, 2019 at 07:58:26AM +0100, Fabien COELHO wrote:
>> I agree that having a server function (extension?) to do a full checksum
>> verification, possibly bandwidth-controlled, would be a good thing. However
>> it would have side effects, such as interfering deeply with the server page
>> cache, which may or may not be desirable.
>
> In what is that different from VACUUM or a sequential scan? It is
> possible to use buffer ring replacement strategies in such cases using
> the normal clock-sweep algorithm, so that scanning a range of pages
> does not really impact Postgres shared buffer cache.
> --
But Fabien was talking about page cache, not shared buffers. And we
can't use custom ring buffer there. OTOH I don't see why accessing the
file through SQL function would behave any differently than direct
access (i.e. what the tool does now).
It shouldn't.
One other thought that I had around this though, which if it's been covered before and I missed it, please disregard :)
The *online* version of the tool is very similar to running pg_basebackup to /dev/null, is it not? Except it doesn't set the cluster to backup mode. Perhaps what we really want is a simpler way to do *that*. That wouldn't necessarily make it a SQL callable function, but it would be a CLI tool that would call a command on a walsender for example.
(We'd of course still need the standalone tool for offline checks)
On Mon, Mar 04, 2019 at 03:08:09PM +0100, Tomas Vondra wrote: > I still don't understand what issue you see in how basebackup verifies > checksums. Can you point me to the explanation you've sent after 11 was > released? The history is mostly on this thread: https://www.postgresql.org/message-id/20181020044248.GD2553@paquier.xyz > So you have a workload/configuration that actually results in data > corruption yet we fail to detect that? Or we generate false positives? > Or what do you mean by "100% safe" here? What's proposed on this thread could generate false positives. Checks which have deterministic properties and clean failure handling are reliable when it comes to reports. -- Michael
Attachment
On 3/5/19 4:12 AM, Michael Paquier wrote: > On Mon, Mar 04, 2019 at 03:08:09PM +0100, Tomas Vondra wrote: >> I still don't understand what issue you see in how basebackup verifies >> checksums. Can you point me to the explanation you've sent after 11 was >> released? > > The history is mostly on this thread: > https://www.postgresql.org/message-id/20181020044248.GD2553@paquier.xyz > Thanks, will look. Based on quickly skimming that thread the main issue seems to be deciding which files in the data directory are expected to have checksums. Which is a valid issue, of course, but I was expecting something about partial read/writes etc. >> So you have a workload/configuration that actually results in data >> corruption yet we fail to detect that? Or we generate false positives? >> Or what do you mean by "100% safe" here? > > What's proposed on this thread could generate false positives. Checks > which have deterministic properties and clean failure handling are > reliable when it comes to reports. My understanding is that: (a) The checksum verification should not generate false positives (same as for basebackup). (b) The partial reads do emit warnings, which might be considered false positives I guess. Which is why I'm arguing for changing it to do the same thing basebackup does, i.e. ignore this. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Tue, Mar 05, 2019 at 02:08:03PM +0100, Tomas Vondra wrote: > Based on quickly skimming that thread the main issue seems to be > deciding which files in the data directory are expected to have > checksums. Which is a valid issue, of course, but I was expecting > something about partial read/writes etc. I remember complaining about partial write handling as well for the base backup checks... There should be an email about it on the list, cannot find it now ;p > My understanding is that: > > (a) The checksum verification should not generate false positives (same > as for basebackup). > > (b) The partial reads do emit warnings, which might be considered false > positives I guess. Which is why I'm arguing for changing it to do the > same thing basebackup does, i.e. ignore this. Well, at least that's consistent... Argh, I really think that we ought to make the failures reported harder because that's easier to detect within a tool and some deployments set log_min_messages > WARNING so checksum failures would just be lost. For base backups we don't care much about that as files are just blindly copied so they could have torn pages, which is fine as that's fixed at replay. Now we are talking about a set of tools which could have reliable detection mechanisms for those problems. -- Michael
Attachment
Greetings,
On Tue, Mar 5, 2019 at 18:36 Michael Paquier <michael@paquier.xyz> wrote:
On Tue, Mar 05, 2019 at 02:08:03PM +0100, Tomas Vondra wrote:
> Based on quickly skimming that thread the main issue seems to be
> deciding which files in the data directory are expected to have
> checksums. Which is a valid issue, of course, but I was expecting
> something about partial read/writes etc.
I remember complaining about partial write handling as well for the
base backup checks... There should be an email about it on the list,
cannot find it now ;p
> My understanding is that:
>
> (a) The checksum verification should not generate false positives (same
> as for basebackup).
>
> (b) The partial reads do emit warnings, which might be considered false
> positives I guess. Which is why I'm arguing for changing it to do the
> same thing basebackup does, i.e. ignore this.
Well, at least that's consistent... Argh, I really think that we
ought to make the failures reported harder because that's easier to
detect within a tool and some deployments set log_min_messages >
WARNING so checksum failures would just be lost. For base backups we
don't care much about that as files are just blindly copied so they
could have torn pages, which is fine as that's fixed at replay. Now
we are talking about a set of tools which could have reliable
detection mechanisms for those problems.
I’m traveling but will try to comment more in the coming days but in general I agree with Tomas on these items. Also, pg_basebackup has to handle torn pages when it comes to checksums just like the verify tool does, and having them be consistent (along with external tools) would really be for the best, imv. I still feel like a retry of a short read (try reading more to get the whole page..) would be alright and reading until we hit eof and then moving on. I’m not sure it’s possible but I do worry a bit that we might get a short read from a network file system or something that isn’t actually at eof and then we would skip a significant remaining portion of the file... another thought might be to stat the file after we have opened it to see it’s length...
Just a few thoughts since I’m on my phone. Will try to write up something more in a day or two.
Thanks!
Stephen
On Sat, Mar 2, 2019 at 4:38 PM Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: > FWIW I don't think this qualifies as torn page - i.e. it's not a full > read with a mix of old and new data. This is partial write, most likely > because we read the blocks one by one, and when we hit the last page > while the table is being extended, we may only see the fist 4kB. And if > we retry very fast, we may still see only the first 4kB. I see the distinction you're making, and you're right. The problem is, whether in this case or whether for a real torn page, we don't seem to have a way to distinguish between a state that occurs transiently due to lack of synchronization and a situation that is permanent and means that we have corruption. And that worries me, because it means we'll either report bogus complaints that will scare easily-panicked users (and anybody who is running this tool has a good chance of being in the "easily-panicked" category ...), or else we'll skip reporting real problems. Neither is good. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Sat, Mar 2, 2019 at 5:45 AM Michael Banck <michael.banck@credativ.de> wrote: > Am Freitag, den 01.03.2019, 18:03 -0500 schrieb Robert Haas: > > On Tue, Sep 18, 2018 at 10:37 AM Michael Banck > > <michael.banck@credativ.de> wrote: > > > I have added a retry for this as well now, without a pg_sleep() as well. > > > This catches around 80% of the half-reads, but a few slip through. At > > > that point we bail out with exit(1), and the user can try again, which I > > > think is fine? > > > > Maybe I'm confused here, but catching 80% of torn pages doesn't sound > > robust at all. > > The chance that pg_verify_checksums hits a torn page (at least in my > tests, see below) is already pretty low, a couple of times per 1000 > runs. Maybe 4 out 5 times, the page is read fine on retry and we march > on. Otherwise, we now just issue a warning and skip the file (or so was > the idea, see below), do you think that is not acceptable? Yeah. Consider a paranoid customer with 100 clusters who runs this every day on every cluster. They're going to see failures every day or three and go ballistic. I suspect that better retry logic might help here. I mean, I would guess that 10 retries at 1 second intervals or something of that sort would be enough to virtually eliminate false positives while still allowing us to report persistent -- and thus real -- problems. But if even that is going to produce false positives with any measurable probability different from zero, then I think we have a problem, because I neither like a verification tool that ignores possible signs of trouble nor one that "cries wolf" when things are fine. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2019-03-06 12:33:49 -0500, Robert Haas wrote: > On Sat, Mar 2, 2019 at 5:45 AM Michael Banck <michael.banck@credativ.de> wrote: > > Am Freitag, den 01.03.2019, 18:03 -0500 schrieb Robert Haas: > > > On Tue, Sep 18, 2018 at 10:37 AM Michael Banck > > > <michael.banck@credativ.de> wrote: > > > > I have added a retry for this as well now, without a pg_sleep() as well. > > > > This catches around 80% of the half-reads, but a few slip through. At > > > > that point we bail out with exit(1), and the user can try again, which I > > > > think is fine? > > > > > > Maybe I'm confused here, but catching 80% of torn pages doesn't sound > > > robust at all. > > > > The chance that pg_verify_checksums hits a torn page (at least in my > > tests, see below) is already pretty low, a couple of times per 1000 > > runs. Maybe 4 out 5 times, the page is read fine on retry and we march > > on. Otherwise, we now just issue a warning and skip the file (or so was > > the idea, see below), do you think that is not acceptable? > > Yeah. Consider a paranoid customer with 100 clusters who runs this > every day on every cluster. They're going to see failures every day > or three and go ballistic. +1 > I suspect that better retry logic might help here. I mean, I would > guess that 10 retries at 1 second intervals or something of that sort > would be enough to virtually eliminate false positives while still > allowing us to report persistent -- and thus real -- problems. But if > even that is going to produce false positives with any measurable > probability different from zero, then I think we have a problem, > because I neither like a verification tool that ignores possible signs > of trouble nor one that "cries wolf" when things are fine. To me the right way seems to be to IO lock the page via PG after such a failure, and then retry. Which should be relatively easily doable for the basebackup case, but obviously harder for the pg_verify_checksums case. Greetings, Andres Freund
On 3/6/19 6:26 PM, Robert Haas wrote: > On Sat, Mar 2, 2019 at 4:38 PM Tomas Vondra > <tomas.vondra@2ndquadrant.com> wrote: >> FWIW I don't think this qualifies as torn page - i.e. it's not a full >> read with a mix of old and new data. This is partial write, most likely >> because we read the blocks one by one, and when we hit the last page >> while the table is being extended, we may only see the fist 4kB. And if >> we retry very fast, we may still see only the first 4kB. > > I see the distinction you're making, and you're right. The problem > is, whether in this case or whether for a real torn page, we don't > seem to have a way to distinguish between a state that occurs > transiently due to lack of synchronization and a situation that is > permanent and means that we have corruption. And that worries me, > because it means we'll either report bogus complaints that will scare > easily-panicked users (and anybody who is running this tool has a good > chance of being in the "easily-panicked" category ...), or else we'll > skip reporting real problems. Neither is good. > Sure, I'd also prefer having a tool that reliably detects all cases of data corruption, and I certainly do share your concerns about false positives and false negatives. But maybe we shouldn't expect a tool meant to verify checksums to detect various other issues. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 3/6/19 6:42 PM, Andres Freund wrote: > On 2019-03-06 12:33:49 -0500, Robert Haas wrote: >> On Sat, Mar 2, 2019 at 5:45 AM Michael Banck <michael.banck@credativ.de> wrote: >>> Am Freitag, den 01.03.2019, 18:03 -0500 schrieb Robert Haas: >>>> On Tue, Sep 18, 2018 at 10:37 AM Michael Banck >>>> <michael.banck@credativ.de> wrote: >>>>> I have added a retry for this as well now, without a pg_sleep() as well. >>>>> This catches around 80% of the half-reads, but a few slip through. At >>>>> that point we bail out with exit(1), and the user can try again, which I >>>>> think is fine? >>>> >>>> Maybe I'm confused here, but catching 80% of torn pages doesn't sound >>>> robust at all. >>> >>> The chance that pg_verify_checksums hits a torn page (at least in my >>> tests, see below) is already pretty low, a couple of times per 1000 >>> runs. Maybe 4 out 5 times, the page is read fine on retry and we march >>> on. Otherwise, we now just issue a warning and skip the file (or so was >>> the idea, see below), do you think that is not acceptable? >> >> Yeah. Consider a paranoid customer with 100 clusters who runs this >> every day on every cluster. They're going to see failures every day >> or three and go ballistic. > > +1 > > >> I suspect that better retry logic might help here. I mean, I would >> guess that 10 retries at 1 second intervals or something of that sort >> would be enough to virtually eliminate false positives while still >> allowing us to report persistent -- and thus real -- problems. But if >> even that is going to produce false positives with any measurable >> probability different from zero, then I think we have a problem, >> because I neither like a verification tool that ignores possible signs >> of trouble nor one that "cries wolf" when things are fine. > > To me the right way seems to be to IO lock the page via PG after such a > failure, and then retry. Which should be relatively easily doable for > the basebackup case, but obviously harder for the pg_verify_checksums > case. > Yes, if we could ensure the retry happens after completing the current I/O on the page (without actually initiating a read into shared buffers) that would work I think - both for partial reads and torn pages. Not sure how to integrate it into the CLI tool, though. Perhaps we it could require connection info so that it can execute a function, when executed in online mode? cheers -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Hi, On 2019-03-06 20:37:39 +0100, Tomas Vondra wrote: > Not sure how to integrate it into the CLI tool, though. Perhaps we it > could require connection info so that it can execute a function, when > executed in online mode? To me the right fix would be to simply have this run as part of the cluster / in a function. I don't see much point in running this outside of the cluster. Greetings, Andres Freund
On 3/6/19 8:41 PM, Andres Freund wrote: > Hi, > > On 2019-03-06 20:37:39 +0100, Tomas Vondra wrote: >> Not sure how to integrate it into the CLI tool, though. Perhaps we it >> could require connection info so that it can execute a function, when >> executed in online mode? > > To me the right fix would be to simply have this run as part of the > cluster / in a function. I don't see much point in running this outside > of the cluster. > Not sure. AFAICS that would to require a single transaction, and if we happen to add some sort of throttling (which is a feature request I'd expect pretty soon to make it usable on live clusters) that might be quite long-running. So, not great. If we want to run it from the server itself, then I guess a background worker would be a better solution. Incidentally, that's something I've been toying with some time ago, see [1]. [1] https://github.com/tvondra/scrub -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Wed, Mar 06, 2019 at 08:53:57PM +0100, Tomas Vondra wrote: > Not sure. AFAICS that would to require a single transaction, and if we > happen to add some sort of throttling (which is a feature request I'd > expect pretty soon to make it usable on live clusters) that might be > quite long-running. So, not great. > > If we want to run it from the server itself, then I guess a background > worker would be a better solution. Incidentally, that's something I've > been toying with some time ago, see [1]. It does not prevent having a SQL function which acts as a wrapper on top of the whole routine logic, does it? I think that it would be nice to have the possibility to target a specific relation and a specific page, as well as being able to check fully a relation at once. It gets easier to check for page ranges this way, and the throttling can be part of the function doing a full-relation check. -- Michael
Attachment
On 3/6/19 6:42 PM, Andres Freund wrote: > > ... > > To me the right way seems to be to IO lock the page via PG after such a > failure, and then retry. Which should be relatively easily doable for > the basebackup case, but obviously harder for the pg_verify_checksums > case. > Actually, what do you mean by "IO lock the page"? Just waiting for the current IO to complete (essentially BM_IO_IN_PROGRESS)? Or essentially acquiring a lock and holding it for the duration of the check? The former does not really help, because there might be another I/O request initiated right after, interfering with the retry. The latter might work, assuming the check is fast (which it probably is). I wonder if this might cause issues due to loading possibly corrupted data (with invalid checksums) into shared buffers. But then again, we could just hack a special version of ReadBuffer_common() which would just (a) check if a page is in shared buffers, and if it is then consider the checksum correct (because in memory it may be stale, and it was read successfully so it was OK at that moment) (b) if it's not in shared buffers already, try reading it and verify the checksum, and then just evict it right away (not to spoil sb) Or did you have something else in mind? regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Hi, On 2019-03-07 12:53:30 +0100, Tomas Vondra wrote: > On 3/6/19 6:42 PM, Andres Freund wrote: > > > > ... > > > > To me the right way seems to be to IO lock the page via PG after such a > > failure, and then retry. Which should be relatively easily doable for > > the basebackup case, but obviously harder for the pg_verify_checksums > > case. > > > > Actually, what do you mean by "IO lock the page"? Just waiting for the > current IO to complete (essentially BM_IO_IN_PROGRESS)? Or essentially > acquiring a lock and holding it for the duration of the check? The latter. And with IO lock I meant BufferDescriptorGetIOLock(), in contrast to a buffer's content lock. That way we wouldn't block modifications to the in-memory page. > The former does not really help, because there might be another I/O request > initiated right after, interfering with the retry. > > The latter might work, assuming the check is fast (which it probably is). I > wonder if this might cause issues due to loading possibly corrupted data > (with invalid checksums) into shared buffers. Oh, I was basically thinking that we'd just reread from disk outside of postgres in that case, while preventing postgres related IO by holding the IO lock. But: > But then again, we could just > hack a special version of ReadBuffer_common() which would just > (a) check if a page is in shared buffers, and if it is then consider the > checksum correct (because in memory it may be stale, and it was read > successfully so it was OK at that moment) > > (b) if it's not in shared buffers already, try reading it and verify the > checksum, and then just evict it right away (not to spoil sb) This'd also make sense and make the whole process more efficient. OTOH, it might actually be worthwhile to check the on-disk page even if there's in-memory state. Unless IO is in progress the on-disk page always should be valid. Greetings, Andres Freund
Hi, Am Sonntag, den 03.03.2019, 11:51 +0100 schrieb Michael Banck: > Am Samstag, den 02.03.2019, 11:08 -0500 schrieb Stephen Frost: > > I'm not necessairly against skipping to the next file, to be clear, > > but I think I'd be happier if we kept reading the file until we > > actually get EOF. > > So if we read half a block twice we should seek() to the next block and > continue till EOF, ok. I think in most cases those pages will be new > anyway and there will be no checksum check, but it sounds like a cleaner > approach. I've seen one or two examples where we did successfully verify > the checksum of a page after a half-read, so it might be worth it. I've done that now, i.e. it seeks to the next block and continues to read there (possibly getting an EOF). I don't issue a warning for this skipped block anymore as it is somewhat to be expected that we see some half-reads. If the seek fails for some reason, that is still a warning. > I still think that an external checksum verification tool has some > merit, given that basebackup does it and the current offline requirement > is really not useful in practise. I've read the rest of the thread, and it seems several people prefer a solution that interacts with the server. I won't be able to work on that for v12 and I guess it would be too late in the cycle anyway. I thought about I/O throttling in online mode, but it seems to be most easily tied in with the progress reporting (that already keeps track of everything or most of what we'd need), so I will work on it in that context. Michael -- Michael Banck Projektleiter / Senior Berater Tel.: +49 2166 9901-171 Fax: +49 2166 9901-100 Email: michael.banck@credativ.de credativ GmbH, HRB Mönchengladbach 12080 USt-ID-Nummer: DE204566209 Trompeterallee 108, 41189 Mönchengladbach Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer Unser Umgang mit personenbezogenen Daten unterliegt folgenden Bestimmungen: https://www.credativ.de/datenschutz
Attachment
On Thu, Mar 7, 2019 at 7:00 PM Andres Freund <andres@anarazel.de> wrote: > > On 2019-03-07 12:53:30 +0100, Tomas Vondra wrote: > > > > But then again, we could just > > hack a special version of ReadBuffer_common() which would just > > > (a) check if a page is in shared buffers, and if it is then consider the > > checksum correct (because in memory it may be stale, and it was read > > successfully so it was OK at that moment) > > > > (b) if it's not in shared buffers already, try reading it and verify the > > checksum, and then just evict it right away (not to spoil sb) > > This'd also make sense and make the whole process more efficient. OTOH, > it might actually be worthwhile to check the on-disk page even if > there's in-memory state. Unless IO is in progress the on-disk page > always should be valid. Definitely. I already saw servers with all-frozen-read-only blocks popular enough to never get evicted in months, and then a minor upgrade / restart having catastrophic consequences.
On 3/8/19 4:19 PM, Julien Rouhaud wrote: > On Thu, Mar 7, 2019 at 7:00 PM Andres Freund <andres@anarazel.de> wrote: >> >> On 2019-03-07 12:53:30 +0100, Tomas Vondra wrote: >>> >>> But then again, we could just >>> hack a special version of ReadBuffer_common() which would just >> >>> (a) check if a page is in shared buffers, and if it is then consider the >>> checksum correct (because in memory it may be stale, and it was read >>> successfully so it was OK at that moment) >>> >>> (b) if it's not in shared buffers already, try reading it and verify the >>> checksum, and then just evict it right away (not to spoil sb) >> >> This'd also make sense and make the whole process more efficient. OTOH, >> it might actually be worthwhile to check the on-disk page even if >> there's in-memory state. Unless IO is in progress the on-disk page >> always should be valid. > > Definitely. I already saw servers with all-frozen-read-only blocks > popular enough to never get evicted in months, and then a minor > upgrade / restart having catastrophic consequences. > Do I understand correctly the "catastrophic consequences" here are due to data corruption / broken checksums on those on-disk pages? regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Fri, Mar 8, 2019 at 6:50 PM Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: > > On 3/8/19 4:19 PM, Julien Rouhaud wrote: > > On Thu, Mar 7, 2019 at 7:00 PM Andres Freund <andres@anarazel.de> wrote: > >> > >> On 2019-03-07 12:53:30 +0100, Tomas Vondra wrote: > >>> > >>> But then again, we could just > >>> hack a special version of ReadBuffer_common() which would just > >> > >>> (a) check if a page is in shared buffers, and if it is then consider the > >>> checksum correct (because in memory it may be stale, and it was read > >>> successfully so it was OK at that moment) > >>> > >>> (b) if it's not in shared buffers already, try reading it and verify the > >>> checksum, and then just evict it right away (not to spoil sb) > >> > >> This'd also make sense and make the whole process more efficient. OTOH, > >> it might actually be worthwhile to check the on-disk page even if > >> there's in-memory state. Unless IO is in progress the on-disk page > >> always should be valid. > > > > Definitely. I already saw servers with all-frozen-read-only blocks > > popular enough to never get evicted in months, and then a minor > > upgrade / restart having catastrophic consequences. > > > > Do I understand correctly the "catastrophic consequences" here are due > to data corruption / broken checksums on those on-disk pages? Ah, yes sorry I should have been clearer. Indeed, there was silent data corruptions (no ckecksum though) that was revealed by the restart. So a routine minor update resulted in a massive outage. Such a scenario can't be avoided if we always bypass checksum check for alreay in shared_buffers pages.
Greetings, * Tomas Vondra (tomas.vondra@2ndquadrant.com) wrote: > On 3/2/19 12:03 AM, Robert Haas wrote: > > On Tue, Sep 18, 2018 at 10:37 AM Michael Banck > > <michael.banck@credativ.de> wrote: > >> I have added a retry for this as well now, without a pg_sleep() as well. > >> This catches around 80% of the half-reads, but a few slip through. At > >> that point we bail out with exit(1), and the user can try again, which I > >> think is fine? > > > > Maybe I'm confused here, but catching 80% of torn pages doesn't sound > > robust at all. > > FWIW I don't think this qualifies as torn page - i.e. it's not a full > read with a mix of old and new data. This is partial write, most likely > because we read the blocks one by one, and when we hit the last page > while the table is being extended, we may only see the fist 4kB. And if > we retry very fast, we may still see only the first 4kB. I really still am not following why this is such an issue- we do a read, get back 4KB, do another read, check if it's zero, and if so then we should be able to conclude that we're at the end of the file, no? If we're at the end of the file and we don't have a final complete block to run a checksum check on then it seems clear to me that the file was being extended and it's ok to skip that block. We could also stat the file and keep track of where we are, to detect such an extension of the file happening, if we wanted an additional cross-check, couldn't we? If we do a read and get 4KB back and then do another and get 4KB back, then we just treat it like we would an 8KB block. Really, as long as a subsequent read is returning bytes then we keep going, and if it returns zero then it's EOF. I could maybe see a "one final read" option, but I don't think it makes sense to have some kind of time-based delay around this where we keep trying to read. All of this about hacking up a way to connect to PG and lock pages in shared buffers so that we can perform a checksum check seems really rather ridiculous for either the extension case or the regular mid-file torn-page case. To be clear, I agree completely that we don't want to be reporting false positives or "this might mean corruption!" to users running the tool, but I haven't seen a good explaination of why this needs to involve the server to avoid that happening. If someone would like to point that out to me, I'd be happy to go read about it and try to understand. Thanks! Stephen
Attachment
Greetings, * Tomas Vondra (tomas.vondra@2ndquadrant.com) wrote: > If we want to run it from the server itself, then I guess a background > worker would be a better solution. Incidentally, that's something I've > been toying with some time ago, see [1]. So, I'm a big fan of this idea of having a background worker that's running and (slowly, maybe configurably) scanning through the data directory checking for corrupted pages. I'd certainly prefer it if that background worker didn't fault those pages into shared buffers though, and I don't really think it should need to even check if a given page is currently being written out or is presently in shared buffers. Basically, I'd think it would work just fine to have it essentially do what I am imagining pg_checksums to do, but as a background worker. Thanks! Stephen
Attachment
On Mon, Mar 18, 2019 at 01:43:08AM -0400, Stephen Frost wrote: > To be clear, I agree completely that we don't want to be reporting false > positives or "this might mean corruption!" to users running the tool, > but I haven't seen a good explaination of why this needs to involve the > server to avoid that happening. If someone would like to point that out > to me, I'd be happy to go read about it and try to understand. The mentions on this thread that the server has all the facility in place to properly lock a buffer and make sure that a partial read *never* happens and that we *never* have any kind of false positives, directly preventing the set of issues we are trying to implement workarounds for in a frontend tool are rather good arguments in my opinion (you can grep for BufferDescriptorGetIOLock() on this thread for example). -- Michael
Attachment
Greetings, * Michael Paquier (michael@paquier.xyz) wrote: > On Mon, Mar 18, 2019 at 01:43:08AM -0400, Stephen Frost wrote: > > To be clear, I agree completely that we don't want to be reporting false > > positives or "this might mean corruption!" to users running the tool, > > but I haven't seen a good explaination of why this needs to involve the > > server to avoid that happening. If someone would like to point that out > > to me, I'd be happy to go read about it and try to understand. > > The mentions on this thread that the server has all the facility in > place to properly lock a buffer and make sure that a partial read > *never* happens and that we *never* have any kind of false positives, Uh, we are, of course, going to have partial reads- we just need to handle them appropriately, and that's not hard to do in a way that we never have false positives. I do not understand, at all, the whole sub-thread argument that we have to avoid partial reads. We certainly don't worry about that when doing backups, and I don't see why we need to avoid it here. We are going to have partial reads- and that's ok, as long as it's because we're at the end of the file, and that's easy enough to check by just doing another read to see if we get back zero bytes, which indicates we're at the end of the file, and then we move on, no need to coordinate anything with the backend for this. > directly preventing the set of issues we are trying to implement > workarounds for in a frontend tool are rather good arguments in my > opinion (you can grep for BufferDescriptorGetIOLock() on this thread > for example). Sure the backend has those facilities since it needs to, but these frontend tools *don't* need that to *never* have any false positives, so why are we complicating things by saying that this frontend tool and the backend have to coordinate? If there's an explanation of why we can't avoid having false positives in the frontend tool, I've yet to see it. I definitely understand that we can get partial reads, but a partial read isn't a failure, and shouldn't be reported as such. Thanks! Stephen
Attachment
Hi, Am Montag, den 18.03.2019, 02:38 -0400 schrieb Stephen Frost: > * Michael Paquier (michael@paquier.xyz) wrote: > > On Mon, Mar 18, 2019 at 01:43:08AM -0400, Stephen Frost wrote: > > > To be clear, I agree completely that we don't want to be reporting false > > > positives or "this might mean corruption!" to users running the tool, > > > but I haven't seen a good explaination of why this needs to involve the > > > server to avoid that happening. If someone would like to point that out > > > to me, I'd be happy to go read about it and try to understand. > > > > The mentions on this thread that the server has all the facility in > > place to properly lock a buffer and make sure that a partial read > > *never* happens and that we *never* have any kind of false positives, > > Uh, we are, of course, going to have partial reads- we just need to > handle them appropriately, and that's not hard to do in a way that we > never have false positives. I think the current patch (V13 from https://www.postgresql.org/message-i d/1552045881.4947.43.camel@credativ.de) does that, modulo possible bugs. > I do not understand, at all, the whole sub-thread argument that we have > to avoid partial reads. We certainly don't worry about that when doing > backups, and I don't see why we need to avoid it here. We are going to > have partial reads- and that's ok, as long as it's because we're at the > end of the file, and that's easy enough to check by just doing another > read to see if we get back zero bytes, which indicates we're at the end > of the file, and then we move on, no need to coordinate anything with > the backend for this. Well, I agree with you, but we don't seem to have consensus on that. > > directly preventing the set of issues we are trying to implement > > workarounds for in a frontend tool are rather good arguments in my > > opinion (you can grep for BufferDescriptorGetIOLock() on this thread > > for example). > > Sure the backend has those facilities since it needs to, but these > frontend tools *don't* need that to *never* have any false positives, so > why are we complicating things by saying that this frontend tool and the > backend have to coordinate? > > If there's an explanation of why we can't avoid having false positives > in the frontend tool, I've yet to see it. I definitely understand that > we can get partial reads, but a partial read isn't a failure, and > shouldn't be reported as such. It is not in the current patch, it should just get reported as a skipped block in the end. If the cluster is online that is, if it is offline, we do consider it a failure. I have now rebased that patch on top of the pg_verify_checksums -> pg_checksums renaming, see attached. Michael -- Michael Banck Projektleiter / Senior Berater Tel.: +49 2166 9901-171 Fax: +49 2166 9901-100 Email: michael.banck@credativ.de credativ GmbH, HRB Mönchengladbach 12080 USt-ID-Nummer: DE204566209 Trompeterallee 108, 41189 Mönchengladbach Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer Unser Umgang mit personenbezogenen Daten unterliegt folgenden Bestimmungen: https://www.credativ.de/datenschutz
Attachment
Greetings, * Michael Banck (michael.banck@credativ.de) wrote: > Am Montag, den 18.03.2019, 02:38 -0400 schrieb Stephen Frost: > > * Michael Paquier (michael@paquier.xyz) wrote: > > > On Mon, Mar 18, 2019 at 01:43:08AM -0400, Stephen Frost wrote: > > > > To be clear, I agree completely that we don't want to be reporting false > > > > positives or "this might mean corruption!" to users running the tool, > > > > but I haven't seen a good explaination of why this needs to involve the > > > > server to avoid that happening. If someone would like to point that out > > > > to me, I'd be happy to go read about it and try to understand. > > > > > > The mentions on this thread that the server has all the facility in > > > place to properly lock a buffer and make sure that a partial read > > > *never* happens and that we *never* have any kind of false positives, > > > > Uh, we are, of course, going to have partial reads- we just need to > > handle them appropriately, and that's not hard to do in a way that we > > never have false positives. > > I think the current patch (V13 from https://www.postgresql.org/message-i > d/1552045881.4947.43.camel@credativ.de) does that, modulo possible bugs. I think the question here is- do you ever see false positives with this latest version..? If you are, then that's an issue and we should discuss and try to figure out what's happening. If you aren't seeing false positives, then it seems like we're done here, right? > > I do not understand, at all, the whole sub-thread argument that we have > > to avoid partial reads. We certainly don't worry about that when doing > > backups, and I don't see why we need to avoid it here. We are going to > > have partial reads- and that's ok, as long as it's because we're at the > > end of the file, and that's easy enough to check by just doing another > > read to see if we get back zero bytes, which indicates we're at the end > > of the file, and then we move on, no need to coordinate anything with > > the backend for this. > > Well, I agree with you, but we don't seem to have consensus on that. I feel like everyone is concerned that we'd report an acceptable partial read as a failure, hence it would be a false positive, and I agree entirely that we don't want false positives, but the answer to that seems to be that we shouldn't report partial reads as failures, solving the problem in a simple way that doesn't involve the server and doesn't materially reduce the check that's being performed. > > > directly preventing the set of issues we are trying to implement > > > workarounds for in a frontend tool are rather good arguments in my > > > opinion (you can grep for BufferDescriptorGetIOLock() on this thread > > > for example). > > > > Sure the backend has those facilities since it needs to, but these > > frontend tools *don't* need that to *never* have any false positives, so > > why are we complicating things by saying that this frontend tool and the > > backend have to coordinate? > > > > If there's an explanation of why we can't avoid having false positives > > in the frontend tool, I've yet to see it. I definitely understand that > > we can get partial reads, but a partial read isn't a failure, and > > shouldn't be reported as such. > > It is not in the current patch, it should just get reported as a skipped > block in the end. If the cluster is online that is, if it is offline, > we do consider it a failure. Ok, that sounds fine- and do we ever see false positives now? > I have now rebased that patch on top of the pg_verify_checksums -> > pg_checksums renaming, see attached. Thanks for that. Reading through the code though, I don't entirely understand why we're making things complicated for ourselves by trying to seek and re-read the entire block, specifically this: > if (r != BLCKSZ) > { > - fprintf(stderr, _("%s: could not read block %u in file \"%s\": read %d of %d\n"), > - progname, blockno, fn, r, BLCKSZ); > - exit(1); > + if (online) > + { > + if (block_retry) > + { > + /* We already tried once to reread the block, skip to the next block */ > + skippedblocks++; > + if (lseek(f, BLCKSZ-r, SEEK_CUR) == -1) > + { > + skippedfiles++; > + fprintf(stderr, _("%s: could not lseek to next block in file \"%s\": %m\n"), > + progname, fn); > + return; > + } > + continue; > + } > + > + /* > + * Retry the block. It's possible that we read the block while it > + * was extended or shrinked, so it it ends up looking torn to us. > + */ > + > + /* > + * Seek back by the amount of bytes we read to the beginning of > + * the failed block. > + */ > + if (lseek(f, -r, SEEK_CUR) == -1) > + { > + skippedfiles++; > + fprintf(stderr, _("%s: could not lseek in file \"%s\": %m\n"), > + progname, fn); > + return; > + } > + > + /* Set flag so we know a retry was attempted */ > + block_retry = true; > + > + /* Reset loop to validate the block again */ > + blockno--; > + > + continue; > + } I would think that we could just do: insert_location = 0; r = read(BLCKSIZE - insert_location); if (r < 0) error(); if (r == 0) EOF detected, move to next if (r < (BLCKSIZE - insert_location)) { insert_location += r; continue; } At this point, we should have a full block, do our checks... Have you seen cases where the kernel will actually return a partial read for something that isn't at the end of the file, and where you could actually lseek past that point and read the next block? I'd be really curious to see that if you can reproduce it... I've definitely seen empty pages come back with a claim that the full amount was read, but that's a very different thing. Obviously the same goes for anywhere else we're trying to handle a partial read return from.. Thanks! Stephen
Attachment
On Mon, Mar 18, 2019 at 02:38:10AM -0400, Stephen Frost wrote: > Uh, we are, of course, going to have partial reads- we just need to > handle them appropriately, and that's not hard to do in a way that we > never have false positives. Ere, my apologies here. I meant the read of a torn page, not a partial read (when extending the relation file we have locks preventing from a partial read as well by the way). -- Michael
Attachment
Greetings, * Michael Paquier (michael@paquier.xyz) wrote: > On Mon, Mar 18, 2019 at 02:38:10AM -0400, Stephen Frost wrote: > > Uh, we are, of course, going to have partial reads- we just need to > > handle them appropriately, and that's not hard to do in a way that we > > never have false positives. > > Ere, my apologies here. I meant the read of a torn page, not a In the case of a torn page, we should be able to check the LSN, as discussed extensively previously, and if the LSN is from after the checkpoint we started at then we should be fine to skip the page. > partial read (when extending the relation file we have locks > preventing from a partial read as well by the way). Yes, we do, in the backend... We don't have (nor do we need) to get involved in those locks for these tools though.. Thanks! Stephen
Attachment
Hi, Am Montag, den 18.03.2019, 08:18 +0100 schrieb Michael Banck: > I have now rebased that patch on top of the pg_verify_checksums -> > pg_checksums renaming, see attached. Sorry, I had missed some hunks in the TAP tests, fixed-up patch attached. Michael -- Michael Banck Projektleiter / Senior Berater Tel.: +49 2166 9901-171 Fax: +49 2166 9901-100 Email: michael.banck@credativ.de credativ GmbH, HRB Mönchengladbach 12080 USt-ID-Nummer: DE204566209 Trompeterallee 108, 41189 Mönchengladbach Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer Unser Umgang mit personenbezogenen Daten unterliegt folgenden Bestimmungen: https://www.credativ.de/datenschutz
Attachment
Hi. Am Montag, den 18.03.2019, 03:34 -0400 schrieb Stephen Frost: > * Michael Banck (michael.banck@credativ.de) wrote: > > Am Montag, den 18.03.2019, 02:38 -0400 schrieb Stephen Frost: > > > * Michael Paquier (michael@paquier.xyz) wrote: > > > > On Mon, Mar 18, 2019 at 01:43:08AM -0400, Stephen Frost wrote: > > > > > To be clear, I agree completely that we don't want to be reporting false > > > > > positives or "this might mean corruption!" to users running the tool, > > > > > but I haven't seen a good explaination of why this needs to involve the > > > > > server to avoid that happening. If someone would like to point that out > > > > > to me, I'd be happy to go read about it and try to understand. > > > > > > > > The mentions on this thread that the server has all the facility in > > > > place to properly lock a buffer and make sure that a partial read > > > > *never* happens and that we *never* have any kind of false positives, > > > > > > Uh, we are, of course, going to have partial reads- we just need to > > > handle them appropriately, and that's not hard to do in a way that we > > > never have false positives. > > > > I think the current patch (V13 from https://www.postgresql.org/message-i > > d/1552045881.4947.43.camel@credativ.de) does that, modulo possible bugs. > > I think the question here is- do you ever see false positives with this > latest version..? If you are, then that's an issue and we should > discuss and try to figure out what's happening. If you aren't seeing > false positives, then it seems like we're done here, right? What do you mean with false positives here? I've never seen a bogus checksum failure, i.e. pg_checksums claiming some checksum is wrong cause it only read half of a block or a torn page. I do see sporadic partial reads and they get treated by the re-check logic and (if that is not enough) get tallied up as a skipped block in the end. Is that a false positive in your book? [...] > > I have now rebased that patch on top of the pg_verify_checksums -> > > pg_checksums renaming, see attached. > > Thanks for that. Reading through the code though, I don't entirely > understand why we're making things complicated for ourselves by trying > to seek and re-read the entire block, specifically this: [...] > I would think that we could just do: > > insert_location = 0; > r = read(BLCKSIZE - insert_location); > if (r < 0) error(); > if (r == 0) EOF detected, move to next > if (r < (BLCKSIZE - insert_location)) { > insert_location += r; > continue; > } > > At this point, we should have a full block, do our checks... Well, we need to read() into some buffer which you have ommitted. So if we had a short read, and then read the rest of the block via (BLCKSIZE - insert_location) wouldn't we have to read that in a second buffer and then join the two in order to compute the checksum? That does not sounds simpler to me than just re-reading the block entirely. > Have you seen cases where the kernel will actually return a partial read > for something that isn't at the end of the file, and where you could > actually lseek past that point and read the next block? I'd be really > curious to see that if you can reproduce it... I've definitely seen > empty pages come back with a claim that the full amount was read, but > that's a very different thing. Well, I've seen partial reads and I have seen very rarely that it will continue to read another block afterwards. If the relation is being extended while we check it, it sounds plausible that another block could be written before we get to read EOF on the next read() after a partial read() so that does not sounds like a bug to me either. I might be misunderstanding your question though? Michael -- Michael Banck Projektleiter / Senior Berater Tel.: +49 2166 9901-171 Fax: +49 2166 9901-100 Email: michael.banck@credativ.de credativ GmbH, HRB Mönchengladbach 12080 USt-ID-Nummer: DE204566209 Trompeterallee 108, 41189 Mönchengladbach Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer Unser Umgang mit personenbezogenen Daten unterliegt folgenden Bestimmungen: https://www.credativ.de/datenschutz
Greetings,
On Mon, Mar 18, 2019 at 15:52 Michael Banck <michael.banck@credativ.de> wrote:
Hi.
Am Montag, den 18.03.2019, 03:34 -0400 schrieb Stephen Frost:
> * Michael Banck (michael.banck@credativ.de) wrote:
> > Am Montag, den 18.03.2019, 02:38 -0400 schrieb Stephen Frost:
> > > * Michael Paquier (michael@paquier.xyz) wrote:
> > > > On Mon, Mar 18, 2019 at 01:43:08AM -0400, Stephen Frost wrote:
> > > > > To be clear, I agree completely that we don't want to be reporting false
> > > > > positives or "this might mean corruption!" to users running the tool,
> > > > > but I haven't seen a good explaination of why this needs to involve the
> > > > > server to avoid that happening. If someone would like to point that out
> > > > > to me, I'd be happy to go read about it and try to understand.
> > > >
> > > > The mentions on this thread that the server has all the facility in
> > > > place to properly lock a buffer and make sure that a partial read
> > > > *never* happens and that we *never* have any kind of false positives,
> > >
> > > Uh, we are, of course, going to have partial reads- we just need to
> > > handle them appropriately, and that's not hard to do in a way that we
> > > never have false positives.
> >
> > I think the current patch (V13 from https://www.postgresql.org/message-i
> > d/1552045881.4947.43.camel@credativ.de) does that, modulo possible bugs.
>
> I think the question here is- do you ever see false positives with this
> latest version..? If you are, then that's an issue and we should
> discuss and try to figure out what's happening. If you aren't seeing
> false positives, then it seems like we're done here, right?
What do you mean with false positives here? I've never seen a bogus
checksum failure, i.e. pg_checksums claiming some checksum is wrong
cause it only read half of a block or a torn page.
I do see sporadic partial reads and they get treated by the re-check
logic and (if that is not enough) get tallied up as a skipped block in
the end. Is that a false positive in your book?
No, that’s clearer not a false positive.
[...]
> > I have now rebased that patch on top of the pg_verify_checksums ->
> > pg_checksums renaming, see attached.
>
> Thanks for that. Reading through the code though, I don't entirely
> understand why we're making things complicated for ourselves by trying
> to seek and re-read the entire block, specifically this:
[...]
> I would think that we could just do:
>
> insert_location = 0;
> r = read(BLCKSIZE - insert_location);
> if (r < 0) error();
> if (r == 0) EOF detected, move to next
> if (r < (BLCKSIZE - insert_location)) {
> insert_location += r;
> continue;
> }
>
> At this point, we should have a full block, do our checks...
Well, we need to read() into some buffer which you have ommitted.
Surely there’s a buffer the read in the existing code is passing in, you just need to offset by the current pointer, sorry for not being clear.
In other words the read would look more like:
read(fd,buf + insert_ptr, BUFSZ - insert_ptr)
And then you have to reset insert_ptr once you have a full block.
So if we had a short read, and then read the rest of the block via
(BLCKSIZE - insert_location) wouldn't we have to read that in a second
buffer and then join the two in order to compute the checksum? That
does not sounds simpler to me than just re-reading the block entirely.
No, just read into your existing buffer at the point where the prior partial read left off...
> Have you seen cases where the kernel will actually return a partial read
> for something that isn't at the end of the file, and where you could
> actually lseek past that point and read the next block? I'd be really
> curious to see that if you can reproduce it... I've definitely seen
> empty pages come back with a claim that the full amount was read, but
> that's a very different thing.
Well, I've seen partial reads and I have seen very rarely that it will
continue to read another block afterwards. If the relation is being
extended while we check it, it sounds plausible that another block could
be written before we get to read EOF on the next read() after a partial
read() so that does not sounds like a bug to me either.
Right, absolutely you can have a partial read during a relation extension and then come back around and do another read and discover more data, that’s entirely reasonable and I’ve seen it happen too.
I might be misunderstanding your question though?
Yes, the question was more like this: have you ever seen a read return a partial result when you know you’re in the middle somewhere of an existing file and the length of the file hasn’t been changed by something else..? I can’t say that I have, when reading from regular files, even in kernel-error type of conditions due to hardware issues, but I’m open to being told I’m wrong... in such a case though I would still expect an error on a subsequent read, which would work just fine for our case. If the kernel just decides to return a zero in that case then I don’t know that there’s really anything we can do about that because that seems like it would be pretty clearly broken results from the kernel and that’s out of scope for this.
Apologies if this isn’t clear, on my phone now.
Thanks!
Stephen
On Mon, Mar 18, 2019 at 2:06 AM Michael Paquier <michael@paquier.xyz> wrote: > The mentions on this thread that the server has all the facility in > place to properly lock a buffer and make sure that a partial read > *never* happens and that we *never* have any kind of false positives, > directly preventing the set of issues we are trying to implement > workarounds for in a frontend tool are rather good arguments in my > opinion (you can grep for BufferDescriptorGetIOLock() on this thread > for example). Yeah, exactly. It may be that there is a good way to avoid those issues without interacting with the server and that would be nice, but ... as far as I can see, nobody's figured out a way that's reliable yet, and all of the solutions proposed so far basically amount to "let's ignore things that might be serious problems because they might be transient" and/or "let's retry and see if the problem goes away." I'm more sanguine about a retry-based solution than an ignore-possible-problems solution, but what's been proposed so far seems quite prone to retrying so fast that it makes no difference, and it's not clear how much code complexity we'd have to add to do better or how reliable it would be even then. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hi, Am Montag, den 18.03.2019, 16:11 +0800 schrieb Stephen Frost: > On Mon, Mar 18, 2019 at 15:52 Michael Banck <michael.banck@credativ.de> wrote: > > Am Montag, den 18.03.2019, 03:34 -0400 schrieb Stephen Frost: > > > Thanks for that. Reading through the code though, I don't entirely > > > understand why we're making things complicated for ourselves by trying > > > to seek and re-read the entire block, specifically this: > > > > [...] > > > > > I would think that we could just do: > > > > > > insert_location = 0; > > > r = read(BLCKSIZE - insert_location); > > > if (r < 0) error(); > > > if (r == 0) EOF detected, move to next > > > if (r < (BLCKSIZE - insert_location)) { > > > insert_location += r; > > > continue; > > > } > > > > > > At this point, we should have a full block, do our checks... > > > > Well, we need to read() into some buffer which you have ommitted. > > Surely there’s a buffer the read in the existing code is passing in, > you just need to offset by the current pointer, sorry for not being > clear. > > In other words the read would look more like: > > read(fd,buf + insert_ptr, BUFSZ - insert_ptr) > > And then you have to reset insert_ptr once you have a full block. Ok, thanks for clearing that up. I've tried to do that now in the attached, does that suit you? > Yes, the question was more like this: have you ever seen a read return > a partial result when you know you’re in the middle somewhere of an > existing file and the length of the file hasn’t been changed by > something else..? I don't think I've seen that, but that wouldn't turn up in regular testing anyway I guess but only in pathological cases? I guess we are probably dealing with this in the current version of the patch, but I can't say for certain as it sounds pretty difficult to test. I have also added a paragraph to the documentation about possilby skipping new or recently updated pages: + If the cluster is online, pages that have been (re-)written since the last + checkpoint will not count as checksum failures if they cannot be read or + verified correctly. Wording improvements welcome. Michael -- Michael Banck Projektleiter / Senior Berater Tel.: +49 2166 9901-171 Fax: +49 2166 9901-100 Email: michael.banck@credativ.de credativ GmbH, HRB Mönchengladbach 12080 USt-ID-Nummer: DE204566209 Trompeterallee 108, 41189 Mönchengladbach Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer Unser Umgang mit personenbezogenen Daten unterliegt folgenden Bestimmungen: https://www.credativ.de/datenschutz
Attachment
Greetings,
On Tue, Mar 19, 2019 at 04:15 Michael Banck <michael.banck@credativ.de> wrote:
Am Montag, den 18.03.2019, 16:11 +0800 schrieb Stephen Frost:
> On Mon, Mar 18, 2019 at 15:52 Michael Banck <michael.banck@credativ.de> wrote:
> > Am Montag, den 18.03.2019, 03:34 -0400 schrieb Stephen Frost:
> > > Thanks for that. Reading through the code though, I don't entirely
> > > understand why we're making things complicated for ourselves by trying
> > > to seek and re-read the entire block, specifically this:
> >
> > [...]
> >
> > > I would think that we could just do:
> > >
> > > insert_location = 0;
> > > r = read(BLCKSIZE - insert_location);
> > > if (r < 0) error();
> > > if (r == 0) EOF detected, move to next
> > > if (r < (BLCKSIZE - insert_location)) {
> > > insert_location += r;
> > > continue;
> > > }
> > >
> > > At this point, we should have a full block, do our checks...
> >
> > Well, we need to read() into some buffer which you have ommitted.
>
> Surely there’s a buffer the read in the existing code is passing in,
> you just need to offset by the current pointer, sorry for not being
> clear.
>
> In other words the read would look more like:
>
> read(fd,buf + insert_ptr, BUFSZ - insert_ptr)
>
> And then you have to reset insert_ptr once you have a full block.
Ok, thanks for clearing that up.
I've tried to do that now in the attached, does that suit you?
Yes, that’s what I was thinking. I’m honestly not entirely convinced that the lseek() efforts still need to be put in- I would have thought it’d be fine to simply check the LSN on a checksum failure and mark it as skipped if the LSN is past the current checkpoint. That seems like it would make things much simpler, but I’m also not against keeping that logic now that it’s in, provided it doesn’t cause issues
> Yes, the question was more like this: have you ever seen a read return
> a partial result when you know you’re in the middle somewhere of an
> existing file and the length of the file hasn’t been changed by
> something else..?
I don't think I've seen that, but that wouldn't turn up in regular
testing anyway I guess but only in pathological cases? I guess we are
probably dealing with this in the current version of the patch, but I
can't say for certain as it sounds pretty difficult to test.
Yeah, a lot of things in this area are unfortunately difficult to test. I’m glad to hear that it doesn’t sound like you’ve seen it though.
I have also added a paragraph to the documentation about possilby
skipping new or recently updated pages:
+ If the cluster is online, pages that have been (re-)written since the last
+ checkpoint will not count as checksum failures if they cannot be read or
+ verified correctly.
I would flip this around:
——-
In an online cluster, pages are being concurrently written to the files while the check is being run, leading to possible torn pages or partial reads. When the tool detects a concurrently written page, indicated by the page’s LSN being beyond the checkpoint the tool started at, that page will be reported as skipped. Note that in a crash scenario, any pages written since the last checkpoint will be replayed from the WAL.
——-
Now here’s the $64 question- have you tested this latest version under load..? If not, could you? And when you do, can you report back what the results are? Do you still see any actual checksum failures? Do the number of skipped pages seem reasonable in your tests or is there a concern there?
If you still see actual checksum failures which aren’t because the LSN is higher than the checkpoint, or because of a short read, then we need to investigate further but hopefully that isn’t happening now. I think a lot of the concerns raised on this thread about wanting to avoid false positives are because the torn page (with higher LSN than current checkpoint) and short read cases were previously reported as failures when they really are expected. Let’s test this as much as we can and make sure we aren’t seeing false positives anymore.
Thanks!
Stephen
On Mon, Mar 18, 2019 at 2:38 AM Stephen Frost <sfrost@snowman.net> wrote: > Sure the backend has those facilities since it needs to, but these > frontend tools *don't* need that to *never* have any false positives, so > why are we complicating things by saying that this frontend tool and the > backend have to coordinate? > > If there's an explanation of why we can't avoid having false positives > in the frontend tool, I've yet to see it. I definitely understand that > we can get partial reads, but a partial read isn't a failure, and > shouldn't be reported as such. I think there's some confusion between 'partial read' and 'torn page', as Michael also said. It's torn pages that I am concerned about - the server is writing and we are reading, and we get a mix of old and new content. We have been quite diligent about protecting ourselves from such risks elsewhere, and checksum verification should not be held to any lesser standard. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hi, Am Dienstag, den 19.03.2019, 11:22 -0400 schrieb Robert Haas: > It's torn pages that I am concerned about - the server is writing and > we are reading, and we get a mix of old and new content. We have been > quite diligent about protecting ourselves from such risks elsewhere, > and checksum verification should not be held to any lesser standard. If we see a checksum failure on an otherwise correctly read block in online mode, we retry the block on the theory that we might have read a torn page. If the checksum verification still fails, we compare its LSN to the LSN of the current checkpoint and don't mind if its newer. This way, a torn page should not cause a false positive either way I think?. If it is a genuine storage failure we will see it in the next pg_checksums run as its LSN will be older than the checkpoint. The basebackup checksum verification works in the same way. I am happy to look into further option about how to make things better, but I am not sure what the actual problem might be that you mention above. I will see whether I can stress-test the patch a bit more but I've already taxed the SSD on my company notebook quite a bit during the development of this so will see whether I can get some real server hardware somewhere. Michael -- Michael Banck Projektleiter / Senior Berater Tel.: +49 2166 9901-171 Fax: +49 2166 9901-100 Email: michael.banck@credativ.de credativ GmbH, HRB Mönchengladbach 12080 USt-ID-Nummer: DE204566209 Trompeterallee 108, 41189 Mönchengladbach Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer Unser Umgang mit personenbezogenen Daten unterliegt folgenden Bestimmungen: https://www.credativ.de/datenschutz
Hi, On 2019-03-19 16:52:08 +0100, Michael Banck wrote: > Am Dienstag, den 19.03.2019, 11:22 -0400 schrieb Robert Haas: > > It's torn pages that I am concerned about - the server is writing and > > we are reading, and we get a mix of old and new content. We have been > > quite diligent about protecting ourselves from such risks elsewhere, > > and checksum verification should not be held to any lesser standard. > > If we see a checksum failure on an otherwise correctly read block in > online mode, we retry the block on the theory that we might have read a > torn page. If the checksum verification still fails, we compare its LSN > to the LSN of the current checkpoint and don't mind if its newer. This > way, a torn page should not cause a false positive either way I > think?. False positives, no. But there's plenty potential for false negatives. In plenty clusters a large fraction of the pages is going to be touched in most checkpoints. > If it is a genuine storage failure we will see it in the next > pg_checksums run as its LSN will be older than the checkpoint. Well, but also, by that time it might be too late to recover things. Or it might be a backup that you just made, that you later want to recover from, ... > The basebackup checksum verification works in the same way. Shouldn't have been merged that way. Greetings, Andres Freund
Greetings,
On Tue, Mar 19, 2019 at 23:59 Andres Freund <andres@anarazel.de> wrote:
Hi,
On 2019-03-19 16:52:08 +0100, Michael Banck wrote:
> Am Dienstag, den 19.03.2019, 11:22 -0400 schrieb Robert Haas:
> > It's torn pages that I am concerned about - the server is writing and
> > we are reading, and we get a mix of old and new content. We have been
> > quite diligent about protecting ourselves from such risks elsewhere,
> > and checksum verification should not be held to any lesser standard.
>
> If we see a checksum failure on an otherwise correctly read block in
> online mode, we retry the block on the theory that we might have read a
> torn page. If the checksum verification still fails, we compare its LSN
> to the LSN of the current checkpoint and don't mind if its newer. This
> way, a torn page should not cause a false positive either way I
> think?.
False positives, no. But there's plenty potential for false
negatives. In plenty clusters a large fraction of the pages is going to
be touched in most checkpoints.
How is it a false negative? The page was in the middle of being written, if we crash the page won’t be used because it’ll get replayed over by the checkpoint, if we don’t crash then it also won’t be used until it’s been written out completely. I don’t agree that this is in any way a false negative- it’s simply a page that happens to be in the middle of a file that we can skip because it isn’t going to be used. It’s not like there’s going to be a checksum failure if the backend reads it.
Not only that, but checksums and such failures are much more likely to happen on long dormant data, not on data that’s actively being written out and therefore is still in the Linux FS cache and hasn’t even hit actual storage yet anyway.
> If it is a genuine storage failure we will see it in the next
> pg_checksums run as its LSN will be older than the checkpoint.
Well, but also, by that time it might be too late to recover things. Or
it might be a backup that you just made, that you later want to recover
from, ...
If it’s a backup you just made then that page is going to be in the WAL and the torn page on disk isn’t going to be used, so how is this an issue? This is why we have WAL- to deal with torn pages.
> The basebackup checksum verification works in the same way.
Shouldn't have been merged that way.
I have a hard time not finding this offensive. These issues were considered, discussed, and well thought out, with the result being committed after agreement.
Do you have any example cases where the code in pg_basebackup has resulted in either a false positive or a false negative? Any case which can be shown to result in either?
If not then I think we need to stop this, because if we can’t trust that a torn page won’t be actually used in that torn state then it seems likely that our entire WAL system is broken and we can’t trust the way we do backups either and have to rewrite all of that to take precautions to lock pages while doing a backup.
Thanks!
Stephen
Hi, On 2019-03-20 03:27:55 +0800, Stephen Frost wrote: > On Tue, Mar 19, 2019 at 23:59 Andres Freund <andres@anarazel.de> wrote: > > On 2019-03-19 16:52:08 +0100, Michael Banck wrote: > > > Am Dienstag, den 19.03.2019, 11:22 -0400 schrieb Robert Haas: > > > > It's torn pages that I am concerned about - the server is writing and > > > > we are reading, and we get a mix of old and new content. We have been > > > > quite diligent about protecting ourselves from such risks elsewhere, > > > > and checksum verification should not be held to any lesser standard. > > > > > > If we see a checksum failure on an otherwise correctly read block in > > > online mode, we retry the block on the theory that we might have read a > > > torn page. If the checksum verification still fails, we compare its LSN > > > to the LSN of the current checkpoint and don't mind if its newer. This > > > way, a torn page should not cause a false positive either way I > > > think?. > > > > False positives, no. But there's plenty potential for false > > negatives. In plenty clusters a large fraction of the pages is going to > > be touched in most checkpoints. > > > How is it a false negative? The page was in the middle of being > written, You don't actually know that. It could just be random gunk in the LSN, and this type of logic just ignores such failures as long as the random gunk is above the system's LSN. And the basebackup logic doesn't just ignore if both the checksum failed, and the lsn is between startptr and current insertion pointer - it just does it with *any* page that has a pd_upper != 0 and a pd_lsn > startptr. Given typical startlsn values (skewing heavily towards lower int64s), that means that random data is more likely than not to pass this test. As it stands, the logic seems to give more false confidence than anything else. > > The basebackup checksum verification works in the same way. > > > > Shouldn't have been merged that way. > > > I have a hard time not finding this offensive. These issues were > considered, discussed, and well thought out, with the result being > committed after agreement. Well, I don't know what to tell you. But: /* * Only check pages which have not been modified since the * start of the base backup. Otherwise, they might have been * written only halfway and the checksum would not be valid. * However, replaying WAL would reinstate the correct page in * this case. We also skip completely new pages, since they * don't have a checksum yet. */ if (!PageIsNew(page) && PageGetLSN(page) < startptr) { doesn't consider plenty scenarios, as pointed out above. It'd be one thing if the concerns I point out above were actually commented upon and weighed not substantial enough (not that I know how). But... > Do you have any example cases where the code in pg_basebackup has resulted > in either a false positive or a false negative? Any case which can be > shown to result in either? CREATE TABLE corruptme AS SELECT g.i::text AS data FROM generate_series(1, 1000000) g(i); SELECT pg_relation_size('corruptme'); postgres[22890][1]=# SELECT current_setting('data_directory') || '/' || pg_relation_filepath('corruptme'); ┌─────────────────────────────────────┐ │ ?column? │ ├─────────────────────────────────────┤ │ /srv/dev/pgdev-dev/base/13390/16384 │ └─────────────────────────────────────┘ (1 row) dd if=/dev/urandom of=/srv/dev/pgdev-dev/base/13390/16384 bs=8192 count=1 conv=notrunc Try a basebackup and see how many times it'll detect the corrupt data. In the vast majority of cases you're going to see checksum failures when reading the data for normal operation, but not when using basebackup (or this new tool). At the very very least this would need to do a) checks that the page is all zeroes if PageIsNew() (like PageIsVerified() does for the backend). That avoids missing cases where corruption just zeroed out the header, but not the whole page. b) Check that pd_lsn is between startlsn and the insertion pointer. That avoids accepting just about all random data. And that'd *still* be less strenuous than what normal backends check. And that's already not great (due to not noticing zeroed out data). I fail to see how it's offensive to describe this as "shouldn't have been merged that way". Greetings, Andres Freund
On 2019-03-19 13:00:50 -0700, Andres Freund wrote: > As it stands, the logic seems to give more false confidence than > anything else. To demonstrate that I ran a loop that verified that a) a normal backend query using the tale detects the corruption b) pg_basebackup doesn't. i=0; while true; do i=$(($i+1)); echo attempt $i; dd if=/dev/urandom of=/srv/dev/pgdev-dev/base/13390/16384 bs=8192 count=1 conv=notrunc 2>/dev/null; psql -X -c 'SELECT * FROM corruptme;' 2>/dev/null && break; ~/build/postgres/dev-assert/vpath/src/bin/pg_basebackup/pg_basebackup -X fetch -F t -D - -c fast > /dev/null || break; done (excuse the crappy one-off sh) had, during ~12k iterations, always detected the corruption in the backend, and never via pg_basebackup. Given the likely LSNs in a cluster, that's not too surprising. Greetings, Andres Freund
On Tue, Mar 19, 2019 at 4:49 PM Andres Freund <andres@anarazel.de> wrote: > To demonstrate that I ran a loop that verified that a) a normal backend > query using the tale detects the corruption b) pg_basebackup doesn't. > > i=0; > while true; do > i=$(($i+1)); > echo attempt $i; > dd if=/dev/urandom of=/srv/dev/pgdev-dev/base/13390/16384 bs=8192 count=1 conv=notrunc 2>/dev/null; > psql -X -c 'SELECT * FROM corruptme;' 2>/dev/null && break; > ~/build/postgres/dev-assert/vpath/src/bin/pg_basebackup/pg_basebackup -X fetch -F t -D - -c fast > /dev/null || break; > done > > (excuse the crappy one-off sh) > > had, during ~12k iterations, always detected the corruption in the > backend, and never via pg_basebackup. Given the likely LSNs in a > cluster, that's not too surprising. Wow. So we shipped a checksum-verification feature (in pg_basebackup) that reliably fails to detect blatantly corrupt pages. That's pretty awful. Your chances get better the more WAL you've ever generated, but you have to generate 163 petabytes of WAL to have a 1% chance of detecting a page of random garbage, so realistically they never get very good. It's probably fair to point out that flipping a couple of random bytes on the page is a more likely error than replacing the entire page with garbage, and the check as designed will detect that fairly reliably -- unless those bytes are very near the beginning of the page. Still, that leaves a lot of kinds of corruption that this will not catch. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hi, Am Dienstag, den 19.03.2019, 13:00 -0700 schrieb Andres Freund: > On 2019-03-20 03:27:55 +0800, Stephen Frost wrote: > > On Tue, Mar 19, 2019 at 23:59 Andres Freund <andres@anarazel.de> wrote: > > > On 2019-03-19 16:52:08 +0100, Michael Banck wrote: > > > > Am Dienstag, den 19.03.2019, 11:22 -0400 schrieb Robert Haas: > > > > > It's torn pages that I am concerned about - the server is writing and > > > > > we are reading, and we get a mix of old and new content. We have been > > > > > quite diligent about protecting ourselves from such risks elsewhere, > > > > > and checksum verification should not be held to any lesser standard. > > > > > > > > If we see a checksum failure on an otherwise correctly read block in > > > > online mode, we retry the block on the theory that we might have read a > > > > torn page. If the checksum verification still fails, we compare its LSN > > > > to the LSN of the current checkpoint and don't mind if its newer. This > > > > way, a torn page should not cause a false positive either way I > > > > think?. > > > > > > False positives, no. But there's plenty potential for false > > > negatives. In plenty clusters a large fraction of the pages is going to > > > be touched in most checkpoints. > > > > > > How is it a false negative? The page was in the middle of being > > written, > > You don't actually know that. It could just be random gunk in the LSN, > and this type of logic just ignores such failures as long as the random > gunk is above the system's LSN. Right, I think this needs to be taken into account. For pg_basebackup, that'd be an additional check for GetRedoRecPtr() or something in the below check: [...] > Well, I don't know what to tell you. But: > > /* > * Only check pages which have not been modified since the > * start of the base backup. Otherwise, they might have been > * written only halfway and the checksum would not be valid. > * However, replaying WAL would reinstate the correct page in > * this case. We also skip completely new pages, since they > * don't have a checksum yet. > */ > if (!PageIsNew(page) && PageGetLSN(page) < startptr) > { > > doesn't consider plenty scenarios, as pointed out above. It'd be one > thing if the concerns I point out above were actually commented upon and > weighed not substantial enough (not that I know how). But... > > > Do you have any example cases where the code in pg_basebackup has resulted > > in either a false positive or a false negative? Any case which can be > > shown to result in either? > > CREATE TABLE corruptme AS SELECT g.i::text AS data FROM generate_series(1, 1000000) g(i); > SELECT pg_relation_size('corruptme'); > postgres[22890][1]=# SELECT current_setting('data_directory') || '/' || pg_relation_filepath('corruptme'); > ┌─────────────────────────────────────┐ > │ ?column? │ > ├─────────────────────────────────────┤ > │ /srv/dev/pgdev-dev/base/13390/16384 │ > └─────────────────────────────────────┘ > (1 row) > dd if=/dev/urandom of=/srv/dev/pgdev-dev/base/13390/16384 bs=8192 count=1 conv=notrunc > > Try a basebackup and see how many times it'll detect the corrupt > data. In the vast majority of cases you're going to see checksum > failures when reading the data for normal operation, but not when using > basebackup (or this new tool). Right, see above. > At the very very least this would need to do > > a) checks that the page is all zeroes if PageIsNew() (like > PageIsVerified() does for the backend). That avoids missing cases > where corruption just zeroed out the header, but not the whole page. We can't run pg_checksum_page() on those afterwards though as it would fire an assertion: |pg_checksums: [...]/../src/include/storage/checksum_impl.h:194: |pg_checksum_page: Assertion `!(((PageHeader) (&cpage->phdr))->pd_upper |== 0)' failed. But we should count it as a checksum error and generate an appropriate error message in that case. > b) Check that pd_lsn is between startlsn and the insertion pointer. That > avoids accepting just about all random data. However, for pg_checksums being a stand-alone application it can't just access the insertion pointer, can it? We could maybe set a threshold from the last checkpoint after which we consider the pd_lsn bogus. But what's a good threshold here? And/or we could port the other sanity checks from PageIsVerified: | if ((p->pd_flags & ~PD_VALID_FLAG_BITS) == 0 && | p->pd_lower <= p->pd_upper && | p->pd_upper <= p->pd_special && | p->pd_special <= BLCKSZ && | p->pd_special == MAXALIGN(p->pd_special)) | header_sane = true That should catch large-scale random corruption like you showed above. Michael -- Michael Banck Projektleiter / Senior Berater Tel.: +49 2166 9901-171 Fax: +49 2166 9901-100 Email: michael.banck@credativ.de credativ GmbH, HRB Mönchengladbach 12080 USt-ID-Nummer: DE204566209 Trompeterallee 108, 41189 Mönchengladbach Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer Unser Umgang mit personenbezogenen Daten unterliegt folgenden Bestimmungen: https://www.credativ.de/datenschutz
Hi, On 2019-03-19 22:39:16 +0100, Michael Banck wrote: > Am Dienstag, den 19.03.2019, 13:00 -0700 schrieb Andres Freund: > > a) checks that the page is all zeroes if PageIsNew() (like > > PageIsVerified() does for the backend). That avoids missing cases > > where corruption just zeroed out the header, but not the whole page. > > We can't run pg_checksum_page() on those afterwards though as it would > fire an assertion: > > |pg_checksums: [...]/../src/include/storage/checksum_impl.h:194: > |pg_checksum_page: Assertion `!(((PageHeader) (&cpage->phdr))->pd_upper > |== 0)' failed. > > But we should count it as a checksum error and generate an appropriate > error message in that case. All I'm saying is that if PageIsNew() you need to run the same checks that PageIsVerified() runs in that case. Namely verifying that the page is all-zeroes, rather than just the pd_upper field. That's separate from running pg_checksum_page(). > > b) Check that pd_lsn is between startlsn and the insertion pointer. That > > avoids accepting just about all random data. > > However, for pg_checksums being a stand-alone application it can't just > access the insertion pointer, can it? We could maybe set a threshold > from the last checkpoint after which we consider the pd_lsn bogus. But > what's a good threshold here? That's *PRECISELY* my point. I think it's a bad idea to do online checksumming from outside the backend. It needs to be inside the backend, and if there's any verification failures on a block, it needs to acquire the IO lock on the page, and reread from disk. Greetings, Andres Freund
On Tue, Mar 19, 2019 at 02:44:52PM -0700, Andres Freund wrote: > That's *PRECISELY* my point. I think it's a bad idea to do online > checksumming from outside the backend. It needs to be inside the > backend, and if there's any verification failures on a block, it needs > to acquire the IO lock on the page, and reread from disk. Yeah, FWIW, Julien Rouhaud was mentioning me that we could use mdread() and loop over the blocks so as we don't finish loading corrupted blocks into shared buffers, checking on the way if the block is already in shared buffers or not. -- Michael
Attachment
Hi, I have rebased this patch now. I also fixed the two issues Andres reported, namely a zeroed-out pageheader and a random LSN. The first is caught be checking for an all- zero-page in the way PageIsVerified() does. The second is caught by comparing the upper 32 bits of the LSN as well and demanding that they are equal. If the LSN is corrupted, the upper 32 bits should be wildly different to the current checkpoint LSN. Well, at least that is a stab at a fix; there is a window where the upper 32 bits could legitimately be different. In order to make that as small as possible, I update the checkpoint LSN every once in a while. Am Montag, den 18.03.2019, 21:15 +0100 schrieb Michael Banck: > I have also added a paragraph to the documentation about possilby > skipping new or recently updated pages: > > + If the cluster is online, pages that have been (re-)written since the last > + checkpoint will not count as checksum failures if they cannot be read or > + verified correctly. I have removed that for now as it seems to be more confusing than helpful. Michael -- Michael Banck Projektleiter / Senior Berater Tel.: +49 2166 9901-171 Fax: +49 2166 9901-100 Email: michael.banck@credativ.de credativ GmbH, HRB Mönchengladbach 12080 USt-ID-Nummer: DE204566209 Trompeterallee 108, 41189 Mönchengladbach Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer Unser Umgang mit personenbezogenen Daten unterliegt folgenden Bestimmungen: https://www.credativ.de/datenschutz
Attachment
On Thu, Mar 28, 2019 at 05:08:33PM +0100, Michael Banck wrote: >Hi, > >I have rebased this patch now. > >I also fixed the two issues Andres reported, namely a zeroed-out >pageheader and a random LSN. The first is caught be checking for an all- >zero-page in the way PageIsVerified() does. The second is caught by >comparing the upper 32 bits of the LSN as well and demanding that they >are equal. If the LSN is corrupted, the upper 32 bits should be wildly >different to the current checkpoint LSN. > >Well, at least that is a stab at a fix; there is a window where the >upper 32 bits could legitimately be different. In order to make that as >small as possible, I update the checkpoint LSN every once in a while. > Doesn't that mean we'll report a false positive? regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Hi, Am Donnerstag, den 28.03.2019, 18:19 +0100 schrieb Tomas Vondra: > On Thu, Mar 28, 2019 at 05:08:33PM +0100, Michael Banck wrote: > > I also fixed the two issues Andres reported, namely a zeroed-out > > pageheader and a random LSN. The first is caught be checking for an all- > > zero-page in the way PageIsVerified() does. The second is caught by > > comparing the upper 32 bits of the LSN as well and demanding that they > > are equal. If the LSN is corrupted, the upper 32 bits should be wildly > > different to the current checkpoint LSN. > > > > Well, at least that is a stab at a fix; there is a window where the > > upper 32 bits could legitimately be different. In order to make that as > > small as possible, I update the checkpoint LSN every once in a while. I decided it makes more sense to just re-read the checkpoint LSN from the control file when we encounter a wrong checksum on re-read of a page as that is when it counts, instead of doing it only every once in a while. > Doesn't that mean we'll report a false positive? A false positive would be pg_checksums claiming a block has a wrong checksum while in fact it does not (after it is correctly written out and synced to disk), right? If pg_checksums reads a current first part and a stale second part twice in a row (we re-read the block), then the LSN of the first part would presumably(?) be higher than the latest checkpoint LSN. If there was a wraparound in the lower part of the LSN so that the upper part is now different to the latest checkpoint LSN, then pg_checksums would report this as a false positive I believe. We could add some additional heuristics like checking the upper part of the LSN has advanced by at most one but that does not seem to make it 100% certified robust either, does it? If pg_checksums reads a current second part and a stale first part twice, then the pageheader LSN would presumably be lower than the checkpoint LSN and again a false positive would be reported. At least in my testing I haven't seen the second case and the first (disregarding the wraparound issue for now) extremely rarely if at all (usually the torn page is gone on re-read). The first case requiring a wraparound since the latest checkpointLSN update also seems quite narrow compared to the issue of random data being written due to corruption. So I think it is more important to make sure random data won't be a false negative than this being a false positive. Maybe we can just issue a warning in online mode that some checksum failures could be false positives and advise the user to recheck those files (using the -r switch) again? I have added this in the attached new version: + printf(_("%s ran against an online cluster and found some bad checksums.\n"), progname); + printf(_("It could be that those are false positives due concurrently updated blocks,\n")); + printf(_("checking the offending files again with the -r option is advised.\n")); It was not mentioned on this thread, but I want to stress again that you cannot run the current pg_checksums on a basebackup due to the control file claiming it is still online. This makes the current program pretty useless for production setups right now in my opinion as few people have the luxury of regular maintenance downtimes when pg_checksums could run and running it against base backups is quite cumbersome. Maybe we can improve things by checking for the postmaster.pid as well and going ahead (only for --check of course) if it is missing, but that hasn't been implemented yet. I agree that the current patch might have some corner-cases where it does not guarantee 100% accuracy in online mode, but I hope the current version at least has no more false negatives. Michael -- Michael Banck Projektleiter / Senior Berater Tel.: +49 2166 9901-171 Fax: +49 2166 9901-100 Email: michael.banck@credativ.de credativ GmbH, HRB Mönchengladbach 12080 USt-ID-Nummer: DE204566209 Trompeterallee 108, 41189 Mönchengladbach Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer Unser Umgang mit personenbezogenen Daten unterliegt folgenden Bestimmungen: https://www.credativ.de/datenschutz
Attachment
Hi, On 2019-03-28 21:09:22 +0100, Michael Banck wrote: > I agree that the current patch might have some corner-cases where it > does not guarantee 100% accuracy in online mode, but I hope the current > version at least has no more false negatives. False positives are *bad*. We shouldn't integrate code that has them. Greetings, Andres Freund
On Thu, Mar 28, 2019 at 01:11:40PM -0700, Andres Freund wrote: >Hi, > >On 2019-03-28 21:09:22 +0100, Michael Banck wrote: >> I agree that the current patch might have some corner-cases where it >> does not guarantee 100% accuracy in online mode, but I hope the current >> version at least has no more false negatives. > >False positives are *bad*. We shouldn't integrate code that has them. > Yeah, I agree. I'm a bit puzzled by the reluctance to make the online mode communicate with the server, which would presumably address these issues. Can someone explain why not to do that? FWIW I've initially argued against that, believing that we can address those issues in some other way, and I'd love if that was possible. But considering we're still trying to make that work reliably I think the reasonable conclusion is that Andres was right communicating with the server is necessary. Of course, I definitely appreciate people are working on this, otherwise we wouldn't be having this discussion ... regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Thu, Mar 28, 2019 at 10:19 PM Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
On Thu, Mar 28, 2019 at 01:11:40PM -0700, Andres Freund wrote:
>Hi,
>
>On 2019-03-28 21:09:22 +0100, Michael Banck wrote:
>> I agree that the current patch might have some corner-cases where it
>> does not guarantee 100% accuracy in online mode, but I hope the current
>> version at least has no more false negatives.
>
>False positives are *bad*. We shouldn't integrate code that has them.
>
Yeah, I agree. I'm a bit puzzled by the reluctance to make the online mode
communicate with the server, which would presumably address these issues.
Can someone explain why not to do that?
I agree that this effort seems better spent on fixing those issues there (of which many are the same), and then re-use that.
FWIW I've initially argued against that, believing that we can address
those issues in some other way, and I'd love if that was possible. But
considering we're still trying to make that work reliably I think the
reasonable conclusion is that Andres was right communicating with the
server is necessary.
Of course, I definitely appreciate people are working on this, otherwise
we wouldn't be having this discussion ...
+1.
Greetings, * Magnus Hagander (magnus@hagander.net) wrote: > On Thu, Mar 28, 2019 at 10:19 PM Tomas Vondra <tomas.vondra@2ndquadrant.com> > wrote: > > > On Thu, Mar 28, 2019 at 01:11:40PM -0700, Andres Freund wrote: > > >Hi, > > > > > >On 2019-03-28 21:09:22 +0100, Michael Banck wrote: > > >> I agree that the current patch might have some corner-cases where it > > >> does not guarantee 100% accuracy in online mode, but I hope the current > > >> version at least has no more false negatives. > > > > > >False positives are *bad*. We shouldn't integrate code that has them. > > > > > > > Yeah, I agree. I'm a bit puzzled by the reluctance to make the online mode > > communicate with the server, which would presumably address these issues. > > Can someone explain why not to do that? > > I agree that this effort seems better spent on fixing those issues there > (of which many are the same), and then re-use that. This really seems like it depends on which of the options we're talking about.. Connecting to the server and asking what the current insert point is, so we can check that the LSN isn't completely insane, seems reasonable, but at least one option being discussed was to have pg_basebackup actually *lock the page* (even if just for I/O..) and then re-read it, and having an external tool doing that instead of the backend seems like a whole different level to me. That would involve having an SQL function for "lock this page against I/O" and then another for "unlock this page", wouldn't it? > > FWIW I've initially argued against that, believing that we can address > > those issues in some other way, and I'd love if that was possible. But > > considering we're still trying to make that work reliably I think the > > reasonable conclusion is that Andres was right communicating with the > > server is necessary. As part of a backup, you could check against the pages written out into the WAL as a cross-check and be able to be confident that at least everything which was backed up had been checked. That doesn't cover things like unlogged tables though. For my part, at least, adding additional checks around the LSN seems like a good solution (though we can't allow those checks to turn into false positives...) and would seriously reduce the risk that we have false negatives (we can *not* completely eliminate false negatives entirely.. we could possibly get to a point where at least we don't have any more false negatives than PG itself has but it looks like an awful lot of work and ends up adding its own risks...). As I've said before, I'd certainly support a background worker which performs ongoing checksum validation of pages and that would be able to use the same approach as what we do with pg_basebackup, but having an external tool locking pages seems really unlikely to be reasonable. Thanks! Stephen
Attachment
Hi, On 2019-03-29 11:30:15 -0400, Stephen Frost wrote: > * Magnus Hagander (magnus@hagander.net) wrote: > > On Thu, Mar 28, 2019 at 10:19 PM Tomas Vondra <tomas.vondra@2ndquadrant.com> > > wrote: > > > On Thu, Mar 28, 2019 at 01:11:40PM -0700, Andres Freund wrote: > > > >Hi, > > > > > > > >On 2019-03-28 21:09:22 +0100, Michael Banck wrote: > > > >> I agree that the current patch might have some corner-cases where it > > > >> does not guarantee 100% accuracy in online mode, but I hope the current > > > >> version at least has no more false negatives. > > > > > > > >False positives are *bad*. We shouldn't integrate code that has them. > > > > > > > > > > Yeah, I agree. I'm a bit puzzled by the reluctance to make the online mode > > > communicate with the server, which would presumably address these issues. > > > Can someone explain why not to do that? > > > > I agree that this effort seems better spent on fixing those issues there > > (of which many are the same), and then re-use that. > > This really seems like it depends on which of the options we're talking > about.. Connecting to the server and asking what the current insert > point is, so we can check that the LSN isn't completely insane, seems > reasonable, but at least one option being discussed was to have > pg_basebackup actually *lock the page* (even if just for I/O..) and then > re-read it, and having an external tool doing that instead of the > backend seems like a whole different level to me. That would involve > having an SQL function for "lock this page against I/O" and then another > for "unlock this page", wouldn't it? No, I don't think so. And we obviously couldn't have a SQL level function hold an LWLock after it has finished, that'd make undetected deadlocks triggerable by users. The way I'd imagine that being done is to just perform the checksum test in the commandline tool, and whenever there's a checksum failure that could plausibly be a torn read, call a server side function that re-tests the page after locking it. Which then would just return the error message in a string. Greetings, Andres Freund
Greetings, * Andres Freund (andres@anarazel.de) wrote: > On 2019-03-29 11:30:15 -0400, Stephen Frost wrote: > > * Magnus Hagander (magnus@hagander.net) wrote: > > > On Thu, Mar 28, 2019 at 10:19 PM Tomas Vondra <tomas.vondra@2ndquadrant.com> > > > wrote: > > > > On Thu, Mar 28, 2019 at 01:11:40PM -0700, Andres Freund wrote: > > > > >Hi, > > > > > > > > > >On 2019-03-28 21:09:22 +0100, Michael Banck wrote: > > > > >> I agree that the current patch might have some corner-cases where it > > > > >> does not guarantee 100% accuracy in online mode, but I hope the current > > > > >> version at least has no more false negatives. > > > > > > > > > >False positives are *bad*. We shouldn't integrate code that has them. > > > > > > > > > > > > > Yeah, I agree. I'm a bit puzzled by the reluctance to make the online mode > > > > communicate with the server, which would presumably address these issues. > > > > Can someone explain why not to do that? > > > > > > I agree that this effort seems better spent on fixing those issues there > > > (of which many are the same), and then re-use that. > > > > This really seems like it depends on which of the options we're talking > > about.. Connecting to the server and asking what the current insert > > point is, so we can check that the LSN isn't completely insane, seems > > reasonable, but at least one option being discussed was to have > > pg_basebackup actually *lock the page* (even if just for I/O..) and then > > re-read it, and having an external tool doing that instead of the > > backend seems like a whole different level to me. That would involve > > having an SQL function for "lock this page against I/O" and then another > > for "unlock this page", wouldn't it? > > No, I don't think so. And we obviously couldn't have a SQL level > function hold an LWLock after it has finished, that'd make undetected > deadlocks triggerable by users. The way I'd imagine that being done is > to just perform the checksum test in the commandline tool, and whenever > there's a checksum failure that could plausibly be a torn read, call a > server side function that re-tests the page after locking it. Which then > would just return the error message in a string. The server-side function would essentially lock the page against i/o, re-read it off disk into an independent location, unlock the page, then calculate the checksum and report back? That seems like it would be reasonable to me. Wouldn't it make sense to then have pg_basebackup use that same function..? Thanks, Stephen
Attachment
Hi, On 2019-03-29 11:38:02 -0400, Stephen Frost wrote: > The server-side function would essentially lock the page against i/o, > re-read it off disk into an independent location, unlock the page, then > calculate the checksum and report back? Right. I think there's a few minor variations of how this could be done, but that'd be the basic approach. > That seems like it would be reasonable to me. Wouldn't it make sense to > then have pg_basebackup use that same function..? Yea, probably. Or at least reuse the majority of it, I can imagine the error reporting would be a bit different (sqlstates et al are needed for the basebackup.c case, but not the pg_checksum case). Greetings, Andres Freund
On Fri, Mar 29, 2019 at 4:30 PM Stephen Frost <sfrost@snowman.net> wrote:
Greetings,
* Magnus Hagander (magnus@hagander.net) wrote:
> On Thu, Mar 28, 2019 at 10:19 PM Tomas Vondra <tomas.vondra@2ndquadrant.com>
> wrote:
>
> > On Thu, Mar 28, 2019 at 01:11:40PM -0700, Andres Freund wrote:
> > >Hi,
> > >
> > >On 2019-03-28 21:09:22 +0100, Michael Banck wrote:
> > >> I agree that the current patch might have some corner-cases where it
> > >> does not guarantee 100% accuracy in online mode, but I hope the current
> > >> version at least has no more false negatives.
> > >
> > >False positives are *bad*. We shouldn't integrate code that has them.
> > >
> >
> > Yeah, I agree. I'm a bit puzzled by the reluctance to make the online mode
> > communicate with the server, which would presumably address these issues.
> > Can someone explain why not to do that?
>
> I agree that this effort seems better spent on fixing those issues there
> (of which many are the same), and then re-use that.
This really seems like it depends on which of the options we're talking
about.. Connecting to the server and asking what the current insert
point is, so we can check that the LSN isn't completely insane, seems
reasonable, but at least one option being discussed was to have
pg_basebackup actually *lock the page* (even if just for I/O..) and then
re-read it, and having an external tool doing that instead of the
backend seems like a whole different level to me. That would involve
having an SQL function for "lock this page against I/O" and then another
for "unlock this page", wouldn't it?
Right.
But what if we just added a flag to the BASE_BACKUP command in the replication protocol that said "meh, I really just want to verify the checksums, so please send the data to devnull and only feed me regular status updates on this connection"?
Hi, Am Freitag, den 29.03.2019, 16:52 +0100 schrieb Magnus Hagander: > On Fri, Mar 29, 2019 at 4:30 PM Stephen Frost <sfrost@snowman.net> wrote: > > * Magnus Hagander (magnus@hagander.net) wrote: > > > On Thu, Mar 28, 2019 at 10:19 PM Tomas Vondra <tomas.vondra@2ndquadrant.com> > > > wrote: > > > > On Thu, Mar 28, 2019 at 01:11:40PM -0700, Andres Freund wrote: > > > > >On 2019-03-28 21:09:22 +0100, Michael Banck wrote: > > > > >> I agree that the current patch might have some corner-cases where it > > > > >> does not guarantee 100% accuracy in online mode, but I hope the current > > > > >> version at least has no more false negatives. > > > > > > > > > >False positives are *bad*. We shouldn't integrate code that has them. > > > > > > > > Yeah, I agree. I'm a bit puzzled by the reluctance to make the online mode > > > > communicate with the server, which would presumably address these issues. > > > > Can someone explain why not to do that? > > > > > > I agree that this effort seems better spent on fixing those issues there > > > (of which many are the same), and then re-use that. > > > > This really seems like it depends on which of the options we're talking > > about.. Connecting to the server and asking what the current insert > > point is, so we can check that the LSN isn't completely insane, seems > > reasonable, but at least one option being discussed was to have > > pg_basebackup actually *lock the page* (even if just for I/O..) and then > > re-read it, and having an external tool doing that instead of the > > backend seems like a whole different level to me. That would involve > > having an SQL function for "lock this page against I/O" and then another > > for "unlock this page", wouldn't it? > > Right. > > But what if we just added a flag to the BASE_BACKUP command in the > replication protocol that said "meh, I really just want to verify the > checksums, so please send the data to devnull and only feed me regular > status updates on this connection"? I don't know whether BASE_BACKUP is the best interface for that (at least right now) - backend/replication/basebackup.c's sendFile() gets only an absolute filename to send, which is not adequate for more in- depth server-based things like locking a particular page in a particular relation of some particular tablespace. ISTM that the fact that we had to teach it about different segment files for checksum verification by splitting up the filename at "." implies that it is not the correct level of abstraction (but maybe it could get schooled some more about Postgres internals, e.g. by passing it a RefFileNode struct and not a filename). Michael -- Michael Banck Projektleiter / Senior Berater Tel.: +49 2166 9901-171 Fax: +49 2166 9901-100 Email: michael.banck@credativ.de credativ GmbH, HRB Mönchengladbach 12080 USt-ID-Nummer: DE204566209 Trompeterallee 108, 41189 Mönchengladbach Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer Unser Umgang mit personenbezogenen Daten unterliegt folgenden Bestimmungen: https://www.credativ.de/datenschutz
On Fri, Mar 29, 2019 at 10:08 PM Michael Banck <michael.banck@credativ.de> wrote:
Hi,
Am Freitag, den 29.03.2019, 16:52 +0100 schrieb Magnus Hagander:
> On Fri, Mar 29, 2019 at 4:30 PM Stephen Frost <sfrost@snowman.net> wrote:
> > * Magnus Hagander (magnus@hagander.net) wrote:
> > > On Thu, Mar 28, 2019 at 10:19 PM Tomas Vondra <tomas.vondra@2ndquadrant.com>
> > > wrote:
> > > > On Thu, Mar 28, 2019 at 01:11:40PM -0700, Andres Freund wrote:
> > > > >On 2019-03-28 21:09:22 +0100, Michael Banck wrote:
> > > > >> I agree that the current patch might have some corner-cases where it
> > > > >> does not guarantee 100% accuracy in online mode, but I hope the current
> > > > >> version at least has no more false negatives.
> > > > >
> > > > >False positives are *bad*. We shouldn't integrate code that has them.
> > > >
> > > > Yeah, I agree. I'm a bit puzzled by the reluctance to make the online mode
> > > > communicate with the server, which would presumably address these issues.
> > > > Can someone explain why not to do that?
> > >
> > > I agree that this effort seems better spent on fixing those issues there
> > > (of which many are the same), and then re-use that.
> >
> > This really seems like it depends on which of the options we're talking
> > about.. Connecting to the server and asking what the current insert
> > point is, so we can check that the LSN isn't completely insane, seems
> > reasonable, but at least one option being discussed was to have
> > pg_basebackup actually *lock the page* (even if just for I/O..) and then
> > re-read it, and having an external tool doing that instead of the
> > backend seems like a whole different level to me. That would involve
> > having an SQL function for "lock this page against I/O" and then another
> > for "unlock this page", wouldn't it?
>
> Right.
>
> But what if we just added a flag to the BASE_BACKUP command in the
> replication protocol that said "meh, I really just want to verify the
> checksums, so please send the data to devnull and only feed me regular
> status updates on this connection"?
I don't know whether BASE_BACKUP is the best interface for that (at
least right now) - backend/replication/basebackup.c's sendFile() gets
only an absolute filename to send, which is not adequate for more in-
depth server-based things like locking a particular page in a particular
relation of some particular tablespace.
ISTM that the fact that we had to teach it about different segment files
for checksum verification by splitting up the filename at "." implies
that it is not the correct level of abstraction (but maybe it could get
schooled some more about Postgres internals, e.g. by passing it a
RefFileNode struct and not a filename).
But that has to be fixed in pg_basebackup *regardless*, doesn't it? And if we fix it there, we only have to fix it once...
//Magnus
Hi, On 2019-03-30 12:56:21 +0100, Magnus Hagander wrote: > > ISTM that the fact that we had to teach it about different segment files > > for checksum verification by splitting up the filename at "." implies > > that it is not the correct level of abstraction (but maybe it could get > > schooled some more about Postgres internals, e.g. by passing it a > > RefFileNode struct and not a filename). > > > > But that has to be fixed in pg_basebackup *regardless*, doesn't it? And if > we fix it there, we only have to fix it once... I'm not understanding the problem here. We already need to know all of this? sendFile() determines whether the file is checksummed, and computes the segment number: if (is_checksummed_file(readfilename, filename)) { verify_checksum = true; ... checksum = pg_checksum_page((char *) page, blkno + segmentno * RELSEG_SIZE); phdr = (PageHeader) page; I agree that the way checksumming works is a bit of a layering violation. In my opinion it belongs in the smgr level, not bufmgr.c etc, so different storage methods can store it differently. But that seems fairly indepedent of this problem. Greetings, Andres Freund
Hi, Am Mittwoch, den 27.03.2019, 11:37 +0100 schrieb Michael Banck: > Am Dienstag, den 26.03.2019, 19:23 +0100 schrieb Michael Banck: > > Am Dienstag, den 26.03.2019, 10:30 -0700 schrieb Andres Freund: > > > On 2019-03-26 18:22:55 +0100, Michael Banck wrote: > > > > /* > > > > - * Only check pages which have not been modified since the > > > > - * start of the base backup. Otherwise, they might have been > > > > - * written only halfway and the checksum would not be valid. > > > > - * However, replaying WAL would reinstate the correct page in > > > > - * this case. We also skip completely new pages, since they > > > > - * don't have a checksum yet. > > > > + * We skip completely new pages after checking they are > > > > + * all-zero, since they don't have a checksum yet. > > > > */ > > > > - if (!PageIsNew(page) && PageGetLSN(page) < startptr) > > > > + if (PageIsNew(page)) > > > > { > > > > - checksum = pg_checksum_page((char *) page, blkno + segmentno * RELSEG_SIZE); > > > > - phdr = (PageHeader) page; > > > > - if (phdr->pd_checksum != checksum) > > > > + all_zeroes = true; > > > > + pagebytes = (size_t *) page; > > > > + for (int i = 0; i < (BLCKSZ / sizeof(size_t)); i++) > > > > > > Can we please abstract the zeroeness check into a separate function to > > > be used both by PageIsVerified() and this? > > > > Ok, done so as PageIsZero further down in bufpage.c. > > It turns out that pg_checksums (current master and back branches, not > just the online version) needs this treatment as well as it won't catch > zeroed-out pageheader corruption, see attached patch to its TAP tests > which trigger it (I also added a random data check similar to > pg_basebackup as well which is not a problem for the current codebase). > > Any suggestion on how to handle this? Should I duplicate the > PageIsZero() code in pg_checksums? Should I move PageIsZero into > something like bufpage_impl.h for use by external programs, similar to > pg_checksum_page()? > > I've done the latter as a POC in the second attached patch. This is still an open item for the back branches I guess, i.e. zero page header for pg_verify_checksums and additionally random page header for pg_basebackup's base backup. Do you plan to work on the patch you have outlined, what would I need to change in the patches I submitted or is another approach warranted entirely? Should I add my patches to the next commitfest in order to track them? Michael -- Michael Banck Projektleiter / Senior Berater Tel.: +49 2166 9901-171 Fax: +49 2166 9901-100 Email: michael.banck@credativ.de credativ GmbH, HRB Mönchengladbach 12080 USt-ID-Nummer: DE204566209 Trompeterallee 108, 41189 Mönchengladbach Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer Unser Umgang mit personenbezogenen Daten unterliegt folgenden Bestimmungen: https://www.credativ.de/datenschutz
On Tue, Apr 30, 2019 at 03:07:43PM +0200, Michael Banck wrote: > This is still an open item for the back branches I guess, i.e. zero page > header for pg_verify_checksums and additionally random page header for > pg_basebackup's base backup. I may be missing something, but could you add an entry in the future commit fest about the stuff discussed here? I have not looked at your patch closely.. Sorry. -- Michael
Attachment
Hi, Am Samstag, den 04.05.2019, 21:50 +0900 schrieb Michael Paquier: > On Tue, Apr 30, 2019 at 03:07:43PM +0200, Michael Banck wrote: > > This is still an open item for the back branches I guess, i.e. zero page > > header for pg_verify_checksums and additionally random page header for > > pg_basebackup's base backup. > > I may be missing something, but could you add an entry in the future > commit fest about the stuff discussed here? I have not looked at your > patch closely.. Sorry. Here is finally a rebased patch for the (IMO) more important issue in pg_basebackup. I've added a commitfest entry for this now: https://commitfest.postgresql.org/25/2308/ Michael -- Michael Banck Projektleiter / Senior Berater Tel.: +49 2166 9901-171 Fax: +49 2166 9901-100 Email: michael.banck@credativ.de credativ GmbH, HRB Mönchengladbach 12080 USt-ID-Nummer: DE204566209 Trompeterallee 108, 41189 Mönchengladbach Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer Unser Umgang mit personenbezogenen Daten unterliegt folgenden Bestimmungen: https://www.credativ.de/datenschutz
Attachment
On Fri, Oct 18, 2019 at 2:06 PM Michael Banck <michael.banck@credativ.de> wrote:
Hi,
Am Samstag, den 04.05.2019, 21:50 +0900 schrieb Michael Paquier:
> On Tue, Apr 30, 2019 at 03:07:43PM +0200, Michael Banck wrote:
> > This is still an open item for the back branches I guess, i.e. zero page
> > header for pg_verify_checksums and additionally random page header for
> > pg_basebackup's base backup.
>
> I may be missing something, but could you add an entry in the future
> commit fest about the stuff discussed here? I have not looked at your
> patch closely.. Sorry.
Here is finally a rebased patch for the (IMO) more important issue in
pg_basebackup. I've added a commitfest entry for this now:
https://commitfest.postgresql.org/25/2308/
The patch does not seem to apply anymore, can you rebase it?
--
Asif Rehman
Hi, Am Dienstag, den 25.02.2020, 19:34 +0500 schrieb Asif Rehman: > On Fri, Oct 18, 2019 at 2:06 PM Michael Banck <michael.banck@credativ.de> wrote: > > Here is finally a rebased patch for the (IMO) more important issue in > > pg_basebackup. I've added a commitfest entry for this now: > > https://commitfest.postgresql.org/25/2308/ > > The patch does not seem to apply anymore, can you rebase it? Thanks for letting me know, please find attached a rebased version. I hope the StaticAssertDecl() is still correct in bufpage.h. Michael -- Michael Banck Projektleiter / Senior Berater Tel.: +49 2166 9901-171 Fax: +49 2166 9901-100 Email: michael.banck@credativ.de credativ GmbH, HRB Mönchengladbach 12080 USt-ID-Nummer: DE204566209 Trompeterallee 108, 41189 Mönchengladbach Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer Unser Umgang mit personenbezogenen Daten unterliegt folgenden Bestimmungen: https://www.credativ.de/datenschutz
Attachment
The following review has been posted through the commitfest application: make installcheck-world: tested, passed Implements feature: tested, passed Spec compliant: tested, passed Documentation: not tested The patch applies cleanly and works as expected. Just a few minor observations: - I would suggest refactoring PageIsZero function by getting rid of all_zeroes variable and simply returning false when a non-zero byte is found, rather than setting all_zeros variable to false and breaking the for loop. The function should simply return true at the end otherwise. - Remove the empty line: + * would throw an assertion failure. Consider this a + * checksum failure. + */ + + checksum_failures++; - Code needs to run through pgindent. Also, I'd suggest to make "5" a define within the current file/function, perhaps something like "MAX_CHECKSUM_FAILURES". You could move the second warning outside the conditional statement as it appears in both "if" and "else" blocks. Regards, --Asif The new status of this patch is: Waiting on Author
Hi, thanks for reviewing this patch! Am Donnerstag, den 27.02.2020, 10:57 +0000 schrieb Asif Rehman: > The following review has been posted through the commitfest application: > make installcheck-world: tested, passed > Implements feature: tested, passed > Spec compliant: tested, passed > Documentation: not tested > > The patch applies cleanly and works as expected. Just a few minor observations: > > - I would suggest refactoring PageIsZero function by getting rid of all_zeroes variable > and simply returning false when a non-zero byte is found, rather than setting all_zeros > variable to false and breaking the for loop. The function should simply return true at the > end otherwise. Good point, I have done so. > - Remove the empty line: > + * would throw an assertion failure. Consider this a > + * checksum failure. > + */ > + > + checksum_failures++; Done > - Code needs to run through pgindent. Done. > Also, I'd suggest to make "5" a define within the current file/function, perhaps > something like "MAX_CHECKSUM_FAILURES". You could move the second > warning outside the conditional statement as it appears in both "if" and "else" blocks. Well, I think you have a valid point, but that would be a different (non bug-fix) patch as this part is not changed by this patch, but code is at most moved around, is it? New version attached. Best regards, Michael -- Michael Banck Projektleiter / Senior Berater Tel.: +49 2166 9901-171 Fax: +49 2166 9901-100 Email: michael.banck@credativ.de credativ GmbH, HRB Mönchengladbach 12080 USt-ID-Nummer: DE204566209 Trompeterallee 108, 41189 Mönchengladbach Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer Unser Umgang mit personenbezogenen Daten unterliegt folgenden Bestimmungen: https://www.credativ.de/datenschutz
Attachment
Michael Banck <michael.banck@credativ.de> writes: > [ 0001-Fix-checksum-verification-in-base-backups-for-random_V3.patch ] I noticed that the cfbot wasn't testing this because of a minor merge conflict. I rebased it over that, and also readjusted things a little bit to avoid unnecessarily reindenting existing code, in hopes of making the patch easier to review. Doing that reveals that the patch actually removes a chunk of code, namely a special case for EOF. Was that intentional, or a result of a faulty merge earlier? It certainly isn't mentioned in your proposed commit message. Another thing that's bothering me is that the patch compares page LSN against GetInsertRecPtr(); but that function says * NOTE: The value *actually* returned is the position of the last full * xlog page. It lags behind the real insert position by at most 1 page. * For that, we don't need to scan through WAL insertion locks, and an * approximation is enough for the current usage of this function. I'm not convinced that an approximation is good enough here. It seems like a page that's just now been updated could have an LSN beyond the current XLOG page start, potentially leading to a false checksum complaint. Maybe we could address that by adding one xlog page to the GetInsertRecPtr result? Kind of a hack, but ... regards, tom lane diff --git a/src/backend/replication/basebackup.c b/src/backend/replication/basebackup.c index 5d94b9c..c7ff9a8 100644 --- a/src/backend/replication/basebackup.c +++ b/src/backend/replication/basebackup.c @@ -2028,15 +2028,47 @@ sendFile(const char *readfilename, const char *tarfilename, page = buf + BLCKSZ * i; /* - * Only check pages which have not been modified since the - * start of the base backup. Otherwise, they might have been - * written only halfway and the checksum would not be valid. - * However, replaying WAL would reinstate the correct page in - * this case. We also skip completely new pages, since they - * don't have a checksum yet. + * We skip completely new pages after checking they are + * all-zero, since they don't have a checksum yet. */ - if (!PageIsNew(page) && PageGetLSN(page) < startptr) + if (PageIsNew(page)) { + if (!PageIsZero(page)) + { + /* + * pd_upper is zero, but the page is not all zero. We + * cannot run pg_checksum_page() on the page as it + * would throw an assertion failure. Consider this a + * checksum failure. + */ + checksum_failures++; + + if (checksum_failures <= 5) + ereport(WARNING, + (errmsg("checksum verification failed in " + "file \"%s\", block %d: pd_upper " + "is zero but page is not all-zero", + readfilename, blkno))); + if (checksum_failures == 5) + ereport(WARNING, + (errmsg("further checksum verification " + "failures in file \"%s\" will not " + "be reported", readfilename))); + } + } + else if (PageGetLSN(page) < startptr || + PageGetLSN(page) > GetInsertRecPtr()) + { + /* + * Only check pages which have not been modified since the + * start of the base backup. Otherwise, they might have + * been written only halfway and the checksum would not be + * valid. However, replaying WAL would reinstate the + * correct page in this case. If the page LSN is larger + * than the current insert pointer then we assume a bogus + * LSN due to random page header corruption and do verify + * the checksum. + */ checksum = pg_checksum_page((char *) page, blkno + segmentno * RELSEG_SIZE); phdr = (PageHeader) page; if (phdr->pd_checksum != checksum) @@ -2064,20 +2096,6 @@ sendFile(const char *readfilename, const char *tarfilename, if (fread(buf + BLCKSZ * i, 1, BLCKSZ, fp) != BLCKSZ) { - /* - * If we hit end-of-file, a concurrent - * truncation must have occurred, so break out - * of this loop just as if the initial fread() - * returned 0. We'll drop through to the same - * code that handles that case. (We must fix - * up cnt first, though.) - */ - if (feof(fp)) - { - cnt = BLCKSZ * i; - break; - } - ereport(ERROR, (errcode_for_file_access(), errmsg("could not reread block %d of file \"%s\": %m", diff --git a/src/backend/storage/page/bufpage.c b/src/backend/storage/page/bufpage.c index d708117..2dc8322 100644 --- a/src/backend/storage/page/bufpage.c +++ b/src/backend/storage/page/bufpage.c @@ -82,11 +82,8 @@ bool PageIsVerified(Page page, BlockNumber blkno) { PageHeader p = (PageHeader) page; - size_t *pagebytes; - int i; bool checksum_failure = false; bool header_sane = false; - bool all_zeroes = false; uint16 checksum = 0; /* @@ -120,18 +117,7 @@ PageIsVerified(Page page, BlockNumber blkno) } /* Check all-zeroes case */ - all_zeroes = true; - pagebytes = (size_t *) page; - for (i = 0; i < (BLCKSZ / sizeof(size_t)); i++) - { - if (pagebytes[i] != 0) - { - all_zeroes = false; - break; - } - } - - if (all_zeroes) + if (PageIsZero(page)) return true; /* @@ -154,6 +140,25 @@ PageIsVerified(Page page, BlockNumber blkno) return false; } +/* + * PageIsZero + * Check that the page consists only of zero bytes. + * + */ +bool +PageIsZero(Page page) +{ + int i; + size_t *pagebytes = (size_t *) page; + + for (i = 0; i < (BLCKSZ / sizeof(size_t)); i++) + { + if (pagebytes[i] != 0) + return false; + } + + return true; +} /* * PageAddItemExtended diff --git a/src/bin/pg_basebackup/t/010_pg_basebackup.pl b/src/bin/pg_basebackup/t/010_pg_basebackup.pl index 6338176..598453e 100644 --- a/src/bin/pg_basebackup/t/010_pg_basebackup.pl +++ b/src/bin/pg_basebackup/t/010_pg_basebackup.pl @@ -6,7 +6,7 @@ use File::Basename qw(basename dirname); use File::Path qw(rmtree); use PostgresNode; use TestLib; -use Test::More tests => 109; +use Test::More tests => 112; program_help_ok('pg_basebackup'); program_version_ok('pg_basebackup'); @@ -497,21 +497,37 @@ my $file_corrupt2 = $node->safe_psql('postgres', my $pageheader_size = 24; my $block_size = $node->safe_psql('postgres', 'SHOW block_size;'); -# induce corruption +# induce corruption in the pageheader by writing random data into it system_or_bail 'pg_ctl', '-D', $pgdata, 'stop'; open $file, '+<', "$pgdata/$file_corrupt1"; -seek($file, $pageheader_size, 0); -syswrite($file, "\0\0\0\0\0\0\0\0\0"); +my $random_data = join '', map { ("a".."z")[rand 26] } 1 .. $pageheader_size; +syswrite($file, $random_data); +close $file; +system_or_bail 'pg_ctl', '-D', $pgdata, 'start'; + +$node->command_checks_all( + [ 'pg_basebackup', '-D', "$tempdir/backup_corrupt1" ], + 1, + [qr{^$}], + [qr/^WARNING.*checksum verification failed/s], + "pg_basebackup reports checksum mismatch for random pageheader data"); +rmtree("$tempdir/backup_corrupt1"); + +# zero out the pageheader completely +open $file, '+<', "$pgdata/$file_corrupt1"; +system_or_bail 'pg_ctl', '-D', $pgdata, 'stop'; +my $zero_data = "\0"x$pageheader_size; +syswrite($file, $zero_data); close $file; system_or_bail 'pg_ctl', '-D', $pgdata, 'start'; $node->command_checks_all( - [ 'pg_basebackup', '-D', "$tempdir/backup_corrupt" ], + [ 'pg_basebackup', '-D', "$tempdir/backup_corrupt1a" ], 1, [qr{^$}], [qr/^WARNING.*checksum verification failed/s], - 'pg_basebackup reports checksum mismatch'); -rmtree("$tempdir/backup_corrupt"); + "pg_basebackup reports checksum mismatch for zeroed pageheader"); +rmtree("$tempdir/backup_corrupt1a"); # induce further corruption in 5 more blocks system_or_bail 'pg_ctl', '-D', $pgdata, 'stop'; diff --git a/src/include/storage/bufpage.h b/src/include/storage/bufpage.h index 3f88683..a1fcb21 100644 --- a/src/include/storage/bufpage.h +++ b/src/include/storage/bufpage.h @@ -419,17 +419,18 @@ do { \ ((is_heap) ? PAI_IS_HEAP : 0)) /* - * Check that BLCKSZ is a multiple of sizeof(size_t). In PageIsVerified(), - * it is much faster to check if a page is full of zeroes using the native - * word size. Note that this assertion is kept within a header to make - * sure that StaticAssertDecl() works across various combinations of - * platforms and compilers. + * Check that BLCKSZ is a multiple of sizeof(size_t). In PageIsZero(), it is + * much faster to check if a page is full of zeroes using the native word size. + * Note that this assertion is kept within a header to make sure that + * StaticAssertDecl() works across various combinations of platforms and + * compilers. */ StaticAssertDecl(BLCKSZ == ((BLCKSZ / sizeof(size_t)) * sizeof(size_t)), "BLCKSZ has to be a multiple of sizeof(size_t)"); extern void PageInit(Page page, Size pageSize, Size specialSize); extern bool PageIsVerified(Page page, BlockNumber blkno); +extern bool PageIsZero(Page page); extern OffsetNumber PageAddItemExtended(Page page, Item item, Size size, OffsetNumber offsetNumber, int flags); extern Page PageGetTempPage(Page page);
I wrote: > Another thing that's bothering me is that the patch compares page LSN > against GetInsertRecPtr(); but that function says > ... > I'm not convinced that an approximation is good enough here. It seems > like a page that's just now been updated could have an LSN beyond the > current XLOG page start, potentially leading to a false checksum > complaint. Maybe we could address that by adding one xlog page to > the GetInsertRecPtr result? Kind of a hack, but ... Actually, after thinking about that a bit more: why is there an LSN-based special condition at all? It seems like it'd be far more useful to checksum everything, and on failure try to re-read and re-verify the page once or twice, so as to handle the corner case where we examine a page that's in process of being overwritten. regards, tom lane
Hi, Am Montag, den 06.04.2020, 16:45 -0400 schrieb Tom Lane: > I wrote: > > Another thing that's bothering me is that the patch compares page LSN > > against GetInsertRecPtr(); but that function says > > ... > > I'm not convinced that an approximation is good enough here. It seems > > like a page that's just now been updated could have an LSN beyond the > > current XLOG page start, potentially leading to a false checksum > > complaint. Maybe we could address that by adding one xlog page to > > the GetInsertRecPtr result? Kind of a hack, but ... I was about to write that it sounds like a pragmatic solution to me, but... > Actually, after thinking about that a bit more: why is there an LSN-based > special condition at all? It seems like it'd be far more useful to > checksum everything, and on failure try to re-read and re-verify the page > once or twice, so as to handle the corner case where we examine a page > that's in process of being overwritten. Andres outlined something about a year ago which on re-reading sounds similar to what you suggest above in 20190326170820.6sylklg7eh6uhabd@alap3.anarazel.de but never posted a full patch. He seems to have had a few additional checks from PageIsVerified() in mind, though. The original check against the checkpoint LSN wasn't suggested by me; I've submitted this patch with the InsertRecPtr as an upper bound as a *(presumably) minimal-invasive patch which could be back-patched (when nothing came of the above thread for a while), but the issue seems to be quite a bit nuanced. Probably we need to take a step back; the question is whether something like what Andres suggested should/could be coded up for v13 still (before the feature freeze) and if so, by whom (I won't have the time), or whether it would still qualify as a back-patchable bug-fix and/or whether your suggestion above would. Michael -- Michael Banck Projektleiter / Senior Berater Tel.: +49 2166 9901-171 Fax: +49 2166 9901-100 Email: michael.banck@credativ.de credativ GmbH, HRB Mönchengladbach 12080 USt-ID-Nummer: DE204566209 Trompeterallee 108, 41189 Mönchengladbach Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer Unser Umgang mit personenbezogenen Daten unterliegt folgenden Bestimmungen: https://www.credativ.de/datenschutz
> On 6 Apr 2020, at 23:15, Michael Banck <michael.banck@credativ.de> wrote: > Probably we need to take a step back; This patch has been Waiting on Author since the last commitfest (and no longer applies as well), and by the sounds of the thread there are some open issues with it. Should it be Returned with Feedback to be re-opened with a fresh take on it? cheers ./daniel
> On 5 Jul 2020, at 13:52, Daniel Gustafsson <daniel@yesql.se> wrote: > >> On 6 Apr 2020, at 23:15, Michael Banck <michael.banck@credativ.de> wrote: > >> Probably we need to take a step back; > > This patch has been Waiting on Author since the last commitfest (and no longer > applies as well), and by the sounds of the thread there are some open issues > with it. Should it be Returned with Feedback to be re-opened with a fresh take > on it? Marked as Returned with Feedback, please open a new entry in case there is a renewed interest with a new patch. cheers ./daniel
Hi, Am Dienstag, den 20.10.2020, 18:11 +0900 schrieb Michael Paquier: > On Mon, Apr 06, 2020 at 04:45:44PM -0400, Tom Lane wrote: > > Actually, after thinking about that a bit more: why is there an LSN-based > > special condition at all? It seems like it'd be far more useful to > > checksum everything, and on failure try to re-read and re-verify the page > > once or twice, so as to handle the corner case where we examine a page > > that's in process of being overwritten. > > I was reviewing this area today, and that actually matches my > impression. Why do we need a LSN-based check at all? As said > upthread, that's of course weak with random data as we would miss most > of the real checksum failures, with odds getting better depending on > the current LSN of the cluster moving on. However, it seems to me > that we would have an extra advantage in removing this check > all together: it would be possible to check for pages even if these > are more recent than the start LSN of the backup, and that could be a > lot of pages that could be checked on a large cluster. So by keeping > this check we also delay the detection of real problems. The check was ported (or the concept of it adapted) from pgBackRest if I remember correctly. > As things stand, I'd like to think that it would be much more useful > to remove this check and to have one or two extra retries (the current > code only has one). I don't like much the possibility of false > positives for such critical checks, but as we need to live with what > has been released, that looks like a good move for stable branches. Sounds good to me. I think some were advocating for locking the page before re-reading. When I looked at it, the level of abstraction that pg_basebackup has (just a list of files chopped up into blocks, no notion of relations I think) made that non-trivial, but maybe still possible for v14 and beyond. Michael -- Michael Banck Projektleiter / Senior Berater Tel.: +49 2166 9901-171 Fax: +49 2166 9901-100 Email: michael.banck@credativ.de credativ GmbH, HRB Mönchengladbach 12080 USt-ID-Nummer: DE204566209 Trompeterallee 108, 41189 Mönchengladbach Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer Unser Umgang mit personenbezogenen Daten unterliegt folgenden Bestimmungen: https://www.credativ.de/datenschutz
On Tue, Nov 10, 2020 at 5:44 AM Michael Paquier <michael@paquier.xyz> wrote:
On Thu, Nov 05, 2020 at 10:57:16AM +0900, Michael Paquier wrote:
> I was referring to the patch I sent on this thread that fixes the
> detection of a corruption for the zero-only case and where pd_lsn
> and/or pg_upper are trashed by a corruption of the page header. Both
> cases allow a base backup to complete on HEAD, while sending pages
> that could be corrupted, which is wrong. Once you make the page
> verification rely only on pd_checksum, as the patch does because the
> checksum is the only source of truth in the page header, corrupted
> pages are correctly detected, causing pg_basebackup to complain as it
> should. However, it has also the risk to cause pg_basebackup to fail
> *and* to report as broken pages that are in the process of being
> written, depending on how slow a disk is able to finish a 8kB write.
> That's a different kind of wrongness, and users have two more reasons
> to be pissed. Note that if a page is found as torn we have a
> consistent page header, meaning that on HEAD the PageIsNew() and
> PageGetLSN() would pass, but the checksum verification would fail as
> the contents at the end of the page does not match the checksum.
Magnus, as the original committer of 4eb77d5, do you have an opinion
to share?
I admit that I at some point lost track of the overlapping threads around this, and just figured there was enough different checksum-involved-people on those threads to handle it :) Meaning the short answer is "no, I don't really have one at this point".
Slightly longer comment is that it does seem reasonable, but I have not read in on all the different issues discussed over the whole thread, so take that as a weak-certainty comment.
On Sun, Nov 15, 2020 at 04:37:36PM +0100, Magnus Hagander wrote: > On Tue, Nov 10, 2020 at 5:44 AM Michael Paquier <michael@paquier.xyz> wrote: >> On Thu, Nov 05, 2020 at 10:57:16AM +0900, Michael Paquier wrote: >>> I was referring to the patch I sent on this thread that fixes the >>> detection of a corruption for the zero-only case and where pd_lsn >>> and/or pg_upper are trashed by a corruption of the page header. Both >>> cases allow a base backup to complete on HEAD, while sending pages >>> that could be corrupted, which is wrong. Once you make the page >>> verification rely only on pd_checksum, as the patch does because the >>> checksum is the only source of truth in the page header, corrupted >>> pages are correctly detected, causing pg_basebackup to complain as it >>> should. However, it has also the risk to cause pg_basebackup to fail >>> *and* to report as broken pages that are in the process of being >>> written, depending on how slow a disk is able to finish a 8kB write. >>> That's a different kind of wrongness, and users have two more reasons >>> to be pissed. Note that if a page is found as torn we have a >>> consistent page header, meaning that on HEAD the PageIsNew() and >>> PageGetLSN() would pass, but the checksum verification would fail as >>> the contents at the end of the page does not match the checksum. >> >> Magnus, as the original committer of 4eb77d5, do you have an opinion >> to share? >> > > I admit that I at some point lost track of the overlapping threads around > this, and just figured there was enough different checksum-involved-people > on those threads to handle it :) Meaning the short answer is "no, I don't > really have one at this point". > > Slightly longer comment is that it does seem reasonable, but I have not > read in on all the different issues discussed over the whole thread, so > take that as a weak-certainty comment. Which part are you considering as reasonable? The removal-feature part on a stable branch or perhaps something else? -- Michael
Attachment
On Mon, Nov 16, 2020 at 1:23 AM Michael Paquier <michael@paquier.xyz> wrote:
On Sun, Nov 15, 2020 at 04:37:36PM +0100, Magnus Hagander wrote:
> On Tue, Nov 10, 2020 at 5:44 AM Michael Paquier <michael@paquier.xyz> wrote:
>> On Thu, Nov 05, 2020 at 10:57:16AM +0900, Michael Paquier wrote:
>>> I was referring to the patch I sent on this thread that fixes the
>>> detection of a corruption for the zero-only case and where pd_lsn
>>> and/or pg_upper are trashed by a corruption of the page header. Both
>>> cases allow a base backup to complete on HEAD, while sending pages
>>> that could be corrupted, which is wrong. Once you make the page
>>> verification rely only on pd_checksum, as the patch does because the
>>> checksum is the only source of truth in the page header, corrupted
>>> pages are correctly detected, causing pg_basebackup to complain as it
>>> should. However, it has also the risk to cause pg_basebackup to fail
>>> *and* to report as broken pages that are in the process of being
>>> written, depending on how slow a disk is able to finish a 8kB write.
>>> That's a different kind of wrongness, and users have two more reasons
>>> to be pissed. Note that if a page is found as torn we have a
>>> consistent page header, meaning that on HEAD the PageIsNew() and
>>> PageGetLSN() would pass, but the checksum verification would fail as
>>> the contents at the end of the page does not match the checksum.
>>
>> Magnus, as the original committer of 4eb77d5, do you have an opinion
>> to share?
>>
>
> I admit that I at some point lost track of the overlapping threads around
> this, and just figured there was enough different checksum-involved-people
> on those threads to handle it :) Meaning the short answer is "no, I don't
> really have one at this point".
>
> Slightly longer comment is that it does seem reasonable, but I have not
> read in on all the different issues discussed over the whole thread, so
> take that as a weak-certainty comment.
Which part are you considering as reasonable? The removal-feature
part on a stable branch or perhaps something else?
I was referring to the latest patch on the thread. But as I said, I have not read up on all the different issues raised in the thread, so take it with a big grain os salt.
And I would also echo the previous comment that this code was adapted from what the pgbackrest folks do. As such, it would be good to get a comment from for example David on that -- I don't see any of them having commented after that was mentioned?
On Mon, Nov 16, 2020 at 11:41:51AM +0100, Magnus Hagander wrote: > I was referring to the latest patch on the thread. But as I said, I have > not read up on all the different issues raised in the thread, so take it > with a big grain os salt. > > And I would also echo the previous comment that this code was adapted from > what the pgbackrest folks do. As such, it would be good to get a comment > from for example David on that -- I don't see any of them having commented > after that was mentioned? Agreed. I am adding Stephen as well in CC. From the code of backrest, the same logic happens in src/command/backup/pageChecksum.c (see pageChecksumProcess), where two checks on pd_upper and pd_lsn happen before verifying the checksum. So, if the page header finishes with random junk because of some kind of corruption, even corrupted pages would be incorrectly considered as correct if the random data passes the pd_upper and pg_lsn checks :/ -- Michael
Attachment
Hi Michael, On 11/20/20 2:28 AM, Michael Paquier wrote: > On Mon, Nov 16, 2020 at 11:41:51AM +0100, Magnus Hagander wrote: >> I was referring to the latest patch on the thread. But as I said, I have >> not read up on all the different issues raised in the thread, so take it >> with a big grain os salt. >> >> And I would also echo the previous comment that this code was adapted from >> what the pgbackrest folks do. As such, it would be good to get a comment >> from for example David on that -- I don't see any of them having commented >> after that was mentioned? > > Agreed. I am adding Stephen as well in CC. From the code of > backrest, the same logic happens in src/command/backup/pageChecksum.c > (see pageChecksumProcess), where two checks on pd_upper and pd_lsn > happen before verifying the checksum. So, if the page header finishes > with random junk because of some kind of corruption, even corrupted > pages would be incorrectly considered as correct if the random data > passes the pd_upper and pg_lsn checks :/ Indeed, this is not good, as Andres pointed out some time ago. My apologies for not getting to this sooner. Our current plan for pgBackRest: 1) Remove the LSN check as you have done in your patch and when rechecking see if the page has become valid *or* the LSN is ascending. 2) Check the LSN against the max LSN reported by PostgreSQL to make sure it is valid. These do completely rule out any type of corruption, but they certainly narrows the possibility by a lot. In the future we would also like to scan the WAL to verify that the page is definitely being written to. As for your patch, it mostly looks good but my objection is that a page may be reported as invalid after 5 retries when in fact it may just be very hot. Maybe checking for an ascending LSN is a good idea there as well? At least in that case we could issue a different warning, instead of "checksum verification failed" perhaps "checksum verification skipped due to concurrent modifications". Regards, -- -David david@pgmasters.net
Greetings, * David Steele (david@pgmasters.net) wrote: > On 11/20/20 2:28 AM, Michael Paquier wrote: > >On Mon, Nov 16, 2020 at 11:41:51AM +0100, Magnus Hagander wrote: > >>I was referring to the latest patch on the thread. But as I said, I have > >>not read up on all the different issues raised in the thread, so take it > >>with a big grain os salt. > >> > >>And I would also echo the previous comment that this code was adapted from > >>what the pgbackrest folks do. As such, it would be good to get a comment > >>from for example David on that -- I don't see any of them having commented > >>after that was mentioned? > > > >Agreed. I am adding Stephen as well in CC. From the code of > >backrest, the same logic happens in src/command/backup/pageChecksum.c > >(see pageChecksumProcess), where two checks on pd_upper and pd_lsn > >happen before verifying the checksum. So, if the page header finishes > >with random junk because of some kind of corruption, even corrupted > >pages would be incorrectly considered as correct if the random data > >passes the pd_upper and pg_lsn checks :/ > > Indeed, this is not good, as Andres pointed out some time ago. My apologies > for not getting to this sooner. Yeah, it's been on our backlog to improve this. > Our current plan for pgBackRest: > > 1) Remove the LSN check as you have done in your patch and when rechecking > see if the page has become valid *or* the LSN is ascending. > 2) Check the LSN against the max LSN reported by PostgreSQL to make sure it > is valid. Yup, that's my recollection also as to our plans for how to improve things here. > These do completely rule out any type of corruption, but they certainly > narrows the possibility by a lot. *don't :) > In the future we would also like to scan the WAL to verify that the page is > definitely being written to. Yeah, that'd certainly be nice to do too. > As for your patch, it mostly looks good but my objection is that a page may > be reported as invalid after 5 retries when in fact it may just be very hot. Yeah.. while unlikely that it'd actually get written out that much, it does seem at least possible. > Maybe checking for an ascending LSN is a good idea there as well? At least > in that case we could issue a different warning, instead of "checksum > verification failed" perhaps "checksum verification skipped due to > concurrent modifications". +1. Thanks, Stephen
Attachment
On Fri, Nov 20, 2020 at 11:08:27AM -0500, Stephen Frost wrote: > David Steele (david@pgmasters.net) wrote: >> Our current plan for pgBackRest: >> >> 1) Remove the LSN check as you have done in your patch and when rechecking >> see if the page has become valid *or* the LSN is ascending. >> 2) Check the LSN against the max LSN reported by PostgreSQL to make sure it >> is valid. > > Yup, that's my recollection also as to our plans for how to improve > things here. > >> These do completely rule out any type of corruption, but they certainly >> narrows the possibility by a lot. > > *don't :) Have you considered the possibility of only using pd_checksums for the validation? This is the only source of truth in the page header we can rely on to validate the full contents of the page, so if the logic relies on anything but the checksum then you expose the logic to risks of reporting pages as corrupted while they were just torn, or just miss corrupted pages, which is what we should avoid for such things. Both are bad. >> As for your patch, it mostly looks good but my objection is that a page may >> be reported as invalid after 5 retries when in fact it may just be very hot. > > Yeah.. while unlikely that it'd actually get written out that much, it > does seem at least possible. > >> Maybe checking for an ascending LSN is a good idea there as well? At least >> in that case we could issue a different warning, instead of "checksum >> verification failed" perhaps "checksum verification skipped due to >> concurrent modifications". > > +1. I don't quite understand how you can make sure that the page is not corrupted here? It could be possible that the last 4kB of a 8kB page got corrupted, where the header had valid data but failing the checksum verification. So if you are not careful you could have at hand a corrupted page discarded because of it failed the retry multiple times in a row. The only method I can think as being really reliable is based on two facts: - Do a check only on pd_checksums, as that validates the full contents of the page. - When doing a retry, make sure that there is no concurrent I/O activity in the shared buffers. This requires an API we don't have yet. -- Michael
Attachment
On 21.11.2020 04:30, Michael Paquier wrote: > The only method I can think as being really > reliable is based on two facts: > - Do a check only on pd_checksums, as that validates the full contents > of the page. > - When doing a retry, make sure that there is no concurrent I/O > activity in the shared buffers. This requires an API we don't have > yet. It seems reasonable to me to rely on checksums only. As for retry, I think that API for concurrent I/O will be complicated. Instead, we can introduce a function to read the page directly from shared buffers after PAGE_RETRY_THRESHOLD attempts. It looks like a bullet-proof solution to me. Do you see any possible problems with it? -- Anastasia Lubennikova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Greetings, * Michael Paquier (michael@paquier.xyz) wrote: > On Fri, Nov 20, 2020 at 11:08:27AM -0500, Stephen Frost wrote: > > David Steele (david@pgmasters.net) wrote: > >> Our current plan for pgBackRest: > >> > >> 1) Remove the LSN check as you have done in your patch and when rechecking > >> see if the page has become valid *or* the LSN is ascending. > >> 2) Check the LSN against the max LSN reported by PostgreSQL to make sure it > >> is valid. > > > > Yup, that's my recollection also as to our plans for how to improve > > things here. > > > >> These do completely rule out any type of corruption, but they certainly > >> narrows the possibility by a lot. > > > > *don't :) > > Have you considered the possibility of only using pd_checksums for the > validation? This is the only source of truth in the page header we > can rely on to validate the full contents of the page, so if the logic > relies on anything but the checksum then you expose the logic to risks > of reporting pages as corrupted while they were just torn, or just > miss corrupted pages, which is what we should avoid for such things. > Both are bad. There's no doubt that you'll get checksum failures from time to time, and that it's an entirely valid case if the page is being concurrently written, so we have to decide if we should be reporting those failures, retrying, or what. It's not at all clear what you're suggesting here as to how you can use 'only' the checksum. > >> As for your patch, it mostly looks good but my objection is that a page may > >> be reported as invalid after 5 retries when in fact it may just be very hot. > > > > Yeah.. while unlikely that it'd actually get written out that much, it > > does seem at least possible. > > > >> Maybe checking for an ascending LSN is a good idea there as well? At least > >> in that case we could issue a different warning, instead of "checksum > >> verification failed" perhaps "checksum verification skipped due to > >> concurrent modifications". > > > > +1. > > I don't quite understand how you can make sure that the page is not > corrupted here? It could be possible that the last 4kB of a 8kB page > got corrupted, where the header had valid data but failing the > checksum verification. Not sure that the proposed approach was really understood here. Specifically what we're talking about is: - read(), save the LSN seen - calculate checksum- get a failure - re-read(), compare LSN to prior LSN, maybe also re-check checksum If checksum fails again AND the LSN has changed and increased (and perhaps otherwise seems reasonable) then we have at least a bit more confidence that the failing checksum is due to the page being rewritten concurrently and not due to latest storage corruption, which is the specific distinction that we're trying to discern here. > So if you are not careful you could have at > hand a corrupted page discarded because of it failed the retry > multiple times in a row. The point of checking for an ascending LSN is to see if the page is being concurrently modified. If it is, then we actually don't care if the page is corrupted because it's going to be rewritten during WAL replay as part of the restore process. > The only method I can think as being really > reliable is based on two facts: > - Do a check only on pd_checksums, as that validates the full contents > of the page. > - When doing a retry, make sure that there is no concurrent I/O > activity in the shared buffers. This requires an API we don't have > yet. I don't think we actually want the backup process to start locking pages, which it seems like is what you're suggesting here..? Trying to do a check without a lock and without having PG end up reading the page back in if it had been evicted due to pressure seems likely to be hard to do reliably and without race conditions complicating things. The other 100% reliable approach, as David discussed before, is to be scanning the WAL at the same time and to ignore any checksum failures for pages that we know are in the WAL with FPIs. Unfortunately, reading WAL for all different versions of PG is a fair bit of work and we haven't quite gotten to biting that off yet (though it's on the roadmap), and the core code certainly doesn't help us in that regard since any given version only supports the current major version WAL (an issue pg_basebackup would also have to deal with it, were it to be modified to use such an approach and to continue working with older versions of PG..). In a similar vein to what we do (in pgbackrest) with pg_control, we expect to develop our own library basically vendorizing WAL reading code from all the major versions of PG which we support in order to track FPIs, restore points, all the kinds of potential recovery targets, and other useful information. Thanks, Stephen
Attachment
Greetings, * Anastasia Lubennikova (a.lubennikova@postgrespro.ru) wrote: > On 21.11.2020 04:30, Michael Paquier wrote: > >The only method I can think as being really > >reliable is based on two facts: > >- Do a check only on pd_checksums, as that validates the full contents > >of the page. > >- When doing a retry, make sure that there is no concurrent I/O > >activity in the shared buffers. This requires an API we don't have > >yet. > > It seems reasonable to me to rely on checksums only. > > As for retry, I think that API for concurrent I/O will be complicated. > Instead, we can introduce a function to read the page directly from shared > buffers after PAGE_RETRY_THRESHOLD attempts. It looks like a bullet-proof > solution to me. Do you see any possible problems with it? We might end up reading pages back in that have been evicted, for one thing, which doesn't seem great, and this also seems likely to be awkward for cases which aren't using the replication protocol, unless every process maintains a connection to PG the entire time, which also doesn't seem great. Also- what is the point of reading the page from shared buffers anyway..? All we need to do is prove that the page will be rewritten during WAL replay. If we can prove that, we don't actually care what the contents of the page are. We certainly can't calculate the checksum on a page we plucked out of shared buffers since we only calculate the checksum when we go to write the page out. Thanks, Stephen
Attachment
On 23.11.2020 18:35, Stephen Frost wrote:
TBH, I think it is highly unlikely that the page that was just updated will be evicted.Greetings, * Anastasia Lubennikova (a.lubennikova@postgrespro.ru) wrote:On 21.11.2020 04:30, Michael Paquier wrote:The only method I can think as being really reliable is based on two facts: - Do a check only on pd_checksums, as that validates the full contents of the page. - When doing a retry, make sure that there is no concurrent I/O activity in the shared buffers. This requires an API we don't have yet.It seems reasonable to me to rely on checksums only. As for retry, I think that API for concurrent I/O will be complicated. Instead, we can introduce a function to read the page directly from shared buffers after PAGE_RETRY_THRESHOLD attempts. It looks like a bullet-proof solution to me. Do you see any possible problems with it?We might end up reading pages back in that have been evicted, for one thing, which doesn't seem great,
Have I missed something? Now pg_basebackup has only one process + one child process for streaming. Anyway, I totally agree with your argument. The need to maintain connection(s) to PG is the most unpleasant part of the proposed approach.and this also seems likely to be awkward for cases which aren't using the replication protocol, unless every process maintains a connection to PG the entire time, which also doesn't seem great.
Well... Reading a page from shared buffers is a reliable way to get a correct page from postgres under any concurrent load. So it just seems natural to me.Also- what is the point of reading the page from shared buffers anyway..?
Yes and this is a tricky part. Until you have explained it in your latest message, I wasn't sure how we can distinct concurrent update from a page header corruption. Now I agree that if page LSN updated and increased between rereads, it is safe enough to conclude that we have some concurrent load.All we need to do is prove that the page will be rewritten during WAL replay.
Good point. I was thinking that we can recalculate checksum. Or even save a page without it, as we have checked LSN and know for sure that it will be rewritten by WAL replay.If we can prove that, we don't actually care what the contents of the page are. We certainly can't calculate the checksum on a page we plucked out of shared buffers since we only calculate the checksum when we go to write the page out.
To sum up, I agree with your proposal to reread the page and rely on ascending LSNs. Can you submit a patch?
You can write it on top of the latest attachment in this thread:
v8-master-0001-Fix-page-verifications-in-base-backups.patch from this message https://www.postgresql.org/message-id/20201030023028.GC1693@paquier.xyz
-- Anastasia Lubennikova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Greetings, * Anastasia Lubennikova (a.lubennikova@postgrespro.ru) wrote: > On 23.11.2020 18:35, Stephen Frost wrote: > >* Anastasia Lubennikova (a.lubennikova@postgrespro.ru) wrote: > >>On 21.11.2020 04:30, Michael Paquier wrote: > >>>The only method I can think as being really > >>>reliable is based on two facts: > >>>- Do a check only on pd_checksums, as that validates the full contents > >>>of the page. > >>>- When doing a retry, make sure that there is no concurrent I/O > >>>activity in the shared buffers. This requires an API we don't have > >>>yet. > >>It seems reasonable to me to rely on checksums only. > >> > >>As for retry, I think that API for concurrent I/O will be complicated. > >>Instead, we can introduce a function to read the page directly from shared > >>buffers after PAGE_RETRY_THRESHOLD attempts. It looks like a bullet-proof > >>solution to me. Do you see any possible problems with it? > >We might end up reading pages back in that have been evicted, for one > >thing, which doesn't seem great, > TBH, I think it is highly unlikely that the page that was just updated will > be evicted. Is it though..? Consider that the page which was being written out was being done so specifically to free a page for use by another backend- while perhaps that doesn't happen all the time, it certainly happens enough on very busy systems. > >and this also seems likely to be > >awkward for cases which aren't using the replication protocol, unless > >every process maintains a connection to PG the entire time, which also > >doesn't seem great. > Have I missed something? Now pg_basebackup has only one process + one child > process for streaming. Anyway, I totally agree with your argument. The need > to maintain connection(s) to PG is the most unpleasant part of the proposed > approach. I was thinking beyond pg_basebackup, yes; apologies for that not being clear but that's what I was meaning when I said "aren't using the replication protocol". > >Also- what is the point of reading the page from shared buffers > >anyway..? > Well... Reading a page from shared buffers is a reliable way to get a > correct page from postgres under any concurrent load. So it just seems > natural to me. Yes, that's true, but if a dirty page was just written out by a backend in order to be able to evict it, so that the backend can then pull in a new page, then having pg_basebackup pull that page back in really isn't great. > >All we need to do is prove that the page will be rewritten > >during WAL replay. > Yes and this is a tricky part. Until you have explained it in your latest > message, I wasn't sure how we can distinct concurrent update from a page > header corruption. Now I agree that if page LSN updated and increased > between rereads, it is safe enough to conclude that we have some concurrent > load. Even in this case, it's almost free to compare the LSN to the starting backup LSN, and to the current LSN position, and make sure it's somewhere between the two. While that doesn't entirely eliminite the possibility that the page happened to get corrupted *and* return a different result on subsequent reads *and* that it was corrupted in such a way that the LSN ended up falling between the starting backup LSN and the current LSN, it's certainly reducing the chances of a false negative a fair bit. A concern here, however, is- can we be 100% sure that we'll get a different result from the two subsequent reads? For my part, at least, I've been doubtful that it's possible but it'd be nice to hear it from someone who has really looked at the kernel side. To try and clairfy, let me illustrate: pg_basebackup (the backend that's sending data to it anyway) starts reading an 8K page, but gets interrupted halfway through, meaning that it's read 4K and is now paused. PG writes that same 8K page, and is able to successfully write the entire block. pg_basebackup then wakes up, reads the second half, computes a checksum and gets a checksum failure. At this point the question is: if pg_basebackup loops, seeks and re-reads the same 8K block again, is it possible that pg_basebackup will get the "old" starting 4K and the "new" ending 4K again? I'd like to think that the answer is 'no' and that the kernel will guarantee that if we managed to read a "new" ending 4K block then the following read of the full 8K block would be guaranteed to give us the "new" starting 4K. If that is truely guaranteed then we could be much more confident that the idea here of simply checking for an ascending LSN, which falls between the starting LSN of the backup and the current LSN (or perhaps the final LSN for the backup) would be sufficient to detect this case. I would also think that, if we can trust that, then there really isn't any need for the delay in performing the re-read, which I have to admit that I don't particularly care for. > > If we can prove that, we don't actually care what > >the contents of the page are. We certainly can't calculate the > >checksum on a page we plucked out of shared buffers since we only > >calculate the checksum when we go to write the page out. > Good point. I was thinking that we can recalculate checksum. Or even save a > page without it, as we have checked LSN and know for sure that it will be > rewritten by WAL replay. At the point that we know the page is in the WAL which must be replayed to make this backup consistent, we could theoretically zero the page out of the actual backup (or if we're doing some kind of incremental magic, skip it entirely, as long as we zero-fill it on restore). > To sum up, I agree with your proposal to reread the page and rely on > ascending LSNs. Can you submit a patch? Probably would make sense to give Michael an opportunity to comment and get his thoughts on this, and for him to update the patch if he agrees. As it relates to pgbackrest, we're currently contemplating having a higher level loop which, upon detecting any page with an invalid checksum, continues to scan to the end of that file and perform the compression, encryption, et al, but then loops back after we've completed that file and skips through the file again, re-reading those pages which didn't have a valid checksum the first time to see if their LSN has changed and is within the range of the backup. This will certainly give more opportunity for the kernel to 'catch up', if needed, and give us an updated page without a random 100ms delay, and will also make it easier for us to, eventually, check and make sure the page was in the WAL that was been produced as part of the backup, to give us a complete guarantee that the contents of this page don't matter and that the failed checksum isn't a sign of latent storage corruption. Thanks, Stephen
Attachment
On Mon, Nov 23, 2020 at 10:35:54AM -0500, Stephen Frost wrote: > * Anastasia Lubennikova (a.lubennikova@postgrespro.ru) wrote: >> It seems reasonable to me to rely on checksums only. >> >> As for retry, I think that API for concurrent I/O will be complicated. >> Instead, we can introduce a function to read the page directly from shared >> buffers after PAGE_RETRY_THRESHOLD attempts. It looks like a bullet-proof >> solution to me. Do you see any possible problems with it? It seems to me that you are missing the point here. It is not necessary to read a page from shared buffers. What is necessary is to make sure that there is zero concurrent I/O activity in shared buffers while a page is getting checked on disk, giving the insurance that there is zero risk of having a torn page for a check for anything working with shared buffers. You could do that only on a retry if we found a page where there was a checksum mismatch, meaning that the page we either torn or currupted, but need an extra verification anyway. > We might end up reading pages back in that have been evicted, for one > thing, which doesn't seem great, and this also seems likely to be > awkward for cases which aren't using the replication protocol, unless > every process maintains a connection to PG the entire time, which also > doesn't seem great. I don't quite see a problem in checking pages that have been just evicted if we are able to detect faster that a page is corrupted, because the initial check may fail because a page was torn, meaning that it was in the middle of an eviction, but the page could also be corrupted, meaning also that it was *not* torn, and would fail a retry where we should make sure that there is no s_b concurrent activity. So in the worst case of seeing you make the detection of a corrupted page faster. Please note that Andres also mentioned about the potential need to worry about table AMs that call directly smgrwrite(), bypassing shared buffers. The only cases in-core where it is used are related to init forks when an unlogged relation gets created, where it would not matter if you are doing a page check while holding a database transaction as the newly-created relation would not be visible yet, but it would matter in the case of base backups doing direct page lookups. Fun. > Also- what is the point of reading the page from shared buffers > anyway..? All we need to do is prove that the page will be rewritten > during WAL replay. If we can prove that, we don't actually care what > the contents of the page are. We certainly can't calculate the > checksum on a page we plucked out of shared buffers since we only > calculate the checksum when we go to write the page out. A LSN-based check makes the thing tricky. How do you make sure that pd_lsn is not itself broken? It could be perfectly possible that a random on-disk corruption makes pd_lsn seen as having a correct value, still the rest of the page is borked. -- Michael
Attachment
On Mon, Nov 23, 2020 at 05:28:52PM -0500, Stephen Frost wrote: > * Anastasia Lubennikova (a.lubennikova@postgrespro.ru) wrote: >> Yes and this is a tricky part. Until you have explained it in your latest >> message, I wasn't sure how we can distinct concurrent update from a page >> header corruption. Now I agree that if page LSN updated and increased >> between rereads, it is safe enough to conclude that we have some concurrent >> load. > > Even in this case, it's almost free to compare the LSN to the starting > backup LSN, and to the current LSN position, and make sure it's > somewhere between the two. While that doesn't entirely eliminite the > possibility that the page happened to get corrupted *and* return a > different result on subsequent reads *and* that it was corrupted in such > a way that the LSN ended up falling between the starting backup LSN and > the current LSN, it's certainly reducing the chances of a false negative > a fair bit. FWIW, I am not much a fan of designs that are not bullet-proof by design. This reduces the odds of problems, sure, still it does not discard the possibility of incorrect results, confusing users as well as people looking at such reports. >> To sum up, I agree with your proposal to reread the page and rely on >> ascending LSNs. Can you submit a patch? > > Probably would make sense to give Michael an opportunity to comment and > get his thoughts on this, and for him to update the patch if he agrees. I think that a LSN check would be a safe thing to do iff pd_checksum is already checked first to make sure that the page contents are fine to use. Still, what's the point in doing a LSN check anyway if we know that the checksum is valid? Then on a retry if the first attempt failed you also need the guarantee that there is zero concurrent I/O activity while a page is rechecked (no need to do that unless the initial page check doing a checksum match failed). So the retry needs to do some s_b interactions, but then comes the much trickier point of concurrent smgrwrite() calls bypassing the shared buffers. > As it relates to pgbackrest, we're currently contemplating having a > higher level loop which, upon detecting any page with an invalid > checksum, continues to scan to the end of that file and perform the > compression, encryption, et al, but then loops back after we've > completed that file and skips through the file again, re-reading those > pages which didn't have a valid checksum the first time to see if their > LSN has changed and is within the range of the backup. This will > certainly give more opportunity for the kernel to 'catch up', if needed, > and give us an updated page without a random 100ms delay, and will also > make it easier for us to, eventually, check and make sure the page was > in the WAL that was been produced as part of the backup, to give us a > complete guarantee that the contents of this page don't matter and that > the failed checksum isn't a sign of latent storage corruption. That would reduce the likelyhood of facing torn pages, still you cannot fully discard the problem either as a same page may get changed again once you loop over, no? And what if a corruption has updated pd_lsn on-disk? Unlikely so, still possible. -- Michael
Attachment
Greetings,
On Mon, Nov 23, 2020 at 20:28 Michael Paquier <michael@paquier.xyz> wrote:
On Mon, Nov 23, 2020 at 05:28:52PM -0500, Stephen Frost wrote:
> * Anastasia Lubennikova (a.lubennikova@postgrespro.ru) wrote:
>> Yes and this is a tricky part. Until you have explained it in your latest
>> message, I wasn't sure how we can distinct concurrent update from a page
>> header corruption. Now I agree that if page LSN updated and increased
>> between rereads, it is safe enough to conclude that we have some concurrent
>> load.
>
> Even in this case, it's almost free to compare the LSN to the starting
> backup LSN, and to the current LSN position, and make sure it's
> somewhere between the two. While that doesn't entirely eliminite the
> possibility that the page happened to get corrupted *and* return a
> different result on subsequent reads *and* that it was corrupted in such
> a way that the LSN ended up falling between the starting backup LSN and
> the current LSN, it's certainly reducing the chances of a false negative
> a fair bit.
FWIW, I am not much a fan of designs that are not bullet-proof by
design. This reduces the odds of problems, sure, still it does not
discard the possibility of incorrect results, confusing users as well
as people looking at such reports.
Let’s be clear about this- our checksums are, themselves, far from bulletproof, regardless of all of our other efforts. They are not foolproof against any corruption, and certainly not even close to being sufficient for guarantees you’d expect in, say, encryption integrity. We cannot say with certainty that a page which passes checksum validation isn’t corrupted in some way. A page which doesn’t pass checksum validation may be corrupted or may be torn and we aren’t 100% of that either, but we can work to try and make a sensible call about which it is.
>> To sum up, I agree with your proposal to reread the page and rely on
>> ascending LSNs. Can you submit a patch?
>
> Probably would make sense to give Michael an opportunity to comment and
> get his thoughts on this, and for him to update the patch if he agrees.
I think that a LSN check would be a safe thing to do iff pd_checksum
is already checked first to make sure that the page contents are fine
to use. Still, what's the point in doing a LSN check anyway if we
know that the checksum is valid? Then on a retry if the first attempt
failed you also need the guarantee that there is zero concurrent I/O
activity while a page is rechecked (no need to do that unless the
initial page check doing a checksum match failed). So the retry needs
to do some s_b interactions, but then comes the much trickier point of
concurrent smgrwrite() calls bypassing the shared buffers.
I agree that the LSN check isn’t interesting if the page passes the checksum validation. I do think we can look at the LSN and make reasonable inferences based off of it even if the checksum doesn’t validate- in particular, in my experience at least, the result of a read, without any intervening write, is very likely to be the same if performed multiple times quickly even if there is latent storage corruption- due to cache’ing, if nothing else. What’s interesting about the LSN check is that we are specifically looking to see if it *changed* in a reasonable and predictable manner, and that it was replaced with a new yet reasonable value. The chances of that happening due to latent storage corruption is vanishingly small.
> As it relates to pgbackrest, we're currently contemplating having a
> higher level loop which, upon detecting any page with an invalid
> checksum, continues to scan to the end of that file and perform the
> compression, encryption, et al, but then loops back after we've
> completed that file and skips through the file again, re-reading those
> pages which didn't have a valid checksum the first time to see if their
> LSN has changed and is within the range of the backup. This will
> certainly give more opportunity for the kernel to 'catch up', if needed,
> and give us an updated page without a random 100ms delay, and will also
> make it easier for us to, eventually, check and make sure the page was
> in the WAL that was been produced as part of the backup, to give us a
> complete guarantee that the contents of this page don't matter and that
> the failed checksum isn't a sign of latent storage corruption.
That would reduce the likelyhood of facing torn pages, still you
cannot fully discard the problem either as a same page may get changed
again once you loop over, no? And what if a corruption has updated
pd_lsn on-disk? Unlikely so, still possible.
We surely don’t care about a page which has been changed multiple times by PG during the backup, since all those changes will be, by definition, in the WAL, no? Therefore, one loop to see that the value of the LSN *changed*, meaning something wrote something new there, with a cross-check to see that the LSN was in the expected range, is going an awfully long way to assuring that this isn’t a case of latent storage corruption. If there is an attacker who is not the PG process but who is modifying files then, yes, that’s a risk, and won’t be picked up by this, but why would they create an invalid checksum in the first place..?
We aren’t attempting to protect against a sophisticated attack, we are trying to detect latent storage corruption.
I would also ask for a clarification as to if you feel that checking the WAL for the page to be insufficient somehow, since I mentioned that as also being on the roadmap. If there’s some reason that checking the WAL for the page wouldn’t be sufficient, I am anxious to understand that reasoning.
Thanks,
Stephen
Hi Michael, On 11/23/20 8:10 PM, Michael Paquier wrote: > On Mon, Nov 23, 2020 at 10:35:54AM -0500, Stephen Frost wrote: > >> Also- what is the point of reading the page from shared buffers >> anyway..? All we need to do is prove that the page will be rewritten >> during WAL replay. If we can prove that, we don't actually care what >> the contents of the page are. We certainly can't calculate the >> checksum on a page we plucked out of shared buffers since we only >> calculate the checksum when we go to write the page out. > > A LSN-based check makes the thing tricky. How do you make sure that > pd_lsn is not itself broken? It could be perfectly possible that a > random on-disk corruption makes pd_lsn seen as having a correct value, > still the rest of the page is borked. We are not just looking at one LSN value. Here are the steps we are proposing (I'll skip checks for zero pages here): 1) Test the page checksum. If it passes the page is OK. 2) If the checksum does not pass then record the page offset and LSN and continue. 3) After the file is copied, reopen and reread the file, seeking to offsets where possible invalid pages were recorded in the first pass. a) If the page is now valid then it is OK. b) If the page is not valid but the LSN has increased from the LSN recorded in the previous pass then it is OK. We can infer this because the LSN has been updated in a way that is not consistent with storage corruption. This is what we are planning for the first round of improving our page checksum validation. We believe that doing the retry in a second pass will be faster and more reliable because some time will have passed since the first read without having to build in a delay for each page error. A further improvement is to check the ascending LSNs found in 3b against PostgreSQL to be completely sure they are valid. We are planning this for our second round of improvements. Reopening the file for the second pass does require some additional logic: 1) The file may have been deleted by PG since the first pass and in that case we won't report any page errors. 2) The file may have been truncated by PG since the first pass so we won't report any errors past the point of truncation. A malicious attacker could easily trick these checks, but as Stephen pointed out elsewhere they would likely make the checksums valid which would escape detection anyway. We believe that the chances of random storage corruption passing all these checks is incredibly small, but eventually we'll also check against the WAL to be completely sure. Regards, -- -David david@pgmasters.net
On Tue, Nov 24, 2020 at 12:38:30PM -0500, David Steele wrote: > We are not just looking at one LSN value. Here are the steps we are > proposing (I'll skip checks for zero pages here): > > 1) Test the page checksum. If it passes the page is OK. > 2) If the checksum does not pass then record the page offset and LSN and > continue. But here the checksum is broken, so while the offset is something we can rely on how do you make sure that the LSN is fine? A broken checksum could perfectly mean that the LSN is actually *not* fine if the page header got corrupted. > 3) After the file is copied, reopen and reread the file, seeking to offsets > where possible invalid pages were recorded in the first pass. > a) If the page is now valid then it is OK. > b) If the page is not valid but the LSN has increased from the LSN Per se the previous point about the LSN value that we cannot rely on. > A malicious attacker could easily trick these checks, but as Stephen pointed > out elsewhere they would likely make the checksums valid which would escape > detection anyway. > > We believe that the chances of random storage corruption passing all these > checks is incredibly small, but eventually we'll also check against the WAL > to be completely sure. The lack of check for any concurrent I/O on the follow-up retries is disturbing. How do you guarantee that on the second retry what you have is a torn page and not something corrupted? Init forks for example are made of up to 2 blocks, so the window would get short for at least those. There are many instances with tables that have few pages as well. -- Michael
Attachment
On Thu, Nov 26, 2020 at 8:42 AM Michael Paquier <michael@paquier.xyz> wrote: > > On Tue, Nov 24, 2020 at 12:38:30PM -0500, David Steele wrote: > > We are not just looking at one LSN value. Here are the steps we are > > proposing (I'll skip checks for zero pages here): > > > > 1) Test the page checksum. If it passes the page is OK. > > 2) If the checksum does not pass then record the page offset and LSN and > > continue. > > But here the checksum is broken, so while the offset is something we > can rely on how do you make sure that the LSN is fine? A broken > checksum could perfectly mean that the LSN is actually *not* fine if > the page header got corrupted. > > > 3) After the file is copied, reopen and reread the file, seeking to offsets > > where possible invalid pages were recorded in the first pass. > > a) If the page is now valid then it is OK. > > b) If the page is not valid but the LSN has increased from the LSN > > Per se the previous point about the LSN value that we cannot rely on. We cannot rely on the LSN itself. But it's a lot more likely that we can rely on the LSN changing, and on the LSN changing in a "correct way". That is, if the LSN *decreases* we know it's corrupt. If the LSN *doesn't change* we know it's corrupt. But if the LSN *increases* AND the new page now has a correct checksum, it's very most likely to be correct. You could perhaps even put cap on it saying "if the LSN increased, but less than <n>", where <n> is a sufficiently high number that it's entirely unreasonable to advanced that far between the reading of two blocks. But it has to have a very high margin in that case. > > A malicious attacker could easily trick these checks, but as Stephen pointed > > out elsewhere they would likely make the checksums valid which would escape > > detection anyway. > > > > We believe that the chances of random storage corruption passing all these > > checks is incredibly small, but eventually we'll also check against the WAL > > to be completely sure. > > The lack of check for any concurrent I/O on the follow-up retries is > disturbing. How do you guarantee that on the second retry what you > have is a torn page and not something corrupted? Init forks for > example are made of up to 2 blocks, so the window would get short for > at least those. There are many instances with tables that have few > pages as well. Here I was more worried that the window might get *too long* if tables are large :) The risk is certainly that you get a torn page *again* on the second read. It could be the same torn page (if it hasn't changed), but you can detect that (by the fact that it hasn't actually changed) and possibly do a short delay before trying again if it gets that far. That could happen if the process is too quick. It could also be that you are unlucky and that you hit a *new* write, and you were so unlucky that both times it happened to hit exactly when you were reading the page the next time. I'm not sure the chance of that happening is even big enough we have to care about it, though? -- Magnus Hagander Me: https://www.hagander.net/ Work: https://www.redpill-linpro.com/
Greetings, * Magnus Hagander (magnus@hagander.net) wrote: > On Thu, Nov 26, 2020 at 8:42 AM Michael Paquier <michael@paquier.xyz> wrote: > > On Tue, Nov 24, 2020 at 12:38:30PM -0500, David Steele wrote: > > > We are not just looking at one LSN value. Here are the steps we are > > > proposing (I'll skip checks for zero pages here): > > > > > > 1) Test the page checksum. If it passes the page is OK. > > > 2) If the checksum does not pass then record the page offset and LSN and > > > continue. > > > > But here the checksum is broken, so while the offset is something we > > can rely on how do you make sure that the LSN is fine? A broken > > checksum could perfectly mean that the LSN is actually *not* fine if > > the page header got corrupted. Of course that could be the case, but it gets to be a smaller and smaller chance by checking that the LSN read falls within reasonable bounds. > > > 3) After the file is copied, reopen and reread the file, seeking to offsets > > > where possible invalid pages were recorded in the first pass. > > > a) If the page is now valid then it is OK. > > > b) If the page is not valid but the LSN has increased from the LSN > > > > Per se the previous point about the LSN value that we cannot rely on. > > We cannot rely on the LSN itself. But it's a lot more likely that we > can rely on the LSN changing, and on the LSN changing in a "correct > way". That is, if the LSN *decreases* we know it's corrupt. If the LSN > *doesn't change* we know it's corrupt. But if the LSN *increases* AND > the new page now has a correct checksum, it's very most likely to be > correct. You could perhaps even put cap on it saying "if the LSN > increased, but less than <n>", where <n> is a sufficiently high number > that it's entirely unreasonable to advanced that far between the > reading of two blocks. But it has to have a very high margin in that > case. This is, in fact, included in what was proposed- the "max increase" would be "the ending LSN of the backup". I don't think we can make it any tighter than that though without risking false positives, which is surely worse than a false negative in this particular case- we already risk false negatives due to the fact that our checksum isn't perfect, so even a perfect check to make sure that the page will, in fact, be replayed over during crash recovery doesn't guarantee that there's no corruption. > > > A malicious attacker could easily trick these checks, but as Stephen pointed > > > out elsewhere they would likely make the checksums valid which would escape > > > detection anyway. > > > > > > We believe that the chances of random storage corruption passing all these > > > checks is incredibly small, but eventually we'll also check against the WAL > > > to be completely sure. > > > > The lack of check for any concurrent I/O on the follow-up retries is > > disturbing. How do you guarantee that on the second retry what you > > have is a torn page and not something corrupted? Init forks for > > example are made of up to 2 blocks, so the window would get short for > > at least those. There are many instances with tables that have few > > pages as well. If there's an easy and cheap way to see if there was concurrent i/o happening for the page, then let's hear it. One idea that has occured to me which hasn't been discussed is checking the file's mtime to see if it's changed since the backup started. In that case, I would think it'd be something like: - Checksum is invalid - LSN is within range - Close file - Stat file - If mtime is from before the backup then signal possible corruption If the checksum is invalid and the LSN isn't in range, then signal corruption. In general, however, I don't like the idea of reaching into PG and asking PG for this page. > Here I was more worried that the window might get *too long* if tables > are large :) I'm not sure that there's really a 'too long' possibility here. > The risk is certainly that you get a torn page *again* on the second > read. It could be the same torn page (if it hasn't changed), but you > can detect that (by the fact that it hasn't actually changed) and > possibly do a short delay before trying again if it gets that far. I'm really not a fan of introducing these delays in the hopes that they'll work.. > That could happen if the process is too quick. It could also be that > you are unlucky and that you hit a *new* write, and you were so > unlucky that both times it happened to hit exactly when you were > reading the page the next time. I'm not sure the chance of that > happening is even big enough we have to care about it, though? If there's actually a new write, surely the LSN would be new? At the least, it wouldn't be the same LSN as the first read that picked up a torn page. In general though, I agree, we are getting to the point here where the chances of missing something with this approach seems extremely slim. I do still like the idea of doing better by actually scanning the WAL but at least for now, this is far better than what we have today while not introducing a huge amount of additional code or complexity. Thanks, Stephen
Attachment
On Fri, Nov 27, 2020 at 11:15:27AM -0500, Stephen Frost wrote: > * Magnus Hagander (magnus@hagander.net) wrote: >> On Thu, Nov 26, 2020 at 8:42 AM Michael Paquier <michael@paquier.xyz> wrote: >>> But here the checksum is broken, so while the offset is something we >>> can rely on how do you make sure that the LSN is fine? A broken >>> checksum could perfectly mean that the LSN is actually *not* fine if >>> the page header got corrupted. > > Of course that could be the case, but it gets to be a smaller and > smaller chance by checking that the LSN read falls within reasonable > bounds. FWIW, I find that scary. >> We cannot rely on the LSN itself. But it's a lot more likely that we >> can rely on the LSN changing, and on the LSN changing in a "correct >> way". That is, if the LSN *decreases* we know it's corrupt. If the LSN >> *doesn't change* we know it's corrupt. But if the LSN *increases* AND >> the new page now has a correct checksum, it's very most likely to be >> correct. You could perhaps even put cap on it saying "if the LSN >> increased, but less than <n>", where <n> is a sufficiently high number >> that it's entirely unreasonable to advanced that far between the >> reading of two blocks. But it has to have a very high margin in that >> case. > > This is, in fact, included in what was proposed- the "max increase" > would be "the ending LSN of the backup". I don't think we can make it > any tighter than that though without risking false positives, which is > surely worse than a false negative in this particular case- we already > risk false negatives due to the fact that our checksum isn't perfect, so > even a perfect check to make sure that the page will, in fact, be > replayed over during crash recovery doesn't guarantee that there's no > corruption. > > If there's an easy and cheap way to see if there was concurrent i/o > happening for the page, then let's hear it. This has been discussed for a couple of months now. I would recommend to go through this thread: https://www.postgresql.org/message-id/CAOBaU_aVvMjQn=ge5qPiJOPMmOj5=ii3st5Q0Y+WuLML5sR17w@mail.gmail.com And this bit is interesting, because that would give the guarantees you are looking for with a page retry (just grep for BM_IO_IN_PROGRESS on the thread): https://www.postgresql.org/message-id/20201102193457.fc2hoen7ahth4bbc@alap3.anarazel.de > One idea that has occured > to me which hasn't been discussed is checking the file's mtime to see if > it's changed since the backup started. In that case, I would think it'd > be something like: > > - Checksum is invalid > - LSN is within range > - Close file > - Stat file > - If mtime is from before the backup then signal possible corruption I suspect that relying on mtime may cause problems. One case coming to my mind is NFS. > In general, however, I don't like the idea of reaching into PG and > asking PG for this page. It seems to me that if we don't ask to PG what it thinks about a page, we will never have a fully bullet-proof design either. -- Michael
Attachment
Greetings, * Michael Paquier (michael@paquier.xyz) wrote: > On Fri, Nov 27, 2020 at 11:15:27AM -0500, Stephen Frost wrote: > > * Magnus Hagander (magnus@hagander.net) wrote: > >> On Thu, Nov 26, 2020 at 8:42 AM Michael Paquier <michael@paquier.xyz> wrote: > >>> But here the checksum is broken, so while the offset is something we > >>> can rely on how do you make sure that the LSN is fine? A broken > >>> checksum could perfectly mean that the LSN is actually *not* fine if > >>> the page header got corrupted. > > > > Of course that could be the case, but it gets to be a smaller and > > smaller chance by checking that the LSN read falls within reasonable > > bounds. > > FWIW, I find that scary. There's ultimately different levels of 'scary' and the risk here that something is actually wrong following these checks strikes me as being on the same order as random bits being flipped in the page and still getting a valid checksum (which is entirely possible with our current checksum...), or maybe even less. Both cases would result in a false negative, which is surely bad, though that strikes me as better than a false positive, where we say there's corruption when there isn't. > And this bit is interesting, because that would give the guarantees > you are looking for with a page retry (just grep for BM_IO_IN_PROGRESS > on the thread): > https://www.postgresql.org/message-id/20201102193457.fc2hoen7ahth4bbc@alap3.anarazel.de There's no guarantee that the page is still in shared buffers or that we have a buffer descriptor still for it by the time we're doing this, as I said up-thread. This approach requires that we reach into PG, acquire at least a buffer descriptor and set BM_IO_IN_PROGRESS on it and then read the page again and checksum it again before finally looking at the (now 'trusted' LSN, even though it might have had some bits flipped in it and we wouldn't know..) and see if it's higher than the start of the backup, and maybe less than the current LSN. Maybe we can avoid actually pulling the page into shared buffers (reading it into our own memory instead) and just have the buffer descriptor but none of this seems like it's going to be very unobtrusive in either code or the running system, and it isn't going to give us an actual guarantee that there's been no corruption. The amount that it improves on the checks that I outline above seems to be exceedingly small and the question is if it's worth it for, most likely, exclusively pg_basebackup (unless we're going to figure out a way to expose this via SQL, which seems unlikely). > > One idea that has occured > > to me which hasn't been discussed is checking the file's mtime to see if > > it's changed since the backup started. In that case, I would think it'd > > be something like: > > > > - Checksum is invalid > > - LSN is within range > > - Close file > > - Stat file > > - If mtime is from before the backup then signal possible corruption > > I suspect that relying on mtime may cause problems. One case coming > to my mind is NFS. I agree that it might not be perfect but it also seems like something which could be reasonably cheaply checked and the window (between when the backup started and the time we hit this torn page) is very likely to be large enough that the mtime will have been updated and be different (and forward, if it was modified) of what it was at the time the backup started. It's also something that incremental backups may be looking at, so if there's serious problems with it then there's a good chance you've got bigger issues. > > In general, however, I don't like the idea of reaching into PG and > > asking PG for this page. > > It seems to me that if we don't ask to PG what it thinks about a page, > we will never have a fully bullet-proof design either. None of this is bullet-proof, it's all trade-offs. Thanks, Stephen
Attachment
On 11/30/20 9:27 AM, Stephen Frost wrote: > Greetings, > > * Michael Paquier (michael@paquier.xyz) wrote: >> On Fri, Nov 27, 2020 at 11:15:27AM -0500, Stephen Frost wrote: >>> * Magnus Hagander (magnus@hagander.net) wrote: >>>> On Thu, Nov 26, 2020 at 8:42 AM Michael Paquier <michael@paquier.xyz> wrote: >>>>> But here the checksum is broken, so while the offset is something we >>>>> can rely on how do you make sure that the LSN is fine? A broken >>>>> checksum could perfectly mean that the LSN is actually *not* fine if >>>>> the page header got corrupted. >>> >>> Of course that could be the case, but it gets to be a smaller and >>> smaller chance by checking that the LSN read falls within reasonable >>> bounds. >> >> FWIW, I find that scary. > > There's ultimately different levels of 'scary' and the risk here that > something is actually wrong following these checks strikes me as being > on the same order as random bits being flipped in the page and still > getting a valid checksum (which is entirely possible with our current > checksum...), or maybe even less. I would say a lot less. First you'd need to corrupt one of the eight bytes that make up the LSN (pretty likely since corruption will probably affect the entire block) and then it would need to be updated to a value that falls within the current backup range, a 1 in 16 million chance if a terabyte of WAL is generated during the backup. Plus, the corruption needs to happen during the backup since we are going to check for that, and the corrupted LSN needs to be ascending, and the LSN originally read needs to be within the backup range (another 1 in 16 million chance) since pages written before the start backup checkpoint should not be torn. So as far as I can see there are more likely to be false negatives from the checksum itself. It would also be easy to add a few rounds of checks, i.e. test if the LSN ascends but stays in the backup LSN range N times. Honestly, I'm much more worried about corruption zeroing the entire page. I don't know how likely that is, but I know none of our proposed solutions would catch it. Andres, since you brought this issue up originally perhaps you'd like to weigh in? Regards, -- -David david@pgmasters.net
On Tue, Mar 9, 2021 at 10:43 PM David Steele <david@pgmasters.net> wrote:
On 11/30/20 6:38 PM, David Steele wrote:
> On 11/30/20 9:27 AM, Stephen Frost wrote:
>> * Michael Paquier (michael@paquier.xyz) wrote:
>>> On Fri, Nov 27, 2020 at 11:15:27AM -0500, Stephen Frost wrote:
>>>> * Magnus Hagander (magnus@hagander.net) wrote:
>>>>> On Thu, Nov 26, 2020 at 8:42 AM Michael Paquier
>>>>> <michael@paquier.xyz> wrote:
>>>>>> But here the checksum is broken, so while the offset is something we
>>>>>> can rely on how do you make sure that the LSN is fine? A broken
>>>>>> checksum could perfectly mean that the LSN is actually *not* fine if
>>>>>> the page header got corrupted.
>>>>
>>>> Of course that could be the case, but it gets to be a smaller and
>>>> smaller chance by checking that the LSN read falls within reasonable
>>>> bounds.
>>>
>>> FWIW, I find that scary.
>>
>> There's ultimately different levels of 'scary' and the risk here that
>> something is actually wrong following these checks strikes me as being
>> on the same order as random bits being flipped in the page and still
>> getting a valid checksum (which is entirely possible with our current
>> checksum...), or maybe even less.
>
> I would say a lot less. First you'd need to corrupt one of the eight
> bytes that make up the LSN (pretty likely since corruption will probably
> affect the entire block) and then it would need to be updated to a value
> that falls within the current backup range, a 1 in 16 million chance if
> a terabyte of WAL is generated during the backup. Plus, the corruption
> needs to happen during the backup since we are going to check for that,
> and the corrupted LSN needs to be ascending, and the LSN originally read
> needs to be within the backup range (another 1 in 16 million chance)
> since pages written before the start backup checkpoint should not be torn.
>
> So as far as I can see there are more likely to be false negatives from
> the checksum itself.
>
> It would also be easy to add a few rounds of checks, i.e. test if the
> LSN ascends but stays in the backup LSN range N times.
>
> Honestly, I'm much more worried about corruption zeroing the entire
> page. I don't know how likely that is, but I know none of our proposed
> solutions would catch it.
>
> Andres, since you brought this issue up originally perhaps you'd like to
> weigh in?
I had another look at this patch and though I think my suggestions above
would improve the patch, I have no objections to going forward as is (if
that is the consensus) since this seems an improvement over what we have
now.
It comes down to whether you prefer false negatives or false positives.
With the LSN checking Stephen and I advocate it is theoretically
possible to have a false negative but the chances of the LSN ascending N
times but staying within the backup LSN range due to corruption seems to
be approaching zero.
I think Michael's method is unlikely to throw false positives, but it
seems at least possible that a block would be hot enough to be appear
torn N times in a row. Torn pages themselves are really easy to reproduce.
If we do go forward with this method I would likely propose the
LSN-based approach as a future improvement.
Regards,
--
-David
david@pgmasters.net
and secondly the patch does not apply cleanly.
> On 9 Jul 2021, at 22:00, Ibrar Ahmed <ibrar.ahmad@gmail.com> wrote: > I am changing the status to "Waiting on Author" based on the latest comments of @David Steele > and secondly the patch does not apply cleanly. This patch hasn’t moved since marked as WoA in the last CF and still doesn’t apply, unless there is a new version brewing it seems apt to close this as RwF and await a new entry in a future CF. -- Daniel Gustafsson https://vmware.com/
> On 2 Sep 2021, at 13:18, Daniel Gustafsson <daniel@yesql.se> wrote: > >> On 9 Jul 2021, at 22:00, Ibrar Ahmed <ibrar.ahmad@gmail.com> wrote: > >> I am changing the status to "Waiting on Author" based on the latest comments of @David Steele >> and secondly the patch does not apply cleanly. > > This patch hasn’t moved since marked as WoA in the last CF and still doesn’t > apply, unless there is a new version brewing it seems apt to close this as RwF > and await a new entry in a future CF. As there has been no movement, I've marked this patch as RwF. -- Daniel Gustafsson https://vmware.com/