Thread: [patch] Fix checksum verification in base backups for zero page headers
Hi, as a continuation of [1], I've split-off the zero page header case from the last patch, as this one seems less contentious. Michael [1] https://commitfest.postgresql.org/28/2308/ -- Michael Banck Projektleiter / Senior Berater Tel.: +49 2166 9901-171 Fax: +49 2166 9901-100 Email: michael.banck@credativ.de credativ GmbH, HRB Mönchengladbach 12080 USt-ID-Nummer: DE204566209 Trompeterallee 108, 41189 Mönchengladbach Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer Unser Umgang mit personenbezogenen Daten unterliegt folgenden Bestimmungen: https://www.credativ.de/datenschutz
Attachment
Re: [patch] Fix checksum verification in base backups for zero page headers
From
Anastasia Lubennikova
Date:
On 01.09.2020 13:22, Michael Banck wrote: > Hi, > > as a continuation of [1], I've split-off the zero page header case from > the last patch, as this one seems less contentious. > > > Michael > > [1] https://commitfest.postgresql.org/28/2308/ > I've looked through the previous discussion. As far as I got it, most of the controversy was about online checksums improvements. The warning about pd_upper inconsistency that you've added is a good addition. The patch is a bit messy, though, because a huge code block was shifted. Will it be different, if you just leave "if (!PageIsNew(page) && PageGetLSN(page) < startptr)" block as it was, and add "else if (PageIsNew(page) && !PageIsZero(page))" ? While on it, I also have a few questions about the code around: 1) Maybe we can use other header sanity checks from PageIsVerified() as well? Or even the function itself. 2) > /* Reset loop to validate the block again */ How many times do we try to reread the block? Is one reread enough? Speaking of which, 'reread_cnt' name looks confusing to me. I would expect that this variable contains a number of attempts, not the number of bytes read. If everyone agrees, that for basebackup purpose it's enough to rely on a single reread, I'm ok with it. Another approach is to read the page directly from shared buffers to ensure that the page is fine. This way is more complicated, but should be almost bullet-proof. Using it we can also check pages with lsn >= startptr. 3) Judging by warning messages, we count checksum failures per file, not per page, and don't report after a fifth failure. Why so? Is it a common case that so many pages of one file are corrupted? -- Anastasia Lubennikova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Re: [patch] Fix checksum verification in base backups for zero page headers
From
Michael Paquier
Date:
On Wed, Sep 02, 2020 at 04:50:14PM +0300, Anastasia Lubennikova wrote: > I've looked through the previous discussion. As far as I got it, most of the > controversy was about online checksums improvements. This patch is waiting on author for two weeks now. Michael, could you reply to the points raised by Anastasia? -- Michael
Attachment
Re: [patch] Fix checksum verification in base backups for zero page headers
From
Michael Banck
Date:
Hi, Am Mittwoch, den 02.09.2020, 16:50 +0300 schrieb Anastasia Lubennikova: > On 01.09.2020 13:22, Michael Banck wrote: > > as a continuation of [1], I've split-off the zero page header case from > > the last patch, as this one seems less contentious. > > [1] https://commitfest.postgresql.org/28/2308/ > > I've looked through the previous discussion. As far as I got it, most of > the controversy was about online checksums improvements. > > The warning about pd_upper inconsistency that you've added is a good > addition. The patch is a bit messy, though, because a huge code block > was shifted. > > Will it be different, if you just leave > "if (!PageIsNew(page) && PageGetLSN(page) < startptr)" > block as it was, and add > "else if (PageIsNew(page) && !PageIsZero(page))" ? Thanks, that does indeed look better as a patch and I think it's fine as-is for the code as well, I've attached a v2. > While on it, I also have a few questions about the code around: All fair points, but I think those should be handled in another patch, if any. > 1) Maybe we can use other header sanity checks from PageIsVerified() as > well? Or even the function itself. Yeah, I'll keep that in mind. > 2) > /* Reset loop to validate the block again */ > How many times do we try to reread the block? Is one reread enough? Not sure whether more rereads would help, but I think the general direction was to replace this with something better anyway I think (see below). > Speaking of which, 'reread_cnt' name looks confusing to me. I would > expect that this variable contains a number of attempts, not the number > of bytes read. Well, there are other "cnt" variables that keep the number of bytes read in that file, so it does not seem to be out of place for me. A comment might not hurt though. On second glance, it should maybe also be of size_t and not int type, as is the other cnt variable? > If everyone agrees, that for basebackup purpose it's enough to rely on a > single reread, I'm ok with it. > Another approach is to read the page directly from shared buffers to > ensure that the page is fine. This way is more complicated, but should > be almost bullet-proof. Using it we can also check pages with lsn >= > startptr. Right, that's what Andres also advocated for initially I think, and what should be done going forward. > 3) Judging by warning messages, we count checksum failures per file, not > per page, and don't report after a fifth failure. Why so? Is it a > common case that so many pages of one file are corrupted? I think this was a request during the original review, on the basis that if we have more than a few checksum errors, it's likely there is something fundamentally broken and it does not make sense to spam the log with potentially millions of broken checksums. Cheers, Michael -- Michael Banck Projektleiter / Senior Berater Tel.: +49 2166 9901-171 Fax: +49 2166 9901-100 Email: michael.banck@credativ.de credativ GmbH, HRB Mönchengladbach 12080 USt-ID-Nummer: DE204566209 Trompeterallee 108, 41189 Mönchengladbach Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer Unser Umgang mit personenbezogenen Daten unterliegt folgenden Bestimmungen: https://www.credativ.de/datenschutz
Attachment
Re: [patch] Fix checksum verification in base backups for zero page headers
From
Michael Banck
Date:
Hi, Am Dienstag, den 22.09.2020, 16:26 +0200 schrieb Michael Banck: > Am Mittwoch, den 02.09.2020, 16:50 +0300 schrieb Anastasia Lubennikova: > > I've looked through the previous discussion. As far as I got it, most of > > the controversy was about online checksums improvements. > > > > The warning about pd_upper inconsistency that you've added is a good > > addition. The patch is a bit messy, though, because a huge code block > > was shifted. > > > > Will it be different, if you just leave > > "if (!PageIsNew(page) && PageGetLSN(page) < startptr)" > > block as it was, and add > > "else if (PageIsNew(page) && !PageIsZero(page))" ? > > Thanks, that does indeed look better as a patch and I think it's fine > as-is for the code as well, I've attached a v2. Sorry, forgot to add you as reviewer in the proposed commit message, I've fixed that up now in V3. Michael -- Michael Banck Projektleiter / Senior Berater Tel.: +49 2166 9901-171 Fax: +49 2166 9901-100 Email: michael.banck@credativ.de credativ GmbH, HRB Mönchengladbach 12080 USt-ID-Nummer: DE204566209 Trompeterallee 108, 41189 Mönchengladbach Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer Unser Umgang mit personenbezogenen Daten unterliegt folgenden Bestimmungen: https://www.credativ.de/datenschutz
Attachment
Re: [patch] Fix checksum verification in base backups for zero page headers
From
Anastasia Lubennikova
Date:
On 22.09.2020 17:30, Michael Banck wrote: > Hi, > > Am Dienstag, den 22.09.2020, 16:26 +0200 schrieb Michael Banck: >> Am Mittwoch, den 02.09.2020, 16:50 +0300 schrieb Anastasia Lubennikova: >>> I've looked through the previous discussion. As far as I got it, most of >>> the controversy was about online checksums improvements. >>> >>> The warning about pd_upper inconsistency that you've added is a good >>> addition. The patch is a bit messy, though, because a huge code block >>> was shifted. >>> >>> Will it be different, if you just leave >>> "if (!PageIsNew(page) && PageGetLSN(page) < startptr)" >>> block as it was, and add >>> "else if (PageIsNew(page) && !PageIsZero(page))" ? >> Thanks, that does indeed look better as a patch and I think it's fine >> as-is for the code as well, I've attached a v2. > Sorry, forgot to add you as reviewer in the proposed commit message, > I've fixed that up now in V3. > > > Michael > Great. This version looks good to me. Thank you for answering my questions, I agree, that we can work on them in separate threads. So I mark this one as ReadyForCommitter. The only minor problem is a typo (?) in the proposed commit message. "If a page is all zero, consider that a checksum failure." It should be "If a page is NOT all zero...". -- Anastasia Lubennikova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Re: [patch] Fix checksum verification in base backups for zero page headers
From
Michael Banck
Date:
Hi, Am Donnerstag, den 24.09.2020, 17:24 +0300 schrieb Anastasia Lubennikova: > So I mark this one as ReadyForCommitter. Thanks! > The only minor problem is a typo (?) in the proposed commit message. > "If a page is all zero, consider that a checksum failure." It should be > "If a page is NOT all zero...". Oh right, I've fixed up the commit message in the attached V4. Michael -- Michael Banck Projektleiter / Senior Berater Tel.: +49 2166 9901-171 Fax: +49 2166 9901-100 Email: michael.banck@credativ.de credativ GmbH, HRB Mönchengladbach 12080 USt-ID-Nummer: DE204566209 Trompeterallee 108, 41189 Mönchengladbach Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer Unser Umgang mit personenbezogenen Daten unterliegt folgenden Bestimmungen: https://www.credativ.de/datenschutz
Attachment
Re: [patch] Fix checksum verification in base backups for zero page headers
From
Michael Paquier
Date:
On Fri, Sep 25, 2020 at 08:53:27AM +0200, Michael Banck wrote: > Oh right, I've fixed up the commit message in the attached V4. Not much a fan of what's proposed here, for a couple of reasons: - If the page is not new, we should check if the header is sane or not. - It may be better to actually count checksum failures if these are repeatable in a given block, and report them. - The code would be more simple with a "continue" added for a block detected as new or with a LSN newer than the backup start. - The new error messages are confusing, and I think that these are hard to translate in a meaningful way. So I think that we should try to use PageIsVerified() directly in the code path of basebackup.c, and this requires a couple of changes to make the routine more extensible: - Addition of a dboid argument for pgstat_report_checksum_failure(), whose call needs to be changed to use the *_in_db() flavor. - Addition of an option to decide if a log should be generated or not. - Addition of an option to control if a checksum failure should be reported to pgstat or not, even if this is actually linked to the previous point, I have seen cases where being able to control both separately would be helpful, particularly the log part. For the last two ones, I would use an extensible argument based on bits32 with a set of flags that the caller can set at will. Something else I would recommend here is to use an error context for the case where we want to log a retry number in the base backup loop that accepts false positives, with the blkno and the physical file name when we find failures after retries. -- Michael
Attachment
Re: [patch] Fix checksum verification in base backups for zero page headers
From
Anastasia Lubennikova
Date:
On 07.10.2020 11:18, Michael Paquier wrote: > On Fri, Sep 25, 2020 at 08:53:27AM +0200, Michael Banck wrote: >> Oh right, I've fixed up the commit message in the attached V4. > Not much a fan of what's proposed here, for a couple of reasons: > - If the page is not new, we should check if the header is sane or > not. > - It may be better to actually count checksum failures if these are > repeatable in a given block, and report them. > - The code would be more simple with a "continue" added for a block > detected as new or with a LSN newer than the backup start. > - The new error messages are confusing, and I think that these are > hard to translate in a meaningful way. I am interested in moving this patch forward, so here is the updated v5. This version uses PageIsVerified. All error messages are left unchanged. > So I think that we should try to use PageIsVerified() directly in the > code path of basebackup.c, and this requires a couple of changes to > make the routine more extensible: > - Addition of a dboid argument for pgstat_report_checksum_failure(), > whose call needs to be changed to use the *_in_db() flavor. In the current patch, PageIsVerifed called from pgbasebackup simply doesn't report failures to pgstat. Because in basebackup.c we already report checksum failures separately. Once per file. pgstat_report_checksum_failures_in_db(dboid, checksum_failures); Should we change this too? I don't see any difference. > - Addition of an option to decide if a log should be generated or > not. > - Addition of an option to control if a checksum failure should be > reported to pgstat or not, even if this is actually linked to the > previous point, I have seen cases where being able to control both > separately would be helpful, particularly the log part. > > For the last two ones, I would use an extensible argument based on > bits32 with a set of flags that the caller can set at will. Done. -- Anastasia Lubennikova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Attachment
Re: [patch] Fix checksum verification in base backups for zero page headers
From
Michael Paquier
Date:
On Fri, Oct 16, 2020 at 06:02:52PM +0300, Anastasia Lubennikova wrote: > In the current patch, PageIsVerifed called from pgbasebackup simply doesn't > Should we change this too? I don't see any difference. I considered that, but now that does not seem worth bothering here. > Done. Thanks for the updated patch. I have played with dd and some fake files to check the validity of the zero-case (as long as pd_lsn does not go wild). Here are some comments, with an updated patch attached: - Added a PageIsVerifiedExtended to make this patch back-patchable, with the original routine keeping its original shape. - I did not like much to show the WARNING from PageIsVerified() for the report, and we'd lose some context related to the base backups. The important part is to know which blocks from which files are found as problematic. - Switched the checks to have PageIsVerified() placed first in the hierarchy, so as we do all the basic validity checks before a look at the LSN. This is not really change things logically. - The meaning of block_retry is also the opposite of what it should be in the original code? IMO, the flag should be set to true if we still are fine to retry a block, and set it to false once the one-time retry has failed. - Moved the hardcoded failure report threshold of 5 into its own variable. - The error strings are not really consistent with the project style in this area. These are usually not spawned across multiple lines to ease searches with grep or such. Anastasia, Michael B, does that look OK to you? NB: This patch fixes only one problem, the zero-page case, as it was the intention of Michael B to split this part into its own thread. We still have, of course, a second problem when it comes to a random LSN put into the page header which could mask an actual checksum failure so this only takes care of half the issues. Having a correct LSN approximation is a tricky problem IMO, but we could improve things by having some delta with an extra WAL page on top of GetInsertRecPtr(). And this function cannot be used for base backups taken from standbys. -- Michael
Attachment
Re: [patch] Fix checksum verification in base backups for zero page headers
From
Anastasia Lubennikova
Date:
On 20.10.2020 09:24, Michael Paquier wrote: > On Fri, Oct 16, 2020 at 06:02:52PM +0300, Anastasia Lubennikova wrote: >> In the current patch, PageIsVerifed called from pgbasebackup simply doesn't >> Should we change this too? I don't see any difference. > I considered that, but now that does not seem worth bothering here. > >> Done. > Thanks for the updated patch. I have played with dd and some fake > files to check the validity of the zero-case (as long as pd_lsn does > not go wild). Here are some comments, with an updated patch attached: > - Added a PageIsVerifiedExtended to make this patch back-patchable, > with the original routine keeping its original shape. Thank you. I always forget about this. Do we have any checklist for such changes, that patch authors and reviewers can use? > - I did not like much to show the WARNING from PageIsVerified() for > the report, and we'd lose some context related to the base backups. > The important part is to know which blocks from which files are found > as problematic. > - Switched the checks to have PageIsVerified() placed first in the > hierarchy, so as we do all the basic validity checks before a look at > the LSN. This is not really change things logically. > - The meaning of block_retry is also the opposite of what it should be > in the original code? IMO, the flag should be set to true if we still > are fine to retry a block, and set it to false once the one-time retry > has failed. Agree. > - The error strings are not really consistent with the project style > in this area. These are usually not spawned across multiple lines to > ease searches with grep or such. Same question as above. I don't see this info about formatting in the error message style guide in documentation. Is it mentioned somewhere else? > Anastasia, Michael B, does that look OK to you? The final patch looks good to me. > NB: This patch fixes only one problem, the zero-page case, as it was > the intention of Michael B to split this part into its own thread. We > still have, of course, a second problem when it comes to a random LSN > put into the page header which could mask an actual checksum failure > so this only takes care of half the issues. Having a correct LSN > approximation is a tricky problem IMO, but we could improve things by > having some delta with an extra WAL page on top of GetInsertRecPtr(). > And this function cannot be used for base backups taken from > standbys. We can also read such pages via shared buffers to be 100% sure. -- Anastasia Lubennikova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Re: [patch] Fix checksum verification in base backups for zero page headers
From
Michael Paquier
Date:
On Thu, Oct 22, 2020 at 12:47:03AM +0300, Anastasia Lubennikova wrote: > Thank you. I always forget about this. Do we have any checklist for such > changes, that patch authors and reviewers can use? Not really. That's more a habit than anything else where any non-static routine that we publish could be used by some out-of-core code, so maintaining a pure API compatibility on stable branches is essential. > We can also read such pages via shared buffers to be 100% sure. Yeah, but this has its limits as well. One can use ignore_checksum_failure, but this can actually be very dangerous as you can finish by loading into shared buffers a page that has a header thought as sane but with a large portion of its page randomly corrupted, spreading corruption around and leading to more fancy logic failures in Postgres, with more panic from customers. Not using ignore_checksum_failure is safer, but it makes an analyze of the damages for a given file harder as things would stop at the first failure of a file with a seqscan. pg_prewarm can help here, but that's not the goal of the tool to do that either. We definitely need a different approach that guarantees that a page is correctly locked with no concurrent I/O while checked on retry, and I am looking at that for the SQL-level check. That's not something I would do in v13 though, but we can make the existing logic much more reliable with a set of fixed retries and some sleeps in-between. A maximum of 5 retries with 100ms seems like a good balance seen from here, but that's not a perfect science of course depending on the hardware latency. This has been a sensitive topic during the stability period of v13 with many committers commenting on the issue, so I'd rather be very careful that there are no objections to what's published here, and in consequence I am going to ping the other thread on the matter. For now I have the attached to address the zero-case and the random LSN case. -- Michael
Attachment
Re: [patch] Fix checksum verification in base backups for zero page headers
From
Julien Rouhaud
Date:
On Thu, Oct 22, 2020 at 9:25 AM Michael Paquier <michael@paquier.xyz> wrote: > > On Thu, Oct 22, 2020 at 12:47:03AM +0300, Anastasia Lubennikova wrote: > > Thank you. I always forget about this. Do we have any checklist for such > > changes, that patch authors and reviewers can use? > > Not really. That's more a habit than anything else where any > non-static routine that we publish could be used by some out-of-core > code, so maintaining a pure API compatibility on stable branches is > essential. > > > We can also read such pages via shared buffers to be 100% sure. > > Yeah, but this has its limits as well. One can use > ignore_checksum_failure, but this can actually be very dangerous as > you can finish by loading into shared buffers a page that has a header > thought as sane but with a large portion of its page randomly > corrupted, spreading corruption around and leading to more fancy > logic failures in Postgres, with more panic from customers. Not using > ignore_checksum_failure is safer, but it makes an analyze of the > damages for a given file harder as things would stop at the first > failure of a file with a seqscan. pg_prewarm can help here, but > that's not the goal of the tool to do that either. > > We definitely need a different approach that guarantees that a page is > correctly locked with no concurrent I/O while checked on retry, and I > am looking at that for the SQL-level check. That's not something I > would do in v13 though, but we can make the existing logic much more > reliable with a set of fixed retries and some sleeps in-between. A > maximum of 5 retries with 100ms seems like a good balance seen from > here, but that's not a perfect science of course depending on the > hardware latency. > > This has been a sensitive topic during the stability period of v13 > with many committers commenting on the issue, so I'd rather be very > careful that there are no objections to what's published here, and in > consequence I am going to ping the other thread on the matter. For > now I have the attached to address the zero-case and the random LSN > case. I'm a bit worried about this approach, as if I understand correctly this can lead to false positive reports. I've certainly seen systems with IO stalled for more than 500ms, so while this is not frequent this could still happen. About the patch: + * doing this check, causing a false positive. If that + * happens, a page is retried once, with an error reported if + * the second attempt also fails. [...] + /* The verification of a page has failed, retry once */ + if (block_attempts < PAGE_RETRY_THRESHOLD) + { Both those comments seem to refer to the previous "retry only once" approach. Apart from that and my concerns on the new heuristics, the patch looks good, +1 for PageIsVerifiedExtended()!
Re: [patch] Fix checksum verification in base backups for zero page headers
From
Michael Paquier
Date:
On Thu, Oct 22, 2020 at 02:27:34PM +0800, Julien Rouhaud wrote: > I'm a bit worried about this approach, as if I understand correctly > this can lead to false positive reports. I've certainly seen systems > with IO stalled for more than 500ms, so while this is not frequent > this could still happen. The possibility of false positives is not a new thing with this feature as currently shaped. On HEAD, this code actually just does one retry, without waiting at all for the operation potentially happening in parallel to finish, so that's actually worse. And that's assuming that the pd_lsn of the page did not get touched by a corruption as we would simply miss a broken page. So, with a non-locking approach, we limit ourselves to tweaking the number of retries and some sleeps :( I am not sure that increasing the sleep of 100ms is a good thing on not-so-slow disks, but we could increase the number of retries. The patch makes that easier to change at least. FWIW, I don't like that this code, with a real risk of false positives, got merged to begin with, and I think that other people share the same opinion, but it is not like we can just remove it on a branch already released either.. And I am not sure if we have done such things in the past for stable branches. If we were to do that, we could just make the operation a no-op, and keep some traces of the grammar for compatibility. > About the patch: > > + * doing this check, causing a false positive. If that > + * happens, a page is retried once, with an error reported if > + * the second attempt also fails. > > [...] > > + /* The verification of a page has failed, retry once */ > + if (block_attempts < PAGE_RETRY_THRESHOLD) > + { Oops. Thanks, I got that changed on my branch. -- Michael
Attachment
Re: [patch] Fix checksum verification in base backups for zero page headers
From
Anastasia Lubennikova
Date:
On 22.10.2020 04:25, Michael Paquier wrote: > On Thu, Oct 22, 2020 at 12:47:03AM +0300, Anastasia Lubennikova wrote: >> We can also read such pages via shared buffers to be 100% sure. > Yeah, but this has its limits as well. One can use > ignore_checksum_failure, but this can actually be very dangerous as > you can finish by loading into shared buffers a page that has a header > thought as sane but with a large portion of its page randomly > corrupted, spreading corruption around and leading to more fancy > logic failures in Postgres, with more panic from customers. Not using > ignore_checksum_failure is safer, but it makes an analyze of the > damages for a given file harder as things would stop at the first > failure of a file with a seqscan. pg_prewarm can help here, but > that's not the goal of the tool to do that either. I was thinking about applying this only to pages with LSN > startLSN. Most of such pages are valid and already in memory, because they were changed just recently, so no need for pg_prewarm here. If such LSN appeared because of a data corruption, page verification from inside ReadBuffer() will report an error first. In proposed function, we can handle this error in any fashion we want. Something like: if (PageGetLSN(page) > startptr) { if (!read_page_via_buffercache()) //throw a warning about corrupted page //handle checksum error as needed else //page is valid. No worries } -- Anastasia Lubennikova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Re: [patch] Fix checksum verification in base backups for zero page headers
From
Michael Paquier
Date:
On Thu, Oct 22, 2020 at 03:11:45PM +0300, Anastasia Lubennikova wrote: > Most of such pages are valid and already in memory, because they were > changed just recently, so no need for pg_prewarm here. If such LSN appeared > because of a data corruption, page verification from inside ReadBuffer() > will report an error first. In proposed function, we can handle this error > in any fashion we want. Something like: > > if (PageGetLSN(page) > startptr) > { > if (!read_page_via_buffercache()) > > //throw a warning about corrupted page > //handle checksum error as needed > else > //page is valid. No worries > } Yeah, we could try to make the logic a bit more complicated like that. However, for any code path relying on a page read without any locking insurance, we cannot really have a lot of trust in any of the fields assigned to the page as this could just be random corruption garbage, and the only thing I am ready to trust here a checksum mismatch check, because that's the only field on the page that's linked to its full contents on the 8k page. This also keeps the code simpler. -- Michael
Attachment
Re: [patch] Fix checksum verification in base backups for zero page headers
From
Michael Paquier
Date:
On Fri, Oct 23, 2020 at 08:00:08AM +0900, Michael Paquier wrote: > Yeah, we could try to make the logic a bit more complicated like > that. However, for any code path relying on a page read without any > locking insurance, we cannot really have a lot of trust in any of the > fields assigned to the page as this could just be random corruption > garbage, and the only thing I am ready to trust here a checksum > mismatch check, because that's the only field on the page that's > linked to its full contents on the 8k page. This also keeps the code > simpler. A small update here. I have extracted the refactored part for PageIsVerified() and committed it as that's independently useful. This makes the patch proposed here simpler on HEAD, leading to the attached. -- Michael
Attachment
Re: [patch] Fix checksum verification in base backups for zero page headers
From
Anastasia Lubennikova
Date:
On 26.10.2020 04:13, Michael Paquier wrote: > On Fri, Oct 23, 2020 at 08:00:08AM +0900, Michael Paquier wrote: >> Yeah, we could try to make the logic a bit more complicated like >> that. However, for any code path relying on a page read without any >> locking insurance, we cannot really have a lot of trust in any of the >> fields assigned to the page as this could just be random corruption >> garbage, and the only thing I am ready to trust here a checksum >> mismatch check, because that's the only field on the page that's >> linked to its full contents on the 8k page. This also keeps the code >> simpler. > A small update here. I have extracted the refactored part for > PageIsVerified() and committed it as that's independently useful. > This makes the patch proposed here simpler on HEAD, leading to the > attached. > -- > Michael Thank you for committing the first part. In case you need a second opinion on the remaining patch, it still looks good to me. -- Anastasia Lubennikova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Re: [patch] Fix checksum verification in base backups for zero page headers
From
Michael Paquier
Date:
On Tue, Oct 27, 2020 at 10:56:23PM +0300, Anastasia Lubennikova wrote: > In case you need a second opinion on the remaining patch, it still looks > good to me. Thanks. The patch for v13 cannot use a macro, but one of the versions of upthread would do just fine. I have been wondering about using the new CheckBuffer() for the purpose of the retry to make it concurrent-safe, but by looking at the code I think that we would run unto problems when trying to open through smgr.c any relation file in global/ as these require an invalid backend ID, and a WAL sender does not satisfy that (see the assertion in GetRelationPath()). I have been hesitating about increasing the number of retries though to give more room to false positives. 20 perhaps? That would give 2s to a disk to finish flushing a page that was caught in the middle of a check with a sleep of 100ms, which sounds plenty enough. -- Michael
Attachment
Re: [patch] Fix checksum verification in base backups for zero page headers
From
Michael Paquier
Date:
On Wed, Oct 28, 2020 at 04:11:56PM +0900, Michael Paquier wrote: > Thanks. The patch for v13 cannot use a macro, but one of the versions > of upthread would do just fine. I have been wondering about using the > new CheckBuffer() for the purpose of the retry to make it > concurrent-safe, but by looking at the code I think that we would run > unto problems when trying to open through smgr.c any relation file in > global/ as these require an invalid backend ID, and a WAL sender does > not satisfy that (see the assertion in GetRelationPath()). Actually, scratch that.. It looks like I am wrong here. By using smgropen() with InvalidBackendId and a RelFileNode built using the path of the file being sent, similarly to what pg_rewind is doing in isRelDataFile(), we should have everything that's needed. This is too complicated for a backpatch and we should have some consolidation with pg_rewind, so using the sleep/retry for v13 sounds like a safer path to take in the stable branch. -- Michael
Attachment
Re: [patch] Fix checksum verification in base backups for zero page headers
From
Michael Paquier
Date:
On Wed, Oct 28, 2020 at 04:43:44PM +0900, Michael Paquier wrote: > On Wed, Oct 28, 2020 at 04:11:56PM +0900, Michael Paquier wrote: >> Thanks. The patch for v13 cannot use a macro, but one of the versions >> of upthread would do just fine. For the note, I have posted a set of patches to address all issues on the original thread where the problem applies: https://www.postgresql.org/message-id/20201030023028.GC1693%40paquier.xyz -- Michael