Thread: checksum verification code breaks backups in v16-
In 025584a168a4b3002e19350bb8db0ebf1fd10235, which shipped with v17, I changed the way that a base backup decides whether files have checksums. At the time, I thought this was harmless refactoring, but it turns out that it was better than harmless. The old way can cause pg_basebackup failures. To reproduce, I believe you just need: 1. A cluster created with v16 or earlier with checksums enabled (i.e. initdb -k). 2. A file inside "global", "base", or a tablespace directory whose name contains a period which is followed by something that atoi() is unable to convert to a positive number. Then pg_basebackup will fail like this: pg_basebackup: error: could not get COPY data stream: ERROR: invalid segment number 0 in file "pg_control.old" One way to fix this is to back-port 025584a168a4b3002e19350bb8db0ebf1fd10235. Before that commit, we assumed all files had checksums unless the filename like one of the files that we know isn't supposed to be checksummed. After that commit, we assume files don't have checksums unless the file format looks like something we know should be checksummed. That inherently avoids trying to checksum things we're not expecting to see, such as a pg_control.old file. If we want a narrower and thus less-risky fix, we could consider just adjusting this code here: segmentno = atoi(segmentpath + 1); if (segmentno == 0) ereport(ERROR, (errmsg("invalid segment number %d in file \"%s\"", segmentno, filename))); I think we could just replace the ereport() with verify_checksum = false (and update the comments). That would leave open the possibility of trying to checksum some random file with a name like this_file_should_not_be_here, which we'll interpret as segment 0 because the name does not contain a dot; but if the file were called this.file.should.not.be.here, we'd not try to checksum it at all because "here" is not an integer. That's a little random, but it avoids taking a position on whether 025584a168a4b3002e19350bb8db0ebf1fd10235 is fully correct. It simply takes the narrower position that we shouldn't fail the entire backup because we're unable to perform a checksum calculation that in the worst case would only produce a warning anyway. At the moment, I think I slightly prefer the narrower fix, but it might be a little too narrow, since the resultant behavior as described in the preceding paragraph is somewhat wonky and nonsensical. Opinions? -- Robert Haas EDB: http://www.enterprisedb.com
On Tue, Dec 03, 2024 at 11:39:43AM -0500, Robert Haas wrote: > If we want a narrower and thus less-risky fix, we could consider just > adjusting this code here: > > segmentno = atoi(segmentpath + 1); > if (segmentno == 0) > ereport(ERROR, > (errmsg("invalid segment number %d in file \"%s\"", > segmentno, filename))); I'm only seeing this code in pg_checksums. Is that what you are proposing to change? > I think we could just replace the ereport() with verify_checksum = > false (and update the comments). That would leave open the possibility > of trying to checksum some random file with a name like > this_file_should_not_be_here, which we'll interpret as segment 0 > because the name does not contain a dot; but if the file were called > this.file.should.not.be.here, we'd not try to checksum it at all > because "here" is not an integer. That's a little random, but it > avoids taking a position on whether > 025584a168a4b3002e19350bb8db0ebf1fd10235 is fully correct. It simply > takes the narrower position that we shouldn't fail the entire backup > because we're unable to perform a checksum calculation that in the > worst case would only produce a warning anyway. Hm. > At the moment, I think I slightly prefer the narrower fix, but it > might be a little too narrow, since the resultant behavior as > described in the preceding paragraph is somewhat wonky and > nonsensical. Yeah, I'm a little worried that it is too narrow. I was hoping that the refactoring commit was older, because then I think we'd be much more willing to back-patch it. But there are a couple of v17 releases out there already, so maybe it's not too much of a stretch. I guess I'm leaning towards the back-patching idea... -- nathan
On Tue, Dec 3, 2024 at 5:26 PM Nathan Bossart <nathandbossart@gmail.com> wrote: > On Tue, Dec 03, 2024 at 11:39:43AM -0500, Robert Haas wrote: > > If we want a narrower and thus less-risky fix, we could consider just > > adjusting this code here: > > > > segmentno = atoi(segmentpath + 1); > > if (segmentno == 0) > > ereport(ERROR, > > (errmsg("invalid segment number %d in file \"%s\"", > > segmentno, filename))); > > I'm only seeing this code in pg_checksums. Is that what you are proposing > to change? No. This code is present in src/backend/backup/basebackup.c in REL_16_STABLE. -- Robert Haas EDB: http://www.enterprisedb.com
On Wed, Dec 04, 2024 at 09:29:50AM -0500, Robert Haas wrote: > On Tue, Dec 3, 2024 at 5:26 PM Nathan Bossart <nathandbossart@gmail.com> wrote: >> I'm only seeing this code in pg_checksums. Is that what you are proposing >> to change? > > No. This code is present in src/backend/backup/basebackup.c in REL_16_STABLE. D'oh, sorry. Even so, I think I'd still vote for back-patching the v17 commit that inadvertently fixed this. -- nathan
On Wed, Dec 4, 2024 at 11:10 AM Nathan Bossart <nathandbossart@gmail.com> wrote: > D'oh, sorry. Even so, I think I'd still vote for back-patching the v17 > commit that inadvertently fixed this. Gotcha. Let's see if anyone else votes. -- Robert Haas EDB: http://www.enterprisedb.com
Robert Haas <robertmhaas@gmail.com> writes: > On Wed, Dec 4, 2024 at 11:10 AM Nathan Bossart <nathandbossart@gmail.com> wrote: >> D'oh, sorry. Even so, I think I'd still vote for back-patching the v17 >> commit that inadvertently fixed this. > Gotcha. Let's see if anyone else votes. +1 for a back-patch of 025584a16. It's made it through a couple of minor releases now, so I think it's more trustworthy than a one-liner that you have low confidence in. regards, tom lane