Thread: checksum verification code breaks backups in v16-

checksum verification code breaks backups in v16-

From
Robert Haas
Date:
In 025584a168a4b3002e19350bb8db0ebf1fd10235, which shipped with v17, I
changed the way that a base backup decides whether files have
checksums. At the time, I thought this was harmless refactoring, but
it turns out that it was better than harmless. The old way can cause
pg_basebackup failures. To reproduce, I believe you just need:

1. A cluster created with v16 or earlier with checksums enabled (i.e.
initdb -k).
2. A file inside "global", "base", or a tablespace directory whose
name contains a period which is followed by something that atoi() is
unable to convert to a positive number.

Then pg_basebackup will fail like this:

pg_basebackup: error: could not get COPY data stream: ERROR:  invalid
segment number 0 in file "pg_control.old"

One way to fix this is to back-port
025584a168a4b3002e19350bb8db0ebf1fd10235. Before that commit, we
assumed all files had checksums unless the filename like one of the
files that we know isn't supposed to be checksummed. After that
commit, we assume files don't have checksums unless the file format
looks like something we know should be checksummed. That inherently
avoids trying to checksum things we're not expecting to see, such as a
pg_control.old file.

If we want a narrower and thus less-risky fix, we could consider just
adjusting this code here:

                segmentno = atoi(segmentpath + 1);
                if (segmentno == 0)
                    ereport(ERROR,
                            (errmsg("invalid segment number %d in file \"%s\"",
                                    segmentno, filename)));

I think we could just replace the ereport() with verify_checksum =
false (and update the comments). That would leave open the possibility
of trying to checksum some random file with a name like
this_file_should_not_be_here, which we'll interpret as segment 0
because the name does not contain a dot; but if the file were called
this.file.should.not.be.here, we'd not try to checksum it at all
because "here" is not an integer. That's a little random, but it
avoids taking a position on whether
025584a168a4b3002e19350bb8db0ebf1fd10235 is fully correct. It simply
takes the narrower position that we shouldn't fail the entire backup
because we're unable to perform a checksum calculation that in the
worst case would only produce a warning anyway.

At the moment, I think I slightly prefer the narrower fix, but it
might be a little too narrow, since the resultant behavior as
described in the preceding paragraph is somewhat wonky and
nonsensical.

Opinions?

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: checksum verification code breaks backups in v16-

From
Nathan Bossart
Date:
On Tue, Dec 03, 2024 at 11:39:43AM -0500, Robert Haas wrote:
> If we want a narrower and thus less-risky fix, we could consider just
> adjusting this code here:
> 
>                 segmentno = atoi(segmentpath + 1);
>                 if (segmentno == 0)
>                     ereport(ERROR,
>                             (errmsg("invalid segment number %d in file \"%s\"",
>                                     segmentno, filename)));

I'm only seeing this code in pg_checksums.  Is that what you are proposing
to change?

> I think we could just replace the ereport() with verify_checksum =
> false (and update the comments). That would leave open the possibility
> of trying to checksum some random file with a name like
> this_file_should_not_be_here, which we'll interpret as segment 0
> because the name does not contain a dot; but if the file were called
> this.file.should.not.be.here, we'd not try to checksum it at all
> because "here" is not an integer. That's a little random, but it
> avoids taking a position on whether
> 025584a168a4b3002e19350bb8db0ebf1fd10235 is fully correct. It simply
> takes the narrower position that we shouldn't fail the entire backup
> because we're unable to perform a checksum calculation that in the
> worst case would only produce a warning anyway.

Hm.

> At the moment, I think I slightly prefer the narrower fix, but it
> might be a little too narrow, since the resultant behavior as
> described in the preceding paragraph is somewhat wonky and
> nonsensical.

Yeah, I'm a little worried that it is too narrow.  I was hoping that the
refactoring commit was older, because then I think we'd be much more
willing to back-patch it.  But there are a couple of v17 releases out there
already, so maybe it's not too much of a stretch.  I guess I'm leaning
towards the back-patching idea...

-- 
nathan



Re: checksum verification code breaks backups in v16-

From
Robert Haas
Date:
On Tue, Dec 3, 2024 at 5:26 PM Nathan Bossart <nathandbossart@gmail.com> wrote:
> On Tue, Dec 03, 2024 at 11:39:43AM -0500, Robert Haas wrote:
> > If we want a narrower and thus less-risky fix, we could consider just
> > adjusting this code here:
> >
> >                 segmentno = atoi(segmentpath + 1);
> >                 if (segmentno == 0)
> >                     ereport(ERROR,
> >                             (errmsg("invalid segment number %d in file \"%s\"",
> >                                     segmentno, filename)));
>
> I'm only seeing this code in pg_checksums.  Is that what you are proposing
> to change?

No. This code is present in src/backend/backup/basebackup.c in REL_16_STABLE.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: checksum verification code breaks backups in v16-

From
Nathan Bossart
Date:
On Wed, Dec 04, 2024 at 09:29:50AM -0500, Robert Haas wrote:
> On Tue, Dec 3, 2024 at 5:26 PM Nathan Bossart <nathandbossart@gmail.com> wrote:
>> I'm only seeing this code in pg_checksums.  Is that what you are proposing
>> to change?
> 
> No. This code is present in src/backend/backup/basebackup.c in REL_16_STABLE.

D'oh, sorry.  Even so, I think I'd still vote for back-patching the v17
commit that inadvertently fixed this.

-- 
nathan



Re: checksum verification code breaks backups in v16-

From
Robert Haas
Date:
On Wed, Dec 4, 2024 at 11:10 AM Nathan Bossart <nathandbossart@gmail.com> wrote:
> D'oh, sorry.  Even so, I think I'd still vote for back-patching the v17
> commit that inadvertently fixed this.

Gotcha. Let's see if anyone else votes.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: checksum verification code breaks backups in v16-

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> On Wed, Dec 4, 2024 at 11:10 AM Nathan Bossart <nathandbossart@gmail.com> wrote:
>> D'oh, sorry.  Even so, I think I'd still vote for back-patching the v17
>> commit that inadvertently fixed this.

> Gotcha. Let's see if anyone else votes.

+1 for a back-patch of 025584a16.  It's made it through a couple of
minor releases now, so I think it's more trustworthy than a
one-liner that you have low confidence in.

            regards, tom lane