Thread: ThisTimeLineID is used uninitialized in basebackup.c, too

ThisTimeLineID is used uninitialized in basebackup.c, too

From
Robert Haas
Date:
I previously reported[1] that CreateReplicationSlot() was accessing
the global variable ThisTimeLineID when it might not be initialized.
Today I realized that the code in basebackup.c has the same problem.
perform_base_backup() blithely uses the variable six times without
doing anything at all to initialize it. In practice, it's always going
to be initialized to some non-zero value, because pg_basebackup is
always going to execute IDENTIFY_SYSTEM before it executes
BASE_BACKUP, and that's going to set ThisTimeLineID as a side effect.
But, if you hack pg_basebackup to not call IDENTIFY_SYSTEM before
calling BASE_BACKUP, then you can reach this code with ThisTimeLineID
== 0 using pg_basebackup -Xfetch. Even if you don't do that, you're
only guaranteed that ThisTimeLineID is initialized to something, not
that it's initialized to the correct thing. The timeline on the
standby can change any time.

Fortunately, this has no serious consequences AFAICT. Heikki wrote
"I'd rather not worry about timelines here" and wrote the algorithm in
such a way that the only thing that happens if you feed a wrong TLI
through the logic is that the wrong TLI might be used to construct a
WAL file name for use in an error report. You won't get an error when
things should have worked, or the other way around, but if you get a
complaint about a segment file, it might include the wrong TLI in the
filename. I think possibly this logic should be rewritten to be fully
timeline-aware, but that's a bit of a project and this isn't really
broken, so what I'd like to do for right now is change it to use
non-bogus TLIs that we have available in local variables rather than a
possibly-nonsensical value from a global variable. Patch for that
attached.

--
Robert Haas
EDB: http://www.enterprisedb.com

[1] http://postgr.es/m/CA+TgmoZC2aYe9eRXcUa_c_XBBMVLps9LZL_K5Oo7M9Xhco2KhA@mail.gmail.com

Attachment

Re: ThisTimeLineID is used uninitialized in basebackup.c, too

From
Kyotaro Horiguchi
Date:
At Thu, 28 Oct 2021 11:39:52 -0400, Robert Haas <robertmhaas@gmail.com> wrote in 
> I previously reported[1] that CreateReplicationSlot() was accessing
> the global variable ThisTimeLineID when it might not be initialized.
> Today I realized that the code in basebackup.c has the same problem.
> perform_base_backup() blithely uses the variable six times without
> doing anything at all to initialize it. In practice, it's always going
> to be initialized to some non-zero value, because pg_basebackup is
> always going to execute IDENTIFY_SYSTEM before it executes
> BASE_BACKUP, and that's going to set ThisTimeLineID as a side effect.
> But, if you hack pg_basebackup to not call IDENTIFY_SYSTEM before
> calling BASE_BACKUP, then you can reach this code with ThisTimeLineID
> == 0 using pg_basebackup -Xfetch. Even if you don't do that, you're
> only guaranteed that ThisTimeLineID is initialized to something, not
> that it's initialized to the correct thing. The timeline on the
> standby can change any time.
> 
> Fortunately, this has no serious consequences AFAICT. Heikki wrote
> "I'd rather not worry about timelines here" and wrote the algorithm in
> such a way that the only thing that happens if you feed a wrong TLI
> through the logic is that the wrong TLI might be used to construct a
> WAL file name for use in an error report. You won't get an error when
> things should have worked, or the other way around, but if you get a
> complaint about a segment file, it might include the wrong TLI in the
> filename. I think possibly this logic should be rewritten to be fully
> timeline-aware, but that's a bit of a project and this isn't really
> broken, so what I'd like to do for right now is change it to use
> non-bogus TLIs that we have available in local variables rather than a
> possibly-nonsensical value from a global variable. Patch for that
> attached.

The first hunk is useless but seems good for sanity. The second hunk
looks good. (I think we can call CheckXLogRemoved before collecting
file names but it would be another thing.) The third and fifth hunks
look fine.

The fourth hunk is somewhat dubious. The TLI for a new segment is not
necessarily equal to the skipped segments unless starttli ==
endtli. So we could change the message like "could not find WAL file
for segment %X" or such but also I don't oppose to using the tli of
the new segment as an approximate.

As a whole, it looks fine to me.

> --
> Robert Haas
> EDB: http://www.enterprisedb.com
> 
> [1] http://postgr.es/m/CA+TgmoZC2aYe9eRXcUa_c_XBBMVLps9LZL_K5Oo7M9Xhco2KhA@mail.gmail.com

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: ThisTimeLineID is used uninitialized in basebackup.c, too

From
Michael Paquier
Date:
On Thu, Oct 28, 2021 at 11:39:52AM -0400, Robert Haas wrote:
> I previously reported[1] that CreateReplicationSlot() was accessing
> the global variable ThisTimeLineID when it might not be initialized.
> Today I realized that the code in basebackup.c has the same problem.
> perform_base_backup() blithely uses the variable six times without
> doing anything at all to initialize it. In practice, it's always going
> to be initialized to some non-zero value, because pg_basebackup is
> always going to execute IDENTIFY_SYSTEM before it executes
> BASE_BACKUP, and that's going to set ThisTimeLineID as a side effect.
> But, if you hack pg_basebackup to not call IDENTIFY_SYSTEM before
> calling BASE_BACKUP, then you can reach this code with ThisTimeLineID
> == 0 using pg_basebackup -Xfetch. Even if you don't do that, you're
> only guaranteed that ThisTimeLineID is initialized to something, not
> that it's initialized to the correct thing. The timeline on the
> standby can change any time.

Yes, I agree that it is a good idea to cut the dependency of those
code paths with ThisTimeLineID, expecting IDENTIFY_SYSTEM to have done
the job beforehand.  One argument in favor of your change, though I'd
like to think that nobody does so, is that users could run BASE_BACKUP
with a replication connection.  So no need to hack pg_basebackup to be
able to finish with a WAL sender that has no TLI set in the backend.

Your patch seems correct to me.
--
Michael

Attachment

Re: ThisTimeLineID is used uninitialized in basebackup.c, too

From
Robert Haas
Date:
On Thu, Oct 28, 2021 at 9:27 PM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:
> The first hunk is useless but seems good for sanity. The second hunk
> looks good. (I think we can call CheckXLogRemoved before collecting
> file names but it would be another thing.) The third and fifth hunks
> look fine.

Thanks for your careful review and analysis, with which I agree.
Instead of using starttli and endtli in the first hunk, we could
instead use an arbitrary constant, like 1, or 42. We could even go
completely nuts and write something like (TimeLineID) NBuffers. After
all, if the value doesn't matter, we can use anything. But I find
using the TLI values that are associated with the LSNs to be more
pleasing, and it seems like you agree, so let's do that.

> The fourth hunk is somewhat dubious. The TLI for a new segment is not
> necessarily equal to the skipped segments unless starttli ==
> endtli. So we could change the message like "could not find WAL file
> for segment %X" or such but also I don't oppose to using the tli of
> the new segment as an approximate.

Right. I think that users might be confused if we just tell them the
segment number; they might have a hard time knowing what that really
means. Normally all the TLI values here will be the same, so picking
any one of them and generating a WAL file name based on that will be
more helpful that a segment number from which they have to generate
the WAL file name themselves.

Really, when you stop to think about it, this logic is totally
ridiculous. Imagine that startsegno = 10, endsegno = 15, starttli = 4,
and endtli = 4. The code as written would be happy if it found segment
10 on timeline 4, segment 11 on timeline 3, segment 12 on timeline 2,
segment 13 on timeline 1, and segment 14 on timeline 525600, which,
needless to say, wouldn't work at all if you tried to use the backup.
We only get by with this logic because in practice such things don't
happen. I'd be happy to see someone rewrite this to perform better
checking, but as long as we're making the rather dubious assumption
that timelines somehow don't matter, I think changing the error
message would make things more confusing rather than less.

> As a whole, it looks fine to me.

Great. Thank you.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: ThisTimeLineID is used uninitialized in basebackup.c, too

From
Robert Haas
Date:
On Fri, Oct 29, 2021 at 7:28 AM Michael Paquier <michael@paquier.xyz> wrote:
> Yes, I agree that it is a good idea to cut the dependency of those
> code paths with ThisTimeLineID, expecting IDENTIFY_SYSTEM to have done
> the job beforehand.  One argument in favor of your change, though I'd
> like to think that nobody does so, is that users could run BASE_BACKUP
> with a replication connection.  So no need to hack pg_basebackup to be
> able to finish with a WAL sender that has no TLI set in the backend.

Right. I mean, I think the current behavior is both unprincipled and
unintentional. It's not like somebody would have intentionally
designed the BASE_BACKUP command to rely on a global variable
happening to have been set by a previous command. And if for some
crazy reason they had done that, surely there would be some comments
or something talking about it. It's just a (minor) mistake.

> Your patch seems correct to me.

Thanks, committed.

-- 
Robert Haas
EDB: http://www.enterprisedb.com