Thread: ThisTimeLineID is used uninitialized in basebackup.c, too
I previously reported[1] that CreateReplicationSlot() was accessing the global variable ThisTimeLineID when it might not be initialized. Today I realized that the code in basebackup.c has the same problem. perform_base_backup() blithely uses the variable six times without doing anything at all to initialize it. In practice, it's always going to be initialized to some non-zero value, because pg_basebackup is always going to execute IDENTIFY_SYSTEM before it executes BASE_BACKUP, and that's going to set ThisTimeLineID as a side effect. But, if you hack pg_basebackup to not call IDENTIFY_SYSTEM before calling BASE_BACKUP, then you can reach this code with ThisTimeLineID == 0 using pg_basebackup -Xfetch. Even if you don't do that, you're only guaranteed that ThisTimeLineID is initialized to something, not that it's initialized to the correct thing. The timeline on the standby can change any time. Fortunately, this has no serious consequences AFAICT. Heikki wrote "I'd rather not worry about timelines here" and wrote the algorithm in such a way that the only thing that happens if you feed a wrong TLI through the logic is that the wrong TLI might be used to construct a WAL file name for use in an error report. You won't get an error when things should have worked, or the other way around, but if you get a complaint about a segment file, it might include the wrong TLI in the filename. I think possibly this logic should be rewritten to be fully timeline-aware, but that's a bit of a project and this isn't really broken, so what I'd like to do for right now is change it to use non-bogus TLIs that we have available in local variables rather than a possibly-nonsensical value from a global variable. Patch for that attached. -- Robert Haas EDB: http://www.enterprisedb.com [1] http://postgr.es/m/CA+TgmoZC2aYe9eRXcUa_c_XBBMVLps9LZL_K5Oo7M9Xhco2KhA@mail.gmail.com
Attachment
At Thu, 28 Oct 2021 11:39:52 -0400, Robert Haas <robertmhaas@gmail.com> wrote in > I previously reported[1] that CreateReplicationSlot() was accessing > the global variable ThisTimeLineID when it might not be initialized. > Today I realized that the code in basebackup.c has the same problem. > perform_base_backup() blithely uses the variable six times without > doing anything at all to initialize it. In practice, it's always going > to be initialized to some non-zero value, because pg_basebackup is > always going to execute IDENTIFY_SYSTEM before it executes > BASE_BACKUP, and that's going to set ThisTimeLineID as a side effect. > But, if you hack pg_basebackup to not call IDENTIFY_SYSTEM before > calling BASE_BACKUP, then you can reach this code with ThisTimeLineID > == 0 using pg_basebackup -Xfetch. Even if you don't do that, you're > only guaranteed that ThisTimeLineID is initialized to something, not > that it's initialized to the correct thing. The timeline on the > standby can change any time. > > Fortunately, this has no serious consequences AFAICT. Heikki wrote > "I'd rather not worry about timelines here" and wrote the algorithm in > such a way that the only thing that happens if you feed a wrong TLI > through the logic is that the wrong TLI might be used to construct a > WAL file name for use in an error report. You won't get an error when > things should have worked, or the other way around, but if you get a > complaint about a segment file, it might include the wrong TLI in the > filename. I think possibly this logic should be rewritten to be fully > timeline-aware, but that's a bit of a project and this isn't really > broken, so what I'd like to do for right now is change it to use > non-bogus TLIs that we have available in local variables rather than a > possibly-nonsensical value from a global variable. Patch for that > attached. The first hunk is useless but seems good for sanity. The second hunk looks good. (I think we can call CheckXLogRemoved before collecting file names but it would be another thing.) The third and fifth hunks look fine. The fourth hunk is somewhat dubious. The TLI for a new segment is not necessarily equal to the skipped segments unless starttli == endtli. So we could change the message like "could not find WAL file for segment %X" or such but also I don't oppose to using the tli of the new segment as an approximate. As a whole, it looks fine to me. > -- > Robert Haas > EDB: http://www.enterprisedb.com > > [1] http://postgr.es/m/CA+TgmoZC2aYe9eRXcUa_c_XBBMVLps9LZL_K5Oo7M9Xhco2KhA@mail.gmail.com regards. -- Kyotaro Horiguchi NTT Open Source Software Center
On Thu, Oct 28, 2021 at 11:39:52AM -0400, Robert Haas wrote: > I previously reported[1] that CreateReplicationSlot() was accessing > the global variable ThisTimeLineID when it might not be initialized. > Today I realized that the code in basebackup.c has the same problem. > perform_base_backup() blithely uses the variable six times without > doing anything at all to initialize it. In practice, it's always going > to be initialized to some non-zero value, because pg_basebackup is > always going to execute IDENTIFY_SYSTEM before it executes > BASE_BACKUP, and that's going to set ThisTimeLineID as a side effect. > But, if you hack pg_basebackup to not call IDENTIFY_SYSTEM before > calling BASE_BACKUP, then you can reach this code with ThisTimeLineID > == 0 using pg_basebackup -Xfetch. Even if you don't do that, you're > only guaranteed that ThisTimeLineID is initialized to something, not > that it's initialized to the correct thing. The timeline on the > standby can change any time. Yes, I agree that it is a good idea to cut the dependency of those code paths with ThisTimeLineID, expecting IDENTIFY_SYSTEM to have done the job beforehand. One argument in favor of your change, though I'd like to think that nobody does so, is that users could run BASE_BACKUP with a replication connection. So no need to hack pg_basebackup to be able to finish with a WAL sender that has no TLI set in the backend. Your patch seems correct to me. -- Michael
Attachment
On Thu, Oct 28, 2021 at 9:27 PM Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote: > The first hunk is useless but seems good for sanity. The second hunk > looks good. (I think we can call CheckXLogRemoved before collecting > file names but it would be another thing.) The third and fifth hunks > look fine. Thanks for your careful review and analysis, with which I agree. Instead of using starttli and endtli in the first hunk, we could instead use an arbitrary constant, like 1, or 42. We could even go completely nuts and write something like (TimeLineID) NBuffers. After all, if the value doesn't matter, we can use anything. But I find using the TLI values that are associated with the LSNs to be more pleasing, and it seems like you agree, so let's do that. > The fourth hunk is somewhat dubious. The TLI for a new segment is not > necessarily equal to the skipped segments unless starttli == > endtli. So we could change the message like "could not find WAL file > for segment %X" or such but also I don't oppose to using the tli of > the new segment as an approximate. Right. I think that users might be confused if we just tell them the segment number; they might have a hard time knowing what that really means. Normally all the TLI values here will be the same, so picking any one of them and generating a WAL file name based on that will be more helpful that a segment number from which they have to generate the WAL file name themselves. Really, when you stop to think about it, this logic is totally ridiculous. Imagine that startsegno = 10, endsegno = 15, starttli = 4, and endtli = 4. The code as written would be happy if it found segment 10 on timeline 4, segment 11 on timeline 3, segment 12 on timeline 2, segment 13 on timeline 1, and segment 14 on timeline 525600, which, needless to say, wouldn't work at all if you tried to use the backup. We only get by with this logic because in practice such things don't happen. I'd be happy to see someone rewrite this to perform better checking, but as long as we're making the rather dubious assumption that timelines somehow don't matter, I think changing the error message would make things more confusing rather than less. > As a whole, it looks fine to me. Great. Thank you. -- Robert Haas EDB: http://www.enterprisedb.com
On Fri, Oct 29, 2021 at 7:28 AM Michael Paquier <michael@paquier.xyz> wrote: > Yes, I agree that it is a good idea to cut the dependency of those > code paths with ThisTimeLineID, expecting IDENTIFY_SYSTEM to have done > the job beforehand. One argument in favor of your change, though I'd > like to think that nobody does so, is that users could run BASE_BACKUP > with a replication connection. So no need to hack pg_basebackup to be > able to finish with a WAL sender that has no TLI set in the backend. Right. I mean, I think the current behavior is both unprincipled and unintentional. It's not like somebody would have intentionally designed the BASE_BACKUP command to rely on a global variable happening to have been set by a previous command. And if for some crazy reason they had done that, surely there would be some comments or something talking about it. It's just a (minor) mistake. > Your patch seems correct to me. Thanks, committed. -- Robert Haas EDB: http://www.enterprisedb.com