Thread: ThisTimeLineID can be used uninitialized

ThisTimeLineID can be used uninitialized

From
Robert Haas
Date:
Hi,

This is a followup to
http://postgr.es/m/CA+TgmoZ5A26C6OxKApafyuy_sx0VG6VXdD_Q6aSEzsvrPHDwzw@mail.gmail.com.
I'm suspicious of the following code in CreateReplicationSlot:

    /* setup state for WalSndSegmentOpen */
    sendTimeLineIsHistoric = false;
    sendTimeLine = ThisTimeLineID;

The first thing that's odd about this is that if this is physical
replication, it's apparently dead code, because AFAICT sendTimeLine
will not be used for anything in that case. If it's logical
replication, there's a case where this will set sendTImeLine to 0,
which seems very strange, since that is not a valid timeline. To
reproduce, do this:

1. Create a new primary database with tli=1. Create a standby.

2. On the standby, fire up a database-connected replication, something
like this:
psql 'port=5433 replication=database dbname=rhaas'
Don't execute any commands yet!

2. From some other backend, promote the standby:
select pg_promote();
It now gets a TLI of 2.

3. Try to create a logical replication slot perhaps using something like this:
CREATE_REPLICATION_SLOT "foo" LOGICAL "test_decoding" ( SNAPSHOT 'nothing');

If the system had been in normal running when you started the session,
it would be initialized, because InitPostgres() calls
RecoveryInProgress(). But since that only initializes it during normal
running, and not during recovery, it doesn't help in this scenario.
And if on the other hand you had not promoted the standby as in step
2, then we'd still set sendTimeLine = 0 here, but then we'd almost
immediately call CheckLogicalDecodingRequirements() and error out
without doing anything with the value. Here, however, we continue on.

But I don't know if it matters. We call CreateInitDecodingContext()
with sendTimeLine and ThisTimeLineID still zero; it doesn't call any
callbacks. Then we call DecodingContextFindStartpoint() with
sendTimeLine still 0 and the first callback that gets invoked is
logical_read_xlog_page. At this point sendTimeLine = 0 and
ThisTimeLineID = 0. That calls XLogReadDetermineTimeline() which
resets ThisTimeLineID to the correct value of 2, but when we get back
to logical_read_xlog_page, we still manage to call WALRead with a
timeline of 0 because state->seg.ws_tli is still 0. And when WALRead
eventually does call WalSndOpen, which unconditionally propagates
sendTimeLine into the TLI pointer that is passed to it. So now
state->seg_ws_tli also ends up being 2. So I guess maybe nothing bad
happens? But it sure seems strange that the code would apparently work
just as well as it does today with the following patch:

diff --git a/src/backend/replication/walsender.c
b/src/backend/replication/walsender.c
index b811a5c0ef..44fd598519 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -945,7 +945,7 @@ CreateReplicationSlot(CreateReplicationSlotCmd *cmd)

     /* setup state for WalSndSegmentOpen */
     sendTimeLineIsHistoric = false;
-    sendTimeLine = ThisTimeLineID;
+    sendTimeLine = rand() % 10;

     if (cmd->kind == REPLICATION_KIND_PHYSICAL)
     {

And in fact, that passes make check-world. :-(

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: ThisTimeLineID can be used uninitialized

From
Andres Freund
Date:
Hi,

On 2021-10-19 15:13:04 -0400, Robert Haas wrote:
> This is a followup to
> http://postgr.es/m/CA+TgmoZ5A26C6OxKApafyuy_sx0VG6VXdD_Q6aSEzsvrPHDwzw@mail.gmail.com.
> I'm suspicious of the following code in CreateReplicationSlot:
> 
>     /* setup state for WalSndSegmentOpen */
>     sendTimeLineIsHistoric = false;
>     sendTimeLine = ThisTimeLineID;
> 
> The first thing that's odd about this is that if this is physical
> replication, it's apparently dead code, because AFAICT sendTimeLine
> will not be used for anything in that case.

It's quite confusing. It's *really* not helped by physical replication using
but not really using an xlogreader to keep state. Which presumably isn't
actually used during a physical CreateReplicationSlot(), but is referenced by
a comment :/


> But I don't know if it matters. We call CreateInitDecodingContext()
> with sendTimeLine and ThisTimeLineID still zero; it doesn't call any
> callbacks. Then we call DecodingContextFindStartpoint() with
> sendTimeLine still 0 and the first callback that gets invoked is
> logical_read_xlog_page. At this point sendTimeLine = 0 and
> ThisTimeLineID = 0. That calls XLogReadDetermineTimeline() which
> resets ThisTimeLineID to the correct value of 2, but when we get back
> to logical_read_xlog_page, we still manage to call WALRead with a
> timeline of 0 because state->seg.ws_tli is still 0. And when WALRead
> eventually does call WalSndOpen, which unconditionally propagates
> sendTimeLine into the TLI pointer that is passed to it. So now
> state->seg_ws_tli also ends up being 2. So I guess maybe nothing bad
> happens? But it sure seems strange that the code would apparently work
> just as well as it does today with the following patch:
> 
> diff --git a/src/backend/replication/walsender.c
> b/src/backend/replication/walsender.c
> index b811a5c0ef..44fd598519 100644
> --- a/src/backend/replication/walsender.c
> +++ b/src/backend/replication/walsender.c
> @@ -945,7 +945,7 @@ CreateReplicationSlot(CreateReplicationSlotCmd *cmd)
> 
>      /* setup state for WalSndSegmentOpen */
>      sendTimeLineIsHistoric = false;
> -    sendTimeLine = ThisTimeLineID;
> +    sendTimeLine = rand() % 10;
> 
>      if (cmd->kind == REPLICATION_KIND_PHYSICAL)
>      {

Istm we should introduce an InvalidTimeLineID, and explicitly initialize
sendTimeLine to that, and assert that it's valid / invalid in a bunch of
places?

Greetings,

Andres Freund



Re: ThisTimeLineID can be used uninitialized

From
Alvaro Herrera
Date:
On 2021-Oct-19, Andres Freund wrote:

> Hi,
> 
> On 2021-10-19 15:13:04 -0400, Robert Haas wrote:
> > This is a followup to
> > http://postgr.es/m/CA+TgmoZ5A26C6OxKApafyuy_sx0VG6VXdD_Q6aSEzsvrPHDwzw@mail.gmail.com.
> > I'm suspicious of the following code in CreateReplicationSlot:
> > 
> >     /* setup state for WalSndSegmentOpen */
> >     sendTimeLineIsHistoric = false;
> >     sendTimeLine = ThisTimeLineID;
> > 
> > The first thing that's odd about this is that if this is physical
> > replication, it's apparently dead code, because AFAICT sendTimeLine
> > will not be used for anything in that case.
> 
> It's quite confusing. It's *really* not helped by physical replication using
> but not really using an xlogreader to keep state. Which presumably isn't
> actually used during a physical CreateReplicationSlot(), but is referenced by
> a comment :/

Yeah, that's not very nice.  My preference would be to change physical
replication to use xlogreader in the regular way, and avoid confounding
backdoors like its current approach.

> > But it sure seems strange that the code would apparently work just
> > as well as it does today with the following patch:
> > 
> > diff --git a/src/backend/replication/walsender.c
> > b/src/backend/replication/walsender.c
> > index b811a5c0ef..44fd598519 100644
> > --- a/src/backend/replication/walsender.c
> > +++ b/src/backend/replication/walsender.c
> > @@ -945,7 +945,7 @@ CreateReplicationSlot(CreateReplicationSlotCmd *cmd)
> > 
> >      /* setup state for WalSndSegmentOpen */
> >      sendTimeLineIsHistoric = false;
> > -    sendTimeLine = ThisTimeLineID;
> > +    sendTimeLine = rand() % 10;

Hah.  Yeah, when you can do things like that and the tests don't break,
that indicates a problem in the tests.

> Istm we should introduce an InvalidTimeLineID, and explicitly initialize
> sendTimeLine to that, and assert that it's valid / invalid in a bunch of
> places?

That's not a bad idea; it'll help discover bogus code.  Obviously, some
additional tests wouldn't harm -- we have a lot more coverage now than
in embarrasingly recent past, but it can still be improved.

-- 
Álvaro Herrera         PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"Las mujeres son como hondas:  mientras más resistencia tienen,
 más lejos puedes llegar con ellas"  (Jonas Nightingale, Leap of Faith)



Re: ThisTimeLineID can be used uninitialized

From
Robert Haas
Date:
On Tue, Oct 19, 2021 at 7:30 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> Hah.  Yeah, when you can do things like that and the tests don't break,
> that indicates a problem in the tests.

I *think* the problem is actually in the code, not the tests. In other
words, from what I can tell, we copy the bogus timeline value (0, or a
random number) into several places, but then eventually overwrite all
copies of that value with a correct value before using it for
anything. So in other words I think that the comment saying that this
code is initializing values that WalSndSegmentOpen is going to need is
just wrong. I don't completely understand why it's wrong, but I think
it IS wrong.

> > Istm we should introduce an InvalidTimeLineID, and explicitly initialize
> > sendTimeLine to that, and assert that it's valid / invalid in a bunch of
> > places?
>
> That's not a bad idea; it'll help discover bogus code.  Obviously, some
> additional tests wouldn't harm -- we have a lot more coverage now than
> in embarrasingly recent past, but it can still be improved.

+1.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: ThisTimeLineID can be used uninitialized

From
Alvaro Herrera
Date:
On 2021-Oct-20, Robert Haas wrote:

> On Tue, Oct 19, 2021 at 7:30 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> > Hah.  Yeah, when you can do things like that and the tests don't break,
> > that indicates a problem in the tests.
> 
> I *think* the problem is actually in the code, not the tests. In other
> words, from what I can tell, we copy the bogus timeline value (0, or a
> random number) into several places, but then eventually overwrite all
> copies of that value with a correct value before using it for
> anything. So in other words I think that the comment saying that this
> code is initializing values that WalSndSegmentOpen is going to need is
> just wrong. I don't completely understand why it's wrong, but I think
> it IS wrong.

Oh, I'm not saying that there isn't a problem in the code.  I'm just
saying that there is *also* a problem (an omission) in the tests.

-- 
Álvaro Herrera         PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"Cómo ponemos nuestros dedos en la arcilla del otro. Eso es la amistad; jugar
al alfarero y ver qué formas se pueden sacar del otro" (C. Halloway en
La Feria de las Tinieblas, R. Bradbury)



Re: ThisTimeLineID can be used uninitialized

From
Michael Paquier
Date:
On Wed, Oct 20, 2021 at 09:08:57AM -0400, Robert Haas wrote:
> On Tue, Oct 19, 2021 at 7:30 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
>>> Istm we should introduce an InvalidTimeLineID, and explicitly initialize
>>> sendTimeLine to that, and assert that it's valid / invalid in a bunch of
>>> places?
>>
>> That's not a bad idea; it'll help discover bogus code.  Obviously, some
>> additional tests wouldn't harm -- we have a lot more coverage now than
>> in embarrasingly recent past, but it can still be improved.
>
> +1.

There is already an assumption in walsender.c where an invalid
timeline is 0, by the way?  See sendTimeLineNextTLI and sendTimeLine.
Asserting here and there looks like a good thing to do for code paths
where the timeline should, or should not, be set.
--
Michael

Attachment

Re: ThisTimeLineID can be used uninitialized

From
Alvaro Herrera
Date:
On 2021-Oct-21, Michael Paquier wrote:

> There is already an assumption in walsender.c where an invalid
> timeline is 0, by the way?  See sendTimeLineNextTLI and sendTimeLine.
> Asserting here and there looks like a good thing to do for code paths
> where the timeline should, or should not, be set.

Sure, but as Robert suggested, let's make that value a known and obvious
constant InvalidTimeLineId rather than magic value 0.

-- 
Álvaro Herrera         PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"In fact, the basic problem with Perl 5's subroutines is that they're not
crufty enough, so the cruft leaks out into user-defined code instead, by
the Conservation of Cruft Principle."  (Larry Wall, Apocalypse 6)



Re: ThisTimeLineID can be used uninitialized

From
Robert Haas
Date:
On Tue, Oct 19, 2021 at 4:44 PM Andres Freund <andres@anarazel.de> wrote:
> It's quite confusing. It's *really* not helped by physical replication using
> but not really using an xlogreader to keep state. Which presumably isn't
> actually used during a physical CreateReplicationSlot(), but is referenced by
> a comment :/

I can't figure out what you're referring to here. I can't find
CreateReplicationSlot() using an xlogreader in the logical replication
case, or a comment that refers to it doing so.

> Istm we should introduce an InvalidTimeLineID, and explicitly initialize
> sendTimeLine to that, and assert that it's valid / invalid in a bunch of
> places?

I think the correct fix for this particular problem is just to delete
the initialization, as in the attached patch. I spent a long time
studying this today and eventually convinced myself that there's just
no way these initializations can ever do anything (details in proposed
commit message).  While it is important that we do not access the
global variable when it's uninitialized, here there is no reason to
access it in the first place.

Regarding the more general problem, I think we should consider (1)
reducing the number of places that access ThisTimeLineID directly,
preferring to add TimeLineID arguments to functions and pass the
relevant timeline value around explicitly and then (2) changing all of
the remaining accesses to ThisTimeLineID to function calls instead,
e.g. by inventing a function GetCurrentTimeLineID(). Once we do that,
I think this kind of problem just goes away. On the one hand,
GetCurrentTimeLineID() could assert that the value is valid before
returning it, and then we would have centralized checking that we're
not using a bogus value. But, there's no reason to stop there. If all
the callers are using this function rather than accessing the global
variable directly, then the function can just initialize the value
from shared memory as required! Or it can forget about having a local
copy stored in a global variable and just always read the current
value from shared memory! With a little thought, I think this approach
can avoid this sort of unfortunate coding:

        if (!RecoveryInProgress())
            read_upto = GetFlushRecPtr();
        else
            read_upto = GetXLogReplayRecPtr(&ThisTimeLineID);
        tli = ThisTimeLineID;

What is going on here? Well, if we're not still in recovery, then the
call to RecoveryInProgress() will initialize ThisTimeLineID as a side
effect, and after that it can't change. If we are still in recovery
then GetXLogReplayRecPtr()'s will update the global variable as a side
effect on every trip through the function. Either way, read_upto is
the end of WAL in the way that's relevant to whichever operating mode
is current. But imagine that we could code this in a way that didn't
depend on global variables getting updated as a side effect. For
example:

        if (!RecoveryInProgress())
            read_upto = GetFlushRecPtr();
        else
            read_upto = GetXLogReplayRecPtr();
        currTLI = GetCurrentTimeLineID();

Or perhaps:

        if (!RecoveryInProgress())
            read_upto = GetFlushRecPtr(&currTLI);
        else
            read_upto = GetXLogReplayRecPtr(&currTLI);

My point here is that the current idiom only makes sense if you
realize that RecoveryInProgress() has a side effect of updating
ThisTimeLineID, and on the other hand that the only reason we're using
ThisTimeLineID instead of a local variable here is that that's what
RecoveryInProgress() updates. It's just two mutually-reinforcing bad
decisions.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Attachment

Re: ThisTimeLineID can be used uninitialized

From
Robert Haas
Date:
On Thu, Oct 21, 2021 at 3:29 PM Robert Haas <robertmhaas@gmail.com> wrote:
> I think the correct fix for this particular problem is just to delete
> the initialization, as in the attached patch. I spent a long time
> studying this today and eventually convinced myself that there's just
> no way these initializations can ever do anything (details in proposed
> commit message).  While it is important that we do not access the
> global variable when it's uninitialized, here there is no reason to
> access it in the first place.

I have committed this.

-- 
Robert Haas
EDB: http://www.enterprisedb.com