Thread: A few nuances about specifying the timeline with START_REPLICATION

A few nuances about specifying the timeline with START_REPLICATION

From
Jeff Davis
Date:
A few questions about this comment in walsender.c, originating in
commit abfd192b1b5b:

/*
 * Found the requested timeline in the history. Check that
 * requested startpoint is on that timeline in our history.
 *
 * This is quite loose on purpose. We only check that we didn't
 * fork off the requested timeline before the switchpoint. We
 * don't check that we switched *to* it before the requested
 * starting point. This is because the client can legitimately
 * request to start replication from the beginning of the WAL
 * segment that contains switchpoint, but on the new timeline, so
 * that it doesn't end up with a partial segment. If you ask for
 * too old a starting point, you'll get an error later when we
 * fail to find the requested WAL segment in pg_wal.
 *
 * XXX: we could be more strict here and only allow a startpoint
 * that's older than the switchpoint, if it's still in the same
 * WAL segment.                                            
 */

1. I think there's a typo: it should be "fork off the requested
timeline before the startpoint", right?

2. It seems to imply that requesting an old start point is wrong, but I
don't see why. As long as the WAL is there (or at least the slot
boundaries), what's the problem? Can we either just change the comment
to say that it's fine to start on an ancestor of the requested timeline
(and maybe update the docs, too)?

3. I noticed when looking at this that the terminology in the docs is a
bit inconsistent between START_REPLICATION and
recovery_target_timeline.
  a. In recovery_target_timeline:
     i. a numeric value means "stop when this timeline forks"
     ii. "latest" means "follow forks along the newest timeline"
     iii. "current" is an alias for the backup's numerical timeline
  b. In the start START_REPLICATION docs:
     i. "current" means "follow forks along the newest timeline"
     ii. a numeric value that is equal to the current timeline is the
same as "current"
     iii. a numeric value that is less than the current timeline means
"stop when this timeline forks"

On point #3, it looks like START_REPLICATION could be improved:

  * Should we change the docs to say "latest" rather than "current"?
  * Should we change the behavior so that specifying the current
timeline as a number still means a historic timeline (e.g. it will stop
replicating there and emit a tuple)?
  * Should we add some keywords like "latest" or "current" to the
START_REPLICATION command?

Regards,
    Jeff Davis





Re: A few nuances about specifying the timeline with START_REPLICATION

From
Heikki Linnakangas
Date:
On 18/06/2021 20:27, Jeff Davis wrote:
> A few questions about this comment in walsender.c, originating in
> commit abfd192b1b5b:
> 
> /*
>   * Found the requested timeline in the history. Check that
>   * requested startpoint is on that timeline in our history.
>   *
>   * This is quite loose on purpose. We only check that we didn't
>   * fork off the requested timeline before the switchpoint. We
>   * don't check that we switched *to* it before the requested
>   * starting point. This is because the client can legitimately
>   * request to start replication from the beginning of the WAL
>   * segment that contains switchpoint, but on the new timeline, so
>   * that it doesn't end up with a partial segment. If you ask for
>   * too old a starting point, you'll get an error later when we
>   * fail to find the requested WAL segment in pg_wal.
>   *
>   * XXX: we could be more strict here and only allow a startpoint
>   * that's older than the switchpoint, if it's still in the same
>   * WAL segment.
>   */
> 
> 1. I think there's a typo: it should be "fork off the requested
> timeline before the startpoint", right?

Yes, I think you're right.

> 2. It seems to imply that requesting an old start point is wrong, but I
> don't see why. As long as the WAL is there (or at least the slot
> boundaries), what's the problem? Can we either just change the comment
> to say that it's fine to start on an ancestor of the requested timeline
> (and maybe update the docs, too)?

Hmm, I'm not sure if the logic in WalSndSegmentOpen() would work if you 
did that. For example, if you had the following WAL segments, because a 
timeline switch happened somewhere in the middle of segment 13:

000000040000000000000012
000000040000000000000013
000000050000000000000013
000000050000000000000014

and you requested to start streaming from timeline 5, 0/12000000, I 
think WalSndSegmentOpen() would try to open file 
"000000050000000000000012" and not find it.

We could teach it to look into the timeline history to find the correct 
file, though. Come to think of it, we could remove the START_REPLICATION 
TIMELINE option altogether (or rather, make it optional or backwards 
compatibility). The server doesn't need it for anything, it knows the 
timeline history so the LSN is enough to uniquely identify the starting 
point.

If the client asks for a historic timeline, the replication will stop 
when it reaches the end of that timeline. In hindsight, I think it would 
make more sense to send a message to the client to say that it's 
switching to a new timeline, and continue streaming from the new timeline.

> 3. I noticed when looking at this that the terminology in the docs is a
> bit inconsistent between START_REPLICATION and
> recovery_target_timeline.
>    a. In recovery_target_timeline:
>       i. a numeric value means "stop when this timeline forks"
>       ii. "latest" means "follow forks along the newest timeline"
>       iii. "current" is an alias for the backup's numerical timeline
>    b. In the start START_REPLICATION docs:
>       i. "current" means "follow forks along the newest timeline"
>       ii. a numeric value that is equal to the current timeline is the
> same as "current"
>       iii. a numeric value that is less than the current timeline means
> "stop when this timeline forks"
> 
> On point #3, it looks like START_REPLICATION could be improved:
> 
>    * Should we change the docs to say "latest" rather than "current"?
>    * Should we change the behavior so that specifying the current
> timeline as a number still means a historic timeline (e.g. it will stop
> replicating there and emit a tuple)?
>    * Should we add some keywords like "latest" or "current" to the
> START_REPLICATION command?
Hmm, the timeline in the START_REPLICATION command is not specifying a 
recovery target timeline, so I don't think "latest" or "current" make 
much sense there. Per above, it just tells the server which timeline the 
requested starting point belongs to, so it's actually redundant.

- Heikki



Re: A few nuances about specifying the timeline with START_REPLICATION

From
Jeff Davis
Date:
On Fri, 2021-06-18 at 21:48 +0300, Heikki Linnakangas wrote:
> On 18/06/2021 20:27, Jeff Davis wrote:
> We could teach it to look into the timeline history to find the
> correct 
> file, though.

That's how recovery_target_timeline behaves, and it would match my
intuition better if START_REPLICATION behaved that way.

> If the client asks for a historic timeline, the replication will
> stop 
> when it reaches the end of that timeline. In hindsight, I think it
> would 
> make more sense to send a message to the client to say that it's 
> switching to a new timeline, and continue streaming from the new
> timeline.

Why is it important for the standby to be told explicitly in the
protocol about timeline switches? If it is important, why only for
historical timelines?

> Hmm, the timeline in the START_REPLICATION command is not specifying
> a 
> recovery target timeline, so I don't think "latest" or "current"
> make 
> much sense there. Per above, it just tells the server which timeline
> the 
> requested starting point belongs to, so it's actually redundant.

That's not very clear from the docs: "if TIMELINE option is specified,
streaming starts on timeline tli...".

Part of the confusion is that there's not a good distinction in
terminology between:
   1. a timeline ID, which is a specific segment of a timeline
   2. a timeline made up of the given timeline ID and all its
ancestors, terminating at the given ID
   3. the timeline made up of the current ID, all ancestor IDs, and all
descendent IDs that the current active primary switches to
   4. the set of all timelines that contain a given ID

It seems you are saying that replication only concerns itself with #3,
which does not require a timeline ID at all. That seems basically
correct for now, but since we already document the protocol to take a
timeline, it makes sense to me to just have the primary serve it if
possible.

If we (continue to?) allow timelines for replication, it will start to
treat the primary like an archive. That might not be quite what was
intended, but could be powerful. You could imagine a special archive
that implements the replication protocol, and have replicas directly
off the archive, or maybe doing PITR off the archive.

Regards,
    Jeff Davis





Re: A few nuances about specifying the timeline with START_REPLICATION

From
Heikki Linnakangas
Date:
On 18/06/2021 22:55, Jeff Davis wrote:
> On Fri, 2021-06-18 at 21:48 +0300, Heikki Linnakangas wrote:
>> On 18/06/2021 20:27, Jeff Davis wrote:
>> We could teach it to look into the timeline history to find the
>> correct
>> file, though.
> 
> That's how recovery_target_timeline behaves, and it would match my
> intuition better if START_REPLICATION behaved that way.
> 
>> If the client asks for a historic timeline, the replication will
>> stop
>> when it reaches the end of that timeline. In hindsight, I think it
>> would
>> make more sense to send a message to the client to say that it's
>> switching to a new timeline, and continue streaming from the new
>> timeline.
> 
> Why is it important for the standby to be told explicitly in the
> protocol about timeline switches?

So that it knows to write the WAL to the correctly named WAL segment. 
You could do it differently, looking at the 'xlp_tli' field in the WAL 
page headers, or watching out for checkpoint records that change the 
timeline. But currently the standby (and pg_receivewal) depends on the 
protocol for that.

> If it is important, why only for historical timelines?

Well, the latest timeline doesn't have any timeline switches, by 
definition. If you're connected to a standby server, IOW you're doing 
cascading replication, then the current timeline can become historic, if 
the standby follows a timeline switch. In that case, the replication is 
stopped when you reach the timeline switch, just like when you request a 
historic timeline.

>> Hmm, the timeline in the START_REPLICATION command is not specifying
>> a
>> recovery target timeline, so I don't think "latest" or "current"
>> make
>> much sense there. Per above, it just tells the server which timeline
>> the
>> requested starting point belongs to, so it's actually redundant.
> 
> That's not very clear from the docs: "if TIMELINE option is specified,
> streaming starts on timeline tli...".
> 
> Part of the confusion is that there's not a good distinction in
> terminology between:
>     1. a timeline ID, which is a specific segment of a timeline
>     2. a timeline made up of the given timeline ID and all its
> ancestors, terminating at the given ID
>     3. the timeline made up of the current ID, all ancestor IDs, and all
> descendent IDs that the current active primary switches to
>     4. the set of all timelines that contain a given ID

Agreed, that's a bit confusing.

> It seems you are saying that replication only concerns itself with #3,
> which does not require a timeline ID at all. That seems basically
> correct for now, but since we already document the protocol to take a
> timeline, it makes sense to me to just have the primary serve it if
> possible.
> 
> If we (continue to?) allow timelines for replication, it will start to
> treat the primary like an archive. That might not be quite what was
> intended, but could be powerful. You could imagine a special archive
> that implements the replication protocol, and have replicas directly
> off the archive, or maybe doing PITR off the archive.

True.

- Heikki