Thread: Re: Fix logging for invalid recovery timeline

Re: Fix logging for invalid recovery timeline

From

"Andrey M. Borodin"

Date:

21 December 2024, 07:28:57

> On 20 Dec 2024, at 20:37, David Steele <david@pgbackrest.org> wrote:
>
> "Latest checkpoint is at %X/%X on timeline %u, but in the history of the requested timeline, the server forked off
fromthat timeline at %X/%X." 

I think errdetai here is very hard to follow. I seem to understand what is going on after reading errmsg, but errdetai
makesme uncertain. 

If we call "tliSwitchPoint(CheckPointTLI, expectedTLEs, NULL);"
don't we risk to have again
ereport(ERROR,
(errmsg("requested timeline %u is not in this server's history",
tli)));
?

Best regards, Andrey Borodin.

Re: Fix logging for invalid recovery timeline

From

David Steele

Date:

24 December 2024, 19:04:20

On 12/20/24 23:28, Andrey M. Borodin wrote:
> 
>> On 20 Dec 2024, at 20:37, David Steele <david@pgbackrest.org> wrote:
>>
>> "Latest checkpoint is at %X/%X on timeline %u, but in the history of the requested timeline, the server forked off
fromthat timeline at %X/%X."

> 
> I think errdetai here is very hard to follow. I seem to understand what is going on after reading errmsg, but
errdetaimakes me uncertain.

Yeah, this one confuses users a lot. We see it mostly when a user 
accidentally promotes a standby and that standby pushes a history file 
and maybe some WAL on a new timeline, e.g. 2. The original primary 
continues to make backups on the original timeline 1. At some point a 
restore is required and Postgres by default wants to recover to the most 
recent timeline, but timeline 2 forks from timeline 1 before the latest 
backup was started so it is not accessible.

The solution is to set the target timeline to current but first the user 
needs this figure out what is going on an this error message just 
doesn't contain enough information to do that. I have some ideas on how 
to make it better but that would probably be for HEAD only.

> If we call "tliSwitchPoint(CheckPointTLI, expectedTLEs, NULL);"
> don't we risk to have again
> ereport(ERROR,
> (errmsg("requested timeline %u is not in this server's history",
> tli)));
> ?

I'm not sure what you mean. For primary backups CheckPointTLI will 
always equal ControlFile->checkPointCopy.ThisTimeLineID so that 
shouldn't be a problem. For standby backups CheckPointTLI will be <= 
ControlFile->checkPointCopy.ThisTimeLineID since CheckPointTLI 
represents the timeline at the start of the backup. If a route from that 
timeline to the current timeline can't be found then I'd certainly 
expect an error.

I'll add this patch to the January CF.

Regards,
-David

Re: Fix logging for invalid recovery timeline

From

Benoit Lobréau

Date:

19 February, 12:51:49

Hi,

I think the changes make sense. Would it be helpful to add the origin of 
the checkpoint record we are referring to ? (i.e. control file or backup 
label).

For example:

* Latest checkpoint in the control file is at %X/%X on timeline %u,
* Checkpoint location in the backup_label file is at %X/%X on timeline %u,

The current message refers to "Latest checkpoint" which automatically 
makes me think about the control file (specifically, the pg_controldata 
output).

Now that I have read the code, I understand which one we are referring 
to, but for someone who hasn't (or me in a few month) it might be useful ?

-- 
Benoit Lobréau
Consultant
http://dalibo.com

Re: Fix logging for invalid recovery timeline

From

David Steele

Date:

19 February, 20:35:18

On 2/19/25 03:51, Benoit Lobréau wrote:
> 
> I think the changes make sense. Would it be helpful to add the origin of 
> the checkpoint record we are referring to ? (i.e. control file or backup 
> label).
> 
> For example:
> 
> * Latest checkpoint in the control file is at %X/%X on timeline %u,
> * Checkpoint location in the backup_label file is at %X/%X on timeline %u,

I like this idea but I would prefer to get the patch committed as-is 
first. The reason is that I'm hoping to see this batch-patched (since it 
is a bug) and that is less likely if the message wording is change.

Your idea would be perfect going forward, though.

Regards,
-David

Re: Fix logging for invalid recovery timeline

From

Michael Paquier

Date:

20 February, 04:45:55

On Wed, Feb 19, 2025 at 05:35:18PM +0000, David Steele wrote:
> I like this idea but I would prefer to get the patch committed as-is first.
> The reason is that I'm hoping to see this batch-patched (since it is a bug)
> and that is less likely if the message wording is change.

(Had this thread flagged as a TODO for some time, sorry for not
chiming in earlier.)
Yeah, that's clearly a bug in the information provided.

> Your idea would be perfect going forward, though.

We have a few logs that already track this information, but perhaps
that's better to track this extra element in the FATAL if you have
log_min_messages at fatal where LOG would not show up?  Feel free to
propose a separate patch if you think that this can be improved.

I thought first that this was an issue coming from 70e81861fadd, but
from what I can see the confusion is much older and comes from
ee994272ca50 that has begun mixing the data from the control and the
backup label for the checkpoint record in this FATAL message.

At a27048cbcb58, the check is done based on the copy of the checkpoint
record, and the log is generated based on the checkpoint data in the
control file.  4a92a1c3d1c3 has begun retrieving the replay TLI from
the backup label.  v13 and v14 have this issue, but I'm not really
tempted to poke at the beast more than necessary as this code had a
lot of changes in the last couple of years, with few to no complaints
as far as I am aware.

Applied down to v15 where we have xlogrecovery.c, then.  Thanks for
the report!
--
Michael

Attachment

signature.asc

Re: Fix logging for invalid recovery timeline

From

David Steele

Date:

20 February, 18:40:47

On 2/19/25 19:45, Michael Paquier wrote:
> On Wed, Feb 19, 2025 at 05:35:18PM +0000, David Steele wrote:
>> I like this idea but I would prefer to get the patch committed as-is first.
>> The reason is that I'm hoping to see this batch-patched (since it is a bug)
>> and that is less likely if the message wording is change.
> 
> (Had this thread flagged as a TODO for some time, sorry for not
> chiming in earlier.)

No worries. Thank you for having a look.

>> Your idea would be perfect going forward, though.
> 
> We have a few logs that already track this information, but perhaps
> that's better to track this extra element in the FATAL if you have
> log_min_messages at fatal where LOG would not show up?  Feel free to
> propose a separate patch if you think that this can be improved.

Benoit -- this was your idea. Did you want to submit a patch yourself?

> At a27048cbcb58, the check is done based on the copy of the checkpoint
> record, and the log is generated based on the checkpoint data in the
> control file.  4a92a1c3d1c3 has begun retrieving the replay TLI from
> the backup label.  v13 and v14 have this issue, but I'm not really
> tempted to poke at the beast more than necessary as this code had a
> lot of changes in the last couple of years, with few to no complaints
> as far as I am aware.

Fair enough -- this is subtle enough that I doubt almost anyone is going 
to notice and the general content of the error message is not really 
changed by the incorrect values.

> Applied down to v15 where we have xlogrecovery.c, then.  Thanks for
> the report!

Thank you for the commit!

Regards,
-David

Re: Fix logging for invalid recovery timeline

From

Benoit Lobréau

Date:

21 February, 18:09:34

On 2/20/25 4:40 PM, David Steele wrote:
> Benoit -- this was your idea. Did you want to submit a patch yourself?

Here is an attempt at that. I kept the wording I used above. Is it fine 
to repeat the whole ereport block twice?

-- 
Benoit Lobréau
Consultant
http://dalibo.com

Attachment

0001-Adjust-logging-for-invalid-recovery-timeline.patch

Re: Fix logging for invalid recovery timeline

From

David Steele

Date:

22 February, 19:17:44

On 2/21/25 09:09, Benoit Lobréau wrote:
> On 2/20/25 4:40 PM, David Steele wrote:
>> Benoit -- this was your idea. Did you want to submit a patch yourself?
> 
> Here is an attempt at that. I kept the wording I used above. Is it fine 
> to repeat the whole ereport block twice?

I think for translation purposes this is probably how it needs to be but 
I wonder if we could do something like:

errdetail("Latest checkpoint in %s is at %X/%X <...>",
           haveBackupLabel ? "pg_control" ? "backup_label",

I'll defer to Michael on that.

In general this patch and the new messages look good to me, though.

Regards,
-David