Re: Split xlog.c - Mailing list pgsql-hackers

From Heikki Linnakangas
Subject Re: Split xlog.c
Date
Msg-id a462d79c-cb5a-47cc-e9ac-616b5003965f@iki.fi
Whole thread Raw
In response to Re: Split xlog.c  (Robert Haas <robertmhaas@gmail.com>)
Responses Re: Split xlog.c
List pgsql-hackers
Here's a new version. It includes two new smaller commits, before the 
main refactoring:

1. Refactor setting XLP_FIRST_IS_OVERWRITE_CONTRECORD. I moved the code 
to set that flag from AdvanceXLInsertBuffer() into 
CreateOverwriteContrecordRecord(). That avoids the need for accessing 
the global variable in AdvanceXLInsertBuffer(), which is nice with this 
patch set because I moved the global variables into xlogrecord.c. For 
comparison, when we are writing a continuation record, the 
XLP_FIRST_IS_CONTRECORD flag is also set by the caller, 
CopyXLogRecordToWAL(), not AdvanceXLInsertBuffer() itself. So I think 
this is marginally more clear anyway.

2. Use correct WAL position in error message on invalid XLOG page 
header. This is the thing that Robert pointed out in the "xlog.c: 
removing ReadRecPtr and EndRecPtr" thread. I needed to make the change 
for the refactoring anyway, but since it's a minor bug fix, it seemed 
better to extract it to a separate commit, after all.

Responses to Robert's comments below:

On 20/10/2021 22:06, Robert Haas wrote:
> - Some logic to (a) sanity-check the control file's REDO pointer, (b)
> set InRecovery = true, and (c) update various bits of control file
> state in memory has been moved substantially earlier. The actual
> update of the control file on disk stays where it was before. At least
> on first reading, I don't really like this. On the one hand, I don't
> see a reason why it's necessary prerequisite for splitting xlog.c. On
> the other hand, it seems a bit dangerous.

The new contents of the control file are determined by the checkpoint 
record, presence of backup label file, and whether we're doing archive 
recovery. We have that information at hand in InitWalRecovery(), whereas 
the caller doesn't know or care whether a backup label file was present, 
for example. That's why I wanted to move that logic to InitWalRecovery().

However, I was afraid of moving the actual call to UpdateControlFile() 
there. That would be a bigger behavioral change. What if initializing 
one of the subsystems fails? Currently, the control file is left 
unchanged, but if we called UpdateControlFile() earlier, then it would 
be modified already.

> There's now ~8 calls to functions in other modules between the time
> you change things in memory and the time that you call
> UpdateControlFile(). Perhaps none of those functions can call
> anything that might in turn call UpdateControlFile() but I don't know
> why we should take the chance. Is there some advantage to having the
> in-memory state out of sync with the on-disk state across all that
> code?
The functions that get called in between don't call UpdateControlFile() 
and don't affect what gets written there. It would be pretty 
questionable if they did, even on master. But for the sake of the 
argument, let's see what would happen if they did:

master: The later call to UpdateControlFile() writes out the same values 
again. Unless the changed field was one of the following: 'state', 
'checkPoint', 'checkPointCopy', 'minRecoveryPoint', 
'minRecoveryPointTLI', 'backupStartPoint', 'backupEndRequired' or 
'time'. If it was one of those, then it may be overwritten with the 
values deduced from the starting checkpoint.

After these patches: The later call to UpdateControlFile() writes out 
the same values again, even if it was one of those fields.

Seems like a wash to me. It's hard to tell which behavior would be the 
correct one.

On 'master', InRecovery might or might not already be set when we call 
those functions. It is already set if there was a backup label file, but 
if we're doing recover for any other reason, it's set only later. That's 
pretty sloppy. We check InRecovery in various assertions, and it affects 
whether UpdateMinRecoveryPoint() updates the control file or not, among 
other things. With these patches, InRecovery is always set at that point 
(or not, if recovery is not needed). That's a bit besides the point 
here, but it highlights that the current coding isn't very robust either 
if those startup functions tried to modify the control file. I think 
these patches make it a little better, or at least not worse.

> - The code to clear InArchiveRecovery and close the WAL segment we had
> open moves earlier. I think it might be possible to fail
> Assert(InArchiveRecovery), because where you've moved this code, we
> haven't yet verified that we reached the minimum recovery point. See
> the comment which begins "It's possible that archive recovery was
> requested, but we don't know how far we need to replay the WAL before
> we reach consistency." What if we reach that point, then fail the big
> hairy if-test and don't set InArchiveRecovery = true? In that case, we
> can still do it later, in ReadRecord. But maybe that will never
> happen. Actually it's not entirely clear to me that the assertion is
> bulletproof even where it is right now, but moving it earlier makes me
> even less confident. Possibly I just don't understand this well
> enough.

Hmm, yeah, this logic is hairy. I tried to find a case where that 
assertion would fail but couldn't find one. I believe it's correct, but 
we could probably make it more clear.

In a nutshell, PerformWalRecovery() will never return, if 
(ArchiveRecoveryRequested && !InArchiveRecovery). Why? There are two 
ways that PerformWalRecovery() can return:

1. After reaching end of WAL. ReadRecord() will always always set 
InArchiveRecovery in that case, if ArchiveRecoveryRequested was set. It 
won't return NULL without doing that.

2. We reached the requested recovery target point. There's a check for 
that case in PerformWalRecovery(), it will throw an "ERROR:  requested 
recovery stop point is before consistent recovery point" if that happens 
before InArchiveRecovery is set. Because reachedConsistency isn't set 
until crash recovery is finished.

That said, independently of this patch series, perhaps that assertion 
should be changed into something like this:

      if (ArchiveRecoveryRequested)
      {
-        Assert(InArchiveRecovery);
+        /*
+         * If archive recovery was requested, we should not finish
+         * recovery before starting archive recovery.
+         *
+         * There are other checks for this in PerformWalRecovery() so
+         * this shouldn't happen, but let's be safe.
+         */
+         if (!InArchiveRecovery)
+             elog(ERROR, "archive recovery was requested, but recovery 
finished before it started");

> It's a little tempting, too, to see if you could somehow consolidate
> the two places that do if (readFile >= 0) { close(readFile); readFile
> = -1 } down to one.

Yeah, I thought about that, but couldn't find a nice way to do it.

> - getRecoveryStopReason() is now called earlier than before, and is 
> now called whether or not ArchiveRecoveryRequested. This seems to
> just move the point of initialization further from the point of use
> to no real advantage, and I also think that the function is only
> designed to do something useful for archive recovery, so calling it
> in other cases just seems confusing.

On the other hand, it's now closer to the actual end-of-recovery. The 
idea here is that it seems natural to return the reason that recovery 
ended along with all the other end-of-recovery information, in the same 
EndOfWalRecoveryInfo struct.

Kyotaro commented on the same thing and suggested keeping the call 
getRecoveryStopReason() where it was. That'd require exposing 
getRecoveryStopReason() from xlogrecovery.c. Which isn't a big deal, we 
could do it, but in general I tried to minimize the surface area between 
xlog.c and xlogrecovery.c. If getRecoveryStopReason() was a separate 
function, should standby_signal_file_found and 
recovery_signal_file_found also be separate functions? I'd prefer to 
gather all the end-of-recovery information into one struct.

> That's all I have on 0001. Is this kind of review helpful?

Yes, very helpful, thank you!

- Heikki

Attachment

pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: Reduce function call costs on ELF platforms
Next
From: Heikki Linnakangas
Date:
Subject: Re: Split xlog.c