Home > mailing lists
Re: [Patch] ALTER SYSTEM READ ONLY - Mailing list pgsql-hackers

From	Amul Sul
Subject	Re: [Patch] ALTER SYSTEM READ ONLY
Date	November 23, 2021 13:53:22
Msg-id	CAAJ_b97dAYCEbStO+2kXVi44ce60JS5DvFzuK8jLcuEV1iKjvQ@mail.gmail.com Whole thread Raw
In response to	Re: [Patch] ALTER SYSTEM READ ONLY (Amul Sul <sulamul@gmail.com>)
Responses	Re: [Patch] ALTER SYSTEM READ ONLY
List	pgsql-hackers
Tree view
   On Wed, Nov 17, 2021 at 6:20 PM Amul Sul <sulamul@gmail.com> wrote:
>
>   On Wed, Nov 17, 2021 at 4:07 PM Amul Sul <sulamul@gmail.com> wrote:
> >
> > On Wed, Nov 17, 2021 at 11:13 AM Amul Sul <sulamul@gmail.com> wrote:
> > >
> > > On Sat, Nov 13, 2021 at 2:18 AM Robert Haas <robertmhaas@gmail.com> wrote:
> > > >
> > > > On Mon, Nov 8, 2021 at 8:20 AM Amul Sul <sulamul@gmail.com> wrote:
> > > > > Attached is the rebased version of refactoring as well as the
> > > > > pg_prohibit_wal feature patches for the latest master head (commit #
> > > > > 39a3105678a).
> > > >
> > > > I spent a lot of time today studying 0002, and specifically the
> > > > question of whether EndOfLog must be the same as
> > > > XLogCtl->replayEndRecPtr and whether EndOfLogTLI must be the same as
> > > > XLogCtl->replayEndTLI.
> > > >
> > > > The answer to the former question is "no" because, if we don't enter
> > > > redo, XLogCtl->replayEndRecPtr won't be initialized at all. If we do
> > > > enter redo, then I think it has to be the same unless something very
> > > > weird happens. EndOfLog gets set like this:
> > > >
> > > >     XLogBeginRead(xlogreader, LastRec);
> > > >     record = ReadRecord(xlogreader, PANIC, false, replayTLI);
> > > >     EndOfLog = EndRecPtr;
> > > >
> > > > In every case that exists in our regression tests, EndRecPtr is the
> > > > same before these three lines of code as it is afterward. However, if
> > > > you test with recovery_target=immediate, you can get it to be
> > > > different, because in that case we drop out of the redo loop after
> > > > calling recoveryStopsBefore() rather than after calling
> > > > recoveryStopsAfter(). Similarly I'm fairly sure that if you use
> > > > recovery_target_inclusive=off you can likewise get it to be different
> > > > (though I discovered the hard way that recovery_target_inclusive=off
> > > > is ignored when you use recovery_target_name). It seems like a really
> > > > bad thing that neither recovery_target=immediate nor
> > > > recovery_target_inclusive=off have any tests, and I think we ought to
> > > > add some.
> > > >
> > >
> > > recovery/t/003_recovery_targets.pl has test for
> > > recovery_target=immediate but not for recovery_target_inclusive=off, we
> > > can add that for recovery_target_lsn, recovery_target_time, and
> > > recovery_target_xid case only where it affects.
> > >
> > > > Anyway, in effect, these three lines of code have the effect of
> > > > backing up the xlogreader by one record when we stop before rather
> > > > than after a record that we're replaying. What that means is that
> > > > EndOfLog is going to be the end+1 of the last record that we actually
> > > > replayed. There might be one more record that we read but did not
> > > > replay, and that record won't impact the value we end up with in
> > > > EndOfLog. Now, XLogCtl->replayEndRecPtr is also that end+1 of the last
> > > > record that we actually replayed. To put that another way, there's no
> > > > way to exit the main redo loop after we set XLogCtl->replayEndRecPtr
> > > > and before we change LastRec. So in the cases where
> > > > XLogCtl->replayEndRecPtr gets initialized at all, it can only be
> > > > different from EndOfLog if something different happens when we re-read
> > > > the last-replayed WAL record than what happened when we read it the
> > > > first time. That seems unlikely, and would be unfortunate it if it did
> > > > happen. I am inclined to think that it might be better not to reread
> > > > the record at all, though.
> > >
> > > There are two reasons that the record is reread; first, one that you
> > > have just explained where the redo loop drops out due to
> > > recoveryStopsBefore() and another one is that InRecovery is false.
> > >
> > > In the formal case at the end, redo while-loop does read a new record
> > > which in effect updates EndRecPtr and when we breaks the loop, we do
> > > reach the place where we do reread record -- where we do read the
> > > record (i.e. LastRec) before the record that redo loop has read and
> > > which correctly sets EndRecPtr. In the latter case, definitely, we
> > > don't need any adjustment to EndRecPtr.
> > >
> > > So technically one case needs reread but that is also not needed, we
> > > have that value in XLogCtl->lastReplayedEndRecPtr. I do agree that we
> > > do not need to reread the record, but EndOfLog and EndOfLogTLI should
> > > be set conditionally something like:
> > >
> > > if (InRecovery)
> > > {
> > >     EndOfLog = XLogCtl->lastReplayedEndRecPtr;
> > >     EndOfLogTLI = XLogCtl->lastReplayedTLI;
> > > }
> > > else
> > > {
> > >     EndOfLog = EndRecPtr;
> > >     EndOfLogTLI = replayTLI;
> > > }
> > >
> > > > As far as this patch goes, I think we need
> > > > a solution that doesn't involve fetching EndOfLog from a variable
> > > > that's only sometimes initialized and then not doing anything with it
> > > > except in the cases where it was initialized.
> > > >
> > >
> > > Another reason could be EndOfLog changes further in the following case:
> > >
> > > /*
> > >  * Actually, if WAL ended in an incomplete record, skip the parts that
> > >  * made it through and start writing after the portion that persisted.
> > >  * (It's critical to first write an OVERWRITE_CONTRECORD message, which
> > >  * we'll do as soon as we're open for writing new WAL.)
> > >  */
> > > if (!XLogRecPtrIsInvalid(missingContrecPtr))
> > > {
> > >     Assert(!XLogRecPtrIsInvalid(abortedRecPtr));
> > >     EndOfLog = missingContrecPtr;
> > > }
> > >
> > > Now only solution that I can think is to copy EndOfLog (so
> > > EndOfLogTLI) into shared memory.
> > >
> > > > As for EndOfLogTLI, I'm afraid I don't think that's the same thing as
> > > > XLogCtl->replayEndTLI. Now, it's hard to be sure, because I don't
> > > > think the regression tests contain any scenarios where we run recovery
> > > > and the values end up different. However, I think that the code sets
> > > > EndOfLogTLI to the TLI of the last WAL file that we read, and I think
> > > > XLogCtl->replayEndTLI gets set to the timeline from which that WAL
> > > > record originated. So imagine that we are looking for WAL that ought
> > > > to be in 000000010000000000000003 but we don't find it; instead we
> > > > find 000000020000000000000003 because our recovery target timeline is
> > > > 2, or something that has 2 in its history. We will read the WAL for
> > > > timeline 1 from this file which has timeline 2 in the file name. I
> > > > think if recovery ends in this file before the timeline switch, these
> > > > values will be different. I did not try to construct a test case for
> > > > this today due to not having enough time, so it's possible that I'm
> > > > wrong about this, but that's how it looks to me from the code.
> > > >
> > >
> > > I am not sure, I have understood this scenario due to lack of
> > > expertise in this area -- Why would the record we looking that ought
> > > to be in 000000010000000000000003 we don't find it? Possibly WAL
> > > corruption or that file is missing?
> > >
> >
> > On further study, XLogPageRead(), WaitForWALToBecomeAvailable(), and
> > XLogFileReadAnyTLI(), I think I could make a sense that there could be
> > a case where the record belong to TLI 1 we are looking for; we might
> > open the file with TLI 2. But, I am wondering what's wrong if we say
> > that TLI 1 for that record even if we read it from the file has TLI 2 or 3 or 4
> > in its file name -- that statement is still true, and that record
> > should be still accessible from the filename with TLI 1.  Also, if we
> > going to consider this reading record exists before the timeline
> > switch point as the EndOfLog then why should be worried about the
> > latter timeline switch which eventually everything after the EndOfLog
> > going to be useless for us. We might continue switching TLI and/or
> > writing the WAL right after EndOfLog, correct me if I am missing
> > something here.
> >
> > Further, I still think replayEndTLI has set to the correct value what
> > we looking for EndOfLogTLI when we go through the redo loop. When it
> > read the record and finds a change in the current replayTLI then it
> > updates that as:
> >
> > if (newReplayTLI != replayTLI)
> > {
> >     /* Check that it's OK to switch to this TLI */
> >     checkTimeLineSwitch(EndRecPtr, newReplayTLI,
> >                         prevReplayTLI, replayTLI);
> >
> >     /* Following WAL records should be run with new TLI */
> >     replayTLI = newReplayTLI;
> >     switchedTLI = true;
> > }
> >
> > Then replayEndTLI gets updated. If we going to skip the reread of
> > "LastRec" that we were discussing, then I think the following code
> > that fetches the EndOfLogTLI is also not needed, XLogCtl->replayEndTLI
> > (or XLogCtl->lastReplayedTLI) or replayTLI (when InRecovery is false)
> > should be enough, AFAICU.
> >
> > /*
> >  * EndOfLogTLI is the TLI in the filename of the XLOG segment containing
> >  * the end-of-log. It could be different from the timeline that EndOfLog
> >  * nominally belongs to, if there was a timeline switch in that segment,
> >  * and we were reading the old WAL from a segment belonging to a higher
> >  * timeline.
> >  */
> > EndOfLogTLI = xlogreader->seg.ws_tli;
> >
>
> I think I found the right case for this, above TLI fetch is needed in
> the case where we do restore from the archived WAL files. In my trial,
> the archive directory has files as below (Kindly ignore the extra
> history file, I perform a few more trials to be sure):
>
> -rw-------. 1 amul amul 16777216 Nov 17 06:36 00000004000000000000001E
> -rw-------. 1 amul amul 16777216 Nov 17 06:39 00000004000000000000001F.partial
> -rw-------. 1 amul amul      128 Nov 17 06:36 00000004.history
> -rw-------. 1 amul amul 16777216 Nov 17 06:40 00000005000000000000001F
> -rw-------. 1 amul amul      171 Nov 17 06:39 00000005.history
> -rw-------. 1 amul amul      209 Nov 17 06:45 00000006.history
> -rw-------. 1 amul amul      247 Nov 17 06:52 00000007.history
>
> The timeline is switched in 1F file but the archiver has backup older
> timeline file and renamed it. While performing PITR using these
> archived files, the .partitial file seems to be skipped from the
> restore. The file with the next timeline id is selected to read the
> records that belong to the previous timeline id as well (i.e. 4 here,
> all the records before timeline switch point). Here is the files
> inside pg_wal directory after restore, note that in the current
> experiment, I chose recovery_target_xid = <just before the timeline#5
> switch point > and then recovery_target_action = 'promote':
>
> -rw-------. 1 amul amul       85 Nov 17 07:33 00000003.history
> -rw-------. 1 amul amul 16777216 Nov 17 07:33 00000004000000000000001E
> -rw-------. 1 amul amul      128 Nov 17 07:33 00000004.history
> -rw-------. 1 amul amul 16777216 Nov 17 07:33 00000005000000000000001F
> -rw-------. 1 amul amul      171 Nov 17 07:33 00000005.history
> -rw-------. 1 amul amul      209 Nov 17 07:33 00000006.history
> -rw-------. 1 amul amul      247 Nov 17 07:33 00000007.history
> -rw-------. 1 amul amul 16777216 Nov 17 07:33 00000008000000000000001F
>
> The last one is the new WAL file created in that cluster.
>

With this experiment, I think it is clear that the EndOfLogTLI can be
different from the replayEndTLI or lastReplayedTLI, and we don't have
any other option to get that into other processes other than exporting
into shared memory.  Similarly, we have bunch of option (e.g.
replayEndRecPtr, lastReplayedEndRecPtr, lastSegSwitchLSN etc) to get
EndOfLog value but those are not perfect and reliable options.

Therefore, in the attached patch, I have exported EndOfLog and
EndOfLogTLI into shared memory and attached only the refactoring
patches since there a bunch of other work needs to be done on the main
ASRO patches what I discussed with Robert off-list, thanks.

Regards,
Amul
Attachment

pgsql-hackers by date:
From: Andy Fan
Date: 23 November 2021, 13:49:14
Subject: Re: Sequence's value can be rollback after a crashed recovery.
From: Juan José Santamaría Flecha
Date: 23 November 2021, 13:58:33
Subject: Re: Windows build warnings
Re: [Patch] ALTER SYSTEM READ ONLY - Mailing list pgsql-hackers

Attachment

Previous

Next