Thread: New WAL record to detect the checkpoint redo location

New WAL record to detect the checkpoint redo location

From
Dilip Kumar
Date:
As discussed [1 ][2] currently, the checkpoint-redo LSN can not be
accurately detected while processing the WAL. Although we have a
checkpoint WAL record containing the exact redo LSN, other WAL records
may be inserted between the checkpoint-redo LSN and the actual
checkpoint record. If we want to stop processing wal exactly at the
checkpoint-redo location then we cannot do that because we would have
already processed some extra records that got added after the redo
LSN.

The patch inserts a special wal record named CHECKPOINT_REDO WAL,
which is located exactly at the checkpoint-redo location.   We can
guarantee this record to be exactly at the checkpoint-redo point
because we already hold the exclusive WAL insertion lock while
identifying the checkpoint redo point and can insert this special
record exactly at the same time so that there are no concurrent WAL
insertions.

[1] https://www.postgresql.org/message-id/CA%2BTgmoYOYZfMCyOXFyC-P%2B-mdrZqm5pP2N7S-r0z3_402h9rsA%40mail.gmail.com
[2] https://www.postgresql.org/message-id/20230614194717.jyuw3okxup4cvtbt%40awork3.anarazel.de

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachment

Re: New WAL record to detect the checkpoint redo location

From
Andres Freund
Date:
Hi,

As I think I mentioned before, I like this idea. However, I don't like the
implementation too much.

On 2023-06-15 13:11:57 +0530, Dilip Kumar wrote:
> diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
> index b2430f617c..a025fb91e2 100644
> --- a/src/backend/access/transam/xlog.c
> +++ b/src/backend/access/transam/xlog.c
> @@ -744,6 +744,7 @@ XLogInsertRecord(XLogRecData *rdata,
>      XLogRecPtr    StartPos;
>      XLogRecPtr    EndPos;
>      bool        prevDoPageWrites = doPageWrites;
> +    bool        callerHoldingExlock = holdingAllLocks;
>      TimeLineID    insertTLI;
>  
>      /* we assume that all of the record header is in the first chunk */
> @@ -792,10 +793,18 @@ XLogInsertRecord(XLogRecData *rdata,
>       *----------
>       */
>      START_CRIT_SECTION();
> -    if (isLogSwitch)
> -        WALInsertLockAcquireExclusive();
> -    else
> -        WALInsertLockAcquire();
> +
> +    /*
> +     * Acquire wal insertion lock, nothing to do if the caller is already
> +     * holding the exclusive lock.
> +     */
> +    if (!callerHoldingExlock)
> +    {
> +        if (isLogSwitch)
> +            WALInsertLockAcquireExclusive();
> +        else
> +            WALInsertLockAcquire();
> +    }
>  
>      /*
>       * Check to see if my copy of RedoRecPtr is out of date. If so, may have

This might work right now, but doesn't really seem maintainable. Nor do I like
adding branches into this code a whole lot.


> @@ -6597,6 +6612,32 @@ CreateCheckPoint(int flags)

I think the commentary above this function would need a fair bit of
revising...

>       */
>      RedoRecPtr = XLogCtl->Insert.RedoRecPtr = checkPoint.redo;
>  
> +    /*
> +     * Insert a special purpose CHECKPOINT_REDO record as the first record at
> +     * checkpoint redo lsn.  Although we have the checkpoint record that
> +     * contains the exact redo lsn, there might have been some other records
> +     * those got inserted between the redo lsn and the actual checkpoint
> +     * record.  So when processing the wal, we cannot rely on the checkpoint
> +     * record if we want to stop at the checkpoint-redo LSN.
> +     *
> +     * This special record, however, is not required when we doing a shutdown
> +     * checkpoint, as there will be no concurrent wal insertions during that
> +     * time.  So, the shutdown checkpoint LSN will be the same as
> +     * checkpoint-redo LSN.
> +     *
> +     * This record is guaranteed to be the first record at checkpoint redo lsn
> +     * because we are inserting this while holding the exclusive wal insertion
> +     * lock.
> +     */
> +    if (!shutdown)
> +    {
> +        int            dummy = 0;
> +
> +        XLogBeginInsert();
> +        XLogRegisterData((char *) &dummy, sizeof(dummy));
> +        recptr = XLogInsert(RM_XLOG_ID, XLOG_CHECKPOINT_REDO);
> +    }

It seems to me that we should be able to do better than this.

I suspect we might be able to get rid of the need for exclusive inserts
here. If we rid of that, we could determine the redoa location based on the
LSN determined by the XLogInsert().

Alternatively, I think we should split XLogInsertRecord() so that the part
with the insertion locks held is a separate function, that we could use here.

Greetings,

Andres Freund



Re: New WAL record to detect the checkpoint redo location

From
Dilip Kumar
Date:
On Fri, Jul 14, 2023 at 8:46 PM Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
> As I think I mentioned before, I like this idea. However, I don't like the
> implementation too much.

Thanks for looking into it.


> This might work right now, but doesn't really seem maintainable. Nor do I like
> adding branches into this code a whole lot.

Okay, Now I have moved the XlogInsert for the special record outside
the WalInsertLock so we don't need this special handling here.

> > @@ -6597,6 +6612,32 @@ CreateCheckPoint(int flags)
>
> I think the commentary above this function would need a fair bit of
> revising...
>
> >        */
> >       RedoRecPtr = XLogCtl->Insert.RedoRecPtr = checkPoint.redo;
> >
> > +     /*
> > +      * Insert a special purpose CHECKPOINT_REDO record as the first record at
> > +      * checkpoint redo lsn.  Although we have the checkpoint record that
> > +      * contains the exact redo lsn, there might have been some other records
> > +      * those got inserted between the redo lsn and the actual checkpoint
> > +      * record.  So when processing the wal, we cannot rely on the checkpoint
> > +      * record if we want to stop at the checkpoint-redo LSN.
> > +      *
> > +      * This special record, however, is not required when we doing a shutdown
> > +      * checkpoint, as there will be no concurrent wal insertions during that
> > +      * time.  So, the shutdown checkpoint LSN will be the same as
> > +      * checkpoint-redo LSN.
> > +      *
> > +      * This record is guaranteed to be the first record at checkpoint redo lsn
> > +      * because we are inserting this while holding the exclusive wal insertion
> > +      * lock.
> > +      */
> > +     if (!shutdown)
> > +     {
> > +             int                     dummy = 0;
> > +
> > +             XLogBeginInsert();
> > +             XLogRegisterData((char *) &dummy, sizeof(dummy));
> > +             recptr = XLogInsert(RM_XLOG_ID, XLOG_CHECKPOINT_REDO);
> > +     }
>
> It seems to me that we should be able to do better than this.
>
> I suspect we might be able to get rid of the need for exclusive inserts
> here. If we rid of that, we could determine the redoa location based on the
> LSN determined by the XLogInsert().

Yeah, good idea, actually we can do this insert outside of the
exclusive insert lock and set the LSN of this insert as the
checkpoint. redo location.  So now we do not need to compute the
checkpoint. redo based on the current insertion point we can directly
use the LSN of this record.  I have modified this and I have also
moved some other code that is not required to be inside the WAL
insertion lock.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachment

Re: New WAL record to detect the checkpoint redo location

From
Michael Paquier
Date:
On Tue, Aug 15, 2023 at 02:23:43PM +0530, Dilip Kumar wrote:
> Yeah, good idea, actually we can do this insert outside of the
> exclusive insert lock and set the LSN of this insert as the
> checkpoint. redo location.  So now we do not need to compute the
> checkpoint. redo based on the current insertion point we can directly
> use the LSN of this record.  I have modified this and I have also
> moved some other code that is not required to be inside the WAL
> insertion lock.

Looking at this patch, I am bit surprised to see that the redo point
maps with the end location of the CHECKPOINT_REDO record as it is the
LSN returned by XLogInsert(), not its start LSN.  For example after a
checkpoint:
=# CREATE EXTENSION pg_walinspect;
CREATE EXTENSION;
=# SELECT redo_lsn, checkpoint_lsn from pg_control_checkpoint();
 redo_lsn  | checkpoint_lsn
-----------+----------------
 0/19129D0 | 0/1912A08
(1 row)
=# SELECT start_lsn, prev_lsn, end_lsn, record_type
     from pg_get_wal_record_info('0/19129D0');
 start_lsn | prev_lsn  |  end_lsn  |  record_type
-----------+-----------+-----------+---------------
 0/19129D0 | 0/19129B0 | 0/1912A08 | RUNNING_XACTS
(1 row)

In this case it matches with the previous record:
=# SELECT start_lsn, prev_lsn, end_lsn, record_type
     from pg_get_wal_record_info('0/19129B0');
 start_lsn | prev_lsn  |  end_lsn  |   record_type
-----------+-----------+-----------+-----------------
 0/19129B0 | 0/1912968 | 0/19129D0 | CHECKPOINT_REDO
(1 row)

This could be used to cross-check that the first record replayed is of
the correct type.  The commit message of this patch tells that "the
checkpoint-redo location is set at LSN of this record", which implies
the start LSN of the record tracked as the redo LSN, not the end of
it?  What's the intention here?
--
Michael

Attachment

Re: New WAL record to detect the checkpoint redo location

From
Dilip Kumar
Date:
On Thu, Aug 17, 2023 at 10:52 AM Michael Paquier <michael@paquier.xyz> wrote:
>
> On Tue, Aug 15, 2023 at 02:23:43PM +0530, Dilip Kumar wrote:
> > Yeah, good idea, actually we can do this insert outside of the
> > exclusive insert lock and set the LSN of this insert as the
> > checkpoint. redo location.  So now we do not need to compute the
> > checkpoint. redo based on the current insertion point we can directly
> > use the LSN of this record.  I have modified this and I have also
> > moved some other code that is not required to be inside the WAL
> > insertion lock.
>
> Looking at this patch, I am bit surprised to see that the redo point
> maps with the end location of the CHECKPOINT_REDO record as it is the
> LSN returned by XLogInsert(), not its start LSN.

Yeah right, actually I was confused, I assumed it will return the
start LSN of the record.  And I do not see any easy way to identify
the Start LSN of this record so maybe this solution will not work.  I
will have to think of something else.  Thanks for pointing it out.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: New WAL record to detect the checkpoint redo location

From
Michael Paquier
Date:
On Thu, Aug 17, 2023 at 01:11:50PM +0530, Dilip Kumar wrote:
> Yeah right, actually I was confused, I assumed it will return the
> start LSN of the record.  And I do not see any easy way to identify
> the Start LSN of this record so maybe this solution will not work.  I
> will have to think of something else.  Thanks for pointing it out.

About that.  One thing to consider may be ReserveXLogInsertLocation()
while holding the WAL insert lock, but you can just rely on
ProcLastRecPtr for the job after inserting the REDO record, no?
--
Michael

Attachment

Re: New WAL record to detect the checkpoint redo location

From
Dilip Kumar
Date:
On Fri, Aug 18, 2023 at 5:24 AM Michael Paquier <michael@paquier.xyz> wrote:
>
> On Thu, Aug 17, 2023 at 01:11:50PM +0530, Dilip Kumar wrote:
> > Yeah right, actually I was confused, I assumed it will return the
> > start LSN of the record.  And I do not see any easy way to identify
> > the Start LSN of this record so maybe this solution will not work.  I
> > will have to think of something else.  Thanks for pointing it out.
>
> About that.  One thing to consider may be ReserveXLogInsertLocation()
> while holding the WAL insert lock, but you can just rely on
> ProcLastRecPtr for the job after inserting the REDO record, no?

Yeah right, we can use ProcLastRecPtr.  I will send the updated patch.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: New WAL record to detect the checkpoint redo location

From
Dilip Kumar
Date:
On Fri, Aug 18, 2023 at 10:12 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Fri, Aug 18, 2023 at 5:24 AM Michael Paquier <michael@paquier.xyz> wrote:
> >
> > On Thu, Aug 17, 2023 at 01:11:50PM +0530, Dilip Kumar wrote:
> > > Yeah right, actually I was confused, I assumed it will return the
> > > start LSN of the record.  And I do not see any easy way to identify
> > > the Start LSN of this record so maybe this solution will not work.  I
> > > will have to think of something else.  Thanks for pointing it out.
> >
> > About that.  One thing to consider may be ReserveXLogInsertLocation()
> > while holding the WAL insert lock, but you can just rely on
> > ProcLastRecPtr for the job after inserting the REDO record, no?
>
> Yeah right, we can use ProcLastRecPtr.  I will send the updated patch.

Here is the updated version of the patch.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachment

Re: New WAL record to detect the checkpoint redo location

From
Michael Paquier
Date:
On Fri, Aug 25, 2023 at 11:08:25AM +0530, Dilip Kumar wrote:
> Here is the updated version of the patch.

The concept of the patch looks sound to me.  I have a few comments.

+     * This special record, however, is not required when we doing a shutdown
+     * checkpoint, as there will be no concurrent wal insertions during that
+     * time.  So, the shutdown checkpoint LSN will be the same as
+     * checkpoint-redo LSN.

This is missing "are", as in "when we are doing a shutdown
checkpoint".

-   freespace = INSERT_FREESPACE(curInsert);
-   if (freespace == 0)

The variable "freespace" can be moved within the if block introduced
by this patch when calculating the REDO location for the shutdown
case.  And you can do the same with curInsert?

-     * Compute new REDO record ptr = location of next XLOG record.
-     *
-     * NB: this is NOT necessarily where the checkpoint record itself will be,
-     * since other backends may insert more XLOG records while we're off doing
-     * the buffer flush work.  Those XLOG records are logically after the
-     * checkpoint, even though physically before it.  Got that?
+     * In case of shutdown checkpoint, compute new REDO record
+     * ptr = location of next XLOG record.

It seems to me that keeping around this comment is important,
particularly for the case where we have a shutdown checkpoint and we
expect nothing to run, no?

How about adding a test in pg_walinspect?  There is an online
checkpoint running there, so you could just add something like that
to check that the REDO record is at the expected LSN stored in the
control file:
+-- Check presence of REDO record.
+SELECT redo_lsn FROM pg_control_checkpoint() \gset
+SELECT start_lsn = :'redo_lsn'::pg_lsn AS same_lsn, record_type
+  FROM pg_get_wal_record_info(:'redo_lsn');
--
Michael

Attachment

Re: New WAL record to detect the checkpoint redo location

From
Dilip Kumar
Date:
On Mon, Aug 28, 2023 at 5:14 AM Michael Paquier <michael@paquier.xyz> wrote:
>
> On Fri, Aug 25, 2023 at 11:08:25AM +0530, Dilip Kumar wrote:
> > Here is the updated version of the patch.
>
> The concept of the patch looks sound to me.  I have a few comments.

Thanks for the review

> +        * This special record, however, is not required when we doing a shutdown
> +        * checkpoint, as there will be no concurrent wal insertions during that
> +        * time.  So, the shutdown checkpoint LSN will be the same as
> +        * checkpoint-redo LSN.
>
> This is missing "are", as in "when we are doing a shutdown
> checkpoint".

Fixed

> -   freespace = INSERT_FREESPACE(curInsert);
> -   if (freespace == 0)
>
> The variable "freespace" can be moved within the if block introduced
> by this patch when calculating the REDO location for the shutdown
> case.  And you can do the same with curInsert?

Done, I have also moved code related to computing curInsert in the
same if (shutdown) block.

> -        * Compute new REDO record ptr = location of next XLOG record.
> -        *
> -        * NB: this is NOT necessarily where the checkpoint record itself will be,
> -        * since other backends may insert more XLOG records while we're off doing
> -        * the buffer flush work.  Those XLOG records are logically after the
> -        * checkpoint, even though physically before it.  Got that?
> +        * In case of shutdown checkpoint, compute new REDO record
> +        * ptr = location of next XLOG record.
>
> It seems to me that keeping around this comment is important,
> particularly for the case where we have a shutdown checkpoint and we
> expect nothing to run, no?

I removed this mainly because now in other comments[1] where we are
introducing this new CHECKPOINT_REDO record we are explaining the
problem
that the redo location and the actual checkpoint records are not at
the same place and that is because of the concurrent xlog insertion.
I think we are explaining in more
detail by also stating that in case of a shutdown checkpoint, there
would not be any concurrent insertion so the shutdown checkpoint redo
will be at the same place.  So I feel keeping old comments is not
required.  And we are explaining it when we are setting this for
non-shutdown checkpoint because for shutdown checkpoint this statement
is anyway not correct because for the shutdown checkpoint the
checkpoint record will be at the same location and there will be no
concurrent wal insertion, what do you think?

[1]
+ /*
+ * Insert a dummy CHECKPOINT_REDO record and set start LSN of this record
+ * as checkpoint.redo.  Although we have the checkpoint record that also
+ * contains the exact redo lsn, there might have been some other records
+ * those got inserted between the redo lsn and the actual checkpoint
+ * record.  So when processing the wal, we cannot rely on the checkpoint
+ * record if we want to stop at the checkpoint-redo LSN.
+ *
+ * This special record, however, is not required when we are doing a
+ * shutdown checkpoint, as there will be no concurrent wal insertions
+ * during that time.  So, the shutdown checkpoint LSN will be the same as
+ * checkpoint-redo LSN.
+ */

>
> How about adding a test in pg_walinspect?  There is an online
> checkpoint running there, so you could just add something like that
> to check that the REDO record is at the expected LSN stored in the
> control file:
> +-- Check presence of REDO record.
> +SELECT redo_lsn FROM pg_control_checkpoint() \gset
> +SELECT start_lsn = :'redo_lsn'::pg_lsn AS same_lsn, record_type
> +  FROM pg_get_wal_record_info(:'redo_lsn');
> --

Done, thanks.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachment

Re: New WAL record to detect the checkpoint redo location

From
Michael Paquier
Date:
On Mon, Aug 28, 2023 at 01:47:18PM +0530, Dilip Kumar wrote:
> I removed this mainly because now in other comments[1] where we are
> introducing this new CHECKPOINT_REDO record we are explaining the
> problem
> that the redo location and the actual checkpoint records are not at
> the same place and that is because of the concurrent xlog insertion.
> I think we are explaining in more
> detail by also stating that in case of a shutdown checkpoint, there
> would not be any concurrent insertion so the shutdown checkpoint redo
> will be at the same place.  So I feel keeping old comments is not
> required.
> And we are explaining it when we are setting this for
> non-shutdown checkpoint because for shutdown checkpoint this statement
> is anyway not correct because for the shutdown checkpoint the
> checkpoint record will be at the same location and there will be no
> concurrent wal insertion, what do you think?

+    * Insert a dummy CHECKPOINT_REDO record and set start LSN of this record
+    * as checkpoint.redo.

I would add a "for a non-shutdown checkpoint" at the end of this
sentence.

+    * record.  So when processing the wal, we cannot rely on the checkpoint
+    * record if we want to stop at the checkpoint-redo LSN.

The term "checkpoint-redo" is also a bit confusing, I guess, because
you just mean to refer to the "redo" LSN here?  Maybe rework the last
sentence as:
"So, when processing WAL, we cannot rely on the checkpoint record if
we want to stop at the same position as the redo LSN".

+    * This special record, however, is not required when we are doing a
+    * shutdown checkpoint, as there will be no concurrent wal insertions
+    * during that time.  So, the shutdown checkpoint LSN will be the same as
+    * checkpoint-redo LSN.

Perhaps the last sentence could be merged with the first one, if we
are tweaking things, say:
"This special record is not required when doing a shutdown checkpoint;
the redo LSN is the same LSN as the checkpoint record as there cannot
be any WAL activity in a shutdown sequence."
--
Michael

Attachment

Re: New WAL record to detect the checkpoint redo location

From
Dilip Kumar
Date:
On Wed, Aug 30, 2023 at 1:03 PM Michael Paquier <michael@paquier.xyz> wrote:
>
> On Mon, Aug 28, 2023 at 01:47:18PM +0530, Dilip Kumar wrote:
> > I removed this mainly because now in other comments[1] where we are
> > introducing this new CHECKPOINT_REDO record we are explaining the
> > problem
> > that the redo location and the actual checkpoint records are not at
> > the same place and that is because of the concurrent xlog insertion.
> > I think we are explaining in more
> > detail by also stating that in case of a shutdown checkpoint, there
> > would not be any concurrent insertion so the shutdown checkpoint redo
> > will be at the same place.  So I feel keeping old comments is not
> > required.
> > And we are explaining it when we are setting this for
> > non-shutdown checkpoint because for shutdown checkpoint this statement
> > is anyway not correct because for the shutdown checkpoint the
> > checkpoint record will be at the same location and there will be no
> > concurrent wal insertion, what do you think?
>
> +    * Insert a dummy CHECKPOINT_REDO record and set start LSN of this record
> +    * as checkpoint.redo.
>
> I would add a "for a non-shutdown checkpoint" at the end of this
> sentence.
>
> +    * record.  So when processing the wal, we cannot rely on the checkpoint
> +    * record if we want to stop at the checkpoint-redo LSN.
>
> The term "checkpoint-redo" is also a bit confusing, I guess, because
> you just mean to refer to the "redo" LSN here?  Maybe rework the last
> sentence as:
> "So, when processing WAL, we cannot rely on the checkpoint record if
> we want to stop at the same position as the redo LSN".
>
> +    * This special record, however, is not required when we are doing a
> +    * shutdown checkpoint, as there will be no concurrent wal insertions
> +    * during that time.  So, the shutdown checkpoint LSN will be the same as
> +    * checkpoint-redo LSN.
>
> Perhaps the last sentence could be merged with the first one, if we
> are tweaking things, say:
> "This special record is not required when doing a shutdown checkpoint;
> the redo LSN is the same LSN as the checkpoint record as there cannot
> be any WAL activity in a shutdown sequence."

Your suggestions LGTM so modified accordingly

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachment

Re: New WAL record to detect the checkpoint redo location

From
Michael Paquier
Date:
On Wed, Aug 30, 2023 at 04:51:19PM +0530, Dilip Kumar wrote:
> Your suggestions LGTM so modified accordingly

I have been putting my HEAD on this patch for a few hours, reviewing
the surroundings, and somewhat missed that this computation is done
while we do not hold the WAL insert locks:
+       checkPoint.redo = ProcLastRecPtr;

Then a few lines down the shared Insert.RedoRecPtr is updated while
holding an exclusive lock.
    RedoRecPtr = XLogCtl->Insert.RedoRecPtr = checkPoint.redo;

If we have a bunch of records inserted between the moment when the
REDO record is inserted and the moment when the checkpointer takes the
exclusive WAL lock, aren't we potentially missing a lot of FPW's that
should exist since the redo LSN?
--
Michael

Attachment

Re: New WAL record to detect the checkpoint redo location

From
Dilip Kumar
Date:
On Thu, Aug 31, 2023 at 9:36 AM Michael Paquier <michael@paquier.xyz> wrote:
>
> On Wed, Aug 30, 2023 at 04:51:19PM +0530, Dilip Kumar wrote:
> > Your suggestions LGTM so modified accordingly
>
> I have been putting my HEAD on this patch for a few hours, reviewing
> the surroundings, and somewhat missed that this computation is done
> while we do not hold the WAL insert locks:
> +       checkPoint.redo = ProcLastRecPtr;
>
> Then a few lines down the shared Insert.RedoRecPtr is updated while
> holding an exclusive lock.
>     RedoRecPtr = XLogCtl->Insert.RedoRecPtr = checkPoint.redo;
>
> If we have a bunch of records inserted between the moment when the
> REDO record is inserted and the moment when the checkpointer takes the
> exclusive WAL lock, aren't we potentially missing a lot of FPW's that
> should exist since the redo LSN?

Yeah, good catch.  With this, it seems like we can not move this new
WAL Insert out of the Exclusive WAL insertion lock right? because if
we want to set the LSN of this record as the checkpoint. redo then
there should not be any concurrent insertion until we expose the
XLogCtl->Insert.RedoRecPtr. Otherwise, we will miss the FPW for all
the record which has been inserted after the checkpoint. redo before
we acquired the exclusive WAL insertion lock.

So maybe I need to restart from the first version of the patch but
instead of moving the insertion of the new record out of the exclusive
lock need to do some better refactoring so that XLogInsertRecord()
doesn't look ugly.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: New WAL record to detect the checkpoint redo location

From
Michael Paquier
Date:
On Thu, Aug 31, 2023 at 09:55:45AM +0530, Dilip Kumar wrote:
> Yeah, good catch.  With this, it seems like we can not move this new
> WAL Insert out of the Exclusive WAL insertion lock right?  Because if
> we want to set the LSN of this record as the checkpoint.redo then
> there should not be any concurrent insertion until we expose the
> XLogCtl->Insert.RedoRecPtr.  Otherwise, we will miss the FPW for all
> the record which has been inserted after the checkpoint.redo before
> we acquired the exclusive WAL insertion lock.

Yes.

> So maybe I need to restart from the first version of the patch but
> instead of moving the insertion of the new record out of the exclusive
> lock need to do some better refactoring so that XLogInsertRecord()
> doesn't look ugly.

Yes, I am not sure which interface would be less ugli-ish, but that's
enough material for a refactoring patch of the WAL insert routines on
top of the main patch that introduces the REDO record.
--
Michael

Attachment

Re: New WAL record to detect the checkpoint redo location

From
Robert Haas
Date:
On Fri, Jul 14, 2023 at 11:16 AM Andres Freund <andres@anarazel.de> wrote:
> I suspect we might be able to get rid of the need for exclusive inserts
> here. If we rid of that, we could determine the redoa location based on the
> LSN determined by the XLogInsert().

I've been brainstorming about this today, trying to figure out some
ideas to make it work.

As Michael Paquier correctly noted downthread, we need to make sure
that a backend inserting a WAL record knows whether it needs to
contain an FPI. The comments in the do...while loop in XLogInsert are
pretty helpful here: doPageWrites can't change once XLogInsertRecord
acquires a WAL insertion lock. For that to be true, the redo pointer
can only move when holding all WAL insertion locks. That means that if
we add an XLOG_CHECKPOINT_REDO to mark the location of the redo
pointer, we've got to either (a) insert the record *and* update our
notion of the last redo pointer while holding all the WAL insertion
locks or (b) change the concurrency model in some way.

Let's explore (b) first. Perhaps my imagination is too limited here,
but I'm having trouble seeing a good way of doing this. One idea that
occurred to me was to make either the insertion of the
XLOG_CHECKPOINT_REDO record fail softly if somebody inserts a record
after it that omits FPIs, but that doesn't really work because then
we're left with a hole in the WAL. It's too late to move the later
record earlier. We could convert the intended XLOG_CHECKPOINT_REDO
record into a dummy record but that seems complex and wasteful.
Similarly, you could try to make the insertion of the later record
fail, but that basically has the same problem: there could be an even
later record being inserted after that which it's already too late to
reposition. Basically, it feels like once we get to the point where we
have a range of LSNs and we're copying data into wal_buffers, it's
awfully late to be trying to back out. Other people can already be
depending on us to put the amount of WAL that we promised to insert at
the place where we promised to put it.

The only other approach to (b) that I can think of is to force FPIs on
for all backends from just before to just after we insert the
XLOG_CHECKPOINT_REDO record. However, since we currently require
taking all the WAL insertion locks to start requiring full page
writes, this doesn't seem like it gains much. In theory perhaps we
could have an approach where we flip full page writes to sorta-on,
then wait until we've seen each WAL insertion lock unheld at least
once, and then at that point we know all new WAL insertions will see
them and can deem them fully on. However, when I start to think along
these lines, I feel like maybe I'm losing the plot. Checkpoints are
rare enough that the overhead of taking all the WAL insertion locks at
the same time isn't really a big problem, or at least I don't think it
is. I think the concern here is more about avoiding useless branches
in hot paths that potentially cost something for every single record
whether it has anything to do with this mechanism or not.

OK, so let's suppose we abandon the idea of changing the concurrency
model in any fundamental way and just try to figure out how to both
insert the record and update our notion of the last redo pointer while
holding all the WAL insertion locks i.e. (a) from the two options
above. Dilip's patch approaches this problem by pulling acquisition of
the WAL insertion locks up to the place where we're already setting
the redo pointer. I wonder if we could also consider the other
possible approach of pushing the update to Insert->RedoRecPtr down
into XLogInsertRecord(), which already has a special case for
acquiring all locks when the record being inserted is an XLOG_SWITCH
record. That would have the advantage of holding all of the insertion
locks for a smaller period of time than what Dilip's patch does -- in
particular, it wouldn't need to hold the lock across the
XLOG_CHECKPOINT_REDO's XLogRecordAssemble -- or across the rather
lengthy tail of XLogInsertRecord. But the obvious objection is that it
would put more branches into XLogInsertRecord which nobody wants.

But ... what if it didn't? Suppose we pulled the XLOG_SWITCH case out
of XLogInsertRecord and made a separate function for that case. It
looks to me like that would avoid 5 branches in that function in the
normal case where we're not inserting XLOG_SWITCH. We would want to
move some logic, particularly the WAL_DEBUG stuff and maybe other
things, into reusable subroutines. Then, we could decide to treat
XLOG_CHECKPOINT_REDO either in the normal path -- adding a couple of
those branches back again -- or in the XLOG_SWITCH function and either
way I think the normal path would have fewer branches than it does
today. One idea that I had was to create a new rmgr for "weird
records," initially XLOG_SWITCH and XLOG_CHECKPOINT_REDO. Then the
test as to whether to use the "normal" version of XLogInsertRecord or
the "weird" version could just be based on the rmid, and the "normal"
function wouldn't need to concern itself with anything specific to the
"weird" cases.

A variant on this idea would be to just accept a few extra branches
and hope it's not really that big of a deal. For instance, instead of
this:

    bool        isLogSwitch = (rechdr->xl_rmid == RM_XLOG_ID &&
                               info == XLOG_SWITCH);

We could have this:

bool isAllLocks = (rechdr->xl_rmid == RM_BIZARRE_ID);
bool isLogSwitch = (isAllLocks && info == XLOG_SWITCH);

...and then conditionalize on either isAllLocks or isLogSwitch as
apppropriate. You'd still need an extra branch someplace to update
Insert->RedoRecPtr when isAllLocks && info == XLOG_CHECKPOINT_REDO,
but maybe that's not so bad?

> Alternatively, I think we should split XLogInsertRecord() so that the part
> with the insertion locks held is a separate function, that we could use here.

The difficulty that I see with this is that the function does a lot
more stuff after calling WALInsertLockRelease(). So just pushing the
part that's between acquiring and releasing WAL insertion locks into a
separate function wouldn't actually avoid a lot of code duplication,
if the goal was to do everything else that XLogInsertRecord() does
except for the lock manipulation. To get there, I think we'd need to
move all of the stuff after the lock release into one or more static
functions, too. Which is possibly an OK approach. I haven't checked
how much additional parameter passing we'd end up doing if we went
that way.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: New WAL record to detect the checkpoint redo location

From
Robert Haas
Date:
On Mon, Sep 18, 2023 at 2:57 PM Robert Haas <robertmhaas@gmail.com> wrote:
> I've been brainstorming about this today, trying to figure out some
> ideas to make it work.

Here are some patches.

0001 refactors XLogInsertRecord to unify a couple of separate tests of
isLogSwitch, hopefully making it cleaner and cheaper to add more
special cases.

0002 is a very minimal patch to add XLOG_CHECKPOINT_REDO without using
it for anything.

0003 modifies CreateCheckPoint() to insert an XLOG_CHECKPOINT_REDO
record for any non-shutdown checkpoint, and modifies
XLogInsertRecord() to treat that as a new special case, wherein after
inserting the record the redo pointer is reset while still holding the
WAL insertion locks.

I've tested this to the extent of running the regression tests, and I
also did one (1) manual test where it looked like the right thing was
happening, but that's it, so this might be buggy or perform like
garbage for all I know. But my hope is that it isn't buggy and
performs adequately. If there's any chance of getting some comments on
the basic design choices before I spend time testing and polishing it,
that would be very helpful.

Thanks,

--
Robert Haas
EDB: http://www.enterprisedb.com

Attachment

Re: New WAL record to detect the checkpoint redo location

From
Amit Kapila
Date:
On Thu, Sep 21, 2023 at 7:05 AM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Mon, Sep 18, 2023 at 2:57 PM Robert Haas <robertmhaas@gmail.com> wrote:
> > I've been brainstorming about this today, trying to figure out some
> > ideas to make it work.
>
> Here are some patches.
>
> 0001 refactors XLogInsertRecord to unify a couple of separate tests of
> isLogSwitch, hopefully making it cleaner and cheaper to add more
> special cases.
>
> 0002 is a very minimal patch to add XLOG_CHECKPOINT_REDO without using
> it for anything.
>
> 0003 modifies CreateCheckPoint() to insert an XLOG_CHECKPOINT_REDO
> record for any non-shutdown checkpoint, and modifies
> XLogInsertRecord() to treat that as a new special case, wherein after
> inserting the record the redo pointer is reset while still holding the
> WAL insertion locks.
>

After the 0003 patch, do we need acquire exclusive lock via
WALInsertLockAcquireExclusive() for non-shutdown checkpoints. Even the
comment "We must block concurrent insertions while examining insert
state to determine the checkpoint REDO pointer." seems to indicate
that it is not required. If it is required then we may want to change
the comments and also acquiring the locks twice will have more cost
than acquiring it once and write the new WAL record under that lock.

One minor comment:
+ else if (XLOG_CHECKPOINT_REDO)
+ class = WALINSERT_SPECIAL_CHECKPOINT;
+ }

Isn't the check needs to compare the record type with info?

Your v6-0001* patch looks like an improvement to me even without the
other two patches.

BTW, I would like to mention that there is a slight interaction of
this work with the patch to upgrade/migrate slots [1]. Basically in
[1], to allow slots migration from lower to higher version, we need to
ensure that all the WAL has been consumed by the slots before clean
shutdown. However, during upgrade we can generate few records like
checkpoint which we will ignore for the slot consistency checking as
such records doesn't matter for data consistency after upgrade. We
probably need to add this record to that list. I'll keep an eye on
both the patches so that we don't miss that interaction but mentioned
it here to make others also aware of the same.

[1] -
https://www.postgresql.org/message-id/TYAPR01MB586615579356A84A8CF29A00F5F9A%40TYAPR01MB5866.jpnprd01.prod.outlook.com

--
With Regards,
Amit Kapila.



Re: New WAL record to detect the checkpoint redo location

From
Robert Haas
Date:
On Thu, Sep 21, 2023 at 4:22 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> After the 0003 patch, do we need acquire exclusive lock via
> WALInsertLockAcquireExclusive() for non-shutdown checkpoints. Even the
> comment "We must block concurrent insertions while examining insert
> state to determine the checkpoint REDO pointer." seems to indicate
> that it is not required. If it is required then we may want to change
> the comments and also acquiring the locks twice will have more cost
> than acquiring it once and write the new WAL record under that lock.

I think the comment needs updating. I don't think we can do curInsert
= XLogBytePosToRecPtr(Insert->CurrBytePos) without taking the locks.
Same for Insert->fullPageWrites.

I agree that it looks a little wasteful to release the lock and then
reacquire it, but I suppose checkpoints don't happen often enough for
it to matter. You're not going to notice an extra set of insertion
lock acquisitions once every 5 minutes, or every half hour, or even
every 1 minute if your checkpoints are super-frequent.

Also notice that the current code is also quite inefficient in this
way. GetLastImportantRecPtr() acquires and releases each lock one at a
time, and then we immediately turn around and do
WALInsertLockAcquireExclusive(). If the overhead that you're concerned
about here were big enough to matter, we could reclaim what we're
losing by having a version of GetLastImportantRecPtr() that expects to
be called with all locks already held. But when I asked Andres, he
thought that it didn't matter, and I bet he's right.

> One minor comment:
> + else if (XLOG_CHECKPOINT_REDO)
> + class = WALINSERT_SPECIAL_CHECKPOINT;
> + }
>
> Isn't the check needs to compare the record type with info?

Yeah wow. That's a big mistake.

> Your v6-0001* patch looks like an improvement to me even without the
> other two patches.

Good to know, thanks.

> BTW, I would like to mention that there is a slight interaction of
> this work with the patch to upgrade/migrate slots [1]. Basically in
> [1], to allow slots migration from lower to higher version, we need to
> ensure that all the WAL has been consumed by the slots before clean
> shutdown. However, during upgrade we can generate few records like
> checkpoint which we will ignore for the slot consistency checking as
> such records doesn't matter for data consistency after upgrade. We
> probably need to add this record to that list. I'll keep an eye on
> both the patches so that we don't miss that interaction but mentioned
> it here to make others also aware of the same.

If your approach requires a code change every time someone adds a new
WAL record that doesn't modify table data, you might want to rethink
the approach a bit.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: New WAL record to detect the checkpoint redo location

From
Dilip Kumar
Date:
On Thu, Sep 21, 2023 at 1:50 AM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Mon, Sep 18, 2023 at 2:57 PM Robert Haas <robertmhaas@gmail.com> wrote:
> > I've been brainstorming about this today, trying to figure out some
> > ideas to make it work.
>
> Here are some patches.
>
> 0001 refactors XLogInsertRecord to unify a couple of separate tests of
> isLogSwitch, hopefully making it cleaner and cheaper to add more
> special cases.

Yeah, this looks improvement as it removes one isLogSwitch from the code.

> 0002 is a very minimal patch to add XLOG_CHECKPOINT_REDO without using
> it for anything.
>
> 0003 modifies CreateCheckPoint() to insert an XLOG_CHECKPOINT_REDO
> record for any non-shutdown checkpoint, and modifies
> XLogInsertRecord() to treat that as a new special case, wherein after
> inserting the record the redo pointer is reset while still holding the
> WAL insertion locks.

Yeah from a design POV, it looks fine to me because the main goal was
to insert the XLOG_CHECKPOINT_REDO record and set the "RedoRecPtr"
under the same exclusive wal insertion lock and this patch is doing
this.  As you already mentioned it is an improvement over my first
patch because a) it holds the exclusive WAL insersion lock for a very
short duration b) not increasing the number of branches in
XLogInsertRecord().

Some review
1.
I feel we can reduce one more branch to the normal path by increasing
one branch in this special case i.e.

Your code is
if (class == WALINSERT_SPECIAL_SWITCH)
{
/*execute isSwitch case */
}
else if (class == WALINSERT_SPECIAL_CHECKPOINT)
{
/*execute checkpoint redo case */
}
else
{
/* common case*/
}

My suggestion
if (xl_rmid == RM_XLOG_ID)
{
    if (class == WALINSERT_SPECIAL_SWITCH)
    {
      /*execute isSwitch case */
    }
    else if (class == WALINSERT_SPECIAL_CHECKPOINT)
   {
     /*execute checkpoint redo case */
   }
}
else
{
 /* common case*/
}

2.
In fact, I feel that we can remove this branch as well right? I mean
why do we need to have this separate thing called "class"?  we can
very much use "info" for that purpose. right?

+ /* Does this record type require special handling? */
+ if (rechdr->xl_rmid == RM_XLOG_ID)
+ {
+ if (info == XLOG_SWITCH)
+ class = WALINSERT_SPECIAL_SWITCH;
+ else if (XLOG_CHECKPOINT_REDO)
+ class = WALINSERT_SPECIAL_CHECKPOINT;
+ }

So if we remove this then we do not have this class and the above case
would look like

if (xl_rmid == RM_XLOG_ID)
{
     if (info == XLOG_SWITCH)
    {
      /*execute isSwitch case */
    }
    else if (info == XLOG_CHECKPOINT_REDO)
   {
     /*execute checkpoint redo case */
   }
}
else
{
 /* common case*/
}

3.
+ /* Does this record type require special handling? */
+ if (rechdr->xl_rmid == RM_XLOG_ID)
+ {
+ if (info == XLOG_SWITCH)
+ class = WALINSERT_SPECIAL_SWITCH;
+ else if (XLOG_CHECKPOINT_REDO)
+ class = WALINSERT_SPECIAL_CHECKPOINT;
+ }
+

the above check-in else if is wrong I mean
else if (XLOG_CHECKPOINT_REDO) should be else if (info == XLOG_CHECKPOINT_REDO)

That's all I have for now.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: New WAL record to detect the checkpoint redo location

From
Amit Kapila
Date:
On Thu, Sep 21, 2023 at 9:06 PM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Thu, Sep 21, 2023 at 4:22 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > After the 0003 patch, do we need acquire exclusive lock via
> > WALInsertLockAcquireExclusive() for non-shutdown checkpoints. Even the
> > comment "We must block concurrent insertions while examining insert
> > state to determine the checkpoint REDO pointer." seems to indicate
> > that it is not required. If it is required then we may want to change
> > the comments and also acquiring the locks twice will have more cost
> > than acquiring it once and write the new WAL record under that lock.
>
> I think the comment needs updating. I don't think we can do curInsert
> = XLogBytePosToRecPtr(Insert->CurrBytePos) without taking the locks.
> Same for Insert->fullPageWrites.
>

If we can't do those without taking all the locks then it is fine but
just wanted to give it a try to see if there is a way to avoid in case
of online (non-shutdown) checkpoints. For example, curInsert is used
only for the shutdown path, so we don't need to acquire all locks for
it in the cases except for the shutdown case. Here, we are reading
Insert->fullPageWrites which requires an insertion lock but not all
the locks (as per comments in structure XLogCtlInsert). Now, I haven't
done detailed analysis for
XLogCtl->InsertTimeLineID/XLogCtl->PrevTimeLineID but some places
reading InsertTimeLineID have a comment like "Given that we're not in
recovery, InsertTimeLineID is set and can't change, so we can read it
without a lock." which suggests that some analysis is required whether
reading those requires all locks in this code path. OTOH, it won't
matter to acquire all locks in this code path for the reasons
mentioned by you and it may help in keeping the code simple. So, it is
up to you to take the final call on this matter. I am fine with your
decision.

>
> > BTW, I would like to mention that there is a slight interaction of
> > this work with the patch to upgrade/migrate slots [1]. Basically in
> > [1], to allow slots migration from lower to higher version, we need to
> > ensure that all the WAL has been consumed by the slots before clean
> > shutdown. However, during upgrade we can generate few records like
> > checkpoint which we will ignore for the slot consistency checking as
> > such records doesn't matter for data consistency after upgrade. We
> > probably need to add this record to that list. I'll keep an eye on
> > both the patches so that we don't miss that interaction but mentioned
> > it here to make others also aware of the same.
>
> If your approach requires a code change every time someone adds a new
> WAL record that doesn't modify table data, you might want to rethink
> the approach a bit.
>

I understand your hesitation and we have discussed several approaches
that do not rely on the WAL record type to determine if the slots have
caught up but the other approaches seem to have different other
downsides. I know it may not be a good idea to discuss those here but
as there was a slight interaction with this work, so I thought to
bring it up. To be precise, we need to ensure that we ignore WAL
records that got generated during pg_upgrade operation (say during
pg_upgrade --check).

The approach we initially followed was to check if the slot's
confirmed_flush_lsn is equal to the latest checkpoint in
pg_controldata (which is the shutdown checkpoint after stopping the
server). This approach doesn't work for the use case where the user
runs pg_upgrade --check before actually performing the upgrade [1].
This is because during the upgrade check, the server will be
stopped/started and update the position of the latest checkpoint,
causing the check to fail in the actual upgrade and leading pg_upgrade
to believe that the slot has not been caught up.

To address the issues in the above approach, we also discussed several
alternative approaches[2][3]: a) Adding a new field in pg_controldata
to record the last checkpoint that happens in non-upgrade mode, so
that we can compare the slot's confirmed_flush_lsn with this value.
However, we were not sure if this was a good enough reason to add a
new field in controldata field and sprinkle IsBinaryUpgrade check in
checkpointer code path. b) Advancing each slot's confirmed_flush_lsn
to the latest position if the first upgrade check passes. This way,
when performing the actual upgrade, the confirmed_flush_lsn will also
pass. However, internally advancing the LSN seems unconventional. c)
Introducing a new pg_upgrade option to skip the check for slot
catch-up so that if it is already done at the time of pg_upgrade
--check, we can avoid rechecking during actual upgrade. Although this
might work, the user would need to specify this manually, which is not
ideal. d) Document this and suggest users consume the WALs, but this
doesn't look acceptable to users.

All the above approaches have their downsides, prompting us to
consider the WAL scan approach which is to scan the end of the WAL for
records that should have been streamed out. This approach was first
proposed by Andres[4] and was chosen[5] after considering all other
approaches. If we don't like relying on WAL record types then I think
the alternative (a) to add a new field in ControlDataFile is worth
considering.

[1] https://www.postgresql.org/message-id/CAA4eK1LzeZLoTLaAuadmuiggc5mq39oLY6fK95oFKiPBPBf%2BeQ%40mail.gmail.com
[2]
https://www.postgresql.org/message-id/OS0PR01MB571640E1B58741979A5E586594F7A%40OS0PR01MB5716.jpnprd01.prod.outlook.com
[3]
https://www.postgresql.org/message-id/TYAPR01MB5866EF7398CB13FFDBF230E7F5F0A%40TYAPR01MB5866.jpnprd01.prod.outlook.com
[4] https://www.postgresql.org/message-id/20230725170319.h423jbthfohwgnf7%40awork3.anarazel.de
[5] https://www.postgresql.org/message-id/CAA4eK1KqqWayKtRhvyRgkhEHvAUemW_dEqgFn7UOG3D4B6f0ew%40mail.gmail.com

--
With Regards,
Amit Kapila.



Re: New WAL record to detect the checkpoint redo location

From
Robert Haas
Date:
On Wed, Sep 20, 2023 at 4:20 PM Robert Haas <robertmhaas@gmail.com> wrote:
> Here are some patches.

Here are some updated patches. Following some off-list conversation
with Andres, I restructured 0003 to put the common case first and use
likely(), and I fixed the brown-paper-bag noted by Amit. I then turned
my attention to performance testing. I was happy to find out when I
did a bunch of testing on Friday that my branch with these patches
applied outperformed master. I was then less happy to find that when I
repeated the same tests today, master outperformed the branch. So now
I don't know what is going on, but it doesn't seem like my test
results are stable enough to draw meaningful conclusions.

I was trying to think of a test case where XLogInsertRecord would be
exercised as heavily as possible, so I really wanted to generate a lot
of WAL while doing as little real work as possible. The best idea that
I had was to run pg_create_restore_point() in a loop. Initially,
performance was dominated by the log messages which that function
emits, so I set log_min_messages='FATAL' to suppress those. To try to
further reduce other bottlenecks, I also set max_wal_size='50GB',
fsync='off', synchronous_commit='off', and wal_buffers='256MB'. Then I
ran this query:

select count(*) from (SELECT pg_create_restore_point('banana') from
generate_series(1,100000000) g) x;

I can't help laughing at the comedy of creating 100 million
banana-named restore points with no fsyncs or logging, but here we
are. All of my test runs with master, and with the patches, and with
just the first patch run in between 34 and 39 seconds. As I say, I
can't really separate out which versions are faster and slower with
any confidence. Before I fixed the brown-paper bag that Amit pointed
out, it was using WALInsertLockAcquireExclusive() instead of
WALInsertLockAcquire() for *all* WAL records, and that created an
extremely large and obvious increase in the runtime of the tests. So
I'm relatively confident that this test case is sensitive to changes
in execution time of XLogInsertRecord(), but apparently the changes
caused by rearranging the branches are a bit too marginal for them to
show up here.

One possible conclusion is that the differences here aren't actually
big enough to get stressed about, but I don't want to jump to that
conclusion without investigating the competing hypothesis that this
isn't the right way to test this, and that some better test would show
clearer results. Suggestions?

--
Robert Haas
EDB: http://www.enterprisedb.com

Attachment

Re: New WAL record to detect the checkpoint redo location

From
Andres Freund
Date:
Hi,

On 2023-10-02 10:42:37 -0400, Robert Haas wrote:
> I was trying to think of a test case where XLogInsertRecord would be
> exercised as heavily as possible, so I really wanted to generate a lot
> of WAL while doing as little real work as possible. The best idea that
> I had was to run pg_create_restore_point() in a loop.

What I use for that is pg_logical_emit_message(). Something like

SELECT count(*)
FROM
    (
        SELECT pg_logical_emit_message(false, '1', 'short'), generate_series(1, 10000)
    );

run via pgbench does seem to exercise that path nicely.


> One possible conclusion is that the differences here aren't actually
> big enough to get stressed about, but I don't want to jump to that
> conclusion without investigating the competing hypothesis that this
> isn't the right way to test this, and that some better test would show
> clearer results. Suggestions?

I saw some small differences in runtime running pgbench with the above query,
with a single client. Comparing profiles showed a surprising degree of
difference. That turns out to mostly a consequence of the fact that
ReserveXLogInsertLocation() isn't inlined anymore, because there now are two
callers of the function in XLogInsertRecord().

Unfortunately, I still see a small performance difference after that. To get
the most reproducible numbers, I disable turbo boost, bound postgres to one
cpu core, bound pgbench to another core. Over a few runs I quite reproducibly
get ~319.323 tps with your patches applied (+ always inline), and ~324.674
with master.

If I add an unlikely around if (rechdr->xl_rmid == RM_XLOG_ID), the
performance does improve. But that "only" brings it up to 322.406. Not sure
what the rest is.


One thing that's notable, but not related to the patch, is that we waste a
fair bit of cpu time below XLogInsertRecord() with divisions. I think they're
all due to the use of UsableBytesInSegment in
XLogBytePosToRecPtr/XLogBytePosToEndRecPtr.  The multiplication of
XLogSegNoOffsetToRecPtr() also shows.

Greetings,

Andres Freund



Re: New WAL record to detect the checkpoint redo location

From
Robert Haas
Date:
On Thu, Oct 5, 2023 at 2:34 PM Andres Freund <andres@anarazel.de> wrote:
> If I add an unlikely around if (rechdr->xl_rmid == RM_XLOG_ID), the
> performance does improve. But that "only" brings it up to 322.406. Not sure
> what the rest is.

I don't really think this is worth worrying about. A sub-one-percent
regression on a highly artificial test case doesn't seem like a big
deal. Anybody less determined than you would have been unable to
measure that there even is a regression in the first place, and that's
basically everyone.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: New WAL record to detect the checkpoint redo location

From
Robert Haas
Date:
On Thu, Oct 5, 2023 at 2:34 PM Andres Freund <andres@anarazel.de> wrote:
> One thing that's notable, but not related to the patch, is that we waste a
> fair bit of cpu time below XLogInsertRecord() with divisions. I think they're
> all due to the use of UsableBytesInSegment in
> XLogBytePosToRecPtr/XLogBytePosToEndRecPtr.  The multiplication of
> XLogSegNoOffsetToRecPtr() also shows.

Despite what I said in my earlier email, and with a feeling like unto
that created by the proximity of the sword of Damocles or some ghostly
albatross, I spent some time reflecting on this. Some observations:

1. The reason why we're doing this multiplication and division is to
make sure that the code in ReserveXLogInsertLocation which executes
while holding insertpos_lck remains as simple and brief as possible.
We could eliminate the conversion between usable byte positions and
LSNs if we replaced Insert->{Curr,Prev}BytePos with LSNs and had
ReserveXLogInsertLocation work out by how much to advance the LSN, but
it would have to be worked out while holding insertpos_lck (or some
replacement lwlock, perhaps) and that cure seems worse than the
disease. Given that, I think we're stuck with converting between
usable bye positions and LSNs, and that intrinsically needs some
multiplication and division.

2. It seems possible to remove one branch in each of
XLogBytePosToRecPtr and XLogBytePosToEndRecPtr. Rather than testing
whether bytesleft < XLOG_BLCKSZ - SizeOfXLogLongPHD, we could simply
increment bytesleft by SizeOfXLogLongPHD - SizeOfXLogShortPHD. Then
the rest of the calculations can be performed as if every page in the
segment had a header of length SizeOfXLogShortPHD, with no need to
special-case the first page. However, that doesn't get rid of any
multiplication or division, just a branch.

3. Aside from that, there seems to be no simple way to reduce the
complexity of an individual calculation, but ReserveXLogInsertLocation
does perform 3 rather similar computations, and I believe that we know
that it will always be the case that *PrevPtr < *StartPos < *EndPos.
Maybe we could have a fast-path for the case where they are all in the
same segment. We could take prevbytepos modulo UsableBytesInSegment;
call the result prevsegoff. If UsableBytesInSegment - prevsegoff >
endbytepos - prevbytepos, then all three pointers are in the same
segment, and maybe we could take advantage of that to avoid performing
the segment calculations more than once, but still needing to repeat
the page calculations. Or, instead or in addition, I think we could by
a similar technique check whether all three pointers are on the same
page; if so, then *StartPos and *EndPos can be computed from *PrevPtr
by just adding the difference between the corresponding byte
positions.

I'm not really sure whether that would come out cheaper. It's just the
only idea that I have. It did also occur to me to wonder whether the
apparent delays performing multiplication and division here were
really the result of the arithmetic itself being slow or whether they
were synchronization-related, SpinLockRelease(&Insert->insertpos_lck)
being a memory barrier just before. But I assume you thought about
that and concluded that wasn't the issue here.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: New WAL record to detect the checkpoint redo location

From
Andres Freund
Date:
Hi,

On 2023-10-06 13:44:55 -0400, Robert Haas wrote:
> On Thu, Oct 5, 2023 at 2:34 PM Andres Freund <andres@anarazel.de> wrote:
> > If I add an unlikely around if (rechdr->xl_rmid == RM_XLOG_ID), the
> > performance does improve. But that "only" brings it up to 322.406. Not sure
> > what the rest is.
> 
> I don't really think this is worth worrying about. A sub-one-percent
> regression on a highly artificial test case doesn't seem like a big
> deal.

I agree. I think it's worth measuring and looking at, after all the fix might
be trivial (like the case of the unlikely for the earlier if()). But it
shouldn't block progress on significant features.

I think this "issue" might be measurable in some other, not quite as artifical
cases, like INSERT ... SELECT or such. But even then it's going to be tiny.

Greetings,

Andres Freund



Re: New WAL record to detect the checkpoint redo location

From
Andres Freund
Date:
Hi,

As noted in my email from a few minutes ago, I agree that optimizing this
shouldn't be a requirement for merging the patch.


On 2023-10-09 15:58:36 -0400, Robert Haas wrote:
> 1. The reason why we're doing this multiplication and division is to
> make sure that the code in ReserveXLogInsertLocation which executes
> while holding insertpos_lck remains as simple and brief as possible.
> We could eliminate the conversion between usable byte positions and
> LSNs if we replaced Insert->{Curr,Prev}BytePos with LSNs and had
> ReserveXLogInsertLocation work out by how much to advance the LSN, but
> it would have to be worked out while holding insertpos_lck (or some
> replacement lwlock, perhaps) and that cure seems worse than the
> disease. Given that, I think we're stuck with converting between
> usable bye positions and LSNs, and that intrinsically needs some
> multiplication and division.

Right, that's absolutely crucial for scalability.


> 2. It seems possible to remove one branch in each of
> XLogBytePosToRecPtr and XLogBytePosToEndRecPtr. Rather than testing
> whether bytesleft < XLOG_BLCKSZ - SizeOfXLogLongPHD, we could simply
> increment bytesleft by SizeOfXLogLongPHD - SizeOfXLogShortPHD. Then
> the rest of the calculations can be performed as if every page in the
> segment had a header of length SizeOfXLogShortPHD, with no need to
> special-case the first page. However, that doesn't get rid of any
> multiplication or division, just a branch.

This reminded me about something I've been bugged by for a while: The whole
idea of short xlog page headers seems like a completely premature
optimization.  The page header is a very small amount of the overall data
(long: 40/8192 ~= 0.00488, short: 24/8192 ~= 0.00292), compared to the space
we waste in many other places, including on a per-record level, it doesn't
seem worth the complexity.



> 3. Aside from that, there seems to be no simple way to reduce the
> complexity of an individual calculation, but ReserveXLogInsertLocation
> does perform 3 rather similar computations, and I believe that we know
> that it will always be the case that *PrevPtr < *StartPos < *EndPos.
> Maybe we could have a fast-path for the case where they are all in the
> same segment. We could take prevbytepos modulo UsableBytesInSegment;
> call the result prevsegoff. If UsableBytesInSegment - prevsegoff >
> endbytepos - prevbytepos, then all three pointers are in the same
> segment, and maybe we could take advantage of that to avoid performing
> the segment calculations more than once, but still needing to repeat
> the page calculations. Or, instead or in addition, I think we could by
> a similar technique check whether all three pointers are on the same
> page; if so, then *StartPos and *EndPos can be computed from *PrevPtr
> by just adding the difference between the corresponding byte
> positions.

I think we might be able to speed some of this up by pre-compute values so we
can implement things like bytesleft / UsableBytesInPage with shifts. IIRC we
already insist on power-of-two segment sizes, so instead of needing to divide
by a runtime value, we should be able to shift by a runtime value (and the
modulo should be a mask).


> I'm not really sure whether that would come out cheaper. It's just the
> only idea that I have. It did also occur to me to wonder whether the
> apparent delays performing multiplication and division here were
> really the result of the arithmetic itself being slow or whether they
> were synchronization-related, SpinLockRelease(&Insert->insertpos_lck)
> being a memory barrier just before. But I assume you thought about
> that and concluded that wasn't the issue here.

I did verify that they continue to be a bottleneck even after (incorrectly
obviously), removing the spinlock. It's also not too surprising, the latency
of 64bit divs is just high, particularly on intel from a few years ago (my
cascade lake workstation) and IIRC there's just a single execution port for it
too, so multiple instructions can't be fully parallelized.

https://uops.info/table.html documents a worst case latency of 89 cycles on
cascade lake, with the division broken up into 36 uops (reducing what's
available to track other in-flight instructions). It's much better on alter
lake (9 cycles and 7 uops on the perf cores, 44 cycles and 4 uops on
efficiency cores) and on zen 3+ (19 cycles, 2 uops).

Greetings,

Andres Freund



Re: New WAL record to detect the checkpoint redo location

From
Robert Haas
Date:
On Mon, Oct 9, 2023 at 4:47 PM Andres Freund <andres@anarazel.de> wrote:
> I think we might be able to speed some of this up by pre-compute values so we
> can implement things like bytesleft / UsableBytesInPage with shifts. IIRC we
> already insist on power-of-two segment sizes, so instead of needing to divide
> by a runtime value, we should be able to shift by a runtime value (and the
> modulo should be a mask).

Huh, is there a general technique for this when dividing by a
non-power-of-two? The segment size is a power of two, as is the page
size, but UsableBytesIn{Page,Segment} are some random value slightly
less than a power of two.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: New WAL record to detect the checkpoint redo location

From
Andres Freund
Date:
Hi,

On 2023-10-09 18:31:11 -0400, Robert Haas wrote:
> On Mon, Oct 9, 2023 at 4:47 PM Andres Freund <andres@anarazel.de> wrote:
> > I think we might be able to speed some of this up by pre-compute values so we
> > can implement things like bytesleft / UsableBytesInPage with shifts. IIRC we
> > already insist on power-of-two segment sizes, so instead of needing to divide
> > by a runtime value, we should be able to shift by a runtime value (and the
> > modulo should be a mask).
> 
> Huh, is there a general technique for this when dividing by a
> non-power-of-two?

There is, but I was just having a brainfart, forgetting that UsableBytesInPage
isn't itself a power of two. The general technique is used by compilers, but
doesn't iirc lend itself well to be done at runtime.

Greetings,

Andres Freund



Re: New WAL record to detect the checkpoint redo location

From
Robert Haas
Date:
On Mon, Oct 9, 2023 at 4:47 PM Andres Freund <andres@anarazel.de> wrote:
> As noted in my email from a few minutes ago, I agree that optimizing this
> shouldn't be a requirement for merging the patch.

Here's a new patch set. I think I've incorporated the performance
fixes that you've suggested so far into this version. I also adjusted
a couple of other things:

- After further study of a point previously raised by Amit, I adjusted
CreateCheckPoint slightly to call WALInsertLockAcquireExclusive
significantly later than before. I think that there's no real reason
to do it so early and that the current coding is probably just a
historical leftover, but it would be good to have some review here.

- I added a cross-check that when starting redo from a checkpoint
whose redo pointer points to an earlier LSN that the checkpoint
itself, the record we read from that LSN must an XLOG_CHECKPOINT_REDO
record.

- I combined what were previously 0002 and 0003 into a single patch,
since that's how this would get committed.

- I fixed up some comments.

- I updated commit messages.

Hopefully this is getting close to good enough.

> I did verify that they continue to be a bottleneck even after (incorrectly
> obviously), removing the spinlock. It's also not too surprising, the latency
> of 64bit divs is just high, particularly on intel from a few years ago (my
> cascade lake workstation) and IIRC there's just a single execution port for it
> too, so multiple instructions can't be fully parallelized.

The chipset on my laptop is even older. Coffee Lake, I think.

I'm not really sure that there's a whole lot we can reasonably do
about the divs unless you like the fastpath idea that I proposed
earlier, or unless you want to write a patch to either get rid of
short page headers or make long and short page headers the same number
of bytes. I have to admit I'm surprised by how visible the division
overhead is in this code path -- but I'm also somewhat inclined to
view that less as evidence that division is something we should be
desperate to eliminate and more as evidence that this code path is
quite fast already. In light of your findings, it doesn't seem
completely impossible to me that the speed of integer division in this
code path could be part of what limits performance for some users, but
I'm also not sure it's all that likely or all that serious, because
we're deliberating creating test cases that insert unreasonable
amounts of WAL without doing any actual work. In the real world,
there's going to be a lot more other code running along with this code
- probably at least the executor and some heap AM code - and I bet not
all of that is as well-optimized as this is already. And it's also
quite likely for many users that the real limits on the speed of the
workload will be related to I/O or lock contention rather than CPU
cost in any form. I'm not saying it's not worth worrying about it. I'm
just saying that we should make sure the amount of worrying we do is
calibrated to the true importance of the issue.

--
Robert Haas
EDB: http://www.enterprisedb.com

Attachment

Re: New WAL record to detect the checkpoint redo location

From
Thomas Munro
Date:
On Tue, Oct 10, 2023 at 11:33 AM Robert Haas <robertmhaas@gmail.com> wrote:
> On Mon, Oct 9, 2023 at 4:47 PM Andres Freund <andres@anarazel.de> wrote:
> > I think we might be able to speed some of this up by pre-compute values so we
> > can implement things like bytesleft / UsableBytesInPage with shifts. IIRC we
> > already insist on power-of-two segment sizes, so instead of needing to divide
> > by a runtime value, we should be able to shift by a runtime value (and the
> > modulo should be a mask).
>
> Huh, is there a general technique for this when dividing by a
> non-power-of-two? The segment size is a power of two, as is the page
> size, but UsableBytesIn{Page,Segment} are some random value slightly
> less than a power of two.

BTW in case someone is interested, Hacker's Delight (a book that has
come up on this list a few times before) devotes a couple of chapters
of magical incantations to this topic.  Compilers know that magic, and
one thought I had when I first saw this discussion was that we could
specialise the code for the permissible wal segment sizes.  But nuking
the variable sized page headers sounds better.



Re: New WAL record to detect the checkpoint redo location

From
Michael Paquier
Date:
On Tue, Oct 10, 2023 at 02:43:34PM -0400, Robert Haas wrote:
> - I combined what were previously 0002 and 0003 into a single patch,
> since that's how this would get committed.
>
> - I fixed up some comments.
>
> - I updated commit messages.
>
> Hopefully this is getting close to good enough.

I have looked at 0001, for now..  And it looks OK to me.

+        * Nonetheless, this case is simpler than the normal cases handled
+        * above, which must check for changes in doPageWrites and RedoRecPtr.
+        * Those checks are only needed for records that can contain
+        * full-pages images, and an XLOG_SWITCH record never does.
+        Assert(fpw_lsn == InvalidXLogRecPtr);

Right, that's the core reason behind the refactoring.  The assertion
is a good idea.
--
Michael

Attachment

Re: New WAL record to detect the checkpoint redo location

From
Robert Haas
Date:
On Thu, Oct 12, 2023 at 3:27 AM Michael Paquier <michael@paquier.xyz> wrote:
> I have looked at 0001, for now..  And it looks OK to me.

Cool. I've committed that one. Thanks for the review.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: New WAL record to detect the checkpoint redo location

From
Michael Paquier
Date:
On Tue, Oct 10, 2023 at 02:43:34PM -0400, Robert Haas wrote:
> Here's a new patch set. I think I've incorporated the performance
> fixes that you've suggested so far into this version. I also adjusted
> a couple of other things:

Now looking at 0002, where you should be careful about the code
indentation or koel will complain.

> - After further study of a point previously raised by Amit, I adjusted
> CreateCheckPoint slightly to call WALInsertLockAcquireExclusive
> significantly later than before. I think that there's no real reason
> to do it so early and that the current coding is probably just a
> historical leftover, but it would be good to have some review here.

This makes the new code call LocalSetXLogInsertAllowed() and what we
set for checkPoint.PrevTimeLineID after taking the insertion locks,
which should be OK.

> - I added a cross-check that when starting redo from a checkpoint
> whose redo pointer points to an earlier LSN that the checkpoint
> itself, the record we read from that LSN must an XLOG_CHECKPOINT_REDO
> record.

I've mentioned as well a test in pg_walinspect after one of the
checkpoints generated there, but what you do here is enough for the
online case.

+       /*
+        * XLogInsertRecord will have updated RedoRecPtr, but we need to copy
+        * that into the record that will be inserted when the checkpoint is
+        * complete.
+        */
+       checkPoint.redo = RedoRecPtr;

For online checkpoints, a very important point is that
XLogCtl->Insert.RedoRecPtr is also updated in XLogInsertRecord().
Perhaps that's worth an addition?  I was a bit confused first that we
do the following for shutdown checkpoints:
RedoRecPtr = XLogCtl->Insert.RedoRecPtr = checkPoint.redo;

Then repeat this pattern for non-shutdown checkpoints a few lines down
without touching the copy of the redo LSN in XLogCtl->Insert, because
of course we don't hold the WAL insert locks in an exclusive fashion
here:
checkPoint.redo = RedoRecPtr;

My point is that this is not only about RedoRecPtr, but also about
XLogCtl->Insert.RedoRecPtr here.  The comment in ReserveXLogSwitch()
says that.
--
Michael

Attachment

Re: New WAL record to detect the checkpoint redo location

From
Robert Haas
Date:
On Fri, Oct 13, 2023 at 3:29 AM Michael Paquier <michael@paquier.xyz> wrote:
> Now looking at 0002, where you should be careful about the code
> indentation or koel will complain.

Fixed in the attached version.

> This makes the new code call LocalSetXLogInsertAllowed() and what we
> set for checkPoint.PrevTimeLineID after taking the insertion locks,
> which should be OK.

Cool.

> I've mentioned as well a test in pg_walinspect after one of the
> checkpoints generated there, but what you do here is enough for the
> online case.

I don't quite understand what you're saying here. If you're suggesting
a potential improvement, can you be a bit more clear and explicit
about what the suggestion is?

> +       /*
> +        * XLogInsertRecord will have updated RedoRecPtr, but we need to copy
> +        * that into the record that will be inserted when the checkpoint is
> +        * complete.
> +        */
> +       checkPoint.redo = RedoRecPtr;
>
> For online checkpoints, a very important point is that
> XLogCtl->Insert.RedoRecPtr is also updated in XLogInsertRecord().
> Perhaps that's worth an addition?  I was a bit confused first that we
> do the following for shutdown checkpoints:
> RedoRecPtr = XLogCtl->Insert.RedoRecPtr = checkPoint.redo;
>
> Then repeat this pattern for non-shutdown checkpoints a few lines down
> without touching the copy of the redo LSN in XLogCtl->Insert, because
> of course we don't hold the WAL insert locks in an exclusive fashion
> here:
> checkPoint.redo = RedoRecPtr;
>
> My point is that this is not only about RedoRecPtr, but also about
> XLogCtl->Insert.RedoRecPtr here.  The comment in ReserveXLogSwitch()
> says that.

I have adjusted the comment in CreateCheckPoint to hopefully address
this concern.  I don't understand what you mean about
ReserveXLogSwitch(), though.

--
Robert Haas
EDB: http://www.enterprisedb.com

Attachment

Re: New WAL record to detect the checkpoint redo location

From
Michael Paquier
Date:
On Tue, Oct 17, 2023 at 12:45:52PM -0400, Robert Haas wrote:
> On Fri, Oct 13, 2023 at 3:29 AM Michael Paquier <michael@paquier.xyz> wrote:
>> I've mentioned as well a test in pg_walinspect after one of the
>> checkpoints generated there, but what you do here is enough for the
>> online case.
>
> I don't quite understand what you're saying here. If you're suggesting
> a potential improvement, can you be a bit more clear and explicit
> about what the suggestion is?

Suggestion is from here, with a test for pg_walinspect after it runs
its online checkpoint (see the full-page case):
https://www.postgresql.org/message-id/ZOvf1tu6rfL/B2PW@paquier.xyz

+-- Check presence of REDO record.
+SELECT redo_lsn FROM pg_control_checkpoint() \gset
+SELECT start_lsn = :'redo_lsn'::pg_lsn AS same_lsn, record_type
+  FROM pg_get_wal_record_info(:'redo_lsn');

>> Then repeat this pattern for non-shutdown checkpoints a few lines down
>> without touching the copy of the redo LSN in XLogCtl->Insert, because
>> of course we don't hold the WAL insert locks in an exclusive fashion
>> here:
>> checkPoint.redo = RedoRecPtr;
>>
>> My point is that this is not only about RedoRecPtr, but also about
>> XLogCtl->Insert.RedoRecPtr here.  The comment in ReserveXLogSwitch()
>> says that.
>
> I have adjusted the comment in CreateCheckPoint to hopefully address
> this concern.

-        * XLogInsertRecord will have updated RedoRecPtr, but we need to copy
-        * that into the record that will be inserted when the checkpoint is
-        * complete.
+        * XLogInsertRecord will have updated XLogCtl->Insert.RedoRecPtr in
+        * shared memory and RedoRecPtr in backend-local memory, but we need
+        * to copy that into the record that will be inserted when the
+        * checkpoint is complete.

This comment diff between v8 and v9 looks OK to me.  Thanks.

> I don't understand what you mean about
> ReserveXLogSwitch(), though.

I am not sure either, looking back at that :p
--
Michael

Attachment

Re: New WAL record to detect the checkpoint redo location

From
Robert Haas
Date:
On Tue, Oct 17, 2023 at 8:35 PM Michael Paquier <michael@paquier.xyz> wrote:
> Suggestion is from here, with a test for pg_walinspect after it runs
> its online checkpoint (see the full-page case):
> https://www.postgresql.org/message-id/ZOvf1tu6rfL/B2PW@paquier.xyz
>
> +-- Check presence of REDO record.
> +SELECT redo_lsn FROM pg_control_checkpoint() \gset
> +SELECT start_lsn = :'redo_lsn'::pg_lsn AS same_lsn, record_type
> +  FROM pg_get_wal_record_info(:'redo_lsn');

I added a variant of this test case. Here's v10.

--
Robert Haas
EDB: http://www.enterprisedb.com

Attachment

Re: New WAL record to detect the checkpoint redo location

From
Michael Paquier
Date:
On Wed, Oct 18, 2023 at 10:24:50AM -0400, Robert Haas wrote:
> I added a variant of this test case. Here's v10.

+-- Verify that an XLOG_CHECKPOINT_REDO record begins at precisely the redo LSN
+-- of the checkpoint we just performed.
+SELECT redo_lsn FROM pg_control_checkpoint() \gset
+SELECT start_lsn = :'redo_lsn'::pg_lsn AS same_lsn, resource_manager,
+    record_type FROM pg_get_wal_record_info(:'redo_lsn');
+ same_lsn | resource_manager |   record_type
+----------+------------------+-----------------
+ t        | XLOG             | CHECKPOINT_REDO
+(1 row)

Seems fine to me.  Thanks for considering the idea.
--
Michael

Attachment

Re: New WAL record to detect the checkpoint redo location

From
Robert Haas
Date:
On Thu, Oct 19, 2023 at 1:53 AM Michael Paquier <michael@paquier.xyz> wrote:
> Seems fine to me.  Thanks for considering the idea.

I think it was a good idea!

I've committed the patch.

--
Robert Haas
EDB: http://www.enterprisedb.com