Thread: New WAL record to detect the checkpoint redo location
As discussed [1 ][2] currently, the checkpoint-redo LSN can not be accurately detected while processing the WAL. Although we have a checkpoint WAL record containing the exact redo LSN, other WAL records may be inserted between the checkpoint-redo LSN and the actual checkpoint record. If we want to stop processing wal exactly at the checkpoint-redo location then we cannot do that because we would have already processed some extra records that got added after the redo LSN. The patch inserts a special wal record named CHECKPOINT_REDO WAL, which is located exactly at the checkpoint-redo location. We can guarantee this record to be exactly at the checkpoint-redo point because we already hold the exclusive WAL insertion lock while identifying the checkpoint redo point and can insert this special record exactly at the same time so that there are no concurrent WAL insertions. [1] https://www.postgresql.org/message-id/CA%2BTgmoYOYZfMCyOXFyC-P%2B-mdrZqm5pP2N7S-r0z3_402h9rsA%40mail.gmail.com [2] https://www.postgresql.org/message-id/20230614194717.jyuw3okxup4cvtbt%40awork3.anarazel.de -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Attachment
Hi, As I think I mentioned before, I like this idea. However, I don't like the implementation too much. On 2023-06-15 13:11:57 +0530, Dilip Kumar wrote: > diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c > index b2430f617c..a025fb91e2 100644 > --- a/src/backend/access/transam/xlog.c > +++ b/src/backend/access/transam/xlog.c > @@ -744,6 +744,7 @@ XLogInsertRecord(XLogRecData *rdata, > XLogRecPtr StartPos; > XLogRecPtr EndPos; > bool prevDoPageWrites = doPageWrites; > + bool callerHoldingExlock = holdingAllLocks; > TimeLineID insertTLI; > > /* we assume that all of the record header is in the first chunk */ > @@ -792,10 +793,18 @@ XLogInsertRecord(XLogRecData *rdata, > *---------- > */ > START_CRIT_SECTION(); > - if (isLogSwitch) > - WALInsertLockAcquireExclusive(); > - else > - WALInsertLockAcquire(); > + > + /* > + * Acquire wal insertion lock, nothing to do if the caller is already > + * holding the exclusive lock. > + */ > + if (!callerHoldingExlock) > + { > + if (isLogSwitch) > + WALInsertLockAcquireExclusive(); > + else > + WALInsertLockAcquire(); > + } > > /* > * Check to see if my copy of RedoRecPtr is out of date. If so, may have This might work right now, but doesn't really seem maintainable. Nor do I like adding branches into this code a whole lot. > @@ -6597,6 +6612,32 @@ CreateCheckPoint(int flags) I think the commentary above this function would need a fair bit of revising... > */ > RedoRecPtr = XLogCtl->Insert.RedoRecPtr = checkPoint.redo; > > + /* > + * Insert a special purpose CHECKPOINT_REDO record as the first record at > + * checkpoint redo lsn. Although we have the checkpoint record that > + * contains the exact redo lsn, there might have been some other records > + * those got inserted between the redo lsn and the actual checkpoint > + * record. So when processing the wal, we cannot rely on the checkpoint > + * record if we want to stop at the checkpoint-redo LSN. > + * > + * This special record, however, is not required when we doing a shutdown > + * checkpoint, as there will be no concurrent wal insertions during that > + * time. So, the shutdown checkpoint LSN will be the same as > + * checkpoint-redo LSN. > + * > + * This record is guaranteed to be the first record at checkpoint redo lsn > + * because we are inserting this while holding the exclusive wal insertion > + * lock. > + */ > + if (!shutdown) > + { > + int dummy = 0; > + > + XLogBeginInsert(); > + XLogRegisterData((char *) &dummy, sizeof(dummy)); > + recptr = XLogInsert(RM_XLOG_ID, XLOG_CHECKPOINT_REDO); > + } It seems to me that we should be able to do better than this. I suspect we might be able to get rid of the need for exclusive inserts here. If we rid of that, we could determine the redoa location based on the LSN determined by the XLogInsert(). Alternatively, I think we should split XLogInsertRecord() so that the part with the insertion locks held is a separate function, that we could use here. Greetings, Andres Freund
On Fri, Jul 14, 2023 at 8:46 PM Andres Freund <andres@anarazel.de> wrote: > > Hi, > > As I think I mentioned before, I like this idea. However, I don't like the > implementation too much. Thanks for looking into it. > This might work right now, but doesn't really seem maintainable. Nor do I like > adding branches into this code a whole lot. Okay, Now I have moved the XlogInsert for the special record outside the WalInsertLock so we don't need this special handling here. > > @@ -6597,6 +6612,32 @@ CreateCheckPoint(int flags) > > I think the commentary above this function would need a fair bit of > revising... > > > */ > > RedoRecPtr = XLogCtl->Insert.RedoRecPtr = checkPoint.redo; > > > > + /* > > + * Insert a special purpose CHECKPOINT_REDO record as the first record at > > + * checkpoint redo lsn. Although we have the checkpoint record that > > + * contains the exact redo lsn, there might have been some other records > > + * those got inserted between the redo lsn and the actual checkpoint > > + * record. So when processing the wal, we cannot rely on the checkpoint > > + * record if we want to stop at the checkpoint-redo LSN. > > + * > > + * This special record, however, is not required when we doing a shutdown > > + * checkpoint, as there will be no concurrent wal insertions during that > > + * time. So, the shutdown checkpoint LSN will be the same as > > + * checkpoint-redo LSN. > > + * > > + * This record is guaranteed to be the first record at checkpoint redo lsn > > + * because we are inserting this while holding the exclusive wal insertion > > + * lock. > > + */ > > + if (!shutdown) > > + { > > + int dummy = 0; > > + > > + XLogBeginInsert(); > > + XLogRegisterData((char *) &dummy, sizeof(dummy)); > > + recptr = XLogInsert(RM_XLOG_ID, XLOG_CHECKPOINT_REDO); > > + } > > It seems to me that we should be able to do better than this. > > I suspect we might be able to get rid of the need for exclusive inserts > here. If we rid of that, we could determine the redoa location based on the > LSN determined by the XLogInsert(). Yeah, good idea, actually we can do this insert outside of the exclusive insert lock and set the LSN of this insert as the checkpoint. redo location. So now we do not need to compute the checkpoint. redo based on the current insertion point we can directly use the LSN of this record. I have modified this and I have also moved some other code that is not required to be inside the WAL insertion lock. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Attachment
On Tue, Aug 15, 2023 at 02:23:43PM +0530, Dilip Kumar wrote: > Yeah, good idea, actually we can do this insert outside of the > exclusive insert lock and set the LSN of this insert as the > checkpoint. redo location. So now we do not need to compute the > checkpoint. redo based on the current insertion point we can directly > use the LSN of this record. I have modified this and I have also > moved some other code that is not required to be inside the WAL > insertion lock. Looking at this patch, I am bit surprised to see that the redo point maps with the end location of the CHECKPOINT_REDO record as it is the LSN returned by XLogInsert(), not its start LSN. For example after a checkpoint: =# CREATE EXTENSION pg_walinspect; CREATE EXTENSION; =# SELECT redo_lsn, checkpoint_lsn from pg_control_checkpoint(); redo_lsn | checkpoint_lsn -----------+---------------- 0/19129D0 | 0/1912A08 (1 row) =# SELECT start_lsn, prev_lsn, end_lsn, record_type from pg_get_wal_record_info('0/19129D0'); start_lsn | prev_lsn | end_lsn | record_type -----------+-----------+-----------+--------------- 0/19129D0 | 0/19129B0 | 0/1912A08 | RUNNING_XACTS (1 row) In this case it matches with the previous record: =# SELECT start_lsn, prev_lsn, end_lsn, record_type from pg_get_wal_record_info('0/19129B0'); start_lsn | prev_lsn | end_lsn | record_type -----------+-----------+-----------+----------------- 0/19129B0 | 0/1912968 | 0/19129D0 | CHECKPOINT_REDO (1 row) This could be used to cross-check that the first record replayed is of the correct type. The commit message of this patch tells that "the checkpoint-redo location is set at LSN of this record", which implies the start LSN of the record tracked as the redo LSN, not the end of it? What's the intention here? -- Michael
Attachment
On Thu, Aug 17, 2023 at 10:52 AM Michael Paquier <michael@paquier.xyz> wrote: > > On Tue, Aug 15, 2023 at 02:23:43PM +0530, Dilip Kumar wrote: > > Yeah, good idea, actually we can do this insert outside of the > > exclusive insert lock and set the LSN of this insert as the > > checkpoint. redo location. So now we do not need to compute the > > checkpoint. redo based on the current insertion point we can directly > > use the LSN of this record. I have modified this and I have also > > moved some other code that is not required to be inside the WAL > > insertion lock. > > Looking at this patch, I am bit surprised to see that the redo point > maps with the end location of the CHECKPOINT_REDO record as it is the > LSN returned by XLogInsert(), not its start LSN. Yeah right, actually I was confused, I assumed it will return the start LSN of the record. And I do not see any easy way to identify the Start LSN of this record so maybe this solution will not work. I will have to think of something else. Thanks for pointing it out. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Thu, Aug 17, 2023 at 01:11:50PM +0530, Dilip Kumar wrote: > Yeah right, actually I was confused, I assumed it will return the > start LSN of the record. And I do not see any easy way to identify > the Start LSN of this record so maybe this solution will not work. I > will have to think of something else. Thanks for pointing it out. About that. One thing to consider may be ReserveXLogInsertLocation() while holding the WAL insert lock, but you can just rely on ProcLastRecPtr for the job after inserting the REDO record, no? -- Michael
Attachment
On Fri, Aug 18, 2023 at 5:24 AM Michael Paquier <michael@paquier.xyz> wrote: > > On Thu, Aug 17, 2023 at 01:11:50PM +0530, Dilip Kumar wrote: > > Yeah right, actually I was confused, I assumed it will return the > > start LSN of the record. And I do not see any easy way to identify > > the Start LSN of this record so maybe this solution will not work. I > > will have to think of something else. Thanks for pointing it out. > > About that. One thing to consider may be ReserveXLogInsertLocation() > while holding the WAL insert lock, but you can just rely on > ProcLastRecPtr for the job after inserting the REDO record, no? Yeah right, we can use ProcLastRecPtr. I will send the updated patch. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Fri, Aug 18, 2023 at 10:12 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Fri, Aug 18, 2023 at 5:24 AM Michael Paquier <michael@paquier.xyz> wrote: > > > > On Thu, Aug 17, 2023 at 01:11:50PM +0530, Dilip Kumar wrote: > > > Yeah right, actually I was confused, I assumed it will return the > > > start LSN of the record. And I do not see any easy way to identify > > > the Start LSN of this record so maybe this solution will not work. I > > > will have to think of something else. Thanks for pointing it out. > > > > About that. One thing to consider may be ReserveXLogInsertLocation() > > while holding the WAL insert lock, but you can just rely on > > ProcLastRecPtr for the job after inserting the REDO record, no? > > Yeah right, we can use ProcLastRecPtr. I will send the updated patch. Here is the updated version of the patch. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Attachment
On Fri, Aug 25, 2023 at 11:08:25AM +0530, Dilip Kumar wrote: > Here is the updated version of the patch. The concept of the patch looks sound to me. I have a few comments. + * This special record, however, is not required when we doing a shutdown + * checkpoint, as there will be no concurrent wal insertions during that + * time. So, the shutdown checkpoint LSN will be the same as + * checkpoint-redo LSN. This is missing "are", as in "when we are doing a shutdown checkpoint". - freespace = INSERT_FREESPACE(curInsert); - if (freespace == 0) The variable "freespace" can be moved within the if block introduced by this patch when calculating the REDO location for the shutdown case. And you can do the same with curInsert? - * Compute new REDO record ptr = location of next XLOG record. - * - * NB: this is NOT necessarily where the checkpoint record itself will be, - * since other backends may insert more XLOG records while we're off doing - * the buffer flush work. Those XLOG records are logically after the - * checkpoint, even though physically before it. Got that? + * In case of shutdown checkpoint, compute new REDO record + * ptr = location of next XLOG record. It seems to me that keeping around this comment is important, particularly for the case where we have a shutdown checkpoint and we expect nothing to run, no? How about adding a test in pg_walinspect? There is an online checkpoint running there, so you could just add something like that to check that the REDO record is at the expected LSN stored in the control file: +-- Check presence of REDO record. +SELECT redo_lsn FROM pg_control_checkpoint() \gset +SELECT start_lsn = :'redo_lsn'::pg_lsn AS same_lsn, record_type + FROM pg_get_wal_record_info(:'redo_lsn'); -- Michael
Attachment
On Mon, Aug 28, 2023 at 5:14 AM Michael Paquier <michael@paquier.xyz> wrote: > > On Fri, Aug 25, 2023 at 11:08:25AM +0530, Dilip Kumar wrote: > > Here is the updated version of the patch. > > The concept of the patch looks sound to me. I have a few comments. Thanks for the review > + * This special record, however, is not required when we doing a shutdown > + * checkpoint, as there will be no concurrent wal insertions during that > + * time. So, the shutdown checkpoint LSN will be the same as > + * checkpoint-redo LSN. > > This is missing "are", as in "when we are doing a shutdown > checkpoint". Fixed > - freespace = INSERT_FREESPACE(curInsert); > - if (freespace == 0) > > The variable "freespace" can be moved within the if block introduced > by this patch when calculating the REDO location for the shutdown > case. And you can do the same with curInsert? Done, I have also moved code related to computing curInsert in the same if (shutdown) block. > - * Compute new REDO record ptr = location of next XLOG record. > - * > - * NB: this is NOT necessarily where the checkpoint record itself will be, > - * since other backends may insert more XLOG records while we're off doing > - * the buffer flush work. Those XLOG records are logically after the > - * checkpoint, even though physically before it. Got that? > + * In case of shutdown checkpoint, compute new REDO record > + * ptr = location of next XLOG record. > > It seems to me that keeping around this comment is important, > particularly for the case where we have a shutdown checkpoint and we > expect nothing to run, no? I removed this mainly because now in other comments[1] where we are introducing this new CHECKPOINT_REDO record we are explaining the problem that the redo location and the actual checkpoint records are not at the same place and that is because of the concurrent xlog insertion. I think we are explaining in more detail by also stating that in case of a shutdown checkpoint, there would not be any concurrent insertion so the shutdown checkpoint redo will be at the same place. So I feel keeping old comments is not required. And we are explaining it when we are setting this for non-shutdown checkpoint because for shutdown checkpoint this statement is anyway not correct because for the shutdown checkpoint the checkpoint record will be at the same location and there will be no concurrent wal insertion, what do you think? [1] + /* + * Insert a dummy CHECKPOINT_REDO record and set start LSN of this record + * as checkpoint.redo. Although we have the checkpoint record that also + * contains the exact redo lsn, there might have been some other records + * those got inserted between the redo lsn and the actual checkpoint + * record. So when processing the wal, we cannot rely on the checkpoint + * record if we want to stop at the checkpoint-redo LSN. + * + * This special record, however, is not required when we are doing a + * shutdown checkpoint, as there will be no concurrent wal insertions + * during that time. So, the shutdown checkpoint LSN will be the same as + * checkpoint-redo LSN. + */ > > How about adding a test in pg_walinspect? There is an online > checkpoint running there, so you could just add something like that > to check that the REDO record is at the expected LSN stored in the > control file: > +-- Check presence of REDO record. > +SELECT redo_lsn FROM pg_control_checkpoint() \gset > +SELECT start_lsn = :'redo_lsn'::pg_lsn AS same_lsn, record_type > + FROM pg_get_wal_record_info(:'redo_lsn'); > -- Done, thanks. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Attachment
On Mon, Aug 28, 2023 at 01:47:18PM +0530, Dilip Kumar wrote: > I removed this mainly because now in other comments[1] where we are > introducing this new CHECKPOINT_REDO record we are explaining the > problem > that the redo location and the actual checkpoint records are not at > the same place and that is because of the concurrent xlog insertion. > I think we are explaining in more > detail by also stating that in case of a shutdown checkpoint, there > would not be any concurrent insertion so the shutdown checkpoint redo > will be at the same place. So I feel keeping old comments is not > required. > And we are explaining it when we are setting this for > non-shutdown checkpoint because for shutdown checkpoint this statement > is anyway not correct because for the shutdown checkpoint the > checkpoint record will be at the same location and there will be no > concurrent wal insertion, what do you think? + * Insert a dummy CHECKPOINT_REDO record and set start LSN of this record + * as checkpoint.redo. I would add a "for a non-shutdown checkpoint" at the end of this sentence. + * record. So when processing the wal, we cannot rely on the checkpoint + * record if we want to stop at the checkpoint-redo LSN. The term "checkpoint-redo" is also a bit confusing, I guess, because you just mean to refer to the "redo" LSN here? Maybe rework the last sentence as: "So, when processing WAL, we cannot rely on the checkpoint record if we want to stop at the same position as the redo LSN". + * This special record, however, is not required when we are doing a + * shutdown checkpoint, as there will be no concurrent wal insertions + * during that time. So, the shutdown checkpoint LSN will be the same as + * checkpoint-redo LSN. Perhaps the last sentence could be merged with the first one, if we are tweaking things, say: "This special record is not required when doing a shutdown checkpoint; the redo LSN is the same LSN as the checkpoint record as there cannot be any WAL activity in a shutdown sequence." -- Michael
Attachment
On Wed, Aug 30, 2023 at 1:03 PM Michael Paquier <michael@paquier.xyz> wrote: > > On Mon, Aug 28, 2023 at 01:47:18PM +0530, Dilip Kumar wrote: > > I removed this mainly because now in other comments[1] where we are > > introducing this new CHECKPOINT_REDO record we are explaining the > > problem > > that the redo location and the actual checkpoint records are not at > > the same place and that is because of the concurrent xlog insertion. > > I think we are explaining in more > > detail by also stating that in case of a shutdown checkpoint, there > > would not be any concurrent insertion so the shutdown checkpoint redo > > will be at the same place. So I feel keeping old comments is not > > required. > > And we are explaining it when we are setting this for > > non-shutdown checkpoint because for shutdown checkpoint this statement > > is anyway not correct because for the shutdown checkpoint the > > checkpoint record will be at the same location and there will be no > > concurrent wal insertion, what do you think? > > + * Insert a dummy CHECKPOINT_REDO record and set start LSN of this record > + * as checkpoint.redo. > > I would add a "for a non-shutdown checkpoint" at the end of this > sentence. > > + * record. So when processing the wal, we cannot rely on the checkpoint > + * record if we want to stop at the checkpoint-redo LSN. > > The term "checkpoint-redo" is also a bit confusing, I guess, because > you just mean to refer to the "redo" LSN here? Maybe rework the last > sentence as: > "So, when processing WAL, we cannot rely on the checkpoint record if > we want to stop at the same position as the redo LSN". > > + * This special record, however, is not required when we are doing a > + * shutdown checkpoint, as there will be no concurrent wal insertions > + * during that time. So, the shutdown checkpoint LSN will be the same as > + * checkpoint-redo LSN. > > Perhaps the last sentence could be merged with the first one, if we > are tweaking things, say: > "This special record is not required when doing a shutdown checkpoint; > the redo LSN is the same LSN as the checkpoint record as there cannot > be any WAL activity in a shutdown sequence." Your suggestions LGTM so modified accordingly -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Attachment
On Wed, Aug 30, 2023 at 04:51:19PM +0530, Dilip Kumar wrote: > Your suggestions LGTM so modified accordingly I have been putting my HEAD on this patch for a few hours, reviewing the surroundings, and somewhat missed that this computation is done while we do not hold the WAL insert locks: + checkPoint.redo = ProcLastRecPtr; Then a few lines down the shared Insert.RedoRecPtr is updated while holding an exclusive lock. RedoRecPtr = XLogCtl->Insert.RedoRecPtr = checkPoint.redo; If we have a bunch of records inserted between the moment when the REDO record is inserted and the moment when the checkpointer takes the exclusive WAL lock, aren't we potentially missing a lot of FPW's that should exist since the redo LSN? -- Michael
Attachment
On Thu, Aug 31, 2023 at 9:36 AM Michael Paquier <michael@paquier.xyz> wrote: > > On Wed, Aug 30, 2023 at 04:51:19PM +0530, Dilip Kumar wrote: > > Your suggestions LGTM so modified accordingly > > I have been putting my HEAD on this patch for a few hours, reviewing > the surroundings, and somewhat missed that this computation is done > while we do not hold the WAL insert locks: > + checkPoint.redo = ProcLastRecPtr; > > Then a few lines down the shared Insert.RedoRecPtr is updated while > holding an exclusive lock. > RedoRecPtr = XLogCtl->Insert.RedoRecPtr = checkPoint.redo; > > If we have a bunch of records inserted between the moment when the > REDO record is inserted and the moment when the checkpointer takes the > exclusive WAL lock, aren't we potentially missing a lot of FPW's that > should exist since the redo LSN? Yeah, good catch. With this, it seems like we can not move this new WAL Insert out of the Exclusive WAL insertion lock right? because if we want to set the LSN of this record as the checkpoint. redo then there should not be any concurrent insertion until we expose the XLogCtl->Insert.RedoRecPtr. Otherwise, we will miss the FPW for all the record which has been inserted after the checkpoint. redo before we acquired the exclusive WAL insertion lock. So maybe I need to restart from the first version of the patch but instead of moving the insertion of the new record out of the exclusive lock need to do some better refactoring so that XLogInsertRecord() doesn't look ugly. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Thu, Aug 31, 2023 at 09:55:45AM +0530, Dilip Kumar wrote: > Yeah, good catch. With this, it seems like we can not move this new > WAL Insert out of the Exclusive WAL insertion lock right? Because if > we want to set the LSN of this record as the checkpoint.redo then > there should not be any concurrent insertion until we expose the > XLogCtl->Insert.RedoRecPtr. Otherwise, we will miss the FPW for all > the record which has been inserted after the checkpoint.redo before > we acquired the exclusive WAL insertion lock. Yes. > So maybe I need to restart from the first version of the patch but > instead of moving the insertion of the new record out of the exclusive > lock need to do some better refactoring so that XLogInsertRecord() > doesn't look ugly. Yes, I am not sure which interface would be less ugli-ish, but that's enough material for a refactoring patch of the WAL insert routines on top of the main patch that introduces the REDO record. -- Michael
Attachment
On Fri, Jul 14, 2023 at 11:16 AM Andres Freund <andres@anarazel.de> wrote: > I suspect we might be able to get rid of the need for exclusive inserts > here. If we rid of that, we could determine the redoa location based on the > LSN determined by the XLogInsert(). I've been brainstorming about this today, trying to figure out some ideas to make it work. As Michael Paquier correctly noted downthread, we need to make sure that a backend inserting a WAL record knows whether it needs to contain an FPI. The comments in the do...while loop in XLogInsert are pretty helpful here: doPageWrites can't change once XLogInsertRecord acquires a WAL insertion lock. For that to be true, the redo pointer can only move when holding all WAL insertion locks. That means that if we add an XLOG_CHECKPOINT_REDO to mark the location of the redo pointer, we've got to either (a) insert the record *and* update our notion of the last redo pointer while holding all the WAL insertion locks or (b) change the concurrency model in some way. Let's explore (b) first. Perhaps my imagination is too limited here, but I'm having trouble seeing a good way of doing this. One idea that occurred to me was to make either the insertion of the XLOG_CHECKPOINT_REDO record fail softly if somebody inserts a record after it that omits FPIs, but that doesn't really work because then we're left with a hole in the WAL. It's too late to move the later record earlier. We could convert the intended XLOG_CHECKPOINT_REDO record into a dummy record but that seems complex and wasteful. Similarly, you could try to make the insertion of the later record fail, but that basically has the same problem: there could be an even later record being inserted after that which it's already too late to reposition. Basically, it feels like once we get to the point where we have a range of LSNs and we're copying data into wal_buffers, it's awfully late to be trying to back out. Other people can already be depending on us to put the amount of WAL that we promised to insert at the place where we promised to put it. The only other approach to (b) that I can think of is to force FPIs on for all backends from just before to just after we insert the XLOG_CHECKPOINT_REDO record. However, since we currently require taking all the WAL insertion locks to start requiring full page writes, this doesn't seem like it gains much. In theory perhaps we could have an approach where we flip full page writes to sorta-on, then wait until we've seen each WAL insertion lock unheld at least once, and then at that point we know all new WAL insertions will see them and can deem them fully on. However, when I start to think along these lines, I feel like maybe I'm losing the plot. Checkpoints are rare enough that the overhead of taking all the WAL insertion locks at the same time isn't really a big problem, or at least I don't think it is. I think the concern here is more about avoiding useless branches in hot paths that potentially cost something for every single record whether it has anything to do with this mechanism or not. OK, so let's suppose we abandon the idea of changing the concurrency model in any fundamental way and just try to figure out how to both insert the record and update our notion of the last redo pointer while holding all the WAL insertion locks i.e. (a) from the two options above. Dilip's patch approaches this problem by pulling acquisition of the WAL insertion locks up to the place where we're already setting the redo pointer. I wonder if we could also consider the other possible approach of pushing the update to Insert->RedoRecPtr down into XLogInsertRecord(), which already has a special case for acquiring all locks when the record being inserted is an XLOG_SWITCH record. That would have the advantage of holding all of the insertion locks for a smaller period of time than what Dilip's patch does -- in particular, it wouldn't need to hold the lock across the XLOG_CHECKPOINT_REDO's XLogRecordAssemble -- or across the rather lengthy tail of XLogInsertRecord. But the obvious objection is that it would put more branches into XLogInsertRecord which nobody wants. But ... what if it didn't? Suppose we pulled the XLOG_SWITCH case out of XLogInsertRecord and made a separate function for that case. It looks to me like that would avoid 5 branches in that function in the normal case where we're not inserting XLOG_SWITCH. We would want to move some logic, particularly the WAL_DEBUG stuff and maybe other things, into reusable subroutines. Then, we could decide to treat XLOG_CHECKPOINT_REDO either in the normal path -- adding a couple of those branches back again -- or in the XLOG_SWITCH function and either way I think the normal path would have fewer branches than it does today. One idea that I had was to create a new rmgr for "weird records," initially XLOG_SWITCH and XLOG_CHECKPOINT_REDO. Then the test as to whether to use the "normal" version of XLogInsertRecord or the "weird" version could just be based on the rmid, and the "normal" function wouldn't need to concern itself with anything specific to the "weird" cases. A variant on this idea would be to just accept a few extra branches and hope it's not really that big of a deal. For instance, instead of this: bool isLogSwitch = (rechdr->xl_rmid == RM_XLOG_ID && info == XLOG_SWITCH); We could have this: bool isAllLocks = (rechdr->xl_rmid == RM_BIZARRE_ID); bool isLogSwitch = (isAllLocks && info == XLOG_SWITCH); ...and then conditionalize on either isAllLocks or isLogSwitch as apppropriate. You'd still need an extra branch someplace to update Insert->RedoRecPtr when isAllLocks && info == XLOG_CHECKPOINT_REDO, but maybe that's not so bad? > Alternatively, I think we should split XLogInsertRecord() so that the part > with the insertion locks held is a separate function, that we could use here. The difficulty that I see with this is that the function does a lot more stuff after calling WALInsertLockRelease(). So just pushing the part that's between acquiring and releasing WAL insertion locks into a separate function wouldn't actually avoid a lot of code duplication, if the goal was to do everything else that XLogInsertRecord() does except for the lock manipulation. To get there, I think we'd need to move all of the stuff after the lock release into one or more static functions, too. Which is possibly an OK approach. I haven't checked how much additional parameter passing we'd end up doing if we went that way. -- Robert Haas EDB: http://www.enterprisedb.com
On Mon, Sep 18, 2023 at 2:57 PM Robert Haas <robertmhaas@gmail.com> wrote: > I've been brainstorming about this today, trying to figure out some > ideas to make it work. Here are some patches. 0001 refactors XLogInsertRecord to unify a couple of separate tests of isLogSwitch, hopefully making it cleaner and cheaper to add more special cases. 0002 is a very minimal patch to add XLOG_CHECKPOINT_REDO without using it for anything. 0003 modifies CreateCheckPoint() to insert an XLOG_CHECKPOINT_REDO record for any non-shutdown checkpoint, and modifies XLogInsertRecord() to treat that as a new special case, wherein after inserting the record the redo pointer is reset while still holding the WAL insertion locks. I've tested this to the extent of running the regression tests, and I also did one (1) manual test where it looked like the right thing was happening, but that's it, so this might be buggy or perform like garbage for all I know. But my hope is that it isn't buggy and performs adequately. If there's any chance of getting some comments on the basic design choices before I spend time testing and polishing it, that would be very helpful. Thanks, -- Robert Haas EDB: http://www.enterprisedb.com
Attachment
On Thu, Sep 21, 2023 at 7:05 AM Robert Haas <robertmhaas@gmail.com> wrote: > > On Mon, Sep 18, 2023 at 2:57 PM Robert Haas <robertmhaas@gmail.com> wrote: > > I've been brainstorming about this today, trying to figure out some > > ideas to make it work. > > Here are some patches. > > 0001 refactors XLogInsertRecord to unify a couple of separate tests of > isLogSwitch, hopefully making it cleaner and cheaper to add more > special cases. > > 0002 is a very minimal patch to add XLOG_CHECKPOINT_REDO without using > it for anything. > > 0003 modifies CreateCheckPoint() to insert an XLOG_CHECKPOINT_REDO > record for any non-shutdown checkpoint, and modifies > XLogInsertRecord() to treat that as a new special case, wherein after > inserting the record the redo pointer is reset while still holding the > WAL insertion locks. > After the 0003 patch, do we need acquire exclusive lock via WALInsertLockAcquireExclusive() for non-shutdown checkpoints. Even the comment "We must block concurrent insertions while examining insert state to determine the checkpoint REDO pointer." seems to indicate that it is not required. If it is required then we may want to change the comments and also acquiring the locks twice will have more cost than acquiring it once and write the new WAL record under that lock. One minor comment: + else if (XLOG_CHECKPOINT_REDO) + class = WALINSERT_SPECIAL_CHECKPOINT; + } Isn't the check needs to compare the record type with info? Your v6-0001* patch looks like an improvement to me even without the other two patches. BTW, I would like to mention that there is a slight interaction of this work with the patch to upgrade/migrate slots [1]. Basically in [1], to allow slots migration from lower to higher version, we need to ensure that all the WAL has been consumed by the slots before clean shutdown. However, during upgrade we can generate few records like checkpoint which we will ignore for the slot consistency checking as such records doesn't matter for data consistency after upgrade. We probably need to add this record to that list. I'll keep an eye on both the patches so that we don't miss that interaction but mentioned it here to make others also aware of the same. [1] - https://www.postgresql.org/message-id/TYAPR01MB586615579356A84A8CF29A00F5F9A%40TYAPR01MB5866.jpnprd01.prod.outlook.com -- With Regards, Amit Kapila.
On Thu, Sep 21, 2023 at 4:22 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > After the 0003 patch, do we need acquire exclusive lock via > WALInsertLockAcquireExclusive() for non-shutdown checkpoints. Even the > comment "We must block concurrent insertions while examining insert > state to determine the checkpoint REDO pointer." seems to indicate > that it is not required. If it is required then we may want to change > the comments and also acquiring the locks twice will have more cost > than acquiring it once and write the new WAL record under that lock. I think the comment needs updating. I don't think we can do curInsert = XLogBytePosToRecPtr(Insert->CurrBytePos) without taking the locks. Same for Insert->fullPageWrites. I agree that it looks a little wasteful to release the lock and then reacquire it, but I suppose checkpoints don't happen often enough for it to matter. You're not going to notice an extra set of insertion lock acquisitions once every 5 minutes, or every half hour, or even every 1 minute if your checkpoints are super-frequent. Also notice that the current code is also quite inefficient in this way. GetLastImportantRecPtr() acquires and releases each lock one at a time, and then we immediately turn around and do WALInsertLockAcquireExclusive(). If the overhead that you're concerned about here were big enough to matter, we could reclaim what we're losing by having a version of GetLastImportantRecPtr() that expects to be called with all locks already held. But when I asked Andres, he thought that it didn't matter, and I bet he's right. > One minor comment: > + else if (XLOG_CHECKPOINT_REDO) > + class = WALINSERT_SPECIAL_CHECKPOINT; > + } > > Isn't the check needs to compare the record type with info? Yeah wow. That's a big mistake. > Your v6-0001* patch looks like an improvement to me even without the > other two patches. Good to know, thanks. > BTW, I would like to mention that there is a slight interaction of > this work with the patch to upgrade/migrate slots [1]. Basically in > [1], to allow slots migration from lower to higher version, we need to > ensure that all the WAL has been consumed by the slots before clean > shutdown. However, during upgrade we can generate few records like > checkpoint which we will ignore for the slot consistency checking as > such records doesn't matter for data consistency after upgrade. We > probably need to add this record to that list. I'll keep an eye on > both the patches so that we don't miss that interaction but mentioned > it here to make others also aware of the same. If your approach requires a code change every time someone adds a new WAL record that doesn't modify table data, you might want to rethink the approach a bit. -- Robert Haas EDB: http://www.enterprisedb.com
On Thu, Sep 21, 2023 at 1:50 AM Robert Haas <robertmhaas@gmail.com> wrote: > > On Mon, Sep 18, 2023 at 2:57 PM Robert Haas <robertmhaas@gmail.com> wrote: > > I've been brainstorming about this today, trying to figure out some > > ideas to make it work. > > Here are some patches. > > 0001 refactors XLogInsertRecord to unify a couple of separate tests of > isLogSwitch, hopefully making it cleaner and cheaper to add more > special cases. Yeah, this looks improvement as it removes one isLogSwitch from the code. > 0002 is a very minimal patch to add XLOG_CHECKPOINT_REDO without using > it for anything. > > 0003 modifies CreateCheckPoint() to insert an XLOG_CHECKPOINT_REDO > record for any non-shutdown checkpoint, and modifies > XLogInsertRecord() to treat that as a new special case, wherein after > inserting the record the redo pointer is reset while still holding the > WAL insertion locks. Yeah from a design POV, it looks fine to me because the main goal was to insert the XLOG_CHECKPOINT_REDO record and set the "RedoRecPtr" under the same exclusive wal insertion lock and this patch is doing this. As you already mentioned it is an improvement over my first patch because a) it holds the exclusive WAL insersion lock for a very short duration b) not increasing the number of branches in XLogInsertRecord(). Some review 1. I feel we can reduce one more branch to the normal path by increasing one branch in this special case i.e. Your code is if (class == WALINSERT_SPECIAL_SWITCH) { /*execute isSwitch case */ } else if (class == WALINSERT_SPECIAL_CHECKPOINT) { /*execute checkpoint redo case */ } else { /* common case*/ } My suggestion if (xl_rmid == RM_XLOG_ID) { if (class == WALINSERT_SPECIAL_SWITCH) { /*execute isSwitch case */ } else if (class == WALINSERT_SPECIAL_CHECKPOINT) { /*execute checkpoint redo case */ } } else { /* common case*/ } 2. In fact, I feel that we can remove this branch as well right? I mean why do we need to have this separate thing called "class"? we can very much use "info" for that purpose. right? + /* Does this record type require special handling? */ + if (rechdr->xl_rmid == RM_XLOG_ID) + { + if (info == XLOG_SWITCH) + class = WALINSERT_SPECIAL_SWITCH; + else if (XLOG_CHECKPOINT_REDO) + class = WALINSERT_SPECIAL_CHECKPOINT; + } So if we remove this then we do not have this class and the above case would look like if (xl_rmid == RM_XLOG_ID) { if (info == XLOG_SWITCH) { /*execute isSwitch case */ } else if (info == XLOG_CHECKPOINT_REDO) { /*execute checkpoint redo case */ } } else { /* common case*/ } 3. + /* Does this record type require special handling? */ + if (rechdr->xl_rmid == RM_XLOG_ID) + { + if (info == XLOG_SWITCH) + class = WALINSERT_SPECIAL_SWITCH; + else if (XLOG_CHECKPOINT_REDO) + class = WALINSERT_SPECIAL_CHECKPOINT; + } + the above check-in else if is wrong I mean else if (XLOG_CHECKPOINT_REDO) should be else if (info == XLOG_CHECKPOINT_REDO) That's all I have for now. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Thu, Sep 21, 2023 at 9:06 PM Robert Haas <robertmhaas@gmail.com> wrote: > > On Thu, Sep 21, 2023 at 4:22 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > After the 0003 patch, do we need acquire exclusive lock via > > WALInsertLockAcquireExclusive() for non-shutdown checkpoints. Even the > > comment "We must block concurrent insertions while examining insert > > state to determine the checkpoint REDO pointer." seems to indicate > > that it is not required. If it is required then we may want to change > > the comments and also acquiring the locks twice will have more cost > > than acquiring it once and write the new WAL record under that lock. > > I think the comment needs updating. I don't think we can do curInsert > = XLogBytePosToRecPtr(Insert->CurrBytePos) without taking the locks. > Same for Insert->fullPageWrites. > If we can't do those without taking all the locks then it is fine but just wanted to give it a try to see if there is a way to avoid in case of online (non-shutdown) checkpoints. For example, curInsert is used only for the shutdown path, so we don't need to acquire all locks for it in the cases except for the shutdown case. Here, we are reading Insert->fullPageWrites which requires an insertion lock but not all the locks (as per comments in structure XLogCtlInsert). Now, I haven't done detailed analysis for XLogCtl->InsertTimeLineID/XLogCtl->PrevTimeLineID but some places reading InsertTimeLineID have a comment like "Given that we're not in recovery, InsertTimeLineID is set and can't change, so we can read it without a lock." which suggests that some analysis is required whether reading those requires all locks in this code path. OTOH, it won't matter to acquire all locks in this code path for the reasons mentioned by you and it may help in keeping the code simple. So, it is up to you to take the final call on this matter. I am fine with your decision. > > > BTW, I would like to mention that there is a slight interaction of > > this work with the patch to upgrade/migrate slots [1]. Basically in > > [1], to allow slots migration from lower to higher version, we need to > > ensure that all the WAL has been consumed by the slots before clean > > shutdown. However, during upgrade we can generate few records like > > checkpoint which we will ignore for the slot consistency checking as > > such records doesn't matter for data consistency after upgrade. We > > probably need to add this record to that list. I'll keep an eye on > > both the patches so that we don't miss that interaction but mentioned > > it here to make others also aware of the same. > > If your approach requires a code change every time someone adds a new > WAL record that doesn't modify table data, you might want to rethink > the approach a bit. > I understand your hesitation and we have discussed several approaches that do not rely on the WAL record type to determine if the slots have caught up but the other approaches seem to have different other downsides. I know it may not be a good idea to discuss those here but as there was a slight interaction with this work, so I thought to bring it up. To be precise, we need to ensure that we ignore WAL records that got generated during pg_upgrade operation (say during pg_upgrade --check). The approach we initially followed was to check if the slot's confirmed_flush_lsn is equal to the latest checkpoint in pg_controldata (which is the shutdown checkpoint after stopping the server). This approach doesn't work for the use case where the user runs pg_upgrade --check before actually performing the upgrade [1]. This is because during the upgrade check, the server will be stopped/started and update the position of the latest checkpoint, causing the check to fail in the actual upgrade and leading pg_upgrade to believe that the slot has not been caught up. To address the issues in the above approach, we also discussed several alternative approaches[2][3]: a) Adding a new field in pg_controldata to record the last checkpoint that happens in non-upgrade mode, so that we can compare the slot's confirmed_flush_lsn with this value. However, we were not sure if this was a good enough reason to add a new field in controldata field and sprinkle IsBinaryUpgrade check in checkpointer code path. b) Advancing each slot's confirmed_flush_lsn to the latest position if the first upgrade check passes. This way, when performing the actual upgrade, the confirmed_flush_lsn will also pass. However, internally advancing the LSN seems unconventional. c) Introducing a new pg_upgrade option to skip the check for slot catch-up so that if it is already done at the time of pg_upgrade --check, we can avoid rechecking during actual upgrade. Although this might work, the user would need to specify this manually, which is not ideal. d) Document this and suggest users consume the WALs, but this doesn't look acceptable to users. All the above approaches have their downsides, prompting us to consider the WAL scan approach which is to scan the end of the WAL for records that should have been streamed out. This approach was first proposed by Andres[4] and was chosen[5] after considering all other approaches. If we don't like relying on WAL record types then I think the alternative (a) to add a new field in ControlDataFile is worth considering. [1] https://www.postgresql.org/message-id/CAA4eK1LzeZLoTLaAuadmuiggc5mq39oLY6fK95oFKiPBPBf%2BeQ%40mail.gmail.com [2] https://www.postgresql.org/message-id/OS0PR01MB571640E1B58741979A5E586594F7A%40OS0PR01MB5716.jpnprd01.prod.outlook.com [3] https://www.postgresql.org/message-id/TYAPR01MB5866EF7398CB13FFDBF230E7F5F0A%40TYAPR01MB5866.jpnprd01.prod.outlook.com [4] https://www.postgresql.org/message-id/20230725170319.h423jbthfohwgnf7%40awork3.anarazel.de [5] https://www.postgresql.org/message-id/CAA4eK1KqqWayKtRhvyRgkhEHvAUemW_dEqgFn7UOG3D4B6f0ew%40mail.gmail.com -- With Regards, Amit Kapila.
On Wed, Sep 20, 2023 at 4:20 PM Robert Haas <robertmhaas@gmail.com> wrote: > Here are some patches. Here are some updated patches. Following some off-list conversation with Andres, I restructured 0003 to put the common case first and use likely(), and I fixed the brown-paper-bag noted by Amit. I then turned my attention to performance testing. I was happy to find out when I did a bunch of testing on Friday that my branch with these patches applied outperformed master. I was then less happy to find that when I repeated the same tests today, master outperformed the branch. So now I don't know what is going on, but it doesn't seem like my test results are stable enough to draw meaningful conclusions. I was trying to think of a test case where XLogInsertRecord would be exercised as heavily as possible, so I really wanted to generate a lot of WAL while doing as little real work as possible. The best idea that I had was to run pg_create_restore_point() in a loop. Initially, performance was dominated by the log messages which that function emits, so I set log_min_messages='FATAL' to suppress those. To try to further reduce other bottlenecks, I also set max_wal_size='50GB', fsync='off', synchronous_commit='off', and wal_buffers='256MB'. Then I ran this query: select count(*) from (SELECT pg_create_restore_point('banana') from generate_series(1,100000000) g) x; I can't help laughing at the comedy of creating 100 million banana-named restore points with no fsyncs or logging, but here we are. All of my test runs with master, and with the patches, and with just the first patch run in between 34 and 39 seconds. As I say, I can't really separate out which versions are faster and slower with any confidence. Before I fixed the brown-paper bag that Amit pointed out, it was using WALInsertLockAcquireExclusive() instead of WALInsertLockAcquire() for *all* WAL records, and that created an extremely large and obvious increase in the runtime of the tests. So I'm relatively confident that this test case is sensitive to changes in execution time of XLogInsertRecord(), but apparently the changes caused by rearranging the branches are a bit too marginal for them to show up here. One possible conclusion is that the differences here aren't actually big enough to get stressed about, but I don't want to jump to that conclusion without investigating the competing hypothesis that this isn't the right way to test this, and that some better test would show clearer results. Suggestions? -- Robert Haas EDB: http://www.enterprisedb.com
Attachment
Hi, On 2023-10-02 10:42:37 -0400, Robert Haas wrote: > I was trying to think of a test case where XLogInsertRecord would be > exercised as heavily as possible, so I really wanted to generate a lot > of WAL while doing as little real work as possible. The best idea that > I had was to run pg_create_restore_point() in a loop. What I use for that is pg_logical_emit_message(). Something like SELECT count(*) FROM ( SELECT pg_logical_emit_message(false, '1', 'short'), generate_series(1, 10000) ); run via pgbench does seem to exercise that path nicely. > One possible conclusion is that the differences here aren't actually > big enough to get stressed about, but I don't want to jump to that > conclusion without investigating the competing hypothesis that this > isn't the right way to test this, and that some better test would show > clearer results. Suggestions? I saw some small differences in runtime running pgbench with the above query, with a single client. Comparing profiles showed a surprising degree of difference. That turns out to mostly a consequence of the fact that ReserveXLogInsertLocation() isn't inlined anymore, because there now are two callers of the function in XLogInsertRecord(). Unfortunately, I still see a small performance difference after that. To get the most reproducible numbers, I disable turbo boost, bound postgres to one cpu core, bound pgbench to another core. Over a few runs I quite reproducibly get ~319.323 tps with your patches applied (+ always inline), and ~324.674 with master. If I add an unlikely around if (rechdr->xl_rmid == RM_XLOG_ID), the performance does improve. But that "only" brings it up to 322.406. Not sure what the rest is. One thing that's notable, but not related to the patch, is that we waste a fair bit of cpu time below XLogInsertRecord() with divisions. I think they're all due to the use of UsableBytesInSegment in XLogBytePosToRecPtr/XLogBytePosToEndRecPtr. The multiplication of XLogSegNoOffsetToRecPtr() also shows. Greetings, Andres Freund
On Thu, Oct 5, 2023 at 2:34 PM Andres Freund <andres@anarazel.de> wrote: > If I add an unlikely around if (rechdr->xl_rmid == RM_XLOG_ID), the > performance does improve. But that "only" brings it up to 322.406. Not sure > what the rest is. I don't really think this is worth worrying about. A sub-one-percent regression on a highly artificial test case doesn't seem like a big deal. Anybody less determined than you would have been unable to measure that there even is a regression in the first place, and that's basically everyone. -- Robert Haas EDB: http://www.enterprisedb.com
On Thu, Oct 5, 2023 at 2:34 PM Andres Freund <andres@anarazel.de> wrote: > One thing that's notable, but not related to the patch, is that we waste a > fair bit of cpu time below XLogInsertRecord() with divisions. I think they're > all due to the use of UsableBytesInSegment in > XLogBytePosToRecPtr/XLogBytePosToEndRecPtr. The multiplication of > XLogSegNoOffsetToRecPtr() also shows. Despite what I said in my earlier email, and with a feeling like unto that created by the proximity of the sword of Damocles or some ghostly albatross, I spent some time reflecting on this. Some observations: 1. The reason why we're doing this multiplication and division is to make sure that the code in ReserveXLogInsertLocation which executes while holding insertpos_lck remains as simple and brief as possible. We could eliminate the conversion between usable byte positions and LSNs if we replaced Insert->{Curr,Prev}BytePos with LSNs and had ReserveXLogInsertLocation work out by how much to advance the LSN, but it would have to be worked out while holding insertpos_lck (or some replacement lwlock, perhaps) and that cure seems worse than the disease. Given that, I think we're stuck with converting between usable bye positions and LSNs, and that intrinsically needs some multiplication and division. 2. It seems possible to remove one branch in each of XLogBytePosToRecPtr and XLogBytePosToEndRecPtr. Rather than testing whether bytesleft < XLOG_BLCKSZ - SizeOfXLogLongPHD, we could simply increment bytesleft by SizeOfXLogLongPHD - SizeOfXLogShortPHD. Then the rest of the calculations can be performed as if every page in the segment had a header of length SizeOfXLogShortPHD, with no need to special-case the first page. However, that doesn't get rid of any multiplication or division, just a branch. 3. Aside from that, there seems to be no simple way to reduce the complexity of an individual calculation, but ReserveXLogInsertLocation does perform 3 rather similar computations, and I believe that we know that it will always be the case that *PrevPtr < *StartPos < *EndPos. Maybe we could have a fast-path for the case where they are all in the same segment. We could take prevbytepos modulo UsableBytesInSegment; call the result prevsegoff. If UsableBytesInSegment - prevsegoff > endbytepos - prevbytepos, then all three pointers are in the same segment, and maybe we could take advantage of that to avoid performing the segment calculations more than once, but still needing to repeat the page calculations. Or, instead or in addition, I think we could by a similar technique check whether all three pointers are on the same page; if so, then *StartPos and *EndPos can be computed from *PrevPtr by just adding the difference between the corresponding byte positions. I'm not really sure whether that would come out cheaper. It's just the only idea that I have. It did also occur to me to wonder whether the apparent delays performing multiplication and division here were really the result of the arithmetic itself being slow or whether they were synchronization-related, SpinLockRelease(&Insert->insertpos_lck) being a memory barrier just before. But I assume you thought about that and concluded that wasn't the issue here. -- Robert Haas EDB: http://www.enterprisedb.com
Hi, On 2023-10-06 13:44:55 -0400, Robert Haas wrote: > On Thu, Oct 5, 2023 at 2:34 PM Andres Freund <andres@anarazel.de> wrote: > > If I add an unlikely around if (rechdr->xl_rmid == RM_XLOG_ID), the > > performance does improve. But that "only" brings it up to 322.406. Not sure > > what the rest is. > > I don't really think this is worth worrying about. A sub-one-percent > regression on a highly artificial test case doesn't seem like a big > deal. I agree. I think it's worth measuring and looking at, after all the fix might be trivial (like the case of the unlikely for the earlier if()). But it shouldn't block progress on significant features. I think this "issue" might be measurable in some other, not quite as artifical cases, like INSERT ... SELECT or such. But even then it's going to be tiny. Greetings, Andres Freund
Hi, As noted in my email from a few minutes ago, I agree that optimizing this shouldn't be a requirement for merging the patch. On 2023-10-09 15:58:36 -0400, Robert Haas wrote: > 1. The reason why we're doing this multiplication and division is to > make sure that the code in ReserveXLogInsertLocation which executes > while holding insertpos_lck remains as simple and brief as possible. > We could eliminate the conversion between usable byte positions and > LSNs if we replaced Insert->{Curr,Prev}BytePos with LSNs and had > ReserveXLogInsertLocation work out by how much to advance the LSN, but > it would have to be worked out while holding insertpos_lck (or some > replacement lwlock, perhaps) and that cure seems worse than the > disease. Given that, I think we're stuck with converting between > usable bye positions and LSNs, and that intrinsically needs some > multiplication and division. Right, that's absolutely crucial for scalability. > 2. It seems possible to remove one branch in each of > XLogBytePosToRecPtr and XLogBytePosToEndRecPtr. Rather than testing > whether bytesleft < XLOG_BLCKSZ - SizeOfXLogLongPHD, we could simply > increment bytesleft by SizeOfXLogLongPHD - SizeOfXLogShortPHD. Then > the rest of the calculations can be performed as if every page in the > segment had a header of length SizeOfXLogShortPHD, with no need to > special-case the first page. However, that doesn't get rid of any > multiplication or division, just a branch. This reminded me about something I've been bugged by for a while: The whole idea of short xlog page headers seems like a completely premature optimization. The page header is a very small amount of the overall data (long: 40/8192 ~= 0.00488, short: 24/8192 ~= 0.00292), compared to the space we waste in many other places, including on a per-record level, it doesn't seem worth the complexity. > 3. Aside from that, there seems to be no simple way to reduce the > complexity of an individual calculation, but ReserveXLogInsertLocation > does perform 3 rather similar computations, and I believe that we know > that it will always be the case that *PrevPtr < *StartPos < *EndPos. > Maybe we could have a fast-path for the case where they are all in the > same segment. We could take prevbytepos modulo UsableBytesInSegment; > call the result prevsegoff. If UsableBytesInSegment - prevsegoff > > endbytepos - prevbytepos, then all three pointers are in the same > segment, and maybe we could take advantage of that to avoid performing > the segment calculations more than once, but still needing to repeat > the page calculations. Or, instead or in addition, I think we could by > a similar technique check whether all three pointers are on the same > page; if so, then *StartPos and *EndPos can be computed from *PrevPtr > by just adding the difference between the corresponding byte > positions. I think we might be able to speed some of this up by pre-compute values so we can implement things like bytesleft / UsableBytesInPage with shifts. IIRC we already insist on power-of-two segment sizes, so instead of needing to divide by a runtime value, we should be able to shift by a runtime value (and the modulo should be a mask). > I'm not really sure whether that would come out cheaper. It's just the > only idea that I have. It did also occur to me to wonder whether the > apparent delays performing multiplication and division here were > really the result of the arithmetic itself being slow or whether they > were synchronization-related, SpinLockRelease(&Insert->insertpos_lck) > being a memory barrier just before. But I assume you thought about > that and concluded that wasn't the issue here. I did verify that they continue to be a bottleneck even after (incorrectly obviously), removing the spinlock. It's also not too surprising, the latency of 64bit divs is just high, particularly on intel from a few years ago (my cascade lake workstation) and IIRC there's just a single execution port for it too, so multiple instructions can't be fully parallelized. https://uops.info/table.html documents a worst case latency of 89 cycles on cascade lake, with the division broken up into 36 uops (reducing what's available to track other in-flight instructions). It's much better on alter lake (9 cycles and 7 uops on the perf cores, 44 cycles and 4 uops on efficiency cores) and on zen 3+ (19 cycles, 2 uops). Greetings, Andres Freund
On Mon, Oct 9, 2023 at 4:47 PM Andres Freund <andres@anarazel.de> wrote: > I think we might be able to speed some of this up by pre-compute values so we > can implement things like bytesleft / UsableBytesInPage with shifts. IIRC we > already insist on power-of-two segment sizes, so instead of needing to divide > by a runtime value, we should be able to shift by a runtime value (and the > modulo should be a mask). Huh, is there a general technique for this when dividing by a non-power-of-two? The segment size is a power of two, as is the page size, but UsableBytesIn{Page,Segment} are some random value slightly less than a power of two. -- Robert Haas EDB: http://www.enterprisedb.com
Hi, On 2023-10-09 18:31:11 -0400, Robert Haas wrote: > On Mon, Oct 9, 2023 at 4:47 PM Andres Freund <andres@anarazel.de> wrote: > > I think we might be able to speed some of this up by pre-compute values so we > > can implement things like bytesleft / UsableBytesInPage with shifts. IIRC we > > already insist on power-of-two segment sizes, so instead of needing to divide > > by a runtime value, we should be able to shift by a runtime value (and the > > modulo should be a mask). > > Huh, is there a general technique for this when dividing by a > non-power-of-two? There is, but I was just having a brainfart, forgetting that UsableBytesInPage isn't itself a power of two. The general technique is used by compilers, but doesn't iirc lend itself well to be done at runtime. Greetings, Andres Freund
On Mon, Oct 9, 2023 at 4:47 PM Andres Freund <andres@anarazel.de> wrote: > As noted in my email from a few minutes ago, I agree that optimizing this > shouldn't be a requirement for merging the patch. Here's a new patch set. I think I've incorporated the performance fixes that you've suggested so far into this version. I also adjusted a couple of other things: - After further study of a point previously raised by Amit, I adjusted CreateCheckPoint slightly to call WALInsertLockAcquireExclusive significantly later than before. I think that there's no real reason to do it so early and that the current coding is probably just a historical leftover, but it would be good to have some review here. - I added a cross-check that when starting redo from a checkpoint whose redo pointer points to an earlier LSN that the checkpoint itself, the record we read from that LSN must an XLOG_CHECKPOINT_REDO record. - I combined what were previously 0002 and 0003 into a single patch, since that's how this would get committed. - I fixed up some comments. - I updated commit messages. Hopefully this is getting close to good enough. > I did verify that they continue to be a bottleneck even after (incorrectly > obviously), removing the spinlock. It's also not too surprising, the latency > of 64bit divs is just high, particularly on intel from a few years ago (my > cascade lake workstation) and IIRC there's just a single execution port for it > too, so multiple instructions can't be fully parallelized. The chipset on my laptop is even older. Coffee Lake, I think. I'm not really sure that there's a whole lot we can reasonably do about the divs unless you like the fastpath idea that I proposed earlier, or unless you want to write a patch to either get rid of short page headers or make long and short page headers the same number of bytes. I have to admit I'm surprised by how visible the division overhead is in this code path -- but I'm also somewhat inclined to view that less as evidence that division is something we should be desperate to eliminate and more as evidence that this code path is quite fast already. In light of your findings, it doesn't seem completely impossible to me that the speed of integer division in this code path could be part of what limits performance for some users, but I'm also not sure it's all that likely or all that serious, because we're deliberating creating test cases that insert unreasonable amounts of WAL without doing any actual work. In the real world, there's going to be a lot more other code running along with this code - probably at least the executor and some heap AM code - and I bet not all of that is as well-optimized as this is already. And it's also quite likely for many users that the real limits on the speed of the workload will be related to I/O or lock contention rather than CPU cost in any form. I'm not saying it's not worth worrying about it. I'm just saying that we should make sure the amount of worrying we do is calibrated to the true importance of the issue. -- Robert Haas EDB: http://www.enterprisedb.com
Attachment
On Tue, Oct 10, 2023 at 11:33 AM Robert Haas <robertmhaas@gmail.com> wrote: > On Mon, Oct 9, 2023 at 4:47 PM Andres Freund <andres@anarazel.de> wrote: > > I think we might be able to speed some of this up by pre-compute values so we > > can implement things like bytesleft / UsableBytesInPage with shifts. IIRC we > > already insist on power-of-two segment sizes, so instead of needing to divide > > by a runtime value, we should be able to shift by a runtime value (and the > > modulo should be a mask). > > Huh, is there a general technique for this when dividing by a > non-power-of-two? The segment size is a power of two, as is the page > size, but UsableBytesIn{Page,Segment} are some random value slightly > less than a power of two. BTW in case someone is interested, Hacker's Delight (a book that has come up on this list a few times before) devotes a couple of chapters of magical incantations to this topic. Compilers know that magic, and one thought I had when I first saw this discussion was that we could specialise the code for the permissible wal segment sizes. But nuking the variable sized page headers sounds better.
On Tue, Oct 10, 2023 at 02:43:34PM -0400, Robert Haas wrote: > - I combined what were previously 0002 and 0003 into a single patch, > since that's how this would get committed. > > - I fixed up some comments. > > - I updated commit messages. > > Hopefully this is getting close to good enough. I have looked at 0001, for now.. And it looks OK to me. + * Nonetheless, this case is simpler than the normal cases handled + * above, which must check for changes in doPageWrites and RedoRecPtr. + * Those checks are only needed for records that can contain + * full-pages images, and an XLOG_SWITCH record never does. + Assert(fpw_lsn == InvalidXLogRecPtr); Right, that's the core reason behind the refactoring. The assertion is a good idea. -- Michael
Attachment
On Thu, Oct 12, 2023 at 3:27 AM Michael Paquier <michael@paquier.xyz> wrote: > I have looked at 0001, for now.. And it looks OK to me. Cool. I've committed that one. Thanks for the review. -- Robert Haas EDB: http://www.enterprisedb.com
On Tue, Oct 10, 2023 at 02:43:34PM -0400, Robert Haas wrote: > Here's a new patch set. I think I've incorporated the performance > fixes that you've suggested so far into this version. I also adjusted > a couple of other things: Now looking at 0002, where you should be careful about the code indentation or koel will complain. > - After further study of a point previously raised by Amit, I adjusted > CreateCheckPoint slightly to call WALInsertLockAcquireExclusive > significantly later than before. I think that there's no real reason > to do it so early and that the current coding is probably just a > historical leftover, but it would be good to have some review here. This makes the new code call LocalSetXLogInsertAllowed() and what we set for checkPoint.PrevTimeLineID after taking the insertion locks, which should be OK. > - I added a cross-check that when starting redo from a checkpoint > whose redo pointer points to an earlier LSN that the checkpoint > itself, the record we read from that LSN must an XLOG_CHECKPOINT_REDO > record. I've mentioned as well a test in pg_walinspect after one of the checkpoints generated there, but what you do here is enough for the online case. + /* + * XLogInsertRecord will have updated RedoRecPtr, but we need to copy + * that into the record that will be inserted when the checkpoint is + * complete. + */ + checkPoint.redo = RedoRecPtr; For online checkpoints, a very important point is that XLogCtl->Insert.RedoRecPtr is also updated in XLogInsertRecord(). Perhaps that's worth an addition? I was a bit confused first that we do the following for shutdown checkpoints: RedoRecPtr = XLogCtl->Insert.RedoRecPtr = checkPoint.redo; Then repeat this pattern for non-shutdown checkpoints a few lines down without touching the copy of the redo LSN in XLogCtl->Insert, because of course we don't hold the WAL insert locks in an exclusive fashion here: checkPoint.redo = RedoRecPtr; My point is that this is not only about RedoRecPtr, but also about XLogCtl->Insert.RedoRecPtr here. The comment in ReserveXLogSwitch() says that. -- Michael
Attachment
On Fri, Oct 13, 2023 at 3:29 AM Michael Paquier <michael@paquier.xyz> wrote: > Now looking at 0002, where you should be careful about the code > indentation or koel will complain. Fixed in the attached version. > This makes the new code call LocalSetXLogInsertAllowed() and what we > set for checkPoint.PrevTimeLineID after taking the insertion locks, > which should be OK. Cool. > I've mentioned as well a test in pg_walinspect after one of the > checkpoints generated there, but what you do here is enough for the > online case. I don't quite understand what you're saying here. If you're suggesting a potential improvement, can you be a bit more clear and explicit about what the suggestion is? > + /* > + * XLogInsertRecord will have updated RedoRecPtr, but we need to copy > + * that into the record that will be inserted when the checkpoint is > + * complete. > + */ > + checkPoint.redo = RedoRecPtr; > > For online checkpoints, a very important point is that > XLogCtl->Insert.RedoRecPtr is also updated in XLogInsertRecord(). > Perhaps that's worth an addition? I was a bit confused first that we > do the following for shutdown checkpoints: > RedoRecPtr = XLogCtl->Insert.RedoRecPtr = checkPoint.redo; > > Then repeat this pattern for non-shutdown checkpoints a few lines down > without touching the copy of the redo LSN in XLogCtl->Insert, because > of course we don't hold the WAL insert locks in an exclusive fashion > here: > checkPoint.redo = RedoRecPtr; > > My point is that this is not only about RedoRecPtr, but also about > XLogCtl->Insert.RedoRecPtr here. The comment in ReserveXLogSwitch() > says that. I have adjusted the comment in CreateCheckPoint to hopefully address this concern. I don't understand what you mean about ReserveXLogSwitch(), though. -- Robert Haas EDB: http://www.enterprisedb.com
Attachment
On Tue, Oct 17, 2023 at 12:45:52PM -0400, Robert Haas wrote: > On Fri, Oct 13, 2023 at 3:29 AM Michael Paquier <michael@paquier.xyz> wrote: >> I've mentioned as well a test in pg_walinspect after one of the >> checkpoints generated there, but what you do here is enough for the >> online case. > > I don't quite understand what you're saying here. If you're suggesting > a potential improvement, can you be a bit more clear and explicit > about what the suggestion is? Suggestion is from here, with a test for pg_walinspect after it runs its online checkpoint (see the full-page case): https://www.postgresql.org/message-id/ZOvf1tu6rfL/B2PW@paquier.xyz +-- Check presence of REDO record. +SELECT redo_lsn FROM pg_control_checkpoint() \gset +SELECT start_lsn = :'redo_lsn'::pg_lsn AS same_lsn, record_type + FROM pg_get_wal_record_info(:'redo_lsn'); >> Then repeat this pattern for non-shutdown checkpoints a few lines down >> without touching the copy of the redo LSN in XLogCtl->Insert, because >> of course we don't hold the WAL insert locks in an exclusive fashion >> here: >> checkPoint.redo = RedoRecPtr; >> >> My point is that this is not only about RedoRecPtr, but also about >> XLogCtl->Insert.RedoRecPtr here. The comment in ReserveXLogSwitch() >> says that. > > I have adjusted the comment in CreateCheckPoint to hopefully address > this concern. - * XLogInsertRecord will have updated RedoRecPtr, but we need to copy - * that into the record that will be inserted when the checkpoint is - * complete. + * XLogInsertRecord will have updated XLogCtl->Insert.RedoRecPtr in + * shared memory and RedoRecPtr in backend-local memory, but we need + * to copy that into the record that will be inserted when the + * checkpoint is complete. This comment diff between v8 and v9 looks OK to me. Thanks. > I don't understand what you mean about > ReserveXLogSwitch(), though. I am not sure either, looking back at that :p -- Michael
Attachment
On Tue, Oct 17, 2023 at 8:35 PM Michael Paquier <michael@paquier.xyz> wrote: > Suggestion is from here, with a test for pg_walinspect after it runs > its online checkpoint (see the full-page case): > https://www.postgresql.org/message-id/ZOvf1tu6rfL/B2PW@paquier.xyz > > +-- Check presence of REDO record. > +SELECT redo_lsn FROM pg_control_checkpoint() \gset > +SELECT start_lsn = :'redo_lsn'::pg_lsn AS same_lsn, record_type > + FROM pg_get_wal_record_info(:'redo_lsn'); I added a variant of this test case. Here's v10. -- Robert Haas EDB: http://www.enterprisedb.com
Attachment
On Wed, Oct 18, 2023 at 10:24:50AM -0400, Robert Haas wrote: > I added a variant of this test case. Here's v10. +-- Verify that an XLOG_CHECKPOINT_REDO record begins at precisely the redo LSN +-- of the checkpoint we just performed. +SELECT redo_lsn FROM pg_control_checkpoint() \gset +SELECT start_lsn = :'redo_lsn'::pg_lsn AS same_lsn, resource_manager, + record_type FROM pg_get_wal_record_info(:'redo_lsn'); + same_lsn | resource_manager | record_type +----------+------------------+----------------- + t | XLOG | CHECKPOINT_REDO +(1 row) Seems fine to me. Thanks for considering the idea. -- Michael
Attachment
On Thu, Oct 19, 2023 at 1:53 AM Michael Paquier <michael@paquier.xyz> wrote: > Seems fine to me. Thanks for considering the idea. I think it was a good idea! I've committed the patch. -- Robert Haas EDB: http://www.enterprisedb.com