Thread: archive status ".ready" files may be created too early
Hi hackers, I believe I've uncovered a bug that may cause archive status ".ready" files to be created too early, which in turn may cause an incorrect version of the corresponding WAL segment to be archived. The crux of the issue seems to be that XLogWrite() does not wait for the entire record to be written to disk before creating the ".ready" file. Instead, it just waits for the last page of the segment to be written before notifying the archiver. If PostgreSQL crashes before it is able to write the rest of the record, it will end up reusing the ".ready" segment at the end of crash recovery. In the meantime, the archiver process may have already processed the old version of the segment. This issue seems to most often manifest as WAL corruption on standby servers after the primary server has crashed because it ran out of disk space. I have attached a proof-of-concept patch (ready_file_fix.patch) that waits to create any ".ready" files until closer to the end of XLogWrite(). The patch is incorrect for a few reasons, but I hope it helps illustrate the problem. I have also attached another patch (repro_helper.patch) to be used in conjunction with the following steps to reproduce the issue: initdb . pg_ctl -D . -o "-c archive_mode=on -c archive_command='exit 0'" -l log.log start pgbench -i -s 1000 postgres psql postgres -c "SELECT pg_switch_wal();" With just repro_helper.patch applied, these commands should produce both of the following log statements: PANIC: failing at inconvenient time LOG: status file already exists for "000000010000000000000017" With both patches applied, the commands will only produce the first PANIC statement. Another thing I am exploring is whether a crash in between writing the last page of a segment and creating the ".ready" file could cause the archiver process to skip processing it altogether. In the scenario I mention earlier, the server seems to recreate the ".ready" file since it rewrites a portion of the segment. However, if a WAL record fits perfectly into the last section of the segment, I am not sure whether the ".ready" file would be created after restart. I am admittedly in the early stages of working on this problem, but I thought it would be worth reporting to the community early on in case anyone has any thoughts on or past experiences with this issue. Nathan
Attachment
Hello. At Thu, 12 Dec 2019 22:50:20 +0000, "Bossart, Nathan" <bossartn@amazon.com> wrote in > Hi hackers, > > I believe I've uncovered a bug that may cause archive status ".ready" > files to be created too early, which in turn may cause an incorrect > version of the corresponding WAL segment to be archived. > > The crux of the issue seems to be that XLogWrite() does not wait for > the entire record to be written to disk before creating the ".ready" > file. Instead, it just waits for the last page of the segment to be > written before notifying the archiver. If PostgreSQL crashes before > it is able to write the rest of the record, it will end up reusing the > ".ready" segment at the end of crash recovery. In the meantime, the > archiver process may have already processed the old version of the > segment. Year, that can happen if the server restarted after the crash. > This issue seems to most often manifest as WAL corruption on standby > servers after the primary server has crashed because it ran out of > disk space. In the first place, it's quite bad to set restart_after_crash to on, or just restart crashed master in replication set. The standby can be incosistent at the time of master crash, so it should be fixed using pg_rewind or should be recreated from a base backup. Even without that archiving behavior, a standby may receive wal bytes inconsistent to the bytes from the same master just before crash. It is not limited to segment boundary. It can happen on every block boundary and could happen everywhere with more complecated steps. What you are calling as a "problem" seems coming from allowing the restart_after_crash behavior. On the other hand, as recommended in the documentation, archive_command can refuse overwriting of the same segment, but we don't impose to do that. As the result the patch doesn't seem to save anything than setting up and operating correctly. > Another thing I am exploring is whether a crash in between writing the > last page of a segment and creating the ".ready" file could cause the > archiver process to skip processing it altogether. In the scenario I > mention earlier, the server seems to recreate the ".ready" file since > it rewrites a portion of the segment. However, if a WAL record fits > perfectly into the last section of the segment, I am not sure whether > the ".ready" file would be created after restart. Why that segment needs .ready after restart, even though nothing could be written to the old segment? > I am admittedly in the early stages of working on this problem, but I > thought it would be worth reporting to the community early on in case > anyone has any thoughts on or past experiences with this issue. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
On 12/12/19, 8:08 PM, "Kyotaro Horiguchi" <horikyota.ntt@gmail.com> wrote: > At Thu, 12 Dec 2019 22:50:20 +0000, "Bossart, Nathan" <bossartn@amazon.com> wrote in >> Hi hackers, >> >> I believe I've uncovered a bug that may cause archive status ".ready" >> files to be created too early, which in turn may cause an incorrect >> version of the corresponding WAL segment to be archived. >> >> The crux of the issue seems to be that XLogWrite() does not wait for >> the entire record to be written to disk before creating the ".ready" >> file. Instead, it just waits for the last page of the segment to be >> written before notifying the archiver. If PostgreSQL crashes before >> it is able to write the rest of the record, it will end up reusing the >> ".ready" segment at the end of crash recovery. In the meantime, the >> archiver process may have already processed the old version of the >> segment. > > Year, that can happen if the server restarted after the crash. > >> This issue seems to most often manifest as WAL corruption on standby >> servers after the primary server has crashed because it ran out of >> disk space. > > In the first place, it's quite bad to set restart_after_crash to on, > or just restart crashed master in replication set. The standby can be > incosistent at the time of master crash, so it should be fixed using > pg_rewind or should be recreated from a base backup. > > Even without that archiving behavior, a standby may receive wal bytes > inconsistent to the bytes from the same master just before crash. It > is not limited to segment boundary. It can happen on every block > boundary and could happen everywhere with more complecated steps. > > What you are calling as a "problem" seems coming from allowing the > restart_after_crash behavior. On the other hand, as recommended in the > documentation, archive_command can refuse overwriting of the same > segment, but we don't impose to do that. > > As the result the patch doesn't seem to save anything than setting up > and operating correctly. Disregarding the behavior of standby servers for a minute, I think that what I've described is still a problem for archiving. If the segment is archived too early, point-in-time restores that require it will fail. If the server refuses to overwrite existing archive files, the archiver process may fail to process the "good" version of the segment until someone takes action to fix it. I think this is especially troubling for backup utilities like pgBackRest that check the archive_status directory independently since it is difficult to know if the segment is truly ".ready". I've attached a slightly improved patch to show how this might be fixed. I am curious what concerns there are about doing something like it to prevent this scenario. >> Another thing I am exploring is whether a crash in between writing the >> last page of a segment and creating the ".ready" file could cause the >> archiver process to skip processing it altogether. In the scenario I >> mention earlier, the server seems to recreate the ".ready" file since >> it rewrites a portion of the segment. However, if a WAL record fits >> perfectly into the last section of the segment, I am not sure whether >> the ".ready" file would be created after restart. > > Why that segment needs .ready after restart, even though nothing could > be written to the old segment? If a ".ready" file is never created for a segment, I don't think it will be archived. Nathan
Attachment
On 2019-Dec-13, Kyotaro Horiguchi wrote: > At Thu, 12 Dec 2019 22:50:20 +0000, "Bossart, Nathan" <bossartn@amazon.com> wrote in > > The crux of the issue seems to be that XLogWrite() does not wait for > > the entire record to be written to disk before creating the ".ready" > > file. Instead, it just waits for the last page of the segment to be > > written before notifying the archiver. If PostgreSQL crashes before > > it is able to write the rest of the record, it will end up reusing the > > ".ready" segment at the end of crash recovery. In the meantime, the > > archiver process may have already processed the old version of the > > segment. > > Year, that can happen if the server restarted after the crash. ... which is the normal way to run things, no? > > servers after the primary server has crashed because it ran out of > > disk space. > > In the first place, it's quite bad to set restart_after_crash to on, > or just restart crashed master in replication set. Why is it bad? It's the default value. > The standby can be incosistent at the time of master crash, so it > should be fixed using pg_rewind or should be recreated from a base > backup. Surely the master will just come up and replay its WAL, and there should be no inconsistency. You seem to be thinking that a standby is promoted immediately on crash of the master, but this is not a given. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Uggg. I must apologyze for the last bogus comment. At Fri, 13 Dec 2019 21:24:36 +0000, "Bossart, Nathan" <bossartn@amazon.com> wrote in > On 12/12/19, 8:08 PM, "Kyotaro Horiguchi" <horikyota.ntt@gmail.com> wrote: > > As the result the patch doesn't seem to save anything than setting up > > and operating correctly. > > Disregarding the behavior of standby servers for a minute, I think I'm sorry. a continuation record split beyond a segment boundary doesn't seem to harm replication. Please forget it. > that what I've described is still a problem for archiving. If the Yeah, I think that happens and it seems a problem. > segment is archived too early, point-in-time restores that require it > will fail. If the server refuses to overwrite existing archive files, > the archiver process may fail to process the "good" version of the > segment until someone takes action to fix it. I think this is > especially troubling for backup utilities like pgBackRest that check > the archive_status directory independently since it is difficult to > know if the segment is truly ".ready". > > I've attached a slightly improved patch to show how this might be > fixed. I am curious what concerns there are about doing something > like it to prevent this scenario. Basically, I agree to the direction, where the .ready notification is delayed until all requested WAL bytes are written out. But I think I found a corner case where the patch doesn't work. As I mentioned in another message, if WAL buffer was full, AdvanceXLInsertBuffer calls XLogWrite to write out the victim buffer regardless whether the last record in the page was the first half of a continuation record. XLogWrite can mark the segment as .ready even with the patch. Is that correct? And do you think the corner case is worth amending? If so, we could amend also that case by marking the last segment as .ready when XLogWrite writes the first bytes of the next segment. (As the further corner-case, it still doesn't work if a contination record spans over trhee or more segments.. But I don't (or want not to) think we don't need to consider that case..) Thoughts? regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Thank you Alvaro for the comment (on my comment). At Fri, 13 Dec 2019 18:33:44 -0300, Alvaro Herrera <alvherre@2ndquadrant.com> wrote in > On 2019-Dec-13, Kyotaro Horiguchi wrote: > > > At Thu, 12 Dec 2019 22:50:20 +0000, "Bossart, Nathan" <bossartn@amazon.com> wrote in > > > > The crux of the issue seems to be that XLogWrite() does not wait for > > > the entire record to be written to disk before creating the ".ready" > > > file. Instead, it just waits for the last page of the segment to be > > > written before notifying the archiver. If PostgreSQL crashes before > > > it is able to write the rest of the record, it will end up reusing the > > > ".ready" segment at the end of crash recovery. In the meantime, the > > > archiver process may have already processed the old version of the > > > segment. > > > > Year, that can happen if the server restarted after the crash. > > ... which is the normal way to run things, no? Yes. In older version (< 10), the default value for wal_level was minimal. In 10, the default only for wal_level was changed to replica. Still I'm not sure if restart_after_crash can be recommended for streaming replcation... > Why is it bad? It's the default value. I reconsider it more deeply. And concluded that's not harm replication as I thought. WAL-buffer overflow may write partial continuation record and it can be flushed immediately. That made me misunderstood that standby can receive only the first half of a continuation record. Actually, that write doesn't advance LogwrtResult.Flush. So standby doesn't receive a split record on page boundary. (The cases where crashed mater is used as new standby as-is might contaminate my thought..) Sorry for the bogus comment. My conclusion here is that restart_after_crash doesn't seem to harm standby immediately. > > The standby can be incosistent at the time of master crash, so it > > should be fixed using pg_rewind or should be recreated from a base > > backup. > > Surely the master will just come up and replay its WAL, and there should > be no inconsistency. > > You seem to be thinking that a standby is promoted immediately on crash > of the master, but this is not a given. Basically no, but it might be mixed a bit. Anyway returning to the porposal, I think that XLogWrite can be called during at WAL-buffer-full and it can go into the last page in a segment. The proposed patch doesn't work since the XLogWrite call didn't write the whole continuation record. But I'm not sure that corner-case is worth amendint.. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
On Fri, Dec 13, 2019 at 7:50 AM Bossart, Nathan <bossartn@amazon.com> wrote: > > Hi hackers, > > I believe I've uncovered a bug that may cause archive status ".ready" > files to be created too early, which in turn may cause an incorrect > version of the corresponding WAL segment to be archived. > > The crux of the issue seems to be that XLogWrite() does not wait for > the entire record to be written to disk before creating the ".ready" > file. Instead, it just waits for the last page of the segment to be > written before notifying the archiver. If PostgreSQL crashes before > it is able to write the rest of the record, it will end up reusing the > ".ready" segment at the end of crash recovery. In the meantime, the > archiver process may have already processed the old version of the > segment. Maybe I'm missing something... But since XLogWrite() seems to call issue_xlog_fsync() before XLogArchiveNotifySeg(), ISTM that this trouble shouldn't happen. No? Regards, -- Fujii Masao
Hello. At Wed, 18 Dec 2019 14:10:04 +0900, Fujii Masao <masao.fujii@gmail.com> wrote in > On Fri, Dec 13, 2019 at 7:50 AM Bossart, Nathan <bossartn@amazon.com> wrote: > > > > Hi hackers, > > > > I believe I've uncovered a bug that may cause archive status ".ready" > > files to be created too early, which in turn may cause an incorrect > > version of the corresponding WAL segment to be archived. > > > > The crux of the issue seems to be that XLogWrite() does not wait for > > the entire record to be written to disk before creating the ".ready" > > file. Instead, it just waits for the last page of the segment to be > > written before notifying the archiver. If PostgreSQL crashes before > > it is able to write the rest of the record, it will end up reusing the > > ".ready" segment at the end of crash recovery. In the meantime, the > > archiver process may have already processed the old version of the > > segment. > > Maybe I'm missing something... But since XLogWrite() seems to > call issue_xlog_fsync() before XLogArchiveNotifySeg(), ISTM that > this trouble shouldn't happen. No? The trouble happens after the synced file is archived. If the last record in the arcvhied segment was the first half of a continuation record and crash before writing the last half, crash recovery stops just before the first half and different record can be overwitten. As the result the archived version of the segment becomes rotten. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
On 12/17/19, 2:26 AM, "Kyotaro Horiguchi" <horikyota.ntt@gmail.com> wrote: > But I think I found a corner case where the patch doesn't work. As I > mentioned in another message, if WAL buffer was full, > AdvanceXLInsertBuffer calls XLogWrite to write out the victim buffer > regardless whether the last record in the page was the first half of a > continuation record. XLogWrite can mark the segment as .ready even > with the patch. > > Is that correct? And do you think the corner case is worth amending? I certainly think it is worth trying to prevent potential WAL archive corruption in known corner cases. Your comment highlights a potential shortcoming of my patch. AFAICT there is no guarantee that XLogWrite() is called with a complete WAL record. Even if that assumption is true at the moment, it might not hold up over time. > If so, we could amend also that case by marking the last segment as > .ready when XLogWrite writes the first bytes of the next segment. (As > the further corner-case, it still doesn't work if a contination record > spans over trhee or more segments.. But I don't (or want not to) think > we don't need to consider that case..) I'm working on a new version of the patch that will actually look at the WAL page metadata to determine when it is safe to mark a segment as ready for archival. It seems relatively easy to figure out whether a page is the last one for the current WAL record. Nathan
On 12/18/19, 8:34 AM, "Bossart, Nathan" <bossartn@amazon.com> wrote: > On 12/17/19, 2:26 AM, "Kyotaro Horiguchi" <horikyota.ntt@gmail.com> wrote: >> If so, we could amend also that case by marking the last segment as >> .ready when XLogWrite writes the first bytes of the next segment. (As >> the further corner-case, it still doesn't work if a contination record >> spans over trhee or more segments.. But I don't (or want not to) think >> we don't need to consider that case..) > > I'm working on a new version of the patch that will actually look at > the WAL page metadata to determine when it is safe to mark a segment > as ready for archival. It seems relatively easy to figure out whether > a page is the last one for the current WAL record. I stand corrected. My attempts to add logic to check the WAL records added quite a bit more complexity than seemed reasonable to maintain in this code path. For example, I didn't anticipate things like XLOG_SWITCH records. I am still concerned about the corner case you noted, but I have yet to find a practical way to handle it. You suggested waiting until writing the first bytes of the next segment before marking a segment as ready, but I'm not sure that fixes this problem either, and I wonder if it could result in waiting arbitrarily long before creating a ".ready" file in some cases. Perhaps I am misunderstanding your suggestion. Another thing I noticed is that any changes in this area could impact archive_timeout. If we reset the archive_timeout timer when we mark the segments ready, we could force WAL switches more often. If we do not move the timer logic, we could be resetting it before the file is ready for the archiver. However, these differences might be subtle enough to be okay. Nathan
At Sat, 21 Dec 2019 01:18:24 +0000, "Bossart, Nathan" <bossartn@amazon.com> wrote in > On 12/18/19, 8:34 AM, "Bossart, Nathan" <bossartn@amazon.com> wrote: > > On 12/17/19, 2:26 AM, "Kyotaro Horiguchi" <horikyota.ntt@gmail.com> wrote: > >> If so, we could amend also that case by marking the last segment as > >> .ready when XLogWrite writes the first bytes of the next segment. (As > >> the further corner-case, it still doesn't work if a contination record > >> spans over trhee or more segments.. But I don't (or want not to) think > >> we don't need to consider that case..) > > > > I'm working on a new version of the patch that will actually look at > > the WAL page metadata to determine when it is safe to mark a segment > > as ready for archival. It seems relatively easy to figure out whether > > a page is the last one for the current WAL record. > > I stand corrected. My attempts to add logic to check the WAL records > added quite a bit more complexity than seemed reasonable to maintain > in this code path. For example, I didn't anticipate things like > XLOG_SWITCH records. > > I am still concerned about the corner case you noted, but I have yet > to find a practical way to handle it. You suggested waiting until > writing the first bytes of the next segment before marking a segment > as ready, but I'm not sure that fixes this problem either, and I > wonder if it could result in waiting arbitrarily long before creating > a ".ready" file in some cases. Perhaps I am misunderstanding your > suggestion. > > Another thing I noticed is that any changes in this area could impact > archive_timeout. If we reset the archive_timeout timer when we mark > the segments ready, we could force WAL switches more often. If we do > not move the timer logic, we could be resetting it before the file is > ready for the archiver. However, these differences might be subtle > enough to be okay. You're right. That doesn't seem to work. Another thing I had in my mind was that XLogWrite had an additional flag so that AdvanceXLInsertBuffer can tell not to mark .ready. The function is called while it *is* writing a complete record. So even if AdvanceXLInsertBuffer inhibit marking .ready the succeeding bytes comes soon and we can mark the old segment as .ready at the time. .. + * If record_write == false, we don't mark the last segment as .ready + * if the caller requested to write up to segment boundary. .. static void - XLogWrite(XLogwrtRqst WriteRqst, bool flexible) + XLogWrite(XLogwrtRqst WriteRqst, bool flexible, bool record_write) When XLogWrite is called with record_write = false, we don't mark .ready and don't advance lastSegSwitchTime/LSN. At the next time XLogWrite is called with record_write=true, if lastSegSwitchLSN is behind the latest segment boundary before or equal to LogwrtResult.Write, mark the skipped segments as .ready and update lastSegSwitchTime/LSN. Does the above make sense? regards. -- Kyotaro Horiguchi NTT Open Source Software Center
On 12/23/19, 6:09 PM, "Kyotaro Horiguchi" <horikyota.ntt@gmail.com> wrote: > You're right. That doesn't seem to work. Another thing I had in my > mind was that XLogWrite had an additional flag so that > AdvanceXLInsertBuffer can tell not to mark .ready. The function is > called while it *is* writing a complete record. So even if > AdvanceXLInsertBuffer inhibit marking .ready the succeeding bytes > comes soon and we can mark the old segment as .ready at the time. > > .. > + * If record_write == false, we don't mark the last segment as .ready > + * if the caller requested to write up to segment boundary. > .. > static void > - XLogWrite(XLogwrtRqst WriteRqst, bool flexible) > + XLogWrite(XLogwrtRqst WriteRqst, bool flexible, bool record_write) > > When XLogWrite is called with record_write = false, we don't mark > .ready and don't advance lastSegSwitchTime/LSN. At the next time > XLogWrite is called with record_write=true, if lastSegSwitchLSN is > behind the latest segment boundary before or equal to > LogwrtResult.Write, mark the skipped segments as .ready and update > lastSegSwitchTime/LSN. Thanks for the suggestion. I explored this proposal a bit today. It looks like there are three places where XLogWrite() is called: AdvanceXLInsertBuffer(), XLogFlush(), and XLogBackgroundFlush(). IIUC while XLogFlush() generally seems to be used to write complete records to disk, this might not be true for XLogBackgroundFlush(), and we're reasonably sure we cannot make such an assumption for AdvanceXLInsertBuffer(). Therefore, we would likely only set record_write to true for XLogFlush() and for certain calls to XLogBackgroundFlush (e.g. flushing asynchronous commits). I'm worried that this approach could be fragile and that we could end up waiting an arbitrarily long time before marking segments as ready for archival. Even if we pay very close attention to the latest flushed LSN, it seems possible that a non-record_write call to XLogWrite() advances things such that we avoid ever calling it with record_write = true. For example, XLogBackgroundFlush() may have flushed the completed blocks, which we cannot assume are complete records. Then, XLogFlush() would skip calling XLogWrite() if LogwrtResult.Flush is sufficiently far ahead. In this scenario, I don't think we would mark any eligible segments as ".ready" until the next call to XLogWrite() with record_write = true, which may never happen. The next approach I'm going to try is having the callers of XLogWrite() manage marking segments ready. That might make it easier to mitigate some of my concerns above, but I'm not tremendously optimistic that this approach will fare any better. Nathan
Sorry for the long delay. I've finally gotten to a new approach that I think is promising. My previous attempts to fix this within XLogWrite() or within the associated code paths all seemed to miss corner cases or to add far too much complexity. The new proof-of-concept patch that I have attached is much different. Instead of trying to adjust the ready- for-archive logic in the XLogWrite() code paths, I propose relocating the ready-for-archive logic to a separate process. The v3 patch is a proof-of-concept patch that moves the ready-for- archive logic to the WAL writer process. We mark files as ready-for- archive when the WAL flush pointer has advanced beyond a known WAL record boundary. In this patch, I am using the WAL insert location as the known WAL record boundary. The main idea is that it should be safe to archive a segment once we know the last WAL record for the WAL segment, which may overflow into the following segment, has been completely written to disk. There are many things missing from this proof-of-concept patch that will need to be handled if this approach seems reasonable. For example, I haven't looked into any adjustments needed for the archive_timeout parameter, I haven't added a way to persist the "latest segment marked ready-for-archive" through crashes, I haven't tried reducing the frequency of retrieving the WAL locations, and I'm not sure the WAL writer process is even the right location for this logic. However, these remaining problems all seem tractable to me. I would appreciate your feedback on whether you believe this approach is worth pursuing. Nathan
Attachment
2020-03-26 18:50:24 Bossart, Nathan <bossartn(at)amazon(dot)com> wrote: > The v3 patch is a proof-of-concept patch that moves the ready-for- > archive logic to the WAL writer process. We mark files as ready-for- > archive when the WAL flush pointer has advanced beyond a known WAL > record boundary. I like such a simple resolution, but I cannot agree it. 1. This patch makes wal_writer_delay to have two meanings. For example, an user setting the parameter to a bigger value gets a archived file later. 2. Even if we create a new parameter, we and users cannot determine the best value. 3. PostgreSQL guarantees that if a database cluster stopped smartly, the cluster flushed and archived all WAL record as follows. [xlog.c] * If archiving is enabled, rotate the last XLOG file so that all the * remaining records are archived (postmaster wakes up the archiver * process one more time at the end of shutdown). The checkpoint * record will go to the next XLOG file and won't be archived (yet). Therefore, the idea may need that end-synchronization between WalWriter and archiver(pgarch). I cannot agree it because processing for stopping system has complexity inherently and the syncronization makes it more complicated. Your idea gives up currency of the notifying instead of simplicity, but I think that the synchronization may ruin its merit. 4. I found the patch spills a chance for notifying. We have to be more careful. At the following case, WalWriter will notify after a little less than 3 times of wal_writer_delay in worst case. It may not be allowed depending on value of wal_writer_delay. If we create a new parameter, we cannot explain to user about it. Premise: - Seg1 has been already notified. - FlushedPtr is 0/2D00000 (= all WAL record is flushed). ----- Step 1. Backend-A updates InsertPtr to 0/2E00000, but does not copy WAL record to buffer. Step 2. (sleep) WalWriter memorize InsertPtr 0/2E00000 to the local variable (LocalInsertPtr) and sleep because FlushedPtr has not passed InsertPtr. Step 3. Backend-A copies WAL record to buffer. Step 4. Backend-B process updates InsertPtr to 0/3100000, copies their record to buffer, commits (flushes it by itself), and updates FlushedPtr to 0/3100000. Step 5. WalWriter detects that FlushedPtr(0/3100000) passes LocalInsertPtr(0/2E00000), but WalWriter cannot notify Seg2 though it should be notified. It is caused by that WalWriter does not know that which record is crossing segment boundary. Then, after two sleeping for cheking that InsertPtr passes FlushedPtr again in worst case, Seg2 is notified. Step 6. (sleep) WalWriter sleep. Step 7. Backend-C inserts WAL record, flush, and updates as follows: InsertPtr --> 0/3200000 FlushedPtr --> 0/3200000 Step 8. Backend-D updates InsertPtr to 0/3300000, but does not copy record to buffer. Step 9. (sleep) WalWriter memorize InsertPtr 0/3300000 to LocalInsertPtr and sleep because FlushedPtr has been 0/3200000. Step 10. Backend-D copies its record. Step 11. Someone(Backend-X or WalWriter) flushes and updates FlushedPtr to 0/3300000. Step 12. WalWriter detects that FlushedPtr(0/3300000) passes LocalInsertPtr(0/3300000) and notify Seg2. ----- I'm preparing a patch that backend inserting segment-crossboundary WAL record leaves its EndRecPtr and someone flushing it checks the EndRecPtr and notifies.. Regards Ryo Matsumura
On 5/28/20, 11:42 PM, "matsumura.ryo@fujitsu.com" <matsumura.ryo@fujitsu.com> wrote: > I'm preparing a patch that backend inserting segment-crossboundary > WAL record leaves its EndRecPtr and someone flushing it checks > the EndRecPtr and notifies.. Thank you for sharing your thoughts. I will be happy to take a look at your patch. Nathan
> On 5/28/20, 11:42 PM, "matsumura.ryo@fujitsu.com" <matsumura.ryo@fujitsu.com> > wrote: > > I'm preparing a patch that backend inserting segment-crossboundary > > WAL record leaves its EndRecPtr and someone flushing it checks > > the EndRecPtr and notifies.. I'm sorry for my slow work. I attach a patch. I also attach a simple target test for primary. 1. Description in primary side [Basic problem] A process flushing WAL record doesn't know whether the flushed RecPtr is EndRecPtr of cross-segment-boundary WAL record or not because only process inserting the WAL record knows it and it never memorizes the information to anywhere. [Basic concept of the patch in primary] A process inserting a cross-segment-boundary WAL record memorizes its EndRecPtr (I call it CrossBoundaryEndRecPtr) to a new structure in XLogCtl. A flushing process creates .ready (Later, I call it just 'notify'.) against a segment that is previous one including CrossBoundaryEndRecPtr only when its flushed RecPtr is equal or greater than the CrossBoundaryEndRecPtr. [Detail of implementation in primary] * Structure of CrossBoundaryEndRecPtrs Requirement of structure is as the following: - System must memorize multiple CrossBoundaryEndRecPtr. - A flushing process must determine to notify or not with only flushed RecPtr briefly. Therefore, I implemented the structure as an array (I call it CrossBoundaryEndRecPtr array) that is same as xlblck array. Strictly, it is enogh that the length is 'xbuffers/wal_segment_size', but I choose 'xbuffers' for simplicity that makes enable the flushing process to use XLogRecPtrToBufIdx(). See also the definition of XLogCtl, XLOGShmemSize(), and XLOGShmemInit() in my patch. * Action of inserting process A inserting process memorie its CrossBoundaryEndRecPtr to CrossBoundaryEndRecPtr array element calculated by XLogRecPtrToBufIdx with its CrossBoundaryEndRecPtr. If the WAL record crosses many segments, only element against last segment including the EndRecPtr is set and others are not set. See also CopyXLogRecordToWAL() in my patch. * Action of flushing process Overview has been already written as the follwing. A flushing process creates .ready (Later, I call it just 'notify'.) against a segment that is previous one including CrossBoundaryEndRecPtr only when its flushed RecPtr is equal or greater than the CrossBoundaryEndRecPtr. An additional detail is as the following. The flushing process may notify many segments if the record crosses many segments, so the process memorizes latest notified segment number to latestArchiveNotifiedSegNo in XLogCtl. The process notifies from latestArchiveNotifiedSegNo + 1 to flushing segment number - 1. And latestArchiveNotifiedSegNo is set to EndOfLog after Startup process exits replay-loop. Standby also set same timing (= before promoting). Mutual exlusion about latestArchiveNotifiedSegNo is not required because the timing of accessing has been already included in WALWriteLock critical section. 2. Description in standby side * Who notifies? walreceiver also doesn't know whether the flushed RecPtr is EndRecPtr of cross-segment-boundary WAL record or not. In standby, only Startup process knows the information because it is hidden in WAL record itself and only Startup process reads and builds WAL record. * Action of Statup process Therefore, I implemented that walreceiver never notify and Startup process does it. In detail, when Startup process reads one full-length WAL record, it notifies from a segment that includes head(ReadRecPtr) of the record to a previous segment that includes EndRecPtr of the record. Now, we must pay attention about switching time line. The last segment of previous TimeLineID must be notified before switching. This case is considered when RM_XLOG_ID is detected. 3. About other notifying for segment Two notifyings for segment are remain. They are not needed to fix. (1) Notifying for partial segment It is not needed to be completed, so it's OK to notify without special consideration. (2) Re-notifying Currently, Checkpointer has notified through XLogArchiveCheckDone(). It is a safe-net for failure of notifying by backend or WAL writer. Backend or WAL writer doesn't retry to notify if falis, but Checkpointer retries to notify when it removes old segment. If it fails to notify, then it does not remove the segment. It makes Checkpointer to retry notify until the notifying suceeeds. Also, in this case, we can just notify whithout special consideration because Checkpointer guarantees that all WAL record included in the segment have been already flushed. Please, your review and comments. Regards Ryo Matsumura
Attachment
Hello. Matsumura-san. I agree that WAL writer is not the place to notify segmnet. And the direction you suggested would work. At Fri, 19 Jun 2020 10:18:34 +0000, "matsumura.ryo@fujitsu.com" <matsumura.ryo@fujitsu.com> wrote in > 1. Description in primary side > > [Basic problem] > A process flushing WAL record doesn't know whether the flushed RecPtr is > EndRecPtr of cross-segment-boundary WAL record or not because only process > inserting the WAL record knows it and it never memorizes the information to anywhere. > > [Basic concept of the patch in primary] > A process inserting a cross-segment-boundary WAL record memorizes its EndRecPtr > (I call it CrossBoundaryEndRecPtr) to a new structure in XLogCtl. > A flushing process creates .ready (Later, I call it just 'notify'.) against > a segment that is previous one including CrossBoundaryEndRecPtr only when its > flushed RecPtr is equal or greater than the CrossBoundaryEndRecPtr. ... > See also the definition of XLogCtl, XLOGShmemSize(), and XLOGShmemInit() in my patch. I think we don't need most of that shmem stuff. XLogWrite is called after WAL buffer is filled up to the requested position. So when it crosses segment boundary we know the all past corss segment-boundary records are stable. That means all we need to remember is only the position of the latest corss-boundary record. > * Action of inserting process > A inserting process memorie its CrossBoundaryEndRecPtr to CrossBoundaryEndRecPtr > array element calculated by XLogRecPtrToBufIdx with its CrossBoundaryEndRecPtr. > If the WAL record crosses many segments, only element against last segment > including the EndRecPtr is set and others are not set. > See also CopyXLogRecordToWAL() in my patch. If we call XLogMarkEndRecPtrIfNeeded() there, the function is called every time a record is written, most of which are wasteful. XLogInsertRecord already has a code block executed only at every page boundary. > * Action of flushing process > Overview has been already written as the follwing. > A flushing process creates .ready (Later, I call it just 'notify'.) against > a segment that is previous one including CrossBoundaryEndRecPtr only when its > flushed RecPtr is equal or greater than the CrossBoundaryEndRecPtr. > > An additional detail is as the following. The flushing process may notify > many segments if the record crosses many segments, so the process memorizes > latest notified segment number to latestArchiveNotifiedSegNo in XLogCtl. > The process notifies from latestArchiveNotifiedSegNo + 1 to > flushing segment number - 1. > > And latestArchiveNotifiedSegNo is set to EndOfLog after Startup process exits > replay-loop. Standby also set same timing (= before promoting). > > Mutual exlusion about latestArchiveNotifiedSegNo is not required because > the timing of accessing has been already included in WALWriteLock critical section. Looks reasonable. > 2. Description in standby side > > * Who notifies? > walreceiver also doesn't know whether the flushed RecPtr is EndRecPtr of > cross-segment-boundary WAL record or not. In standby, only Startup process > knows the information because it is hidden in WAL record itself and only > Startup process reads and builds WAL record. Standby doesn't write it's own WAL records. Even if primary sent an immature record on segment boundary, it just would promote to a new TLI and starts its own history. Nothing breaks. However it could be a problem if a standby that crashed the problematic way were started as-is as a primary, such scenario is out of our scope. Now we can identify stable portion of WAL stream. It's enough to prevent walsender from sending data that can be overwritten afterwards. GetReplicationTargetRecPtr() in the attached does that. > * Action of Statup process > Therefore, I implemented that walreceiver never notify and Startup process does it. > In detail, when Startup process reads one full-length WAL record, it notifies > from a segment that includes head(ReadRecPtr) of the record to a previous segment that > includes EndRecPtr of the record. I don't like that archiving on standby relies on replay progress. We should avoid that and fortunately I think we dont need it. > Now, we must pay attention about switching time line. > The last segment of previous TimeLineID must be notified before switching. > This case is considered when RM_XLOG_ID is detected. That segment is archived after renamed as ".partial" later. We don't archive the last incomplete segment of the previous timeline as-is. > 3. About other notifying for segment > Two notifyings for segment are remain. They are not needed to fix. > > (1) Notifying for partial segment > It is not needed to be completed, so it's OK to notify without special consideration. > > (2) Re-notifying > Currently, Checkpointer has notified through XLogArchiveCheckDone(). > It is a safe-net for failure of notifying by backend or WAL writer. > Backend or WAL writer doesn't retry to notify if falis, but Checkpointer retries > to notify when it removes old segment. If it fails to notify, then it does not > remove the segment. It makes Checkpointer to retry notify until the notifying suceeeds. > Also, in this case, we can just notify whithout special consideration > because Checkpointer guarantees that all WAL record included in the segment have been already flushed. So it can be simplified as the attached. Any thoughts? regards. -- Kyotaro Horiguchi NTT Open Source Software Center From 6a2475aec9a871def5f194058f62f3f6991777e9 Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com> Date: Thu, 25 Jun 2020 08:50:54 +0900 Subject: [PATCH] Avoid to archive immature records For a segment-spanning record, if primary crashes after the first segment is archived and before finishing the full record, crash recovery causes the last record of the first segment overwritten and history diverges between pg_wal and archive. Avoid that corruption by preventing immature records from being archived. Prevent walsender from sending immature records for the same reason. --- src/backend/access/transam/xlog.c | 124 +++++++++++++++++++++++++++- src/backend/replication/walsender.c | 14 ++-- src/include/access/xlog.h | 1 + 3 files changed, 131 insertions(+), 8 deletions(-) diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index a1256a103b..b3d49fbc8b 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -724,6 +724,16 @@ typedef struct XLogCtlData */ XLogRecPtr lastFpwDisableRecPtr; + /* The last segment notified to be archived. Protected by WALWriteLock */ + XLogSegNo lastNotifiedSeg; + + /* + * Remember the range of the last segment-spanning record. Protected by + * info_lck + */ + XLogRecPtr lastSegContRecStart; + XLogRecPtr lastSegContRecEnd; + slock_t info_lck; /* locks shared variables shown above */ } XLogCtlData; @@ -1158,6 +1168,9 @@ XLogInsertRecord(XLogRecData *rdata, */ if (StartPos / XLOG_BLCKSZ != EndPos / XLOG_BLCKSZ) { + XLogSegNo startseg; + XLogSegNo endseg; + SpinLockAcquire(&XLogCtl->info_lck); /* advance global request to include new block(s) */ if (XLogCtl->LogwrtRqst.Write < EndPos) @@ -1165,6 +1178,21 @@ XLogInsertRecord(XLogRecData *rdata, /* update local result copy while I have the chance */ LogwrtResult = XLogCtl->LogwrtResult; SpinLockRelease(&XLogCtl->info_lck); + + /* Remember the range of the record if it spans over segments */ + XLByteToSeg(StartPos, startseg, wal_segment_size); + XLByteToPrevSeg(EndPos, endseg, wal_segment_size); + + if (startseg != endseg) + { + SpinLockAcquire(&XLogCtl->info_lck); + if (XLogCtl->lastSegContRecEnd < StartPos) + { + XLogCtl->lastSegContRecStart = StartPos; + XLogCtl->lastSegContRecEnd = EndPos; + } + SpinLockRelease(&XLogCtl->info_lck); + } } /* @@ -2396,6 +2424,56 @@ XLogCheckpointNeeded(XLogSegNo new_segno) return false; } +/* + * Notify segments that are surely stable. + * + * If the last segment in pg_wal is complete and ended with a continuation + * record, crash recovery results in a diverged historiy from archive. Don't + * archive a segment until the whole record is finished writing. + */ +static void +NotifyStableSegments(XLogSegNo notifySegNo) +{ + XLogRecPtr archiveTargetRecPtr; + XLogSegNo i; + + if (XLogCtl->lastNotifiedSeg < notifySegNo) + { + XLogRecPtr lastSegContRecStart; + XLogRecPtr lastSegContRecEnd; + XLogSegNo notifyUpTo = 0; + + SpinLockAcquire(&XLogCtl->info_lck); + lastSegContRecStart = XLogCtl->lastSegContRecStart; + lastSegContRecEnd = XLogCtl->lastSegContRecEnd; + SpinLockRelease(&XLogCtl->info_lck); + + /* + * Use start position of the last segmenet-spanning continuation record + * when the record is not flushed completely. + */ + if (lastSegContRecStart < LogwrtResult.Flush && + LogwrtResult.Flush <= lastSegContRecEnd) + archiveTargetRecPtr = lastSegContRecStart; + else + archiveTargetRecPtr = LogwrtResult.Flush; + + XLByteToSeg(archiveTargetRecPtr, notifyUpTo, wal_segment_size); + + /* back to the last complete segment */ + notifyUpTo--; + + /* cap by given segment */ + if (notifyUpTo > notifySegNo) + notifyUpTo = notifySegNo; + + for (i = XLogCtl->lastNotifiedSeg + 1 ; i <= notifyUpTo ; i++) + XLogArchiveNotifySeg(i); + + XLogCtl->lastNotifiedSeg = notifyUpTo; + } +} + /* * Write and/or fsync the log at least as far as WriteRqst indicates. * @@ -2583,7 +2661,7 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible) LogwrtResult.Flush = LogwrtResult.Write; /* end of page */ if (XLogArchivingActive()) - XLogArchiveNotifySeg(openLogSegNo); + NotifyStableSegments(openLogSegNo); XLogCtl->lastSegSwitchTime = (pg_time_t) time(NULL); XLogCtl->lastSegSwitchLSN = LogwrtResult.Flush; @@ -2653,6 +2731,10 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible) WalSndWakeupRequest(); LogwrtResult.Flush = LogwrtResult.Write; + + /* Now the record is fully written, try to notify stable segments */ + if (XLogArchivingActive()) + NotifyStableSegments(openLogSegNo - 1); } /* @@ -7703,6 +7785,18 @@ StartupXLOG(void) XLogCtl->LogwrtRqst.Write = EndOfLog; XLogCtl->LogwrtRqst.Flush = EndOfLog; + /* + * We have archived up to the previous segment of EndOfLog so far. + * Initialize lastNotifiedSeg if needed. + */ + if (XLogArchivingActive()) + { + XLogSegNo endLogSegNo; + + XLByteToSeg(EndOfLog, endLogSegNo, wal_segment_size); + XLogCtl->lastNotifiedSeg = endLogSegNo - 1; + } + /* * Update full_page_writes in shared memory and write an XLOG_FPW_CHANGE * record before resource manager writes cleanup WAL records or checkpoint @@ -8426,6 +8520,34 @@ GetFlushRecPtr(void) return LogwrtResult.Flush; } +/* + * GetReplicationTargetRecPtr -- Returns the latest position that can be + * replicated. WAL records up to this position won't be overwritten even after + * a crash of primary. + */ +XLogRecPtr +GetReplicationTargetRecPtr(void) +{ + XLogRecPtr lastSegContRecStart; + XLogRecPtr lastSegContRecEnd; + + SpinLockAcquire(&XLogCtl->info_lck); + LogwrtResult = XLogCtl->LogwrtResult; + lastSegContRecStart = XLogCtl->lastSegContRecStart; + lastSegContRecEnd = XLogCtl->lastSegContRecEnd; + SpinLockRelease(&XLogCtl->info_lck); + + /* + * Use start position of the last segmenet-spanning continuation record + * when the record is not flushed completely. + */ + if (lastSegContRecStart < LogwrtResult.Flush && + LogwrtResult.Flush <= lastSegContRecEnd) + return lastSegContRecStart; + + return LogwrtResult.Flush; +} + /* * GetLastImportantRecPtr -- Returns the LSN of the last important record * inserted. All records not explicitly marked as unimportant are considered diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c index e2477c47e0..c5682a8836 100644 --- a/src/backend/replication/walsender.c +++ b/src/backend/replication/walsender.c @@ -2630,14 +2630,14 @@ XLogSendPhysical(void) /* * Streaming the current timeline on a master. * - * Attempt to send all data that's already been written out and - * fsync'd to disk. We cannot go further than what's been written out - * given the current implementation of WALRead(). And in any case - * it's unsafe to send WAL that is not securely down to disk on the - * master: if the master subsequently crashes and restarts, standbys - * must not have applied any WAL that got lost on the master. + * Attempt to send all data that's can be replicated. We cannot go + * further than what's been written out given the current + * implementation of WALRead(). And in any case it's unsafe to send + * WAL that is not securely down to disk on the master: if the master + * subsequently crashes and restarts, standbys must not have applied + * any WAL that got lost on the master. */ - SendRqstPtr = GetFlushRecPtr(); + SendRqstPtr = GetReplicationTargetRecPtr(); } /* diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h index 347a38f57c..ef21418093 100644 --- a/src/include/access/xlog.h +++ b/src/include/access/xlog.h @@ -335,6 +335,7 @@ extern void GetFullPageWriteInfo(XLogRecPtr *RedoRecPtr_p, bool *doPageWrites_p) extern XLogRecPtr GetRedoRecPtr(void); extern XLogRecPtr GetInsertRecPtr(void); extern XLogRecPtr GetFlushRecPtr(void); +extern XLogRecPtr GetReplicationTargetRecPtr(void); extern XLogRecPtr GetLastImportantRecPtr(void); extern void RemovePromoteSignalFiles(void); -- 2.18.4
Hello, Horiguchi-san Thank you for your comment and patch. At Thursday, June 25, 2020 3:36 PM(JST), "Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>" wrote in > I think we don't need most of that shmem stuff. XLogWrite is called I wanted no more shmem stuff too, but other ideas need more lock that excludes inserter and writer each other. > after WAL buffer is filled up to the requested position. So when it > crosses segment boundary we know the all past corss segment-boundary > records are stable. That means all we need to remember is only the > position of the latest corss-boundary record. I could not agree. In the following case, it may not work well. - record-A and record-B (record-B is a newer one) is copied, and - lastSegContRecStart/End points to record-B's, and - FlushPtr is proceeded to in the middle of record-A. In the above case, the writer should notify segments before record-A, but it notifies ones before record-B. If the writer notifies only when it flushes the latest record completely, it works well. But the writer may not be enable to notify any segment forever when WAL records crossing-segment-boundary are inserted contiunuously. So I think that we must remeber all such cross-segement-boundary records's EndRecPtr in buffer. > If we call XLogMarkEndRecPtrIfNeeded() there, the function is called > every time a record is written, most of which are wasteful. > XLogInsertRecord already has a code block executed only at every page > boundary. I agree. XLogMarkEndRecPtrIfNeeded() is moved into the code block before updating LogwrtRqst.Write for avoiding passing-each-other with writer. > Now we can identify stable portion of WAL stream. It's enough to > prevent walsender from sending data that can be overwritten > afterwards. GetReplicationTargetRecPtr() in the attached does that. I didn't notice it. I agree basically, but it is based on lastSegContRecStart/End. So, first of all, we have to agree what should be remebered. Regards Ryo Matsumura
Hello, Matsumura-san. At Mon, 6 Jul 2020 04:02:23 +0000, "matsumura.ryo@fujitsu.com" <matsumura.ryo@fujitsu.com> wrote in > Hello, Horiguchi-san > > Thank you for your comment and patch. > > At Thursday, June 25, 2020 3:36 PM(JST), "Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>" wrote in > > I think we don't need most of that shmem stuff. XLogWrite is called > > I wanted no more shmem stuff too, but other ideas need more lock > that excludes inserter and writer each other. > > > after WAL buffer is filled up to the requested position. So when it > > crosses segment boundary we know the all past corss segment-boundary > > records are stable. That means all we need to remember is only the > > position of the latest corss-boundary record. > > I could not agree. In the following case, it may not work well. > - record-A and record-B (record-B is a newer one) is copied, and > - lastSegContRecStart/End points to record-B's, and > - FlushPtr is proceeded to in the middle of record-A. IIUC, that means record-B is a cross segment-border record and we hav e flushed beyond the recrod-B. In that case crash recovery afterwards can read the complete record-B and will finish recovery *after* the record-B. That's what we need here. > In the above case, the writer should notify segments before record-A, > but it notifies ones before record-B. If the writer notifies If you mean that NotifyStableSegments notifies up-to the previous segment of the segment where record-A is placed, that's wrong. The issue here is crash recovery sees an incomplete record at a segment-border. So it is sufficient that crash recoery can read the last record by looking pg_wal. > only when it flushes the latest record completely, it works well. It confirms that "lastSegContRecEnd < LogwrtResult.Flush", that means the last record(B) is completely flushed-out, isn't that? So it works well. > But the writer may not be enable to notify any segment forever when > WAL records crossing-segment-boundary are inserted contiunuously. No. As I mentioned in the preivous main, if we see a cross-segment-boundary record, the previous cross-segment-boundary record is flushed completely, and the segment containing the first-half of the previous cross-segment-boundary record has already been flushed. I didin't that but we can put an assertion in XLogInsertRecord like this: + /* Remember the range of the record if it spans over segments */ + XLByteToSeg(StartPos, startseg, wal_segment_size); + XLByteToPrevSeg(EndPos, endseg, wal_segment_size); + + if (startseg != endseg) + { ++ /* we shouldn't have a record spanning over three or more segments */ ++ Assert(endseg = startseg + 1); + SpinLockAcquire(&XLogCtl->info_lck); + if (XLogCtl->lastSegContRecEnd < StartPos) + { + XLogCtl->lastSegContRecStart = StartPos; + XLogCtl->lastSegContRecEnd = EndPos; > So I think that we must remeber all such cross-segement-boundary records's EndRecPtr in buffer. > > > > If we call XLogMarkEndRecPtrIfNeeded() there, the function is called > > every time a record is written, most of which are wasteful. > > XLogInsertRecord already has a code block executed only at every page > > boundary. > > I agree. > XLogMarkEndRecPtrIfNeeded() is moved into the code block before updating > LogwrtRqst.Write for avoiding passing-each-other with writer. > > > > Now we can identify stable portion of WAL stream. It's enough to > > prevent walsender from sending data that can be overwritten > > afterwards. GetReplicationTargetRecPtr() in the attached does that. > > I didn't notice it. > I agree basically, but it is based on lastSegContRecStart/End. > > So, first of all, we have to agree what should be remebered. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Hello, Horiguchi-san At Monday, July 6, 2020 05:13:40 +0000, "Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>" wrote in > > > after WAL buffer is filled up to the requested position. So when it > > > crosses segment boundary we know the all past corss segment-boundary > > > records are stable. That means all we need to remember is only the > > > position of the latest corss-boundary record. > > > > I could not agree. In the following case, it may not work well. > > - record-A and record-B (record-B is a newer one) is copied, and > > - lastSegContRecStart/End points to record-B's, and > > - FlushPtr is proceeded to in the middle of record-A. > > IIUC, that means record-B is a cross segment-border record and we hav e > flushed beyond the recrod-B. In that case crash recovery afterwards > can read the complete record-B and will finish recovery *after* the > record-B. That's what we need here. I'm sorry I didn't explain enough. Record-A and Record-B are cross segment-border records. Record-A spans segment X and X+1 Record-B spans segment X+2 and X+3. If both records have been inserted to WAL buffer, lastSegContRecStart/End points to Record-B. If a writer flushes upto the middle of segment-X+1, NotifyStableSegments() allows the writer to notify segment-X. Is my understanding correct? Regards Ryo Matsumrua
Hello. # Sorry, I wrongly thought that I replied to this thread.. At Tue, 7 Jul 2020 09:02:56 +0000, "matsumura.ryo@fujitsu.com" <matsumura.ryo@fujitsu.com> wrote in > At Monday, July 6, 2020 05:13:40 +0000, "Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>" wrote in > > > > after WAL buffer is filled up to the requested position. So when it > > > > crosses segment boundary we know the all past corss segment-boundary > > > > records are stable. That means all we need to remember is only the > > > > position of the latest corss-boundary record. > > > > > > I could not agree. In the following case, it may not work well. > > > - record-A and record-B (record-B is a newer one) is copied, and > > > - lastSegContRecStart/End points to record-B's, and > > > - FlushPtr is proceeded to in the middle of record-A. > > > > IIUC, that means record-B is a cross segment-border record and we hav e > > flushed beyond the recrod-B. In that case crash recovery afterwards > > can read the complete record-B and will finish recovery *after* the > > record-B. That's what we need here. > > I'm sorry I didn't explain enough. > > Record-A and Record-B are cross segment-border records. > Record-A spans segment X and X+1 > Record-B spans segment X+2 and X+3. Ok. > If both records have been inserted to WAL buffer, lastSegContRecStart/End points to Record-B. > If a writer flushes upto the middle of segment-X+1, NotifyStableSegments() allows the writer to notify segment-X. > Is my understanding correct? I think that that cannot happen since the segment X must have been flushed at the time Record-A is completely flushed out. When we write to the next segment, we have already flushed and closed the whole last segment. If it is not the case we are to archive segment files not fully flushed, and would get broken archive files. Am I missing something here? regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Hello, > At Mon, 13 Jul 2020 01:57:36 +0000, "Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>" wrote in > Am I missing something here? I write more detail(*). Record-A and Record-B are cross segment-border records. Record-A spans segment X and X+1. Record-B spans segment X+2 and X+3. If both records have been inserted to WAL buffer, lastSegContRecStart/End points to Record-B * If a writer flushes segment X and a part of X+1 but record-A is not flushed completely, NotifyStableSegments() allows the writer to notify segment-X. Then, Record-A may be invalidated by crash-recovery and overwritten by new WAL record. The segment-X is not same as the archived one. Regard Ryo Matsumura
On Wed, Jul 22, 2020 at 02:53:49AM +0000, matsumura.ryo@fujitsu.com wrote: > Then, Record-A may be invalidated by crash-recovery and overwritten by new WAL record. > The segment-X is not same as the archived one. Please note that the latest patch fails to apply per the CF bot, so a rebase would be in order to have at least some automated tests for the last patch. -- Michael
Attachment
On 07/07/2020 12:02, matsumura.ryo@fujitsu.com wrote: > At Monday, July 6, 2020 05:13:40 +0000, "Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>" wrote in >>>> after WAL buffer is filled up to the requested position. So when it >>>> crosses segment boundary we know the all past corss segment-boundary >>>> records are stable. That means all we need to remember is only the >>>> position of the latest corss-boundary record. >>> >>> I could not agree. In the following case, it may not work well. >>> - record-A and record-B (record-B is a newer one) is copied, and >>> - lastSegContRecStart/End points to record-B's, and >>> - FlushPtr is proceeded to in the middle of record-A. >> >> IIUC, that means record-B is a cross segment-border record and we hav e >> flushed beyond the recrod-B. In that case crash recovery afterwards >> can read the complete record-B and will finish recovery *after* the >> record-B. That's what we need here. > > I'm sorry I didn't explain enough. > > Record-A and Record-B are cross segment-border records. > Record-A spans segment X and X+1 > Record-B spans segment X+2 and X+3. > If both records have been inserted to WAL buffer, lastSegContRecStart/End points to Record-B. > If a writer flushes upto the middle of segment-X+1, NotifyStableSegments() allows the writer to notify segment-X. > Is my understanding correct? I think this little ASCII drawing illustrates the above scenario: AAAAA F BBBBB |---------|---------|---------| seg X seg X+1 seg X+2 AAAAA and BBBBB are Record-A and Record-B. F is the current flush pointer. In this case, it would be OK to notify segment X, as long as F is greater than the end of record A. And if I'm reading Kyotaro's patch correctly, that's what would happen with the patch. The patch seems correct to me. I'm a bit sad that we have to track yet another WAL position (two, actually) to fix this, but I don't see a better way. I wonder if we should arrange things so that XLogwrtResult.Flush never points in the middle of a record? I'm not totally convinced that all the current callers of GetFlushRecPtr() are OK with a middle-of-WAL record value. Could we get into similar trouble if a standby replicates half of a cross-segment record to a cascaded standby, and the cascaded standby has WAL archiving enabled? - Heikki
Thanks for visiting this thread. At Mon, 12 Oct 2020 15:04:40 +0300, Heikki Linnakangas <hlinnaka@iki.fi> wrote in > On 07/07/2020 12:02, matsumura.ryo@fujitsu.com wrote: > > At Monday, July 6, 2020 05:13:40 +0000, "Kyotaro Horiguchi > > <horikyota(dot)ntt(at)gmail(dot)com>" wrote in > >>>> after WAL buffer is filled up to the requested position. So when it > >>>> crosses segment boundary we know the all past corss segment-boundary > >>>> records are stable. That means all we need to remember is only the > >>>> position of the latest corss-boundary record. > >>> > >>> I could not agree. In the following case, it may not work well. > >>> - record-A and record-B (record-B is a newer one) is copied, and > >>> - lastSegContRecStart/End points to record-B's, and > >>> - FlushPtr is proceeded to in the middle of record-A. > >> > >> IIUC, that means record-B is a cross segment-border record and we hav > >> e > >> flushed beyond the recrod-B. In that case crash recovery afterwards > >> can read the complete record-B and will finish recovery *after* the > >> record-B. That's what we need here. > > I'm sorry I didn't explain enough. > > Record-A and Record-B are cross segment-border records. > > Record-A spans segment X and X+1 > > Record-B spans segment X+2 and X+3. > > If both records have been inserted to WAL buffer, > > lastSegContRecStart/End points to Record-B. > > If a writer flushes upto the middle of segment-X+1, > > NotifyStableSegments() allows the writer to notify segment-X. > > Is my understanding correct? > > I think this little ASCII drawing illustrates the above scenario: > > AAAAA F BBBBB > |---------|---------|---------| > seg X seg X+1 seg X+2 > > AAAAA and BBBBB are Record-A and Record-B. F is the current flush > pointer. I modified the figure a bit for the explanation below. F0 F1 AAAAA F BBBBB |---------|---------|---------| seg X seg X+1 seg X+2 Matsumura-san has a concern about the case where there are two (or more) partially-flushed segment-spanning records at the same time. This patch remembers only the last cross-segment record. If we were going to flush up to F0 after Record-B had been written, we would fail to hold-off archiving seg-X. This patch is based on a assumption that that case cannot happen because we don't leave a pending page at the time of segment switch and no records don't span over three or more segments. > In this case, it would be OK to notify segment X, as long as F is > greater than the end of record A. And if I'm reading Kyotaro's patch > correctly, that's what would happen with the patch. > > The patch seems correct to me. I'm a bit sad that we have to track yet > another WAL position (two, actually) to fix this, but I don't see a > better way. Is the two means Record-A and B? Is it needed even with having the assumption above? > I wonder if we should arrange things so that XLogwrtResult.Flush never > points in the middle of a record? I'm not totally convinced that all That happens at good percentage of page-boundary. And a record can span over three or more pages. Do we need to avoid all such cases? I did that only for the cross-segment case. > the current callers of GetFlushRecPtr() are OK with a middle-of-WAL > record value. Could we get into similar trouble if a standby > replicates half of a cross-segment record to a cascaded standby, and > the cascaded standby has WAL archiving enabled? The patch includes a fix for primary->standby case. But I'm not sure we can do that in the cascaded case. A standby is not aware of the structure of a WAL blob and has no idea of up-to-where to send the received blobs. However, if we can rely on the behavior of CopyData that we always receive a blob as a whole sent from the sender at once, the cascaded standbys are free from the trouble (as far as the cascaded-standby doesn't crash just before writing the last-half of a record into pg_wal and after archiving the last full-segment, which seems unlikely.). regards. -- Kyotaro Horiguchi NTT Open Source Software Center From e8320f7486c057ccb97be56ce1859296b8072256 Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com> Date: Tue, 13 Oct 2020 20:41:33 +0900 Subject: [PATCH v2] Avoid archiving a WAL segment that continues to the next segment If the last record of a finshed segment continues to the next segment, we need to defer archiving of the segment until the record is flushed to the end. Otherwise crash recovery can overwrite the last record of a segment and history diverges between archive and pg_wal. --- src/backend/access/transam/xlog.c | 160 +++++++++++++++++++++++++++- src/backend/replication/walsender.c | 14 +-- src/include/access/xlog.h | 1 + 3 files changed, 163 insertions(+), 12 deletions(-) diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index 52a67b1170..8fd0fdb598 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -723,6 +723,16 @@ typedef struct XLogCtlData */ XLogRecPtr lastFpwDisableRecPtr; + /* The last segment notified to be archived. Protected by WALWriteLock */ + XLogSegNo lastNotifiedSeg; + + /* + * Remember the range of the last segment-spanning record. Protected by + * info_lck + */ + XLogRecPtr lastSegContRecStart; + XLogRecPtr lastSegContRecEnd; + slock_t info_lck; /* locks shared variables shown above */ } XLogCtlData; @@ -1160,6 +1170,9 @@ XLogInsertRecord(XLogRecData *rdata, */ if (StartPos / XLOG_BLCKSZ != EndPos / XLOG_BLCKSZ) { + XLogSegNo startseg; + XLogSegNo endseg; + SpinLockAcquire(&XLogCtl->info_lck); /* advance global request to include new block(s) */ if (XLogCtl->LogwrtRqst.Write < EndPos) @@ -1167,6 +1180,21 @@ XLogInsertRecord(XLogRecData *rdata, /* update local result copy while I have the chance */ LogwrtResult = XLogCtl->LogwrtResult; SpinLockRelease(&XLogCtl->info_lck); + + /* Remember the range of the record if it spans over segments */ + XLByteToSeg(StartPos, startseg, wal_segment_size); + XLByteToPrevSeg(EndPos, endseg, wal_segment_size); + + if (startseg != endseg) + { + SpinLockAcquire(&XLogCtl->info_lck); + if (XLogCtl->lastSegContRecEnd < StartPos) + { + XLogCtl->lastSegContRecStart = StartPos; + XLogCtl->lastSegContRecEnd = EndPos; + } + SpinLockRelease(&XLogCtl->info_lck); + } } /* @@ -2399,6 +2427,43 @@ XLogCheckpointNeeded(XLogSegNo new_segno) return false; } +/* + * Returns last notified segment. + */ +static XLogSegNo +GetLastNotifiedSegment(void) +{ + XLogSegNo last_notified; + + SpinLockAcquire(&XLogCtl->info_lck); + last_notified = XLogCtl->lastNotifiedSeg; + SpinLockRelease(&XLogCtl->info_lck); + + return last_notified; +} + +/* + * Notify segments that are not yet notified. + */ +static void +NotifySegmentsUpTo(XLogSegNo notifySegNo) +{ + XLogSegNo last_notified = GetLastNotifiedSegment(); + XLogSegNo i; + + if (notifySegNo <= last_notified) + return; + + for (i = XLogCtl->lastNotifiedSeg + 1 ; i <= notifySegNo ; i++) + XLogArchiveNotifySeg(i); + + /* Don't go back in the case someone else has made it go further. */ + SpinLockAcquire(&XLogCtl->info_lck); + if (XLogCtl->lastNotifiedSeg < notifySegNo) + XLogCtl->lastNotifiedSeg = notifySegNo; + SpinLockRelease(&XLogCtl->info_lck); +} + /* * Write and/or fsync the log at least as far as WriteRqst indicates. * @@ -2422,6 +2487,7 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible) int npages; int startidx; uint32 startoffset; + bool extended = false; /* We should always be inside a critical section here */ Assert(CritSectionCount > 0); @@ -2578,15 +2644,48 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible) */ if (finishing_seg) { + XLogRecPtr lastSegContRecStart; + XLogRecPtr lastSegContRecEnd; + issue_xlog_fsync(openLogFile, openLogSegNo); /* signal that we need to wakeup walsenders later */ WalSndWakeupRequest(); - LogwrtResult.Flush = LogwrtResult.Write; /* end of page */ + SpinLockAcquire(&XLogCtl->info_lck); + lastSegContRecStart = XLogCtl->lastSegContRecStart; + lastSegContRecEnd = XLogCtl->lastSegContRecEnd; + SpinLockRelease(&XLogCtl->info_lck); - if (XLogArchivingActive()) - XLogArchiveNotifySeg(openLogSegNo); + /* + * Don't expose flush location at middle of a corss-segment WAL + * record. + */ + if (LogwrtResult.Write < lastSegContRecStart || + lastSegContRecEnd <= LogwrtResult.Write) + { + LogwrtResult.Flush = LogwrtResult.Write; /* end of page */ + + if (XLogArchivingActive()) + NotifySegmentsUpTo(openLogSegNo); + + extended = false; + } + else + { + /* + * Sometimes we are told to flush up not to the end of a + * record but to WALbuffer page boundary. We advance the + * request LSN to the end of the record in the case the + * record at the requested LSN continues to the next segment. + */ + if (lastSegContRecStart <= WriteRqst.Write && + WriteRqst.Write <= lastSegContRecEnd) + WriteRqst.Write = lastSegContRecEnd; + + /* Continue to the next iteration ignoring flexible flag */ + extended = true; + } XLogCtl->lastSegSwitchTime = (pg_time_t) time(NULL); XLogCtl->lastSegSwitchLSN = LogwrtResult.Flush; @@ -2615,11 +2714,23 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible) } curridx = NextBufIdx(curridx); - /* If flexible, break out of loop as soon as we wrote something */ - if (flexible && npages == 0) + /* + * If flexible, break out of loop as soon as we wrote something. + * However, we don't leave the loop if the last record in the just + * finished segment needs to be finished. + */ + if (flexible && !extended && npages == 0) break; } + /* + * We have extended the write request to the next segment if the record at + * the initial WriteRqst.Write continues to the next segment. In that case + * need to notify the last segment here. + */ + if (extended) + NotifySegmentsUpTo(openLogSegNo - 1); + Assert(npages == 0); /* @@ -7710,6 +7821,18 @@ StartupXLOG(void) XLogCtl->LogwrtRqst.Write = EndOfLog; XLogCtl->LogwrtRqst.Flush = EndOfLog; + /* + * We have archived up to the previous segment of EndOfLog so far. + * Initialize lastNotifiedSeg if needed. + */ + if (XLogArchivingActive()) + { + XLogSegNo endLogSegNo; + + XLByteToSeg(EndOfLog, endLogSegNo, wal_segment_size); + XLogCtl->lastNotifiedSeg = endLogSegNo - 1; + } + /* * Update full_page_writes in shared memory and write an XLOG_FPW_CHANGE * record before resource manager writes cleanup WAL records or checkpoint @@ -8434,6 +8557,33 @@ GetFlushRecPtr(void) return LogwrtResult.Flush; } +/* + * GetReplicationTargetRecPtr -- Returns the latest position that is safe to + * replicate. + */ +XLogRecPtr +GetReplicationTargetRecPtr(void) +{ + XLogRecPtr lastSegContRecStart; + XLogRecPtr lastSegContRecEnd; + + SpinLockAcquire(&XLogCtl->info_lck); + LogwrtResult = XLogCtl->LogwrtResult; + lastSegContRecStart = XLogCtl->lastSegContRecStart; + lastSegContRecEnd = XLogCtl->lastSegContRecEnd; + SpinLockRelease(&XLogCtl->info_lck); + + /* + * Use start position of the last segmenet-spanning continuation record + * when the record is not flushed completely. + */ + if (lastSegContRecStart < LogwrtResult.Flush && + LogwrtResult.Flush <= lastSegContRecEnd) + return lastSegContRecStart; + + return LogwrtResult.Flush; +} + /* * GetLastImportantRecPtr -- Returns the LSN of the last important record * inserted. All records not explicitly marked as unimportant are considered diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c index 7c9d1b67df..e9eaca5ffa 100644 --- a/src/backend/replication/walsender.c +++ b/src/backend/replication/walsender.c @@ -2642,14 +2642,14 @@ XLogSendPhysical(void) /* * Streaming the current timeline on a primary. * - * Attempt to send all data that's already been written out and - * fsync'd to disk. We cannot go further than what's been written out - * given the current implementation of WALRead(). And in any case - * it's unsafe to send WAL that is not securely down to disk on the - * primary: if the primary subsequently crashes and restarts, standbys - * must not have applied any WAL that got lost on the primary. + * Attempt to send all data that's can be replicated. We cannot go + * further than what's been written out given the current + * implementation of WALRead(). And in any case it's unsafe to send + * WAL that is not securely down to disk on the primary: if the primary + * subsequently crashes and restarts, standbys must not have applied + * any WAL that got lost on the primary. */ - SendRqstPtr = GetFlushRecPtr(); + SendRqstPtr = GetReplicationTargetRecPtr(); } /* diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h index 221af87e71..94876f628c 100644 --- a/src/include/access/xlog.h +++ b/src/include/access/xlog.h @@ -338,6 +338,7 @@ extern void GetFullPageWriteInfo(XLogRecPtr *RedoRecPtr_p, bool *doPageWrites_p) extern XLogRecPtr GetRedoRecPtr(void); extern XLogRecPtr GetInsertRecPtr(void); extern XLogRecPtr GetFlushRecPtr(void); +extern XLogRecPtr GetReplicationTargetRecPtr(void); extern XLogRecPtr GetLastImportantRecPtr(void); extern void RemovePromoteSignalFiles(void); -- 2.18.4
On 14.10.2020 03:06, Kyotaro Horiguchi wrote: > The patch includes a fix for primary->standby case. But I'm not sure > we can do that in the cascaded case. A standby is not aware of the > structure of a WAL blob and has no idea of up-to-where to send the > received blobs. However, if we can rely on the behavior of CopyData > that we always receive a blob as a whole sent from the sender at once, > the cascaded standbys are free from the trouble (as far as the > cascaded-standby doesn't crash just before writing the last-half of a > record into pg_wal and after archiving the last full-segment, which > seems unlikely.). > > regards. > Status update for a commitfest entry. This entry was "Waiting on author" during this CF. As I see, the latest message contains new version of the patch. Does it need more work? Are you going to continue working on it? -- Anastasia Lubennikova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Apologies for the long delay. I've spent a good amount of time thinking about this bug and trying out a few different approaches for fixing it. I've attached a work- in-progress patch for my latest attempt. On 10/13/20, 5:07 PM, "Kyotaro Horiguchi" <horikyota.ntt@gmail.com> wrote: > F0 F1 > AAAAA F BBBBB > |---------|---------|---------| > seg X seg X+1 seg X+2 > > Matsumura-san has a concern about the case where there are two (or > more) partially-flushed segment-spanning records at the same time. > > This patch remembers only the last cross-segment record. If we were > going to flush up to F0 after Record-B had been written, we would fail > to hold-off archiving seg-X. This patch is based on a assumption that > that case cannot happen because we don't leave a pending page at the > time of segment switch and no records don't span over three or more > segments. I wonder if these are safe assumptions to make. For your example, if we've written record B to the WAL buffers, but neither record A nor B have been written to disk or flushed, aren't we still in trouble? Also, is there actually any limit on WAL record length that means that it is impossible for a record to span over three or more segments? Perhaps these assumptions are true, but it doesn't seem obvious to me that they are, and they might be pretty fragile. The attached patch doesn't make use of these assumptions. Instead, we track the positions of the records that cross segment boundaries in a small hash map, and we use that to determine when it is safe to mark a segment as ready for archival. I think this approach resembles Matsumura-san's patch from June. As before, I'm not handling replication, archive_timeout, and persisting latest-marked-ready through crashes yet. For persisting the latest-marked-ready segment through crashes, I was thinking of using a new file that stores the segment number. Nathan
Attachment
At Mon, 14 Dec 2020 18:25:23 +0000, "Bossart, Nathan" <bossartn@amazon.com> wrote in > Apologies for the long delay. > > I've spent a good amount of time thinking about this bug and trying > out a few different approaches for fixing it. I've attached a work- > in-progress patch for my latest attempt. > > On 10/13/20, 5:07 PM, "Kyotaro Horiguchi" <horikyota.ntt@gmail.com> wrote: > > F0 F1 > > AAAAA F BBBBB > > |---------|---------|---------| > > seg X seg X+1 seg X+2 > > > > Matsumura-san has a concern about the case where there are two (or > > more) partially-flushed segment-spanning records at the same time. > > > > This patch remembers only the last cross-segment record. If we were > > going to flush up to F0 after Record-B had been written, we would fail > > to hold-off archiving seg-X. This patch is based on a assumption that > > that case cannot happen because we don't leave a pending page at the > > time of segment switch and no records don't span over three or more > > segments. > > I wonder if these are safe assumptions to make. For your example, if > we've written record B to the WAL buffers, but neither record A nor B > have been written to disk or flushed, aren't we still in trouble? You're right in that regard. There's a window where partial record is written when write location passes F0 after insertion location passes F1. However, remembering all spanning records seems overkilling to me. I modifed the previous patch so that it remembers the start LSN of the *oldest* corss-segment continuation record in the last consecutive bonded segments, and the end LSN of the latest cross-segmetn continuation record. This doesn't foreget past segment boundaries. The region is cleard when WAL-write LSN goes beyond the remembered end LSN. So the region may contain several wal-segments that are not connected to the next one, but that doesn't matter so much. > Also, is there actually any limit on WAL record length that means that > it is impossible for a record to span over three or more segments? Even though it is not a hard limit, AFAICS as mentioned before the longest possible record is what log_newpages() emits. that is up to about 500kBytes for now. I think we don't want to make the length longer. If we set the wal_segment_size to 1MB and set the block size to 16kB or more, we would have a recrod spanning over three or more segments but I don't think that is a sane configuration and that kind of issue could happen elsewhere. > Perhaps these assumptions are true, but it doesn't seem obvious to me > that they are, and they might be pretty fragile. I added an assertion that a record must be shorter than a wal segment to XLogRecordAssemble(). This guarantees the assumption to be true? (The condition is tentative, would need to be adjusted.) > The attached patch doesn't make use of these assumptions. Instead, we > track the positions of the records that cross segment boundaries in a > small hash map, and we use that to determine when it is safe to mark a > segment as ready for archival. I think this approach resembles > Matsumura-san's patch from June. > > As before, I'm not handling replication, archive_timeout, and > persisting latest-marked-ready through crashes yet. For persisting > the latest-marked-ready segment through crashes, I was thinking of > using a new file that stores the segment number. Also, the attached is a PoC. regards. -- Kyotaro Horiguchi NTT Open Source Software Center From 5714bd064d61135c41a64ecc39aeff74c25a0e74 Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com> Date: Tue, 15 Dec 2020 16:24:13 +0900 Subject: [PATCH v3] PoC: Avoid archiving a WAL segment that continues to the next segment If the last record of a finshed segment continues to the next segment, we need to defer archiving of the segment until the record is flushed to the end. Otherwise crash recovery can overwrite the last record of a segment and history diverges between archive and pg_wal. --- src/backend/access/transam/xlog.c | 187 +++++++++++++++++++++++- src/backend/access/transam/xloginsert.c | 3 + src/backend/replication/walsender.c | 14 +- src/include/access/xlog.h | 1 + 4 files changed, 193 insertions(+), 12 deletions(-) diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index 8dd225c2e1..98da521601 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -723,6 +723,16 @@ typedef struct XLogCtlData */ XLogRecPtr lastFpwDisableRecPtr; + /* The last segment notified to be archived. Protected by WALWriteLock */ + XLogSegNo lastNotifiedSeg; + + /* + * Remember the oldest and newest known segment that ends with a + * continuation record. + */ + XLogRecPtr firstSegContRecStart; + XLogRecPtr lastSegContRecEnd; + slock_t info_lck; /* locks shared variables shown above */ } XLogCtlData; @@ -1160,6 +1170,9 @@ XLogInsertRecord(XLogRecData *rdata, */ if (StartPos / XLOG_BLCKSZ != EndPos / XLOG_BLCKSZ) { + XLogSegNo startseg; + XLogSegNo endseg; + SpinLockAcquire(&XLogCtl->info_lck); /* advance global request to include new block(s) */ if (XLogCtl->LogwrtRqst.Write < EndPos) @@ -1167,6 +1180,19 @@ XLogInsertRecord(XLogRecData *rdata, /* update local result copy while I have the chance */ LogwrtResult = XLogCtl->LogwrtResult; SpinLockRelease(&XLogCtl->info_lck); + + /* Remember the range of segments end with a continuation recrod */ + XLByteToSeg(StartPos, startseg, wal_segment_size); + XLByteToPrevSeg(EndPos, endseg, wal_segment_size); + + if (startseg != endseg) + { + SpinLockAcquire(&XLogCtl->info_lck); + if (XLogCtl->firstSegContRecStart == InvalidXLogRecPtr) + XLogCtl->firstSegContRecStart = StartPos; + XLogCtl->lastSegContRecEnd = EndPos; + SpinLockRelease(&XLogCtl->info_lck); + } } /* @@ -2400,6 +2426,43 @@ XLogCheckpointNeeded(XLogSegNo new_segno) return false; } +/* + * Returns last notified segment. + */ +static XLogSegNo +GetLastNotifiedSegment(void) +{ + XLogSegNo last_notified; + + SpinLockAcquire(&XLogCtl->info_lck); + last_notified = XLogCtl->lastNotifiedSeg; + SpinLockRelease(&XLogCtl->info_lck); + + return last_notified; +} + +/* + * Notify segments that are not yet notified. + */ +static void +NotifySegmentsUpTo(XLogSegNo notifySegNo) +{ + XLogSegNo last_notified = GetLastNotifiedSegment(); + XLogSegNo i; + + if (notifySegNo <= last_notified) + return; + + for (i = XLogCtl->lastNotifiedSeg + 1 ; i <= notifySegNo ; i++) + XLogArchiveNotifySeg(i); + + /* Don't go back in the case someone else has made it go further. */ + SpinLockAcquire(&XLogCtl->info_lck); + if (XLogCtl->lastNotifiedSeg < notifySegNo) + XLogCtl->lastNotifiedSeg = notifySegNo; + SpinLockRelease(&XLogCtl->info_lck); +} + /* * Write and/or fsync the log at least as far as WriteRqst indicates. * @@ -2423,6 +2486,7 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible) int npages; int startidx; uint32 startoffset; + bool extended = false; /* We should always be inside a critical section here */ Assert(CritSectionCount > 0); @@ -2579,15 +2643,73 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible) */ if (finishing_seg) { + XLogRecPtr firstSegContRecStart; + XLogRecPtr lastSegContRecEnd; + issue_xlog_fsync(openLogFile, openLogSegNo); /* signal that we need to wakeup walsenders later */ WalSndWakeupRequest(); - LogwrtResult.Flush = LogwrtResult.Write; /* end of page */ + SpinLockAcquire(&XLogCtl->info_lck); + firstSegContRecStart = XLogCtl->firstSegContRecStart; + lastSegContRecEnd = XLogCtl->lastSegContRecEnd; + SpinLockRelease(&XLogCtl->info_lck); - if (XLogArchivingActive()) - XLogArchiveNotifySeg(openLogSegNo); + /* + * If we're on a continuation record spans over segments, don't + * expose flush location until the next record is written. This + * prevents expose a flush location at middle of a + * cross-segment WAL recrod. + */ + if (LogwrtResult.Write < firstSegContRecStart || + lastSegContRecEnd <= LogwrtResult.Write) + { + LogwrtResult.Flush = LogwrtResult.Write; /* end of page */ + + if (XLogArchivingActive()) + NotifySegmentsUpTo(openLogSegNo); + + extended = false; + } + else + { + /* + * There's a case where we are told to flush up not to the + * end of a record but to WALbuffer page boundary. We + * advance the request LSN to the end of the record in the + * case the record at the requested LSN continues to the + * next segment. + */ + XLogSegNo oldseg; + XLogSegNo currseg; + + XLByteToPrevSeg(WriteRqst.Write, currseg, wal_segment_size); + XLByteToPrevSeg(lastSegContRecEnd, oldseg, + wal_segment_size); + + if (oldseg == currseg && + WriteRqst.Write <= lastSegContRecEnd) + WriteRqst.Write = lastSegContRecEnd; + + /* + * We need to finish writing at least the current record in + * order to the old segment can be safely archived. + */ + extended = true; + } + + if (lastSegContRecEnd <= LogwrtResult.Write) + { + /* + * We got out from the continuation region, reset the + * locations. + */ + SpinLockAcquire(&XLogCtl->info_lck); + XLogCtl->firstSegContRecStart = InvalidXLogRecPtr; + XLogCtl->lastSegContRecEnd = InvalidXLogRecPtr; + SpinLockRelease(&XLogCtl->info_lck); + } XLogCtl->lastSegSwitchTime = (pg_time_t) time(NULL); XLogCtl->lastSegSwitchLSN = LogwrtResult.Flush; @@ -2616,11 +2738,23 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible) } curridx = NextBufIdx(curridx); - /* If flexible, break out of loop as soon as we wrote something */ - if (flexible && npages == 0) + /* + * If flexible, break out of loop as soon as we wrote something. + * However, we don't leave the loop if the last record in the just + * finished segment needs to be finished. + */ + if (flexible && !extended && npages == 0) break; } + /* + * We have extended the write request to the next segment if the record at + * the initial WriteRqst.Write continues to the next segment. In that case + * need to notify the last segment here. + */ + if (extended) + NotifySegmentsUpTo(openLogSegNo - 1); + Assert(npages == 0); /* @@ -7705,6 +7839,18 @@ StartupXLOG(void) XLogCtl->LogwrtRqst.Write = EndOfLog; XLogCtl->LogwrtRqst.Flush = EndOfLog; + /* + * We have archived up to the previous segment of EndOfLog so far. + * Initialize lastNotifiedSeg if needed. + */ + if (XLogArchivingActive()) + { + XLogSegNo endLogSegNo; + + XLByteToSeg(EndOfLog, endLogSegNo, wal_segment_size); + XLogCtl->lastNotifiedSeg = endLogSegNo - 1; + } + /* * Update full_page_writes in shared memory and write an XLOG_FPW_CHANGE * record before resource manager writes cleanup WAL records or checkpoint @@ -8429,6 +8575,37 @@ GetFlushRecPtr(void) return LogwrtResult.Flush; } +/* + * GetReplicationTargetRecPtr -- Returns the latest position that is safe to + * replicate. + */ +XLogRecPtr +GetReplicationTargetRecPtr(void) +{ + static XLogRecPtr lastTargetRecPtr = InvalidXLogRecPtr; + XLogRecPtr firstSegContRecStart; + XLogRecPtr lastSegContRecEnd; + + SpinLockAcquire(&XLogCtl->info_lck); + LogwrtResult = XLogCtl->LogwrtResult; + firstSegContRecStart = XLogCtl->firstSegContRecStart; + lastSegContRecEnd = XLogCtl->lastSegContRecEnd; + SpinLockRelease(&XLogCtl->info_lck); + + /* + * Don't move forward if the current flush position may be within a + * continuation record that spans over segments. + */ + if (lastTargetRecPtr == InvalidXLogRecPtr || + firstSegContRecStart == InvalidXLogRecPtr || + XLogSegmentOffset(LogwrtResult.Flush, wal_segment_size) != 0 || + LogwrtResult.Flush < firstSegContRecStart || + lastSegContRecEnd <= LogwrtResult.Flush) + lastTargetRecPtr = LogwrtResult.Flush; + + return lastTargetRecPtr; +} + /* * GetLastImportantRecPtr -- Returns the LSN of the last important record * inserted. All records not explicitly marked as unimportant are considered diff --git a/src/backend/access/transam/xloginsert.c b/src/backend/access/transam/xloginsert.c index 1f0e4e01e6..af53d1f514 100644 --- a/src/backend/access/transam/xloginsert.c +++ b/src/backend/access/transam/xloginsert.c @@ -815,6 +815,9 @@ XLogRecordAssemble(RmgrId rmid, uint8 info, rechdr->xl_prev = InvalidXLogRecPtr; rechdr->xl_crc = rdata_crc; + /* we shouldn't have a record longer than a segment */ + Assert(total_len < wal_segment_size); + return &hdr_rdt; } diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c index 2eb19ad293..00d8701a60 100644 --- a/src/backend/replication/walsender.c +++ b/src/backend/replication/walsender.c @@ -2637,14 +2637,14 @@ XLogSendPhysical(void) /* * Streaming the current timeline on a primary. * - * Attempt to send all data that's already been written out and - * fsync'd to disk. We cannot go further than what's been written out - * given the current implementation of WALRead(). And in any case - * it's unsafe to send WAL that is not securely down to disk on the - * primary: if the primary subsequently crashes and restarts, standbys - * must not have applied any WAL that got lost on the primary. + * Attempt to send all data that's can be replicated. We cannot go + * further than what's been written out given the current + * implementation of WALRead(). And in any case it's unsafe to send + * WAL that is not securely down to disk on the primary: if the primary + * subsequently crashes and restarts, standbys must not have applied + * any WAL that got lost on the primary. */ - SendRqstPtr = GetFlushRecPtr(); + SendRqstPtr = GetReplicationTargetRecPtr(); } /* diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h index 221af87e71..94876f628c 100644 --- a/src/include/access/xlog.h +++ b/src/include/access/xlog.h @@ -338,6 +338,7 @@ extern void GetFullPageWriteInfo(XLogRecPtr *RedoRecPtr_p, bool *doPageWrites_p) extern XLogRecPtr GetRedoRecPtr(void); extern XLogRecPtr GetInsertRecPtr(void); extern XLogRecPtr GetFlushRecPtr(void); +extern XLogRecPtr GetReplicationTargetRecPtr(void); extern XLogRecPtr GetLastImportantRecPtr(void); extern void RemovePromoteSignalFiles(void); -- 2.27.0
At Tue, 15 Dec 2020 19:32:57 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in > At Mon, 14 Dec 2020 18:25:23 +0000, "Bossart, Nathan" <bossartn@amazon.com> wrote in > > I wonder if these are safe assumptions to make. For your example, if > > we've written record B to the WAL buffers, but neither record A nor B > > have been written to disk or flushed, aren't we still in trouble? > > You're right in that regard. There's a window where partial record is > written when write location passes F0 after insertion location passes > F1. However, remembering all spanning records seems overkilling to me. > > I modifed the previous patch so that it remembers the start LSN of the > *oldest* corss-segment continuation record in the last consecutive > bonded segments, and the end LSN of the latest cross-segmetn > continuation record. This doesn't foreget past segment boundaries. > The region is cleard when WAL-write LSN goes beyond the remembered end > LSN. So the region may contain several wal-segments that are not > connected to the next one, but that doesn't matter so much. Mmm. Even tough it'a PoC, it was too bogus. I fixed it to work saner way. - Record the beginning LSN of the first cross-seg record and the end LSN of the last cross-seg recrod in a consecutive segments bonded by cross-seg recrods. Spcifically X and Y below. X Z Y [recA] [recB] [recC] [seg A] [seg B] [seg C] [seg D] [seg E] (1) (2.2) (2.2) (2.1) (2.1) (1) 1. If we wrote upto before X or beyond Y at a segment boundary, notify the finished segment immediately. 1.1. If we have written beyond Y, clear the recorded region. 2. Otherwise we don't notify the segment immediately: 2.1. If write request was up to exactly the current segment boundary and we know the end LSN of the record there (that is, it is recC above), extend the request to the end LSN. Then notify the segment after the record is written to the end. 2.2. Otherwise (that is recA or recB), we don't know whether the last record of the last segment is ends just at the segment boundary (Z) or a record spans between segments (recB). Anyway even if there is such a record there, we don't know where it ends. As the result what we can do there is only to refrain from notifying. It doesn't matter so much since we have already inserted recC so we will soon reach recC and will notify up to seg C. There might be a case where we insert up to Y before writing up to Z, the segment-region X-Y contains non-connected segment boundary in that case. It is processed as if it is a connected segment boundary. However, like 2.2 above, It doesn't matter since we write up to Y soon. At Tue, 15 Dec 2020 19:32:57 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in me> I added an assertion that a record must be shorter than a wal segment me> to XLogRecordAssemble(). This guarantees the assumption to be true? me> (The condition is tentative, would need to be adjusted.) Changed the assertion more direct way. me> Also, the attached is a PoC. regards. -- Kyotaro Horiguchi NTT Open Source Software Center From 50b3b05dd0eed79cd0b97991e82090b9d569cbac Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com> Date: Wed, 16 Dec 2020 10:36:14 +0900 Subject: [PATCH v4 1/2] Avoid archiving a WAL segment that continues to the next segment If the last record of a finshed segment continues to the next segment, we need to defer archiving of the segment until the record is flushed to the end. Otherwise crash recovery can overwrite the last record of a segment and history diverges between archive and pg_wal. --- src/backend/access/transam/xlog.c | 214 +++++++++++++++++++++++++++- src/backend/replication/walsender.c | 14 +- src/include/access/xlog.h | 1 + 3 files changed, 217 insertions(+), 12 deletions(-) diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index 8dd225c2e1..8705809160 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -723,6 +723,16 @@ typedef struct XLogCtlData */ XLogRecPtr lastFpwDisableRecPtr; + /* The last segment notified to be archived. Protected by WALWriteLock */ + XLogSegNo lastNotifiedSeg; + + /* + * Remember the oldest and newest known segment that ends with a + * continuation record. + */ + XLogRecPtr firstSegContRecStart; + XLogRecPtr lastSegContRecEnd; + slock_t info_lck; /* locks shared variables shown above */ } XLogCtlData; @@ -1160,6 +1170,9 @@ XLogInsertRecord(XLogRecData *rdata, */ if (StartPos / XLOG_BLCKSZ != EndPos / XLOG_BLCKSZ) { + XLogSegNo startseg; + XLogSegNo endseg; + SpinLockAcquire(&XLogCtl->info_lck); /* advance global request to include new block(s) */ if (XLogCtl->LogwrtRqst.Write < EndPos) @@ -1167,6 +1180,24 @@ XLogInsertRecord(XLogRecData *rdata, /* update local result copy while I have the chance */ LogwrtResult = XLogCtl->LogwrtResult; SpinLockRelease(&XLogCtl->info_lck); + + /* Remember the range of segments end with a continuation recrod */ + XLByteToSeg(StartPos, startseg, wal_segment_size); + XLByteToPrevSeg(EndPos, endseg, wal_segment_size); + + if (startseg != endseg) + { + /* + * We shouldn't have a record spanning over more than two segments + */ + Assert (startseg + 1 == endseg); + + SpinLockAcquire(&XLogCtl->info_lck); + if (XLogCtl->firstSegContRecStart == InvalidXLogRecPtr) + XLogCtl->firstSegContRecStart = StartPos; + XLogCtl->lastSegContRecEnd = EndPos; + SpinLockRelease(&XLogCtl->info_lck); + } } /* @@ -2400,6 +2431,43 @@ XLogCheckpointNeeded(XLogSegNo new_segno) return false; } +/* + * Returns last notified segment. + */ +static XLogSegNo +GetLastNotifiedSegment(void) +{ + XLogSegNo last_notified; + + SpinLockAcquire(&XLogCtl->info_lck); + last_notified = XLogCtl->lastNotifiedSeg; + SpinLockRelease(&XLogCtl->info_lck); + + return last_notified; +} + +/* + * Notify segments that are not yet notified. + */ +static void +NotifySegmentsUpTo(XLogSegNo notifySegNo) +{ + XLogSegNo last_notified = GetLastNotifiedSegment(); + XLogSegNo i; + + if (notifySegNo <= last_notified) + return; + + for (i = XLogCtl->lastNotifiedSeg + 1 ; i <= notifySegNo ; i++) + XLogArchiveNotifySeg(i); + + /* Don't go back in the case someone else has made it go further. */ + SpinLockAcquire(&XLogCtl->info_lck); + if (XLogCtl->lastNotifiedSeg < notifySegNo) + XLogCtl->lastNotifiedSeg = notifySegNo; + SpinLockRelease(&XLogCtl->info_lck); +} + /* * Write and/or fsync the log at least as far as WriteRqst indicates. * @@ -2423,6 +2491,7 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible) int npages; int startidx; uint32 startoffset; + bool seg_notify = false; /* We should always be inside a critical section here */ Assert(CritSectionCount > 0); @@ -2579,15 +2648,95 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible) */ if (finishing_seg) { + XLogRecPtr firstSegContRecStart; + XLogRecPtr lastSegContRecEnd; + issue_xlog_fsync(openLogFile, openLogSegNo); /* signal that we need to wakeup walsenders later */ WalSndWakeupRequest(); - LogwrtResult.Flush = LogwrtResult.Write; /* end of page */ + SpinLockAcquire(&XLogCtl->info_lck); + firstSegContRecStart = XLogCtl->firstSegContRecStart; + lastSegContRecEnd = XLogCtl->lastSegContRecEnd; + SpinLockRelease(&XLogCtl->info_lck); - if (XLogArchivingActive()) - XLogArchiveNotifySeg(openLogSegNo); + /* + * If we may be on a continuation record spans over segments, + * don't archive the segment until the record is written to the + * end. If we do, we could have corrupt archive having + * different records at the boundary after a server crash + * around here. For the same reason, also for replication, + * don't expose flush location until the record is written to + * the end so that an incomplete record at segment boundary + * won't be sent to standby. + */ + if (LogwrtResult.Write < firstSegContRecStart || + lastSegContRecEnd <= LogwrtResult.Write) + { + LogwrtResult.Flush = LogwrtResult.Write; /* end of page */ + + if (XLogArchivingActive()) + NotifySegmentsUpTo(openLogSegNo); + + if (lastSegContRecEnd <= LogwrtResult.Write) + { + /* + * We got out of the continuation region, reset the + * locations. + */ + SpinLockAcquire(&XLogCtl->info_lck); + XLogCtl->firstSegContRecStart = InvalidXLogRecPtr; + XLogCtl->lastSegContRecEnd = InvalidXLogRecPtr; + SpinLockRelease(&XLogCtl->info_lck); + } + + /* already notified */ + seg_notify = false; + } + else + { + seg_notify = true; + + /* + * There's a case where we are told to flush up not to the + * end of a record but to WALbuffer page boundary. We + * advance the request LSN to the end of the record in the + * case the record at the requested LSN continues to the + * next segment. + */ + if (XLogSegmentOffset(WriteRqst.Write, wal_segment_size) + == 0) + { + XLogSegNo oldseg; + XLogSegNo currseg; + + XLByteToSeg(WriteRqst.Write, currseg, + wal_segment_size); + XLByteToPrevSeg(lastSegContRecEnd, oldseg, + wal_segment_size); + + /* + * If we know the exact placement of the record. Extend + * request to the end of the recrod. + */ + if (oldseg == currseg && + WriteRqst.Write < lastSegContRecEnd) + WriteRqst.Write = lastSegContRecEnd; + else + { + /* + * We forgot the exact placement of the record + * around requested write LSN. Refrain from + * notifying the segment to avoid archiving a + * segment having incomplete-record at the end. In + * this case we are going to write another couple + * of further segments thus we will soon reach the + * next segment boundary. */ + seg_notify = false; + } + } + } XLogCtl->lastSegSwitchTime = (pg_time_t) time(NULL); XLogCtl->lastSegSwitchLSN = LogwrtResult.Flush; @@ -2616,11 +2765,23 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible) } curridx = NextBufIdx(curridx); - /* If flexible, break out of loop as soon as we wrote something */ - if (flexible && npages == 0) + /* + * If flexible, break out of loop as soon as we wrote something. + * However, we don't leave the loop if the last record in the just + * finished segment needs to be finished. + */ + if (flexible && seg_notify && npages == 0) break; } + /* + * We have extended the write request to the next segment if the record at + * the initial WriteRqst.Write continues to the next segment. In that case + * need to notify the last segment here. + */ + if (seg_notify) + NotifySegmentsUpTo(openLogSegNo - 1); + Assert(npages == 0); /* @@ -7705,6 +7866,18 @@ StartupXLOG(void) XLogCtl->LogwrtRqst.Write = EndOfLog; XLogCtl->LogwrtRqst.Flush = EndOfLog; + /* + * We have archived up to the previous segment of EndOfLog so far. + * Initialize lastNotifiedSeg if needed. + */ + if (XLogArchivingActive()) + { + XLogSegNo endLogSegNo; + + XLByteToSeg(EndOfLog, endLogSegNo, wal_segment_size); + XLogCtl->lastNotifiedSeg = endLogSegNo - 1; + } + /* * Update full_page_writes in shared memory and write an XLOG_FPW_CHANGE * record before resource manager writes cleanup WAL records or checkpoint @@ -8429,6 +8602,37 @@ GetFlushRecPtr(void) return LogwrtResult.Flush; } +/* + * GetReplicationTargetRecPtr -- Returns the latest position that is safe to + * replicate. + */ +XLogRecPtr +GetReplicationTargetRecPtr(void) +{ + static XLogRecPtr lastTargetRecPtr = InvalidXLogRecPtr; + XLogRecPtr firstSegContRecStart; + XLogRecPtr lastSegContRecEnd; + + SpinLockAcquire(&XLogCtl->info_lck); + LogwrtResult = XLogCtl->LogwrtResult; + firstSegContRecStart = XLogCtl->firstSegContRecStart; + lastSegContRecEnd = XLogCtl->lastSegContRecEnd; + SpinLockRelease(&XLogCtl->info_lck); + + /* + * Don't move forward if the current flush position may be within a + * continuation record that spans over segments. + */ + if (lastTargetRecPtr == InvalidXLogRecPtr || + firstSegContRecStart == InvalidXLogRecPtr || + XLogSegmentOffset(LogwrtResult.Flush, wal_segment_size) != 0 || + LogwrtResult.Flush < firstSegContRecStart || + lastSegContRecEnd <= LogwrtResult.Flush) + lastTargetRecPtr = LogwrtResult.Flush; + + return lastTargetRecPtr; +} + /* * GetLastImportantRecPtr -- Returns the LSN of the last important record * inserted. All records not explicitly marked as unimportant are considered diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c index 2eb19ad293..00d8701a60 100644 --- a/src/backend/replication/walsender.c +++ b/src/backend/replication/walsender.c @@ -2637,14 +2637,14 @@ XLogSendPhysical(void) /* * Streaming the current timeline on a primary. * - * Attempt to send all data that's already been written out and - * fsync'd to disk. We cannot go further than what's been written out - * given the current implementation of WALRead(). And in any case - * it's unsafe to send WAL that is not securely down to disk on the - * primary: if the primary subsequently crashes and restarts, standbys - * must not have applied any WAL that got lost on the primary. + * Attempt to send all data that's can be replicated. We cannot go + * further than what's been written out given the current + * implementation of WALRead(). And in any case it's unsafe to send + * WAL that is not securely down to disk on the primary: if the primary + * subsequently crashes and restarts, standbys must not have applied + * any WAL that got lost on the primary. */ - SendRqstPtr = GetFlushRecPtr(); + SendRqstPtr = GetReplicationTargetRecPtr(); } /* diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h index 221af87e71..94876f628c 100644 --- a/src/include/access/xlog.h +++ b/src/include/access/xlog.h @@ -338,6 +338,7 @@ extern void GetFullPageWriteInfo(XLogRecPtr *RedoRecPtr_p, bool *doPageWrites_p) extern XLogRecPtr GetRedoRecPtr(void); extern XLogRecPtr GetInsertRecPtr(void); extern XLogRecPtr GetFlushRecPtr(void); +extern XLogRecPtr GetReplicationTargetRecPtr(void); extern XLogRecPtr GetLastImportantRecPtr(void); extern void RemovePromoteSignalFiles(void); -- 2.27.0 From 6367ea9ffac7b6e6e692ddacfb8dc997c222c296 Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com> Date: Wed, 16 Dec 2020 10:36:42 +0900 Subject: [PATCH v4 2/2] debug print --- src/backend/access/transam/xlog.c | 17 +++++++++++++++++ 1 file changed, 17 insertions(+) diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index 8705809160..a1bd00ae1b 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -1192,6 +1192,11 @@ XLogInsertRecord(XLogRecData *rdata, */ Assert (startseg + 1 == endseg); + if (XLogCtl->firstSegContRecStart == InvalidXLogRecPtr) + ereport(LOG, (errmsg("REG-REG: (%lX, %lX)", StartPos, EndPos), errhidestmt(true),errhidecontext(true))); + else + ereport(LOG, (errmsg("UPD-REG: (%lX, %lX)", XLogCtl->firstSegContRecStart, EndPos), errhidestmt(true),errhidecontext(true))); + SpinLockAcquire(&XLogCtl->info_lck); if (XLogCtl->firstSegContRecStart == InvalidXLogRecPtr) XLogCtl->firstSegContRecStart = StartPos; @@ -2674,6 +2679,8 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible) if (LogwrtResult.Write < firstSegContRecStart || lastSegContRecEnd <= LogwrtResult.Write) { + ereport(LOG, (errmsg("NOTIFY1 %lX-%lX: (%lX, %lX) %lX(/%lX) %lX", GetLastNotifiedSegment() + 1, openLogSegNo,firstSegContRecStart, lastSegContRecEnd, WriteRqst.Write, XLogSegmentOffset(WriteRqst.Write, wal_segment_size),LogwrtResult.Write), errhidestmt(true),errhidecontext(true))); + LogwrtResult.Flush = LogwrtResult.Write; /* end of page */ if (XLogArchivingActive()) @@ -2681,6 +2688,7 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible) if (lastSegContRecEnd <= LogwrtResult.Write) { + ereport(LOG, (errmsg("CLEAR-REG: (%lX, %lX) %lX, %lX", firstSegContRecStart, lastSegContRecEnd,WriteRqst.Write, LogwrtResult.Write), errhidestmt(true),errhidecontext(true))); /* * We got out of the continuation region, reset the * locations. @@ -2705,6 +2713,7 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible) * case the record at the requested LSN continues to the * next segment. */ + ereport(LOG, (errmsg("NOT-NOTIFY: (%lX, %lX) %lX(/%lX) %lX(/%lX)", firstSegContRecStart, lastSegContRecEnd,LogwrtResult.Write, XLogSegmentOffset(LogwrtResult.Write, wal_segment_size), WriteRqst.Write, XLogSegmentOffset(WriteRqst.Write,wal_segment_size)), errhidestmt(true),errhidecontext(true))); if (XLogSegmentOffset(WriteRqst.Write, wal_segment_size) == 0) { @@ -2722,9 +2731,13 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible) */ if (oldseg == currseg && WriteRqst.Write < lastSegContRecEnd) + { + ereport(LOG, (errmsg("EXTEND-RQST: (%lX, %lX(%lX)) %lX(%lX) => %d", firstSegContRecStart, lastSegContRecEnd,oldseg, WriteRqst.Write, currseg, (oldseg == currseg && WriteRqst.Write <= lastSegContRecEnd)), errhidestmt(true),errhidecontext(true))); WriteRqst.Write = lastSegContRecEnd; + } else { + ereport(LOG, (errmsg("SKIP-NOTIFY: (%lX, %lX(%lX)) %lX(%lX) => %d", firstSegContRecStart, lastSegContRecEnd,oldseg, WriteRqst.Write, currseg, (oldseg == currseg && WriteRqst.Write <= lastSegContRecEnd)), errhidestmt(true),errhidecontext(true))); /* * We forgot the exact placement of the record * around requested write LSN. Refrain from @@ -2780,7 +2793,11 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible) * need to notify the last segment here. */ if (seg_notify) + { + ereport(LOG, (errmsg("NOTIFY2 %lX-%lX: (%lX, %lX) %lX, seg %lX", GetLastNotifiedSegment() + 1, openLogSegNo - 1,XLogCtl->firstSegContRecStart, XLogCtl->lastSegContRecEnd, LogwrtResult.Write, openLogSegNo - 1), errhidestmt(true),errhidecontext(true))); + NotifySegmentsUpTo(openLogSegNo - 1); + } Assert(npages == 0); -- 2.27.0
On 12/15/20, 2:33 AM, "Kyotaro Horiguchi" <horikyota.ntt@gmail.com> wrote: > You're right in that regard. There's a window where partial record is > written when write location passes F0 after insertion location passes > F1. However, remembering all spanning records seems overkilling to me. I'm curious why you feel that recording all cross-segment records is overkill. IMO it seems far simpler to just do that rather than try to reason about all these different scenarios and rely on various (and possibly fragile) assumptions. You only need to record the end location of records that cross into the next segment (or that fit perfectly into the end of the current one) and to evaluate which segments to mark .ready as the "flushed" LSN advances. I'd expect that in most cases we wouldn't need to store more than a couple of record boundaries, so it's not like we'd normally be storing dozens of boundaries. Even if we did need to store several boundaries, AFAICT the approach I'm proposing should still work well enough. Nathan
At Thu, 17 Dec 2020 22:20:35 +0000, "Bossart, Nathan" <bossartn@amazon.com> wrote in > On 12/15/20, 2:33 AM, "Kyotaro Horiguchi" <horikyota.ntt@gmail.com> wrote: > > You're right in that regard. There's a window where partial record is > > written when write location passes F0 after insertion location passes > > F1. However, remembering all spanning records seems overkilling to me. > > I'm curious why you feel that recording all cross-segment records is > overkill. IMO it seems far simpler to just do that rather than try to Sorry, my words are not enough. Remembering all spanning records in *shared memory* seems to be overkilling. Much more if it is stored in shared hash table. Even though it rarely the case, it can fail hard way when reaching the limit. If we could do well by remembering just two locations, we wouldn't need to worry about such a limitation. > reason about all these different scenarios and rely on various > (and possibly fragile) assumptions. You only need to record the end After the previous mail sent, I noticed that the assumption on record-length was not needed. So that way no longer need any of the assumption^^; > location of records that cross into the next segment (or that fit > perfectly into the end of the current one) and to evaluate which > segments to mark .ready as the "flushed" LSN advances. I'd expect > that in most cases we wouldn't need to store more than a couple of > record boundaries, so it's not like we'd normally be storing dozens of > boundaries. Even if we did need to store several boundaries, AFAICT > the approach I'm proposing should still work well enough. I didn't say it doesn't work, just overkill. Another concern about the concrete patch: NotifySegmentsReadyForArchive() searches the shared hashacquiaing a LWLock every time XLogWrite is called while segment archive is being held off. I don't think it is acceptable and I think it could be a problem when many backends are competing on WAL. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
At Wed, 16 Dec 2020 11:01:20 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in > - Record the beginning LSN of the first cross-seg record and the end > LSN of the last cross-seg recrod in a consecutive segments bonded by > cross-seg recrods. Spcifically X and Y below. > > X Z Y > [recA] [recB] [recC] > [seg A] [seg B] [seg C] [seg D] [seg E] > (1) (2.2) (2.2) (2.1) (2.1) (1) > > 1. If we wrote upto before X or beyond Y at a segment boundary, notify > the finished segment immediately. > > 1.1. If we have written beyond Y, clear the recorded region. > > 2. Otherwise we don't notify the segment immediately: > > 2.1. If write request was up to exactly the current segment boundary > and we know the end LSN of the record there (that is, it is recC > above), extend the request to the end LSN. Then notify the segment > after the record is written to the end. > > 2.2. Otherwise (that is recA or recB), we don't know whether the > last record of the last segment is ends just at the segment boundary > (Z) or a record spans between segments (recB). Anyway even if there > is such a record there, we don't know where it ends. As the result > what we can do there is only to refrain from notifying. It doesn't > matter so much since we have already inserted recC so we will soon > reach recC and will notify up to seg C. > > There might be a case where we insert up to Y before writing up to Z, > the segment-region X-Y contains non-connected segment boundary in that > case. It is processed as if it is a connected segment > boundary. However, like 2.2 above, It doesn't matter since we write up > to Y soon. I noticed that we can cause the continuation record flushed immedately. So in the attached, 1. If there's no remembered cross-segment boundary or we're out of the region X-Y, notify the finished segment immediately. 2. Otherwise we don't notify the segment immedately 2.1. If we are finishing the last semgment known to continue to the next segment, extend write request to the end of the recrod *and* force to write then flush up to there. 2.2. (the same to the above) 3. In the case of 2.1, we can flush the previous segment immediately so do that. X. When we notify a segment, clear the rememberd region if we have got out of the region. The attached is changed in the following points: - Fixed some bugs that I confusedly refer to write-lsn instead of flush-lsn. - Changed to urge flushing up to the end of a continuation record, not only waiting for the recrod to be written. - More agressively clear the remembered region. regards. -- Kyotaro Horiguchi NTT Open Source Software Center From 95e438ee448c1686c946909f1fc84ec95ee6c7d4 Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com> Date: Fri, 18 Dec 2020 13:45:29 +0900 Subject: [PATCH v5 1/2] Avoid archiving a WAL segment that continues to the next segment If the last record of a finshed segment continues to the next segment, we need to defer archiving of the segment until the record is flushed to the end. Otherwise crash recovery can overwrite the last record of a segment and history diverges between archive and pg_wal. --- src/backend/access/transam/xlog.c | 221 +++++++++++++++++++++++++++- src/backend/replication/walsender.c | 14 +- src/include/access/xlog.h | 1 + 3 files changed, 224 insertions(+), 12 deletions(-) diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index b1e5d2dbff..b0d3ba2c5a 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -733,6 +733,16 @@ typedef struct XLogCtlData */ XLogRecPtr lastFpwDisableRecPtr; + /* The last segment notified to be archived. Protected by WALWriteLock */ + XLogSegNo lastNotifiedSeg; + + /* + * Remember the region we need to consider refraining from archiving + * finished segments immediately. Protected by info_lck. + */ + XLogRecPtr firstSegContRecStart; + XLogRecPtr lastSegContRecEnd; + slock_t info_lck; /* locks shared variables shown above */ } XLogCtlData; @@ -1170,6 +1180,9 @@ XLogInsertRecord(XLogRecData *rdata, */ if (StartPos / XLOG_BLCKSZ != EndPos / XLOG_BLCKSZ) { + XLogSegNo startseg; + XLogSegNo endseg; + SpinLockAcquire(&XLogCtl->info_lck); /* advance global request to include new block(s) */ if (XLogCtl->LogwrtRqst.Write < EndPos) @@ -1177,6 +1190,22 @@ XLogInsertRecord(XLogRecData *rdata, /* update local result copy while I have the chance */ LogwrtResult = XLogCtl->LogwrtResult; SpinLockRelease(&XLogCtl->info_lck); + + /* + * Remember the range of segment boundaries that are connected by a + * continuation record. + */ + XLByteToSeg(StartPos, startseg, wal_segment_size); + XLByteToPrevSeg(EndPos, endseg, wal_segment_size); + + if (startseg != endseg) + { + SpinLockAcquire(&XLogCtl->info_lck); + if (XLogCtl->firstSegContRecStart == InvalidXLogRecPtr) + XLogCtl->firstSegContRecStart = StartPos; + XLogCtl->lastSegContRecEnd = EndPos; + SpinLockRelease(&XLogCtl->info_lck); + } } /* @@ -2410,6 +2439,50 @@ XLogCheckpointNeeded(XLogSegNo new_segno) return false; } +/* + * Returns last notified segment. + */ +static XLogSegNo +GetLastNotifiedSegment(void) +{ + XLogSegNo last_notified; + + SpinLockAcquire(&XLogCtl->info_lck); + last_notified = XLogCtl->lastNotifiedSeg; + SpinLockRelease(&XLogCtl->info_lck); + + return last_notified; +} + +/* + * Notify segments that are not yet notified. + */ +static void +NotifySegmentsUpTo(XLogSegNo notifySegNo) +{ + XLogSegNo last_notified = GetLastNotifiedSegment(); + XLogSegNo i; + + if (notifySegNo <= last_notified) + return; + + for (i = XLogCtl->lastNotifiedSeg + 1 ; i <= notifySegNo ; i++) + XLogArchiveNotifySeg(i); + + /* Don't go back in the case someone else has made it go further. */ + SpinLockAcquire(&XLogCtl->info_lck); + if (XLogCtl->lastNotifiedSeg < notifySegNo) + XLogCtl->lastNotifiedSeg = notifySegNo; + + /* Reset the locations if we have got out of the continuation region. */ + if (XLogCtl->lastSegContRecEnd <= XLogCtl->LogwrtRqst.Flush) + { + XLogCtl->firstSegContRecStart = InvalidXLogRecPtr; + XLogCtl->lastSegContRecEnd = InvalidXLogRecPtr; + } + SpinLockRelease(&XLogCtl->info_lck); +} + /* * Write and/or fsync the log at least as far as WriteRqst indicates. * @@ -2433,6 +2506,8 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible) int npages; int startidx; uint32 startoffset; + bool notify_seg = false; + bool force_continue = false; /* We should always be inside a critical section here */ Assert(CritSectionCount > 0); @@ -2589,15 +2664,97 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible) */ if (finishing_seg) { + XLogRecPtr firstSegContRecStart; + XLogRecPtr lastSegContRecEnd; + issue_xlog_fsync(openLogFile, openLogSegNo); /* signal that we need to wakeup walsenders later */ WalSndWakeupRequest(); - LogwrtResult.Flush = LogwrtResult.Write; /* end of page */ + SpinLockAcquire(&XLogCtl->info_lck); + firstSegContRecStart = XLogCtl->firstSegContRecStart; + lastSegContRecEnd = XLogCtl->lastSegContRecEnd; + SpinLockRelease(&XLogCtl->info_lck); - if (XLogArchivingActive()) - XLogArchiveNotifySeg(openLogSegNo); + /* + * If we may be on a continuation record spans over segments, + * don't archive the segment until the record is written to the + * end. If we do, we could have corrupt archive having + * different records at the boundary after a server crash + * around here. For the same reason, also for replication, + * don't expose flush location until the record is written to + * the end so that an incomplete record at segment boundary + * won't be sent to standby. + */ + if (firstSegContRecStart == InvalidXLogRecPtr || + LogwrtResult.Write < firstSegContRecStart || + lastSegContRecEnd <= LogwrtResult.Write) + { + LogwrtResult.Flush = LogwrtResult.Write; /* end of page */ + + if (XLogArchivingActive()) + NotifySegmentsUpTo(openLogSegNo); + + /* Already notified */ + notify_seg = false; + force_continue = false; + } + else + { + /* + * The last record in the finishing segment may be + * continuing into the next segment. Don't archive the + * segment until we are sure that the continuation record + * is completely written out. + */ + + XLogSegNo reqendseg; + XLogSegNo contendseg; + + force_continue = true; + + XLByteToSeg(WriteRqst.Write, reqendseg, wal_segment_size); + XLByteToSeg(lastSegContRecEnd, contendseg, + wal_segment_size); + + /* + * There's a case where we are told to flush up not to the + * end of a record but to a page boundary. Advance the + * request LSN to the end of the record at the boundary if + * any. + */ + if (reqendseg == contendseg) + { + /* + * We finished the last segment in the region. Extend + * the write request to the end of the recrod. + */ + if (WriteRqst.Write < lastSegContRecEnd) + WriteRqst.Write = lastSegContRecEnd; + + /* + * Flush up to at least the same location immediately + * so that the segment can be archived. + */ + if (WriteRqst.Flush < lastSegContRecEnd) + WriteRqst.Flush = lastSegContRecEnd; + + /* We know the record will be finished */ + notify_seg = true; + } + else + { + /* We don't know the exact placement of the record + * around the requested write LSN. Do not archive this + * segment since it might ends with an incomplete + * record. In this case we will end up with archiving + * the segment soon since we have written up to further + * segments. + */ + notify_seg = false; + } + } XLogCtl->lastSegSwitchTime = (pg_time_t) time(NULL); XLogCtl->lastSegSwitchLSN = LogwrtResult.Flush; @@ -2626,8 +2783,11 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible) } curridx = NextBufIdx(curridx); - /* If flexible, break out of loop as soon as we wrote something */ - if (flexible && npages == 0) + /* + * If flexible, break out of loop as soon as we wrote something. + * However, we don't leave the loop if we have to write a bit more. + */ + if (flexible && !force_continue && npages == 0) break; } @@ -2685,6 +2845,14 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible) XLogCtl->LogwrtRqst.Flush = LogwrtResult.Flush; SpinLockRelease(&XLogCtl->info_lck); } + + /* + * We have extended the write request to the next segment if the record at + * the initial WriteRqst.Write continues to the next segment. In that case + * need to notify the last segment here. + */ + if (notify_seg) + NotifySegmentsUpTo(openLogSegNo - 1); } /* @@ -7716,6 +7884,18 @@ StartupXLOG(void) XLogCtl->LogwrtRqst.Write = EndOfLog; XLogCtl->LogwrtRqst.Flush = EndOfLog; + /* + * We have archived up to the previous segment of EndOfLog so far. + * Initialize lastNotifiedSeg if needed. + */ + if (XLogArchivingActive()) + { + XLogSegNo endLogSegNo; + + XLByteToSeg(EndOfLog, endLogSegNo, wal_segment_size); + XLogCtl->lastNotifiedSeg = endLogSegNo - 1; + } + /* * Update full_page_writes in shared memory and write an XLOG_FPW_CHANGE * record before resource manager writes cleanup WAL records or checkpoint @@ -8440,6 +8620,37 @@ GetFlushRecPtr(void) return LogwrtResult.Flush; } +/* + * GetReplicationTargetRecPtr -- Returns the latest position that is safe to + * replicate. + */ +XLogRecPtr +GetReplicationTargetRecPtr(void) +{ + static XLogRecPtr lastTargetRecPtr = InvalidXLogRecPtr; + XLogRecPtr firstSegContRecStart; + XLogRecPtr lastSegContRecEnd; + + SpinLockAcquire(&XLogCtl->info_lck); + LogwrtResult = XLogCtl->LogwrtResult; + firstSegContRecStart = XLogCtl->firstSegContRecStart; + lastSegContRecEnd = XLogCtl->lastSegContRecEnd; + SpinLockRelease(&XLogCtl->info_lck); + + /* + * Don't move forward if the current flush position may be within a + * continuation record that spans over segments. + */ + if (lastTargetRecPtr == InvalidXLogRecPtr || + firstSegContRecStart == InvalidXLogRecPtr || + XLogSegmentOffset(LogwrtResult.Flush, wal_segment_size) != 0 || + LogwrtResult.Flush < firstSegContRecStart || + lastSegContRecEnd <= LogwrtResult.Flush) + lastTargetRecPtr = LogwrtResult.Flush; + + return lastTargetRecPtr; +} + /* * GetLastImportantRecPtr -- Returns the LSN of the last important record * inserted. All records not explicitly marked as unimportant are considered diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c index d5c9bc31d8..c13cca1e64 100644 --- a/src/backend/replication/walsender.c +++ b/src/backend/replication/walsender.c @@ -2631,14 +2631,14 @@ XLogSendPhysical(void) /* * Streaming the current timeline on a primary. * - * Attempt to send all data that's already been written out and - * fsync'd to disk. We cannot go further than what's been written out - * given the current implementation of WALRead(). And in any case - * it's unsafe to send WAL that is not securely down to disk on the - * primary: if the primary subsequently crashes and restarts, standbys - * must not have applied any WAL that got lost on the primary. + * Attempt to send all data that's can be replicated. We cannot go + * further than what's been written out given the current + * implementation of WALRead(). And in any case it's unsafe to send + * WAL that is not securely down to disk on the primary: if the primary + * subsequently crashes and restarts, standbys must not have applied + * any WAL that got lost on the primary. */ - SendRqstPtr = GetFlushRecPtr(); + SendRqstPtr = GetReplicationTargetRecPtr(); } /* diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h index 221af87e71..94876f628c 100644 --- a/src/include/access/xlog.h +++ b/src/include/access/xlog.h @@ -338,6 +338,7 @@ extern void GetFullPageWriteInfo(XLogRecPtr *RedoRecPtr_p, bool *doPageWrites_p) extern XLogRecPtr GetRedoRecPtr(void); extern XLogRecPtr GetInsertRecPtr(void); extern XLogRecPtr GetFlushRecPtr(void); +extern XLogRecPtr GetReplicationTargetRecPtr(void); extern XLogRecPtr GetLastImportantRecPtr(void); extern void RemovePromoteSignalFiles(void); -- 2.27.0 From db3b7a536199202539a29af50fa55cc46cdff1db Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com> Date: Fri, 18 Dec 2020 13:49:43 +0900 Subject: [PATCH v5 2/2] debug print --- src/backend/access/transam/xlog.c | 19 +++++++++++++++++++ 1 file changed, 19 insertions(+) diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index b0d3ba2c5a..ee5494701c 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -1200,6 +1200,11 @@ XLogInsertRecord(XLogRecData *rdata, if (startseg != endseg) { + if (XLogCtl->firstSegContRecStart == InvalidXLogRecPtr) + ereport(LOG, (errmsg("REG-REG: (%lX, %lX)", StartPos, EndPos), errhidestmt(true),errhidecontext(true))); + else + ereport(LOG, (errmsg("UPD-REG: (%lX, %lX)", XLogCtl->firstSegContRecStart, EndPos), errhidestmt(true),errhidecontext(true))); + SpinLockAcquire(&XLogCtl->info_lck); if (XLogCtl->firstSegContRecStart == InvalidXLogRecPtr) XLogCtl->firstSegContRecStart = StartPos; @@ -2477,6 +2482,8 @@ NotifySegmentsUpTo(XLogSegNo notifySegNo) /* Reset the locations if we have got out of the continuation region. */ if (XLogCtl->lastSegContRecEnd <= XLogCtl->LogwrtRqst.Flush) { + ereport(LOG, (errmsg("CLEAR-REG: (%lX, %lX) %lX, %lX", XLogCtl->firstSegContRecStart, XLogCtl->lastSegContRecEnd,XLogCtl->LogwrtRqst.Write, XLogCtl->LogwrtRqst.Flush), errhidestmt(true),errhidecontext(true))); + XLogCtl->firstSegContRecStart = InvalidXLogRecPtr; XLogCtl->lastSegContRecEnd = InvalidXLogRecPtr; } @@ -2691,6 +2698,8 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible) LogwrtResult.Write < firstSegContRecStart || lastSegContRecEnd <= LogwrtResult.Write) { + ereport(LOG, (errmsg("NOTIFY1 %lX-%lX: (%lX, %lX) %lX(/%lX) %lX", GetLastNotifiedSegment() + 1, openLogSegNo,firstSegContRecStart, lastSegContRecEnd, WriteRqst.Write, XLogSegmentOffset(WriteRqst.Write, wal_segment_size),LogwrtResult.Write), errhidestmt(true),errhidecontext(true))); + LogwrtResult.Flush = LogwrtResult.Write; /* end of page */ if (XLogArchivingActive()) @@ -2731,7 +2740,12 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible) * the write request to the end of the recrod. */ if (WriteRqst.Write < lastSegContRecEnd) + { + ereport(LOG, (errmsg("EXTEND-RQST: (%lX, %lX(%lX)) %lX(%lX)", firstSegContRecStart, lastSegContRecEnd,contendseg, WriteRqst.Write, reqendseg), errhidestmt(true),errhidecontext(true))); WriteRqst.Write = lastSegContRecEnd; + } + else + ereport(LOG, (errmsg("NOT-EXTEND-RQST: (%lX, %lX(%lX)) %lX(%lX)", firstSegContRecStart, lastSegContRecEnd,contendseg, WriteRqst.Write, reqendseg), errhidestmt(true),errhidecontext(true))); /* * Flush up to at least the same location immediately @@ -2752,6 +2766,7 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible) * the segment soon since we have written up to further * segments. */ + ereport(LOG, (errmsg("SKIP-NOTIFY: (%lX, %lX(%lX)) %lX(%lX) => %d", firstSegContRecStart, lastSegContRecEnd,contendseg, WriteRqst.Write, reqendseg, (contendseg == reqendseg && WriteRqst.Write < lastSegContRecEnd)),errhidestmt(true),errhidecontext(true))); notify_seg = false; } } @@ -2852,7 +2867,11 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible) * need to notify the last segment here. */ if (notify_seg) + { + ereport(LOG, (errmsg("NOTIFY2 %lX-%lX: (%lX, %lX) %lX, seg %lX", GetLastNotifiedSegment() + 1, openLogSegNo - 1,XLogCtl->firstSegContRecStart, XLogCtl->lastSegContRecEnd, LogwrtResult.Flush, openLogSegNo - 1), errhidestmt(true),errhidecontext(true))); + NotifySegmentsUpTo(openLogSegNo - 1); + } } /* -- 2.27.0
Hi! I was looking to review something in CF. This seems like a thread of some interest to me. Recently we had somewhat related incident. Do I understand correctly that this incident is related to the bug discussed inthis thread? Primary instance was killed by OOM [ 2020-11-12 15:27:03.732 MSK ,,,739,00000 ]:LOG: server process (PID 40189) was terminated by signal 9: Killed after recovery it archived some WAL segments. [ 2020-11-12 15:27:31.477 MSK ,,,739,00000 ]:LOG: database system is ready to accept connections INFO: 2020/11/12 15:27:32.059541 FILE PATH: 0000000E0001C02F000000AF.br INFO: 2020/11/12 15:27:32.114319 FILE PATH: 0000000E0001C02F000000B3.br then PITR failed on another host [ 2020-11-12 16:26:33.024 MSK ,,,51414,00000 ]:LOG: restored log file "0000000E0001C02F000000B3" from archive [ 2020-11-12 16:26:33.042 MSK ,,,51414,00000 ]:LOG: invalid record length at 1C02F/B3FFF778: wanted 24, got 0 [ 2020-11-12 16:26:33.042 MSK ,,,51414,00000 ]:LOG: invalid record length at 1C02F/B3FFF778: wanted 24, got 0 archived segment has some zeroes at the end rmgr: XLOG len (rec/tot): 51/ 1634, tx: 0, lsn: 1C02F/B3FFF058, prev 1C02F/B3FFEFE8, desc: FPI_FOR_HINT, blkref #0: rel 1663/14030/16384 blk 140 FPW rmgr: Heap len (rec/tot): 129/ 129, tx: 3890578935, lsn: 1C02F/B3FFF6C0, prev 1C02F/B3FFF058, desc: HOT_UPDATEoff 34 xmax 3890578935 ; new off 35 xmax 0, blkref #0: rel 1663/14030/16384 blk 140 rmgr: Transaction len (rec/tot): 46/ 46, tx: 3890578935, lsn: 1C02F/B3FFF748, prev 1C02F/B3FFF6C0, desc: COMMIT 2020-11-1215:27:31.507363 MSK pg_waldump: FATAL: error in WAL record at 1C02F/**B3FFF748**: invalid record length at 1C02F/**B3FFF778**: wanted 24, got0 Meanwhile next segment points to previous record at **B3FFF748** postgres@man-odszl7u4361o8m3z:/tmp$ pg_waldump 0000000E0001C02F000000B4| head rmgr: Heap len (rec/tot): 129/ 129, tx: 3890578936, lsn: 1C02F/B4000A68, prev 1C02F/**B3FFF778**, desc: HOT_UPDATEoff 35 xmax 3890578936 ; new off 36 xmax 0, blkref #0: rel 1663/14030/16384 blk 140 rmgr: Transaction len (rec/tot): 46/ 46, tx: 3890578936, lsn: 1C02F/B4000AF0, prev 1C02F/B4000A68, desc: COMMIT 2020-11-1215:27:32.509443 MSK Best regards, Andrey Borodin.
Hi! Thanks for working on this. > 18 дек. 2020 г., в 10:42, Kyotaro Horiguchi <horikyota.ntt@gmail.com> написал(а): > > I noticed that we can cause the continuation record flushed > immedately. I've took a look into the code and want to share some thoughts. 1. Maybe we could tend to avoid interlacing field protected by different locks in XLogCtlData? We can place lastNotifiedSegsomewhere near field that is protected by WALWriteLock. I'm not sure it's useful idea. 2. In XLogInsertRecord() we release &XLogCtl->info_lck just to compute few bytes. And possibly aquire it back. Maybe justhold the lock a little longer? 3. Things that are done by GetLastNotifiedSegment() could just be atomic read? I'm not sure it's common practice. Thanks! Best regards, Andrey Borodin.
On 12/17/20, 9:15 PM, "Kyotaro Horiguchi" <horikyota.ntt@gmail.com> wrote: > At Thu, 17 Dec 2020 22:20:35 +0000, "Bossart, Nathan" <bossartn@amazon.com> wrote in >> On 12/15/20, 2:33 AM, "Kyotaro Horiguchi" <horikyota.ntt@gmail.com> wrote: >> > You're right in that regard. There's a window where partial record is >> > written when write location passes F0 after insertion location passes >> > F1. However, remembering all spanning records seems overkilling to me. >> >> I'm curious why you feel that recording all cross-segment records is >> overkill. IMO it seems far simpler to just do that rather than try to > > Sorry, my words are not enough. Remembering all spanning records in > *shared memory* seems to be overkilling. Much more if it is stored in > shared hash table. Even though it rarely the case, it can fail hard > way when reaching the limit. If we could do well by remembering just > two locations, we wouldn't need to worry about such a limitation. I don't think it will fail if we reach max_size for the hash table. The comment above ShmemInitHash() has this note: * max_size is the estimated maximum number of hashtable entries. This is * not a hard limit, but the access efficiency will degrade if it is * exceeded substantially (since it's used to compute directory size and * the hash table buckets will get overfull). > Another concern about the concrete patch: > > NotifySegmentsReadyForArchive() searches the shared hashacquiaing a > LWLock every time XLogWrite is called while segment archive is being > held off. I don't think it is acceptable and I think it could be a > problem when many backends are competing on WAL. This is a fair point. I did some benchmarking with a few hundred connections all doing writes, and I was not able to discern any noticeable performance impact. My guess is that contention on this new lock is unlikely because callers of XLogWrite() must already hold WALWriteLock. Otherwise, I believe we only acquire ArchNotifyLock no more than once per segment to record new record boundaries. Nathan
On 1/2/21, 8:55 AM, "Andrey Borodin" <x4mmm@yandex-team.ru> wrote: > Recently we had somewhat related incident. Do I understand correctly that this incident is related to the bug discussedin this thread? I'm not sure that we can rule it out, but the log pattern I've typically seen for this is "invalid contrecord length." The issue is that we're marking segments as ready for archive when the segment is fully written versus when its WAL records are fully written (since its WAL records may cross into the next segment). The fact that you're seeing zeroes at the end of your archived segments leads me to think it is unlikely that you are experiencing this bug. Nathan
At Tue, 26 Jan 2021 19:13:57 +0000, "Bossart, Nathan" <bossartn@amazon.com> wrote in > On 12/17/20, 9:15 PM, "Kyotaro Horiguchi" <horikyota.ntt@gmail.com> wrote: > > At Thu, 17 Dec 2020 22:20:35 +0000, "Bossart, Nathan" <bossartn@amazon.com> wrote in > >> On 12/15/20, 2:33 AM, "Kyotaro Horiguchi" <horikyota.ntt@gmail.com> wrote: > >> > You're right in that regard. There's a window where partial record is > >> > written when write location passes F0 after insertion location passes > >> > F1. However, remembering all spanning records seems overkilling to me. > >> > >> I'm curious why you feel that recording all cross-segment records is > >> overkill. IMO it seems far simpler to just do that rather than try to > > > > Sorry, my words are not enough. Remembering all spanning records in > > *shared memory* seems to be overkilling. Much more if it is stored in > > shared hash table. Even though it rarely the case, it can fail hard > > way when reaching the limit. If we could do well by remembering just > > two locations, we wouldn't need to worry about such a limitation. > > I don't think it will fail if we reach max_size for the hash table. > The comment above ShmemInitHash() has this note: > > * max_size is the estimated maximum number of hashtable entries. This is > * not a hard limit, but the access efficiency will degrade if it is > * exceeded substantially (since it's used to compute directory size and > * the hash table buckets will get overfull). That description means that a shared hash has a directory with fixed size thus there may be synonyms, which causes degradation. Even though buckets are preallocated with the specified number, since the minimum directory size is 256, buckets are allocated at least 256 in a long run. Minimum on-the-fly allocation size is 32. I haven't calcuated further precicely, but I'm worried about the amount of spare shared memory the hash can allocate. > > Another concern about the concrete patch: > > > > NotifySegmentsReadyForArchive() searches the shared hashacquiaing a > > LWLock every time XLogWrite is called while segment archive is being > > held off. I don't think it is acceptable and I think it could be a > > problem when many backends are competing on WAL. > > This is a fair point. I did some benchmarking with a few hundred > connections all doing writes, and I was not able to discern any > noticeable performance impact. My guess is that contention on this > new lock is unlikely because callers of XLogWrite() must already hold > WALWriteLock. Otherwise, I believe we only acquire ArchNotifyLock no > more than once per segment to record new record boundaries. Thanks. I agree that the reader-reader contention is not a problem due to existing serialization by WALWriteLock. Adding an entry happens only at segment boundary so the ArchNotifyLock doesn't seem to be a problem. However the function prolongs the WALWriteLock section. Couldn't we somehow move the call to NotifySegmentsReadyForArchive in XLogWrite out of the WALWriteLock section? regards. -- Kyotaro Horiguchi NTT Open Source Software Center
On 1/26/21, 6:36 PM, "Kyotaro Horiguchi" <horikyota.ntt@gmail.com> wrote: > At Tue, 26 Jan 2021 19:13:57 +0000, "Bossart, Nathan" <bossartn@amazon.com> wrote in >> On 12/17/20, 9:15 PM, "Kyotaro Horiguchi" <horikyota.ntt@gmail.com> wrote: >> > At Thu, 17 Dec 2020 22:20:35 +0000, "Bossart, Nathan" <bossartn@amazon.com> wrote in >> >> On 12/15/20, 2:33 AM, "Kyotaro Horiguchi" <horikyota.ntt@gmail.com> wrote: >> >> > You're right in that regard. There's a window where partial record is >> >> > written when write location passes F0 after insertion location passes >> >> > F1. However, remembering all spanning records seems overkilling to me. >> >> >> >> I'm curious why you feel that recording all cross-segment records is >> >> overkill. IMO it seems far simpler to just do that rather than try to >> > >> > Sorry, my words are not enough. Remembering all spanning records in >> > *shared memory* seems to be overkilling. Much more if it is stored in >> > shared hash table. Even though it rarely the case, it can fail hard >> > way when reaching the limit. If we could do well by remembering just >> > two locations, we wouldn't need to worry about such a limitation. >> >> I don't think it will fail if we reach max_size for the hash table. >> The comment above ShmemInitHash() has this note: >> >> * max_size is the estimated maximum number of hashtable entries. This is >> * not a hard limit, but the access efficiency will degrade if it is >> * exceeded substantially (since it's used to compute directory size and >> * the hash table buckets will get overfull). > > That description means that a shared hash has a directory with fixed > size thus there may be synonyms, which causes degradation. Even though > buckets are preallocated with the specified number, since the minimum > directory size is 256, buckets are allocated at least 256 in a long > run. Minimum on-the-fly allocation size is 32. I haven't calcuated > further precicely, but I'm worried about the amount of spare shared > memory the hash can allocate. On my machine, hash_estimate_size() for the table returns 5,968 bytes. That estimate is for a max_size of 16. In my testing, I've been able to need up to 6 elements in this table, but that required turning off synchronous_commit, adding a long sleep at the end of XLogWrite(), and increasing wal_buffers substantially. This leads me to think that a max_size of 16 elements is typically sufficient. (I may have also accidentally demonstrated that only storing two record boundaries could be insufficient.) >> > Another concern about the concrete patch: >> > >> > NotifySegmentsReadyForArchive() searches the shared hashacquiaing a >> > LWLock every time XLogWrite is called while segment archive is being >> > held off. I don't think it is acceptable and I think it could be a >> > problem when many backends are competing on WAL. >> >> This is a fair point. I did some benchmarking with a few hundred >> connections all doing writes, and I was not able to discern any >> noticeable performance impact. My guess is that contention on this >> new lock is unlikely because callers of XLogWrite() must already hold >> WALWriteLock. Otherwise, I believe we only acquire ArchNotifyLock no >> more than once per segment to record new record boundaries. > > Thanks. I agree that the reader-reader contention is not a problem due > to existing serialization by WALWriteLock. Adding an entry happens > only at segment boundary so the ArchNotifyLock doesn't seem to be a > problem. > > However the function prolongs the WALWriteLock section. Couldn't we > somehow move the call to NotifySegmentsReadyForArchive in XLogWrite > out of the WALWriteLock section? I don't see a clean way to do that. XLogWrite() assumes that WALWriteLock is held when it is called, and it doesn't release it at any point. I think we'd have to move NotifySegmentsReadyForArchive() to the callers of XLogWrite() if we wanted to avoid holding onto WALWriteLock for longer. Unless we can point to a measurable performance penalty, I'm not sure this is worth it. Nathan
Here is a rebased version of my patch. As before, I'm not yet handling replication, archive_timeout, and persisting latest-marked- ready through crashes. If this approach seems reasonable to others, I'll go ahead and start working on these items. Nathan
Attachment
Alright, I've attached a new patch set for this. 0001 is similar to the last patch I sent in this thread, although it contains a few fixes. The main difference is that we no longer initialize lastNotifiedSeg in StartupXLOG(). Instead, we initialize it in XLogWrite() where we previously were creating the archive status files. This ensures that standby servers do not create many unnecessary archive status files after promotion. 0002 adds logic for persisting the last notified segment through crashes. This is needed because a poorly-timed crash could otherwise cause us to skip marking segments as ready-for-archival altogether. This file is only used for primary servers, as there exists a separate code path for marking segments as ready-for-archive for standbys. I considered attempting to prevent this bug from affecting standby servers by withholding WAL for a segment until the previous segment has been marked ready-for-archival. However, that would require us to track record boundaries even with archiving turned off. Also, my patch relied on the assumption that the flush pointer advances along record boundaries except for records that span multiple segments. This assumption is likely not always true, and even if it is, it seems pretty fragile. Furthermore, I suspect that there are still problems with standbys since the code path responsible for creating archive status files on standbys has even less context about the WAL record boundaries. IMO patches 0001 and 0002 should be the focus for now, and related bugs for standby servers should be picked up in a new thread. I ended up not touching archive_timeout at all. The documentation for this parameter seems to be written ambiguously enough such that any small differences in behavior with these patches is still acceptable. I don't expect that users will see much change. In the worst case, the timer for archive_timeout may get reset a bit before the segment's archive status file is created. Nathan
Attachment
On 2021-Feb-19, Bossart, Nathan wrote: > 0002 adds logic for persisting the last notified segment through > crashes. This is needed because a poorly-timed crash could otherwise > cause us to skip marking segments as ready-for-archival altogether. > This file is only used for primary servers, as there exists a separate > code path for marking segments as ready-for-archive for standbys. I'm not sure I understand what's the reason not to store this value in pg_control; I feel like I'm missing something. Can you please explain? There were some comments earlier in the thread about the maximum size of a record. As I recall, you can have records of arbitrary size if you have COMMIT with a large number of relation invalidation messages being included in the xlog record, or a large number of XIDs of subtransactions in the transaction. Spanning several segments is possible, AFAIU. -- Álvaro Herrera 39°49'30"S 73°17'W — https://www.EnterpriseDB.com/ "La rebeldía es la virtud original del hombre" (Arthur Schopenhauer)
On 7/27/21, 6:05 PM, "Alvaro Herrera" <alvherre@alvh.no-ip.org> wrote: > On 2021-Feb-19, Bossart, Nathan wrote: > >> 0002 adds logic for persisting the last notified segment through >> crashes. This is needed because a poorly-timed crash could otherwise >> cause us to skip marking segments as ready-for-archival altogether. >> This file is only used for primary servers, as there exists a separate >> code path for marking segments as ready-for-archive for standbys. > > I'm not sure I understand what's the reason not to store this value in > pg_control; I feel like I'm missing something. Can you please explain? Thanks for taking a look. The only reason I can think of is that it could make back-patching difficult. I don't mind working on a version of the patch that uses pg_control. Back-patching this fix might be a stretch, anyway. > There were some comments earlier in the thread about the maximum size of > a record. As I recall, you can have records of arbitrary size if you > have COMMIT with a large number of relation invalidation messages being > included in the xlog record, or a large number of XIDs of > subtransactions in the transaction. Spanning several segments is > possible, AFAIU. This is my understanding, too. Nathan
On 2021-Jul-28, Bossart, Nathan wrote: > On 7/27/21, 6:05 PM, "Alvaro Herrera" <alvherre@alvh.no-ip.org> wrote: > > I'm not sure I understand what's the reason not to store this value in > > pg_control; I feel like I'm missing something. Can you please explain? > > Thanks for taking a look. > > The only reason I can think of is that it could make back-patching > difficult. I don't mind working on a version of the patch that uses > pg_control. Back-patching this fix might be a stretch, anyway. Hmm ... I'm not sure we're prepared to backpatch this kind of change. It seems a bit too disruptive to how replay works. I think patch we should be focusing solely on patch 0001 to surgically fix the precise bug you see. Does patch 0002 exist because you think that a system with only 0001 will not correctly deal with a crash at the right time? Now, the reason I'm looking at this patch series is that we're seeing a related problem with walsender/walreceiver, which apparently are capable of creating a file in the replica that ends up not existing in the primary after a crash, for a reason closely related to what you describe for WAL archival. I'm not sure what is going on just yet, so I'm not going to try and explain because I'm likely to get it wrong. -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/
On 7/30/21, 11:34 AM, "Alvaro Herrera" <alvherre@alvh.no-ip.org> wrote: > Hmm ... I'm not sure we're prepared to backpatch this kind of change. > It seems a bit too disruptive to how replay works. I think patch we > should be focusing solely on patch 0001 to surgically fix the precise > bug you see. Does patch 0002 exist because you think that a system with > only 0001 will not correctly deal with a crash at the right time? Yes, that was what I was worried about. However, I just performed a variety of tests with just 0001 applied, and I am beginning to suspect my concerns were unfounded. With wal_buffers set very high, synchronous_commit set to off, and a long sleep at the end of XLogWrite(), I can reliably cause the archive status files to lag far behind the current open WAL segment. However, even if I crash at this time, the .ready files are created when the server restarts (albeit out of order). This appears to be due to the call to XLogArchiveCheckDone() in RemoveOldXlogFiles(). Therefore, we can likely abandon 0002. > Now, the reason I'm looking at this patch series is that we're seeing a > related problem with walsender/walreceiver, which apparently are capable > of creating a file in the replica that ends up not existing in the > primary after a crash, for a reason closely related to what you > describe for WAL archival. I'm not sure what is going on just yet, so > I'm not going to try and explain because I'm likely to get it wrong. I've suspected that this is due to the use of the flushed location for the send pointer, which AFAICT needn't align with a WAL record boundary. /* * Streaming the current timeline on a primary. * * Attempt to send all data that's already been written out and * fsync'd to disk. We cannot go further than what's been written out * given the current implementation of WALRead(). And in any case * it's unsafe to send WAL that is not securely down to disk on the * primary: if the primary subsequently crashes and restarts, standbys * must not have applied any WAL that got lost on the primary. */ SendRqstPtr = GetFlushRecPtr(); Nathan
On 2021-Jul-30, Bossart, Nathan wrote: > On 7/30/21, 11:34 AM, "Alvaro Herrera" <alvherre@alvh.no-ip.org> wrote: > > Hmm ... I'm not sure we're prepared to backpatch this kind of change. > > It seems a bit too disruptive to how replay works. I think patch we > > should be focusing solely on patch 0001 to surgically fix the precise > > bug you see. Does patch 0002 exist because you think that a system with > > only 0001 will not correctly deal with a crash at the right time? > > Yes, that was what I was worried about. However, I just performed a > variety of tests with just 0001 applied, and I am beginning to suspect > my concerns were unfounded. With wal_buffers set very high, > synchronous_commit set to off, and a long sleep at the end of > XLogWrite(), I can reliably cause the archive status files to lag far > behind the current open WAL segment. However, even if I crash at this > time, the .ready files are created when the server restarts (albeit > out of order). This appears to be due to the call to > XLogArchiveCheckDone() in RemoveOldXlogFiles(). Therefore, we can > likely abandon 0002. That's great to hear. I'll give 0001 a look again. > > Now, the reason I'm looking at this patch series is that we're seeing a > > related problem with walsender/walreceiver, which apparently are capable > > of creating a file in the replica that ends up not existing in the > > primary after a crash, for a reason closely related to what you > > describe for WAL archival. I'm not sure what is going on just yet, so > > I'm not going to try and explain because I'm likely to get it wrong. > > I've suspected that this is due to the use of the flushed location for > the send pointer, which AFAICT needn't align with a WAL record > boundary. Yeah, I had gotten as far as the GetFlushRecPtr but haven't tracked down what happens with a contrecord. -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/
On 7/30/21, 3:23 PM, "Alvaro Herrera" <alvherre@alvh.no-ip.org> wrote: > That's great to hear. I'll give 0001 a look again. Much appreciated. Nathan
On 2021-Jul-30, Bossart, Nathan wrote: > Yes, that was what I was worried about. However, I just performed a > variety of tests with just 0001 applied, and I am beginning to suspect > my concerns were unfounded. With wal_buffers set very high, > synchronous_commit set to off, and a long sleep at the end of > XLogWrite(), I can reliably cause the archive status files to lag far > behind the current open WAL segment. However, even if I crash at this > time, the .ready files are created when the server restarts (albeit > out of order). I think that creating files out of order might be problematic. On the archiver side, pgarch_readyXlog() expects to return the oldest archivable file; but if we create a newer segment's .ready file first, it is possible that a directory scan would return that newer file before the older segment's .ready file appears. However, the comments in pgarch_readyXlog() aren't super convincing that processing the files in order is actually a correctness requirement, so perhaps it doesn't matter all that much. I noticed that XLogCtl->lastNotifiedSeg is protected by both the info_lck and ArchNotifyLock. I think it it's going to be protected by the lwlock, then we should drop the use of the spinlock. We set archiver's latch on each XLogArchiveNotify(), but if we're doing it in a loop such as in NotifySegmentsReadyForArchive() perhaps it is better to create all the .ready files first and do PgArchWakeup() at the end. I'm not convinced that this is useful but let's at least discard the idea explicitly if not. -- Álvaro Herrera Valdivia, Chile — https://www.EnterpriseDB.com/ "Always assume the user will do much worse than the stupidest thing you can imagine." (Julien PUYDT)
Attachment
On 2021-Jul-30, Alvaro Herrera wrote: > We set archiver's latch on each XLogArchiveNotify(), but if we're doing > it in a loop such as in NotifySegmentsReadyForArchive() perhaps it is > better to create all the .ready files first and do PgArchWakeup() at the > end. I'm not convinced that this is useful but let's at least discard > the idea explicitly if not. hm, this causes an ABI change so it's not backpatchable. -- Álvaro Herrera 39°49'30"S 73°17'W — https://www.EnterpriseDB.com/ "This is what I like so much about PostgreSQL. Most of the surprises are of the "oh wow! That's cool" Not the "oh shit!" kind. :)" Scott Marlowe, http://archives.postgresql.org/pgsql-admin/2008-10/msg00152.php
On 7/30/21, 4:52 PM, "Alvaro Herrera" <alvherre@alvh.no-ip.org> wrote: > I think that creating files out of order might be problematic. On the > archiver side, pgarch_readyXlog() expects to return the oldest > archivable file; but if we create a newer segment's .ready file first, > it is possible that a directory scan would return that newer file before > the older segment's .ready file appears. > > However, the comments in pgarch_readyXlog() aren't super convincing that > processing the files in order is actually a correctness requirement, so > perhaps it doesn't matter all that much. I can't think of a reason it'd be needed from a correctness perspective. After a quick scan, I couldn't find any promises about archival order in the documentation, either. In any case, it doesn't look like there's a risk that the archiver will skip files when the .ready files are created out of order. > I noticed that XLogCtl->lastNotifiedSeg is protected by both the > info_lck and ArchNotifyLock. I think it it's going to be protected by > the lwlock, then we should drop the use of the spinlock. That seems reasonable to me. This means that the lock is acquired at the end of every XLogWrite(), but the other places that acquire the lock only do so once per WAL segment. Plus, the call to NotifySegmentsReadyForArchive() at the end of every XLogWrite() should usually only need the lock for a short amount of time to retrieve a value from shared memory. > We set archiver's latch on each XLogArchiveNotify(), but if we're doing > it in a loop such as in NotifySegmentsReadyForArchive() perhaps it is > better to create all the .ready files first and do PgArchWakeup() at the > end. I'm not convinced that this is useful but let's at least discard > the idea explicitly if not. I don't have a terribly strong opinion, but I would lean towards setting the latch for each call to XLogArchiveNotify() so that the archiver process can get started as soon as a segment is ready. However, I doubt that holding off until the end of the loop has any discernible effect in most cases. Nathan
On 2021-Jul-31, Bossart, Nathan wrote: > On 7/30/21, 4:52 PM, "Alvaro Herrera" <alvherre@alvh.no-ip.org> wrote: > > I noticed that XLogCtl->lastNotifiedSeg is protected by both the > > info_lck and ArchNotifyLock. I think it it's going to be protected by > > the lwlock, then we should drop the use of the spinlock. > > That seems reasonable to me. This means that the lock is acquired at > the end of every XLogWrite(), Uhh, actually that there sounds really bad; it's going to be a serious contention point. Another option might be to make it an atomic. -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/
On 7/31/21, 9:12 AM, "Alvaro Herrera" <alvherre@alvh.no-ip.org> wrote: > On 2021-Jul-31, Bossart, Nathan wrote: > >> On 7/30/21, 4:52 PM, "Alvaro Herrera" <alvherre@alvh.no-ip.org> wrote: > >> > I noticed that XLogCtl->lastNotifiedSeg is protected by both the >> > info_lck and ArchNotifyLock. I think it it's going to be protected by >> > the lwlock, then we should drop the use of the spinlock. >> >> That seems reasonable to me. This means that the lock is acquired at >> the end of every XLogWrite(), > > Uhh, actually that there sounds really bad; it's going to be a serious > contention point. > > Another option might be to make it an atomic. This is why I was trying to get away with just using info_lck for reading lastNotifiedSeg. ArchNotifyLock is mostly intended to protect RecordBoundaryMap. However, since lastNotifiedSeg is used in functions like GetLatestRecordBoundarySegment() that access the map, I found it easier to reason about things if I knew that it wouldn't change as long as I held ArchNotifyLock. I think the main downside of making lastNotifiedSeg an atomic is that the value we first read in NotifySegmentsReadyForArchive() might not be up-to-date, which means we might hold off creating .ready files longer than necessary. Nathan
On 2021-Jul-31, Bossart, Nathan wrote: > This is why I was trying to get away with just using info_lck for > reading lastNotifiedSeg. ArchNotifyLock is mostly intended to protect > RecordBoundaryMap. However, since lastNotifiedSeg is used in > functions like GetLatestRecordBoundarySegment() that access the map, I > found it easier to reason about things if I knew that it wouldn't > change as long as I held ArchNotifyLock. I think it's okay to make lastNotifiedSeg protected by just info_lck, and RecordBoundaryMap protected by just ArchNotifyLock. It's okay to acquire the spinlock inside the lwlock-protected area, as long as we make sure never to do the opposite. (And we sure don't want to hold info_lck long enough that a LWLock acquisition would occur in the meantime). So I modified things that way, and also added another function to set the seg if it's unset, with a single spinlock acquisition (rather than acqquire, read, release, acqquire, set, release, which sounds like it would have trouble behaving.) I haven't tried your repro with this yet. I find it highly suspicious that the patch does an archiver notify (i.e. creation of the .ready file) in XLogInsertRecord(). Is that a sane thing to do? Sounds to me like that should be attempted in XLogFlush only. This appeared after Kyotaro's patch at [1] and before your patch at [2]. [1] https://postgr.es/m/20201014.090628.839639906081252194.horikyota.ntt@gmail.com [2] https://postgr.es/m/EFF40306-8E8A-4259-B181-C84F3F06636C@amazon.com I also just realized that Kyotaro's patch there also tried to handle the streaming replication issue I was talking about. > I think the main downside of making lastNotifiedSeg an atomic is that > the value we first read in NotifySegmentsReadyForArchive() might not > be up-to-date, which means we might hold off creating .ready files > longer than necessary. I'm not sure I understand how this would be a problem. If we block somebody from setting a newer value, they'll just set the value immediately after we release the lock. Will we reread the value afterwards to see if it changed? -- Álvaro Herrera Valdivia, Chile — https://www.EnterpriseDB.com/
Attachment
On 8/2/21, 2:42 PM, "Alvaro Herrera" <alvherre@alvh.no-ip.org> wrote: > I think it's okay to make lastNotifiedSeg protected by just info_lck, > and RecordBoundaryMap protected by just ArchNotifyLock. It's okay to > acquire the spinlock inside the lwlock-protected area, as long as we > make sure never to do the opposite. (And we sure don't want to hold > info_lck long enough that a LWLock acquisition would occur in the > meantime). So I modified things that way, and also added another > function to set the seg if it's unset, with a single spinlock > acquisition (rather than acqquire, read, release, acqquire, set, > release, which sounds like it would have trouble behaving.) The patch looks good to me. > I find it highly suspicious that the patch does an archiver notify (i.e. > creation of the .ready file) in XLogInsertRecord(). Is that a sane > thing to do? Sounds to me like that should be attempted in XLogFlush > only. This appeared after Kyotaro's patch at [1] and before your patch > at [2]. I believe my worry was that we'd miss notifying a segment as soon as possible if the record was somehow flushed prior to registering the record boundary in the map. If that's actually impossible, then I would agree that the extra call to NotifySegmentsReadyForArchive() is unnecessary. >> I think the main downside of making lastNotifiedSeg an atomic is that >> the value we first read in NotifySegmentsReadyForArchive() might not >> be up-to-date, which means we might hold off creating .ready files >> longer than necessary. > > I'm not sure I understand how this would be a problem. If we block > somebody from setting a newer value, they'll just set the value > immediately after we release the lock. Will we reread the value > afterwards to see if it changed? I think you are right. If we see an old value for lastNotifiedSeg, the worst case is that we take the ArchNotifyLock, read lastNotifiedSeg again (which should then be up-to-date), and then basically do nothing. I suspect initializing lastNotifiedSeg might still be a little tricky, though. Do you think it is important to try to avoid this spinlock for lastNotifiedSeg? IIUC it's acquired at the end of every call to XLogWrite() already, and we'd still need to acquire it for the flush pointer, anyway. Nathan
At Mon, 2 Aug 2021 23:28:19 +0000, "Bossart, Nathan" <bossartn@amazon.com> wrote in > On 8/2/21, 2:42 PM, "Alvaro Herrera" <alvherre@alvh.no-ip.org> wrote: > > I think it's okay to make lastNotifiedSeg protected by just info_lck, > > and RecordBoundaryMap protected by just ArchNotifyLock. It's okay to > > acquire the spinlock inside the lwlock-protected area, as long as we > > make sure never to do the opposite. (And we sure don't want to hold > > info_lck long enough that a LWLock acquisition would occur in the > > meantime). So I modified things that way, and also added another > > function to set the seg if it's unset, with a single spinlock > > acquisition (rather than acqquire, read, release, acqquire, set, > > release, which sounds like it would have trouble behaving.) > > The patch looks good to me. + for (seg = flushed_seg; seg > last_notified; seg--) + { + RecordBoundaryEntry *entry; + + entry = (RecordBoundaryEntry *) hash_search(RecordBoundaryMap, + (void *) &seg, HASH_FIND, I'm afraid that using hash to store boundary info is too much. Isn't a ring buffer enough for this use? In that case it is enough to remember only the end LSN of the segment spanning records. It is easy to expand the buffer if needed. + if (!XLogSegNoIsInvalid(latest_boundary_seg)) It is a matter of taste, but I see latest_boundary_seg != InvalidXLogSegNo more frequentlyl, maybe to avoid double negation. @@ -1167,10 +1195,33 @@ XLogInsertRecord(XLogRecData *rdata, SpinLockRelease(&XLogCtl->info_lck); } + /* + * Record the record boundary if we crossed the segment boundary. This is ... + XLByteToSeg(StartPos, StartSeg, wal_segment_size); + XLByteToSeg(EndPos, EndSeg, wal_segment_size); + + if (StartSeg != EndSeg && XLogArchivingActive()) + { The immediately prceding if block is for cross-page records. So we can reduce the overhaed by the above calculations by moving it to the preceding if-block. +RegisterRecordBoundaryEntry(XLogSegNo seg, XLogRecPtr pos) The seg is restricted to the segment that pos resides on. The caller is free from caring that restriction if the function takes only pos. It adds a small overhead to calculate segment number from the LSN but I think it doesn't matter so much. (Or if we don't use hash, that calculation is not required at all). @@ -2602,7 +2855,7 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible) LogwrtResult.Flush = LogwrtResult.Write; /* end of page */ if (XLogArchivingActive()) - XLogArchiveNotifySeg(openLogSegNo); + SetLastNotifiedSegmentIfInvalid(openLogSegNo - 1); Is it safe? If server didn't notified of WAL files for recent 3 finished segments in the previous server life, they need to be archived this life time. But this omits maybe all of the tree. (I didn't confirm that behavior..) > > I find it highly suspicious that the patch does an archiver notify (i.e. > > creation of the .ready file) in XLogInsertRecord(). Is that a sane > > thing to do? Sounds to me like that should be attempted in XLogFlush > > only. This appeared after Kyotaro's patch at [1] and before your patch > > at [2]. > > I believe my worry was that we'd miss notifying a segment as soon as > possible if the record was somehow flushed prior to registering the > record boundary in the map. If that's actually impossible, then I > would agree that the extra call to NotifySegmentsReadyForArchive() is > unnecessary. I don't think that XLogWrite(up to LSN=X) can happen before XLogInsert(endpos = X) ends. > >> I think the main downside of making lastNotifiedSeg an atomic is that > >> the value we first read in NotifySegmentsReadyForArchive() might not > >> be up-to-date, which means we might hold off creating .ready files > >> longer than necessary. > > > > I'm not sure I understand how this would be a problem. If we block > > somebody from setting a newer value, they'll just set the value > > immediately after we release the lock. Will we reread the value > > afterwards to see if it changed? > > I think you are right. If we see an old value for lastNotifiedSeg, > the worst case is that we take the ArchNotifyLock, read > lastNotifiedSeg again (which should then be up-to-date), and then Agreed. > basically do nothing. I suspect initializing lastNotifiedSeg might > still be a little tricky, though. Do you think it is important to try > to avoid this spinlock for lastNotifiedSeg? IIUC it's acquired at the > end of every call to XLogWrite() already, and we'd still need to > acquire it for the flush pointer, anyway. As mentioned above, I think it needs more cosideration. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
On 8/2/21, 7:37 PM, "Kyotaro Horiguchi" <horikyota.ntt@gmail.com> wrote: > I'm afraid that using hash to store boundary info is too much. Isn't a > ring buffer enough for this use? In that case it is enough to > remember only the end LSN of the segment spanning records. It is easy > to expand the buffer if needed. I agree that the hash table requires a bit more memory than what is probably necessary, but I'm not sure I agree that maintaining a custom data structure to save a few kilobytes of memory is worth the effort. > @@ -2602,7 +2855,7 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible) > LogwrtResult.Flush = LogwrtResult.Write; /* end of page */ > > if (XLogArchivingActive()) > - XLogArchiveNotifySeg(openLogSegNo); > + SetLastNotifiedSegmentIfInvalid(openLogSegNo - 1); > > Is it safe? If server didn't notified of WAL files for recent 3 > finished segments in the previous server life, they need to be > archived this life time. But this omits maybe all of the tree. > (I didn't confirm that behavior..) I tested this scenario out earlier [0]. It looks like the call to XLogArchiveCheckDone() in RemoveOldXlogFiles() will take care of creating any .ready files we missed. >> I believe my worry was that we'd miss notifying a segment as soon as >> possible if the record was somehow flushed prior to registering the >> record boundary in the map. If that's actually impossible, then I >> would agree that the extra call to NotifySegmentsReadyForArchive() is >> unnecessary. > > I don't think that XLogWrite(up to LSN=X) can happen before > XLogInsert(endpos = X) ends. Is there anything preventing that from happening? At the location where we are registering the record boundary, we've already called CopyXLogRecordToWAL(), and neither the WAL insertion lock nor the WALWriteLock are held. Even if we register the boundary before updating the shared LogwrtRqst.Write, there's a chance that someone else has already moved it ahead and called XLogWrite(). I think the worst case scenario is that we hold off creating .ready files longer than necessary, but IMO that's still a worthwhile thing to do. Nathan [0] https://postgr.es/m/DA71434B-7340-4984-9B91-F085BC47A778%40amazon.com
At Tue, 3 Aug 2021 21:32:18 +0000, "Bossart, Nathan" <bossartn@amazon.com> wrote in > On 8/2/21, 7:37 PM, "Kyotaro Horiguchi" <horikyota.ntt@gmail.com> wrote: > > I'm afraid that using hash to store boundary info is too much. Isn't a > > ring buffer enough for this use? In that case it is enough to > > remember only the end LSN of the segment spanning records. It is easy > > to expand the buffer if needed. > > I agree that the hash table requires a bit more memory than what is > probably necessary, but I'm not sure I agree that maintaining a custom > data structure to save a few kilobytes of memory is worth the effort. Memory is one of my concerns but more significant point was required CPU cycles by GetLatestRecordBoundarySegment. So I don't mind it is using a hash if the loop on the hash didn't block other backends. Addition to that, while NotifySegmentsReadyForArchive() is notifying pending segments, other backends simultaneously reach there are blocked until the notification, incuding file creation, finishes. I don't think that's great. Couldn't we set lastNotifiedSegment before the loop? At the moment a backend decides to notify some segments, others no longer need to consider those segments. Even if the backend crashes meanwhile, as you mentionied below, it's safe since the unnotified segments are notifed after restart. > > @@ -2602,7 +2855,7 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible) > > LogwrtResult.Flush = LogwrtResult.Write; /* end of page */ > > > > if (XLogArchivingActive()) > > - XLogArchiveNotifySeg(openLogSegNo); > > + SetLastNotifiedSegmentIfInvalid(openLogSegNo - 1); > > > > Is it safe? If server didn't notified of WAL files for recent 3 > > finished segments in the previous server life, they need to be > > archived this life time. But this omits maybe all of the tree. > > (I didn't confirm that behavior..) > > I tested this scenario out earlier [0]. It looks like the call to > XLogArchiveCheckDone() in RemoveOldXlogFiles() will take care of > creating any .ready files we missed. Yeah, I reclled of that behvaior. In that case crash recovery reads up to just before the last (continued) record in the last finished segment. On the other hand if creash recovery was able to read that record, it's safe to archive the last segment immediately after recovery. So that behavior is safe. Thanks! > >> I believe my worry was that we'd miss notifying a segment as soon as > >> possible if the record was somehow flushed prior to registering the > >> record boundary in the map. If that's actually impossible, then I > >> would agree that the extra call to NotifySegmentsReadyForArchive() is > >> unnecessary. > > > > I don't think that XLogWrite(up to LSN=X) can happen before > > XLogInsert(endpos = X) ends. > > Is there anything preventing that from happening? At the location > where we are registering the record boundary, we've already called > CopyXLogRecordToWAL(), and neither the WAL insertion lock nor the > WALWriteLock are held. Even if we register the boundary before > updating the shared LogwrtRqst.Write, there's a chance that someone > else has already moved it ahead and called XLogWrite(). I think the > worst case scenario is that we hold off creating .ready files longer > than necessary, but IMO that's still a worthwhile thing to do. Oh, boundary registration happens actually after an insertion ends (but before XLogInsert ends). The missed segment is never processed due to the qualification by lastNotifiedSeg. Does it work that RegisterRecordBoundaryEntry omits registering of the bounary if it finds lastNotifiedSeg have gone too far? > Nathan > > [0] https://postgr.es/m/DA71434B-7340-4984-9B91-F085BC47A778%40amazon.com > regards. -- Kyotaro Horiguchi NTT Open Source Software Center
By the way about the v3 patch, +#define InvalidXLogSegNo ((XLogSegNo) 0xFFFFFFFFFFFFFFFF) Like InvalidXLogRecPtr, the first valid segment number is 1 so we can use 0 as InvalidXLogSegNo. BootStrapXLOG(): /* Create first XLOG segment file */ openLogFile = XLogFileInit(1); KeepLogSeg(): /* avoid underflow, don't go below 1 */ if (currSegNo <= keep_segs) segno = 1; regards. -- Kyotaro Horiguchi NTT Open Source Software Center
On 8/4/21, 6:58 PM, "Kyotaro Horiguchi" <horikyota.ntt@gmail.com> wrote: > Addition to that, while NotifySegmentsReadyForArchive() is notifying > pending segments, other backends simultaneously reach there are > blocked until the notification, incuding file creation, finishes. I > don't think that's great. Couldn't we set lastNotifiedSegment before > the loop? At the moment a backend decides to notify some segments, > others no longer need to consider those segments. Even if the backend > crashes meanwhile, as you mentionied below, it's safe since the > unnotified segments are notifed after restart. That seems reasonable to me. It looks like we rely on RemoveOldXlogFiles() even today for when XLogArchiveNotify() fails. I updated this in v4 of the patch. In addition to this change, I also addressed your other feedback by changing XLogSegNoIsInvalid() to XLogSegNoIsValid() and by moving record boundary registration to the "if" block for cross-page records. > Does it work that RegisterRecordBoundaryEntry omits registering of the > bounary if it finds lastNotifiedSeg have gone too far? Yeah, there's no reason to add a record boundary if we've already notified the prior segment. For that to happen, another cross-segment record would have to be flushed to disk and NotifySegmentsReadyForArchive() would have to be called before registering the boundary. With that being said, I don't expect an extra map entry here and there to impact performance enough for us to worry about it. Nathan
Attachment
On 8/4/21, 9:05 PM, "Kyotaro Horiguchi" <horikyota.ntt@gmail.com> wrote: > By the way about the v3 patch, > > +#define InvalidXLogSegNo ((XLogSegNo) 0xFFFFFFFFFFFFFFFF) > > Like InvalidXLogRecPtr, the first valid segment number is 1 so we can > use 0 as InvalidXLogSegNo. It's been a while since I wrote this, but if I remember correctly, the issue with using 0 is that we could end up initializing lastNotifiedSeg to InvalidXLogSegNo in XLogWrite(). Eventually, we'd initialize it to 1, but we will have skipped creating the .ready file for the first segment. Nathan
At Thu, 5 Aug 2021 05:15:04 +0000, "Bossart, Nathan" <bossartn@amazon.com> wrote in > On 8/4/21, 9:05 PM, "Kyotaro Horiguchi" <horikyota.ntt@gmail.com> wrote: > > By the way about the v3 patch, > > > > +#define InvalidXLogSegNo ((XLogSegNo) 0xFFFFFFFFFFFFFFFF) > > > > Like InvalidXLogRecPtr, the first valid segment number is 1 so we can > > use 0 as InvalidXLogSegNo. > > It's been a while since I wrote this, but if I remember correctly, the > issue with using 0 is that we could end up initializing > lastNotifiedSeg to InvalidXLogSegNo in XLogWrite(). Eventually, we'd > initialize it to 1, but we will have skipped creating the .ready file > for the first segment. Maybe this? + SetLastNotifiedSegmentIfInvalid(openLogSegNo - 1); Hmm. Theoretically 0 is invalid as segment number. So we'd better not using 0 as a valid value of lastNotifiedSeg. Honestly I don't like having this initialization in XLogWrite. We should and I think can initialize it earlier. It seems to me the most appropriate timing to initialize the variable is just before running the end-of-recovery checkpoint). Since StartupXLOG knows the first segment to write . If it were set to 0, that doesn't matter at all. We can get rid of the new symbol by doing this. Maybe something like this: > { > /* > * There is no partial block to copy. Just set InitializedUpTo, and > * let the first attempt to insert a log record to initialize the next > * buffer. > */ > XLogCtl->InitializedUpTo = EndOfLog; > } > + /* + * EndOfLog resides on the next segment of the last finished one. Set the + * last finished segment as lastNotifiedSeg now. In the case where the + * last crash has left the last several segments not being marked as + * .ready, the checkpoint just after does that for all finished segments. + * There's a corner case where the checkpoint advances segment, but that + * ends up at most with a duplicate archive notification. + */ + XLByteToSeg(EndOfLog, EndOfLogSeg, wal_segment_size); + Assert(EndOfLogSeg > 0); + SetLastNotifiedSegment(EndOfLogSeg - 1); + > LogwrtResult.Write = LogwrtResult.Flush = EndOfLog; Does this makes sense? regards. -- Kyotaro Horiguchi NTT Open Source Software Center
On 8/5/21, 12:39 AM, "Kyotaro Horiguchi" <horikyota.ntt@gmail.com> wrote: > Honestly I don't like having this initialization in XLogWrite. We > should and I think can initialize it earlier. It seems to me the most > appropriate timing to initialize the variable is just before running > the end-of-recovery checkpoint). Since StartupXLOG knows the first > segment to write . If it were set to 0, that doesn't matter at all. > We can get rid of the new symbol by doing this. This seems like a good idea to me. I made this change in v5. I performed some basic testing, and it seems to reliably initialize lastNotifiedSeg correctly. > + /* > + * EndOfLog resides on the next segment of the last finished one. Set the > + * last finished segment as lastNotifiedSeg now. In the case where the > + * last crash has left the last several segments not being marked as > + * .ready, the checkpoint just after does that for all finished segments. > + * There's a corner case where the checkpoint advances segment, but that > + * ends up at most with a duplicate archive notification. > + */ I'm not quite following the corner case you've described here. Is it possible that the segment that EndOfLog points to will be eligible for removal after the checkpoint? In v5 of the patch, I've also added an extra call to NotifySegmentsReadyForArchive() in the same place we previously created the .ready files. I think this helps notify archiver sooner in certain cases (e.g., asynchronous commit). Nathan
Attachment
At Fri, 6 Aug 2021 00:21:34 +0000, "Bossart, Nathan" <bossartn@amazon.com> wrote in > On 8/5/21, 12:39 AM, "Kyotaro Horiguchi" <horikyota.ntt@gmail.com> wrote: > > Honestly I don't like having this initialization in XLogWrite. We > > should and I think can initialize it earlier. It seems to me the most > > appropriate timing to initialize the variable is just before running > > the end-of-recovery checkpoint). Since StartupXLOG knows the first > > segment to write . If it were set to 0, that doesn't matter at all. > > We can get rid of the new symbol by doing this. > > This seems like a good idea to me. I made this change in v5. I > performed some basic testing, and it seems to reliably initialize > lastNotifiedSeg correctly. > > > + /* > > + * EndOfLog resides on the next segment of the last finished one. Set the > > + * last finished segment as lastNotifiedSeg now. In the case where the > > + * last crash has left the last several segments not being marked as > > + * .ready, the checkpoint just after does that for all finished segments. > > + * There's a corner case where the checkpoint advances segment, but that > > + * ends up at most with a duplicate archive notification. > > + */ > > I'm not quite following the corner case you've described here. Is it > possible that the segment that EndOfLog points to will be eligible for > removal after the checkpoint? Archiving doesn't immediately mean removal. A finished segment is ought to be archived right away. Since the EndOfLog segment must not get marked .ready, setting lastNotifiedSeg to the previous segment is quite right, but if the end-of-recovery checkpoint advances segment, EndOfLog is marked .ready at the XLogFlush just after. But, sorry, what I forgot at the time was the checkpoint also moves lastNotifiedSeg. So, sorry, that corner case does not exist. > In v5 of the patch, I've also added an extra call to > NotifySegmentsReadyForArchive() in the same place we previously > created the .ready files. I think this helps notify archiver sooner > in certain cases (e.g., asynchronous commit). In v5, NotifySegmentsReadyForArchive() still holds ArchNotifyLock including .ready file creations. Since the notification loop doesn't need the hash itself, the loop can be took out of the lock section? current: LWLockAcquire(ArchNotifyLock, LW_EXCLUSIVE); last_notified = GetLastNotifiedSegment(); latest_boundary_seg = GetLatestRecordBoundarySegment(last_notified, flushed, &found); if (found) { SetLastNotifiedSegment(latest_boundary_seg - 1); for (seg = last_notified + 1; seg < latest_boundary_seg; seg++) XLogArchiveNotifySeg(seg, false); RemoveRecordBoundariesUpTo(latest_boundary_seg); PgArchWakeup(); } LWLockRelease(ArchNotifyLock); But we can release the lock earlier. LWLockAcquire(ArchNotifyLock, LW_EXCLUSIVE); last_notified = GetLastNotifiedSegment(); latest_boundary_seg = GetLatestRecordBoundarySegment(last_notified, flushed, &found); if (found) { SetLastNotifiedSegment(latest_boundary_seg - 1); RemoveRecordBoundariesUpTo(latest_boundary_seg); } LWLockRelease(ArchNotifyLock); if (found) { for (seg = last_notified + 1; seg < latest_boundary_seg; seg++) XLogArchiveNotifySeg(seg, false); PgArchWakeup(); } -- Kyotaro Horiguchi NTT Open Source Software Center
On 8/6/21, 12:42 AM, "Kyotaro Horiguchi" <horikyota.ntt@gmail.com> wrote: > In v5, NotifySegmentsReadyForArchive() still holds ArchNotifyLock > including .ready file creations. Since the notification loop doesn't > need the hash itself, the loop can be took out of the lock section? I think that works. This creates another opportunity for archive status files to be created out of order, but as discussed elsewhere, I think we have to be prepared for that regardless. I moved the notification loop out of the lock section in v6. Nathan
Attachment
So why do we call this structure "record boundary map", when the boundaries it refers to are segment boundaries? I think we should call it "segment boundary map" instead ... and also I think we should use that name in the lock that protects it, so instead of ArchNotifyLock, it could be SegmentBoundaryLock or perhaps WalSegmentBoundaryLock. The reason for the latter is that I suspect the segment boundary map will also have a use to fix the streaming replication issue I mentioned earlier in the thread. This also makes me think that we'll want the wal record *start* address to be in the hash table too, not just its *end* address. So we'll use the start-1 as position to send, rather than the end-of-segment when GetFlushRecPtr() returns that. As for 0xFFFFFFFFFFFFFFFF, I think it would be cleaner to do a #define MaxXLogSegNo with that value in the line immediately after typedef XLogSegNo, rather than use the numerical value directly in the assignment. Typo in comment atop RemoveRecordBoundariesUpTo: it reads "up to an", should read "up to and". I think the API of GetLatestRecordBoundarySegment would be better by returning the boolean and having the segment as out argument. Then you could do the caller more cleanly, if (GetLatestRecordBoundarySegment(last_notified, flushed, &latest_boundary_segment)) { SetLastNotified( ... ); RemoveRecordBoundaries( ... ); LWLockRelease( ... ); for (..) XLogArchiveNotifySeg(...); PgArchWakeup(); } else LWLockRelease(...); -- Álvaro Herrera 39°49'30"S 73°17'W — https://www.EnterpriseDB.com/ "La virtud es el justo medio entre dos defectos" (Aristóteles)
Attached is a new version of the patch with all feedback addressed. On 8/16/21, 5:09 PM, "alvherre@alvh.no-ip.org" <alvherre@alvh.no-ip.org> wrote: > The reason for the latter is that I suspect the segment boundary map > will also have a use to fix the streaming replication issue I mentioned > earlier in the thread. This also makes me think that we'll want the wal > record *start* address to be in the hash table too, not just its *end* > address. So we'll use the start-1 as position to send, rather than the > end-of-segment when GetFlushRecPtr() returns that. I've been thinking about the next steps for this one, too. ISTM we'll need to basically assume that the flush pointer jumps along record boundaries except for the cross-segment records. I don't know if that is the safest assumption, but I think the alternative involves recording every record boundary in the map. Nathan
Attachment
On 2021-Aug-17, Bossart, Nathan wrote: > On 8/16/21, 5:09 PM, "alvherre@alvh.no-ip.org" <alvherre@alvh.no-ip.org> wrote: > > The reason for the latter is that I suspect the segment boundary map > > will also have a use to fix the streaming replication issue I mentioned > > earlier in the thread. This also makes me think that we'll want the wal > > record *start* address to be in the hash table too, not just its *end* > > address. So we'll use the start-1 as position to send, rather than the > > end-of-segment when GetFlushRecPtr() returns that. > > I've been thinking about the next steps for this one, too. ISTM we'll > need to basically assume that the flush pointer jumps along record > boundaries except for the cross-segment records. I don't know if that > is the safest assumption, but I think the alternative involves > recording every record boundary in the map. I'm not sure I understand your idea correctly. Perhaps another solution is to assume that the flush pointer jumps along record boundaries *including* for cross-segment records. The problem stems precisely from the fact that we set the flush pointer at segment boundaries, even when they aren't record boundary. -- Álvaro Herrera 39°49'30"S 73°17'W — https://www.EnterpriseDB.com/ Bob [Floyd] used to say that he was planning to get a Ph.D. by the "green stamp method," namely by saving envelopes addressed to him as 'Dr. Floyd'. After collecting 500 such letters, he mused, a university somewhere in Arizona would probably grant him a degree. (Don Knuth)
On 8/17/21, 10:44 AM, "alvherre@alvh.no-ip.org" <alvherre@alvh.no-ip.org> wrote: > On 2021-Aug-17, Bossart, Nathan wrote: >> I've been thinking about the next steps for this one, too. ISTM we'll >> need to basically assume that the flush pointer jumps along record >> boundaries except for the cross-segment records. I don't know if that >> is the safest assumption, but I think the alternative involves >> recording every record boundary in the map. > > I'm not sure I understand your idea correctly. Perhaps another solution > is to assume that the flush pointer jumps along record boundaries > *including* for cross-segment records. The problem stems precisely from > the fact that we set the flush pointer at segment boundaries, even when > they aren't record boundary. I think we are in agreement. If we assume that the flush pointer jumps along record boundaries and segment boundaries, the solution would be to avoid using the flush pointer when it points to a segment boundary (given that the segment boundary is not also a record boundary). Instead, we'd only send up to the start position of the last record in the segment to standbys. Nathan
On 2021-Aug-17, Bossart, Nathan wrote: > I think we are in agreement. If we assume that the flush pointer > jumps along record boundaries and segment boundaries, the solution > would be to avoid using the flush pointer when it points to a segment > boundary (given that the segment boundary is not also a record > boundary). Instead, we'd only send up to the start position of the > last record in the segment to standbys. Agreed. An implementation for that would be to test the flush pointer for it being a segment boundary, and in that case we (acquire segment boundary lock and) test for presence in the segment boundary map. If present, then retreat the pointer to the record's start address. This means that we acquire the segment boundary lock rarely. I was concerned that we'd need to acquire it every time we read the flush pointer, which would have been a disaster. Thanks -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/
The thing I still don't understand about this patch is why we call RegisterSegmentBoundaryEntry and NotifySegmentsReadyForArchive in XLogInsertRecord. My model concept of this would have these routines called only in XLogFlush / XLogWrite, which are conceptually about transferring data from in-memory WAL buffers into the on-disk WAL store (pg_xlog files). As I understand, XLogInsertRecord is about copying bytes from the high-level operation (heap insert etc) into WAL buffers. So doing the register/notify dance in both places should be redundant and unnecessary. In the NotifySegmentsReadyForArchive() call at the bottom of XLogWrite, we use flushed=InvalidXLogRecPtr. Why is that? Surely we can use LogwrtResult.Flush, just like in the other callsite there? (If we're covering for somebody advancing FlushRecPtr concurrently, I think we add a comment to explain that. But why do we need to do that in the first place?) -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/ "The eagle never lost so much time, as when he submitted to learn of the crow." (William Blake)
On 8/17/21, 1:24 PM, "alvherre@alvh.no-ip.org" <alvherre@alvh.no-ip.org> wrote: > The thing I still don't understand about this patch is why we call > RegisterSegmentBoundaryEntry and NotifySegmentsReadyForArchive in > XLogInsertRecord. > > My model concept of this would have these routines called only in > XLogFlush / XLogWrite, which are conceptually about transferring data > from in-memory WAL buffers into the on-disk WAL store (pg_xlog files). > As I understand, XLogInsertRecord is about copying bytes from the > high-level operation (heap insert etc) into WAL buffers. So doing the > register/notify dance in both places should be redundant and > unnecessary. The main reason for registering the boundaries in XLogInsertRecord() is that it has the required information about the WAL record boundaries. I do not think that XLogWrite() has this information. If we assumed that write requests always pointed to record boundaries, we could probably just move the XLogArchiveNotifySeg() calls to the end of XLogWrite(), which is what my original patch [0] did. > In the NotifySegmentsReadyForArchive() call at the bottom of XLogWrite, > we use flushed=InvalidXLogRecPtr. Why is that? Surely we can use > LogwrtResult.Flush, just like in the other callsite there? (If we're > covering for somebody advancing FlushRecPtr concurrently, I think we > add a comment to explain that. But why do we need to do that in the > first place?) Good point. I did this in the new version of the patch. Nathan [0] https://postgr.es/m/CBDDFA01-6E40-46BB-9F98-9340F4379505%40amazon.com
Attachment
On 2021-Aug-17, Bossart, Nathan wrote: > The main reason for registering the boundaries in XLogInsertRecord() > is that it has the required information about the WAL record > boundaries. I do not think that XLogWrite() has this information. Doh, of course. So, why isn't it that we call Register in XLogInsertRecord, and Notify in XLogWrite? -- Álvaro Herrera 39°49'30"S 73°17'W — https://www.EnterpriseDB.com/ "La fuerza no está en los medios físicos sino que reside en una voluntad indomable" (Gandhi)
On 8/17/21, 2:13 PM, "alvherre@alvh.no-ip.org" <alvherre@alvh.no-ip.org> wrote: > On 2021-Aug-17, Bossart, Nathan wrote: > >> The main reason for registering the boundaries in XLogInsertRecord() >> is that it has the required information about the WAL record >> boundaries. I do not think that XLogWrite() has this information. > > Doh, of course. So, why isn't it that we call Register in > XLogInsertRecord, and Notify in XLogWrite? We do. However, we also call NotifySegmentsReadyForArchive() in XLogInsertRecord() to handle the probably-unlikely scenario that the flush pointer has already advanced past the to-be-registered boundary. This ensures that the .ready files are created as soon as possible. Nathan
On 2021-Aug-17, Bossart, Nathan wrote: > On 8/17/21, 2:13 PM, "alvherre@alvh.no-ip.org" <alvherre@alvh.no-ip.org> wrote: > > > So, why isn't it that we call Register in XLogInsertRecord, and > > Notify in XLogWrite? > > We do. However, we also call NotifySegmentsReadyForArchive() in > XLogInsertRecord() to handle the probably-unlikely scenario that the > flush pointer has already advanced past the to-be-registered boundary. > This ensures that the .ready files are created as soon as possible. I see. I have two thoughts on that. First, why not do it outside the block that tests for crossing a segment boundary? If that's a good thing to do, then we should do it always. However, why do it in a WAL-producing client-connected backend? It strikes me as a bad thing to do, because you are possibly causing delays for client-connected backends. I suggest that we should give this task to the WAL writer process -- say, have XLogBackgroundFlush do it. -- Álvaro Herrera Valdivia, Chile — https://www.EnterpriseDB.com/ "All rings of power are equal, But some rings of power are more equal than others." (George Orwell's The Lord of the Rings)
On 2021-Aug-17, alvherre@alvh.no-ip.org wrote: > However, why do it in a WAL-producing client-connected backend? It > strikes me as a bad thing to do, because you are possibly causing delays > for client-connected backends. I suggest that we should give this task > to the WAL writer process -- say, have XLogBackgroundFlush do it. Reading the comments on walwriter.c I am hesitant of having walwriter do it: > * Because the walwriter's cycle is directly linked to the maximum delay > * before async-commit transactions are guaranteed committed, it's probably > * unwise to load additional functionality onto it. For instance, if you've > * got a yen to create xlog segments further in advance, that'd be better done > * in bgwriter than in walwriter. So that comment suggests that we should give the responsibility to bgwriter. This seems good enough to me. I suppose if bgwriter has a long run of buffers to write it could take a little bit of time (a few hundred milliseconds?) but I think that should be okay. -- Álvaro Herrera 39°49'30"S 73°17'W — https://www.EnterpriseDB.com/ "No renuncies a nada. No te aferres a nada."
On 8/18/21, 10:06 AM, "alvherre@alvh.no-ip.org" <alvherre@alvh.no-ip.org> wrote: > So that comment suggests that we should give the responsibility to bgwriter. > This seems good enough to me. I suppose if bgwriter has a long run of > buffers to write it could take a little bit of time (a few hundred > milliseconds?) but I think that should be okay. Do you think bgwriter should be the only caller of NotifySegmentsReadyForArchive(), or should we still have XLogWrite() call it? Nathan
On 2021-Aug-18, Bossart, Nathan wrote: > On 8/18/21, 10:06 AM, "alvherre@alvh.no-ip.org" <alvherre@alvh.no-ip.org> wrote: > > So that comment suggests that we should give the responsibility to bgwriter. > > This seems good enough to me. I suppose if bgwriter has a long run of > > buffers to write it could take a little bit of time (a few hundred > > milliseconds?) but I think that should be okay. > > Do you think bgwriter should be the only caller of > NotifySegmentsReadyForArchive(), or should we still have XLogWrite() > call it? I think XLogWrite should absolutely be the primary caller. The one in bgwriter should be a backstop for the case you describe where the flush pointer advanced past the registration point in XLogInsertRecord. I realize this means there's a contradiction with my previous argument, in that synchronous transaction commit calls XLogWrite at some point, so we *are* putting the client-connected backend in charge of creating the notify files. However, that only happens on transaction commit, where we already accept responsibility for the WAL flush, not on each individual XLOG record insert; also, the WAL writer will take care of it sometimes, for transactions that are long-enough lived. -- Álvaro Herrera Valdivia, Chile — https://www.EnterpriseDB.com/ "How strange it is to find the words "Perl" and "saner" in such close proximity, with no apparent sense of irony. I doubt that Larry himself could have managed it." (ncm, http://lwn.net/Articles/174769/)
On 2021-Aug-18, alvherre@alvh.no-ip.org wrote: > I realize this means there's a contradiction with my previous argument, > in that synchronous transaction commit calls XLogWrite at some point, so > we *are* putting the client-connected backend in charge of creating the > notify files. However, that only happens on transaction commit, where > we already accept responsibility for the WAL flush, not on each > individual XLOG record insert; also, the WAL writer will take care of it > sometimes, for transactions that are long-enough lived. Eh. I just said WAL writer will sometimes do it, and that's true because it'll occur in XLogBackgroundFlush. But upthread I wimped out of having WAL writer call NotifySegmentsReadyForArchive() and instead opined to give responsibility to bgwriter. However, thinking about it again, maybe it does make sense to have walwriter do it too directly. This causes no harm to walwriter's time constraints, since *it will have to do it via XLogBackgroundFlush anyway*. -- Álvaro Herrera Valdivia, Chile — https://www.EnterpriseDB.com/
On 8/18/21, 10:48 AM, "alvherre@alvh.no-ip.org" <alvherre@alvh.no-ip.org> wrote: > On 2021-Aug-18, alvherre@alvh.no-ip.org wrote: > >> I realize this means there's a contradiction with my previous argument, >> in that synchronous transaction commit calls XLogWrite at some point, so >> we *are* putting the client-connected backend in charge of creating the >> notify files. However, that only happens on transaction commit, where >> we already accept responsibility for the WAL flush, not on each >> individual XLOG record insert; also, the WAL writer will take care of it >> sometimes, for transactions that are long-enough lived. > > Eh. I just said WAL writer will sometimes do it, and that's true > because it'll occur in XLogBackgroundFlush. But upthread I wimped out > of having WAL writer call NotifySegmentsReadyForArchive() and instead > opined to give responsibility to bgwriter. However, thinking about it > again, maybe it does make sense to have walwriter do it too directly. > This causes no harm to walwriter's time constraints, since *it will have > to do it via XLogBackgroundFlush anyway*. I'll add it after XLogBackgroundFlush(). I think we'll also want to set the WAL writer's latch in case it is hibernating. Another approach could be to keep the NotifySegmentsReadyForArchive() call in XLogInsertRecord(), but only call it if the flush pointer is beyond the boundary we just registered. Or we could only set the latch in XLogInsertRecord() if we detect that the flush pointer has advanced. Nathan
On 2021-Aug-18, Bossart, Nathan wrote: > I'll add it after XLogBackgroundFlush(). I was wondering which would be better: before or after. XLogBackgroundFlush would do it anyway, so if you do it after then it's not clear to me that it'd do anything (I mean we should not do any new calls of NotifySegmentsReadyForArchive and just rely on the one in XLogBackgroundFlush -> XLogWrite). The advantage of doing NotifySegmentsReadyForArchive before XLogBackgroundFlush is that the files would be created sooner, so the archiver can be working in parallel while walwriter does its other thing; then we'd reach the NotifySegmentsReadyForArchive in XLogBackgroundFlush and it'd find nothing to do most of the time, which is just fine. > I think we'll also want to set the WAL writer's latch in case it is > hibernating. Yeah. (That's another advantage of doing it in walwriter rather than bgwriter: we don't publish bgwriter's latch anywhere AFAICS). > Another approach could be to keep the NotifySegmentsReadyForArchive() > call in XLogInsertRecord(), but only call it if the flush pointer is > beyond the boundary we just registered. Or we could only set the > latch in XLogInsertRecord() if we detect that the flush pointer has > advanced. Hmm. -- Álvaro Herrera 39°49'30"S 73°17'W — https://www.EnterpriseDB.com/
On 8/18/21, 11:31 AM, "alvherre@alvh.no-ip.org" <alvherre@alvh.no-ip.org> wrote: > I was wondering which would be better: before or after. > XLogBackgroundFlush would do it anyway, so if you do it after then it's > not clear to me that it'd do anything (I mean we should not do any new > calls of NotifySegmentsReadyForArchive and just rely on the one in > XLogBackgroundFlush -> XLogWrite). > > The advantage of doing NotifySegmentsReadyForArchive before > XLogBackgroundFlush is that the files would be created sooner, so the > archiver can be working in parallel while walwriter does its other > thing; then we'd reach the NotifySegmentsReadyForArchive in > XLogBackgroundFlush and it'd find nothing to do most of the time, which > is just fine. As long as XLogBackgroundFlush() found work to do, it would take care of notifying, but I don't think we can depend on that. However, since we're primarily using the WAL writer to take care of the case when the record has already been flushed, notifying beforehand seems fine to me. If XLogBackgroundFlush() does end up calling XLogWrite(), it'll call it again, anyway. In the attached patch, I modified XLogInsertRecord() to simply set the latch if we detect that flushRecPtr has advanced. Nathan
Attachment
On 2021-Aug-18, Bossart, Nathan wrote: > As long as XLogBackgroundFlush() found work to do, it would take care > of notifying, but I don't think we can depend on that. However, since > we're primarily using the WAL writer to take care of the case when the > record has already been flushed, notifying beforehand seems fine to > me. If XLogBackgroundFlush() does end up calling XLogWrite(), it'll > call it again, anyway. Agreed. > In the attached patch, I modified XLogInsertRecord() to simply set the > latch if we detect that flushRecPtr has advanced. Right, that's what I was thinking. I modified that slightly to use LogwrtResult.Flush (which should be fresh enough) instead of calling GetFlushRecPtr again, which saves a bit. I also changed it to > instead of >=, because if I understand you correctly we only care to notify if the flush pointer advanced, not in the case it stayed the same. I made a few other cosmetic tweaks -- added comment to SegmentBoundaryEntry and renamed the 'pos' to 'endpos'; renamed argument 'notify' of XLogArchiveNotify to 'nudge' (because having two different "notify" concepts in that function seemed confusing). -- Álvaro Herrera Valdivia, Chile — https://www.EnterpriseDB.com/ "It takes less than 2 seconds to get to 78% complete; that's a good sign. A few seconds later it's at 90%, but it seems to have stuck there. Did somebody make percentages logarithmic while I wasn't looking?" http://smylers.hates-software.com/2005/09/08/1995c749.html
Attachment
On 8/18/21, 4:47 PM, "alvherre@alvh.no-ip.org" <alvherre@alvh.no-ip.org> wrote: > On 2021-Aug-18, Bossart, Nathan wrote: >> In the attached patch, I modified XLogInsertRecord() to simply set the >> latch if we detect that flushRecPtr has advanced. > > Right, that's what I was thinking. I modified that slightly to use > LogwrtResult.Flush (which should be fresh enough) instead of calling > GetFlushRecPtr again, which saves a bit. I also changed it to > instead > of >=, because if I understand you correctly we only care to notify if > the flush pointer advanced, not in the case it stayed the same. My thinking was that we needed to read flushRecPtr after registering the boundary in case it advanced just before registration. And I used >= because if flushRecPtr points to the end of the record, we should be able to create the .ready file for the segment. We can avoid acquiring the spinlock an extra time if we move the first part of the cross-segment logic to before we update the local copy of LogwrtResult. I attached a new version of the patch that does this. The rest looks good to me. Nathan
Attachment
In v12 I moved the code around a bit and reworded some comments. -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/
Attachment
Two things. 1. We use a hash table in shared memory. That's great. The part that's not so great is that in both places where we read items from it, we have to iterate in some way. This seems a bit silly. An array would serve us better, if only we could expand it as needed. However, in shared memory we can't do that. (I think the list of elements we need to memoize is arbitrary long, if enough processes can be writing WAL at the same time.) Now that I think about this again, maybe it's limited by NUM_XLOGINSERT_LOCKS, since there can only be that many records being written down at a time ... 2. There is a new LWLock acquisition that may be a new contention point. We acquire the lock in these cases: - WAL record insert, when a segment boundary is crosses (rare enough). - XLogWrite, when a segment needs to be notified. Looking again, I think the last point might be a problem actually, because XLogWrite is called with WALWriteLock held. Maybe we should take the NotifySegmentsReadyForArchive() call outside the section locked by WALWriteLock (so put it in XLogWrite callers instead of XLogWrite itself). -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/ "Cómo ponemos nuestros dedos en la arcilla del otro. Eso es la amistad; jugar al alfarero y ver qué formas se pueden sacar del otro" (C. Halloway en La Feria de las Tinieblas, R. Bradbury)
On Fri, Aug 20, 2021 at 10:50 AM alvherre@alvh.no-ip.org <alvherre@alvh.no-ip.org> wrote: > 1. We use a hash table in shared memory. That's great. The part that's > not so great is that in both places where we read items from it, we > have to iterate in some way. This seems a bit silly. An array would > serve us better, if only we could expand it as needed. However, in > shared memory we can't do that. (I think the list of elements we > need to memoize is arbitrary long, if enough processes can be writing > WAL at the same time.) We can't expand the hash table either. It has an initial and maximum size of 16 elements, which means it's basically an expensive array, and which also means that it imposes a new limit of 16 * wal_segment_size on the size of WAL records. If you exceed that limit, I think things just go boom... which I think is not acceptable. I think we can have records in the multi-GB range of wal_level=logical and someone chooses a stupid replica identity setting. It's actually not clear to me why we need to track multiple entries anyway. The scenario postulated by Horiguchi-san in https://www.postgresql.org/message-id/20201014.090628.839639906081252194.horikyota.ntt@gmail.com seems to require that the write position be multiple segments ahead of the flush position, but that seems impossible with the present code, because XLogWrite() calls issue_xlog_fsync() at once if the segment is filled. So I think, at least with the present code, any record that isn't completely flushed to disk has to be at least partially in the current segment. And there can be only one record that starts in some earlier segment and ends in this one. I will be the first to admit that the forced end-of-segment syncs suck. They often stall every backend in the entire system at the same time. Everyone fills up the xlog segment really fast and then stalls HARD while waiting for that sync to happen. So it's arguably better not to do more things that depend on that being how it works, but I think needing a variable-size amount of shared memory is even worse. If we're going to track multiple entries here we need some rule that bounds how many of them we can need to track. If the number of entries is defined by the number of segment boundaries that a particular record crosses, it's effectively unbounded, because right now WAL records can be pretty much arbitrarily big. -- Robert Haas EDB: http://www.enterprisedb.com
On 8/20/21, 8:29 AM, "Robert Haas" <robertmhaas@gmail.com> wrote: > On Fri, Aug 20, 2021 at 10:50 AM alvherre@alvh.no-ip.org > <alvherre@alvh.no-ip.org> wrote: >> 1. We use a hash table in shared memory. That's great. The part that's >> not so great is that in both places where we read items from it, we >> have to iterate in some way. This seems a bit silly. An array would >> serve us better, if only we could expand it as needed. However, in >> shared memory we can't do that. (I think the list of elements we >> need to memoize is arbitrary long, if enough processes can be writing >> WAL at the same time.) > > We can't expand the hash table either. It has an initial and maximum > size of 16 elements, which means it's basically an expensive array, > and which also means that it imposes a new limit of 16 * > wal_segment_size on the size of WAL records. If you exceed that limit, > I think things just go boom... which I think is not acceptable. I > think we can have records in the multi-GB range of wal_level=logical > and someone chooses a stupid replica identity setting. If a record spans multiple segments, we only register one segment boundary. For example, if I insert a record that starts at segment number 1 and stops at 10, I'll insert one segment boundary for segment 10. We'll only create .ready files for segments 1 through 9 once this record is completely flushed to disk. I was under the impression that shared hash tables could be expanded as necessary, but from your note and the following comment, that does not seem to be true: * Note: for a shared-memory hashtable, nelem needs to be a pretty good * estimate, since we can't expand the table on the fly. But an unshared * hashtable can be expanded on-the-fly, so it's better for nelem to be * on the small side and let the table grow if it's exceeded. An overly * large nelem will penalize hash_seq_search speed without buying much. > It's actually not clear to me why we need to track multiple entries > anyway. The scenario postulated by Horiguchi-san in > https://www.postgresql.org/message-id/20201014.090628.839639906081252194.horikyota.ntt@gmail.com > seems to require that the write position be multiple segments ahead of > the flush position, but that seems impossible with the present code, > because XLogWrite() calls issue_xlog_fsync() at once if the segment is > filled. So I think, at least with the present code, any record that > isn't completely flushed to disk has to be at least partially in the > current segment. And there can be only one record that starts in some > earlier segment and ends in this one. We register the boundaries XLogInsertRecord(), which AFAICT just bumps the global write request pointer ahead, so I'm not sure we can make any assumptions about what is written/flushed at that time. (I see that we do end up calling XLogFlush() for XLOG_SWITCH records in XLogInsertRecord(), but I don't see any other cases where we actually write anything in this function.) Am I missing something? > I will be the first to admit that the forced end-of-segment syncs > suck. They often stall every backend in the entire system at the same > time. Everyone fills up the xlog segment really fast and then stalls > HARD while waiting for that sync to happen. So it's arguably better > not to do more things that depend on that being how it works, but I > think needing a variable-size amount of shared memory is even worse. > If we're going to track multiple entries here we need some rule that > bounds how many of them we can need to track. If the number of entries > is defined by the number of segment boundaries that a particular > record crosses, it's effectively unbounded, because right now WAL > records can be pretty much arbitrarily big. If there isn't a way to ensure that the number of entries we need to store is bounded, I'm tempted to propose my original patch [0], which just moves .ready file creation to the very end of XLogWrite(). It's probably not a complete solution, but it might be better than what's there today. Nathan [0] https://postgr.es/m/CBDDFA01-6E40-46BB-9F98-9340F4379505%40amazon.com
On Fri, Aug 20, 2021 at 12:36 PM Bossart, Nathan <bossartn@amazon.com> wrote: > If a record spans multiple segments, we only register one segment > boundary. For example, if I insert a record that starts at segment > number 1 and stops at 10, I'll insert one segment boundary for segment > 10. We'll only create .ready files for segments 1 through 9 once this > record is completely flushed to disk. Oh ... OK. So is there any experimental scenario in which the hash table ends up with more than 1 entry? And if so, how does that happen? > > It's actually not clear to me why we need to track multiple entries > > anyway. The scenario postulated by Horiguchi-san in > > https://www.postgresql.org/message-id/20201014.090628.839639906081252194.horikyota.ntt@gmail.com > > seems to require that the write position be multiple segments ahead of > > the flush position, but that seems impossible with the present code, > > because XLogWrite() calls issue_xlog_fsync() at once if the segment is > > filled. So I think, at least with the present code, any record that > > isn't completely flushed to disk has to be at least partially in the > > current segment. And there can be only one record that starts in some > > earlier segment and ends in this one. > > We register the boundaries XLogInsertRecord(), which AFAICT just bumps > the global write request pointer ahead, so I'm not sure we can make > any assumptions about what is written/flushed at that time. (I see > that we do end up calling XLogFlush() for XLOG_SWITCH records in > XLogInsertRecord(), but I don't see any other cases where we actually > write anything in this function.) Am I missing something? Well, I'm not sure. But I *think* that the code as it exists today is smart enough not to try to archive a segment that hasn't been completely flushed, and the gap is only that even though the segment might be completely flushed, some portion of the record that is part of a later segment might not be flushed, and thus after a crash we might overwrite the already-flushed contents. The patch can make an implementation choice to do some work at XLogInsertRecord() time if it likes, but there's no real hazard at that point. The hazard only exists, or so I think, once a segment that contains part of the record is fully on disk. But that means, if my previous logic is correct, that the hazard can only exist for at most 1 record at any point in time. > If there isn't a way to ensure that the number of entries we need to > store is bounded, I'm tempted to propose my original patch [0], which > just moves .ready file creation to the very end of XLogWrite(). It's > probably not a complete solution, but it might be better than what's > there today. Doesn't that allocate memory inside a critical section? I would have thought it would cause an immediate assertion failure. -- Robert Haas EDB: http://www.enterprisedb.com
On 8/20/21, 10:08 AM, "Robert Haas" <robertmhaas@gmail.com> wrote: > On Fri, Aug 20, 2021 at 12:36 PM Bossart, Nathan <bossartn@amazon.com> wrote: >> If a record spans multiple segments, we only register one segment >> boundary. For example, if I insert a record that starts at segment >> number 1 and stops at 10, I'll insert one segment boundary for segment >> 10. We'll only create .ready files for segments 1 through 9 once this >> record is completely flushed to disk. > > Oh ... OK. So is there any experimental scenario in which the hash > table ends up with more than 1 entry? And if so, how does that happen? I was able to do this by turning synchronous_commit off, increasing wal_buffers substantially, and adding sleeps to XLogWrite(). >> If there isn't a way to ensure that the number of entries we need to >> store is bounded, I'm tempted to propose my original patch [0], which >> just moves .ready file creation to the very end of XLogWrite(). It's >> probably not a complete solution, but it might be better than what's >> there today. > > Doesn't that allocate memory inside a critical section? I would have > thought it would cause an immediate assertion failure. I could probably replace the list with two local variables (start and end segments). Thinking about this stuff further, I was wondering if one way to handle the bounded shared hash table problem would be to replace the latest boundary in the map whenever it was full. But at that point, do we even need a hash table? This led me to revisit the two-element approach that was discussed upthread. What if we only stored the earliest and latest segment boundaries at any given time? Once the earliest boundary is added, it never changes until the segment is flushed and it is removed. The latest boundary, however, will be updated any time we register another segment. Once the earliest boundary is removed, we replace it with the latest boundary. This strategy could cause us to miss intermediate boundaries, but AFAICT the worst case scenario is that we hold off creating .ready files a bit longer than necessary. I'll work on a patch to illustrate what I'm thinking. Nathan
On 2021-Aug-20, Bossart, Nathan wrote: > On 8/20/21, 8:29 AM, "Robert Haas" <robertmhaas@gmail.com> wrote: > > We can't expand the hash table either. It has an initial and maximum > > size of 16 elements, which means it's basically an expensive array, > > and which also means that it imposes a new limit of 16 * > > wal_segment_size on the size of WAL records. If you exceed that limit, > > I think things just go boom... which I think is not acceptable. I > > think we can have records in the multi-GB range of wal_level=logical > > and someone chooses a stupid replica identity setting. > > I was under the impression that shared hash tables could be expanded > as necessary, but from your note and the following comment, that does > not seem to be true: Actually, you were right. Hash tables in shared memory can be expanded. There are some limitations (the hash "directory" is fixed size, which means the hash table get less efficient if it grows too much), but you can definitely create more hash entries than the initial size. See for example element_alloc(), which covers the case of a hash table being IS_PARTITIONED -- something that only shmem hash tables can be. Note that ShmemInitHash passes the HASH_ALLOC flag and uses ShmemAllocNoError as allocation function, which acquires memory from the shared segment. This is a minor thing -- it doesn't affect the fact that the hash table is possibly being misused and inefficient -- but I thought it was worth pointing out. As an example, consider the LOCK / PROCLOCK hash tables. These can contain more elements than max_backends * max_locks_per_transaction. Those elements consume shared memory from the "allocation slop" in the shared memory segment. It's tough when it happens (as far as I know the memory is never "returned" once such a hash table grows to use that space), but it does work. -- Álvaro Herrera Valdivia, Chile — https://www.EnterpriseDB.com/
On Fri, Aug 20, 2021 at 1:29 PM Bossart, Nathan <bossartn@amazon.com> wrote: > Thinking about this stuff further, I was wondering if one way to > handle the bounded shared hash table problem would be to replace the > latest boundary in the map whenever it was full. But at that point, > do we even need a hash table? This led me to revisit the two-element > approach that was discussed upthread. What if we only stored the > earliest and latest segment boundaries at any given time? Once the > earliest boundary is added, it never changes until the segment is > flushed and it is removed. The latest boundary, however, will be > updated any time we register another segment. Once the earliest > boundary is removed, we replace it with the latest boundary. This > strategy could cause us to miss intermediate boundaries, but AFAICT > the worst case scenario is that we hold off creating .ready files a > bit longer than necessary. I think this is a promising approach. We could also have a small fixed-size array, so that we only have to risk losing track of anything when we overflow the array. But I guess I'm still unconvinced that there's a real possibility of genuinely needing multiple elements. Suppose we are thinking of adding a second element to the array (or the hash table). I feel like it's got to be safe to just remove the first one. If not, then apparently the WAL record that caused us to make the first entry isn't totally flushed yet - which I still think is impossible. -- Robert Haas EDB: http://www.enterprisedb.com
On Fri, Aug 20, 2021 at 1:52 PM alvherre@alvh.no-ip.org <alvherre@alvh.no-ip.org> wrote: > Actually, you were right. Hash tables in shared memory can be expanded. > There are some limitations (the hash "directory" is fixed size, which > means the hash table get less efficient if it grows too much), but you > can definitely create more hash entries than the initial size. See for > example element_alloc(), which covers the case of a hash table being > IS_PARTITIONED -- something that only shmem hash tables can be. Note > that ShmemInitHash passes the HASH_ALLOC flag and uses ShmemAllocNoError > as allocation function, which acquires memory from the shared segment. I realize that the code supports this ... but I thought we had established a policy that only the main lock manager's shared hash tables, and not any others, are actually allowed to make use of this functionality. See commit 7c797e7194d969f974abf579cacf30ffdccdbb95. It seems like a dangerous thing to rely on in any case, since we can't predict how much extra shared memory might actually be available. -- Robert Haas EDB: http://www.enterprisedb.com
On 8/20/21, 11:20 AM, "Robert Haas" <robertmhaas@gmail.com> wrote: > On Fri, Aug 20, 2021 at 1:29 PM Bossart, Nathan <bossartn@amazon.com> wrote: >> Thinking about this stuff further, I was wondering if one way to >> handle the bounded shared hash table problem would be to replace the >> latest boundary in the map whenever it was full. But at that point, >> do we even need a hash table? This led me to revisit the two-element >> approach that was discussed upthread. What if we only stored the >> earliest and latest segment boundaries at any given time? Once the >> earliest boundary is added, it never changes until the segment is >> flushed and it is removed. The latest boundary, however, will be >> updated any time we register another segment. Once the earliest >> boundary is removed, we replace it with the latest boundary. This >> strategy could cause us to miss intermediate boundaries, but AFAICT >> the worst case scenario is that we hold off creating .ready files a >> bit longer than necessary. > > I think this is a promising approach. We could also have a small > fixed-size array, so that we only have to risk losing track of > anything when we overflow the array. But I guess I'm still unconvinced > that there's a real possibility of genuinely needing multiple > elements. Suppose we are thinking of adding a second element to the > array (or the hash table). I feel like it's got to be safe to just > remove the first one. If not, then apparently the WAL record that > caused us to make the first entry isn't totally flushed yet - which I > still think is impossible. I've attached a patch to demonstrate what I'm thinking. Nathan
Attachment
On 2021-Aug-20, Bossart, Nathan wrote: > > On Fri, Aug 20, 2021 at 1:29 PM Bossart, Nathan <bossartn@amazon.com> wrote: > >> This led me to revisit the two-element > >> approach that was discussed upthread. What if we only stored the > >> earliest and latest segment boundaries at any given time? Once the > >> earliest boundary is added, it never changes until the segment is > >> flushed and it is removed. The latest boundary, however, will be > >> updated any time we register another segment. Once the earliest > >> boundary is removed, we replace it with the latest boundary. This > >> strategy could cause us to miss intermediate boundaries, but AFAICT > >> the worst case scenario is that we hold off creating .ready files a > >> bit longer than necessary. > I've attached a patch to demonstrate what I'm thinking. There is only one thing I didn't like in this new version, which is that we're holding info_lck too much. I've seen info_lck contention be a problem in some workloads and I'd rather not add more stuff to it. I'd rather we stick with using a new lock object to protect all the data we need for this job. Should this new lock object be a spinlock or an lwlock? I think a spinlock would generally be better because it's lower overhead and we can't use it in shared mode anywhere, which would be the greatest argument for an lwlock. However, I think we avoid letting code run with spinlocks held that's not straight-line code, and we have some function calls there. -- Álvaro Herrera Valdivia, Chile — https://www.EnterpriseDB.com/
Attached is v14 which uses a separate spinlock. -- Álvaro Herrera 39°49'30"S 73°17'W — https://www.EnterpriseDB.com/ "No me acuerdo, pero no es cierto. No es cierto, y si fuera cierto, no me acuerdo." (Augusto Pinochet a una corte de justicia)
Attachment
On 8/20/21, 4:00 PM, "alvherre@alvh.no-ip.org" <alvherre@alvh.no-ip.org> wrote: > Attached is v14 which uses a separate spinlock. This looks good to me. I was looking at moving the function calls out of the spinlock region. I don't think the functions are doing anything too expensive, and they help clean up NotifySegmentsReadyForArchive() quite a bit, but I understand why it might be against project policy to do something like that. It would be easy enough to get rid of the helper functions if that was concern. Nathan
On 2021-Aug-20, Bossart, Nathan wrote: > I was looking at moving the function calls out of the spinlock region. > I don't think the functions are doing anything too expensive, and they > help clean up NotifySegmentsReadyForArchive() quite a bit, but I > understand why it might be against project policy to do something like > that. It would be easy enough to get rid of the helper functions if > that was concern. Well, the thing I realized is that these three helper functions have exactly one caller each. I think the compiler is going to inline them, so there isn't going to be a function call in the assembly. I haven't verified this, though. -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/ "Los dioses no protegen a los insensatos. Éstos reciben protección de otros insensatos mejor dotados" (Luis Wu, Mundo Anillo)
On 8/20/21, 4:52 PM, "alvherre@alvh.no-ip.org" <alvherre@alvh.no-ip.org> wrote: > On 2021-Aug-20, Bossart, Nathan wrote: > >> I was looking at moving the function calls out of the spinlock region. >> I don't think the functions are doing anything too expensive, and they >> help clean up NotifySegmentsReadyForArchive() quite a bit, but I >> understand why it might be against project policy to do something like >> that. It would be easy enough to get rid of the helper functions if >> that was concern. > > Well, the thing I realized is that these three helper functions have > exactly one caller each. I think the compiler is going to inline them, > so there isn't going to be a function call in the assembly. I haven't > verified this, though. Good point. It looks like they're getting inlined for me. Nathan
On 2021-Aug-21, Bossart, Nathan wrote: > > Well, the thing I realized is that these three helper functions have > > exactly one caller each. I think the compiler is going to inline them, > > so there isn't going to be a function call in the assembly. I haven't > > verified this, though. > > Good point. It looks like they're getting inlined for me. I still didn't like it, because it looks like we're creating an API for which there can be only one caller. So I expanded the functions in the caller. It doesn't look too bad. However ... ... while reading the resulting code after backpatching to all branches, I realized that if there are no registrations whatsoever, then archiving won't do anything, which surely is the wrong thing to do. The correct behavior should be "if there are no registrations, then *all* flushed segments can be notified". I'll fix that ... Another thing I didn't like is that you used a name ending in RecPtr for the LSN, which gives no indication that it really is the *end* LSN, not the start pointer. And it won't play nice with the need to add the *start* LSN which we'll need to implement solving the equivalent problem for streaming replication. I'll rename those to earliestSegBoundaryEndPtr and latestSegBoundaryEndPtr. -- Álvaro Herrera Valdivia, Chile — https://www.EnterpriseDB.com/ Syntax error: function hell() needs an argument. Please choose what hell you want to involve.
Attachment
On 8/23/21, 8:50 AM, "alvherre@alvh.no-ip.org" <alvherre@alvh.no-ip.org> wrote: > ... while reading the resulting code after backpatching to all branches, > I realized that if there are no registrations whatsoever, then archiving > won't do anything, which surely is the wrong thing to do. The correct > behavior should be "if there are no registrations, then *all* flushed > segments can be notified". Hm. My expectation would be that if there are no registrations, we cannot create .ready files for the flushed segments. The scenario where I can see that happening is when a record gets flushed to disk prior to registration. In that case, we'll still eventually register the record and wake up the WAL writer process, which will take care of creating the .ready files that were missed earlier. Is there another case you are thinking of where we could miss registration for a cross- segment record altogether? Nathan
On 2021-Aug-23, Bossart, Nathan wrote: > Hm. My expectation would be that if there are no registrations, we > cannot create .ready files for the flushed segments. The scenario > where I can see that happening is when a record gets flushed to disk > prior to registration. In that case, we'll still eventually register > the record and wake up the WAL writer process, which will take care of > creating the .ready files that were missed earlier. Is there another > case you are thinking of where we could miss registration for a cross- > segment record altogether? I'm thinking of the case where no record cross segment boundaries ever. -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/
On 8/23/21, 9:33 AM, "alvherre@alvh.no-ip.org" <alvherre@alvh.no-ip.org> wrote: > On 2021-Aug-23, Bossart, Nathan wrote: > >> Hm. My expectation would be that if there are no registrations, we >> cannot create .ready files for the flushed segments. The scenario >> where I can see that happening is when a record gets flushed to disk >> prior to registration. In that case, we'll still eventually register >> the record and wake up the WAL writer process, which will take care of >> creating the .ready files that were missed earlier. Is there another >> case you are thinking of where we could miss registration for a cross- >> segment record altogether? > > I'm thinking of the case where no record cross segment boundaries ever. Sorry, I'm still not following this one. If we skipped creating .ready segments due to a crash, we rely on RemoveOldXlogFiles() to create them as needed in the end-of-recovery checkpoint. If a record fits perfectly in the end of a segment, we'll still register it as a boundary for the next segment (hence why we use XLByteToSeg() instead of XLByteToPrevSeg()). If database activity stops completely, there shouldn't be anything to mark ready. Nathan
On 2021-Aug-23, Bossart, Nathan wrote: > Sorry, I'm still not following this one. If we skipped creating > .ready segments due to a crash, we rely on RemoveOldXlogFiles() to > create them as needed in the end-of-recovery checkpoint. If a record > fits perfectly in the end of a segment, we'll still register it as a > boundary for the next segment (hence why we use XLByteToSeg() instead > of XLByteToPrevSeg()). If database activity stops completely, there > shouldn't be anything to mark ready. The only way .ready files are created is that XLogNotifyWrite() is called. For regular WAL files during regular operation, that only happens in XLogNotifyWriteSeg(). That, in turn, only happens in NotifySegmentsReadyForArchive(). But if the system runs and never writes WAL records that cross WAL boundaries, that function will see that both earliestSegBoundary and latestSegBoundary are MaxXLogSegno, and return without doing anything. So no segments will be notified. -- Álvaro Herrera 39°49'30"S 73°17'W — https://www.EnterpriseDB.com/
On 2021-Aug-23, alvherre@alvh.no-ip.org wrote: > The only way .ready files are created is that XLogNotifyWrite() is > called. For regular WAL files during regular operation, that only > happens in XLogNotifyWriteSeg(). That, in turn, only happens in > NotifySegmentsReadyForArchive(). But if the system runs and never > writes WAL records that cross WAL boundaries, that function will see > that both earliestSegBoundary and latestSegBoundary are MaxXLogSegno, > and return without doing anything. So no segments will be notified. Nevermind -- I realized that all segments get registered, not just those for which we generate continuation records. -- Álvaro Herrera 39°49'30"S 73°17'W — https://www.EnterpriseDB.com/ "Use it up, wear it out, make it do, or do without"
On 8/23/21, 10:31 AM, "alvherre@alvh.no-ip.org" <alvherre@alvh.no-ip.org> wrote: > On 2021-Aug-23, alvherre@alvh.no-ip.org wrote: > >> The only way .ready files are created is that XLogNotifyWrite() is >> called. For regular WAL files during regular operation, that only >> happens in XLogNotifyWriteSeg(). That, in turn, only happens in >> NotifySegmentsReadyForArchive(). But if the system runs and never >> writes WAL records that cross WAL boundaries, that function will see >> that both earliestSegBoundary and latestSegBoundary are MaxXLogSegno, >> and return without doing anything. So no segments will be notified. > > Nevermind -- I realized that all segments get registered, not just those > for which we generate continuation records. Ah, okay. BTW the other changes you mentioned made sense to me. Nathan
On 2021-Aug-23, Bossart, Nathan wrote: > Ah, okay. BTW the other changes you mentioned made sense to me. Thanks. I've pushed this now to all live branches. -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/ <Schwern> It does it in a really, really complicated way <crab> why does it need to be complicated? <Schwern> Because it's MakeMaker.
On 8/23/21, 12:55 PM, "alvherre@alvh.no-ip.org" <alvherre@alvh.no-ip.org> wrote: > Thanks. I've pushed this now to all live branches. Thank you! I appreciate the thorough reviews. Should we make a new thread for the streaming replication fix? Nathan
On 2021-Aug-23, Bossart, Nathan wrote: > On 8/23/21, 12:55 PM, "alvherre@alvh.no-ip.org" <alvherre@alvh.no-ip.org> wrote: > > Thanks. I've pushed this now to all live branches. > > Thank you! I appreciate the thorough reviews. Should we make a new > thread for the streaming replication fix? Yeah, this one is long enough :-) -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/
On 2021/08/24 4:55, alvherre@alvh.no-ip.org wrote: > On 2021-Aug-23, Bossart, Nathan wrote: > >> Ah, okay. BTW the other changes you mentioned made sense to me. > > Thanks. I've pushed this now to all live branches. Thanks a lot! + /* + * There's a chance that the record was already flushed to disk and we + * missed marking segments as ready for archive. If this happens, we + * nudge the WALWriter, which will take care of notifying segments as + * needed. + */ + if (StartSeg != EndSeg && XLogArchivingActive() && + LogwrtResult.Flush >= EndPos && ProcGlobal->walwriterLatch) + SetLatch(ProcGlobal->walwriterLatch); Is this really necessary? If LogwrtResult.Flush >= EndPos, which means that another process already has flushed the record concurrently and updated XLogCtl->LogwrtResult.Flush. This situation also means that that another process called NotifySegmentsReadyForArchive(LogwrtResult.Flush). Right? If this understanding is right, there seems no need to wake walwriter up here so that it can call NotifySegmentsReadyForArchive(LogwrtResult.Flush) gain. Thought? Regards, -- Fujii Masao Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATION
On 8/25/21, 11:01 AM, "Fujii Masao" <masao.fujii@oss.nttdata.com> wrote: > If LogwrtResult.Flush >= EndPos, which means that another process already > has flushed the record concurrently and updated XLogCtl->LogwrtResult.Flush. > This situation also means that that another process called > NotifySegmentsReadyForArchive(LogwrtResult.Flush). Right? If the segment boundary wasn't registered before the other process called NotifySegmentsReadyForArchive(), then it couldn't have used the boundary for deciding which .ready files to create. > If this understanding is right, there seems no need to wake walwriter up here > so that it can call NotifySegmentsReadyForArchive(LogwrtResult.Flush) gain. > Thought? We're actually discussing this right now in another thread [0]. I think we might be able to get rid of that part if we move the boundary registration to before we release the WAL insert lock(s). Nathan [0] https://postgr.es/m/DE60B9AA-9670-47DA-9678-6C79BCD884E3%40amazon.com
Hi, On 2021-08-23 15:55:03 -0400, alvherre@alvh.no-ip.org wrote: > On 2021-Aug-23, Bossart, Nathan wrote: > > > Ah, okay. BTW the other changes you mentioned made sense to me. > > Thanks. I've pushed this now to all live branches. While rebasing the aio patchset ontop of HEAD I noticed that this commit added another atomic operation to XLogWrite() with archiving enabled. The WAL write path is really quite hot, and at least some of the NotifySegmentsReadyForArchive() calls are done while WALWriteLock is held. I think we should at least try to make the fast-path where no segment boundaries were crossed use no atomic operations. Greetings, Andres Freund
On 2021-Aug-28, Andres Freund wrote: > While rebasing the aio patchset ontop of HEAD I noticed that this commit added > another atomic operation to XLogWrite() with archiving enabled. The WAL write > path is really quite hot, and at least some of the > NotifySegmentsReadyForArchive() calls are done while WALWriteLock is held. > > I think we should at least try to make the fast-path where no segment > boundaries were crossed use no atomic operations. I think the best way to achieve this is is to rely completely on walwriter doing the segment notification, so that the WAL write done by backend would only do a latch set. -- Álvaro Herrera Valdivia, Chile — https://www.EnterpriseDB.com/
On 8/30/21, 12:52 PM, "alvherre@alvh.no-ip.org" <alvherre@alvh.no-ip.org> wrote: > On 2021-Aug-28, Andres Freund wrote: > >> While rebasing the aio patchset ontop of HEAD I noticed that this commit added >> another atomic operation to XLogWrite() with archiving enabled. The WAL write >> path is really quite hot, and at least some of the >> NotifySegmentsReadyForArchive() calls are done while WALWriteLock is held. >> >> I think we should at least try to make the fast-path where no segment >> boundaries were crossed use no atomic operations. > > I think the best way to achieve this is is to rely completely on > walwriter doing the segment notification, so that the WAL write done by > backend would only do a latch set. +1. If we do that, we may also want to move NotifySegmentsReadyForArchive() to after the call to XLogBackgroundFlush() in the WAL writer. Nathan
Hi, On 2021-08-30 15:51:54 -0400, alvherre@alvh.no-ip.org wrote: > On 2021-Aug-28, Andres Freund wrote: > > > While rebasing the aio patchset ontop of HEAD I noticed that this commit added > > another atomic operation to XLogWrite() with archiving enabled. The WAL write > > path is really quite hot, and at least some of the > > NotifySegmentsReadyForArchive() calls are done while WALWriteLock is held. > > > > I think we should at least try to make the fast-path where no segment > > boundaries were crossed use no atomic operations. > > I think the best way to achieve this is is to rely completely on > walwriter doing the segment notification, so that the WAL write done by > backend would only do a latch set. When were you thinking of doing the latch sets? Adding a latch set for every XLogWrite() wouldn't be cheap either. Both because syscalls under a lock aren't free and because waking up walsender even more often isn't free (we already have a few threads about reducing the signalling frequency). There's also the question of what to do with single user mode. We shouldn't just skip creating .ready files there... Although, the more I think about, the more I am confused about the trailing if (XLogArchivingActive()) NotifySegmentsReadyForArchive(LogwrtResult.Flush); in XLogWrite(). Shouldn't that at the very least be inside the "If asked to flush, do so" branch? Outside that and the finishing_seg branch LogwrtResult.Flush won't have moved, right? So the call to NotifySegmentsReadyForArchive() can't do anything, no? Nor does it seem like we'd ever need to call NotifySegmentsReadyForArchive() if we started writing on the current page - flushRecPtr can't move across a segment boundary in that case. I hadn't yet realized that this commit doesn't just make XLogWrite() more expensive, it also makes XLogInsertRecord() more expensive :(. Adding two divisions to XLogInsertRecord() isn't nice, especially as it happens even if !XLogArchivingActive(). I can't really convince myself this deals correctly with multiple segment spanning records and with records spanning more than one segment? It'd be easier to understand if the new XLogCtlData variables were documented... If there's one record from segment s0 to s1 and one from s1 to s4, and wal_buffers is big enough to contain them all, the first record will set earliestSegBoundary = s1 the second latestSegBoundary = s4. When s1 is fully written out, NotifySegmentsReadyForArchive() will set earliestSegBoundary = latestSegBoundary = s4 and create .reaady for .s1 - ok. But when when s2 is flushed, we'll afaict happily create .ready files for s2, s3 despite s4 not yet being written, because earliestSegBoundary is now s4. I think there's other issues as well. The more I look at this commit, the less I believe it's right. The whole approach here of delaying .ready creation for these types of segments seems wrong to me. Doesn't the exact same problem also exist for streaming rep - which one can also use to maintain a PITR archive? walsender sends up to the flush location, and pg_receivewal's FindStreamingStart() will afaict just continue receiving from after that point. Greetings, Andres Freund
On 8/30/21, 2:06 PM, "Andres Freund" <andres@anarazel.de> wrote: > When were you thinking of doing the latch sets? Adding a latch set for every > XLogWrite() wouldn't be cheap either. Both because syscalls under a lock > aren't free and because waking up walsender even more often isn't free (we > already have a few threads about reducing the signalling frequency). > > There's also the question of what to do with single user mode. We shouldn't > just skip creating .ready files there... Good point. > Although, the more I think about, the more I am confused about the trailing > if (XLogArchivingActive()) > NotifySegmentsReadyForArchive(LogwrtResult.Flush); > > in XLogWrite(). Shouldn't that at the very least be inside the "If asked to > flush, do so" branch? Outside that and the finishing_seg branch > LogwrtResult.Flush won't have moved, right? So the call to > NotifySegmentsReadyForArchive() can't do anything, no? The registration logic looks like this: 1. Register boundary 2. Get flush location from shared memory 3. If flush location >= our just-registered boundary, nudge the WAL writer to create .ready files if needed If we called NotifySegmentsReadyForArchive() before we updated the flush location in shared memory, we might skip nudging the WAL writer even though we should. > Nor does it seem like we'd ever need to call NotifySegmentsReadyForArchive() > if we started writing on the current page - flushRecPtr can't move across a > segment boundary in that case. I think there is a chance that we've crossed one of our recorded segment boundaries anytime the flush pointer moves. > When s1 is fully written out, NotifySegmentsReadyForArchive() will set > earliestSegBoundary = latestSegBoundary = s4 and create .reaady for .s1 - > ok. But when when s2 is flushed, we'll afaict happily create .ready files > for s2, s3 despite s4 not yet being written, because earliestSegBoundary is > now s4. In this case, the .ready files for s2 and s3 wouldn't be created until s4 is flushed to disk. > The whole approach here of delaying .ready creation for these types of > segments seems wrong to me. Doesn't the exact same problem also exist for > streaming rep - which one can also use to maintain a PITR archive? walsender > sends up to the flush location, and pg_receivewal's FindStreamingStart() will > afaict just continue receiving from after that point. The problem with streaming replication is being discussed in a new thread [0]. Nathan [0] https://postgr.es/m/202108232252.dh7uxf6oxwcy%40alvherre.pgsql
On 8/30/21, 3:40 PM, "Bossart, Nathan" <bossartn@amazon.com> wrote: > On 8/30/21, 2:06 PM, "Andres Freund" <andres@anarazel.de> wrote: >> Although, the more I think about, the more I am confused about the trailing >> if (XLogArchivingActive()) >> NotifySegmentsReadyForArchive(LogwrtResult.Flush); >> >> in XLogWrite(). Shouldn't that at the very least be inside the "If asked to >> flush, do so" branch? Outside that and the finishing_seg branch >> LogwrtResult.Flush won't have moved, right? So the call to >> NotifySegmentsReadyForArchive() can't do anything, no? > > The registration logic looks like this: > 1. Register boundary > 2. Get flush location from shared memory > 3. If flush location >= our just-registered boundary, nudge > the WAL writer to create .ready files if needed > > If we called NotifySegmentsReadyForArchive() before we updated the > flush location in shared memory, we might skip nudging the WAL writer > even though we should. In the other thread [0], we're considering moving boundary registration to before WALInsertLockRelease(). If I'm right that this removes the race condition in question, we should be able to move the call to NotifySegmentsReadyForArchive() at the end of XLogWrite() to the if-asked-to-flush branch. Nathan [0] https://postgr.es/m/DE60B9AA-9670-47DA-9678-6C79BCD884E3%40amazon.com
Hi, On 2021-08-30 22:39:04 +0000, Bossart, Nathan wrote: > On 8/30/21, 2:06 PM, "Andres Freund" <andres@anarazel.de> wrote: > > When were you thinking of doing the latch sets? Adding a latch set for every > > XLogWrite() wouldn't be cheap either. Both because syscalls under a lock > > aren't free and because waking up walsender even more often isn't free (we > > already have a few threads about reducing the signalling frequency). > > > > There's also the question of what to do with single user mode. We shouldn't > > just skip creating .ready files there... > > Good point. > > > Although, the more I think about, the more I am confused about the trailing > > if (XLogArchivingActive()) > > NotifySegmentsReadyForArchive(LogwrtResult.Flush); > > > > in XLogWrite(). Shouldn't that at the very least be inside the "If asked to > > flush, do so" branch? Outside that and the finishing_seg branch > > LogwrtResult.Flush won't have moved, right? So the call to > > NotifySegmentsReadyForArchive() can't do anything, no? > > The registration logic looks like this: > 1. Register boundary > 2. Get flush location from shared memory > 3. If flush location >= our just-registered boundary, nudge > the WAL writer to create .ready files if needed > > If we called NotifySegmentsReadyForArchive() before we updated the > flush location in shared memory, we might skip nudging the WAL writer > even though we should. That's trivial to address - just have a local variable saying whether we need to call NotifySegmentsReadyForArchive(). Note that the finishing_seg path currently calls NotifySegmentsReadyForArchive() before the shared memory flush location is updated. > > When s1 is fully written out, NotifySegmentsReadyForArchive() will set > > earliestSegBoundary = latestSegBoundary = s4 and create .reaady for .s1 - > > ok. But when when s2 is flushed, we'll afaict happily create .ready files > > for s2, s3 despite s4 not yet being written, because earliestSegBoundary is > > now s4. > > In this case, the .ready files for s2 and s3 wouldn't be created until > s4 is flushed to disk. I don't think that's true as the code stands today? The NotifySegmentsReadyForArchive() for s2 will update earliestSegBoundary to s4, because latestSegBoundary = 4 and earliestSegBoundary = 1, triggering the keep_latest branch. Any subsequent NotifySegmentsReadyForArchive() with a segment < 4 will then be able to flush s2 and s3? > > The whole approach here of delaying .ready creation for these types of > > segments seems wrong to me. Doesn't the exact same problem also exist for > > streaming rep - which one can also use to maintain a PITR archive? walsender > > sends up to the flush location, and pg_receivewal's FindStreamingStart() will > > afaict just continue receiving from after that point. > > The problem with streaming replication is being discussed in a new > thread [0]. I don't think it's sensible to fix these separately. It'd be one thing to do that for HEAD, but on the back branches? And that this patch hasn't gotten any performance testing is scary. Greetings, Andres Freund
On 8/30/21, 7:39 PM, "Andres Freund" <andres@anarazel.de> wrote: > On 2021-08-30 22:39:04 +0000, Bossart, Nathan wrote: >> If we called NotifySegmentsReadyForArchive() before we updated the >> flush location in shared memory, we might skip nudging the WAL writer >> even though we should. > > That's trivial to address - just have a local variable saying whether we need > to call NotifySegmentsReadyForArchive(). I think we can remove the race condition entirely by moving boundary registration to before WALInsertLockRelease(). I attached a patch for discussion. >> > When s1 is fully written out, NotifySegmentsReadyForArchive() will set >> > earliestSegBoundary = latestSegBoundary = s4 and create .reaady for .s1 - >> > ok. But when when s2 is flushed, we'll afaict happily create .ready files >> > for s2, s3 despite s4 not yet being written, because earliestSegBoundary is >> > now s4. >> >> In this case, the .ready files for s2 and s3 wouldn't be created until >> s4 is flushed to disk. > > I don't think that's true as the code stands today? The > NotifySegmentsReadyForArchive() for s2 will update earliestSegBoundary to s4, > because latestSegBoundary = 4 and earliestSegBoundary = 1, triggering the > keep_latest branch. Any subsequent NotifySegmentsReadyForArchive() with a > segment < 4 will then be able to flush s2 and s3? When flushRecPtr is less than both of the segment boundaries, NotifySegmentsReadyForArchive() will return without doing anything. At least, that was the intent. If there is some reason it's not actually working that way, I can work on fixing it. > I don't think it's sensible to fix these separately. It'd be one thing to do > that for HEAD, but on the back branches? And that this patch hasn't gotten any > performance testing is scary. Are there any specific performance tests you'd like to see? I don't mind running a couple. Nathan
Attachment
Hi On 2021-08-31 06:45:06 +0000, Bossart, Nathan wrote: > On 8/30/21, 7:39 PM, "Andres Freund" <andres@anarazel.de> wrote: > > On 2021-08-30 22:39:04 +0000, Bossart, Nathan wrote: > >> If we called NotifySegmentsReadyForArchive() before we updated the > >> flush location in shared memory, we might skip nudging the WAL writer > >> even though we should. > > > > That's trivial to address - just have a local variable saying whether we need > > to call NotifySegmentsReadyForArchive(). > > I think we can remove the race condition entirely by moving boundary > registration to before WALInsertLockRelease(). I attached a patch for > discussion. I think it's a bad idea to move more code to before WALInsertLockRelease. There's a very limited number of xlog insert slots, and WAL flushes (e.g. commits) need to wait for insertions to finish. > >> > When s1 is fully written out, NotifySegmentsReadyForArchive() will set > >> > earliestSegBoundary = latestSegBoundary = s4 and create .reaady for .s1 - > >> > ok. But when when s2 is flushed, we'll afaict happily create .ready files > >> > for s2, s3 despite s4 not yet being written, because earliestSegBoundary is > >> > now s4. > >> > >> In this case, the .ready files for s2 and s3 wouldn't be created until > >> s4 is flushed to disk. > > > > I don't think that's true as the code stands today? The > > NotifySegmentsReadyForArchive() for s2 will update earliestSegBoundary to s4, > > because latestSegBoundary = 4 and earliestSegBoundary = 1, triggering the > > keep_latest branch. Any subsequent NotifySegmentsReadyForArchive() with a > > segment < 4 will then be able to flush s2 and s3? > > When flushRecPtr is less than both of the segment boundaries, > NotifySegmentsReadyForArchive() will return without doing anything. > At least, that was the intent. If there is some reason it's not > actually working that way, I can work on fixing it. But that's not OK either! Consider a scenario when there's small records each spanning just a bit into the next segment, and initially all the data is in wal_buffers. RegisterSegmentBoundary(s1, s1+10) earliestSegBoundary = s1 earliestSegBoundaryEndPtr = s1+10 RegisterSegmentBoundary(s2, s2+10) earliestSegBoundary = s1 earliestSegBoundaryEndPtr = s1+10 latestSegBoundary = s2 latestSegBoundaryEndPtr = s2 + 10 RegisterSegmentBoundary(s3, s3+10) earliestSegBoundary = s1 earliestSegBoundaryEndPtr = s1+10 latestSegBoundary = s2 latestSegBoundaryEndPtr = s2 + 10 RegisterSegmentBoundary(s4, s4+10) earliestSegBoundary = s1 earliestSegBoundaryEndPtr = s1+10 latestSegBoundary = s4 latestSegBoundaryEndPtr = s4 + 10 If there's now a flush request including all of s3, we'll have the following sequence of notifies: NotifySegmentsReadyForArchive(s1) nothing happens, smaller than s1+10 NotifySegmentsReadyForArchive(s2) earliestSegBoundary = s4 earliestSegBoundaryEndPtr = s4+10 latestSegBoundary = s4 latestSegBoundaryEndPtr = s4 + 10 latest_boundary_seg = s1 NotifySegmentsReadyForArchive(s3) nothing happens, flush is smaller than s4 If the record ending at s4 + 10 isn't an async commit (and thus XLogCtl->asyncXactLSN is smaller), and there are no further records, we can end up waiting effectively forever for s2 (and s3) to be archived. If all other connections (and autovac etc) are idle, there will be no XLogFlush() calls, nor will XLogBackgroundFlush() do anything, because it'll hit the "If already known flushed" path, because the the first page in s4 is only partially filled. Am I missing something? > > I don't think it's sensible to fix these separately. It'd be one thing to do > > that for HEAD, but on the back branches? And that this patch hasn't gotten any > > performance testing is scary. > > Are there any specific performance tests you'd like to see? I don't > mind running a couple. - Parallel copy with > 8 processes - Parallel non-transactional insertion of small-medium records Simulates inserting rows within a transaction - Parallel transactional insertion of small-medium sized records, with fsync=on Plain oltp writes - Parallel transactional insertion of small-medium sized records, with fsync=off fsync=off to simulate a fast server-class SSD (where fsync is instantaneous). Of course, if you have one of those, you can also use that. For the oltp ones I've had good experience simulating workloads with pg_logical_emit_message(). That just hits the WAL path, *drastically* reducing the variance / shortening the required test duration. Greetings, Andres Freund
On 8/31/21, 12:44 AM, "Andres Freund" <andres@anarazel.de> wrote: > If there's now a flush request including all of s3, we'll have the following > sequence of notifies: > > NotifySegmentsReadyForArchive(s1) > nothing happens, smaller than s1+10 > > NotifySegmentsReadyForArchive(s2) > earliestSegBoundary = s4 > earliestSegBoundaryEndPtr = s4+10 > latestSegBoundary = s4 > latestSegBoundaryEndPtr = s4 + 10 > latest_boundary_seg = s1 > > NotifySegmentsReadyForArchive(s3) > nothing happens, flush is smaller than s4 When earliestSegBoundary is set to s4, latestSegBoundary will be set to MaxXLogSegNo. > If the record ending at s4 + 10 isn't an async commit (and thus > XLogCtl->asyncXactLSN is smaller), and there are no further records, we can > end up waiting effectively forever for s2 (and s3) to be archived. If all > other connections (and autovac etc) are idle, there will be no XLogFlush() > calls, nor will XLogBackgroundFlush() do anything, because it'll hit the "If > already known flushed" path, because the the first page in s4 is only > partially filled. I'm not following why s4 wouldn't be flushed in this example. Even if the first page in s4 is only partially filled, that portion of the record should still get flushed, and we'll create the .ready files for s2 and s3 at that time. I tested this by adding some debug logging and creating a small record that crossed segment boundaries but didn't fill the first page on the next segment, and the .ready file was created as expected. Is there a case where we wouldn't flush the end of the record to disk? During my testing, I did find an obvious bug. We probably shouldn't be calling NotifySegmentsReadyForArchive() when archiving isn't enabled. diff --git a/src/backend/postmaster/walwriter.c b/src/backend/postmaster/walwriter.c index 6a1e16edc2..8ca0d8e616 100644 --- a/src/backend/postmaster/walwriter.c +++ b/src/backend/postmaster/walwriter.c @@ -253,7 +253,8 @@ WalWriterMain(void) * here to handle a race condition where WAL is flushed to disk prior * to registering the segment boundary. */ - NotifySegmentsReadyForArchive(GetFlushRecPtr()); + if (XLogArchivingActive()) + NotifySegmentsReadyForArchive(GetFlushRecPtr()); /* * Do what we're here for; then, if XLogBackgroundFlush() found useful Nathan
Hi, On 2021-08-31 17:01:31 +0000, Bossart, Nathan wrote: > > If the record ending at s4 + 10 isn't an async commit (and thus > > XLogCtl->asyncXactLSN is smaller), and there are no further records, we can > > end up waiting effectively forever for s2 (and s3) to be archived. If all > > other connections (and autovac etc) are idle, there will be no XLogFlush() > > calls, nor will XLogBackgroundFlush() do anything, because it'll hit the "If > > already known flushed" path, because the the first page in s4 is only > > partially filled. > > I'm not following why s4 wouldn't be flushed in this example. Even if > the first page in s4 is only partially filled, that portion of the > record should still get flushed, and we'll create the .ready files for > s2 and s3 at that time. What would trigger the flushing? We don't write out partially filled pages unless a) we're explicitly flushing an LSN on the partial page (e.g. because a synchronous commit record resides on it) b) there's an async commit (i.e. commit with synchronous_commit=off) on the page Greetings, Andres Freund
On 8/31/21, 10:21 AM, "Andres Freund" <andres@anarazel.de> wrote: > What would trigger the flushing? We don't write out partially filled pages > unless > a) we're explicitly flushing an LSN on the partial page (e.g. because a > synchronous commit record resides on it) > b) there's an async commit (i.e. commit with synchronous_commit=off) on the > page Ah, so your point is that an open transaction that has written a partial page on the next segment wouldn't trigger a flush. What appears to happen in this case is that bgwriter eventually creates a xl_running_xacts record and nudges walwriter to flush it to disk, at which point the .ready file(s) will be created. That's admittedly a bit fragile. Nathan
Hi, On 2021-08-31 18:09:36 +0000, Bossart, Nathan wrote: > On 8/31/21, 10:21 AM, "Andres Freund" <andres@anarazel.de> wrote: > > What would trigger the flushing? We don't write out partially filled pages > > unless > > a) we're explicitly flushing an LSN on the partial page (e.g. because a > > synchronous commit record resides on it) > > b) there's an async commit (i.e. commit with synchronous_commit=off) on the > > page > > Ah, so your point is that an open transaction that has written a > partial page on the next segment wouldn't trigger a flush. Doesn't have to be a transaction, can be a checkpoint or xl_running_xacts, or ... as well. > What appears to happen in this case is that bgwriter eventually creates a > xl_running_xacts record and nudges walwriter to flush it to disk, at which > point the .ready file(s) will be created. That's admittedly a bit fragile. That's not guaranteed to happen. If e.g. the partial record is a checkpoint or a xl_running_xacts, we'll not trigger further WAL writes in the background, unless autovacuum ends up doing something. Regards, Andres
On 8/31/21, 1:30 PM, "Andres Freund" <andres@anarazel.de> wrote: > On 2021-08-31 18:09:36 +0000, Bossart, Nathan wrote: >> What appears to happen in this case is that bgwriter eventually creates a >> xl_running_xacts record and nudges walwriter to flush it to disk, at which >> point the .ready file(s) will be created. That's admittedly a bit fragile. > > That's not guaranteed to happen. If e.g. the partial record is a checkpoint or > a xl_running_xacts, we'll not trigger further WAL writes in the background, > unless autovacuum ends up doing something. Right. Per the attached patch, a simple way to handle that could be to teach XLogBackgroundFlush() to flush to the "earliest" segment boundary if it doesn't find anything else to do. I think you could still miss creating a .ready file for the previous segment in single- user mode, though. Nathan
Attachment
On 2021-08-31 23:31:15 +0000, Bossart, Nathan wrote: > On 8/31/21, 1:30 PM, "Andres Freund" <andres@anarazel.de> wrote: > > On 2021-08-31 18:09:36 +0000, Bossart, Nathan wrote: > >> What appears to happen in this case is that bgwriter eventually creates a > >> xl_running_xacts record and nudges walwriter to flush it to disk, at which > >> point the .ready file(s) will be created. That's admittedly a bit fragile. > > > > That's not guaranteed to happen. If e.g. the partial record is a checkpoint or > > a xl_running_xacts, we'll not trigger further WAL writes in the background, > > unless autovacuum ends up doing something. > > Right. Per the attached patch, a simple way to handle that could be > to teach XLogBackgroundFlush() to flush to the "earliest" segment > boundary if it doesn't find anything else to do. I think you could > still miss creating a .ready file for the previous segment in single- > user mode, though. Maybe, but this is getting uglier and uglier. I think patch should be reverted. It's not in a state that's appropriate for the backbranches.
On 2021-Aug-31, Andres Freund wrote: > Maybe, but this is getting uglier and uglier. > > I think patch should be reverted. It's not in a state that's appropriate for > the backbranches. Yeah, that's becoming my conclusion too -- undo that, and start from scratch using the other idea. -- Álvaro Herrera Valdivia, Chile — https://www.EnterpriseDB.com/
On Tue, Aug 31, 2021 at 08:52:05PM -0400, alvherre@alvh.no-ip.org wrote: > Yeah, that's becoming my conclusion too -- undo that, and start from > scratch using the other idea. That's about 515e3d8, right? I have not looked in details at what you have here, but this produces a compilation warning on Windows for me with this part of the patch: +RegisterSegmentBoundary(XLogSegNo seg, XLogRecPtr endpos) +{ + XLogSegNo segno PG_USED_FOR_ASSERTS_ONLY; segno gets to be an unreferenced local variable. -- Michael
Attachment
On 2021-Sep-01, Michael Paquier wrote: > That's about 515e3d8, right? Yes. > I have not looked in details at what you have here, but this produces > a compilation warning on Windows for me with this part of the patch: This seems a tiny speck in a sea of bogosity. If you want to silence the warning, be my guest, but in the long run I am inclined to revert the whole commit once I have a better picture of a way forward. -- Álvaro Herrera 39°49'30"S 73°17'W — https://www.EnterpriseDB.com/