Thread: prevent immature WAL streaming
Included 蔡梦娟 and Jakub Wartak because they've expressed interest on this topic -- notably [2] ("Bug on update timing of walrcv->flushedUpto variable"). As mentioned in the course of thread [1], we're missing a fix for streaming replication to avoid sending records that the primary hasn't fully flushed yet. This patch is a first attempt at fixing that problem by retreating the LSN reported as FlushPtr whenever a segment is registered, based on the understanding that if no registration exists then the LogwrtResult.Flush pointer can be taken at face value; but if a registration exists, then we have to stream only till the start LSN of that registered entry. This patch is probably incomplete. First, I'm not sure that logical replication is affected by this problem. I think it isn't, because logical replication will halt until the record can be read completely -- maybe I'm wrong and there is a way for things to go wrong with logical replication as well. But also, I need to look at the other uses of GetFlushRecPtr() and see if those need to change to the new function too or they can remain what they are now. I'd also like to have tests. That seems moderately hard, but if we had WAL-molasses that could be used in walreceiver, it could be done. (It sounds easier to write tests with a molasses-archive_command.) [1] https://postgr.es/m/CBDDFA01-6E40-46BB-9F98-9340F4379505@amazon.com [2] https://postgr.es/m/3f9c466d-d143-472c-a961-66406172af96.mengjuan.cmj@alibaba-inc.com -- Álvaro Herrera 39°49'30"S 73°17'W — https://www.EnterpriseDB.com/
Attachment
At Mon, 23 Aug 2021 18:52:17 -0400, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote in > Included 蔡梦娟 and Jakub Wartak because they've expressed interest on > this topic -- notably [2] ("Bug on update timing of walrcv->flushedUpto > variable"). > > As mentioned in the course of thread [1], we're missing a fix for > streaming replication to avoid sending records that the primary hasn't > fully flushed yet. This patch is a first attempt at fixing that problem > by retreating the LSN reported as FlushPtr whenever a segment is > registered, based on the understanding that if no registration exists > then the LogwrtResult.Flush pointer can be taken at face value; but if a > registration exists, then we have to stream only till the start LSN of > that registered entry. > > This patch is probably incomplete. First, I'm not sure that logical > replication is affected by this problem. I think it isn't, because > logical replication will halt until the record can be read completely -- > maybe I'm wrong and there is a way for things to go wrong with logical > replication as well. But also, I need to look at the other uses of > GetFlushRecPtr() and see if those need to change to the new function too > or they can remain what they are now. > > I'd also like to have tests. That seems moderately hard, but if we had > WAL-molasses that could be used in walreceiver, it could be done. (It > sounds easier to write tests with a molasses-archive_command.) > > > [1] https://postgr.es/m/CBDDFA01-6E40-46BB-9F98-9340F4379505@amazon.com > [2] https://postgr.es/m/3f9c466d-d143-472c-a961-66406172af96.mengjuan.cmj@alibaba-inc.com (I'm not sure what "WAL-molasses" above expresses, same as "sugar"?) For our information, this issue is related to the commit 0668719801 which makes XLogPageRead restart reading a (continued or segments-spanning) record with switching sources. In that thread, I modifed the code to cause a server crash under the desired situation.) regards. -- Kyotaro Horiguchi NTT Open Source Software Center
On 8/23/21, 3:53 PM, "Alvaro Herrera" <alvherre@alvh.no-ip.org> wrote: > As mentioned in the course of thread [1], we're missing a fix for > streaming replication to avoid sending records that the primary hasn't > fully flushed yet. This patch is a first attempt at fixing that problem > by retreating the LSN reported as FlushPtr whenever a segment is > registered, based on the understanding that if no registration exists > then the LogwrtResult.Flush pointer can be taken at face value; but if a > registration exists, then we have to stream only till the start LSN of > that registered entry. I wonder if we need to move the call to RegisterSegmentBoundary() to somewhere before WALInsertLockRelease() for this to work as intended. Right now, boundary registration could take place after the flush pointer has been advanced, which means GetSafeFlushRecPtr() could still return an unsafe position. Nathan
On 2021-Aug-24, Bossart, Nathan wrote: > I wonder if we need to move the call to RegisterSegmentBoundary() to > somewhere before WALInsertLockRelease() for this to work as intended. > Right now, boundary registration could take place after the flush > pointer has been advanced, which means GetSafeFlushRecPtr() could > still return an unsafe position. Yeah, you're right -- that's a definite risk. I didn't try to reproduce a problem with that, but it is seems pretty obvious that it can happen. I didn't have a lot of luck with a reliable reproducer script. I was able to reproduce the problem starting with Ryo Matsumura's script and attaching a replica; most of the time the replica would recover by restarting from a streaming position earlier than where the problem occurred; but a few times it would just get stuck with a WAL segment containing a bogus record. Then, after patch, the problem no longer occurs. I attach the patch with the change you suggested. -- Álvaro Herrera Valdivia, Chile — https://www.EnterpriseDB.com/ Tom: There seems to be something broken here. Teodor: I'm in sackcloth and ashes... Fixed. http://archives.postgresql.org/message-id/482D1632.8010507@sigaev.ru
Attachment
On 8/24/21, 4:03 PM, "alvherre@alvh.no-ip.org" <alvherre@alvh.no-ip.org> wrote: > On 2021-Aug-24, Bossart, Nathan wrote: > >> I wonder if we need to move the call to RegisterSegmentBoundary() to >> somewhere before WALInsertLockRelease() for this to work as intended. >> Right now, boundary registration could take place after the flush >> pointer has been advanced, which means GetSafeFlushRecPtr() could >> still return an unsafe position. > > Yeah, you're right -- that's a definite risk. I didn't try to reproduce > a problem with that, but it is seems pretty obvious that it can happen. > > I didn't have a lot of luck with a reliable reproducer script. I was > able to reproduce the problem starting with Ryo Matsumura's script and > attaching a replica; most of the time the replica would recover by > restarting from a streaming position earlier than where the problem > occurred; but a few times it would just get stuck with a WAL segment > containing a bogus record. Then, after patch, the problem no longer > occurs. If moving RegisterSegmentBoundary() is sufficient to prevent the flush pointer from advancing before we register the boundary, I bet we could also remove the WAL writer nudge. Another interesting thing I see is that the boundary stored in earliestSegBoundary is not necessarily the earliest one. It's just the first one that has been registered. I did this for simplicity for the .ready file fix, but I can see it causing problems here. I think we can move earliestSegBoundary backwards as long as it is greater than lastNotifiedSeg + 1. However, it's still not necessarily the earliest one if we copied latestSegBoundary to earliestSegBoundary in NotifySegmentsReadyForArchive(). To handle this, we could track several boundaries in an array, but then we'd have to hold the safe LSN back whenever the array overflowed and we started forgetting boundaries. Perhaps there's a simpler solution. I'll keep thinking... Nathan
Hi Álvaro, -hackers, > I attach the patch with the change you suggested. I've gave a shot to to the v02 patch on top of REL_12_STABLE (already including 5065aeafb0b7593c04d3bc5bc2a86037f32143fc).Previously(yesterday) without the v02 patch I was getting standby corruption alwaysvia simulation by having separate /pg_xlog dedicated fs, and archive_mode=on, wal_keep_segments=120, archive_commandset to rsync to different dir on same fs, wal_init_zero at default(true). Today (with v02) I've got corruption in only initial 2 runs out of ~ >30 tries on standby. Probably the 2 failures were somehowmy fault (?) or some rare condition (and in 1 of those 2 cases simply restarting standby did help). To be honest I'vetried to force this error, but with v02 I simply cannot force this error anymore, so that's good! :) > I didn't have a lot of luck with a reliable reproducer script. I was able to > reproduce the problem starting with Ryo Matsumura's script and attaching > a replica; most of the time the replica would recover by restarting from a > streaming position earlier than where the problem occurred; but a few > times it would just get stuck with a WAL segment containing a bogus > record. In order to get reliable reproducer and get proper the fault injection instead of playing with really filling up fs, apparentlyone could substitute fd with fd of /dev/full using e.g. dup2() so that every write is going to throw this errortoo: root@hive:~# ./t & # simple while(1) { fprintf() flush () } testcase root@hive:~# ls -l /proc/27296/fd/3 lrwx------ 1 root root 64 Aug 25 06:22 /proc/27296/fd/3 -> /tmp/testwrite root@hive:~# gdb -q -p 27296 -- 1089 is bitmask O_WRONLY|.. (gdb) p dup2(open("/dev/full", 1089, 0777), 3) $1 = 3 (gdb) c Continuing. ==> fflush/write(): : No space left on device So I've also tried to be malicious while writing to the DB and inject ENOSPCE near places like: a) XLogWrite()->XLogFileInit() near line 3322 // assuming: if (wal_init_zero) is true, one gets classic "PANIC: could notwrite to file "pg_wal/xlogtemp.90670": No space left on device" b) XLogWrite() near line 2547 just after pg_pwrite // one can get "PANIC: could not write to log file 000000010000003B000000A8at offset 0, length 15466496: No space left on device" (that would be possible with wal_init_zero=false?) c) XLogWrite() near line 2592 // just before issue_xlog_fsync to get "PANIC: could not fdatasync file "000000010000004300000004":Invalid argument" that would pretty much mean same as above but with last possible offset nearend of WAL? This was done with gdb voodoo: handle SIGUSR1 noprint nostop break xlog.c:<LINE> // https://github.com/postgres/postgres/blob/REL_12_STABLE/src/backend/access/transam/xlog.c#L3311 c print fd or openLogFile -- to verify it is 3 p dup2(open("/dev/full", 1089, 0777), 3) -- during most of walwriter runtime it has current log as fd=3 After restarting master and inspecting standby - in all of those above 3 cases - the standby didn't inhibit the "invalidcontrecord length" at least here, while without corruption this v02 patch it is notorious. So if it passes the worst-casecode review assumptions I would be wondering if it shouldn't even be committed as it stands right now. -J.
On 2021-Aug-24, Bossart, Nathan wrote: > If moving RegisterSegmentBoundary() is sufficient to prevent the flush > pointer from advancing before we register the boundary, I bet we could > also remove the WAL writer nudge. Can you elaborate on this? I'm not sure I see the connection. > Another interesting thing I see is that the boundary stored in > earliestSegBoundary is not necessarily the earliest one. It's just > the first one that has been registered. I did this for simplicity for > the .ready file fix, but I can see it causing problems here. Hmm, is there really a problem here? Surely the flush point cannot go past whatever has been written. If somebody is writing an earlier section of WAL, then we cannot move the flush pointer to a later position. So it doesn't matter if the earliest point we have registered is the true earliest -- we only care for it to be the earliest that is past the flush point. -- Álvaro Herrera Valdivia, Chile — https://www.EnterpriseDB.com/
On Mon, Aug 23, 2021 at 11:04 PM Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote: > At Mon, 23 Aug 2021 18:52:17 -0400, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote in > > I'd also like to have tests. That seems moderately hard, but if we had > > WAL-molasses that could be used in walreceiver, it could be done. (It > > sounds easier to write tests with a molasses-archive_command.) > > > > > > [1] https://postgr.es/m/CBDDFA01-6E40-46BB-9F98-9340F4379505@amazon.com > > [2] https://postgr.es/m/3f9c466d-d143-472c-a961-66406172af96.mengjuan.cmj@alibaba-inc.com > > (I'm not sure what "WAL-molasses" above expresses, same as "sugar"?) I think, but am not 100% sure, that "molasses" here is being used to refer to fault injection. -- Robert Haas EDB: http://www.enterprisedb.com
On 8/25/21, 5:33 AM, "alvherre@alvh.no-ip.org" <alvherre@alvh.no-ip.org> wrote: > On 2021-Aug-24, Bossart, Nathan wrote: > >> If moving RegisterSegmentBoundary() is sufficient to prevent the flush >> pointer from advancing before we register the boundary, I bet we could >> also remove the WAL writer nudge. > > Can you elaborate on this? I'm not sure I see the connection. The reason we are moving RegisterSegmentBoundary() to before WALInsertLockRelease() is because we believe it will prevent boundary registration from taking place after the flush pointer has already advanced past the boundary in question. We had added the WAL writer nudge to make sure we called NotifySegmentsReadyForArchive() whenever that happened. If moving boundary registration to before we release the lock(s) is enough to prevent the race condition with the flush pointer, then ISTM we no longer have to worry about nudging the WAL writer. >> Another interesting thing I see is that the boundary stored in >> earliestSegBoundary is not necessarily the earliest one. It's just >> the first one that has been registered. I did this for simplicity for >> the .ready file fix, but I can see it causing problems here. > > Hmm, is there really a problem here? Surely the flush point cannot go > past whatever has been written. If somebody is writing an earlier > section of WAL, then we cannot move the flush pointer to a later > position. So it doesn't matter if the earliest point we have registered > is the true earliest -- we only care for it to be the earliest that is > past the flush point. Let's say we have the following situation (F = flush, E = earliest registered boundary, and L = latest registered boundary), and let's assume that each segment has a cross-segment record that ends in the next segment. F E L |-----|-----|-----|-----|-----|-----|-----|-----| 1 2 3 4 5 6 7 8 Then, we write out WAL to disk and create .ready files as needed. If we didn't flush beyond the latest registered boundary, the latest registered boundary now becomes the earliest boundary. F E |-----|-----|-----|-----|-----|-----|-----|-----| 1 2 3 4 5 6 7 8 At this point, the earliest segment boundary past the flush point is before the "earliest" boundary we are tracking. Nathan
On 2021-Aug-25, Jakub Wartak wrote: > In order to get reliable reproducer and get proper the fault injection > instead of playing with really filling up fs, apparently one could > substitute fd with fd of /dev/full using e.g. dup2() so that every > write is going to throw this error too: Oh, this is a neat trick that I didn't know about. Thanks. > After restarting master and inspecting standby - in all of those above > 3 cases - the standby didn't inhibit the "invalid contrecord length" > at least here, while without corruption this v02 patch it is > notorious. So if it passes the worst-case code review assumptions I > would be wondering if it shouldn't even be committed as it stands > right now. Well, Nathan is right that this patch isn't really closing the hole because we aren't tracking segment boundaries that aren't "earliest" nor "latest" at the moment of registration. The simplistic approach seems okay for the archive case, but not for the replication case. I also noticed today (facepalm) that the patch doesn't work unless you have archiving enabled, because the registration code is only invoked when XLogArchivingActive(). Doh. This part is easy to solve. The other doesn't look easy ... -- Álvaro Herrera Valdivia, Chile — https://www.EnterpriseDB.com/
BTW while going about testing this, I noticed that we forbid pg_walfile_name() while in recovery. That restriction was added by commit 370f770c15a4 because ThisTimeLineID was not set correctly during recovery. That was supposed to be fixed by commit 1148e22a82ed, so I thought that it should be possible to remove the restriction. However, I did that per the attached patch, but was quickly disappointed because ThisTimeLineID seems to remain zero in a standby for reasons that I didn't investigate. -- Álvaro Herrera 39°49'30"S 73°17'W — https://www.EnterpriseDB.com/ "The problem with the facetime model is not just that it's demoralizing, but that the people pretending to work interrupt the ones actually working." (Paul Graham)
Attachment
At Wed, 25 Aug 2021 18:18:59 +0000, "Bossart, Nathan" <bossartn@amazon.com> wrote in > On 8/25/21, 5:33 AM, "alvherre@alvh.no-ip.org" <alvherre@alvh.no-ip.org> wrote: > > On 2021-Aug-24, Bossart, Nathan wrote: > >> Another interesting thing I see is that the boundary stored in > >> earliestSegBoundary is not necessarily the earliest one. It's just > >> the first one that has been registered. I did this for simplicity for > >> the .ready file fix, but I can see it causing problems here. > > > > Hmm, is there really a problem here? Surely the flush point cannot go > > past whatever has been written. If somebody is writing an earlier > > section of WAL, then we cannot move the flush pointer to a later > > position. So it doesn't matter if the earliest point we have registered > > is the true earliest -- we only care for it to be the earliest that is > > past the flush point. > > Let's say we have the following situation (F = flush, E = earliest > registered boundary, and L = latest registered boundary), and let's > assume that each segment has a cross-segment record that ends in the > next segment. > > F E L > |-----|-----|-----|-----|-----|-----|-----|-----| > 1 2 3 4 5 6 7 8 > > Then, we write out WAL to disk and create .ready files as needed. If > we didn't flush beyond the latest registered boundary, the latest > registered boundary now becomes the earliest boundary. > > F E > |-----|-----|-----|-----|-----|-----|-----|-----| > 1 2 3 4 5 6 7 8 > > At this point, the earliest segment boundary past the flush point is > before the "earliest" boundary we are tracking. We know we have some cross-segment records in the regin [E L] so we cannot add a .ready file if flush is in the region. I haven't looked the latest patch (or I may misunderstand the discussion here) but I think we shouldn't move E before F exceeds previous (or in the first picture above) L. Things are done that way in my ancient proposal in [1]. https://www.postgresql.org/message-id/attachment/117052/v4-0001-Avoid-archiving-a-WAL-segment-that-continues-to-t.patch + if (LogwrtResult.Write < firstSegContRecStart || + lastSegContRecEnd <= LogwrtResult.Write) + { <notify the last segment> By doing so, at the time of the second picutre, the pointers are set as: E F L |-----|-----|-----|-----|-----|-----|-----|-----| 1 2 3 4 5 6 7 8 Then the poiter are cleard at the time F reaches L, F |-----|-----|-----|-----|-----|-----|-----|-----| 1 2 3 4 5 6 7 8 Isn't this work correctly? As I think I mentioned in the thread, I don't think we don't have so many (more than several, specifically) segments in a region [E L]. [1] https://www.postgresql.org/message-id/20201216.110120.887433782054853494.horikyota.ntt%40gmail.com regards. -- Kyotaro Horiguchi NTT Open Source Software Center
At Wed, 25 Aug 2021 20:20:04 -0400, "alvherre@alvh.no-ip.org" <alvherre@alvh.no-ip.org> wrote in > BTW while going about testing this, I noticed that we forbid > pg_walfile_name() while in recovery. That restriction was added by > commit 370f770c15a4 because ThisTimeLineID was not set correctly during > recovery. That was supposed to be fixed by commit 1148e22a82ed, so I > thought that it should be possible to remove the restriction. However, > I did that per the attached patch, but was quickly disappointed because > ThisTimeLineID seems to remain zero in a standby for reasons that I > didn't investigate. On a intermediate node of a cascading replication set, timeline id on walsender and walrecever can differ and ordinary backends cannot decide which to use. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
(this is off-topic here) At Wed, 25 Aug 2021 09:56:56 -0400, Robert Haas <robertmhaas@gmail.com> wrote in > On Mon, Aug 23, 2021 at 11:04 PM Kyotaro Horiguchi > <horikyota.ntt@gmail.com> wrote: > > At Mon, 23 Aug 2021 18:52:17 -0400, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote in > > > I'd also like to have tests. That seems moderately hard, but if we had > > > WAL-molasses that could be used in walreceiver, it could be done. (It > > > sounds easier to write tests with a molasses-archive_command.) > > > > > > > > > [1] https://postgr.es/m/CBDDFA01-6E40-46BB-9F98-9340F4379505@amazon.com > > > [2] https://postgr.es/m/3f9c466d-d143-472c-a961-66406172af96.mengjuan.cmj@alibaba-inc.com > > > > (I'm not sure what "WAL-molasses" above expresses, same as "sugar"?) > > I think, but am not 100% sure, that "molasses" here is being used to > refer to fault injection. Oh. That makes sense, thanks. I sometimes inject artificial faults (a server crash, in this case) to create specific on-disk states but I cannot imagine that that kind of machinery can be statically placed in the source tree.. -- Kyotaro Horiguchi NTT Open Source Software Center
On 8/25/21, 5:40 PM, "Kyotaro Horiguchi" <horikyota.ntt@gmail.com> wrote: > At Wed, 25 Aug 2021 18:18:59 +0000, "Bossart, Nathan" <bossartn@amazon.com> wrote in >> Let's say we have the following situation (F = flush, E = earliest >> registered boundary, and L = latest registered boundary), and let's >> assume that each segment has a cross-segment record that ends in the >> next segment. >> >> F E L >> |-----|-----|-----|-----|-----|-----|-----|-----| >> 1 2 3 4 5 6 7 8 >> >> Then, we write out WAL to disk and create .ready files as needed. If >> we didn't flush beyond the latest registered boundary, the latest >> registered boundary now becomes the earliest boundary. >> >> F E >> |-----|-----|-----|-----|-----|-----|-----|-----| >> 1 2 3 4 5 6 7 8 >> >> At this point, the earliest segment boundary past the flush point is >> before the "earliest" boundary we are tracking. > > We know we have some cross-segment records in the regin [E L] so we > cannot add a .ready file if flush is in the region. I haven't looked > the latest patch (or I may misunderstand the discussion here) but I > think we shouldn't move E before F exceeds previous (or in the first > picture above) L. Things are done that way in my ancient proposal in > [1]. The strategy in place ensures that we track a boundary that doesn't change until the flush position passes it as well as the latest registered boundary. I think it is important that any segment boundary tracking mechanism does at least those two things. I don't see how we could do that if we didn't update E until F passed both E and L. Nathan
At Thu, 26 Aug 2021 03:24:48 +0000, "Bossart, Nathan" <bossartn@amazon.com> wrote in > On 8/25/21, 5:40 PM, "Kyotaro Horiguchi" <horikyota.ntt@gmail.com> wrote: > > At Wed, 25 Aug 2021 18:18:59 +0000, "Bossart, Nathan" <bossartn@amazon.com> wrote in > >> Let's say we have the following situation (F = flush, E = earliest > >> registered boundary, and L = latest registered boundary), and let's > >> assume that each segment has a cross-segment record that ends in the > >> next segment. > >> > >> F E L > >> |-----|-----|-----|-----|-----|-----|-----|-----| > >> 1 2 3 4 5 6 7 8 > >> > >> Then, we write out WAL to disk and create .ready files as needed. If > >> we didn't flush beyond the latest registered boundary, the latest > >> registered boundary now becomes the earliest boundary. > >> > >> F E > >> |-----|-----|-----|-----|-----|-----|-----|-----| > >> 1 2 3 4 5 6 7 8 > >> > >> At this point, the earliest segment boundary past the flush point is > >> before the "earliest" boundary we are tracking. > > > > We know we have some cross-segment records in the regin [E L] so we > > cannot add a .ready file if flush is in the region. I haven't looked > > the latest patch (or I may misunderstand the discussion here) but I > > think we shouldn't move E before F exceeds previous (or in the first > > picture above) L. Things are done that way in my ancient proposal in > > [1]. > > The strategy in place ensures that we track a boundary that doesn't > change until the flush position passes it as well as the latest > registered boundary. I think it is important that any segment > boundary tracking mechanism does at least those two things. I don't > see how we could do that if we didn't update E until F passed both E > and L. (Sorry, but I didn't get you clearly. So the discussion below might be pointless.) The ancient patch did: If a flush didn't reach E, we can archive finished segments. If a flush ends between E and L, we shouldn't archive finshed segments at all. L can be moved further while this state, while E sits on the same location while this state. Once a flush passes L, we can archive all finished segments and can erase both E and L. A drawback of this strategy is that the region [E L] can contain gaps (that is, segment boundaries that is not bonded by a continuation record) and archive can be excessively retarded. Perhaps if flush goes behind write head by more than two segments, the probability of creating the gaps would be higher. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Hi, On 2021-08-23 18:52:17 -0400, Alvaro Herrera wrote: > Included 蔡梦娟 and Jakub Wartak because they've expressed interest on > this topic -- notably [2] ("Bug on update timing of walrcv->flushedUpto > variable"). > > As mentioned in the course of thread [1], we're missing a fix for > streaming replication to avoid sending records that the primary hasn't > fully flushed yet. This patch is a first attempt at fixing that problem > by retreating the LSN reported as FlushPtr whenever a segment is > registered, based on the understanding that if no registration exists > then the LogwrtResult.Flush pointer can be taken at face value; but if a > registration exists, then we have to stream only till the start LSN of > that registered entry. I'm doubtful that the approach of adding awareness of record boundaries is a good path to go down: - It adds nontrivial work to hot code paths to handle an edge case, rather than making rare code paths more expensive. - There are very similar issues with promotions of replicas (consider what happens if we need to promote with the end of local WAL spanning a segment boundary, and what happens to cascading replicas). We have some logic to try to deal with that, but it's pretty grotty and I think incomplete. - It seems to make some future optimizations harder - we should work towards replicating data sooner, rather than the opposite. Right now that's a major bottleneck around syncrep. - Once XLogFlush() for some LSN returned we can write that LSN to disk. The LSN doesn't necessarily have to correspond to a specific on-disk location (it could e.g. be the return value from GetFlushRecPtr()). But "rewinding" to before the last record makes that problematic. - I suspect that schemes with heuristic knowledge of segment boundary spanning records have deadlock or at least latency spike issues. What if synchronous commit needs to flush up to a certain record boundary, but streaming rep doesn't replicate it out because there's segment spanning records both before and after? I think a better approach might be to handle this on the WAL layout level. What if we never overwrite partial records but instead just skipped over them during decoding? Of course there's some difficulties with that - the checksum and the length from the record header aren't going to be meaningful. But we could deal with that using a special flag in the XLogPageHeaderData.xlp_info of the following page. If that flag is set, xlp_rem_len could contain the checksum of the partial record. Greetings, Andres Freund
On 2021-Aug-30, Andres Freund wrote: > I'm doubtful that the approach of adding awareness of record boundaries > is a good path to go down: Honestly, I do not like it one bit and if I can avoid relying on them while making the whole thing work correctly, I am happy. Clearly it wasn't a problem for the ancient recovery-only WAL design, but as soon as we added replication on top the whole issue of continuation records became a bug. I do think that the code should be first correct and second performant, though. > - There are very similar issues with promotions of replicas (consider > what happens if we need to promote with the end of local WAL spanning > a segment boundary, and what happens to cascading replicas). We have > some logic to try to deal with that, but it's pretty grotty and I > think incomplete. Ouch, I hadn't thought of cascading replicas. > - It seems to make some future optimizations harder - we should work > towards replicating data sooner, rather than the opposite. Right now > that's a major bottleneck around syncrep. Absolutely. > I think a better approach might be to handle this on the WAL layout > level. What if we never overwrite partial records but instead just > skipped over them during decoding? Maybe this is a workable approach, let's work it out fully. Let me see if I understand what you mean: * We would remove the logic to inhibit archiving and streaming- replicating the tail end of a split WAL record; that logic deals with bytes only, so doesn't have to be aware of record boundaries. * On WAL replay, we ignore records that are split across a segment boundary and whose checksum does not match. * On WAL write ... ? How do we detect after recovery that a record that was being written, and potentially was sent to the archive, needs to be "skipped"? -- Álvaro Herrera Valdivia, Chile — https://www.EnterpriseDB.com/
Hi, On 2021-08-31 09:56:30 -0400, Alvaro Herrera wrote: > On 2021-Aug-30, Andres Freund wrote: > > I think a better approach might be to handle this on the WAL layout > > level. What if we never overwrite partial records but instead just > > skipped over them during decoding? > > Maybe this is a workable approach, let's work it out fully. > > Let me see if I understand what you mean: > * We would remove the logic to inhibit archiving and streaming- > replicating the tail end of a split WAL record; that logic deals with > bytes only, so doesn't have to be aware of record boundaries. > * On WAL replay, we ignore records that are split across a segment > boundary and whose checksum does not match. > * On WAL write ... ? I was thinking that on a normal WAL write we'd do nothing. Instead we would have dedicated code at the end of recovery that, if the WAL ends in a partial record, changes the page following the "valid" portion of the WAL to indicate that an incomplete record is to be skipped. Of course, we need to be careful to not weaken WAL validity checking too much. How about the following: If we're "aborting" a continued record, we set XLP_FIRST_IS_ABORTED_PARTIAL on the page at which we do so (i.e. the page after the valid end of the WAL). On a page with XLP_FIRST_IS_ABORTED_PARTIAL we expect a special type of record to start just after the page header. That record contains sufficient information for us to verify the validity of the partial record (since its checksum and length aren't valid, and may not even be all readable if the record header itself was split). I think it would make sense to include the LSN of the aborted record, and a checksum of the partial data. > How do we detect after recovery that a record that was being written, > and potentially was sent to the archive, needs to be "skipped"? I think we can just read the WAL and see if it ends with a partial record. It'd add a bit of complication to the error checking in xlogreader, because we'd likely want to treat verification from page headers a bit different from verification due to record data. But that seems doable. Does this make sense? Greetings, Andres Freund
On 2021/09/01 0:53, Andres Freund wrote: > I was thinking that on a normal WAL write we'd do nothing. Instead we would > have dedicated code at the end of recovery that, if the WAL ends in a partial > record, changes the page following the "valid" portion of the WAL to indicate > that an incomplete record is to be skipped. Agreed! > Of course, we need to be careful to not weaken WAL validity checking too > much. How about the following: > > If we're "aborting" a continued record, we set XLP_FIRST_IS_ABORTED_PARTIAL on > the page at which we do so (i.e. the page after the valid end of the WAL). When do you expect that XLP_FIRST_IS_ABORTED_PARTIAL is set? It's set when recovery finds a a partially-flushed segment-spanning record? But maybe we cannot do that (i.e., cannot overwrite the page) because the page that the flag is set in might have already been archived. No? > I think we can just read the WAL and see if it ends with a partial > record. It'd add a bit of complication to the error checking in xlogreader, > because we'd likely want to treat verification from page headers a bit > different from verification due to record data. But that seems doable. Yes. > Does this make sense? Yes, I think! Regards, -- Fujii Masao Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATION
Hi, On 2021-09-01 11:34:34 +0900, Fujii Masao wrote: > On 2021/09/01 0:53, Andres Freund wrote: > > Of course, we need to be careful to not weaken WAL validity checking too > > much. How about the following: > > > > If we're "aborting" a continued record, we set XLP_FIRST_IS_ABORTED_PARTIAL on > > the page at which we do so (i.e. the page after the valid end of the WAL). > > When do you expect that XLP_FIRST_IS_ABORTED_PARTIAL is set? It's set > when recovery finds a a partially-flushed segment-spanning record? > But maybe we cannot do that (i.e., cannot overwrite the page) because > the page that the flag is set in might have already been archived. No? I was imagining that XLP_FIRST_IS_ABORTED_PARTIAL would be set in the "tail end" of a partial record. I.e. if there's a partial record starting in the successfully archived segment A, but the end of the record, in B, has not been written to disk before a crash, we'd set XLP_FIRST_IS_ABORTED_PARTIAL at the end of the valid data in B. Which could not have been archived yet, or we'd not have a partial record. So we should never need to set the flag on an already archived page. Greetings, Andres Freund
At Tue, 31 Aug 2021 20:15:24 -0700, Andres Freund <andres@anarazel.de> wrote in > Hi, > > On 2021-09-01 11:34:34 +0900, Fujii Masao wrote: > > On 2021/09/01 0:53, Andres Freund wrote: > > > Of course, we need to be careful to not weaken WAL validity checking too > > > much. How about the following: > > > > > > If we're "aborting" a continued record, we set XLP_FIRST_IS_ABORTED_PARTIAL on > > > the page at which we do so (i.e. the page after the valid end of the WAL). > > > > When do you expect that XLP_FIRST_IS_ABORTED_PARTIAL is set? It's set > > when recovery finds a a partially-flushed segment-spanning record? > > But maybe we cannot do that (i.e., cannot overwrite the page) because > > the page that the flag is set in might have already been archived. No? > > I was imagining that XLP_FIRST_IS_ABORTED_PARTIAL would be set in the "tail > end" of a partial record. I.e. if there's a partial record starting in the > successfully archived segment A, but the end of the record, in B, has not been > written to disk before a crash, we'd set XLP_FIRST_IS_ABORTED_PARTIAL at the > end of the valid data in B. Which could not have been archived yet, or we'd > not have a partial record. So we should never need to set the flag on an > already archived page. I agree that that makes sense. Is that that, crash recovery remembers if the last record was an immature record that looks like continue to the next segment, and if so, set the flag when inserting the first record, which would be CHECKPOINT_SHUTDOWN? (and reader deals with it) I'll try to show how it looks like. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
On 2021/09/01 12:15, Andres Freund wrote: > Hi, > > On 2021-09-01 11:34:34 +0900, Fujii Masao wrote: >> On 2021/09/01 0:53, Andres Freund wrote: >>> Of course, we need to be careful to not weaken WAL validity checking too >>> much. How about the following: >>> >>> If we're "aborting" a continued record, we set XLP_FIRST_IS_ABORTED_PARTIAL on >>> the page at which we do so (i.e. the page after the valid end of the WAL). >> >> When do you expect that XLP_FIRST_IS_ABORTED_PARTIAL is set? It's set >> when recovery finds a a partially-flushed segment-spanning record? >> But maybe we cannot do that (i.e., cannot overwrite the page) because >> the page that the flag is set in might have already been archived. No? > > I was imagining that XLP_FIRST_IS_ABORTED_PARTIAL would be set in the "tail > end" of a partial record. I.e. if there's a partial record starting in the > successfully archived segment A, but the end of the record, in B, has not been > written to disk before a crash, we'd set XLP_FIRST_IS_ABORTED_PARTIAL at the > end of the valid data in B. Which could not have been archived yet, or we'd > not have a partial record. So we should never need to set the flag on an > already archived page. Thanks for clarifying that! Unless I misunderstand that, when recovery ends at a partially-flushed segment-spanning record, it sets XLP_FIRST_IS_ABORTED_PARTIAL in the next segment before starting writing new WAL data there. Therefore XLP_FIRST_IS_CONTRECORD or XLP_FIRST_IS_ABORTED_PARTIAL must be set in the next segment following a partially-flushed segment-spanning record. If none of them is set, the validation code in recovery should report an error. Yes, this design looks good to me. Regards, -- Fujii Masao Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATION
Hi, On 2021-09-01 15:01:43 +0900, Fujii Masao wrote: > Thanks for clarifying that! Unless I misunderstand that, when recovery ends > at a partially-flushed segment-spanning record, it sets > XLP_FIRST_IS_ABORTED_PARTIAL in the next segment before starting writing > new WAL data there. Therefore XLP_FIRST_IS_CONTRECORD or > XLP_FIRST_IS_ABORTED_PARTIAL must be set in the next segment following > a partially-flushed segment-spanning record. If none of them is set, > the validation code in recovery should report an error. Right. With the small addition that I think we shouldn't just do this for segment spanning records, but for all records spanning pages. Greetings, Andres Freund
On 9/1/21, 10:06 AM, "Andres Freund" <andres@anarazel.de> wrote: > On 2021-09-01 15:01:43 +0900, Fujii Masao wrote: >> Thanks for clarifying that! Unless I misunderstand that, when recovery ends >> at a partially-flushed segment-spanning record, it sets >> XLP_FIRST_IS_ABORTED_PARTIAL in the next segment before starting writing >> new WAL data there. Therefore XLP_FIRST_IS_CONTRECORD or >> XLP_FIRST_IS_ABORTED_PARTIAL must be set in the next segment following >> a partially-flushed segment-spanning record. If none of them is set, >> the validation code in recovery should report an error. > > Right. With the small addition that I think we shouldn't just do this for > segment spanning records, but for all records spanning pages. This approach seems promising. I like that it avoids adding extra work in the hot path for writing WAL. I'm assuming it won't be back- patchable, though. Nathan
At Wed, 1 Sep 2021 15:01:43 +0900, Fujii Masao <masao.fujii@oss.nttdata.com> wrote in > > > On 2021/09/01 12:15, Andres Freund wrote: > > Hi, > > On 2021-09-01 11:34:34 +0900, Fujii Masao wrote: > >> On 2021/09/01 0:53, Andres Freund wrote: > >>> Of course, we need to be careful to not weaken WAL validity checking > >>> too > >>> much. How about the following: > >>> > >>> If we're "aborting" a continued record, we set > >>> XLP_FIRST_IS_ABORTED_PARTIAL on > >>> the page at which we do so (i.e. the page after the valid end of the > >>> WAL). > >> > >> When do you expect that XLP_FIRST_IS_ABORTED_PARTIAL is set? It's set > >> when recovery finds a a partially-flushed segment-spanning record? > >> But maybe we cannot do that (i.e., cannot overwrite the page) because > >> the page that the flag is set in might have already been archived. No? > > I was imagining that XLP_FIRST_IS_ABORTED_PARTIAL would be set in the > > "tail > > end" of a partial record. I.e. if there's a partial record starting in > > the > > successfully archived segment A, but the end of the record, in B, has > > not been > > written to disk before a crash, we'd set XLP_FIRST_IS_ABORTED_PARTIAL > > at the > > end of the valid data in B. Which could not have been archived yet, or > > we'd > > not have a partial record. So we should never need to set the flag on > > an > > already archived page. > > Thanks for clarifying that! Unless I misunderstand that, when recovery > ends > at a partially-flushed segment-spanning record, it sets > XLP_FIRST_IS_ABORTED_PARTIAL in the next segment before starting > writing > new WAL data there. Therefore XLP_FIRST_IS_CONTRECORD or > XLP_FIRST_IS_ABORTED_PARTIAL must be set in the next segment following > a partially-flushed segment-spanning record. If none of them is set, > the validation code in recovery should report an error. > > Yes, this design looks good to me. So, this is a crude PoC of that. At the end of recovery: - When XLogReadRecord misses a page where the next part of the current continuation record should be seen, xlogreader->ContRecAbortPtr is set to the beginning of the missing page. - When StartupXLOG receives a valid ContRecAbortPtr, the value is used as the next WAL insertion location then sets the same to XLogCtl->contAbortedRecPtr. - When XLogCtl->contAbortedRecPtr is set, AdvanceXLInsertBuffer() (called under XLogInsertRecord()) sets XLP_FIRST_IS_ABORTED_PARTIAL flag to the page. While recovery: - When XLogReadRecord meets a XLP_FIRST_IS_ABORT_PARTIAL page, it rereads a record from there. In this PoC, 1. This patch is written on the current master, but it doesn't interfare with the seg-boundary-memorize patch since it removes the calls to RegisterSegmentBoundary. 2. Since xlogreader cannot emit a log-message immediately, we don't have a means to leave a log message to inform recovery met an aborted partial continuation record. (In this PoC, it is done by fprintf:p) 3. Myebe we need to pg_waldump to show partial continuation records, but I'm not sure how to realize that. 4. By the way, is this (method) applicable in this stage? The attached first is the PoC including debug-crash aid. The second is the repro script. It failes to reproduce the situation once per several trials. The following log messages are shown by a run of the script. > ... > TRAP: FailedAssertion("c++ < 1", File: "xlog.c", Line: 2675, PID: 254644) > ... > LOG: database system is shut down > ... > > LOG: redo starts at 0/2000028 > LOG: redo done at 0/6FFFFA8 system usage:... > LOG: #### Recovery finished: ContRecAbort: 0/7000000 (EndRecPtr: 0/6FFFFE8) The record from 6FFFFE8 is missing the trailing part after 7000000. > LOG: #### EndOfLog=0/7000000 > LOG: #### set XLP_FIRST_IS_ABORT_PARTIAL@0/7000000 So, WAL insertion starts from 7000000 and the first page is set the flag. > LOG: database system is ready to accept connections > ... > LOG: database system is shut down > ... > ######################### > ... > LOG: redo starts at 0/2000028 > LOG: consistent recovery state reached at 0/2000100 > ... > LOG: restored log file "000000010000000000000007" from archive > #### aborted partial continuation record found at 0/6FFFFE8, continue from 0/7000000 The record from 6FFFFE8 is immature so skip it and continue reading from 7000000. > LOG: last completed transaction was at log time 2021-09-01 20:40:21.775295+09 > LOG: #### Recovery finished: ContRecAbort: 0/0 (EndRecPtr: 0/8000000) > LOG: restored log file "000000010000000000000007" from archive > LOG: selected new timeline ID: 2 > LOG: archive recovery complete > LOG: #### EndOfLog=0/8000000 Recovery ends. regards. -- Kyotaro Horiguchi NTT Open Source Software Center diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index 24165ab03e..b0f18e4e5e 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -111,6 +111,7 @@ int CommitSiblings = 5; /* # concurrent xacts needed to sleep */ int wal_retrieve_retry_interval = 5000; int max_slot_wal_keep_size_mb = -1; bool track_wal_io_timing = false; +bool contrec_aborted = false; #ifdef WAL_DEBUG bool XLOG_DEBUG = false; @@ -586,6 +587,7 @@ typedef struct XLogCtlData XLogRecPtr replicationSlotMinLSN; /* oldest LSN needed by any slot */ XLogSegNo lastRemovedSegNo; /* latest removed/recycled XLOG segment */ + XLogRecPtr contAbortedRecPtr; /* Fake LSN counter, for unlogged relations. Protected by ulsn_lck. */ XLogRecPtr unloggedLSN; @@ -735,6 +737,10 @@ typedef struct XLogCtlData XLogSegNo latestSegBoundary; XLogRecPtr latestSegBoundaryEndPtr; + /* BEGIN: FOR DEBUGGING-CRASH USE*/ + bool crossseg; + /* END: DEBUGGING-CRASH USE*/ + slock_t segtrack_lck; /* locks shared variables shown above */ } XLogCtlData; @@ -860,6 +866,7 @@ static XLogSource XLogReceiptSource = XLOG_FROM_ANY; /* State information for XLOG reading */ static XLogRecPtr ReadRecPtr; /* start of last record read */ static XLogRecPtr EndRecPtr; /* end+1 of last record read */ +static XLogRecPtr ContRecAbortPtr; /* end+1 of last aborted contrec */ /* * Local copies of equivalent fields in the control file. When running @@ -1178,16 +1185,10 @@ XLogInsertRecord(XLogRecData *rdata, XLByteToSeg(StartPos, StartSeg, wal_segment_size); XLByteToSeg(EndPos, EndSeg, wal_segment_size); - /* - * Register our crossing the segment boundary if that occurred. - * - * Note that we did not use XLByteToPrevSeg() for determining the - * ending segment. This is so that a record that fits perfectly into - * the end of the segment causes the latter to get marked ready for - * archival immediately. - */ - if (StartSeg != EndSeg && XLogArchivingActive()) - RegisterSegmentBoundary(EndSeg, EndPos); + /* BEGIN: FOR DEBUGGING-CRASH USE */ + if (StartSeg != EndSeg) + XLogCtl->crossseg = true; + /* END: FOR DEBUGGING-CRASH USE */ /* * Advance LogwrtRqst.Write so that it includes new block(s). @@ -2292,6 +2293,27 @@ AdvanceXLInsertBuffer(XLogRecPtr upto, bool opportunistic) if (!Insert->forcePageWrites) NewPage->xlp_info |= XLP_BKP_REMOVABLE; + /* + * If the last page ended with an aborted partial continuation record, + * mark it to tell the parital record is omittable. Snice this happens + * only at the end of crash recovery, no rece condition here. + */ + if (XLogCtl->contAbortedRecPtr >= NewPageBeginPtr) + { + if (XLogCtl->contAbortedRecPtr == NewPageBeginPtr) + { + NewPage->xlp_info |= XLP_FIRST_IS_ABORT_PARTIAL; + elog(LOG, "#### set XLP_FIRST_IS_ABORT_PARTIAL@%X/%X", + LSN_FORMAT_ARGS(NewPageBeginPtr)); + } + else + elog(LOG, "### incosistent abort location %X/%X, expected %X/%X", + LSN_FORMAT_ARGS(XLogCtl->contAbortedRecPtr), + LSN_FORMAT_ARGS(NewPageBeginPtr)); + + XLogCtl->contAbortedRecPtr = InvalidXLogRecPtr; + } + /* * If first page of an XLOG segment file, make it a long header. */ @@ -2644,6 +2666,17 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible) { issue_xlog_fsync(openLogFile, openLogSegNo); + /* BEGIN: FOR DEBUGGING-CRASH USE */ + if (XLogCtl->crossseg) + { + static int c = 0; + struct stat b; + + if (stat("/tmp/hoge", &b) == 0) + Assert (c++ < 1); + } + /* END: FOR DEBUGGING-CRASH USE */ + /* signal that we need to wakeup walsenders later */ WalSndWakeupRequest(); @@ -4568,6 +4601,7 @@ ReadRecord(XLogReaderState *xlogreader, int emode, record = XLogReadRecord(xlogreader, &errormsg); ReadRecPtr = xlogreader->ReadRecPtr; EndRecPtr = xlogreader->EndRecPtr; + ContRecAbortPtr = xlogreader->ContRecAbortPtr; if (record == NULL) { if (readFile >= 0) @@ -7873,12 +7907,26 @@ StartupXLOG(void) StandbyMode = false; /* - * Re-fetch the last valid or last applied record, so we can identify the - * exact endpoint of what we consider the valid portion of WAL. + * The last record may be an immature continuation record at the end of a + * page. We continue writing from ContRecAbortPtr instead of EndRecPtr that + * case. */ - XLogBeginRead(xlogreader, LastRec); - record = ReadRecord(xlogreader, PANIC, false); - EndOfLog = EndRecPtr; + elog(LOG, "#### Recovery finished: ContRecAbort: %X/%X (EndRecPtr: %X/%X)", LSN_FORMAT_ARGS(ContRecAbortPtr), LSN_FORMAT_ARGS(EndRecPtr)); + if (XLogRecPtrIsInvalid(ContRecAbortPtr)) + { + /* + * Re-fetch the last valid or last applied record, so we can identify + * the exact endpoint of what we consider the valid portion of WAL. + */ + XLogBeginRead(xlogreader, LastRec); + record = ReadRecord(xlogreader, PANIC, false); + EndOfLog = EndRecPtr; + } + else + { + EndOfLog = ContRecAbortPtr; + XLogCtl->contAbortedRecPtr = ContRecAbortPtr; + } /* * EndOfLogTLI is the TLI in the filename of the XLOG segment containing @@ -8013,7 +8061,8 @@ StartupXLOG(void) Insert = &XLogCtl->Insert; Insert->PrevBytePos = XLogRecPtrToBytePos(LastRec); Insert->CurrBytePos = XLogRecPtrToBytePos(EndOfLog); - + elog(LOG, "#### EndOfLog=%X/%X", LSN_FORMAT_ARGS(EndOfLog)); + /* * Tricky point here: readBuf contains the *last* block that the LastRec * record spans, not the one it starts in. The last block is indeed the diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c index 5cf74e181a..404db7ce4d 100644 --- a/src/backend/access/transam/xlogreader.c +++ b/src/backend/access/transam/xlogreader.c @@ -294,6 +294,7 @@ XLogReadRecord(XLogReaderState *state, char **errormsg) ResetDecoder(state); + state->ContRecAbortPtr = InvalidXLogRecPtr; RecPtr = state->EndRecPtr; if (state->ReadRecPtr != InvalidXLogRecPtr) @@ -319,6 +320,7 @@ XLogReadRecord(XLogReaderState *state, char **errormsg) randAccess = true; } +retry: state->currRecPtr = RecPtr; targetPagePtr = RecPtr - (RecPtr % XLOG_BLCKSZ); @@ -444,12 +446,27 @@ XLogReadRecord(XLogReaderState *state, char **errormsg) XLOG_BLCKSZ)); if (readOff < 0) - goto err; + goto err_partial_contrec; Assert(SizeOfXLogShortPHD <= readOff); /* Check that the continuation on next page looks valid */ pageHeader = (XLogPageHeader) state->readBuf; + if (pageHeader->xlp_info & XLP_FIRST_IS_ABORT_PARTIAL) + { + if (pageHeader->xlp_info & XLP_FIRST_IS_CONTRECORD) + { + report_invalid_record(state, + "both XLP_FIRST_IS_CONTRECORD and XLP_FIRST_IS_ABORT_PARTIAL are set at %X/%X", + LSN_FORMAT_ARGS(RecPtr)); + goto err; + } + + fprintf(stderr, "#### aborted partial continuation record found at %X/%X, continue from %X/%X\n", LSN_FORMAT_ARGS(RecPtr),LSN_FORMAT_ARGS(targetPagePtr)); + ResetDecoder(state); + RecPtr = targetPagePtr; + goto retry; + } if (!(pageHeader->xlp_info & XLP_FIRST_IS_CONTRECORD)) { report_invalid_record(state, @@ -550,6 +567,10 @@ XLogReadRecord(XLogReaderState *state, char **errormsg) else return NULL; +err_partial_contrec: + state->ContRecAbortPtr = targetPagePtr; + fprintf(stderr, "contrec aborted@%X/%X\n", LSN_FORMAT_ARGS(state->ContRecAbortPtr)); + err: /* diff --git a/src/include/access/xlog_internal.h b/src/include/access/xlog_internal.h index 3b5eceff65..6390812a5a 100644 --- a/src/include/access/xlog_internal.h +++ b/src/include/access/xlog_internal.h @@ -76,8 +76,10 @@ typedef XLogLongPageHeaderData *XLogLongPageHeader; #define XLP_LONG_HEADER 0x0002 /* This flag indicates backup blocks starting in this page are optional */ #define XLP_BKP_REMOVABLE 0x0004 +/* This flag indicates the first record in this page breaks a contrecord */ +#define XLP_FIRST_IS_ABORT_PARTIAL 0x0008 /* All defined flag bits in xlp_info (used for validity checking of header) */ -#define XLP_ALL_FLAGS 0x0007 +#define XLP_ALL_FLAGS 0x000F #define XLogPageHeaderSize(hdr) \ (((hdr)->xlp_info & XLP_LONG_HEADER) ? SizeOfXLogLongPHD : SizeOfXLogShortPHD) diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h index 21d200d3df..00a03a628c 100644 --- a/src/include/access/xlogreader.h +++ b/src/include/access/xlogreader.h @@ -175,6 +175,8 @@ struct XLogReaderState XLogRecPtr ReadRecPtr; /* start of last record read */ XLogRecPtr EndRecPtr; /* end+1 of last record read */ + XLogRecPtr ContRecAbortPtr; /* end+1 of aborted partial contrecord if + * any */ /* ---------------------------------------- * Decoded representation of current record diff --git a/src/include/catalog/pg_control.h b/src/include/catalog/pg_control.h index e3f48158ce..26fc123cdb 100644 --- a/src/include/catalog/pg_control.h +++ b/src/include/catalog/pg_control.h @@ -76,6 +76,7 @@ typedef struct CheckPoint #define XLOG_END_OF_RECOVERY 0x90 #define XLOG_FPI_FOR_HINT 0xA0 #define XLOG_FPI 0xB0 +#define XLOG_ABORT_CONTRECORD x0C0 /* PWD=`pwd` DATA=data BKUP=bkup ARCH=$PWD/arch rm -rf arch mkdir arch rm -rf $DATA initdb -D $DATA echo "restart_after_crash = off" >> $DATA/postgresql.conf echo "archive_mode=on" >> $DATA/postgresql.conf echo "archive_command='cp %p ${ARCH}/%f'" >> $DATA/postgresql.conf echo "restart_after_crash = off" >> $DATA/postgresql.conf rm /tmp/hoge pg_ctl -D $DATA start rm -rf $BKUP pg_basebackup -D $BKUP -h /tmp echo "archive_mode=off" >> $BKUP/postgresql.conf echo "restore_command='cp ${ARCH}/%f %p'" >> $BKUP/postgresql.conf touch bkup/recovery.signal psql -c 'create table t(a int); insert into t (select a from generate_series(0, 600000) a)' touch /tmp/hoge psql -c 'insert into t (select a from generate_series(0, 600000) a)' rm /tmp/hoge sleep 5 pg_ctl -D $DATA -w start psql -c 'checkpoint' pg_ctl -D $DATA -w stop echo "#########################" pg_ctl -D $BKUP -w start sleep 10 pg_ctl -D $BKUP -w stop
At Thu, 2 Sep 2021 18:43:33 -0400, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote in > On 2021-Sep-02, Kyotaro Horiguchi wrote: > > > So, this is a crude PoC of that. > > I had ended up with something very similar, except I was trying to cram > the flag via the checkpoint record instead of hacking > AdvanceXLInsertBuffer(). I removed that stuff and merged both, here's > the result. > > > 1. This patch is written on the current master, but it doesn't > > interfare with the seg-boundary-memorize patch since it removes the > > calls to RegisterSegmentBoundary. > > I rebased on top of the revert patch. Thanks! > > 2. Since xlogreader cannot emit a log-message immediately, we don't > > have a means to leave a log message to inform recovery met an > > aborted partial continuation record. (In this PoC, it is done by > > fprintf:p) > > Shrug. We can just use an #ifndef FRONTEND / elog(LOG). (I didn't keep > this part, sorry.) No problem, it was mere a develop-time message for behavior observation. > > 3. Myebe we need to pg_waldump to show partial continuation records, > > but I'm not sure how to realize that. > > Ah yes, we'll need to fix that. I just believe 0001 does the right thing. 0002: > + XLogRecPtr abortedContrecordPtr; /* LSN of incomplete record at end of > + * WAL */ The name sounds like the start LSN. doesn't contrecordAbort(ed)Ptr work? > if (!(pageHeader->xlp_info & XLP_FIRST_IS_CONTRECORD)) > { > report_invalid_record(state, > "there is no contrecord flag at %X/%X", > LSN_FORMAT_ARGS(RecPtr)); > - goto err; > + goto aborted_contrecord; This loses the exclusion check between XLP_FIRST_IS_CONTRECORD and _IS_ABROTED_PARTIAL. Is it okay? (I don't object to remove the check.). I didin't thought this as an aborted contrecord. but on second thought, when we see a record broken in any style, we stop recovery at the point. I agree to the change and all the silmiar changes. + /* XXX should we goto aborted_contrecord here? */ I think it should be aborted_contrecord. When that happens, the loaded bytes actually looked like the first fragment of a continuation record to xlogreader, even if the cause were a broken total_len. So if we abort the record there, the next time xlogreader will meet XLP_FIRST_IS_ABORTED_PARTIAL at the same page, and correctly finds a new record there. On the other hand if we just errored-out there, we will step-back to the beginning of the broken record in the previous page or segment then start writing a new record there but that is exactly what we want to avoid now. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Hi, On 2021-09-03 12:55:23 -0400, Alvaro Herrera wrote: > Oh, but of course we can't modify XLogReaderState in backbranches to add > the new struct member abortedContrecordPtr (or whatever we end up naming > that.) Why is that? Afaict it's always allocated via XLogReaderAllocate(), so adding a new field at the end should be fine? That said, I'm worried that this stuff is too complicated to get right in the backbranches. I suspect letting it stew in master for a while before backpatching would be a good move. Greetings, Andres Freund
On 2021-Sep-03, Andres Freund wrote: > Hi, > > On 2021-09-03 12:55:23 -0400, Alvaro Herrera wrote: > > Oh, but of course we can't modify XLogReaderState in backbranches to add > > the new struct member abortedContrecordPtr (or whatever we end up naming > > that.) > > Why is that? Afaict it's always allocated via XLogReaderAllocate(), so adding > a new field at the end should be fine? Hmm, true, that works. > That said, I'm worried that this stuff is too complicated to get right in the > backbranches. I suspect letting it stew in master for a while before > backpatching would be a good move. Sure, we can put it in master now and backpatch before the November minors if everything goes well. -- Álvaro Herrera 39°49'30"S 73°17'W — https://www.EnterpriseDB.com/
Hi, On 2021-09-03 20:01:50 -0400, Alvaro Herrera wrote: > From 6abc5026f92b99d704bff527d1306eb8588635e9 Mon Sep 17 00:00:00 2001 > From: Alvaro Herrera <alvherre@alvh.no-ip.org> > Date: Tue, 31 Aug 2021 20:55:10 -0400 > Subject: [PATCH v3 1/5] Revert "Avoid creating archive status ".ready" files > too early" > This reverts commit 515e3d84a0b58b58eb30194209d2bc47ed349f5b. I'd prefer to see this pushed soon. I've a bunch of patches to xlog.c that conflict with the prior changes, and rebasing back and forth isn't that much fun... > From f767cdddb3120f1f6c079c8eb00eaff38ea98c79 Mon Sep 17 00:00:00 2001 > From: Alvaro Herrera <alvherre@alvh.no-ip.org> > Date: Thu, 2 Sep 2021 17:21:46 -0400 > Subject: [PATCH v3 2/5] Implement FIRST_IS_ABORTED_CONTRECORD > > --- > src/backend/access/transam/xlog.c | 53 +++++++++++++++++++++++-- > src/backend/access/transam/xlogreader.c | 39 +++++++++++++++++- > src/include/access/xlog_internal.h | 14 ++++++- > src/include/access/xlogreader.h | 3 ++ > 4 files changed, 103 insertions(+), 6 deletions(-) > > diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c > index e51a7a749d..411f1618df 100644 > --- a/src/backend/access/transam/xlog.c > +++ b/src/backend/access/transam/xlog.c > @@ -586,6 +586,8 @@ typedef struct XLogCtlData > XLogRecPtr replicationSlotMinLSN; /* oldest LSN needed by any slot */ > > XLogSegNo lastRemovedSegNo; /* latest removed/recycled XLOG segment */ > + XLogRecPtr abortedContrecordPtr; /* LSN of incomplete record at end of > + * WAL */ > > /* Fake LSN counter, for unlogged relations. Protected by ulsn_lck. */ > XLogRecPtr unloggedLSN; > @@ -848,6 +850,7 @@ static XLogSource XLogReceiptSource = XLOG_FROM_ANY; > /* State information for XLOG reading */ > static XLogRecPtr ReadRecPtr; /* start of last record read */ > static XLogRecPtr EndRecPtr; /* end+1 of last record read */ > +static XLogRecPtr abortedContrecordPtr; /* end+1 of incomplete record */ > > /* > * Local copies of equivalent fields in the control file. When running > @@ -2246,6 +2249,30 @@ AdvanceXLInsertBuffer(XLogRecPtr upto, bool opportunistic) > if (!Insert->forcePageWrites) > NewPage->xlp_info |= XLP_BKP_REMOVABLE; > > + /* > + * If the last page ended with an aborted partial continuation record, > + * mark the new page to indicate that the partial record can be > + * omitted. > + * > + * This happens only once at the end of recovery, so there's no race > + * condition here. > + */ > + if (XLogCtl->abortedContrecordPtr >= NewPageBeginPtr) > + { Can we move this case out of AdvanceXLInsertBuffer()? As the comment says, this only happens at the end of recovery, so putting it into AdvanceXLInsertBuffer() doesn't really seem necessary? > +#ifdef WAL_DEBUG > + if (XLogCtl->abortedContrecordPtr != NewPageBeginPtr) > + elog(PANIC, "inconsistent aborted contrecord location %X/%X, expected %X/%X", > + LSN_FORMAT_ARGS(XLogCtl->abortedContrecordPtr), > + LSN_FORMAT_ARGS(NewPageBeginPtr)); > + ereport(LOG, > + (errmsg_internal("setting XLP_FIRST_IS_ABORTED_PARTIAL flag at %X/%X", > + LSN_FORMAT_ARGS(NewPageBeginPtr)))); > +#endif > + NewPage->xlp_info |= XLP_FIRST_IS_ABORTED_PARTIAL; > + > + XLogCtl->abortedContrecordPtr = InvalidXLogRecPtr; > + } > /* > * If first page of an XLOG segment file, make it a long header. > */ > @@ -4392,6 +4419,7 @@ ReadRecord(XLogReaderState *xlogreader, int emode, > record = XLogReadRecord(xlogreader, &errormsg); > ReadRecPtr = xlogreader->ReadRecPtr; > EndRecPtr = xlogreader->EndRecPtr; > + abortedContrecordPtr = xlogreader->abortedContrecordPtr; > if (record == NULL) > { > if (readFile >= 0) > @@ -7691,10 +7719,29 @@ StartupXLOG(void) > /* > * Re-fetch the last valid or last applied record, so we can identify the > * exact endpoint of what we consider the valid portion of WAL. > + * > + * When recovery ended in an incomplete record, continue writing from the > + * point where it went missing. This leaves behind an initial part of > + * broken record, which rescues downstream which have already received > + * that first part. > */ > - XLogBeginRead(xlogreader, LastRec); > - record = ReadRecord(xlogreader, PANIC, false); > - EndOfLog = EndRecPtr; > + if (XLogRecPtrIsInvalid(abortedContrecordPtr)) > + { > + XLogBeginRead(xlogreader, LastRec); > + record = ReadRecord(xlogreader, PANIC, false); > + EndOfLog = EndRecPtr; > + } > + else > + { > +#ifdef WAL_DEBUG > + ereport(LOG, > + (errmsg_internal("recovery overwriting broken contrecord at %X/%X (EndRecPtr: %X/%X)", > + LSN_FORMAT_ARGS(abortedContrecordPtr), > + LSN_FORMAT_ARGS(EndRecPtr)))); > +#endif "broken" sounds a bit off. But then, it's just WAL_DEBUG. Which made me realize, isn't this missing a if (XLOG_DEBUG)? > @@ -442,14 +448,28 @@ XLogReadRecord(XLogReaderState *state, char **errormsg) > readOff = ReadPageInternal(state, targetPagePtr, > Min(total_len - gotlen + SizeOfXLogShortPHD, > XLOG_BLCKSZ)); > - > if (readOff < 0) > goto err; > > Assert(SizeOfXLogShortPHD <= readOff); > > - /* Check that the continuation on next page looks valid */ > pageHeader = (XLogPageHeader) state->readBuf; > + > + /* > + * If we were expecting a continuation record and got an "aborted > + * partial" flag, that means the continuation record was lost. > + * Ignore the record we were reading, since we now know it's broken > + * and lost forever, and restart the read by assuming the address > + * to read is the location where we found this flag. > + */ > + if (pageHeader->xlp_info & XLP_FIRST_IS_ABORTED_PARTIAL) > + { > + ResetDecoder(state); > + RecPtr = targetPagePtr; > + goto restart; > + } I think we need to add more validation to this path. What I was proposing earlier is that we add a new special type of record at the start of an XLP_FIRST_IS_ABORTED_PARTIAL page, which contains a) lsn of the record we're aborting, b) checksum of the data up to this point. Greetings, Andres Freund
On 2021-Sep-03, Andres Freund wrote: > Hi, > > On 2021-09-03 20:01:50 -0400, Alvaro Herrera wrote: > > From 6abc5026f92b99d704bff527d1306eb8588635e9 Mon Sep 17 00:00:00 2001 > > From: Alvaro Herrera <alvherre@alvh.no-ip.org> > > Date: Tue, 31 Aug 2021 20:55:10 -0400 > > Subject: [PATCH v3 1/5] Revert "Avoid creating archive status ".ready" files > > too early" > > > This reverts commit 515e3d84a0b58b58eb30194209d2bc47ed349f5b. > > I'd prefer to see this pushed soon. I've a bunch of patches to xlog.c that > conflict with the prior changes, and rebasing back and forth isn't that much > fun... Done. > > + /* > > + * If we were expecting a continuation record and got an "aborted > > + * partial" flag, that means the continuation record was lost. > > + * Ignore the record we were reading, since we now know it's broken > > + * and lost forever, and restart the read by assuming the address > > + * to read is the location where we found this flag. > > + */ > > + if (pageHeader->xlp_info & XLP_FIRST_IS_ABORTED_PARTIAL) > > + { > > + ResetDecoder(state); > > + RecPtr = targetPagePtr; > > + goto restart; > > + } > > I think we need to add more validation to this path. What I was proposing > earlier is that we add a new special type of record at the start of an > XLP_FIRST_IS_ABORTED_PARTIAL page, which contains a) lsn of the record we're > aborting, b) checksum of the data up to this point. Hmm, a new record type? Yeah, we can do that, and sounds like it would make things simpler too -- I wasn't too happy about adding clutter to AdvanceXLInsertBuffer either, but the alternative I had in mind was that we'd pass the flag to the checkpointer, which seemed quite annoying API wise. But if we add a new record, seems we can write it directly in StartupXLOG and avoid most nastiness. I'm not too sure about the checksum up to this point, though, but I'll spend some time with the idea to see how bad it is. -- Álvaro Herrera 39°49'30"S 73°17'W — https://www.EnterpriseDB.com/ Voy a acabar con todos los humanos / con los humanos yo acabaré voy a acabar con todos (bis) / con todos los humanos acabaré ¡acabaré! (Bender)
On 2021-Sep-03, Andres Freund wrote: > > +#ifdef WAL_DEBUG > > + ereport(LOG, > > + (errmsg_internal("recovery overwriting broken contrecord at %X/%X (EndRecPtr: %X/%X)", > > + LSN_FORMAT_ARGS(abortedContrecordPtr), > > + LSN_FORMAT_ARGS(EndRecPtr)))); > > +#endif > > "broken" sounds a bit off. But then, it's just WAL_DEBUG. Which made me > realize, isn't this missing a > if (XLOG_DEBUG)? Attached are the same patches as last night, except I added a test for XLOG_DEBUG where pertinent. (The elog(PANIC) is not made conditional on that, since it's a cross-check rather than informative.) Also fixed the silly pg_rewind mistake I made. I'll work on the new xlog record early next week. -- Álvaro Herrera Valdivia, Chile — https://www.EnterpriseDB.com/ "I can see support will not be a problem. 10 out of 10." (Simon Wittber) (http://archives.postgresql.org/pgsql-general/2004-12/msg00159.php)
Attachment
On 9/4/21, 10:26 AM, "Alvaro Herrera" <alvherre@alvh.no-ip.org> wrote: > Attached are the same patches as last night, except I added a test for > XLOG_DEBUG where pertinent. (The elog(PANIC) is not made conditional on > that, since it's a cross-check rather than informative.) Also fixed the > silly pg_rewind mistake I made. > > I'll work on the new xlog record early next week. Are these patches in a good state for some preliminary testing? I'd like to try them out, but I'll hold off if they're not quite ready yet. Nathan
At Tue, 7 Sep 2021 18:41:57 +0000, "Bossart, Nathan" <bossartn@amazon.com> wrote in > On 9/4/21, 10:26 AM, "Alvaro Herrera" <alvherre@alvh.no-ip.org> wrote: > > Attached are the same patches as last night, except I added a test for > > XLOG_DEBUG where pertinent. (The elog(PANIC) is not made conditional on > > that, since it's a cross-check rather than informative.) Also fixed the > > silly pg_rewind mistake I made. > > > > I'll work on the new xlog record early next week. > > Are these patches in a good state for some preliminary testing? I'd > like to try them out, but I'll hold off if they're not quite ready > yet. Thanks! As my understanding the new record add the ability to cross-check between a teard-off contrecord and the new record inserted after the teard-off record. I didn't test the version by myself but the previous version implemented the essential machinery and that won't change fundamentally by the new record. So I think the current patch deserves to see the algorithm actually works against the problem. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
At Tue, 14 Sep 2021 22:32:04 -0300, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote in > On 2021-Sep-14, Alvaro Herrera wrote: > > > On 2021-Sep-08, Kyotaro Horiguchi wrote: > > > > > Thanks! As my understanding the new record add the ability to > > > cross-check between a teard-off contrecord and the new record inserted > > > after the teard-off record. I didn't test the version by myself but > > > the previous version implemented the essential machinery and that > > > won't change fundamentally by the new record. > > > > > > So I think the current patch deserves to see the algorithm actually > > > works against the problem. > > > > Here's a version with the new record type. It passes check-world, and > > it seems to work correctly to prevent overwrite of the tail end of a > > segment containing a broken record. This is very much WIP still; > > comments are missing and I haven't tried to implement any sort of > > verification that the record being aborted is the right one. > > Here's the attachment I forgot earlier. (I missed the chance to complain about that:p) Tnaks for the patch! - CopyXLogRecordToWAL(rechdr->xl_tot_len, isLogSwitch, rdata, - StartPos, EndPos); + CopyXLogRecordToWAL(rechdr->xl_tot_len, isLogSwitch, + flags & XLOG_SET_ABORTED_PARTIAL, + rdata, StartPos, EndPos); The new xlog flag XLOG_SET_ABORTED_PARTIAL is used only by RM_XLOG_ID/XLOG_OVERWRITE_CONTRECORD records, so the flag value is the equivalent of the record type. We might instead want a new flag XLOG_SPECIAL_TREATED_RECORD or something to quickly distinguish records that need a special treat like XLOG_SWITCH. if (flags & XLOG_SPECIAL_TREATED_RECORD) { if (rechdr->xl_rmid == RM_XLOG_ID) { if (info == XLOG_SWITCH) isLogSwitch = true; if (info == XLOG_OVERWRITE_CONTRECORD) isOverwrite = true; } } .. CopyXLogRecrodToWAL(.., isLogSwitch, isOverwrite, rdata, StartPos, EndPos); + /* XXX can only happen once in the loop. Verify? */ + if (set_aborted_partial) + pagehdr->xlp_info |= XLP_FIRST_IS_ABORTED_PARTIAL; + I'm not sure about the reason for the change from the previous patch (I might be missing something), this sets the flag on the *next* page of the page where the record starts. So in the first place we shouldn't set the flag there. The page header flags of the first page is set by AdvanceXLInsertBuffer. If we want to set the flag in the function, we need to find the page header for the beginning of the record and make sure that the record is placed at the beginning of the page. (It is the reason that I did that in AdvanceXLInsertBuffer..) regards. -- Kyotaro Horiguchi NTT Open Source Software Center
On 9/17/21, 9:37 AM, "Alvaro Herrera" <alvherre@alvh.no-ip.org> wrote: > OK, this version is much more palatable, because here we verify that the > OVERWRITE_CONTRECORD we replay matches the record that was lost. Also, > I wrote a test script that creates such a broken record (by the simple > expedient of deleting the WAL file containing the second half while the > server is down); we then create a standby and we can observe that it > replays the sequence correctly. > > If you have some time to try your reproducers with this new proposed > fix, I would appreciate it. I haven't had a chance to look at the patch yet, but it appears to fix things with my original reproduction steps for the archive_status stuff [0]. Nathan [0] https://postgr.es/m/CBDDFA01-6E40-46BB-9F98-9340F4379505%40amazon.com
On 2021-Sep-17, Bossart, Nathan wrote: > On 9/17/21, 9:37 AM, "Alvaro Herrera" <alvherre@alvh.no-ip.org> wrote: > > If you have some time to try your reproducers with this new proposed > > fix, I would appreciate it. > > I haven't had a chance to look at the patch yet, but it appears to fix > things with my original reproduction steps for the archive_status > stuff [0]. Thank you, this is good to hear. -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/ "Ed is the standard text editor." http://groups.google.com/group/alt.religion.emacs/msg/8d94ddab6a9b0ad3
On 9/17/21, 10:35 AM, "Alvaro Herrera" <alvherre@alvh.no-ip.org> wrote: > On 2021-Sep-17, Bossart, Nathan wrote: > >> On 9/17/21, 9:37 AM, "Alvaro Herrera" <alvherre@alvh.no-ip.org> wrote: > >> > If you have some time to try your reproducers with this new proposed >> > fix, I would appreciate it. >> >> I haven't had a chance to look at the patch yet, but it appears to fix >> things with my original reproduction steps for the archive_status >> stuff [0]. > > Thank you, this is good to hear. I gave the patch a read-through. I'm wondering if the XLOG_OVERWRITE_CONTRECORD records are actually necessary. IIUC we will set XLP_FIRST_IS_ABORTED_PARTIAL on the next page, and xlp_pageaddr on that page will already be validated in XLogReaderValidatePageHeader(). Does adding this new record also help ensure the page header is filled in and flushed to disk? Nathan
On 2021-Sep-17, Bossart, Nathan wrote: > I gave the patch a read-through. I'm wondering if the > XLOG_OVERWRITE_CONTRECORD records are actually necessary. IIUC we > will set XLP_FIRST_IS_ABORTED_PARTIAL on the next page, and > xlp_pageaddr on that page will already be validated in > XLogReaderValidatePageHeader(). Does adding this new record also help > ensure the page header is filled in and flushed to disk? That was the first implementation, a few versions of the patch ago. An added benefit of a separate WAL record is that you can carry additional data for validation, such as -- as suggested by Andres -- the CRC of the partial data contained in the message that we're skipping. I didn't implement that, but it should be trivial to add it. -- Álvaro Herrera 39°49'30"S 73°17'W — https://www.EnterpriseDB.com/ "How strange it is to find the words "Perl" and "saner" in such close proximity, with no apparent sense of irony. I doubt that Larry himself could have managed it." (ncm, http://lwn.net/Articles/174769/)
On 9/17/21, 1:32 PM, "Alvaro Herrera" <alvherre@alvh.no-ip.org> wrote: > On 2021-Sep-17, Bossart, Nathan wrote: > >> I gave the patch a read-through. I'm wondering if the >> XLOG_OVERWRITE_CONTRECORD records are actually necessary. IIUC we >> will set XLP_FIRST_IS_ABORTED_PARTIAL on the next page, and >> xlp_pageaddr on that page will already be validated in >> XLogReaderValidatePageHeader(). Does adding this new record also help >> ensure the page header is filled in and flushed to disk? > > That was the first implementation, a few versions of the patch ago. An > added benefit of a separate WAL record is that you can carry additional > data for validation, such as -- as suggested by Andres -- the CRC of the > partial data contained in the message that we're skipping. I didn't > implement that, but it should be trivial to add it. I see. IMO feels a bit counterintuitive to validate a partial record that you are ignoring anyway, but I suppose it's still valuable to know when the WAL is badly broken. It's not expensive, and it doesn't add a ton of complexity, either. Nathan
On 2021-Sep-17, Bossart, Nathan wrote: > > That was the first implementation, a few versions of the patch ago. An > > added benefit of a separate WAL record is that you can carry additional > > data for validation, such as -- as suggested by Andres -- the CRC of the > > partial data contained in the message that we're skipping. I didn't > > implement that, but it should be trivial to add it. > > I see. IMO feels a bit counterintuitive to validate a partial record > that you are ignoring anyway, but I suppose it's still valuable to > know when the WAL is badly broken. It's not expensive, and it doesn't > add a ton of complexity, either. Yeah, we don't have any WAL record history validation other than the verifying the LSN of the previous record; I suppose in this particular case you could argue that we shouldn't bother with any validation either. But it seems safer to do it. It doesn't hurt anything anyway. -- Álvaro Herrera Valdivia, Chile — https://www.EnterpriseDB.com/
I made one final pass over the whole thing to be sure it's all commented as thoroughly as possible, and changed the names of things to make them all consistent. So here's the last version which I intend to push to all branches soon. (The only difference in back-branches is that the xlogreader struct needs to be adjusted to have the new fields at the bottom.) One thing to note is that this is an un-downgradeable minor; and of course people should upgrade standbys before primaries. On 2021-Sep-17, Alvaro Herrera wrote: > On 2021-Sep-17, Bossart, Nathan wrote: > > I see. IMO feels a bit counterintuitive to validate a partial record > > that you are ignoring anyway, but I suppose it's still valuable to > > know when the WAL is badly broken. It's not expensive, and it doesn't > > add a ton of complexity, either. > > Yeah, we don't have any WAL record history validation other than the > verifying the LSN of the previous record; I suppose in this particular > case you could argue that we shouldn't bother with any validation > either. But it seems safer to do it. It doesn't hurt anything anyway. BTW we do validate the CRC for all records, but we don't have any means to validate the CRC of a partial record; so in theory if we don't add CRC validation of the ignored contents, there would be no validation at all for that record. I don't think we care, but it's true that there would be a big blob in WAL that we don't really know anything about. -- Álvaro Herrera Valdivia, Chile — https://www.EnterpriseDB.com/ Thou shalt study thy libraries and strive not to reinvent them without cause, that thy code may be short and readable and thy days pleasant and productive. (7th Commandment for C Programmers)
Attachment
This took some time because backpatching the tests was more work than I anticipated -- function name changes, operators that don't exist, definition of the WAL segment size in pg_settings. I had to remove the second test in branches 13 and earlier due to lack of LSN+bytes operator. Fortunately, the first test (which is not as clean) still works all the way back. However, I notice now that the pg_rewind tests reproducibly fail in branch 14 for reasons I haven't yet understood. It's strange that no other branch fails, even when run quite a few times. -- Álvaro Herrera 39°49'30"S 73°17'W — https://www.EnterpriseDB.com/ "Siempre hay que alimentar a los dioses, aunque la tierra esté seca" (Orual)
On 2021-Sep-23, Alvaro Herrera wrote: > However, I notice now that the pg_rewind tests reproducibly fail in > branch 14 for reasons I haven't yet understood. It's strange that no > other branch fails, even when run quite a few times. Turns out that this is a real bug (setting EndOfLog seems insufficient). I'm looking into it. -- Álvaro Herrera Valdivia, Chile — https://www.EnterpriseDB.com/ "No necesitamos banderas No reconocemos fronteras" (Jorge González)
Hi Alvaro, I just started reading this thread, but maybe you can confirm or refute my understanding of what was done. In the first email you write > As mentioned in the course of thread [1], we're missing a fix for streaming replication to avoid sending records that the primary hasn't fully flushed yet. This patch is a first attempt at fixing that problem by retreating the LSN reported as FlushPtr whenever a segment is registered, based on the understanding that if no registration exists then the LogwrtResult.Flush pointer can be taken at face value; but if a registration exists, then we have to stream only till the start LSN of that registered entry. So did we end up holding back the wal_sender to not send anything that is not confirmed as flushed on master Are there measurements on how much this slows down replication compared to allowing sending the moment it is written in buffers but not necessarily flushed locally ? Did we investigate possibility of sending as fast as possible and controlling the flush synchronisation by sending separate flush pointers *both* ways ? And maybe there was even an alternative considered where we are looking at a more general Durability, for example 2-out-of-3 where primary is one of the 3 and not necessarily the most durable one? ----- Hannu Krosing Google Cloud - We have a long list of planned contributions and we are hiring. Contact me if interested. On Fri, Sep 24, 2021 at 4:33 AM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > > On 2021-Sep-23, Alvaro Herrera wrote: > > > However, I notice now that the pg_rewind tests reproducibly fail in > > branch 14 for reasons I haven't yet understood. It's strange that no > > other branch fails, even when run quite a few times. > > Turns out that this is a real bug (setting EndOfLog seems insufficient). > I'm looking into it. > > -- > Álvaro Herrera Valdivia, Chile — https://www.EnterpriseDB.com/ > "No necesitamos banderas > No reconocemos fronteras" (Jorge González) > >
On 2021-Sep-23, Alvaro Herrera wrote: > On 2021-Sep-23, Alvaro Herrera wrote: > > > However, I notice now that the pg_rewind tests reproducibly fail in > > branch 14 for reasons I haven't yet understood. It's strange that no > > other branch fails, even when run quite a few times. > > Turns out that this is a real bug (setting EndOfLog seems insufficient). > I'm looking into it. I had misdiagnosed it; the real problem is that this was taking action in standby mode and breaking things after promotion. I took quite a detour ... Here's the set for all branches, which I think are really final, in case somebody wants to play and reproduce their respective problem scenarios. Nathan already confirmed that his reproducer no longer shows a problem, and this version shouldn't affect that. -- Álvaro Herrera Valdivia, Chile — https://www.EnterpriseDB.com/ "El Maquinismo fue proscrito so pena de cosquilleo hasta la muerte" (Ijon Tichy en Viajes, Stanislaw Lem)
Attachment
- 0001-Fix-WAL-replay-in-presence-of-an-incomplete-record.patch
- v10-0001-Fix-WAL-replay-in-presence-of-an-incomplete-reco.patch
- v11-0001-Fix-WAL-replay-in-presence-of-an-incomplete-reco.patch
- v12-0001-Fix-WAL-replay-in-presence-of-an-incomplete-reco.patch
- v13-0001-Fix-WAL-replay-in-presence-of-an-incomplete-reco.patch
- v14-0001-Fix-WAL-replay-in-presence-of-an-incomplete-reco.patch
- v96-0001-Fix-WAL-replay-in-presence-of-an-incomplete-reco.patch
On 2021-Sep-24, Alvaro Herrera wrote: > Here's the set for all branches, which I think are really final, in case > somebody wants to play and reproduce their respective problem scenarios. I forgot to mention that I'll wait until 14.0 is tagged before getting anything pushed. -- Álvaro Herrera Valdivia, Chile — https://www.EnterpriseDB.com/ Thou shalt check the array bounds of all strings (indeed, all arrays), for surely where thou typest "foo" someone someday shall type "supercalifragilisticexpialidocious" (5th Commandment for C programmers)
On 2021-Sep-24, Alvaro Herrera wrote: > Here's the set for all branches, which I think are really final, in case > somebody wants to play and reproduce their respective problem scenarios. > Nathan already confirmed that his reproducer no longer shows a problem, > and this version shouldn't affect that. Pushed. Watching for buildfarm fireworks now. -- Álvaro Herrera Valdivia, Chile — https://www.EnterpriseDB.com/
On 2021-Sep-24, Hannu Krosing wrote: Hi Hannu > In the first email you write > > > As mentioned in the course of thread [1], we're missing a fix for > streaming replication to avoid sending records that the primary hasn't > fully flushed yet. This patch is a first attempt at fixing that problem > by retreating the LSN reported as FlushPtr whenever a segment is > registered, based on the understanding that if no registration exists > then the LogwrtResult.Flush pointer can be taken at face value; but if a > registration exists, then we have to stream only till the start LSN of > that registered entry. > > So did we end up holding back the wal_sender to not send anything that > is not confirmed as flushed on master No. We eventually realized that that approach was a dead end, so I abandoned the whole thing and attacked the problem differently. So your other questions don't apply. I tried to make the commit message explain both the problem and the solution in as much detail as possible; please have a look at that and let me know if something is unclear. Thanks -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/ "El miedo atento y previsor es la madre de la seguridad" (E. Burke)
Alvaro Herrera <alvherre@alvh.no-ip.org> writes: > Pushed. Watching for buildfarm fireworks now. jacana didn't like it: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=jacana&dt=2021-09-29%2018%3A25%3A53 The relevant info seems to be # Running: pg_basebackup -D /home/pgrunner/bf/root/REL_14_STABLE/pgsql.build/src/test/recovery/tmp_check/t_026_overwrite_contrecord_primary2_data/backup/backup -h127.0.0.1 -p 59502 --checkpoint fast --no-sync pg_basebackup: error: connection to server at "127.0.0.1", port 59502 failed: FATAL: no pg_hba.conf entry for replicationconnection from host "127.0.0.1", user "pgrunner", no encryption Bail out! system pg_basebackup failed which looks like a pretty straightforward bogus-connection-configuration problem, except why wouldn't other BF members show it? regards, tom lane
On 9/29/21 4:33 PM, Tom Lane wrote: > Alvaro Herrera <alvherre@alvh.no-ip.org> writes: >> Pushed. Watching for buildfarm fireworks now. > jacana didn't like it: > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=jacana&dt=2021-09-29%2018%3A25%3A53 > > The relevant info seems to be > > # Running: pg_basebackup -D /home/pgrunner/bf/root/REL_14_STABLE/pgsql.build/src/test/recovery/tmp_check/t_026_overwrite_contrecord_primary2_data/backup/backup -h127.0.0.1 -p 59502 --checkpoint fast --no-sync > pg_basebackup: error: connection to server at "127.0.0.1", port 59502 failed: FATAL: no pg_hba.conf entry for replicationconnection from host "127.0.0.1", user "pgrunner", no encryption > Bail out! system pg_basebackup failed > > which looks like a pretty straightforward bogus-connection-configuration > problem, except why wouldn't other BF members show it? > > This: # Second test: a standby that receives WAL via archive/restore commands. $node = PostgresNode->new('primary2'); $node->init( has_archiving => 1, extra => ['--wal-segsize=1']); doesn't have "allows_streaming => 1". cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
On 2021-Sep-29, Andrew Dunstan wrote: > > The relevant info seems to be > > > > # Running: pg_basebackup -D /home/pgrunner/bf/root/REL_14_STABLE/pgsql.build/src/test/recovery/tmp_check/t_026_overwrite_contrecord_primary2_data/backup/backup -h127.0.0.1 -p 59502 --checkpoint fast --no-sync > > pg_basebackup: error: connection to server at "127.0.0.1", port 59502 failed: FATAL: no pg_hba.conf entry for replicationconnection from host "127.0.0.1", user "pgrunner", no encryption > > Bail out! system pg_basebackup failed > > > > which looks like a pretty straightforward bogus-connection-configuration > > problem, except why wouldn't other BF members show it? > > This: > > # Second test: a standby that receives WAL via archive/restore commands. > $node = PostgresNode->new('primary2'); > $node->init( > has_archiving => 1, > extra => ['--wal-segsize=1']); > > doesn't have "allows_streaming => 1". Hmm, but I omitted allows_streaming on purpose -- I only wanted archiving, not streaming. I understand that your point is that set_replication_conf is not called unless allows_streaming is set. So, do we take the stance that we have no right to expect pg_basebackup to work if we didn't pass allow_streaming => 1? If so, the fix is to add it. But my preferred fix would be to call set_replication_conf if either allows_streaming or has_archiving are given. Another easy fix would be to call $primary2->set_replication_conf in the test file, but then you'd complain that that's supposed to be an internal method :-) (This reminds me that I had to add something that seemed like it should have been unnecessary: wal_level=replica should become set if I request archiving, right? Otherwise the WAL archive is useless. I also had to add max_wal_senders=2 so that pg_basebackup would work, but I'm on the fence about setting that automatically if has_archiving is given.) -- Álvaro Herrera Valdivia, Chile — https://www.EnterpriseDB.com/ "At least to kernel hackers, who really are human, despite occasional rumors to the contrary" (LWN.net)
Andrew Dunstan <andrew@dunslane.net> writes: > On 9/29/21 4:33 PM, Tom Lane wrote: >> which looks like a pretty straightforward bogus-connection-configuration >> problem, except why wouldn't other BF members show it? > This: > ... > doesn't have "allows_streaming => 1". Oh, and that only breaks things on Windows, cf set_replication_conf. ... although I wonder why the fact that sub init otherwise sets wal_level = minimal doesn't cause a problem for this test. Maybe the test script is cowboy-ishly overriding that? regards, tom lane
On 2021-Sep-29, Tom Lane wrote: > ... although I wonder why the fact that sub init otherwise sets > wal_level = minimal doesn't cause a problem for this test. > Maybe the test script is cowboy-ishly overriding that? It is. (I claim that it should be set otherwise when has_archiving=>1). -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/ "El Maquinismo fue proscrito so pena de cosquilleo hasta la muerte" (Ijon Tichy en Viajes, Stanislaw Lem)
On 9/29/21 5:27 PM, Alvaro Herrera wrote: > On 2021-Sep-29, Andrew Dunstan wrote: > >>> The relevant info seems to be >>> >>> # Running: pg_basebackup -D /home/pgrunner/bf/root/REL_14_STABLE/pgsql.build/src/test/recovery/tmp_check/t_026_overwrite_contrecord_primary2_data/backup/backup -h127.0.0.1 -p 59502 --checkpoint fast --no-sync >>> pg_basebackup: error: connection to server at "127.0.0.1", port 59502 failed: FATAL: no pg_hba.conf entry for replicationconnection from host "127.0.0.1", user "pgrunner", no encryption >>> Bail out! system pg_basebackup failed >>> >>> which looks like a pretty straightforward bogus-connection-configuration >>> problem, except why wouldn't other BF members show it? >> This: >> >> # Second test: a standby that receives WAL via archive/restore commands. >> $node = PostgresNode->new('primary2'); >> $node->init( >> has_archiving => 1, >> extra => ['--wal-segsize=1']); >> >> doesn't have "allows_streaming => 1". > Hmm, but I omitted allows_streaming on purpose -- I only wanted > archiving, not streaming. I understand that your point is that > set_replication_conf is not called unless allows_streaming is set. > > So, do we take the stance that we have no right to expect pg_basebackup > to work if we didn't pass allow_streaming => 1? If so, the fix is to > add it. But my preferred fix would be to call set_replication_conf if > either allows_streaming or has_archiving are given. > > > Another easy fix would be to call $primary2->set_replication_conf in the > test file, but then you'd complain that that's supposed to be an > internal method :-) It claims that but it's also used here: src/bin/pg_basebackup/t/010_pg_basebackup.pl (Also, good perl style would start purely internal method names with an underscore.) cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
Andrew Dunstan <andrew@dunslane.net> writes: > On 9/29/21 5:27 PM, Alvaro Herrera wrote: >> So, do we take the stance that we have no right to expect pg_basebackup >> to work if we didn't pass allow_streaming => 1? If so, the fix is to >> add it. But my preferred fix would be to call set_replication_conf if >> either allows_streaming or has_archiving are given. >> >> Another easy fix would be to call $primary2->set_replication_conf in the >> test file, but then you'd complain that that's supposed to be an >> internal method :-) > It claims that but it's also used here: > src/bin/pg_basebackup/t/010_pg_basebackup.pl Given that precedent, it seems like calling set_replication_conf is a good quick-fix for getting the buildfarm green again. But +1 for then refactoring this to get rid of these hacks (both with respect to pg_hba.conf and the postgresql.conf parameters) in both of these tests. regards, tom lane
Hi, On 2021-09-29 17:04:48 -0400, Andrew Dunstan wrote: > On 9/29/21 4:33 PM, Tom Lane wrote: > # Second test: a standby that receives WAL via archive/restore commands. > $node = PostgresNode->new('primary2'); > $node->init( > has_archiving => 1, > extra => ['--wal-segsize=1']); > > > doesn't have "allows_streaming => 1". FWIW, with that fixed I see the test hanging (e.g. [1]): can't unlink c:/cirrus/src/test/recovery/tmp_check/t_026_overwrite_contrecord_primary2_data/archives/000000010000000000000008.fail: Permissiondenied at t/026_overwrite_contrecord.pl line 189. ### Stopping node "primary2" using mode immediate # Running: pg_ctl -D c:/cirrus/src/test/recovery/tmp_check/t_026_overwrite_contrecord_primary2_data/pgdata -m immediate stop waiting for server to shut down........................................................................................................................... failed pg_ctl: server does not shut down Bail out! command "pg_ctl -D c:/cirrus/src/test/recovery/tmp_check/t_026_overwrite_contrecord_primary2_data/pgdata -m immediatestop" exited with value 1 Warning: unable to close filehandle GEN6 properly: Bad file descriptor during global destruction. The hang seems to be fixed by uncommenting the $h->finish(). Was there a reason you commented that out? THe test still fails, but at least it doesn't hang anymore. Greetings, Andres Freund [1] https://api.cirrus-ci.com/v1/artifact/task/6204050896060416/tap/src/test/recovery/tmp_check/log/regress_log_026_overwrite_contrecord
Hello On 2021-Sep-30, Andres Freund wrote: > FWIW, with that fixed I see the test hanging (e.g. [1]): > > can't unlink c:/cirrus/src/test/recovery/tmp_check/t_026_overwrite_contrecord_primary2_data/archives/000000010000000000000008.fail: Permissiondenied at t/026_overwrite_contrecord.pl line 189. > ### Stopping node "primary2" using mode immediate > # Running: pg_ctl -D c:/cirrus/src/test/recovery/tmp_check/t_026_overwrite_contrecord_primary2_data/pgdata -m immediatestop > waiting for server to shut down........................................................................................................................... failed > pg_ctl: server does not shut down > Bail out! command "pg_ctl -D c:/cirrus/src/test/recovery/tmp_check/t_026_overwrite_contrecord_primary2_data/pgdata -mimmediate stop" exited with value 1 > Warning: unable to close filehandle GEN6 properly: Bad file descriptor during global destruction. > > The hang seems to be fixed by uncommenting the $h->finish(). Was there a > reason you commented that out? Hmm, no -- I was experimenting to see what effect it had, and because I saw none, I left it like that. Let me fix both things and see what happens next. > THe test still fails, but at least it doesn't hang anymore. I'll try and get it fixed, but ultimately we can always "fix" this test by removing it. It is only in 14 and master; the earlier branches only have the first part of this test file. (That's because I ran out of patience trying to port it to versions lacking LSN operators and such.) I think the second half doesn't provide all that much additional coverage anyway. -- Álvaro Herrera Valdivia, Chile — https://www.EnterpriseDB.com/ "Ellos andaban todos desnudos como su madre los parió, y también las mujeres, aunque no vi más que una, harto moza, y todos los que yo vi eran todos mancebos, que ninguno vi de edad de más de XXX años" (Cristóbal Colón)
On 9/29/21 5:29 PM, Tom Lane wrote: > Andrew Dunstan <andrew@dunslane.net> writes: >> On 9/29/21 4:33 PM, Tom Lane wrote: >>> which looks like a pretty straightforward bogus-connection-configuration >>> problem, except why wouldn't other BF members show it? >> This: >> ... >> doesn't have "allows_streaming => 1". > Oh, and that only breaks things on Windows, cf set_replication_conf. > > ... although I wonder why the fact that sub init otherwise sets > wal_level = minimal doesn't cause a problem for this test. > Maybe the test script is cowboy-ishly overriding that? > > Regardless of this problem, I think we should simply call set_replication_conf unconditionally in init(). Replication connections are now allowed by default on Unix, this would just bring Windows nodes into line with that. The function does have this: $self->host eq $test_pghost or die "set_replication_conf only works with the default host"; I'm not sure when that wouldn't be true. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
Andrew Dunstan <andrew@dunslane.net> writes: > Regardless of this problem, I think we should simply call > set_replication_conf unconditionally in init(). Replication connections > are now allowed by default on Unix, this would just bring Windows nodes > into line with that. Yeah, I was thinking along the same lines yesterday. The fact that pre-commit testing failed to note the problem is exactly because we have this random difference between what works by default on Unix and what works by default on Windows. Let's close that gap before it bites us again. There's still the issue of these tests overriding postgresql.conf entries made by init(), but maybe we can live with that? regards, tom lane
On 2021-Sep-30, Tom Lane wrote: > Andrew Dunstan <andrew@dunslane.net> writes: > > Regardless of this problem, I think we should simply call > > set_replication_conf unconditionally in init(). Replication connections > > are now allowed by default on Unix, this would just bring Windows nodes > > into line with that. > > Yeah, I was thinking along the same lines yesterday. The fact that > pre-commit testing failed to note the problem is exactly because we > have this random difference between what works by default on Unix > and what works by default on Windows. Let's close that gap before > it bites us again. +1 > There's still the issue of these tests overriding postgresql.conf > entries made by init(), but maybe we can live with that? I vote to at least have has_archiving=>1 set wal_level=replica, and potentially max_wal_senders=2 too (allows pg_basebackup). -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/
Alvaro Herrera <alvherre@alvh.no-ip.org> writes: > On 2021-Sep-30, Tom Lane wrote: >> There's still the issue of these tests overriding postgresql.conf >> entries made by init(), but maybe we can live with that? > I vote to at least have has_archiving=>1 set wal_level=replica, and > potentially max_wal_senders=2 too (allows pg_basebackup). I think this requires a bit more investigation. I looked quickly through the pre-existing tests that set has_archiving=>1, and every single one of them also sets allows_streaming=>1, which no doubt explains why this issue hasn't come up before. So now I'm wondering if any of those other tests is setting allows_streaming only because of this configuration issue. More to the point, since we've not previously used the combination of has_archiving without allows_streaming, I wonder exactly how we want to define it to work. I'm not really convinced that it should be defined as "allows basebackup even though replication is supposed to be off". Perhaps a compromise could be to invent a third option "allows_basebackup", so that init() actually knows what's going on? regards, tom lane
Just when you thought it was safe to go back in the water: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=prairiedog&dt=2021-09-29%2022%3A05%3A44 which is complaining that the (misspelled, BTW) log message 'sucessfully skipped missing contrecord at' doesn't show up. This machine is old, slow, and 32-bit bigendian. I first thought the problem might be "didn't wait long enough", but it seems like waiting for replay ought to be sufficient. What I'm now guessing is that the test case is making unwarranted assumptions about how much WAL will be generated, such that no page-crossing contrecord actually appears. Also, digging around, I see hornet showed the same problem: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hornet&dt=2021-09-29%2018%3A19%3A55 hornet is 64-bit bigendian ... so maybe this actually reduces to an endianness question? regards, tom lane
On 2021-Sep-30, Tom Lane wrote: > Just when you thought it was safe to go back in the water: > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=prairiedog&dt=2021-09-29%2022%3A05%3A44 > > which is complaining that the (misspelled, BTW) Ah, the case of the missing juxtaposed consonants. Les Luthiers have something to say on that matter. https://www.youtube.com/watch?v=ptorPqV7D5s > log message 'sucessfully skipped missing contrecord at' doesn't show > up. Hmm. Well, as I said, maybe this part of the test isn't worth much anyway. Rather than spending time trying to figure out why isn't this triggering the WAL overwriting, I compared the coverage report for running only the first test to the coverage report of running only the second test. It turns out that there's no relevant coverage increase in the second test. So I propose just removing that part. (The reason I added that test in the first place was to try to reproduce the problem without having to physically unlink a WAL file from the primary's pg_wal subdir. But maybe it's just make-work.) -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/
Attachment
Alvaro Herrera <alvherre@alvh.no-ip.org> writes: > Hmm. Well, as I said, maybe this part of the test isn't worth much > anyway. Rather than spending time trying to figure out why isn't this > triggering the WAL overwriting, I compared the coverage report for > running only the first test to the coverage report of running only the > second test. It turns out that there's no relevant coverage increase in > the second test. So I propose just removing that part. Seems reasonable. We don't need to spend buildfarm cycles forevermore on a test that's not adding useful coverage. regards, tom lane
Hi, On Wed, Sep 29, 2021 at 8:14 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > > On 2021-Sep-24, Alvaro Herrera wrote: > > > Here's the set for all branches, which I think are really final, in case > > somebody wants to play and reproduce their respective problem scenarios. > > Nathan already confirmed that his reproducer no longer shows a problem, > > and this version shouldn't affect that. > > Pushed. Watching for buildfarm fireworks now. > While reading this commit (ff9f111bce24), wondered can't we skip missingContrecPtr global variable declaration and calculate that from abortedRecPtr value whenever it needed. IIUC, missingContrecPtr is the next page to the page that abortedRecPtr contain and that can be calculated as "abortedRecPtr + (XLOG_BLCKSZ - (abortedRecPtr % XLOG_BLCKSZ))", thoughts? Please correct me if I'm missing something, thanks. Regards, Amul
On 2021-Oct-07, Amul Sul wrote: > While reading this commit (ff9f111bce24), wondered can't we skip > missingContrecPtr global variable declaration and calculate that from > abortedRecPtr value whenever it needed. IIUC, missingContrecPtr is the > next page to the page that abortedRecPtr contain and that can be > calculated as "abortedRecPtr + (XLOG_BLCKSZ - (abortedRecPtr % > XLOG_BLCKSZ))", thoughts? Please correct me if I'm missing something, > thanks. I don't think that works -- what if the missing record is not on the next page but on some future one? Imagine an enormously large record that starts on segment 1, covers all of segment 2 and ends in segment 3. We could have flushed segment 2 already, so with your idea we would skip ahead only to that position, but really we need to skip all the way to the first page of segment 3. This is easier to imagine if you set wal segment size to 1 MB, but it is possible with the default segment size too, since commit records can be arbitrarily large, and "logical message" records as well. -- Álvaro Herrera Valdivia, Chile — https://www.EnterpriseDB.com/
On Thu, 7 Oct 2021 at 6:41 PM, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
On 2021-Oct-07, Amul Sul wrote:
> While reading this commit (ff9f111bce24), wondered can't we skip
> missingContrecPtr global variable declaration and calculate that from
> abortedRecPtr value whenever it needed. IIUC, missingContrecPtr is the
> next page to the page that abortedRecPtr contain and that can be
> calculated as "abortedRecPtr + (XLOG_BLCKSZ - (abortedRecPtr %
> XLOG_BLCKSZ))", thoughts? Please correct me if I'm missing something,
> thanks.
I don't think that works -- what if the missing record is not on the
next page but on some future one? Imagine an enormously large record
that starts on segment 1, covers all of segment 2 and ends in segment 3.
We could have flushed segment 2 already, so with your idea we would skip
ahead only to that position, but really we need to skip all the way to
the first page of segment 3.
This is easier to imagine if you set wal segment size to 1 MB, but it is
possible with the default segment size too, since commit records can be
arbitrarily large, and "logical message" records as well.
Make sense, thanks for the explanation.
Regards,
Amul
On 2021-Oct-07, Amul Sul wrote: > Make sense, thanks for the explanation. You're welcome. Also, I forgot: thank you for taking the time to review the code. Much appreciated. -- Álvaro Herrera 39°49'30"S 73°17'W — https://www.EnterpriseDB.com/
Hi, On Thu, Oct 7, 2021 at 6:57 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > > On 2021-Oct-07, Amul Sul wrote: > > > Make sense, thanks for the explanation. > > You're welcome. Also, I forgot: thank you for taking the time to review > the code. Much appreciated. :) > > I have one more question, regarding the need for other global variables i.e. abortedRecPtr. (Sorry for coming back after so long.) Instead of abortedRecPtr point, isn't enough to write overwrite-contrecord at XLogCtl->lastReplayedEndRecPtr? I think both are pointing to the same location then can't we use lastReplayedEndRecPtr instead of abortedRecPtr to write overwrite-contrecord and remove need of extra global variable, like attached? You might wonder why I am so concerned about the global variable. The reason is that I am working on another thread[1] where we are trying to club all the WAL write operations that happen at the end of StartupXLOG into a separate function. In the future, we might want to allow executing this function from other processes (e.g. Checkpointer). For that, we need to remove the dependency of those WAL write operations having on the global variables which are mostly valid in the startup process. The easiest way to do that is simply copy all the global variables into shared memory but that will not be an optimised solution, the goal is to try to see if we could leverage the existing information available into shared memory. I would be grateful if you could share your thoughts on the same, thanks. Regards, Amul 1] https://postgr.es/m/CAAJ_b97KZzdJsffwRK7w0XU5HnXkcgKgTR69t8cOZztsyXjkQw@mail.gmail.com
Attachment
On 2021-Sep-25, Alvaro Herrera wrote: >> On 2021-Sep-24, Alvaro Herrera wrote: >> >> > Here's the set for all branches, which I think are really final, in >> > case somebody wants to play and reproduce their respective problem >> scenarios. >> >> I forgot to mention that I'll wait until 14.0 is tagged before getting anything >> pushed. Hi Alvaro, sorry for being late to the party, but to add some reassurance that v2-commited-fix this really fixes solves theinitial production problem, I've done limited test on it (just like with the v1-patch idea earlier/ with using wal_keep_segments,wal_init_zero=on, archive_mode=on and archive_command='/bin/true') - On 12.8, I was able like last time to manually reproduce it on 3 out of 3 tries and I've got: 2x "invalid contrecord length",1x "there is no contrecord flag" on standby. - On soon-to-be-become-12.9 REL_12_STABLE (with commit 1df0a914d58f2bdb03c11dfcd2cb9cd01c286d59 ) on 4 out of 4 tries, I'vegot beautiful insight into what happened: LOG: started streaming WAL from primary at 1/EC000000 on timeline 1 LOG: sucessfully skipped missing contrecord at 1/EBFFFFF8, overwritten at 2021-10-13 11:22:37.48305+00 CONTEXT: WAL redo at 1/EC000028 for XLOG/OVERWRITE_CONTRECORD: lsn 1/EBFFFFF8; time 2021-10-13 11:22:37.48305+00 ...and slave was able to carry-on automatically. In 4th test, the cascade was tested too (m -> s1 -> s11) and both {s1,s11}did behave properly and log the above message. Also additional check proved that after simulating ENOSPC crash onmaster the data contents were identical everywhere (m1=s1=s11). Thank you Alvaro and also to everybody else who participated in solving this challenging and really edge-case nasty bug. -J.
On Wed, Oct 13, 2021 at 2:01 AM Amul Sul <sulamul@gmail.com> wrote: > Instead of abortedRecPtr point, isn't enough to write > overwrite-contrecord at XLogCtl->lastReplayedEndRecPtr? I think both > are pointing to the same location then can't we use > lastReplayedEndRecPtr instead of abortedRecPtr to write > overwrite-contrecord and remove need of extra global variable, like > attached? I think you mean missingContrecPtr, not abortedRecPtr. If I understand correctly, abortedRecPtr is going to be the location in some WAL segment which we replayed where a long record began, but missingContrecPtr seems like it would have to point to the beginning of the first segment we were unable to find to continue replay; and thus it ought to be the same as lastReplayedEndRecPtr. But the committed code doesn't seem to check that these are the same or verify the relationship between them in any way, so I'm worried there is some other case here. The comments in XLogReadRecord also suggest this: * We get here when a record that spans multiple pages needs to be * assembled, but something went wrong -- perhaps a contrecord piece * was lost. If caller is WAL replay, it will know where the aborted Saying that "perhaps" a contrecord piece was lost seems to imply that other explanations are possible as well, but I'm not sure what. -- Robert Haas EDB: http://www.enterprisedb.com
On 2021-Oct-13, Amul Sul wrote: > I have one more question, regarding the need for other global > variables i.e. abortedRecPtr. (Sorry for coming back after so long.) > > Instead of abortedRecPtr point, isn't enough to write > overwrite-contrecord at XLogCtl->lastReplayedEndRecPtr? I think both > are pointing to the same location then can't we use > lastReplayedEndRecPtr instead of abortedRecPtr to write > overwrite-contrecord and remove need of extra global variable, like > attached? I'm a bit fuzzy on the difference "the end+1" and "the start of the next record". Are they always the same? We do have XLogRecPtrToBytePos() and XLogBytePosToEndRecPtr() to convert unadorned XLogRecPtr values to "usable byte positions", which suggests to me that the proposed patch may fail if end+1 is a page or segment boundary. The other difference is that abortedRecPtr is set if we fail to read a record, but XLogCtl->lastReplayedEndRecPtr is set even if we read the record successfully. So you'd have need a bool flag that the overwrite contrecord record needs to be written. Your patch is using the fact that missingContrecPtr is non-invalid as such a flag ... I can't see anything wrong with that. So maybe your patch is okay in this aspect. > You might wonder why I am so concerned about the global variable. The > reason is that I am working on another thread[1] where we are trying > to club all the WAL write operations that happen at the end of > StartupXLOG into a separate function. In the future, we might want to > allow executing this function from other processes (e.g. > Checkpointer). For that, we need to remove the dependency of those WAL > write operations having on the global variables which are mostly valid > in the startup process. Seems a fine goal. -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/
On 2021-Oct-13, Robert Haas wrote: > On Wed, Oct 13, 2021 at 2:01 AM Amul Sul <sulamul@gmail.com> wrote: > > Instead of abortedRecPtr point, isn't enough to write > > overwrite-contrecord at XLogCtl->lastReplayedEndRecPtr? I think both > > are pointing to the same location then can't we use > > lastReplayedEndRecPtr instead of abortedRecPtr to write > > overwrite-contrecord and remove need of extra global variable, like > > attached? > > I think you mean missingContrecPtr, not abortedRecPtr. If I understand > correctly, abortedRecPtr is going to be the location in some WAL > segment which we replayed where a long record began, but > missingContrecPtr seems like it would have to point to the beginning > of the first segment we were unable to find to continue replay; and > thus it ought to be the same as lastReplayedEndRecPtr. So abortedRecPtr and missingContrecPtr both point to the same long record: the former is the start of the record, and the latter is some intermediate position where we failed to find the contrecord. lastReplayedEndRecPtr is the end+1 of the record *prior* to the long record. > But the committed code doesn't seem to check that these are the same > or verify the relationship between them in any way, so I'm worried > there is some other case here. Yeah, the only reason they are the same is that xlogreader sets both to Invalid when reading a record, and then sets both when a read fails. > The comments in XLogReadRecord also suggest this: > > * We get here when a record that spans multiple pages needs to be > * assembled, but something went wrong -- perhaps a contrecord piece > * was lost. If caller is WAL replay, it will know where the aborted > > Saying that "perhaps" a contrecord piece was lost seems to imply that > other explanations are possible as well, but I'm not sure what. Other explanations are possible. Imagine cosmic rays alter one byte in the last contrecord. WAL replay will stop there, and the contrecord will have been found all right, but CRC check would have failed to pass, so we would set xlogreader->missingContrecPtr to the final contrecord of that record (I didn't actually verify this particular scenario.) In fact, anything that causes xlogreader.c:XLogReadRecord to return NULL after setting "assembled=true" would set both abortedRecPtr and missingContrecPtr -- except DecodeXLogRecord failure, which perhaps should be handled in the same way. -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/ "Siempre hay que alimentar a los dioses, aunque la tierra esté seca" (Orual)
Hi, It seems that 026_overwrite_contrecord.pl test often fails under valgrind. I first thought that related failures on skink were due to me migrating the animal to a new host (and then running out of space due to a mistake in ccache config). But it happened again after I fixed those, and I just reproduced the issue locally. It's a bit odd that we didn't start to see these failures immediately, but only in the last few days. I'd blame skink being migrated to a new home, except that I can see the issue locally. FWIW, the way skink runs all postgres instances through valgrind is by replacing the postgres binary. Here's my local version of that: andres@awork3:~/build/postgres/dev-assert/vpath$ cat tmp_install/home/andres/build/postgres/dev-assert/install/bin/postgres #!/bin/bash exec /usr/bin/valgrind \ --quiet \ --error-exitcode=128 \ --suppressions=/home/andres/src/postgresql/src/tools/valgrind.supp \ --trace-children=yes --track-origins=yes --read-var-info=no \ --leak-check=no \ --run-libc-freeres=no \ --vgdb=no \ --error-markers=VALGRINDERROR-BEGIN,VALGRINDERROR-END \ /home/andres/build/postgres/dev-assert/vpath/tmp_install/home/andres/build/postgres/dev-assert/install/bin/postgres.orig \ "$@" make -C src/test/recovery/ check PROVE_FLAGS='-v' PROVE_TESTS='t/026_overwrite_contrecord.pl' NO_TEMP_INSTALL=1 ... not ok 1 - 000000010000000000000002 differs from 000000010000000000000002 # Failed test '000000010000000000000002 differs from 000000010000000000000002' Regards, Andres
Hi, On 2021-10-13 11:03:39 -0700, Andres Freund wrote: > It seems that 026_overwrite_contrecord.pl test often fails under valgrind. I > first thought that related failures on skink were due to me migrating the > animal to a new host (and then running out of space due to a mistake in ccache > config). But it happened again after I fixed those, and I just reproduced the > issue locally. > > It's a bit odd that we didn't start to see these failures immediately, but > only in the last few days. I'd blame skink being migrated to a new home, > except that I can see the issue locally. > > FWIW, the way skink runs all postgres instances through valgrind is by > replacing the postgres binary. Here's my local version of that: > > andres@awork3:~/build/postgres/dev-assert/vpath$ cat tmp_install/home/andres/build/postgres/dev-assert/install/bin/postgres > #!/bin/bash > > exec /usr/bin/valgrind \ > --quiet \ > --error-exitcode=128 \ > --suppressions=/home/andres/src/postgresql/src/tools/valgrind.supp \ > --trace-children=yes --track-origins=yes --read-var-info=no \ > --leak-check=no \ > --run-libc-freeres=no \ > --vgdb=no \ > --error-markers=VALGRINDERROR-BEGIN,VALGRINDERROR-END \ > /home/andres/build/postgres/dev-assert/vpath/tmp_install/home/andres/build/postgres/dev-assert/install/bin/postgres.orig \ > "$@" > > make -C src/test/recovery/ check PROVE_FLAGS='-v' PROVE_TESTS='t/026_overwrite_contrecord.pl' NO_TEMP_INSTALL=1 > ... > > not ok 1 - 000000010000000000000002 differs from 000000010000000000000002 > > # Failed test '000000010000000000000002 differs from 000000010000000000000002' I added LSNs to the error message: not ok 1 - 000000010000000000000002 (0/2002350) differs from 000000010000000000000002 (0/2099600) It appears that the problem is that inbetween the determination of rows_walsize the insert LSN moved to the next segment separately from the insertion, presumably due to autovacuum/analayze or such. <retries, with log_autovacuum_min_duration_statement=0, log_min_duration_statement=0> 2021-10-13 11:23:25.659 PDT [1491455] 026_overwrite_contrecord.pl LOG: statement: insert into filler select * from generate_series(1,1000) 2021-10-13 11:23:26.467 PDT [1491455] 026_overwrite_contrecord.pl LOG: duration: 861.112 ms 2021-10-13 11:23:27.055 PDT [1491458] 026_overwrite_contrecord.pl LOG: statement: select pg_current_wal_insert_lsn() - '0/0' 2021-10-13 11:23:27.357 PDT [1491458] 026_overwrite_contrecord.pl LOG: duration: 347.888 ms 2021-10-13 11:23:27.980 PDT [1491461] 026_overwrite_contrecord.pl LOG: statement: WITH setting AS ( SELECT setting::int AS wal_segsize FROM pg_settings WHERE name = 'wal_segment_size' ) INSERT INTO filler SELECT g FROM setting, generate_series(1, 1000 * (wal_segsize - ((pg_current_wal_insert_lsn() - '0/0') % wal_segsize)) / 64232) g 2021-10-13 11:24:25.173 PDT [1491550] LOG: automatic analyze of table "postgres.public.filler" avg read rate: 3.185 MB/s, avg write rate: 0.039 MB/s buffer usage: 96 hits, 731 misses, 9 dirtied system usage: CPU: user: 1.79 s, system: 0.00 s, elapsed: 1.81 s 2021-10-13 11:24:26.255 PDT [1491461] 026_overwrite_contrecord.pl LOG: duration: 58360.811 ms 2021-10-13 11:24:26.857 PDT [1491557] 026_overwrite_contrecord.pl LOG: statement: SELECT pg_current_wal_insert_lsn() 2021-10-13 11:24:27.120 PDT [1491557] 026_overwrite_contrecord.pl LOG: duration: 300.562 ms Hm. I guess we can disable autovac. But that's not a great solution, there might be WAL files due to catalog access etc too. Seems like it might be worth doing the "filling" of the segment with a loop in a DO block instead, where the end condition is to be within some distance to the end of the segment? With plenty headroom? Another thing: filling a segment by inserting lots of very tiny rows is pretty expensive. Can't we use something a bit wider? Perhaps even emit_message? Greetings, Andres Freund
On 2021-Oct-13, Andres Freund wrote: > I added LSNs to the error message: > not ok 1 - 000000010000000000000002 (0/2002350) differs from 000000010000000000000002 (0/2099600) > > It appears that the problem is that inbetween the determination of > rows_walsize the insert LSN moved to the next segment separately from the > insertion, presumably due to autovacuum/analayze or such. > <retries, with log_autovacuum_min_duration_statement=0, log_min_duration_statement=0> Oh, of course. > Hm. I guess we can disable autovac. But that's not a great solution, there > might be WAL files due to catalog access etc too. Well, we don't expect anything else to happen -- the cluster is otherwise idle. I think we should do it regardless of any other changes, just to keep things steadier. > Seems like it might be worth doing the "filling" of the segment with a loop in > a DO block instead, where the end condition is to be within some distance to > the end of the segment? With plenty headroom? Eh, good idea ... didn't think of that, but it should keep things more stable under strange conditions. > Another thing: filling a segment by inserting lots of very tiny rows is pretty > expensive. Can't we use something a bit wider? Perhaps even emit_message? I think I realized partway through writing the test that I could use emit_message instead of using a batched row insert ... so, yeah, we can use it here also. -- Álvaro Herrera Valdivia, Chile — https://www.EnterpriseDB.com/ "La verdad no siempre es bonita, pero el hambre de ella sí"
Hi, On 2021-10-13 15:52:46 -0300, Alvaro Herrera wrote: > > Hm. I guess we can disable autovac. But that's not a great solution, there > > might be WAL files due to catalog access etc too. > > Well, we don't expect anything else to happen -- the cluster is > otherwise idle. I think we should do it regardless of any other > changes, just to keep things steadier. IDK, it seems good to have a bit of variance as well. But I don't have a strong opinion on it. > > Another thing: filling a segment by inserting lots of very tiny rows is pretty > > expensive. Can't we use something a bit wider? Perhaps even emit_message? FWIW, the count of inserted rows is something like 171985 ;) > I think I realized partway through writing the test that I could use > emit_message instead of using a batched row insert ... so, yeah, we > can use it here also. Cool. Even if we want to use inserts, lets at least make the rows wide... I think it'd be good to have a bit of variance in record width. So perhaps adding a bit of random() in to influence record width would be a good idea? Greetings, Andres Freund
On 2021-Oct-13, Andres Freund wrote: > > > Another thing: filling a segment by inserting lots of very tiny rows is pretty > > > expensive. Can't we use something a bit wider? Perhaps even emit_message? > > FWIW, the count of inserted rows is something like 171985 ;) This does ~1600 iterations to fill one segment, 10 rows per iteration, row size is variable; exits when two BLCKSZ remain to complete the WAL segment: create table filler (a int, b text); do $$ declare wal_segsize int := setting::int from pg_settings where name = 'wal_segment_size'; remain int; iters int := 0; begin loop insert into filler select g, repeat(md5(g::text), (random() * 60 + 1)::int) from generate_series(1, 10) g; remain := wal_segsize - (pg_current_wal_insert_lsn() - '0/0') % wal_segsize; raise notice '(%) remain: %', iters, remain; if remain < 2 * setting::int from pg_settings where name = 'block_size' then exit; end if; iters := iters + 1; end loop; end $$ ; (Of course, I'm not proposing that the 'raise notice' be there in the committed form.) If I enlarge the 'repeat' count, it gets worse (more iterations required) because a lot of the rows become toasted and thus subject to compression. If I do 20 rows per iteration rather than 10, the risk is that we'll do too many near the end of the segment and we'll have to continue running until completing the next one. So, this seems good enough. -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/ "Once again, thank you and all of the developers for your hard work on PostgreSQL. This is by far the most pleasant management experience of any database I've worked on." (Dan Harris) http://archives.postgresql.org/pgsql-performance/2006-04/msg00247.php
Hi, On 2021-10-13 12:13:45 -0700, Andres Freund wrote: > On 2021-10-13 15:52:46 -0300, Alvaro Herrera wrote: > > I think I realized partway through writing the test that I could use > > emit_message instead of using a batched row insert ... so, yeah, we > > can use it here also. > > Cool. Even if we want to use inserts, lets at least make the rows wide... > > I think it'd be good to have a bit of variance in record width. So perhaps > adding a bit of random() in to influence record width would be a good idea? Something very roughly like the attached. Perhaps that's going a bit overboard though. But it seems like it might be something we could use in a few tests?
Attachment
This works nicely with the TAP test: -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/ "Postgres is bloatware by design: it was built to house PhD theses." (Joey Hellerstein, SIGMOD annual conference 2002)
Attachment
Hi On 2021-Oct-13, Andres Freund wrote: > Something very roughly like the attached. Perhaps that's going a bit overboard > though. But it seems like it might be something we could use in a few tests? Hah, our emails crossed. If you want to turn this into a patch to the 026 test file, please go ahead. Failing that, I'd just push the patch I just sent in my other reply. -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/ "You're _really_ hosed if the person doing the hiring doesn't understand relational systems: you end up with a whole raft of programmers, none of whom has had a Date with the clue stick." (Andrew Sullivan)
Hi, On 2021-10-13 17:46:53 -0300, Alvaro Herrera wrote: > On 2021-Oct-13, Andres Freund wrote: > > > Something very roughly like the attached. Perhaps that's going a bit overboard > > though. But it seems like it might be something we could use in a few tests? > > Hah, our emails crossed. If you want to turn this into a patch to the > 026 test file, please go ahead. Failing that, I'd just push the patch I > just sent in my other reply. Yea, let's go for your patch then. I've verified that at least locally it passes under valgrind. Greetings, Andres Freund
On 2021-Oct-13, Andres Freund wrote: > Hi, > > On 2021-10-13 17:46:53 -0300, Alvaro Herrera wrote: > > On 2021-Oct-13, Andres Freund wrote: > > > > > Something very roughly like the attached. Perhaps that's going a bit overboard > > > though. But it seems like it might be something we could use in a few tests? > > > > Hah, our emails crossed. If you want to turn this into a patch to the > > 026 test file, please go ahead. Failing that, I'd just push the patch I > > just sent in my other reply. > > Yea, let's go for your patch then. I've verified that at least locally it > passes under valgrind. Ah great, thanks. Pushed then. -- Álvaro Herrera Valdivia, Chile — https://www.EnterpriseDB.com/ "But static content is just dynamic content that isn't moving!" http://smylers.hates-software.com/2007/08/15/fe244d0c.html
On Wed, Oct 13, 2021 at 10:58 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > > On 2021-Oct-13, Amul Sul wrote: > > > I have one more question, regarding the need for other global > > variables i.e. abortedRecPtr. (Sorry for coming back after so long.) > > > > Instead of abortedRecPtr point, isn't enough to write > > overwrite-contrecord at XLogCtl->lastReplayedEndRecPtr? I think both > > are pointing to the same location then can't we use > > lastReplayedEndRecPtr instead of abortedRecPtr to write > > overwrite-contrecord and remove need of extra global variable, like > > attached? > > I'm a bit fuzzy on the difference "the end+1" and "the start of the next > record". Are they always the same? We do have XLogRecPtrToBytePos() > and XLogBytePosToEndRecPtr() to convert unadorned XLogRecPtr values to > "usable byte positions", which suggests to me that the proposed patch > may fail if end+1 is a page or segment boundary. > Yes, you are correct, that could be a possible failure. How about calculating that from the lastReplayedEndRecPtr by converting it first to "usable byte positions" and then recalculating the record pointer from that, like attached? > The other difference is that abortedRecPtr is set if we fail to read a > record, but XLogCtl->lastReplayedEndRecPtr is set even if we read the > record successfully. So you'd have need a bool flag that the overwrite > contrecord record needs to be written. Your patch is using the fact > that missingContrecPtr is non-invalid as such a flag ... I can't see > anything wrong with that. So maybe your patch is okay in this aspect. > > > You might wonder why I am so concerned about the global variable. The > > reason is that I am working on another thread[1] where we are trying > > to club all the WAL write operations that happen at the end of > > StartupXLOG into a separate function. In the future, we might want to > > allow executing this function from other processes (e.g. > > Checkpointer). For that, we need to remove the dependency of those WAL > > write operations having on the global variables which are mostly valid > > in the startup process. > > Seems a fine goal. Thanks for looking at the patch. Regards, Amul
Attachment
On Thu, Oct 14, 2021 at 6:14 PM Amul Sul <sulamul@gmail.com> wrote: > > On Wed, Oct 13, 2021 at 10:58 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > > > > On 2021-Oct-13, Amul Sul wrote: > > > > > I have one more question, regarding the need for other global > > > variables i.e. abortedRecPtr. (Sorry for coming back after so long.) > > > > > > Instead of abortedRecPtr point, isn't enough to write > > > overwrite-contrecord at XLogCtl->lastReplayedEndRecPtr? I think both > > > are pointing to the same location then can't we use > > > lastReplayedEndRecPtr instead of abortedRecPtr to write > > > overwrite-contrecord and remove need of extra global variable, like > > > attached? > > > > I'm a bit fuzzy on the difference "the end+1" and "the start of the next > > record". Are they always the same? We do have XLogRecPtrToBytePos() > > and XLogBytePosToEndRecPtr() to convert unadorned XLogRecPtr values to > > "usable byte positions", which suggests to me that the proposed patch > > may fail if end+1 is a page or segment boundary. > > > > Yes, you are correct, that could be a possible failure. > > How about calculating that from the lastReplayedEndRecPtr by > converting it first to "usable byte positions" and then recalculating > the record pointer from that, like attached? > Any thoughts about the patch posted previously? Regards, Amul
At Fri, 22 Oct 2021 18:43:52 +0530, Amul Sul <sulamul@gmail.com> wrote in > Any thoughts about the patch posted previously? Honestly, xlogreader looks fine with the current shape. The reason is that it seems cleaner as an interface boundary since the caller of xlogreader doesn't need to know about the details of xlogreader. The current code nicely hides the end+1 confusion. Even if we want to get rid of global variables in xlog.c, I don't understand why we remove only abortedRecPtr. That change makes things more complex as a whole by letting xlog.c be more conscious of xlogreader's internals. I'm not sure I like that aspect of the patch. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
On Mon, Oct 25, 2021 at 7:02 AM Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote: > > At Fri, 22 Oct 2021 18:43:52 +0530, Amul Sul <sulamul@gmail.com> wrote in > > Any thoughts about the patch posted previously? > > Honestly, xlogreader looks fine with the current shape. The reason is > that it seems cleaner as an interface boundary since the caller of > xlogreader doesn't need to know about the details of xlogreader. The > current code nicely hides the end+1 confusion. > > Even if we want to get rid of global variables in xlog.c, I don't > understand why we remove only abortedRecPtr. That change makes things > more complex as a whole by letting xlog.c be more conscious of > xlogreader's internals. I'm not sure I like that aspect of the patch. > Because we have other ways to get abortedRecPtr without having a global variable, but we don't have such a way for missingContrecPtr, AFAICU. I agree using global variables makes things a bit easier, but those are inefficient when you want to share those with other processes -- that would add extra burden to shared memory. Regards, Amul
At Mon, 25 Oct 2021 10:34:27 +0530, Amul Sul <sulamul@gmail.com> wrote in > On Mon, Oct 25, 2021 at 7:02 AM Kyotaro Horiguchi > <horikyota.ntt@gmail.com> wrote: > > > > At Fri, 22 Oct 2021 18:43:52 +0530, Amul Sul <sulamul@gmail.com> wrote in > > > Any thoughts about the patch posted previously? > > > > Honestly, xlogreader looks fine with the current shape. The reason is > > that it seems cleaner as an interface boundary since the caller of > > xlogreader doesn't need to know about the details of xlogreader. The > > current code nicely hides the end+1 confusion. > > > > Even if we want to get rid of global variables in xlog.c, I don't > > understand why we remove only abortedRecPtr. That change makes things > > more complex as a whole by letting xlog.c be more conscious of > > xlogreader's internals. I'm not sure I like that aspect of the patch. > > > > Because we have other ways to get abortedRecPtr without having a > global variable, but we don't have such a way for missingContrecPtr, > AFAICU. That depends on the reason why you want to get rid of the glboal variables. Since we restart WAL reading before reading the two variables so we can not rely on the xlogreader's corresponding members. So we need another set of variables to preserve the values beyond the restart. > I agree using global variables makes things a bit easier, but those > are inefficient when you want to share those with other processes -- > that would add extra burden to shared memory. We could simply add a new member in XLogCtlData. Or we can create another struct for ReadRecord's (not XLogReader's) state then allocate shared memory to it. I don't think it is the right solution to infer it from another variable using knowledge of xlogreader's internals. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Alvaro Herrera <alvherre@alvh.no-ip.org> writes: > On 2021-Oct-13, Andres Freund wrote: >> Yea, let's go for your patch then. I've verified that at least locally it >> passes under valgrind. > Ah great, thanks. Pushed then. Seems like this hasn't fixed the problem: skink still fails on this test occasionally. https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink&dt=2021-10-22%2013%3A52%3A00 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink&dt=2021-11-02%2000%3A28%3A30 Both of these look like # Failed test '000000010000000000000002 differs from 000000010000000000000002' # at t/026_overwrite_contrecord.pl line 61. # Looks like you failed 1 test of 3. t/026_overwrite_contrecord.pl ........ Dubious, test returned 1 (wstat 256, 0x100) Failed 1/3 subtests which looks like the same thing we were seeing before. 010e52337 seems to have just decreased the probability of failure. regards, tom lane
Hello Alvaro, 14.10.2021 01:09, Alvaro Herrera wrote: >> Yea, let's go for your patch then. I've verified that at least locally it >> passes under valgrind. > Ah great, thanks. Pushed then. > While translating messages I've noticed that the version of the patch ported to REL9_6_STABLE..REL_13_STABLE contains a typo "sucessfully". Please consider committing the fix. Best regards, Alexander
Attachment
On 2021-Nov-08, Alexander Lakhin wrote: > Hello Alvaro, > 14.10.2021 01:09, Alvaro Herrera wrote: > >> Yea, let's go for your patch then. I've verified that at least locally it > >> passes under valgrind. > > Ah great, thanks. Pushed then. > > > While translating messages I've noticed that the version of the patch > ported to REL9_6_STABLE..REL_13_STABLE contains a typo "sucessfully". > Please consider committing the fix. Thanks, pushed. I also modified the .po files that contained the typo so as not to waste the translators' efforts. I blamed the wrong commit in the 9.6 commit message. Sigh. -- Álvaro Herrera 39°49'30"S 73°17'W — https://www.EnterpriseDB.com/ Syntax error: function hell() needs an argument. Please choose what hell you want to involve.
I wrote: > Seems like this hasn't fixed the problem: skink still fails on > this test occasionally. > # Failed test '000000010000000000000002 differs from 000000010000000000000002' > # at t/026_overwrite_contrecord.pl line 61. This is still happening off and on, which makes it look like a timing-sensitive problem. Confirming that, I can make it fail every time by adding a long sleep just ahead of where 026_overwrite_contrecord.pl captures $initfile. On reflection I think the problem is obvious: if autovacuum does anything concurrently with the test's startup portion, it will cause the carefully-established WAL insertion point to move into the next segment. I propose to add "autovacuum = off" to the test's postmaster configuration. Also, I think we want -ok($initfile != $endfile, "$initfile differs from $endfile"); +ok($initfile ne $endfile, "$initfile differs from $endfile"); The existing coding works as long as all characters of these WAL segment names happen to be decimal digits, but ... regards, tom lane
On 2021-Nov-09, Tom Lane wrote: > This is still happening off and on, which makes it look like a > timing-sensitive problem. Confirming that, I can make it fail > every time by adding a long sleep just ahead of where > 026_overwrite_contrecord.pl captures $initfile. On reflection > I think the problem is obvious: if autovacuum does anything > concurrently with the test's startup portion, it will cause the > carefully-established WAL insertion point to move into the > next segment. I propose to add "autovacuum = off" to the > test's postmaster configuration. Ooh, of course. > Also, I think we want > > -ok($initfile != $endfile, "$initfile differs from $endfile"); > +ok($initfile ne $endfile, "$initfile differs from $endfile"); > > The existing coding works as long as all characters of these > WAL segment names happen to be decimal digits, but ... Argh! Thanks for taking care of these issues. -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/ "La conclusión que podemos sacar de esos estudios es que no podemos sacar ninguna conclusión de ellos" (Tanenbaum)
Tom Lane <tgl@sss.pgh.pa.us> writes: > Also, I think we want > > -ok($initfile != $endfile, "$initfile differs from $endfile"); > +ok($initfile ne $endfile, "$initfile differs from $endfile"); > > The existing coding works as long as all characters of these > WAL segment names happen to be decimal digits, but ... Even better style (IMO) would be: isnt($initfile, $endfile, "WAL file name has changed"); Or some other more descriptive message of _why_ it should have changed. - ilmari
We're *still* not out of the woods with 026_overwrite_contrecord.pl, as we are continuing to see occasional "mismatching overwritten LSN" failures, further down in the test where it tries to start up the standby: sysname | branch | snapshot | stage | l ------------+---------------+---------------------+---------------+------------------------------------------------------------------------------------------------------------ spurfowl | REL_13_STABLE | 2021-10-18 03:56:26 | recoveryCheck | 2021-10-18 00:08:09.324 EDT [2455:6] FATAL: mismatchingoverwritten LSN 0/1FFE018 -> 0/1FFE000 sidewinder | HEAD | 2021-10-19 04:32:36 | recoveryCheck | 2021-10-19 06:46:23.168 CEST [26393:6] FATAL: mismatchingoverwritten LSN 0/1FFE018 -> 0/1FFE000 francolin | REL9_6_STABLE | 2021-10-26 01:41:39 | recoveryCheck | 2021-10-26 01:48:05.646 UTC [3417202][][1/0:0] FATAL: mismatching overwritten LSN 0/1FFE018 -> 0/1FFE000 petalura | HEAD | 2021-11-05 00:20:03 | recoveryCheck | 2021-11-05 02:58:12.146 CET [61848fb3.28d157:6] FATAL: mismatching overwritten LSN 0/1FFE018 -> 0/1FFE000 lapwing | REL_11_STABLE | 2021-11-05 17:24:49 | recoveryCheck | 2021-11-05 17:39:29.741 UTC [9831:6] FATAL: mismatchingoverwritten LSN 0/1FFE014 -> 0/1FFE000 morepork | HEAD | 2021-11-10 02:51:12 | recoveryCheck | 2021-11-10 04:03:33.576 CET [73561:6] FATAL: mismatchingoverwritten LSN 0/1FFE018 -> 0/1FFE000 petalura | HEAD | 2021-11-16 15:20:03 | recoveryCheck | 2021-11-16 18:16:47.875 CET [6193e77f.35b87f:6] FATAL: mismatching overwritten LSN 0/1FFE018 -> 0/1FFE000 morepork | HEAD | 2021-11-17 03:45:36 | recoveryCheck | 2021-11-17 04:57:04.359 CET [32089:6] FATAL: mismatchingoverwritten LSN 0/1FFE018 -> 0/1FFE000 spurfowl | REL_10_STABLE | 2021-11-22 22:21:03 | recoveryCheck | 2021-11-22 17:29:35.520 EST [16011:6] FATAL: mismatchingoverwritten LSN 0/1FFE018 -> 0/1FFE000 (9 rows) Looking at adjacent successful runs, it seems that the exact point where the "missing contrecord" starts varies substantially, even after our previous fix to disable autovacuum in this test. How could that be? It's probably for the best though, because I think this is exposing an actual bug that we would not have seen if the start point were completely consistent. I have not dug into the code, but it looks to me like if the "consistent recovery state" is reached exactly at a page boundary (0/1FFE000 in all these cases), then the standby expects that to be what the OVERWRITE_CONTRECORD record will point at. But actually it points to the first WAL record on that page, resulting in a bogus failure. regards, tom lane
On 2021-Nov-23, Tom Lane wrote: > We're *still* not out of the woods with 026_overwrite_contrecord.pl, > as we are continuing to see occasional "mismatching overwritten LSN" > failures, further down in the test where it tries to start up the > standby: Augh. > Looking at adjacent successful runs, it seems that the exact point > where the "missing contrecord" starts varies substantially, even after > our previous fix to disable autovacuum in this test. How could that be? Well, there is intentionally some variability. Maybe not as much as one would wish, but I expect that that should explain why that point is not always the same. > It's probably for the best though, because I think this is exposing > an actual bug that we would not have seen if the start point were > completely consistent. I have not dug into the code, but it looks to > me like if the "consistent recovery state" is reached exactly at a > page boundary (0/1FFE000 in all these cases), then the standby expects > that to be what the OVERWRITE_CONTRECORD record will point at. But > actually it points to the first WAL record on that page, resulting > in a bogus failure. So what is happening is that we set state->overwrittenRecPtr to the LSN of page start, ignoring the page header. Is that the LSN of the first record in a page? I'll see if I can reproduce the problem. -- Álvaro Herrera 39°49'30"S 73°17'W — https://www.EnterpriseDB.com/ "La persona que no quería pecar / estaba obligada a sentarse en duras y empinadas sillas / desprovistas, por cierto de blandos atenuantes" (Patricio Vogel)
On Wed, Nov 24, 2021 at 2:10 AM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > > On 2021-Nov-23, Tom Lane wrote: > > > We're *still* not out of the woods with 026_overwrite_contrecord.pl, > > as we are continuing to see occasional "mismatching overwritten LSN" > > failures, further down in the test where it tries to start up the > > standby: > > Augh. > > > Looking at adjacent successful runs, it seems that the exact point > > where the "missing contrecord" starts varies substantially, even after > > our previous fix to disable autovacuum in this test. How could that be? > > Well, there is intentionally some variability. Maybe not as much as one > would wish, but I expect that that should explain why that point is not > always the same. > > > It's probably for the best though, because I think this is exposing > > an actual bug that we would not have seen if the start point were > > completely consistent. I have not dug into the code, but it looks to > > me like if the "consistent recovery state" is reached exactly at a > > page boundary (0/1FFE000 in all these cases), then the standby expects > > that to be what the OVERWRITE_CONTRECORD record will point at. But > > actually it points to the first WAL record on that page, resulting > > in a bogus failure. > > So what is happening is that we set state->overwrittenRecPtr to the LSN > of page start, ignoring the page header. Is that the LSN of the first > record in a page? I'll see if I can reproduce the problem. > In XLogReadRecord(), both the variables being compared have inconsistency in the assignment -- one gets assigned from state->currRecPtr where other is from RecPtr. ..... state->overwrittenRecPtr = state->currRecPtr; ..... state->abortedRecPtr = RecPtr; ..... Before the place where assembled flag sets, there is a bunch of code that adjusts RecPtr. I think instead of RecPtr, the latter assignment should use state->currRecPtr as well. Regards, Amul
On 2021-Nov-25, Amul Sul wrote: > In XLogReadRecord(), both the variables being compared have > inconsistency in the assignment -- one gets assigned from > state->currRecPtr where other is from RecPtr. > > ..... > state->overwrittenRecPtr = state->currRecPtr; > ..... > state->abortedRecPtr = RecPtr; > ..... > > Before the place where assembled flag sets, there is a bunch of code > that adjusts RecPtr. I think instead of RecPtr, the latter assignment > should use state->currRecPtr as well. You're exactly right. I managed to reproduce the problem shown by buildfarm members, and indeed this fixes it. And it makes sense: the adjustment you refer to, is precisely to skip the page header when the LSN is the start of the page, which is exactly the problem we're seeing in the buildfarm ... except that on lapwing branch REL_11_STABLE, we're seeing the LSN is off by 0x14 instead of 0x18. That seems very strange. I think the reason for this is that lapwing has MAXALIGN 4, so MAXALIGN(sizeof(XLogPageHeaderData)) is 20, not 24 as is the case in the other failing members. ... checks buildfarm ... Yeah, all the others in Tom's list are x86-64. I'm pushing the fix in a minute. -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/ "¿Qué importan los años? Lo que realmente importa es comprobar que a fin de cuentas la mejor edad de la vida es estar vivo" (Mafalda)
Oh, but also I think I should push a mitigation in case a production system hits this problem: maybe reduce the message from FATAL to WARNING if the registered LSN is at a page boundary. -- Álvaro Herrera Valdivia, Chile — https://www.EnterpriseDB.com/ "Entristecido, Wutra (canción de Las Barreras) echa a Freyr a rodar y a nosotros al mar"
Alvaro Herrera <alvherre@alvh.no-ip.org> writes: > Oh, but also I think I should push a mitigation in case a production > system hits this problem: maybe reduce the message from FATAL to WARNING > if the registered LSN is at a page boundary. Uh, why? The fix should remove the problem, and if it doesn't, we're still looking at inconsistent WAL aren't we? regards, tom lane
On 2021-Nov-25, Tom Lane wrote: > Alvaro Herrera <alvherre@alvh.no-ip.org> writes: > > Oh, but also I think I should push a mitigation in case a production > > system hits this problem: maybe reduce the message from FATAL to WARNING > > if the registered LSN is at a page boundary. > > Uh, why? The fix should remove the problem, and if it doesn't, we're > still looking at inconsistent WAL aren't we? The problem is that the bug occurs while writing the WAL record. Fixed servers won't produce such records, but if you run an unpatched server and it happens to write one, without a mitigation you cannot get away from FATAL during replay. Since this bug exists in released minors, we should allow people to upgrade to a newer version if they hit it. -- Álvaro Herrera Valdivia, Chile — https://www.EnterpriseDB.com/ "Nadie está tan esclavizado como el que se cree libre no siéndolo" (Goethe)
Alvaro Herrera <alvherre@alvh.no-ip.org> writes: > On 2021-Nov-25, Tom Lane wrote: >> Uh, why? The fix should remove the problem, and if it doesn't, we're >> still looking at inconsistent WAL aren't we? > The problem is that the bug occurs while writing the WAL record. Fixed > servers won't produce such records, but if you run an unpatched server > and it happens to write one, without a mitigation you cannot get away > from FATAL during replay. Really? AFAICS the WAL record contains the correct value, or at least we should define that one as being correct, for precisely this reason. regards, tom lane
On 2021-Nov-25, Tom Lane wrote: > Alvaro Herrera <alvherre@alvh.no-ip.org> writes: > > > The problem is that the bug occurs while writing the WAL record. Fixed > > servers won't produce such records, but if you run an unpatched server > > and it happens to write one, without a mitigation you cannot get away > > from FATAL during replay. > > Really? AFAICS the WAL record contains the correct value, or at least > we should define that one as being correct, for precisely this reason. I don't know what is the correct value for a record that comes exactly after the page header. But here's a patch that fixes the problem; and if a standby replays WAL written by an unpatched primary, it will be able to read past instead of dying of FATAL. I originally wrote this to have a WARNING in VerifyOverwriteContrecord (in the cases that are new), with the idea that it'd prompt people to upgrade, but that's probably a waste of time. -- Álvaro Herrera Valdivia, Chile — https://www.EnterpriseDB.com/ "El hombre nunca sabe de lo que es capaz hasta que lo intenta" (C. Dickens)
Attachment
Alvaro Herrera <alvherre@alvh.no-ip.org> writes: > On 2021-Nov-25, Tom Lane wrote: >> Really? AFAICS the WAL record contains the correct value, or at least >> we should define that one as being correct, for precisely this reason. > I don't know what is the correct value for a record that comes exactly > after the page header. But here's a patch that fixes the problem; and > if a standby replays WAL written by an unpatched primary, it will be > able to read past instead of dying of FATAL. Meh ... but given the simplicity of the write-side fix, maybe changing it is appropriate. However, this seems too forgiving: + if (xlrec->overwritten_lsn != state->overwrittenRecPtr && + xlrec->overwritten_lsn - SizeOfXLogShortPHD != state->overwrittenRecPtr && + xlrec->overwritten_lsn - SizeOfXLogLongPHD != state->overwrittenRecPtr) The latter two cases should only be accepted if overwrittenRecPtr is exactly at a page boundary. regards, tom lane
I wrote: > However, this seems too forgiving: ... also, I don't know if you intended this already, but the VerifyOverwriteContrecord change should only be applied in back branches. There's no need for it in HEAD. regards, tom lane
On Fri, Nov 26, 2021 at 1:42 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Alvaro Herrera <alvherre@alvh.no-ip.org> writes: > > On 2021-Nov-25, Tom Lane wrote: > >> Really? AFAICS the WAL record contains the correct value, or at least > >> we should define that one as being correct, for precisely this reason. > > > I don't know what is the correct value for a record that comes exactly > > after the page header. But here's a patch that fixes the problem; and > > if a standby replays WAL written by an unpatched primary, it will be > > able to read past instead of dying of FATAL. > > Meh ... but given the simplicity of the write-side fix, maybe changing > it is appropriate. > > However, this seems too forgiving: > > + if (xlrec->overwritten_lsn != state->overwrittenRecPtr && > + xlrec->overwritten_lsn - SizeOfXLogShortPHD != state->overwrittenRecPtr && > + xlrec->overwritten_lsn - SizeOfXLogLongPHD != state->overwrittenRecPtr) > Unless I am missing something, I am not sure why need this adjustment if we are going to use state->currRecPtr value which doesn't seem to be changing at all. AFAICU, state->currRecPtr will be unchanged value whether going to set overwrittenRecPtr or abortedRecPtr. Do primary and standby see state->currRecPtr differently, I guess not, never? Regards, Amul
On 2021-Nov-26, Amul Sul wrote: > On Fri, Nov 26, 2021 at 1:42 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > > > Meh ... but given the simplicity of the write-side fix, maybe changing > > it is appropriate. Actually, fixing the other side is equally simple, and it is also more correct. What changed my mind is that upon completing a successful read of a record, what we set as state->ReadRecPtr is the local variable RecPtr -- so that is what becomes the true LSN of the record. Using state->currRecPtr is inconsistent with that definition. > Unless I am missing something, I am not sure why need this adjustment > if we are going to use state->currRecPtr value which doesn't seem to > be changing at all. AFAICU, state->currRecPtr will be unchanged value > whether going to set overwrittenRecPtr or abortedRecPtr. Do primary > and standby see state->currRecPtr differently, I guess not, never? You're right for the wrong reason. We don't need the adjustment in the verify routine. The reason we don't is that we're not going to use state->currRecPtr anymore, but rather RecPtr in both places. You're thinking that primary and standby would never "see state->currRecPtr differently", but that's only if they are both running the same code. If you had a primary running 14.1 and a standby running 14.2, with the previously proposed fix (using state->currRecPtr), you would be in trouble. With this fix (using RecPtr) it works fine. -- Álvaro Herrera 39°49'30"S 73°17'W — https://www.EnterpriseDB.com/
On 2021-Sep-03, Alvaro Herrera wrote: > The last commit is something I noticed in pg_rewind ... I had missed this one; it's pushed now. -- Álvaro Herrera Breisgau, Deutschland — https://www.EnterpriseDB.com/ "I can see support will not be a problem. 10 out of 10." (Simon Wittber) (http://archives.postgresql.org/pgsql-general/2004-12/msg00159.php)