Thread: .ready and .done files considered harmful
I and various colleagues of mine have from time to time encountered systems that got a bit behind on WAL archiving, because the archive_command started failing and nobody noticed right away. Ideally, people should have monitoring for this and put it to rights immediately, but some people don't. If those people happen to have a relatively small pg_wal partition, they will likely become aware of the issue when it fills up and takes down the server, but some users provision disk space pretty generously and therefore nothing compels them to notice the issue until they fill it up. In at least one case, on a system that was actually generating a reasonable amount of WAL, this took in excess of six months. As you might imagine, pg_wal can get fairly large in such scenarios, but the user is generally less concerned with solving that problem than they are with getting the system back up. It is doubtless true that the user would prefer to shrink the disk usage down to something more reasonable over time, but on the facts as presented, it can't really be an urgent issue for them. What they really need is just free up a little disk space somehow or other and then get archiving running fast enough to keep up with future WAL generation. Regrettably, the archiver cannot do this, not even if you set archive_command = /bin/true, because the archiver will barely ever actually run the archive_command. Instead, it will spend virtually all of its time calling readdir(), because for some reason it feels a need to make a complete scan of the archive_status directory before archiving a WAL file, and then it has to make another scan before archiving the next one. Someone - and it's probably for the best that the identity of that person remains unknown to me - came up with a clever solution to this problem, which is now used almost as a matter of routine whenever this comes up. You just run pg_archivecleanup on your pg_wal directory, and then remove all the corresponding .ready files and call it a day. I haven't scrutinized the code for pg_archivecleanup, but evidently it avoids needing O(n^2) time for this and therefore can clean up the whole directory in something like the amount of time the archiver would take to deal with a single file. While this seems to be quite an effective procedure and I have not yet heard any user complaints, it seems disturbingly error-prone, and honestly shouldn't ever be necessary. The issue here is only that pgarch.c acts as though after archiving 000000010000000000000001, 000000010000000000000002, and then 000000010000000000000003, we have no idea what file we might need to archive next. Could it, perhaps, be 000000010000000000000004? Only a full directory scan will tell us the answer! I have two possible ideas for addressing this; perhaps other people will have further suggestions. A relatively non-invasive fix would be to teach pgarch.c how to increment a WAL file name. After archiving segment N, check using stat() whether there's an .ready file for segment N+1. If so, do that one next. If not, then fall back to performing a full directory scan. As far as I can see, this is just cheap insurance. If archiving is keeping up, the extra stat() won't matter much. If it's not, this will save more system calls than it costs. Since during normal operation it shouldn't really be possible for files to show up in pg_wal out of order, I don't really see a scenario where this changes the behavior, either. If there are gaps in the sequence at startup time, this will cope with it exactly the same as we do now, except with a better chance of finishing before I retire. However, that's still pretty wasteful. Every time we have to wait for the next file to be ready for archiving, we'll basically fall back to repeatedly scanning the whole directory, waiting for it to show up. And I think that we can't get around that by just using stat() to look for the appearance of the file we expect to see, because it's possible that we might be doing all of this on a standby which then gets promoted, or some upstream primary gets promoted, and WAL files start appearing on a different timeline, making our prediction of what the next filename will be incorrect. But perhaps we could work around this by allowing pgarch.c to access shared memory, in which case it could examine the current timeline whenever it wants, and probably also whatever LSNs it needs to know what's safe to archive. If we did that, could we just get rid of the .ready and .done files altogether? Are they just a really expensive IPC mechanism to avoid a shared memory connection, or is there some more fundamental reason why we need them? And is there any good reason why the archiver shouldn't be connected to shared memory? It is certainly nice to avoid having more processes connected to shared memory than necessary, but the current scheme is so inefficient that I think we end up worse off. Thanks, -- Robert Haas EDB: http://www.enterprisedb.com
Hi, On 2021-05-03 16:49:16 -0400, Robert Haas wrote: > I have two possible ideas for addressing this; perhaps other people > will have further suggestions. A relatively non-invasive fix would be > to teach pgarch.c how to increment a WAL file name. After archiving > segment N, check using stat() whether there's an .ready file for > segment N+1. If so, do that one next. If not, then fall back to > performing a full directory scan. Hm. I wonder if it'd not be better to determine multiple files to be archived in one readdir() pass? > As far as I can see, this is just cheap insurance. If archiving is > keeping up, the extra stat() won't matter much. If it's not, this will > save more system calls than it costs. Since during normal operation it > shouldn't really be possible for files to show up in pg_wal out of > order, I don't really see a scenario where this changes the behavior, > either. If there are gaps in the sequence at startup time, this will > cope with it exactly the same as we do now, except with a better > chance of finishing before I retire. There's definitely gaps in practice :(. Due to the massive performance issues with archiving there are several tools that archive multiple files as part of one archive command invocation (and mark the additional archived files as .done immediately). > However, that's still pretty wasteful. Every time we have to wait for > the next file to be ready for archiving, we'll basically fall back to > repeatedly scanning the whole directory, waiting for it to show up. Hm. That seems like it's only an issue because .done and .ready are in the same directory? Otherwise the directory would be empty while we're waiting for the next file to be ready to be archived. I hate that that's a thing but given teh serial nature of archiving, with high per-call overhead, I don't think it'd be ok to just break that without a replacement :(. > But perhaps we could work around this by allowing pgarch.c to access > shared memory, in which case it could examine the current timeline > whenever it wants, and probably also whatever LSNs it needs to know > what's safe to archive. FWIW, the shared memory stats patch implies doing that, since the archiver reports stats. > If we did that, could we just get rid of the .ready and .done files > altogether? Are they just a really expensive IPC mechanism to avoid a > shared memory connection, or is there some more fundamental reason why > we need them? What kind of shared memory mechanism are you thinking of? Due to timelines and history files I don't think simple position counters would be quite enough. I think the aforementioned "batching" archive commands are part of the problem :(. > And is there any good reason why the archiver shouldn't be connected > to shared memory? It is certainly nice to avoid having more processes > connected to shared memory than necessary, but the current scheme is > so inefficient that I think we end up worse off. I think there is no fundamental for avoiding shared memory in the archiver. I guess there's a minor robustness advantage, because the forked shell to start the archvive command won't be attached to shared memory. But that's only until the child exec()s to the archive command. There is some minor performance advantage as well, not having to process the often large and contended memory mapping for shared_buffers is probably measurable - but swamped by the cost of needing to actually archive the segment. My only "concern" with doing anything around this is that I think the whole approach of archive_command is just hopelessly broken, with even just halfway busy servers only able to keep up archiving if they muck around with postgres internal data during archive command execution. Add to that how hard it is to write a robust archive command (e.g. the one in our docs still suggests test ! -f && cp, which means that copy failing in the middle yields an incomplete archive)... While I don't think it's all that hard to design a replacement, it's however likely still more work than addressing the O(n^2) issue, so ... Greetings, Andres Freund
> 4 мая 2021 г., в 09:27, Andres Freund <andres@anarazel.de> написал(а): > > Hi, > > On 2021-05-03 16:49:16 -0400, Robert Haas wrote: >> I have two possible ideas for addressing this; perhaps other people >> will have further suggestions. A relatively non-invasive fix would be >> to teach pgarch.c how to increment a WAL file name. After archiving >> segment N, check using stat() whether there's an .ready file for >> segment N+1. If so, do that one next. If not, then fall back to >> performing a full directory scan. > > Hm. I wonder if it'd not be better to determine multiple files to be > archived in one readdir() pass? FWIW we use both methods [0]. WAL-G has a pipe with WAL-push candidates. We add there some predictions, and if it does not fill upload concurrency - list archive_status contents (concurrently tobackground uploads). > > >> As far as I can see, this is just cheap insurance. If archiving is >> keeping up, the extra stat() won't matter much. If it's not, this will >> save more system calls than it costs. Since during normal operation it >> shouldn't really be possible for files to show up in pg_wal out of >> order, I don't really see a scenario where this changes the behavior, >> either. If there are gaps in the sequence at startup time, this will >> cope with it exactly the same as we do now, except with a better >> chance of finishing before I retire. > > There's definitely gaps in practice :(. Due to the massive performance > issues with archiving there are several tools that archive multiple > files as part of one archive command invocation (and mark the additional > archived files as .done immediately). Interestingly, we used to rename .ready->.done some years ago. But pgBackRest developers convinced me that it's not a goodidea to mess with data dir [1]. Then pg_probackup developers convinced me that renaming .ready->.done on our own scalesbetter and implemented this functionality for us [2]. >> If we did that, could we just get rid of the .ready and .done files >> altogether? Are they just a really expensive IPC mechanism to avoid a >> shared memory connection, or is there some more fundamental reason why >> we need them? > > What kind of shared memory mechanism are you thinking of? Due to > timelines and history files I don't think simple position counters would > be quite enough. > > I think the aforementioned "batching" archive commands are part of the > problem :(.archiv I'd be happy if we had a table with files that need to be archived, a table with registered archivers and a function to say"archiver number X has done its job on file Y". Archiver could listen to some archiver channel while sleeping or somethinglike that. Thanks! Best regards, Andrey Borodin. [0] https://github.com/x4m/wal-g/blob/c8a785217fe1123197280fd24254e51492bf5a68/internal/bguploader.go#L119-L137 [1] https://www.postgresql.org/message-id/flat/20180828200754.GI3326%40tamriel.snowman.net#0b07304710b9ce5244438b7199447ee7 [2] https://github.com/wal-g/wal-g/pull/950
On Tue, May 4, 2021 at 12:27 AM Andres Freund <andres@anarazel.de> wrote: > On 2021-05-03 16:49:16 -0400, Robert Haas wrote: > > I have two possible ideas for addressing this; perhaps other people > > will have further suggestions. A relatively non-invasive fix would be > > to teach pgarch.c how to increment a WAL file name. After archiving > > segment N, check using stat() whether there's an .ready file for > > segment N+1. If so, do that one next. If not, then fall back to > > performing a full directory scan. > > Hm. I wonder if it'd not be better to determine multiple files to be > archived in one readdir() pass? I think both methods have some merit. If we had a way to pass a range of files to archive_command instead of just one, then your way is distinctly better, and perhaps we should just go ahead and invent such a thing. If not, your way doesn't entirely solve the O(n^2) problem, since you have to choose some upper bound on the number of file names you're willing to buffer in memory, but it may lower it enough that it makes no practical difference. I am somewhat inclined to think that it would be good to start with the method I'm proposing, since it is a clear-cut improvement over what we have today and can be done with a relatively limited amount of code change and no redesign, and then perhaps do something more ambitious afterward. > There's definitely gaps in practice :(. Due to the massive performance > issues with archiving there are several tools that archive multiple > files as part of one archive command invocation (and mark the additional > archived files as .done immediately). Good to know. > > However, that's still pretty wasteful. Every time we have to wait for > > the next file to be ready for archiving, we'll basically fall back to > > repeatedly scanning the whole directory, waiting for it to show up. > > Hm. That seems like it's only an issue because .done and .ready are in > the same directory? Otherwise the directory would be empty while we're > waiting for the next file to be ready to be archived. I think that's right. > I hate that that's > a thing but given teh serial nature of archiving, with high per-call > overhead, I don't think it'd be ok to just break that without a > replacement :(. I don't know quite what you mean by this. Moving .done files to a separate directory from .ready files could certainly be done and I don't think it even would be that hard. It does seem like a bit of a half measure though. If we're going to redesign this I think we ought to be more ambitious than that. > > But perhaps we could work around this by allowing pgarch.c to access > > shared memory, in which case it could examine the current timeline > > whenever it wants, and probably also whatever LSNs it needs to know > > what's safe to archive. > > FWIW, the shared memory stats patch implies doing that, since the > archiver reports stats. Are you planning to commit that for v15? If so, will it be early in the cycle, do you think? > What kind of shared memory mechanism are you thinking of? Due to > timelines and history files I don't think simple position counters would > be quite enough. I was thinking of simple position counters, but we could do something more sophisticated. I don't even care if we stick with .ready/.done for low-frequency stuff like timeline and history files. But I think we'd be better off avoiding it for WAL files, because there are just too many of them, and it's too hard to create a system that actually scales. Or else we need a way for a single .ready file to cover many WAL files in need of being archived, rather than just one. > I think there is no fundamental for avoiding shared memory in the > archiver. I guess there's a minor robustness advantage, because the > forked shell to start the archvive command won't be attached to shared > memory. But that's only until the child exec()s to the archive command. That doesn't seem like a real issue because we're not running user-defined code between fork() and exec(). > There is some minor performance advantage as well, not having to process > the often large and contended memory mapping for shared_buffers is > probably measurable - but swamped by the cost of needing to actually > archive the segment. Process it how? Another option would be to have two processes. You could have one that stayed connected to shared memory and another that JUST ran the archive_command, and they could talk over a socket or something. But that would add a bunch of extra complexity, so I don't want to do it unless we actually need to do it. > My only "concern" with doing anything around this is that I think the > whole approach of archive_command is just hopelessly broken, with even > just halfway busy servers only able to keep up archiving if they muck > around with postgres internal data during archive command execution. Add > to that how hard it is to write a robust archive command (e.g. the one > in our docs still suggests test ! -f && cp, which means that copy > failing in the middle yields an incomplete archive)... > > While I don't think it's all that hard to design a replacement, it's > however likely still more work than addressing the O(n^2) issue, so ... I think it is probably a good idea to fix the O(n^2) issue first, and then as a separate step try to redefine things so that a decent archive command doesn't have to poke around as much at internal stuff. Part of that should probably involve having a way to pass a range of files to archive_command instead of a single file. I was also wondering whether we should go further and allow for the archiving to be performed by C code running inside the backend rather than shelling out to an external command. -- Robert Haas EDB: http://www.enterprisedb.com
iOn Tue, May 4, 2021 at 7:38 PM Robert Haas <robertmhaas@gmail.com> wrote: > > On Tue, May 4, 2021 at 12:27 AM Andres Freund <andres@anarazel.de> wrote: > > On 2021-05-03 16:49:16 -0400, Robert Haas wrote: > > > I have two possible ideas for addressing this; perhaps other people > > > will have further suggestions. A relatively non-invasive fix would be > > > to teach pgarch.c how to increment a WAL file name. After archiving > > > segment N, check using stat() whether there's an .ready file for > > > segment N+1. If so, do that one next. If not, then fall back to > > > performing a full directory scan. > > > > Hm. I wonder if it'd not be better to determine multiple files to be > > archived in one readdir() pass? > > I think both methods have some merit. If we had a way to pass a range > of files to archive_command instead of just one, then your way is > distinctly better, and perhaps we should just go ahead and invent such > a thing. If not, your way doesn't entirely solve the O(n^2) problem, > since you have to choose some upper bound on the number of file names > you're willing to buffer in memory, but it may lower it enough that it > makes no practical difference. I am somewhat inclined to think that it > would be good to start with the method I'm proposing, since it is a > clear-cut improvement over what we have today and can be done with a > relatively limited amount of code change and no redesign, and then > perhaps do something more ambitious afterward. I agree that if we continue to archive one file using the archive command then Robert's solution of checking the existence of the next WAL segment (N+1) has an advantage. But, currently, if you notice pgarch_readyXlog always consider any history file as the oldest file but that will not be true if we try to predict the next WAL segment name. For example, if we have archived 000000010000000000000004 then next we will look for 000000010000000000000005 but after generating segment 000000010000000000000005, if there is a timeline switch then we will have the below files in the archive status (000000010000000000000005.ready, 00000002.history file). Now, the existing archiver will archive 00000002.history first whereas our code will archive 000000010000000000000005 first. Said that I don't see any problem with that because before archiving any segment file from TL 2 we will definitely archive the 00000002.history file because we will not find the 000000010000000000000006.ready and we will scan the full directory and now we will find 00000002.history as oldest file. > > > > However, that's still pretty wasteful. Every time we have to wait for > > > the next file to be ready for archiving, we'll basically fall back to > > > repeatedly scanning the whole directory, waiting for it to show up. Is this true? that only when we have to wait for the next file to be ready we got for scanning? If I read the code in "pgarch_ArchiverCopyLoop", for every single file to achieve it is calling "pgarch_readyXlog", wherein it scans the directory every time. So I did not understand your point that only when it needs to wait for the next .ready file it need to scan the full directory. It appeared it always scans the full directory after archiving each WAL segment. What am I missing? > > Hm. That seems like it's only an issue because .done and .ready are in > > the same directory? Otherwise the directory would be empty while we're > > waiting for the next file to be ready to be archived. > > I think that's right. If we agree with your above point that it only needs to scan the full directory when it has to wait for the next file to be ready then making a separate directory for .done file can improve a lot because the directory will be empty so scanning will not be very costly. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Tue, May 4, 2021 at 11:54 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > I agree that if we continue to archive one file using the archive > command then Robert's solution of checking the existence of the next > WAL segment (N+1) has an advantage. But, currently, if you notice > pgarch_readyXlog always consider any history file as the oldest file > but that will not be true if we try to predict the next WAL segment > name. For example, if we have archived 000000010000000000000004 then > next we will look for 000000010000000000000005 but after generating > segment 000000010000000000000005, if there is a timeline switch then > we will have the below files in the archive status > (000000010000000000000005.ready, 00000002.history file). Now, the > existing archiver will archive 00000002.history first whereas our code > will archive 000000010000000000000005 first. Said that I don't see > any problem with that because before archiving any segment file from > TL 2 we will definitely archive the 00000002.history file because we > will not find the 000000010000000000000006.ready and we will scan the > full directory and now we will find 00000002.history as oldest file. OK, that makes sense and is good to know. > > > > However, that's still pretty wasteful. Every time we have to wait for > > > > the next file to be ready for archiving, we'll basically fall back to > > > > repeatedly scanning the whole directory, waiting for it to show up. > > Is this true? that only when we have to wait for the next file to be > ready we got for scanning? If I read the code in > "pgarch_ArchiverCopyLoop", for every single file to achieve it is > calling "pgarch_readyXlog", wherein it scans the directory every time. > So I did not understand your point that only when it needs to wait for > the next .ready file it need to scan the full directory. It appeared > it always scans the full directory after archiving each WAL segment. > What am I missing? It's not true now, but my proposal would make it true. -- Robert Haas EDB: http://www.enterprisedb.com
On Tue, May 4, 2021 at 10:12 PM Robert Haas <robertmhaas@gmail.com> wrote: > > Is this true? that only when we have to wait for the next file to be > > ready we got for scanning? If I read the code in > > "pgarch_ArchiverCopyLoop", for every single file to achieve it is > > calling "pgarch_readyXlog", wherein it scans the directory every time. > > So I did not understand your point that only when it needs to wait for > > the next .ready file it need to scan the full directory. It appeared > > it always scans the full directory after archiving each WAL segment. > > What am I missing? > > It's not true now, but my proposal would make it true. Okay, got it. Thanks. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Greetings, * Robert Haas (robertmhaas@gmail.com) wrote: > On Tue, May 4, 2021 at 11:54 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > I agree that if we continue to archive one file using the archive > > command then Robert's solution of checking the existence of the next > > WAL segment (N+1) has an advantage. But, currently, if you notice > > pgarch_readyXlog always consider any history file as the oldest file > > but that will not be true if we try to predict the next WAL segment > > name. For example, if we have archived 000000010000000000000004 then > > next we will look for 000000010000000000000005 but after generating > > segment 000000010000000000000005, if there is a timeline switch then > > we will have the below files in the archive status > > (000000010000000000000005.ready, 00000002.history file). Now, the > > existing archiver will archive 00000002.history first whereas our code > > will archive 000000010000000000000005 first. Said that I don't see > > any problem with that because before archiving any segment file from > > TL 2 we will definitely archive the 00000002.history file because we > > will not find the 000000010000000000000006.ready and we will scan the > > full directory and now we will find 00000002.history as oldest file. > > OK, that makes sense and is good to know. I expect David will chime in on this thread too, but I did want to point out that when it coming to archiving history files you'd *really* like that to be done just about as quickly as absolutely possible, to avoid the case that we saw before that code was added, to wit: two promotions done too quickly that ended up with conflicting history and possibly conflicting WAL files trying to be archived, and ensuing madness. It's not just about making sure that we archive the history file for a timeline before archiving WAL segments along that timeline but also about making sure we get that history file into the archive as fast as we can, and archiving a 16MB WAL first would certainly delay that. Thanks, Stephen
Attachment
On Wed, May 5, 2021 at 1:06 PM Stephen Frost <sfrost@snowman.net> wrote: > It's not just about making sure that we archive the history file for a > timeline before archiving WAL segments along that timeline but also > about making sure we get that history file into the archive as fast as > we can, and archiving a 16MB WAL first would certainly delay that. Ooph. That's a rather tough constraint. Could we get around it by introducing some kind of signalling mechanism, perhaps? Like if there's a new history file, that must mean the server has switched timelines -- I think, anyway -- so if we notified the archiver every time there was a timeline switch it could react accordingly. -- Robert Haas EDB: http://www.enterprisedb.com
Greetings, * Robert Haas (robertmhaas@gmail.com) wrote: > On Wed, May 5, 2021 at 1:06 PM Stephen Frost <sfrost@snowman.net> wrote: > > It's not just about making sure that we archive the history file for a > > timeline before archiving WAL segments along that timeline but also > > about making sure we get that history file into the archive as fast as > > we can, and archiving a 16MB WAL first would certainly delay that. > > Ooph. That's a rather tough constraint. Could we get around it by > introducing some kind of signalling mechanism, perhaps? Like if > there's a new history file, that must mean the server has switched > timelines -- I think, anyway -- so if we notified the archiver every > time there was a timeline switch it could react accordingly. I would think something like that would be alright and not worse than what we've got now. That said, in an ideal world, we'd have a way to get the new timeline to switch to in a way that doesn't leave open race conditions, so as long we're talking about big changes to the way archiving and archive_command work (or about throwing out the horrible idea that is archive_command in the first place and replacing it with appropriate hooks such that someone could install an extension which would handle archiving...), I would hope we'd have a way of saying "please, atomically, go get me a new timeline." Just as a reminder for those following along at home, as I'm sure you're already aware, the way we figure out what timeline to switch to when a replica is getting promoted is that we go run the restore command asking for history files until we get back "nope, there is no file named 0000123.history", and then we switch to that timeline and then try to push such a history file into the repo and hope that it works. Thanks, Stephen
Attachment
On Wed, May 5, 2021 at 4:13 PM Stephen Frost <sfrost@snowman.net> wrote: > I would think something like that would be alright and not worse than > what we've got now. OK. > That said, in an ideal world, we'd have a way to get the new timeline to > switch to in a way that doesn't leave open race conditions, so as long > we're talking about big changes to the way archiving and archive_command > work (or about throwing out the horrible idea that is archive_command in > the first place and replacing it with appropriate hooks such that > someone could install an extension which would handle archiving...), I > would hope we'd have a way of saying "please, atomically, go get me a new > timeline." > > Just as a reminder for those following along at home, as I'm sure you're > already aware, the way we figure out what timeline to switch to when a > replica is getting promoted is that we go run the restore command asking > for history files until we get back "nope, there is no file named > 0000123.history", and then we switch to that timeline and then try to > push such a history file into the repo and hope that it works. Huh, I had not thought about that problem. So, at the risk of getting sidetracked, what exactly are you asking for here? Let the extension pick the timeline using an algorithm of its own devising, rather than having core do it? Or what? -- Robert Haas EDB: http://www.enterprisedb.com
Hi, On 2021-05-05 16:13:08 -0400, Stephen Frost wrote: > Just as a reminder for those following along at home, as I'm sure you're > already aware, the way we figure out what timeline to switch to when a > replica is getting promoted is that we go run the restore command asking > for history files until we get back "nope, there is no file named > 0000123.history", and then we switch to that timeline and then try to > push such a history file into the repo and hope that it works. Which is why the whole concept of timelines as we have them right now is pretty much useless. It is fundamentally impossible to guarantee unique timeline ids in all cases if they are assigned sequentially at timeline creation - consider needing to promote a node on both ends of a split network. I'm quite doubtful that pretending to tackle this problem via archiving order is a good idea, given the fundamentally racy nature. Greetings, Andres Freund
Hi, On 2021-05-05 16:22:21 -0400, Robert Haas wrote: > Huh, I had not thought about that problem. So, at the risk of getting > sidetracked, what exactly are you asking for here? Let the extension > pick the timeline using an algorithm of its own devising, rather than > having core do it? Or what? Not Stephen, but to me the most reasonable way to address this is to make timeline identifier wider and randomly allocated. The sequential looking natures of timelines imo is actively unhelpful. Greetings, Andres Freund
Greetings, * Robert Haas (robertmhaas@gmail.com) wrote: > On Wed, May 5, 2021 at 4:13 PM Stephen Frost <sfrost@snowman.net> wrote: > > That said, in an ideal world, we'd have a way to get the new timeline to > > switch to in a way that doesn't leave open race conditions, so as long > > we're talking about big changes to the way archiving and archive_command > > work (or about throwing out the horrible idea that is archive_command in > > the first place and replacing it with appropriate hooks such that > > someone could install an extension which would handle archiving...), I > > would hope we'd have a way of saying "please, atomically, go get me a new > > timeline." > > > > Just as a reminder for those following along at home, as I'm sure you're > > already aware, the way we figure out what timeline to switch to when a > > replica is getting promoted is that we go run the restore command asking > > for history files until we get back "nope, there is no file named > > 0000123.history", and then we switch to that timeline and then try to > > push such a history file into the repo and hope that it works. > > Huh, I had not thought about that problem. So, at the risk of getting > sidetracked, what exactly are you asking for here? Let the extension > pick the timeline using an algorithm of its own devising, rather than > having core do it? Or what? Having the extension do it somehow is an interesting idea and one which might be kind of cool. The first thought I had was to make it archive_command's job to "pick" the timeline by just re-trying to push the .history file (the actual contents of it don't change, as the information in the file is about the timeline we are switching *from* and at what LSN). That requires an archive command which will fail if that file already exists though and, ideally, would perform the file archival in an atomic fashion (though this last bit isn't stricly necessary- anything along these lines would certainly be better than the current state). Having an entirely independent command/hook that's explicitly for this case would be another approach, of course, either in a manner that allows the extension to pick the destination timeline or is defined to be "return success only if the file is successfully archived, but do *not* overwrite any existing file of the same name and return an error instead." and then the same approach as outlined above. Thanks, Stephen
Attachment
On Wed, May 5, 2021 at 4:31 PM Andres Freund <andres@anarazel.de> wrote: > On 2021-05-05 16:22:21 -0400, Robert Haas wrote: > > Huh, I had not thought about that problem. So, at the risk of getting > > sidetracked, what exactly are you asking for here? Let the extension > > pick the timeline using an algorithm of its own devising, rather than > > having core do it? Or what? > > Not Stephen, but to me the most reasonable way to address this is to > make timeline identifier wider and randomly allocated. The sequential > looking natures of timelines imo is actively unhelpful. Yeah, I always wondered why we didn't assign them randomly. -- Robert Haas EDB: http://www.enterprisedb.com
Greetings, * Robert Haas (robertmhaas@gmail.com) wrote: > On Wed, May 5, 2021 at 4:31 PM Andres Freund <andres@anarazel.de> wrote: > > On 2021-05-05 16:22:21 -0400, Robert Haas wrote: > > > Huh, I had not thought about that problem. So, at the risk of getting > > > sidetracked, what exactly are you asking for here? Let the extension > > > pick the timeline using an algorithm of its own devising, rather than > > > having core do it? Or what? > > > > Not Stephen, but to me the most reasonable way to address this is to > > make timeline identifier wider and randomly allocated. The sequential > > looking natures of timelines imo is actively unhelpful. > > Yeah, I always wondered why we didn't assign them randomly. Based on what we do today regarding the info we put into .history files, trying to figure out which is the "latest" timeline might be a bit tricky with randomly selected timelines. Maybe we could find a way to solve that though. I do note that this comment is timeline.c is, ahem, perhaps over-stating things a bit: * Note: while this is somewhat heuristic, it does positively guarantee * that (result + 1) is not a known timeline, and therefore it should * be safe to assign that ID to a new timeline. Thanks, Stephen
Attachment
On Wed, May 5, 2021 at 4:53 PM Stephen Frost <sfrost@snowman.net> wrote: > I do note that this comment is timeline.c is, ahem, perhaps over-stating > things a bit: > > * Note: while this is somewhat heuristic, it does positively guarantee > * that (result + 1) is not a known timeline, and therefore it should > * be safe to assign that ID to a new timeline. OK, that made me laugh out loud. -- Robert Haas EDB: http://www.enterprisedb.com
At Tue, 4 May 2021 10:07:51 -0400, Robert Haas <robertmhaas@gmail.com> wrote in > On Tue, May 4, 2021 at 12:27 AM Andres Freund <andres@anarazel.de> wrote: > > On 2021-05-03 16:49:16 -0400, Robert Haas wrote: > > > But perhaps we could work around this by allowing pgarch.c to access > > > shared memory, in which case it could examine the current timeline > > > whenever it wants, and probably also whatever LSNs it needs to know > > > what's safe to archive. > > > > FWIW, the shared memory stats patch implies doing that, since the > > archiver reports stats. > > Are you planning to commit that for v15? If so, will it be early in > the cycle, do you think? FWIW It's already done for v14 individually. Author: Fujii Masao <fujii@postgresql.org> Date: Mon Mar 15 13:13:14 2021 +0900 Make archiver process an auxiliary process. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
On Thu, May 6, 2021 at 3:23 AM Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote: > FWIW It's already done for v14 individually. > > Author: Fujii Masao <fujii@postgresql.org> > Date: Mon Mar 15 13:13:14 2021 +0900 > > Make archiver process an auxiliary process. Oh, I hadn't noticed. Thanks. -- Robert Haas EDB: http://www.enterprisedb.com
How are you envisioning the shared-memory signaling should work in the original sample case, where the archiver had been failing for half a year ? Or should we perhaps have a system table for ready-to-archive WAL files to get around limitation sof file system to return just the needed files with ORDER BY ... LIMIT as we already know how to make lookups in database fast ? Cheers Hannu On Thu, May 6, 2021 at 12:24 PM Robert Haas <robertmhaas@gmail.com> wrote: > > On Thu, May 6, 2021 at 3:23 AM Kyotaro Horiguchi > <horikyota.ntt@gmail.com> wrote: > > FWIW It's already done for v14 individually. > > > > Author: Fujii Masao <fujii@postgresql.org> > > Date: Mon Mar 15 13:13:14 2021 +0900 > > > > Make archiver process an auxiliary process. > > Oh, I hadn't noticed. Thanks. > > -- > Robert Haas > EDB: http://www.enterprisedb.com > >
Hi, On 2021-05-06 21:23:36 +0200, Hannu Krosing wrote: > How are you envisioning the shared-memory signaling should work in the > original sample case, where the archiver had been failing for half a > year ? If we leave history files and gaps in the .ready sequence aside for a second, we really only need an LSN or segment number describing the current "archive position". Then we can iterate over the segments between the "archive position" and the flush position (which we already know). Even if we needed to keep statting .ready/.done files (to handle gaps due to archive command mucking around with .ready/done), it'd still be a lot cheaper than what we do today. It probably would even still be cheaper if we just statted all potentially relevant timeline history files all the time to send them first. > Or should we perhaps have a system table for ready-to-archive WAL > files to get around limitation sof file system to return just the > needed files with ORDER BY ... LIMIT as we already know how to make > lookups in database fast ? Archiving needs to work on a standby so that doesn't seem like an option. Regards, Andres Freund
archiving individual WAL files by maintaining a WAL counter to identify
the next WAL file in a sequence.
WAL archiver scans the status directory to identify the next WAL file
which needs to be archived. This directory scan can be minimized by
maintaining the log segment number of the current file which is being archived
Please find attached patch v1.
Hi,
On 2021-05-06 21:23:36 +0200, Hannu Krosing wrote:
> How are you envisioning the shared-memory signaling should work in the
> original sample case, where the archiver had been failing for half a
> year ?
If we leave history files and gaps in the .ready sequence aside for a
second, we really only need an LSN or segment number describing the
current "archive position". Then we can iterate over the segments
between the "archive position" and the flush position (which we already
know). Even if we needed to keep statting .ready/.done files (to handle
gaps due to archive command mucking around with .ready/done), it'd still
be a lot cheaper than what we do today. It probably would even still be
cheaper if we just statted all potentially relevant timeline history
files all the time to send them first.
> Or should we perhaps have a system table for ready-to-archive WAL
> files to get around limitation sof file system to return just the
> needed files with ORDER BY ... LIMIT as we already know how to make
> lookups in database fast ?
Archiving needs to work on a standby so that doesn't seem like an
option.
Regards,
Andres Freund
Attachment
On Tue, Jul 6, 2021 at 11:36 AM Dipesh Pandit <dipesh.pandit@gmail.com> wrote: > > Hi, > > We have addressed the O(n^2) problem which involves directory scan for > archiving individual WAL files by maintaining a WAL counter to identify > the next WAL file in a sequence. > > WAL archiver scans the status directory to identify the next WAL file > which needs to be archived. This directory scan can be minimized by > maintaining the log segment number of the current file which is being archived > and incrementing it by '1' to get the next WAL file in a sequence. Archiver > can check the availability of the next file in status directory and in case if the > file is not available then it should fall-back to directory scan to get the oldest > WAL file. > > Please find attached patch v1. > I have a few suggestions on the patch 1. + + /* + * Found the oldest WAL, reset timeline ID and log segment number to generate + * the next WAL file in the sequence. + */ + if (found && !historyFound) + { + XLogFromFileName(xlog, &curFileTLI, &nextLogSegNo, wal_segment_size); + ereport(LOG, + (errmsg("directory scan to archive write-ahead log file \"%s\"", + xlog))); + } If a history file is found we are not updating curFileTLI and nextLogSegNo, so it will attempt the previously found segment. This is fine because it will not find that segment and it will rescan the directory. But I think we can do better, instead of searching the same old segment in the previous timeline we can search that old segment in the new TL so that if the TL switch happened within the segment then we will find the segment and we will avoid the directory search. /* + * Log segment number and timeline ID to get next WAL file in a sequence. + */ +static XLogSegNo nextLogSegNo = 0; +static TimeLineID curFileTLI = 0; + So everytime archiver will start with searching segno=0 in timeline=0. Instead of doing this can't we first scan the directory and once we get the first segment to archive then only we can start predicting the next wal segment? I think there is nothing wrong even if we try to look for seg 0 in timeline 0, everytime we start the archivar but that will be true only once in the history of the cluster so why not skip this until we scan the directory once? -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Greetings, * Dipesh Pandit (dipesh.pandit@gmail.com) wrote: > We have addressed the O(n^2) problem which involves directory scan for > archiving individual WAL files by maintaining a WAL counter to identify > the next WAL file in a sequence. This seems to have missed the concerns raised in https://postgr.es/m/20210505170601.GF20766@tamriel.snowman.net ..? And also the comments immediately above the ones being added here: > @@ -596,29 +606,55 @@ pgarch_archiveXlog(char *xlog) > * larger ID; the net result being that past timelines are given higher > * priority for archiving. This seems okay, or at least not obviously worth > * changing. > + * > + * WAL files are generated in a specific order of log segment number. The > + * directory scan for each WAL file can be minimized by identifying the next > + * WAL file in the sequence. This can be achieved by maintaining log segment > + * number and timeline ID corresponding to WAL file currently being archived. > + * The log segment number of current WAL file can be incremented by '1' upon > + * successful archival to point to the next WAL file. specifically about history files being given higher priority for archiving. If we go with this change then we'd at least want to rewrite or remove those comments, but I don't actually agree that we should remove that preference to archive history files ahead of WAL, for the reasons brought up previously. As was suggested on that subthread, it seems like it should be possible to just track the current timeline and adjust what we're doing if the timeline changes, and we should even know what the .history file is at that point and likely don't even need to scan the directory for it, as it'll be the old timeline ID. Thanks, Stephen
Attachment
> archiving. If we go with this change then we'd at least want to rewrite
> or remove those comments, but I don't actually agree that we should
> remove that preference to archive history files ahead of WAL, for the
> reasons brought up previously.
> As was suggested on that subthread, it seems like it should be possible
> to just track the current timeline and adjust what we're doing if the
> timeline changes, and we should even know what the .history file is at
> that point and likely don't even need to scan the directory for it, as
I have a few suggestions on the patch
1.
+
+ /*
+ * Found the oldest WAL, reset timeline ID and log segment number to generate
+ * the next WAL file in the sequence.
+ */
+ if (found && !historyFound)
+ {
+ XLogFromFileName(xlog, &curFileTLI, &nextLogSegNo, wal_segment_size);
+ ereport(LOG,
+ (errmsg("directory scan to archive write-ahead log file \"%s\"",
+ xlog)));
+ }
If a history file is found we are not updating curFileTLI and
nextLogSegNo, so it will attempt the previously found segment. This
is fine because it will not find that segment and it will rescan the
directory. But I think we can do better, instead of searching the
same old segment in the previous timeline we can search that old
segment in the new TL so that if the TL switch happened within the
segment then we will find the segment and we will avoid the directory
search.
/*
+ * Log segment number and timeline ID to get next WAL file in a sequence.
+ */
+static XLogSegNo nextLogSegNo = 0;
+static TimeLineID curFileTLI = 0;
+
So everytime archiver will start with searching segno=0 in timeline=0.
Instead of doing this can't we first scan the directory and once we
get the first segment to archive then only we can start predicting the
next wal segment? I think there is nothing wrong even if we try to
look for seg 0 in timeline 0, everytime we start the archivar but that
will be true only once in the history of the cluster so why not skip
this until we scan the directory once?
+ * Log segment number already points to the next file in the sequence
+ * (as part of successful archival of the previous file). Generate the path
+ * for status file.
+ */
I think the name of the variable is appropriate here, but maybe we can reword
the comment something like:
+ /*
+ * We already have the next anticipated log segment number and the
+ * timeline, check if this WAL file is ready to be archived. If yes, skip
+ * the directory scan.
+ */
is not available at archiver. In order to keep track of timeline switch we have
to push a notification from backend to archiver. Backend can send a signal
to notify archiver about the timeline change. Archiver can register this
notification and perform a full directory scan to make sure that archiving
history files take precedence over archiving WAL files.
> nextLogSegNo, so it will attempt the previously found segment. This
> is fine because it will not find that segment and it will rescan the
> directory. But I think we can do better, instead of searching the
> same old segment in the previous timeline we can search that old
> segment in the new TL so that if the TL switch happened within the
> segment then we will find the segment and we will avoid the directory
> search.
considering archiving history file takes precedence over WAL files we cannot
update the "curFileTLI" whenever a history file is found.
> Instead of doing this can't we first scan the directory and once we
> get the first segment to archive then only we can start predicting the
> next wal segment?
> I think the name of the variable is appropriate here, but maybe we can reword
> the comment something like:
Attachment
On Mon, Jul 19, 2021 at 5:43 PM Dipesh Pandit <dipesh.pandit@gmail.com> wrote: > > Hi, > > > I agree, I missed this part. The .history file should be given higher preference. > > I will take care of it in the next patch. > > Archiver does not have access to shared memory and the current timeline ID > is not available at archiver. In order to keep track of timeline switch we have > to push a notification from backend to archiver. Backend can send a signal > to notify archiver about the timeline change. Archiver can register this > notification and perform a full directory scan to make sure that archiving > history files take precedence over archiving WAL files. Yeah, that makes sense, some comments on v2. 1. +pgarch_timeline_switch(SIGNAL_ARGS) +{ + int save_errno = errno; + + /* Set the flag to register a timeline switch */ + timeline_switch = true; + SetLatch(MyLatch); + On the timeline switch, setting a flag should be enough, I don't think that we need to wake up the archiver. Because it will just waste the scan cycle. We have set the flag and that should be enough and let the XLogArchiveNotify() wake this up when something is ready to be archived and that time we will scan the directory first based on the flag. 2. + */ + if (XLogArchivingActive() && ArchiveRecoveryRequested) + XLogArchiveNotifyTLISwitch(); + + ..... /* + * Signal archiver to notify timeline switch + */ +void +XLogArchiveNotifyTLISwitch(void) +{ + if (IsUnderPostmaster) + PgArchNotifyTLISwitch(); +} Why do we need multi level interfaces? I mean instead of calling first XLogArchiveNotifyTLISwitch and then calling PgArchNotifyTLISwitch, can't we directly call PgArchNotifyTLISwitch()? 3. + if (timeline_switch) + { + /* Perform a full directory scan in next cycle */ + dirScan = true; + timeline_switch = false; + } I suggest you can add some comments atop this check. 4. +PgArchNotifyTLISwitch(void) +{ + int arch_pgprocno = PgArch->pgprocno; + + if (arch_pgprocno != INVALID_PGPROCNO) + { + int archiver_pid = ProcGlobal->allProcs[arch_pgprocno].pid; + + if (kill(archiver_pid, SIGINT) < 0) + elog(ERROR, "could not notify timeline change to archiver"); I think you should use %m in the error message so that it also prints the OS error code. 5. +/* Flag to specify a full directory scan to find next log file */ +static bool dirScan = true; Why is this a global variable? I mean whenever you enter the function pgarch_ArchiverCopyLoop(), this can be set to true and after that you can pass this as inout parameter to pgarch_readyXlog() there in it can be conditionally set to false once we get some segment and whenever the timeline switch we can set it back to the true. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Tue, Jul 6, 2021 at 9:34 AM Stephen Frost <sfrost@snowman.net> wrote: > As was suggested on that subthread, it seems like it should be possible > to just track the current timeline and adjust what we're doing if the > timeline changes, and we should even know what the .history file is at > that point and likely don't even need to scan the directory for it, as > it'll be the old timeline ID. I'm a little concerned that this might turn out to be more complicated than it's worth. It's not a case that should happen often, and if you handle it then you have to be careful to handle cases like two timeline switches in very rapid succession, which seems like it could be tricky. Maybe it's fine, though. I'm not really sure. -- Robert Haas EDB: http://www.enterprisedb.com
> that we need to wake up the archiver. Because it will just waste the
> scan cycle.
> XLogArchiveNotifyTLISwitch and then calling PgArchNotifyTLISwitch,
> can't we directly call PgArchNotifyTLISwitch()?
> + {
> + /* Perform a full directory scan in next cycle */
> + dirScan = true;
> + timeline_switch = false;
> + }
> I suggest you can add some comments atop this check.
> the OS error code.
> pgarch_ArchiverCopyLoop(), this can be set to true and after that you
> can pass this as inout parameter to pgarch_readyXlog() there in it can
> be conditionally set to false once we get some segment and whenever
> the timeline switch we can set it back to the true.
Attachment
+ *
+ * "nextLogSegNo" identifies the next log file to be archived in a log
+ * sequence and the flag "dirScan" specifies a full directory scan to find
+ * the next log file.
IMHO, this comment should go atop of pgarch_readyXlog() as a description
of its parameters, and not in pgarch_ArchiverCopyLoop().
/*
+ * Interrupt handler for archiver
+ *
+ * There is a timeline switch and we have been notified by backend.
+ */
Instead, I would suggest having something like this:
+/*
+ * Interrupt handler for handling the timeline switch.
+ *
+ * A timeline switch has been notified, mark this event so that the next iteration
+ * of pgarch_ArchiverCopyLoop() archives the history file, and we set the
+ * timeline to the new one for the next anticipated log segment.
+ */
Regards,
Jeevan Ladhe
Hi,> some comments on v2.Thanks for your comments. I have incorporated the changesand updated a new patch. Please find the details below.> On the timeline switch, setting a flag should be enough, I don't think
> that we need to wake up the archiver. Because it will just waste the
> scan cycle.Yes, I modified it.> Why do we need multi level interfaces? I mean instead of calling first
> XLogArchiveNotifyTLISwitch and then calling PgArchNotifyTLISwitch,
> can't we directly call PgArchNotifyTLISwitch()?Yes, multilevel interfaces are not required. Removed extra interface.> + if (timeline_switch)
> + {
> + /* Perform a full directory scan in next cycle */
> + dirScan = true;
> + timeline_switch = false;
> + }
> I suggest you can add some comments atop this check.Added comment to specify the action required in case of atimeline switch.> I think you should use %m in the error message so that it also prints
> the OS error code.Done.> Why is this a global variable? I mean whenever you enter the function
> pgarch_ArchiverCopyLoop(), this can be set to true and after that you
> can pass this as inout parameter to pgarch_readyXlog() there in it can
> be conditionally set to false once we get some segment and whenever
> the timeline switch we can set it back to the true.Yes, It is not necessary to have global scope for "dirScan". Changedthe scope to local for "dirScan" and "nextLogSegNo".PFA patch v3.Thanks,Dipesh
On 5/6/21, 1:01 PM, "Andres Freund" <andres@anarazel.de> wrote: > If we leave history files and gaps in the .ready sequence aside for a > second, we really only need an LSN or segment number describing the > current "archive position". Then we can iterate over the segments > between the "archive position" and the flush position (which we already > know). Even if we needed to keep statting .ready/.done files (to handle > gaps due to archive command mucking around with .ready/done), it'd still > be a lot cheaper than what we do today. It probably would even still be > cheaper if we just statted all potentially relevant timeline history > files all the time to send them first. My apologies for chiming in so late to this thread, but a similar idea crossed my mind while working on a bug where .ready files get created too early [0]. Specifically, instead of maintaining a status file per WAL segment, I was thinking we could narrow it down to a couple of files to keep track of the boundaries we care about: 1. earliest_done: the oldest segment that has been archived and can be recycled/removed 2. latest_done: the newest segment that has been archived 3. latest_ready: the newest segment that is ready for archival This might complicate matters for backup utilities that currently modify the .ready/.done files, but it would simplify this archive status stuff quite a bit and eliminate the need to worry about the directory scans in the first place. Nathan [0] https://www.postgresql.org/message-id/flat/CBDDFA01-6E40-46BB-9F98-9340F4379505@amazon.com
On Fri, Jul 23, 2021 at 5:46 PM Bossart, Nathan <bossartn@amazon.com> wrote: > My apologies for chiming in so late to this thread, but a similar idea > crossed my mind while working on a bug where .ready files get created > too early [0]. Specifically, instead of maintaining a status file per > WAL segment, I was thinking we could narrow it down to a couple of > files to keep track of the boundaries we care about: > > 1. earliest_done: the oldest segment that has been archived and > can be recycled/removed > 2. latest_done: the newest segment that has been archived > 3. latest_ready: the newest segment that is ready for archival > > This might complicate matters for backup utilities that currently > modify the .ready/.done files, but it would simplify this archive > status stuff quite a bit and eliminate the need to worry about the > directory scans in the first place. In terms of immediate next steps, I think we should focus on eliminating the O(n^2) problem and not get sucked into a bigger redesign. The patch on the table aims to do just that much and I think that's a good thing. But in the longer term I agree that we want to redesign the signalling somehow. I am not convinced that using a file is the right way to go. If we had to rewrite that file for every change, and especially if we had to fsync it, it would be almost as bad as what we're doing right now in terms of the amount of traffic to the filesystem. Atomicity is a problem too, because if we simply create a file then after a crash it will either exist or not, but a file might end up garbled with a mix of old and new contents unless we always write a temporary file and automatically rename that over the existing one. As I said in my original post, I'm kind of wondering about keeping the information in shared memory instead of using the filesystem. I think we would still need to persist it to disk at least occasionally but perhaps there is a way to avoid having to do that as frequently as what we do now. I haven't thought too deeply about what the requirements are here. -- Robert Haas EDB: http://www.enterprisedb.com
On 7/26/21, 6:31 AM, "Robert Haas" <robertmhaas@gmail.com> wrote: > In terms of immediate next steps, I think we should focus on > eliminating the O(n^2) problem and not get sucked into a bigger > redesign. The patch on the table aims to do just that much and I think > that's a good thing. I agree. I'll leave further discussion about a redesign for another thread. Nathan
On 7/26/21, 6:31 AM, "Robert Haas" <robertmhaas@gmail.com> wrote:
> In terms of immediate next steps, I think we should focus on
> eliminating the O(n^2) problem and not get sucked into a bigger
> redesign. The patch on the table aims to do just that much and I think
> that's a good thing.
I agree. I'll leave further discussion about a redesign for another
thread.
Nathan
Attachment
On Tue, Jul 27, 2021 at 3:43 AM Dipesh Pandit <dipesh.pandit@gmail.com> wrote: > and updated a new patch. Please find the attached patch v4. Some review: /* + * If archiver is active, send notification that timeline has switched. + */ + if (XLogArchivingActive() && ArchiveRecoveryRequested && + IsUnderPostmaster) + PgArchNotifyTLISwitch(); There are a few other places in xlog.c that are conditional on XLogArchivingActive(), but none of them test ArchiveRecoveryRequested or IsUnderPostmaster. It appears to me that PgArchStartupAllowed() controls whether the archiver runs, and that's not contingent on ArchiveRecoveryRequested and indeed couldn't be, since it's running in the postmaster where that variable wouldn't be initialized. So why do we care about ArchiveRecoveryRequested here? This is not entirely a rhetorical question; maybe there's some reason we should care. If so, the comment ought to mention it. If not, the test should go away. IsUnderPostmaster does make a difference, but I think that test could be placed inside PgArchNotifyTLISwitch() rather than putting it here in StartupXLOG(). In fact, I think the test could be removed entirely, since if PgArchNotifyTLISwitch() is called in single-user mode, it will presumably just discover that arch_pgprocno == INVALID_PGPROCNO, so it will simply do nothing even without the special-case code. + pqsignal(SIGINT, pgarch_timeline_switch); I don't think it's great that we're using up SIGINT for this purpose. There aren't that many signals available at the O/S level that we can use for our purposes, and we generally try to multiplex them at the application layer, e.g. by setting a latch or a flag in shared memory, rather than using a separate signal. Can we do something of that sort here? Or maybe we don't even need a signal. ThisTimeLineID is already visible in shared memory, so why not just have the archiver just check and see whether it's changed, say via a new accessor function GetCurrentTimeLineID()? I guess there could be a concern about the expensive of that, because we'd probably be taking a spinlock or an lwlock for every cycle, but I don't think it's probably that bad, because I doubt we can archive much more than a double-digit number of files per second even with a very fast archive_command, and contention on a lock generally requires a five digit number of acquisitions per second. It would be worth testing to see if we can see a problem here, but I'm fairly hopeful that it's not an issue. If we do feel that it's important to avoid repeatedly taking a lock, let's see if we can find a way to do it without dedicating a signal to this purpose. + * + * "nextLogSegNo" identifies the next log file to be archived in a log + * sequence and the flag "dirScan" specifies a full directory scan to find + * the next log file. */ - while (pgarch_readyXlog(xlog)) + while (pgarch_readyXlog(xlog, &dirScan, &nextLogSegNo)) I do not like this very much. dirScan and nextLogSegNo aren't clearly owned either by pgarch_ArchiverCopyLoop() or by pgarch_readyXlog(), since both functions modify both variables, in each case conditionally, while also relying on the way that the other function manipulates them. Essentially these are global variables in disguise. There's a third, related variable too, which is handled differently: + static TimeLineID curFileTLI = 0; This is really the same kind of thing as the other two, but because pgarch_readyXlog() happens not to need this one, you just made it static inside pgarch_readyXlog() instead of passing it back and forth. The problem with all this is that you can't understand either function in isolation. Unless you read them both together and look at all of the ways these three variables are manipulated, you can't really understand the logic. And there's really no reason why that needs to be true. The job of cleaning timeline_switch and setting dirScan could be done entirely within pgarch_readyXlog(), and so could the job of incrementing nextLogSegNo, because we're not going to again call pgarch_readyXlog() unless archiving succeeded. Also note that the TLI which is stored in curFileTLI corresponds to the segment number stored in nextLogSegNo, yet one of them has "cur" for "current" in the name and the other has "next". It would be easier to read the code if the names were chosen more consistently. My tentative idea as to how to clean this up is: declare a new struct with a name like readyXlogState and members lastTLI and lastSegNo. Have pgarch_ArchiverCopyLoop() declare a variable of this type, zero it, pass it as a parameter to pgarch_readyXlog(), and otherwise leave it alone. Then let pgarch_readyXlog() do all of the manipulation of the values stored therein. + /* + * Fall-back to directory scan + * + * open xlog status directory and read through list of xlogs that have the + * .ready suffix, looking for earliest file. It is possible to optimise + * this code, though only a single file is expected on the vast majority + * of calls, so.... + */ You've moved this comment from its original location, but the trouble is that the comment is 100% false. In fact, the whole reason why you wrote this patch is *because* this comment is 100% false. In fact it is not difficult to create cases where each scan finds many files, and the purpose of the patch is precisely to optimize the code that the person who wrote this thought didn't need optimizing. Now it may take some work to figure out what we want to say here exactly, but preserving the comment as it's written here is certainly misleading. -- Robert Haas EDB: http://www.enterprisedb.com
> There aren't that many signals available at the O/S level that we can
> use for our purposes, and we generally try to multiplex them at the
> application layer, e.g. by setting a latch or a flag in shared memory,
> rather than using a separate signal. Can we do something of that sort
> here? Or maybe we don't even need a signal. ThisTimeLineID is already
> visible in shared memory, so why not just have the archiver just check
> and see whether it's changed, say via a new accessor function
> GetCurrentTimeLineID()?
strcmp(argv[1], "--forkavlauncher") == 0 ||
strcmp(argv[1], "--forkavworker") == 0 ||
strcmp(argv[1], "--forkboot") == 0 ||
strncmp(argv[1], "--forkbgworker=", 15) == 0)
PGSharedMemoryReAttach();
else
PGSharedMemoryNoReAttach();
On Wed, Jul 28, 2021 at 6:48 AM Dipesh Pandit <dipesh.pandit@gmail.com> wrote: > As of now shared memory is not attached to the archiver. Archiver cannot > access ThisTimeLineID or a flag available in shared memory. If that is true, why are there functions PgArchShmemSize() and PgArchShmemInit(), and how does this statement in PgArchiverMain() manage not to core dump? /* * Advertise our pgprocno so that backends can use our latch to wake us up * while we're sleeping. */ PgArch->pgprocno = MyProc->pgprocno; I think what you are saying is true before v14, but not in v14 and master. -- Robert Haas EDB: http://www.enterprisedb.com
> There aren't that many signals available at the O/S level that we can
> use for our purposes, and we generally try to multiplex them at the
> application layer, e.g. by setting a latch or a flag in shared memory,
> rather than using a separate signal. Can we do something of that sort
> here? Or maybe we don't even need a signal. ThisTimeLineID is already
> visible in shared memory, so why not just have the archiver just check
> and see whether it's changed, say via a new accessor function
> GetCurrentTimeLineID()? I guess there could be a concern about the
> expensive of that, because we'd probably be taking a spinlock or an
> lwlock for every cycle, but I don't think it's probably that bad,
> because I doubt we can archive much more than a double-digit number of
> files per second even with a very fast archive_command, and contention
> on a lock generally requires a five digit number of acquisitions per
> second. It would be worth testing to see if we can see a problem here,
> but I'm fairly hopeful that it's not an issue. If we do feel that it's
> important to avoid repeatedly taking a lock, let's see if we can find
> a way to do it without dedicating a signal to this purpose.
> in isolation. Unless you read them both together and look at all of
> the ways these three variables are manipulated, you can't really
> understand the logic. And there's really no reason why that needs to
> be true. The job of cleaning timeline_switch and setting dirScan could
> be done entirely within pgarch_readyXlog(), and so could the job of
> incrementing nextLogSegNo, because we're not going to again call
> pgarch_readyXlog() unless archiving succeeded.
> Also note that the TLI which is stored in curFileTLI corresponds to
> the segment number stored in nextLogSegNo, yet one of them has "cur"
> for "current" in the name and the other has "next". It would be easier
> to read the code if the names were chosen more consistently.
> with a name like readyXlogState and members lastTLI and lastSegNo.
> Have pgarch_ArchiverCopyLoop() declare a variable of this type, zero
> it, pass it as a parameter to pgarch_readyXlog(), and otherwise leave
> it alone. Then let pgarch_readyXlog() do all of the manipulation of
> the values stored therein.
> is that the comment is 100% false. In fact, the whole reason why you
> wrote this patch is *because* this comment is 100% false. In fact it
> is not difficult to create cases where each scan finds many files, and
> the purpose of the patch is precisely to optimize the code that the
> person who wrote this thought didn't need optimizing. Now it may take
> some work to figure out what we want to say here exactly, but
> preserving the comment as it's written here is certainly misleading.
Attachment
On Mon, Aug 2, 2021 at 9:06 AM Dipesh Pandit <dipesh.pandit@gmail.com> wrote: > We can maintain the current timeline ID in archiver specific shared memory. > If we switch to a new timeline then the backend process can update the new > timeline ID in shared memory. Archiver can keep a track of current timeline ID > and if it finds that there is a timeline switch then it can perform a full directory > scan to make sure that archiving history files takes precedence over WAL files. > Access to the shared memory area can be protected by adding a WALArchiverLock. > If we take this approach then it doesn't require to use a dedicated signal to notify > a timeline switch. Hi, I don't really understand why you are storing something in shared memory specifically for the archiver. Can't we use XLogCtl's ThisTimeLineID instead of storing another copy of the information? Thanks, -- Robert Haas EDB: http://www.enterprisedb.com
+ /* + * Perform a full directory scan to identify the next log segment. There + * may be one of the following scenarios which may require us to perform a + * full directory scan. + * + * 1. This is the first cycle since archiver has started and there is no + * idea about the next anticipated log segment. + * + * 2. There is a timeline switch, i.e. the timeline ID tracked at archiver + * does not match with current timeline ID. Archive history file as part of + * this timeline switch. + * + * 3. The next anticipated log segment is not available. One benefit of the current implementation of pgarch_readyXlog() is that .ready files created out of order will be prioritized before segments with greater LSNs. IIUC, with this patch, as long as there is a "next anticipated" segment available, the archiver won't go back and archive segments it missed. I don't think the archive status files are regularly created out of order, but XLogArchiveCheckDone() has handling for that case, and the work to avoid creating .ready files too early [0] seems to make it more likely. Perhaps we should also force a directory scan when we detect that we are creating a .ready file for a segment that is older than the "next anticipated" segment. Nathan [0] https://postgr.es/m/DA71434B-7340-4984-9B91-F085BC47A778%40amazon.com
> memory specifically for the archiver. Can't we use XLogCtl's
> ThisTimeLineID instead of storing another copy of the information?
On Thu, Aug 5, 2021 at 7:39 AM Dipesh Pandit <dipesh.pandit@gmail.com> wrote: > Yes, we can avoid storing another copy of information. We can > use XLogCtl's ThisTimeLineID on Primary. However, > XLogCtl's ThisTimeLineID is not set to the current timeline ID on > Standby server. It's value is set to '0'. Can we use XLogCtl's > replayEndTLI on the Standby server to get the current timeline ID? I'm not sure. I think we need the value to be accurate during recovery, so I'm not sure whether replayEndTLI would get us there. Another approach might be to set ThisTimeLineID on standbys also. Actually just taking a fast look at the code I'm not quite sure why that isn't happening already. Do you have any understanding of that? -- Robert Haas EDB: http://www.enterprisedb.com
> recovery, so I'm not sure whether replayEndTLI would get us there.
> Another approach might be to set ThisTimeLineID on standbys also.
> Actually just taking a fast look at the code I'm not quite sure why
/*
* Write the timeline history file, and have it archived. After this
* point (or rather, as soon as the file is archived), the timeline
* will appear as "taken" in the WAL archive and to any standby
* servers. If we crash before actually switching to the new
* timeline, standby servers will nevertheless think that we switched
* to the new timeline, and will try to connect to the new timeline.
* To minimize the window for that, try to do as little as possible
* between here and writing the end-of-recovery record.
*/
In case of Standby this happens only when it gets promoted.
At Tue, 3 Aug 2021 20:46:57 +0000, "Bossart, Nathan" <bossartn@amazon.com> wrote in > + /* > + * Perform a full directory scan to identify the next log segment. There > + * may be one of the following scenarios which may require us to perform a > + * full directory scan. > + * > + * 1. This is the first cycle since archiver has started and there is no > + * idea about the next anticipated log segment. > + * > + * 2. There is a timeline switch, i.e. the timeline ID tracked at archiver > + * does not match with current timeline ID. Archive history file as part of > + * this timeline switch. > + * > + * 3. The next anticipated log segment is not available. > > One benefit of the current implementation of pgarch_readyXlog() is > that .ready files created out of order will be prioritized before > segments with greater LSNs. IIUC, with this patch, as long as there > is a "next anticipated" segment available, the archiver won't go back > and archive segments it missed. I don't think the archive status > files are regularly created out of order, but XLogArchiveCheckDone() > has handling for that case, and the work to avoid creating .ready > files too early [0] seems to make it more likely. Perhaps we should > also force a directory scan when we detect that we are creating a > .ready file for a segment that is older than the "next anticipated" > segment. > > Nathan > > [0] https://postgr.es/m/DA71434B-7340-4984-9B91-F085BC47A778%40amazon.com It works the current way always at the first iteration of pgarch_ArchiveCopyLoop() becuse in the last iteration of pgarch_ArchiveCopyLoop(), pgarch_readyXlog() erases the last anticipated segment. The shortcut works only when pgarch_ArchiveCopyLoop archives more than once successive segments at once. If the anticipated next segment found to be missing a .ready file while archiving multiple files, pgarch_readyXLog falls back to the regular way. So I don't see the danger to happen perhaps you are considering. In the first place, .ready are added while holding WALWriteLock in XLogWrite, and while removing old segments after a checkpoint (which happens while recovery). Assuming that no one manually remove .ready files on an active server, the former is the sole place doing that. So I don't see a chance that .ready files are created out-of-order way. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
On 8/5/21, 6:26 PM, "Kyotaro Horiguchi" <horikyota.ntt@gmail.com> wrote: > It works the current way always at the first iteration of > pgarch_ArchiveCopyLoop() becuse in the last iteration of > pgarch_ArchiveCopyLoop(), pgarch_readyXlog() erases the last > anticipated segment. The shortcut works only when > pgarch_ArchiveCopyLoop archives more than once successive segments at > once. If the anticipated next segment found to be missing a .ready > file while archiving multiple files, pgarch_readyXLog falls back to > the regular way. > > So I don't see the danger to happen perhaps you are considering. I think my concern is that there's no guarantee that we will ever do another directory scan. A server that's generating a lot of WAL could theoretically keep us in the next-anticipated-log code path indefinitely. > In the first place, .ready are added while holding WALWriteLock in > XLogWrite, and while removing old segments after a checkpoint (which > happens while recovery). Assuming that no one manually remove .ready > files on an active server, the former is the sole place doing that. So > I don't see a chance that .ready files are created out-of-order way. Perhaps a more convincing example is when XLogArchiveNotify() fails. AFAICT this can fail without ERROR-ing, in which case the server can continue writing WAL and creating .ready files for later segments. At some point, the checkpointer process will call RemoveOldXlogFiles() and try to create the missing .ready file. Nathan
At Thu, 5 Aug 2021 21:53:30 +0530, Dipesh Pandit <dipesh.pandit@gmail.com> wrote in > > I'm not sure. I think we need the value to be accurate during > > recovery, so I'm not sure whether replayEndTLI would get us there. > > Another approach might be to set ThisTimeLineID on standbys also. > > Actually just taking a fast look at the code I'm not quite sure why > > that isn't happening already. Do you have any understanding of that? > > During investigation I found that the current timeline ID (ThisTimeLineID) > gets updated in XLogCtl’s ThisTimeLineID once it gets finalised as part > of archive recovery. > > /* > * Write the timeline history file, and have it archived. After this > * point (or rather, as soon as the file is archived), the timeline > * will appear as "taken" in the WAL archive and to any standby > * servers. If we crash before actually switching to the new > * timeline, standby servers will nevertheless think that we > switched > * to the new timeline, and will try to connect to the new timeline. > * To minimize the window for that, try to do as little as possible > * between here and writing the end-of-recovery record. > */ > > In case of Standby this happens only when it gets promoted. > > If Standby is in recovery mode then replayEndTLI points to the most > recent TLI corresponding to the replayed records. Also, if replying a > record causes timeline switch then replayEndTLI gets updated with > the new timeline. As long as it is in recovery mode replayEndTLI should > point to the current timeline ID on Standby. Thoughts? As I mentioned in another branch of this thread, pgarch_readyXlog() always goes into the fall back path at the first iteration of pgarch_ArchiverCopyLoop() and the current (or expected) TLI is informed there. So no need of shared timeline ID at that time. When pgarch_ArchiverCopyLoop meets a timeline switch, the short cut path fails to find the next anticipated .ready file then goes into the fallback path, which should find the history file for the next TLI (unless any timing misalignment I'm not aware of happens). So the shared timeline id works only to let the fast path give way to the fall back path to find the just created history file as earlier as possible. Notifying the timeline ID that the startup process recognizes to archiver makes thing more complex than requied. Currently archiver doesn't use SIGINT, so I think we can use sigint for the purpose. Furthermore, it seems to me that we can make the TLI and the next anticipated segment number function-local static variables. It would be workable assuming that the only caller pgarch_ArchiverCopyLoop obeys the contract that it must call pgarch_readyXlog() until it returns false. However, there seems to be no reason for it not to work even otherwise, unless I'm missing something (that's likely), though. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
At Fri, 6 Aug 2021 02:34:24 +0000, "Bossart, Nathan" <bossartn@amazon.com> wrote in > On 8/5/21, 6:26 PM, "Kyotaro Horiguchi" <horikyota.ntt@gmail.com> wrote: > > It works the current way always at the first iteration of > > pgarch_ArchiveCopyLoop() becuse in the last iteration of > > pgarch_ArchiveCopyLoop(), pgarch_readyXlog() erases the last > > anticipated segment. The shortcut works only when > > pgarch_ArchiveCopyLoop archives more than once successive segments at > > once. If the anticipated next segment found to be missing a .ready > > file while archiving multiple files, pgarch_readyXLog falls back to > > the regular way. > > > > So I don't see the danger to happen perhaps you are considering. > > I think my concern is that there's no guarantee that we will ever do > another directory scan. A server that's generating a lot of WAL could > theoretically keep us in the next-anticipated-log code path > indefinitely. Theoretically possible. Supposing that .ready may be created out-of-order (for the following reason, as a possibility), when once the fast path bailed out then the fallback path finds that the second oldest file has .ready, the succeeding fast path continues running leaving the oldest file. > > In the first place, .ready are added while holding WALWriteLock in > > XLogWrite, and while removing old segments after a checkpoint (which > > happens while recovery). Assuming that no one manually remove .ready > > files on an active server, the former is the sole place doing that. So > > I don't see a chance that .ready files are created out-of-order way. > > Perhaps a more convincing example is when XLogArchiveNotify() fails. > AFAICT this can fail without ERROR-ing, in which case the server can > continue writing WAL and creating .ready files for later segments. At > some point, the checkpointer process will call RemoveOldXlogFiles() > and try to create the missing .ready file. Mmm. Assuming that could happen, a history file gets cursed to lose a chance to be archived forever once that disaster falls onto it. Apart from this patch, maybe we need a measure to notify the history files that are once missed a chance. Assuming that all such forgotten files would be finally re-marked as .ready anywhere, they can be re-found by archiver by explicitly triggering the fallback path. Currently the trigger fires implicitly by checking shared timeline movement, but by causing the trigger by, for example by a signal as mentioned in a nearby message, that behavior would be easily to implement. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
The possible path that archiver can take for each cycle is either a fast
path or a fall-back patch. The fast path involves checking availability of
This can be achieved by maintaining a shared memory flag. If this flag is set
then archiver should take the fall-back path otherwise it should continue with
the fast path.
This flag can be set by backend in case if an action like timeline switch,
.ready files created out of order,... requires archiver to perform a full
Attachment
+ * This .ready file is created out of order, notify archiver to perform + * a full directory scan to archive corresponding WAL file. + */ + StatusFilePath(archiveStatusPath, xlog, ".ready"); + if (stat(archiveStatusPath, &stat_buf) == 0) + PgArchEnableDirScan(); We may want to call PgArchWakeup() after setting the flag. + * Perform a full directory scan to identify the next log segment. There + * may be one of the following scenarios which may require us to perform a + * full directory scan. ... + * - The next anticipated log segment is not available. I wonder if we really need to perform a directory scan in this case. Unless there are other cases where the .ready files are created out of order, I think this just causes an unnecessary directory scan every time the archiver catches up. + * Flag to enable/disable directory scan. If this flag is set then it + * forces archiver to perform a full directory scan to get the next log + * segment. + */ + pg_atomic_flag dirScan; I personally don't think it's necessary to use an atomic here. A spinlock or LWLock would probably work just fine, as contention seems unlikely. If we use a lock, we also don't have to worry about memory barriers. Nathan
On 8/15/21, 9:52 PM, "Bossart, Nathan" <bossartn@amazon.com> wrote: > + * Perform a full directory scan to identify the next log segment. There > + * may be one of the following scenarios which may require us to perform a > + * full directory scan. > ... > + * - The next anticipated log segment is not available. > > I wonder if we really need to perform a directory scan in this case. > Unless there are other cases where the .ready files are created out of > order, I think this just causes an unnecessary directory scan every > time the archiver catches up. Thinking further, I suppose this is necessary for when lastSegNo gets reset after processing an out-of-order .ready file. Nathan
> + StatusFilePath(archiveStatusPath, xlog, ".ready");
> + if (stat(archiveStatusPath, &stat_buf) == 0)
> + PgArchEnableDirScan();
> We may want to call PgArchWakeup() after setting the flag.
> >
> > I wonder if we really need to perform a directory scan in this case.
> > Unless there are other cases where the .ready files are created out of
> > order, I think this just causes an unnecessary directory scan every
> > time the archiver catches up.
> Thinking further, I suppose this is necessary for when lastSegNo gets
> reset after processing an out-of-order .ready file.
> I personally don't think it's necessary to use an atomic here. A
> spinlock or LWLock would probably work just fine, as contention seems
> unlikely. If we use a lock, we also don't have to worry about memory
> barriers.
here will make sure that there is no reordering of read/write instructions while
accessing the flag in shared memory. Archiver needs to read this flag at the
Attachment
On 8/17/21, 5:53 AM, "Dipesh Pandit" <dipesh.pandit@gmail.com> wrote: >> I personally don't think it's necessary to use an atomic here. A >> spinlock or LWLock would probably work just fine, as contention seems >> unlikely. If we use a lock, we also don't have to worry about memory >> barriers. > > History file should be archived as soon as it gets created. The atomic flag > here will make sure that there is no reordering of read/write instructions while > accessing the flag in shared memory. Archiver needs to read this flag at the > beginning of each cycle. Write to atomic flag is synchronized and it provides > a lockless read. I think an atomic flag here is an efficient choice unless I am > missing something. Sorry, I think my note was not very clear. I agree that a flag should be used for this purpose, but I think we should just use a regular bool protected by a spinlock or LWLock instead of an atomic. The file atomics.h has the following note: * Use higher level functionality (lwlocks, spinlocks, heavyweight locks) * whenever possible. Writing correct code using these facilities is hard. IOW I don't think the extra complexity is necessary. From a performance standpoint, contention seems unlikely. We only need to read the flag roughly once per WAL segment, and we only ever set it in uncommon scenarios such as a timeline switch or the creation of an out-of-order .ready file. Nathan
On Tue, Aug 17, 2021 at 12:33 PM Bossart, Nathan <bossartn@amazon.com> wrote: > Sorry, I think my note was not very clear. I agree that a flag should > be used for this purpose, but I think we should just use a regular > bool protected by a spinlock or LWLock instead of an atomic. The file > atomics.h has the following note: > > * Use higher level functionality (lwlocks, spinlocks, heavyweight locks) > * whenever possible. Writing correct code using these facilities is hard. > > IOW I don't think the extra complexity is necessary. From a > performance standpoint, contention seems unlikely. We only need to > read the flag roughly once per WAL segment, and we only ever set it in > uncommon scenarios such as a timeline switch or the creation of an > out-of-order .ready file. In the interest of full disclosure, I think that I was probably the one who suggested to Dipesh that he should look into using atomics, although I can't quite remember right now why I thought we might want to do that. I do not on general principle very much like code that does LWLockAcquire(whatever); exactly-one-assignment-statement-that-modifies-a-1-2-or-4-byte-quantity; LWLockRelease(whatever). If you had two assignments in there, then you know why you have a lock: it's to make those behave as an atomic, indivisible unit. But when you only have one, what are you protecting against? You're certainly not making anything atomic that would not have been anyway, so you must be using the LWLock as a memory barrier. But then you really kind of have to think about memory barriers anyway: why do you need one at all, and what things need to be separated? It's not clear that spelling pg_memory_barrier() as LWLockAcquire() and/or LWLockRelease() is actually saving you anything in terms of notional complexity. In this patch, it appears to me that the atomic flag is only ever being read unlocked, so I think that we're actually getting no benefit at all from the use of pg_atomic_flag here. We're not making anything atomic, because there's only one bit of shared state, and we're not getting any memory barrier semantics, because it looks to me like the flag is only ever tested using pg_atomic_unlocked_test_flag, which is documented not to have barrier semantics. So as far as I can see, there's no point in using either an LWLock or atomics here. We could just use bool with no lock and the code would do exactly what it does now. So I guess the question is whether that's correct or whether we need some kind of synchronization and, if so, of what sort. I can't actually see that there's any kind of hard synchronization requirement here at all. What we're trying to do is guarantee that if the timeline changes, we'll pick up the timeline history for the new timeline next, and that if files are archived out of order, we'll switch to archiving the oldest file that is now present rather than continuing with consecutive files. But suppose we just use an unsynchronized bool. The worst case is that we'll archive one extra file proceeding in order before we jump to the file that we were supposed to archive next. It's not evident to me that this is all that bad. The same thing would have happened if the previous file had been archived slightly faster than it actually was, so that we began archiving the next file just before, rather than just after, the notification was sent. And if it is bad, wrapping an LWLock around the accesses to the flag variable, or using an atomic, does nothing to stop it. -- Robert Haas EDB: http://www.enterprisedb.com
On 8/17/21, 11:28 AM, "Robert Haas" <robertmhaas@gmail.com> wrote: > I can't actually see that there's any kind of hard synchronization > requirement here at all. What we're trying to do is guarantee that if > the timeline changes, we'll pick up the timeline history for the new > timeline next, and that if files are archived out of order, we'll > switch to archiving the oldest file that is now present rather than > continuing with consecutive files. But suppose we just use an > unsynchronized bool. The worst case is that we'll archive one extra > file proceeding in order before we jump to the file that we were > supposed to archive next. It's not evident to me that this is all that > bad. The same thing would have happened if the previous file had been > archived slightly faster than it actually was, so that we began > archiving the next file just before, rather than just after, the > notification was sent. And if it is bad, wrapping an LWLock around the > accesses to the flag variable, or using an atomic, does nothing to > stop it. I am inclined to agree. The archiver only ever reads the flag and sets it to false (if we are doing a directory scan). Others only ever set the flag to true. The only case I can think of where we might miss the timeline switch or out-of-order .ready file is when the archiver sets the flag to false and then ReadDir() fails. However, that seems to cause the archiver process to restart, and we always start with a directory scan at first. Nathan
On 8/17/21, 12:11 PM, "Bossart, Nathan" <bossartn@amazon.com> wrote: > On 8/17/21, 11:28 AM, "Robert Haas" <robertmhaas@gmail.com> wrote: >> I can't actually see that there's any kind of hard synchronization >> requirement here at all. What we're trying to do is guarantee that if >> the timeline changes, we'll pick up the timeline history for the new >> timeline next, and that if files are archived out of order, we'll >> switch to archiving the oldest file that is now present rather than >> continuing with consecutive files. But suppose we just use an >> unsynchronized bool. The worst case is that we'll archive one extra >> file proceeding in order before we jump to the file that we were >> supposed to archive next. It's not evident to me that this is all that >> bad. The same thing would have happened if the previous file had been >> archived slightly faster than it actually was, so that we began >> archiving the next file just before, rather than just after, the >> notification was sent. And if it is bad, wrapping an LWLock around the >> accesses to the flag variable, or using an atomic, does nothing to >> stop it. > > I am inclined to agree. The archiver only ever reads the flag and > sets it to false (if we are doing a directory scan). Others only ever > set the flag to true. The only case I can think of where we might > miss the timeline switch or out-of-order .ready file is when the > archiver sets the flag to false and then ReadDir() fails. However, > that seems to cause the archiver process to restart, and we always > start with a directory scan at first. Thinking further, I think the most important thing to ensure is that resetting the flag happens before we begin the directory scan. Consider the following scenario in which a timeline history file would potentially be lost: 1. Archiver completes directory scan. 2. A timeline history file is created and the flag is set. 3. Archiver resets the flag. I don't think there's any problem with the archiver reading a stale value for the flag. It should eventually be updated and route us to the directory scan code path. I'd also note that we're depending on the directory scan logic for picking up all timeline history files and out-of-order .ready files that may have been created each time the flag is set. AFAICT that is safe since we prioritize timeline history files and reset the archiver state anytime we do a directory scan. We'll first discover timeline history files via directory scans, and then we'll move on to .ready files, starting at the one with the lowest segment number. If a new timeline history file or out-of-order .ready file is created, the archiver is notified, and we start over. Nathan
Attachment
On Tue, Aug 17, 2021 at 4:19 PM Bossart, Nathan <bossartn@amazon.com> wrote: > Thinking further, I think the most important thing to ensure is that > resetting the flag happens before we begin the directory scan. > Consider the following scenario in which a timeline history file would > potentially be lost: > > 1. Archiver completes directory scan. > 2. A timeline history file is created and the flag is set. > 3. Archiver resets the flag. Dipesh says in his latest email that the archiver resets the flag just before it begins a directory scan. If that's accurate, then I think this sequence of events can't occur. If there is a race condition here with setting the flag, then an alternative design would be to use a counter - either a plain old uint64 or perhaps pg_atomic_uint64 - and have the startup process increment the counter when it wants to trigger a scan. In this design, the archiver would never modify the counter itself, but just remember the last value that it saw. If it later sees a different value it knows that a full scan is required. I think this kind of system is extremely robust against the general class of problems that you're talking about here, but I'm not sure whether we need it, because I'm not sure whether there is a race with just the bool. -- Robert Haas EDB: http://www.enterprisedb.com
Thanks for the new version of the patch. Overall, I think it is on the right track. + /* + * This .ready file is created out of order, notify archiver to perform + * a full directory scan to archive corresponding WAL file. + */ + StatusFilePath(archiveStatusPath, xlog, ".ready"); + if (stat(archiveStatusPath, &stat_buf) == 0) + { + PgArchEnableDirScan(); + PgArchWakeup(); + } Should we have XLogArchiveNotify(), writeTimeLineHistory(), and writeTimeLineHistoryFile() enable the directory scan instead? Else, we have to exhaustively cover all such code paths, which may be difficult to maintain. Another reason I am bringing this up is that my patch for adjusting .ready file creation [0] introduces more opportunities for .ready files to be created out-of-order. + /* + * This is a fall-back path, check if we are here due to the unavailability + * of next anticipated log segment or the archiver is being forced to + * perform a full directory scan. Reset the flag in shared memory only if + * it has been enabled to force a full directory scan and then proceed with + * directory scan. + */ + if (PgArch->dirScan) + PgArch->dirScan = false; Why do we need to check that the flag is set before we reset it? I think we could just always reset it since we are about to do a directory scan anyway. On 8/18/21, 7:25 AM, "Robert Haas" <robertmhaas@gmail.com> wrote: > On Tue, Aug 17, 2021 at 4:19 PM Bossart, Nathan <bossartn@amazon.com> wrote: >> Thinking further, I think the most important thing to ensure is that >> resetting the flag happens before we begin the directory scan. >> Consider the following scenario in which a timeline history file would >> potentially be lost: >> >> 1. Archiver completes directory scan. >> 2. A timeline history file is created and the flag is set. >> 3. Archiver resets the flag. > > Dipesh says in his latest email that the archiver resets the flag just > before it begins a directory scan. If that's accurate, then I think > this sequence of events can't occur. > > If there is a race condition here with setting the flag, then an > alternative design would be to use a counter - either a plain old > uint64 or perhaps pg_atomic_uint64 - and have the startup process > increment the counter when it wants to trigger a scan. In this design, > the archiver would never modify the counter itself, but just remember > the last value that it saw. If it later sees a different value it > knows that a full scan is required. I think this kind of system is > extremely robust against the general class of problems that you're > talking about here, but I'm not sure whether we need it, because I'm > not sure whether there is a race with just the bool. I'm not sure, either. Perhaps it would at least be worth adding a pg_memory_barrier() after setting dirScan to false to avoid the scenario I mentioned (which may or may not be possible). IMO this stuff would be much easier to reason about if we used a lock instead, even if the synchronization was not strictly necessary. However, I don't want to hold this patch up too much on this point. Nathan [0] https://postgr.es/m/05AD5FE2-9A53-4D11-A3F8-3A83EBB0EB93%40amazon.com
> writeTimeLineHistoryFile() enable the directory scan instead? Else,
> we have to exhaustively cover all such code paths, which may be
> difficult to maintain. Another reason I am bringing this up is that
> my patch for adjusting .ready file creation [0] introduces more
> opportunities for .ready files to be created out-of-order.
> + * This is a fall-back path, check if we are here due to the unavailability
> + * of next anticipated log segment or the archiver is being forced to
> + * perform a full directory scan. Reset the flag in shared memory only if
> + * it has been enabled to force a full directory scan and then proceed with
> + * directory scan.
> + */
> + if (PgArch->dirScan)
> + PgArch->dirScan = false;
> Why do we need to check that the flag is set before we reset it? I
> think we could just always reset it since we are about to do a
> directory scan anyway
> > alternative design would be to use a counter - either a plain old
> > uint64 or perhaps pg_atomic_uint64 - and have the startup process
> > increment the counter when it wants to trigger a scan. In this design,
> > the archiver would never modify the counter itself, but just remember
> > the last value that it saw. If it later sees a different value it
> > knows that a full scan is required. I think this kind of system is
> > extremely robust against the general class of problems that you're
> > talking about here, but I'm not sure whether we need it, because I'm
> > not sure whether there is a race with just the bool.
> I'm not sure, either. Perhaps it would at least be worth adding a
> pg_memory_barrier() after setting dirScan to false to avoid the
> scenario I mentioned (which may or may not be possible). IMO this
> stuff would be much easier to reason about if we used a lock instead,
> even if the synchronization was not strictly necessary. However, I
> don't want to hold this patch up too much on this point.
Attachment
On 8/19/21, 5:42 AM, "Dipesh Pandit" <dipesh.pandit@gmail.com> wrote: >> Should we have XLogArchiveNotify(), writeTimeLineHistory(), and >> writeTimeLineHistoryFile() enable the directory scan instead? Else, >> we have to exhaustively cover all such code paths, which may be >> difficult to maintain. Another reason I am bringing this up is that >> my patch for adjusting .ready file creation [0] introduces more >> opportunities for .ready files to be created out-of-order. > > XLogArchiveNotify() notifies Archiver when a log segment is ready for > archival by creating a .ready file. This function is being called for each > log segment and placing a call to enable directory scan here will result > in directory scan for each log segment. Could we have XLogArchiveNotify() check the archiver state and only trigger a directory scan if we detect that we are creating an out-of- order .ready file? > There is one possible scenario where it may run into a race condition. If > archiver has just finished archiving all .ready files and the next anticipated > log segment is not available then in this case archiver takes the fall-back > path to scan directory. It resets the flag before it begins directory scan. > Now, if a directory scan is enabled by a timeline switch or .ready file created > out of order in parallel to the event that the archiver resets the flag then this > might result in a race condition. But in this case also archiver is eventually > going to perform a directory scan and the desired file will be archived as part > of directory scan. Apart of this I can't think of any other scenario which may > result into a race condition unless I am missing something. What do you think about adding an upper limit to the number of files we can archive before doing a directory scan? The more I think about the directory scan flag, the more I believe it is a best-effort tool that will remain prone to race conditions. If we have a guarantee that a directory scan will happen within the next N files, there's probably less pressure to make sure that it's 100% correct. On an unrelated note, do we need to add some extra handling for backup history files and partial WAL files? Nathan
On 5/4/21, 7:07 AM, "Robert Haas" <robertmhaas@gmail.com> wrote: > On Tue, May 4, 2021 at 12:27 AM Andres Freund <andres@anarazel.de> wrote: >> On 2021-05-03 16:49:16 -0400, Robert Haas wrote: >> > I have two possible ideas for addressing this; perhaps other people >> > will have further suggestions. A relatively non-invasive fix would be >> > to teach pgarch.c how to increment a WAL file name. After archiving >> > segment N, check using stat() whether there's an .ready file for >> > segment N+1. If so, do that one next. If not, then fall back to >> > performing a full directory scan. >> >> Hm. I wonder if it'd not be better to determine multiple files to be >> archived in one readdir() pass? > > I think both methods have some merit. If we had a way to pass a range > of files to archive_command instead of just one, then your way is > distinctly better, and perhaps we should just go ahead and invent such > a thing. If not, your way doesn't entirely solve the O(n^2) problem, > since you have to choose some upper bound on the number of file names > you're willing to buffer in memory, but it may lower it enough that it > makes no practical difference. I am somewhat inclined to think that it > would be good to start with the method I'm proposing, since it is a > clear-cut improvement over what we have today and can be done with a > relatively limited amount of code change and no redesign, and then > perhaps do something more ambitious afterward. I was curious about this, so I wrote a patch (attached) to store multiple files per directory scan and tested it against the latest patch in this thread (v9) [0]. Specifically, I set archive_command to 'false', created ~20K WAL segments, then restarted the server with archive_command set to 'true'. Both the v9 patch and the attached patch completed archiving all segments in just under a minute. (I tested the attached patch with NUM_FILES_PER_DIRECTORY_SCAN set to 64, 128, and 256 and didn't observe any significant difference.) The existing logic took over 4 minutes to complete. I'm hoping to do this test again with many more (100K+) status files, as I believe that the v9 patch will be faster at that scale, but I'm not sure how much faster it will be. Nathan [0] https://www.postgresql.org/message-id/attachment/125543/v9-0001-mitigate-directory-scan-for-WAL-archiver.patch
Attachment
On 8/21/21, 9:29 PM, "Bossart, Nathan" <bossartn@amazon.com> wrote: > I was curious about this, so I wrote a patch (attached) to store > multiple files per directory scan and tested it against the latest > patch in this thread (v9) [0]. Specifically, I set archive_command to > 'false', created ~20K WAL segments, then restarted the server with > archive_command set to 'true'. Both the v9 patch and the attached > patch completed archiving all segments in just under a minute. (I > tested the attached patch with NUM_FILES_PER_DIRECTORY_SCAN set to 64, > 128, and 256 and didn't observe any significant difference.) The > existing logic took over 4 minutes to complete. > > I'm hoping to do this test again with many more (100K+) status files, > as I believe that the v9 patch will be faster at that scale, but I'm > not sure how much faster it will be. I ran this again on a bigger machine with 200K WAL files pending archive. The v9 patch took ~5.5 minutes, the patch I sent took ~8 minutes, and the existing logic took just under 3 hours. Nathan
On Sun, Aug 22, 2021 at 10:31 PM Bossart, Nathan <bossartn@amazon.com> wrote: > I ran this again on a bigger machine with 200K WAL files pending > archive. The v9 patch took ~5.5 minutes, the patch I sent took ~8 > minutes, and the existing logic took just under 3 hours. Hmm. On the one hand, 8 minutes > 5.5 minutes, and presumably the gap would only get wider if the number of files were larger or if reading the directory were slower. I am pretty sure that reading the directory must be much slower in some real deployments where this problem has come up. On the other hand, 8.8 minutes << 3 hours, and your patch would win if somehow we had a ton of gaps in the sequence of files. I'm not sure how likely that is to be the cause - probably not very likely at all if you aren't using an archive command that cheats, but maybe really common if you are. Hmm, but I think if the archive_command cheats by marking a bunch of files done when it is tasked with archiving just one, your patch will break, because, unless I'm missing something, it doesn't re-evaluate whether things have changed on every pass through the loop as Dipesh's patch does. So I guess I'm not quite sure I understand why you think this might be the way to go? Maintaining the binary heap in lowest-priority-first order is very clever, and the patch does look quite elegant. I'm just not sure I understand the point. -- Robert Haas EDB: http://www.enterprisedb.com
On 8/23/21, 6:42 AM, "Robert Haas" <robertmhaas@gmail.com> wrote: > On Sun, Aug 22, 2021 at 10:31 PM Bossart, Nathan <bossartn@amazon.com> wrote: >> I ran this again on a bigger machine with 200K WAL files pending >> archive. The v9 patch took ~5.5 minutes, the patch I sent took ~8 >> minutes, and the existing logic took just under 3 hours. > > Hmm. On the one hand, 8 minutes > 5.5 minutes, and presumably the gap > would only get wider if the number of files were larger or if reading > the directory were slower. I am pretty sure that reading the directory > must be much slower in some real deployments where this problem has > come up. On the other hand, 8.8 minutes << 3 hours, and your patch > would win if somehow we had a ton of gaps in the sequence of files. > I'm not sure how likely that is to be the cause - probably not very > likely at all if you aren't using an archive command that cheats, but > maybe really common if you are. Hmm, but I think if the > archive_command cheats by marking a bunch of files done when it is > tasked with archiving just one, your patch will break, because, unless > I'm missing something, it doesn't re-evaluate whether things have > changed on every pass through the loop as Dipesh's patch does. So I > guess I'm not quite sure I understand why you think this might be the > way to go? To handle a "cheating" archive command, I'd probably need to add a stat() for every time pgarch_readyXLog() returned something from arch_files. I suspect something similar might be needed in Dipesh's patch to handle backup history files and partial WAL files. In any case, I think Dipesh's patch is the way to go. It obviously will perform better in the extreme cases discussed in this thread. I think it's important to make sure the patch doesn't potentially leave files behind to be picked up by a directory scan that might not happen, but there are likely ways to handle that. In the worst case, perhaps we need to force a directory scan every N files to make sure nothing gets left behind. But maybe we can do better. > Maintaining the binary heap in lowest-priority-first order is very > clever, and the patch does look quite elegant. I'm just not sure I > understand the point. This was mostly an exploratory exercise to get some numbers for the different approaches discussed in this thread. Nathan
On Mon, Aug 23, 2021 at 11:50 AM Bossart, Nathan <bossartn@amazon.com> wrote: > To handle a "cheating" archive command, I'd probably need to add a > stat() for every time pgarch_readyXLog() returned something from > arch_files. I suspect something similar might be needed in Dipesh's > patch to handle backup history files and partial WAL files. I think he's effectively got that already, although it's probably inside of pgarch_readyXLog(). The idea there is that instead of having a cache of files to be returned (as in your case) he just checks whether the next file in sequence happens to be present and if so returns that file name. To see whether it's present, he uses stat(). > In any case, I think Dipesh's patch is the way to go. It obviously > will perform better in the extreme cases discussed in this thread. I > think it's important to make sure the patch doesn't potentially leave > files behind to be picked up by a directory scan that might not > happen, but there are likely ways to handle that. In the worst case, > perhaps we need to force a directory scan every N files to make sure > nothing gets left behind. But maybe we can do better. It seems to me that we can handle that by just having the startup process notify the archiver every time some file is ready for archiving that's not the next one in the sequence. We have to make sure we cover all the relevant code paths, but that seems like it should be doable, and we have to decide on the synchronization details, but that also seems pretty manageable, even if we haven't totally got it sorted yet. The thing is, as soon as you go back to forcing a directory scan every N files, you've made it formally O(N^2) again, which might not matter in practice if the constant factor is low enough, but I don't think it will be. Either you force the scans every, say, 1000 files, in which case it's going to make the whole mechanism a lot less effective in terms of getting out from under problem cases -- or you force scans every, say, 1000000 files, in which case it's not really going to cause any missed files to get archived soon enough to make anyone happy. I doubt there is really a happy medium in there. I suppose the two approaches could be combined, too - remember the first N files you think you'll encounter and then after that try successive filenames until one is missing. That would be more resilient against O(N^2) behavior in the face of frequent gaps. But it might also be more engineering than is strictly required. -- Robert Haas EDB: http://www.enterprisedb.com
On 8/23/21, 10:49 AM, "Robert Haas" <robertmhaas@gmail.com> wrote: > On Mon, Aug 23, 2021 at 11:50 AM Bossart, Nathan <bossartn@amazon.com> wrote: >> To handle a "cheating" archive command, I'd probably need to add a >> stat() for every time pgarch_readyXLog() returned something from >> arch_files. I suspect something similar might be needed in Dipesh's >> patch to handle backup history files and partial WAL files. > > I think he's effectively got that already, although it's probably > inside of pgarch_readyXLog(). The idea there is that instead of having > a cache of files to be returned (as in your case) he just checks > whether the next file in sequence happens to be present and if so > returns that file name. To see whether it's present, he uses stat(). IIUC partial WAL files are handled because the next file in the sequence with the given TimeLineID won't be there, so we will fall back to a directory scan and pick it up. Timeline history files are handled by forcing a directory scan, which should work because they always have the highest priority. Backup history files, however, do not seem to be handled. I think one approach to fixing that is to also treat backup history files similarly to timeline history files. If one is created, we force a directory scan, and the directory scan logic will consider backup history files as higher priority than everything but timeline history files. I've been looking at the v9 patch with fresh eyes, and I still think we should be able to force the directory scan as needed in XLogArchiveNotify(). Unless the file to archive is a regular WAL file that is > our stored location in archiver memory, we should force a directory scan. I think it needs to be > instead of >= because we don't know if the archiver has just completed a directory scan and found a later segment to use to update the archiver state (but hasn't yet updated the state in shared memory). Also, I think we need to make sure to set PgArch->dirScan back to true at the end of pgarch_readyXlog() unless we've found a new regular WAL file that we can use to reset the archiver's stored location. This ensures that we'll keep doing directory scans as long as there are timeline/backup history files to process. Nathan
At Tue, 24 Aug 2021 00:03:37 +0000, "Bossart, Nathan" <bossartn@amazon.com> wrote in > On 8/23/21, 10:49 AM, "Robert Haas" <robertmhaas@gmail.com> wrote: > > On Mon, Aug 23, 2021 at 11:50 AM Bossart, Nathan <bossartn@amazon.com> wrote: > >> To handle a "cheating" archive command, I'd probably need to add a > >> stat() for every time pgarch_readyXLog() returned something from > >> arch_files. I suspect something similar might be needed in Dipesh's > >> patch to handle backup history files and partial WAL files. > > > > I think he's effectively got that already, although it's probably > > inside of pgarch_readyXLog(). The idea there is that instead of having > > a cache of files to be returned (as in your case) he just checks > > whether the next file in sequence happens to be present and if so > > returns that file name. To see whether it's present, he uses stat(). > > IIUC partial WAL files are handled because the next file in the > sequence with the given TimeLineID won't be there, so we will fall > back to a directory scan and pick it up. Timeline history files are > handled by forcing a directory scan, which should work because they > always have the highest priority. Backup history files, however, do > not seem to be handled. I think one approach to fixing that is to > also treat backup history files similarly to timeline history files. > If one is created, we force a directory scan, and the directory scan > logic will consider backup history files as higher priority than > everything but timeline history files. Backup history files are (currently) just informational and they are finally processed at the end of a bulk-archiving performed by the fast path. However, I feel that it is cleaner to trigger a directory scan every time we add an other-than-a-regular-WAL-file, as base-backup or promotion are not supposed happen so infrequently. > I've been looking at the v9 patch with fresh eyes, and I still think > we should be able to force the directory scan as needed in > XLogArchiveNotify(). Unless the file to archive is a regular WAL file > that is > our stored location in archiver memory, we should force a > directory scan. I think it needs to be > instead of >= because we > don't know if the archiver has just completed a directory scan and > found a later segment to use to update the archiver state (but hasn't > yet updated the state in shared memory). I'm afraid that it can be seen as a violation of modularity. I feel that wal-emitter side should not be aware of that datail of archiving. Instead, I would prefer to keep directory scan as far as it found an smaller segment id than the next-expected segment id ever archived by the fast-path (if possible). This would be less-performant in the case out-of-order segments are frequent but I think the overall objective of the original patch will be kept. > Also, I think we need to make sure to set PgArch->dirScan back to true > at the end of pgarch_readyXlog() unless we've found a new regular WAL > file that we can use to reset the archiver's stored location. This > ensures that we'll keep doing directory scans as long as there are > timeline/backup history files to process. Right. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
(sigh..) At Tue, 24 Aug 2021 11:35:06 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in > > IIUC partial WAL files are handled because the next file in the > > sequence with the given TimeLineID won't be there, so we will fall > > back to a directory scan and pick it up. Timeline history files are > > handled by forcing a directory scan, which should work because they > > always have the highest priority. Backup history files, however, do > > not seem to be handled. I think one approach to fixing that is to > > also treat backup history files similarly to timeline history files. > > If one is created, we force a directory scan, and the directory scan > > logic will consider backup history files as higher priority than > > everything but timeline history files. > > Backup history files are (currently) just informational and they are > finally processed at the end of a bulk-archiving performed by the fast > path. However, I feel that it is cleaner to trigger a directory scan > every time we add an other-than-a-regular-WAL-file, as base-backup or - promotion are not supposed happen so infrequently. + promotion are not supposed happen so frequently. -- Kyotaro Horiguchi NTT Open Source Software Center
> > > sequence with the given TimeLineID won't be there, so we will fall
> > > back to a directory scan and pick it up. Timeline history files are
> > > handled by forcing a directory scan, which should work because they
> > > always have the highest priority. Backup history files, however, do
> > > not seem to be handled. I think one approach to fixing that is to
> > > also treat backup history files similarly to timeline history files.
> > > If one is created, we force a directory scan, and the directory scan
> > > logic will consider backup history files as higher priority than
> > > everything but timeline history files.
> >
> > Backup history files are (currently) just informational and they are
> > finally processed at the end of a bulk-archiving performed by the fast
> > path. However, I feel that it is cleaner to trigger a directory scan
> > every time we add an other-than-a-regular-WAL-file, as base-backup or
> - promotion are not supposed happen so infrequently.
> + promotion are not supposed happen so frequently.
backup history file. Also, updated archiver to prioritize archiving a backup
history file over regular WAL files during directory scan to make sure that
backup history file gets archived before the directory scan gets disabled
as part of archiving a regular WAL file.
> > we should be able to force the directory scan as needed in
> > XLogArchiveNotify(). Unless the file to archive is a regular WAL file
> > that is > our stored location in archiver memory, we should force a
> > directory scan. I think it needs to be > instead of >= because we
> > don't know if the archiver has just completed a directory scan and
> > found a later segment to use to update the archiver state (but hasn't
> > yet updated the state in shared memory).
>
> I'm afraid that it can be seen as a violation of modularity. I feel
> that wal-emitter side should not be aware of that datail of
> archiving. Instead, I would prefer to keep directory scan as far as it
> found an smaller segment id than the next-expected segment id ever
> archived by the fast-path (if possible). This would be
> less-performant in the case out-of-order segments are frequent but I
> think the overall objective of the original patch will be kept.
> > file that we can use to reset the archiver's stored location. This
> > ensures that we'll keep doing directory scans as long as there are
> > timeline/backup history files to process.
>
> Right.
Attachment
On 8/24/21, 5:31 AM, "Dipesh Pandit" <dipesh.pandit@gmail.com> wrote: >> > I've been looking at the v9 patch with fresh eyes, and I still think >> > we should be able to force the directory scan as needed in >> > XLogArchiveNotify(). Unless the file to archive is a regular WAL file >> > that is > our stored location in archiver memory, we should force a >> > directory scan. I think it needs to be > instead of >= because we >> > don't know if the archiver has just completed a directory scan and >> > found a later segment to use to update the archiver state (but hasn't >> > yet updated the state in shared memory). >> >> I'm afraid that it can be seen as a violation of modularity. I feel >> that wal-emitter side should not be aware of that datail of >> archiving. Instead, I would prefer to keep directory scan as far as it >> found an smaller segment id than the next-expected segment id ever >> archived by the fast-path (if possible). This would be >> less-performant in the case out-of-order segments are frequent but I >> think the overall objective of the original patch will be kept. > > Archiver selects the file with lowest segment number as part of directory > scan and the next segment number gets resets based on this file. It starts > a new sequence from here and check the availability of the next file. If > there are holes then it will continue to fall back to directory scan. This will > continue until it finds the next sequence in order. I think this is already > handled unless I am missing something. I'm thinking of the following scenario: 1. Status file 2.ready is created. 2. Archiver finds 2.ready and uses it to update its state. 3. Status file 1.ready is created. At this point, the archiver will look for 3.ready next. If it finds 3.ready, it'll look for 4.ready. Let's say it keeps finding status files up until 1000000.ready. In this case, the archiver won't go back and archive segment 1 until we've archived ~1M files. I'll admit this is a contrived example, but I think it demonstrates how certain assumptions could fail with this approach. I think Horiguchi-san made a good point that the .ready file creators should ideally not need to understand archiving details. However, I think this approach requires them to be inextricably linked. In the happy case, the archiver will follow the simple path of processing each consecutive WAL file without incurring a directory scan. Any time there is something other than a regular WAL file to archive, we need to take special action to make sure it is picked up. This sort of problem doesn't really show up in the always-use- directory-scan approaches. If you imagine the .ready file creators as throwing status files over a fence at random times and in no particular order, directory scans are ideal because you are essentially starting with a clean slate each time. The logic to prioritize timeline history files is nice to have, but even if it wasn't there, the archiver would still pick it up eventually. IOW there's no situation (except perhaps infinite timeline history file generation) that puts us in danger of skipping files indefinitely. Even if we started creating a completely new type of status file, the directory scan approaches would probably work without any changes. Nathan
On Tue, Aug 24, 2021 at 1:26 PM Bossart, Nathan <bossartn@amazon.com> wrote: > I think Horiguchi-san made a good point that the .ready file creators > should ideally not need to understand archiving details. However, I > think this approach requires them to be inextricably linked. In the > happy case, the archiver will follow the simple path of processing > each consecutive WAL file without incurring a directory scan. Any > time there is something other than a regular WAL file to archive, we > need to take special action to make sure it is picked up. I think they should be inextricably linked, really. If we know something - like that there's a file ready to be archived - then it seems like we should not throw that information away and force somebody else to rediscover it through an expensive process. The whole problem here comes from the fact that we're using the filesystem as an IPC mechanism, and it's sometimes a very inefficient one. I can't quite decide whether the problems we're worrying about here are real issues or just kind of hypothetical. I mean, today, it seems to be possible that we fail to mark some file ready for archiving, emit a log message, and then a huge amount of time could go by before we try again to mark it ready for archiving. Are the problems we're talking about here objectively worse than that, or just different? Is it a problem in practice, or just in theory? I really want to avoid getting backed into a corner where we decide that the status quo is the best we can do, because I'm pretty sure that has to be the wrong conclusion. If we think that get-a-bunch-of-files-per-readdir approach is better than the keep-trying-the-next-file approach, I mean that's OK with me; I just want to do something about this. I am not sure whether or not that's the right course of action. -- Robert Haas EDB: http://www.enterprisedb.com
On 8/24/21, 12:09 PM, "Robert Haas" <robertmhaas@gmail.com> wrote: > I can't quite decide whether the problems we're worrying about here > are real issues or just kind of hypothetical. I mean, today, it seems > to be possible that we fail to mark some file ready for archiving, > emit a log message, and then a huge amount of time could go by before > we try again to mark it ready for archiving. Are the problems we're > talking about here objectively worse than that, or just different? Is > it a problem in practice, or just in theory? If a .ready file is created out of order, the directory scan logic will pick it up about as soon as possible based on its priority. If the archiver is keeping up relatively well, there's a good chance such a file will have the highest archival priority and will be picked up the next time the archiver looks for a file to archive. With the patch proposed in this thread, an out-of-order .ready file has no such guarantee. As long as the archiver never has to fall back to a directory scan, it won't be archived. The proposed patch handles the case where RemoveOldXlogFiles() creates missing .ready files by forcing a directory scan, but I'm not sure this is enough. I think we have to check the archiver state each time we create a .ready file to see whether we're creating one out-of-order. While this may be an extremely rare problem in practice, archiving something after the next checkpoint completes seems better than never archiving it at all. IMO this isn't an area where there is much space to take risks. > I really want to avoid getting backed into a corner where we decide > that the status quo is the best we can do, because I'm pretty sure > that has to be the wrong conclusion. If we think that > get-a-bunch-of-files-per-readdir approach is better than the > keep-trying-the-next-file approach, I mean that's OK with me; I just > want to do something about this. I am not sure whether or not that's > the right course of action. I certainly think we can do better. The get-a-bunch-of-files-per- readdir approach can help us cut down on the directory scans by one or two orders of magnitude, which is still a huge win. Plus, such an approach retains much of the resilience of the current implementation (although there may be bit more delay for the special cases). That being said, I still think the keep-trying-the-next-file approach is worth exploring, but I think it's really important to consider that there is no guarantee that a directory scan will happen anytime soon. Nathan
> will pick it up about as soon as possible based on its priority. If
> the archiver is keeping up relatively well, there's a good chance such
> a file will have the highest archival priority and will be picked up
> the next time the archiver looks for a file to archive. With the
> patch proposed in this thread, an out-of-order .ready file has no such
> guarantee. As long as the archiver never has to fall back to a
> directory scan, it won't be archived. The proposed patch handles the
> case where RemoveOldXlogFiles() creates missing .ready files by
> forcing a directory scan, but I'm not sure this is enough. I think we
> have to check the archiver state each time we create a .ready file to
> see whether we're creating one out-of-order.
> something after the next checkpoint completes seems better than never
> archiving it at all. IMO this isn't an area where there is much space
> to take risks.
Attachment
On 8/25/21, 4:11 AM, "Dipesh Pandit" <dipesh.pandit@gmail.com> wrote: > Please find attached patch v11. Apologies for the delay. I still intend to review this. Nathan
On 8/25/21, 4:11 AM, "Dipesh Pandit" <dipesh.pandit@gmail.com> wrote: > An alternate approach could be to force a directory scan at checkpoint to > break the infinite wait for a .ready file which is being missed due to the > fact that it is created out of order. This will make sure that the file > gets archived within the checkpoint boundaries. I think this is a good idea. > Please find attached patch v11. Thanks for the new version of the patch. + /* + * History files or a .ready file created out of order requires archiver to + * perform a full directory scan. + */ + if (IsTLHistoryFileName(xlog) || IsBackupHistoryFileName(xlog) || + fileOutOfOrder) + PgArchEnableDirScan(); I think we should force a directory scan for everything that isn't a regular WAL file. IOW we can use !IsXLogFileName(xlog) instead of enumerating all the different kinds of files we might want to archive. + /* + * Segment number of the most recent .ready file found by archiver, + * protected by WALArchiveLock. + */ + XLogSegNo lastReadySegNo; } PgArchData; +/* + * Segment number and timeline ID to identify the next file in a WAL sequence + */ +typedef struct readyXLogState +{ + XLogSegNo lastSegNo; + TimeLineID lastTLI; +} readyXLogState; lastSegNo and lastReadySegNo appear to be the same thing. Couldn't we just use the value in PgArchData? + return (curSegNo < lastSegNo) ? true : false; I think this needs to be <=. If the two values are equal, pgarch_readyXlog() may have just completed a directory scan and might be just about to set PgArch->lastSegNo to a greater value. + LWLockAcquire(WALArchiveLock, LW_EXCLUSIVE); + PgArch->lastReadySegNo = segNo; + LWLockRelease(WALArchiveLock); IMO we should just use a spinlock instead of introducing a new LWLock. It looks like you really only need the lock for a couple of simple functions. I still think protecting PgArch->dirScan with a spinlock is a good idea, if for no other reason than it makes it easier to reason about this logic. + if (stat(xlogready, &st) == 0) I think we should ERROR if stat() fails for any other reason than ENOENT. + ishistory = IsTLHistoryFileName(basename) || + IsBackupHistoryFileName(basename); I suspect we still want to prioritize timeline history files over backup history files. TBH I find the logic below this point for prioritizing history files to be difficult to follow, and I think we should refactor it into some kind of archive priority comparator function. + /* + * Reset the flag only when we found a regular WAL file to make + * sure that we are done with processing history files. + */ + PgArch->dirScan = false; I think we really want to unset dirScan before we start the directory scan, and then we set it to true afterwards if we didn't find a regular WAL file. If someone requests a directory scan in the middle of an ongoing directory scan, we don't want to lose that request. I attached two patches that demonstrate what I'm thinking this change should look like. One is my take on the keep-trying-the-next-file approach, and the other is a new version of the multiple-files-per- readdir approach (with handling for "cheating" archive commands). I personally feel that the multiple-files-per-readdir approach winds up being a bit cleaner and more resilient than the keep-trying-the-next- file approach. However, the keep-trying-the-next-file approach will certainly be more efficient (especially for the extreme cases discussed in this thread), and I don't have any concrete concerns with this approach that seem impossible to handle. Nathan
Attachment
> should look like. One is my take on the keep-trying-the-next-file
> approach, and the other is a new version of the multiple-files-per-
> readdir approach (with handling for "cheating" archive commands). I
> personally feel that the multiple-files-per-readdir approach winds up
> being a bit cleaner and more resilient than the keep-trying-the-next-
> file approach. However, the keep-trying-the-next-file approach will
> certainly be more efficient (especially for the extreme cases
> discussed in this thread), and I don't have any concrete concerns with
> this approach that seem impossible to handle.
+ PgArchForceDirScan();
+
+ * We must use <= because the archiver may have just completed a
+ * directory scan and found a later segment (but hasn't updated
+ * shared memory yet).
+ */
+ if (this_segno <= arch_segno)
+ PgArchForceDirScan();
Attachment
On 9/2/21, 6:22 AM, "Dipesh Pandit" <dipesh.pandit@gmail.com> wrote: > I agree that multiple-files-pre-readdir is cleaner and has the resilience of the > current implementation. However, I have a few suggestion on keep-trying-the > -next-file approach patch shared in previous thread. Which approach do you think we should use? I think we have decent patches for both approaches at this point, so perhaps we should see if we can get some additional feedback from the community on which one we should pursue further. > + /* force directory scan the first time we call pgarch_readyXlog() */ > + PgArchForceDirScan(); > + > > We should not force a directory in pgarch_ArchiverCopyLoop(). This gets called > whenever archiver wakes up from the wait state. This will result in a > situation where the archiver performs a full directory scan despite having the > accurate information about the next anticipated log segment. > Instead we can check if lastSegNo is initialized and continue directory scan > until it gets initialized in pgarch_readyXlog(). The problem I see with this is that pgarch_archiveXlog() might end up failing. If it does, we won't retry archiving the file until we do a directory scan. I think we could try to avoid forcing a directory scan outside of these failure cases and archiver startup, but I'm not sure it's really worth it. When pgarch_readyXlog() returns false, it most likely means that there are no .ready files present, so I'm not sure we are gaining a whole lot by avoiding a directory scan in that case. I guess it might help a bit if there are a ton of .done files, though. > + return lastSegNo; > We should return "true" here. Oops. Good catch. > I am thinking if we can add a log message for files which are > archived as part of directory scan. This will be useful for diagnostic purpose > to check if desired files gets archived as part of directory scan in special > scenarios. I also think that we should add a few comments in pgarch_readyXlog(). I agree, but it should probably be something like DEBUG3 instead of LOG. > + /* > + * We must use <= because the archiver may have just completed a > + * directory scan and found a later segment (but hasn't updated > + * shared memory yet). > + */ > + if (this_segno <= arch_segno) > + PgArchForceDirScan(); > > I still think that we should use '<' operator here because > arch_segno represents the segment number of the most recent > .ready file found by the archiver. This gets updated in shared > memory only if archiver has successfully found a .ready file. > In a normal scenario this_segno will be greater than arch_segno > whereas in cases where a .ready file is created out of order > this_segno may be less than arch_segno. I am wondering > if there is a scenario where arch_segno is equal to this_segno > unless I am missing something. The pg_readyXlog() logic looks a bit like this: 1. Try to skip directory scan. If that succeeds, we're done. 2. Do a directory scan. 3. If we found a regular WAL file, update PgArch and return what we found. Let's say step 1 looks for WAL file 10, but 10.ready doesn't exist yet. The following directory scan ends up finding 11.ready. Just before we update the PgArch state, XLogArchiveNotify() is called and creates 10.ready. However, pg_readyXlog() has already decided to return WAL segment 11 and update the state to look for 12 next. If we just used '<', we won't force a directory scan, and segment 10 will not be archived until the next one happens. If we use '<=', I don't think we have the same problem. I've also thought about another similar scenario. Let's say step 1 looks for WAL file 10, but it doesn't exist yet (just like the previous example). The following directory scan ends up finding 12.ready, but just before we update PgArch, we create 11.ready. In this case, we'll still skip forcing a directory scan until 10.ready is created later on. I believe it all eventually works out as long as we can safely assume that all files that should have .ready files will eventually get them. Nathan
> patches for both approaches at this point, so perhaps we should see if
> we can get some additional feedback from the community on which one we
> should pursue further.
> failing. If it does, we won't retry archiving the file until we do a
> directory scan. I think we could try to avoid forcing a directory
> scan outside of these failure cases and archiver startup, but I'm not
> sure it's really worth it. When pgarch_readyXlog() returns false, it
> most likely means that there are no .ready files present, so I'm not
> sure we are gaining a whole lot by avoiding a directory scan in that
> case. I guess it might help a bit if there are a ton of .done files,
> though.
> LOG.
At Fri, 3 Sep 2021 18:31:46 +0530, Dipesh Pandit <dipesh.pandit@gmail.com> wrote in > Hi, > > Thanks for the feedback. > > > Which approach do you think we should use? I think we have decent > > patches for both approaches at this point, so perhaps we should see if > > we can get some additional feedback from the community on which one we > > should pursue further. > > In my opinion both the approaches have benefits over current implementation. > I think in keep-trying-the-next-file approach we have handled all rare and > specific > scenarios which requires us to force a directory scan to archive the > desired files. > In addition to this with the recent change to force a directory scan at > checkpoint > we can avoid an infinite wait for a file which is still being missed out > despite > handling the special scenarios. It is also more efficient in extreme > scenarios > as discussed in this thread. However, multiple-files-per-readdir approach > is > cleaner with resilience of current implementation. > > I agree that we should decide on which approach to pursue further based on > additional feedback from the community. I was thinking that the multple-files approch would work efficiently but the the patch still runs directory scans every 64 files. As Robert mentioned it is still O(N^2). I'm not sure the reason for the limit, but if it were to lower memory consumption or the cost to sort, we can resolve that issue by taking trying-the-next approach ignoring the case of having many gaps (discussed below). If it were to cause voluntary checking of out-of-order files, almost the same can be achieved by running directory scans every 64 files in the trying-the-next approach (and we would suffer O(N^2) again). On the other hand, if archiving is delayed by several segments, the multiple-files method might reduce the cost to scan the status directory but it won't matter since the directory contains only several files. (I think that it might be better that we don't go to trying-the-next path if we found only several files by running a directory scan.) The multiple-files approach reduces the number of directory scans if there were many gaps in the WAL file sequence. Alghouth theoretically the last max_backend(+alpha?) segemnts could be written out-of-order, but I suppose we only have gaps only among the several latest files in reality. I'm not sure, though.. In short, the trying-the-next approach seems to me to be the way to go, for the reason that it is simpler but it can cover the possible failures by almost the same measures with the muliple-files approach. > > The problem I see with this is that pgarch_archiveXlog() might end up > > failing. If it does, we won't retry archiving the file until we do a > > directory scan. I think we could try to avoid forcing a directory > > scan outside of these failure cases and archiver startup, but I'm not > > sure it's really worth it. When pgarch_readyXlog() returns false, it > > most likely means that there are no .ready files present, so I'm not > > sure we are gaining a whole lot by avoiding a directory scan in that > > case. I guess it might help a bit if there are a ton of .done files, > > though. > > Yes, I think it will be useful when we have a bunch of .done files and > the frequency of .ready files is such that the archiver goes to wait > state before the next WAL file is ready for archival. > > > I agree, but it should probably be something like DEBUG3 instead of > > LOG. > > I will update it in the next patch. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
On 9/7/21, 1:42 AM, "Kyotaro Horiguchi" <horikyota.ntt@gmail.com> wrote: > I was thinking that the multple-files approch would work efficiently > but the the patch still runs directory scans every 64 files. As > Robert mentioned it is still O(N^2). I'm not sure the reason for the > limit, but if it were to lower memory consumption or the cost to sort, > we can resolve that issue by taking trying-the-next approach ignoring > the case of having many gaps (discussed below). If it were to cause > voluntary checking of out-of-order files, almost the same can be > achieved by running directory scans every 64 files in the > trying-the-next approach (and we would suffer O(N^2) again). On the > other hand, if archiving is delayed by several segments, the > multiple-files method might reduce the cost to scan the status > directory but it won't matter since the directory contains only > several files. (I think that it might be better that we don't go to > trying-the-next path if we found only several files by running a > directory scan.) The multiple-files approach reduces the number of > directory scans if there were many gaps in the WAL file > sequence. Alghouth theoretically the last max_backend(+alpha?) > segemnts could be written out-of-order, but I suppose we only have > gaps only among the several latest files in reality. I'm not sure, > though.. > > In short, the trying-the-next approach seems to me to be the way to > go, for the reason that it is simpler but it can cover the possible > failures by almost the same measures with the muliple-files approach. Thanks for chiming in. The limit of 64 in the multiple-files-per- directory-scan approach was mostly arbitrary. My earlier testing [0] with different limits didn't reveal any significant difference, but using a higher limit might yield a small improvement when there are several hundred thousand .ready files. IMO increasing the limit isn't really worth it for this approach. For 500,000 .ready files, ordinarily you'd need 500,000 directory scans. When 64 files are archived for each directory scan, you need ~8,000 directory scans. With 128 files per directory scan, you need ~4,000. With 256, you need ~2000. The difference between 8,000 directory scans and 500,000 is quite significant. The difference between 2,000 and 8,000 isn't nearly as significant in comparison. Nathan [0] https://www.postgresql.org/message-id/3ECC212F-88FD-4FB2-BAF1-C2DD1563E310%40amazon.com
On Tue, Sep 7, 2021 at 1:28 PM Bossart, Nathan <bossartn@amazon.com> wrote: > Thanks for chiming in. The limit of 64 in the multiple-files-per- > directory-scan approach was mostly arbitrary. My earlier testing [0] > with different limits didn't reveal any significant difference, but > using a higher limit might yield a small improvement when there are > several hundred thousand .ready files. IMO increasing the limit isn't > really worth it for this approach. For 500,000 .ready files, > ordinarily you'd need 500,000 directory scans. When 64 files are > archived for each directory scan, you need ~8,000 directory scans. > With 128 files per directory scan, you need ~4,000. With 256, you > need ~2000. The difference between 8,000 directory scans and 500,000 > is quite significant. The difference between 2,000 and 8,000 isn't > nearly as significant in comparison. That's certainly true. I guess what I don't understand about the multiple-files-per-dirctory scan implementation is what happens when something happens that would require the keep-trying-the-next-file approach to perform a forced scan. It seems to me that you still need to force an immediate full scan, because if the idea is that you want to, say, prioritize archiving of new timeline files over any others, a cached list of files that you should archive next doesn't accomplish that, just like keeping on trying the next file in sequence doesn't accomplish that. So I'm wondering if in the end the two approaches converge somewhat, so that with either patch you get (1) some kind of optimization to scan the directory less often, plus (2) some kind of notification mechanism to tell you when you need to avoid applying that optimization. If you wanted to, (1) could even include both batching and then, when the batch is exhausted, trying files in sequence. I'm not saying that's the way to go, but you could. In the end, it seems less important that we do any particular thing here and more important that we do something - but if prioritizing timeline history files is important, then we have to preserve that behavior. -- Robert Haas EDB: http://www.enterprisedb.com
On 9/7/21, 10:54 AM, "Robert Haas" <robertmhaas@gmail.com> wrote: > I guess what I don't understand about the multiple-files-per-dirctory > scan implementation is what happens when something happens that would > require the keep-trying-the-next-file approach to perform a forced > scan. It seems to me that you still need to force an immediate full > scan, because if the idea is that you want to, say, prioritize > archiving of new timeline files over any others, a cached list of > files that you should archive next doesn't accomplish that, just like > keeping on trying the next file in sequence doesn't accomplish that. Right. The latest patch for that approach [0] does just that. In fact, I think timeline files are the only files for which we need to force an immediate directory scan in the multiple-files-per-scan approach. For the keep-trying-the-next-file approach, we have to force a directory scan for anything but a regular WAL file that is ahead of our archiver state. > So I'm wondering if in the end the two approaches converge somewhat, > so that with either patch you get (1) some kind of optimization to > scan the directory less often, plus (2) some kind of notification > mechanism to tell you when you need to avoid applying that > optimization. If you wanted to, (1) could even include both batching > and then, when the batch is exhausted, trying files in sequence. I'm > not saying that's the way to go, but you could. In the end, it seems > less important that we do any particular thing here and more important > that we do something - but if prioritizing timeline history files is > important, then we have to preserve that behavior. Yeah, I would agree that the approaches basically converge into some form of "do fewer directory scans." Nathan [0] https://www.postgresql.org/message-id/attachment/125980/0001-Improve-performance-of-pgarch_readyXlog-with-many-st.patch
On Tue, Sep 7, 2021 at 2:13 PM Bossart, Nathan <bossartn@amazon.com> wrote: > Right. The latest patch for that approach [0] does just that. In > fact, I think timeline files are the only files for which we need to > force an immediate directory scan in the multiple-files-per-scan > approach. For the keep-trying-the-next-file approach, we have to > force a directory scan for anything but a regular WAL file that is > ahead of our archiver state. Yeah, that makes sense. > Yeah, I would agree that the approaches basically converge into some > form of "do fewer directory scans." I guess we still have to pick one or the other, but I don't really know how to do that, since both methods seem to be relatively fine, and the scenarios where one is better than the other all feel a little bit contrived. I guess if no clear consensus emerges in the next week or so, I'll just pick one and commit it. Not quite sure yet how I'll do the picking, but we seem to all agree that something is better than nothing, so hopefully nobody will be too sad if I make an arbitrary decision. And if some clear agreement emerges before then, even better. -- Robert Haas EDB: http://www.enterprisedb.com
On 9/7/21, 11:31 AM, "Robert Haas" <robertmhaas@gmail.com> wrote: > I guess we still have to pick one or the other, but I don't really > know how to do that, since both methods seem to be relatively fine, > and the scenarios where one is better than the other all feel a little > bit contrived. I guess if no clear consensus emerges in the next week > or so, I'll just pick one and commit it. Not quite sure yet how I'll > do the picking, but we seem to all agree that something is better than > nothing, so hopefully nobody will be too sad if I make an arbitrary > decision. And if some clear agreement emerges before then, even > better. I will be happy to see this fixed either way. Nathan
At Tue, 7 Sep 2021 18:40:24 +0000, "Bossart, Nathan" <bossartn@amazon.com> wrote in > On 9/7/21, 11:31 AM, "Robert Haas" <robertmhaas@gmail.com> wrote: > > I guess we still have to pick one or the other, but I don't really > > know how to do that, since both methods seem to be relatively fine, > > and the scenarios where one is better than the other all feel a little > > bit contrived. I guess if no clear consensus emerges in the next week > > or so, I'll just pick one and commit it. Not quite sure yet how I'll > > do the picking, but we seem to all agree that something is better than > > nothing, so hopefully nobody will be too sad if I make an arbitrary > > decision. And if some clear agreement emerges before then, even > > better. > > I will be happy to see this fixed either way. +1. I agree about the estimation on performance gain to Nathan. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
> > know how to do that, since both methods seem to be relatively fine,
> > and the scenarios where one is better than the other all feel a little
> > bit contrived. I guess if no clear consensus emerges in the next week
> > or so, I'll just pick one and commit it. Not quite sure yet how I'll
> > do the picking, but we seem to all agree that something is better than
> > nothing, so hopefully nobody will be too sad if I make an arbitrary
> > decision. And if some clear agreement emerges before then, even
> > better.
>
> > LOG.
Attachment
On 9/8/21, 10:49 AM, "Dipesh Pandit" <dipesh.pandit@gmail.com> wrote: > Updated log level to DEBUG3 and rebased the patch. PFA patch. Thanks for the new patch. + * by checking the availability of next WAL file. "xlogState" specifies the + * segment number and timeline ID corresponding to the next WAL file. "xlogState" probably needs to be updated here. As noted before [0], I think we need to force a directory scan at the beginning of pgarch_MainLoop() and when pgarch_ArchiverCopyLoop() returns before we exit the "while" loop. Else, there's probably a risk that we skip archiving a file until the next directory scan. IMO forcing a directory scan at the beginning of pgarch_ArchiverCopyLoop() is a simpler way to do roughly the same thing. I'm skeptical that persisting the next-anticipated state between calls to pgarch_ArchiverCopyLoop() is worth the complexity. Nathan [0] https://www.postgresql.org/message-id/AC78607B-9DA6-41F4-B253-840D3DD964BF%40amazon.com
>
> "xlogState" probably needs to be updated here.
> beginning of pgarch_MainLoop() and when pgarch_ArchiverCopyLoop()
> returns before we exit the "while" loop. Else, there's probably a
> risk that we skip archiving a file until the next directory scan. IMO
> forcing a directory scan at the beginning of pgarch_ArchiverCopyLoop()
> is a simpler way to do roughly the same thing. I'm skeptical that
> persisting the next-anticipated state between calls to
> pgarch_ArchiverCopyLoop() is worth the complexity.
Attachment
On Thu, Sep 2, 2021 at 5:52 PM Bossart, Nathan <bossartn@amazon.com> wrote: > The pg_readyXlog() logic looks a bit like this: > > 1. Try to skip directory scan. If that succeeds, we're done. > 2. Do a directory scan. > 3. If we found a regular WAL file, update PgArch and return > what we found. > > Let's say step 1 looks for WAL file 10, but 10.ready doesn't exist > yet. The following directory scan ends up finding 11.ready. Just > before we update the PgArch state, XLogArchiveNotify() is called and > creates 10.ready. However, pg_readyXlog() has already decided to > return WAL segment 11 and update the state to look for 12 next. If we > just used '<', we won't force a directory scan, and segment 10 will > not be archived until the next one happens. If we use '<=', I don't > think we have the same problem. The latest post on this thread contained a link to this one, and it made me want to rewind to this point in the discussion. Suppose we have the following alternative scenario: Let's say step 1 looks for WAL file 10, but 10.ready doesn't exist yet. The following directory scan ends up finding 12.ready. Just before we update the PgArch state, XLogArchiveNotify() is called and creates 11.ready. However, pg_readyXlog() has already decided to return WAL segment 12 and update the state to look for 13 next. Now, if I'm not mistaken, using <= doesn't help at all. In my opinion, the problem here is that the natural way to ask "is this file being archived out of order?" is to ask yourself "is the file that I'm marking as ready for archiving now the one that immediately follows the last one I marked as ready for archiving?" and then invert the result. That is, if I last marked 10 as ready, and now I'm marking 11 as ready, then it's in order, but if I'm now marking anything else whatsoever, then it's out of order. But that's not what this does. Instead of comparing what it's doing now to what it did last, it compares what it did now to what the archiver did last. And it's really not obvious that that's correct. I think that the above argument actually demonstrates a flaw in the logic, but even if not, or even if it's too small a flaw to be a problem in practice, it seems a lot harder to reason about. -- Robert Haas EDB: http://www.enterprisedb.com
On 9/13/21, 1:14 PM, "Robert Haas" <robertmhaas@gmail.com> wrote: > On Thu, Sep 2, 2021 at 5:52 PM Bossart, Nathan <bossartn@amazon.com> wrote: >> Let's say step 1 looks for WAL file 10, but 10.ready doesn't exist >> yet. The following directory scan ends up finding 11.ready. Just >> before we update the PgArch state, XLogArchiveNotify() is called and >> creates 10.ready. However, pg_readyXlog() has already decided to >> return WAL segment 11 and update the state to look for 12 next. If we >> just used '<', we won't force a directory scan, and segment 10 will >> not be archived until the next one happens. If we use '<=', I don't >> think we have the same problem. > > The latest post on this thread contained a link to this one, and it > made me want to rewind to this point in the discussion. Suppose we > have the following alternative scenario: > > Let's say step 1 looks for WAL file 10, but 10.ready doesn't exist > yet. The following directory scan ends up finding 12.ready. Just > before we update the PgArch state, XLogArchiveNotify() is called and > creates 11.ready. However, pg_readyXlog() has already decided to > return WAL segment 12 and update the state to look for 13 next. > > Now, if I'm not mistaken, using <= doesn't help at all. I think this is the scenario I was trying to touch on in the paragraph immediately following the one you mentioned. My theory was that we'll still skip forcing a directory scan until 10.ready is created, so it would eventually work out as long as we can safely assume that all .ready files that should be created eventually will be. Thinking further, I don't think that's right. We might've already renamed 10.ready to 10.done and removed it long ago, so there's a chance that we wouldn't go back and pick up 11.ready until one of our "fallback" directory scans forced by the checkpointer. So, yes, I think you are right. > In my opinion, the problem here is that the natural way to ask "is > this file being archived out of order?" is to ask yourself "is the > file that I'm marking as ready for archiving now the one that > immediately follows the last one I marked as ready for archiving?" and > then invert the result. That is, if I last marked 10 as ready, and now > I'm marking 11 as ready, then it's in order, but if I'm now marking > anything else whatsoever, then it's out of order. But that's not what > this does. Instead of comparing what it's doing now to what it did > last, it compares what it did now to what the archiver did last. > > And it's really not obvious that that's correct. I think that the > above argument actually demonstrates a flaw in the logic, but even if > not, or even if it's too small a flaw to be a problem in practice, it > seems a lot harder to reason about. I certainly agree that it's harder to reason about. If we were to go the keep-trying-the-next-file route, we could probably minimize a lot of the handling for these rare cases by banking on the "fallback" directory scans. Provided we believe these situations are extremely rare, some extra delay for an archive every once in a while might be acceptable. Nathan
> made me want to rewind to this point in the discussion. Suppose we
> have the following alternative scenario:
>
> Let's say step 1 looks for WAL file 10, but 10.ready doesn't exist
> yet. The following directory scan ends up finding 12.ready. Just
> before we update the PgArch state, XLogArchiveNotify() is called and
> creates 11.ready. However, pg_readyXlog() has already decided to
> return WAL segment 12 and update the state to look for 13 next.
>
> Now, if I'm not mistaken, using <= doesn't help at all.
>
> In my opinion, the problem here is that the natural way to ask "is
> this file being archived out of order?" is to ask yourself "is the
> file that I'm marking as ready for archiving now the one that
> immediately follows the last one I marked as ready for archiving?" and
> then invert the result. That is, if I last marked 10 as ready, and now
> I'm marking 11 as ready, then it's in order, but if I'm now marking
> anything else whatsoever, then it's out of order. But that's not what
> this does. Instead of comparing what it's doing now to what it did
> last, it compares what it did now to what the archiver did last.
> > And it's really not obvious that that's correct. I think that the
> > above argument actually demonstrates a flaw in the logic, but even if
> > not, or even if it's too small a flaw to be a problem in practice, it
> > seems a lot harder to reason about.
>
> I certainly agree that it's harder to reason about. If we were to go
> the keep-trying-the-next-file route, we could probably minimize a lot
> of the handling for these rare cases by banking on the "fallback"
> directory scans. Provided we believe these situations are extremely
> rare, some extra delay for an archive every once in a while might be
> acceptable.
Attachment
On 9/14/21, 7:23 AM, "Dipesh Pandit" <dipesh.pandit@gmail.com> wrote: > I agree that when we are creating a .ready file we should compare > the current .ready file with the last .ready file to check if this file is > created out of order. We can store the state of the last .ready file > in shared memory and compare it with the current .ready file. I > believe that archiver specific shared memory area can be used > to store the state of the last .ready file unless I am missing > something and this needs to be stored in a separate shared > memory area. > > With this change, we have the flexibility to move the current archiver > state out of shared memory and keep it local to archiver. I have > incorporated these changes and updated a new patch. I wonder if this can be simplified even further. If we don't bother trying to catch out-of-order .ready files in XLogArchiveNotify() and just depend on the per-checkpoint/restartpoint directory scans, we can probably remove lastReadySegNo from archiver state completely. + /* Force a directory scan if we are archiving anything but a regular + * WAL file or if this WAL file is being created out-of-order. + */ + if (!IsXLogFileName(xlog)) + PgArchForceDirScan(); + else + { + TimeLineID tli; + XLogSegNo last_segno; + XLogSegNo this_segno; + + last_segno = PgArchGetLastReadySegNo(); + XLogFromFileName(xlog, &tli, &this_segno, wal_segment_size); + + /* + * Force a directory scan in case if this .ready file created out of + * order. + */ + last_segno++; + if (last_segno != this_segno) + PgArchForceDirScan(); + + PgArchSetLastReadySegNo(this_segno); + } This is an interesting idea, but the "else" block here seems prone to race conditions. I think we'd have to hold arch_lck to prevent that. But as I mentioned above, if we are okay with depending on the fallback directory scans, I think we can remove the "else" block completely. + /* Initialize the current state of archiver */ + xlogState.lastSegNo = MaxXLogSegNo; + xlogState.lastTli = MaxTimeLineID; It looks like we have two ways to force a directory scan. We can either set force_dir_scan to true, or lastSegNo can be set to MaxXLogSegNo. Why not just set force_dir_scan to true here so that we only have one way to force a directory scan? + /* + * Failed to archive, make sure that archiver performs a + * full directory scan in the next cycle to avoid missing + * the WAL file which could not be archived due to some + * failure in current cycle. + */ + PgArchForceDirScan(); Don't we also need to force a directory scan in the other cases we return early from pgarch_ArchiverCopyLoop()? We will have already advanced the archiver state in pgarch_readyXlog(), so I think we'd end up skipping files if we didn't. For example, if archive_command isn't set, we'll just return, and the next call to pgarch_readyXlog() might return the next file. + /* Continue directory scan until we find a regular WAL file */ + SpinLockAcquire(&PgArch->arch_lck); + PgArch->force_dir_scan = true; + SpinLockRelease(&PgArch->arch_lck); nitpick: I think we should just call PgArchForceDirScan() here. Nathan
On 9/14/21, 9:18 AM, "Bossart, Nathan" <bossartn@amazon.com> wrote: > This is an interesting idea, but the "else" block here seems prone to > race conditions. I think we'd have to hold arch_lck to prevent that. > But as I mentioned above, if we are okay with depending on the > fallback directory scans, I think we can remove the "else" block > completely. Thinking further, we probably need to hold a lock even when we are creating the .ready file to avoid race conditions. Nathan
At Tue, 14 Sep 2021 18:07:31 +0000, "Bossart, Nathan" <bossartn@amazon.com> wrote in > On 9/14/21, 9:18 AM, "Bossart, Nathan" <bossartn@amazon.com> wrote: > > This is an interesting idea, but the "else" block here seems prone to > > race conditions. I think we'd have to hold arch_lck to prevent that. > > But as I mentioned above, if we are okay with depending on the > > fallback directory scans, I think we can remove the "else" block > > completely. > > Thinking further, we probably need to hold a lock even when we are > creating the .ready file to avoid race conditions. The race condition surely happens, but even if that happens, all competing processes except one of them detect out-of-order and will enforce directory scan. But I'm not sure how it behaves under more complex situation so I'm not sure I like that behavior. We could just use another lock for the logic there, but instead couldn't we merge PgArchGetLastReadySegNo and PgArchSetLastReadySegNo into one atomic test-and-(check-and-)set function? Like this. ==== XLogFromFileName(xlog, &tli, &this_segno, wal_segment_size); if (!PgArchReadySegIsInOrder(this_segno)) PgArchForceDirScan(); bool PgArchReadySegIsInOrder(XLogSegNo this_segno) { bool in_order = true; SpinLockAcquire(&PgArch->arch_lck); if (PgArch->lastReadySegNo + 1 != this_segno) in_order = false; PgArch->lastReadySegNo = this_segno; SpinLockRelease(&PgArch->arch_lck); return in_order; } ==== By the way, it seems to me that we only need to force directory scan when notification seg number steps back. If this is correct, we can reduce the number how many times we enforce directory scans. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
> trying to catch out-of-order .ready files in XLogArchiveNotify() and
> just depend on the per-checkpoint/restartpoint directory scans, we can
> probably remove lastReadySegNo from archiver state completely.
> + xlogState.lastSegNo = MaxXLogSegNo;
> + xlogState.lastTli = MaxTimeLineID;
>
> It looks like we have two ways to force a directory scan. We can
> either set force_dir_scan to true, or lastSegNo can be set to
> MaxXLogSegNo. Why not just set force_dir_scan to true here so that we
> only have one way to force a directory scan?
> return early from pgarch_ArchiverCopyLoop()? We will have already
> advanced the archiver state in pgarch_readyXlog(), so I think we'd end
> up skipping files if we didn't. For example, if archive_command isn't
> set, we'll just return, and the next call to pgarch_readyXlog() might
> return the next file.
> > > race conditions. I think we'd have to hold arch_lck to prevent that.
> > > But as I mentioned above, if we are okay with depending on the
> > > fallback directory scans, I think we can remove the "else" block
> > > completely.
> > Thinking further, we probably need to hold a lock even when we are
> > creating the .ready file to avoid race conditions.
>
> The race condition surely happens, but even if that happens, all
> competing processes except one of them detect out-of-order and will
> enforce directory scan. But I'm not sure how it behaves under more
> complex situation so I'm not sure I like that behavior.
>
> We could just use another lock for the logic there, but instead
> couldn't we merge PgArchGetLastReadySegNo and PgArchSetLastReadySegNo
> into one atomic test-and-(check-and-)set function? Like this.
Attachment
On 9/15/21, 4:28 AM, "Dipesh Pandit" <dipesh.pandit@gmail.com> wrote: > I have incorporated these changes and updated a new patch. PFA patch. Thanks for the new patch. I've attached my take on the latest version. My main focus this time was simplifying the patch for this approach as much as possible. Specifically, I've done the following: 1. I've removed several calls to PgArchForceDirScan() in favor of calling it at the top of pgarch_ArchiverCopyLoop(). I believe there is some disagreement about this change, but I don't think we gain enough to justify the complexity. The main reason we exit pgarch_ArchiverCopyLoop() should ordinarily be that we've run out of files to archive, so incurring a directory scan the next time it is called doesn't seem like it would normally be too bad. I'm sure there are exceptions (e.g., lots of .done files, archive failures), but the patch is still not making things any worse than they presently are for these cases. 2. I removed all the logic that attempted to catch out-of-order .ready files. Instead, XLogArchiveNotify() only forces a directory scan for files other than regular WAL files, and we depend on our periodic directory scans to pick up anything that's been left behind. 3. I moved the logic that forces directory scans every once in a while. We were doing that in the checkpoint/restartpoint logic, which, upon further thought, might not be the best idea. The checkpoint interval can vary widely, and IIRC we won't bother creating checkpoints at all if database activity stops. Instead, I've added logic in pgarch_readyXlog() that forces a directory scan if one hasn't happened in a few minutes. 4. Finally, I've tried to ensure comments are clear and that the logic is generally easy to reason about. What do you think? Nathan
Attachment
> calling it at the top of pgarch_ArchiverCopyLoop(). I believe
> there is some disagreement about this change, but I don't think
> we gain enough to justify the complexity. The main reason we
> exit pgarch_ArchiverCopyLoop() should ordinarily be that we've
> next time it is called doesn't seem like it would normally be too
> bad. I'm sure there are exceptions (e.g., lots of .done files,
> archive failures), but the patch is still not making things any
> worse than they presently are for these cases.
> .ready files. Instead, XLogArchiveNotify() only forces a
> directory scan for files other than regular WAL files, and we
> depend on our periodic directory scans to pick up anything that's
> been left behind.
> 3. I moved the logic that forces directory scans every once in a
> while. We were doing that in the checkpoint/restartpoint logic,
> which, upon further thought, might not be the best idea. The
> checkpoint interval can vary widely, and IIRC we won't bother
> creating checkpoints at all if database activity stops. Instead,
> I've added logic in pgarch_readyXlog() that forces a directory
> scan if one hasn't happened in a few minutes.
> logic is generally easy to reason about.
>
> What do you think?
On Thu, Sep 16, 2021 at 7:26 PM Bossart, Nathan <bossartn@amazon.com> wrote: > 1. I've removed several calls to PgArchForceDirScan() in favor of > calling it at the top of pgarch_ArchiverCopyLoop(). I believe > there is some disagreement about this change, but I don't think > we gain enough to justify the complexity. The main reason we > exit pgarch_ArchiverCopyLoop() should ordinarily be that we've > run out of files to archive, so incurring a directory scan the > next time it is called doesn't seem like it would normally be too > bad. I'm sure there are exceptions (e.g., lots of .done files, > archive failures), but the patch is still not making things any > worse than they presently are for these cases. I was thinking that this might increase the number of directory scans by a pretty large amount when we repeatedly catch up, then 1 new file gets added, then we catch up, etc. But I guess your thought process is that such directory scans, even if they happen many times per second, can't really be that expensive, since the directory can't have much in it. Which seems like a fair point. I wonder if there are any situations in which there's not much to archive but the archive_status directory still contains tons of files. -- Robert Haas EDB: http://www.enterprisedb.com
On 2021-Sep-20, Robert Haas wrote: > I was thinking that this might increase the number of directory scans > by a pretty large amount when we repeatedly catch up, then 1 new file > gets added, then we catch up, etc. I was going to say that perhaps we can avoid repeated scans by having a bitmap of future files that were found by a scan; so if we need to do one scan, we keep track of the presence of the next (say) 64 files in our timeline, and then we only have to do another scan when we need to archive a file that wasn't present the last time we scanned. However: > But I guess your thought process is that such directory scans, even if > they happen many times per second, can't really be that expensive, > since the directory can't have much in it. Which seems like a fair > point. I wonder if there are any situations in which there's not much > to archive but the archive_status directory still contains tons of > files. (If we take this stance, which seems reasonable to me, then we don't need to optimize.) But perhaps we should complain if we find extraneous files in archive_status -- Then it'd be on the users' heads not to leave tons of files that would slow down the scan. -- Álvaro Herrera 39°49'30"S 73°17'W — https://www.EnterpriseDB.com/ Maybe there's lots of data loss but the records of data loss are also lost. (Lincoln Yeoh)
On 9/20/21, 1:42 PM, "Alvaro Herrera" <alvherre@alvh.no-ip.org> wrote: > On 2021-Sep-20, Robert Haas wrote: > >> I was thinking that this might increase the number of directory scans >> by a pretty large amount when we repeatedly catch up, then 1 new file >> gets added, then we catch up, etc. > > I was going to say that perhaps we can avoid repeated scans by having a > bitmap of future files that were found by a scan; so if we need to do > one scan, we keep track of the presence of the next (say) 64 files in > our timeline, and then we only have to do another scan when we need to > archive a file that wasn't present the last time we scanned. However: This sounds a bit like the other approach discussed earlier in this thread [0]. >> But I guess your thought process is that such directory scans, even if >> they happen many times per second, can't really be that expensive, >> since the directory can't have much in it. Which seems like a fair >> point. I wonder if there are any situations in which there's not much >> to archive but the archive_status directory still contains tons of >> files. > > (If we take this stance, which seems reasonable to me, then we don't > need to optimize.) But perhaps we should complain if we find extraneous > files in archive_status -- Then it'd be on the users' heads not to leave > tons of files that would slow down the scan. The simplest situation I can think of that might be a problem is when checkpointing is stuck and the .done files are adding up. However, after the lengthy directory scan, you should still be able to archive several files without a scan of archive_status. And if you are repeatedly catching up, the extra directory scans probably aren't hurting anything. At the very least, this patch doesn't make things any worse in this area. BTW I attached a new version of the patch with a couple of small changes. Specifically, I adjusted some of the comments and moved the assignment of last_dir_scan to after the directory scan completes. Before, we were resetting it before the directory scan, so if the directory scan took too long, you'd still end up scanning archive_status for every file. I think that's still possible if your archive_command is especially slow, but archiving isn't going to keep up anyway in that case. Nathan [0] https://www.postgresql.org/message-id/attachment/125980/0001-Improve-performance-of-pgarch_readyXlog-with-many-st.patch
Attachment
On Mon, Sep 20, 2021 at 4:42 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > I was going to say that perhaps we can avoid repeated scans by having a > bitmap of future files that were found by a scan; so if we need to do > one scan, we keep track of the presence of the next (say) 64 files in > our timeline, and then we only have to do another scan when we need to > archive a file that wasn't present the last time we scanned. There are two different proposed patches on this thread. One of them works exactly that way, and the other one tries to optimize by assuming that if we just optimized WAL file N, we likely will next want to archive WAL file N+1. It's been hard to decide which way is better. -- Robert Haas EDB: http://www.enterprisedb.com
On Thu, Sep 16, 2021 at 7:26 PM Bossart, Nathan <bossartn@amazon.com> wrote: > What do you think? I think this is committable. I also went back and looked at your previous proposal to do files in batches, and I think that's also committable. After some reflection, I think I have a slight preference for the batching approach. It seems like it might lend itself to archiving multiple files in a single invocation of the archive_command, and Alvaro just suggested it again apparently not having realized that it had been previously proposed by Andres, so I guess it has the further advantage of being the thing that several committers intuitively feel like we ought to be doing to solve this problem. So what I am inclined to do is commit v1-0001-Improve-performance-of-pgarch_readyXlog-with-many.patch. However, v6-0001-Do-fewer-directory-scans-of-archive_status.patch has perhaps evolved a bit more than the other one, so I thought I should first ask whether any of those changes have influenced your thinking about the batching approach and whether you want to make any updates to that patch first. I don't really see that this is needed, but I might be missing something. Thanks, -- Robert Haas EDB: http://www.enterprisedb.com
On 9/24/21 12:28 PM, Robert Haas wrote: > On Thu, Sep 16, 2021 at 7:26 PM Bossart, Nathan <bossartn@amazon.com> wrote: >> What do you think? > > I think this is committable. I also went back and looked at your > previous proposal to do files in batches, and I think that's also > committable. After some reflection, I think I have a slight preference > for the batching approach. > It seems like it might lend itself to archiving multiple files in a > single invocation of the archive_command, and Alvaro just suggested it > again apparently not having realized that it had been previously > proposed by Andres, so I guess it has the further advantage of being > the thing that several committers intuitively feel like we ought to be > doing to solve this problem. I also prefer this approach. Reducing directory scans is an excellent optimization, but from experience I know that execution time for the archive_command can also be a significant bottleneck. Begin able to archive multiple segments per execution would be a big win in certain scenarios. > So what I am inclined to do is commit > v1-0001-Improve-performance-of-pgarch_readyXlog-with-many.patch. I read the patch and it looks good to me. I do wish we had a way to test that history files get archived first, but as I recall I was not able to figure out how to do reliably for [1] without writing a custom archive_command just for testing. That is something we might want to consider as we make this logic more complex. Regards, -- -David david@pgmasters.net [1] https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=b981df4cc09aca978c5ce55e437a74913d09cccc
On 9/24/21, 9:29 AM, "Robert Haas" <robertmhaas@gmail.com> wrote: > So what I am inclined to do is commit > v1-0001-Improve-performance-of-pgarch_readyXlog-with-many.patch. > However, v6-0001-Do-fewer-directory-scans-of-archive_status.patch has > perhaps evolved a bit more than the other one, so I thought I should > first ask whether any of those changes have influenced your thinking > about the batching approach and whether you want to make any updates > to that patch first. I don't really see that this is needed, but I > might be missing something. Besides sprucing up the comments a bit, I don't think there is anything that needs to be changed. The only other thing I considered was getting rid of the arch_files array. Instead, I would swap the comparator function the heap uses with a reverse one, rebuild the heap, and then have pgarch_readyXlog() return files via binaryheap_remove_first(). However, this seemed like a bit more complexity than necessary. Attached is a new version of the patch with some expanded comments. Nathan
Attachment
On 9/27/21, 11:06 AM, "Bossart, Nathan" <bossartn@amazon.com> wrote: > On 9/24/21, 9:29 AM, "Robert Haas" <robertmhaas@gmail.com> wrote: >> So what I am inclined to do is commit >> v1-0001-Improve-performance-of-pgarch_readyXlog-with-many.patch. >> However, v6-0001-Do-fewer-directory-scans-of-archive_status.patch has >> perhaps evolved a bit more than the other one, so I thought I should >> first ask whether any of those changes have influenced your thinking >> about the batching approach and whether you want to make any updates >> to that patch first. I don't really see that this is needed, but I >> might be missing something. > > Besides sprucing up the comments a bit, I don't think there is > anything that needs to be changed. The only other thing I considered > was getting rid of the arch_files array. Instead, I would swap the > comparator function the heap uses with a reverse one, rebuild the > heap, and then have pgarch_readyXlog() return files via > binaryheap_remove_first(). However, this seemed like a bit more > complexity than necessary. > > Attached is a new version of the patch with some expanded comments. I just wanted to gently bump this thread in case there is any additional feedback. I should have some time to work on it this week. Also, it's looking more and more like this patch will nicely assist the batching/loadable backup module stuff [0]. Nathan [0] https://postgr.es/m/E9035E94-EC76-436E-B6C9-1C03FBD8EF54%40amazon.com
On Fri, Sep 24, 2021 at 12:28 PM Robert Haas <robertmhaas@gmail.com> wrote: > On Thu, Sep 16, 2021 at 7:26 PM Bossart, Nathan <bossartn@amazon.com> wrote: > > What do you think? > > I think this is committable. I also went back and looked at your > previous proposal to do files in batches, and I think that's also > committable. After some reflection, I think I have a slight preference > for the batching approach. > It seems like it might lend itself to archiving multiple files in a > single invocation of the archive_command, and Alvaro just suggested it > again apparently not having realized that it had been previously > proposed by Andres, so I guess it has the further advantage of being > the thing that several committers intuitively feel like we ought to be > doing to solve this problem. > > So what I am inclined to do is commit > v1-0001-Improve-performance-of-pgarch_readyXlog-with-many.patch. > However, v6-0001-Do-fewer-directory-scans-of-archive_status.patch has > perhaps evolved a bit more than the other one, so I thought I should > first ask whether any of those changes have influenced your thinking > about the batching approach and whether you want to make any updates > to that patch first. I don't really see that this is needed, but I > might be missing something. Nathan, I just realized we never closed the loop on this. Do you have any thoughts? -- Robert Haas EDB: http://www.enterprisedb.com
On 10/19/21, 5:59 AM, "Robert Haas" <robertmhaas@gmail.com> wrote: > Nathan, I just realized we never closed the loop on this. Do you have > any thoughts? IMO the patch is in decent shape. Happy to address any feedback you might have on the latest patch [0]. Nathan [0] https://www.postgresql.org/message-id/attachment/126789/v3-0001-Improve-performance-of-pgarch_readyXlog-with-many.patch
On 10/19/21, 7:53 AM, "Bossart, Nathan" <bossartn@amazon.com> wrote: > On 10/19/21, 5:59 AM, "Robert Haas" <robertmhaas@gmail.com> wrote: >> Nathan, I just realized we never closed the loop on this. Do you have >> any thoughts? > > IMO the patch is in decent shape. Happy to address any feedback you > might have on the latest patch [0]. This thread seems to have lost traction. The cfbot entry for the latest patch [0] is still all green, so I think it is still good to go. I'm happy to address any additional feedback, though. Nathan [0] https://postgr.es/m/attachment/126789/v3-0001-Improve-performance-of-pgarch_readyXlog-with-many.patch
On Thu, Nov 11, 2021 at 10:37 AM Bossart, Nathan <bossartn@amazon.com> wrote: > On 10/19/21, 7:53 AM, "Bossart, Nathan" <bossartn@amazon.com> wrote: > > On 10/19/21, 5:59 AM, "Robert Haas" <robertmhaas@gmail.com> wrote: > >> Nathan, I just realized we never closed the loop on this. Do you have > >> any thoughts? > > > > IMO the patch is in decent shape. Happy to address any feedback you > > might have on the latest patch [0]. > > This thread seems to have lost traction. The cfbot entry for the > latest patch [0] is still all green, so I think it is still good to > go. I'm happy to address any additional feedback, though. Somehow I didn't see your October 19th response previously. The threading in gmail seems to have gotten broken, which may have contributed. -- Robert Haas EDB: http://www.enterprisedb.com
On Thu, Nov 11, 2021 at 2:49 PM Robert Haas <robertmhaas@gmail.com> wrote: > Somehow I didn't see your October 19th response previously. The > threading in gmail seems to have gotten broken, which may have > contributed. And actually I also missed the September 27th email where you sent v3. Oops. Committed now. -- Robert Haas EDB: http://www.enterprisedb.com
On 11/11/21, 12:23 PM, "Robert Haas" <robertmhaas@gmail.com> wrote: > On Thu, Nov 11, 2021 at 2:49 PM Robert Haas <robertmhaas@gmail.com> wrote: >> Somehow I didn't see your October 19th response previously. The >> threading in gmail seems to have gotten broken, which may have >> contributed. > > And actually I also missed the September 27th email where you sent v3. Oops. > > Committed now. Thanks! I figured it was something like that. Sorry if I caused the thread breakage. Nathan
On Thu, Nov 11, 2021 at 3:58 PM Bossart, Nathan <bossartn@amazon.com> wrote: > Thanks! I figured it was something like that. Sorry if I caused the > thread breakage. I think it was actually that the thread went over 100 emails ... which usually causes Google to break it, but I don't know why it broke it into three pieces instead of two, or why I missed the new ones. Anyway, I don't think it was your fault, but no worries either way. -- Robert Haas EDB: http://www.enterprisedb.com
Hello, sorry for necro-posting here: On 2021-May-03, Robert Haas wrote: > I and various colleagues of mine have from time to time encountered > systems that got a bit behind on WAL archiving, because the > archive_command started failing and nobody noticed right away. We've recently had a couple of cases where the archiver hasn't been able to keep up on systems running 13 and 14 because of this problem, causing serious production outages. Obviously that's not a great experience. I understand that this has been significantly improved in branch 15 by commit beb4e9ba1652, the fix in this thread; we hypothesize that both these production problems wouldn't have occurred, if the users had been running the optimized pgarch.c code. However, that commit was not backpatched. I think that was the correct decision at the time, because it wasn't a trivial fix. It was significantly modified by 1fb17b190341 a month later, both to fix a critical bug as well as to make some efficiency improvements. Now that the code has been battle-tested, I think we can consider putting it into the older branches. I did a quick cherry-pick experiment, and I found that it backpatches cleanly to 14. It doesn't to 13, for lack of d75288fb27b8, which is far too invasive to backpatch, and I don't think we should rewrite the code so that it works on the previous state. Fortunately 13 only has one more year to live, so I don't feel too bad about leaving it as is. So, my question now is, would there be much opposition to backpatching beb4e9ba1652 + 1fb17b190341 to REL_14_STABLE? (On the other hand, we can always blame users for failing to implement WAL archiving "correctly" ... but from my perspective, this is an embarrasing Postgres problem, and one that's relatively easy to solve with very low risk.) Thanks, -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/ "Para tener más hay que desear menos"
On Wed, Nov 13, 2024 at 11:05 AM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > So, my question now is, would there be much opposition to backpatching > beb4e9ba1652 + 1fb17b190341 to REL_14_STABLE? It seems like it's been long enough now that if the new logic had major problems we probably would have found them by now; so I feel like it's probably pretty safe. Perhaps it's questionable how many people we'll help by back-patching this into one additional release, but if you feel it's worth it I wouldn't be inclined to argue. -- Robert Haas EDB: http://www.enterprisedb.com
On 2024-Nov-14, Michael Paquier wrote: > On Wed, Nov 13, 2024 at 02:52:31PM -0500, Robert Haas wrote: > > On Wed, Nov 13, 2024 at 11:05 AM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > >> So, my question now is, would there be much opposition to backpatching > >> beb4e9ba1652 + 1fb17b190341 to REL_14_STABLE? > > > > It seems like it's been long enough now that if the new logic had > > major problems we probably would have found them by now; so I feel > > like it's probably pretty safe. Perhaps it's questionable how many > > people we'll help by back-patching this into one additional release, > > but if you feel it's worth it I wouldn't be inclined to argue. > > +1 for v14 as this version is still around for two years. Thanks, I have pushed it now. > Even getting that down to v13 would be OK for me at this stage, as, > assuming that something is messed up for a reason or another, we would > still have one year to address any issue that could pop up on > REL_13_STABLE. > > Now, the conflicts that exist between v14 and v13 in pgarch.[c|h] are > really risky due to the design changes done in d75288fb27b8, so it is > much better to leave v13 out of scope.. Yeah, my first intention was to push it to 13 as well. I did try to add a minimal version of d75288fb27b8 to get it there (it had a lot of conflicts but it was easy to resolve by just removing a few lines), but the tests fail with pgarchiver sometimes hanging. Clearly an indicator that it would be far too risky to backpatch it there. I don't think it's reasonable to expect much more from that. Let's just tell people to upgrade if they have issues. -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/ "La libertad es como el dinero; el que no la sabe emplear la pierde" (Alvarez)
On Thu, Dec 12, 2024 at 04:29:47PM +0100, Alvaro Herrera wrote: > On 2024-Nov-14, Michael Paquier wrote: > >> On Wed, Nov 13, 2024 at 02:52:31PM -0500, Robert Haas wrote: >> > On Wed, Nov 13, 2024 at 11:05 AM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: >> >> So, my question now is, would there be much opposition to backpatching >> >> beb4e9ba1652 + 1fb17b190341 to REL_14_STABLE? >> > >> > It seems like it's been long enough now that if the new logic had >> > major problems we probably would have found them by now; so I feel >> > like it's probably pretty safe. Perhaps it's questionable how many >> > people we'll help by back-patching this into one additional release, >> > but if you feel it's worth it I wouldn't be inclined to argue. >> >> +1 for v14 as this version is still around for two years. > > Thanks, I have pushed it now. I'm happy to see these changes are proving to be useful! FWIW commit 756e221 should further reduce archiving overhead, but that may make it more likely that the server will attempt to re-archive files after a crash, so I'm not sure I would advise back-patching it. -- nathan