Thread: pg_receivewal makes a bad daemon
You might want to use pg_receivewal to save all of your WAL segments somewhere instead of relying on archive_command. It has, at the least, the advantage of working on the byte level rather than the segment level. But it seems to me that it is not entirely suitable as a substitute for archiving, for a couple of reasons. One is that as soon as it runs into a problem, it exits, which is not really what you want out of a daemon that's critical to the future availability of your system. Another is that you can't monitor it aside from looking at what it prints out, which is also not really what you want for a piece of critical infrastructure. The first problem seems somewhat more straightforward. Suppose we add a new command-line option, perhaps --daemon but we can bikeshed. If this option is specified, then it tries to keep going when it hits a problem, rather than just giving up. There's some fuzziness in my mind about exactly what this should mean. If the problem we hit is that we lost the connection to the remote server, then we should try to reconnect. But if the problem is something like a failure inside open_walfile() or close_walfile(), like a failed open() or fsync() or close() or something, it's a little less clear what to do. Maybe one idea would be to have a parent process and a child process, where the child process does all the work and the parent process just keeps re-launching it if it dies. It's not entirely clear that this is a suitable way of recovering from, say, an fsync() failure, given previous discussions claiming that - and I might be exaggerating a bit here - there is essentially no way to recover from a failed fsync() because the kernel might have already thrown out your data and you might as well just set the data center on fire - but perhaps an retry system that can't cope with certain corner cases is better than not having one at all, and perhaps we could revise the logic here and there to have the process doing the work take some action other than exiting when that's an intelligent approach. The second problem is a bit more complex. If you were transferring WAL to another PostgreSQL instance rather than to a frontend process, you could log to some place other than standard output, like for example a file, and you could periodically rotate that file, or alternatively you could log to syslog or the Windows event log. Even better, you could connect to PostgreSQL and run SQL queries against monitoring views and see what results you get. If the existing monitoring views don't give users what they need, we can improve them, but the whole infrastructure needed for this kind of thing is altogether lacking for any frontend program. It does not seem very appealing to reinvent log rotation, connection management, and monitoring views inside pg_receivewal, let alone in every frontend process where similar monitoring might be useful. But at least for me, without such capabilities, it is a little hard to take pg_receivewal seriously. I wonder first of all whether other people agree with these concerns, and secondly what they think we ought to do about it. One option is - do nothing. This could be based either on the idea that pg_receivewal is hopeless, or else on the idea that pg_receivewal can be restarted by some external system when required and monitored well enough as things stand. A second option is to start building out capabilities in pg_receivewal to turn it into something closer to what you'd expect of a normal daemon, with the addition of a retry capability as probably the easiest improvement. A third option is to somehow move towards a world where you can use the server to move WAL around even if you don't really want to run the server. Imagine a server running with no data directory and only a minimal set of running processes, just (1) a postmaster and (2) a walreceiver that writes to an archive directory and (3) non-database-connected backends that are just smart enough to handle queries for status information. This has the same problem that I mentioned on the thread about monitoring the recovery process, namely that we haven't got pg_authid. But against that, you get a lot of infrastructure for free: configuration files, process management, connection management, an existing wire protocol, memory contexts, rich error reporting, etc. I am curious to hear what other people think about the usefulness (or lack thereof) of pg_receivewal as thing stand today, as well as ideas about future direction. Thanks, -- Robert Haas EDB: http://www.enterprisedb.com
On Wed, 2021-05-05 at 11:04 -0400, Robert Haas wrote: > You might want to use pg_receivewal to save all of your WAL segments > somewhere instead of relying on archive_command. It has, at the least, > the advantage of working on the byte level rather than the segment > level. But it seems to me that it is not entirely suitable as a > substitute for archiving, for a couple of reasons. One is that as soon > as it runs into a problem, it exits, which is not really what you want > out of a daemon that's critical to the future availability of your > system. Another is that you can't monitor it aside from looking at > what it prints out, which is also not really what you want for a piece > of critical infrastructure. > > The first problem seems somewhat more straightforward. Suppose we add > a new command-line option, perhaps --daemon but we can bikeshed. If > this option is specified, then it tries to keep going when it hits a > problem, rather than just giving up. [...] That sounds like a good idea. I don't know what it takes to make that perfect (if such a thing exists), but simply trying to re-establish database connections and dying when we hit an I/O problem seems like a clear improvement. > The second problem is a bit more complex. [...] If I wanted to monitor pg_receivewal, I'd have it use a replication slot and monitor "pg_replication_slots" on the primary. That way I see if there is a WAL sender process, and I can measure the lag in bytes. What more could you want? Yours, Laurenz Albe
On Wed, May 5, 2021 at 5:04 PM Robert Haas <robertmhaas@gmail.com> wrote: > > You might want to use pg_receivewal to save all of your WAL segments > somewhere instead of relying on archive_command. It has, at the least, > the advantage of working on the byte level rather than the segment > level. But it seems to me that it is not entirely suitable as a > substitute for archiving, for a couple of reasons. One is that as soon > as it runs into a problem, it exits, which is not really what you want > out of a daemon that's critical to the future availability of your > system. Another is that you can't monitor it aside from looking at > what it prints out, which is also not really what you want for a piece > of critical infrastructure. > > The first problem seems somewhat more straightforward. Suppose we add > a new command-line option, perhaps --daemon but we can bikeshed. If > this option is specified, then it tries to keep going when it hits a > problem, rather than just giving up. There's some fuzziness in my mind > about exactly what this should mean. If the problem we hit is that we > lost the connection to the remote server, then we should try to > reconnect. But if the problem is something like a failure inside > open_walfile() or close_walfile(), like a failed open() or fsync() or > close() or something, it's a little less clear what to do. Maybe one > idea would be to have a parent process and a child process, where the > child process does all the work and the parent process just keeps > re-launching it if it dies. It's not entirely clear that this is a > suitable way of recovering from, say, an fsync() failure, given > previous discussions claiming that - and I might be exaggerating a bit > here - there is essentially no way to recover from a failed fsync() > because the kernel might have already thrown out your data and you > might as well just set the data center on fire - but perhaps an retry > system that can't cope with certain corner cases is better than not > having one at all, and perhaps we could revise the logic here and > there to have the process doing the work take some action other than > exiting when that's an intelligent approach. Is this really a problem we should fix ourselves? Most daemon-managers today will happily be configured to automatically restart a daemon on failure with a single setting since a long time now. E.g. in systemd (which most linuxen uses now) you just set Restart=on-failure (or maybe even Restart=always) and something like RestartSec=10. That said, it wouldn't cover an fsync() error -- they will always restart. The way to handle that is for the operator to capture the error message perhaps, and just "deal with it"? What could be more interesting there in a "systemd world" would be to add watchdog support. That'd obviously only be interesting on systemd platforms, but we already have some of that basic notification support in the postmaster for those. > The second problem is a bit more complex. If you were transferring WAL > to another PostgreSQL instance rather than to a frontend process, you > could log to some place other than standard output, like for example a > file, and you could periodically rotate that file, or alternatively > you could log to syslog or the Windows event log. Even better, you > could connect to PostgreSQL and run SQL queries against monitoring > views and see what results you get. If the existing monitoring views > don't give users what they need, we can improve them, but the whole > infrastructure needed for this kind of thing is altogether lacking for > any frontend program. It does not seem very appealing to reinvent log > rotation, connection management, and monitoring views inside > pg_receivewal, let alone in every frontend process where similar > monitoring might be useful. But at least for me, without such > capabilities, it is a little hard to take pg_receivewal seriously. Again, isn't this the job of the daemon runner? At least in cases where it's not Windows :)? That is, taking the output and putting it in a log, and interfacing with log rotation. Now, having some sort of statistics *other* than parsing a log would definitely be useful. But perhaps that could be something as simple having a --statsfile=/foo/bar parameter and then update that one at regular intervals with "whatever is the current state"? And of course, the other point to monitor is the replication slot on the server it's connected to -- but I agree that being able to monitor both sides there would be good. > I wonder first of all whether other people agree with these concerns, > and secondly what they think we ought to do about it. One option is - > do nothing. This could be based either on the idea that pg_receivewal > is hopeless, or else on the idea that pg_receivewal can be restarted > by some external system when required and monitored well enough as > things stand. A second option is to start building out capabilities in > pg_receivewal to turn it into something closer to what you'd expect of > a normal daemon, with the addition of a retry capability as probably > the easiest improvement. A third option is to somehow move towards a > world where you can use the server to move WAL around even if you > don't really want to run the server. Imagine a server running with no > data directory and only a minimal set of running processes, just (1) a > postmaster and (2) a walreceiver that writes to an archive directory > and (3) non-database-connected backends that are just smart enough to > handle queries for status information. This has the same problem that > I mentioned on the thread about monitoring the recovery process, > namely that we haven't got pg_authid. But against that, you get a lot > of infrastructure for free: configuration files, process management, > connection management, an existing wire protocol, memory contexts, > rich error reporting, etc. > > I am curious to hear what other people think about the usefulness (or > lack thereof) of pg_receivewal as thing stand today, as well as ideas > about future direction. Per above, I'm thinking maybe our efforts are better directed at documenting ways to do it now? Also, all the above also apply to pg_recvlogical, right? So if we do want to invent our own daemon-init-system, we should probably do one more generic that can handle both. -- Magnus Hagander Me: https://www.hagander.net/ Work: https://www.redpill-linpro.com/
On Wed, May 5, 2021 at 12:34 PM Magnus Hagander <magnus@hagander.net> wrote: > Is this really a problem we should fix ourselves? Most daemon-managers > today will happily be configured to automatically restart a daemon on > failure with a single setting since a long time now. E.g. in systemd > (which most linuxen uses now) you just set Restart=on-failure (or > maybe even Restart=always) and something like RestartSec=10. > > That said, it wouldn't cover an fsync() error -- they will always > restart. The way to handle that is for the operator to capture the > error message perhaps, and just "deal with it"? Maybe, but if that's really a non-problem, why does postgres itself restart, and have facilities to write and rotate log files? I feel like this argument boils down to "a manual transmission ought to be good enough for anyone, let's not have automatics." But over the years people have found that automatics are a lot easier to drive. It may be true that if you know just how to configure your system's daemon manager, you can make all of this work, but it's not like we document how to do any of that, and it's probably not the same on every platform - Windows in particular - and, really, why should people have to do this much work? If I want to run postgres in the background I can just type 'pg_ctl start'. I could even put 'pg_ctl start' in my crontab to make sure it gets restarted within a few minutes even if the postmaster dies. If I want to keep pg_receivewal running all the time ... I need a whole pile of extra mechanism to work around its inherent fragility. Documenting how that's typically done on modern systems, as you propose further on, would be great, but I can't do it, because I don't know how to make it work. Hence the thread. > Also, all the above also apply to pg_recvlogical, right? So if we do > want to invent our own daemon-init-system, we should probably do one > more generic that can handle both. Yeah. And I'm not really 100% convinced that trying to patch this functionality into pg_receive{wal,logical} is the best way forward ... but I'm not entirely convinced that it isn't, either. I think one of the basic problems with trying to deploy PostgreSQL in 2021 is that it needs so much supporting infrastructure and so much babysitting. archive_command has to be a complicated, almost magical program we don't provide, and we don't even tell you in the documentation that you need it. If you don't want to use that, you can stream with pg_receivewal instead, but now you need a complicated daemon-runner mechanism that we don't provide or document the need for. You also probably need a connection pooler that we don't provide, a failover manager that we don't provide, and backup management software that we don't provide. And the interfaces that those tools have to work with are so awkward and primitive that even the tool authors can't always get it right. So I'm sort of unimpressed by any arguments that boil down to "what we have is good enough" or "that's the job of some other piece of software". Too many things are the job of some piece of software that doesn't really exist, or is only available on certain platforms, or that has some other problem that makes it not usable for everyone. People want to be able to download and use PostgreSQL without needing a whole library of other bits and pieces from around the Internet. -- Robert Haas EDB: http://www.enterprisedb.com
On Wed, May 05, 2021 at 01:12:03PM -0400, Robert Haas wrote: > On Wed, May 5, 2021 at 12:34 PM Magnus Hagander <magnus@hagander.net> wrote: > > Is this really a problem we should fix ourselves? Most daemon-managers > > today will happily be configured to automatically restart a daemon on > > failure with a single setting since a long time now. E.g. in systemd > > (which most linuxen uses now) you just set Restart=on-failure (or > > maybe even Restart=always) and something like RestartSec=10. > > > > That said, it wouldn't cover an fsync() error -- they will always > > restart. The way to handle that is for the operator to capture the > > error message perhaps, and just "deal with it"? > > Maybe, but if that's really a non-problem, why does postgres itself > restart, and have facilities to write and rotate log files? I feel > like this argument boils down to "a manual transmission ought to be > good enough for anyone, let's not have automatics." But over the years > people have found that automatics are a lot easier to drive. It may be > true that if you know just how to configure your system's daemon > manager, you can make all of this work, but it's not like we document > how to do any of that, and it's probably not the same on every > platform - Windows in particular - and, really, why should people have > to do this much work? If I want to run postgres in the background I > can just type 'pg_ctl start'. I could even put 'pg_ctl start' in my > crontab to make sure it gets restarted within a few minutes even if > the postmaster dies. If I want to keep pg_receivewal running all the > time ... I need a whole pile of extra mechanism to work around its > inherent fragility. Documenting how that's typically done on modern > systems, as you propose further on, would be great, but I can't do it, > because I don't know how to make it work. Hence the thread. > > > Also, all the above also apply to pg_recvlogical, right? So if we do > > want to invent our own daemon-init-system, we should probably do one > > more generic that can handle both. > > Yeah. And I'm not really 100% convinced that trying to patch this > functionality into pg_receive{wal,logical} is the best way forward ... > but I'm not entirely convinced that it isn't, either. I think one of > the basic problems with trying to deploy PostgreSQL in 2021 is that it > needs so much supporting infrastructure and so much babysitting. > archive_command has to be a complicated, almost magical program we > don't provide, and we don't even tell you in the documentation that > you need it. If you don't want to use that, you can stream with > pg_receivewal instead, but now you need a complicated daemon-runner > mechanism that we don't provide or document the need for. You also > probably need a connection pooler that we don't provide, a failover > manager that we don't provide, and backup management software that we > don't provide. And the interfaces that those tools have to work with > are so awkward and primitive that even the tool authors can't always > get it right. So I'm sort of unimpressed by any arguments that boil > down to "what we have is good enough" or "that's the job of some other > piece of software". Too many things are the job of some piece of > software that doesn't really exist, or is only available on certain > platforms, or that has some other problem that makes it not usable for > everyone. People want to be able to download and use PostgreSQL > without needing a whole library of other bits and pieces from around > the Internet. We do use at least one bit and piece from around the internet to make our software usable, namely libreadline, the absence of which make psql pretty much unusable. That out of the way, am I understanding correctly that you're proposing that make tools for daemon-izing, logging, connection management, and failover, and ship same with PostgreSQL? I can see the appeal for people shipping proprietary forks of the PostgreSQL, especially ones under restrictive licenses, and I guess we could make a pretty good case for continuing to center those interests as we have since the Berkeley days. Rather than, or maybe as a successor to, wiring such things into each tool we ship that require them, I'd picture something along the lines of .sos that could then be repurposed, modified, etc., as we provide with the distribution as it is now. Another possibility would be to look around for mature capabilities that are cross-platform in the sense that they work on all the platforms we do. While I don't think it's likely we'd find them for all the above use cases under compatible licenses, it's probably worth a look. At worst, we'd get some idea of how (not) to design the APIs to them. I'm going to guess that anything with an incompatible license will upset people who are accustomed to ensuring that we have what legally amounts to an MIT license clean distribution, but I'm thinking that option is at least worth discussing, even if the immediate consensus is, "libreadline is bad enough. We went to a lot of trouble to purge that other stuff back in the bad old days. Let's not make that mistake again." Best, David. -- David Fetter <david(at)fetter(dot)org> http://fetter.org/ Phone: +1 415 235 3778 Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate
On Wed, May 5, 2021 at 10:42 PM David Fetter <david@fetter.org> wrote: > We do use at least one bit and piece from around the internet to make > our software usable, namely libreadline, the absence of which make > psql pretty much unusable. I'm not talking about dependent libraries. We obviously have to depend on some external libraries; it would be crazy to write our own versions of libreadline, zlib, glibc, and everything else we use. -- Robert Haas EDB: http://www.enterprisedb.com
On Wed, May 5, 2021 at 7:12 PM Robert Haas <robertmhaas@gmail.com> wrote: > > On Wed, May 5, 2021 at 12:34 PM Magnus Hagander <magnus@hagander.net> wrote: > > Is this really a problem we should fix ourselves? Most daemon-managers > > today will happily be configured to automatically restart a daemon on > > failure with a single setting since a long time now. E.g. in systemd > > (which most linuxen uses now) you just set Restart=on-failure (or > > maybe even Restart=always) and something like RestartSec=10. > > > > That said, it wouldn't cover an fsync() error -- they will always > > restart. The way to handle that is for the operator to capture the > > error message perhaps, and just "deal with it"? > > Maybe, but if that's really a non-problem, why does postgres itself > restart, and have facilities to write and rotate log files? I feel > like this argument boils down to "a manual transmission ought to be > good enough for anyone, let's not have automatics." But over the years > people have found that automatics are a lot easier to drive. It may be > true that if you know just how to configure your system's daemon > manager, you can make all of this work, but it's not like we document > how to do any of that, and it's probably not the same on every > platform - Windows in particular - and, really, why should people have > to do this much work? If I want to run postgres in the background I > can just type 'pg_ctl start'. I could even put 'pg_ctl start' in my > crontab to make sure it gets restarted within a few minutes even if > the postmaster dies. If I want to keep pg_receivewal running all the > time ... I need a whole pile of extra mechanism to work around its > inherent fragility. Documenting how that's typically done on modern > systems, as you propose further on, would be great, but I can't do it, > because I don't know how to make it work. Hence the thread. If PostgreSQL was built today, I'm not sure we would've built that functionality TBH. The vast majority of people are not interested in manually starting postgres and then putting in a crontab to "restart it if it fails". That's not how anybody runs a server and hasn't for a long time. It might be interesting for us as developers, but not to the vast majority of our users. Most of those get their startup scripts from our packagers -- so maybe we should encourage packagers to provide it, like they do for PostgreSQL itself. But I don't think adding log rotations and other independent functionality to pg_receivexyz would help almost anybody in our user base. In relation to the other thread about pid 1 handling and containers -- if anything, I bet a larger portion of our users would be interested in running pg_receivewal in a dedicated container, than would want to start it manually and verify it's running using crontab... By a large margin. It is true that Windows is a special case in this. But it is, I'd say, equally true that adding something akin to "pg_ctl start" for pg_receivewal would be equally useless on Windows. We can certainly build and add such functionality. But my feeling is that it's going to be added complexity for very little practical gain. Much of the server world moved to "we don't want every single daemon to implement it it's own way, ever so slightly different". I like your car analogy though. But I'd consider it more like "we used to have to mix the right amount of oil into the gasoline manually. But modern engines don't really require us to do that anymore, so most people have stopped, only those who want very special cars do". Or something along that line. (Reality is probably somewhere in between, and I suck at car analogies) > > Also, all the above also apply to pg_recvlogical, right? So if we do > > want to invent our own daemon-init-system, we should probably do one > > more generic that can handle both. > > Yeah. And I'm not really 100% convinced that trying to patch this > functionality into pg_receive{wal,logical} is the best way forward ... It does in a lot of ways amount to basically a daemon-init system. It might be easier to just vendor one of the existing ones :) Or more realistically, suggest they use something that's already on their system. On linux that'll be systemd, on *bsd it'll probably be something like supervisord, on mac it'll be launchd. But this is really more a function of the operating system/distribution. Windows is again the one that stands out. But PostgreSQL *alraedy* does a pretty weak job of solving that problem on Windows, so duplicating that is not that strong a win.. > but I'm not entirely convinced that it isn't, either. I think one of > the basic problems with trying to deploy PostgreSQL in 2021 is that it > needs so much supporting infrastructure and so much babysitting. > archive_command has to be a complicated, almost magical program we > don't provide, and we don't even tell you in the documentation that > you need it. If you don't want to use that, you can stream with > pg_receivewal instead, but now you need a complicated daemon-runner > mechanism that we don't provide or document the need for. You also > probably need a connection pooler that we don't provide, a failover > manager that we don't provide, and backup management software that we > don't provide. And the interfaces that those tools have to work with > are so awkward and primitive that even the tool authors can't always > get it right. So I'm sort of unimpressed by any arguments that boil > down to "what we have is good enough" or "that's the job of some other > piece of software". Too many things are the job of some piece of > software that doesn't really exist, or is only available on certain > platforms, or that has some other problem that makes it not usable for > everyone. People want to be able to download and use PostgreSQL > without needing a whole library of other bits and pieces from around > the Internet. I definitely don't think what we have is good enough, and I agree with your general description of the problem. I just don't think turning a simple tool into a more complicated daemon is not going to help with that in any material way. You still need some sort of *backup management* on that side, otherwise your pg_receivewal will now be the one that fills your disk along with the outputs of your pg_basebackups. So we'd be better off providing that management tool, which could then drive the lower level tools as necessary. Or maybe the better solution in that case would perhaps be to actually bless one of the existing solutions out there by making it the official one. -- Magnus Hagander Me: https://www.hagander.net/ Work: https://www.redpill-linpro.com/
On Thu, May 6, 2021 at 5:43 AM Robert Haas <robertmhaas@gmail.com> wrote: > > On Wed, May 5, 2021 at 10:42 PM David Fetter <david@fetter.org> wrote: > > We do use at least one bit and piece from around the internet to make > > our software usable, namely libreadline, the absence of which make > > psql pretty much unusable. FWIW, we did go with the idea of using readline. Which doesn't work properly on Windows. So this is an excellent example of how we're already not solving the problem for Windows users, but are apparently OK with it in this case. > I'm not talking about dependent libraries. We obviously have to depend > on some external libraries; it would be crazy to write our own > versions of libreadline, zlib, glibc, and everything else we use. Why is that more crazy than building our own limited version of supervisord? readline and glibc might be one thing, but zlib (at least the parts we use) is probably less complex than building our own cross platform daemon-management. -- Magnus Hagander Me: https://www.hagander.net/ Work: https://www.redpill-linpro.com/
On 05.05.21 19:12, Robert Haas wrote: > Maybe, but if that's really a non-problem, why does postgres itself > restart, and have facilities to write and rotate log files? I think because those were invented at a time when the operating system facilities were less useful. And the log management facilities aren't even very good, because there is no support for remote logging. > It may be > true that if you know just how to configure your system's daemon > manager, you can make all of this work, but it's not like we document > how to do any of that, and it's probably not the same on every > platform - Windows in particular - and, really, why should people have > to do this much work? If I want to run postgres in the background I > can just type 'pg_ctl start'. Not really a solution, because systemd will kill it when you log out. > Documenting how that's typically done on modern > systems, as you propose further on, would be great, but I can't do it, > because I don't know how to make it work. Hence the thread. That is probably effort better spent. I think the issues that you alluded to, what should be done in case of what error, is important to work out in detail and document in any case, because it will be the foundation of any of the other solutions.
Hi, On 2021-05-05 18:34:36 +0200, Magnus Hagander wrote: > Is this really a problem we should fix ourselves? Most daemon-managers > today will happily be configured to automatically restart a daemon on > failure with a single setting since a long time now. E.g. in systemd > (which most linuxen uses now) you just set Restart=on-failure (or > maybe even Restart=always) and something like RestartSec=10. I'm not convinced by this. For two main reasons: 1) Our own code can know a lot more about the different error types than we can signal to systemd. The retry timeouts for e.g. a connection failure (whatever) is different than for fsync failing (alarm alarm). If we run out of space we might want to clean up space / invoke a command to do so, but there's nothing equivalent for systemd. 2) Do we really want to either implement at least 3 different ways to do this kind of thing, or force users to do it over and over again? That's not to say that there's no space for handling "unexpected" errors outside of postgres binaries, but I think it's pretty obvious that that doesn't cover somewhat predictable types of errors. And looking at the server side of things - it is *not* the same for systemd to restart postgres, as postmaster doing so internally. The latter can hold on onto shared memory. Which e.g. with simple huge_pages configurations is crucial, because it prevents other processes to use that shared memory. And it accelerates restart by a lot - the kernel needing to zero shared memory on first access (or allocation) can be a very significant penalty. Greetings, Andres Freund
Hi, On 2021-05-07 12:03:36 +0200, Magnus Hagander wrote: > It might be interesting for us as developers, but not to the vast > majority of our users. Most of those get their startup scripts from > our packagers -- so maybe we should encourage packagers to provide it, > like they do for PostgreSQL itself. I think that's the entirely wrong direction to go. A lot of the usability problems around postgres precisely stem from us doing this kind of thing, where the user experience then ends up wildly varying, incomplete and incomprehensible. That's not to say that we need to reimplement everything just for a consistent experience. But just punting crucial things like how a archiving can be made reliable in face of normal-ish errors, and how it can be monitored is just going to further force people to move purely onto managed services. > Or maybe the better solution in that case would perhaps be to actually > bless one of the existing solutions out there by making it the > official one. Which existing system currently does provide an archiving solution that does not imply the very significant overhead of archive_command? Even if an archiving solution internally batches things, the fsyncs, filesystem metadata operations for .ready .done are a *significant* cost and all the forks are not cheap either. Greetings, Andres Freund