Thread: pg_receivewal makes a bad daemon

pg_receivewal makes a bad daemon

From
Robert Haas
Date:
You might want to use pg_receivewal to save all of your WAL segments
somewhere instead of relying on archive_command. It has, at the least,
the advantage of working on the byte level rather than the segment
level. But it seems to me that it is not entirely suitable as a
substitute for archiving, for a couple of reasons. One is that as soon
as it runs into a problem, it exits, which is not really what you want
out of a daemon that's critical to the future availability of your
system. Another is that you can't monitor it aside from looking at
what it prints out, which is also not really what you want for a piece
of critical infrastructure.

The first problem seems somewhat more straightforward. Suppose we add
a new command-line option, perhaps --daemon but we can bikeshed. If
this option is specified, then it tries to keep going when it hits a
problem, rather than just giving up. There's some fuzziness in my mind
about exactly what this should mean. If the problem we hit is that we
lost the connection to the remote server, then we should try to
reconnect. But if the problem is something like a failure inside
open_walfile() or close_walfile(), like a failed open() or fsync() or
close() or something, it's a little less clear what to do. Maybe one
idea would be to have a parent process and a child process, where the
child process does all the work and the parent process just keeps
re-launching it if it dies. It's not entirely clear that this is a
suitable way of recovering from, say, an fsync() failure, given
previous discussions claiming that - and I might be exaggerating a bit
here - there is essentially no way to recover from a failed fsync()
because the kernel might have already thrown out your data and you
might as well just set the data center on fire - but perhaps an retry
system that can't cope with certain corner cases is better than not
having one at all, and perhaps we could revise the logic here and
there to have the process doing the work take some action other than
exiting when that's an intelligent approach.

The second problem is a bit more complex. If you were transferring WAL
to another PostgreSQL instance rather than to a frontend process, you
could log to some place other than standard output, like for example a
file, and you could periodically rotate that file, or alternatively
you could log to syslog or the Windows event log. Even better, you
could connect to PostgreSQL and run SQL queries against monitoring
views and see what results you get. If the existing monitoring views
don't give users what they need, we can improve them, but the whole
infrastructure needed for this kind of thing is altogether lacking for
any frontend program. It does not seem very appealing to reinvent log
rotation, connection management, and monitoring views inside
pg_receivewal, let alone in every frontend process where similar
monitoring might be useful. But at least for me, without such
capabilities, it is a little hard to take pg_receivewal seriously.

I wonder first of all whether other people agree with these concerns,
and secondly what they think we ought to do about it. One option is -
do nothing. This could be based either on the idea that pg_receivewal
is hopeless, or else on the idea that pg_receivewal can be restarted
by some external system when required and monitored well enough as
things stand. A second option is to start building out capabilities in
pg_receivewal to turn it into something closer to what you'd expect of
a normal daemon, with the addition of a retry capability as probably
the easiest improvement. A third option is to somehow move towards a
world where you can use the server to move WAL around even if you
don't really want to run the server. Imagine a server running with no
data directory and only a minimal set of running processes, just (1) a
postmaster and (2) a walreceiver that writes to an archive directory
and (3) non-database-connected backends that are just smart enough to
handle queries for status information. This has the same problem that
I mentioned on the thread about monitoring the recovery process,
namely that we haven't got pg_authid. But against that, you get a lot
of infrastructure for free: configuration files, process management,
connection management, an existing wire protocol, memory contexts,
rich error reporting, etc.

I am curious to hear what other people think about the usefulness (or
lack thereof) of pg_receivewal as thing stand today, as well as ideas
about future direction.

Thanks,

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: pg_receivewal makes a bad daemon

From
Laurenz Albe
Date:
On Wed, 2021-05-05 at 11:04 -0400, Robert Haas wrote:
> You might want to use pg_receivewal to save all of your WAL segments
> somewhere instead of relying on archive_command. It has, at the least,
> the advantage of working on the byte level rather than the segment
> level. But it seems to me that it is not entirely suitable as a
> substitute for archiving, for a couple of reasons. One is that as soon
> as it runs into a problem, it exits, which is not really what you want
> out of a daemon that's critical to the future availability of your
> system. Another is that you can't monitor it aside from looking at
> what it prints out, which is also not really what you want for a piece
> of critical infrastructure.
> 
> The first problem seems somewhat more straightforward. Suppose we add
> a new command-line option, perhaps --daemon but we can bikeshed. If
> this option is specified, then it tries to keep going when it hits a
> problem, rather than just giving up.  [...]

That sounds like a good idea.

I don't know what it takes to make that perfect (if such a thing exists),
but simply trying to re-establish database connections and dying when
we hit an I/O problem seems like a clear improvement.

> The second problem is a bit more complex.  [...]

If I wanted to monitor pg_receivewal, I'd have it use a replication
slot and monitor "pg_replication_slots" on the primary.  That way I see
if there is a WAL sender process, and I can measure the lag in bytes.

What more could you want?

Yours,
Laurenz Albe




Re: pg_receivewal makes a bad daemon

From
Magnus Hagander
Date:
On Wed, May 5, 2021 at 5:04 PM Robert Haas <robertmhaas@gmail.com> wrote:
>
> You might want to use pg_receivewal to save all of your WAL segments
> somewhere instead of relying on archive_command. It has, at the least,
> the advantage of working on the byte level rather than the segment
> level. But it seems to me that it is not entirely suitable as a
> substitute for archiving, for a couple of reasons. One is that as soon
> as it runs into a problem, it exits, which is not really what you want
> out of a daemon that's critical to the future availability of your
> system. Another is that you can't monitor it aside from looking at
> what it prints out, which is also not really what you want for a piece
> of critical infrastructure.
>
> The first problem seems somewhat more straightforward. Suppose we add
> a new command-line option, perhaps --daemon but we can bikeshed. If
> this option is specified, then it tries to keep going when it hits a
> problem, rather than just giving up. There's some fuzziness in my mind
> about exactly what this should mean. If the problem we hit is that we
> lost the connection to the remote server, then we should try to
> reconnect. But if the problem is something like a failure inside
> open_walfile() or close_walfile(), like a failed open() or fsync() or
> close() or something, it's a little less clear what to do. Maybe one
> idea would be to have a parent process and a child process, where the
> child process does all the work and the parent process just keeps
> re-launching it if it dies. It's not entirely clear that this is a
> suitable way of recovering from, say, an fsync() failure, given
> previous discussions claiming that - and I might be exaggerating a bit
> here - there is essentially no way to recover from a failed fsync()
> because the kernel might have already thrown out your data and you
> might as well just set the data center on fire - but perhaps an retry
> system that can't cope with certain corner cases is better than not
> having one at all, and perhaps we could revise the logic here and
> there to have the process doing the work take some action other than
> exiting when that's an intelligent approach.

Is this really a problem we should fix ourselves? Most daemon-managers
today will happily be configured to automatically restart a daemon on
failure with a single setting since a long time now. E.g. in systemd
(which most linuxen uses now) you just set Restart=on-failure (or
maybe even Restart=always) and something like RestartSec=10.

That said, it wouldn't cover an fsync() error -- they will always
restart. The way to handle that is for the operator to capture the
error message perhaps, and just "deal with it"?

What could be more interesting there in a "systemd world" would be to
add watchdog support. That'd obviously only be interesting on systemd
platforms, but we already have some of that basic notification support
in the postmaster for those.

> The second problem is a bit more complex. If you were transferring WAL
> to another PostgreSQL instance rather than to a frontend process, you
> could log to some place other than standard output, like for example a
> file, and you could periodically rotate that file, or alternatively
> you could log to syslog or the Windows event log. Even better, you
> could connect to PostgreSQL and run SQL queries against monitoring
> views and see what results you get. If the existing monitoring views
> don't give users what they need, we can improve them, but the whole
> infrastructure needed for this kind of thing is altogether lacking for
> any frontend program. It does not seem very appealing to reinvent log
> rotation, connection management, and monitoring views inside
> pg_receivewal, let alone in every frontend process where similar
> monitoring might be useful. But at least for me, without such
> capabilities, it is a little hard to take pg_receivewal seriously.

Again, isn't this the job of the daemon runner? At least in cases
where it's not Windows :)? That is, taking the output and putting it
in a log, and interfacing with log rotation.

Now, having some sort of statistics *other* than parsing a log would
definitely be useful. But perhaps that could be something as simple
having a --statsfile=/foo/bar parameter and then update that one at
regular intervals with "whatever is the current state"?

And of course, the other point to monitor is the replication slot on
the server it's connected to -- but I agree that being able to monitor
both sides there would be good.


> I wonder first of all whether other people agree with these concerns,
> and secondly what they think we ought to do about it. One option is -
> do nothing. This could be based either on the idea that pg_receivewal
> is hopeless, or else on the idea that pg_receivewal can be restarted
> by some external system when required and monitored well enough as
> things stand. A second option is to start building out capabilities in
> pg_receivewal to turn it into something closer to what you'd expect of
> a normal daemon, with the addition of a retry capability as probably
> the easiest improvement. A third option is to somehow move towards a
> world where you can use the server to move WAL around even if you
> don't really want to run the server. Imagine a server running with no
> data directory and only a minimal set of running processes, just (1) a
> postmaster and (2) a walreceiver that writes to an archive directory
> and (3) non-database-connected backends that are just smart enough to
> handle queries for status information. This has the same problem that
> I mentioned on the thread about monitoring the recovery process,
> namely that we haven't got pg_authid. But against that, you get a lot
> of infrastructure for free: configuration files, process management,
> connection management, an existing wire protocol, memory contexts,
> rich error reporting, etc.
>
> I am curious to hear what other people think about the usefulness (or
> lack thereof) of pg_receivewal as thing stand today, as well as ideas
> about future direction.

Per above, I'm thinking maybe our efforts are better directed at
documenting ways to do it now?

Also, all the above also apply to pg_recvlogical, right? So if we do
want to invent our own daemon-init-system, we should probably do one
more generic that can handle both.

-- 
 Magnus Hagander
 Me: https://www.hagander.net/
 Work: https://www.redpill-linpro.com/



Re: pg_receivewal makes a bad daemon

From
Robert Haas
Date:
On Wed, May 5, 2021 at 12:34 PM Magnus Hagander <magnus@hagander.net> wrote:
> Is this really a problem we should fix ourselves? Most daemon-managers
> today will happily be configured to automatically restart a daemon on
> failure with a single setting since a long time now. E.g. in systemd
> (which most linuxen uses now) you just set Restart=on-failure (or
> maybe even Restart=always) and something like RestartSec=10.
>
> That said, it wouldn't cover an fsync() error -- they will always
> restart. The way to handle that is for the operator to capture the
> error message perhaps, and just "deal with it"?

Maybe, but if that's really a non-problem, why does postgres itself
restart, and have facilities to write and rotate log files? I feel
like this argument boils down to "a manual transmission ought to be
good enough for anyone, let's not have automatics." But over the years
people have found that automatics are a lot easier to drive. It may be
true that if you know just how to configure your system's daemon
manager, you can make all of this work, but it's not like we document
how to do any of that, and it's probably not the same on every
platform - Windows in particular - and, really, why should people have
to do this much work? If I want to run postgres in the background I
can just type 'pg_ctl start'. I could even put 'pg_ctl start' in my
crontab to make sure it gets restarted within a few minutes even if
the postmaster dies. If I want to keep pg_receivewal running all the
time ... I need a whole pile of extra mechanism to work around its
inherent fragility. Documenting how that's typically done on modern
systems, as you propose further on, would be great, but I can't do it,
because I don't know how to make it work. Hence the thread.

> Also, all the above also apply to pg_recvlogical, right? So if we do
> want to invent our own daemon-init-system, we should probably do one
> more generic that can handle both.

Yeah. And I'm not really 100% convinced that trying to patch this
functionality into pg_receive{wal,logical} is the best way forward ...
but I'm not entirely convinced that it isn't, either. I think one of
the basic problems with trying to deploy PostgreSQL in 2021 is that it
needs so much supporting infrastructure and so much babysitting.
archive_command has to be a complicated, almost magical program we
don't provide, and we don't even tell you in the documentation that
you need it. If you don't want to use that, you can stream with
pg_receivewal instead, but now you need a complicated daemon-runner
mechanism that we don't provide or document the need for. You also
probably need a connection pooler that we don't provide, a failover
manager that we don't provide, and backup management software that we
don't provide. And the interfaces that those tools have to work with
are so awkward and primitive that even the tool authors can't always
get it right. So I'm sort of unimpressed by any arguments that boil
down to "what we have is good enough" or "that's the job of some other
piece of software". Too many things are the job of some piece of
software that doesn't really exist, or is only available on certain
platforms, or that has some other problem that makes it not usable for
everyone. People want to be able to download and use PostgreSQL
without needing a whole library of other bits and pieces from around
the Internet.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: pg_receivewal makes a bad daemon

From
David Fetter
Date:
On Wed, May 05, 2021 at 01:12:03PM -0400, Robert Haas wrote:
> On Wed, May 5, 2021 at 12:34 PM Magnus Hagander <magnus@hagander.net> wrote:
> > Is this really a problem we should fix ourselves? Most daemon-managers
> > today will happily be configured to automatically restart a daemon on
> > failure with a single setting since a long time now. E.g. in systemd
> > (which most linuxen uses now) you just set Restart=on-failure (or
> > maybe even Restart=always) and something like RestartSec=10.
> >
> > That said, it wouldn't cover an fsync() error -- they will always
> > restart. The way to handle that is for the operator to capture the
> > error message perhaps, and just "deal with it"?
> 
> Maybe, but if that's really a non-problem, why does postgres itself
> restart, and have facilities to write and rotate log files? I feel
> like this argument boils down to "a manual transmission ought to be
> good enough for anyone, let's not have automatics." But over the years
> people have found that automatics are a lot easier to drive. It may be
> true that if you know just how to configure your system's daemon
> manager, you can make all of this work, but it's not like we document
> how to do any of that, and it's probably not the same on every
> platform - Windows in particular - and, really, why should people have
> to do this much work? If I want to run postgres in the background I
> can just type 'pg_ctl start'. I could even put 'pg_ctl start' in my
> crontab to make sure it gets restarted within a few minutes even if
> the postmaster dies. If I want to keep pg_receivewal running all the
> time ... I need a whole pile of extra mechanism to work around its
> inherent fragility. Documenting how that's typically done on modern
> systems, as you propose further on, would be great, but I can't do it,
> because I don't know how to make it work. Hence the thread.
> 
> > Also, all the above also apply to pg_recvlogical, right? So if we do
> > want to invent our own daemon-init-system, we should probably do one
> > more generic that can handle both.
> 
> Yeah. And I'm not really 100% convinced that trying to patch this
> functionality into pg_receive{wal,logical} is the best way forward ...
> but I'm not entirely convinced that it isn't, either. I think one of
> the basic problems with trying to deploy PostgreSQL in 2021 is that it
> needs so much supporting infrastructure and so much babysitting.
> archive_command has to be a complicated, almost magical program we
> don't provide, and we don't even tell you in the documentation that
> you need it. If you don't want to use that, you can stream with
> pg_receivewal instead, but now you need a complicated daemon-runner
> mechanism that we don't provide or document the need for. You also
> probably need a connection pooler that we don't provide, a failover
> manager that we don't provide, and backup management software that we
> don't provide. And the interfaces that those tools have to work with
> are so awkward and primitive that even the tool authors can't always
> get it right. So I'm sort of unimpressed by any arguments that boil
> down to "what we have is good enough" or "that's the job of some other
> piece of software". Too many things are the job of some piece of
> software that doesn't really exist, or is only available on certain
> platforms, or that has some other problem that makes it not usable for
> everyone. People want to be able to download and use PostgreSQL
> without needing a whole library of other bits and pieces from around
> the Internet.

We do use at least one bit and piece from around the internet to make
our software usable, namely libreadline, the absence of which make
psql pretty much unusable.

That out of the way, am I understanding correctly that you're
proposing that make tools for daemon-izing, logging, connection
management, and failover, and ship same with PostgreSQL? I can see the
appeal for people shipping proprietary forks of the PostgreSQL,
especially ones under restrictive licenses, and I guess we could make
a pretty good case for continuing to center those interests as we have
since the Berkeley days.  Rather than, or maybe as a successor to,
wiring such things into each tool we ship that require them, I'd
picture something along the lines of .sos that could then be
repurposed, modified, etc., as we provide with the distribution as it
is now.

Another possibility would be to look around for mature capabilities
that are cross-platform in the sense that they work on all the
platforms we do.  While I don't think it's likely we'd find them for
all the above use cases under compatible licenses, it's probably worth
a look. At worst, we'd get some idea of how (not) to design the APIs
to them.

I'm going to guess that anything with an incompatible license will
upset people who are accustomed to ensuring that we have what legally
amounts to an MIT license clean distribution, but I'm thinking that
option is at least worth discussing, even if the immediate consensus
is, "libreadline is bad enough. We went to a lot of trouble to purge
that other stuff back in the bad old days. Let's not make that mistake
again."

Best,
David.
-- 
David Fetter <david(at)fetter(dot)org> http://fetter.org/
Phone: +1 415 235 3778

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate



Re: pg_receivewal makes a bad daemon

From
Robert Haas
Date:
On Wed, May 5, 2021 at 10:42 PM David Fetter <david@fetter.org> wrote:
> We do use at least one bit and piece from around the internet to make
> our software usable, namely libreadline, the absence of which make
> psql pretty much unusable.

I'm not talking about dependent libraries. We obviously have to depend
on some external libraries; it would be crazy to write our own
versions of libreadline, zlib, glibc, and everything else we use.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: pg_receivewal makes a bad daemon

From
Magnus Hagander
Date:
On Wed, May 5, 2021 at 7:12 PM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Wed, May 5, 2021 at 12:34 PM Magnus Hagander <magnus@hagander.net> wrote:
> > Is this really a problem we should fix ourselves? Most daemon-managers
> > today will happily be configured to automatically restart a daemon on
> > failure with a single setting since a long time now. E.g. in systemd
> > (which most linuxen uses now) you just set Restart=on-failure (or
> > maybe even Restart=always) and something like RestartSec=10.
> >
> > That said, it wouldn't cover an fsync() error -- they will always
> > restart. The way to handle that is for the operator to capture the
> > error message perhaps, and just "deal with it"?
>
> Maybe, but if that's really a non-problem, why does postgres itself
> restart, and have facilities to write and rotate log files? I feel
> like this argument boils down to "a manual transmission ought to be
> good enough for anyone, let's not have automatics." But over the years
> people have found that automatics are a lot easier to drive. It may be
> true that if you know just how to configure your system's daemon
> manager, you can make all of this work, but it's not like we document
> how to do any of that, and it's probably not the same on every
> platform - Windows in particular - and, really, why should people have
> to do this much work? If I want to run postgres in the background I
> can just type 'pg_ctl start'. I could even put 'pg_ctl start' in my
> crontab to make sure it gets restarted within a few minutes even if
> the postmaster dies. If I want to keep pg_receivewal running all the
> time ... I need a whole pile of extra mechanism to work around its
> inherent fragility. Documenting how that's typically done on modern
> systems, as you propose further on, would be great, but I can't do it,
> because I don't know how to make it work. Hence the thread.

If PostgreSQL was built today, I'm not sure we would've built that
functionality TBH.

The vast majority of people are not interested in manually starting
postgres and then putting in a crontab to "restart it if it fails".
That's not how anybody runs a server and hasn't for a long time.

It might be interesting for us as developers, but not to the vast
majority of our users. Most of those get their startup scripts from
our packagers -- so maybe we should encourage packagers to provide it,
like they do for PostgreSQL itself. But I don't think adding log
rotations and other independent functionality to pg_receivexyz would
help almost anybody in our user base.

In relation to the other thread about pid 1 handling and containers --
if anything, I bet a larger portion of our users would be interested
in running pg_receivewal in a dedicated container, than would want to
start it manually and verify it's running using crontab... By a large
margin.

It is true that Windows is a special case in this. But it is, I'd say,
equally true that adding something akin to "pg_ctl start" for
pg_receivewal would be equally useless on Windows.

We can certainly build and add such functionality. But my feeling is
that it's going to be added complexity for very little practical gain.
Much of the server world moved to "we don't want every single daemon
to implement it it's own way, ever so slightly different".

I like your car analogy though. But I'd consider it more like "we used
to have to mix the right amount of oil into the gasoline manually. But
modern engines don't really require us to do that anymore, so most
people have stopped, only those who want very special cars do". Or
something along that line. (Reality is probably somewhere in between,
and I suck at car analogies)


> > Also, all the above also apply to pg_recvlogical, right? So if we do
> > want to invent our own daemon-init-system, we should probably do one
> > more generic that can handle both.
>
> Yeah. And I'm not really 100% convinced that trying to patch this
> functionality into pg_receive{wal,logical} is the best way forward ...

It does in a lot of ways amount to basically a daemon-init system. It
might be easier to just vendor one of the existing ones :) Or more
realistically, suggest they use something that's already on their
system. On linux that'll be systemd, on *bsd it'll probably be
something like supervisord, on mac it'll be launchd. But this is
really more a function of the operating system/distribution.

Windows is again the one that stands out. But PostgreSQL *alraedy*
does a pretty weak job of solving that problem on Windows, so
duplicating that is not that strong a win..


> but I'm not entirely convinced that it isn't, either. I think one of
> the basic problems with trying to deploy PostgreSQL in 2021 is that it
> needs so much supporting infrastructure and so much babysitting.
> archive_command has to be a complicated, almost magical program we
> don't provide, and we don't even tell you in the documentation that
> you need it. If you don't want to use that, you can stream with
> pg_receivewal instead, but now you need a complicated daemon-runner
> mechanism that we don't provide or document the need for. You also
> probably need a connection pooler that we don't provide, a failover
> manager that we don't provide, and backup management software that we
> don't provide. And the interfaces that those tools have to work with
> are so awkward and primitive that even the tool authors can't always
> get it right. So I'm sort of unimpressed by any arguments that boil
> down to "what we have is good enough" or "that's the job of some other
> piece of software". Too many things are the job of some piece of
> software that doesn't really exist, or is only available on certain
> platforms, or that has some other problem that makes it not usable for
> everyone. People want to be able to download and use PostgreSQL
> without needing a whole library of other bits and pieces from around
> the Internet.

I definitely don't think what we have is good enough, and I agree with
your general description of the problem.

I just don't think turning a simple tool into a more complicated
daemon is not going to help with that in any material way. You still
need some sort of *backup management* on that side, otherwise your
pg_receivewal will now be the one that fills your disk along with the
outputs of your pg_basebackups. So we'd be better off providing that
management tool, which could then drive the lower level tools as
necessary.

Or maybe the better solution in that case would perhaps be to actually
bless one of the existing solutions out there by making it the
official one.

--
 Magnus Hagander
 Me: https://www.hagander.net/
 Work: https://www.redpill-linpro.com/



Re: pg_receivewal makes a bad daemon

From
Magnus Hagander
Date:
On Thu, May 6, 2021 at 5:43 AM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Wed, May 5, 2021 at 10:42 PM David Fetter <david@fetter.org> wrote:
> > We do use at least one bit and piece from around the internet to make
> > our software usable, namely libreadline, the absence of which make
> > psql pretty much unusable.

FWIW, we did go with the idea of using readline. Which doesn't work
properly on Windows. So this is an excellent example of how we're
already not solving the problem for Windows users, but are apparently
OK with it in this case.


> I'm not talking about dependent libraries. We obviously have to depend
> on some external libraries; it would be crazy to write our own
> versions of libreadline, zlib, glibc, and everything else we use.

Why is that more crazy than building our own limited version of
supervisord? readline and glibc might be one thing, but zlib (at least
the parts we use) is probably less complex than building our own cross
platform daemon-management.

-- 
 Magnus Hagander
 Me: https://www.hagander.net/
 Work: https://www.redpill-linpro.com/



Re: pg_receivewal makes a bad daemon

From
Peter Eisentraut
Date:
On 05.05.21 19:12, Robert Haas wrote:
> Maybe, but if that's really a non-problem, why does postgres itself
> restart, and have facilities to write and rotate log files?

I think because those were invented at a time when the operating system 
facilities were less useful.  And the log management facilities aren't 
even very good, because there is no support for remote logging.

> It may be
> true that if you know just how to configure your system's daemon
> manager, you can make all of this work, but it's not like we document
> how to do any of that, and it's probably not the same on every
> platform - Windows in particular - and, really, why should people have
> to do this much work? If I want to run postgres in the background I
> can just type 'pg_ctl start'.

Not really a solution, because systemd will kill it when you log out.

> Documenting how that's typically done on modern
> systems, as you propose further on, would be great, but I can't do it,
> because I don't know how to make it work. Hence the thread.

That is probably effort better spent.

I think the issues that you alluded to, what should be done in case of 
what error, is important to work out in detail and document in any case, 
because it will be the foundation of any of the other solutions.



Re: pg_receivewal makes a bad daemon

From
Andres Freund
Date:
Hi,

On 2021-05-05 18:34:36 +0200, Magnus Hagander wrote:
> Is this really a problem we should fix ourselves? Most daemon-managers
> today will happily be configured to automatically restart a daemon on
> failure with a single setting since a long time now. E.g. in systemd
> (which most linuxen uses now) you just set Restart=on-failure (or
> maybe even Restart=always) and something like RestartSec=10.

I'm not convinced by this. For two main reasons:

1) Our own code can know a lot more about the different error types than
   we can signal to systemd. The retry timeouts for e.g. a connection
   failure (whatever) is different than for fsync failing (alarm
   alarm). If we run out of space we might want to clean up space /
   invoke a command to do so, but there's nothing equivalent for
   systemd.

2) Do we really want to either implement at least 3 different ways to do
   this kind of thing, or force users to do it over and over again?

That's not to say that there's no space for handling "unexpected" errors
outside of postgres binaries, but I think it's pretty obvious that that
doesn't cover somewhat predictable types of errors.


And looking at the server side of things - it is *not* the same for
systemd to restart postgres, as postmaster doing so internally. The
latter can hold on onto shared memory. Which e.g. with simple huge_pages
configurations is crucial, because it prevents other processes to use
that shared memory. And it accelerates restart by a lot - the kernel
needing to zero shared memory on first access (or allocation) can be a
very significant penalty.

Greetings,

Andres Freund



Re: pg_receivewal makes a bad daemon

From
Andres Freund
Date:
Hi,

On 2021-05-07 12:03:36 +0200, Magnus Hagander wrote:
> It might be interesting for us as developers, but not to the vast
> majority of our users. Most of those get their startup scripts from
> our packagers -- so maybe we should encourage packagers to provide it,
> like they do for PostgreSQL itself.

I think that's the entirely wrong direction to go. A lot of the
usability problems around postgres precisely stem from us doing this
kind of thing, where the user experience then ends up wildly varying,
incomplete and incomprehensible.

That's not to say that we need to reimplement everything just for a
consistent experience. But just punting crucial things like how a
archiving can be made reliable in face of normal-ish errors, and how it
can be monitored is just going to further force people to move purely
onto managed services.


> Or maybe the better solution in that case would perhaps be to actually
> bless one of the existing solutions out there by making it the
> official one.

Which existing system currently does provide an archiving solution that
does not imply the very significant overhead of archive_command? Even if
an archiving solution internally batches things, the fsyncs, filesystem
metadata operations for .ready .done are a *significant* cost and all
the forks are not cheap either.

Greetings,

Andres Freund