pg_receivewal makes a bad daemon - Mailing list pgsql-hackers

From Robert Haas
Subject pg_receivewal makes a bad daemon
Date
Msg-id CA+TgmobgkyqMOwn64_5t3qQ6GdhOOMFki5w9f6278NhU5r7=oA@mail.gmail.com
Whole thread Raw
Responses Re: pg_receivewal makes a bad daemon
Re: pg_receivewal makes a bad daemon
List pgsql-hackers
You might want to use pg_receivewal to save all of your WAL segments
somewhere instead of relying on archive_command. It has, at the least,
the advantage of working on the byte level rather than the segment
level. But it seems to me that it is not entirely suitable as a
substitute for archiving, for a couple of reasons. One is that as soon
as it runs into a problem, it exits, which is not really what you want
out of a daemon that's critical to the future availability of your
system. Another is that you can't monitor it aside from looking at
what it prints out, which is also not really what you want for a piece
of critical infrastructure.

The first problem seems somewhat more straightforward. Suppose we add
a new command-line option, perhaps --daemon but we can bikeshed. If
this option is specified, then it tries to keep going when it hits a
problem, rather than just giving up. There's some fuzziness in my mind
about exactly what this should mean. If the problem we hit is that we
lost the connection to the remote server, then we should try to
reconnect. But if the problem is something like a failure inside
open_walfile() or close_walfile(), like a failed open() or fsync() or
close() or something, it's a little less clear what to do. Maybe one
idea would be to have a parent process and a child process, where the
child process does all the work and the parent process just keeps
re-launching it if it dies. It's not entirely clear that this is a
suitable way of recovering from, say, an fsync() failure, given
previous discussions claiming that - and I might be exaggerating a bit
here - there is essentially no way to recover from a failed fsync()
because the kernel might have already thrown out your data and you
might as well just set the data center on fire - but perhaps an retry
system that can't cope with certain corner cases is better than not
having one at all, and perhaps we could revise the logic here and
there to have the process doing the work take some action other than
exiting when that's an intelligent approach.

The second problem is a bit more complex. If you were transferring WAL
to another PostgreSQL instance rather than to a frontend process, you
could log to some place other than standard output, like for example a
file, and you could periodically rotate that file, or alternatively
you could log to syslog or the Windows event log. Even better, you
could connect to PostgreSQL and run SQL queries against monitoring
views and see what results you get. If the existing monitoring views
don't give users what they need, we can improve them, but the whole
infrastructure needed for this kind of thing is altogether lacking for
any frontend program. It does not seem very appealing to reinvent log
rotation, connection management, and monitoring views inside
pg_receivewal, let alone in every frontend process where similar
monitoring might be useful. But at least for me, without such
capabilities, it is a little hard to take pg_receivewal seriously.

I wonder first of all whether other people agree with these concerns,
and secondly what they think we ought to do about it. One option is -
do nothing. This could be based either on the idea that pg_receivewal
is hopeless, or else on the idea that pg_receivewal can be restarted
by some external system when required and monitored well enough as
things stand. A second option is to start building out capabilities in
pg_receivewal to turn it into something closer to what you'd expect of
a normal daemon, with the addition of a retry capability as probably
the easiest improvement. A third option is to somehow move towards a
world where you can use the server to move WAL around even if you
don't really want to run the server. Imagine a server running with no
data directory and only a minimal set of running processes, just (1) a
postmaster and (2) a walreceiver that writes to an archive directory
and (3) non-database-connected backends that are just smart enough to
handle queries for status information. This has the same problem that
I mentioned on the thread about monitoring the recovery process,
namely that we haven't got pg_authid. But against that, you get a lot
of infrastructure for free: configuration files, process management,
connection management, an existing wire protocol, memory contexts,
rich error reporting, etc.

I am curious to hear what other people think about the usefulness (or
lack thereof) of pg_receivewal as thing stand today, as well as ideas
about future direction.

Thanks,

-- 
Robert Haas
EDB: http://www.enterprisedb.com



pgsql-hackers by date:

Previous
From: Craig Ringer
Date:
Subject: Re: Is txid_status() actually safe? / What is 011_crash_recovery.pl testing?
Next
From: Michał Wadas
Date:
Subject: Proposal: per expression intervalstyle