Thread: relaying errors from background workers

relaying errors from background workers

From
Robert Haas
Date:
Suppose a user backend starts a background worker for some purpose;
the background worker dies with an error.  The infrastructure we have
today is sufficient for the user backend to discover that the worker
backend has died, but not why.  There might be an error in the server
log, but the error information won't be transmitted back to the user
backend in any way.  I think it would be nice to fix that.  I also
think it would be nice to be able to relay not only errors, but also
messages logged via ereport() or elog() at lower log levels (WARNING,
NOTICE, INFO, DEBUG).

The design I have in mind is to teach elog.c how to write such
messages to a shm_mq.  This is in fact one of the major use cases I
had in mind when I designed the shm_mq infrastructure, because it
seems to me that almost anything we want to do in parallel is likely
to want to do this.  Even aside from parallelism, it's not too hard to
imagine wanting to use background workers to launch a job in the
background and then come back later and see what happened.  If there
was an error, you're going to want go know specifically what went
wrong, not just that something went wrong.

The main thing I'm not sure about is how to format the message that we
write to the shm_mq.  One option is to try to use the good old FEBE
protocol.  This doesn't look entirely straightforward, because
send_message_to_frontend() assembles the message using pq_sendbyte()
and pq_sendstring(), and then sends it out to the client using
pq_endmessage() and pq_flush().  This infrastructure assumes that the
connection to the frontend is a socket.  It doesn't seem impossible to
kludge that infrastructure to be able to send to either a socket or a
shm_mq, but I'm not sure whether it's a good idea.  Alternatively, we
could devise some other message format specific to this problem; it
would probably look a lot like an ErrorData protocol message, but
maybe that's doesn't really matter.  Any thoughts?

A third alternative is to say, OK, we really ought to have an actual
socket connection between those backends, so that using FEBE just
works.  I don't think that's a good idea.  It would require passing a
socket descriptor from the user backend up to the postmaster and then
back down to the background worker, or else using some kind of FIFO.
That's a set of portability problems I'd rather not deal with.  I
think the shm_mq infrastructure is also better in that it gives us the
ability to use a very large queue size if, for example, we discover
that we need that in order to avoid having the client block because
the queue is full.  Socket buffer sizes can be adjusted at the OS
level, of course, but the details are different on every platform and
the upper limits tend not to be too large.

Another thing to think about is that, most likely, many users of the
background worker facility will want to respond to a relayed error by
rethrowing it.  That means that whatever format we use to send the
error from one process to the other has to be able to be decoded by
the receiving process.  That process will probably want to do
something like add add a bit more to the context (e.g. "in background
worker PID %d") and throw the resulting error preserving the rest of
the original fields.  I'm not sure exactly what make sense here, but
the point is that ideally the message format should be something that
the receiver can rethrow, possibly after tweaking it a bit.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: relaying errors from background workers

From
Petr Jelinek
Date:
On 22/05/14 06:21, Robert Haas wrote:
>
> The main thing I'm not sure about is how to format the message that we
> write to the shm_mq.  One option is to try to use the good old FEBE
> protocol.  This doesn't look entirely straightforward, because
> send_message_to_frontend() assembles the message using pq_sendbyte()
> and pq_sendstring(), and then sends it out to the client using
> pq_endmessage() and pq_flush().  This infrastructure assumes that the
> connection to the frontend is a socket.  It doesn't seem impossible to
> kludge that infrastructure to be able to send to either a socket or a
> shm_mq, but I'm not sure whether it's a good idea.  Alternatively, we
> could devise some other message format specific to this problem; it
> would probably look a lot like an ErrorData protocol message, but
> maybe that's doesn't really matter.  Any thoughts?
>

I played with this a bit already and even have some (very much hacked 
up) prototype based on ErrorData as it seemed better solution than using 
sockets. Obviously some of the ErrorData fields don't really have 
meaning during transport (send_to_client for example) as those must be 
set based on the connection options and I didn't get to making this 
nicer yet.

>
> Another thing to think about is that, most likely, many users of the
> background worker facility will want to respond to a relayed error by
> rethrowing it.  That means that whatever format we use to send the
> error from one process to the other has to be able to be decoded by
> the receiving process.  That process will probably want to do
> something like add add a bit more to the context (e.g. "in background
> worker PID %d") and throw the resulting error preserving the rest of
> the original fields.  I'm not sure exactly what make sense here, but
> the point is that ideally the message format should be something that
> the receiver can rethrow, possibly after tweaking it a bit.
>

This is one advantage of using ErrorData or ErrorData-like structure 
based messaging, rethrowing is simpler, but I guess if we really needed 
we could provide some conversion api.

One side effect of the rethrowing that maybe deserves a bit of thought 
is that we are going to log the same error from both bgworker and 
backend if we rethrow it.

--  Petr Jelinek                  http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training &
Services



Re: relaying errors from background workers

From
Robert Haas
Date:
On Thu, May 22, 2014 at 6:33 AM, Petr Jelinek <petr@2ndquadrant.com> wrote:
> On 22/05/14 06:21, Robert Haas wrote:
>> The main thing I'm not sure about is how to format the message that we
>> write to the shm_mq.  One option is to try to use the good old FEBE
>> protocol.  This doesn't look entirely straightforward, because
>> send_message_to_frontend() assembles the message using pq_sendbyte()
>> and pq_sendstring(), and then sends it out to the client using
>> pq_endmessage() and pq_flush().  This infrastructure assumes that the
>> connection to the frontend is a socket.  It doesn't seem impossible to
>> kludge that infrastructure to be able to send to either a socket or a
>> shm_mq, but I'm not sure whether it's a good idea.  Alternatively, we
>> could devise some other message format specific to this problem; it
>> would probably look a lot like an ErrorData protocol message, but
>> maybe that's doesn't really matter.  Any thoughts?
>
> I played with this a bit already and even have some (very much hacked up)
> prototype based on ErrorData as it seemed better solution than using
> sockets. Obviously some of the ErrorData fields don't really have meaning
> during transport (send_to_client for example) as those must be set based on
> the connection options and I didn't get to making this nicer yet.

I didn't mean the ErrorData structure that the backend uses
internally; it doesn't seem practical to use that because it contains
pointers.  Even if we stored all the data in the DSM (which seems
rather hard without a shared-memory allocator), it might not be mapped
at the same address in both backends.  I meant the libpq wire protocol
used to transmit errors to the client; I guess the message is actually
called ErrorResponse rather than ErrorData:

http://www.postgresql.org/docs/current/static/protocol-message-formats.html

ErrorResponse of course wouldn't contain anything like send_to_client,
as that's a server-internal thing.

>> Another thing to think about is that, most likely, many users of the
>> background worker facility will want to respond to a relayed error by
>> rethrowing it.  That means that whatever format we use to send the
>> error from one process to the other has to be able to be decoded by
>> the receiving process.  That process will probably want to do
>> something like add add a bit more to the context (e.g. "in background
>> worker PID %d") and throw the resulting error preserving the rest of
>> the original fields.  I'm not sure exactly what make sense here, but
>> the point is that ideally the message format should be something that
>> the receiver can rethrow, possibly after tweaking it a bit.
>>
>
> This is one advantage of using ErrorData or ErrorData-like structure based
> messaging, rethrowing is simpler, but I guess if we really needed we could
> provide some conversion api.
>
> One side effect of the rethrowing that maybe deserves a bit of thought is
> that we are going to log the same error from both bgworker and backend if we
> rethrow it.

Yeah, I'm not sure how to handle that.  It probably wouldn't be hard
to make the routine that re-throws the error suppress sending it to
the server log, but that could also be confusing if someone is trying
to follow the progress of a particular session.  Having the background
worker suppress sending it to the  log also seems bad; it has no
guarantee that it will successfully relay the message, or that the
recipient will re-throw it.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: relaying errors from background workers

From
Amit Kapila
Date:
On Thu, May 22, 2014 at 9:51 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>
> Suppose a user backend starts a background worker for some purpose;
> the background worker dies with an error.  The infrastructure we have
> today is sufficient for the user backend to discover that the worker
> backend has died, but not why.  There might be an error in the server
> log, but the error information won't be transmitted back to the user
> backend in any way.  I think it would be nice to fix that.  I also
> think it would be nice to be able to relay not only errors, but also
> messages logged via ereport() or elog() at lower log levels (WARNING,
> NOTICE, INFO, DEBUG).
>
> The design I have in mind is to teach elog.c how to write such
> messages to a shm_mq.  This is in fact one of the major use cases I
> had in mind when I designed the shm_mq infrastructure, because it
> seems to me that almost anything we want to do in parallel is likely
> to want to do this.  Even aside from parallelism, it's not too hard to
> imagine wanting to use background workers to launch a job in the
> background and then come back later and see what happened.  If there
> was an error, you're going to want go know specifically what went
> wrong, not just that something went wrong.
>
> The main thing I'm not sure about is how to format the message that we
> write to the shm_mq.  One option is to try to use the good old FEBE
> protocol.  This doesn't look entirely straightforward, because
> send_message_to_frontend() assembles the message using pq_sendbyte()
> and pq_sendstring(), and then sends it out to the client using
> pq_endmessage() and pq_flush().  This infrastructure assumes that the
> connection to the frontend is a socket.  It doesn't seem impossible to
> kludge that infrastructure to be able to send to either a socket or a
> shm_mq, but I'm not sure whether it's a good idea.

I think it will be better to keep assembling part of message similar to
current and then have different way of communicating to backend.
This will make rethrowing of message to client simpler.
Already we have mechanism for reporting to client and server
(send_message_to_server_log()/send_message_to_frontend()), so
devising a similar way for communicating with backend seems to be
a plausible way.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com