Thread: relaying errors from background workers
Suppose a user backend starts a background worker for some purpose; the background worker dies with an error. The infrastructure we have today is sufficient for the user backend to discover that the worker backend has died, but not why. There might be an error in the server log, but the error information won't be transmitted back to the user backend in any way. I think it would be nice to fix that. I also think it would be nice to be able to relay not only errors, but also messages logged via ereport() or elog() at lower log levels (WARNING, NOTICE, INFO, DEBUG). The design I have in mind is to teach elog.c how to write such messages to a shm_mq. This is in fact one of the major use cases I had in mind when I designed the shm_mq infrastructure, because it seems to me that almost anything we want to do in parallel is likely to want to do this. Even aside from parallelism, it's not too hard to imagine wanting to use background workers to launch a job in the background and then come back later and see what happened. If there was an error, you're going to want go know specifically what went wrong, not just that something went wrong. The main thing I'm not sure about is how to format the message that we write to the shm_mq. One option is to try to use the good old FEBE protocol. This doesn't look entirely straightforward, because send_message_to_frontend() assembles the message using pq_sendbyte() and pq_sendstring(), and then sends it out to the client using pq_endmessage() and pq_flush(). This infrastructure assumes that the connection to the frontend is a socket. It doesn't seem impossible to kludge that infrastructure to be able to send to either a socket or a shm_mq, but I'm not sure whether it's a good idea. Alternatively, we could devise some other message format specific to this problem; it would probably look a lot like an ErrorData protocol message, but maybe that's doesn't really matter. Any thoughts? A third alternative is to say, OK, we really ought to have an actual socket connection between those backends, so that using FEBE just works. I don't think that's a good idea. It would require passing a socket descriptor from the user backend up to the postmaster and then back down to the background worker, or else using some kind of FIFO. That's a set of portability problems I'd rather not deal with. I think the shm_mq infrastructure is also better in that it gives us the ability to use a very large queue size if, for example, we discover that we need that in order to avoid having the client block because the queue is full. Socket buffer sizes can be adjusted at the OS level, of course, but the details are different on every platform and the upper limits tend not to be too large. Another thing to think about is that, most likely, many users of the background worker facility will want to respond to a relayed error by rethrowing it. That means that whatever format we use to send the error from one process to the other has to be able to be decoded by the receiving process. That process will probably want to do something like add add a bit more to the context (e.g. "in background worker PID %d") and throw the resulting error preserving the rest of the original fields. I'm not sure exactly what make sense here, but the point is that ideally the message format should be something that the receiver can rethrow, possibly after tweaking it a bit. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 22/05/14 06:21, Robert Haas wrote: > > The main thing I'm not sure about is how to format the message that we > write to the shm_mq. One option is to try to use the good old FEBE > protocol. This doesn't look entirely straightforward, because > send_message_to_frontend() assembles the message using pq_sendbyte() > and pq_sendstring(), and then sends it out to the client using > pq_endmessage() and pq_flush(). This infrastructure assumes that the > connection to the frontend is a socket. It doesn't seem impossible to > kludge that infrastructure to be able to send to either a socket or a > shm_mq, but I'm not sure whether it's a good idea. Alternatively, we > could devise some other message format specific to this problem; it > would probably look a lot like an ErrorData protocol message, but > maybe that's doesn't really matter. Any thoughts? > I played with this a bit already and even have some (very much hacked up) prototype based on ErrorData as it seemed better solution than using sockets. Obviously some of the ErrorData fields don't really have meaning during transport (send_to_client for example) as those must be set based on the connection options and I didn't get to making this nicer yet. > > Another thing to think about is that, most likely, many users of the > background worker facility will want to respond to a relayed error by > rethrowing it. That means that whatever format we use to send the > error from one process to the other has to be able to be decoded by > the receiving process. That process will probably want to do > something like add add a bit more to the context (e.g. "in background > worker PID %d") and throw the resulting error preserving the rest of > the original fields. I'm not sure exactly what make sense here, but > the point is that ideally the message format should be something that > the receiver can rethrow, possibly after tweaking it a bit. > This is one advantage of using ErrorData or ErrorData-like structure based messaging, rethrowing is simpler, but I guess if we really needed we could provide some conversion api. One side effect of the rethrowing that maybe deserves a bit of thought is that we are going to log the same error from both bgworker and backend if we rethrow it. -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Thu, May 22, 2014 at 6:33 AM, Petr Jelinek <petr@2ndquadrant.com> wrote: > On 22/05/14 06:21, Robert Haas wrote: >> The main thing I'm not sure about is how to format the message that we >> write to the shm_mq. One option is to try to use the good old FEBE >> protocol. This doesn't look entirely straightforward, because >> send_message_to_frontend() assembles the message using pq_sendbyte() >> and pq_sendstring(), and then sends it out to the client using >> pq_endmessage() and pq_flush(). This infrastructure assumes that the >> connection to the frontend is a socket. It doesn't seem impossible to >> kludge that infrastructure to be able to send to either a socket or a >> shm_mq, but I'm not sure whether it's a good idea. Alternatively, we >> could devise some other message format specific to this problem; it >> would probably look a lot like an ErrorData protocol message, but >> maybe that's doesn't really matter. Any thoughts? > > I played with this a bit already and even have some (very much hacked up) > prototype based on ErrorData as it seemed better solution than using > sockets. Obviously some of the ErrorData fields don't really have meaning > during transport (send_to_client for example) as those must be set based on > the connection options and I didn't get to making this nicer yet. I didn't mean the ErrorData structure that the backend uses internally; it doesn't seem practical to use that because it contains pointers. Even if we stored all the data in the DSM (which seems rather hard without a shared-memory allocator), it might not be mapped at the same address in both backends. I meant the libpq wire protocol used to transmit errors to the client; I guess the message is actually called ErrorResponse rather than ErrorData: http://www.postgresql.org/docs/current/static/protocol-message-formats.html ErrorResponse of course wouldn't contain anything like send_to_client, as that's a server-internal thing. >> Another thing to think about is that, most likely, many users of the >> background worker facility will want to respond to a relayed error by >> rethrowing it. That means that whatever format we use to send the >> error from one process to the other has to be able to be decoded by >> the receiving process. That process will probably want to do >> something like add add a bit more to the context (e.g. "in background >> worker PID %d") and throw the resulting error preserving the rest of >> the original fields. I'm not sure exactly what make sense here, but >> the point is that ideally the message format should be something that >> the receiver can rethrow, possibly after tweaking it a bit. >> > > This is one advantage of using ErrorData or ErrorData-like structure based > messaging, rethrowing is simpler, but I guess if we really needed we could > provide some conversion api. > > One side effect of the rethrowing that maybe deserves a bit of thought is > that we are going to log the same error from both bgworker and backend if we > rethrow it. Yeah, I'm not sure how to handle that. It probably wouldn't be hard to make the routine that re-throws the error suppress sending it to the server log, but that could also be confusing if someone is trying to follow the progress of a particular session. Having the background worker suppress sending it to the log also seems bad; it has no guarantee that it will successfully relay the message, or that the recipient will re-throw it. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, May 22, 2014 at 9:51 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>
> Suppose a user backend starts a background worker for some purpose;
> the background worker dies with an error. The infrastructure we have
> today is sufficient for the user backend to discover that the worker
> backend has died, but not why. There might be an error in the server
> log, but the error information won't be transmitted back to the user
> backend in any way. I think it would be nice to fix that. I also
> think it would be nice to be able to relay not only errors, but also
> messages logged via ereport() or elog() at lower log levels (WARNING,
> NOTICE, INFO, DEBUG).
>
> The design I have in mind is to teach elog.c how to write such
> messages to a shm_mq. This is in fact one of the major use cases I
> had in mind when I designed the shm_mq infrastructure, because it
> seems to me that almost anything we want to do in parallel is likely
> to want to do this. Even aside from parallelism, it's not too hard to
> imagine wanting to use background workers to launch a job in the
> background and then come back later and see what happened. If there
> was an error, you're going to want go know specifically what went
> wrong, not just that something went wrong.
>
> The main thing I'm not sure about is how to format the message that we
> write to the shm_mq. One option is to try to use the good old FEBE
> protocol. This doesn't look entirely straightforward, because
> send_message_to_frontend() assembles the message using pq_sendbyte()
> and pq_sendstring(), and then sends it out to the client using
> pq_endmessage() and pq_flush(). This infrastructure assumes that the
> connection to the frontend is a socket. It doesn't seem impossible to
> kludge that infrastructure to be able to send to either a socket or a
> shm_mq, but I'm not sure whether it's a good idea.
>
> Suppose a user backend starts a background worker for some purpose;
> the background worker dies with an error. The infrastructure we have
> today is sufficient for the user backend to discover that the worker
> backend has died, but not why. There might be an error in the server
> log, but the error information won't be transmitted back to the user
> backend in any way. I think it would be nice to fix that. I also
> think it would be nice to be able to relay not only errors, but also
> messages logged via ereport() or elog() at lower log levels (WARNING,
> NOTICE, INFO, DEBUG).
>
> The design I have in mind is to teach elog.c how to write such
> messages to a shm_mq. This is in fact one of the major use cases I
> had in mind when I designed the shm_mq infrastructure, because it
> seems to me that almost anything we want to do in parallel is likely
> to want to do this. Even aside from parallelism, it's not too hard to
> imagine wanting to use background workers to launch a job in the
> background and then come back later and see what happened. If there
> was an error, you're going to want go know specifically what went
> wrong, not just that something went wrong.
>
> The main thing I'm not sure about is how to format the message that we
> write to the shm_mq. One option is to try to use the good old FEBE
> protocol. This doesn't look entirely straightforward, because
> send_message_to_frontend() assembles the message using pq_sendbyte()
> and pq_sendstring(), and then sends it out to the client using
> pq_endmessage() and pq_flush(). This infrastructure assumes that the
> connection to the frontend is a socket. It doesn't seem impossible to
> kludge that infrastructure to be able to send to either a socket or a
> shm_mq, but I'm not sure whether it's a good idea.
I think it will be better to keep assembling part of message similar to
current and then have different way of communicating to backend.
This will make rethrowing of message to client simpler.
Already we have mechanism for reporting to client and server
(send_message_to_server_log()/send_message_to_frontend()), so
devising a similar way for communicating with backend seems to be
a plausible way.