Re: [HACKERS] Optional message to user when terminating/cancellingbackend - Mailing list pgsql-hackers

From Andres Freund
Subject Re: [HACKERS] Optional message to user when terminating/cancellingbackend
Date
Msg-id 20180613191621.dl5ko3v5gjltxsbj@alap3.anarazel.de
Whole thread Raw
In response to Re: [HACKERS] Optional message to user when terminating/cancellingbackend  (Daniel Gustafsson <daniel@yesql.se>)
Responses Re: [HACKERS] Optional message to user when terminating/cancellingbackend
List pgsql-hackers
Hi,

Onder, I CCed you because it seems worthwhile to ensure that the
relevant Citus code could use this instead of the gross hack you and I
committed...

On 2018-06-13 20:54:03 +0200, Daniel Gustafsson wrote:
> > On 9 Apr 2018, at 23:55, Andres Freund <andres@anarazel.de> wrote:
> > On 2017-06-20 13:01:35 -0700, Andres Freund wrote:
> >> For extensions it'd also be useful if it'd be possible to overwrite the
> >> error code.  E.g. for citus there's a distributed deadlock detector,
> >> running out of process because there's no way to interrupt lock waits
> >> locally, and we've to do some ugly hacking to generate proper error
> >> messages and code from another session.
> > 
> > What happened to this request?  Seems we're out of the crunch mode and
> > could round the feature out a littlebit more…
> 
> Revisiting old patches, I took a stab at this request.
> 
> Since I don’t really have a use case for altering the sqlerrcode other than the
> on that Citus..  cited, I modelled the API around that.

Cool. Onder?


> diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
> index b851fe023a..ad9763698f 100644
> --- a/doc/src/sgml/func.sgml
> +++ b/doc/src/sgml/func.sgml
> @@ -18540,7 +18540,7 @@ SELECT set_config('log_statement_stats', 'off', false);
>       <tbody>
>        <row>
>         <entry>
> -        <literal><function>pg_cancel_backend(<parameter>pid</parameter> <type>int</type>)</function></literal>
> +        <literal><function>pg_cancel_backend(<parameter>pid</parameter> <type>int</type> [,
<parameter>message</parameter><type>text</type>])</function></literal>
 
>          </entry>
>         <entry><type>boolean</type></entry>
>         <entry>Cancel a backend's current query.  This is also allowed if the
> @@ -18565,7 +18565,7 @@ SELECT set_config('log_statement_stats', 'off', false);
>        </row>
>        <row>
>         <entry>
> -        <literal><function>pg_terminate_backend(<parameter>pid</parameter> <type>int</type>)</function></literal>
> +        <literal><function>pg_terminate_backend(<parameter>pid</parameter> <type>int</type> [,
<parameter>message</parameter><type>text</type>])</function></literal>
 
>          </entry>
>         <entry><type>boolean</type></entry>
>         <entry>Terminate a backend.  This is also allowed if the calling role
> @@ -18596,6 +18596,14 @@ SELECT set_config('log_statement_stats', 'off', false);
>      The role of an active backend can be found from the
>      <structfield>usename</structfield> column of the
>      <structname>pg_stat_activity</structname> view.
> +    If the optional <literal>message</literal> parameter is set, the text
> +    will be appended to the error message returned to the signalled backend.
> +    <literal>message</literal> is limited to 128 bytes, any longer text
> +    will be truncated. An example where we cancel our own backend:
> +<programlisting>
> +postgres=# SELECT pg_cancel_backend(pg_backend_pid(), 'Cancellation message text');
> +ERROR:  canceling statement due to user request: Cancellation message text
> +</programlisting>
>     </para>

I'm not sure I really like the appending bit. There's a security
argument to be made about doing so, but from a user POV that mostly
seems restrictive.  I wonder if that aspect would be better handled by
adding an error context that contains information about which process
did the cancellation (or DETAIL?)?


> +/*
> + * Structure for registering a feedback payload to be sent to a cancelled, or
> + * terminated backend. Each backend is registered per pid in the array which is
> + * indexed by Backend ID. Reading and writing the message is protected by a
> + * per-slot spinlock.
> + */
> +typedef struct
> +{
> +    pid_t    pid;

This is the pid of the receiving process? If so, why do we need this in
here? That's just duplicated data, no?


> +    slock_t    mutex;
> +    char    message[MAX_CANCEL_MSG];
> +    int        len;
> +    int        sqlerrcode;

I'd vote for including the pid of the process that did the cancelling in
here.

> +Datum
> +pg_cancel_backend(PG_FUNCTION_ARGS)
> +{
> +    PG_RETURN_BOOL(pg_cancel_backend_internal(PG_GETARG_INT32(0), NULL));
> +}
> +
> +Datum
> +pg_cancel_backend_msg(PG_FUNCTION_ARGS)
> +{
> +    pid_t        pid;
> +    char        *msg = NULL;
> +
> +    if (PG_ARGISNULL(0))
> +        PG_RETURN_NULL();
> +
> +    pid = PG_GETARG_INT32(0);
> +
> +    if (PG_NARGS() == 2 && !PG_ARGISNULL(1))
> +        msg = text_to_cstring(PG_GETARG_TEXT_PP(1));
> +
> +    PG_RETURN_BOOL(pg_cancel_backend_internal(pid, msg));
> +}

Why isn't this just one function? Given that you already have a PG_NARGS
check, I don't quite see the point?

>  /*
>   * Signal to terminate a backend process.  This is allowed if you are a member
>   * of the role whose process is being terminated.
>   *
>   * Note that only superusers can signal superuser-owned processes.
>   */
> -Datum
> -pg_terminate_backend(PG_FUNCTION_ARGS)
> +static bool
> +pg_terminate_backend_internal(pid_t pid, char *msg)
>  {
> -    int            r = pg_signal_backend(PG_GETARG_INT32(0), SIGTERM);
> +    int        r = pg_signal_backend(pid, SIGTERM, msg);
>  
>      if (r == SIGNAL_BACKEND_NOSUPERUSER)
>          ereport(ERROR,
> @@ -146,7 +203,30 @@ pg_terminate_backend(PG_FUNCTION_ARGS)
>                  (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
>                   (errmsg("must be a member of the role whose process is being terminated or member of
pg_signal_backend"))));
>  
> -    PG_RETURN_BOOL(r == SIGNAL_BACKEND_SUCCESS);
> +    return (r == SIGNAL_BACKEND_SUCCESS);
> +}
> +
> +Datum
> +pg_terminate_backend(PG_FUNCTION_ARGS)
> +{
> +    PG_RETURN_BOOL(pg_terminate_backend_internal(PG_GETARG_INT32(0), NULL));
> +}
> +
> +Datum
> +pg_terminate_backend_msg(PG_FUNCTION_ARGS)
> +{
> +    pid_t        pid;
> +    char        *msg = NULL;
> +
> +    if (PG_ARGISNULL(0))
> +        PG_RETURN_NULL();
> +
> +    pid = PG_GETARG_INT32(0);
> +
> +    if (PG_NARGS() == 2 && !PG_ARGISNULL(1))
> +        msg = text_to_cstring(PG_GETARG_TEXT_PP(1));
> +
> +    PG_RETURN_BOOL(pg_terminate_backend_internal(pid, msg));
>  }
>  
>  /*
> @@ -168,7 +248,6 @@ pg_reload_conf(PG_FUNCTION_ARGS)
>      PG_RETURN_BOOL(true);
>  }

Same.


> +/*
> + * Set a message for the cancellation of the backend with the specified pid,
> + * using the default sqlerrcode.
> + */
> +int
> +SetBackendCancelMessage(pid_t backend_pid, char *message)
> +{
> +    return backend_feedback(backend_pid, message, ERRCODE_QUERY_CANCELED);
> +}
> +
> +/*
> + * Set a message for the termination of the backend with the specified pid,
> + * using the default sqlerrcode.
> + */
> +int
> +SetBackendTerminationMessage(pid_t backend_pid, char *message)
> +{
> +    return backend_feedback(backend_pid, message, ERRCODE_ADMIN_SHUTDOWN);
> +}
> +
> +/*
> + * Set both a message and a sqlerrcode for use when signalling the backend
> + * with the specified pid.
> + */
> +int
> +SetBackendSignalFeedback(pid_t backend_pid, char *message, int sqlerrcode)
> +{
> +    return backend_feedback(backend_pid, message, sqlerrcode);
> +}

I'm not quite seeing the point of these variants.  What advantage are
they over just doing the same on the caller level?


> +/*
> + * Sets a cancellation message for the backend with the specified pid, and
> + * returns the length of the message actually created. If the returned length
> + * is less than the length of the message parameter, truncation has occurred.
> + * If the backend isn't found, -1 is returned. If no message is passed, zero is
> + * returned. If two backends collide in setting a message, the existing message
> + * will be overwritten by the last one in.
> + */
> +static int
> +backend_feedback(pid_t backend_pid, char *message, int sqlerrcode)
> +{
> +    int        i;
> +    int        len;
> +
> +    if (!message)
> +        return 1;
> +
> +    len = pg_mbcliplen(message, strlen(message), MAX_CANCEL_MSG - 1);
> +
> +    for (i = 0; i < MaxBackends; i++)
> +    {
> +        BackendSignalFeedbackShmemStruct *slot = &BackendSignalFeedbackSlots[i];
> +
> +        if (slot->pid != 0 && slot->pid == backend_pid)
> +        {
> +            SpinLockAcquire(&slot->mutex);
> +            if (slot->pid != backend_pid)
> +            {
> +                SpinLockRelease(&slot->mutex);
> +                goto error;
> +            }
> +
> +            MemSet(slot->message, '\0', sizeof(slot->message));
> +            memcpy(slot->message, message, len);

This seems unnecessarily expensive.

> +            slot->len = len;
> +            slot->sqlerrcode = sqlerrcode;
> +            SpinLockRelease(&slot->mutex);
> +
> +            if (len != strlen(message))
> +                ereport(NOTICE,
> +                        (errmsg("message is too long and has been truncated")));
> +            return 0;
> +        }
> +    }

So we made cancellation a O(N) proposition :/. Probably not too bad, but
still...

> +error:
> +
> +    elog(LOG, "Cancellation message requested for missing backend %d by %d",
> +         (int) backend_pid, MyProcPid);
> +
> +    return 1;
> +}

So we now do log spam if processes exit in the wrong moment?


> +/*
> + * Test whether there is feedback registered for the current backend that can
> + * be consumed and presented to the user.
> + */
> +bool
> +HasBackendSignalFeedback(void)
> +{
> +    volatile BackendSignalFeedbackShmemStruct *slot = MyCancelSlot;
> +    bool     has_message = false;
> +
> +    if (slot != NULL)
> +    {
> +        SpinLockAcquire(&slot->mutex);
> +        has_message = ((slot->len > 0) && (slot->sqlerrcode != 0));
> +        SpinLockRelease(&slot->mutex);
> +    }
> +
> +    return has_message;
> +}

I'm somewhat uncomfortable with acquiring a mutex in an error path.  We
used to prcoess at least some interrupts in signal handlers in some
cases - I think I removed all of them, but if not, we'd be in deep
trouble here.  Wonder if we shouldn't try to get around needing that...



Greetings,

Andres Freund


pgsql-hackers by date:

Previous
From: Daniel Gustafsson
Date:
Subject: Re: [HACKERS] Optional message to user when terminating/cancellingbackend
Next
From: Jeff Davis
Date:
Subject: Re: Spilling hashed SetOps and aggregates to disk