Thread: GSSENC'ed connection stalls while reconnection attempts.

GSSENC'ed connection stalls while reconnection attempts.

From
Kyotaro Horiguchi
Date:
Hello.

If psql connected using GSSAPI auth and server restarted, reconnection
sequence stalls and won't return.

I found that psql(libpq) sends startup packet via gss
encryption. conn->gssenc should be reset when encryption state is
freed.

The reason that psql doesn't notice the error is pqPacketSend returns
STATUS_OK when write error occurred.  That behavior contradicts to the
comment of the function. The function is used only while making
connection so it's ok to error out immediately by write failure.  I
think other usage of pqFlush while making a connection should report
write failure the same way.

Finally, It's user-friendly if psql shows error message for error on
reset attempts. (This perhaps should be arguable.)

The attached does the above. Any thoughts and/or opinions are welcome.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
From 8cc5663f31744b384cf013f86062ea28b431baa6 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Fri, 10 Jul 2020 17:30:11 +0900
Subject: [PATCH] Fix reconnection failure of GSSENC connections

A connection using GSS encryption fails to reconnect and freezes. Fix
that by resetting GSS encryption state on dropping a connection. On
the way fixing that, fix the behavior for write failure while
connection attempts.  Also let psql report the cause of reset attempt
failure.
---
 src/bin/psql/common.c             |  2 +-
 src/interfaces/libpq/fe-auth.c    |  7 ++++++-
 src/interfaces/libpq/fe-connect.c | 13 ++++++++++---
 3 files changed, 17 insertions(+), 5 deletions(-)

diff --git a/src/bin/psql/common.c b/src/bin/psql/common.c
index 6323a35c91..d5ab1d8ada 100644
--- a/src/bin/psql/common.c
+++ b/src/bin/psql/common.c
@@ -310,7 +310,7 @@ CheckConnection(void)
         OK = ConnectionUp();
         if (!OK)
         {
-            fprintf(stderr, _("Failed.\n"));
+            fprintf(stderr, _("Failed: %s\n"), PQerrorMessage(pset.db));
 
             /*
              * Transition to having no connection.  Keep this bit in sync with
diff --git a/src/interfaces/libpq/fe-auth.c b/src/interfaces/libpq/fe-auth.c
index 9f5403d74c..3664ee10a0 100644
--- a/src/interfaces/libpq/fe-auth.c
+++ b/src/interfaces/libpq/fe-auth.c
@@ -588,7 +588,12 @@ pg_SASL_init(PGconn *conn, int payloadlen)
     }
     if (pqPutMsgEnd(conn))
         goto error;
-    if (pqFlush(conn))
+
+    /*
+     * We must not fail write here.  Write failure needs to be checked
+     * explicitly. See pqSendSome.
+     */
+    if (pqFlush(conn) || conn->write_failed)
         goto error;
 
     termPQExpBuffer(&mechanism_buf);
diff --git a/src/interfaces/libpq/fe-connect.c b/src/interfaces/libpq/fe-connect.c
index 27c9bb46ee..f897003f42 100644
--- a/src/interfaces/libpq/fe-connect.c
+++ b/src/interfaces/libpq/fe-connect.c
@@ -496,6 +496,8 @@ pqDropConnection(PGconn *conn, bool flushInput)
             free(conn->gss_ResultBuffer);
             conn->gss_ResultBuffer = NULL;
         }
+
+        conn->gssenc = false;
     }
 #endif
 #ifdef ENABLE_SSPI
@@ -3455,8 +3457,10 @@ keep_going:                        /* We will come back to here until there is
                  * Just make sure that any data sent by pg_fe_sendauth is
                  * flushed out.  Although this theoretically could block, it
                  * really shouldn't since we don't send large auth responses.
+                 * Write failure needs to be checked explicitly. See
+                 * pqSendSome.
                  */
-                if (pqFlush(conn))
+                if (pqFlush(conn) || conn->write_failed)
                     goto error_return;
 
                 if (areq == AUTH_REQ_OK)
@@ -4546,8 +4550,11 @@ pqPacketSend(PGconn *conn, char pack_type,
     if (pqPutMsgEnd(conn))
         return STATUS_ERROR;
 
-    /* Flush to ensure backend gets it. */
-    if (pqFlush(conn))
+    /*
+     * Flush to ensure backend gets it. Write failure needs to be checked
+     * explicitly. See pqSendSome.
+     */
+    if (pqFlush(conn) || conn->write_failed)
         return STATUS_ERROR;
 
     return STATUS_OK;
-- 
2.18.4


Re: GSSENC'ed connection stalls while reconnection attempts.

From
Tom Lane
Date:
Kyotaro Horiguchi <horikyota.ntt@gmail.com> writes:
> If psql connected using GSSAPI auth and server restarted, reconnection
> sequence stalls and won't return.

Yeah, reproduced here.  (I wonder if there's any reasonable way to
exercise this scenario in src/test/kerberos/.)

> I found that psql(libpq) sends startup packet via gss
> encryption. conn->gssenc should be reset when encryption state is
> freed.

Actually, it looks to me like the GSS support was wedged in by somebody
who was paying no attention to how SSL is managed, or else we forgot
to pay attention to GSS the last time we rearranged SSL support.  It's
completely broken for the multiple-host-addresses scenario as well,
because try_gss is being set and cleared in the wrong places altogether.
conn->gcred is not being handled correctly either I think --- we need
to make sure that it's dropped in pqDropConnection.

The attached patch makes this all act more like the way SSL is handled,
and for me it resolves the reconnection problem.

> The reason that psql doesn't notice the error is pqPacketSend returns
> STATUS_OK when write error occurred.  That behavior contradicts to the
> comment of the function. The function is used only while making
> connection so it's ok to error out immediately by write failure.  I
> think other usage of pqFlush while making a connection should report
> write failure the same way.

I'm disinclined to mess with that, because (a) I don't think it's the
actual source of the problem, and (b) it would affect way more than
just GSS mode.

> Finally, It's user-friendly if psql shows error message for error on
> reset attempts. (This perhaps should be arguable.)

Meh, that's changing fairly longstanding behavior that I don't think
we've had many complaints about.

            regards, tom lane

diff --git a/src/interfaces/libpq/fe-connect.c b/src/interfaces/libpq/fe-connect.c
index 27c9bb46ee..7bee9dd201 100644
--- a/src/interfaces/libpq/fe-connect.c
+++ b/src/interfaces/libpq/fe-connect.c
@@ -477,6 +477,11 @@ pqDropConnection(PGconn *conn, bool flushInput)
     {
         OM_uint32    min_s;

+        if (conn->gcred != GSS_C_NO_CREDENTIAL)
+        {
+            gss_release_cred(&min_s, &conn->gcred);
+            conn->gcred = GSS_C_NO_CREDENTIAL;
+        }
         if (conn->gctx)
             gss_delete_sec_context(&min_s, &conn->gctx, GSS_C_NO_BUFFER);
         if (conn->gtarg_nam)
@@ -496,6 +501,7 @@ pqDropConnection(PGconn *conn, bool flushInput)
             free(conn->gss_ResultBuffer);
             conn->gss_ResultBuffer = NULL;
         }
+        conn->gssenc = false;
     }
 #endif
 #ifdef ENABLE_SSPI
@@ -2027,11 +2033,6 @@ connectDBStart(PGconn *conn)
      */
     resetPQExpBuffer(&conn->errorMessage);

-#ifdef ENABLE_GSS
-    if (conn->gssencmode[0] == 'd') /* "disable" */
-        conn->try_gss = false;
-#endif
-
     /*
      * Set up to try to connect to the first host.  (Setting whichhost = -1 is
      * a bit of a cheat, but PQconnectPoll will advance it to 0 before
@@ -2468,6 +2469,9 @@ keep_going:                        /* We will come back to here until there is
         conn->allow_ssl_try = (conn->sslmode[0] != 'd');    /* "disable" */
         conn->wait_ssl_try = (conn->sslmode[0] == 'a'); /* "allow" */
 #endif
+#ifdef ENABLE_GSS
+        conn->try_gss = (conn->gssencmode[0] != 'd');    /* "disable" */
+#endif

         reset_connection_state_machine = false;
         need_new_connection = true;
@@ -3349,12 +3353,8 @@ keep_going:                        /* We will come back to here until there is
                      */
                     if (conn->gssenc && conn->gssencmode[0] == 'p')
                     {
-                        OM_uint32    minor;
-
                         /* postmaster expects us to drop the connection */
                         conn->try_gss = false;
-                        conn->gssenc = false;
-                        gss_delete_sec_context(&minor, &conn->gctx, NULL);
                         pqDropConnection(conn, true);
                         conn->status = CONNECTION_NEEDED;
                         goto keep_going;
@@ -3906,9 +3906,6 @@ makeEmptyPGconn(void)
     conn->verbosity = PQERRORS_DEFAULT;
     conn->show_context = PQSHOW_CONTEXT_ERRORS;
     conn->sock = PGINVALID_SOCKET;
-#ifdef ENABLE_GSS
-    conn->try_gss = true;
-#endif

     /*
      * We try to send at least 8K at a time, which is the usual size of pipe
@@ -4065,22 +4062,6 @@ freePGconn(PGconn *conn)
         free(conn->gsslib);
     if (conn->connip)
         free(conn->connip);
-#ifdef ENABLE_GSS
-    if (conn->gcred != GSS_C_NO_CREDENTIAL)
-    {
-        OM_uint32    minor;
-
-        gss_release_cred(&minor, &conn->gcred);
-        conn->gcred = GSS_C_NO_CREDENTIAL;
-    }
-    if (conn->gctx)
-    {
-        OM_uint32    minor;
-
-        gss_delete_sec_context(&minor, &conn->gctx, GSS_C_NO_BUFFER);
-        conn->gctx = NULL;
-    }
-#endif
     /* Note that conn->Pfdebug is not ours to close or free */
     if (conn->last_query)
         free(conn->last_query);

Re: GSSENC'ed connection stalls while reconnection attempts.

From
Tom Lane
Date:
I wrote:
> Kyotaro Horiguchi <horikyota.ntt@gmail.com> writes:
>> If psql connected using GSSAPI auth and server restarted, reconnection
>> sequence stalls and won't return.

> Yeah, reproduced here.  (I wonder if there's any reasonable way to
> exercise this scenario in src/test/kerberos/.)

I tried writing such a test based on the IO::Pty infrastructure used
by src/bin/psql/t/010_tab_completion.pl, as attached.  It works, but
it feels pretty grotty, especially seeing that so much of the patch
is copy-and-pasted from 010_tab_completion.pl.  I think if we want
to have a test like this, it'd be good to work a little harder on
refactoring so that more of that code can be shared.  My Perl skillz
are a bit weak for that, though.

            regards, tom lane

diff --git a/src/test/kerberos/t/001_auth.pl b/src/test/kerberos/t/001_auth.pl
index b3aeea9574..552d2724e7 100644
--- a/src/test/kerberos/t/001_auth.pl
+++ b/src/test/kerberos/t/001_auth.pl
@@ -13,17 +13,30 @@
 
 use strict;
 use warnings;
+
 use TestLib;
 use PostgresNode;
 use Test::More;
+use IPC::Run qw(pump finish timer);
+use Data::Dumper;
 
-if ($ENV{with_gssapi} eq 'yes')
+# Do nothing unless Makefile has told us that the build is --with-gssapi.
+if ($ENV{with_gssapi} ne 'yes')
 {
+    plan skip_all => 'GSSAPI/Kerberos not supported by this build';
+}
+
+# If we don't have IO::Pty, we can't run the test with interactive_psql.
+my $have_pty = 1;
+eval { require IO::Pty; };
+if ($@)
+{
+    $have_pty = 0;
     plan tests => 18;
 }
 else
 {
-    plan skip_all => 'GSSAPI/Kerberos not supported by this build';
+    plan tests => 22;
 }
 
 my ($krb5_bin_dir, $krb5_sbin_dir);
@@ -275,6 +288,77 @@ test_query(
     "gssencmode=require",
     "sending 100K lines works");
 
+# Test that libpq can reconnect after a server restart.
+if ($have_pty)
+{
+    # fire up an interactive psql session
+    my $in  = '';
+    my $out = '';
+
+    my $timer = timer(5);
+
+    my $h = $node->interactive_psql(
+        'postgres',
+        \$in,
+        \$out,
+        $timer,
+        extra_params => [
+            '-XAtd',
+            $node->connstr('postgres')
+              . " host=$host hostaddr=$hostaddr gssencmode=require user=test1"
+        ]);
+
+    like(
+        $out,
+        qr/GSSAPI-encrypted connection/,
+        "startup banner shows GSSAPI encryption");
+
+    # subroutine to send a command and wait for response
+    sub interactive_command
+    {
+        my ($send, $pattern, $annotation) = @_;
+
+        # report test failures from caller location
+        local $Test::Builder::Level = $Test::Builder::Level + 1;
+
+        # reset output collector
+        $out = "";
+        # restart per-command timer
+        $timer->start(5);
+        # send the data to be sent
+        $in .= $send;
+        # wait ...
+        pump $h until ($out =~ $pattern || $timer->is_expired);
+        my $okay = ($out =~ $pattern && !$timer->is_expired);
+        ok($okay, $annotation);
+        # for debugging, log actual output if it didn't match
+        local $Data::Dumper::Terse = 1;
+        local $Data::Dumper::Useqq = 1;
+        diag 'Actual output was '
+          . Dumper($out)
+          . "Did not match \"$pattern\"\n"
+          if !$okay;
+        return;
+    }
+
+    interactive_command("SELECT 2+2;\n", qr/4/, "interactive psql is alive");
+
+    $node->restart;
+
+    interactive_command(
+        "SELECT 2+2;\n",
+        qr/The connection to the server was lost.*GSSAPI-encrypted connection/s,
+        "startup banner shows GSSAPI encryption after reconnection");
+
+    interactive_command("SELECT 20+22;\n", qr/42/, "psql is still alive");
+
+    # send psql an explicit \q to shut it down, else pty won't close properly
+    $timer->start(5);
+    $in .= "\\q\n";
+    finish $h or die "psql returned $?";
+    $timer->reset;
+}
+
 unlink($node->data_dir . '/pg_hba.conf');
 $node->append_conf('pg_hba.conf',
     qq{hostgssenc all all $hostaddr/32 gss map=mymap});

Re: GSSENC'ed connection stalls while reconnection attempts.

From
Kyotaro Horiguchi
Date:
At Fri, 10 Jul 2020 12:01:10 -0400, Tom Lane <tgl@sss.pgh.pa.us> wrote in 
> Kyotaro Horiguchi <horikyota.ntt@gmail.com> writes:
> > If psql connected using GSSAPI auth and server restarted, reconnection
> > sequence stalls and won't return.
> 
> Yeah, reproduced here.  (I wonder if there's any reasonable way to
> exercise this scenario in src/test/kerberos/.)
> 
> > I found that psql(libpq) sends startup packet via gss
> > encryption. conn->gssenc should be reset when encryption state is
> > freed.
> 
> Actually, it looks to me like the GSS support was wedged in by somebody
> who was paying no attention to how SSL is managed, or else we forgot
> to pay attention to GSS the last time we rearranged SSL support.  It's
> completely broken for the multiple-host-addresses scenario as well,
> because try_gss is being set and cleared in the wrong places altogether.
> conn->gcred is not being handled correctly either I think --- we need
> to make sure that it's dropped in pqDropConnection.
> 
> The attached patch makes this all act more like the way SSL is handled,
> and for me it resolves the reconnection problem.

It looks good to me.

> > The reason that psql doesn't notice the error is pqPacketSend returns
> > STATUS_OK when write error occurred.  That behavior contradicts to the
> > comment of the function. The function is used only while making
> > connection so it's ok to error out immediately by write failure.  I
> > think other usage of pqFlush while making a connection should report
> > write failure the same way.
> 
> I'm disinclined to mess with that, because (a) I don't think it's the
> actual source of the problem, and (b) it would affect way more than
> just GSS mode.

If I did that in pqFlush your objection would be right, but
pqPacketSend is defined as "RETURNS: STATUS_ERROR if the write fails"
so not doing that is just wrong. (pqSendSome reported write failure in
this case.) For other parts in authentication code, I don't think it
doesn't affect badly because authentication should proceed without any
read/write overlapping.

> > Finally, It's user-friendly if psql shows error message for error on
> > reset attempts. (This perhaps should be arguable.)
> 
> Meh, that's changing fairly longstanding behavior that I don't think
> we've had many complaints about.

Yeah, I haven't seen the message for any other reasons than the
absence of server. :p And, I noticed that, in the first place, I would
see that message in the next connection attempt from scratch.

I agree to you on this point.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: GSSENC'ed connection stalls while reconnection attempts.

From
Tom Lane
Date:
Kyotaro Horiguchi <horikyota.ntt@gmail.com> writes:
> At Fri, 10 Jul 2020 12:01:10 -0400, Tom Lane <tgl@sss.pgh.pa.us> wrote in 
>> The attached patch makes this all act more like the way SSL is handled,
>> and for me it resolves the reconnection problem.

> It looks good to me.

OK, thanks.

>>> The reason that psql doesn't notice the error is pqPacketSend returns
>>> STATUS_OK when write error occurred.  That behavior contradicts to the
>>> comment of the function. The function is used only while making
>>> connection so it's ok to error out immediately by write failure.  I
>>> think other usage of pqFlush while making a connection should report
>>> write failure the same way.

>> I'm disinclined to mess with that, because (a) I don't think it's the
>> actual source of the problem, and (b) it would affect way more than
>> just GSS mode.

> If I did that in pqFlush your objection would be right, but
> pqPacketSend is defined as "RETURNS: STATUS_ERROR if the write fails"
> so not doing that is just wrong. (pqSendSome reported write failure in
> this case.) For other parts in authentication code, I don't think it
> doesn't affect badly because authentication should proceed without any
> read/write overlapping.

As the comment for pqSendSome says, we report a write failure immediately
only if we also cannot read.  I don't really see a reason why the behavior
described there isn't fine during initial connection as well.  If you feel
that the comment for pqPacketSend needs adjustment, we can do that.
In any case, I'm quite against changing pqPacketSend's behavior because
"it's only used during initial connection"; there is nothing about the
function that restricts it to that case.

Bottom line here is that I'm suspicious of changing the behavior of
the read/write code on the strength of a bug in the GSS state management
logic.  If there's a reason to change the read/write code, we should be
able to demonstrate it without the GSS bug.

            regards, tom lane



Re: GSSENC'ed connection stalls while reconnection attempts.

From
Kyotaro Horiguchi
Date:
At Mon, 13 Jul 2020 11:08:09 -0400, Tom Lane <tgl@sss.pgh.pa.us> wrote in 
> Kyotaro Horiguchi <horikyota.ntt@gmail.com> writes:
> > At Fri, 10 Jul 2020 12:01:10 -0400, Tom Lane <tgl@sss.pgh.pa.us> wrote intgl> >> I'm disinclined to mess with that,
because(a) I don't think it's the
 
> >> actual source of the problem, and (b) it would affect way more than
> >> just GSS mode.
> 
> > If I did that in pqFlush your objection would be right, but
> > pqPacketSend is defined as "RETURNS: STATUS_ERROR if the write fails"
> > so not doing that is just wrong. (pqSendSome reported write failure in
> > this case.) For other parts in authentication code, I don't think it
> > doesn't affect badly because authentication should proceed without any
> > read/write overlapping.
> 
> As the comment for pqSendSome says, we report a write failure immediately
> only if we also cannot read.  I don't really see a reason why the behavior
> described there isn't fine during initial connection as well.  If you feel
> that the comment for pqPacketSend needs adjustment, we can do that.

I'm fine with that.

> In any case, I'm quite against changing pqPacketSend's behavior because
> "it's only used during initial connection"; there is nothing about the
> function that restricts it to that case.

That sounds fair enough.

> Bottom line here is that I'm suspicious of changing the behavior of
> the read/write code on the strength of a bug in the GSS state management
> logic.  If there's a reason to change the read/write code, we should be
> able to demonstrate it without the GSS bug.

Agreed to separate the change from this issue.  I also don't think
that change in behavior dramatically improve the situation since we
should have had a bunch of trouble when a write actually failed in the
normal case.

I'm going to post a patch to change the comment of pqPacketSend.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: GSSENC'ed connection stalls while reconnection attempts.

From
Kyotaro Horiguchi
Date:
At Tue, 14 Jul 2020 13:31:31 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in 
> Agreed to separate the change from this issue.  I also don't think
> that change in behavior dramatically improve the situation since we
> should have had a bunch of trouble when a write actually failed in the
> normal case.
> 
> I'm going to post a patch to change the comment of pqPacketSend.

So this is a proposal to add a description about the behavior on write
failure.  The last half of the addition is a copy from the comment of
pqFlush.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
From 5e01ed1fccce083b1e1c13692d19474378c66db6 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Wed, 15 Jul 2020 11:27:57 +0900
Subject: [PATCH] Explain pqPacketSend more precisely.

The comment explains that the function always returns STATUS_ERRUR for
write failure, but this may not be the case. Add description to make
it clear that there is such a case.
---
 src/interfaces/libpq/fe-connect.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/src/interfaces/libpq/fe-connect.c b/src/interfaces/libpq/fe-connect.c
index 7bee9dd201..af2818add4 100644
--- a/src/interfaces/libpq/fe-connect.c
+++ b/src/interfaces/libpq/fe-connect.c
@@ -4505,7 +4505,9 @@ PQrequestCancel(PGconn *conn)
  * buf, buf_len: contents of message.  The given length includes only what
  * is in buf; the message type and message length fields are added here.
  *
- * RETURNS: STATUS_ERROR if the write fails, STATUS_OK otherwise.
+ * RETURNS: STATUS_ERROR if the write fails, STATUS_OK otherwise. Note that the
+ * function may return STATUS_OK even if it failed to write to the underlying
+ * socket. (See pqSendSome comments about how failure should be handled.)
  * SIDE_EFFECTS: may block.
  *
  * Note: all messages sent with this routine have a length word, whether
-- 
2.18.4