Thread: BUG #18210: libpq: PQputCopyData sometimes fails in non-blocking mode over GSSAPI encrypted connection

The following bug has been logged on the website:

Bug reference:      18210
Logged by:          Lars Kanis
Email address:      lars@greiz-reinsdorf.de
PostgreSQL version: 16.1
Operating system:   Ubuntu-23.10
Description:

I'm the maintainer of the ruby binding of libpq and this issue happens in
Ruby, but it isn't related to the Ruby environment. It is only that the ruby
binding always runs with the libpq non-blocking mode enabled internally, so
that this issue happens.

The error "GSSAPI caller failed to retransmit all data needing to be
retried" is raised here:

https://github.com/postgres/postgres/blob/eeb0ebad79d9350305d9e111fbac76e20fa4b2fe/src/interfaces/libpq/fe-secure-gssapi.c#L110
It happens only in non-blocking mode over GSSAPI encrypted connections. It
isn't reliable and depends on the network timing. When sending a 7MB file in
alternating pieces of 27KB and 180 Byte per PQputCopyData() there is a 50%
chance to get the failure over the local network. It doesn't happen if TLS
is used instead.

I added some debug prints into libpq to show the amounts of data
transferred. It also shows the gdb backtrace of where the error is raised.
It like so:
```
[...lots of data blocks successfully transferred...]
PQputCopyData() size=186
PQputCopyData() size=27714
pqPutMsgEnd() conn->outCount=30162 toSend=24576
pg_GSS_write() len=24576  PqGSSSendConsumed=0
PQputCopyData() size=189
PQputCopyData() size=27405
pqPutMsgEnd() conn->outCount=33190 toSend=32768
pg_GSS_write() len=32768  PqGSSSendConsumed=0
PQputCopyData() size=185
PQputCopyData() size=27481
pqPutMsgEnd() conn->outCount=28098 toSend=24576
pg_GSS_write() len=24576  PqGSSSendConsumed=0
PQputCopyData() size=182
pqPutMsgEnd() conn->outCount=11965 toSend=8192
pg_GSS_write() len=8192  PqGSSSendConsumed=8256

Breakpoint 1, 0x00007ffff20fe811 in appendPQExpBufferStr ()
   from /home/kanis/ruby-pg/ports/x86_64-linux/lib/libpq.so.5
(gdb) bt
#0  0x00007ffff20fe811 in appendPQExpBufferStr () from
/home/kanis/ruby-pg/ports/x86_64-linux/lib/libpq.so.5
#1  0x00007ffff2103545 in pg_GSS_write () from
/home/kanis/ruby-pg/ports/x86_64-linux/lib/libpq.so.5
#2  0x00007ffff20fbbad in pqsecure_write () from
/home/kanis/ruby-pg/ports/x86_64-linux/lib/libpq.so.5
#3  0x00007ffff20f569d in pqSendSome () from
/home/kanis/ruby-pg/ports/x86_64-linux/lib/libpq.so.5
#4  0x00007ffff20f517c in pqPutMsgEnd () from
/home/kanis/ruby-pg/ports/x86_64-linux/lib/libpq.so.5
#5  0x00007ffff20f07ba in PQputCopyData () from
/home/kanis/ruby-pg/ports/x86_64-linux/lib/libpq.so.5
#6  0x00007ffff1f259e2 in ?? () from
/home/kanis/.rvm/gems/ruby-3.2.1/gems/pg-1.5.4-x86_64-linux/lib/3.2/pg_ext.so
#7  0x00007ffff7c67002 in rb_nogvl () from
/home/kanis/.rvm/rubies/ruby-3.2.1/bin/../lib/libruby.so.3.2
#8  0x00007ffff1f261ac in gvl_PQputCopyData ()
   from
/home//kanis/.rvm/gems/ruby-3.2.1/gems/pg-1.5.4-x86_64-linux/lib/3.2/pg_ext.so
#9  0x00007ffff1f2af90 in ?? () from
/home/kanis/.rvm/gems/ruby-3.2.1/gems/pg-1.5.4-x86_64-linux/lib/3.2/pg_ext.so
#10 0x00007ffff7c9ebff in vm_call_cfunc_with_frame ()
   from /home//kanis/.rvm/rubies/ruby-3.2.1/bin/../lib/libruby.so.3.2
#11 0x00007ffff7caefa0 in vm_exec_core () from
/home/kanis/.rvm/rubies/ruby-3.2.1/bin/../lib/libruby.so.3.2
#12 0x00007ffff7cb4a23 in rb_vm_exec () from /home/kanis/.rvm/rub
[...more ruby stack...]
```

The problem occurs when pqPutMsgEnd() sends some data per pqSendSome(), if
there is more that 8192 Bytes in the output buffer. The data size is
truncated to the next lower 8KB boundary. This happens here:
https://github.com/postgres/postgres/blob/eeb0ebad79d9350305d9e111fbac76e20fa4b2fe/src/interfaces/libpq/fe-misc.c#L544
Due to the truncation it can happen that the boundary is before the boundary
of the previous call of pqSendSome(). This in turn is raised by
pg_GSS_write() which can not transfer any data in this case.

So the error message "GSSAPI caller failed to retransmit all data needing to
be retried" is not about the application calling libpq. In fact the 7MB file
always fits into libpq's internal buffer memory, so that no retransmits are
done on the application level or at the level of the ruby binding. 

The issue is fixed if I change libpq at either pqPutMsgEnd() or at
pg_GSS_write().
So either avoid the truncation of send size to next 8KB boundary in
pqPutMsgEnd() like so:
```
    if (conn->outCount >= 8192)
    {
        int            toSend = conn->outCount; // - (conn->outCount % 8192);

        if (pqSendSome(conn, toSend) < 0)
            return EOF;
```

... or just returning 0 Bytes in pg_GSS_write() instead of raising an
error:
```
    if (len < PqGSSSendConsumed)
    {
        return 0;
        appendPQExpBufferStr(&conn->errorMessage,
                             "GSSAPI caller failed to retransmit all data needing to be
retried\n");
```

Unfortunately so far I'm unable to provide a repro script. Maybe the above
information is enough to understand the problem.

Regards,
Lars Kanis


PG Bug reporting form <noreply@postgresql.org> writes:
> The error "GSSAPI caller failed to retransmit all data needing to be
> retried" is raised here:
>
https://github.com/postgres/postgres/blob/eeb0ebad79d9350305d9e111fbac76e20fa4b2fe/src/interfaces/libpq/fe-secure-gssapi.c#L110
> It happens only in non-blocking mode over GSSAPI encrypted connections. It
> isn't reliable and depends on the network timing. When sending a 7MB file in
> alternating pieces of 27KB and 180 Byte per PQputCopyData() there is a 50%
> chance to get the failure over the local network. It doesn't happen if TLS
> is used instead.

Hmm.

> The issue is fixed if I change libpq at either pqPutMsgEnd() or at
> pg_GSS_write().

Both of those seem like pretty obviously wrong "fixes".  However,
I'm not quite sure I see the problem clearly enough to propose
a fix I'd believe in.

> Unfortunately so far I'm unable to provide a repro script. Maybe the above
> information is enough to understand the problem.

A repro script would be really really helpful here.

            regards, tom lane



I wrote:
> PG Bug reporting form <noreply@postgresql.org> writes:
>> The error "GSSAPI caller failed to retransmit all data needing to be
>> retried" is raised here:
>>
https://github.com/postgres/postgres/blob/eeb0ebad79d9350305d9e111fbac76e20fa4b2fe/src/interfaces/libpq/fe-secure-gssapi.c#L110
>> It happens only in non-blocking mode over GSSAPI encrypted connections. It
>> isn't reliable and depends on the network timing. When sending a 7MB file in
>> alternating pieces of 27KB and 180 Byte per PQputCopyData() there is a 50%
>> chance to get the failure over the local network. It doesn't happen if TLS
>> is used instead.

> A repro script would be really really helpful here.

After consuming more caffeine, I was able to repro it with the
not-intended-for-commit hack in 0001 attached.  (On my machine,
the test just hangs up upon failing, because the hacked-up
logic in copy.c doesn't cope very well with the failure.  I don't
think that is copy.c's fault though, it's just an incomplete hack.)

I concur with the conclusion that it's really pqPutMsgEnd's fault.
By deciding not to send the last partial block that's in the
outBuffer, it runs the risk of not presenting some data that it did
present the last time, and that can trigger the "failed to retransmit
all data" error.  This happens because GSS's gss_MaxPktSize is a bit
less than 16K (it's 16320 on my machine).  So if we initially present
24K of data (3 blocks), and pg_GSS_write successfully encrypts and
sends one packet of data, then it will encrypt all the rest.  But if
its second pqsecure_raw_write call fails with EINTR, it will return
with bytes_sent = 16320 (and PqGSSSendConsumed = 8256), causing us to
reduce the outBuffer contents to 8256 bytes plus whatever partial
block we didn't try to send.  If we don't fill outBuffer to at least
16K before trying again, we'll try to send just 8192 bytes, and
kaboom.  (This is why the alternating-long-and-short-lines business
is important.)

The quick hack in 0002 attached fixes it, but I can't say that
I like this solution: it's propagating a bit of ugliness that
ought to be localized in pg_GSS_write out to callers.

I wonder if we should drop the idea of returning a positive bytecount
after a partial write, and just return the pqsecure_raw_write result,
and not reset PqGSSSendConsumed until we write everything presented.
In edge cases maybe that would result in some buffer bloat, but it
doesn't seem worse than what happens when the very first
pqsecure_raw_write returns EINTR.

In any case, the backend needs a look to see whether it requires a
similar fix.  We don't do nonblock mode there, but I don't think
that means we can never get EINTR.

            regards, tom lane

diff --git a/src/bin/psql/copy.c b/src/bin/psql/copy.c
index dbbbdb8898..2b9c278ef7 100644
--- a/src/bin/psql/copy.c
+++ b/src/bin/psql/copy.c
@@ -505,7 +505,7 @@ handleCopyOut(PGconn *conn, FILE *copystream, PGresult **res)
  */

 /* read chunk size for COPY IN - size is not critical */
-#define COPYBUFSIZ 8192
+#define COPYBUFSIZ 32768

 bool
 handleCopyIn(PGconn *conn, FILE *copystream, bool isbinary, PGresult **res)
@@ -531,6 +531,8 @@ handleCopyIn(PGconn *conn, FILE *copystream, bool isbinary, PGresult **res)
         goto copyin_cleanup;
     }

+    (void) PQsetnonblocking(conn, 1);
+
     /* Prompt if interactive input */
     if (isatty(fileno(copystream)))
     {
@@ -649,16 +651,17 @@ handleCopyIn(PGconn *conn, FILE *copystream, bool isbinary, PGresult **res)
             }

             /*
-             * If the buffer is full, or we've reached the EOF, flush it.
-             *
-             * Make sure there's always space for four more bytes in the
-             * buffer, plus a NUL terminator.  That way, an EOF marker is
-             * never split across two fgets() calls, which simplifies the
-             * logic.
+             * Flush after every line for test purposes.
              */
-            if (buflen >= COPYBUFSIZ - 5 || (copydone && buflen > 0))
+            if (buflen > 0)
             {
-                if (PQputCopyData(conn, buf, buflen) <= 0)
+                int pcdres;
+
+                while ((pcdres = PQputCopyData(conn, buf, buflen)) == 0)
+                {
+                    /* busy-wait */
+                }
+                if (pcdres < 0)
                 {
                     OK = false;
                     break;
@@ -686,6 +689,8 @@ handleCopyIn(PGconn *conn, FILE *copystream, bool isbinary, PGresult **res)

 copyin_cleanup:

+    (void) PQsetnonblocking(conn, 0);
+
     /*
      * Clear the EOF flag on the stream, in case copying ended due to an EOF
      * signal.  This allows an interactive TTY session to perform another COPY
diff --git a/src/test/kerberos/t/001_auth.pl b/src/test/kerberos/t/001_auth.pl
index 0deb9bffc8..711bbc1e29 100644
--- a/src/test/kerberos/t/001_auth.pl
+++ b/src/test/kerberos/t/001_auth.pl
@@ -419,6 +419,25 @@ test_query(
     'gssencmode=require',
     'sending 100K lines works');

+my $bigdata = '';
+for (my $rpt = 0; $rpt < 1000; $rpt++)
+{
+    $bigdata .= 'xyzzy' x (27*1024/5) . "\n";
+    $bigdata .= 'qwert' x (180/5) . "\n";
+}
+
+test_query(
+    $node,
+    'test1',
+    "CREATE TEMP TABLE mytab (f1 text);\n"
+      . "COPY mytab FROM STDIN;\n"
+      . $bigdata
+      . "\\.\n"
+      . "SELECT COUNT(*) FROM mytab;",
+    qr/^2000$/s,
+    'gssencmode=require',
+    'sending alternating-size lines works');
+
 # require_auth=gss succeeds if required.
 $node->connect_ok(
     $node->connstr('postgres')
diff --git a/src/interfaces/libpq/fe-misc.c b/src/interfaces/libpq/fe-misc.c
index 660cdec93c..5a74a95968 100644
--- a/src/interfaces/libpq/fe-misc.c
+++ b/src/interfaces/libpq/fe-misc.c
@@ -541,7 +541,19 @@ pqPutMsgEnd(PGconn *conn)
 
     if (conn->outCount >= 8192)
     {
-        int            toSend = conn->outCount - (conn->outCount % 8192);
+        int            toSend = conn->outCount;
+
+        /*
+         * Normally we only send full 8K blocks.  However, if we don't present
+         * the last partial block we can trigger pg_GSS_write's "failed to
+         * retransmit all data" failure, since rounding toSend down could
+         * result in not presenting some data that we did present on an
+         * earlier try.  The full-block heuristic is pretty useless anyway
+         * when encrypting, since encryption overhead will throw off the
+         * number.  Hence, in GSSAPI mode just send all we have.
+         */
+        if (!conn->gssenc)
+            toSend -= toSend % 8192;
 
         if (pqSendSome(conn, toSend) < 0)
             return EOF;

Am 22.11.23 um 16:11 schrieb Tom Lane:

A repro script would be really really helpful here.


After fiddling a bit longer I was able to write a reproducer in C. It needs a Kerberos enabled environment and it reliably fails as described when running over Internet (by VPN connection). Then the output is:

  PQputCopyData text2 command failed: GSSAPI caller failed to retransmit all data needing to be retried

The program is here: https://gist.github.com/larskanis/5546a20365b0ca64e294a32b138f441e

I wasn't able to reproduce the error on a LAN connection with the above program. Even adding a trigger on the copy-table to slow the server down or adding PQflush/PQconsumeInput calls didn't work to trigger the GSSAPI error. In contrast the original program sometimes fails even on a LAN connection.

Regards,
Lars Kanis


I wrote:
> I wonder if we should drop the idea of returning a positive bytecount
> after a partial write, and just return the pqsecure_raw_write result,
> and not reset PqGSSSendConsumed until we write everything presented.
> In edge cases maybe that would result in some buffer bloat, but it
> doesn't seem worse than what happens when the very first
> pqsecure_raw_write returns EINTR.

Here's a patch that fixes it along those lines.  I like this better,
I think, not least because it removes the assumption that "interesting"
pqsecure_raw_write failures will recur on the next try.

Still need to look at syncing the backend with this.

            regards, tom lane

diff --git a/src/interfaces/libpq/fe-secure-gssapi.c b/src/interfaces/libpq/fe-secure-gssapi.c
index 7e373236e9..ea8b0020d2 100644
--- a/src/interfaces/libpq/fe-secure-gssapi.c
+++ b/src/interfaces/libpq/fe-secure-gssapi.c
@@ -79,8 +79,8 @@
  * On success, returns the number of data bytes consumed (possibly less than
  * len).  On failure, returns -1 with errno set appropriately.  If the errno
  * indicates a non-retryable error, a message is added to conn->errorMessage.
- * For retryable errors, caller should call again (passing the same data)
- * once the socket is ready.
+ * For retryable errors, caller should call again (passing the same or more
+ * data) once the socket is ready.
  */
 ssize_t
 pg_GSS_write(PGconn *conn, const void *ptr, size_t len)
@@ -90,19 +90,25 @@ pg_GSS_write(PGconn *conn, const void *ptr, size_t len)
     gss_buffer_desc input,
                 output = GSS_C_EMPTY_BUFFER;
     ssize_t        ret = -1;
-    size_t        bytes_sent = 0;
     size_t        bytes_to_encrypt;
     size_t        bytes_encrypted;
     gss_ctx_id_t gctx = conn->gctx;

     /*
-     * When we get a failure, we must not tell the caller we have successfully
-     * transmitted everything, else it won't retry.  Hence a "success"
-     * (positive) return value must only count source bytes corresponding to
-     * fully-transmitted encrypted packets.  The amount of source data
-     * corresponding to the current partly-transmitted packet is remembered in
+     * When we get a retryable failure, we must not tell the caller we have
+     * successfully transmitted everything, else it won't retry.  For
+     * simplicity, we claim we haven't transmitted anything until we have
+     * successfully transmitted all "len" bytes.  Between calls, the amount of
+     * the current input data that's already been encrypted and placed into
+     * PqGSSSendBuffer (and perhaps transmitted) is remembered in
      * PqGSSSendConsumed.  On a retry, the caller *must* be sending that data
      * again, so if it offers a len less than that, something is wrong.
+     *
+     * Note: it may seem attractive to report partial write completion once
+     * we've successfully sent any encrypted packets.  However, that can cause
+     * problems for callers; notably, pqPutMsgEnd's heuristic to send only
+     * full 8K blocks interacts badly with such a hack.  We won't save much,
+     * typically, by letting callers discard data early, so don't risk it.
      */
     if (len < PqGSSSendConsumed)
     {
@@ -140,33 +146,20 @@ pg_GSS_write(PGconn *conn, const void *ptr, size_t len)

             retval = pqsecure_raw_write(conn, PqGSSSendBuffer + PqGSSSendNext, amount);
             if (retval <= 0)
-            {
-                /*
-                 * Report any previously-sent data; if there was none, reflect
-                 * the pqsecure_raw_write result up to our caller.  When there
-                 * was some, we're effectively assuming that any interesting
-                 * failure condition will recur on the next try.
-                 */
-                if (bytes_sent)
-                    return bytes_sent;
                 return retval;
-            }

             /*
              * Check if this was a partial write, and if so, move forward that
              * far in our buffer and try again.
              */
-            if (retval != amount)
+            if (retval < amount)
             {
                 PqGSSSendNext += retval;
                 continue;
             }

-            /* We've successfully sent whatever data was in that packet. */
-            bytes_sent += PqGSSSendConsumed;
-
-            /* All encrypted data was sent, our buffer is empty now. */
-            PqGSSSendLength = PqGSSSendNext = PqGSSSendConsumed = 0;
+            /* We've successfully sent whatever data was in the buffer. */
+            PqGSSSendLength = PqGSSSendNext = 0;
         }

         /*
@@ -192,7 +185,7 @@ pg_GSS_write(PGconn *conn, const void *ptr, size_t len)

         /*
          * Create the next encrypted packet.  Any failure here is considered a
-         * hard failure, so we return -1 even if bytes_sent > 0.
+         * hard failure, so we return -1 even if some data has been sent.
          */
         major = gss_wrap(&minor, gctx, 1, GSS_C_QOP_DEFAULT,
                          &input, &conf_state, &output);
@@ -236,10 +229,13 @@ pg_GSS_write(PGconn *conn, const void *ptr, size_t len)
     }

     /* If we get here, our counters should all match up. */
-    Assert(bytes_sent == len);
-    Assert(bytes_sent == bytes_encrypted);
+    Assert(len == PqGSSSendConsumed);
+    Assert(len == bytes_encrypted);
+
+    /* We're reporting all the data as sent, so reset PqGSSSendConsumed. */
+    PqGSSSendConsumed = 0;

-    ret = bytes_sent;
+    ret = bytes_encrypted;

 cleanup:
     /* Release GSSAPI buffer storage, if we didn't already */
diff --git a/src/interfaces/libpq/libpq-int.h b/src/interfaces/libpq/libpq-int.h
index c745facfec..22bc682ffc 100644
--- a/src/interfaces/libpq/libpq-int.h
+++ b/src/interfaces/libpq/libpq-int.h
@@ -583,8 +583,8 @@ struct pg_conn
     int            gss_SendLength; /* End of data available in gss_SendBuffer */
     int            gss_SendNext;    /* Next index to send a byte from
                                  * gss_SendBuffer */
-    int            gss_SendConsumed;    /* Number of *unencrypted* bytes consumed
-                                     * for current contents of gss_SendBuffer */
+    int            gss_SendConsumed;    /* Number of source bytes encrypted but
+                                     * not yet reported as sent */
     char       *gss_RecvBuffer; /* Received, encrypted data */
     int            gss_RecvLength; /* End of data available in gss_RecvBuffer */
     char       *gss_ResultBuffer;    /* Decryption of data in gss_RecvBuffer */

Greetings,

* Tom Lane (tgl@sss.pgh.pa.us) wrote:
> I wrote:
> > I wonder if we should drop the idea of returning a positive bytecount
> > after a partial write, and just return the pqsecure_raw_write result,
> > and not reset PqGSSSendConsumed until we write everything presented.
> > In edge cases maybe that would result in some buffer bloat, but it
> > doesn't seem worse than what happens when the very first
> > pqsecure_raw_write returns EINTR.
>
> Here's a patch that fixes it along those lines.  I like this better,
> I think, not least because it removes the assumption that "interesting"
> pqsecure_raw_write failures will recur on the next try.

Thanks for looking into this.  I agree with this approach and took a
look through the patch and it all makes sense to me.

> Still need to look at syncing the backend with this.

I see you made similar changes to the backend and those also look good.

Thanks again,

Stephen

Attachment