BUG #18210: libpq: PQputCopyData sometimes fails in non-blocking mode over GSSAPI encrypted connection - Mailing list pgsql-bugs
From | PG Bug reporting form |
---|---|
Subject | BUG #18210: libpq: PQputCopyData sometimes fails in non-blocking mode over GSSAPI encrypted connection |
Date | |
Msg-id | 18210-4c6d0b14627f2eb8@postgresql.org Whole thread Raw |
Responses |
Re: BUG #18210: libpq: PQputCopyData sometimes fails in non-blocking mode over GSSAPI encrypted connection
|
List | pgsql-bugs |
The following bug has been logged on the website: Bug reference: 18210 Logged by: Lars Kanis Email address: lars@greiz-reinsdorf.de PostgreSQL version: 16.1 Operating system: Ubuntu-23.10 Description: I'm the maintainer of the ruby binding of libpq and this issue happens in Ruby, but it isn't related to the Ruby environment. It is only that the ruby binding always runs with the libpq non-blocking mode enabled internally, so that this issue happens. The error "GSSAPI caller failed to retransmit all data needing to be retried" is raised here: https://github.com/postgres/postgres/blob/eeb0ebad79d9350305d9e111fbac76e20fa4b2fe/src/interfaces/libpq/fe-secure-gssapi.c#L110 It happens only in non-blocking mode over GSSAPI encrypted connections. It isn't reliable and depends on the network timing. When sending a 7MB file in alternating pieces of 27KB and 180 Byte per PQputCopyData() there is a 50% chance to get the failure over the local network. It doesn't happen if TLS is used instead. I added some debug prints into libpq to show the amounts of data transferred. It also shows the gdb backtrace of where the error is raised. It like so: ``` [...lots of data blocks successfully transferred...] PQputCopyData() size=186 PQputCopyData() size=27714 pqPutMsgEnd() conn->outCount=30162 toSend=24576 pg_GSS_write() len=24576 PqGSSSendConsumed=0 PQputCopyData() size=189 PQputCopyData() size=27405 pqPutMsgEnd() conn->outCount=33190 toSend=32768 pg_GSS_write() len=32768 PqGSSSendConsumed=0 PQputCopyData() size=185 PQputCopyData() size=27481 pqPutMsgEnd() conn->outCount=28098 toSend=24576 pg_GSS_write() len=24576 PqGSSSendConsumed=0 PQputCopyData() size=182 pqPutMsgEnd() conn->outCount=11965 toSend=8192 pg_GSS_write() len=8192 PqGSSSendConsumed=8256 Breakpoint 1, 0x00007ffff20fe811 in appendPQExpBufferStr () from /home/kanis/ruby-pg/ports/x86_64-linux/lib/libpq.so.5 (gdb) bt #0 0x00007ffff20fe811 in appendPQExpBufferStr () from /home/kanis/ruby-pg/ports/x86_64-linux/lib/libpq.so.5 #1 0x00007ffff2103545 in pg_GSS_write () from /home/kanis/ruby-pg/ports/x86_64-linux/lib/libpq.so.5 #2 0x00007ffff20fbbad in pqsecure_write () from /home/kanis/ruby-pg/ports/x86_64-linux/lib/libpq.so.5 #3 0x00007ffff20f569d in pqSendSome () from /home/kanis/ruby-pg/ports/x86_64-linux/lib/libpq.so.5 #4 0x00007ffff20f517c in pqPutMsgEnd () from /home/kanis/ruby-pg/ports/x86_64-linux/lib/libpq.so.5 #5 0x00007ffff20f07ba in PQputCopyData () from /home/kanis/ruby-pg/ports/x86_64-linux/lib/libpq.so.5 #6 0x00007ffff1f259e2 in ?? () from /home/kanis/.rvm/gems/ruby-3.2.1/gems/pg-1.5.4-x86_64-linux/lib/3.2/pg_ext.so #7 0x00007ffff7c67002 in rb_nogvl () from /home/kanis/.rvm/rubies/ruby-3.2.1/bin/../lib/libruby.so.3.2 #8 0x00007ffff1f261ac in gvl_PQputCopyData () from /home//kanis/.rvm/gems/ruby-3.2.1/gems/pg-1.5.4-x86_64-linux/lib/3.2/pg_ext.so #9 0x00007ffff1f2af90 in ?? () from /home/kanis/.rvm/gems/ruby-3.2.1/gems/pg-1.5.4-x86_64-linux/lib/3.2/pg_ext.so #10 0x00007ffff7c9ebff in vm_call_cfunc_with_frame () from /home//kanis/.rvm/rubies/ruby-3.2.1/bin/../lib/libruby.so.3.2 #11 0x00007ffff7caefa0 in vm_exec_core () from /home/kanis/.rvm/rubies/ruby-3.2.1/bin/../lib/libruby.so.3.2 #12 0x00007ffff7cb4a23 in rb_vm_exec () from /home/kanis/.rvm/rub [...more ruby stack...] ``` The problem occurs when pqPutMsgEnd() sends some data per pqSendSome(), if there is more that 8192 Bytes in the output buffer. The data size is truncated to the next lower 8KB boundary. This happens here: https://github.com/postgres/postgres/blob/eeb0ebad79d9350305d9e111fbac76e20fa4b2fe/src/interfaces/libpq/fe-misc.c#L544 Due to the truncation it can happen that the boundary is before the boundary of the previous call of pqSendSome(). This in turn is raised by pg_GSS_write() which can not transfer any data in this case. So the error message "GSSAPI caller failed to retransmit all data needing to be retried" is not about the application calling libpq. In fact the 7MB file always fits into libpq's internal buffer memory, so that no retransmits are done on the application level or at the level of the ruby binding. The issue is fixed if I change libpq at either pqPutMsgEnd() or at pg_GSS_write(). So either avoid the truncation of send size to next 8KB boundary in pqPutMsgEnd() like so: ``` if (conn->outCount >= 8192) { int toSend = conn->outCount; // - (conn->outCount % 8192); if (pqSendSome(conn, toSend) < 0) return EOF; ``` ... or just returning 0 Bytes in pg_GSS_write() instead of raising an error: ``` if (len < PqGSSSendConsumed) { return 0; appendPQExpBufferStr(&conn->errorMessage, "GSSAPI caller failed to retransmit all data needing to be retried\n"); ``` Unfortunately so far I'm unable to provide a repro script. Maybe the above information is enough to understand the problem. Regards, Lars Kanis
pgsql-bugs by date: