Thread: BUG #18907: SSL error: bad length failure during transfer data in pipeline mode with libpq
BUG #18907: SSL error: bad length failure during transfer data in pipeline mode with libpq
From
PG Bug reporting form
Date:
The following bug has been logged on the website: Bug reference: 18907 Logged by: Dorjpalam Batbaatar Email address: htgn.dbat.95@gmail.com PostgreSQL version: 16.4 Operating system: AlmaLinux 9 Description: When using libpq to transfer large amounts of data to the server in pipeline mode (registering with COPY), an error "SSL error: bad length" sometimes occurs. The most common cause of the error is libpq's PQsendQueryParams(). PostgreSQL is version 16.4. I looked into this here, and it seems that the cause is that openssl's SSL_write() is not being retried when it should be. According to the openssl documentation SSL_write(), if the return value of SSL_get_error() is SSL_ERROR_WANT_READ or SSL_ERROR_WANT_WRITE, it must be called again with the same data. https://docs.openssl.org/3.0/man3/SSL_write/#warnings In libpq's message sending function pqPutMsgEnd(PGconn *conn), if not all data has been sent and in non-blocking mode, it just returns, but in the libpq's exported API (e.g. PQsendQueryGuts() called by PQsendQueryParams()), pqPutMsgEnd() is called multiple times, so I think the sent data changes. So in the above situation, it needs to be retried with the same data, but it seems that the error occurs because the send data has changed. As a test, I tried to retry if pqsecure_write() returned 0 in pqSendSome(), and it ran in pipeline mode without errors. pqSendSome() is a function which called from pqPutMsgEnd(PGconn *conn) and pqsecure_write() is called from this. In pqsecure_write() SSL_write() is performed. Below is the patch I tried. diff --git a/src/interfaces/libpq/fe-misc.c b/src/interfaces/libpq/fe-misc.c index 488f7d6e55..bbafb189c9 100644 --- a/src/interfaces/libpq/fe-misc.c +++ b/src/interfaces/libpq/fe-misc.c @@ -914,22 +914,43 @@ pqSendSome(PGconn *conn, int len) * Note that errors here don't result in write_failed becoming * set. */ - if (pqReadData(conn) < 0) + if (sent > 0) { - result = -1; /* error message already set up */ - break; - } + if (pqReadData(conn) < 0) + { + result = -1; /* error message already set up */ + break; + } - if (pqIsnonblocking(conn)) - { - result = 1; - break; - } + if (pqIsnonblocking(conn)) + { + result = 1; + break; + } - if (pqWait(true, true, conn)) + if (pqWait(true, true, conn)) + { + result = -1; + break; + } + } + else { - result = -1; - break; + /* + * When sent is 0 retry for write. Before write again read + * which arrived responses from the server + */ + if (pqWait(true, true, conn)) + { + result = -1; + break; + } + + if (pqReadData(conn) < 0) + { + result = -1; /* error message already set up */ + break; + } } } }
Re: BUG #18907: SSL error: bad length failure during transfer data in pipeline mode with libpq
From
Tom Lane
Date:
PG Bug reporting form <noreply@postgresql.org> writes: > When using libpq to transfer large amounts of data to the server in pipeline > mode (registering with COPY), an error "SSL error: bad length" > sometimes occurs. Could you provide a self-contained test case demonstrating such failures? This is not the kind of code that we like to change on the basis of undocumented claims. regards, tom lane
Re: BUG #18907: SSL error: bad length failure during transfer data in pipeline mode with libpq
From
Jacob Champion
Date:
On Tue, Apr 29, 2025 at 11:06 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Could you provide a self-contained test case demonstrating such > failures? This is not the kind of code that we like to change > on the basis of undocumented claims. Agreed -- but also, let us know if the answer is "no, I can't", or if you get stuck and need some additional collaboration. These corner cases can be really nasty to track down and record. Thanks, --Jacob
Re: BUG #18907: SSL error: bad length failure during transfer data in pipeline mode with libpq
From
BATBAATAR Dorjpalam
Date:
I am sending a sample program to reproduce the this phenomenon. The attached archive contains a Makefile to build with PostgreSQL17. To run the program, all you need is a PostgreSQL17 server with SSL connection. After building, you will have an executable file named query-data-send-error. Please execute it as follows. ./query-data-send-error -i 200 -u 200 -c "postgres://postgres:postgres@192.168.0.10/postgres?sslmode=require" The -i is the number of times to create a test data record, -u is the number of times to update the test data record, -c specifies the connection string of the PostgreSQL server to connect to, respectively. The sample program does the following 1) Create the test_data table. 2) Register test data in units of 100 records for the number of times specified by -i. 3) Repeat updating the registered records for the number of times specified by -u. My environment is as follows PostgreSQL Server: 17.2 OS: Rocky Linux 9.5 (Blue Onyx) Kernel: Linux 5.14.0-503.22.1.el9_5.x86_64 Spec: CPU 4vCore/Memory 8G/HDD 400G At runtime, the following error occurs when updating. Line : 552 SSL error: bad length SSL SYSCALL error: EOF detected Depending on the timing, this error may not occur, but if the number of times is increased, will occur almost every time. On 2025/04/30 3:48, Jacob Champion wrote: > On Tue, Apr 29, 2025 at 11:06 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Could you provide a self-contained test case demonstrating such >> failures? This is not the kind of code that we like to change >> on the basis of undocumented claims. > Agreed -- but also, let us know if the answer is "no, I can't", or if > you get stuck and need some additional collaboration. These corner > cases can be really nasty to track down and record. > > Thanks, > --Jacob
Attachment
Re: BUG #18907: SSL error: bad length failure during transfer data in pipeline mode with libpq
From
Tom Lane
Date:
BATBAATAR Dorjpalam <htgn.dbat.95@gmail.com> writes: > I am sending a sample program to reproduce the this phenomenon. Thank you for the reproducer! (For anyone following along at home, it doesn't fail for me with the suggested "-i 200 -u 200" parameters, but it does fail in most runs with "-i 1000 -u 200".) After some playing around, I figured out that the trouble scenario is like this: * We have a bunch of data pending to be sent, and we try to pqFlush() it. OpenSSL returns SSL_WANT_WRITE, and since we're in nonblock mode we just accept the failure-to-write and continue on. * The app provides a bit more data to be sent, and we get to pqPutMsgEnd(), which does this: if (conn->outCount >= 8192) { int toSend = conn->outCount - (conn->outCount % 8192); if (pqSendSome(conn, toSend) < 0) return EOF; /* in nonblock mode, don't complain if unable to send it all */ } Because of rounding toSend down to an 8K multiple, we are asking OpenSSL to send less than the previous pqFlush call asked to send. That violates the SSL_write() API, and at least some of the time it results in SSL_R_BAD_LENGTH. As a quick cross-check I've been running with diff --git a/src/interfaces/libpq/fe-misc.c b/src/interfaces/libpq/fe-misc.c index c14e3c95250..75593ef0f72 100644 --- a/src/interfaces/libpq/fe-misc.c +++ b/src/interfaces/libpq/fe-misc.c @@ -555,7 +555,7 @@ pqPutMsgEnd(PGconn *conn) if (conn->outCount >= 8192) { - int toSend = conn->outCount - (conn->outCount % 8192); + int toSend = conn->outCount; if (pqSendSome(conn, toSend) < 0) return EOF; and that seems to prevent the failure. The SSL_write docs say that you should not either increase or decrease the length during a repeat call after SSL_WANT_WRITE, but that seems to be a lie: increasing the length doesn't cause any problems. (We do use SSL_MODE_ACCEPT_MOVING_WRITE_BUFFER, and perhaps that affects this?) For a real fix, the narrowest answer would be to not round down toSend if we are using an SSL connection. I wonder though if the round-down behavior is of any use with GSSAPI either, or more generally if it's sensible for anything except a Unix-pipe connection. regards, tom lane
Re: BUG #18907: SSL error: bad length failure during transfer data in pipeline mode with libpq
From
Tom Lane
Date:
I wrote: > For a real fix, the narrowest answer would be to not round down > toSend if we are using an SSL connection. I wonder though if > the round-down behavior is of any use with GSSAPI either, or > more generally if it's sensible for anything except a Unix-pipe > connection. Indeed, it looks like we'd better disable the round-down for GSSAPI too, because pg_GSS_write has exactly this same API requirement that caller has to pass at least as much data as last time. Interestingly, we got a report of such a failure with GSSAPI awhile ago, and "fixed" it in commit d053a879b. Apparently the test case we were looking at then did not trigger this specific pattern involving pqFlush followed by pqPutMsgEnd, because that commit did not do anything to prevent this failure pattern. I'm disinclined to revert what d053a879b did, but we'd better remove or update this comment: + * Note: it may seem attractive to report partial write completion once + * we've successfully sent any encrypted packets. However, that can cause + * problems for callers; notably, pqPutMsgEnd's heuristic to send only + * full 8K blocks interacts badly with such a hack. We won't save much, + * typically, by letting callers discard data early, so don't risk it. regards, tom lane
Re: BUG #18907: SSL error: bad length failure during transfer data in pipeline mode with libpq
From
Tom Lane
Date:
I wrote: > Indeed, it looks like we'd better disable the round-down for GSSAPI > too, because pg_GSS_write has exactly this same API requirement that > caller has to pass at least as much data as last time. So in short, I propose the attached patch. I chose to disable the round-down behavior for all TCP connections, including ones that use neither SSL nor GSSAPI. I'm not sure if it's worth worrying too much about that case. regards, tom lane diff --git a/src/backend/libpq/be-secure-gssapi.c b/src/backend/libpq/be-secure-gssapi.c index 3534f0b8111..5d98c58ffa8 100644 --- a/src/backend/libpq/be-secure-gssapi.c +++ b/src/backend/libpq/be-secure-gssapi.c @@ -121,9 +121,9 @@ be_gssapi_write(Port *port, const void *ptr, size_t len) * again, so if it offers a len less than that, something is wrong. * * Note: it may seem attractive to report partial write completion once - * we've successfully sent any encrypted packets. However, that can cause - * problems for callers; notably, pqPutMsgEnd's heuristic to send only - * full 8K blocks interacts badly with such a hack. We won't save much, + * we've successfully sent any encrypted packets. However, doing that + * expands the state space of this processing and has been responsible for + * bugs in the past (cf. commit d053a879b). We won't save much, * typically, by letting callers discard data early, so don't risk it. */ if (len < PqGSSSendConsumed) diff --git a/src/interfaces/libpq/fe-misc.c b/src/interfaces/libpq/fe-misc.c index c14e3c95250..dca44fdc5d2 100644 --- a/src/interfaces/libpq/fe-misc.c +++ b/src/interfaces/libpq/fe-misc.c @@ -553,9 +553,35 @@ pqPutMsgEnd(PGconn *conn) /* Make message eligible to send */ conn->outCount = conn->outMsgEnd; + /* If appropriate, try to push out some data */ if (conn->outCount >= 8192) { - int toSend = conn->outCount - (conn->outCount % 8192); + int toSend = conn->outCount; + + /* + * On Unix-pipe connections, it seems profitable to prefer sending + * pipe-buffer-sized packets not randomly-sized ones, so retain the + * last partial-8K chunk in our buffer for now. On TCP connections, + * the advantage of that is far less clear. Moreover, it flat out + * isn't safe when using SSL or GSSAPI, because those code paths have + * API stipulations that if they fail to send all the data that was + * offered in the previous write attempt, we mustn't offer less data + * in this write attempt. The previous write attempt might've been + * pqFlush attempting to send everything in the buffer, so we mustn't + * offer less now. (Presently, we won't try to use SSL or GSSAPI on + * Unix connections, so those checks are just Asserts. They'll have + * to become part of the regular if-test if we ever change that.) + */ + if (conn->raddr.addr.ss_family == AF_UNIX) + { +#ifdef USE_SSL + Assert(!conn->ssl_in_use); +#endif +#ifdef ENABLE_GSS + Assert(!conn->gssenc); +#endif + toSend -= toSend % 8192; + } if (pqSendSome(conn, toSend) < 0) return EOF; diff --git a/src/interfaces/libpq/fe-secure-gssapi.c b/src/interfaces/libpq/fe-secure-gssapi.c index 62d05f68496..bc9e1ce06fa 100644 --- a/src/interfaces/libpq/fe-secure-gssapi.c +++ b/src/interfaces/libpq/fe-secure-gssapi.c @@ -112,9 +112,9 @@ pg_GSS_write(PGconn *conn, const void *ptr, size_t len) * again, so if it offers a len less than that, something is wrong. * * Note: it may seem attractive to report partial write completion once - * we've successfully sent any encrypted packets. However, that can cause - * problems for callers; notably, pqPutMsgEnd's heuristic to send only - * full 8K blocks interacts badly with such a hack. We won't save much, + * we've successfully sent any encrypted packets. However, doing that + * expands the state space of this processing and has been responsible for + * bugs in the past (cf. commit d053a879b). We won't save much, * typically, by letting callers discard data early, so don't risk it. */ if (len < PqGSSSendConsumed)