libpq copy error handling busted - Mailing list pgsql-hackers
From | Andres Freund |
---|---|
Subject | libpq copy error handling busted |
Date | |
Msg-id | 20200603201242.ofvm4jztpqytwfye@alap3.anarazel.de Whole thread Raw |
Responses |
Re: libpq copy error handling busted
|
List | pgsql-hackers |
Hi, When libpq is used to COPY data to the server, it doesn't properly handle errors. An easy way to trigger the problem is to start pgbench -i with a sufficiently large scale, and then just shut the server down. pgbench will happily use 100% of the cpu trying to send data to the server, even though libpq knows that the connection is broken. It can't even be cancelled using ctrl-c anymore, because the cancel request cannot be sent: andres@awork3:~/src/postgresql$ pgbench -i -s 4000 -q dropping old tables... creating tables... generating data (client-side)... 80889300 of 400000000 tuples (20%) done (elapsed 85.00 s, remaining 335.33 s) ^CCould not send cancel request: PQcancel() -- connect() failed: No such file or directory This is partially an old problem, and partially got recently worse. Before the below commit we detected terminated connections, but we didn't handle copy failing. The failure to handle terminated connections originates in: commit 1f39a1c0641531e0462a4822f2dba904c5d4d699 Author: Tom Lane <tgl@sss.pgh.pa.us> Date: 2019-03-19 16:20:20 -0400 Restructure libpq's handling of send failures. The problem is basically that pqSendSome() returns *success* in all failure cases. Both when a failure is already known: + /* + * If we already had a write failure, we will never again try to send data + * on that connection. Even if the kernel would let us, we've probably + * lost message boundary sync with the server. conn->write_failed + * therefore persists until the connection is reset, and we just discard + * all data presented to be written. + */ + if (conn->write_failed) + { + /* conn->write_err_msg should be set up already */ + conn->outCount = 0; + return 0; + } + and when initially "diagnosing" the failure: /* Anything except EAGAIN/EWOULDBLOCK/EINTR is trouble */ switch (SOCK_ERRNO) ... /* Discard queued data; no chance it'll ever be sent */ conn->outCount = 0; return 0; The idea of the above commit was: Instead, let's fix this in a more general fashion by eliminating pqHandleSendFailure altogether, and instead arranging to postpone all reports of send failures in libpq until after we've made an attempt to read and process server messages. The send failure won't be reported at all if we find a server message or detect input EOF. but that doesn't work here, because we never process the error message. There's no code in pqParseInput3() to process server messages while doing copy. I'm honestly a bit baffled. How can we not have noticed that COPY FROM STDIN doesn't handle errors before the input is exhausted? It's not just pgbench, it's psql (and I asume pg_restore) etc as well. Greetings, Andres Freund
pgsql-hackers by date: