libpq copy error handling busted - Mailing list pgsql-hackers

From Andres Freund
Subject libpq copy error handling busted
Date
Msg-id 20200603201242.ofvm4jztpqytwfye@alap3.anarazel.de
Whole thread Raw
Responses Re: libpq copy error handling busted
List pgsql-hackers
Hi,

When libpq is used to COPY data to the server, it doesn't properly
handle errors.

An easy way to trigger the problem is to start pgbench -i with a
sufficiently large scale, and then just shut the server down. pgbench
will happily use 100% of the cpu trying to send data to the server, even
though libpq knows that the connection is broken.

It can't even be cancelled using ctrl-c anymore, because the cancel
request cannot be sent:

andres@awork3:~/src/postgresql$ pgbench -i -s 4000 -q
dropping old tables...
creating tables...
generating data (client-side)...
80889300 of 400000000 tuples (20%) done (elapsed 85.00 s, remaining 335.33 s)
^CCould not send cancel request: PQcancel() -- connect() failed: No such file or directory


This is partially an old problem, and partially got recently
worse. Before the below commit we detected terminated connections, but
we didn't handle copy failing.


The failure to handle terminated connections originates in:

commit 1f39a1c0641531e0462a4822f2dba904c5d4d699
Author: Tom Lane <tgl@sss.pgh.pa.us>
Date:   2019-03-19 16:20:20 -0400

    Restructure libpq's handling of send failures.


The problem is basically that pqSendSome() returns *success* in all
failure cases. Both when a failure is already known:

+    /*
+     * If we already had a write failure, we will never again try to send data
+     * on that connection.  Even if the kernel would let us, we've probably
+     * lost message boundary sync with the server.  conn->write_failed
+     * therefore persists until the connection is reset, and we just discard
+     * all data presented to be written.
+     */
+    if (conn->write_failed)
+    {
+        /* conn->write_err_msg should be set up already */
+        conn->outCount = 0;
+        return 0;
+    }
+

and when initially "diagnosing" the failure:
            /* Anything except EAGAIN/EWOULDBLOCK/EINTR is trouble */
            switch (SOCK_ERRNO)
...
                    /* Discard queued data; no chance it'll ever be sent */
                    conn->outCount = 0;
                    return 0;

The idea of the above commit was:
    Instead, let's fix this in a more general fashion by eliminating
    pqHandleSendFailure altogether, and instead arranging to postpone
    all reports of send failures in libpq until after we've made an
    attempt to read and process server messages.  The send failure won't
    be reported at all if we find a server message or detect input EOF.

but that doesn't work here, because we never process the error
message. There's no code in pqParseInput3() to process server messages
while doing copy.


I'm honestly a bit baffled. How can we not have noticed that COPY FROM
STDIN doesn't handle errors before the input is exhausted? It's not just
pgbench, it's psql (and I asume pg_restore) etc as well.

Greetings,

Andres Freund



pgsql-hackers by date:

Previous
From: Bruce Momjian
Date:
Subject: Re: Internal key management system
Next
From: Andres Freund
Date:
Subject: Re: significant slowdown of HashAggregate between 9.6 and 10