Thread: Unhappy with error handling in psql's handleCopyOut()

Unhappy with error handling in psql's handleCopyOut()

From

Tom Lane

Date:

11 February 2014, 20:43:20

While looking at the pending patch to make psql report a line count
after COPY, I came across this business in handleCopyOut():
    * Check command status and return to normal libpq state.  After a    * client-side error, the server will remain
readyto deliver data.  The    * cleanest thing is to fully drain and discard that data.  If the    * client-side error
happenedearly in a large file, this takes a long    * time.  Instead, take advantage of the fact that PQexec() will
silently   * end any ongoing PGRES_COPY_OUT state.  This does cause us to lose the    * results of any commands
followingthe COPY in a single command string.    * It also only works for protocol version 3.  XXX should we clean up
* using the slow way when the connection is using protocol version 2?
 

which git blames on commit 08146775 (committed by Alvaro on behalf of
Noah).

This does not make me happy.  In the first place, we have not dropped
support for protocol version 2.  In the second place, I fail to see
what the advantage is of kluging things like this.  The main costs of
draining the remaining COPY data are the server-side work of generating
the data and the network transmission costs, neither of which will go
away with this technique.  So I'm thinking we should revert this kluge
and just drain the data straightforwardly, which would also eliminate
the mentioned edge-case misbehavior when there were more commands in
the query string.

Is there a reason I'm overlooking not to do this?
        regards, tom lane

Re: Unhappy with error handling in psql's handleCopyOut()

From

Stephen Frost

Date:

11 February 2014, 20:53:11

* Tom Lane (tgl@sss.pgh.pa.us) wrote:
> While looking at the pending patch to make psql report a line count
> after COPY, I came across this business in handleCopyOut():
>
>      * Check command status and return to normal libpq state.  After a
>      * client-side error, the server will remain ready to deliver data.  The
>      * cleanest thing is to fully drain and discard that data.  If the
>      * client-side error happened early in a large file, this takes a long
>      * time.  Instead, take advantage of the fact that PQexec() will silently
>      * end any ongoing PGRES_COPY_OUT state.  This does cause us to lose the
>      * results of any commands following the COPY in a single command string.
>      * It also only works for protocol version 3.  XXX should we clean up
>      * using the slow way when the connection is using protocol version 2?
>
> which git blames on commit 08146775 (committed by Alvaro on behalf of
> Noah).
>
> This does not make me happy.  In the first place, we have not dropped
> support for protocol version 2.  In the second place, I fail to see
> what the advantage is of kluging things like this.  The main costs of
> draining the remaining COPY data are the server-side work of generating
> the data and the network transmission costs, neither of which will go
> away with this technique.  So I'm thinking we should revert this kluge
> and just drain the data straightforwardly, which would also eliminate
> the mentioned edge-case misbehavior when there were more commands in
> the query string.
>
> Is there a reason I'm overlooking not to do this?

I've not gotten back to it yet, but I ran into a related-seeming issue
where psql would happily chew up 2G of memory trying to send "COPY
failed" notices when it gets disconnected from a server that it's trying
to send data to mid-COPY.  conn->sock was -1, connection was
'CONNECTION_BAD', but the loop towards the end of handleCopyIn doesn't
care and nothing in libpq is changing PQresultStatus():
   /*    * Check command status and return to normal libpq state    *    * We must not ever return with the status
stillPGRES_COPY_IN.  Our    * caller is unable to distinguish that situation from reaching the next    * COPY in a
commandstring that happened to contain two consecutive COPY    * FROM STDIN commands.  XXX if something makes
PQputCopyEnd()fail    * indefinitely while retaining status PGRES_COPY_IN, we get an infinite    * loop.  This is more
realisticthan handleCopyOut()'s counterpart risk.    */   while (res = PQgetResult(conn), PQresultStatus(res) ==
PGRES_COPY_IN)  {         OK = false;       PQclear(res); 
       PQputCopyEnd(pset.db, _("trying to exit copy mode"));   }

And so it would just keep looping, first building up the 2G of 'copy
failed' notices from the PQputCopyEnd, and then just spinning forever
because it couldn't drain the queue.
Thanks,
    Stephen

Re: Unhappy with error handling in psql's handleCopyOut()

From

Tom Lane

Date:

11 February 2014, 21:29:12

Stephen Frost <sfrost@snowman.net> writes:
> I've not gotten back to it yet, but I ran into a related-seeming issue
> where psql would happily chew up 2G of memory trying to send "COPY
> failed" notices when it gets disconnected from a server that it's trying
> to send data to mid-COPY.  conn->sock was -1, connection was
> 'CONNECTION_BAD', but the loop towards the end of handleCopyIn doesn't
> care and nothing in libpq is changing PQresultStatus():

[ scratches head... ]  Offhand I'd have expected PQgetResult to start
returning error indications.  It definitely will do that if it gets
error indications from the I/O layer.  Perhaps it didn't see any
because asyncStatus had already been reset from PGASYNC_BUSY?

If so, maybe we need an explicit test on the connection being good before
we return valid PGRES_COPY_IN etc results.
        regards, tom lane

Re: Unhappy with error handling in psql's handleCopyOut()

From

Stephen Frost

Date:

11 February 2014, 22:02:47

* Tom Lane (tgl@sss.pgh.pa.us) wrote:
> Stephen Frost <sfrost@snowman.net> writes:
> > I've not gotten back to it yet, but I ran into a related-seeming issue
> > where psql would happily chew up 2G of memory trying to send "COPY
> > failed" notices when it gets disconnected from a server that it's trying
> > to send data to mid-COPY.  conn->sock was -1, connection was
> > 'CONNECTION_BAD', but the loop towards the end of handleCopyIn doesn't
> > care and nothing in libpq is changing PQresultStatus():
>
> [ scratches head... ]  Offhand I'd have expected PQgetResult to start
> returning error indications.  It definitely will do that if it gets
> error indications from the I/O layer.  Perhaps it didn't see any
> because asyncStatus had already been reset from PGASYNC_BUSY?

The I/O layer was definitely returning errors, I traced through that
using gdb (tho I don't have it open any more, had to move on to other
things; been planning to reproduce it).  Here's the full p *conn:

$9 = {pghost = 0x7fa742753300 "XXXXXXXXXXXXXXXXXXXXXXXXXXXXXX", pghostaddr = 0x0, pgport = 0x7fa742753340 "XXXXX",
pgunixsocket= 0x0, pgtty = 0x7fa742753360 "", connect_timeout = 0x0, client_encoding_initial = 0x0, pgoptions =
0x7fa742753380"", appname = 0x0, fbappname = 0x7fa7427533a0 "psql", dbName = 0x7fa7427532e0 "XXXX", replication = 0x0,
pguser= 0x7fa7427532c0 "XXXXXXXXXX", pgpass = 0x7fa742755070 "XXXXXXXXXXX", keepalives = 0x0, keepalives_idle = 0x0,
keepalives_interval= 0x0, keepalives_count = 0x0, sslmode = 0x7fa7427533c0 "prefer", sslcompression = 0x7fa7427533e0
"1",sslkey = 0x0, sslcert = 0x0, sslrootcert = 0x0, sslcrl = 0x0, requirepeer = 0x0, krbsrvname = 0x7fa742753400
"postgres",Pfdebug = 0x0, noticeHooks = {noticeRec = 0x7fa73fed3ec0 <defaultNoticeReceiver>, noticeRecArg = 0x0,
noticeProc= 0x7fa740572f20 <NoticeProcessor>, noticeProcArg = 0x0}, events = 0x0, nEvents = 0, eventArraySize = 0,
status= CONNECTION_BAD, asyncStatus = PGASYNC_COPY_IN, xactStatus = PQTRANS_INTRANS, queryclass = PGQUERY_SIMPLE,
last_query= 0x7fa7427809f0 "COPY XXXXXXXXXXXXXXXXXXXXXXXXXX (XXXXXX, XXXXXX, XXXXXX, XXXXXX, XXXXXX) FROM stdin;",
last_sqlstate= "\000\000\000\000\000", options_valid = 1 '\001', nonblocking = 0 '\000', singleRowMode = 0 '\000',
copy_is_binary= 0 '\000', copy_already_done = 0, notifyHead = 0x0, notifyTail = 0x0, sock = -1, laddr = {addr =
{ss_family= 2, __ss_align = 0,     __ss_padding = '\000' <repeats 111 times>}, salen = 16}, raddr = {addr = {ss_family
=2, __ss_align = 0, __ss_padding = '\000' <repeats 111 times>}, salen = 16}, pversion = 196608, sversion = 90302,
auth_req_received= 1 '\001', password_needed = 1 '\001', dot_pgpass_used = 1 '\001', sigpipe_so = 0 '\000',
sigpipe_flag= 0 '\000', addrlist = 0x0, addr_cur = 0x0, addrlist_family = 0, setenv_state = SETENV_STATE_IDLE, next_eo
=0x0, send_appname = 1 '\001', be_pid = 25857, be_key = 393001970, md5Salt = "XXXXXX", pstatus = 0x7fa742788630,
client_encoding= 6, std_strings = 1 '\001', verbosity = PQERRORS_DEFAULT, lobjfuncs = 0x0, inBuffer = 0x7fa74274a740
"G",inBufSize = 16384, inStart = 0, inCursor = 0, inEnd = 0, outBuffer = 0x7fa63a495010 "00229878919\t1", '0' <repeats
12times>, "13462\tsupplemental\tt\nd", outBufSize = 2147475456, outCount = 2147475445, outMsgStart = 2147475446,
outMsgEnd= 2147475450, rowBuf = 0x7fa742752760, rowBufLen = 32, result = 0x0, next_result = 0x0, allow_ssl_try = 1
'\001',wait_ssl_try = 0 '\000', ssl = 0x0, peer = 0x0, engine = 0x0, gctx = 0x0, gtarg_nam = 0x0, ginbuf = {length = 0,
value= 0x0}, goutbuf = {length = 0, value = 0x0}, errorMessage = {data = 0x7fa742752970 "cannot allocate memory for
outputbuffer\n", len = 41, maxlen = 256}, workBuffer = {data = 0x7fa742752a80 "SAVEPOINT", len = 9, maxlen = 256}} 

You can see asyncStatus is still PGASYNC_COPY_IN and also how
outBufSize grew up to the 2G mark, along with the 'cannot allocate'
error.

The p *res looked like:

$2 = {ntups = 0, numAttributes = 0, attDescs = 0x0, tuples = 0x0, tupArrSize = 0, numParameters = 0, paramDescs = 0x0,
resultStatus= PGRES_COPY_IN, cmdStatus =
"\000\177\000\000@\000\000\000\000\000\000\000!\000\000\000\000\000\000\000\210ף?\247\177\000\000\000BxB\247\177\000\000`\000\000\000\000\000\000\000\060\000\000\000\000\000\000\000\300dxB\247\177\000\000\000\000\000",
binary= 0, noticeHooks = {noticeRec = 0x7fa73fed3ec0 <defaultNoticeReceiver>, noticeRecArg = 0x0, noticeProc =
0x7fa740572f20<NoticeProcessor>, noticeProcArg = 0x0}, events = 0x0, nEvents = 0, client_encoding = 6, errMsg = 0x0,
errFields= 0x0, null_field = "", curBlock = 0x0, curOffset = 0, spaceLeft = 0} 

> If so, maybe we need an explicit test on the connection being good before
> we return valid PGRES_COPY_IN etc results.

Right, PQresultStatus() just checks if the PGresult is non-NULL, it's
not doing any checking of the connection state itself.  I had been
trying to work out if something else should have been responsible for
changing asyncStatus away from PGASYNC_COPY_IN when the connection
failure is detected, but having PQresultStatus() check the connection
status seems like it would work.
Thanks,
    Stephen

Re: Unhappy with error handling in psql's handleCopyOut()

From

Tom Lane

Date:

11 February 2014, 23:05:31

I wrote:
> Stephen Frost <sfrost@snowman.net> writes:
>> I've not gotten back to it yet, but I ran into a related-seeming issue
>> where psql would happily chew up 2G of memory trying to send "COPY
>> failed" notices when it gets disconnected from a server that it's trying
>> to send data to mid-COPY.  conn->sock was -1, connection was
>> 'CONNECTION_BAD', but the loop towards the end of handleCopyIn doesn't
>> care and nothing in libpq is changing PQresultStatus():

> [ scratches head... ]  Offhand I'd have expected PQgetResult to start
> returning error indications.

After some study of the code I have a theory about this.  Note that
we don't close the socket on send failures anymore; if the status is
CONNECTION_BAD, that pretty much has to have been done by pqReadData.
Which wouldn't be called in a PQputCopyData loop, as long as everything
was going normally.

Ordinarily, while we're pumping data to the backend, we'll flush libpq's
output buffer every 8K (see pqPutMsgEnd).  Suppose that, in one of those
calls to pqSendSome, we fail to send everything in the buffer but
send() doesn't report a hard error --- maybe it says EINTR, or maybe
it just returns zero.  Since there's still data to send, now pqSendSome
will call pqReadData.  Assume that the read attempt *does* give a hard
error; whereupon pqReadData closes the socket, sets CONNECTION_BAD,
and returns -1.  pqSendSome falls out, leaving data still in the
output buffer, and then we'll return -1 to handleCopyIn, which abandons
its data sending loop and calls PQputCopyEnd.  But there's still >= 8K
in the buffer, so when PQputCopyEnd calls pqPutMsgEnd, the latter
tries to send it.  And fails.  (What's more, pqSendSome doesn't clear
the output buffer when it fails at the top because the socket's gone.)
This results in PQputCopyEnd exiting without having changed the
asyncStatus back to ASYNC_BUSY from ASYNC_COPY_IN.  Now PQgetResult
will happily return a COPY_IN result, so we loop, and the whole thing
repeats, with no state change except we keep adding data to the buffer.

I think we need to change pqSendSome to reset outCount to zero if it exits
at the top due to no socket.  It would do that if it got a hard error from
pqsecure_write, so why's it not doing so if there's not even a socket?
This would avoid the problem of bloating the output buffer while we
continue to queue data that can never be sent.  (It might be that some of
the other error situations in pqSendSome should also abandon unwritten
data, but that's less clear.)

Another thought is that maybe PQputCopyEnd should change asyncStatus even
if it fails to queue the copy-end message.  I think the reason why it's
coded like it is is the idea that if we've not queued the message, then
the backend still thinks the copy is active, so we should not change
state.  However, if we've already closed the socket then this is pretty
much moot, so perhaps there needs to be an exception allowing the state
change in that case.  Also, if the output buffer is already over 8K, we're
failing to distinguish "couldn't queue the message" from "couldn't send
the message"; but I'm not sure it's worth refactoring to detect that case.

Alternatively, we could do what I suggested earlier and change PQgetResult
to not return an ordinary COPY_IN or COPY_OUT status if the connection is
known dead.  This seems probably more robust than tinkering at the margins
in PQputCopyEnd.
        regards, tom lane

Re: Unhappy with error handling in psql's handleCopyOut()

From

Tom Lane

Date:

12 February 2014, 01:39:03

I wrote:
>> Stephen Frost <sfrost@snowman.net> writes:
>>> I've not gotten back to it yet, but I ran into a related-seeming issue
>>> where psql would happily chew up 2G of memory trying to send "COPY
>>> failed" notices when it gets disconnected from a server that it's trying
>>> to send data to mid-COPY.  conn->sock was -1, connection was
>>> 'CONNECTION_BAD', but the loop towards the end of handleCopyIn doesn't
>>> care and nothing in libpq is changing PQresultStatus():

> After some study of the code I have a theory about this.

I was able to reproduce this misbehavior by setting a gdb breakpoint
at pqReadData and then killing the connected server process while psql's
COPY IN was stopped there.  Resetting outCount to zero in the
socket-already-gone case in pqSendSome is enough to fix the problem.
However, I think it's also prudent to hack PQgetResult so that it
won't return a "copy in progress" status if the connection is known
dead.

The error recovery behavior in pqSendSome has been like this since 8.1
or thereabouts, so I'm inclined to push something like the attached into
all branches.

            regards, tom lane

diff --git a/src/interfaces/libpq/fe-exec.c b/src/interfaces/libpq/fe-exec.c
index 5f371b4..764e90c 100644
*** a/src/interfaces/libpq/fe-exec.c
--- b/src/interfaces/libpq/fe-exec.c
*************** static int PQsendQueryGuts(PGconn *conn,
*** 63,68 ****
--- 63,69 ----
                  const int *paramFormats,
                  int resultFormat);
  static void parseInput(PGconn *conn);
+ static PGresult *getCopyResult(PGconn *conn, ExecStatusType copytype);
  static bool PQexecStart(PGconn *conn);
  static PGresult *PQexecFinish(PGconn *conn);
  static int PQsendDescribe(PGconn *conn, char desc_type,
*************** PQgetResult(PGconn *conn)
*** 1734,1755 ****
              conn->asyncStatus = PGASYNC_BUSY;
              break;
          case PGASYNC_COPY_IN:
!             if (conn->result && conn->result->resultStatus == PGRES_COPY_IN)
!                 res = pqPrepareAsyncResult(conn);
!             else
!                 res = PQmakeEmptyPGresult(conn, PGRES_COPY_IN);
              break;
          case PGASYNC_COPY_OUT:
!             if (conn->result && conn->result->resultStatus == PGRES_COPY_OUT)
!                 res = pqPrepareAsyncResult(conn);
!             else
!                 res = PQmakeEmptyPGresult(conn, PGRES_COPY_OUT);
              break;
          case PGASYNC_COPY_BOTH:
!             if (conn->result && conn->result->resultStatus == PGRES_COPY_BOTH)
!                 res = pqPrepareAsyncResult(conn);
!             else
!                 res = PQmakeEmptyPGresult(conn, PGRES_COPY_BOTH);
              break;
          default:
              printfPQExpBuffer(&conn->errorMessage,
--- 1735,1747 ----
              conn->asyncStatus = PGASYNC_BUSY;
              break;
          case PGASYNC_COPY_IN:
!             res = getCopyResult(conn, PGRES_COPY_IN);
              break;
          case PGASYNC_COPY_OUT:
!             res = getCopyResult(conn, PGRES_COPY_OUT);
              break;
          case PGASYNC_COPY_BOTH:
!             res = getCopyResult(conn, PGRES_COPY_BOTH);
              break;
          default:
              printfPQExpBuffer(&conn->errorMessage,
*************** PQgetResult(PGconn *conn)
*** 1786,1791 ****
--- 1778,1813 ----
      return res;
  }

+ /*
+  * getCopyResult
+  *      Helper for PQgetResult: generate result for COPY-in-progress cases
+  */
+ static PGresult *
+ getCopyResult(PGconn *conn, ExecStatusType copytype)
+ {
+     /*
+      * If the server connection has been lost, don't pretend everything is
+      * hunky-dory; instead return a PGRES_FATAL_ERROR result, and reset the
+      * asyncStatus to idle (corresponding to what we'd do if we'd detected I/O
+      * error in the earlier steps in PQgetResult).  The text returned in the
+      * result is whatever is in conn->errorMessage; we expect that was filled
+      * with something useful when the lost connection was detected.
+      */
+     if (conn->status != CONNECTION_OK)
+     {
+         pqSaveErrorResult(conn);
+         conn->asyncStatus = PGASYNC_IDLE;
+         return pqPrepareAsyncResult(conn);
+     }
+
+     /* If we have an async result for the COPY, return that */
+     if (conn->result && conn->result->resultStatus == copytype)
+         return pqPrepareAsyncResult(conn);
+
+     /* Otherwise, invent a suitable PGresult */
+     return PQmakeEmptyPGresult(conn, copytype);
+ }
+

  /*
   * PQexec
diff --git a/src/interfaces/libpq/fe-misc.c b/src/interfaces/libpq/fe-misc.c
index 1eb3ac6..a7afd42 100644
*** a/src/interfaces/libpq/fe-misc.c
--- b/src/interfaces/libpq/fe-misc.c
*************** pqSendSome(PGconn *conn, int len)
*** 804,809 ****
--- 804,811 ----
      {
          printfPQExpBuffer(&conn->errorMessage,
                            libpq_gettext("connection not open\n"));
+         /* Discard queued data; no chance it'll ever be sent */
+         conn->outCount = 0;
          return -1;
      }

Re: Unhappy with error handling in psql's handleCopyOut()

From

Stephen Frost

Date:

12 February 2014, 03:00:00

* Tom Lane (tgl@sss.pgh.pa.us) wrote:
> I was able to reproduce this misbehavior by setting a gdb breakpoint
> at pqReadData and then killing the connected server process while psql's
> COPY IN was stopped there.  Resetting outCount to zero in the
> socket-already-gone case in pqSendSome is enough to fix the problem.
> However, I think it's also prudent to hack PQgetResult so that it
> won't return a "copy in progress" status if the connection is known
> dead.

Agreed.

> The error recovery behavior in pqSendSome has been like this since 8.1
> or thereabouts, so I'm inclined to push something like the attached into
> all branches.

Looks good to me, thanks for running this down!
Stephen

Re: Unhappy with error handling in psql's handleCopyOut()

From

Noah Misch

Date:

08 May 2014, 03:28:02

On Tue, Feb 11, 2014 at 03:43:08PM -0500, Tom Lane wrote:
> While looking at the pending patch to make psql report a line count
> after COPY, I came across this business in handleCopyOut():
> 
>      * Check command status and return to normal libpq state.  After a
>      * client-side error, the server will remain ready to deliver data.  The
>      * cleanest thing is to fully drain and discard that data.  If the
>      * client-side error happened early in a large file, this takes a long
>      * time.  Instead, take advantage of the fact that PQexec() will silently
>      * end any ongoing PGRES_COPY_OUT state.  This does cause us to lose the
>      * results of any commands following the COPY in a single command string.
>      * It also only works for protocol version 3.  XXX should we clean up
>      * using the slow way when the connection is using protocol version 2?
> 
> which git blames on commit 08146775 (committed by Alvaro on behalf of
> Noah).
> 
> This does not make me happy.  In the first place, we have not dropped
> support for protocol version 2.

As you may realize, that commit helped protocol version 3 most but also
strictly improved things for protocol version 2.  I didn't feel the need to
improve them both to the same extent, a judgment that still seems reasonable.

> In the second place, I fail to see
> what the advantage is of kluging things like this.  The main costs of
> draining the remaining COPY data are the server-side work of generating
> the data and the network transmission costs, neither of which will go
> away with this technique.

Agreed.

From your commit b8f00a4 of 2014-02-13:
> +     * If for some reason libpq is still reporting PGRES_COPY_OUT state, we
> +     * would like to forcibly exit that state, since our caller would be
> +     * unable to distinguish that situation from reaching the next COPY in a
> +     * command string that happened to contain two consecutive COPY TO STDOUT
> +     * commands.  However, libpq provides no API for doing that, and in
> +     * principle it's a libpq bug anyway if PQgetCopyData() returns -1 or -2
> +     * but hasn't exited COPY_OUT state internally.  So we ignore the
> +     * possibility here.
>       */

PQgetCopyData() returns -2 without exiting COPY_OUT when its malloc() call
fails.  My submission of the patch that became commit 08146775 included a test
procedure for getting an infinite loop in handleCopyOut(); that commit fixed
the infinite loop.  After commit b8f00a4, the same test procedure once again
initiates an infinite loop.  To a large extent, that's bad luck on the part of
commit b8f00a4.  I bet malloc() failure in pqCheckInBufferSpace() could
initiate an infinite loop even with the longstanding 9.2/9.3 code.

> So I'm thinking we should revert this kluge
> and just drain the data straightforwardly, which would also eliminate
> the mentioned edge-case misbehavior when there were more commands in
> the query string.
> 
> Is there a reason I'm overlooking not to do this?

The moral of the infinite loop test case story is that draining the data
straightforwardly has strictly more ways to fail.  Consequently, commit
b8f00a4 improved some scenarios while regressing others.  Overall, I do think
it was an improvement.

-- 
Noah Misch
EnterpriseDB                                 http://www.enterprisedb.com