Thread: Unhappy with error handling in psql's handleCopyOut()
While looking at the pending patch to make psql report a line count after COPY, I came across this business in handleCopyOut(): * Check command status and return to normal libpq state. After a * client-side error, the server will remain readyto deliver data. The * cleanest thing is to fully drain and discard that data. If the * client-side error happenedearly in a large file, this takes a long * time. Instead, take advantage of the fact that PQexec() will silently * end any ongoing PGRES_COPY_OUT state. This does cause us to lose the * results of any commands followingthe COPY in a single command string. * It also only works for protocol version 3. XXX should we clean up * using the slow way when the connection is using protocol version 2? which git blames on commit 08146775 (committed by Alvaro on behalf of Noah). This does not make me happy. In the first place, we have not dropped support for protocol version 2. In the second place, I fail to see what the advantage is of kluging things like this. The main costs of draining the remaining COPY data are the server-side work of generating the data and the network transmission costs, neither of which will go away with this technique. So I'm thinking we should revert this kluge and just drain the data straightforwardly, which would also eliminate the mentioned edge-case misbehavior when there were more commands in the query string. Is there a reason I'm overlooking not to do this? regards, tom lane
* Tom Lane (tgl@sss.pgh.pa.us) wrote: > While looking at the pending patch to make psql report a line count > after COPY, I came across this business in handleCopyOut(): > > * Check command status and return to normal libpq state. After a > * client-side error, the server will remain ready to deliver data. The > * cleanest thing is to fully drain and discard that data. If the > * client-side error happened early in a large file, this takes a long > * time. Instead, take advantage of the fact that PQexec() will silently > * end any ongoing PGRES_COPY_OUT state. This does cause us to lose the > * results of any commands following the COPY in a single command string. > * It also only works for protocol version 3. XXX should we clean up > * using the slow way when the connection is using protocol version 2? > > which git blames on commit 08146775 (committed by Alvaro on behalf of > Noah). > > This does not make me happy. In the first place, we have not dropped > support for protocol version 2. In the second place, I fail to see > what the advantage is of kluging things like this. The main costs of > draining the remaining COPY data are the server-side work of generating > the data and the network transmission costs, neither of which will go > away with this technique. So I'm thinking we should revert this kluge > and just drain the data straightforwardly, which would also eliminate > the mentioned edge-case misbehavior when there were more commands in > the query string. > > Is there a reason I'm overlooking not to do this? I've not gotten back to it yet, but I ran into a related-seeming issue where psql would happily chew up 2G of memory trying to send "COPY failed" notices when it gets disconnected from a server that it's trying to send data to mid-COPY. conn->sock was -1, connection was 'CONNECTION_BAD', but the loop towards the end of handleCopyIn doesn't care and nothing in libpq is changing PQresultStatus(): /* * Check command status and return to normal libpq state * * We must not ever return with the status stillPGRES_COPY_IN. Our * caller is unable to distinguish that situation from reaching the next * COPY in a commandstring that happened to contain two consecutive COPY * FROM STDIN commands. XXX if something makes PQputCopyEnd()fail * indefinitely while retaining status PGRES_COPY_IN, we get an infinite * loop. This is more realisticthan handleCopyOut()'s counterpart risk. */ while (res = PQgetResult(conn), PQresultStatus(res) == PGRES_COPY_IN) { OK = false; PQclear(res); PQputCopyEnd(pset.db, _("trying to exit copy mode")); } And so it would just keep looping, first building up the 2G of 'copy failed' notices from the PQputCopyEnd, and then just spinning forever because it couldn't drain the queue. Thanks, Stephen
Stephen Frost <sfrost@snowman.net> writes: > I've not gotten back to it yet, but I ran into a related-seeming issue > where psql would happily chew up 2G of memory trying to send "COPY > failed" notices when it gets disconnected from a server that it's trying > to send data to mid-COPY. conn->sock was -1, connection was > 'CONNECTION_BAD', but the loop towards the end of handleCopyIn doesn't > care and nothing in libpq is changing PQresultStatus(): [ scratches head... ] Offhand I'd have expected PQgetResult to start returning error indications. It definitely will do that if it gets error indications from the I/O layer. Perhaps it didn't see any because asyncStatus had already been reset from PGASYNC_BUSY? If so, maybe we need an explicit test on the connection being good before we return valid PGRES_COPY_IN etc results. regards, tom lane
* Tom Lane (tgl@sss.pgh.pa.us) wrote: > Stephen Frost <sfrost@snowman.net> writes: > > I've not gotten back to it yet, but I ran into a related-seeming issue > > where psql would happily chew up 2G of memory trying to send "COPY > > failed" notices when it gets disconnected from a server that it's trying > > to send data to mid-COPY. conn->sock was -1, connection was > > 'CONNECTION_BAD', but the loop towards the end of handleCopyIn doesn't > > care and nothing in libpq is changing PQresultStatus(): > > [ scratches head... ] Offhand I'd have expected PQgetResult to start > returning error indications. It definitely will do that if it gets > error indications from the I/O layer. Perhaps it didn't see any > because asyncStatus had already been reset from PGASYNC_BUSY? The I/O layer was definitely returning errors, I traced through that using gdb (tho I don't have it open any more, had to move on to other things; been planning to reproduce it). Here's the full p *conn: $9 = {pghost = 0x7fa742753300 "XXXXXXXXXXXXXXXXXXXXXXXXXXXXXX", pghostaddr = 0x0, pgport = 0x7fa742753340 "XXXXX", pgunixsocket= 0x0, pgtty = 0x7fa742753360 "", connect_timeout = 0x0, client_encoding_initial = 0x0, pgoptions = 0x7fa742753380"", appname = 0x0, fbappname = 0x7fa7427533a0 "psql", dbName = 0x7fa7427532e0 "XXXX", replication = 0x0, pguser= 0x7fa7427532c0 "XXXXXXXXXX", pgpass = 0x7fa742755070 "XXXXXXXXXXX", keepalives = 0x0, keepalives_idle = 0x0, keepalives_interval= 0x0, keepalives_count = 0x0, sslmode = 0x7fa7427533c0 "prefer", sslcompression = 0x7fa7427533e0 "1",sslkey = 0x0, sslcert = 0x0, sslrootcert = 0x0, sslcrl = 0x0, requirepeer = 0x0, krbsrvname = 0x7fa742753400 "postgres",Pfdebug = 0x0, noticeHooks = {noticeRec = 0x7fa73fed3ec0 <defaultNoticeReceiver>, noticeRecArg = 0x0, noticeProc= 0x7fa740572f20 <NoticeProcessor>, noticeProcArg = 0x0}, events = 0x0, nEvents = 0, eventArraySize = 0, status= CONNECTION_BAD, asyncStatus = PGASYNC_COPY_IN, xactStatus = PQTRANS_INTRANS, queryclass = PGQUERY_SIMPLE, last_query= 0x7fa7427809f0 "COPY XXXXXXXXXXXXXXXXXXXXXXXXXX (XXXXXX, XXXXXX, XXXXXX, XXXXXX, XXXXXX) FROM stdin;", last_sqlstate= "\000\000\000\000\000", options_valid = 1 '\001', nonblocking = 0 '\000', singleRowMode = 0 '\000', copy_is_binary= 0 '\000', copy_already_done = 0, notifyHead = 0x0, notifyTail = 0x0, sock = -1, laddr = {addr = {ss_family= 2, __ss_align = 0, __ss_padding = '\000' <repeats 111 times>}, salen = 16}, raddr = {addr = {ss_family =2, __ss_align = 0, __ss_padding = '\000' <repeats 111 times>}, salen = 16}, pversion = 196608, sversion = 90302, auth_req_received= 1 '\001', password_needed = 1 '\001', dot_pgpass_used = 1 '\001', sigpipe_so = 0 '\000', sigpipe_flag= 0 '\000', addrlist = 0x0, addr_cur = 0x0, addrlist_family = 0, setenv_state = SETENV_STATE_IDLE, next_eo =0x0, send_appname = 1 '\001', be_pid = 25857, be_key = 393001970, md5Salt = "XXXXXX", pstatus = 0x7fa742788630, client_encoding= 6, std_strings = 1 '\001', verbosity = PQERRORS_DEFAULT, lobjfuncs = 0x0, inBuffer = 0x7fa74274a740 "G",inBufSize = 16384, inStart = 0, inCursor = 0, inEnd = 0, outBuffer = 0x7fa63a495010 "00229878919\t1", '0' <repeats 12times>, "13462\tsupplemental\tt\nd", outBufSize = 2147475456, outCount = 2147475445, outMsgStart = 2147475446, outMsgEnd= 2147475450, rowBuf = 0x7fa742752760, rowBufLen = 32, result = 0x0, next_result = 0x0, allow_ssl_try = 1 '\001',wait_ssl_try = 0 '\000', ssl = 0x0, peer = 0x0, engine = 0x0, gctx = 0x0, gtarg_nam = 0x0, ginbuf = {length = 0, value= 0x0}, goutbuf = {length = 0, value = 0x0}, errorMessage = {data = 0x7fa742752970 "cannot allocate memory for outputbuffer\n", len = 41, maxlen = 256}, workBuffer = {data = 0x7fa742752a80 "SAVEPOINT", len = 9, maxlen = 256}} You can see asyncStatus is still PGASYNC_COPY_IN and also how outBufSize grew up to the 2G mark, along with the 'cannot allocate' error. The p *res looked like: $2 = {ntups = 0, numAttributes = 0, attDescs = 0x0, tuples = 0x0, tupArrSize = 0, numParameters = 0, paramDescs = 0x0, resultStatus= PGRES_COPY_IN, cmdStatus = "\000\177\000\000@\000\000\000\000\000\000\000!\000\000\000\000\000\000\000\210ף?\247\177\000\000\000BxB\247\177\000\000`\000\000\000\000\000\000\000\060\000\000\000\000\000\000\000\300dxB\247\177\000\000\000\000\000", binary= 0, noticeHooks = {noticeRec = 0x7fa73fed3ec0 <defaultNoticeReceiver>, noticeRecArg = 0x0, noticeProc = 0x7fa740572f20<NoticeProcessor>, noticeProcArg = 0x0}, events = 0x0, nEvents = 0, client_encoding = 6, errMsg = 0x0, errFields= 0x0, null_field = "", curBlock = 0x0, curOffset = 0, spaceLeft = 0} > If so, maybe we need an explicit test on the connection being good before > we return valid PGRES_COPY_IN etc results. Right, PQresultStatus() just checks if the PGresult is non-NULL, it's not doing any checking of the connection state itself. I had been trying to work out if something else should have been responsible for changing asyncStatus away from PGASYNC_COPY_IN when the connection failure is detected, but having PQresultStatus() check the connection status seems like it would work. Thanks, Stephen
I wrote: > Stephen Frost <sfrost@snowman.net> writes: >> I've not gotten back to it yet, but I ran into a related-seeming issue >> where psql would happily chew up 2G of memory trying to send "COPY >> failed" notices when it gets disconnected from a server that it's trying >> to send data to mid-COPY. conn->sock was -1, connection was >> 'CONNECTION_BAD', but the loop towards the end of handleCopyIn doesn't >> care and nothing in libpq is changing PQresultStatus(): > [ scratches head... ] Offhand I'd have expected PQgetResult to start > returning error indications. After some study of the code I have a theory about this. Note that we don't close the socket on send failures anymore; if the status is CONNECTION_BAD, that pretty much has to have been done by pqReadData. Which wouldn't be called in a PQputCopyData loop, as long as everything was going normally. Ordinarily, while we're pumping data to the backend, we'll flush libpq's output buffer every 8K (see pqPutMsgEnd). Suppose that, in one of those calls to pqSendSome, we fail to send everything in the buffer but send() doesn't report a hard error --- maybe it says EINTR, or maybe it just returns zero. Since there's still data to send, now pqSendSome will call pqReadData. Assume that the read attempt *does* give a hard error; whereupon pqReadData closes the socket, sets CONNECTION_BAD, and returns -1. pqSendSome falls out, leaving data still in the output buffer, and then we'll return -1 to handleCopyIn, which abandons its data sending loop and calls PQputCopyEnd. But there's still >= 8K in the buffer, so when PQputCopyEnd calls pqPutMsgEnd, the latter tries to send it. And fails. (What's more, pqSendSome doesn't clear the output buffer when it fails at the top because the socket's gone.) This results in PQputCopyEnd exiting without having changed the asyncStatus back to ASYNC_BUSY from ASYNC_COPY_IN. Now PQgetResult will happily return a COPY_IN result, so we loop, and the whole thing repeats, with no state change except we keep adding data to the buffer. I think we need to change pqSendSome to reset outCount to zero if it exits at the top due to no socket. It would do that if it got a hard error from pqsecure_write, so why's it not doing so if there's not even a socket? This would avoid the problem of bloating the output buffer while we continue to queue data that can never be sent. (It might be that some of the other error situations in pqSendSome should also abandon unwritten data, but that's less clear.) Another thought is that maybe PQputCopyEnd should change asyncStatus even if it fails to queue the copy-end message. I think the reason why it's coded like it is is the idea that if we've not queued the message, then the backend still thinks the copy is active, so we should not change state. However, if we've already closed the socket then this is pretty much moot, so perhaps there needs to be an exception allowing the state change in that case. Also, if the output buffer is already over 8K, we're failing to distinguish "couldn't queue the message" from "couldn't send the message"; but I'm not sure it's worth refactoring to detect that case. Alternatively, we could do what I suggested earlier and change PQgetResult to not return an ordinary COPY_IN or COPY_OUT status if the connection is known dead. This seems probably more robust than tinkering at the margins in PQputCopyEnd. regards, tom lane
I wrote: >> Stephen Frost <sfrost@snowman.net> writes: >>> I've not gotten back to it yet, but I ran into a related-seeming issue >>> where psql would happily chew up 2G of memory trying to send "COPY >>> failed" notices when it gets disconnected from a server that it's trying >>> to send data to mid-COPY. conn->sock was -1, connection was >>> 'CONNECTION_BAD', but the loop towards the end of handleCopyIn doesn't >>> care and nothing in libpq is changing PQresultStatus(): > After some study of the code I have a theory about this. I was able to reproduce this misbehavior by setting a gdb breakpoint at pqReadData and then killing the connected server process while psql's COPY IN was stopped there. Resetting outCount to zero in the socket-already-gone case in pqSendSome is enough to fix the problem. However, I think it's also prudent to hack PQgetResult so that it won't return a "copy in progress" status if the connection is known dead. The error recovery behavior in pqSendSome has been like this since 8.1 or thereabouts, so I'm inclined to push something like the attached into all branches. regards, tom lane diff --git a/src/interfaces/libpq/fe-exec.c b/src/interfaces/libpq/fe-exec.c index 5f371b4..764e90c 100644 *** a/src/interfaces/libpq/fe-exec.c --- b/src/interfaces/libpq/fe-exec.c *************** static int PQsendQueryGuts(PGconn *conn, *** 63,68 **** --- 63,69 ---- const int *paramFormats, int resultFormat); static void parseInput(PGconn *conn); + static PGresult *getCopyResult(PGconn *conn, ExecStatusType copytype); static bool PQexecStart(PGconn *conn); static PGresult *PQexecFinish(PGconn *conn); static int PQsendDescribe(PGconn *conn, char desc_type, *************** PQgetResult(PGconn *conn) *** 1734,1755 **** conn->asyncStatus = PGASYNC_BUSY; break; case PGASYNC_COPY_IN: ! if (conn->result && conn->result->resultStatus == PGRES_COPY_IN) ! res = pqPrepareAsyncResult(conn); ! else ! res = PQmakeEmptyPGresult(conn, PGRES_COPY_IN); break; case PGASYNC_COPY_OUT: ! if (conn->result && conn->result->resultStatus == PGRES_COPY_OUT) ! res = pqPrepareAsyncResult(conn); ! else ! res = PQmakeEmptyPGresult(conn, PGRES_COPY_OUT); break; case PGASYNC_COPY_BOTH: ! if (conn->result && conn->result->resultStatus == PGRES_COPY_BOTH) ! res = pqPrepareAsyncResult(conn); ! else ! res = PQmakeEmptyPGresult(conn, PGRES_COPY_BOTH); break; default: printfPQExpBuffer(&conn->errorMessage, --- 1735,1747 ---- conn->asyncStatus = PGASYNC_BUSY; break; case PGASYNC_COPY_IN: ! res = getCopyResult(conn, PGRES_COPY_IN); break; case PGASYNC_COPY_OUT: ! res = getCopyResult(conn, PGRES_COPY_OUT); break; case PGASYNC_COPY_BOTH: ! res = getCopyResult(conn, PGRES_COPY_BOTH); break; default: printfPQExpBuffer(&conn->errorMessage, *************** PQgetResult(PGconn *conn) *** 1786,1791 **** --- 1778,1813 ---- return res; } + /* + * getCopyResult + * Helper for PQgetResult: generate result for COPY-in-progress cases + */ + static PGresult * + getCopyResult(PGconn *conn, ExecStatusType copytype) + { + /* + * If the server connection has been lost, don't pretend everything is + * hunky-dory; instead return a PGRES_FATAL_ERROR result, and reset the + * asyncStatus to idle (corresponding to what we'd do if we'd detected I/O + * error in the earlier steps in PQgetResult). The text returned in the + * result is whatever is in conn->errorMessage; we expect that was filled + * with something useful when the lost connection was detected. + */ + if (conn->status != CONNECTION_OK) + { + pqSaveErrorResult(conn); + conn->asyncStatus = PGASYNC_IDLE; + return pqPrepareAsyncResult(conn); + } + + /* If we have an async result for the COPY, return that */ + if (conn->result && conn->result->resultStatus == copytype) + return pqPrepareAsyncResult(conn); + + /* Otherwise, invent a suitable PGresult */ + return PQmakeEmptyPGresult(conn, copytype); + } + /* * PQexec diff --git a/src/interfaces/libpq/fe-misc.c b/src/interfaces/libpq/fe-misc.c index 1eb3ac6..a7afd42 100644 *** a/src/interfaces/libpq/fe-misc.c --- b/src/interfaces/libpq/fe-misc.c *************** pqSendSome(PGconn *conn, int len) *** 804,809 **** --- 804,811 ---- { printfPQExpBuffer(&conn->errorMessage, libpq_gettext("connection not open\n")); + /* Discard queued data; no chance it'll ever be sent */ + conn->outCount = 0; return -1; }
* Tom Lane (tgl@sss.pgh.pa.us) wrote: > I was able to reproduce this misbehavior by setting a gdb breakpoint > at pqReadData and then killing the connected server process while psql's > COPY IN was stopped there. Resetting outCount to zero in the > socket-already-gone case in pqSendSome is enough to fix the problem. > However, I think it's also prudent to hack PQgetResult so that it > won't return a "copy in progress" status if the connection is known > dead. Agreed. > The error recovery behavior in pqSendSome has been like this since 8.1 > or thereabouts, so I'm inclined to push something like the attached into > all branches. Looks good to me, thanks for running this down! Stephen
On Tue, Feb 11, 2014 at 03:43:08PM -0500, Tom Lane wrote: > While looking at the pending patch to make psql report a line count > after COPY, I came across this business in handleCopyOut(): > > * Check command status and return to normal libpq state. After a > * client-side error, the server will remain ready to deliver data. The > * cleanest thing is to fully drain and discard that data. If the > * client-side error happened early in a large file, this takes a long > * time. Instead, take advantage of the fact that PQexec() will silently > * end any ongoing PGRES_COPY_OUT state. This does cause us to lose the > * results of any commands following the COPY in a single command string. > * It also only works for protocol version 3. XXX should we clean up > * using the slow way when the connection is using protocol version 2? > > which git blames on commit 08146775 (committed by Alvaro on behalf of > Noah). > > This does not make me happy. In the first place, we have not dropped > support for protocol version 2. As you may realize, that commit helped protocol version 3 most but also strictly improved things for protocol version 2. I didn't feel the need to improve them both to the same extent, a judgment that still seems reasonable. > In the second place, I fail to see > what the advantage is of kluging things like this. The main costs of > draining the remaining COPY data are the server-side work of generating > the data and the network transmission costs, neither of which will go > away with this technique. Agreed. From your commit b8f00a4 of 2014-02-13: > + * If for some reason libpq is still reporting PGRES_COPY_OUT state, we > + * would like to forcibly exit that state, since our caller would be > + * unable to distinguish that situation from reaching the next COPY in a > + * command string that happened to contain two consecutive COPY TO STDOUT > + * commands. However, libpq provides no API for doing that, and in > + * principle it's a libpq bug anyway if PQgetCopyData() returns -1 or -2 > + * but hasn't exited COPY_OUT state internally. So we ignore the > + * possibility here. > */ PQgetCopyData() returns -2 without exiting COPY_OUT when its malloc() call fails. My submission of the patch that became commit 08146775 included a test procedure for getting an infinite loop in handleCopyOut(); that commit fixed the infinite loop. After commit b8f00a4, the same test procedure once again initiates an infinite loop. To a large extent, that's bad luck on the part of commit b8f00a4. I bet malloc() failure in pqCheckInBufferSpace() could initiate an infinite loop even with the longstanding 9.2/9.3 code. > So I'm thinking we should revert this kluge > and just drain the data straightforwardly, which would also eliminate > the mentioned edge-case misbehavior when there were more commands in > the query string. > > Is there a reason I'm overlooking not to do this? The moral of the infinite loop test case story is that draining the data straightforwardly has strictly more ways to fail. Consequently, commit b8f00a4 improved some scenarios while regressing others. Overall, I do think it was an improvement. -- Noah Misch EnterpriseDB http://www.enterprisedb.com