Thread: PQexec() hangs on OOM
Alex Shulgin's find of a missing NULL check after strdup() (87tx1or3cc.fsf@commandprompt.com) prompted me to do some testing of libpq, when malloc/strdup returns NULL. To simulate running out of memory, I wrote a little LD_PRELOAD library that allows causing malloc() to return NULL, after a particular number of calls. and ran src/test/examples/testlibpq with that hack. After fixing all the missing NULL-checks, I found that testlibpq sometimes just hangs. It happens when this malloc() call in PQmakeEmptyPGResult() fails: #3 0x00007f6dc86495c0 in malloc () from /home/heikki/mallocfail.so #4 0x00007f6dc8423b6e in PQmakeEmptyPGresult (conn=0x1bea040, status=PGRES_COMMAND_OK) at fe-exec.c:144 #5 0x00007f6dc8430b27 in pqParseInput3 (conn=0x1bea040) at fe-protocol3.c:204 #6 0x00007f6dc8426468 in parseInput (conn=0x1bea040) at fe-exec.c:1652 #7 0x00007f6dc8426583 in PQgetResult (conn=0x1bea040) at fe-exec.c:1727 #8 0x00007f6dc8426c76 in PQexecFinish (conn=0x1bea040) at fe-exec.c:2000 #9 0x00007f6dc84268ad in PQexec (conn=0x1bea040, query=0x400e32 "BEGIN") at fe-exec.c:1834 #10 0x0000000000400b18 in main (argc=1, argv=0x7fffc9b6a568) at testlibpq.c:59 When that malloc() returns NULL, parseInput returns without reading any input. PQgetResult() takes that as a sign that it needs to read more input from the server, before calling parseInput() again, and that read never returns because there is no more data coming from the server. I don't have any immediate plans to fix this, or to continue testing this. There might well be more cases like this. Patches are welcome. Attached is the little wrapper library I used to test this. testlibpq hangs when run with MALLOC_FAIL_AT=110. It's really quick & dirty, sorry about that. I'm sure there are more sophisticated tools to do similar testing out there somewhere.. - Heikki
Attachment
On Tue, Nov 25, 2014 at 10:15 PM, Heikki Linnakangas wrote: > When that malloc() returns NULL, parseInput returns without reading any > input. PQgetResult() takes that as a sign that it needs to read more input > from the server, before calling parseInput() again, and that read never > returns because there is no more data coming from the server. > > I don't have any immediate plans to fix this, or to continue testing this. > There might well be more cases like this. Patches are welcome. > > Attached is the little wrapper library I used to test this. testlibpq hangs > when run with MALLOC_FAIL_AT=110. It's really quick & dirty, sorry about > that. I'm sure there are more sophisticated tools to do similar testing out > there somewhere.. With MALLOC_FAIL_AT=84, 86, 92, the backtrace just before the malloc creating the OOM looks like that: #0 0x00007f76316964d0 in __poll_nocancel () from /usr/lib/libc.so.6 #1 0x00007f7631971577 in pqSocketPoll (sock=4, forRead=1, forWrite=0, end_time=-1) at fe-misc.c:1133 #2 0x00007f7631971461 in pqSocketCheck (conn=0x1495040, forRead=1, forWrite=0, end_time=-1) at fe-misc.c:1075 #3 0x00007f76319712f8 in pqWaitTimed (forRead=1, forWrite=0, conn=0x1495040, finish_time=-1) at fe-misc.c:1007 #4 0x00007f76319712ca in pqWait (forRead=1, forWrite=0, conn=0x1495040) at fe-misc.c:990 #5 0x00007f763196d21d in PQgetResult (conn=0x1495040) at fe-exec.c:1711 #6 0x00007f763196d913 in PQexecFinish (conn=0x1495040) at fe-exec.c:1997 #7 0x00007f763196d576 in PQexec (conn=0x1495040, query=0x400ef2 "BEGIN") at fe-exec.c:1831 #8 0x0000000000400bd8 in main (argc=1, argv=0x7ffd8a2644c8) at testlibpq.c:5 In this case, as what happens is an OOM related to the allocation of PGResult, I think that we had better store a status flag in PGConn related to this OOM, as PGConn->errorMessage may not be empty to take care of the ambiguity that PGResult == NULL does not necessarily mean wait for more results. Something like PGResultStatus to avoid any API incompatibility. Thoughts? Looking at the other malloc() calls of llibpq, we do not really have this ambiguity. For example if makeEmptyPGconn() == NULL means OOM. I am guessing from the code as well that PQmakeEmptyPGresult() == NULL means OOM, so the error handling problem comes from parseInput and its underlings. Also in pqSaveParameterStatus, shouldn't we have a better OOM handling there as well for pstatus? Regards, -- Michael
On Mon, Apr 6, 2015 at 3:54 PM, Michael Paquier <michael.paquier@gmail.com> wrote: > In this case, as what happens is an OOM related to the allocation of > PGResult, I think that we had better store a status flag in PGConn related > to this OOM, as PGConn->errorMessage may not be empty to take care of the > ambiguity that PGResult == NULL does not necessarily mean wait for more > results. Something like PGResultStatus to avoid any API incompatibility. > Thoughts? > Second idea: return a status with parseInput as it is not part of the APIs of libpq. -- Michael
Michael Paquier <michael.paquier@gmail.com> writes: > Second idea: return a status with parseInput as it is not part of the APIs > of libpq. Yeah; most subroutines in libpq have a zero-or-EOF return convention, we can make parseInput do likewise. I'm not sure if that'd need to propagate further down though ... regards, tom lane
On Mon, Apr 6, 2015 at 10:46 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Michael Paquier <michael.paquier@gmail.com> writes: >> Second idea: return a status with parseInput as it is not part of the APIs >> of libpq. > > Yeah; most subroutines in libpq have a zero-or-EOF return convention, > we can make parseInput do likewise. I'm not sure if that'd need to > propagate further down though ... So, I have been looking at that in more details, and finished with the patch attached that makes the problem go away for me with a nice "out of memory" error. I have extended parseInput() to have it return a status code to decide if parsing should be stopped or should continue. Note that I have tried to be careful to make a clear distinction between cases where routines return EOF because of not enough data parsed and actual OOMs. I have noticed as well that getCopyStart() in fe-protocol3.c needs to be made a little bit smarter to make the difference between an OOM and the not-enough-data type of problem. This problem has a low probability to happen in the field, and no people complained about that as well, so I think that patching only HEAD is adapted. Regards, -- Michael
Attachment
On 04/07/2015 09:18 AM, Michael Paquier wrote: > @@ -143,7 +143,11 @@ PQmakeEmptyPGresult(PGconn *conn, ExecStatusType status) > > result = (PGresult *) malloc(sizeof(PGresult)); > if (!result) > + { > + printfPQExpBuffer(&conn->errorMessage, > + libpq_gettext("out of memory\n")); > return NULL; > + } > > result->ntups = 0; > result->numAttributes = 0; That's not cool, because PQmakeEmptyPGresult may be called with conn == NULL. In general, I'm a bit wary of changing PQmakeEmptyResult so that it sets the error message. Will have to check all the callers carefully to see if that would upset any of them. And it might be called by applications too. > I have noticed as well that getCopyStart() in fe-protocol3.c needs to > be made a little bit smarter to make the difference between an OOM and > the not-enough-data type of problem. Yeah. getParamDescription still has the issue, with your patch. > This problem has a low probability to happen in the field, and no > people complained about that as well, so I think that patching only > HEAD is adapted. As long as the patch applies easily, I don't see much reason to not backpatch. - Heikki
On Wed, Apr 8, 2015 at 3:54 AM, Heikki Linnakangas wrote: > On 04/07/2015 09:18 AM, Michael Paquier wrote: >> >> @@ -143,7 +143,11 @@ PQmakeEmptyPGresult(PGconn *conn, ExecStatusType >> status) >> >> [...] > > That's not cool, because PQmakeEmptyPGresult may be called with conn == > NULL. In general, I'm a bit wary of changing PQmakeEmptyResult so that it > sets the error message. Will have to check all the callers carefully to see > if that would upset any of them. And it might be called by applications too. Oops, my mistake. For a patch arguing to not change how libpq routines behave this is bad, and contrary to what is mentioned in the docs as well. I moved the error message into parseInput. I guess it makes more sense there. >> I have noticed as well that getCopyStart() in fe-protocol3.c needs to >> be made a little bit smarter to make the difference between an OOM and >> the not-enough-data type of problem. > > Yeah. getParamDescription still has the issue, with your patch. Check. >> This problem has a low probability to happen in the field, and no >> people complained about that as well, so I think that patching only >> HEAD is adapted. > > > As long as the patch applies easily, I don't see much reason to not > backpatch. It applies cleanly down to 9.1. For 9.0 some minor tweaks are visibly needed. In any case, take two is attached. Regards, -- Michael
Attachment
On 04/08/2015 07:27 AM, Michael Paquier wrote: > On Wed, Apr 8, 2015 at 3:54 AM, Heikki Linnakangas wrote: >> On 04/07/2015 09:18 AM, Michael Paquier wrote: >>> I have noticed as well that getCopyStart() in fe-protocol3.c needs to >>> be made a little bit smarter to make the difference between an OOM and >>> the not-enough-data type of problem. >> >> Yeah. getParamDescription still has the issue, with your patch. > > Check. There are still a few parseInput subroutines that have similar issues. In fe-protocol2.c: * pqGetErrorNotice2 returns EOF if there is not enough data, but also on OOM. The caller assumes it's because not enough data. * getRowDescriptions() is the same, although it sets conn->errorMessage on OOM. * getAnotherTuple() is the same, but it also skips over the received data, so you won't get stuck. So maybe that's OK. * getNotify() just drops any NOTIFY messages on OOM. Perhaps that's OK.. The corresponding functions in fe-protocol3.c are pretty much identical, with the same issues. In addition: * getParameterStatus will act as if the parameter value was "out of memory". That'll be fun for something like client_encoding or standard_conforming_strings. Would be good to use a small char[100] variable, in stack, if it fits, and only use malloc for larger values. That would cover all the common variables that need to be machine-parsed. - Heikki
On Fri, Apr 10, 2015 at 2:42 AM, Heikki Linnakangas <hlinnaka@iki.fi> wrote: > On 04/08/2015 07:27 AM, Michael Paquier wrote: >> >> On Wed, Apr 8, 2015 at 3:54 AM, Heikki Linnakangas wrote: >>> >>> On 04/07/2015 09:18 AM, Michael Paquier wrote: >>>> >>>> I have noticed as well that getCopyStart() in fe-protocol3.c needs to >>>> be made a little bit smarter to make the difference between an OOM and >>>> the not-enough-data type of problem. >>> >>> >>> Yeah. getParamDescription still has the issue, with your patch. >> >> >> Check. > > > There are still a few parseInput subroutines that have similar issues. In > fe-protocol2.c: > > * pqGetErrorNotice2 returns EOF if there is not enough data, but also on > OOM. The caller assumes it's because not enough data. > * getRowDescriptions() is the same, although it sets conn->errorMessage on > OOM. OK, I extended those two with the same logic as previously. > * getAnotherTuple() is the same, but it also skips over the received data, > so you won't get stuck. So maybe that's OK. I think that it should be changed for consistency with the others, so done this way for this function, and this way we only need to bail-out if status == -2, a status of -1 meaning that there is not enough data. > * getNotify() just drops any NOTIFY messages on OOM. Perhaps that's OK.. This one is different though, it directly skips the messages. > The corresponding functions in fe-protocol3.c are pretty much identical, > with the same issues. In addition: > > * getParameterStatus will act as if the parameter value was "out of memory". > That'll be fun for something like client_encoding or > standard_conforming_strings. Would be good to use a small char[100] > variable, in stack, if it fits, and only use malloc for larger values. That > would cover all the common variables that need to be machine-parsed. Are you suggesting to replace PQExpBuffer? So, attached is take three for all the other things above. Regards, -- Michael
Attachment
Michael Paquier <michael.paquier@gmail.com> writes: > >> There are still a few parseInput subroutines that have similar issues. In >> fe-protocol2.c: >> >> * pqGetErrorNotice2 returns EOF if there is not enough data, but also on >> OOM. The caller assumes it's because not enough data. >> * getRowDescriptions() is the same, although it sets conn->errorMessage on >> OOM. > > OK, I extended those two with the same logic as previously. > >> * getAnotherTuple() is the same, but it also skips over the received data, >> so you won't get stuck. So maybe that's OK. > > I think that it should be changed for consistency with the others, so > done this way for this function, and this way we only need to bail-out > if status == -2, a status of -1 meaning that there is not enough data. > >> * getNotify() just drops any NOTIFY messages on OOM. Perhaps that's OK.. > > This one is different though, it directly skips the messages. > >> The corresponding functions in fe-protocol3.c are pretty much identical, >> with the same issues. In addition: >> >> * getParameterStatus will act as if the parameter value was "out of memory". >> That'll be fun for something like client_encoding or >> standard_conforming_strings. Would be good to use a small char[100] >> variable, in stack, if it fits, and only use malloc for larger values. That >> would cover all the common variables that need to be machine-parsed. > > Are you suggesting to replace PQExpBuffer? > > So, attached is take three for all the other things above. There's still one call to pqGetErrorNotice3 that doesn't check returned value for -2, in fe-connect.c. Shouldn't we also check it like this: diff --git a/src/interfaces/libpq/fe-connect.c b/src/interfaces/libpq/fe-connect.c index e7c7a25..330b8da 100644 --- a/src/interfaces/libpq/fe-connect.c +++ b/src/interfaces/libpq/fe-connect.c @@ -2242,6 +2242,7 @@ keep_going: /* We will come back to here until there is int msgLength; int avail; AuthRequest areq; + int res; /* * Scan the message from current point (note that if we find @@ -2366,11 +2367,18 @@ keep_going: /* We will come back to here until there is { if (PG_PROTOCOL_MAJOR(conn->pversion) >= 3) { - if (pqGetErrorNotice3(conn, true)) + res = pqGetErrorNotice3(conn, true); + if (res == -1) { /* We'll come back when there is more data */ return PGRES_POLLING_READING; } + else if (res == -2) + { + printfPQExpBuffer(&conn->errorMessage, + libpq_gettext("out of memory\n")); + goto error_return; + } } else { -- Alex
On Thu, May 21, 2015 at 1:31 AM, Oleksandr Shulgin wrote: > Michael Paquier <michael.paquier@gmail.com> writes: >> So, attached is take three for all the other things above. > > There's still one call to pqGetErrorNotice3 that doesn't check returned > value for -2, in fe-connect.c. Shouldn't we also check it like this: > > [...] Yes, you are right. Take 4 attached is updated with something similar to what you sent to cover this code path. -- Michael
Attachment
On 05/26/2015 10:01 AM, Michael Paquier wrote: > On Thu, May 21, 2015 at 1:31 AM, Oleksandr Shulgin wrote: >> Michael Paquier <michael.paquier@gmail.com> writes: >>> So, attached is take three for all the other things above. >> >> There's still one call to pqGetErrorNotice3 that doesn't check returned >> value for -2, in fe-connect.c. Shouldn't we also check it like this: >> >> [...] > > Yes, you are right. Take 4 attached is updated with something similar > to what you sent to cover this code path. This is still a few bricks shy of a load. Before this patch, if you run out of memory when allocating a large result set - which is probably the most common reason for OOM in libpq - you got this error: postgres=# select generate_series(1, 10000000); out of memory for query result With this patch, I got: postgres=# select generate_series(1, 10000000); server sent data ("D" message) without prior row description ("T" message) Looking at the patch again, I think we should actually leave getAnotherTuple() alone. It's a lot nicer if getAnotherTuple() can handle the OOM error itself than propagating it to the caller. There's only one caller for getAnotherTuple(), but for pqGetErrorNotice3() the same is even more true: it would be much nicer if it could handle OOM itself, without propagating it to the caller. And it well could do so. When it's processing an ERROR, it could just set conn->errorMessage to "out of memory", and discard the error it received from the server. When processing a NOTICE, it could likewise just send "out of memory" to the NOTICE processsor, and discard the message from the server. The real problem with pqGetErrorNotice3() today is that it treats OOM the same as "no data available yet", and we can fix that by reading but discarding the backend message. Like getAnotherTuple() does. In short, the idea of returning a status code from parseInput(), instead of just dealing with the error, was a bad one. Sorry I didn't have this epiphany earlier... I came up with the attached. It fixes the few cases where we were currently returning "need more data" when OOM happened, to deal with the OOM error instead, by setting conn->errorMessage. How does this look to you? - Heikki -- - Heikki
Attachment
On Sat, Jul 4, 2015 at 1:32 AM, Heikki Linnakangas wrote: > In short, the idea of returning a status code from parseInput(), instead of > just dealing with the error, was a bad one. Sorry I didn't have this > epiphany earlier... > > I came up with the attached. It fixes the few cases where we were currently > returning "need more data" when OOM happened, to deal with the OOM error > instead, by setting conn->errorMessage. How does this look to you? So this relies on the fact that when errorMessage is set subsequent messages are ignored, right? This looks neat. At the bottom of getAnotherTuple() in fe-protocol2.c if PQmakeEmptyPGresult returns NULL, shouldn't the error message be enforced to "out of memory"? It is an error code path, so an error will be set, but perhaps the message is incorrect. - if (!res->errMsg) - goto failure; + if (res) + { + res->resultStatus = isError ? PGRES_FATAL_ERROR : PGRES_NONFATAL_ERROR; + res->errMsg = pqResultStrdup(res, workBuf.data); + } If res->errMsg is NULL, we may have a crash later on when calling appendPQExpBufferStr on this field. I think that we should add an additional check on it. -- Michael
On 07/04/2015 03:40 PM, Michael Paquier wrote: > On Sat, Jul 4, 2015 at 1:32 AM, Heikki Linnakangas wrote: >> In short, the idea of returning a status code from parseInput(), instead of >> just dealing with the error, was a bad one. Sorry I didn't have this >> epiphany earlier... >> >> I came up with the attached. It fixes the few cases where we were currently >> returning "need more data" when OOM happened, to deal with the OOM error >> instead, by setting conn->errorMessage. How does this look to you? > > So this relies on the fact that when errorMessage is set subsequent > messages are ignored, right? This looks neat. > > At the bottom of getAnotherTuple() in fe-protocol2.c if > PQmakeEmptyPGresult returns NULL, shouldn't the error message be > enforced to "out of memory"? It is an error code path, so an error > will be set, but perhaps the message is incorrect. > > - if (!res->errMsg) > - goto failure; > + if (res) > + { > + res->resultStatus = isError ? PGRES_FATAL_ERROR : > PGRES_NONFATAL_ERROR; > + res->errMsg = pqResultStrdup(res, workBuf.data); > + } > If res->errMsg is NULL, we may have a crash later on when calling > appendPQExpBufferStr on this field. I think that we should add an > additional check on it. Yeah, added. I tested the various error paths this patch modifies with a debugger, and found out that the getCopyStart changes were not doing much good. The caller still waits for the COPY-IN result, so it still gets stuck. So I left out that change. The getParamDescriptions() changes were slightly broken. It didn't read the whole input message with pqGetInt() etc., so pqParseInput3() threw the "message contents do not agree with length in message type" error. I started fixing that, by changing the error handling in that function to be more like that in getRowDescriptions(), but then I realized that all the EOF return cases in the pqParseInput3() subroutines are actually dead code. pqParseInput3() always reads the whole message, before passing it on to the right subroutine. That was documented for getRowDescriptions() and getAnotherTuple(), but the rest of the functions were more like the protocol version 2 code, prepared to deal with incomplete messages. I think it would be good to refactor that, removing the EOF return cases altogether. So I left out that change for now as well. That left me with the attached patch. It doesn't handle the getParamDescription() case, nor the getCopyStart() case, but it's a start. We don't necessarily need to fix everything in one go. Does this look correct to you, as far as it goes? - Heikki
Attachment
On Tue, Jul 7, 2015 at 2:13 AM, Heikki Linnakangas wrote: > The getParamDescriptions() changes were slightly broken. It didn't read the > whole input message with pqGetInt() etc., so pqParseInput3() threw the > "message contents do not agree with length in message type" error. I started > fixing that, by changing the error handling in that function to be more like > that in getRowDescriptions(), but then I realized that all the EOF return > cases in the pqParseInput3() subroutines are actually dead code. > pqParseInput3() always reads the whole message, before passing it on to the > right subroutine. That was documented for getRowDescriptions() and > getAnotherTuple(), but the rest of the functions were more like the protocol > version 2 code, prepared to deal with incomplete messages. I think it would > be good to refactor that, removing the EOF return cases altogether. So I > left out that change for now as well. Yes, (the latter case is not actually used currently). Well, I don't mind writing the additional patch to update . On top of that The refactoring should be a master-only change, perhaps? > That left me with the attached patch. It doesn't handle the > getParamDescription() case, nor the getCopyStart() case, but it's a start. > We don't necessarily need to fix everything in one go. Does this look > correct to you, as far as it goes? I have been carefully through the routines modified, doing some tests at the same time and I haven't spotted an issue. -- Michael
On Tue, Jul 7, 2015 at 3:31 PM, Michael Paquier <michael.paquier@gmail.com> wrote: > On Tue, Jul 7, 2015 at 2:13 AM, Heikki Linnakangas wrote: >> The getParamDescriptions() changes were slightly broken. It didn't read the >> whole input message with pqGetInt() etc., so pqParseInput3() threw the >> "message contents do not agree with length in message type" error. I started >> fixing that, by changing the error handling in that function to be more like >> that in getRowDescriptions(), but then I realized that all the EOF return >> cases in the pqParseInput3() subroutines are actually dead code. >> pqParseInput3() always reads the whole message, before passing it on to the >> right subroutine. That was documented for getRowDescriptions() and >> getAnotherTuple(), but the rest of the functions were more like the protocol >> version 2 code, prepared to deal with incomplete messages. I think it would >> be good to refactor that, removing the EOF return cases altogether. So I >> left out that change for now as well. > > Yes, (the latter case is not actually used currently). Well, I don't > mind writing the additional patch to update . On top of that The > refactoring should be a master-only change, perhaps? I pushed the send button too early. I don't mind writing the additional patch for the other routines, patch that should be backpatched, and the refactoring patch (master-only?). -- Michael
On 07/07/2015 09:32 AM, Michael Paquier wrote: > On Tue, Jul 7, 2015 at 3:31 PM, Michael Paquier > <michael.paquier@gmail.com> wrote: >> On Tue, Jul 7, 2015 at 2:13 AM, Heikki Linnakangas wrote: >>> The getParamDescriptions() changes were slightly broken. It didn't read the >>> whole input message with pqGetInt() etc., so pqParseInput3() threw the >>> "message contents do not agree with length in message type" error. I started >>> fixing that, by changing the error handling in that function to be more like >>> that in getRowDescriptions(), but then I realized that all the EOF return >>> cases in the pqParseInput3() subroutines are actually dead code. >>> pqParseInput3() always reads the whole message, before passing it on to the >>> right subroutine. That was documented for getRowDescriptions() and >>> getAnotherTuple(), but the rest of the functions were more like the protocol >>> version 2 code, prepared to deal with incomplete messages. I think it would >>> be good to refactor that, removing the EOF return cases altogether. So I >>> left out that change for now as well. >> >> Yes, (the latter case is not actually used currently). Well, I don't >> mind writing the additional patch to update . On top of that The >> refactoring should be a master-only change, perhaps? > > I pushed the send button too early. > I don't mind writing the additional patch for the other routines, > patch that should be backpatched, and the refactoring patch > (master-only?). Ok, committed and backpatched the latest patch I posted. Yeah, we'll need additional patches for the refactoring and the remaining issues. I'm not sure if it's worth trying to backpatch them; let's see how big the patch is. - Heikki
On Wed, Jul 8, 2015 at 1:01 AM, Heikki Linnakangas <hlinnaka@iki.fi> wrote: > On 07/07/2015 09:32 AM, Michael Paquier wrote: >> >> On Tue, Jul 7, 2015 at 3:31 PM, Michael Paquier >> <michael.paquier@gmail.com> wrote: >>> >>> On Tue, Jul 7, 2015 at 2:13 AM, Heikki Linnakangas wrote: >>>> >>>> The getParamDescriptions() changes were slightly broken. It didn't read >>>> the >>>> whole input message with pqGetInt() etc., so pqParseInput3() threw the >>>> "message contents do not agree with length in message type" error. I >>>> started >>>> fixing that, by changing the error handling in that function to be more >>>> like >>>> that in getRowDescriptions(), but then I realized that all the EOF >>>> return >>>> cases in the pqParseInput3() subroutines are actually dead code. >>>> pqParseInput3() always reads the whole message, before passing it on to >>>> the >>>> right subroutine. That was documented for getRowDescriptions() and >>>> getAnotherTuple(), but the rest of the functions were more like the >>>> protocol >>>> version 2 code, prepared to deal with incomplete messages. I think it >>>> would >>>> be good to refactor that, removing the EOF return cases altogether. So I >>>> left out that change for now as well. >>> >>> >>> Yes, (the latter case is not actually used currently). Well, I don't >>> mind writing the additional patch to update . On top of that The >>> refactoring should be a master-only change, perhaps? >> >> >> I pushed the send button too early. >> I don't mind writing the additional patch for the other routines, >> patch that should be backpatched, and the refactoring patch >> (master-only?). > > > Ok, committed and backpatched the latest patch I posted. Yeah, we'll need > additional patches for the refactoring and the remaining issues. I'm not > sure if it's worth trying to backpatch them; let's see how big the patch is. So, here are two patches aimed at fixing the hangling issues with getStartCopy and getParamDescriptions. After trying a couple of approaches, I found out that the most elegant way to prevent the infinite loop in parseInput is to introduce a new PGASYNC flag to control the error and let the caller know what happened. More refactoring could be done, and I think that the use of this new ASYNC flag could be spread to other places as well if you think that's a useful. For now I have focused only on fixing the existing problems with fixes that are rather back-patchable. -- Michael
Attachment
On Thu, Jul 9, 2015 at 10:04 PM, Michael Paquier <michael.paquier@gmail.com>wrote: > So, here are two patches aimed at fixing the hangling issues with > getStartCopy and getParamDescriptions. After trying a couple of > approaches, I found out that the most elegant way to prevent the > infinite loop in parseInput is to introduce a new PGASYNC flag to > control the error and let the caller know what happened. > > More refactoring could be done, and I think that the use of this new > ASYNC flag could be spread to other places as well if you think that's > a useful. For now I have focused only on fixing the existing problems > with fixes that are rather back-patchable. > As those are actually new bug fixes, I am adding an entry in CF 2015-09. -- Michael
On Thu, Jul 9, 2015 at 6:34 PM, Michael Paquier <michael.paquier@gmail.com> wrote: > > On Wed, Jul 8, 2015 at 1:01 AM, Heikki Linnakangas <hlinnaka@iki.fi> wrote: > > On 07/07/2015 09:32 AM, Michael Paquier wrote: > > > > > > Ok, committed and backpatched the latest patch I posted. Yeah, we'll need > > additional patches for the refactoring and the remaining issues. I'm not > > sure if it's worth trying to backpatch them; let's see how big the patch is. > > So, here are two patches aimed at fixing the hangling issues with > getStartCopy and getParamDescriptions. After trying a couple of > approaches, I found out that the most elegant way to prevent the > infinite loop in parseInput is to introduce a new PGASYNC flag to > control the error and let the caller know what happened. One thing that looks slightly non-elegant about this approach is that new async status (PGASYNC_FATAL) is used to deal with errors only in few paths in function pqParseInput3() and not-other paths which should be okay if there is no other better way. I have not spent enough time on it to suggest any better alternative, but would like to hear what other approaches you have considered? With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Sat, Aug 29, 2015 at 10:35 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: > On Thu, Jul 9, 2015 at 6:34 PM, Michael Paquier <michael.paquier@gmail.com> > wrote: > > > > On Wed, Jul 8, 2015 at 1:01 AM, Heikki Linnakangas <hlinnaka@iki.fi> > wrote: > > > On 07/07/2015 09:32 AM, Michael Paquier wrote: > > > > > > > > > Ok, committed and backpatched the latest patch I posted. Yeah, we'll > need > > > additional patches for the refactoring and the remaining issues. I'm > not > > > sure if it's worth trying to backpatch them; let's see how big the > patch is. > > > > So, here are two patches aimed at fixing the hangling issues with > > getStartCopy and getParamDescriptions. After trying a couple of > > approaches, I found out that the most elegant way to prevent the > > infinite loop in parseInput is to introduce a new PGASYNC flag to > > control the error and let the caller know what happened. > > One thing that looks slightly non-elegant about this approach is that > new async status (PGASYNC_FATAL) is used to deal with errors only > in few paths in function pqParseInput3() and not-other paths which should > be okay if there is no other better way. > I considered using this logic in more code paths, but I kept focused on having a not-too-invasive back-patchable patch as the refactoring that would occur may be too heavy for what is usually pushed to the stable branches. Do you think it would be better to get a cleaner refactoring patch first and globalize the approach with PGASYNC_FATAL? As this is a flag internal to libpq not exposed to the user this is fine on a code base, but I am worrying about putting in stable branches more complexity than really needed (upthread I said the same thing and Heikki mentioned that it is fine as long as it is easy to backpatch). Note that I don't mind spending more time on it, or to put in other works reworking the whole patch to have something fully refactored, my goal being to have a clean fix for all supported versions for those OOM problems. I have not spent enough time on > it to suggest any better alternative, but would like to hear what other > approaches you have considered? > The other approach I have considered was to use the error string status to decide if a failure happened, and this did not finish well in PQgetResult :) It was actually a far cleaner approach to have a failure flag to decide if based on the async state process should move on to a failure code path or not. -- Michael
On Mon, Aug 31, 2015 at 12:55 PM, Michael Paquier <michael.paquier@gmail.com > wrote: > > On Sat, Aug 29, 2015 at 10:35 PM, Amit Kapila <amit.kapila16@gmail.com> > wrote: >> >> >> One thing that looks slightly non-elegant about this approach is that >> new async status (PGASYNC_FATAL) is used to deal with errors only >> in few paths in function pqParseInput3() and not-other paths which should >> be okay if there is no other better way. >> > > I considered using this logic in more code paths, but I kept focused on > having a not-too-invasive back-patchable patch as the refactoring that > would occur may be too heavy for what is usually pushed to the stable > branches. Do you think it would be better to get a cleaner refactoring > patch first and globalize the approach with PGASYNC_FATAL? > No wait, lets first try to see if this the best fix for Copy path. The problem with the current approach is that even if we are able to receive error, it doesn't completely clear the previous command execution. As an example, if using debugger, I make getCopyStart() return OOM error, then the next command execution fails. postgres=# copy t1 from stdin; out of memory postgres=# copy t1 from stdin; another command is already in progress > As this is a flag internal to libpq not exposed to the user this is fine > on a code base, but I am worrying about putting in stable branches more > complexity than really needed (upthread I said the same thing and Heikki > mentioned that it is fine as long as it is easy to backpatch). Note that I > don't mind spending more time on it, or to put in other works reworking the > whole patch to have something fully refactored, my goal being to have a > clean fix for all supported versions for those OOM problems. > > I have not spent enough time on >> it to suggest any better alternative, but would like to hear what other >> approaches you have considered? >> > > The other approach I have considered was to use the error string status to > decide if a failure happened, and this did not finish well in PQgetResult :) > Yeah, that could be another way and we already use that in pqParseInput3() to distinguish the error path. > It was actually a far cleaner approach to have a failure flag to decide if > based on the async state process should move on to a failure code path or > not. > I think you have already tried, but it seems to me that if we can handle it based on result status, that would be better, rather than introducing new state, but anyway lets first try to sort out above reported problem. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Fri, Sep 4, 2015 at 2:07 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
On Mon, Aug 31, 2015 at 12:55 PM, Michael Paquier <michael.paquier@gmail.com> wrote:On Sat, Aug 29, 2015 at 10:35 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:One thing that looks slightly non-elegant about this approach is thatnew async status (PGASYNC_FATAL) is used to deal with errors onlyin few paths in function pqParseInput3() and not-other paths which shouldbe okay if there is no other better way.
I considered using this logic in more code paths, but I kept focused on having a not-too-invasive back-patchable patch as the refactoring that would occur may be too heavy for what is usually pushed to the stable branches. Do you think it would be better to get a cleaner refactoring patch first and globalize the approach with PGASYNC_FATAL?No wait, lets first try to see if this the best fix for Copy path.
Sure. Thanks. (We could remove some dead code of libpq though).
The problem with the current approach is that even if we are able toreceive error, it doesn't completely clear the previous commandexecution. As an example, if using debugger, I make getCopyStart()return OOM error, then the next command execution fails.postgres=# copy t1 from stdin;out of memorypostgres=# copy t1 from stdin;another command is already in progress
Oops. Indeed. Using this new flag is obviously not a good idea because the connection remains in a state that it considers as PGASYNC_FATAL and actually it may not even finish to consume the remaining messages. Perhaps I was too sleepy last time I tested that, well...
It was actually a far cleaner approach to have a failure flag to decide if based on the async state process should move on to a failure code path or not.I think you have already tried, but it seems to me that if we can handleit based on result status, that would be better, rather than introducingnew state, but anyway lets first try to sort out above reported problem.
So, looking at that again with a fresh eye, I noticed that getCopyDataMessage has a similar error handling when it fails on pqCheckInBufferSpace. So looking at what I added, it seems that I am trying to duplicate the protocol error handling even if everything is already present, so that's really overdoing it. Hence, let's simply drop this idea of new flag PGASYNC_FATAL and instead treat the OOM as a sync handling error with the server, like in the patch attached.
When emulating an OOM with this patch, I am getting this error instead of the infinite loop, which looks acceptable to me:
=# copy aa to stdout;
out of memory
lost synchronization with server: got message type "H", length 5
The connection to the server was lost. Attempting reset: Succeeded.
When emulating an OOM with this patch, I am getting this error instead of the infinite loop, which looks acceptable to me:
=# copy aa to stdout;
out of memory
lost synchronization with server: got message type "H", length 5
The connection to the server was lost. Attempting reset: Succeeded.
The extra message handling I have added at the end of getCopyStart and getParamDescriptions still looks more adapted to me when a failure happens, so I kept it.
Regards,
--
Michael
Attachment
On Fri, Sep 4, 2015 at 12:55 PM, Michael Paquier <michael.paquier@gmail.com> wrote:
On Fri, Sep 4, 2015 at 2:07 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:It was actually a far cleaner approach to have a failure flag to decide if based on the async state process should move on to a failure code path or not.I think you have already tried, but it seems to me that if we can handleit based on result status, that would be better, rather than introducingnew state, but anyway lets first try to sort out above reported problem.So, looking at that again with a fresh eye, I noticed that getCopyDataMessage has a similar error handling when it fails on pqCheckInBufferSpace. So looking at what I added, it seems that I am trying to duplicate the protocol error handling even if everything is already present, so that's really overdoing it. Hence, let's simply drop this idea of new flag PGASYNC_FATAL and instead treat the OOM as a sync handling error with the server, like in the patch attached.
I think sync handling error, drops the connection which I feel should
be only done as a last resort for any error and if we are able to handle
OOM error for other kind of messages, then we should try to handle
it for Copy related messages as well. I have tried to do it for Copy
In and Copy Out protocol messages in the attached patch with logic
similar to what is currently used in code. The idea is that in Copy
mode if there is error we just return the error and the pending data
will be automatically discarded in next execution during PQexecStart().
When emulating an OOM with this patch, I am getting this error instead of the infinite loop, which looks acceptable to me:
=# copy aa to stdout;
out of memory
lost synchronization with server: got message type "H", length 5
The connection to the server was lost. Attempting reset: Succeeded.The extra message handling I have added at the end of getCopyStart and getParamDescriptions still looks more adapted to me when a failure happens, so I kept it.
Sure, but then make the handling similar to getRowDescriptions() both for
failure and success case, that way code will be consistent.
Still I have not checked about COPY_BOTH and getParamDescriptions(),
see if you like what I have done in attached patch, then we can do the
similar for them as well.
Attachment
On Sat, Sep 5, 2015 at 9:45 PM, Amit Kapila <amit.kapila16@gmail.com> wrote= : > > > On Fri, Sep 4, 2015 at 12:55 PM, Michael Paquier <michael.paquier@gmail.c= om> wrote: >> >> >> On Fri, Sep 4, 2015 at 2:07 PM, Amit Kapila <amit.kapila16@gmail.com> wr= ote: >>>> >>>> It was actually a far cleaner approach to have a failure flag to decid= e if based on the async state process should move on to a failure code path= or not. >>> >>> >>> I think you have already tried, but it seems to me that if we can handl= e >>> it based on result status, that would be better, rather than introducin= g >>> new state, but anyway lets first try to sort out above reported problem= . >> >> >> So, looking at that again with a fresh eye, I noticed that getCopyDataMe= ssage has a similar error handling when it fails on pqCheckInBufferSpace. S= o looking at what I added, it seems that I am trying to duplicate the proto= col error handling even if everything is already present, so that's really = overdoing it. Hence, let's simply drop this idea of new flag PGASYNC_FATAL = and instead treat the OOM as a sync handling error with the server, like in= the patch attached. >> > > I think sync handling error, drops the connection which I feel should > be only done as a last resort for any error and if we are able to handle > OOM error for other kind of messages, then we should try to handle > it for Copy related messages as well. I have tried to do it for Copy > In and Copy Out protocol messages in the attached patch with logic > similar to what is currently used in code. The idea is that in Copy > mode if there is error we just return the error and the pending data > will be automatically discarded in next execution during PQexecStart(). OK. So you rely on the error state set by pqSaveErrorResult, then wrap an error check in getCopyResult and PQexecFinish... I guess that this is acceptable for COPY as this code path would just kept looping infinitely on the current HEAD. > > >> When emulating an OOM with this patch, I am getting this error instead o= f the infinite loop, which looks acceptable to me: >> =3D# copy aa to stdout; >> out of memory >> lost synchronization with server: got message type "H", length 5 >> The connection to the server was lost. Attempting reset: Succeeded. >> >> The extra message handling I have added at the end of getCopyStart and g= etParamDescriptions still looks more adapted to me when a failure happens, = so I kept it. > > > Sure, but then make the handling similar to getRowDescriptions() both for > failure and success case, that way code will be consistent. No problem for me, and this makes actually refactoring easier as those code paths are more similar. > Still I have not checked about COPY_BOTH and getParamDescriptions(), > see if you like what I have done in attached patch, then we can do the > similar for them as well. Now, to move into the serious things... + /* + * Advance inStart to show that the copy related message has been + * processed. + */ + conn->inStart =3D conn->inCursor; This change... + /* getCopyStart() moves inStart itself */ conn->asyncStatus =3D PGASYNC_COPY_= IN; - break; + continue; ... And this change are risky for a backpatch. And they actually break the existing replication protocol and backward compatibility so I have no doubt that they could break any client applications that work directly with the replication level protocol with commands like BASE_BACKUP, START_REPLICATION and IDENTITY_SYSTEM. I think that we should really try to keep the non-error code path as close as possible to the original code. So, I think that the right approach would be to leave immediately pqParseInput3 should an error happen, switching asyncStatus to leave the loop in PQgetResult. This seems as well to work correctly with PGRES_COPY_BOTH (I emulated failures with pg_basebackup and errors were caught up correctly. This brings back of course the introduction of a new flag PGASYNC_FATAL and we could replace the extra stuff you added in getCopyResult as well. I have a WIP patch that I don't have the time to finish now, but I'll send it once I am done. For now I am just sharing my thoughts on the matter. --=20 Michael
On Mon, Sep 7, 2015 at 1:40 PM, Michael Paquier <michael.paquier@gmail.com> wrote: > > On Sat, Sep 5, 2015 at 9:45 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > So, I think that the right approach would be to leave immediately > pqParseInput3 should an error happen, switching asyncStatus to leave > the loop in PQgetResult. Sure, if you think that works, then do it that way. > This seems as well to work correctly with > PGRES_COPY_BOTH (I emulated failures with pg_basebackup and errors > were caught up correctly. This brings back of course the introduction > of a new flag PGASYNC_FATAL I think we should try not to introduce this new flag, as that is not merely a flag, but rather a state in query-execution state machine. Now if we introduce a new error state into that state-machine, then it doesn't seem like a good idea to do that only for some of the paths and doing it for all other paths is a call for somewhat larger verification cycle (to see if it works in all paths as previously). With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Mon, Sep 7, 2015 at 6:45 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: > On Mon, Sep 7, 2015 at 1:40 PM, Michael Paquier <michael.paquier@gmail.com> > wrote: >> >> On Sat, Sep 5, 2015 at 9:45 PM, Amit Kapila <amit.kapila16@gmail.com> >> wrote: >> > >> >> So, I think that the right approach would be to leave immediately >> pqParseInput3 should an error happen, switching asyncStatus to leave >> the loop in PQgetResult. > > Sure, if you think that works, then do it that way. > >> This seems as well to work correctly with >> PGRES_COPY_BOTH (I emulated failures with pg_basebackup and errors >> were caught up correctly. This brings back of course the introduction >> of a new flag PGASYNC_FATAL > > I think we should try not to introduce this new flag, as that is not merely > a flag, but rather a state in query-execution state machine. Now if we > introduce a new error state into that state-machine, then it doesn't seem > like a good idea to do that only for some of the paths and doing it for all > other paths is a call for somewhat larger verification cycle (to see if it > works in all paths as previously). Well, the previous patch you sent added code paths to manage the failures with COPY protocol directly in the COPY routines in a way similar to what the introduction of PGASYNC_FATAL is doing, except that PGASYNC_FATAL has the advantage to let the end of PQgetResult know that a failure has happened, centralizing the error check. Also, it seems to me that switching getParamDescriptions to an error state is going to need this flag, or at least a new flag to exit the loop when parsing the input. Regarding the other messages that could use this flag, I just double-checked and it seems that all the other calls return an EOF when they do not have enough data received from the backend or they directly switch to another PGASYNC state, so they do not need a specific fatal error handling. Please let me know if I missed something. In any case, attached are two patches: - 0001 adds the OOM/failure handling for the BIND and COPY start messages. This time the connection is not dropped. After a failure, successive commands work as well, this addresses the previous issue you reported. - 0002 is a cleanup bonus, getRowDescriptions and getAnotherTuple have some dead code that I think would be better removed, those are remnants from a copy/paste from the similar code of protocol 2. Regards, -- Michael
Attachment
On Mon, Sep 7, 2015 at 10:41 PM, Michael Paquier wrote: > In any case, attached are two patches: > - 0001 adds the OOM/failure handling for the BIND and COPY start > messages. This time the connection is not dropped. After a failure, > successive commands work as well, this addresses the previous issue > you reported. > - 0002 is a cleanup bonus, getRowDescriptions and getAnotherTuple have > some dead code that I think would be better removed, those are > remnants from a copy/paste from the similar code of protocol 2. And I forgot... Attached is a simple program to test BIND messages. -- Michael
Attachment
On Tue, Sep 8, 2015 at 12:13 AM, Michael Paquier <michael.paquier@gmail.com> wrote: > On Mon, Sep 7, 2015 at 10:41 PM, Michael Paquier wrote: >> In any case, attached are two patches: >> - 0001 adds the OOM/failure handling for the BIND and COPY start >> messages. This time the connection is not dropped. After a failure, >> successive commands work as well, this addresses the previous issue >> you reported. >> - 0002 is a cleanup bonus, getRowDescriptions and getAnotherTuple have >> some dead code that I think would be better removed, those are >> remnants from a copy/paste from the similar code of protocol 2. > > And I forgot... Attached is a simple program to test BIND messages. And it occurred to me after sleeping on it that what I sent earlier is not enough: in the case where the server has sent only a partial message, we should not move into an error code path, but simply get back into the loop of pqGetResult and wait for additional input from the server. Attached are patches to fix that. Regards, -- Michael
Attachment
On Mon, Sep 7, 2015 at 8:43 PM, Michael Paquier <michael.paquier@gmail.com> wrote: > On Mon, Sep 7, 2015 at 10:41 PM, Michael Paquier wrote: > > In any case, attached are two patches: > > - 0001 adds the OOM/failure handling for the BIND and COPY start > > messages. This time the connection is not dropped. After a failure, > > successive commands work as well, this addresses the previous issue > > you reported. > > - 0002 is a cleanup bonus, getRowDescriptions and getAnotherTuple have > > some dead code that I think would be better removed, those are > > remnants from a copy/paste from the similar code of protocol 2. > > And I forgot... Attached is a simple program to test BIND messages. > Still the same test fails for me. postgres=# copy t1 from stdin; out of memory postgres=# copy t1 from stdin; --hangs here It hangs in second statement, basically I think you can't change the state to PGASYNC_BUSY on failure in case of copy in below code. + case PGASYNC_FATAL: + res = NULL; + /* + * Set the state back to BUSY, allowing parsing to proceed to + * consume messages coming from the server. + */ + conn->asyncStatus = PGASYNC_BUSY; + break; I think you need to keep the state as PGASYNC_COPY_*, so that the pending data can be discarded. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Mon, Sep 7, 2015 at 1:40 PM, Michael Paquier <michael.paquier@gmail.com> wrote:
On Sat, Sep 5, 2015 at 9:45 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
Now, to move into the serious things...
+ /*
+ * Advance inStart to show that the copy related message has been
+ * processed.
+ */
+ conn->inStart = conn->inCursor;
This change...
+ /* getCopyStart() moves
inStart itself */
conn->asyncStatus = PGASYNC_COPY_IN;
- break;
+ continue;
... And this change are risky for a backpatch. And they actually
break the existing replication protocol
Can you please explain how will it break replication protocol?
I have done the required handling for Copy Both mode as well in attached
patch similar to what was done for other Copy modes in previous patch.
Check if you still find it as broken for replication?
I have only kept the changes for COPY modes, so that once we settle on
those, I think similar changes could be done for getParamDescriptions()
Attachment
On Sat, Sep 12, 2015 at 5:56 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: > On Mon, Sep 7, 2015 at 8:43 PM, Michael Paquier <michael.paquier@gmail.com> > wrote: >> >> On Mon, Sep 7, 2015 at 10:41 PM, Michael Paquier wrote: >> > In any case, attached are two patches: >> > - 0001 adds the OOM/failure handling for the BIND and COPY start >> > messages. This time the connection is not dropped. After a failure, >> > successive commands work as well, this addresses the previous issue >> > you reported. >> > - 0002 is a cleanup bonus, getRowDescriptions and getAnotherTuple have >> > some dead code that I think would be better removed, those are >> > remnants from a copy/paste from the similar code of protocol 2. >> >> And I forgot... Attached is a simple program to test BIND messages. > > > Still the same test fails for me. > > postgres=# copy t1 from stdin; > out of memory > postgres=# copy t1 from stdin; --hangs here > > It hangs in second statement, basically I think you can't change the > state to PGASYNC_BUSY on failure in case of copy in below code. > > + case PGASYNC_FATAL: > + res = NULL; > + /* > + * Set the state back to BUSY, allowing parsing to proceed to > + * consume messages coming from the server. > + */ > + conn->asyncStatus = PGASYNC_BUSY; > + break; > > I think you need to keep the state as PGASYNC_COPY_*, so that > the pending data can be discarded. Yes right. And my patch was as well broken with PGRES_COPY_BOTH. If the replication protocol expected to wait for some data, this would have frozen as well as it's a two-way data exchange in this case. -- Michael
On Sat, Sep 12, 2015 at 6:11 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: > On Mon, Sep 7, 2015 at 1:40 PM, Michael Paquier <michael.paquier@gmail.com> > wrote: >> >> On Sat, Sep 5, 2015 at 9:45 PM, Amit Kapila <amit.kapila16@gmail.com> >> wrote: >> >> Now, to move into the serious things... >> >> + /* >> + * Advance inStart to show that the copy related message has been >> + * processed. >> + */ >> + conn->inStart = conn->inCursor; >> This change... >> >> + /* getCopyStart() moves >> inStart itself */ >> conn->asyncStatus = >> PGASYNC_COPY_IN; >> - break; >> + continue; >> ... And this change are risky for a backpatch. And they actually >> break the existing replication protocol First, many thanks for providing your input! I am really looking forward into fixing those problems appropriately. > Can you please explain how will it break replication protocol? > I have done the required handling for Copy Both mode as well in attached > patch similar to what was done for other Copy modes in previous patch. > Check if you still find it as broken for replication? When not enough data has been received from the server, it seems that we should PQclear the existing result, and leave immediately pqParseInput3 with the status PGASYNC_BUSY to be able to wait for more data. Your patch, as-is, is breaking that promise (and this comes from the first versions of my patches). It seems also that we should not write an error message on connection in this case to be consistent in the behavior of back-branches. For the OOM case, we definitely need to take the error code path though, as your patch is correctly doing, and what mine legendary failed to do. + if (pqGetInt(&result->numAttributes, 2, conn)) + { + errmsg = libpq_gettext("extraneous data in COPY start message"); + goto advance_and_error; + } Here an error message is not needed, and this message should have been formulated as "insufficient data in blah" either way. > I have only kept the changes for COPY modes, so that once we settle on > those, I think similar changes could be done for getParamDescriptions() Yeah, I looked into this one as well, resulting in patch 0002 attached. In this case we have the advantage to only receive data from the server, so pqPrepareAsyncResult is handling the failure just fine. Attached as well is the test case I used (similar to previous one, just extended a bit to report the error message), after getting the result from PQsendDescribePrepared, PQresultStatus is set correctly on error and reports "out of memory" to the user. What do you think about it? Regards, -- Michael
Attachment
On 09/14/2015 04:36 AM, Michael Paquier wrote: > On Sat, Sep 12, 2015 at 6:11 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: >> On Mon, Sep 7, 2015 at 1:40 PM, Michael Paquier <michael.paquier@gmail.com> >> wrote: >>> >>> On Sat, Sep 5, 2015 at 9:45 PM, Amit Kapila <amit.kapila16@gmail.com> >>> wrote: >>> >>> Now, to move into the serious things... >>> >>> + /* >>> + * Advance inStart to show that the copy related message has been >>> + * processed. >>> + */ >>> + conn->inStart = conn->inCursor; >>> This change... >>> >>> + /* getCopyStart() moves >>> inStart itself */ >>> conn->asyncStatus = >>> PGASYNC_COPY_IN; >>> - break; >>> + continue; >>> ... And this change are risky for a backpatch. And they actually >>> break the existing replication protocol > > First, many thanks for providing your input! I am really looking > forward into fixing those problems appropriately. > >> Can you please explain how will it break replication protocol? >> I have done the required handling for Copy Both mode as well in attached >> patch similar to what was done for other Copy modes in previous patch. >> Check if you still find it as broken for replication? > > When not enough data has been received from the server, it seems that > we should PQclear the existing result, and leave immediately > pqParseInput3 with the status PGASYNC_BUSY to be able to wait for more > data. Your patch, as-is, is breaking that promise (and this comes from > the first versions of my patches). It seems also that we should not > write an error message on connection in this case to be consistent in > the behavior of back-branches. For the OOM case, we definitely need to > take the error code path though, as your patch is correctly doing, and > what mine legendary failed to do. > > + if (pqGetInt(&result->numAttributes, 2, conn)) > + { > + errmsg = libpq_gettext("extraneous data in COPY start message"); > + goto advance_and_error; > + } > Here an error message is not needed, and this message should have been > formulated as "insufficient data in blah" either way. > >> I have only kept the changes for COPY modes, so that once we settle on >> those, I think similar changes could be done for getParamDescriptions() > > Yeah, I looked into this one as well, resulting in patch 0002 > attached. In this case we have the advantage to only receive data from > the server, so pqPrepareAsyncResult is handling the failure just fine. > Attached as well is the test case I used (similar to previous one, > just extended a bit to report the error message), after getting the > result from PQsendDescribePrepared, PQresultStatus is set correctly on > error and reports "out of memory" to the user. What do you think about > it? Patches 0001 and 0002 look good to me. Two tiny nitpicks: > --- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c > +++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c > @@ -445,6 +445,7 @@ libpqrcv_PQexec(const char *query) > if (PQresultStatus(lastResult) == PGRES_COPY_IN || > PQresultStatus(lastResult) == PGRES_COPY_OUT || > PQresultStatus(lastResult) == PGRES_COPY_BOTH || > + PQresultStatus(lastResult) == PGRES_FATAL_ERROR || > PQstatus(streamConn) == CONNECTION_BAD) > break; > } Is this really needed? AFAICS, the loop will terminate on PGRES_FATAL_ERROR too. Without this patch, it will always return the last result from the query, whether or not it was an error result. With this patch, it will return the first error encountered, or last non-error result if there is no error. In practice, there is no difference because we don't expect multiple results from the server. > @@ -1343,31 +1348,33 @@ getNotify(PGconn *conn) > * parseInput already read the message type and length. > */ > static int > -getCopyStart(PGconn *conn, ExecStatusType copytype) > +getCopyStart(PGconn *conn, ExecStatusType copytype, int msgLength) > { > PGresult *result; > int nfields; > int i; > + const char *errmsg = NULL; > > result = PQmakeEmptyPGresult(conn, copytype); > if (!result) > - goto failure; > + goto advance_and_error; > > if (pqGetc(&conn->copy_is_binary, conn)) > - goto failure; > + goto not_enough_data; > result->binary = conn->copy_is_binary; > + > /* the next two bytes are the number of fields */ > - if (pqGetInt(&(result->numAttributes), 2, conn)) > - goto failure; > + if (pqGetInt(&result->numAttributes, 2, conn)) > + goto not_enough_data; > nfields = result->numAttributes; > > /* allocate space for the attribute descriptors */ > - if (nfields > 0) > + if (result && nfields > 0) 'result' is never NULL here, so the NULL test is still not needed. The 0003 patch also looks OK to me as far as it goes. I feel that some more refactoring would be in order, though. There are still many different styles for handling a command: sometimes it's done inline in the case-statement, sometimes it calls out to a separate function. Sometimes the function moves inStart, sometimes it does not. On error (e.g. short message), sometimes the function returns 'false', sometimes it sets an error message itself. Patches 0001 and 0002 look small enough that we probably could backpatch them, so that would leave 0003 as a optional master-only cleanup. I spent some time writing a regression test for this. This is pretty difficult to test otherwise, and it's a good candidate for automated, exchaustive testing. The only difficulty is injecting the malloc() failures in a cross-platform way. I ended up with a glibc-only solution that uses glibc's malloc() hooks. Other alternatives I considered were Linux-specific LD_PRELOAD, GNU ld specific -wrap option. One cross-platform solution would be to #define malloc() and friends to wrapper functions, and compile a special version of libpq. Anyway, I came up with the attached test program. It hangs without your fixes, and works with the patches applied. Thoughts? - Heikki
Attachment
On Fri, Sep 18, 2015 at 12:55 PM, Heikki Linnakangas wrote: > On 09/14/2015 04:36 AM, Michael Paquier wrote: > Patches 0001 and 0002 look good to me. Two tiny nitpicks: > >> --- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c >> +++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c >> @@ -445,6 +445,7 @@ libpqrcv_PQexec(const char *query) >> if (PQresultStatus(lastResult) == PGRES_COPY_IN || >> PQresultStatus(lastResult) == PGRES_COPY_OUT || >> PQresultStatus(lastResult) == PGRES_COPY_BOTH || >> + PQresultStatus(lastResult) == PGRES_FATAL_ERROR || >> PQstatus(streamConn) == CONNECTION_BAD) >> break; >> } > > > Is this really needed? AFAICS, the loop will terminate on PGRES_FATAL_ERROR > too. Without this patch, it will always return the last result from the > query, whether or not it was an error result. With this patch, it will > return the first error encountered, or last non-error result if there is no > error. In practice, there is no difference because we don't expect multiple > results from the server. Yes, you are right. PQgetResult is doing the necessary processing here. This seems to be some remnant of previous testing around the replication protocol, and while it has no real incidence for the server, it may for applications that rely on libpqwalreceiver.so... Hence let's remove it. Thanks for pointing out that. >> @@ -1343,31 +1348,33 @@ getNotify(PGconn *conn) >> [...] >> + if (result && nfields > 0) > > 'result' is never NULL here, so the NULL test is still not needed. I thought I removed all of them. > The 0003 patch also looks OK to me as far as it goes. I feel that some more > refactoring would be in order, though. There are still many different styles > for handling a command: sometimes it's done inline in the case-statement, > sometimes it calls out to a separate function. Sometimes the function moves > inStart, sometimes it does not. On error (e.g. short message), sometimes the > function returns 'false', sometimes it sets an error message itself. Sure. I had in mind a backpatch as the main goal of this set of patches, and there is surely more room than simply removing all the dead code paths that 0003 is currently doing. Why not simply fixing the holes in the current logic to begin, and have the refactoring patch as something to work on for the next CF? I don't mind working on it FWIW. That's just a larger topic than what is discussed on this thread. > Patches 0001 and 0002 look small enough that we probably could backpatch > them, so that would leave 0003 as a optional master-only cleanup. That makes sense, and what I had in mind as well. Attached are updated patches, feel free to ignore 0003, that's the same as before only removing the dead code. > I spent some time writing a regression test for this. This is pretty > difficult to test otherwise, and it's a good candidate for automated, > exchaustive testing. The only difficulty is injecting the malloc() failures > in a cross-platform way. I ended up with a glibc-only solution that uses > glibc's malloc() hooks. Other alternatives I considered were Linux-specific > LD_PRELOAD, GNU ld specific -wrap option. One cross-platform solution would > be to #define malloc() and friends to wrapper functions, and compile a > special version of libpq. Anyway, I came up with the attached test program. > It hangs without your fixes, and works with the patches applied. Thoughts? That's not going to work on OSX/BSD (as those callbacks are not present) and Windows, but testing that on Linux only looks fine to me, and locating it where it is, aka in a non-default compile path is definitely better. You should add .gitignore in src/interfaces/libpq/test/ (uri-regress is missing as well btw). Regards, -- Michael
Attachment
On Fri, Sep 18, 2015 at 11:25 PM, Heikki Linnakangas <hlinnaka@iki.fi> wrote: > On 09/14/2015 04:36 AM, Michael Paquier wrote: > > Patches 0001 and 0002 look good to me. Two tiny nitpicks: > > --- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c >> +++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c >> @@ -445,6 +445,7 @@ libpqrcv_PQexec(const char *query) >> if (PQresultStatus(lastResult) == PGRES_COPY_IN || >> PQresultStatus(lastResult) == PGRES_COPY_OUT || >> PQresultStatus(lastResult) == PGRES_COPY_BOTH || >> + PQresultStatus(lastResult) == PGRES_FATAL_ERROR || >> PQstatus(streamConn) == CONNECTION_BAD) >> break; >> } >> > > Is this really needed? AFAICS, the loop will terminate on > PGRES_FATAL_ERROR too. IIRC, this is required to sanely report "out of memory" error in case of replication protocol (master-standby setup). This loop and in-particular this check is quite similar to PQexecFinish() functions check and loop where we return last result. I think it is better to keep both the places in-sync and also I think this is required to report the error appropriately. I have tried manual debugging for the out of memory error for this case and it works well with this check and without the check it doesn't report the error in an appropriate way(don't remember exactly what was the difference). If required, I can again try to reproduce the scenario and share the exact report. > Patches 0001 and 0002 look small enough that we probably could backpatch > them, so that would leave 0003 as a optional master-only cleanup. > > + +advance_and_error: + /* Discard unsaved result, if any */ This part seems to be common in both patches and is also being used in getRowDescriptions(), we can have a common routine for that part, if you see any benefit in same. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Fri, Sep 18, 2015 at 11:32 PM, Amit Kapila wrote: > IIRC, this is required to sanely report "out of memory" error in case > of replication protocol (master-standby setup). This loop and in-particular > this check is quite similar to PQexecFinish() functions check and loop > where we return last result. I think it is better to keep both the places > in-sync > and also I think this is required to report the error appropriately. I have > tried manual debugging for the out of memory error for this case and > it works well with this check and without the check it doesn't report > the error in an appropriate way(don't remember exactly what was > the difference). If required, I can again try to reproduce the scenario > and share the exact report. I just had a look at that again... Put for example a call to pg_usleep in libpqrcv_identify_system after executing IDENTIFY_SYSTEM and before calling PQresultStatus, then take a breakpoint on the WAL receiver process when starting up a standby. This gives plenty of room to emulate the OOM failure in getCopyStart. When the check on PGRES_FATAL_ERROR is not added and when emulating the OOM immediately, libpqrcv_PQexec loops twice and thinks it can start up strrep but fails afterwards. Here is the failure seen from the standby: LOG: started streaming WAL from primary at 0/3000000 on timeline 1 FATAL: could not send data to WAL stream: server closed the connection unexpectedly And from the master: LOG: unexpected EOF on standby connection The WAL receiver process ultimately restarts after. When the check on PGRES_FATAL_ERROR is added, strrep fails to start appropriately and the error is fetched correctly by the WAL receiver: FATAL: could not start WAL streaming: out of memory In short: Amit seems right to have added this check. Note for later: looking at patches during conferences is really a bad habit. -- Michael
On Sat, Sep 19, 2015 at 10:44 AM, Michael Paquier <michael.paquier@gmail.com > wrote: > > > On Fri, Sep 18, 2015 at 11:32 PM, Amit Kapila wrote: > > IIRC, this is required to sanely report "out of memory" error in case > > of replication protocol (master-standby setup). This loop and in-particular > > this check is quite similar to PQexecFinish() functions check and loop > > where we return last result. I think it is better to keep both the places > > in-sync > > and also I think this is required to report the error appropriately. I have > > tried manual debugging for the out of memory error for this case and > > it works well with this check and without the check it doesn't report > > the error in an appropriate way(don't remember exactly what was > > the difference). If required, I can again try to reproduce the scenario > > and share the exact report. > > I just had a look at that again... Put for example a call to pg_usleep in libpqrcv_identify_system after executing IDENTIFY_SYSTEM and before calling PQresultStatus, then take a breakpoint on the WAL receiver process when starting up a standby. This gives plenty of room to emulate the OOM failure in getCopyStart. When the check on PGRES_FATAL_ERROR is not added and when emulating the OOM immediately, libpqrcv_PQexec loops twice and thinks it can start up strrep but fails afterwards. Here is the failure seen from the standby: > LOG: started streaming WAL from primary at 0/3000000 on timeline 1 > FATAL: could not send data to WAL stream: server closed the connection unexpectedly > And from the master: > LOG: unexpected EOF on standby connection > The WAL receiver process ultimately restarts after. > > When the check on PGRES_FATAL_ERROR is added, strrep fails to start appropriately and the error is fetched correctly by the WAL receiver: > FATAL: could not start WAL streaming: out of memory > > In short: Amit seems right to have added this check. > Okay, in that case, you can revert that change in your first patch and then that patch will be Ready for committer. In second patch [1], the handling is to continue on error as below: - if (getParamDescriptions(conn)) + if (getParamDescriptions(conn, msgLength)) return; - break; + /* getParamDescriptions() moves inStart itself */ + continue; Now, assume there is "out of memory" error in getParamDescription(), the next message is 'T' which leads to a call to getRowDescriptions() and in that function if there is an error in below part of code, then I think it will just overwrite the previous "out of memory" error (I have tried by manual debugging and forcefully generating the below error): getRowDescriptions() { .. if (pqGetInt(&(result->numAttributes), 2, conn)) { /* We should not run out of data here, so complain */ errmsg = libpq_gettext("insufficient data in \"T\" message"); goto advance_and_error; } .. } Here, the point to note is that for similar handling of error from getRowDescriptions(), we just skip the consecutive 'D' messages by checking resultStatus. I think overwriting of error for this case is not the right behaviour. [1] - 0002-Fix-OOM-error-handling-in-BIND-protocol-of-libpq-2.patch With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Sat, Oct 10, 2015 at 9:06 PM, Amit Kapila wrote: > Okay, in that case, you can revert that change in your first patch and > then that patch will be Ready for committer. Yep. Done. > In second patch [1], the handling is to continue on error as below: > > - if (getParamDescriptions(conn)) > + if (getParamDescriptions(conn, msgLength)) > return; > - break; > + /* getParamDescriptions() moves inStart itself */ > + continue; > > Now, assume there is "out of memory" error in getParamDescription(), > the next message is 'T' which leads to a call to getRowDescriptions() > and in that function if there is an error in below part of code, then I > think it will just overwrite the previous "out of memory" error (I have > tried by manual debugging and forcefully generating the below error): > Here, the point to note is that for similar handling of error from > getRowDescriptions(), we just skip the consecutive 'D' messages > by checking resultStatus. > I think overwriting of error for this case is not the right behaviour. > [1] - 0002-Fix-OOM-error-handling-in-BIND-protocol-of-libpq-2.patch Yeah, this behavior is caused by this piece of code: @@ -600,7 +601,9 @@ getRowDescriptions(PGconn *conn, int msgLength) advance_and_error: /* Discard unsaved result, if any */ if (result && result != conn->result) PQclear(result); An idea may be to check for conn->result->resultStatus != PGRES_FATAL_ERROR to concatenate correctly the error messages using pqCatenateResultError. But just returning the first error encountered as you mention would be more natural. So I updated the patch this way. Regards, -- Michael
Attachment
On Sun, Oct 11, 2015 at 6:31 PM, Michael Paquier <michael.paquier@gmail.com> wrote: > > > Yeah, this behavior is caused by this piece of code: > @@ -600,7 +601,9 @@ getRowDescriptions(PGconn *conn, int msgLength) > advance_and_error: > /* Discard unsaved result, if any */ > if (result && result != conn->result) > PQclear(result); > An idea may be to check for conn->result->resultStatus != > PGRES_FATAL_ERROR to concatenate correctly the error messages using > pqCatenateResultError. But just returning the first error encountered > as you mention would be more natural. So I updated the patch this way. > Minor comment: +getParamDescriptions(PGconn *conn, int msgLength) { .. + + /* + * Advance inStart to show that the copy related message has been + * processed. + */ + conn->inStart = conn->inCursor; + .. } Reference to 'copy' in above comment seems to be wrong. Other than that both the patches looks good to me, I will mark this as Ready for committer. Please see if you can update the patch soonish, if you find above comment valid. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Thu, Oct 15, 2015 at 11:28 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: > On Sun, Oct 11, 2015 at 6:31 PM, Michael Paquier <michael.paquier@gmail.com> > wrote: >> >> >> Yeah, this behavior is caused by this piece of code: >> @@ -600,7 +601,9 @@ getRowDescriptions(PGconn *conn, int msgLength) >> advance_and_error: >> /* Discard unsaved result, if any */ >> if (result && result != conn->result) >> PQclear(result); >> An idea may be to check for conn->result->resultStatus != >> PGRES_FATAL_ERROR to concatenate correctly the error messages using >> pqCatenateResultError. But just returning the first error encountered >> as you mention would be more natural. So I updated the patch this way. >> > > Minor comment: > +getParamDescriptions(PGconn *conn, int msgLength) > { > .. > + > + /* > + * Advance inStart to show that the copy related message has been > + * processed. > + */ > + conn->inStart = conn->inCursor; > + > .. > } > > Reference to 'copy' in above comment seems to be wrong. Oh, yes it wrong. This should refer to the "t" message. > Other than that both the patches looks good to me, I will mark this as > Ready for committer. Please see if you can update the patch soonish, > if you find above comment valid. Done. Thanks for the review! -- Michael
Attachment
On Fri, Oct 16, 2015 at 11:00 AM, Michael Paquier <michael.paquier@gmail.com> wrote: > On Thu, Oct 15, 2015 at 11:28 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: >> Other than that both the patches looks good to me, I will mark this as >> Ready for committer. Please see if you can update the patch soonish, >> if you find above comment valid. > > Done. Thanks for the review! Could it be possible to get some committer attention for this patch and get that committed for this CF? This is a clearly bug fix and it is present for months. At least Amit and I have reached a consensus about a way to do things. -- Michael
Michael Paquier <michael.paquier@gmail.com> writes: > Could it be possible to get some committer attention for this patch > and get that committed for this CF? FWIW, I've been assuming that Heikki would handle it, since he was involved earlier. If he's not got time/interest for some reason, I could pick it up. regards, tom lane
On Tue, Dec 1, 2015 at 11:55 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Michael Paquier <michael.paquier@gmail.com> writes: >> Could it be possible to get some committer attention for this patch >> and get that committed for this CF? > > FWIW, I've been assuming that Heikki would handle it, since he was > involved earlier. If he's not got time/interest for some reason, > I could pick it up. Thanks. -- Michael
On 16/10/15 05:00, Michael Paquier wrote: > On Thu, Oct 15, 2015 at 11:28 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: >> On Sun, Oct 11, 2015 at 6:31 PM, Michael Paquier <michael.paquier@gmail.com> >> wrote: >>> >>> Yeah, this behavior is caused by this piece of code: >>> @@ -600,7 +601,9 @@ getRowDescriptions(PGconn *conn, int msgLength) >>> advance_and_error: >>> /* Discard unsaved result, if any */ >>> if (result && result != conn->result) >>> PQclear(result); >>> An idea may be to check for conn->result->resultStatus != >>> PGRES_FATAL_ERROR to concatenate correctly the error messages using >>> pqCatenateResultError. But just returning the first error encountered >>> as you mention would be more natural. So I updated the patch this way. I pushed the ParamDescription fix. I'm not quite sure on the details of the COPY patch. Do you remember what this change in PQexecFinish is for: > @@ -1991,6 +1995,9 @@ PQexecFinish(PGconn *conn) > * We have to stop if we see copy in/out/both, however. We will resume > * parsing after application performs the data transfer. > * > + * Stop if we are in copy mode and error has occurred, the pending results > + * will be discarded during next execution in PQexecStart. > + * > * Also stop if the connection is lost (else we'll loop infinitely). > */ > lastResult = NULL; > @@ -2020,6 +2027,11 @@ PQexecFinish(PGconn *conn) > result->resultStatus == PGRES_COPY_BOTH || > conn->status == CONNECTION_BAD) > break; > + else if ((conn->asyncStatus == PGASYNC_COPY_IN || > + conn->asyncStatus == PGASYNC_COPY_OUT || > + conn->asyncStatus == PGASYNC_COPY_BOTH) && > + result->resultStatus == PGRES_FATAL_ERROR) > + break; > } > > return lastResult; I don't immediately see why that's needed. I ran the little tester program I wrote earlier (http://www.postgresql.org/message-id/55FC501A.5000409@iki.fi), with and without that change, and it behaved the same. Also, when I modify my tester program so that when malloc fails, all subsequent mallocs fail too, I'm still getting the "another command is already in progress" error. The problem scenario is: 1. Application calls PQexec("COPY foo FROM STDIN") 2. getCopyStart sets conn->asyncStatus = PGASYNC_COPY_IN. 3. PQgetResult calls getCopyResult() but it fails with OOM. Now, libpq is already in PGASYNC_COPY_IN mode, but PQexec() returned NULL because of the OOM, so the application doesn't know that it's in copy mode. When the application calls PQexec() again, it fails with the "another command is already in progress" error. Let's see if we can fix that too, or if we should put that off to yet another patch. - Heikki
On Mon, Dec 14, 2015 at 10:41 PM, Heikki Linnakangas <hlinnaka@iki.fi> wrote: > On 16/10/15 05:00, Michael Paquier wrote: > >> On Thu, Oct 15, 2015 at 11:28 PM, Amit Kapila <amit.kapila16@gmail.com> >> wrote: >> >>> On Sun, Oct 11, 2015 at 6:31 PM, Michael Paquier < >>> michael.paquier@gmail.com> >>> wrote: >>> >>>> >>>> Yeah, this behavior is caused by this piece of code: >>>> @@ -600,7 +601,9 @@ getRowDescriptions(PGconn *conn, int msgLength) >>>> advance_and_error: >>>> /* Discard unsaved result, if any */ >>>> if (result && result != conn->result) >>>> PQclear(result); >>>> An idea may be to check for conn->result->resultStatus != >>>> PGRES_FATAL_ERROR to concatenate correctly the error messages using >>>> pqCatenateResultError. But just returning the first error encountered >>>> as you mention would be more natural. So I updated the patch this way. >>>> >>> > I pushed the ParamDescription fix. I'm not quite sure on the details of > the COPY patch. Do you remember what this change in PQexecFinish is for: > > @@ -1991,6 +1995,9 @@ PQexecFinish(PGconn *conn) >> * We have to stop if we see copy in/out/both, however. We will >> resume >> * parsing after application performs the data transfer. >> * >> + * Stop if we are in copy mode and error has occurred, the >> pending results >> + * will be discarded during next execution in PQexecStart. >> + * >> * Also stop if the connection is lost (else we'll loop >> infinitely). >> */ >> lastResult = NULL; >> @@ -2020,6 +2027,11 @@ PQexecFinish(PGconn *conn) >> result->resultStatus == PGRES_COPY_BOTH || >> conn->status == CONNECTION_BAD) >> break; >> + else if ((conn->asyncStatus == PGASYNC_COPY_IN || >> + conn->asyncStatus == PGASYNC_COPY_OUT >> || >> + conn->asyncStatus == PGASYNC_COPY_BOTH) >> && >> + result->resultStatus == >> PGRES_FATAL_ERROR) >> + break; >> } >> >> return lastResult; >> > > I don't immediately see why that's needed. I ran the little tester program > I wrote earlier ( > http://www.postgresql.org/message-id/55FC501A.5000409@iki.fi), with and > without that change, and it behaved the same. > > Without this change, it can ignore the OOM error for copy command. As an example, for command like "copy t1 from stdin;", when the flow reaches getCopyStart() in debugger, I have manually ensured that PQmakeEmptyPGresult() return NULL by overwriting the return value of malloc to zero and then it enters the error path (advance_and_error) and then I just press continue and what I observed is without above change it just doesn't show OOM error and with the above change it properly reports OOM error. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Tue, Dec 15, 2015 at 2:22 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: > On Mon, Dec 14, 2015 at 10:41 PM, Heikki Linnakangas <hlinnaka@iki.fi> wrote: >> On 16/10/15 05:00, Michael Paquier wrote: >> I don't immediately see why that's needed. I ran the little tester program I wrote earlier ( http://www.postgresql.org/message-id/55FC501A.5000409@iki.fi), with and without that change, and it behaved the same. >> > > Without this change, it can ignore the OOM error for copy command. > As an example, for command like "copy t1 from stdin;", when the > flow reaches getCopyStart() in debugger, I have manually ensured > that PQmakeEmptyPGresult() return NULL by overwriting the return > value of malloc to zero and then it enters the error path > (advance_and_error) and then I just press continue and what I observed > is without above change it just doesn't show OOM error and with the > above change it properly reports OOM error. Yeah, that's basically this kind of trace: * frame #0: 0x0000000100e8edad libpq.5.dylib`getCopyStart(conn=0x00007fb238404090, copytype=PGRES_COPY_OUT, msgLength=5) + 29 at fe-protocol3.c:1400 frame #1: 0x0000000100e8d3d6 libpq.5.dylib`pqParseInput3(conn=0x00007fb238404090) + 2374 at fe-protocol3.c:383 frame #2: 0x0000000100e7f5ad libpq.5.dylib`parseInput(conn=0x00007fb238404090) + 45 at fe-exec.c:1652 frame #3: 0x0000000100e7f86b libpq.5.dylib`PQgetResult(conn=0x00007fb238404090) + 251 at fe-exec.c:1727 frame #4: 0x0000000100e7fdad libpq.5.dylib`PQexecFinish(conn=0x00007fb238404090) + 29 at fe-exec.c:2007 frame #5: 0x0000000100e7fbdc libpq.5.dylib`PQexec(conn=0x00007fb238404090, query=0x00007fb23841b760) + 92 at fe-exec.c:1838 frame #6: 0x0000000100dd2f35 psql`SendQuery(query=0x00007fb23841b760) + 965 at common.c:1064 frame #7: 0x0000000100dd99f4 psql`MainLoop(source=0x00007fff74944a90) + 2388 at mainloop.c:292 frame #8: 0x0000000100de1988 psql`main(argc=1, argv=0x00007fff5ee36510) + 3304 at startup.c:390 If an error occurs in the COPY start message, this allows to fetch correctly the error that is generated by getCopyResult(), and that what has been written to conn->errorMessage does not get overwritten by subsequent messages, allowing to output the correct report to the user. -- Michael
On Tue, Dec 15, 2015 at 3:54 PM, Michael Paquier <michael.paquier@gmail.com> wrote: > > > On Tue, Dec 15, 2015 at 2:22 PM, Amit Kapila <amit.kapila16@gmail.com> > wrote: >> On Mon, Dec 14, 2015 at 10:41 PM, Heikki Linnakangas <hlinnaka@iki.fi> >> wrote: >>> On 16/10/15 05:00, Michael Paquier wrote: >>> I don't immediately see why that's needed. I ran the little tester >>> program I wrote earlier >>> (http://www.postgresql.org/message-id/55FC501A.5000409@iki.fi), with and >>> without that change, and it behaved the same. >>> >> >> Without this change, it can ignore the OOM error for copy command. >> As an example, for command like "copy t1 from stdin;", when the >> flow reaches getCopyStart() in debugger, I have manually ensured >> that PQmakeEmptyPGresult() return NULL by overwriting the return >> value of malloc to zero and then it enters the error path >> (advance_and_error) and then I just press continue and what I observed >> is without above change it just doesn't show OOM error and with the >> above change it properly reports OOM error. > > Yeah, that's basically this kind of trace: > * frame #0: 0x0000000100e8edad > libpq.5.dylib`getCopyStart(conn=0x00007fb238404090, copytype=PGRES_COPY_OUT, > msgLength=5) + 29 at fe-protocol3.c:1400 > frame #1: 0x0000000100e8d3d6 > libpq.5.dylib`pqParseInput3(conn=0x00007fb238404090) + 2374 at > fe-protocol3.c:383 > frame #2: 0x0000000100e7f5ad > libpq.5.dylib`parseInput(conn=0x00007fb238404090) + 45 at fe-exec.c:1652 > frame #3: 0x0000000100e7f86b > libpq.5.dylib`PQgetResult(conn=0x00007fb238404090) + 251 at fe-exec.c:1727 > frame #4: 0x0000000100e7fdad > libpq.5.dylib`PQexecFinish(conn=0x00007fb238404090) + 29 at fe-exec.c:2007 > frame #5: 0x0000000100e7fbdc > libpq.5.dylib`PQexec(conn=0x00007fb238404090, query=0x00007fb23841b760) + 92 > at fe-exec.c:1838 > frame #6: 0x0000000100dd2f35 psql`SendQuery(query=0x00007fb23841b760) + > 965 at common.c:1064 > frame #7: 0x0000000100dd99f4 psql`MainLoop(source=0x00007fff74944a90) + > 2388 at mainloop.c:292 > frame #8: 0x0000000100de1988 psql`main(argc=1, argv=0x00007fff5ee36510) > + 3304 at startup.c:390 > > If an error occurs in the COPY start message, this allows to fetch correctly > the error that is generated by getCopyResult(), and that what has been > written to conn->errorMessage does not get overwritten by subsequent > messages, allowing to output the correct report to the user. Heikki and I had a short talk about this issue in Moscow, and Heikki has mentioned me that he cannot convince himself that this would be a sane change for the COPY protocol. Based on his experience with those things, I'd rather put every back on the table and look again at this issue with fresh eyes with more testing involving in-core and external tools that use the COPY protocol with libpq routines. I recall having doing some testing with pg_receivexlog and pg_recvlogical but I'll look at that again. Also, I know one tool which would be interesting to test: pg_bulkload. Do you know of any other things worth looking at? We want things that link with libpq directly. So, in short, I am planning to do more testing, even if it has proved to be rather clear that this change has the advantage to make the error stack to be clearer. But we may be missing something, so let's see. -- Michael