Thread: PQexec() hangs on OOM

PQexec() hangs on OOM

From
Heikki Linnakangas
Date:
Alex Shulgin's find of a missing NULL check after strdup()
(87tx1or3cc.fsf@commandprompt.com) prompted me to do some testing of
libpq, when malloc/strdup returns NULL. To simulate running out of
memory, I wrote a little LD_PRELOAD library that allows causing malloc()
to return NULL, after a particular number of calls. and ran
src/test/examples/testlibpq with that hack.

After fixing all the missing NULL-checks, I found that testlibpq
sometimes just hangs. It happens when this malloc() call in
PQmakeEmptyPGResult() fails:

#3  0x00007f6dc86495c0 in malloc () from /home/heikki/mallocfail.so
#4  0x00007f6dc8423b6e in PQmakeEmptyPGresult (conn=0x1bea040,
status=PGRES_COMMAND_OK) at fe-exec.c:144
#5  0x00007f6dc8430b27 in pqParseInput3 (conn=0x1bea040) at
fe-protocol3.c:204
#6  0x00007f6dc8426468 in parseInput (conn=0x1bea040) at fe-exec.c:1652
#7  0x00007f6dc8426583 in PQgetResult (conn=0x1bea040) at fe-exec.c:1727
#8  0x00007f6dc8426c76 in PQexecFinish (conn=0x1bea040) at fe-exec.c:2000
#9  0x00007f6dc84268ad in PQexec (conn=0x1bea040, query=0x400e32
"BEGIN") at fe-exec.c:1834
#10 0x0000000000400b18 in main (argc=1, argv=0x7fffc9b6a568) at
testlibpq.c:59

When that malloc() returns NULL, parseInput returns without reading any
input. PQgetResult() takes that as a sign that it needs to read more
input from the server, before calling parseInput() again, and that read
never returns because there is no more data coming from the server.

I don't have any immediate plans to fix this, or to continue testing
this. There might well be more cases like this. Patches are welcome.

Attached is the little wrapper library I used to test this. testlibpq
hangs when run with MALLOC_FAIL_AT=110. It's really quick & dirty, sorry
about that. I'm sure there are more sophisticated tools to do similar
testing out there somewhere..

- Heikki

Attachment

Re: PQexec() hangs on OOM

From
Michael Paquier
Date:
On Tue, Nov 25, 2014 at 10:15 PM, Heikki Linnakangas wrote:
> When that malloc() returns NULL, parseInput returns without reading any
> input. PQgetResult() takes that as a sign that it needs to read more input
> from the server, before calling parseInput() again, and that read never
> returns because there is no more data coming from the server.
>
> I don't have any immediate plans to fix this, or to continue testing this.
> There might well be more cases like this. Patches are welcome.
>
> Attached is the little wrapper library I used to test this. testlibpq
hangs
> when run with MALLOC_FAIL_AT=110. It's really quick & dirty, sorry about
> that. I'm sure there are more sophisticated tools to do similar testing
out
> there somewhere..

With MALLOC_FAIL_AT=84, 86, 92, the backtrace just before the malloc
creating the OOM looks like that:
#0  0x00007f76316964d0 in __poll_nocancel () from /usr/lib/libc.so.6
#1  0x00007f7631971577 in pqSocketPoll (sock=4, forRead=1, forWrite=0,
end_time=-1) at fe-misc.c:1133
#2  0x00007f7631971461 in pqSocketCheck (conn=0x1495040, forRead=1,
forWrite=0, end_time=-1) at fe-misc.c:1075
#3  0x00007f76319712f8 in pqWaitTimed (forRead=1, forWrite=0,
conn=0x1495040, finish_time=-1) at fe-misc.c:1007
#4  0x00007f76319712ca in pqWait (forRead=1, forWrite=0, conn=0x1495040) at
fe-misc.c:990
#5  0x00007f763196d21d in PQgetResult (conn=0x1495040) at fe-exec.c:1711
#6  0x00007f763196d913 in PQexecFinish (conn=0x1495040) at fe-exec.c:1997
#7  0x00007f763196d576 in PQexec (conn=0x1495040, query=0x400ef2 "BEGIN")
at fe-exec.c:1831
#8  0x0000000000400bd8 in main (argc=1, argv=0x7ffd8a2644c8) at
testlibpq.c:5

In this case, as what happens is an OOM related to the allocation of
PGResult, I think that we had better store a status flag in PGConn related
to this OOM, as PGConn->errorMessage may not be empty to take care of the
ambiguity that PGResult == NULL does not necessarily mean wait for more
results. Something like PGResultStatus to avoid any API incompatibility.
Thoughts?

Looking at the other malloc() calls of llibpq, we do not really have this
ambiguity. For example if makeEmptyPGconn() == NULL means OOM. I am
guessing from the code as well that PQmakeEmptyPGresult() == NULL means
OOM, so the error handling problem comes from parseInput and its underlings.

Also in pqSaveParameterStatus, shouldn't we have a better OOM handling
there as well for pstatus?
Regards,
--
Michael

Re: PQexec() hangs on OOM

From
Michael Paquier
Date:
On Mon, Apr 6, 2015 at 3:54 PM, Michael Paquier  <michael.paquier@gmail.com>
wrote:

> In this case, as what happens is an OOM related to the allocation of
> PGResult, I think that we had better store a status flag in PGConn related
> to this OOM, as PGConn->errorMessage may not be empty to take care of the
> ambiguity that PGResult == NULL does not necessarily mean wait for more
> results. Something like PGResultStatus to avoid any API incompatibility.
> Thoughts?
>

Second idea: return a status with parseInput as it is not part of the APIs
of libpq.
--
Michael

Re: PQexec() hangs on OOM

From
Tom Lane
Date:
Michael Paquier <michael.paquier@gmail.com> writes:
> Second idea: return a status with parseInput as it is not part of the APIs
> of libpq.

Yeah; most subroutines in libpq have a zero-or-EOF return convention,
we can make parseInput do likewise.  I'm not sure if that'd need to
propagate further down though ...

            regards, tom lane

Re: PQexec() hangs on OOM

From
Michael Paquier
Date:
On Mon, Apr 6, 2015 at 10:46 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Michael Paquier <michael.paquier@gmail.com> writes:
>> Second idea: return a status with parseInput as it is not part of the APIs
>> of libpq.
>
> Yeah; most subroutines in libpq have a zero-or-EOF return convention,
> we can make parseInput do likewise.  I'm not sure if that'd need to
> propagate further down though ...

So, I have been looking at that in more details, and finished with the
patch attached that makes the problem go away for me with a nice "out
of memory" error. I have extended parseInput() to have it return a
status code to decide if parsing should be stopped or should continue.
Note that I have tried to be careful to make a clear distinction
between cases where routines return EOF because of not enough data
parsed and actual OOMs.

I have noticed as well that getCopyStart() in fe-protocol3.c needs to
be made a little bit smarter to make the difference between an OOM and
the not-enough-data type of problem.

This problem has a low probability to happen in the field, and no
people complained about that as well, so I think that patching only
HEAD is adapted.
Regards,
--
Michael

Attachment

Re: PQexec() hangs on OOM

From
Heikki Linnakangas
Date:
On 04/07/2015 09:18 AM, Michael Paquier wrote:
> @@ -143,7 +143,11 @@ PQmakeEmptyPGresult(PGconn *conn, ExecStatusType status)
>
>      result = (PGresult *) malloc(sizeof(PGresult));
>      if (!result)
> +    {
> +        printfPQExpBuffer(&conn->errorMessage,
> +                          libpq_gettext("out of memory\n"));
>          return NULL;
> +    }
>
>      result->ntups = 0;
>      result->numAttributes = 0;

That's not cool, because PQmakeEmptyPGresult may be called with conn ==
NULL. In general, I'm a bit wary of changing PQmakeEmptyResult so that
it sets the error message. Will have to check all the callers carefully
to see if that would upset any of them. And it might be called by
applications too.

> I have noticed as well that getCopyStart() in fe-protocol3.c needs to
> be made a little bit smarter to make the difference between an OOM and
> the not-enough-data type of problem.

Yeah. getParamDescription still has the issue, with your patch.

> This problem has a low probability to happen in the field, and no
> people complained about that as well, so I think that patching only
> HEAD is adapted.

As long as the patch applies easily, I don't see much reason to not
backpatch.

- Heikki

Re: PQexec() hangs on OOM

From
Michael Paquier
Date:
On Wed, Apr 8, 2015 at 3:54 AM, Heikki Linnakangas wrote:
> On 04/07/2015 09:18 AM, Michael Paquier wrote:
>>
>> @@ -143,7 +143,11 @@ PQmakeEmptyPGresult(PGconn *conn, ExecStatusType
>> status)
>>
>> [...]
>
> That's not cool, because PQmakeEmptyPGresult may be called with conn ==
> NULL. In general, I'm a bit wary of changing PQmakeEmptyResult so that it
> sets the error message. Will have to check all the callers carefully to see
> if that would upset any of them. And it might be called by applications too.

Oops, my mistake. For a patch arguing to not change how libpq routines
behave this is bad, and contrary to what is mentioned in the docs as
well. I moved the error message into parseInput. I guess it makes more
sense there.

>> I have noticed as well that getCopyStart() in fe-protocol3.c needs to
>> be made a little bit smarter to make the difference between an OOM and
>> the not-enough-data type of problem.
>
> Yeah. getParamDescription still has the issue, with your patch.

Check.

>> This problem has a low probability to happen in the field, and no
>> people complained about that as well, so I think that patching only
>> HEAD is adapted.
>
>
> As long as the patch applies easily, I don't see much reason to not
> backpatch.

It applies cleanly down to 9.1. For 9.0 some minor tweaks are visibly needed.
In any case, take two is attached.
Regards,
--
Michael

Attachment

Re: PQexec() hangs on OOM

From
Heikki Linnakangas
Date:
On 04/08/2015 07:27 AM, Michael Paquier wrote:
> On Wed, Apr 8, 2015 at 3:54 AM, Heikki Linnakangas wrote:
>> On 04/07/2015 09:18 AM, Michael Paquier wrote:
>>> I have noticed as well that getCopyStart() in fe-protocol3.c needs to
>>> be made a little bit smarter to make the difference between an OOM and
>>> the not-enough-data type of problem.
>>
>> Yeah. getParamDescription still has the issue, with your patch.
>
> Check.

There are still a few parseInput subroutines that have similar issues.
In fe-protocol2.c:

* pqGetErrorNotice2 returns EOF if there is not enough data, but also on
OOM. The caller assumes it's because not enough data.

* getRowDescriptions() is the same, although it sets conn->errorMessage
on OOM.

* getAnotherTuple() is the same, but it also skips over the received
data, so you won't get stuck. So maybe that's OK.

* getNotify() just drops any NOTIFY messages on OOM. Perhaps that's OK..

The corresponding functions in fe-protocol3.c are pretty much identical,
with the same issues. In addition:

* getParameterStatus will act as if the parameter value was "out of
memory". That'll be fun for something like client_encoding or
standard_conforming_strings. Would be good to use a small char[100]
variable, in stack, if it fits, and only use malloc for larger values.
That would cover all the common variables that need to be machine-parsed.

- Heikki

Re: PQexec() hangs on OOM

From
Michael Paquier
Date:
On Fri, Apr 10, 2015 at 2:42 AM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> On 04/08/2015 07:27 AM, Michael Paquier wrote:
>>
>> On Wed, Apr 8, 2015 at 3:54 AM, Heikki Linnakangas wrote:
>>>
>>> On 04/07/2015 09:18 AM, Michael Paquier wrote:
>>>>
>>>> I have noticed as well that getCopyStart() in fe-protocol3.c needs to
>>>> be made a little bit smarter to make the difference between an OOM and
>>>> the not-enough-data type of problem.
>>>
>>>
>>> Yeah. getParamDescription still has the issue, with your patch.
>>
>>
>> Check.
>
>
> There are still a few parseInput subroutines that have similar issues. In
> fe-protocol2.c:
>
> * pqGetErrorNotice2 returns EOF if there is not enough data, but also on
> OOM. The caller assumes it's because not enough data.
> * getRowDescriptions() is the same, although it sets conn->errorMessage on
> OOM.

OK, I extended those two with the same logic as previously.

> * getAnotherTuple() is the same, but it also skips over the received data,
> so you won't get stuck. So maybe that's OK.

I think that it should be changed for consistency with the others, so
done this way for this function, and this way we only need to bail-out
if status == -2, a status of -1 meaning that there is not enough data.

> * getNotify() just drops any NOTIFY messages on OOM. Perhaps that's OK..

This one is different though, it directly skips the messages.

> The corresponding functions in fe-protocol3.c are pretty much identical,
> with the same issues. In addition:
>
> * getParameterStatus will act as if the parameter value was "out of memory".
> That'll be fun for something like client_encoding or
> standard_conforming_strings. Would be good to use a small char[100]
> variable, in stack, if it fits, and only use malloc for larger values. That
> would cover all the common variables that need to be machine-parsed.

Are you suggesting to replace PQExpBuffer?

So, attached is take three for all the other things above.
Regards,
--
Michael

Attachment

Re: PQexec() hangs on OOM

From
Oleksandr Shulgin
Date:
Michael Paquier <michael.paquier@gmail.com> writes:
>
>> There are still a few parseInput subroutines that have similar issues. In
>> fe-protocol2.c:
>>
>> * pqGetErrorNotice2 returns EOF if there is not enough data, but also on
>> OOM. The caller assumes it's because not enough data.
>> * getRowDescriptions() is the same, although it sets conn->errorMessage on
>> OOM.
>
> OK, I extended those two with the same logic as previously.
>
>> * getAnotherTuple() is the same, but it also skips over the received data,
>> so you won't get stuck. So maybe that's OK.
>
> I think that it should be changed for consistency with the others, so
> done this way for this function, and this way we only need to bail-out
> if status == -2, a status of -1 meaning that there is not enough data.
>
>> * getNotify() just drops any NOTIFY messages on OOM. Perhaps that's OK..
>
> This one is different though, it directly skips the messages.
>
>> The corresponding functions in fe-protocol3.c are pretty much identical,
>> with the same issues. In addition:
>>
>> * getParameterStatus will act as if the parameter value was "out of memory".
>> That'll be fun for something like client_encoding or
>> standard_conforming_strings. Would be good to use a small char[100]
>> variable, in stack, if it fits, and only use malloc for larger values. That
>> would cover all the common variables that need to be machine-parsed.
>
> Are you suggesting to replace PQExpBuffer?
>
> So, attached is take three for all the other things above.

There's still one call to pqGetErrorNotice3 that doesn't check returned
value for -2, in fe-connect.c.  Shouldn't we also check it like this:

diff --git a/src/interfaces/libpq/fe-connect.c b/src/interfaces/libpq/fe-connect.c
index e7c7a25..330b8da 100644
--- a/src/interfaces/libpq/fe-connect.c
+++ b/src/interfaces/libpq/fe-connect.c
@@ -2242,6 +2242,7 @@ keep_going:                        /* We will come back to here until there is
                 int            msgLength;
                 int            avail;
                 AuthRequest areq;
+                int            res;

                 /*
                  * Scan the message from current point (note that if we find
@@ -2366,11 +2367,18 @@ keep_going:                        /* We will come back to here until there is
                 {
                     if (PG_PROTOCOL_MAJOR(conn->pversion) >= 3)
                     {
-                        if (pqGetErrorNotice3(conn, true))
+                        res = pqGetErrorNotice3(conn, true);
+                        if (res == -1)
                         {
                             /* We'll come back when there is more data */
                             return PGRES_POLLING_READING;
                         }
+                        else if (res == -2)
+                        {
+                            printfPQExpBuffer(&conn->errorMessage,
+                                              libpq_gettext("out of memory\n"));
+                            goto error_return;
+                        }
                     }
                     else
                     {


--
Alex

Re: PQexec() hangs on OOM

From
Michael Paquier
Date:
On Thu, May 21, 2015 at 1:31 AM, Oleksandr Shulgin wrote:
> Michael Paquier <michael.paquier@gmail.com> writes:
>> So, attached is take three for all the other things above.
>
> There's still one call to pqGetErrorNotice3 that doesn't check returned
> value for -2, in fe-connect.c.  Shouldn't we also check it like this:
>
> [...]

Yes, you are right. Take 4 attached is updated with something similar
to what you sent to cover this code path.
--
Michael

Attachment

Re: PQexec() hangs on OOM

From
Heikki Linnakangas
Date:
On 05/26/2015 10:01 AM, Michael Paquier wrote:
> On Thu, May 21, 2015 at 1:31 AM, Oleksandr Shulgin wrote:
>> Michael Paquier <michael.paquier@gmail.com> writes:
>>> So, attached is take three for all the other things above.
>>
>> There's still one call to pqGetErrorNotice3 that doesn't check returned
>> value for -2, in fe-connect.c.  Shouldn't we also check it like this:
>>
>> [...]
>
> Yes, you are right. Take 4 attached is updated with something similar
> to what you sent to cover this code path.

This is still a few bricks shy of a load. Before this patch, if you run
out of memory when allocating a large result set - which is probably the
most common reason for OOM in libpq - you got this error:

postgres=# select generate_series(1, 10000000);
out of memory for query result

With this patch, I got:

postgres=# select generate_series(1, 10000000);
server sent data ("D" message) without prior row description ("T" message)


Looking at the patch again, I think we should actually leave
getAnotherTuple() alone. It's a lot nicer if getAnotherTuple() can
handle the OOM error itself than propagating it to the caller.

There's only one caller for getAnotherTuple(), but for
pqGetErrorNotice3() the same is even more true: it would be much nicer
if it could handle OOM itself, without propagating it to the caller. And
it well could do so. When it's processing an ERROR, it could just set
conn->errorMessage to "out of memory", and discard the error it received
from the server. When processing a NOTICE, it could likewise just send
"out of memory" to the NOTICE processsor, and discard the message from
the server. The real problem with pqGetErrorNotice3() today is that it
treats OOM the same as "no data available yet", and we can fix that by
reading but discarding the backend message. Like getAnotherTuple() does.

In short, the idea of returning a status code from parseInput(), instead
of just dealing with the error, was a bad one. Sorry I didn't have this
epiphany earlier...

I came up with the attached. It fixes the few cases where we were
currently returning "need more data" when OOM happened, to deal with the
OOM error instead, by setting conn->errorMessage. How does this look to you?

- Heikki


--
- Heikki

Attachment

Re: PQexec() hangs on OOM

From
Michael Paquier
Date:
On Sat, Jul 4, 2015 at 1:32 AM, Heikki Linnakangas wrote:
> In short, the idea of returning a status code from parseInput(), instead of
> just dealing with the error, was a bad one. Sorry I didn't have this
> epiphany earlier...
>
> I came up with the attached. It fixes the few cases where we were currently
> returning "need more data" when OOM happened, to deal with the OOM error
> instead, by setting conn->errorMessage. How does this look to you?

So this relies on the fact that when errorMessage is set subsequent
messages are ignored, right? This looks neat.

At the bottom of getAnotherTuple() in fe-protocol2.c if
PQmakeEmptyPGresult returns NULL, shouldn't the error message be
enforced to "out of memory"? It is an error code path, so an error
will be set, but perhaps the message is incorrect.

-       if (!res->errMsg)
-               goto failure;
+       if (res)
+       {
+               res->resultStatus = isError ? PGRES_FATAL_ERROR :
PGRES_NONFATAL_ERROR;
+               res->errMsg = pqResultStrdup(res, workBuf.data);
+       }
If res->errMsg is NULL, we may have a crash later on when calling
appendPQExpBufferStr on this field. I think that we should add an
additional check on it.
--
Michael

Re: PQexec() hangs on OOM

From
Heikki Linnakangas
Date:
On 07/04/2015 03:40 PM, Michael Paquier wrote:
> On Sat, Jul 4, 2015 at 1:32 AM, Heikki Linnakangas wrote:
>> In short, the idea of returning a status code from parseInput(), instead of
>> just dealing with the error, was a bad one. Sorry I didn't have this
>> epiphany earlier...
>>
>> I came up with the attached. It fixes the few cases where we were currently
>> returning "need more data" when OOM happened, to deal with the OOM error
>> instead, by setting conn->errorMessage. How does this look to you?
>
> So this relies on the fact that when errorMessage is set subsequent
> messages are ignored, right? This looks neat.
>
> At the bottom of getAnotherTuple() in fe-protocol2.c if
> PQmakeEmptyPGresult returns NULL, shouldn't the error message be
> enforced to "out of memory"? It is an error code path, so an error
> will be set, but perhaps the message is incorrect.
>
> -       if (!res->errMsg)
> -               goto failure;
> +       if (res)
> +       {
> +               res->resultStatus = isError ? PGRES_FATAL_ERROR :
> PGRES_NONFATAL_ERROR;
> +               res->errMsg = pqResultStrdup(res, workBuf.data);
> +       }
> If res->errMsg is NULL, we may have a crash later on when calling
> appendPQExpBufferStr on this field. I think that we should add an
> additional check on it.

Yeah, added.

I tested the various error paths this patch modifies with a debugger,
and found out that the getCopyStart changes were not doing much good.
The caller still waits for the COPY-IN result, so it still gets stuck.
So I left out that change.

The getParamDescriptions() changes were slightly broken. It didn't read
the whole input message with pqGetInt() etc., so pqParseInput3() threw
the "message contents do not agree with length in message type" error. I
started fixing that, by changing the error handling in that function to
be more like that in getRowDescriptions(), but then I realized that all
the EOF return cases in the pqParseInput3() subroutines are actually
dead code. pqParseInput3() always reads the whole message, before
passing it on to the right subroutine. That was documented for
getRowDescriptions() and getAnotherTuple(), but the rest of the
functions were more like the protocol version 2 code, prepared to deal
with incomplete messages. I think it would be good to refactor that,
removing the EOF return cases altogether. So I left out that change for
now as well.

That left me with the attached patch. It doesn't handle the
getParamDescription() case, nor the getCopyStart() case, but it's a
start. We don't necessarily need to fix everything in one go. Does this
look correct to you, as far as it goes?

- Heikki


Attachment

Re: PQexec() hangs on OOM

From
Michael Paquier
Date:
On Tue, Jul 7, 2015 at 2:13 AM, Heikki Linnakangas wrote:
> The getParamDescriptions() changes were slightly broken. It didn't read the
> whole input message with pqGetInt() etc., so pqParseInput3() threw the
> "message contents do not agree with length in message type" error. I started
> fixing that, by changing the error handling in that function to be more like
> that in getRowDescriptions(), but then I realized that all the EOF return
> cases in the pqParseInput3() subroutines are actually dead code.
> pqParseInput3() always reads the whole message, before passing it on to the
> right subroutine. That was documented for getRowDescriptions() and
> getAnotherTuple(), but the rest of the functions were more like the protocol
> version 2 code, prepared to deal with incomplete messages. I think it would
> be good to refactor that, removing the EOF return cases altogether. So I
> left out that change for now as well.

Yes, (the latter case is not actually used currently). Well, I don't
mind writing the additional patch to update . On top of that The
refactoring should be a master-only change, perhaps?

> That left me with the attached patch. It doesn't handle the
> getParamDescription() case, nor the getCopyStart() case, but it's a start.
> We don't necessarily need to fix everything in one go. Does this look
> correct to you, as far as it goes?

I have been carefully through the routines modified, doing some tests
at the same time and I haven't spotted an issue.
--
Michael

Re: PQexec() hangs on OOM

From
Michael Paquier
Date:
On Tue, Jul 7, 2015 at 3:31 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> On Tue, Jul 7, 2015 at 2:13 AM, Heikki Linnakangas wrote:
>> The getParamDescriptions() changes were slightly broken. It didn't read the
>> whole input message with pqGetInt() etc., so pqParseInput3() threw the
>> "message contents do not agree with length in message type" error. I started
>> fixing that, by changing the error handling in that function to be more like
>> that in getRowDescriptions(), but then I realized that all the EOF return
>> cases in the pqParseInput3() subroutines are actually dead code.
>> pqParseInput3() always reads the whole message, before passing it on to the
>> right subroutine. That was documented for getRowDescriptions() and
>> getAnotherTuple(), but the rest of the functions were more like the protocol
>> version 2 code, prepared to deal with incomplete messages. I think it would
>> be good to refactor that, removing the EOF return cases altogether. So I
>> left out that change for now as well.
>
> Yes, (the latter case is not actually used currently). Well, I don't
> mind writing the additional patch to update . On top of that The
> refactoring should be a master-only change, perhaps?

I pushed the send button too early.
I don't mind writing the additional patch for the other routines,
patch that should be backpatched, and the refactoring patch
(master-only?).
--
Michael

Re: PQexec() hangs on OOM

From
Heikki Linnakangas
Date:
On 07/07/2015 09:32 AM, Michael Paquier wrote:
> On Tue, Jul 7, 2015 at 3:31 PM, Michael Paquier
> <michael.paquier@gmail.com> wrote:
>> On Tue, Jul 7, 2015 at 2:13 AM, Heikki Linnakangas wrote:
>>> The getParamDescriptions() changes were slightly broken. It didn't read the
>>> whole input message with pqGetInt() etc., so pqParseInput3() threw the
>>> "message contents do not agree with length in message type" error. I started
>>> fixing that, by changing the error handling in that function to be more like
>>> that in getRowDescriptions(), but then I realized that all the EOF return
>>> cases in the pqParseInput3() subroutines are actually dead code.
>>> pqParseInput3() always reads the whole message, before passing it on to the
>>> right subroutine. That was documented for getRowDescriptions() and
>>> getAnotherTuple(), but the rest of the functions were more like the protocol
>>> version 2 code, prepared to deal with incomplete messages. I think it would
>>> be good to refactor that, removing the EOF return cases altogether. So I
>>> left out that change for now as well.
>>
>> Yes, (the latter case is not actually used currently). Well, I don't
>> mind writing the additional patch to update . On top of that The
>> refactoring should be a master-only change, perhaps?
>
> I pushed the send button too early.
> I don't mind writing the additional patch for the other routines,
> patch that should be backpatched, and the refactoring patch
> (master-only?).

Ok, committed and backpatched the latest patch I posted. Yeah, we'll
need additional patches for the refactoring and the remaining issues.
I'm not sure if it's worth trying to backpatch them; let's see how big
the patch is.

- Heikki

Re: PQexec() hangs on OOM

From
Michael Paquier
Date:
On Wed, Jul 8, 2015 at 1:01 AM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> On 07/07/2015 09:32 AM, Michael Paquier wrote:
>>
>> On Tue, Jul 7, 2015 at 3:31 PM, Michael Paquier
>> <michael.paquier@gmail.com> wrote:
>>>
>>> On Tue, Jul 7, 2015 at 2:13 AM, Heikki Linnakangas wrote:
>>>>
>>>> The getParamDescriptions() changes were slightly broken. It didn't read
>>>> the
>>>> whole input message with pqGetInt() etc., so pqParseInput3() threw the
>>>> "message contents do not agree with length in message type" error. I
>>>> started
>>>> fixing that, by changing the error handling in that function to be more
>>>> like
>>>> that in getRowDescriptions(), but then I realized that all the EOF
>>>> return
>>>> cases in the pqParseInput3() subroutines are actually dead code.
>>>> pqParseInput3() always reads the whole message, before passing it on to
>>>> the
>>>> right subroutine. That was documented for getRowDescriptions() and
>>>> getAnotherTuple(), but the rest of the functions were more like the
>>>> protocol
>>>> version 2 code, prepared to deal with incomplete messages. I think it
>>>> would
>>>> be good to refactor that, removing the EOF return cases altogether. So I
>>>> left out that change for now as well.
>>>
>>>
>>> Yes, (the latter case is not actually used currently). Well, I don't
>>> mind writing the additional patch to update . On top of that The
>>> refactoring should be a master-only change, perhaps?
>>
>>
>> I pushed the send button too early.
>> I don't mind writing the additional patch for the other routines,
>> patch that should be backpatched, and the refactoring patch
>> (master-only?).
>
>
> Ok, committed and backpatched the latest patch I posted. Yeah, we'll need
> additional patches for the refactoring and the remaining issues. I'm not
> sure if it's worth trying to backpatch them; let's see how big the patch is.

So, here are two patches aimed at fixing the hangling issues with
getStartCopy and getParamDescriptions. After trying a couple of
approaches, I found out that the most elegant way to prevent the
infinite loop in parseInput is to introduce a new PGASYNC flag to
control the error and let the caller know what happened.

More refactoring could be done, and I think that the use of this new
ASYNC flag could be spread to other places as well if you think that's
a useful. For now I have focused only on fixing the existing problems
with fixes that are rather back-patchable.
--
Michael

Attachment

Re: PQexec() hangs on OOM

From
Michael Paquier
Date:
On Thu, Jul 9, 2015 at 10:04 PM, Michael Paquier
<michael.paquier@gmail.com>wrote:

> So, here are two patches aimed at fixing the hangling issues with
> getStartCopy and getParamDescriptions. After trying a couple of
> approaches, I found out that the most elegant way to prevent the
> infinite loop in parseInput is to introduce a new PGASYNC flag to
> control the error and let the caller know what happened.
>
> More refactoring could be done, and I think that the use of this new
> ASYNC flag could be spread to other places as well if you think that's
> a useful. For now I have focused only on fixing the existing problems
> with fixes that are rather back-patchable.
>

As those are actually new bug fixes, I am adding an entry in CF 2015-09.
--
Michael

Re: PQexec() hangs on OOM

From
Amit Kapila
Date:
On Thu, Jul 9, 2015 at 6:34 PM, Michael Paquier <michael.paquier@gmail.com>
wrote:
>
> On Wed, Jul 8, 2015 at 1:01 AM, Heikki Linnakangas <hlinnaka@iki.fi>
wrote:
> > On 07/07/2015 09:32 AM, Michael Paquier wrote:
> >
> >
> > Ok, committed and backpatched the latest patch I posted. Yeah, we'll
need
> > additional patches for the refactoring and the remaining issues. I'm not
> > sure if it's worth trying to backpatch them; let's see how big the
patch is.
>
> So, here are two patches aimed at fixing the hangling issues with
> getStartCopy and getParamDescriptions. After trying a couple of
> approaches, I found out that the most elegant way to prevent the
> infinite loop in parseInput is to introduce a new PGASYNC flag to
> control the error and let the caller know what happened.

One thing that looks slightly non-elegant about this approach is that
new async status (PGASYNC_FATAL) is used to deal with errors only
in few paths in function pqParseInput3() and not-other paths which should
be okay if there is no other better way.  I have not spent enough time on
it to suggest any better alternative, but would like to hear what other
approaches you have considered?

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: PQexec() hangs on OOM

From
Michael Paquier
Date:
On Sat, Aug 29, 2015 at 10:35 PM, Amit Kapila <amit.kapila16@gmail.com>
wrote:

> On Thu, Jul 9, 2015 at 6:34 PM, Michael Paquier <michael.paquier@gmail.com>
> wrote:
> >
> > On Wed, Jul 8, 2015 at 1:01 AM, Heikki Linnakangas <hlinnaka@iki.fi>
> wrote:
> > > On 07/07/2015 09:32 AM, Michael Paquier wrote:
> > >
> > >
> > > Ok, committed and backpatched the latest patch I posted. Yeah, we'll
> need
> > > additional patches for the refactoring and the remaining issues. I'm
> not
> > > sure if it's worth trying to backpatch them; let's see how big the
> patch is.
> >
> > So, here are two patches aimed at fixing the hangling issues with
> > getStartCopy and getParamDescriptions. After trying a couple of
> > approaches, I found out that the most elegant way to prevent the
> > infinite loop in parseInput is to introduce a new PGASYNC flag to
> > control the error and let the caller know what happened.
>
> One thing that looks slightly non-elegant about this approach is that
> new async status (PGASYNC_FATAL) is used to deal with errors only
> in few paths in function pqParseInput3() and not-other paths which should
> be okay if there is no other better way.
>

I considered using this logic in more code paths, but I kept focused on
having a not-too-invasive back-patchable patch as the refactoring that
would occur may be too heavy for what is usually pushed to the stable
branches. Do you think it would be better to get a cleaner refactoring
patch first and globalize the approach with PGASYNC_FATAL? As this is a
flag internal to libpq not exposed to the user this is fine on a code base,
but I am worrying about putting in stable branches more complexity than
really needed (upthread I said the same thing and Heikki mentioned that it
is fine as long as it is easy to backpatch). Note that I don't mind
spending more time on it, or to put in other works reworking the whole
patch to have something fully refactored, my goal being to have a clean fix
for all supported versions for those OOM problems.

I have not spent enough time on
> it to suggest any better alternative, but would like to hear what other
> approaches you have considered?
>

The other approach I have considered was to use the error string status to
decide if a failure happened, and this did not finish well in PQgetResult
:) It was actually a far cleaner approach to have a failure flag to decide
if based on the async state process should move on to a failure code path
or not.
--
Michael

Re: PQexec() hangs on OOM

From
Amit Kapila
Date:
On Mon, Aug 31, 2015 at 12:55 PM, Michael Paquier <michael.paquier@gmail.com
> wrote:
>
> On Sat, Aug 29, 2015 at 10:35 PM, Amit Kapila <amit.kapila16@gmail.com>
> wrote:
>>
>>
>> One thing that looks slightly non-elegant about this approach is that
>> new async status (PGASYNC_FATAL) is used to deal with errors only
>> in few paths in function pqParseInput3() and not-other paths which should
>> be okay if there is no other better way.
>>
>
> I considered using this logic in more code paths, but I kept focused on
> having a not-too-invasive back-patchable patch as the refactoring that
> would occur may be too heavy for what is usually pushed to the stable
> branches. Do you think it would be better to get a cleaner refactoring
> patch first and globalize the approach with PGASYNC_FATAL?
>

No wait, lets first try to see if this the best fix for Copy path.

The problem with the current approach is that even if we are able to
receive error, it doesn't completely clear the previous command
execution.  As an example, if using debugger, I make getCopyStart()
return OOM error, then the next command execution fails.

postgres=# copy t1 from stdin;
out of memory
postgres=# copy t1 from stdin;
another command is already in progress



> As this is a flag internal to libpq not exposed to the user this is fine
> on a code base, but I am worrying about putting in stable branches more
> complexity than really needed (upthread I said the same thing and Heikki
> mentioned that it is fine as long as it is easy to backpatch). Note that I
> don't mind spending more time on it, or to put in other works reworking the
> whole patch to have something fully refactored, my goal being to have a
> clean fix for all supported versions for those OOM problems.
>
> I have not spent enough time on
>> it to suggest any better alternative, but would like to hear what other
>> approaches you have considered?
>>
>
> The other approach I have considered was to use the error string status to
> decide if a failure happened, and this did not finish well in PQgetResult :)
>

Yeah, that could be another way and we already use that in
pqParseInput3() to distinguish the error path.


> It was actually a far cleaner approach to have a failure flag to decide if
> based on the async state process should move on to a failure code path or
> not.
>

I think you have already tried, but it seems to me that if we can handle
it based on result status, that would be better, rather than introducing
new state, but anyway lets first try to sort out above reported problem.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: PQexec() hangs on OOM

From
Michael Paquier
Date:


On Fri, Sep 4, 2015 at 2:07 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Mon, Aug 31, 2015 at 12:55 PM, Michael Paquier <michael.paquier@gmail.com> wrote:
On Sat, Aug 29, 2015 at 10:35 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

One thing that looks slightly non-elegant about this approach is that
new async status (PGASYNC_FATAL) is used to deal with errors only
in few paths in function pqParseInput3() and not-other paths which should
be okay if there is no other better way.

I considered using this logic in more code paths, but I kept focused on having a not-too-invasive back-patchable patch as the refactoring that would occur may be too heavy for what is usually pushed to the stable branches. Do you think it would be better to get a cleaner refactoring patch first and globalize the approach with PGASYNC_FATAL?

No wait, lets first try to see if this the best fix for Copy path.

Sure. Thanks. (We could remove some dead code of libpq though).
 
The problem with the current approach is that even if we are able to
receive error, it doesn't completely clear the previous command
execution.  As an example, if using debugger, I make getCopyStart()
return OOM error, then the next command execution fails.

postgres=# copy t1 from stdin;
out of memory
postgres=# copy t1 from stdin;
another command is already in progress

Oops. Indeed. Using this new flag is obviously not a good idea because the connection remains in a state that it considers as PGASYNC_FATAL and actually it may not even finish to consume the remaining messages. Perhaps I was too sleepy last time I tested that, well...


It was actually a far cleaner approach to have a failure flag to decide if based on the async state process should move on to a failure code path or not.

I think you have already tried, but it seems to me that if we can handle
it based on result status, that would be better, rather than introducing
new state, but anyway lets first try to sort out above reported problem.

So, looking at that again with a fresh eye, I noticed that getCopyDataMessage has a similar error handling when it fails on pqCheckInBufferSpace. So looking at what I added, it seems that I am trying to duplicate the protocol error handling even if everything is already present, so that's really overdoing it. Hence, let's simply drop this idea of new flag PGASYNC_FATAL and instead treat the OOM as a sync handling error with the server, like in the patch attached.

When emulating an OOM with this patch, I am getting this error instead of the infinite loop, which looks acceptable to me:
=# copy aa to stdout;
out of memory
lost synchronization with server: got message type "H", length 5
The connection to the server was lost. Attempting reset: Succeeded.

The extra message handling I have added at the end of getCopyStart and getParamDescriptions still looks more adapted to me when a failure happens, so I kept it.
Regards,
--
Michael
Attachment

Re: PQexec() hangs on OOM

From
Amit Kapila
Date:
On Fri, Sep 4, 2015 at 12:55 PM, Michael Paquier <michael.paquier@gmail.com> wrote:

On Fri, Sep 4, 2015 at 2:07 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
It was actually a far cleaner approach to have a failure flag to decide if based on the async state process should move on to a failure code path or not.

I think you have already tried, but it seems to me that if we can handle
it based on result status, that would be better, rather than introducing
new state, but anyway lets first try to sort out above reported problem.

So, looking at that again with a fresh eye, I noticed that getCopyDataMessage has a similar error handling when it fails on pqCheckInBufferSpace. So looking at what I added, it seems that I am trying to duplicate the protocol error handling even if everything is already present, so that's really overdoing it. Hence, let's simply drop this idea of new flag PGASYNC_FATAL and instead treat the OOM as a sync handling error with the server, like in the patch attached.


I think sync handling error, drops the connection which I feel should
be only done as a last resort for any error and if we are able to handle
OOM error for other kind of messages, then we should try to handle
it for Copy related messages as well.  I have tried to do it for Copy
In and Copy Out protocol messages in the attached patch with logic
similar to what is currently used in code.  The idea is that in Copy
mode if there is error we just return the error and the pending data
will be automatically discarded in next execution during PQexecStart().

 
When emulating an OOM with this patch, I am getting this error instead of the infinite loop, which looks acceptable to me:
=# copy aa to stdout;
out of memory
lost synchronization with server: got message type "H", length 5
The connection to the server was lost. Attempting reset: Succeeded.

The extra message handling I have added at the end of getCopyStart and getParamDescriptions still looks more adapted to me when a failure happens, so I kept it.

Sure, but then make the handling similar to getRowDescriptions() both for
failure and success case, that way code will be consistent.

Still I have not checked about COPY_BOTH and getParamDescriptions(),
see if you like what I have done in attached patch, then we can do the
similar for them as well.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
Attachment

Re: PQexec() hangs on OOM

From
Michael Paquier
Date:
On Sat, Sep 5, 2015 at 9:45 PM, Amit Kapila <amit.kapila16@gmail.com> wrote=
:
>
>
> On Fri, Sep 4, 2015 at 12:55 PM, Michael Paquier <michael.paquier@gmail.c=
om> wrote:
>>
>>
>> On Fri, Sep 4, 2015 at 2:07 PM, Amit Kapila <amit.kapila16@gmail.com> wr=
ote:
>>>>
>>>> It was actually a far cleaner approach to have a failure flag to decid=
e if based on the async state process should move on to a failure code path=
 or not.
>>>
>>>
>>> I think you have already tried, but it seems to me that if we can handl=
e
>>> it based on result status, that would be better, rather than introducin=
g
>>> new state, but anyway lets first try to sort out above reported problem=
.
>>
>>
>> So, looking at that again with a fresh eye, I noticed that getCopyDataMe=
ssage has a similar error handling when it fails on pqCheckInBufferSpace. S=
o looking at what I added, it seems that I am trying to duplicate the proto=
col error handling even if everything is already present, so that's really =
overdoing it. Hence, let's simply drop this idea of new flag PGASYNC_FATAL =
and instead treat the OOM as a sync handling error with the server, like in=
 the patch attached.
>>
>
> I think sync handling error, drops the connection which I feel should
> be only done as a last resort for any error and if we are able to handle
> OOM error for other kind of messages, then we should try to handle
> it for Copy related messages as well.  I have tried to do it for Copy
> In and Copy Out protocol messages in the attached patch with logic
> similar to what is currently used in code.  The idea is that in Copy
> mode if there is error we just return the error and the pending data
> will be automatically discarded in next execution during PQexecStart().

OK. So you rely on the error state set by pqSaveErrorResult, then wrap
an error check in getCopyResult and PQexecFinish... I guess that this
is acceptable for COPY as this code path would just kept looping
infinitely on the current HEAD.

>
>
>> When emulating an OOM with this patch, I am getting this error instead o=
f the infinite loop, which looks acceptable to me:
>> =3D# copy aa to stdout;
>> out of memory
>> lost synchronization with server: got message type "H", length 5
>> The connection to the server was lost. Attempting reset: Succeeded.
>>
>> The extra message handling I have added at the end of getCopyStart and g=
etParamDescriptions still looks more adapted to me when a failure happens, =
so I kept it.
>
>
> Sure, but then make the handling similar to getRowDescriptions() both for
> failure and success case, that way code will be consistent.

No problem for me, and this makes actually refactoring easier as those
code paths are more similar.

> Still I have not checked about COPY_BOTH and getParamDescriptions(),
> see if you like what I have done in attached patch, then we can do the
> similar for them as well.

Now, to move into the serious things...

+       /*
+        * Advance inStart to show that the copy related message has been
+        * processed.
+        */
+       conn->inStart =3D conn->inCursor;
This change...

+                                       /* getCopyStart() moves
inStart itself */
                                        conn->asyncStatus =3D PGASYNC_COPY_=
IN;
-                                       break;
+                                       continue;
...  And this change are risky for a backpatch. And they actually
break the existing replication protocol and backward compatibility so
I have no doubt that they could break any client applications that
work directly with the replication level protocol with commands like
BASE_BACKUP, START_REPLICATION and IDENTITY_SYSTEM. I think that we
should really try to keep the non-error code path as close as possible
to the original code.

So, I think that the right approach would be to leave immediately
pqParseInput3 should an error happen, switching asyncStatus to leave
the loop in PQgetResult. This seems as well to work correctly with
PGRES_COPY_BOTH (I emulated failures with pg_basebackup and errors
were caught up correctly. This brings back of course the introduction
of a new flag PGASYNC_FATAL and we could replace the extra stuff you
added in getCopyResult as well. I have a WIP patch that I don't have
the time to finish now, but I'll send it once I am done. For now I am
just sharing my thoughts on the matter.
--=20
Michael

Re: PQexec() hangs on OOM

From
Amit Kapila
Date:
On Mon, Sep 7, 2015 at 1:40 PM, Michael Paquier <michael.paquier@gmail.com>
wrote:
>
> On Sat, Sep 5, 2015 at 9:45 PM, Amit Kapila <amit.kapila16@gmail.com>
wrote:
> >
>
> So, I think that the right approach would be to leave immediately
> pqParseInput3 should an error happen, switching asyncStatus to leave
> the loop in PQgetResult.

Sure, if you think that works, then do it that way.

> This seems as well to work correctly with
> PGRES_COPY_BOTH (I emulated failures with pg_basebackup and errors
> were caught up correctly. This brings back of course the introduction
> of a new flag PGASYNC_FATAL

I think we should try not to introduce this new flag, as that is not merely
a
flag, but rather a state in query-execution state machine.  Now if we
introduce
a new error state into that state-machine, then it doesn't seem like a good
idea to do that only for some of the paths and doing it for all other paths
is a call for somewhat larger verification cycle (to see if it works in all
paths
as previously).



With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: PQexec() hangs on OOM

From
Michael Paquier
Date:
On Mon, Sep 7, 2015 at 6:45 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Mon, Sep 7, 2015 at 1:40 PM, Michael Paquier <michael.paquier@gmail.com>
> wrote:
>>
>> On Sat, Sep 5, 2015 at 9:45 PM, Amit Kapila <amit.kapila16@gmail.com>
>> wrote:
>> >
>>
>> So, I think that the right approach would be to leave immediately
>> pqParseInput3 should an error happen, switching asyncStatus to leave
>> the loop in PQgetResult.
>
> Sure, if you think that works, then do it that way.
>
>> This seems as well to work correctly with
>> PGRES_COPY_BOTH (I emulated failures with pg_basebackup and errors
>> were caught up correctly. This brings back of course the introduction
>> of a new flag PGASYNC_FATAL
>
> I think we should try not to introduce this new flag, as that is not merely
> a flag, but rather a state in query-execution state machine.  Now if we
> introduce a new error state into that state-machine, then it doesn't seem
> like a good idea to do that only for some of the paths and doing it for all
> other paths is a call for somewhat larger verification cycle (to see if it
> works in all paths as previously).

Well, the previous patch you sent added code paths to manage the
failures with COPY protocol directly in the COPY routines in a way
similar to what the introduction of PGASYNC_FATAL is doing, except
that PGASYNC_FATAL has the advantage to let the end of PQgetResult
know that a failure has happened, centralizing the error check. Also,
it seems to me that switching getParamDescriptions to an error state
is going to need this flag, or at least a new flag to exit the loop
when parsing the input.

Regarding the other messages that could use this flag, I just
double-checked and it seems that all the other calls return an EOF
when they do not have enough data received from the backend or they
directly switch to another PGASYNC state, so they do not need a
specific fatal error handling. Please let me know if I missed
something.

In any case, attached are two patches:
- 0001 adds the OOM/failure handling for the BIND and COPY start
messages. This time the connection is not dropped. After a failure,
successive commands work as well, this addresses the previous issue
you reported.
- 0002 is a cleanup bonus, getRowDescriptions and getAnotherTuple have
some dead code that I think would be better removed, those are
remnants from a copy/paste from the similar code of protocol 2.
Regards,
--
Michael

Attachment

Re: PQexec() hangs on OOM

From
Michael Paquier
Date:
On Mon, Sep 7, 2015 at 10:41 PM, Michael Paquier wrote:
> In any case, attached are two patches:
> - 0001 adds the OOM/failure handling for the BIND and COPY start
> messages. This time the connection is not dropped. After a failure,
> successive commands work as well, this addresses the previous issue
> you reported.
> - 0002 is a cleanup bonus, getRowDescriptions and getAnotherTuple have
> some dead code that I think would be better removed, those are
> remnants from a copy/paste from the similar code of protocol 2.

And I forgot... Attached is a simple program to test BIND messages.
--
Michael

Attachment

Re: PQexec() hangs on OOM

From
Michael Paquier
Date:
On Tue, Sep 8, 2015 at 12:13 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> On Mon, Sep 7, 2015 at 10:41 PM, Michael Paquier wrote:
>> In any case, attached are two patches:
>> - 0001 adds the OOM/failure handling for the BIND and COPY start
>> messages. This time the connection is not dropped. After a failure,
>> successive commands work as well, this addresses the previous issue
>> you reported.
>> - 0002 is a cleanup bonus, getRowDescriptions and getAnotherTuple have
>> some dead code that I think would be better removed, those are
>> remnants from a copy/paste from the similar code of protocol 2.
>
> And I forgot... Attached is a simple program to test BIND messages.

And it occurred to me after sleeping on it that what I sent earlier is
not enough: in the case where the server has sent only a partial
message, we should not move into an error code path, but simply get
back into the loop of pqGetResult and wait for additional input from
the server. Attached are patches to fix that.
Regards,
--
Michael

Attachment

Re: PQexec() hangs on OOM

From
Amit Kapila
Date:
On Mon, Sep 7, 2015 at 8:43 PM, Michael Paquier <michael.paquier@gmail.com>
wrote:

> On Mon, Sep 7, 2015 at 10:41 PM, Michael Paquier wrote:
> > In any case, attached are two patches:
> > - 0001 adds the OOM/failure handling for the BIND and COPY start
> > messages. This time the connection is not dropped. After a failure,
> > successive commands work as well, this addresses the previous issue
> > you reported.
> > - 0002 is a cleanup bonus, getRowDescriptions and getAnotherTuple have
> > some dead code that I think would be better removed, those are
> > remnants from a copy/paste from the similar code of protocol 2.
>
> And I forgot... Attached is a simple program to test BIND messages.
>

Still the same test fails for me.

postgres=# copy t1 from stdin;
out of memory
postgres=# copy t1 from stdin; --hangs here

It hangs in second statement, basically I think you can't change the
state to PGASYNC_BUSY on failure in case of copy in below code.

+ case PGASYNC_FATAL:
+ res = NULL;
+ /*
+ * Set the state back to BUSY, allowing parsing to proceed to
+ * consume messages coming from the server.
+ */
+ conn->asyncStatus = PGASYNC_BUSY;
+ break;

I think you need to keep the state as PGASYNC_COPY_*, so that
the pending data can be discarded.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: PQexec() hangs on OOM

From
Amit Kapila
Date:
On Mon, Sep 7, 2015 at 1:40 PM, Michael Paquier <michael.paquier@gmail.com> wrote:
On Sat, Sep 5, 2015 at 9:45 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

Now, to move into the serious things...

+       /*
+        * Advance inStart to show that the copy related message has been
+        * processed.
+        */
+       conn->inStart = conn->inCursor;
This change...

+                                       /* getCopyStart() moves
inStart itself */
                                        conn->asyncStatus = PGASYNC_COPY_IN;
-                                       break;
+                                       continue;
...  And this change are risky for a backpatch. And they actually
break the existing replication protocol

Can you please explain how will it break replication protocol? 
 
I have done the required handling for Copy Both mode as well in attached
patch similar to what was done for other Copy modes in previous patch.
Check if you still find it as broken for replication?

I have only kept the changes for COPY modes, so that once we settle on
those, I think similar changes could be done for getParamDescriptions()


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
Attachment

Re: PQexec() hangs on OOM

From
Michael Paquier
Date:
On Sat, Sep 12, 2015 at 5:56 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Mon, Sep 7, 2015 at 8:43 PM, Michael Paquier <michael.paquier@gmail.com>
> wrote:
>>
>> On Mon, Sep 7, 2015 at 10:41 PM, Michael Paquier wrote:
>> > In any case, attached are two patches:
>> > - 0001 adds the OOM/failure handling for the BIND and COPY start
>> > messages. This time the connection is not dropped. After a failure,
>> > successive commands work as well, this addresses the previous issue
>> > you reported.
>> > - 0002 is a cleanup bonus, getRowDescriptions and getAnotherTuple have
>> > some dead code that I think would be better removed, those are
>> > remnants from a copy/paste from the similar code of protocol 2.
>>
>> And I forgot... Attached is a simple program to test BIND messages.
>
>
> Still the same test fails for me.
>
> postgres=# copy t1 from stdin;
> out of memory
> postgres=# copy t1 from stdin; --hangs here
>
> It hangs in second statement, basically I think you can't change the
> state to PGASYNC_BUSY on failure in case of copy in below code.
>
> + case PGASYNC_FATAL:
> + res = NULL;
> + /*
> + * Set the state back to BUSY, allowing parsing to proceed to
> + * consume messages coming from the server.
> + */
> + conn->asyncStatus = PGASYNC_BUSY;
> + break;
>
> I think you need to keep the state as PGASYNC_COPY_*, so that
> the pending data can be discarded.

Yes right. And my patch was as well broken with PGRES_COPY_BOTH. If
the replication protocol expected to wait for some data, this would
have frozen as well as it's a two-way data exchange in this case.
--
Michael

Re: PQexec() hangs on OOM

From
Michael Paquier
Date:
On Sat, Sep 12, 2015 at 6:11 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Mon, Sep 7, 2015 at 1:40 PM, Michael Paquier <michael.paquier@gmail.com>
> wrote:
>>
>> On Sat, Sep 5, 2015 at 9:45 PM, Amit Kapila <amit.kapila16@gmail.com>
>> wrote:
>>
>> Now, to move into the serious things...
>>
>> +       /*
>> +        * Advance inStart to show that the copy related message has been
>> +        * processed.
>> +        */
>> +       conn->inStart = conn->inCursor;
>> This change...
>>
>> +                                       /* getCopyStart() moves
>> inStart itself */
>>                                         conn->asyncStatus =
>> PGASYNC_COPY_IN;
>> -                                       break;
>> +                                       continue;
>> ...  And this change are risky for a backpatch. And they actually
>> break the existing replication protocol

First, many thanks for providing your input! I am really looking
forward into fixing those problems appropriately.

> Can you please explain how will it break replication protocol?
> I have done the required handling for Copy Both mode as well in attached
> patch similar to what was done for other Copy modes in previous patch.
> Check if you still find it as broken for replication?

When not enough data has been received from the server, it seems that
we should PQclear the existing result, and leave immediately
pqParseInput3 with the status PGASYNC_BUSY to be able to wait for more
data. Your patch, as-is, is breaking that promise (and this comes from
the first versions of my patches). It seems also that we should not
write an error message on connection in this case to be consistent in
the behavior of back-branches. For the OOM case, we definitely need to
take the error code path though, as your patch is correctly doing, and
what mine legendary failed to do.

+       if (pqGetInt(&result->numAttributes, 2, conn))
+       {
+               errmsg = libpq_gettext("extraneous data in COPY start message");
+               goto advance_and_error;
+       }
Here an error message is not needed, and this message should have been
formulated as "insufficient data in blah" either way.

> I have only kept the changes for COPY modes, so that once we settle on
> those, I think similar changes could be done for getParamDescriptions()

Yeah, I looked into this one as well, resulting in patch 0002
attached. In this case we have the advantage to only receive data from
the server, so pqPrepareAsyncResult is handling the failure just fine.
Attached as well is the test case I used (similar to previous one,
just extended a bit to report the error message), after getting the
result from PQsendDescribePrepared, PQresultStatus is set correctly on
error and reports "out of memory" to the user. What do you think about
it?
Regards,
--
Michael

Attachment

Re: PQexec() hangs on OOM

From
Heikki Linnakangas
Date:
On 09/14/2015 04:36 AM, Michael Paquier wrote:
> On Sat, Sep 12, 2015 at 6:11 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> On Mon, Sep 7, 2015 at 1:40 PM, Michael Paquier <michael.paquier@gmail.com>
>> wrote:
>>>
>>> On Sat, Sep 5, 2015 at 9:45 PM, Amit Kapila <amit.kapila16@gmail.com>
>>> wrote:
>>>
>>> Now, to move into the serious things...
>>>
>>> +       /*
>>> +        * Advance inStart to show that the copy related message has been
>>> +        * processed.
>>> +        */
>>> +       conn->inStart = conn->inCursor;
>>> This change...
>>>
>>> +                                       /* getCopyStart() moves
>>> inStart itself */
>>>                                          conn->asyncStatus =
>>> PGASYNC_COPY_IN;
>>> -                                       break;
>>> +                                       continue;
>>> ...  And this change are risky for a backpatch. And they actually
>>> break the existing replication protocol
>
> First, many thanks for providing your input! I am really looking
> forward into fixing those problems appropriately.
>
>> Can you please explain how will it break replication protocol?
>> I have done the required handling for Copy Both mode as well in attached
>> patch similar to what was done for other Copy modes in previous patch.
>> Check if you still find it as broken for replication?
>
> When not enough data has been received from the server, it seems that
> we should PQclear the existing result, and leave immediately
> pqParseInput3 with the status PGASYNC_BUSY to be able to wait for more
> data. Your patch, as-is, is breaking that promise (and this comes from
> the first versions of my patches). It seems also that we should not
> write an error message on connection in this case to be consistent in
> the behavior of back-branches. For the OOM case, we definitely need to
> take the error code path though, as your patch is correctly doing, and
> what mine legendary failed to do.
>
> +       if (pqGetInt(&result->numAttributes, 2, conn))
> +       {
> +               errmsg = libpq_gettext("extraneous data in COPY start message");
> +               goto advance_and_error;
> +       }
> Here an error message is not needed, and this message should have been
> formulated as "insufficient data in blah" either way.
>
>> I have only kept the changes for COPY modes, so that once we settle on
>> those, I think similar changes could be done for getParamDescriptions()
>
> Yeah, I looked into this one as well, resulting in patch 0002
> attached. In this case we have the advantage to only receive data from
> the server, so pqPrepareAsyncResult is handling the failure just fine.
> Attached as well is the test case I used (similar to previous one,
> just extended a bit to report the error message), after getting the
> result from PQsendDescribePrepared, PQresultStatus is set correctly on
> error and reports "out of memory" to the user. What do you think about
> it?

Patches 0001 and 0002 look good to me. Two tiny nitpicks:

> --- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
> +++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
> @@ -445,6 +445,7 @@ libpqrcv_PQexec(const char *query)
>          if (PQresultStatus(lastResult) == PGRES_COPY_IN ||
>              PQresultStatus(lastResult) == PGRES_COPY_OUT ||
>              PQresultStatus(lastResult) == PGRES_COPY_BOTH ||
> +            PQresultStatus(lastResult) == PGRES_FATAL_ERROR ||
>              PQstatus(streamConn) == CONNECTION_BAD)
>              break;
>      }

Is this really needed? AFAICS, the loop will terminate on
PGRES_FATAL_ERROR too. Without this patch, it will always return the
last result from the query, whether or not it was an error result. With
this patch, it will return the first error encountered, or last
non-error result if there is no error. In practice, there is no
difference because we don't expect multiple results from the server.

> @@ -1343,31 +1348,33 @@ getNotify(PGconn *conn)
>   * parseInput already read the message type and length.
>   */
>  static int
> -getCopyStart(PGconn *conn, ExecStatusType copytype)
> +getCopyStart(PGconn *conn, ExecStatusType copytype, int msgLength)
>  {
>      PGresult   *result;
>      int            nfields;
>      int            i;
> +    const char *errmsg = NULL;
>
>      result = PQmakeEmptyPGresult(conn, copytype);
>      if (!result)
> -        goto failure;
> +        goto advance_and_error;
>
>      if (pqGetc(&conn->copy_is_binary, conn))
> -        goto failure;
> +        goto not_enough_data;
>      result->binary = conn->copy_is_binary;
> +
>      /* the next two bytes are the number of fields    */
> -    if (pqGetInt(&(result->numAttributes), 2, conn))
> -        goto failure;
> +    if (pqGetInt(&result->numAttributes, 2, conn))
> +        goto not_enough_data;
>      nfields = result->numAttributes;
>
>      /* allocate space for the attribute descriptors */
> -    if (nfields > 0)
> +    if (result && nfields > 0)

'result' is never NULL here, so the NULL test is still not needed.

The 0003 patch also looks OK to me as far as it goes. I feel that some
more refactoring would be in order, though. There are still many
different styles for handling a command: sometimes it's done inline in
the case-statement, sometimes it calls out to a separate function.
Sometimes the function moves inStart, sometimes it does not. On error
(e.g. short message), sometimes the function returns 'false', sometimes
it sets an error message itself.

Patches 0001 and 0002 look small enough that we probably could backpatch
them, so that would leave 0003 as a optional master-only cleanup.

I spent some time writing a regression test for this. This is pretty
difficult to test otherwise, and it's a good candidate for automated,
exchaustive testing. The only difficulty is injecting the malloc()
failures in a cross-platform way. I ended up with a glibc-only solution
that uses glibc's malloc() hooks. Other alternatives I considered were
Linux-specific LD_PRELOAD, GNU ld specific -wrap option. One
cross-platform solution would be to #define malloc() and friends to
wrapper functions, and compile a special version of libpq. Anyway, I
came up with the attached test program. It hangs without your fixes, and
works with the patches applied. Thoughts?

- Heikki


Attachment

Re: PQexec() hangs on OOM

From
Michael Paquier
Date:
On Fri, Sep 18, 2015 at 12:55 PM, Heikki Linnakangas wrote:
> On 09/14/2015 04:36 AM, Michael Paquier wrote:
> Patches 0001 and 0002 look good to me. Two tiny nitpicks:
>
>> --- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
>> +++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
>> @@ -445,6 +445,7 @@ libpqrcv_PQexec(const char *query)
>>                 if (PQresultStatus(lastResult) == PGRES_COPY_IN ||
>>                         PQresultStatus(lastResult) == PGRES_COPY_OUT ||
>>                         PQresultStatus(lastResult) == PGRES_COPY_BOTH ||
>> +                       PQresultStatus(lastResult) == PGRES_FATAL_ERROR ||
>>                         PQstatus(streamConn) == CONNECTION_BAD)
>>                         break;
>>         }
>
>
> Is this really needed? AFAICS, the loop will terminate on PGRES_FATAL_ERROR
> too. Without this patch, it will always return the last result from the
> query, whether or not it was an error result. With this patch, it will
> return the first error encountered, or last non-error result if there is no
> error. In practice, there is no difference because we don't expect multiple
> results from the server.

Yes, you are right. PQgetResult is doing the necessary processing
here. This seems to be some remnant of previous testing around the
replication protocol, and while it has no real incidence for the
server, it may for applications that rely on libpqwalreceiver.so...
Hence let's remove it. Thanks for pointing out that.

>> @@ -1343,31 +1348,33 @@ getNotify(PGconn *conn)
>> [...]
>> +       if (result && nfields > 0)
>
> 'result' is never NULL here, so the NULL test is still not needed.

I thought I removed all of them.

> The 0003 patch also looks OK to me as far as it goes. I feel that some more
> refactoring would be in order, though. There are still many different styles
> for handling a command: sometimes it's done inline in the case-statement,
> sometimes it calls out to a separate function. Sometimes the function moves
> inStart, sometimes it does not. On error (e.g. short message), sometimes the
> function returns 'false', sometimes it sets an error message itself.

Sure. I had in mind a backpatch as the main goal of this set of
patches, and there is surely more room than simply removing all the
dead code paths that 0003 is currently doing. Why not simply fixing
the holes in the current logic to begin, and have the refactoring
patch as something to work on for the next CF? I don't mind working on
it FWIW. That's just a larger topic than what is discussed on this
thread.

> Patches 0001 and 0002 look small enough that we probably could backpatch
> them, so that would leave 0003 as a optional master-only cleanup.

That makes sense, and what I had in mind as well. Attached are updated
patches, feel free to ignore 0003, that's the same as before only
removing the dead code.

> I spent some time writing a regression test for this. This is pretty
> difficult to test otherwise, and it's a good candidate for automated,
> exchaustive testing. The only difficulty is injecting the malloc() failures
> in a cross-platform way. I ended up with a glibc-only solution that uses
> glibc's malloc() hooks. Other alternatives I considered were Linux-specific
> LD_PRELOAD, GNU ld specific -wrap option. One cross-platform solution would
> be to #define malloc() and friends to wrapper functions, and compile a
> special version of libpq. Anyway, I came up with the attached test program.
> It hangs without your fixes, and works with the patches applied. Thoughts?

That's not going to work on OSX/BSD (as those callbacks are not
present) and Windows, but testing that on Linux only looks fine to me,
and locating it where it is, aka in a non-default compile path is
definitely better. You should add .gitignore in
src/interfaces/libpq/test/ (uri-regress is missing as well btw).
Regards,
--
Michael

Attachment

Re: PQexec() hangs on OOM

From
Amit Kapila
Date:
On Fri, Sep 18, 2015 at 11:25 PM, Heikki Linnakangas <hlinnaka@iki.fi>
wrote:

> On 09/14/2015 04:36 AM, Michael Paquier wrote:
>
> Patches 0001 and 0002 look good to me. Two tiny nitpicks:
>
> --- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
>> +++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
>> @@ -445,6 +445,7 @@ libpqrcv_PQexec(const char *query)
>>                 if (PQresultStatus(lastResult) == PGRES_COPY_IN ||
>>                         PQresultStatus(lastResult) == PGRES_COPY_OUT ||
>>                         PQresultStatus(lastResult) == PGRES_COPY_BOTH ||
>> +                       PQresultStatus(lastResult) == PGRES_FATAL_ERROR ||
>>                         PQstatus(streamConn) == CONNECTION_BAD)
>>                         break;
>>         }
>>
>
> Is this really needed? AFAICS, the loop will terminate on
> PGRES_FATAL_ERROR too.



IIRC, this is required to sanely report "out of memory" error in case
of replication protocol (master-standby setup).  This loop and in-particular
this check is quite similar to PQexecFinish() functions check and loop
where we return last result.  I think it is better to keep both the places
in-sync
and also I think this is required to report the error appropriately.  I have
tried manual debugging for the out of memory error for this case and
it works well with this check and without the check it doesn't report
the error in an appropriate way(don't remember exactly what was
the difference).  If required, I can again try to reproduce the scenario
and share the exact report.


> Patches 0001 and 0002 look small enough that we probably could backpatch
> them, so that would leave 0003 as a optional master-only cleanup.
>
>
+
+advance_and_error:
+ /* Discard unsaved result, if any */

This part seems to be common in both patches and is also being
used in getRowDescriptions(), we can have a common routine for
that part, if you see any benefit in same.



With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: PQexec() hangs on OOM

From
Michael Paquier
Date:
On Fri, Sep 18, 2015 at 11:32 PM, Amit Kapila wrote:
> IIRC, this is required to sanely report "out of memory" error in case
> of replication protocol (master-standby setup).  This loop and
in-particular
> this check is quite similar to PQexecFinish() functions check and loop
> where we return last result.  I think it is better to keep both the places
> in-sync
> and also I think this is required to report the error appropriately.  I
have
> tried manual debugging for the out of memory error for this case and
> it works well with this check and without the check it doesn't report
> the error in an appropriate way(don't remember exactly what was
> the difference).  If required, I can again try to reproduce the scenario
> and share the exact report.

I just had a look at that again... Put for example a call to pg_usleep in
libpqrcv_identify_system after executing IDENTIFY_SYSTEM and before calling
PQresultStatus, then take a breakpoint on the WAL receiver process when
starting up a standby. This gives plenty of room to emulate the OOM failure
in getCopyStart. When the check on PGRES_FATAL_ERROR is not added and when
emulating the OOM immediately, libpqrcv_PQexec loops twice and thinks it
can start up strrep but fails afterwards. Here is the failure seen from the
standby:
LOG:  started streaming WAL from primary at 0/3000000 on timeline 1
FATAL:  could not send data to WAL stream: server closed the connection
unexpectedly
And from the master:
LOG:  unexpected EOF on standby connection
The WAL receiver process ultimately restarts after.

When the check on PGRES_FATAL_ERROR is added, strrep fails to start
appropriately and the error is fetched correctly by the WAL receiver:
FATAL:  could not start WAL streaming: out of memory

In short: Amit seems right to have added this check.

Note for later: looking at patches during conferences is really a bad habit.
--
Michael

Re: PQexec() hangs on OOM

From
Amit Kapila
Date:
On Sat, Sep 19, 2015 at 10:44 AM, Michael Paquier <michael.paquier@gmail.com
> wrote:
>
>
> On Fri, Sep 18, 2015 at 11:32 PM, Amit Kapila wrote:
> > IIRC, this is required to sanely report "out of memory" error in case
> > of replication protocol (master-standby setup).  This loop and
in-particular
> > this check is quite similar to PQexecFinish() functions check and loop
> > where we return last result.  I think it is better to keep both the
places
> > in-sync
> > and also I think this is required to report the error appropriately.  I
have
> > tried manual debugging for the out of memory error for this case and
> > it works well with this check and without the check it doesn't report
> > the error in an appropriate way(don't remember exactly what was
> > the difference).  If required, I can again try to reproduce the scenario
> > and share the exact report.
>
> I just had a look at that again... Put for example a call to pg_usleep in
libpqrcv_identify_system after executing IDENTIFY_SYSTEM and before calling
PQresultStatus, then take a breakpoint on the WAL receiver process when
starting up a standby. This gives plenty of room to emulate the OOM failure
in getCopyStart. When the check on PGRES_FATAL_ERROR is not added and when
emulating the OOM immediately, libpqrcv_PQexec loops twice and thinks it
can start up strrep but fails afterwards. Here is the failure seen from the
standby:
> LOG:  started streaming WAL from primary at 0/3000000 on timeline 1
> FATAL:  could not send data to WAL stream: server closed the connection
unexpectedly
> And from the master:
> LOG:  unexpected EOF on standby connection
> The WAL receiver process ultimately restarts after.
>
> When the check on PGRES_FATAL_ERROR is added, strrep fails to start
appropriately and the error is fetched correctly by the WAL receiver:
> FATAL:  could not start WAL streaming: out of memory
>
> In short: Amit seems right to have added this check.
>

Okay, in that case, you can revert that change in your first patch and
then that patch will be Ready for committer.

In second patch [1], the handling is to continue on error as below:

- if (getParamDescriptions(conn))

+ if (getParamDescriptions(conn, msgLength))

  return;

- break;

+ /* getParamDescriptions() moves inStart itself */

+ continue;


Now, assume there is "out of memory" error in getParamDescription(),
the next message is 'T' which leads to a call to getRowDescriptions()
and in that function if there is an error in below part of code, then I
think it will just overwrite the previous "out of memory" error (I have
tried by manual debugging and forcefully generating the below error):

getRowDescriptions()

{

..

if (pqGetInt(&(result->numAttributes), 2, conn))

{

/* We should not run out of data here, so complain */

errmsg = libpq_gettext("insufficient data in \"T\" message");

goto advance_and_error;

}
..
}

Here, the point to note is that for similar handling of error from
getRowDescriptions(), we just skip the consecutive 'D' messages
by checking resultStatus.

I think overwriting of error for this case is not the right  behaviour.



[1] - 0002-Fix-OOM-error-handling-in-BIND-protocol-of-libpq-2.patch

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: PQexec() hangs on OOM

From
Michael Paquier
Date:
On Sat, Oct 10, 2015 at 9:06 PM, Amit Kapila wrote:
> Okay, in that case, you can revert that change in your first patch and
> then that patch will be Ready for committer.

Yep. Done.

> In second patch [1], the handling is to continue on error as below:
>
> - if (getParamDescriptions(conn))
> + if (getParamDescriptions(conn, msgLength))
>   return;
> - break;
> + /* getParamDescriptions() moves inStart itself */
> + continue;
>
> Now, assume there is "out of memory" error in getParamDescription(),
> the next message is 'T' which leads to a call to getRowDescriptions()
> and in that function if there is an error in below part of code, then I
> think it will just overwrite the previous "out of memory" error (I have
> tried by manual debugging and forcefully generating the below error):

> Here, the point to note is that for similar handling of error from
> getRowDescriptions(), we just skip the consecutive 'D' messages
> by checking resultStatus.
> I think overwriting of error for this case is not the right  behaviour.
> [1] - 0002-Fix-OOM-error-handling-in-BIND-protocol-of-libpq-2.patch

Yeah, this behavior is caused by this piece of code:
@@ -600,7 +601,9 @@ getRowDescriptions(PGconn *conn, int msgLength)
 advance_and_error:
        /* Discard unsaved result, if any */
        if (result && result != conn->result)
                PQclear(result);
An idea may be to check for conn->result->resultStatus !=
PGRES_FATAL_ERROR to concatenate correctly the error messages using
pqCatenateResultError. But just returning the first error encountered
as you mention would be more natural. So I updated the patch this way.
Regards,
--
Michael

Attachment

Re: PQexec() hangs on OOM

From
Amit Kapila
Date:
On Sun, Oct 11, 2015 at 6:31 PM, Michael Paquier <michael.paquier@gmail.com>
wrote:
>
>
> Yeah, this behavior is caused by this piece of code:
> @@ -600,7 +601,9 @@ getRowDescriptions(PGconn *conn, int msgLength)
>  advance_and_error:
>         /* Discard unsaved result, if any */
>         if (result && result != conn->result)
>                 PQclear(result);
> An idea may be to check for conn->result->resultStatus !=
> PGRES_FATAL_ERROR to concatenate correctly the error messages using
> pqCatenateResultError. But just returning the first error encountered
> as you mention would be more natural. So I updated the patch this way.
>

Minor comment:
+getParamDescriptions(PGconn *conn, int msgLength)
{
..
+
+ /*
+ * Advance inStart to show that the copy related message has been
+ * processed.
+ */
+ conn->inStart = conn->inCursor;
+
..
}

Reference to 'copy' in above comment seems to be wrong.


Other than that both the patches looks good to me, I will mark this as
Ready for committer.  Please see if you can update the patch soonish,
if you find above comment valid.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: PQexec() hangs on OOM

From
Michael Paquier
Date:
On Thu, Oct 15, 2015 at 11:28 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Sun, Oct 11, 2015 at 6:31 PM, Michael Paquier <michael.paquier@gmail.com>
> wrote:
>>
>>
>> Yeah, this behavior is caused by this piece of code:
>> @@ -600,7 +601,9 @@ getRowDescriptions(PGconn *conn, int msgLength)
>>  advance_and_error:
>>         /* Discard unsaved result, if any */
>>         if (result && result != conn->result)
>>                 PQclear(result);
>> An idea may be to check for conn->result->resultStatus !=
>> PGRES_FATAL_ERROR to concatenate correctly the error messages using
>> pqCatenateResultError. But just returning the first error encountered
>> as you mention would be more natural. So I updated the patch this way.
>>
>
> Minor comment:
> +getParamDescriptions(PGconn *conn, int msgLength)
> {
> ..
> +
> + /*
> + * Advance inStart to show that the copy related message has been
> + * processed.
> + */
> + conn->inStart = conn->inCursor;
> +
> ..
> }
>
> Reference to 'copy' in above comment seems to be wrong.

Oh, yes it wrong. This should refer to the "t" message.

> Other than that both the patches looks good to me, I will mark this as
> Ready for committer.  Please see if you can update the patch soonish,
> if you find above comment valid.

Done. Thanks for the review!
--
Michael

Attachment

Re: PQexec() hangs on OOM

From
Michael Paquier
Date:
On Fri, Oct 16, 2015 at 11:00 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> On Thu, Oct 15, 2015 at 11:28 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> Other than that both the patches looks good to me, I will mark this as
>> Ready for committer.  Please see if you can update the patch soonish,
>> if you find above comment valid.
>
> Done. Thanks for the review!

Could it be possible to get some committer attention for this patch
and get that committed for this CF? This is a clearly bug fix and it
is present for months. At least Amit and I have reached a consensus
about a way to do things.
--
Michael

Re: PQexec() hangs on OOM

From
Tom Lane
Date:
Michael Paquier <michael.paquier@gmail.com> writes:
> Could it be possible to get some committer attention for this patch
> and get that committed for this CF?

FWIW, I've been assuming that Heikki would handle it, since he was
involved earlier.  If he's not got time/interest for some reason,
I could pick it up.

            regards, tom lane

Re: PQexec() hangs on OOM

From
Michael Paquier
Date:
On Tue, Dec 1, 2015 at 11:55 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Michael Paquier <michael.paquier@gmail.com> writes:
>> Could it be possible to get some committer attention for this patch
>> and get that committed for this CF?
>
> FWIW, I've been assuming that Heikki would handle it, since he was
> involved earlier.  If he's not got time/interest for some reason,
> I could pick it up.

Thanks.
--
Michael

Re: PQexec() hangs on OOM

From
Heikki Linnakangas
Date:
On 16/10/15 05:00, Michael Paquier wrote:
> On Thu, Oct 15, 2015 at 11:28 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> On Sun, Oct 11, 2015 at 6:31 PM, Michael Paquier <michael.paquier@gmail.com>
>> wrote:
>>>
>>> Yeah, this behavior is caused by this piece of code:
>>> @@ -600,7 +601,9 @@ getRowDescriptions(PGconn *conn, int msgLength)
>>>   advance_and_error:
>>>          /* Discard unsaved result, if any */
>>>          if (result && result != conn->result)
>>>                  PQclear(result);
>>> An idea may be to check for conn->result->resultStatus !=
>>> PGRES_FATAL_ERROR to concatenate correctly the error messages using
>>> pqCatenateResultError. But just returning the first error encountered
>>> as you mention would be more natural. So I updated the patch this way.

I pushed the ParamDescription fix. I'm not quite sure on the details of
the COPY patch. Do you remember what this change in PQexecFinish is for:

> @@ -1991,6 +1995,9 @@ PQexecFinish(PGconn *conn)
>       * We have to stop if we see copy in/out/both, however. We will resume
>       * parsing after application performs the data transfer.
>       *
> +     * Stop if we are in copy mode and error has occurred, the pending results
> +     * will be discarded during next execution in PQexecStart.
> +     *
>       * Also stop if the connection is lost (else we'll loop infinitely).
>       */
>      lastResult = NULL;
> @@ -2020,6 +2027,11 @@ PQexecFinish(PGconn *conn)
>              result->resultStatus == PGRES_COPY_BOTH ||
>              conn->status == CONNECTION_BAD)
>              break;
> +        else if ((conn->asyncStatus == PGASYNC_COPY_IN ||
> +                  conn->asyncStatus == PGASYNC_COPY_OUT  ||
> +                  conn->asyncStatus == PGASYNC_COPY_BOTH) &&
> +                 result->resultStatus == PGRES_FATAL_ERROR)
> +            break;
>      }
>
>      return lastResult;

I don't immediately see why that's needed. I ran the little tester
program I wrote earlier
(http://www.postgresql.org/message-id/55FC501A.5000409@iki.fi), with and
without that change, and it behaved the same.

Also, when I modify my tester program so that when malloc fails, all
subsequent mallocs fail too, I'm still getting the "another command is
already in progress" error. The problem scenario is:

1. Application calls PQexec("COPY foo FROM STDIN")
2. getCopyStart sets conn->asyncStatus = PGASYNC_COPY_IN.
3. PQgetResult calls getCopyResult() but it fails with OOM.

Now, libpq is already in PGASYNC_COPY_IN mode, but PQexec() returned
NULL because of the OOM, so the application doesn't know that it's in
copy mode. When the application calls PQexec() again, it fails with the
"another command is already in progress" error.

Let's see if we can fix that too, or if we should put that off to yet
another patch.

- Heikki

Re: PQexec() hangs on OOM

From
Amit Kapila
Date:
On Mon, Dec 14, 2015 at 10:41 PM, Heikki Linnakangas <hlinnaka@iki.fi>
wrote:

> On 16/10/15 05:00, Michael Paquier wrote:
>
>> On Thu, Oct 15, 2015 at 11:28 PM, Amit Kapila <amit.kapila16@gmail.com>
>> wrote:
>>
>>> On Sun, Oct 11, 2015 at 6:31 PM, Michael Paquier <
>>> michael.paquier@gmail.com>
>>> wrote:
>>>
>>>>
>>>> Yeah, this behavior is caused by this piece of code:
>>>> @@ -600,7 +601,9 @@ getRowDescriptions(PGconn *conn, int msgLength)
>>>>   advance_and_error:
>>>>          /* Discard unsaved result, if any */
>>>>          if (result && result != conn->result)
>>>>                  PQclear(result);
>>>> An idea may be to check for conn->result->resultStatus !=
>>>> PGRES_FATAL_ERROR to concatenate correctly the error messages using
>>>> pqCatenateResultError. But just returning the first error encountered
>>>> as you mention would be more natural. So I updated the patch this way.
>>>>
>>>
> I pushed the ParamDescription fix. I'm not quite sure on the details of
> the COPY patch. Do you remember what this change in PQexecFinish is for:
>
> @@ -1991,6 +1995,9 @@ PQexecFinish(PGconn *conn)
>>          * We have to stop if we see copy in/out/both, however. We will
>> resume
>>          * parsing after application performs the data transfer.
>>          *
>> +        * Stop if we are in copy mode and error has occurred, the
>> pending results
>> +        * will be discarded during next execution in PQexecStart.
>> +        *
>>          * Also stop if the connection is lost (else we'll loop
>> infinitely).
>>          */
>>         lastResult = NULL;
>> @@ -2020,6 +2027,11 @@ PQexecFinish(PGconn *conn)
>>                         result->resultStatus == PGRES_COPY_BOTH ||
>>                         conn->status == CONNECTION_BAD)
>>                         break;
>> +               else if ((conn->asyncStatus == PGASYNC_COPY_IN ||
>> +                                 conn->asyncStatus == PGASYNC_COPY_OUT
>> ||
>> +                                 conn->asyncStatus == PGASYNC_COPY_BOTH)
>> &&
>> +                                result->resultStatus ==
>> PGRES_FATAL_ERROR)
>> +                       break;
>>         }
>>
>>         return lastResult;
>>
>
> I don't immediately see why that's needed. I ran the little tester program
> I wrote earlier (
> http://www.postgresql.org/message-id/55FC501A.5000409@iki.fi), with and
> without that change, and it behaved the same.
>
>
Without this change, it can ignore the OOM error for copy command.
As an example, for command like "copy t1 from stdin;", when the
flow reaches getCopyStart() in debugger, I have manually ensured
that PQmakeEmptyPGresult() return NULL by overwriting the return
value of malloc to zero and then it enters the error path
(advance_and_error) and then I just press continue and what I observed
is without above change it just doesn't show OOM error and with the
above change it properly reports OOM error.



With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: PQexec() hangs on OOM

From
Michael Paquier
Date:
On Tue, Dec 15, 2015 at 2:22 PM, Amit Kapila <amit.kapila16@gmail.com>
wrote:
> On Mon, Dec 14, 2015 at 10:41 PM, Heikki Linnakangas <hlinnaka@iki.fi>
wrote:
>> On 16/10/15 05:00, Michael Paquier wrote:
>> I don't immediately see why that's needed. I ran the little tester
program I wrote earlier (
http://www.postgresql.org/message-id/55FC501A.5000409@iki.fi), with and
without that change, and it behaved the same.
>>
>
> Without this change, it can ignore the OOM error for copy command.
> As an example, for command like "copy t1 from stdin;", when the
> flow reaches getCopyStart() in debugger, I have manually ensured
> that PQmakeEmptyPGresult() return NULL by overwriting the return
> value of malloc to zero and then it enters the error path
> (advance_and_error) and then I just press continue and what I observed
> is without above change it just doesn't show OOM error and with the
> above change it properly reports OOM error.

Yeah, that's basically this kind of trace:
  * frame #0: 0x0000000100e8edad
libpq.5.dylib`getCopyStart(conn=0x00007fb238404090,
copytype=PGRES_COPY_OUT, msgLength=5) + 29 at fe-protocol3.c:1400
    frame #1: 0x0000000100e8d3d6
libpq.5.dylib`pqParseInput3(conn=0x00007fb238404090) + 2374 at
fe-protocol3.c:383
    frame #2: 0x0000000100e7f5ad
libpq.5.dylib`parseInput(conn=0x00007fb238404090) + 45 at fe-exec.c:1652
    frame #3: 0x0000000100e7f86b
libpq.5.dylib`PQgetResult(conn=0x00007fb238404090) + 251 at fe-exec.c:1727
    frame #4: 0x0000000100e7fdad
libpq.5.dylib`PQexecFinish(conn=0x00007fb238404090) + 29 at fe-exec.c:2007
    frame #5: 0x0000000100e7fbdc
libpq.5.dylib`PQexec(conn=0x00007fb238404090, query=0x00007fb23841b760) +
92 at fe-exec.c:1838
    frame #6: 0x0000000100dd2f35 psql`SendQuery(query=0x00007fb23841b760) +
965 at common.c:1064
    frame #7: 0x0000000100dd99f4 psql`MainLoop(source=0x00007fff74944a90) +
2388 at mainloop.c:292
    frame #8: 0x0000000100de1988 psql`main(argc=1, argv=0x00007fff5ee36510)
+ 3304 at startup.c:390

If an error occurs in the COPY start message, this allows to fetch
correctly the error that is generated by getCopyResult(), and that what has
been written to conn->errorMessage does not get overwritten by subsequent
messages, allowing to output the correct report to the user.
--
Michael

Re: PQexec() hangs on OOM

From
Michael Paquier
Date:
On Tue, Dec 15, 2015 at 3:54 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:
>
>
> On Tue, Dec 15, 2015 at 2:22 PM, Amit Kapila <amit.kapila16@gmail.com>
> wrote:
>> On Mon, Dec 14, 2015 at 10:41 PM, Heikki Linnakangas <hlinnaka@iki.fi>
>> wrote:
>>> On 16/10/15 05:00, Michael Paquier wrote:
>>> I don't immediately see why that's needed. I ran the little tester
>>> program I wrote earlier
>>> (http://www.postgresql.org/message-id/55FC501A.5000409@iki.fi), with and
>>> without that change, and it behaved the same.
>>>
>>
>> Without this change, it can ignore the OOM error for copy command.
>> As an example, for command like "copy t1 from stdin;", when the
>> flow reaches getCopyStart() in debugger, I have manually ensured
>> that PQmakeEmptyPGresult() return NULL by overwriting the return
>> value of malloc to zero and then it enters the error path
>> (advance_and_error) and then I just press continue and what I observed
>> is without above change it just doesn't show OOM error and with the
>> above change it properly reports OOM error.
>
> Yeah, that's basically this kind of trace:
>   * frame #0: 0x0000000100e8edad
> libpq.5.dylib`getCopyStart(conn=0x00007fb238404090, copytype=PGRES_COPY_OUT,
> msgLength=5) + 29 at fe-protocol3.c:1400
>     frame #1: 0x0000000100e8d3d6
> libpq.5.dylib`pqParseInput3(conn=0x00007fb238404090) + 2374 at
> fe-protocol3.c:383
>     frame #2: 0x0000000100e7f5ad
> libpq.5.dylib`parseInput(conn=0x00007fb238404090) + 45 at fe-exec.c:1652
>     frame #3: 0x0000000100e7f86b
> libpq.5.dylib`PQgetResult(conn=0x00007fb238404090) + 251 at fe-exec.c:1727
>     frame #4: 0x0000000100e7fdad
> libpq.5.dylib`PQexecFinish(conn=0x00007fb238404090) + 29 at fe-exec.c:2007
>     frame #5: 0x0000000100e7fbdc
> libpq.5.dylib`PQexec(conn=0x00007fb238404090, query=0x00007fb23841b760) + 92
> at fe-exec.c:1838
>     frame #6: 0x0000000100dd2f35 psql`SendQuery(query=0x00007fb23841b760) +
> 965 at common.c:1064
>     frame #7: 0x0000000100dd99f4 psql`MainLoop(source=0x00007fff74944a90) +
> 2388 at mainloop.c:292
>     frame #8: 0x0000000100de1988 psql`main(argc=1, argv=0x00007fff5ee36510)
> + 3304 at startup.c:390
>
> If an error occurs in the COPY start message, this allows to fetch correctly
> the error that is generated by getCopyResult(), and that what has been
> written to conn->errorMessage does not get overwritten by subsequent
> messages, allowing to output the correct report to the user.

Heikki and I had a short talk about this issue in Moscow, and Heikki
has mentioned me that he cannot convince himself that this would be a
sane change for the COPY protocol. Based on his experience with those
things, I'd rather put every back on the table and look again at this
issue with fresh eyes with more testing involving in-core and external
tools that use the COPY protocol with libpq routines. I recall having
doing some testing with pg_receivexlog and pg_recvlogical but I'll
look at that again. Also, I know one tool which would be interesting
to test: pg_bulkload. Do you know of any other things worth looking
at? We want things that link with libpq directly.

So, in short, I am planning to do more testing, even if it has proved
to be rather clear that this change has the advantage to make the
error stack to be clearer. But we may be missing something, so let's
see.
--
Michael