Re: PQexec() hangs on OOM - Mailing list pgsql-bugs

From Michael Paquier
Subject Re: PQexec() hangs on OOM
Date
Msg-id CAB7nPqQY=eXpgkNLaf+xQ0u3ZYAw6wHs72TfEVtiJQtT3HE7oA@mail.gmail.com
Whole thread Raw
In response to Re: PQexec() hangs on OOM  (Amit Kapila <amit.kapila16@gmail.com>)
Responses Re: PQexec() hangs on OOM  (Amit Kapila <amit.kapila16@gmail.com>)
Re: PQexec() hangs on OOM  (Amit Kapila <amit.kapila16@gmail.com>)
List pgsql-bugs
On Sat, Sep 5, 2015 at 9:45 PM, Amit Kapila <amit.kapila16@gmail.com> wrote=
:
>
>
> On Fri, Sep 4, 2015 at 12:55 PM, Michael Paquier <michael.paquier@gmail.c=
om> wrote:
>>
>>
>> On Fri, Sep 4, 2015 at 2:07 PM, Amit Kapila <amit.kapila16@gmail.com> wr=
ote:
>>>>
>>>> It was actually a far cleaner approach to have a failure flag to decid=
e if based on the async state process should move on to a failure code path=
 or not.
>>>
>>>
>>> I think you have already tried, but it seems to me that if we can handl=
e
>>> it based on result status, that would be better, rather than introducin=
g
>>> new state, but anyway lets first try to sort out above reported problem=
.
>>
>>
>> So, looking at that again with a fresh eye, I noticed that getCopyDataMe=
ssage has a similar error handling when it fails on pqCheckInBufferSpace. S=
o looking at what I added, it seems that I am trying to duplicate the proto=
col error handling even if everything is already present, so that's really =
overdoing it. Hence, let's simply drop this idea of new flag PGASYNC_FATAL =
and instead treat the OOM as a sync handling error with the server, like in=
 the patch attached.
>>
>
> I think sync handling error, drops the connection which I feel should
> be only done as a last resort for any error and if we are able to handle
> OOM error for other kind of messages, then we should try to handle
> it for Copy related messages as well.  I have tried to do it for Copy
> In and Copy Out protocol messages in the attached patch with logic
> similar to what is currently used in code.  The idea is that in Copy
> mode if there is error we just return the error and the pending data
> will be automatically discarded in next execution during PQexecStart().

OK. So you rely on the error state set by pqSaveErrorResult, then wrap
an error check in getCopyResult and PQexecFinish... I guess that this
is acceptable for COPY as this code path would just kept looping
infinitely on the current HEAD.

>
>
>> When emulating an OOM with this patch, I am getting this error instead o=
f the infinite loop, which looks acceptable to me:
>> =3D# copy aa to stdout;
>> out of memory
>> lost synchronization with server: got message type "H", length 5
>> The connection to the server was lost. Attempting reset: Succeeded.
>>
>> The extra message handling I have added at the end of getCopyStart and g=
etParamDescriptions still looks more adapted to me when a failure happens, =
so I kept it.
>
>
> Sure, but then make the handling similar to getRowDescriptions() both for
> failure and success case, that way code will be consistent.

No problem for me, and this makes actually refactoring easier as those
code paths are more similar.

> Still I have not checked about COPY_BOTH and getParamDescriptions(),
> see if you like what I have done in attached patch, then we can do the
> similar for them as well.

Now, to move into the serious things...

+       /*
+        * Advance inStart to show that the copy related message has been
+        * processed.
+        */
+       conn->inStart =3D conn->inCursor;
This change...

+                                       /* getCopyStart() moves
inStart itself */
                                        conn->asyncStatus =3D PGASYNC_COPY_=
IN;
-                                       break;
+                                       continue;
...  And this change are risky for a backpatch. And they actually
break the existing replication protocol and backward compatibility so
I have no doubt that they could break any client applications that
work directly with the replication level protocol with commands like
BASE_BACKUP, START_REPLICATION and IDENTITY_SYSTEM. I think that we
should really try to keep the non-error code path as close as possible
to the original code.

So, I think that the right approach would be to leave immediately
pqParseInput3 should an error happen, switching asyncStatus to leave
the loop in PQgetResult. This seems as well to work correctly with
PGRES_COPY_BOTH (I emulated failures with pg_basebackup and errors
were caught up correctly. This brings back of course the introduction
of a new flag PGASYNC_FATAL and we could replace the extra stuff you
added in getCopyResult as well. I have a WIP patch that I don't have
the time to finish now, but I'll send it once I am done. For now I am
just sharing my thoughts on the matter.
--=20
Michael

pgsql-bugs by date:

Previous
From: Amit Kapila
Date:
Subject: Re: PQexec() hangs on OOM
Next
From: Amit Kapila
Date:
Subject: Re: PQexec() hangs on OOM