Re: OOM in libpq and infinite loop with getCopyStart() - Mailing list pgsql-hackers

From Tom Lane
Subject Re: OOM in libpq and infinite loop with getCopyStart()
Date
Msg-id 23387.1459522980@sss.pgh.pa.us
Whole thread Raw
In response to Re: OOM in libpq and infinite loop with getCopyStart()  (Michael Paquier <michael.paquier@gmail.com>)
Responses Re: OOM in libpq and infinite loop with getCopyStart()
List pgsql-hackers
Michael Paquier <michael.paquier@gmail.com> writes:
> On Fri, Apr 1, 2016 at 12:48 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Anyway, the short of my review is that we need more clarity of thought
>> about what state libpq is in after a failure like this, and what that
>> state looks like to the application, and how that should affect how
>> libpq reacts to application calls.

> Hm. I would have thought that returning to the caller a result
> generated by PQmakeEmptyPGresult(PGRES_FATAL_ERROR) with an error
> string marked "out of memory" would be the best thing to do instead of
> NULL...

Sure, we're trying to do that, but the question is what happens afterward.
An app that is following the documented usage pattern for PQgetResult
will call it again, expecting to get a NULL, but as the patch stands it
might get a PGRES_COPY_XXX instead.  What drove me to decide the patch
wasn't committable was realizing that Robert was right upthread: the
change you propose in libpqwalreceiver.c is not a bug fix.  That code
is doing exactly what it's supposed to do per the libpq.sgml docs, namely
iterate till it gets a NULL.[1]  If we have to change it, that means
(a) we've changed the API spec for PQgetResult and (b) every other
existing caller of PQgetResult would need to change similarly, or else
risk corner-case bugs about as bad as the one we're trying to fix.

So the core of my complaint is that we need to fix things so that, whether
or not we are able to create the PGRES_FATAL_ERROR PGresult (and we'd
better consider the behavior when we cannot), we need to leave libpq
in a state where subsequent calls to PQgetResult will return NULL.

> If getCopyStart() complains because of a lack of data, we loop
> back into pqParseInput3 to try again. If an OOM happens, we still loop
> into pqParseInput3 but all the pending data received from client needs
> to be discarded so as we don't rely on the next calls of PQexecStart
> to do the cleanup, once all the data is received we get out of the
> loop and generate the result with PGRES_FATAL_ERROR. Does that match
> what you have in mind?

If we could, that would be OK, but (a) you're only considering the
COPY OUT case not the COPY IN case, and (b) a persistent OOM condition
could very well prevent us from ever completing the copy.  For example,
some COPY DATA messages might be too big for our current buffer, but
we won't be able to enlarge the buffer.  I'm inclined to think that
reporting the OOM to the client is the highest-priority goal, and
that attempting to clean up during the next PQexec/PQsendQuery is a
reasonable design.  If we fail to do that, the client won't really
understand that it's a cleanup failure and not a failure to issue the
new query, but it's not clear why it needs to know the difference.
        regards, tom lane

[1] Well, almost.  Really, after dealing with a PGRES_COPY_XXX result
it ought to go back to pumping PQgetResult.  It doesn't, but that's
an expected client behavior, and cleaning up after that is exactly what
PQexecStart is intended for.



pgsql-hackers by date:

Previous
From: Craig Ringer
Date:
Subject: Re: Logical Decoding - Execute join query
Next
From: Tom Lane
Date:
Subject: Re: OOM in libpq and infinite loop with getCopyStart()