Thread: OOM in libpq and infinite loop with getCopyStart()
Hi all, One year and a half ago Heikki has reported that libpq could run on an infinite loop in a couple of code paths should an OOM happen in PQExec(): http://www.postgresql.org/message-id/547480DE.4040408@vmware.com The code paths are: - BIND messages for getRowDescriptions(), or 'T' messages, fixed per 7b96bf44 - Code paths handling some calls to PQmakeEmptyPGresult(), fixed per 414bef3 - Start message for COPY command, not fixed yet. The two first problems have been addressed, not the last one. And as the last thread became long and confusing here is a revival of the topic on a new thread, with a fresh patch, and a fresh study of the problem submitted for this CF because it would be nice to get things fixed, or improved in some cases, but see below... Here is a summary of what this patch does and improves in some cases: 1) COPY IN and COPY OUT, without the patch, those two remain stuck in an infinite loop... With the patch an error "out of memory" is reported to the caller. I had a look as well at some community utilities, like pgloader and pg_bulkload, though none of them are using getCopyStart(). Do other people have ideas to things to look at? pgadmin for example? Now in the case of COPY_BOTH. let's take a couple of examples: 2) pg_receivexlog: In the case of pg_receivexlog, the patch allows libpq to throw correctly an error back to the caller, then pg_receivexlog will try to send START_REPLICATION again at its next loop: pg_receivexlog: could not send replication command "START_REPLICATION": out of memory Without the patch pg_receivexlog would just ignore the error, and at the next iteration of ParseInput() the message obtained from server would be treated because the message cursor does not move on in this case. 3) pg_recvlogical Similarly pg_recvlogical fails with the following error with the patch attached: pg_recvlogical: could not send replication command "START_REPLICATION SLOT "toto" LOGICAL 0/0": out of memory And after the failure it loops back to the next attempt, and is able to perform its streaming stuff. Without the patch, similarly to pg_receivexlog, COPY_BOTH would take the code path of ParseInput() once again and it would treat the message, ignoring any OOM problems on the way. 4) libpqwalreceiver: upon noticing the OOM error with the patch attached, a standby would fail after sending START_REPLICATION, log the OOM error properly and attempt to restart later a new WAL receiver process to try again. Without this patch, the WAL receiver would still fail to start, and log the following, unhelpful message: 2016-03-01 14:07:00 JST [84058]: [22-1] db=,user=,app=,client= LOG: invalid record length at 0/3000970 That's not really informational for the user :( Note though that in this case the WAL receiver does not remain in an infinite loop, and that it is able to restart properly because it fails with this "invalid record length error", and then start streaming at its next attempt when the WAL receiver is started again. So, that's how things are standing. I still see that it would be a gain in terms of failure visibility to log properly this OOM failure in all cases. Now it is true that if there are some applications that may expect libpq to *not* return an error and treat the COPY start message at the next loop of ParseInput(), though by looking at what is in-core things can be handled properly. Thoughts? I have registered that in the CF app, and a patch is attached. -- Michael
Attachment
Hello, Michael I didn't checked your patch in detail yet but here is a thought I would like to share. In my experience usually it takes number of rewrites before patch will be accepted. To make sure that after every rewrite your patch still solves an issue you described you should probably provide a few test programs too. If it's possible to make these programs part of regression tests suite it would be just great. Without such test programs I personally find it difficult to verify that your patch fixes something. If such examples are provided your patch would be much more likely to accepted. "See, this program crashes everything. Now we apply a patch and everything works! Cool, heh?" Best regards, Aleksander
On Thu, Mar 3, 2016 at 10:18 PM, Aleksander Alekseev <a.alekseev@postgrespro.ru> wrote: > I didn't checked your patch in detail yet but here is a thought I would > like to share. > > In my experience usually it takes number of rewrites before patch will > be accepted. To make sure that after every rewrite your patch still > solves an issue you described you should probably provide a few test > programs too. If it's possible to make these programs part of regression > tests suite it would be just great. Yes that's usually the case, this patch is not at its first version. > Without such test programs I personally find it difficult to verify > that your patch fixes something. If such examples are provided your > patch would be much more likely to accepted. "See, this program crashes > everything. Now we apply a patch and everything works! Cool, heh?" The easiest way to perform tests with this patch is to take a debugger and enforce the malloc'd pointers to NULL in the code paths. That's wild but efficient, and that's actually for we did for the other two libpq patches treating those OOM failures. In the case of the standby code path what I did was to put a sleep in the WAL receiver code and then trigger the OOM manually, leading to the errors listed upthread. Regarding automated tests, it is largely feasible to have tests on Linux at least by using for example LD_PRELOAD or glibc-specific malloc hooks, though those would be platform-restricted. Here is for example one for COPY IN/OUT, that this patch also passed (there has been already some discussion for this patch). http://www.postgresql.org/message-id/55FC501A.5000409@iki.fi Applying it for recovery is doable as well by manipulating the recovery protocol with libpq though it is aimed really at covering a narrow case. -- Michael
Hello, Michael > The easiest way to perform tests with this patch is to take a debugger > and enforce the malloc'd pointers to NULL in the code paths. I see. Still I don't think it's an excuse to not provide clear steps to reproduce an issue. As I see it anyone should be able to easily check your patch locally without having deep understanding of how libpq is implemented or reading thread which contains 48 e-mails. For instance you cold provide a step-by-step instruction like: 1. run gdb --args some-prog arg1 arg2 ... 2. b some_file.c:123 3. c 4. ... 5. we expect ... but actually see ... Or you cold automate these steps using gdb batch file: gdb --batch --command=gdb.script --args some-prog arg1 arg2 ... where gdb.script is: b some_file.c:123 c ... etc ... Naturally all of this is optional. But it will simplify both understanding and reviewing of this path. Thus chances that this patch will be accepted will be increased and time required for this will be reduced significantly. Best regards, Aleksander
On Fri, Mar 4, 2016 at 12:59 AM, Aleksander Alekseev <a.alekseev@postgrespro.ru> wrote: >> The easiest way to perform tests with this patch is to take a debugger >> and enforce the malloc'd pointers to NULL in the code paths. > > I see. Still I don't think it's an excuse to not provide clear steps to > reproduce an issue. As I see it anyone should be able to easily check > your patch locally without having deep understanding of how libpq is > implemented or reading thread which contains 48 e-mails. OK, here they are if that helps. Following those steps and having some knowledge of lldb or gdb is enough. The key point is that getCopyStart is the routine to breakpoint and make fail. A) psql and COPY. 1) gdb --args psql -c 'COPY (SELECT 1) TO stdout' 2) Then take a breakpoint at getCopyStart() 3) run 4) Enforce the return result of PQmakeEmptyPGresult to NULL: p result = 0 5) Enjoy the infinite loop with HEAD, and the error with the patch B) pg_receivexlog: 1) mkdir data gdb --args pg_receivexlog --verbose -D data/ 2) Take a breakpoint at getCopyStart 3) run 4) enforce result = 0 after the call to PQmakeEmptyPGresult 5) And enjoy what has been reported C) pg_recvlogical, similar to B. 1) Create a replication slot on a server that accepts them select pg_create_logical_replication_slot('foo', 'test_decoding'); 2) Same breakpoint as B) with this start command or similar (depends on where the logical slot has been created): pg_recvlogical --start --slot foo -f - -d postgres D) triggering standby problem: 1) Patch upstream, as follows by adding a sleep in libpqrcv_startstreaming of libpqwalreceiver.c. This is a simple method to have the time to take a breakpoint on the WAL receiver process that has been started, with streaming that has not begun yet. diff --git a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c index f670957..c7fccf1 100644 --- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c +++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c @@ -188,6 +188,9 @@ libpqrcv_startstreaming(TimeLineID tli, XLogRecPtr startpoint, char *slotname) char cmd[256]; PGresult *res; + /* ZZZzzz... */ + pg_usleep(10000L); + /* Start streaming from the point requested by startup process */ if (slotname != NULL) snprintf(cmd, sizeof(cmd), Trying to design a test taking into account the context of a WAL receiver does not seem worth it to me with a simple method like this one... 2) Start the standby and attach debugger on the WAL receiver process, send SIGSTOP to it, whatever... 3) breakpoint at getCopyStart 4) Then move on with the same process as previously, and enjoy the errors. -- Michael
Attachment
Hello, Michael Thanks a lot for steps to reproduce you provided. I tested your path on Ubuntu Linux 14.04 LTS (GCC 4.8.4) and FreeBSD 10.2 RELEASE (Clang 3.4.1). In both cases patch applies cleanly, there are no warnings during compilation and all regression tests pass. A few files are not properly pgindent'ed though: ``` diff --git a/src/interfaces/libpq/fe-exec.c b/src/interfaces/libpq/fe-exec.c index c99f193..2769719 100644 --- a/src/interfaces/libpq/fe-exec.c +++ b/src/interfaces/libpq/fe-exec.c @@ -2031,7 +2031,7 @@ PQexecFinish(PGconn *conn) conn->status == CONNECTION_BAD) break; else if((conn->asyncStatus == PGASYNC_COPY_IN || - conn->asyncStatus == PGASYNC_COPY_OUT || + conn->asyncStatus == PGASYNC_COPY_OUT || conn->asyncStatus == PGASYNC_COPY_BOTH) && result->resultStatus == PGRES_FATAL_ERROR) break; diff --git a/src/interfaces/libpq/fe-protocol3.c b/src/interfaces/libpq/fe-protocol3.c index 21a1d9b..280ca16 100644 --- a/src/interfaces/libpq/fe-protocol3.c +++ b/src/interfaces/libpq/fe-protocol3.c @@ -49,9 +49,9 @@ static int getParamDescriptions(PGconn *conn, int msgLength); static int getAnotherTuple(PGconn *conn, int msgLength);static int getParameterStatus(PGconn *conn);static intgetNotify(PGconn *conn); -static int getCopyStart(PGconn *conn, - ExecStatusType copytype, - int msgLength); +static int getCopyStart(PGconn *conn, + ExecStatusType copytype, + int msgLength);static int getReadyForQuery(PGconn *conn);static void reportErrorPosition(PQExpBuffer msg, constchar *query, int loc, int encoding); ``` Indeed your patch solves issues you described. Still here is something that concerns me: ``` $ gdb --args pg_receivexlog --verbose -D ./temp_data/ (gdb) b getCopyStart Function "getCopyStart" not defined. Make breakpoint pending on future shared library load? (y or [n]) y Breakpoint 1 (getCopyStart) pending. (gdb) r Starting program: /usr/local/pgsql/bin/pg_receivexlog --verbose -D ./temp_data/ [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1". pg_receivexlog: starting log streaming at 0/1000000 (timeline 1) Breakpoint 1, getCopyStart (conn=0x610220, copytype=PGRES_COPY_BOTH, msgLength=3) at fe-protocol3.c:1398 1398 const char *errmsg = NULL; (gdb) n 1400 result = PQmakeEmptyPGresult(conn, copytype); (gdb) 1401 if (!result) (gdb) p result = 0 $1 = (PGresult *) 0x0 (gdb) c Continuing. pg_receivexlog: could not send replication command "START_REPLICATION": out of memory pg_receivexlog: disconnected; waiting 5 seconds to try again pg_receivexlog: starting log streaming at 0/1000000 (timeline 1) Breakpoint 1, getCopyStart (conn=0x610180, copytype=PGRES_COPY_BOTH, msgLength=3) at fe-protocol3.c:1398 1398 const char *errmsg = NULL; (gdb) n 1400 result = PQmakeEmptyPGresult(conn, copytype); (gdb) n 1401 if (!result) (gdb) p result = 0 $2 = (PGresult *) 0x0 (gdb) c Continuing. pg_receivexlog: could not send replication command "START_REPLICATION": out of memory pg_receivexlog: disconnected; waiting 5 seconds to try again pg_receivexlog: starting log streaming at 0/1000000 (timeline 1) Breakpoint 1, getCopyStart (conn=0x610180, copytype=PGRES_COPY_BOTH, msgLength=3) at fe-protocol3.c:1398 1398 const char *errmsg = NULL; ``` Granted this behaviour is a bit better then the current one. But basically it's the same infinite loop only with pauses and warnings. I wonder if this is a behaviour we really want. For instance wouldn't it be better just to terminate an application in out-of-memory case? "Let it crash" as Erlang programmers say. Best regards, Aleksander
Aleksander Alekseev wrote: > pg_receivexlog: could not send replication command "START_REPLICATION": > out of memory pg_receivexlog: disconnected; waiting 5 seconds to try > again pg_receivexlog: starting log streaming at 0/1000000 (timeline 1) > > Breakpoint 1, getCopyStart (conn=0x610180, copytype=PGRES_COPY_BOTH, > msgLength=3) at fe-protocol3.c:1398 1398 const char > *errmsg = NULL; > ``` > > Granted this behaviour is a bit better then the current one. But > basically it's the same infinite loop only with pauses and warnings. I > wonder if this is a behaviour we really want. For instance wouldn't it > be better just to terminate an application in out-of-memory case? "Let > it crash" as Erlang programmers say. Hmm. It would be useful to retry in the case that there is a chance that the program releases memory and can continue later. But if it will only stay there doing nothing other than retrying, then that obviously will not happen. One situation where this might help is if the overall *system* is short on memory and we expect that situation to resolve itself after a while -- after all, if the system is so loaded that it can't allocate a few more bytes for the COPY message, then odds are that other things are also crashing and eventually enough memory will be released that pg_receivexlog can continue. On the other hand, if the system is so loaded, perhaps it's better to "let it crash" and have it restart later -- presumably once the admin notices the problem and restarts it manually after cleaning up the mess. If all programs are well behaved and nothing crashes when OOM but they all retry instead, then everything will continue to retry infinitely and make no progress. That cannot be good. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Thu, Mar 10, 2016 at 12:12 AM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > Aleksander Alekseev wrote: >> pg_receivexlog: could not send replication command "START_REPLICATION": >> out of memory pg_receivexlog: disconnected; waiting 5 seconds to try >> again pg_receivexlog: starting log streaming at 0/1000000 (timeline 1) >> >> Breakpoint 1, getCopyStart (conn=0x610180, copytype=PGRES_COPY_BOTH, >> msgLength=3) at fe-protocol3.c:1398 1398 const char >> *errmsg = NULL; >> ``` >> >> Granted this behaviour is a bit better then the current one. But >> basically it's the same infinite loop only with pauses and warnings. I >> wonder if this is a behaviour we really want. For instance wouldn't it >> be better just to terminate an application in out-of-memory case? "Let >> it crash" as Erlang programmers say. > > Hmm. It would be useful to retry in the case that there is a chance > that the program releases memory and can continue later. But if it will > only stay there doing nothing other than retrying, then that obviously > will not happen. One situation where this might help is if the overall > *system* is short on memory and we expect that situation to resolve > itself after a while -- after all, if the system is so loaded that it > can't allocate a few more bytes for the COPY message, then odds are that > other things are also crashing and eventually enough memory will be > released that pg_receivexlog can continue. Yep, that's my assumption regarding that, at some point the system may succeed, and I don't think that we should break the current behaviors of pg_receivexlog and pg_recvlogical regarding that in the back-branches. Now, note that without the patch we actually have the same problem. Say if OOMs happen continuously in getCopyStart, with COPY_BOTH libpq would attempt to read the next message continuously and would keep failing. Except that in this case the caller has no idea what is happening as things keep running in libpq itself. > On the other hand, if the system is so loaded, perhaps it's better to > "let it crash" and have it restart later -- presumably once the admin > notices the problem and restarts it manually after cleaning up the mess. > > If all programs are well behaved and nothing crashes when OOM but they > all retry instead, then everything will continue to retry infinitely and > make no progress. That cannot be good. That's something we could take care of in those client utilities I think with a new option like --maximum-retries or similar, but anyway I think that's a different discussion. The patch I am proposing here allows a client application to be made aware of OOM errors that happen. If we don't do something about that first, something like --maximum-retries would be useless for COPY_BOTH as the client will never be made aware of the OOM that happened in libpq and would keep looping inside libpq itself until some memory is freed. -- Michael
On Tue, Mar 1, 2016 at 12:38 AM, Michael Paquier <michael.paquier@gmail.com> wrote: > Thoughts? I have registered that in the CF app, and a patch is attached. It is very difficult to believe that this is a good idea: --- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c +++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c @@ -445,6 +445,7 @@ libpqrcv_PQexec(const char *query) if (PQresultStatus(lastResult) == PGRES_COPY_IN || PQresultStatus(lastResult) == PGRES_COPY_OUT || PQresultStatus(lastResult) == PGRES_COPY_BOTH || + PQresultStatus(lastResult) == PGRES_FATAL_ERROR || PQstatus(streamConn) == CONNECTION_BAD) break; } I mean, why would it be a good idea to blindly skip over fatal errors? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Mon, Mar 21, 2016 at 10:13 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Tue, Mar 1, 2016 at 12:38 AM, Michael Paquier
> <michael.paquier@gmail.com> wrote:
> > Thoughts? I have registered that in the CF app, and a patch is attached.
>
> It is very difficult to believe that this is a good idea:
>
> --- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
> +++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
> @@ -445,6 +445,7 @@ libpqrcv_PQexec(const char *query)
> if (PQresultStatus(lastResult) == PGRES_COPY_IN ||
> PQresultStatus(lastResult) == PGRES_COPY_OUT ||
> PQresultStatus(lastResult) == PGRES_COPY_BOTH ||
> + PQresultStatus(lastResult) == PGRES_FATAL_ERROR ||
> PQstatus(streamConn) == CONNECTION_BAD)
> break;
> }
>
> I mean, why would it be a good idea to blindly skip over fatal errors?
>
>
> On Tue, Mar 1, 2016 at 12:38 AM, Michael Paquier
> <michael.paquier@gmail.com> wrote:
> > Thoughts? I have registered that in the CF app, and a patch is attached.
>
> It is very difficult to believe that this is a good idea:
>
> --- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
> +++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
> @@ -445,6 +445,7 @@ libpqrcv_PQexec(const char *query)
> if (PQresultStatus(lastResult) == PGRES_COPY_IN ||
> PQresultStatus(lastResult) == PGRES_COPY_OUT ||
> PQresultStatus(lastResult) == PGRES_COPY_BOTH ||
> + PQresultStatus(lastResult) == PGRES_FATAL_ERROR ||
> PQstatus(streamConn) == CONNECTION_BAD)
> break;
> }
>
> I mean, why would it be a good idea to blindly skip over fatal errors?
>
I think it is not about skipping the FATAL error, rather to stop trying to get further results on FATAL error. This is to ensure that OOM error is reported rather that ignored. There has been discussion about this previously as well [1].
Amit Kapila <amit.kapila16@gmail.com> writes: > On Mon, Mar 21, 2016 at 10:13 PM, Robert Haas <robertmhaas@gmail.com> wrote: >> It is very difficult to believe that this is a good idea: >> >> --- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c >> +++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c >> @@ -445,6 +445,7 @@ libpqrcv_PQexec(const char *query) >> if (PQresultStatus(lastResult) == PGRES_COPY_IN || >> PQresultStatus(lastResult) == PGRES_COPY_OUT || >> PQresultStatus(lastResult) == PGRES_COPY_BOTH || >> + PQresultStatus(lastResult) == PGRES_FATAL_ERROR || >> PQstatus(streamConn) == CONNECTION_BAD) >> break; >> >> I mean, why would it be a good idea to blindly skip over fatal errors? > I think it is not about skipping the FATAL error, rather to stop trying to > get further results on FATAL error. If the code already includes "lost the connection" as a case to break on, I'm not quite sure why "got a query error" is not. Maybe that PQstatus check is broken too, but it doesn't seem like this patch makes it more so. regards, tom lane
On Tue, Mar 22, 2016 at 9:46 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> Amit Kapila <amit.kapila16@gmail.com> writes:
> > On Mon, Mar 21, 2016 at 10:13 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> >> It is very difficult to believe that this is a good idea:
> >>
> >> --- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
> >> +++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
> >> @@ -445,6 +445,7 @@ libpqrcv_PQexec(const char *query)
> >> if (PQresultStatus(lastResult) == PGRES_COPY_IN ||
> >> PQresultStatus(lastResult) == PGRES_COPY_OUT ||
> >> PQresultStatus(lastResult) == PGRES_COPY_BOTH ||
> >> + PQresultStatus(lastResult) == PGRES_FATAL_ERROR ||
> >> PQstatus(streamConn) == CONNECTION_BAD)
> >> break;
> >>
> >> I mean, why would it be a good idea to blindly skip over fatal errors?
>
> > I think it is not about skipping the FATAL error, rather to stop trying to
> > get further results on FATAL error.
>
> If the code already includes "lost the connection" as a case to break on,
> I'm not quite sure why "got a query error" is not.
>
>
> Amit Kapila <amit.kapila16@gmail.com> writes:
> > On Mon, Mar 21, 2016 at 10:13 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> >> It is very difficult to believe that this is a good idea:
> >>
> >> --- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
> >> +++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
> >> @@ -445,6 +445,7 @@ libpqrcv_PQexec(const char *query)
> >> if (PQresultStatus(lastResult) == PGRES_COPY_IN ||
> >> PQresultStatus(lastResult) == PGRES_COPY_OUT ||
> >> PQresultStatus(lastResult) == PGRES_COPY_BOTH ||
> >> + PQresultStatus(lastResult) == PGRES_FATAL_ERROR ||
> >> PQstatus(streamConn) == CONNECTION_BAD)
> >> break;
> >>
> >> I mean, why would it be a good idea to blindly skip over fatal errors?
>
> > I think it is not about skipping the FATAL error, rather to stop trying to
> > get further results on FATAL error.
>
> If the code already includes "lost the connection" as a case to break on,
> I'm not quite sure why "got a query error" is not.
>
This error check is exactly same as PQexecFinish() and there some explanation is given in comments which hints towards the reason for continuing on error, basically both the functions PQexecFinish() and libpqrcv_PQexec() returns the last result if there are many and PQexecFinish() concatenates the errors as well in some cases. Do we see any use in continuing to get result after getting PGRES_FATAL_ERROR error?
On Tue, Mar 22, 2016 at 2:19 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: > On Tue, Mar 22, 2016 at 9:46 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> >> Amit Kapila <amit.kapila16@gmail.com> writes: >> > On Mon, Mar 21, 2016 at 10:13 PM, Robert Haas <robertmhaas@gmail.com> >> > wrote: >> >> It is very difficult to believe that this is a good idea: >> >> >> >> --- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c >> >> +++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c >> >> @@ -445,6 +445,7 @@ libpqrcv_PQexec(const char *query) >> >> if (PQresultStatus(lastResult) == PGRES_COPY_IN || >> >> PQresultStatus(lastResult) == PGRES_COPY_OUT || >> >> PQresultStatus(lastResult) == PGRES_COPY_BOTH || >> >> + PQresultStatus(lastResult) == PGRES_FATAL_ERROR || >> >> PQstatus(streamConn) == CONNECTION_BAD) >> >> break; >> >> >> >> I mean, why would it be a good idea to blindly skip over fatal errors? >> >> > I think it is not about skipping the FATAL error, rather to stop trying >> > to >> > get further results on FATAL error. >> >> If the code already includes "lost the connection" as a case to break on, >> I'm not quite sure why "got a query error" is not. >> > > This error check is exactly same as PQexecFinish() and there some > explanation is given in comments which hints towards the reason for > continuing on error, basically both the functions PQexecFinish() and > libpqrcv_PQexec() returns the last result if there are many and > PQexecFinish() concatenates the errors as well in some cases. Do we see any > use in continuing to get result after getting PGRES_FATAL_ERROR error? I spent quite a couple of hours looking at that, and I still fail to see how it would be an advantage to stack errors. We reduce the error message visibility for frontend clients, and this does not prevent clients to actually see error messages generated by a backend, take a WAL sender. And actually depending on the message type received we may end up by simply ignoring those error messages and have libpq discard them. So it seems like an oversight of 03a571a4. One could always say that this change breaks the case where multiple error messages are sent in a row from a client with COPY_* (IN, OUT and BOTH), because we'd get back the last error message, while with this change we let client know the first one, though that would mean that client is missing something when using COPY_BOTH. Also... @@ -2023,6 +2030,11 @@ PQexecFinish(PGconn *conn) result->resultStatus == PGRES_COPY_BOTH || conn->status== CONNECTION_BAD) break; + else if ((conn->asyncStatus == PGASYNC_COPY_IN || + conn->asyncStatus == PGASYNC_COPY_OUT || + conn->asyncStatus == PGASYNC_COPY_BOTH) && + result->resultStatus == PGRES_FATAL_ERROR) + break; The reason behind this check in PQexecFinish is that we need to make the difference between an error where there is not enough data, which means that we want to retry again the message fetching through parseInput3(), and the case where an actual OOM happened, where we want to exit immediately and let the caller know that there has been a frontend error. Now the other reason why we went on with PGRES_FATAL_ERROR here is that it was discussed that it was an overkill to assign a special status to PGASYNC to detect a frontend-side failure, because we already treat other frontend-side errors with PGRES_FATAL_ERROR and assign an error message to them (see for example when an allocation fails for 'C', introduced in another patch of the OOM failure series), and we still need to know that we are in PGASYNC_COPY_* mode in this code path as well because it means that the start message has been processed, but we had a failure in deparsing it so we bypassed it, and need to fail immediately. So using PGREC_FATAL_ERROR simplifies the code paths creating an error stack. Hope this brings some light in. -- Michael
Hi Amit, On 3/22/16 3:37 AM, Michael Paquier wrote: > Hope this brings some light in. Do you know when you'll have time to respond to Michael's last email? I've marked this patch "waiting on author" in the meantime. Thanks, -- -David david@pgmasters.net
On Wed, Mar 30, 2016 at 12:01 AM, David Steele <david@pgmasters.net> wrote: > Hi Amit, > > On 3/22/16 3:37 AM, Michael Paquier wrote: > >> Hope this brings some light in. > > > Do you know when you'll have time to respond to Michael's last email? I've > marked this patch "waiting on author" in the meantime. Shouldn't it be "needs review" instead? I am marked as the author of this patch. -- Michael
On 3/29/16 8:21 PM, Michael Paquier wrote: > On Wed, Mar 30, 2016 at 12:01 AM, David Steele <david@pgmasters.net> wrote: >> Hi Amit, >> >> On 3/22/16 3:37 AM, Michael Paquier wrote: >> >>> Hope this brings some light in. >> >> >> Do you know when you'll have time to respond to Michael's last email? I've >> marked this patch "waiting on author" in the meantime. > > Shouldn't it be "needs review" instead? I am marked as the author of this patch. Whoops! I got the author and reviewer backwards. Set back to "needs review". -- -David david@pgmasters.net
On Tue, Mar 29, 2016 at 8:31 PM, David Steele <david@pgmasters.net> wrote:
[2] - http://www.postgresql.org/message-id/566EF84F.1030206@iki.fi
Hi Amit,
On 3/22/16 3:37 AM, Michael Paquier wrote:Hope this brings some light in.
Do you know when you'll have time to respond to Michael's last email? I've marked this patch "waiting on author" in the meantime.
I think this patch needs to be looked upon by committer now. I have done review and added some code in this patch as well long back, just see the e-mail [1], patch is just same as it was in October 2015. I think myself and Michael are in agreement that this patch solves the reported problem. There is one similar problem [2] reported by Heikki which I think can be fixed separately.
I think the main reason of moving this thread to hackers from bugs is to gain some more traction which I see that it achieves its purpose to some extent, but I find that more or less we are at same situation as we were back in October.
Let me know if you think anything more from myside can help in moving patch.
> I think this patch needs to be looked upon by committer now. I have > done review and added some code in this patch as well long back, just > see the e-mail [1], patch is just same as it was in October 2015. I > think myself and Michael are in agreement that this patch solves the > reported problem. There is one similar problem [2] reported by Heikki > which I think can be fixed separately. > [...] > Let me know if you think anything more from myside can help in moving > patch. +1, same here. Lets see whats committer's opinion. -- Best regards, Aleksander Alekseev http://eax.me/
Aleksander Alekseev <a.alekseev@postgrespro.ru> writes: > +1, same here. Lets see whats committer's opinion. I fooled around with this patch quite a bit but could not bring myself to pull the trigger, because I think it fundamentally breaks applications that follow the "repeat PQgetResult until NULL" rule. The only reason that psql manages not to fail is that it doesn't do that, it just calls PQexec; and you've hacked PQexecFinish so that it falls out without pumping PQgetResult till dry. But that's not a good solution, it's just a hack that makes the behavior unprincipled and unlike direct use of PQgetResult. The key problem is that, assuming that the changes in getCopyStart() succeed in returning a PGRES_FATAL_ERROR PGresult, the application may just follow the rule of doing nothing with it unless it's the last non-null result from PQgetResult. And it won't be, because you've switched libpq's asyncStatus into one or another COPY status, which will cause PQgetResult to continually create and return PGRES_COPY_XXX results, which is what it's supposed to do in that situation (cf the last step in getCopyResult). Now, this will accidentally fail to fail if PQgetResult's attempt to manufacture a PGRES_COPY_XXX result fails and returns null, which is certainly possible if we're up against an OOM situation. But what if it doesn't fail --- which is also possible, because a PGRES_COPY_XXX with no attached data will not need as much space as a PGRES_FATAL_ERROR with attached error message. The app probably throws away the PGRES_FATAL_ERROR and tries to use the PGRES_COPY_XXX result, which is almost okay, except that it will lack the copy format information which will be fatal if the application is relying on that. So AFAICT, this patch kinda sorta works for psql but it is not going to make things better for other apps. The other problem is that I don't have a lot of faith in the theory that getCopyStart is going to be able to make an error PGresult when it couldn't make a COPY PGresult. The existing message-receipt routines that this is modeled on are dealing with PGresults that are expected to grow large, so throwing away the accumulated PGresult is highly likely to free enough memory to let us allocate an error PGresult. Not so here. I have to run, but the bottom line is I don't feel comfortable with this. It's trying to fix what's a very corner-case problem (since COPY PGresults aren't large) but it's introducing new corner case behaviors of its own. regards, tom lane
I thought about this patch a bit more... When I first looked at the patch, I didn't believe that it worked at all: it failed to return a PGRES_COPY_XXX result to the application, and yet nonetheless switched libpq's asyncStatus to PGASYNC_COPY_XXX? Wouldn't things be hopelessly confused? I tried it out and saw that indeed it seemed to work in psql, and after tracing through that found that psql has no idea what's going on, but when psql issues its next command PQexecStart manages to get us out of the copy state (barring more OOM failures, anyway). That seems a bit accidental, though, and for sure it wasn't documented in the patch. In any case, having libpq in that state would have side-effects on how it responds to application actions, which is the core of my complaint about PQgetResult behavior. Those side effects only make sense if the app knows (or should have known) that the connection is in COPY state. I think it might be possible to get saner behavior if we invent a set of new asyncStatus values PGASYNC_COPY_XXX_FAILED, having the semantics that we got the relevant CopyStart message from the backend but were unable to report it to the application. PQexecStart would treat these the same as the corresponding normal PGASYNC_COPY_XXX states and do what's needful to get out of the copy mode. But in PQgetResult, we would *not* treat those states as a reason to return a PGRES_COPY_XXX result to the application. Probably we'd just return NULL, with the implication being "we're done so far as you're concerned, please give a new command". I might be underthinking it though. Anyway, the short of my review is that we need more clarity of thought about what state libpq is in after a failure like this, and what that state looks like to the application, and how that should affect how libpq reacts to application calls. I'll set this back to Waiting on Author ... regards, tom lane
On Fri, Apr 1, 2016 at 12:48 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > I thought about this patch a bit more... > > When I first looked at the patch, I didn't believe that it worked at > all: it failed to return a PGRES_COPY_XXX result to the application, > and yet nonetheless switched libpq's asyncStatus to PGASYNC_COPY_XXX? > Wouldn't things be hopelessly confused? I tried it out and saw that > indeed it seemed to work in psql, and after tracing through that found > that psql has no idea what's going on, but when psql issues its next > command PQexecStart manages to get us out of the copy state (barring > more OOM failures, anyway). That seems a bit accidental, though, > and for sure it wasn't documented in the patch. From the patch, that's mentioned: + * Stop if we are in copy mode and error has occurred, the pending results + * will be discarded during next execution in PQexecStart. > In any case, having > libpq in that state would have side-effects on how it responds to > application actions, which is the core of my complaint about > PQgetResult behavior. Those side effects only make sense if the app > knows (or should have known) that the connection is in COPY state. OK. So we'd want to avoid relying on subsequent PQ* calls and return into a clean state once the OOM shows up. > I think it might be possible to get saner behavior if we invent a > set of new asyncStatus values PGASYNC_COPY_XXX_FAILED, having the > semantics that we got the relevant CopyStart message from the backend > but were unable to report it to the application. PQexecStart would > treat these the same as the corresponding normal PGASYNC_COPY_XXX > states and do what's needful to get out of the copy mode. But in > PQgetResult, we would *not* treat those states as a reason to return a > PGRES_COPY_XXX result to the application. Probably we'd just return > NULL, with the implication being "we're done so far as you're concerned, > please give a new command". I might be underthinking it though. I raised something like that a couple of months back, based on the fact that PGASYNC_* are flags internal to libpq on frontend: http://www.postgresql.org/message-id/CAB7nPqTaeDwpRCqAk-cZZxwWaPdEYr1gUg0vwvG7RhzhmAJy3Q@mail.gmail.com In this case what was debated is PGASYNC_FATAL... Something that is quite different with your proposal is that we return NULL or something else, and it did not discard any pending data from the client. For applications running PQgetResult in a loop that would work. > Anyway, the short of my review is that we need more clarity of thought > about what state libpq is in after a failure like this, and what that > state looks like to the application, and how that should affect how > libpq reacts to application calls. Hm. I would have thought that returning to the caller a result generated by PQmakeEmptyPGresult(PGRES_FATAL_ERROR) with an error string marked "out of memory" would be the best thing to do instead of NULL... If getCopyStart() complains because of a lack of data, we loop back into pqParseInput3 to try again. If an OOM happens, we still loop into pqParseInput3 but all the pending data received from client needs to be discarded so as we don't rely on the next calls of PQexecStart to do the cleanup, once all the data is received we get out of the loop and generate the result with PGRES_FATAL_ERROR. Does that match what you have in mind? -- Michael
On Fri, Apr 1, 2016 at 10:34 AM, Michael Paquier <michael.paquier@gmail.com> wrote:
> On Fri, Apr 1, 2016 at 12:48 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > I thought about this patch a bit more...
> >
> > When I first looked at the patch, I didn't believe that it worked at
> > all: it failed to return a PGRES_COPY_XXX result to the application,
> > and yet nonetheless switched libpq's asyncStatus to PGASYNC_COPY_XXX?
> > Wouldn't things be hopelessly confused? I tried it out and saw that
> > indeed it seemed to work in psql, and after tracing through that found
> > that psql has no idea what's going on, but when psql issues its next
> > command PQexecStart manages to get us out of the copy state (barring
> > more OOM failures, anyway). That seems a bit accidental, though,
> > and for sure it wasn't documented in the patch.
>
> From the patch, that's mentioned:
> + * Stop if we are in copy mode and error has occurred, the pending results
> + * will be discarded during next execution in PQexecStart.
>
> On Fri, Apr 1, 2016 at 12:48 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > I thought about this patch a bit more...
> >
> > When I first looked at the patch, I didn't believe that it worked at
> > all: it failed to return a PGRES_COPY_XXX result to the application,
> > and yet nonetheless switched libpq's asyncStatus to PGASYNC_COPY_XXX?
> > Wouldn't things be hopelessly confused? I tried it out and saw that
> > indeed it seemed to work in psql, and after tracing through that found
> > that psql has no idea what's going on, but when psql issues its next
> > command PQexecStart manages to get us out of the copy state (barring
> > more OOM failures, anyway). That seems a bit accidental, though,
> > and for sure it wasn't documented in the patch.
>
> From the patch, that's mentioned:
> + * Stop if we are in copy mode and error has occurred, the pending results
> + * will be discarded during next execution in PQexecStart.
>
Yeah, and same is mentioned in PQexecStart function as well which indicates that above assumption is right.
>
> > Anyway, the short of my review is that we need more clarity of thought
> > about what state libpq is in after a failure like this, and what that
> > state looks like to the application, and how that should affect how
> > libpq reacts to application calls.
>
> > Anyway, the short of my review is that we need more clarity of thought
> > about what state libpq is in after a failure like this, and what that
> > state looks like to the application, and how that should affect how
> > libpq reacts to application calls.
>
Very valid point. So, if we see with patch, I think libpq will be in PGASYNC_COPY_XXX state after such a failure and next time when it tries to again execute statement, it will end copy mode and then allow to proceed from there. This design is required for other purposes like silently discarding left over result. Now, I think the other option(if possible) seems to be that we can put libpq in PGASYNC_IDLE, if we get such an error as we do at some other places in case of error as below and return OOM failure as we do in patch.
PQgetResult()
{
..
if (flushResult ||
pqWait(TRUE, FALSE, conn) ||
pqReadData(conn) < 0)
{
/*
* conn->errorMessage has been set by pqWait or pqReadData. We
* want to append it to any already-received error message.
*/
pqSaveErrorResult(conn);
conn->asyncStatus = PGASYNC_IDLE;
return pqPrepareAsyncResult(conn);
}
..
}
Michael Paquier <michael.paquier@gmail.com> writes: > On Fri, Apr 1, 2016 at 12:48 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Anyway, the short of my review is that we need more clarity of thought >> about what state libpq is in after a failure like this, and what that >> state looks like to the application, and how that should affect how >> libpq reacts to application calls. > Hm. I would have thought that returning to the caller a result > generated by PQmakeEmptyPGresult(PGRES_FATAL_ERROR) with an error > string marked "out of memory" would be the best thing to do instead of > NULL... Sure, we're trying to do that, but the question is what happens afterward. An app that is following the documented usage pattern for PQgetResult will call it again, expecting to get a NULL, but as the patch stands it might get a PGRES_COPY_XXX instead. What drove me to decide the patch wasn't committable was realizing that Robert was right upthread: the change you propose in libpqwalreceiver.c is not a bug fix. That code is doing exactly what it's supposed to do per the libpq.sgml docs, namely iterate till it gets a NULL.[1] If we have to change it, that means (a) we've changed the API spec for PQgetResult and (b) every other existing caller of PQgetResult would need to change similarly, or else risk corner-case bugs about as bad as the one we're trying to fix. So the core of my complaint is that we need to fix things so that, whether or not we are able to create the PGRES_FATAL_ERROR PGresult (and we'd better consider the behavior when we cannot), we need to leave libpq in a state where subsequent calls to PQgetResult will return NULL. > If getCopyStart() complains because of a lack of data, we loop > back into pqParseInput3 to try again. If an OOM happens, we still loop > into pqParseInput3 but all the pending data received from client needs > to be discarded so as we don't rely on the next calls of PQexecStart > to do the cleanup, once all the data is received we get out of the > loop and generate the result with PGRES_FATAL_ERROR. Does that match > what you have in mind? If we could, that would be OK, but (a) you're only considering the COPY OUT case not the COPY IN case, and (b) a persistent OOM condition could very well prevent us from ever completing the copy. For example, some COPY DATA messages might be too big for our current buffer, but we won't be able to enlarge the buffer. I'm inclined to think that reporting the OOM to the client is the highest-priority goal, and that attempting to clean up during the next PQexec/PQsendQuery is a reasonable design. If we fail to do that, the client won't really understand that it's a cleanup failure and not a failure to issue the new query, but it's not clear why it needs to know the difference. regards, tom lane [1] Well, almost. Really, after dealing with a PGRES_COPY_XXX result it ought to go back to pumping PQgetResult. It doesn't, but that's an expected client behavior, and cleaning up after that is exactly what PQexecStart is intended for.
Amit Kapila <amit.kapila16@gmail.com> writes: > Very valid point. So, if we see with patch, I think libpq will be > in PGASYNC_COPY_XXX state after such a failure and next time when it tries > to again execute statement, it will end copy mode and then allow to proceed > from there. This design is required for other purposes like silently > discarding left over result. Now, I think the other option(if possible) > seems to be that we can put libpq in PGASYNC_IDLE, if we get such an error > as we do at some other places in case of error as below and return OOM > failure as we do in patch. If we transition to PGASYNC_IDLE then the next PQexecStart will not realize that it needs to do something to get out of the COPY state; but it does, since the backend thinks that we're doing COPY. That was the reasoning behind my proposal to invent new PGASYNC states. Obviously there's more than one way to represent this new state, eg you could have a separate error flag instead --- but it is a new state. regards, tom lane
I wrote: > So the core of my complaint is that we need to fix things so that, whether > or not we are able to create the PGRES_FATAL_ERROR PGresult (and we'd > better consider the behavior when we cannot), ... BTW, the real Achilles' heel of any attempt to ensure sane behavior at the OOM limit is this possibility of being unable to create a PGresult with which to inform the client that we failed. I wonder if we could make things better by keeping around an emergency backup PGresult struct. Something along these lines: 1. Add a field "PGresult *emergency_result" to PGconn. 2. At the very beginning of any PGresult-returning libpq function, check to see if we have an emergency_result, and if not make one, ensuring there's room in it for a reasonable-size error message; or maybe even preload it with "out of memory" if we assume that's the only condition it'll ever be used for. If malloc fails at this point, just return NULL without doing anything or changing any libpq state. (Since a NULL result is documented as possibly caused by OOM, this isn't violating any API.) 3. Subsequent operations never touch the emergency_result unless we're up against an OOM, but it can be used to return a failure indication to the client so long as we leave libpq in a state where additional calls to PQgetResult would return NULL. Basically this shifts the point where an unreportable OOM could happen from somewhere in the depths of libpq to the very start of an operation, where we're presumably in a clean state and OOM failure doesn't leave us with a mess we can't clean up. regards, tom lane
On Sat, Apr 2, 2016 at 12:30 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > I wrote: >> So the core of my complaint is that we need to fix things so that, whether >> or not we are able to create the PGRES_FATAL_ERROR PGresult (and we'd >> better consider the behavior when we cannot), ... > > BTW, the real Achilles' heel of any attempt to ensure sane behavior at > the OOM limit is this possibility of being unable to create a PGresult > with which to inform the client that we failed. > > I wonder if we could make things better by keeping around an emergency > backup PGresult struct. Something along these lines: > > 1. Add a field "PGresult *emergency_result" to PGconn. > > 2. At the very beginning of any PGresult-returning libpq function, check > to see if we have an emergency_result, and if not make one, ensuring > there's room in it for a reasonable-size error message; or maybe even > preload it with "out of memory" if we assume that's the only condition > it'll ever be used for. If malloc fails at this point, just return NULL > without doing anything or changing any libpq state. (Since a NULL result > is documented as possibly caused by OOM, this isn't violating any API.) > > 3. Subsequent operations never touch the emergency_result unless we're > up against an OOM, but it can be used to return a failure indication > to the client so long as we leave libpq in a state where additional > calls to PQgetResult would return NULL. > > Basically this shifts the point where an unreportable OOM could happen > from somewhere in the depths of libpq to the very start of an operation, > where we're presumably in a clean state and OOM failure doesn't leave > us with a mess we can't clean up. I have moved this patch to next CF for the time being. As that's a legit bug and not a feature, that should be fine to pursue work on this item even if this CF ends. -- Michael
On Wed, Apr 6, 2016 at 3:09 PM, Michael Paquier <michael.paquier@gmail.com> wrote: > On Sat, Apr 2, 2016 at 12:30 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> I wrote: >>> So the core of my complaint is that we need to fix things so that, whether >>> or not we are able to create the PGRES_FATAL_ERROR PGresult (and we'd >>> better consider the behavior when we cannot), ... >> >> BTW, the real Achilles' heel of any attempt to ensure sane behavior at >> the OOM limit is this possibility of being unable to create a PGresult >> with which to inform the client that we failed. >> >> I wonder if we could make things better by keeping around an emergency >> backup PGresult struct. Something along these lines: >> >> 1. Add a field "PGresult *emergency_result" to PGconn. >> >> 2. At the very beginning of any PGresult-returning libpq function, check >> to see if we have an emergency_result, and if not make one, ensuring >> there's room in it for a reasonable-size error message; or maybe even >> preload it with "out of memory" if we assume that's the only condition >> it'll ever be used for. If malloc fails at this point, just return NULL >> without doing anything or changing any libpq state. (Since a NULL result >> is documented as possibly caused by OOM, this isn't violating any API.) >> >> 3. Subsequent operations never touch the emergency_result unless we're >> up against an OOM, but it can be used to return a failure indication >> to the client so long as we leave libpq in a state where additional >> calls to PQgetResult would return NULL. >> >> Basically this shifts the point where an unreportable OOM could happen >> from somewhere in the depths of libpq to the very start of an operation, >> where we're presumably in a clean state and OOM failure doesn't leave >> us with a mess we can't clean up. Sorry for the late reply here. I am looking back at this thing more seriously. So, if I understand that correctly. As the emergency result is pre-allocated, this leverages any future OOM errors because we could always fallback to this error in case of problems, so this would indeed help. And as this is independent of the rest of the status of pg_conn, any subsequent calls of any PQ* routines returning a PGresult would remain in the same state. On top of this emergency code path, don't you think that we need as well a flag that is switched to 'on' once an OOM error is faced? In consequence, on a code path where an OOM happens, this flag is activated. Then any subsequent calls of routines returning a PGresult checks for this flag, resets it, and returns the emergency result. At the end, it seems to me that the code paths where we check if the emergency flag is activated or not are the beginning of PQgetResult and PQexecFinish. In the case of PQexecFinish(), pending results would be cleared the next time PQexecStart is called. For PQgetResult, this gives the possibility to the application to report the OOM properly. However, successive calls to PQgetResult() would still make the application wait undefinitely for more data if it doesn't catch the error. Another thing that I think needs to be patched is libpqrcv_PQexec by the way, so as we check if any errors are caught on the way when calling multiple times PQgetResult or this code path could remain stuck with an infinite wait. As a result, it seems to me that this offers no real way to fix completely applications doing something like that: PQsendQuery(conn); for (;;) { while (PQisBusy(conn)) { //wait here for some event } result = PQgetResult(conn); if (!result) break; } The issue is that we'd wait for a NULL result to be received from PQgetResult, however even with this patch asyncStatus remaining set to PGASYNC_BUSY would make libpq waiting forever with pqWait for data that will never show up. We could have a new status for asyncStatus to be able to skip properly the loops where asyncStatus == PGASYNC_BUSY and make PQisBusy smarter but this would actually break the semantics of calling successively PQgetResult() if I am following correctly the last posts of this thread. Another, perhaps, better idea is to add some more extra logic to pg_conn as follows: bool is_emergency; PGresult *emergency_result; bool is_emergency_consumed; So as when an OOM shows up, is_emergency is set to true. Then a first call of PQgetResult returns the PGresult with the OOM in emergency_result, setting is_emergency_consumed to true and switching is_emergency back to false. Then a second call to PQgetResult enforces the consumption of any results remaining without waiting for them, at the end returning NULL to the caller, resetting is_emergency_consumed to false. That's different compared to the extra failure PGASYNC_COPY_FAILED in the fact that it does not break PQisBusy(). Thoughts? -- Michael
On Mon, Apr 18, 2016 at 4:48 PM, Michael Paquier <michael.paquier@gmail.com> wrote: > Another, perhaps, better idea is to add some more extra logic to > pg_conn as follows: > bool is_emergency; > PGresult *emergency_result; > bool is_emergency_consumed; > So as when an OOM shows up, is_emergency is set to true. Then a first > call of PQgetResult returns the PGresult with the OOM in > emergency_result, setting is_emergency_consumed to true and switching > is_emergency back to false. Then a second call to PQgetResult enforces > the consumption of any results remaining without waiting for them, at > the end returning NULL to the caller, resetting is_emergency_consumed > to false. That's different compared to the extra failure > PGASYNC_COPY_FAILED in the fact that it does not break PQisBusy(). So, in terms of code, it gives more or less the attached. I have coupled the emergency_result with an enum flag emergency_status that has 3 states: - NONE, which is the default - ENABLE, which is when an OOM or other error has been found in libpq. Making the next call of PQgetResult return the emergency_result - CONSUMED, set after PQgetResult is consuming emergency_result, to return to caller NULL so as it can get out of any PQgetResult loop expecting a result at the end of. Once NULL is returned, the flag is switched back to NONE. This works in the case of getCopyStart because the OOM failures are happening before any messages are sent to server. The checks for the emergency result are done in PQgetResult, so as any upper-level routine gets the message. Tom, others, what do you think? -- Michael