libpq bug - Mailing list pgsql-bugs
From | Kirby Bohling (TRSi) |
---|---|
Subject | libpq bug |
Date | |
Msg-id | Pine.GSO.4.21.0009151001590.14400-100000@oasis.novia.net Whole thread Raw |
Responses |
Re: libpq bug
(Tom Lane <tgl@sss.pgh.pa.us>)
|
List | pgsql-bugs |
Your name : Kirby C. Bohling Your email address : kbohling@oasis.novia.net System Configuration --------------------- Architecture (example: Intel Pentium) : Intel PII 550 Operating System (example: Linux 2.0.26 ELF) : FreeBSD 4.0 Release PostgreSQL version (example: PostgreSQL-7.0): PostgreSQL-7.0.2 Compiler used (example: gcc 2.8.0) : gcc 2.95.2 19991024 (release) Please enter a FULL description of your problem: ------------------------------------------------ I have an C++ application that runs for extended periods of time that keeps open the same postgres connection forever. After running for some period of time, the code will hang, after attaching with gdb, it is always hung in the same spot. fe-misc.c: 739, which is a call to select. I haven't compiled with debugging information, so I can't tell what it is waiting on. After reviewing the logs, I get a SIGPIPE, and "PQsendQuery -- There is no connection to the back end". I believe that the backend has died, and this is the symptom of that. The one thing I noticed, is that the code only hangs when I tried to start a transaction. After close examination, I realized that the only thing different is that didn't call PQstatus(), before making PQexec(). I have investigated the code in libpq. This is my guess at the stack trace, I don't have the code compiled with debugging, and I haven't got the time to do that, and wait around for the bug to happen again. #0 0xXXXXX in pqWait at pqWait.c:739 #1 0xXXXXX in PQgetResult at fe-exec.c:1126 #2 0xXXXXX in PQexec at fe-exec.c:1204 #3 0xXXXXX in myFuncThatCallsPQexec() myFuncs.c: 1234 If you follow the code from the entry into PQexec, all that way into pqWait, and then down into the select call, I noticed that nowhere in the path of execution did it check conn->status to see if the status was CONNECTION_OK, it only checked to see if the socket non-negative. This was by visual inspection, but using a debugger, so double check that. If my guess is correct, the backend has gone away, select can't tell that you are never going to be able to read or write on that socket. It might break out of the deadlock if the select call passed in the file descriptor to the exeception fd list (NOTE: Not all select()'s are the same. I ran across serious problems with code that depended on the way AIX handled exception fd's versus the way Solaris 2.6 did, that discussion way, way beyond the scope of this email). My guess is that the connection status is CONNECTION_BAD, I can't tell, the debugger won't help me out, because at libpq-fe.h:86 typedef struct pg_conn PGConn; Nice opaque typedef, but no way for me to print the structure in a debugger short of printing the raw memory. I believe that I have written the work-around for my project, I wrote a wrapper call to PQexec that always calls PQstatus, and fakes the error codes if PQstatus is bad, and my problems seem to magically disappear. The program resets it connection if the connection goes south, and life is great. My guess is that somewhere along the way, PQstatus() should be called, or conn->status should be checked. I am not sure the the most appropriate place to put the fix. There might also be some very good reason that it isn't there. All that being said, I believe that the bug is my fault, for failing to check the connection status before calling PQexec(). But the removal of a infinite waiting condition seems to be pretty valuable to me, even if the infinite wait is due to a lazy programmer. Hence I took the time to figure this much out. Please describe a way to repeat the problem. Please try to provide a concise reproducible example, if at all possible: ---------------------------------------------------------------------- If you know how this problem might be fixed, list the solution below: --------------------------------------------------------------------- My best guess is the following: Add the following lines to pqWait() /* I am not sure if any other cases should be or'ed with this, but I know that looking for != CONNECTION_OK is a bad idea, as while initiating a connection, the state is not CONNECTION_OK, but pqWait is called */ if( conn->status == CONNECTION_BAD ) { printfPQExpBuffer( &conn->errorMessage, "pqWait() -- bailing, connection is bad\n"); return EOF; } Thanks if you read this far, I would have give up by now... Kirby
pgsql-bugs by date: