Re: Cryptic error message in low-memory conditions - Mailing list pgsql-hackers
From | Tom Lane |
---|---|
Subject | Re: Cryptic error message in low-memory conditions |
Date | |
Msg-id | 8040.1314403175@sss.pgh.pa.us Whole thread Raw |
In response to | Cryptic error message in low-memory conditions (Daniel Farina <daniel@heroku.com>) |
Responses |
Re: Cryptic error message in low-memory conditions
Re: Cryptic error message in low-memory conditions |
List | pgsql-hackers |
Daniel Farina <daniel@heroku.com> writes: > Some Postgres servers will error out for a while with the following > error message: > "expected authentication request from server, but received c" > [ and this seems to be triggered by fork failures in the server ] <spock>Fascinating.</spock> I poked at this for awhile by the expedient of hot-wiring postmaster.c to always fail the fork request: *** src/backend/postmaster/postmaster.c.orig Tue Aug 23 17:13:09 2011 --- src/backend/postmaster/postmaster.c Fri Aug 26 19:08:25 2011 *************** *** 3182,3188 **** #ifdef EXEC_BACKEND pid = backend_forkexec(port); #else /* !EXEC_BACKEND*/ ! pid = fork_process(); if (pid == 0) /* child */ { free(bn); --- 3182,3189 ---- #ifdef EXEC_BACKEND pid = backend_forkexec(port); #else /* !EXEC_BACKEND*/ ! pid = -1; // fork_process(); ! errno = ENOMEM; if (pid == 0) /* child */ { free(bn); The basic case did what I was expecting: $ psql "dbname=regression" psql: could not fork new process for connection: Cannot allocate memory which was unsurprising since I'm quite sure that code path got tested in basically this fashion, back in the dark ages. But sure enough, over a TCP connection with SSL support turned on, it fails as described. strace shows the postmaster is sending what it's supposed to. I eventually figured out the problem: libpq sees the "E" response, thinks it must be talking to a postmaster too old to understand NEGOTIATE_SSL_CODE, and closes the socket and tries again. But *it forgets to clear out its input buffer*, so the body of the error message is still there waiting to be read, and the next connection sees the "c" as the first byte of the first postmaster response. The fix for this is *** src/interfaces/libpq/fe-connect.c.orig Thu Jul 28 10:39:57 2011 --- src/interfaces/libpq/fe-connect.c Fri Aug 26 18:52:18 2011 *************** *** 2064,2069 **** --- 2064,2072 ---- conn->allow_ssl_try = false; /* Assume it ain't gonnahandle protocol 3, either */ conn->pversion = PG_PROTOCOL(2, 0); + /* Discard any unread/unsent data */ + conn->inStart = conn->inCursor = conn->inEnd = 0; + conn->outCount = 0; /* Must drop the old connection */ closesocket(conn->sock); conn->sock = -1; and similarly in the other places in fe-connect.c where we're abandoning an open connection (not sure every one of those has to have it, but it seems a good safety practice). That gets us to the right place in the default SSL mode: $ psql "host=localhost sslmode=prefer dbname=regression" psql: could not fork new process for connection: Cannot allocate memory but we're still not there if you're trying to force SSL: $ psql "host=localhost sslmode=require dbname=regression" psql: server does not support SSL, but SSL was required The reason for this is that that same bit of code supposes that any "E" response must mean that the postmaster didn't recognize NEGOTIATE_SSL_CODE. It doesn't (and of course shouldn't) pay any attention to the actual textual error message. Now, seeing as how NEGOTIATE_SSL_CODE has been understood by every build since PG 7.0, I believe that this is dead code and we could remove it; it seems exceedingly unlikely that any modern build of libpq will ever be used to talk to a server that responds to that with "E". In fact it's worse than just delivering a misleading error message in the "require" case, because if you're not doing "require" then what happens next is that the code forces the protocol level down to 2 and tries again, supposing that it must be dealing with an ancient server. In the normal case where you are talking to a server under load rather than a deliberately-broken one, it's entirely possible that the second connection attempt succeeds. And what you've got then is that the connection is operating in protocol 2 rather than what the user probably expected, disabling assorted functionality that he may well be depending on. I don't recall having seen reports that could match that syndrome, but then again Daniel's complaint is a new one on me too. So I'm thinking we'd be well advised to eliminate the assumption that an "E" response could be fixed by downgrading to protocol 2, and instead just make this code report the error message it got from the postmaster. That's more than a one-liner so I don't have a patch for it yet. Lastly, I noticed that if I tried this repeatedly on a Unix socket, I sometimes got psql: server closed the connection unexpectedly This probably means the server terminated abnormally before orwhile processing the request. could not send startup packet: Broken pipe rather than the expected results. I think what is happening here is a race condition, such that if the postmaster closes the socket without having read the startup packet, the client might not have actually gotten to its send() yet, and then it will get EPIPE from send() before it gets to the point of reading the error response. I tried to fix this by having report_fork_failure_to_client eat any pending data before responding: *** src/backend/postmaster/postmaster.c.orig Tue Aug 23 17:13:09 2011 --- src/backend/postmaster/postmaster.c Fri Aug 26 19:08:25 2011 *************** *** 3262,3276 **** char buffer[1000]; int rc; /* Format the error message packet (always V2protocol) */ snprintf(buffer, sizeof(buffer), "E%s%s\n", _("could not fork new process for connection:"), strerror(errnum)); - /* Set port to non-blocking. Don't do send() if this fails */ - if (!pg_set_noblock(port->sock)) - return; - /* We'll retry after EINTR, but ignore all other failures */ do { --- 3262,3291 ---- char buffer[1000]; int rc; + /* Set port to non-blocking. Don't do anything if this fails */ + if (!pg_set_noblock(port->sock)) + return; + + /* + * Try to read the startup packet, if it's available which it likely is. + * The reason for doing this is that when the client is connected via Unix + * socket, if we close the socket without having read data it already + * sent, many kernels will report EPIPE instead of letting the client + * read whatever we sent. That will likely result in an unhelpful + * "server closed the connection unexpectedly" report instead of what + * we are trying to tell them. So read and discard any already-available + * data. We already set nonblock mode, so we won't block here. + */ + do + { + rc = recv(port->sock, buffer, sizeof(buffer), 0); + } while (rc > 0 || (rc < 0 && errno == EINTR)); + /* Format the error message packet (always V2 protocol) */ snprintf(buffer, sizeof(buffer), "E%s%s\n", _("could not fork new process for connection: "), strerror(errnum)); /* We'll retry after EINTR,but ignore all other failures */ do { This seems to get rid of the problem on my Fedora box, but my old HPUX box still does it some of the time. I'm inclined to recommend applying this patch anyway --- the only alternative I can see is to try to rejigger the libpq code to temporarily ignore EPIPE during startup, which seems both messy and fragile. Comments? regards, tom lane
pgsql-hackers by date: