Thread: Strange error (Socket command option unknown)
Hello, I hope this is the right list to post this question. I'm at a loss trying to figure out a strange error message I'm getting (at an excessively high rate -- around 200 times per day, out of a total of several hundred thousands statements executed per day). The error message reads: "FATAL 1: Socket command option e unknown" And the server closes the connection. Actually, the complete error message contains several lines, and reads: FATAL 1: Socket command option e unknown server closed the connection unexpectedly This probably means that the server terminated abnormally before or while processing the request. Except that the server does not shutdown (more statements continue to arrive, and they're executed). I checked the docs, but I can't figure out what may be happening, so I'm at a loss at trying to determine what to do. (the one thing I did notice is that it is related to one specific statement -- however, it's not that the statement is unconditionally wrong: it does execute several thousand times per day, and it succeeds most of the time) Any ideas? Pointers where to look for more information would also be highly appreciated! Thanks, Carlos --
Hi Carlos, Just to give everyone a bit more background information, which version of PostgreSQL are you having the troubles with, and which operating system is it running on? :-) Regards and best wishes, Justin Clift Carlos Moreno wrote: > > Hello, > > I hope this is the right list to post this question. > > I'm at a loss trying to figure out a strange error message > I'm getting (at an excessively high rate -- around 200 times > per day, out of a total of several hundred thousands statements > executed per day). > > The error message reads: > > "FATAL 1: Socket command option e unknown" > > > And the server closes the connection. Actually, the > complete error message contains several lines, and reads: > > FATAL 1: Socket command option e unknown > server closed the connection unexpectedly > This probably means that the server terminated abnormally > before or while processing the request. > > > Except that the server does not shutdown (more statements > continue to arrive, and they're executed). > > I checked the docs, but I can't figure out what may be > happening, so I'm at a loss at trying to determine what to > do. (the one thing I did notice is that it is related to > one specific statement -- however, it's not that the > statement is unconditionally wrong: it does execute > several thousand times per day, and it succeeds most of > the time) > > Any ideas? Pointers where to look for more information > would also be highly appreciated! > > Thanks, > > Carlos > -- > > > > > ---------------------------(end of broadcast)--------------------------- > TIP 4: Don't 'kill -9' the postmaster -- "My grandfather once told me that there are two kinds of people: those who work and those who take the credit. He told me to try to be in the first group; there was less competition there." - Indira Gandhi
Oops! That was pretty irresponsible from me, not providing that information! :-( Anyway, it's postgres 7.2.3 running on a RedHat 7.3 Linux system, on a dual Athlon CPU (presumably installed properly -- i.e., with the SMP kernel), with 1GB of memory. I installed postgres from the source, and changed the parameters related to the shared buffers; I put something around 250MB dedicated to shared buffers. At peak times, we estimate that the system may do something around 10 to 20 SQL statements per second; maybe 70% to 80% of those are insert statements. The largest table has around a million records (we clean it up weekly, since it would otherwise keep growing to infinity). Sorry for the lack of information in my initial post! I hope this will increase the chances that someone will be familiar with the problem and can guide me to find a solution! Thanks, Carlos -- Justin Clift wrote: > Hi Carlos, > > Just to give everyone a bit more background information, which version > of PostgreSQL are you having the troubles with, and which operating > system is it running on? > > > [...] > >> >> I'm at a loss trying to figure out a strange error message >> I'm getting (at an excessively high rate -- around 200 times >> per day, out of a total of several hundred thousands statements >> executed per day). >> >> The error message reads: >> >> "FATAL 1: Socket command option e unknown" >> >> >> And the server closes the connection. Actually, the >> complete error message contains several lines, and reads: >> >> FATAL 1: Socket command option e unknown >> server closed the connection unexpectedly >> This probably means that the server terminated abnormally >> before or while processing the request. >> >> >> Except that the server does not shutdown (more statements >> continue to arrive, and they're executed). >> >> I checked the docs, but I can't figure out what may be >> happening, so I'm at a loss at trying to determine what to >> do. (the one thing I did notice is that it is related to >> one specific statement -- however, it's not that the >> statement is unconditionally wrong: it does execute >> several thousand times per day, and it succeeds most of >> the time) >
Carlos Moreno <moreno@mochima.com> writes: > "FATAL 1: Socket command option e unknown" You sure it doesn't say Socket command type e unknown ? Because that's how it's spelled in tcop/postgres.c, and I don't see any other messages like this. This message is basically saying that the backend got an unexpected start-of-message character from the frontend. Usually this happens because the frontend and backend have gotten out of sync, and the most common cause of that is a failed COPY IN operation. COPY IN has no error-recovery ability worthy of the name :-(. Does your app use COPY IN a lot? What shows up in the postmaster log right before these messages? regards, tom lane
Tom Lane wrote: >Carlos Moreno <moreno@mochima.com> writes: > >>"FATAL 1: Socket command option e unknown" >> > >You sure it doesn't say > Socket command type e unknown > No, I'm not sure :-) In fact, I *am sure* that it is indeed the way you say :-) BTW, some times the type is p, instead of e. > >? Because that's how it's spelled in tcop/postgres.c, and I don't see >any other messages like this. > >This message is basically saying that the backend got an unexpected >start-of-message character from the frontend. Usually this happens >because the frontend and backend have gotten out of sync, and the most >common cause of that is a failed COPY IN operation. COPY IN has no >error-recovery ability worthy of the name :-(. > What is COPY IN?? I didn't find it in the documentation (it talks about copy, copy from, copy to, copy binary, but I didn't find copy in). If COPY IN is an explicit SQL command, then I can tell you for sure that my app. does not use a single one of those (the fact that I still have no idea what it is may be offered as evidence of that :-)). If it is something that may be one of the parts involved in some other operation(s), then maybe I'm using them (it's just that I wouldn't know) Now, what I understand from your explanation is that the *communication* between server and client get out of sync?? Is that possible even when both are running on the same machine, and not even using TCP/IP to communicate?? (the server does have TCP/IP enabled, since other applications, running on different machines use it, but this application, the one logging these errors, is running locally, and connect to the server with a connection string that specifies only dbname and user) I wonder if the fact that my app. is multi-threaded (logging to database is something that has kind of low-priority, and I don't want it to interfere with the "main loop" of my app. -- the main loop is basically the communications loop, which is critical for the responsiveness of my server to the multiple clients). So, I wonder if I (or postgres??) is doing something wrong with the multiple requests?? There are only two threads that make any use of the database: one that logs to db (i.e., exclusively insert, update, and delete commands), and one to process the login authentications (users' names and passwords are stored in the database, along with related information that has to be retrieved at login-time). So, is it possible that these two threads are "stepping on each other's tails"? The curious thing is that it is always one specific insert statement the one that causes the problem. I'm pretty intrigued, since I don't see a single difference in the way that statement is produced or processed... :-( >Does your app use COPY IN a lot? What shows up in the postmaster log >right before these messages? > Errr.... Where would I find that?? Should I explicitly redirect the output of postmaster when I run it?? (I didn't :-( ). So I guess I know the first thing I have to do...?? :-) Thanks for your message! Carlos --
Carlos Moreno <moreno@mochima.com> writes: > I wonder if the fact that my app. is multi-threaded > (logging to database is something that has kind of > low-priority, and I don't want it to interfere with the > "main loop" of my app. -- the main loop is basically > the communications loop, which is critical for the > responsiveness of my server to the multiple clients). > > So, I wonder if I (or postgres??) is doing something > wrong with the multiple requests?? There are only > two threads that make any use of the database: > one that logs to db (i.e., exclusively insert, update, > and delete commands), and one to process the > login authentications (users' names and passwords > are stored in the database, along with related > information that has to be retrieved at login-time). If you're using the database from two threads, you should either be using one DB connection per thread, or protecting the single DB connection with a mutex lock for the duration of each query. One connection per thread is probably simplest. > So, is it possible that these two threads are "stepping > on each other's tails"? Quite possible, unless you're taking steps to prevent it. Sharing a single connection between independent threads with no locking is probably not going to work. -Doug
Doug McNaught wrote: >Carlos Moreno <moreno@mochima.com> writes: > > >>I wonder if the fact that my app. is multi-threaded >>(logging to database is something that has kind of >>low-priority, and I don't want it to interfere with the >>"main loop" of my app. -- the main loop is basically >>the communications loop, which is critical for the >>responsiveness of my server to the multiple clients). >> >>So, I wonder if I (or postgres??) is doing something >>wrong with the multiple requests?? There are only >>two threads that make any use of the database: >>one that logs to db (i.e., exclusively insert, update, >>and delete commands), and one to process the >>login authentications (users' names and passwords >>are stored in the database, along with related >>information that has to be retrieved at login-time). >> > >If you're using the database from two threads, you should either be >using one DB connection per thread, or protecting the single DB >connection with a mutex lock for the duration of each query. One >connection per thread is probably simplest. > Hmm, are you referring to the PgDatabase object that I use to connect to the database? (I'm using libpq++) If that is what you're referring to, then yes, I'm perfectly fine there -- the PgDatabase object is a local variable in each of the two threads. Each thread loops checking a "jobs queue" (properly synchronized -- honest! :-)), and whenever it finds one or several SQL's to be executed (or a login to be processed), *then* it creates the PgDatabase object, which lives for the duration of that loop -- i.e., the same PgDatabase object will be used to execute the one or several jobs in the queue. >>So, is it possible that these two threads are "stepping >>on each other's tails"? >> > >Quite possible, unless you're taking steps to prevent it. Sharing a >single connection between independent threads with no locking is >probably not going to work. > No, that's not what is happening. But I was wondering if there is some information that indirectly is being shared by the two threads (that it is because of my mistake or postgres' mistake -- far less likely, of course) BTW, after reading Tom's message, I restarted the system to change the startup script and send all output from postmaster to a log file. Nothing relevant; there are about a dozen logged messages "Socket command type e unknown", and the line before is either another identical message, or a failed insert (a foreign key referential integrity violation). These failed inserts are not supposed to happen, but right now, it is happening because of a minor bug recently fixed, that was causing certain things not to be inserted to the DB, and so now the subsequent inserts are making reference to something that is not present. Still, this should be normal operation from the point of view of the server; it is simply an insert that failed and was properly reported by the backend, right? Thanks, Carlos --
Carlos Moreno <moreno@mochima.com> writes: > Nothing relevant; there are about a dozen logged messages > "Socket command type e unknown", and the line before is > either another identical message, or a failed insert (a > foreign key referential integrity violation). You should look very hard at the part of your code that deals with error reports from the server. It might be worth logging the whole session between client and server with tcpdump --- that'd be tedious to wade through, but it might give some evidence of what's happening. It seems pretty clear that the client is sending an incomplete message, but knowing only one character of the broken message isn't much to help you debug it. With a tcpdump trace you could see the whole packet ... regards, tom lane
Carlos Moreno <moreno@mochima.com> writes: > Doug McNaught wrote: > > >If you're using the database from two threads, you should either be > >using one DB connection per thread, or protecting the single DB > >connection with a mutex lock for the duration of each query. One > >connection per thread is probably simplest. > > > > Hmm, are you referring to the PgDatabase object that I > use to connect to the database? (I'm using libpq++) I don't know libpq++ but I would assume that each PgDatabase object has its own underlying database connection. If this is true than you shouldn't see any thread-related problems. You should read the libpq++ docs and/or the source code to make sure of this, though. > If that is what you're referring to, then yes, I'm perfectly > fine there -- the PgDatabase object is a local variable in > each of the two threads. Each thread loops checking > a "jobs queue" (properly synchronized -- honest! :-)), > and whenever it finds one or several SQL's to be > executed (or a login to be processed), *then* it creates > the PgDatabase object, which lives for the duration of > that loop -- i.e., the same PgDatabase object will be > used to execute the one or several jobs in the queue. Creating and destroying database connections on the fly is kind of inefficient, but shouldn't cause the problems you're seeing. I'm wondering if it might not be a libpq++ bug, just because you don't seem to be doing anything nonstandard otherwise, and there haven't been any reports that I've seen of people running into this using regular libpq, especially since you're not using COPY... > No, that's not what is happening. But I was wondering > if there is some information that indirectly is being shared > by the two threads (that it is because of my mistake or > postgres' mistake -- far less likely, of course) As I said, it *might* be a libpq++ bug. What you say below kind of smells like it: > BTW, after reading Tom's message, I restarted the system > to change the startup script and send all output from > postmaster to a log file. > > Nothing relevant; there are about a dozen logged messages > "Socket command type e unknown", and the line before is > either another identical message, or a failed insert (a > foreign key referential integrity violation). These failed > inserts are not supposed to happen, but right now, it is > happening because of a minor bug recently fixed, that > was causing certain things not to be inserted to the DB, > and so now the subsequent inserts are making reference > to something that is not present. Still, this should be > normal operation from the point of view of the server; > it is simply an insert that failed and was properly > reported by the backend, right? Right. Perhaps the client-side library isn't handling it properly. Might be time to dig in with a debugger. If you can trigger the referential integrity violation at will, do so and step through the libpq++ code with gdb (or whatever)--see if it handles the error from the server properly (you'll need to read up on the wire protocol). Good luck... -Doug
Hi Doug, Doug McNaught wrote: > >Creating and destroying database connections on the fly is kind of >inefficient, but shouldn't cause the problems you're seeing. > But doing otherwise has the problem that if for some reason the connection fails (or drops at some point), then it's gone forever (in libpq++, there is a way to check if the connection is alive, but I see no method to reconnect; only when creating the PgDatabase or PgConnection object does it connect to the database -- in fact, this was precisely the bug I was referring to, the one that led to the referential integrity violations right now). > >Right. Perhaps the client-side library isn't handling it properly. > >Might be time to dig in with a debugger. If you can trigger the >referential integrity violation at will, do so and step through the >libpq++ code with gdb (or whatever)--see if it handles the error from >the server properly (you'll need to read up on the wire protocol). > Ouff!! This sounds sooo scary!! :-) But I guess if that is the problem, then I might completely eliminate it with a mutex for access to the database. I.e., any time I execute *any* SQL statement, then I lock the mutex. Hmmm, however, that doesn't fit too well the fact that it is one specific insert statement that *always* causes the error. Also, before the most recent changes, I had very few of these errors (about 20 in a month and a half of activity, versus a couple hundred in just two days), and before, I was doing the login in the main thread, and the logs to database in a separate thread... Hmm, intriguing... I guess I'm going to try using a mutex for any access to the database as my first step, to *determine* if the problem is related to concurrent database accesses from the two threads. (hmmm, at this point, I wonder if it is about time to upgrade to 7.3) Thanks a lot -- and thanks to Tom -- for your comments and suggestions! Cheers, Carlos --
Tom Lane wrote: > >You should look very hard at the part of your code that deals with >error reports from the server. > >It might be worth logging the whole session between client and >server with tcpdump --- that'd be tedious to wade through, but >it might give some evidence of what's happening. It seems pretty >clear that the client is sending an incomplete message, but knowing >only one character of the broken message isn't much to help you >debug it. With a tcpdump trace you could see the whole packet ... > Ok, it would look like I am going to have to face the terror of doing this! :-) However, I have zero experience on this; if I may, I'd like to ask for some pointers on this tcpdump; checking the man pages, I came up with: tcpdump -i local port 5432 I then connected using psql -h localhost, as a test; it seems to be showing only timing and length information. What could I do? The other thing is that this is only a quick test to first see that I can figure out the information that tcpdump gives me... But the connection is not being done through TCP/IP, since the application and postmaster are running on the same physical machine. So, how can I listen on local (i.e., AF_UNIX) sockets? (which I assume is the way my application and the postgres server communicate) Thanks! Carlos --
Carlos Moreno <moreno@mochima.com> writes: > Tom Lane wrote: > > >It might be worth logging the whole session between client and > >server with tcpdump --- that'd be tedious to wade through, but > >it might give some evidence of what's happening. It seems pretty > >clear that the client is sending an incomplete message, but knowing > >only one character of the broken message isn't much to help you > >debug it. With a tcpdump trace you could see the whole packet ... > > > > Ok, it would look like I am going to have to face the > terror of doing this! :-) > > However, I have zero experience on this; if I may, > I'd like to ask for some pointers on this tcpdump; > > checking the man pages, I came up with: > > tcpdump -i local port 5432 > > I then connected using psql -h localhost, as a test; > it seems to be showing only timing and length information. > What could I do? You'll need to use '-s 1500' (or whatever the MTU is) to get all of each packet, and '-xX' to dump the whole packet in hex/ascii. > The other thing is that this is only a quick test to first > see that I can figure out the information that tcpdump > gives me... But the connection is not being done through > TCP/IP, since the application and postmaster are running > on the same physical machine. > > So, how can I listen on local (i.e., AF_UNIX) sockets? > (which I assume is the way my application and the postgres > server communicate) No can do with tcpdump. Either switch your app (temporarily) to use TCP/IP or look into using strace(1) to log the read/write/send/rcv calls made by the app (this may be tricky, and will slow down your app). The other thing you could do (which would work very well but take some savvy) is reconfigure PG to use a different port (5433, say), then write a little daemon that connects to '/var/run/postgresql/.s.PGSQL.5433' and listens on 's.PGSQL.5432', passing data both ways and logging it. This would be a great diagnostic tool but takes some sockets knowledge to write (not anything super-arcane, though). Switching to TCP temporarily is probably less work. :) Hope this helps! -Doug
Carlos Moreno <moreno@mochima.com> writes: > checking the man pages, I came up with: > tcpdump -i local port 5432 > I then connected using psql -h localhost, as a test; > it seems to be showing only timing and length information. I get useful results with /usr/sbin/tcpdump -x -X -s 0 -i lo port 5432 You do need to use "-h localhost" when connecting with psql. regards, tom lane
Doug McNaught wrote: > >No can do with tcpdump. Either switch your app (temporarily) to use >TCP/IP or look into using strace(1) > D'oh!! Of course!! How could I forget about the good old strace!!! I used it soo much during the debugging period of my application, and now I'm so ungrateful that I forgot about it? :-) Anyway, thanks for the advice! I guess I'll try first with strace (because I can do that right now -- the app is running as we speak), and if it is too complicated, I would temporarily change my app so that it connects through TCP/IP (to localhost) to be able to tcpdump it. Thanks! Carlos --
Tom Lane wrote: > >I get useful results with > >/usr/sbin/tcpdump -x -X -s 0 -i lo port 5432 > >You do need to use "-h localhost" when connecting with psql. > Yes, the -h localhost would be necessary; notice that I want to eavesdrop on my application (the test with psql was just to try and see the protocol just to make sure that I understand the basics before jumping to some serious debugging on my application). But yes, the above translates to the fact that I'll have to add host=localhost to the connection string in my code. Thanks, Carlos --
May I suggest looking at tcpflow? The syntax is similar to tcpdump but it will create a file for each connection (actually two files, one for inbound and one for outbound data). The files contain the data stream for that connection - extremely useful for debugging. The only thing I've found frustrating is that since one file is everything from the client and the other is everything from the server you are left to "synchronize" which parts of the inbound stream correspond to which parts of the outbound. If you want a few more bells/whistles and a GUI, try Ethereal. Cheers, Steve On Monday 24 February 2003 2:31 pm, Carlos Moreno wrote: > Tom Lane wrote: > >You should look very hard at the part of your code that deals with > >error reports from the server. > > > >It might be worth logging the whole session between client and > >server with tcpdump --- that'd be tedious to wade through, but > >it might give some evidence of what's happening. It seems pretty > >clear that the client is sending an incomplete message, but knowing > >only one character of the broken message isn't much to help you > >debug it. With a tcpdump trace you could see the whole packet ... > > Ok, it would look like I am going to have to face the > terror of doing this! :-) > > However, I have zero experience on this; if I may, > I'd like to ask for some pointers on this tcpdump; > > checking the man pages, I came up with: > > tcpdump -i local port 5432 > > I then connected using psql -h localhost, as a test; > it seems to be showing only timing and length information. > What could I do? > > The other thing is that this is only a quick test to first > see that I can figure out the information that tcpdump > gives me... But the connection is not being done through > TCP/IP, since the application and postmaster are running > on the same physical machine. > > So, how can I listen on local (i.e., AF_UNIX) sockets? > (which I assume is the way my application and the postgres > server communicate) > > Thanks! > > Carlos