Thread: Strange message followed by server crash
I'm trying to figure out what exactly caused PostgreSQL to crash.
PostgreSQL 8.0.7
OS: Red Hat Linux EL 4.0
Log messages:
<1Tmenteshvili 10709 2006-08-24 17:48:19 BST 0> ERROR: invalid message format
< 3670 2006-08-24 17:48:19 BST > LOG: server process (PID 10709) was terminated by signal 11
< 3670 2006-08-24 17:48:19 BST > LOG: terminating any other active server processes
< 3670 2006-08-24 17:48:19 BST > LOG: server process (PID 10709) was terminated by signal 11
< 3670 2006-08-24 17:48:19 BST > LOG: terminating any other active server processes
Looking into PostgreSQL code I note that the message "invalid message format" is generated by function: pq_getmsgend which is called when processing incoming messages from clients.
So am I correct in assuming that there was an error in the front end/back end data sent to PostgreSQL?
I asked the client above what they were doing when this event occurred and the answer was logging out of the database.
At the same time, an engineer, who thought nobody was in the building, was working on the network and changing a network switch.
Therefore can I assume this crash was caused by some form of corrupt / incomplete data sent to PostgreSQL?
Thanks in advance
regards
Donald Fraser
"Donald Fraser" <postgres@kiwi-fraser.net> writes: > Log messages: > <1Tmenteshvili 10709 2006-08-24 17:48:19 BST 0> ERROR: invalid message format > < 3670 2006-08-24 17:48:19 BST > LOG: server process (PID 10709) was terminated by signal 11 > At the same time, an engineer, who thought nobody was in the building, was working on the network and changing a networkswitch. Hm, the "invalid message format" is unsurprising if the network suffered a glitch and delivered bad data, but the backend crash is not good. And what do you make of the apparent junk on the log line? (What's your log_line_prefix exactly?) Given that this sounds like it won't be easy to reproduce, I'm hoping you have a core file left from that and can get a stack trace from it. That would give us some clue where to look. regards, tom lane
>"Donald Fraser" <postgres@kiwi-fraser.net> writes:
>> Log messages:
>> <1Tmenteshvili 10709 2006-08-24 17:48:19 BST 0> ERROR: invalid message format
>> < 3670 2006-08-24 17:48:19 BST > LOG: server process (PID 10709) was terminated by signal 11
>> At the same time, an engineer, who thought nobody was in the building, was working on the network and changing a network switch.
>> Log messages:
>> <1Tmenteshvili 10709 2006-08-24 17:48:19 BST 0> ERROR: invalid message format
>> < 3670 2006-08-24 17:48:19 BST > LOG: server process (PID 10709) was terminated by signal 11
>> At the same time, an engineer, who thought nobody was in the building, was working on the network and changing a network switch.
"Tom Lane" writes:
>And what do you make of the apparent junk on the log line? (What's
>your log_line_prefix exactly?)
>And what do you make of the apparent junk on the log line? (What's
>your log_line_prefix exactly?)
The log messages are good, log prefix is: <%u %p %t %x>
"1Tmenteshvili" is actually a user login name... don't ask.
>Given that this sounds like it won't be easy to reproduce, I'm hoping
>you have a core file left from that and can get a stack trace from it.
>That would give us some clue where to look.
Where would I look for a core file?
There's nothing unusual in the data directory, nothing in the /tmp directory?
I had my email turned off yesterday and PostgreSQL restarted itself so I didn't find out about the crash till this morning when I logged in.
Everything seems to have be running smoothly since then...
If there isn't a core file, how would I go about ensuring one is created next time? (touch wood there wont be one).
Regards
Donald Fraser
"Donald Fraser" <postgres@kiwi-fraser.net> writes: > "Tom Lane" writes: >> Given that this sounds like it won't be easy to reproduce, I'm hoping >> you have a core file left from that and can get a stack trace from it. > Where would I look for a core file? > There's nothing unusual in the data directory, nothing in the /tmp = > directory? Normally it would go into the $PGDATA/base/NNN subdirectory for the database the backend was connected to. A very few systems (OS X for instance) drop cores into a dedicated system-wide directory called /cores or some such. If you're not finding anything, it's probably because the postmaster was launched under "ulimit -c 0" or local equivalent spelling --- a lot of systems have a policy of launching daemons that way. > If there isn't a core file, how would I go about ensuring one is created = > next time? (touch wood there wont be one). Put "ulimit -c unlimited" into the script that launches the postmaster, and restart it. regards, tom lane
Tom Lane wrote: > "Donald Fraser" <postgres@kiwi-fraser.net> writes: > > "Tom Lane" writes: > >> Given that this sounds like it won't be easy to reproduce, I'm hoping > >> you have a core file left from that and can get a stack trace from it. > > > Where would I look for a core file? > > There's nothing unusual in the data directory, nothing in the /tmp = > > directory? > > Normally it would go into the $PGDATA/base/NNN subdirectory for the > database the backend was connected to. A very few systems (OS X for > instance) drop cores into a dedicated system-wide directory called > /cores or some such. Note that on recent versions of Linux you can modify both the file name and the directory where it's saved, by modifying /proc/sys/kernel/core_pattern (or the deprecated /proc/sys/kernel/core_uses_pid). See core(5) for details. -- Alvaro Herrera http://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc.
"Donald Fraser" <postgres@kiwi-fraser.net> writes: > Log messages: > <1Tmenteshvili 10709 2006-08-24 17:48:19 BST 0> ERROR: invalid = > message format > < 3670 2006-08-24 17:48:19 BST > LOG: server process (PID 10709) was = > terminated by signal 11 > At the same time, an engineer, who thought nobody was in the = > building, was working on the network and changing a network switch. I tried to reproduce this by using gdb to force the "invalid message format" failure --- that is, deliberately changing one of the values pq_getmsgend compares. No luck, the backend issued the error and continued just as it should. My best guess now that the crash was not directly related to that error, but was caused by insufficiently robust processing of whatever garbage was received following that. However, with no clue what that garbage might have looked like, there's too many possible code paths to try to chase through with no leads. Anyone care to do some random-data stress testing, ie, connect to a backend and squirt random data at it to see if you can make it crash rather than just complain and disconnect? Be prepared to show a backend stack trace and the exact data sent if you succeed. regards, tom lane