Thread: Strange message followed by server crash

Strange message followed by server crash

From
"Donald Fraser"
Date:
I'm trying to figure out what exactly caused PostgreSQL to crash.
 
PostgreSQL 8.0.7
OS: Red Hat Linux EL 4.0
 
Log messages:
<1Tmenteshvili 10709 2006-08-24 17:48:19 BST 0> ERROR:  invalid message format
< 3670 2006-08-24 17:48:19 BST > LOG:  server process (PID 10709) was terminated by signal 11
< 3670 2006-08-24 17:48:19 BST > LOG:  terminating any other active server processes
 
Looking into PostgreSQL code I note that the message "invalid message format" is generated by function: pq_getmsgend which is called when processing incoming messages from clients.
So am I correct in assuming that there was an error in the front end/back end data sent to PostgreSQL?
 
I asked the client above what they were doing when this event occurred and the answer was logging out of the database.
 
At the same time, an engineer, who thought nobody was in the building, was working on the network and changing a network switch.
Therefore can I assume this crash was caused by some form of corrupt / incomplete data sent to PostgreSQL?
 
Thanks in advance
regards
Donald Fraser
 
 

Re: Strange message followed by server crash

From
Tom Lane
Date:
"Donald Fraser" <postgres@kiwi-fraser.net> writes:
> Log messages:
> <1Tmenteshvili 10709 2006-08-24 17:48:19 BST 0> ERROR:  invalid message format
> < 3670 2006-08-24 17:48:19 BST > LOG:  server process (PID 10709) was terminated by signal 11

> At the same time, an engineer, who thought nobody was in the building, was working on the network and changing a
networkswitch. 

Hm, the "invalid message format" is unsurprising if the network suffered
a glitch and delivered bad data, but the backend crash is not good.
And what do you make of the apparent junk on the log line?  (What's
your log_line_prefix exactly?)

Given that this sounds like it won't be easy to reproduce, I'm hoping
you have a core file left from that and can get a stack trace from it.
That would give us some clue where to look.

            regards, tom lane

Re: Strange message followed by server crash

From
"Donald Fraser"
Date:
>"Donald Fraser" <postgres@kiwi-fraser.net> writes:
>> Log messages:
>> <1Tmenteshvili 10709 2006-08-24 17:48:19 BST 0> ERROR:  invalid message format
>> < 3670 2006-08-24 17:48:19 BST > LOG:  server process (PID 10709) was terminated by signal 11

>> At the same time, an engineer, who thought nobody was in the building, was working on the network and changing a network switch.
 
"Tom Lane" writes:
>And what do you make of the apparent junk on the log line?  (What's
>your log_line_prefix exactly?)
The log messages are good, log prefix is: <%u %p %t %x>
"1Tmenteshvili" is actually a user login name... don't ask.

>Given that this sounds like it won't be easy to reproduce, I'm hoping
>you have a core file left from that and can get a stack trace from it.
>That would give us some clue where to look.

Where would I look for a core file?
There's nothing unusual in the data directory, nothing in the /tmp directory?
 
I had my email turned off yesterday and PostgreSQL restarted itself so I didn't find out about the crash till this morning when I logged in.
Everything seems to have be running smoothly since then...
If there isn't a core file, how would I go about ensuring one is created next time? (touch wood there wont be one).
 
Regards
Donald Fraser
 

Re: Strange message followed by server crash

From
Tom Lane
Date:
"Donald Fraser" <postgres@kiwi-fraser.net> writes:
> "Tom Lane" writes:
>> Given that this sounds like it won't be easy to reproduce, I'm hoping
>> you have a core file left from that and can get a stack trace from it.

> Where would I look for a core file?
> There's nothing unusual in the data directory, nothing in the /tmp =
> directory?

Normally it would go into the $PGDATA/base/NNN subdirectory for the
database the backend was connected to.  A very few systems (OS X for
instance) drop cores into a dedicated system-wide directory called
/cores or some such.  If you're not finding anything, it's probably
because the postmaster was launched under "ulimit -c 0" or local
equivalent spelling --- a lot of systems have a policy of launching
daemons that way.

> If there isn't a core file, how would I go about ensuring one is created =
> next time? (touch wood there wont be one).

Put "ulimit -c unlimited" into the script that launches the postmaster,
and restart it.

            regards, tom lane

Re: Strange message followed by server crash

From
Alvaro Herrera
Date:
Tom Lane wrote:
> "Donald Fraser" <postgres@kiwi-fraser.net> writes:
> > "Tom Lane" writes:
> >> Given that this sounds like it won't be easy to reproduce, I'm hoping
> >> you have a core file left from that and can get a stack trace from it.
>
> > Where would I look for a core file?
> > There's nothing unusual in the data directory, nothing in the /tmp =
> > directory?
>
> Normally it would go into the $PGDATA/base/NNN subdirectory for the
> database the backend was connected to.  A very few systems (OS X for
> instance) drop cores into a dedicated system-wide directory called
> /cores or some such.

Note that on recent versions of Linux you can modify both the file name
and the directory where it's saved, by modifying
/proc/sys/kernel/core_pattern (or the deprecated
/proc/sys/kernel/core_uses_pid).  See core(5) for details.

--
Alvaro Herrera                                http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

Re: Strange message followed by server crash

From
Tom Lane
Date:
"Donald Fraser" <postgres@kiwi-fraser.net> writes:
> Log messages:
> <1Tmenteshvili 10709 2006-08-24 17:48:19 BST 0> ERROR:  invalid =
> message format
> < 3670 2006-08-24 17:48:19 BST > LOG:  server process (PID 10709) was =
> terminated by signal 11

> At the same time, an engineer, who thought nobody was in the =
> building, was working on the network and changing a network switch.

I tried to reproduce this by using gdb to force the "invalid message
format" failure --- that is, deliberately changing one of the values
pq_getmsgend compares.  No luck, the backend issued the error and
continued just as it should.  My best guess now that the crash was not
directly related to that error, but was caused by insufficiently robust
processing of whatever garbage was received following that.  However,
with no clue what that garbage might have looked like, there's too many
possible code paths to try to chase through with no leads.

Anyone care to do some random-data stress testing, ie, connect to a
backend and squirt random data at it to see if you can make it crash
rather than just complain and disconnect?  Be prepared to show a backend
stack trace and the exact data sent if you succeed.

            regards, tom lane