Thread: Backend Crash v8.4.2

Backend Crash v8.4.2

From

Kelly Burkhart

Date:

29 June 2010, 11:17:35

Hello,

We had a backend crash this morning.  Version is PostgreSQL 8.4.2
running on openSuSE 11.2.  This machine is connected via iSCSI to a
Dell Equallogic array.  We've been running 8.4.2 since February (I
believe) without issue, although we've recently upgraded this machine
from 24G to 72G RAM.  I don't see anything alarming in
/var/log/messages, the line from the crash:

2010-06-29T07:59:07.912030-05:00 db01-primary kernel: [839991.262273]
postmaster[11109]: segfault at 10f ip 000000000068d884 sp
00007fff78fa9dc0 error 4 in postgres[400000+44b000]

The crash left a core file, does the stack trace indicate anything crucial?

(gdb) where
#0  0x000000000068d884 in SearchCatCacheList ()
#1  0x0000000100000000 in ?? ()
#2  0x0000000000bbcbe0 in ?? ()
#3  0x00007f3b3a86a580 in ?? ()
#4  0x72ddbea20068dae0 in ?? ()
#5  0x00007fff78faa720 in ?? ()
#6  0x0000000000000000 in ?? ()
Current language:  auto
The current source language is "auto; currently asm".

Can anyone provide some guidance on how I can go about discovering the
cause?  There are some indications in the 8.4.3 and 8.4.4 release
notes that some possible crashes are fixed.  Do those issues
correspond with the stack trace above?

Thanks in advance for any advice,

-Kelly

Re: Backend Crash v8.4.2

From

Tom Lane

Date:

29 June 2010, 11:35:08

Kelly Burkhart <kelly.burkhart@gmail.com> writes:
> The crash left a core file, does the stack trace indicate anything crucial?

> (gdb) where
> #0  0x000000000068d884 in SearchCatCacheList ()
> #1  0x0000000100000000 in ?? ()
> #2  0x0000000000bbcbe0 in ?? ()
> #3  0x00007f3b3a86a580 in ?? ()
> #4  0x72ddbea20068dae0 in ?? ()
> #5  0x00007fff78faa720 in ?? ()
> #6  0x0000000000000000 in ?? ()
> Current language:  auto
> The current source language is "auto; currently asm".

That's pretty much useless unless you can install debug symbols and
try again.  I will say though that this is probably a new bug ---
I don't recall seeing anything crashing in SearchCatCacheList recently.

> Can anyone provide some guidance on how I can go about discovering the
> cause?

Please try to create a reproducible test case.  One thing you can get to
start from is the query that was being executed --- try this in gdb:

    p debug_query_string

If that just gives you a number and not the text of a SQL query, try

    p (char *) debug_query_string

            regards, tom lane

Re: Backend Crash v8.4.2

From

Kelly Burkhart

Date:

30 June 2010, 12:06:44

On Tue, Jun 29, 2010 at 9:34 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Kelly Burkhart <kelly.burkhart@gmail.com> writes:
>> The crash left a core file, does the stack trace indicate anything crucial?
>
>> (gdb) where
>> #0  0x000000000068d884 in SearchCatCacheList ()
>> #1  0x0000000100000000 in ?? ()
>> #2  0x0000000000bbcbe0 in ?? ()
>> #3  0x00007f3b3a86a580 in ?? ()
>> #4  0x72ddbea20068dae0 in ?? ()
>> #5  0x00007fff78faa720 in ?? ()
>> #6  0x0000000000000000 in ?? ()
>> Current language:  auto
>> The current source language is "auto; currently asm".
>
> That's pretty much useless unless you can install debug symbols and
> try again.  I will say though that this is probably a new bug ---
> I don't recall seeing anything crashing in SearchCatCacheList recently.

I had our system people install the debug symbols and I get the same
stack trace.  I believe the symbols are indeed installed, yesterday
when I started gdb I saw a bunch of lines like this:

Missing separate debuginfo for /usr/lib64/libssl.so.0.9.8
Try: zypper install -C
"debuginfo(build-id)=c1d9e2a7e013149b5acc4d3580724d4827f5827c"

I don't see that now.

>
>> Can anyone provide some guidance on how I can go about discovering the
>> cause?
>
> Please try to create a reproducible test case.  One thing you can get to
> start from is the query that was being executed --- try this in gdb:
>
>        p debug_query_string

I was able to see the query:

select sd.close, s.minimum_trade_increment
from symbol_daily sd, symbol s
where s.symbol_name = sd.symbol_name
  and s.exchange_name = sd.exchange_name
  and sd.symbol_name = $1
  and sd.trading_dt = last_trading_dt()

It's a well established query done probably several times each
morning.  I don't know how to create a reproducible test case as I
can't determine anything that we did yesterday that was any different
from any other day.

-K

Re: Backend Crash v8.4.2

From

Tom Lane

Date:

30 June 2010, 13:07:25

Kelly Burkhart <kelly.burkhart@gmail.com> writes:
> I had our system people install the debug symbols and I get the same
> stack trace.  I believe the symbols are indeed installed, yesterday
> when I started gdb I saw a bunch of lines like this:

> Missing separate debuginfo for /usr/lib64/libssl.so.0.9.8
> Try: zypper install -C
> "debuginfo(build-id)=c1d9e2a7e013149b5acc4d3580724d4827f5827c"

> I don't see that now.

That sounds like you have symbols now for the system libraries, but not
postgresql itself.

> It's a well established query done probably several times each
> morning.  I don't know how to create a reproducible test case as I
> can't determine anything that we did yesterday that was any different
> from any other day.

Best guess from here is that you managed to run into some sort of
cache-reload bug; those are very sensitive to concurrent operations
since you only see them when a shared cache inval event happens at
just the wrong time.  I would recommend an update to 8.4.4 since we
did stomp two or three critters of that ilk in the last few months,
but I can't really guarantee that we found the one that bit you.

While you're at it, please try to make sure you install a non-symbol-
stripped version of 8.4.4.  If it does happen again, at least you'll
be prepared to collect more data.

            regards, tom lane

Re: Backend Crash v8.4.2

From

Kelly Burkhart

Date:

30 June 2010, 14:35:07

On Wed, Jun 30, 2010 at 11:07 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Best guess from here is that you managed to run into some sort of
> cache-reload bug; those are very sensitive to concurrent operations
> since you only see them when a shared cache inval event happens at
> just the wrong time.  I would recommend an update to 8.4.4 since we
> did stomp two or three critters of that ilk in the last few months,
> but I can't really guarantee that we found the one that bit you.
>
> While you're at it, please try to make sure you install a non-symbol-
> stripped version of 8.4.4.  If it does happen again, at least you'll
> be prepared to collect more data.
>

We'll plan on upgrading.

RE: stripped symbols, I assume you mean configuring with
--enable-debug specified, I see from my config.log that I did not
specify that flag.  I just built with debug symbols on a
non-production machine and the stack trace is different.  I assume
it's completely invalid because symbol addresses from different builds
are not guaranteed to line up.  Correct?  Or is this helpful?

Program terminated with signal 11, Segmentation fault.
#0  0x000000000068d884 in RelationCacheInitializePhase2 () at relcache.c:2588
2588            LOAD_CRIT_INDEX(IndexRelidIndexId);
(gdb) where
#0  0x000000000068d884 in RelationCacheInitializePhase2 () at relcache.c:2588
#1  0x0000000000000000 in ?? ()
(gdb)

Thanks,

-K

Re: Backend Crash v8.4.2

From

Tom Lane

Date:

30 June 2010, 15:10:15

Kelly Burkhart <kelly.burkhart@gmail.com> writes:
> RE: stripped symbols, I assume you mean configuring with
> --enable-debug specified, I see from my config.log that I did not
> specify that flag.

Ah, if you built it yourself, that explains why your sysadmins'
installation of symbol packages didn't help.  If you're building
with gcc, --enable-debug is pretty much always a good idea: it
doesn't cost anything but some extra disk space.  With some other
compilers --enable-debug disables optimization and hence isn't
a good idea for production builds.

> I just built with debug symbols on a
> non-production machine and the stack trace is different.  I assume
> it's completely invalid because symbol addresses from different builds
> are not guaranteed to line up.  Correct?  Or is this helpful?

Again, depends if it's gcc.  If so, and everything is identical between
this machine and the one where you did the original build, this'd
probably work.

> Program terminated with signal 11, Segmentation fault.
> #0  0x000000000068d884 in RelationCacheInitializePhase2 () at relcache.c:2588
> 2588            LOAD_CRIT_INDEX(IndexRelidIndexId);

That looks interesting, indeed.  I don't think I want to trust it
entirely because of the likelihood that there's some difference between
this build and the original; but if it's not too far off from reality
then it places the failure in relcache.c rather than SearchCatCacheList.
And that makes sense because we have indeed fixed several cache-related
bugs in relcache over the past six months or so.  At this point I'd
*strongly* encourage you to update to 8.4.4.

And please do build with --enable-debug in future, if you're using gcc.

            regards, tom lane