Thread: Postgres 8.4.20 seqfault on RHEL 6.4

Postgres 8.4.20 seqfault on RHEL 6.4

From
Dave Johansen
Date:
I'm running Postgres 8.4.20 on RHEL 6.4 and it will occasionally crash. The postgres.log file just says that a PID was terminated. The output from dmesg has a message like this one:
postmaster[22905]: segfault at 686 ip 0000000000000686 sp 00007fff83d72e88 error 14 in postgres[400000+463000]

What can I do to try and figure out what is causing the crash and fix it?

Thanks,
Dave

Re: Postgres 8.4.20 seqfault on RHEL 6.4

From
Tom Lane
Date:
Dave Johansen <davejohansen@gmail.com> writes:
> I'm running Postgres 8.4.20 on RHEL 6.4 and it will occasionally crash. The
> postgres.log file just says that a PID was terminated. The output from
> dmesg has a message like this one:
> postmaster[22905]: segfault at 686 ip 0000000000000686 sp 00007fff83d72e88
> error 14 in postgres[400000+463000]

> What can I do to try and figure out what is causing the crash and fix it?

(1) install relevant postgresql-debuginfo package (assuming we're talking
about a Red Hat-originated postgres package)

(2) run postmaster under "ulimit -c unlimited" (easiest way is probably
to add such a command to /etc/rc.d/init.d/postgresql and restart the
service)

(3) wait for crash

(4) gdb the resulting corefile (should be under your $PGDATA directory)

(5) send in a stack trace.

Keep in mind that 8.4.x is out of support so far as the PG community is
concerned, so we're unlikely to expend any great amount of effort on
this; but we can take a quick look at a stack trace to see if it looks
like a known problem.

            regards, tom lane


Re: Postgres 8.4.20 seqfault on RHEL 6.4

From
Dave Johansen
Date:
On Thu, Feb 12, 2015 at 4:33 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Dave Johansen <davejohansen@gmail.com> writes:
> I'm running Postgres 8.4.20 on RHEL 6.4 and it will occasionally crash. The
> postgres.log file just says that a PID was terminated. The output from
> dmesg has a message like this one:
> postmaster[22905]: segfault at 686 ip 0000000000000686 sp 00007fff83d72e88
> error 14 in postgres[400000+463000]

> What can I do to try and figure out what is causing the crash and fix it?

(1) install relevant postgresql-debuginfo package (assuming we're talking
about a Red Hat-originated postgres package)

(2) run postmaster under "ulimit -c unlimited" (easiest way is probably
to add such a command to /etc/rc.d/init.d/postgresql and restart the
service)

(3) wait for crash

(4) gdb the resulting corefile (should be under your $PGDATA directory)

(5) send in a stack trace.

Thanks for the info. It will take a little while for me to get this all approved and setup, but should be helpful info.

Also, could this be caused by an issue or misuse of libpq/libpqxx in our application? If so, is there someway that we could turn on some sort of logging logging to see the queries or at least connections that we opening when or just before the crash happened?
 
Keep in mind that 8.4.x is out of support so far as the PG community is
concerned, so we're unlikely to expend any great amount of effort on
this; but we can take a quick look at a stack trace to see if it looks
like a known problem.

Yes, we're in the process of working on upgrading to 9.2, but it will probably still be a while. If we are able to track down some issue, then we'd bring it up to RedHat and request that they resolve it.

Thanks,
Dave

Re: Postgres 8.4.20 seqfault on RHEL 6.4

From
Tom Lane
Date:
Dave Johansen <davejohansen@gmail.com> writes:
> Also, could this be caused by an issue or misuse of libpq/libpqxx in our
> application? If so, is there someway that we could turn on some sort of
> logging logging to see the queries or at least connections that we opening
> when or just before the crash happened?

A client should not be able to crash the server.  (Ideally, anyway...)

As for logging, log_statements = all ought to help, though it might be
verbose.

            regards, tom lane


Re: Postgres 8.4.20 seqfault on RHEL 6.4

From
Dave Johansen
Date:
On Thu, Feb 12, 2015 at 7:35 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Dave Johansen <davejohansen@gmail.com> writes:
> Also, could this be caused by an issue or misuse of libpq/libpqxx in our
> application? If so, is there someway that we could turn on some sort of
> logging logging to see the queries or at least connections that we opening
> when or just before the crash happened?

A client should not be able to crash the server.  (Ideally, anyway...)

That's what I figured, but wanted to check to be sure.

As for logging, log_statements = all ought to help, though it might be
verbose.

Thankfully the crash is only happening on our development server and not the operational one, so the traffic level isn't too high and logging all the statements shouldn't be too insane.

Thanks,
Dave

Re: Postgres 8.4.20 seqfault on RHEL 6.4

From
Dave Johansen
Date:
On Thu, Feb 12, 2015 at 4:33 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Dave Johansen <davejohansen@gmail.com> writes:
> I'm running Postgres 8.4.20 on RHEL 6.4 and it will occasionally crash. The
> postgres.log file just says that a PID was terminated. The output from
> dmesg has a message like this one:
> postmaster[22905]: segfault at 686 ip 0000000000000686 sp 00007fff83d72e88
> error 14 in postgres[400000+463000]

> What can I do to try and figure out what is causing the crash and fix it?

(1) install relevant postgresql-debuginfo package (assuming we're talking
about a Red Hat-originated postgres package)

(2) run postmaster under "ulimit -c unlimited" (easiest way is probably
to add such a command to /etc/rc.d/init.d/postgresql and restart the
service)

(3) wait for crash

(4) gdb the resulting corefile (should be under your $PGDATA directory)

(5) send in a stack trace.

Here's the stacktrace from gdb (if it matters, the package version from RHEL is postgresql-8.4.18-1.el6_4.x86_64):
#0  0x0000000000000686 in ?? ()
#1  0x00007f76ae551801 in ?? ()
#2  0x00000000019f7793 in ?? ()
#3  0x00007fff06ad6be0 in ?? ()
#4  0x00007fff06ad6be0 in ?? ()
#5  0x0000000000545e35 in ExecMakeFunctionResult (fcache=0x19f5680, econtext=0x19f37e8, isNull=0x19f7793 "", isDone=0x19f7b8c) at execQual.c:1870
#6  0x0000000000541096 in ExecTargetList (projInfo=<value optimized out>, isDone=0x7fff06ad704c) as execQual.c:5212
#7  ExecProject (projeInfo=<value optimized out>, isDone=0xfff06ad704c) as execQual.c:5427
#8  0x0000000000553c5b in ExecResult (node=0x1999a68) at nodeResult.c:155
#9  0x00000000005406c8 in ExecProcNode (node=0x1999a68) at execProcnode.c:344
#10 0x000000000053e942 in ExecutePlan (queryDesc=0x1990c60, direction=<value optimized out>, count=0) as execMain.c:1542
#11 0xstandardExecutorRun (queryDesc=0x1990c60, direction=<value optimized out>, count=0) as execMain.c:310
... (I can include the rest, if it's needed)

Any insight?
Thanks,
Dave

Re: Postgres 8.4.20 seqfault on RHEL 6.4

From
Dave Johansen
Date:
On Fri, Feb 13, 2015 at 2:38 PM, Dave Johansen <davejohansen@gmail.com> wrote:
On Thu, Feb 12, 2015 at 4:33 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Dave Johansen <davejohansen@gmail.com> writes:
> I'm running Postgres 8.4.20 on RHEL 6.4 and it will occasionally crash. The
> postgres.log file just says that a PID was terminated. The output from
> dmesg has a message like this one:
> postmaster[22905]: segfault at 686 ip 0000000000000686 sp 00007fff83d72e88
> error 14 in postgres[400000+463000]

> What can I do to try and figure out what is causing the crash and fix it?

(1) install relevant postgresql-debuginfo package (assuming we're talking
about a Red Hat-originated postgres package)

(2) run postmaster under "ulimit -c unlimited" (easiest way is probably
to add such a command to /etc/rc.d/init.d/postgresql and restart the
service)

(3) wait for crash

(4) gdb the resulting corefile (should be under your $PGDATA directory)

(5) send in a stack trace.

Here's the stacktrace from gdb (if it matters, the package version from RHEL is postgresql-8.4.18-1.el6_4.x86_64):
#0  0x0000000000000686 in ?? ()
#1  0x00007f76ae551801 in ?? ()
#2  0x00000000019f7793 in ?? ()
#3  0x00007fff06ad6be0 in ?? ()
#4  0x00007fff06ad6be0 in ?? ()
#5  0x0000000000545e35 in ExecMakeFunctionResult (fcache=0x19f5680, econtext=0x19f37e8, isNull=0x19f7793 "", isDone=0x19f7b8c) at execQual.c:1870
#6  0x0000000000541096 in ExecTargetList (projInfo=<value optimized out>, isDone=0x7fff06ad704c) as execQual.c:5212
#7  ExecProject (projeInfo=<value optimized out>, isDone=0xfff06ad704c) as execQual.c:5427
#8  0x0000000000553c5b in ExecResult (node=0x1999a68) at nodeResult.c:155
#9  0x00000000005406c8 in ExecProcNode (node=0x1999a68) at execProcnode.c:344
#10 0x000000000053e942 in ExecutePlan (queryDesc=0x1990c60, direction=<value optimized out>, count=0) as execMain.c:1542
#11 0xstandardExecutorRun (queryDesc=0x1990c60, direction=<value optimized out>, count=0) as execMain.c:310
... (I can include the rest, if it's needed)

Any insight?
Thanks,
Dave

So from looking at the stacktrace it looked like the issue was happening in one of our C functions. I did some digging and what had happened was the permissions on the folder that has those functions had been set wide open, so whenever someone built our software it overwrote the .so files. Normally, it's a process that's only done by the postgres when a new "version" is rolled out, but that check was being overwritten because of the incorrect permissions.

So that brings up a different question that I will start a new thread for.

Thanks for the help,
Dave