Thread: BUG #6222: Segmentation fault on unlogged table

BUG #6222: Segmentation fault on unlogged table

From

"Fujii Masao"

Date:

26 September 2011, 01:16:45

The following bug has been logged online:

Bug reference:      6222
Logged by:          Fujii Masao
Email address:      masao.fujii@gmail.com
PostgreSQL version: 9.2dev
Operating system:   Linux hermes 2.6.38-11-generic #50-Ubuntu SMP Mon Sep 12
21:18:14 UTC 2011 i686 i686 i386 GNU/Linux
Description:        Segmentation fault on unlogged table
Details:

When I made pgbench tables unlogged and ran pgbench, I encountered a
segmentation fault. Here is the test case which reproduces a segmentation
fault:

$ pgbench -i
$ pg_dump > pgbench.dump
$ sed s/"CREATE TABLE"/"CREATE UNLOGGED TABLE"/g pgbench.dump >
pgbench_unlogged.dump
$ createdb test
$ psql -f pgbench_unlogged.dump test
$ pgbench -c2 -T60 -Mprepared

LOG:  server process (PID 23154) was terminated by signal 11: Segmentation
fault
LOG:  terminating any other active server processes
WARNING:  terminating connection because of crash of another server process
DETAIL:  The postmaster has commanded this server process to roll back the
current transaction and exit, because another server process exited
abnormally and possibly corrupted shared memory.
HINT:  In a moment you should be able to reconnect to the database and
repeat your command.
<snip>

The backtrace from core file shows that SEGV comes from ExecQual():

$ gdb -c data/core bin/postgres
<snip>
Core was generated by `postgres: postgres test [local] UPDATE      '.
Program terminated with signal 11, Segmentation fault.
#0  0x0821a6fa in ExecQual ()
(gdb) bt
#0  0x0821a6fa in ExecQual ()
#1  0x0821aff1 in ExecScan ()
#2  0x0823035e in ExecSeqScan ()
#3  0x082113fe in ExecProcNode ()
#4  0x0822e1f0 in ExecModifyTable ()
#5  0x082113b2 in ExecProcNode ()
#6  0x0820f273 in ExecutePlan ()
#7  0x0820d93b in standard_ExecutorRun ()
#8  0x0820d7af in ExecutorRun ()
#9  0x0832886f in ProcessQuery ()
#10 0x08329e7c in PortalRunMulti ()
#11 0x083295b1 in PortalRun ()
#12 0x083251a8 in exec_execute_message ()
#13 0x08327c67 in PostgresMain ()
#14 0x082d0fb2 in BackendRun ()
#15 0x082d05c0 in BackendStartup ()
#16 0x082cd5c8 in ServerLoop ()
#17 0x082ccf48 in PostmasterMain ()
#18 0x0824d748 in main ()

BTW, when I did the same test on MacOS, I got a bus error instead of SEGV.

Re: BUG #6222: Segmentation fault on unlogged table

From

Robert Haas

Date:

26 September 2011, 15:01:10

On Sun, Sep 25, 2011 at 9:16 PM, Fujii Masao <masao.fujii@gmail.com> wrote:
> When I made pgbench tables unlogged and ran pgbench, I encountered a
> segmentation fault. Here is the test case which reproduces a segmentation
> fault:
>
> $ pgbench -i
> $ pg_dump > pgbench.dump
> $ sed s/"CREATE TABLE"/"CREATE UNLOGGED TABLE"/g pgbench.dump >
> pgbench_unlogged.dump
> $ createdb test
> $ psql -f pgbench_unlogged.dump test
> $ pgbench -c2 -T60 -Mprepared

The first time I did this (on MacOS X), I got a bus error, just as you
did.  But I followed the exact series of commands above, which I
believe would end up with the test running against the default
database (with regular tables), rather than the test database with
unlogged tables.  The second time, it worked OK against the regular
tables, so I then tried it again against the unlogged tables (by
running the last command in the series again with "test" added to the
end) and it then crashed with the following backtrace:

#0  0x0000000101153560 in ?? ()
#1  0x00000001001540fe in ExecScan (node=0x7fff5fbfde4f,
accessMtd=0x100167150 <SeqNext>, recheckMtd=0x100166ef0 <SeqRecheck>)
at execScan.c:195
#2  0x000000010014bcb2 in ExecProcNode (node=0x101151f30) at execProcnode.c:393
#3  0x0000000100164691 in ExecModifyTable (node=0x101151c20) at
nodeModifyTable.c:766
#4  0x000000010014bcf2 in ExecProcNode (node=0x101151c20) at execProcnode.c:370
#5  0x000000010014a10e in ExecutePlan [inlined] () at
/Users/rhaas/pgsql/src/backend/executor/execMain.c:1438
#6  0x000000010014a10e in standard_ExecutorRun (queryDesc=0x10113f388,
direction=ForwardScanDirection, count=0) at execMain.c:312
#7  0x000000010023f714 in ProcessQuery (plan=0x101140e98,
sourceText=0x10113f238 "UPDATE pgbench_branches SET bbalance =
bbalance + $1 WHERE bid = $2;", params=<value temporarily unavailable,
due to optimizations>, dest=0x100520420, completionTag=0x7fff5fbfe450
"") at pquery.c:185
#8  0x000000010023fe77 in PortalRunMulti (portal=0x10113d238,
isTopLevel=<value temporarily unavailable, due to optimizations>,
dest=0x100520420, altdest=0x100520420, completionTag=0x7fff5fbfe450
"") at pquery.c:1274
#9  0x0000000100240888 in PortalRun (portal=0x10113d238,
count=9223372036854775807, isTopLevel=<value temporarily unavailable,
due to optimizations>, dest=0x101055a50, altdest=0x101055a50,
completionTag=0x7fff5fbfe450 "") at pquery.c:8&
#11 0x00000001001ee7ec in ServerLoop () at postmaster.c:3512
#12 0x00000001001ef6d7 in PostmasterMain (argc=1, argv=0x100c08c40) at
postmaster.c:1093
#13 0x00000001001824b5 in main (argc=1, argv=0x100c08c40) at main.c:199

Unfortunately, this is thoroughly unhelpful, because the top of the
stack has apparently been clobbered, and I can't see where inside
ExecQual the crash is happening.

(gdb) fr 1
#1  0x00000001001540fe in ExecScan (node=0x7fff5fbfde4f,
accessMtd=0x100167150 <SeqNext>, recheckMtd=0x100166ef0 <SeqRecheck>)
at execScan.c:195
195            if (!qual || ExecQual(qual, econtext, false))

The whole thing is a bit mysterious because ExecQual() doesn't really
do much that seems like it could generate an invalid memory reference.

I'll poke at this some more...

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: BUG #6222: Segmentation fault on unlogged table

From

Robert Haas

Date:

26 September 2011, 15:41:33

On Mon, Sep 26, 2011 at 11:00 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> The whole thing is a bit mysterious because ExecQual() doesn't really
> do much that seems like it could generate an invalid memory reference.
>
> I'll poke at this some more...

I added some debugging code which sets a global variable to various
values as it executes this code.  I was still able to reproduce the
crash (but only with unlogged tables, as in your original report) and
the crash appears to be happening here:

+               where_did_we_crash = 104;
                expr_value = ExecEvalExpr(clause, econtext, &isNull, NULL);
+               where_did_we_crash = 105;

I end up with where_did_we_crash = 104 in the core dump.

ExecEvalExpr is a macro which does this:

#define ExecEvalExpr(expr, econtext, isNull, isDone) \
        ((*(expr)->evalfunc) (expr, econtext, isNull, isDone))

I can't print out the value of "clause" directly, because the
invocation of ExecQual() doesn't even show up in the stack trace.  But
I can see from the backtrace that it's getting called by ExecScan()
with an argument of qual.  That qual is a one-element list, and it's
only element is also a List.  That List contains a FuncExprState node
with an evalfunc of ExecEvalOper.  But unless I'm missing something,
that's no good, because ExecQual is only walking the outer list, not
the inner one.  And certainly if it tries to use a List object as an
ExprState, that's not going to work.

To check my work, I did this:

--- a/src/backend/executor/execQual.c
+++ b/src/backend/executor/execQual.c
@@ -5003,6 +5003,7 @@ ExecQual(List *qual, ExprContext *econtext, bool
resultForNull)
                Datum           expr_value;
                bool            isNull;

+               Assert(!IsA(clause, List));
                expr_value = ExecEvalExpr(clause, econtext, &isNull, NULL);

                if (isNull)

And in fact the test case (when run against the unlogged tables) fails
that assertion:

TRAP: FailedAssertion("!(!((((Node*)(clause))->type) == T_List))",
File: "execQual.c", Line: 5006)

Now I'm not too sure why that is happening yet, but I'm leaning toward
the idea that the bug here is timing-related and that unlogged tables
aren't the cause, but rather just make it easier to hit whatever the
race condition is by removing some overhead.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: BUG #6222: Segmentation fault on unlogged table

From

Robert Haas

Date:

26 September 2011, 16:08:30

On Mon, Sep 26, 2011 at 11:40 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> To check my work, I did this:
>
> --- a/src/backend/executor/execQual.c
> +++ b/src/backend/executor/execQual.c
> @@ -5003,6 +5003,7 @@ ExecQual(List *qual, ExprContext *econtext, bool
> resultForNull)
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0Datum =A0 =A0 =A0 =A0 =A0 expr_value;
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0bool =A0 =A0 =A0 =A0 =A0 =A0isNull;
>
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 Assert(!IsA(clause, List));
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0expr_value =3D ExecEvalExpr(clause, econte=
xt, &isNull, NULL);
>
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0if (isNull)
>
> And in fact the test case (when run against the unlogged tables) fails
> that assertion:
>
> TRAP: FailedAssertion("!(!((((Node*)(clause))->type) =3D=3D T_List))",
> File: "execQual.c", Line: 5006)
>
> Now I'm not too sure why that is happening yet, but I'm leaning toward
> the idea that the bug here is timing-related and that unlogged tables
> aren't the cause, but rather just make it easier to hit whatever the
> race condition is by removing some overhead.

I cannot reproduce this on commit
e6faf910d75027bdce7cd0f2033db4e912592bcc.  But on the very next commit
I can:

commit e6faf910d75027bdce7cd0f2033db4e912592bcc
Author: Tom Lane <tgl@sss.pgh.pa.us>
Date:   Fri Sep 16 00:42:53 2011 -0400

    Redesign the plancache mechanism for more flexibility and efficiency.

Tom, any thoughts?

--=20
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: BUG #6222: Segmentation fault on unlogged table

From

Merlin Moncure

Date:

26 September 2011, 16:17:18

On Mon, Sep 26, 2011 at 11:08 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Mon, Sep 26, 2011 at 11:40 AM, Robert Haas <robertmhaas@gmail.com> wro=
te:
>> To check my work, I did this:
>>
>> --- a/src/backend/executor/execQual.c
>> +++ b/src/backend/executor/execQual.c
>> @@ -5003,6 +5003,7 @@ ExecQual(List *qual, ExprContext *econtext, bool
>> resultForNull)
>> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0Datum =A0 =A0 =A0 =A0 =A0 expr_value;
>> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0bool =A0 =A0 =A0 =A0 =A0 =A0isNull;
>>
>> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 Assert(!IsA(clause, List));
>> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0expr_value =3D ExecEvalExpr(clause, econt=
ext, &isNull, NULL);
>>
>> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0if (isNull)
>>
>> And in fact the test case (when run against the unlogged tables) fails
>> that assertion:
>>
>> TRAP: FailedAssertion("!(!((((Node*)(clause))->type) =3D=3D T_List))",
>> File: "execQual.c", Line: 5006)
>>
>> Now I'm not too sure why that is happening yet, but I'm leaning toward
>> the idea that the bug here is timing-related and that unlogged tables
>> aren't the cause, but rather just make it easier to hit whatever the
>> race condition is by removing some overhead.
>
> I cannot reproduce this on commit
> e6faf910d75027bdce7cd0f2033db4e912592bcc. =A0But on the very next commit
> I can:
>
> commit e6faf910d75027bdce7cd0f2033db4e912592bcc
> Author: Tom Lane <tgl@sss.pgh.pa.us>
> Date: =A0 Fri Sep 16 00:42:53 2011 -0400
>
> =A0 =A0Redesign the plancache mechanism for more flexibility and efficien=
cy.
>
> Tom, any thoughts?

hm. any relation to YAMAMOTO Takashi's recent email, [BUGS] bug in
plancache.c, which hasn't hit the archives yet?

> "GetCachedPlan can pass the 'qlist' to the planner twice.
if i understand the code correctly, it's unsafe because the planner is
destructive wrt the input tree.  for my application, it often causes
a crash in executor."

merlin

Re: BUG #6222: Segmentation fault on unlogged table

From

Robert Haas

Date:

26 September 2011, 16:21:01

On Mon, Sep 26, 2011 at 12:17 PM, Merlin Moncure <mmoncure@gmail.com> wrote:
> hm. any relation to YAMAMOTO Takashi's recent email, [BUGS] bug in
> plancache.c, which hasn't hit the archives yet?
>
>> "GetCachedPlan can pass the 'qlist' to the planner twice.
> if i understand the code correctly, it's unsafe because the planner is
> destructive wrt the input tree. =A0for my application, it often causes
> a crash in executor."

I was just wondering about that.  It seems like it could very well be
the same issue, but I have not tested it yet.

--=20
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: BUG #6222: Segmentation fault on unlogged table

From

Robert Haas

Date:

26 September 2011, 16:32:54

On Mon, Sep 26, 2011 at 12:20 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Mon, Sep 26, 2011 at 12:17 PM, Merlin Moncure <mmoncure@gmail.com> wro=
te:
>> hm. any relation to YAMAMOTO Takashi's recent email, [BUGS] bug in
>> plancache.c, which hasn't hit the archives yet?
>>
>>> "GetCachedPlan can pass the 'qlist' to the planner twice.
>> if i understand the code correctly, it's unsafe because the planner is
>> destructive wrt the input tree. =A0for my application, it often causes
>> a crash in executor."
>
> I was just wondering about that. =A0It seems like it could very well be
> the same issue, but I have not tested it yet.

OK, I tested it.  On my system, that patch appears to fix this problem.

--=20
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: BUG #6222: Segmentation fault on unlogged table

From

Tom Lane

Date:

26 September 2011, 16:39:35

Robert Haas <robertmhaas@gmail.com> writes:
>> I was just wondering about that.  It seems like it could very well be
>> the same issue, but I have not tested it yet.

> OK, I tested it.  On my system, that patch appears to fix this problem.

Yeah, I was just going to suggest that that might be related, but
I see you beat me to the deduction.  Will commit it (with more than
zero comments) in a moment.

            regards, tom lane