Re: BUG #15032: Segmentation fault when running a particular query - Mailing list pgsql-bugs

From Guo Xiang Tan
Subject Re: BUG #15032: Segmentation fault when running a particular query
Date
Msg-id CAEL+R-C0N+hM2OtE_s9xezy15w+18LOxgnYkjHh63SAi+3NxRA@mail.gmail.com
Whole thread Raw
In response to Re: BUG #15032: Segmentation fault when running a particular query  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: BUG #15032: Segmentation fault when running a particular query
List pgsql-bugs
Unsurprisingly, the given info is not enough to reproduce the crash.

We could only reproduce this on our production PostgreSQL cluster. I took a full pg_dump and restored into a PostgreSQL cluster (10.1) locally but could not reproduce the seg fault. If it helps, the data in the cluster was restored via pg_dump (10.1) from a PostgreSQL 9.5 cluster.

On Sat, Jan 27, 2018 at 2:45 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
PG Bug reporting form <noreply@postgresql.org> writes:
> ## Query that results in segmentation fault

Unsurprisingly, the given info is not enough to reproduce the crash.
However, looking at the stack trace:

> (gdb) bt
> #0  index_markpos (scan=0x0) at
> /build/postgresql-10-qAeTPy/postgresql-10-10.1/build/../src/backend/access/index/indexam.c:373
> #1  0x000055a812746c68 in ExecMergeJoin (pstate=0x55a8131bc778) at
> /build/postgresql-10-qAeTPy/postgresql-10-10.1/build/../src/backend/executor/nodeMergejoin.c:1188
> #2  0x000055a81272cf3f in ExecProcNode (node=0x55a8131bc778) at
> /build/postgresql-10-qAeTPy/postgresql-10-10.1/build/../src/include/executor/executor.h:250
> #3  EvalPlanQualNext (epqstate=epqstate@entry=0x55a81318c518) at
> /build/postgresql-10-qAeTPy/postgresql-10-10.1/build/../src/backend/executor/execMain.c:3005
> #4  0x000055a81272d342 in EvalPlanQual (estate=estate@entry=0x55a81318c018,
> epqstate=epqstate@entry=0x55a81318c518,
> relation=relation@entry=0x7f4e4e25ab68, rti=1, lockmode=<optimized out>,
> tid=tid@entry=0x7ffd54492330, priorXmax=8959603) at
> /build/postgresql-10-qAeTPy/postgresql-10-10.1/build/../src/backend/executor/execMain.c:2521
> #5  0x000055a812747af7 in ExecUpdate (mtstate=mtstate@entry=0x55a81318c468,
> tupleid=tupleid@entry=0x7ffd54492450, oldtuple=oldtuple@entry=0x0,
> slot=<optimized out>, slot@entry=0x55a8131a2f08,
> planSlot=planSlot@entry=0x55a81319db60,
> epqstate=epqstate@entry=0x55a81318c518, estate=0x55a81318c018, canSetTag=1
> '\001') at
> /build/postgresql-10-qAeTPy/postgresql-10-10.1/build/../src/backend/executor/nodeModifyTable.c:1113

it seems fairly clear that somebody passed a NULL scandesc pointer
to index_markpos.  Looking at the only two callers of that function,
this must mean that either an IndexScan's node->iss_ScanDesc or an
IndexOnlyScan's node->ioss_ScanDesc was null.  (We don't see
ExecIndexMarkPos in the trace because the compiler optimized the tail
call.)  And that leads me to commit 09529a70b, which changed the logic
in those node types to allow initialization of the index scandesc to be
delayed to the first tuple fetch, rather than necessarily performed during
ExecInitNode.

Because this is happening inside an EvalPlanQual, it's unsurprising
that we'd be taking an unusual code path.  I believe what happened
was that the IndexScan node returned a jammed-in EPQ tuple on its
first call, and so hadn't opened the scandesc at all, while
ExecMergeJoin would do an ExecMarkPos if the tuple matched (which
it typically would if we'd gotten to EPQ), whereupon kaboom.

It's tempting to think that this is an oversight in commit 09529a70b
and we need to rectify it by something along the lines of teaching
ExecIndexMarkPos and ExecIndexOnlyMarkPos to initialize the scandesc
if needed before calling index_markpos.

However, on further reflection, it seems like this is a bug of far
older standing, to wit that ExecIndexMarkPos/ExecIndexRestrPos
are doing entirely the wrong thing when EPQ is active.  It's not
meaningful, or at least not correct, to be messing with the index
scan state at all in that case.  Rather, what the scan is supposed
to do is return the single jammed-in EPQ tuple, and what "restore"
ought to mean is "clear my es_epqScanDone flag so that that tuple
can be returned again".

It's not clear to me whether the failure to do that has any real
consequences though.  It would only matter if there's more than
one tuple available on the outer side of the mergejoin, which
I think there never would be in an EPQ situation.  Still, if there
ever were more outer tuples, the mergejoin would misbehave and maybe
even crash itself (because it probably assumes that restoring to
a point where there had been a tuple would allow it to re-fetch
that tuple successfully).

So what I'm inclined to do is teach the mark/restore infrastructure
to do the right thing with EPQ state when EPQ is active.  But I'm not
clear on whether that needs to be back-patched earlier than v10.

                        regards, tom lane

pgsql-bugs by date:

Previous
From: Andres Freund
Date:
Subject: Re: BUG #14932: SELECT DISTINCT val FROM table gets stuck in aninfinite loop
Next
From: Andres Freund
Date:
Subject: Re: BUG #14932: SELECT DISTINCT val FROM table gets stuck in aninfinite loop