Thread: BUG #16219: EvalPlanQualFetchRowMark segfaults on Updates

BUG #16219: EvalPlanQualFetchRowMark segfaults on Updates

From
PG Bug reporting form
Date:
The following bug has been logged on the website:

Bug reference:      16219
Logged by:          Julian Schauder
Email address:      julian.schauder@gmx.de
PostgreSQL version: 12.1
Operating system:   Ubuntu 18.04
Description:

Howdy Hackers,

for a week now (upgrade pg11=>12) we're experiencing a bug that causes
segfaults on a daily basis. 
I've yet to reproduce this exact behaviour as the querys do not fail all the
time but rather now-and-then 
when run concurrently. ( ~4 Crashes so far )

Outputs(identifiers) are anonymised.

> 2020-01-20 01:39:24.246 CET [2112]  LOG:  server process (PID 31599) was
terminated by signal 11: Segmentation fault
> 2020-01-20 01:39:24.246 CET [2112]  DETAIL:  Failed process was running:
UPDATE partitionedtable ia SET A=A, B=now() FROM temptable tt WHERE
ia.id=tt.ia_id AND A::text IS DISTINCT FROM tt.A::text;
> 2020-01-20 01:39:24.246 CET [2112]  LOG:  terminating any other active
server processes


> #0  EvalPlanQualFetchRowMark (epqstate=0x5576bfbccfc0, rti=2,
slot=slot@entry=0x5576c888b6c0) at
./build/../src/backend/executor/execMain.c:2590
> #1  0x00005576bdf449ce in ExecScanFetch (recheckMtd=0x5576bdf65740
<SeqRecheck>, accessMtd=0x5576bdf656b0 <SeqNext>, node=0x5576c888b500) at
./build/../src/backend/executor/execScan.c:115
> #2  ExecScan (node=0x5576c888b500, accessMtd=0x5576bdf656b0 <SeqNext>,
recheckMtd=0x5576bdf65740 <SeqRecheck>) at
./build/../src/backend/executor/execScan.c:200
> #3  0x00005576bdf63f7d in ExecProcNode (node=0x5576c888b500) at
./build/../src/include/executor/executor.h:239
> #4  ExecNestLoop (pstate=<optimized out>) at
./build/../src/backend/executor/nodeNestloop.c:109
> #5  0x00005576bdf3e9a9 in ExecProcNode (node=0x5576c888b310) at
./build/../src/include/executor/executor.h:239
> #6  EvalPlanQualNext (epqstate=epqstate@entry=0x5576bfbccfc0) at
./build/../src/backend/executor/execMain.c:2695
> #7  0x00005576bdf3ee77 in EvalPlanQual
(epqstate=epqstate@entry=0x5576bfbccfc0,
relation=relation@entry=0x7fbfd57df168, rti=42,
inputslot=inputslot@entry=0x5576cfb66578) at ./build/../src/backend/>
executor/execMain.c:2467
> #8  0x00005576bdf61f5e in ExecUpdate
(mtstate=mtstate@entry=0x5576bfbccec8, tupleid=0x7ffdd552f89a, oldtuple=0x0,
slot=<optimized out>, planSlot=planSlot@entry=0x5576c153f428,
epqstate=epqstate@entry=0x5576bfbccfc0, estate=0x5576bf5df080, 
>     canSetTag=true) at
./build/../src/backend/executor/nodeModifyTable.c:1387
> #9  0x00005576bdf62df9 in ExecModifyTable (pstate=0x5576bfbccec8) at
./build/../src/backend/executor/nodeModifyTable.c:2223
> #10 0x00005576bdf3c23d in ExecProcNode (node=0x5576bfbccec8) at
./build/../src/include/executor/executor.h:239
> #11 ExecutePlan (execute_once=<optimized out>, dest=0x5576be63ec00
<donothingDR>, direction=<optimized out>, numberTuples=0,
sendTuples=<optimized out>, operation=CMD_UPDATE,
use_parallel_mode=<optimized out>, planstate=0x5576bfbccec8, 
>     estate=0x5576bf5df080) at
./build/../src/backend/executor/execMain.c:1646
> #12 standard_ExecutorRun (queryDesc=0x5576bed15ae0, direction=<optimized
out>, count=0, execute_once=<optimized out>) at
./build/../src/backend/executor/execMain.c:364
> #13 0x00007fc89b1a4125 in pgss_ExecutorRun (queryDesc=0x5576bed15ae0,
direction=ForwardScanDirection, count=0, execute_once=<optimized out>) at
./build/../contrib/pg_stat_statements/pg_stat_statements.c:893
> #14 0x00005576be0993b5 in ProcessQuery (plan=<optimized out>, 
>     sourceText=0x5576bed159d0 "UPDATE partitionedtable ia SET A=A, B=now()
FROM temptable tt WHERE ia.id=tt.ia_id AND A::text IS DISTINCT FROM
tt.A::text"..., 
>     params=0x0, queryEnv=0x0, dest=0x5576be63ec00 <donothingDR>,
completionTag=0x7ffdd552ff40 "") at
./build/../src/backend/tcop/pquery.c:161
> #15 0x00005576be0995f3 in PortalRunMulti
(portal=portal@entry=0x5576bea978f0, isTopLevel=isTopLevel@entry=true,
setHoldSnapshot=setHoldSnapshot@entry=false, dest=0x5576be63ec00
<donothingDR>, dest@entry=0x5576bea00890, 
>     altdest=0x5576be63ec00 <donothingDR>, altdest@entry=0x5576bea00890,
completionTag=completionTag@entry=0x7ffdd552ff40 "") at
./build/../src/backend/tcop/pquery.c:1283
> #16 0x00005576be09a1ff in PortalRun (portal=portal@entry=0x5576bea978f0,
count=count@entry=1, isTopLevel=isTopLevel@entry=true, run_once=<optimized
out>, dest=dest@entry=0x5576bea00890, altdest=altdest@entry=0x5576bea00890,

>     completionTag=0x7ffdd552ff40 "") at
./build/../src/backend/tcop/pquery.c:796
> #17 0x00005576be097a31 in exec_execute_message (max_rows=1,
portal_name=0x5576bea00480 "") at
./build/../src/backend/tcop/postgres.c:2090
> #18 PostgresMain (argc=<optimized out>, argv=argv@entry=0x5576bea2fbd0,
dbname=<optimized out>, username=<optimized out>) at
./build/../src/backend/tcop/postgres.c:4297
> #19 0x00005576be01e53b in BackendRun (port=0x5576bea27dc0,
port=0x5576bea27dc0) at
./build/../src/backend/postmaster/postmaster.c:4437
> #20 BackendStartup (port=0x5576bea27dc0) at
./build/../src/backend/postmaster/postmaster.c:4128
> #21 ServerLoop () at ./build/../src/backend/postmaster/postmaster.c:1704
> #22 0x00005576be01f513 in PostmasterMain (argc=5, argv=0x5576be9f9090) at
./build/../src/backend/postmaster/postmaster.c:1377
> #23 0x00005576bdd95cf6 in main (argc=5, argv=0x5576be9f9090) at
./build/../src/backend/main/main.c:228


> #0  EvalPlanQualFetchRowMark (epqstate=0x5576bfbccfc0, rti=2,
slot=slot@entry=0x5576c888b6c0) at
./build/../src/backend/executor/execMain.c:2590
>         earm = 0x5576c7176258
>         erm = 0x24
>         datum = <optimized out>
>         isNull = false
>         __func__ = "EvalPlanQualFetchRowMark"
> #1  0x00005576bdf449ce in ExecScanFetch (recheckMtd=0x5576bdf65740
<SeqRecheck>, accessMtd=0x5576bdf656b0 <SeqNext>, node=0x5576c888b500) at
./build/../src/backend/executor/execScan.c:115
>         slot = 0x5576c888b6c0
>         epqstate = <optimized out>
>         scanrelid = <optimized out>
>         slot = <optimized out>
>         estate = 0x5576c888b100
>         estate = <optimized out>
>         epqstate = <optimized out>
>         scanrelid = <optimized out>
>         slot = <optimized out>
>         slot = <optimized out>
>         slot = <optimized out>
>         slot = <optimized out>
> #2  ExecScan (node=0x5576c888b500,

Excerpt shows the pointer 0x24 of erm to be uninitialized'ish. Once i can
capture a second 
stacktrace i might be able to confirm if this is indeed a random value or
always 0x24. However,
we're workarounding this by simply not running this query concurrently
anymore hence i might
lack new stacktraces soon.

> UPDATE partitionedtable ia SET A=A, B=now() FROM temptable tt WHERE
ia.id=tt.ia_id AND A::text IS DISTINCT FROM tt.A::text;
>                                                           QUERY PLAN
                                                    
>

-------------------------------------------------------------------------------------------------------------------------------
>  Update on partitionedtable ia  (cost=0.43..1202173.35 rows=465660
width=143)
>    Update on partitionedtable ia
>    Update on partitionedtable_0062 ia_1
>    Update on partitionedtable_0061 ia_2
>    Update on partitionedtable_0060 ia_3
> [repeats for another ~400 Partitions]
> [..]
>    ->  Nested Loop  (cost=0.43..3386.78 rows=1194 width=145)
>          ->  Seq Scan on temptable tt  (cost=0.00..22.00 rows=1200
width=46)
>          ->  Index Scan using partitionedtable_pk on partitionedtable ia
(cost=0.43..2.79 rows=1 width=560)
>                Index Cond: (id = tt.ia_id)
>                Filter: ((document)::text IS DISTINCT FROM
(tt.document_json)::text)
>    ->  Nested Loop  (cost=0.42..3108.99 rows=1194 width=159)
>          ->  Seq Scan on temptable tt  (cost=0.00..22.00 rows=1200
width=46)
>          ->  Index Scan using partitionedtable_062_pk on
partitionedtable_0062 ia_1  (cost=0.42..2.56 rows=1 width=518)
>                Index Cond: (id = tt.ia_id)
>                Filter: ((document)::text IS DISTINCT FROM
(tt.document_json)::text)
> [repeats for another ~400 Partitions]
> [..]
> 
>  JIT:
>    Functions: 3510
>    Options: Inlining true, Optimization true, Expressions true, Deforming
true
> (2344 rows)

The Query basically updates a massively partitioned* Table from the
contents
of a temptable.

*Inheritance-Partitioning on ID(not modified within the query)
*Inheritance has triggers. ISTM that they are not of any concern.

System is a current Ubuntu18.04 with a freshly pg_upgrade'd postgresql-12.1
pgdg-build,
including new statistics, indices and vacuum.

I'm currently trying to get this to a minimized and reproducible form as i
do not see an 
obvious error in the code yet. For the time being i fail to actually
recreate as accurate as 
production does :)

regards,
julian


Re: BUG #16219: EvalPlanQualFetchRowMark segfaults on Updates

From
Tom Lane
Date:
PG Bug reporting form <noreply@postgresql.org> writes:
> for a week now (upgrade pg11=>12) we're experiencing a bug that causes
> segfaults on a daily basis. 
> I've yet to reproduce this exact behaviour as the querys do not fail all the
> time but rather now-and-then 
> when run concurrently. ( ~4 Crashes so far )

Not sure if you realize that the EvalPlanQual stuff is only called when
there's an uncommitted conflicting update on a target row.  So the way
I'd try to reproduce something like this is

session 1                            session 2

begin;
issue update query;
                                     issue same update query;
commit;

which should take the race-condition issue out of the equation,
since session 2 will certainly be trying to update the same
rows that session 1 did but hasn't yet committed.

There may be other hard-to-reproduce factors at work, of course.
If you can test on a debug build (with --enable-cassert), that'd
help remove a few more variables.

>> earm = 0x5576c7176258
>> erm = 0x24

> Excerpt shows the pointer 0x24 of erm to be uninitialized'ish.

Yeah.  If you could do "p *earm" here, that might offer some clues;
I'm wondering if that entire struct has gotten trashed (likely because
somebody freed it too soon).

            regards, tom lane