Re: pgsql: Add parallel-aware hash joins. - Mailing list pgsql-committers

From Andres Freund
Subject Re: pgsql: Add parallel-aware hash joins.
Date
Msg-id 20171221104225.w46zgr3w4f2mowls@alap3.anarazel.de
Whole thread Raw
In response to Re: pgsql: Add parallel-aware hash joins.  (Andres Freund <andres@anarazel.de>)
List pgsql-committers
On 2017-12-21 01:55:50 -0800, Andres Freund wrote:
> On 2017-12-21 01:29:40 -0800, Andres Freund wrote:
> > On 2017-12-21 08:49:46 +0000, Andres Freund wrote:
> > > Add parallel-aware hash joins.
> >
> > There's to relatively mundane failures:
> > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=tern&dt=2017-12-21%2008%3A48%3A12
> > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=termite&dt=2017-12-21%2008%3A50%3A08
> >
> > but also one that's a lot more interesting:
> > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=capybara&dt=2017-12-21%2008%3A50%3A08
> >
> > which shows an assert failure:
> >
> > #2  0x00000000008687d1 in ExceptionalCondition (conditionName=conditionName@entry=0xa76a98
"!(!accessor->sts->participants[i].writing)",errorType=errorType@entry=0x8b2c49 "FailedAssertion",
fileName=fileName@entry=0xa76991"sharedtuplestore.c", lineNumber=lineNumber@entry=273) at assert.c:54
 
> > #3  0x000000000089883e in sts_begin_parallel_scan (accessor=0xfaf780) at sharedtuplestore.c:273
> > #4  0x0000000000634de4 in ExecParallelHashRepartitionRest (hashtable=0xfaec18) at nodeHash.c:1369
> > #5  ExecParallelHashIncreaseNumBatches (hashtable=0xfaec18) at nodeHash.c:1198
> > #6  0x000000000063546b in ExecParallelHashTupleAlloc (hashtable=hashtable@entry=0xfaec18, size=40,
shared=shared@entry=0x7ffee26a8868)at nodeHash.c:2778
 
> > #7  0x00000000006357c8 in ExecParallelHashTableInsert (hashtable=hashtable@entry=0xfaec18,
slot=slot@entry=0xfa76f8,hashvalue=<optimized out>) at nodeHash.c:1696
 
> > #8  0x0000000000635b5f in MultiExecParallelHash (node=0xf7ebc8) at nodeHash.c:288
> > #9  MultiExecHash (node=node@entry=0xf7ebc8) at nodeHash.c:112
> >
> > which seems to suggest that something in the state machine logic is
> > borked. ExecParallelHashIncreaseNumBatches() should've ensured that
> > everyone has called sts_end_write()...

> Thomas, I wonder if the problem is that PHJ_GROW_BATCHES_ELECTING
> updates, via ExecParallelHashJoinSetUpBatches(), HashJoinTable->nbatch,
> while other backends also access ->nbatch in
> ExecParallelHashCloseBatchAccessors(). Both happens after waiting for
> the WAIT_EVENT_HASH_GROW_BATCHES_ELECTING phase.
>
> That'd lead to ExecParallelHashCloseBatchAccessors() likely not finish
> writing all batches (because nbatch < nbatch_old), which seems like it'd
> explain this?

Trying to debug this I found another issue. I'd placed a sleep(10) in
ExecParallelHashCloseBatchAccessors() and then ctrl-c'ed the server for
some reason. Segfault time:

#0  0x000055bfbac42539 in tas (lock=0x7fcd82ae14ac <error: Cannot access memory at address 0x7fcd82ae14ac>) at
/home/andres/src/postgresql/src/include/storage/s_lock.h:228
#1  0x000055bfbac42b4d in ConditionVariableCancelSleep () at
/home/andres/src/postgresql/src/backend/storage/lmgr/condition_variable.c:173
#2  0x000055bfba8e24ae in AbortTransaction () at /home/andres/src/postgresql/src/backend/access/transam/xact.c:2478
#3  0x000055bfba8e4a2a in AbortOutOfAnyTransaction () at
/home/andres/src/postgresql/src/backend/access/transam/xact.c:4387
#4  0x000055bfba91ed97 in RemoveTempRelationsCallback (code=1, arg=0) at
/home/andres/src/postgresql/src/backend/catalog/namespace.c:4034
#5  0x000055bfbac1bc90 in shmem_exit (code=1) at /home/andres/src/postgresql/src/backend/storage/ipc/ipc.c:228
#6  0x000055bfbac1bb67 in proc_exit_prepare (code=1) at /home/andres/src/postgresql/src/backend/storage/ipc/ipc.c:185
#7  0x000055bfbac1bacf in proc_exit (code=1) at /home/andres/src/postgresql/src/backend/storage/ipc/ipc.c:102
#8  0x000055bfbadbccf0 in errfinish (dummy=0) at /home/andres/src/postgresql/src/backend/utils/error/elog.c:543
#9  0x000055bfbac4eda3 in ProcessInterrupts () at /home/andres/src/postgresql/src/backend/tcop/postgres.c:2917
#10 0x000055bfbac42a63 in ConditionVariableSleep (cv=0x7fcd82ae14ac, wait_event_info=134217742) at
/home/andres/src/postgresql/src/backend/storage/lmgr/condition_variable.c:129
#11 0x000055bfbac18405 in BarrierArriveAndWait (barrier=0x7fcd82ae1494, wait_event_info=134217742) at
/home/andres/src/postgresql/src/backend/storage/ipc/barrier.c:191
#12 0x000055bfbaa9361e in ExecParallelHashIncreaseNumBatches (hashtable=0x55bfbd0e11d0) at
/home/andres/src/postgresql/src/backend/executor/nodeHash.c:1191
#13 0x000055bfbaa962ef in ExecParallelHashTupleAlloc (hashtable=0x55bfbd0e11d0, size=40, shared=0x7ffda8967050) at
/home/andres/src/postgresql/src/backend/executor/nodeHash.c:2781
#14 0x000055bfbaa946e8 in ExecParallelHashTableInsert (hashtable=0x55bfbd0e11d0, slot=0x55bfbd089a80,
hashvalue=3825063138)at /home/andres/src/postgresql/src/backend/executor/nodeHash.c:1699
 
#15 0x000055bfbaa91d90 in MultiExecParallelHash (node=0x55bfbd089610) at
/home/andres/src/postgresql/src/backend/executor/nodeHash.c:288
#16 0x000055bfbaa919b9 in MultiExecHash (node=0x55bfbd089610) at
/home/andres/src/postgresql/src/backend/executor/nodeHash.c:112
#17 0x000055bfbaa7a500 in MultiExecProcNode (node=0x55bfbd089610) at
/home/andres/src/postgresql/src/backend/executor/execProcnode.c:502
#18 0x000055bfbaa98515 in ExecHashJoinImpl (parallel=1 '\001', pstate=0x55bfbd053d50) at
/home/andres/src/postgresql/src/backend/executor/nodeHashjoin.c:291
#19 ExecParallelHashJoin (pstate=0x55bfbd053d50) at
/home/andres/src/postgresql/src/backend/executor/nodeHashjoin.c:582
#20 0x000055bfbaa7a424 in ExecProcNodeFirst (node=0x55bfbd053d50) at
/home/andres/src/postgresql/src/backend/executor/execProcnode.c:446
#21 0x000055bfbaa858c7 in ExecProcNode (node=0x55bfbd053d50) at
/home/andres/src/postgresql/src/include/executor/executor.h:242
#22 0x000055bfbaa85d67 in fetch_input_tuple (aggstate=0x55bfbd053698) at
/home/andres/src/postgresql/src/backend/executor/nodeAgg.c:699
#23 0x000055bfbaa889b5 in agg_retrieve_direct (aggstate=0x55bfbd053698) at
/home/andres/src/postgresql/src/backend/executor/nodeAgg.c:2355
#24 0x000055bfbaa8858e in ExecAgg (pstate=0x55bfbd053698) at
/home/andres/src/postgresql/src/backend/executor/nodeAgg.c:2166
#25 0x000055bfbaa7a424 in ExecProcNodeFirst (node=0x55bfbd053698) at
/home/andres/src/postgresql/src/backend/executor/execProcnode.c:446
#26 0x000055bfbaa90a4e in ExecProcNode (node=0x55bfbd053698) at
/home/andres/src/postgresql/src/include/executor/executor.h:242
#27 0x000055bfbaa910d5 in gather_getnext (gatherstate=0x55bfbd053340) at
/home/andres/src/postgresql/src/backend/executor/nodeGather.c:285
#28 0x000055bfbaa90f5f in ExecGather (pstate=0x55bfbd053340) at
/home/andres/src/postgresql/src/backend/executor/nodeGather.c:216
#29 0x000055bfbaa7a424 in ExecProcNodeFirst (node=0x55bfbd053340) at
/home/andres/src/postgresql/src/backend/executor/execProcnode.c:446
#30 0x000055bfbaa858c7 in ExecProcNode (node=0x55bfbd053340) at
/home/andres/src/postgresql/src/include/executor/executor.h:242
#31 0x000055bfbaa85d67 in fetch_input_tuple (aggstate=0x55bfbd052c18) at
/home/andres/src/postgresql/src/backend/executor/nodeAgg.c:699
#32 0x000055bfbaa889b5 in agg_retrieve_direct (aggstate=0x55bfbd052c18) at
/home/andres/src/postgresql/src/backend/executor/nodeAgg.c:2355
#33 0x000055bfbaa8858e in ExecAgg (pstate=0x55bfbd052c18) at
/home/andres/src/postgresql/src/backend/executor/nodeAgg.c:2166
#34 0x000055bfbaa7a424 in ExecProcNodeFirst (node=0x55bfbd052c18) at
/home/andres/src/postgresql/src/backend/executor/execProcnode.c:446
#35 0x000055bfbaa716c6 in ExecProcNode (node=0x55bfbd052c18) at
/home/andres/src/postgresql/src/include/executor/executor.h:242
#36 0x000055bfbaa7404e in ExecutePlan (estate=0x55bfbd0529c8, planstate=0x55bfbd052c18, use_parallel_mode=1 '\001',
operation=CMD_SELECT,sendTuples=1 '\001', numberTuples=0, direction=ForwardScanDirection, dest=0x55bfbd121ae0,
execute_once=1'\001')
 
    at /home/andres/src/postgresql/src/backend/executor/execMain.c:1718
#37 0x000055bfbaa71cba in standard_ExecutorRun (queryDesc=0x55bfbcf0fe68, direction=ForwardScanDirection, count=0,
execute_once=1'\001') at /home/andres/src/postgresql/src/backend/executor/execMain.c:361
 
#38 0x000055bfbaa71ad4 in ExecutorRun (queryDesc=0x55bfbcf0fe68, direction=ForwardScanDirection, count=0,
execute_once=1'\001') at /home/andres/src/postgresql/src/backend/executor/execMain.c:304 
#39 0x000055bfbac52725 in PortalRunSelect (portal=0x55bfbcf56a48, forward=1 '\001', count=0, dest=0x55bfbd121ae0) at
/home/andres/src/postgresql/src/backend/tcop/pquery.c:932
#40 0x000055bfbac523b8 in PortalRun (portal=0x55bfbcf56a48, count=9223372036854775807, isTopLevel=1 '\001', run_once=1
'\001',dest=0x55bfbd121ae0, altdest=0x55bfbd121ae0, completionTag=0x7ffda8967840 "") at
/home/andres/src/postgresql/src/backend/tcop/pquery.c:773
#41 0x000055bfbac4c0a1 in exec_simple_query (query_string=0x55bfbceefaf8 "select count(*) from simple r join
bigger_than_it_lookss using (id);") at /home/andres/src/postgresql/src/backend/tcop/postgres.c:1120
 
#42 0x000055bfbac505e4 in PostgresMain (argc=1, argv=0x55bfbcf1d178, dbname=0x55bfbcf1cf30 "regression",
username=0x55bfbceec588"andres") at /home/andres/src/postgresql/src/backend/tcop/postgres.c:4139
 
#43 0x000055bfbabac375 in BackendRun (port=0x55bfbcf120f0) at
/home/andres/src/postgresql/src/backend/postmaster/postmaster.c:4412
#44 0x000055bfbababa74 in BackendStartup (port=0x55bfbcf120f0) at
/home/andres/src/postgresql/src/backend/postmaster/postmaster.c:4084
#45 0x000055bfbaba7d49 in ServerLoop () at /home/andres/src/postgresql/src/backend/postmaster/postmaster.c:1757
#46 0x000055bfbaba72d2 in PostmasterMain (argc=39, argv=0x55bfbcee9e90) at
/home/andres/src/postgresql/src/backend/postmaster/postmaster.c:1365
#47 0x000055bfbaadb49d in main (argc=39, argv=0x55bfbcee9e90) at
/home/andres/src/postgresql/src/backend/main/main.c:228

So, afaics no workers had yet attached, the leader accepted the cancel
interrupt, the dsm segments were destroyed, and as part of cleanup
cv_sleep_target was supposed to be reset, which fails, because it's
memory has since been freed.   Looking at how that can happen.

Greetings,

Andres Freund


pgsql-committers by date:

Previous
From: Andres Freund
Date:
Subject: Re: pgsql: Add parallel-aware hash joins.
Next
From: Thomas Munro
Date:
Subject: Re: pgsql: Add parallel-aware hash joins.