Re: pgsql: Add parallel-aware hash joins. - Mailing list pgsql-committers
From | Andres Freund |
---|---|
Subject | Re: pgsql: Add parallel-aware hash joins. |
Date | |
Msg-id | 20171221104225.w46zgr3w4f2mowls@alap3.anarazel.de Whole thread Raw |
In response to | Re: pgsql: Add parallel-aware hash joins. (Andres Freund <andres@anarazel.de>) |
List | pgsql-committers |
On 2017-12-21 01:55:50 -0800, Andres Freund wrote: > On 2017-12-21 01:29:40 -0800, Andres Freund wrote: > > On 2017-12-21 08:49:46 +0000, Andres Freund wrote: > > > Add parallel-aware hash joins. > > > > There's to relatively mundane failures: > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=tern&dt=2017-12-21%2008%3A48%3A12 > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=termite&dt=2017-12-21%2008%3A50%3A08 > > > > but also one that's a lot more interesting: > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=capybara&dt=2017-12-21%2008%3A50%3A08 > > > > which shows an assert failure: > > > > #2 0x00000000008687d1 in ExceptionalCondition (conditionName=conditionName@entry=0xa76a98 "!(!accessor->sts->participants[i].writing)",errorType=errorType@entry=0x8b2c49 "FailedAssertion", fileName=fileName@entry=0xa76991"sharedtuplestore.c", lineNumber=lineNumber@entry=273) at assert.c:54 > > #3 0x000000000089883e in sts_begin_parallel_scan (accessor=0xfaf780) at sharedtuplestore.c:273 > > #4 0x0000000000634de4 in ExecParallelHashRepartitionRest (hashtable=0xfaec18) at nodeHash.c:1369 > > #5 ExecParallelHashIncreaseNumBatches (hashtable=0xfaec18) at nodeHash.c:1198 > > #6 0x000000000063546b in ExecParallelHashTupleAlloc (hashtable=hashtable@entry=0xfaec18, size=40, shared=shared@entry=0x7ffee26a8868)at nodeHash.c:2778 > > #7 0x00000000006357c8 in ExecParallelHashTableInsert (hashtable=hashtable@entry=0xfaec18, slot=slot@entry=0xfa76f8,hashvalue=<optimized out>) at nodeHash.c:1696 > > #8 0x0000000000635b5f in MultiExecParallelHash (node=0xf7ebc8) at nodeHash.c:288 > > #9 MultiExecHash (node=node@entry=0xf7ebc8) at nodeHash.c:112 > > > > which seems to suggest that something in the state machine logic is > > borked. ExecParallelHashIncreaseNumBatches() should've ensured that > > everyone has called sts_end_write()... > Thomas, I wonder if the problem is that PHJ_GROW_BATCHES_ELECTING > updates, via ExecParallelHashJoinSetUpBatches(), HashJoinTable->nbatch, > while other backends also access ->nbatch in > ExecParallelHashCloseBatchAccessors(). Both happens after waiting for > the WAIT_EVENT_HASH_GROW_BATCHES_ELECTING phase. > > That'd lead to ExecParallelHashCloseBatchAccessors() likely not finish > writing all batches (because nbatch < nbatch_old), which seems like it'd > explain this? Trying to debug this I found another issue. I'd placed a sleep(10) in ExecParallelHashCloseBatchAccessors() and then ctrl-c'ed the server for some reason. Segfault time: #0 0x000055bfbac42539 in tas (lock=0x7fcd82ae14ac <error: Cannot access memory at address 0x7fcd82ae14ac>) at /home/andres/src/postgresql/src/include/storage/s_lock.h:228 #1 0x000055bfbac42b4d in ConditionVariableCancelSleep () at /home/andres/src/postgresql/src/backend/storage/lmgr/condition_variable.c:173 #2 0x000055bfba8e24ae in AbortTransaction () at /home/andres/src/postgresql/src/backend/access/transam/xact.c:2478 #3 0x000055bfba8e4a2a in AbortOutOfAnyTransaction () at /home/andres/src/postgresql/src/backend/access/transam/xact.c:4387 #4 0x000055bfba91ed97 in RemoveTempRelationsCallback (code=1, arg=0) at /home/andres/src/postgresql/src/backend/catalog/namespace.c:4034 #5 0x000055bfbac1bc90 in shmem_exit (code=1) at /home/andres/src/postgresql/src/backend/storage/ipc/ipc.c:228 #6 0x000055bfbac1bb67 in proc_exit_prepare (code=1) at /home/andres/src/postgresql/src/backend/storage/ipc/ipc.c:185 #7 0x000055bfbac1bacf in proc_exit (code=1) at /home/andres/src/postgresql/src/backend/storage/ipc/ipc.c:102 #8 0x000055bfbadbccf0 in errfinish (dummy=0) at /home/andres/src/postgresql/src/backend/utils/error/elog.c:543 #9 0x000055bfbac4eda3 in ProcessInterrupts () at /home/andres/src/postgresql/src/backend/tcop/postgres.c:2917 #10 0x000055bfbac42a63 in ConditionVariableSleep (cv=0x7fcd82ae14ac, wait_event_info=134217742) at /home/andres/src/postgresql/src/backend/storage/lmgr/condition_variable.c:129 #11 0x000055bfbac18405 in BarrierArriveAndWait (barrier=0x7fcd82ae1494, wait_event_info=134217742) at /home/andres/src/postgresql/src/backend/storage/ipc/barrier.c:191 #12 0x000055bfbaa9361e in ExecParallelHashIncreaseNumBatches (hashtable=0x55bfbd0e11d0) at /home/andres/src/postgresql/src/backend/executor/nodeHash.c:1191 #13 0x000055bfbaa962ef in ExecParallelHashTupleAlloc (hashtable=0x55bfbd0e11d0, size=40, shared=0x7ffda8967050) at /home/andres/src/postgresql/src/backend/executor/nodeHash.c:2781 #14 0x000055bfbaa946e8 in ExecParallelHashTableInsert (hashtable=0x55bfbd0e11d0, slot=0x55bfbd089a80, hashvalue=3825063138)at /home/andres/src/postgresql/src/backend/executor/nodeHash.c:1699 #15 0x000055bfbaa91d90 in MultiExecParallelHash (node=0x55bfbd089610) at /home/andres/src/postgresql/src/backend/executor/nodeHash.c:288 #16 0x000055bfbaa919b9 in MultiExecHash (node=0x55bfbd089610) at /home/andres/src/postgresql/src/backend/executor/nodeHash.c:112 #17 0x000055bfbaa7a500 in MultiExecProcNode (node=0x55bfbd089610) at /home/andres/src/postgresql/src/backend/executor/execProcnode.c:502 #18 0x000055bfbaa98515 in ExecHashJoinImpl (parallel=1 '\001', pstate=0x55bfbd053d50) at /home/andres/src/postgresql/src/backend/executor/nodeHashjoin.c:291 #19 ExecParallelHashJoin (pstate=0x55bfbd053d50) at /home/andres/src/postgresql/src/backend/executor/nodeHashjoin.c:582 #20 0x000055bfbaa7a424 in ExecProcNodeFirst (node=0x55bfbd053d50) at /home/andres/src/postgresql/src/backend/executor/execProcnode.c:446 #21 0x000055bfbaa858c7 in ExecProcNode (node=0x55bfbd053d50) at /home/andres/src/postgresql/src/include/executor/executor.h:242 #22 0x000055bfbaa85d67 in fetch_input_tuple (aggstate=0x55bfbd053698) at /home/andres/src/postgresql/src/backend/executor/nodeAgg.c:699 #23 0x000055bfbaa889b5 in agg_retrieve_direct (aggstate=0x55bfbd053698) at /home/andres/src/postgresql/src/backend/executor/nodeAgg.c:2355 #24 0x000055bfbaa8858e in ExecAgg (pstate=0x55bfbd053698) at /home/andres/src/postgresql/src/backend/executor/nodeAgg.c:2166 #25 0x000055bfbaa7a424 in ExecProcNodeFirst (node=0x55bfbd053698) at /home/andres/src/postgresql/src/backend/executor/execProcnode.c:446 #26 0x000055bfbaa90a4e in ExecProcNode (node=0x55bfbd053698) at /home/andres/src/postgresql/src/include/executor/executor.h:242 #27 0x000055bfbaa910d5 in gather_getnext (gatherstate=0x55bfbd053340) at /home/andres/src/postgresql/src/backend/executor/nodeGather.c:285 #28 0x000055bfbaa90f5f in ExecGather (pstate=0x55bfbd053340) at /home/andres/src/postgresql/src/backend/executor/nodeGather.c:216 #29 0x000055bfbaa7a424 in ExecProcNodeFirst (node=0x55bfbd053340) at /home/andres/src/postgresql/src/backend/executor/execProcnode.c:446 #30 0x000055bfbaa858c7 in ExecProcNode (node=0x55bfbd053340) at /home/andres/src/postgresql/src/include/executor/executor.h:242 #31 0x000055bfbaa85d67 in fetch_input_tuple (aggstate=0x55bfbd052c18) at /home/andres/src/postgresql/src/backend/executor/nodeAgg.c:699 #32 0x000055bfbaa889b5 in agg_retrieve_direct (aggstate=0x55bfbd052c18) at /home/andres/src/postgresql/src/backend/executor/nodeAgg.c:2355 #33 0x000055bfbaa8858e in ExecAgg (pstate=0x55bfbd052c18) at /home/andres/src/postgresql/src/backend/executor/nodeAgg.c:2166 #34 0x000055bfbaa7a424 in ExecProcNodeFirst (node=0x55bfbd052c18) at /home/andres/src/postgresql/src/backend/executor/execProcnode.c:446 #35 0x000055bfbaa716c6 in ExecProcNode (node=0x55bfbd052c18) at /home/andres/src/postgresql/src/include/executor/executor.h:242 #36 0x000055bfbaa7404e in ExecutePlan (estate=0x55bfbd0529c8, planstate=0x55bfbd052c18, use_parallel_mode=1 '\001', operation=CMD_SELECT,sendTuples=1 '\001', numberTuples=0, direction=ForwardScanDirection, dest=0x55bfbd121ae0, execute_once=1'\001') at /home/andres/src/postgresql/src/backend/executor/execMain.c:1718 #37 0x000055bfbaa71cba in standard_ExecutorRun (queryDesc=0x55bfbcf0fe68, direction=ForwardScanDirection, count=0, execute_once=1'\001') at /home/andres/src/postgresql/src/backend/executor/execMain.c:361 #38 0x000055bfbaa71ad4 in ExecutorRun (queryDesc=0x55bfbcf0fe68, direction=ForwardScanDirection, count=0, execute_once=1'\001') at /home/andres/src/postgresql/src/backend/executor/execMain.c:304 #39 0x000055bfbac52725 in PortalRunSelect (portal=0x55bfbcf56a48, forward=1 '\001', count=0, dest=0x55bfbd121ae0) at /home/andres/src/postgresql/src/backend/tcop/pquery.c:932 #40 0x000055bfbac523b8 in PortalRun (portal=0x55bfbcf56a48, count=9223372036854775807, isTopLevel=1 '\001', run_once=1 '\001',dest=0x55bfbd121ae0, altdest=0x55bfbd121ae0, completionTag=0x7ffda8967840 "") at /home/andres/src/postgresql/src/backend/tcop/pquery.c:773 #41 0x000055bfbac4c0a1 in exec_simple_query (query_string=0x55bfbceefaf8 "select count(*) from simple r join bigger_than_it_lookss using (id);") at /home/andres/src/postgresql/src/backend/tcop/postgres.c:1120 #42 0x000055bfbac505e4 in PostgresMain (argc=1, argv=0x55bfbcf1d178, dbname=0x55bfbcf1cf30 "regression", username=0x55bfbceec588"andres") at /home/andres/src/postgresql/src/backend/tcop/postgres.c:4139 #43 0x000055bfbabac375 in BackendRun (port=0x55bfbcf120f0) at /home/andres/src/postgresql/src/backend/postmaster/postmaster.c:4412 #44 0x000055bfbababa74 in BackendStartup (port=0x55bfbcf120f0) at /home/andres/src/postgresql/src/backend/postmaster/postmaster.c:4084 #45 0x000055bfbaba7d49 in ServerLoop () at /home/andres/src/postgresql/src/backend/postmaster/postmaster.c:1757 #46 0x000055bfbaba72d2 in PostmasterMain (argc=39, argv=0x55bfbcee9e90) at /home/andres/src/postgresql/src/backend/postmaster/postmaster.c:1365 #47 0x000055bfbaadb49d in main (argc=39, argv=0x55bfbcee9e90) at /home/andres/src/postgresql/src/backend/main/main.c:228 So, afaics no workers had yet attached, the leader accepted the cancel interrupt, the dsm segments were destroyed, and as part of cleanup cv_sleep_target was supposed to be reset, which fails, because it's memory has since been freed. Looking at how that can happen. Greetings, Andres Freund
pgsql-committers by date: