Re: pgsql: Add parallel-aware hash joins. - Mailing list pgsql-hackers

From Andres Freund
Subject Re: pgsql: Add parallel-aware hash joins.
Date
Msg-id 20180104200647.wna6yoin7nav5qqo@alap3.anarazel.de
Whole thread Raw
In response to Re: pgsql: Add parallel-aware hash joins.  (Andres Freund <andres@anarazel.de>)
Responses Re: pgsql: Add parallel-aware hash joins.
List pgsql-hackers
On 2018-01-04 11:20:33 -0800, Andres Freund wrote:
> On 2018-01-04 12:11:37 -0500, Tom Lane wrote:
> > Robert Haas <robertmhaas@gmail.com> writes:
> > > On Thu, Jan 4, 2018 at 11:00 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > >> Also, what the devil is happening on skink?
> > 
> > > So, skink is apparently dying during shutdown of a user-connected
> > > backend, and specifically the one that executed the 'tablespace' test.
> > 
> > Well, yeah, valgrind is burping: the postmaster log is full of
> > 
> > ==10544== VALGRINDERROR-BEGIN
> > ==10544== Syscall param epoll_pwait(sigmask) points to unaddressable byte(s)
> > ==10544==    at 0x7011490: epoll_pwait (epoll_pwait.c:42)
> > ==10544==    by 0x4BF40B: WaitEventSetWaitBlock (latch.c:1048)
> > ==10544==    by 0x4BF40B: WaitEventSetWait (latch.c:1000)
> > ==10544==    by 0x3C0B3B: secure_read (be-secure.c:166)
> > ==10544==    by 0x3CCD9E: pq_recvbuf (pqcomm.c:963)
> > ==10544==    by 0x3CDA07: pq_getbyte (pqcomm.c:1006)
> > ==10544==    by 0x4E2A2D: SocketBackend (postgres.c:339)
> > ==10544==    by 0x4E444E: ReadCommand (postgres.c:512)
> > ==10544==    by 0x4E7588: PostgresMain (postgres.c:4085)
> > ==10544==    by 0x4641D0: BackendRun (postmaster.c:4412)
> > ==10544==    by 0x467308: BackendStartup (postmaster.c:4084)
> > ==10544==    by 0x4675F7: ServerLoop (postmaster.c:1757)
> > ==10544==    by 0x4689D4: PostmasterMain (postmaster.c:1365)
> > ==10544==  Address 0x0 is not stack'd, malloc'd or (recently) free'd
> > ==10544== 
> > ==10544== VALGRINDERROR-END
> > 
> > But (a) this is happening in multiple branches, and (b) we've not
> > changed anything near that code in awhile.  I think something broke
> > in valgrind itself.
> 
> Some packages on skink have been upgraded. It appears that there either
> was a libc or valgrind change that made valgrind not recognize that a
> pointer of 0 might not point anywhere :(

It looks like recent glibc implements epoll_wait() as epoll_pwait() with
a NULL sigmask argument, and valgrind isn't happy with it.


> Let me check whether valgrind accept multiple suppression files, in
> which case I could add a suppression for this error to all
> branches.

Done that. I've added a valgrind-global.supp that has:

{
   null-sigmask-to-pwait-means-no-sigmask
   Memcheck:Param
   epoll_pwait(sigmask)
   fun:epoll_pwait
   fun:WaitEventSetWaitBlock
}


> Will also check whether I can reproduce locally.

Locally the above suppression allowed me to continue a bit further. But
I promptly hit other new blurbs:

assert build:
2018-01-04 12:03:45.927 PST [5716][3/189] STATEMENT:  select format('%1$s %4$s', 1, 2, 3);
==5718== Invalid write of size 8
==5718==    at 0x40E3DD: ExecInterpExpr (execExprInterp.c:1117)
==5718==    by 0x40F5E9: ExecInterpExprStillValid (execExprInterp.c:1788)
==5718==    by 0xE6B0409: ExecEvalExpr (executor.h:282)
==5718==    by 0xE6BA6EA: exec_eval_simple_expr (pl_exec.c:5602)
==5718==    by 0xE6B9DBB: exec_eval_expr (pl_exec.c:5197)
==5718==    by 0xE6B5A4B: exec_stmt_raise (pl_exec.c:3230)
==5718==    by 0xE6B2C04: exec_stmt (pl_exec.c:1632)
==5718==    by 0xE6B2956: exec_stmts (pl_exec.c:1535)
==5718==    by 0xE6B27FC: exec_stmt_block (pl_exec.c:1473)
==5718==    by 0xE6B083E: plpgsql_exec_function (pl_exec.c:474)
==5718==    by 0xE6AB13C: plpgsql_inline_handler (pl_handler.c:330)
==5718==    by 0x76C2B9: OidFunctionCall1Coll (fmgr.c:1327)
==5718==    by 0x3A28CA: ExecuteDoStmt (functioncmds.c:2234)
==5718==    by 0x5F92EB: standard_ProcessUtility (utility.c:532)
==5718==    by 0x5F8E26: ProcessUtility (utility.c:357)
==5718==    by 0x5F7D9F: PortalRunUtility (pquery.c:1178)
==5718==    by 0x5F7FC0: PortalRunMulti (pquery.c:1324)
==5718==    by 0x5F74DE: PortalRun (pquery.c:799)
==5718==    by 0x5F10BE: exec_simple_query (postgres.c:1120)
==5718==    by 0x5F5640: PostgresMain (postgres.c:4143)
==5718==  Address 0x20 is not stack'd, malloc'd or (recently) free'd
==5718== 
{
   <insert_a_suppression_name_here>
   Memcheck:Addr8
   fun:ExecInterpExpr
   fun:ExecInterpExprStillValid
   fun:ExecEvalExpr
   fun:exec_eval_simple_expr
   fun:exec_eval_expr
   fun:exec_stmt_raise
   fun:exec_stmt
   fun:exec_stmts
   fun:exec_stmt_block
   fun:plpgsql_exec_function
   fun:plpgsql_inline_handler
   fun:OidFunctionCall1Coll
   fun:ExecuteDoStmt
   fun:standard_ProcessUtility
   fun:ProcessUtility
   fun:PortalRunUtility
   fun:PortalRunMulti
   fun:PortalRun
   fun:exec_simple_query
   fun:PostgresMain
}
==5718== 
==5718== Process terminating with default action of signal 11 (SIGSEGV)
==5718==  Access not within mapped region at address 0x20
==5718==    at 0x40E3DD: ExecInterpExpr (execExprInterp.c:1117)
==5718==    by 0x40F5E9: ExecInterpExprStillValid (execExprInterp.c:1788)
==5718==    by 0xE6B0409: ExecEvalExpr (executor.h:282)
==5718==    by 0xE6BA6EA: exec_eval_simple_expr (pl_exec.c:5602)
==5718==    by 0xE6B9DBB: exec_eval_expr (pl_exec.c:5197)
==5718==    by 0xE6B5A4B: exec_stmt_raise (pl_exec.c:3230)
==5718==    by 0xE6B2C04: exec_stmt (pl_exec.c:1632)
==5718==    by 0xE6B2956: exec_stmts (pl_exec.c:1535)
==5718==    by 0xE6B27FC: exec_stmt_block (pl_exec.c:1473)
==5718==    by 0xE6B083E: plpgsql_exec_function (pl_exec.c:474)
==5718==    by 0xE6AB13C: plpgsql_inline_handler (pl_handler.c:330)
==5718==    by 0x76C2B9: OidFunctionCall1Coll (fmgr.c:1327)
==5718==    by 0x3A28CA: ExecuteDoStmt (functioncmds.c:2234)
==5718==    by 0x5F92EB: standard_ProcessUtility (utility.c:532)
==5718==    by 0x5F8E26: ProcessUtility (utility.c:357)
==5718==    by 0x5F7D9F: PortalRunUtility (pquery.c:1178)
==5718==    by 0x5F7FC0: PortalRunMulti (pquery.c:1324)
==5718==    by 0x5F74DE: PortalRun (pquery.c:799)
==5718==    by 0x5F10BE: exec_simple_query (postgres.c:1120)
==5718==    by 0x5F5640: PostgresMain (postgres.c:4143)
==5718==  If you believe this happened as a result of a stack
==5718==  overflow in your program's main thread (unlikely but
==5718==  possible), you can try to increase the size of the
==5718==  main thread stack using the --main-stacksize= flag.
==5718==  The main thread stack size used in this run was 8388608.

This looks like it might be related to recent plpgsql changes. Looks
like some pointer in:
            /* call input function (similar to InputFunctionCall) */
            if (!op->d.iocoerce.finfo_in->fn_strict || str != NULL)
is NULL.


==4286== Conditional jump or move depends on uninitialised value(s)
==4286==    at 0x6BA8A17: __wcsnlen_avx2 (strlen-avx2.S:261)
==4286==    by 0x6AF2FF1: wcsrtombs (wcsrtombs.c:104)
==4286==    by 0x6A88A40: wcstombs (wcstombs.c:34)
==4286==    by 0x514772: lowerstr_with_len.part.1 (ts_locale.c:284)
==4286==    by 0x523A63: readstoplist (ts_utils.c:119)
==4286==    by 0xEAB2839: dsnowball_init (dict_snowball.c:201)
==4286==    by 0x684975: OidFunctionCall1Coll (fmgr.c:1327)
==4286==    by 0x679277: lookup_ts_dictionary_cache (ts_cache.c:337)
==4286==    by 0x514E02: LexizeExec (ts_parse.c:203)
==4286==    by 0x5153ED: parsetext (ts_parse.c:405)
==4286==    by 0x521DD2: to_tsvector_byid (to_tsany.c:249)
==4286==    by 0x380261: ExecInterpExpr (execExprInterp.c:664)
==4286==    by 0x456069: ExecEvalExprSwitchContext (executor.h:301)
==4286==    by 0x456069: evaluate_expr (clauses.c:4873)
==4286==    by 0x4588B5: evaluate_function (clauses.c:4424)
==4286==    by 0x4588B5: simplify_function (clauses.c:4064)
==4286==    by 0x457519: eval_const_expressions_mutator (clauses.c:2676)
==4286==    by 0x3D615A: expression_tree_mutator (nodeFuncs.c:2903)
==4286==    by 0x458E43: simplify_function (clauses.c:4055)
==4286==    by 0x457519: eval_const_expressions_mutator (clauses.c:2676)
==4286==    by 0x3D5C02: expression_tree_mutator (nodeFuncs.c:2614)
==4286==    by 0x4572C1: eval_const_expressions_mutator (clauses.c:3671)
==4286==  Uninitialised value was created by a heap allocation
==4286==    at 0x4C2BB76: malloc (vg_replace_malloc.c:299)
==4286==    by 0x6A9667: AllocSetAlloc (aset.c:945)
==4286==    by 0x6B5AC7: palloc (mcxt.c:848)
==4286==    by 0x5238EA: get_tsearch_config_filename (ts_utils.c:55)
==4286==    by 0x5239AF: readstoplist (ts_utils.c:75)
==4286==    by 0xEAB2839: dsnowball_init (dict_snowball.c:201)
==4286==    by 0x684975: OidFunctionCall1Coll (fmgr.c:1327)
==4286==    by 0x679277: lookup_ts_dictionary_cache (ts_cache.c:337)
==4286==    by 0x514E02: LexizeExec (ts_parse.c:203)
==4286==    by 0x5153ED: parsetext (ts_parse.c:405)
==4286==    by 0x521DD2: to_tsvector_byid (to_tsany.c:249)
==4286==    by 0x380261: ExecInterpExpr (execExprInterp.c:664)
==4286==    by 0x456069: ExecEvalExprSwitchContext (executor.h:301)
==4286==    by 0x456069: evaluate_expr (clauses.c:4873)
==4286==    by 0x4588B5: evaluate_function (clauses.c:4424)
==4286==    by 0x4588B5: simplify_function (clauses.c:4064)
==4286==    by 0x457519: eval_const_expressions_mutator (clauses.c:2676)
==4286==    by 0x3D615A: expression_tree_mutator (nodeFuncs.c:2903)
==4286==    by 0x458E43: simplify_function (clauses.c:4055)
==4286==    by 0x457519: eval_const_expressions_mutator (clauses.c:2676)
==4286==    by 0x3D5C02: expression_tree_mutator (nodeFuncs.c:2614)
==4286==    by 0x4572C1: eval_const_expressions_mutator (clauses.c:3671)
==4286== 
{
   <insert_a_suppression_name_here>
   Memcheck:Cond
   fun:__wcsnlen_avx2
   fun:wcsrtombs
   fun:wcstombs
   fun:lowerstr_with_len.part.1
   fun:readstoplist
   fun:dsnowball_init
   fun:OidFunctionCall1Coll
   fun:lookup_ts_dictionary_cache
   fun:LexizeExec
   fun:parsetext
   fun:to_tsvector_byid
   fun:ExecInterpExpr
   fun:ExecEvalExprSwitchContext
   fun:evaluate_expr
   fun:evaluate_function
   fun:simplify_function
   fun:eval_const_expressions_mutator
   fun:expression_tree_mutator
   fun:simplify_function
   fun:eval_const_expressions_mutator
   fun:expression_tree_mutator
   fun:eval_const_expressions_mutator
}

I don't recall any recent changes in the area, so this might be a glibc
internal issue.

Greetings,

Andres Freund


pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: [HACKERS] wrong t_bits alignment in pageinspect
Next
From: Tom Lane
Date:
Subject: Re: pgsql: Add parallel-aware hash joins.