Re: stress test for parallel workers - Mailing list pgsql-hackers

From Justin Pryzby
Subject Re: stress test for parallel workers
Date
Msg-id 20190723230440.GU22387@telsasoft.com
Whole thread Raw
In response to Re: stress test for parallel workers  (Thomas Munro <thomas.munro@gmail.com>)
Responses Re: stress test for parallel workers  (Tom Lane <tgl@sss.pgh.pa.us>)
Re: stress test for parallel workers  (Thomas Munro <thomas.munro@gmail.com>)
Re: stress test for parallel workers  (Alvaro Herrera <alvherre@2ndquadrant.com>)
List pgsql-hackers
On Wed, Jul 24, 2019 at 10:46:42AM +1200, Thomas Munro wrote:
> On Wed, Jul 24, 2019 at 10:42 AM Justin Pryzby <pryzby@telsasoft.com> wrote:
> > On Wed, Jul 24, 2019 at 10:03:25AM +1200, Thomas Munro wrote:
> > > On Wed, Jul 24, 2019 at 5:42 AM Justin Pryzby <pryzby@telsasoft.com> wrote:
> > > > #2  0x000000000085ddff in errfinish (dummy=<value optimized out>) at elog.c:555
> > > >         edata = <value optimized out>
> > >
> > > If you have that core, it might be interesting to go to frame 2 and
> > > print *edata or edata->saved_errno.
> >
> > As you saw..unless someone you know a trick, it's "optimized out".
> 
> How about something like this:
> 
> print errorData[errordata_stack_depth]

Clever.

(gdb) p errordata[errordata_stack_depth]
$2 = {elevel = 13986192, output_to_server = 254, output_to_client = 127, show_funcname = false, hide_stmt = false,
hide_ctx= false, filename = 0x27b3790 "< %m %u >", lineno = 41745456, 
 
  funcname = 0x3030313335 <Address 0x3030313335 out of bounds>, domain = 0x0, context_domain = 0x27cff90 "postgres",
sqlerrcode= 0, message = 0xe8800000001 <Address 0xe8800000001 out of bounds>, 
 
  detail = 0x297a <Address 0x297a out of bounds>, detail_log = 0x0, hint = 0xe88 <Address 0xe88 out of bounds>, context
=0x297a <Address 0x297a out of bounds>, message_id = 0x0, schema_name = 0x0, 
 
  table_name = 0x0, column_name = 0x0, datatype_name = 0x0, constraint_name = 0x0, cursorpos = 0, internalpos = 0,
internalquery= 0x0, saved_errno = 0, assoc_context = 0x0}
 
(gdb) p errordata
$3 = {{elevel = 22, output_to_server = true, output_to_client = false, show_funcname = false, hide_stmt = false,
hide_ctx= false, filename = 0x9c4030 "origin.c", lineno = 591, 
 
    funcname = 0x9c46e0 "CheckPointReplicationOrigin", domain = 0x9ac810 "postgres-11", context_domain = 0x9ac810
"postgres-11",sqlerrcode = 4293, 
 
    message = 0x27b0e40 "could not write to file \"pg_logical/replorigin_checkpoint.tmp\": No space left on device",
detail= 0x0, detail_log = 0x0, hint = 0x0, context = 0x0, 
 
    message_id = 0x8a7aa8 "could not write to file \"%s\": %m", ...

I ought to have remembered that it *was* in fact out of space this AM when this
core was dumped (due to having not touched it since scheduling transition to
this VM last week).

I want to say I'm almost certain it wasn't ENOSPC in other cases, since,
failing to find log output, I ran df right after the failure.

But that gives me an idea: is it possible there's an issue with files being
held opened by worker processes ?  Including by parallel workers?  Probably
WALs, even after they're rotated ?  If there were worker processes holding
opened lots of rotated WALs, that could cause ENOSPC, but that wouldn't be
obvious after they die, since the space would then be freed.

Justin



pgsql-hackers by date:

Previous
From: Nikita Glukhov
Date:
Subject: Re: Support for jsonpath .datetime() method
Next
From: Tom Lane
Date:
Subject: Re: pgbench tests vs Windows