Re: stress test for parallel workers - Mailing list pgsql-hackers
From | Justin Pryzby |
---|---|
Subject | Re: stress test for parallel workers |
Date | |
Msg-id | 20190723230440.GU22387@telsasoft.com Whole thread Raw |
In response to | Re: stress test for parallel workers (Thomas Munro <thomas.munro@gmail.com>) |
Responses |
Re: stress test for parallel workers
Re: stress test for parallel workers Re: stress test for parallel workers |
List | pgsql-hackers |
On Wed, Jul 24, 2019 at 10:46:42AM +1200, Thomas Munro wrote: > On Wed, Jul 24, 2019 at 10:42 AM Justin Pryzby <pryzby@telsasoft.com> wrote: > > On Wed, Jul 24, 2019 at 10:03:25AM +1200, Thomas Munro wrote: > > > On Wed, Jul 24, 2019 at 5:42 AM Justin Pryzby <pryzby@telsasoft.com> wrote: > > > > #2 0x000000000085ddff in errfinish (dummy=<value optimized out>) at elog.c:555 > > > > edata = <value optimized out> > > > > > > If you have that core, it might be interesting to go to frame 2 and > > > print *edata or edata->saved_errno. > > > > As you saw..unless someone you know a trick, it's "optimized out". > > How about something like this: > > print errorData[errordata_stack_depth] Clever. (gdb) p errordata[errordata_stack_depth] $2 = {elevel = 13986192, output_to_server = 254, output_to_client = 127, show_funcname = false, hide_stmt = false, hide_ctx= false, filename = 0x27b3790 "< %m %u >", lineno = 41745456, funcname = 0x3030313335 <Address 0x3030313335 out of bounds>, domain = 0x0, context_domain = 0x27cff90 "postgres", sqlerrcode= 0, message = 0xe8800000001 <Address 0xe8800000001 out of bounds>, detail = 0x297a <Address 0x297a out of bounds>, detail_log = 0x0, hint = 0xe88 <Address 0xe88 out of bounds>, context =0x297a <Address 0x297a out of bounds>, message_id = 0x0, schema_name = 0x0, table_name = 0x0, column_name = 0x0, datatype_name = 0x0, constraint_name = 0x0, cursorpos = 0, internalpos = 0, internalquery= 0x0, saved_errno = 0, assoc_context = 0x0} (gdb) p errordata $3 = {{elevel = 22, output_to_server = true, output_to_client = false, show_funcname = false, hide_stmt = false, hide_ctx= false, filename = 0x9c4030 "origin.c", lineno = 591, funcname = 0x9c46e0 "CheckPointReplicationOrigin", domain = 0x9ac810 "postgres-11", context_domain = 0x9ac810 "postgres-11",sqlerrcode = 4293, message = 0x27b0e40 "could not write to file \"pg_logical/replorigin_checkpoint.tmp\": No space left on device", detail= 0x0, detail_log = 0x0, hint = 0x0, context = 0x0, message_id = 0x8a7aa8 "could not write to file \"%s\": %m", ... I ought to have remembered that it *was* in fact out of space this AM when this core was dumped (due to having not touched it since scheduling transition to this VM last week). I want to say I'm almost certain it wasn't ENOSPC in other cases, since, failing to find log output, I ran df right after the failure. But that gives me an idea: is it possible there's an issue with files being held opened by worker processes ? Including by parallel workers? Probably WALs, even after they're rotated ? If there were worker processes holding opened lots of rotated WALs, that could cause ENOSPC, but that wouldn't be obvious after they die, since the space would then be freed. Justin
pgsql-hackers by date: