Re: stress test for parallel workers - Mailing list pgsql-hackers

From Thomas Munro
Subject Re: stress test for parallel workers
Date
Msg-id CA+hUKG+mZ=FjC3jyGK94vjJjL+SgO7ocFcF2Pm7MBWo6nuSKzg@mail.gmail.com
Whole thread Raw
In response to Re: stress test for parallel workers  (Justin Pryzby <pryzby@telsasoft.com>)
Responses Re: stress test for parallel workers  (Justin Pryzby <pryzby@telsasoft.com>)
List pgsql-hackers
On Wed, Jul 24, 2019 at 11:04 AM Justin Pryzby <pryzby@telsasoft.com> wrote:
> I ought to have remembered that it *was* in fact out of space this AM when this
> core was dumped (due to having not touched it since scheduling transition to
> this VM last week).
>
> I want to say I'm almost certain it wasn't ENOSPC in other cases, since,
> failing to find log output, I ran df right after the failure.

Ok, cool, so the ENOSPC thing we understand, and the postmaster death
thing is probably something entirely different.  Which brings us to
the question: what is killing your postmaster or causing it to exit
silently and unexpectedly, but leaving no trace in any operating
system log?  You mentioned that you couldn't see any signs of the OOM
killer.  Are you in a situation to test an OOM failure so you can
confirm what that looks like on your system?  You might try typing
this into Python:

x = [42]
for i in range(1000):
  x = x + x

On my non-Linux system, it ran for a while and then was killed, and
dmesg showed:

pid 15956 (python3.6), jid 0, uid 1001, was killed: out of swap space
pid 40238 (firefox), jid 0, uid 1001, was killed: out of swap space

Admittedly it is quite hard for to distinguish between a web browser
and a program designed to eat memory as fast as possible...  Anyway on
Linux you should see stuff about killed processes and/or OOM in one of
dmesg, syslog, messages.

> But that gives me an idea: is it possible there's an issue with files being
> held opened by worker processes ?  Including by parallel workers?  Probably
> WALs, even after they're rotated ?  If there were worker processes holding
> opened lots of rotated WALs, that could cause ENOSPC, but that wouldn't be
> obvious after they die, since the space would then be freed.

Parallel workers don't do anything with WAL files, but they can create
temporary files.  If you're building humongous indexes with parallel
workers, you'll get some of those, but I don't think it'd be more than
you'd get without parallelism.  If you were using up all of your disk
space with temporary files, wouldn't this be reproducible?  I think
you said you were testing this repeatedly, so if that were the problem
I'd expect to see some non-panicky out-of-space errors when the temp
files blow out your disk space, and only rarely a panic if a
checkpoint happens to run exactly at a moment where the create index
hasn't yet written the byte that breaks the camel's back, but the
checkpoint pushes it over edge in one of these places where it panics
on failure.

-- 
Thomas Munro
https://enterprisedb.com



pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: stress test for parallel workers
Next
From: Thomas Munro
Date:
Subject: Re: stress test for parallel workers