Re: stress test for parallel workers - Mailing list pgsql-hackers

From Justin Pryzby
Subject Re: stress test for parallel workers
Date
Msg-id 20190724003343.GV22387@telsasoft.com
Whole thread Raw
In response to Re: stress test for parallel workers  (Thomas Munro <thomas.munro@gmail.com>)
List pgsql-hackers
On Wed, Jul 24, 2019 at 11:32:30AM +1200, Thomas Munro wrote:
> On Wed, Jul 24, 2019 at 11:04 AM Justin Pryzby <pryzby@telsasoft.com> wrote:
> > I ought to have remembered that it *was* in fact out of space this AM when this
> > core was dumped (due to having not touched it since scheduling transition to
> > this VM last week).
> >
> > I want to say I'm almost certain it wasn't ENOSPC in other cases, since,
> > failing to find log output, I ran df right after the failure.

I meant it wasn't a trivial error on my part of failing to drop the previously
loaded DB instance.  It occured to me to check inodes, which can also cause
ENOSPC.  This is mkfs -T largefile, so running out of inodes is not an
impossibility.  But seems an unlikely culprit, unless something made tens of
thousands of (small) files.  

[pryzbyj@alextelsasrv01 ~]$ df -i /var/lib/pgsql
Filesystem           Inodes IUsed IFree IUse% Mounted on
/dev/mapper/data-postgres
                      65536  5605 59931    9% /var/lib/pgsql

> Ok, cool, so the ENOSPC thing we understand, and the postmaster death
> thing is probably something entirely different.  Which brings us to
> the question: what is killing your postmaster or causing it to exit
> silently and unexpectedly, but leaving no trace in any operating
> system log?  You mentioned that you couldn't see any signs of the OOM
> killer.  Are you in a situation to test an OOM failure so you can
> confirm what that looks like on your system?

$ command time -v python -c "'x'*4999999999" |wc
Traceback (most recent call last):
  File "<string>", line 1, in <module>
MemoryError
Command exited with non-zero status 1
...
        Maximum resident set size (kbytes): 4276

$ dmesg
...
Out of memory: Kill process 10665 (python) score 478 or sacrifice child
Killed process 10665, UID 503, (python) total-vm:4024260kB, anon-rss:3845756kB, file-rss:1624kB

I wouldn't burn too much more time on it until I can reproduce it.  The
failures were all during pg_restore, so checkpointer would've been very busy.
It seems possible it for it to notice ENOSPC before workers...which would be
fsyncing WAL, where checkpointer is fsyncing data.

> Admittedly it is quite hard for to distinguish between a web browser
> and a program designed to eat memory as fast as possible...

Browsers making lots of progress here but still clearly 2nd place.

Justin



pgsql-hackers by date:

Previous
From: Steven Pousty
Date:
Subject: Re: SQL/JSON path issues/questions
Next
From: Michael Paquier
Date:
Subject: Re: Fetching timeline during recovery