On Wed, Jul 24, 2019 at 11:32:30AM +1200, Thomas Munro wrote:
> On Wed, Jul 24, 2019 at 11:04 AM Justin Pryzby <pryzby@telsasoft.com> wrote:
> > I ought to have remembered that it *was* in fact out of space this AM when this
> > core was dumped (due to having not touched it since scheduling transition to
> > this VM last week).
> >
> > I want to say I'm almost certain it wasn't ENOSPC in other cases, since,
> > failing to find log output, I ran df right after the failure.
I meant it wasn't a trivial error on my part of failing to drop the previously
loaded DB instance. It occured to me to check inodes, which can also cause
ENOSPC. This is mkfs -T largefile, so running out of inodes is not an
impossibility. But seems an unlikely culprit, unless something made tens of
thousands of (small) files.
[pryzbyj@alextelsasrv01 ~]$ df -i /var/lib/pgsql
Filesystem Inodes IUsed IFree IUse% Mounted on
/dev/mapper/data-postgres
65536 5605 59931 9% /var/lib/pgsql
> Ok, cool, so the ENOSPC thing we understand, and the postmaster death
> thing is probably something entirely different. Which brings us to
> the question: what is killing your postmaster or causing it to exit
> silently and unexpectedly, but leaving no trace in any operating
> system log? You mentioned that you couldn't see any signs of the OOM
> killer. Are you in a situation to test an OOM failure so you can
> confirm what that looks like on your system?
$ command time -v python -c "'x'*4999999999" |wc
Traceback (most recent call last):
File "<string>", line 1, in <module>
MemoryError
Command exited with non-zero status 1
...
Maximum resident set size (kbytes): 4276
$ dmesg
...
Out of memory: Kill process 10665 (python) score 478 or sacrifice child
Killed process 10665, UID 503, (python) total-vm:4024260kB, anon-rss:3845756kB, file-rss:1624kB
I wouldn't burn too much more time on it until I can reproduce it. The
failures were all during pg_restore, so checkpointer would've been very busy.
It seems possible it for it to notice ENOSPC before workers...which would be
fsyncing WAL, where checkpointer is fsyncing data.
> Admittedly it is quite hard for to distinguish between a web browser
> and a program designed to eat memory as fast as possible...
Browsers making lots of progress here but still clearly 2nd place.
Justin