Re: stress test for parallel workers - Mailing list pgsql-hackers
From | Thomas Munro |
---|---|
Subject | Re: stress test for parallel workers |
Date | |
Msg-id | CA+hUKG+mZ=FjC3jyGK94vjJjL+SgO7ocFcF2Pm7MBWo6nuSKzg@mail.gmail.com Whole thread Raw |
In response to | Re: stress test for parallel workers (Justin Pryzby <pryzby@telsasoft.com>) |
Responses |
Re: stress test for parallel workers
|
List | pgsql-hackers |
On Wed, Jul 24, 2019 at 11:04 AM Justin Pryzby <pryzby@telsasoft.com> wrote: > I ought to have remembered that it *was* in fact out of space this AM when this > core was dumped (due to having not touched it since scheduling transition to > this VM last week). > > I want to say I'm almost certain it wasn't ENOSPC in other cases, since, > failing to find log output, I ran df right after the failure. Ok, cool, so the ENOSPC thing we understand, and the postmaster death thing is probably something entirely different. Which brings us to the question: what is killing your postmaster or causing it to exit silently and unexpectedly, but leaving no trace in any operating system log? You mentioned that you couldn't see any signs of the OOM killer. Are you in a situation to test an OOM failure so you can confirm what that looks like on your system? You might try typing this into Python: x = [42] for i in range(1000): x = x + x On my non-Linux system, it ran for a while and then was killed, and dmesg showed: pid 15956 (python3.6), jid 0, uid 1001, was killed: out of swap space pid 40238 (firefox), jid 0, uid 1001, was killed: out of swap space Admittedly it is quite hard for to distinguish between a web browser and a program designed to eat memory as fast as possible... Anyway on Linux you should see stuff about killed processes and/or OOM in one of dmesg, syslog, messages. > But that gives me an idea: is it possible there's an issue with files being > held opened by worker processes ? Including by parallel workers? Probably > WALs, even after they're rotated ? If there were worker processes holding > opened lots of rotated WALs, that could cause ENOSPC, but that wouldn't be > obvious after they die, since the space would then be freed. Parallel workers don't do anything with WAL files, but they can create temporary files. If you're building humongous indexes with parallel workers, you'll get some of those, but I don't think it'd be more than you'd get without parallelism. If you were using up all of your disk space with temporary files, wouldn't this be reproducible? I think you said you were testing this repeatedly, so if that were the problem I'd expect to see some non-panicky out-of-space errors when the temp files blow out your disk space, and only rarely a panic if a checkpoint happens to run exactly at a moment where the create index hasn't yet written the byte that breaks the camel's back, but the checkpoint pushes it over edge in one of these places where it panics on failure. -- Thomas Munro https://enterprisedb.com
pgsql-hackers by date: