On Mon, Nov 20, 2017 at 10:34 AM, Andres Freund <andres@anarazel.de> wrote:
Hi, On 2017-11-20 10:13:57 -0800, David Pacheco wrote: > I expect what happened is that the syslogger process attempted to allocate > memory, failed because the system was low, and explicitly exited. That's > consistent with an exited process, no core file generated, and the "FATAL" > "out > of memory" entries in the log I posted.
Well unfortunately we're presumably not going to see output from syslogger's death itself, chicken and egg etc.
If that's true, I think that's a separate bug. If a process receives ENOMEM from an attempt to allocate memory, it can still write out a log message to the already-open file descriptor indicating that it's exiting because of that failure. For all I know, that's what happened here, as there are FATAL "out of memory" entries in the log. (I'd be able to tell for sure if we had configured logging to include pids in the log messages.)
> Tom Lane wrote: > > Hm, so that's another angle David didn't report on: is it possible that > > his workload could have resulted in a very large volume of incomplete > > in-progress log messages? > > Yes. I mentioned in my Nov 6 mail that large log messages over a > short period appear to have been a major contributing factor:
So it's actually quite possible that syslogger did use a fair amount of memory. Could you show its memory usage, just via ps?
I don't have information from the problematic case because the process had already exited. Right now, for the same database (actually a different instance, because this incident triggered a failover):
I'm not sure what we can infer from that, as this is a different system, and the workload that generates the very large query strings only runs occasionally. Those strings are also not logged unless something's gone wrong.
> This is a critical point. As far as I can tell, all that's necessary > for this deadlock to occur is: > > - the syslogger is unable to make forward progress (e.g., because it > segfaulted)
This specific case I don't care that much about, but ...
I understand if the community isn't interested in fixing this case if other users aren't seeing it much, but surely it's still a bug that this unusual case can result in a deadlock?