Hi all,
a few days ago I setup an buildfarm animal markhor, running the tests
with CLOBBER_CACHE_RECURSIVELY. As the tests are running very long,
reporting the results back to the server fails because of a safeguard
limit in the buildfarm server. Anyway, that's being discussed in a
different thread - here it's merely as a 'don't bother looking for addax
on the buildfarm website' warning.
I've been checking the progress of the recursive tests today, and I
found it actually failed in the 'make check' step. The logs are
available here:
buildfarm logs: http://www.fuzzy.cz/tmp/buildfarm/recursive-oom.tgz
kernel logs: http://www.fuzzy.cz/tmp/buildfarm/messages
The tests are running within a LXC container (operated through libvirt),
so whenever I say 'VM' I actually mean a LXC container. It might be some
VM/LXC misconfiguration, but as this happens only to a single VM (the
one running tests with recursive clobber), I find it unlikely.
================== An example of the failure ==================
parallel group (20 tests): pg_lsn regproc oid name char money float4
txid text int2 varchar int4 float8 boolean int8 uuid rangetypes bit
numeric enum ... float4 ... ok float8 ... ok bit ... FAILED (test process exited with
exitcode 2) numeric ... FAILED (test process exited with exit code 2) txid ... ok ...
===============================================================
and then of course the usual 'terminating connection because of crash of
another server process' warning. Apparently, it's getting killed by the
OOM killer, because it exhausts all the memory assigned to that VM (2GB).
May 15 19:44:53 postgres invoked oom-killer: gfp_mask=0xd0, order=0,
oom_adj=0, oom_score_adj=0
May 15 19:44:53 cspug kernel: postgres cpuset=recursive-builds
mems_allowed=0
May 15 19:44:53 cspug kernel: Pid: 17159, comm: postgres Not tainted
2.6.32-431.17.1.el6.centos.plus.x86_64 #1
AFAIK 2GB is more than enough for a buildfarm machine (after all,
chipmunk hass just 512MB). Also, this only happens on this VM
(cpuset=recursive-builds), the other two VMs, with exactly the same
limits, running other buildfarm animals (regular or with
CLOBBER_CACHE_ALWAYS) are perfectly happy. See magpie or markhor for
example. And I don't see any reason why a build with recursive clobber
should require more memory than a regular build. So this seems like a
memory leak somewhere in the cache invalidation code.
I thought it might be fixed by commit b23b0f5588 (Code review for recent
changes in relcache.c), but mite is currently working on
7894ac5004 and yet it already failed on OOM.
The failures apparently happen within a few hours of the test start. For
example on addax (gcc), the build started on 02:50 and the first OOM
failure happened on 05:19, on mite (clang), it's 03:20 vs. 06:50. So
it's like ~3-4 after the tests start.
regards
Tomas