> On Thu, Sep 26, 2024 at 08:46:17AM GMT, Dmitry Dolgov wrote:
> > On Thu, Sep 26, 2024 at 07:57:12AM GMT, Gabriele Bartolini wrote:
> > Hi Dmitry,
> >
> > I've been attempting to replicate this issue directly in Kubernetes, but I
> > haven't been successful so far. I've been using EKS nodes, and it seems
> > that they all run cgroup v2 now. Do you have anything that could help me
> > get started on this more quickly?
>
> Thanks for testing. I can check if I can get some EKS clusters to
> experiment with. In the meantime, what about the reproducing script for
> cgroup v2 (the plain one that I've attached with the patch, that doesn't
> require any k8s cluster), doesn't it work for you?
Looks like there is a plot twist. After talking to Gabriele off list and
testing on an EKS, I've discovered that since 5.7 Linux kernel supports
hugetlb reservation via hugetlbfs [1]. That means that together with the
original limitation at page fault time there is one at reservation time,
which has a separate knob in cgroupfs:
    # cgroup v2, hugetlb controller
    #
    # original limit, page fault level
    hugetlb.2MB.limit_in_bytes
    #
    # new one, reservation level
    hugetlb.2MB.rsvd.limit_in_bytes
This means that there still could be people facing the original issue patch is
trying to address: for that one needs to either run older kernel, or have a
container orchestration tool that do not set rsvd value (looks like there are
such examples). But in the long term perspective I would expect everyone
converging to use reservation limits correctly, so maybe the patch is not
needed after all.
[1]:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=cdc2fcfea79b9873bb63159f8ed973f4046018c8