Home > mailing lists

Re: BUG #17757: Not honoring huge_pages setting during initdb causes DB crash in Kubernetes - Mailing list pgsql-bugs

From	Andres Freund
Subject	Re: BUG #17757: Not honoring huge_pages setting during initdb causes DB crash in Kubernetes
Date	January 22, 2023 03:27:04
Msg-id	20230122002704.yoskrrfkbgi7xcfs@awork3.anarazel.de Whole thread Raw
In response to	Re: BUG #17757: Not honoring huge_pages setting during initdb causes DB crash in Kubernetes (Andres Freund <andres@anarazel.de>)
List	pgsql-bugs

Tree view

Hi,

On 2023-01-21 15:29:22 -0800, Andres Freund wrote:
> On 2023-01-22 00:10:29 +0100, Tomas Vondra wrote:
> > On 1/20/23 23:48, PG Bug reporting form wrote:
> > > In these cases, the initdb phase will attempt to allocate huge pages that
> > > are available in the OS, but it will be denied access by Kubernetes and
> > > fail.
> >
> > Well, so how exactly this fails? Does that mean Kubernetes broke mmap()
> > with MAP_HUGETLB so that it doesn't return MAP_FAILED when hugepages are
> > not available, or what? Because that's the only explanation I can see,
> > looking at the code.
>
> Yea, that's what I was wondering about as well.
>
>
> > Or it just does not realize there are no hugepages, returns something
> > and then crashes with SIGBUS later when trying to access it?
>
> I assume that that's the case. There's references to bus errors in a bunch of
> the linked issues. E.g.
> https://github.com/CrunchyData/postgres-operator/issues/413
>
> selecting default max_connections ... sh: line 1:    60 Bus error               (core dumped)
"/usr/pgsql-10/bin/postgres"--boot -x0 -F -c max_connections=100 -c shared_buffers=1000 -c
dynamic_shared_memory_type=none< "/dev/null" > "/dev/null" 2>&1

>
> It's possible that the problem would go away if we used MAP_POPULATE for the
> allocation.

> I'd guess that this is annoying cgroups stuff :(

Ah, the fun:
https://www.kernel.org/doc/html/latest/admin-guide/cgroup-v1/hugetlb.html

  The HugeTLB controller allows users to limit the HugeTLB usage (page fault) per
  control group and enforces the limit during page fault. Since HugeTLB
  doesn't support page reclaim, enforcing the limit at page fault time implies
  that, the application will get SIGBUS signal if it tries to fault in HugeTLB
  pages beyond its limit. Therefore the application needs to know exactly how many
  HugeTLB pages it uses before hand, and the sysadmin needs to make sure that
  there are enough available on the machine for all the users to avoid processes
  getting SIGBUS.

but there's also

      Reservation accounting

  hugetlb.<hugepagesize>.rsvd.limit_in_bytes hugetlb.<hugepagesize>.rsvd.max_usage_in_bytes
hugetlb.<hugepagesize>.rsvd.usage_in_byteshugetlb.<hugepagesize>.rsvd.failcnt

  The HugeTLB controller allows to limit the HugeTLB reservations per control
  group and enforces the controller limit at reservation time and at the fault
  of HugeTLB memory for which no reservation exists. Since reservation limits
  are enforced at reservation time (on mmap or shget), reservation limits
  never causes the application to get SIGBUS signal if the memory was reserved
  before hand. For MAP_NORESERVE allocations, the reservation limit behaves
  the same as the fault limit, enforcing memory usage at fault time and
  causing the application to receive a SIGBUS if it’s crossing its limit.

  Reservation limits are superior to page fault limits described above, since
  reservation limits are enforced at reservation time (on mmap or shget), and
  never causes the application to get SIGBUS signal if the memory was reserved
  before hand. This allows for easier fallback to alternatives such as
  non-HugeTLB memory for example. In the case of page fault accounting, it’s
  very hard to avoid processes getting SIGBUS since the sysadmin needs
  precisely know the HugeTLB usage of all the tasks in the system and make
  sure there is enough pages to satisfy all requests. Avoiding tasks getting
  SIGBUS on overcommited systems is practically impossible with page fault
  accounting.

So the problem is that the wrong time of cgroup limits are used. I don't know
if that's a kubernetes or a postgres-operator issue.

Greetings,

Andres Freund

pgsql-bugs by date:

From: Tom Lane
Date: 22 January 2023, 03:08:01
Subject: Re: BUG #17757: Not honoring huge_pages setting during initdb causes DB crash in Kubernetes

From: Tomas Vondra
Date: 22 January 2023, 03:55:01
Subject: Re: BUG #17757: Not honoring huge_pages setting during initdb causes DB crash in Kubernetes

Re: BUG #17757: Not honoring huge_pages setting during initdb causes DB crash in Kubernetes - Mailing list pgsql-bugs

Previous

Next