Re: failed NUMA pages inquiry status: Operation not permitted - Mailing list pgsql-hackers

From Christoph Berg
Subject Re: failed NUMA pages inquiry status: Operation not permitted
Date
Msg-id aUFbrmKrYPBuTZ1c@msg.df7cb.de
Whole thread Raw
In response to Re: failed NUMA pages inquiry status: Operation not permitted  (Tomas Vondra <tomas@vondra.me>)
List pgsql-hackers
Re: Tomas Vondra
> Hmmm, strange. -2 is ENOENT, which should mean this:
> 
>        -ENOENT
>               The page is not present.
> 
> But what does "not present" mean in this context? And why would that be
> only intermittent? Presumably this is still running in Docker, so maybe
> it's another weird consequence of that?

I've managed to reproduce it once, running this loop on
18-as-of-today. It errored out after a few 100 iterations:

while psql -c 'SELECT COUNT(*) >= 0 AS ok FROM pg_shmem_allocations_numa'; do :; done

2025-12-16 11:49:35.982 UTC [621807] myon@postgres ERROR:  invalid NUMA node id outside of allowed range [0, 0]: -2
2025-12-16 11:49:35.982 UTC [621807] myon@postgres STATEMENT:  SELECT COUNT(*) >= 0 AS ok FROM
pg_shmem_allocations_numa

That was on the apt.pg.o amd64 build machine while a few things were
just building. Maybe ENOENT "The page is not present" means something
was just swapped out because the machine was under heavy load.

I tried reading the kernel source and it sounds related:

 * If the source virtual memory range has any unmapped holes, or if
 * the destination virtual memory range is not a whole unmapped hole,
 * move_pages() will fail respectively with -ENOENT or -EEXIST. This
 * provides a very strict behavior to avoid any chance of memory
 * corruption going unnoticed if there are userland race conditions.
 * Only one thread should resolve the userland page fault at any given
 * time for any given faulting address. This means that if two threads
 * try to both call move_pages() on the same destination address at the
 * same time, the second thread will get an explicit error from this
 * command.
...
 * The UFFDIO_MOVE_MODE_ALLOW_SRC_HOLES flag can be specified to
 * prevent -ENOENT errors to materialize if there are holes in the
 * source virtual range that is being remapped. The holes will be
 * accounted as successfully remapped in the retval of the
 * command. This is mostly useful to remap hugepage naturally aligned
 * virtual regions without knowing if there are transparent hugepage
 * in the regions or not, but preventing the risk of having to split
 * the hugepmd during the remap.
...
ssize_t move_pages(struct userfaultfd_ctx *ctx, unsigned long dst_start,
                   unsigned long src_start, unsigned long len, __u64 mode)
...
                        if (!(mode & UFFDIO_MOVE_MODE_ALLOW_SRC_HOLES)) {
                                err = -ENOENT;
                                break;

What I don't understand yet is why this move_pages() signature does
not match the one from libnuma and move_pages(2) (note "mode" vs "flags"):

int numa_move_pages(int pid, unsigned long count,
        void **pages, const int *nodes, int *status, int flags)
{
        return move_pages(pid, count, pages, nodes, status, flags);
}

I guess the answer is somewhere in that gap.

> ERROR:  invalid NUMA node id outside of allowed range [0, 0]: -2

Maybe instead of putting sanity checks on what the kernel is
returning, we should just pass that through to the user? (Or perhaps
transform negative numbers to NULL?)

Christoph



pgsql-hackers by date:

Previous
From: Anthonin Bonnefoy
Date:
Subject: Fix possible 'unexpected data beyond EOF' on replica restart
Next
From: "Greg Burd"
Date:
Subject: Re: [PATCH] Fix ARM64/MSVC atomic memory ordering issues on Win11 by adding explicit DMB ?barriers