On Wed, Jul 9, 2025 at 7:13 PM Andres Freund <andres@anarazel.de> wrote:
> > Yes, and we are discussing if it is worth getting into smaller pages
> > for such usecases (e.g. 4kB ones without hugetlb with 2MB hugepages or
> > what more even more waste 1GB hugetlb if we dont request 2MB for some
> > small structs: btw, we have ability to select MAP_HUGE_2MB vs
> > MAP_HUGE_1GB). I'm thinking about two problems:
> > - 4kB are swappable and mlock() potentially (?) disarms NUMA autobalacning
>
> I'm not really bought into this being a problem. If your system has enough
> pressure to swap out the PGPROC array, you're so hosed that this won't make a
> difference.
OK I need to bend here, yet still part of me believes that the
situation where we have hugepages (for 'Buffer Blocks') and yet some
smaller more, but way critical structs are more likely to be swapped
out due to pressure of some backend-gone-wild random mallocs() is
unhealthy to me (especially the fact the OS might prefer swapping on
per node rather than global picture)
> I'm rather doubtful that it's a good idea to combine numa awareness with numa
> balancing. Numa balancing adds latency and makes it much more expensive for
> userspace to act in a numa aware way, since it needs to regularly update its
> knowledge about where memory resides.
Well the problem is that backends come here and go to random CPUs
often (migrated++ on very high backend counts and non-uniform
workloads in terms of backend-CPU usage), but the autobalancing
doesn't need to be on or off for everything. It could be autobalancing
for a certain memory region and it is not affecting the app in any way
(well, other than those minor page faulting, literally ).
> If we used 4k pages for the procarray we would just have ~4 procs on one page,
> if that range were marked as interleaved, it'd probably suffice.
OK, this sounds like the best and simplest proposal to me, yet the
patch doesn't do OS-based interleaving for those today. Gonna try that
mlock() sooner or later... ;)
-J.