Re: Better shared data structure management and resizable shared data structures - Mailing list pgsql-hackers
| From | Ashutosh Bapat |
|---|---|
| Subject | Re: Better shared data structure management and resizable shared data structures |
| Date | |
| Msg-id | CAExHW5s9Vp+-vJi020UJ+otyccBBo7eT1g6bttdRKL6HAvscyQ@mail.gmail.com Whole thread Raw |
| In response to | Re: Better shared data structure management and resizable shared data structures (Heikki Linnakangas <hlinnaka@iki.fi>) |
| Responses |
Re: Better shared data structure management and resizable shared data structures
Re: Better shared data structure management and resizable shared data structures |
| List | pgsql-hackers |
On Fri, Feb 13, 2026 at 5:33 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote: > > On 13/02/2026 13:47, Ashutosh Bapat wrote: > > `man madvise` has this > > MADV_REMOVE (since Linux 2.6.16) > > Free up a given range of pages and its associated > > backing store. This is equivalent to punching a > > hole in the corresponding byte range of the backing > > store (see fallocate(2)). Subsequent accesses > > in the specified address range will see bytes containing zero. > > > > The specified address range must be mapped shared > > and writable. This flag cannot be applied to > > locked pages, Huge TLB pages, or VM_PFNMAP pages. > > > > In the initial implementation, only tmpfs(5) was > > supported MADV_REMOVE; but since Linux 3.5, any > > filesystem which supports the fallocate(2) > > FALLOC_FL_PUNCH_HOLE mode also supports MADV_REMOVE. > > Hugetlbfs fails with the error EINVAL and other > > filesystems fail with the error EOPNOTSUPP. > > > > It says the flag can not be applied to Huge TLB pages. We won't be > > able to make resizable shared memory structures allocated with huge > > pages. That seems like a serious restriction. > > Per https://man7.org/linux/man-pages/man2/madvise.2.html: > > MADV_REMOVE (since Linux 2.6.16) > ... > > Support for the Huge TLB filesystem was added in Linux > v4.3. > > > I may be misunderstanding something, but it seems like this is useful > > to free already allocated memory, not necessarily allocate more > > memory. I don't understand how a user would start with a larger > > reserved address space with only small portions of that space being > > backed by memory. > > Hmm, I guess you'll need to use MAP_NORESERVE in the first mmap() call. > to reserve address space for the maximum size, and then > madvise(MADV_POPULATE_WRITE) using the initial size. Later, > madvise(MADV_REMOVE) to shrink, and madvise(MADV_POPULATE_WRITE) to grow > again. Thank you for the hint. Also thanks to Andres's idea, the resizable structure patch is quite small now. Actually, after experimenting with madvise, memfd_create and ftruncate(), I see that MADV_POPULATE_WRITE is not required at all. We don't have to do anything to expand a structure. Memory will be allocated as and when the program writes to it. I also discovered things that I didn't know about. 1. ftruncate() sets the size of the file but it doesn't allocate the memory pages. 2. to use madvise() the address needs to be backed by a file, so memfd_create is a must. 3. We can't write to a file backed memory at a location beyond the size of the file. Hence we have to set the size of the file to the maximum size at the beginning. 4. the address and length passed to madvise needs to be page aligned, but that passed to fallocate() needn't be. `man fallocate` says "Specifying the FALLOC_FL_PUNCH_HOLE flag (available since Linux 2.6.38) in mode deallocates space (i.e., creates a hole) in the byte range starting at offset and continuing for len bytes. Within the specified range, partial filesystem blocks are zeroed, and whole filesystem blocks are removed from the file.". It seems to be automatically taking care of the page size. So using fallocate() simplifies logic. Further `man madvise` says "but since Linux 3.5, any filesystem which supports the fallocate(2) FALLOC_FL_PUNCH_HOLE mode also supports MADV_REMOVE." fallocate with FALLOC_FL_PUNCH_HOLE is guaranteed to be available on a system which supports MADV_REMOVE. Using fallocate() (or madvise()) to free memory, we don't need multiple segments. So much less code churn compared to the multiple mappings approach. However, there is one drawback. In the multiple mapping approach access beyond the current size of the structure would result in segfault or bus error. But in the fallocate/madvise approach such an access does not cause a crash. A write beyond the pages that fit the current size of the structure causes more memory to be allocated silently. A read returns 0s. So, there's a possibility that bugs in size calculations might go unnoticed. I think that's how it works even today, access in the yet un-allocated part of the shared memory will simply go unnoticed. PFA the patches with 0003 implementing resizable structures using fallocate(). There are TODOs, and also I need to make sure that resizable structures are disabled where memfd_create(), fallocate() and anonymous memory mappings are not available. Also the test is unstable since it prints the memory consumption numbers obtained from /proc/self/status. But it demonstrates that allocation and freeing of shared memory as the shared structures undergo resizing. I don't think there is a stable way to use the numbers though; so we might have to remove those ultimately. -- Best Wishes, Ashutosh Bapat
Attachment
pgsql-hackers by date: