On Wed, Jun 21, 2023 at 10:42 AM Andres Freund <
andres@anarazel.de> wrote:
> So I am wondering if you're encountering a different kind of problem. As I
> mentioned, I have observed that the pages need to be clean for this to
> work. For me adding a "sync path/to/postgres" makes it work on 6.3.8. Without
> the sync it starts to work a while later (presumably when the kernel got
> around to writing the data back).
Hmm, then after rebooting today, it shouldn't have that problem until a build links again, but I'll make sure to do that when building. Still same failure, though. Looking more closely at the manpage for madvise, it has this under MADV_HUGEPAGE:
"The MADV_HUGEPAGE, MADV_NOHUGEPAGE, and MADV_COLLAPSE operations are available only if the kernel was configured with CONFIG_TRANSPARENT_HUGEPAGE and file/shmem memory is only supported if the kernel was configured with CONFIG_READ_ONLY_THP_FOR_FS."
Earlier, I only checked the first config option but didn't know about the second...
$ grep CONFIG_READ_ONLY_THP_FOR_FS /boot/config-$(uname -r)
# CONFIG_READ_ONLY_THP_FOR_FS is not set
Apparently, it's experimental. That could be the explanation, but now I'm wondering why the fallback
madvise(addr, advlen, MADV_HUGEPAGE);
didn't also give an error. I wonder if we could mremap to some anonymous region and call madvise on that. That would be more similar to the hack I shared last year, which may be more fragile, but now it wouldn't need explicit huge pages.