Re: remap the .text segment into huge pages at run time - Mailing list pgsql-hackers

From Andres Freund
Subject Re: remap the .text segment into huge pages at run time
Date
Msg-id 20221105082748.dgb57maldyvvpv6n@awork3.anarazel.de
Whole thread Raw
In response to Re: remap the .text segment into huge pages at run time  (John Naylor <john.naylor@enterprisedb.com>)
Responses Re: remap the .text segment into huge pages at run time
Re: remap the .text segment into huge pages at run time
List pgsql-hackers
Hi,

On 2022-11-05 12:54:18 +0700, John Naylor wrote:
> On Sat, Nov 5, 2022 at 1:33 AM Andres Freund <andres@anarazel.de> wrote:
> > I hacked in a MADV_COLLAPSE (with setarch -R, so that I could just
> hardcode
> > the address / length), and it seems to work nicely.
> >
> > With the weird caveat that on fs one needs to make sure that the
> executable
> > doesn't reflinks to reuse parts of other files, and that the mold linker
> and
> > cp do... Not a concern on ext4, but on xfs. I took to copying the postgres
> > binary with cp --reflink=never
>
> What happens otherwise? That sounds like a difficult thing to guard against.

MADV_COLLAPSE fails, but otherwise things continue on. I think it's mostly an
issue on dev systems, not on prod systems, because there the files will be be
unpacked from a package or such.


> > On 2022-11-03 10:21:23 -0700, Andres Freund wrote:
> > > > - Add a "cold" __asm__ filler function that just takes up space,
> enough to
> > > > push the end of the .text segment over the next aligned boundary, or
> to
> > > > ~8MB in size.
> > >
> > > I don't understand why this is needed - as long as the pages are
> aligned to
> > > 2MB, why do we need to fill things up on disk? The in-memory contents
> are the
> > > relevant bit, no?
> >
> > I now assume it's because you either observed the mappings set up by the
> > loader to not include the space between the segments?
>
> My knowledge is not quite that deep. The iodlr repo has an example "hello
> world" program, which links with 8 filler objects, each with 32768
> __attribute((used)) dummy functions. I just cargo-culted that idea and
> simplified it. Interestingly enough, looking through the commit history,
> they used to align the segments via linker flags, but took it out here:
>
> https://github.com/intel/iodlr/pull/25#discussion_r397787559
>
> ...saying "I'm not sure why we added this". :/

That was about using a linker script, not really linker flags though.

I don't think the dummy functions are a good approach, there were plenty
things after it when I played with them.



> I quickly tried to align the segments with the linker and then in my patch
> have the address for mmap() rounded *down* from the .text start to the
> beginning of that segment. It refused to start without logging an error.

Hm, what linker was that? I did note that you need some additional flags for
some of the linkers.


> > With these flags the "R E" segments all start on a 0x200000/2MiB boundary
> and
> > are padded to the next 2MiB boundary. However the OS / dynamic loader only
> > maps the necessary part, not all the zero padding.
> >
> > This means that if we were to issue a MADV_COLLAPSE, we can before it do
> an
> > mremap() to increase the length of the mapping.
>
> I see, interesting. What location are you passing for madvise() and
> mremap()? The beginning of the segment (for me has .init/.plt) or an
> aligned boundary within .text?

I started postgres with setarch -R, looked at /proc/$pid/[s]maps to see the
start/end of the r-xp mapped segment.  Here's my hacky code, with a bunch of
comments added.

       void *addr = (void*) 0x555555800000;
       void *end = (void *) 0x555555e09000;
       size_t advlen = (uintptr_t) end - (uintptr_t) addr;

       const size_t bound = 1024*1024*2 - 1;
       size_t advlen_up = (advlen + bound - 1) & ~(bound - 1);
       void *r2;

       /*
        * Increase size of mapping to cover the tailing padding to the next
        * segment. Otherwise all the code in that range can't be put into
        * a huge page (access in the non-mapped range needs to cause a fault,
        * hence can't be in the huge page).
        * XXX: Should proably assert that that space is actually zeroes.
        */
       r2 = mremap(addr, advlen, advlen_up, 0);
       if (r2 == MAP_FAILED)
           fprintf(stderr, "mremap failed: %m\n");
       else if (r2 != addr)
           fprintf(stderr, "mremap wrong addr: %m\n");
       else
           advlen = advlen_up;

       /*
        * The docs for MADV_COLLAPSE say there should be at least one page
        * in the mapped space "for every eligible hugepage-aligned/sized
        * region to be collapsed". I just forced that. But probably not
        * necessary.
        */
       r = madvise(addr, advlen, MADV_WILLNEED);
       if (r != 0)
           fprintf(stderr, "MADV_WILLNEED failed: %m\n");

       r = madvise(addr, advlen, MADV_POPULATE_READ);
       if (r != 0)
           fprintf(stderr, "MADV_POPULATE_READ failed: %m\n");

       /*
        * Make huge pages out of it. Requires at least linux 6.1.  We could
        * fall back to MADV_HUGEPAGE if it fails, but it doesn't do all that
        * much in older kernels.
        */
#define MADV_COLLAPSE    25
       r = madvise(addr, advlen, MADV_COLLAPSE);
       if (r != 0)
           fprintf(stderr, "MADV_COLLAPSE failed: %m\n");


A real version would have to open /proc/self/maps and do this for at least
postgres' r-xp mapping. We could do it for libraries too, if they're suitably
aligned (both in memory and on-disk).

Greetings,

Andres Freund



pgsql-hackers by date:

Previous
From: Corey Huinker
Date:
Subject: Re: psql: Add command to use extended query protocol
Next
From: Pavel Stehule
Date:
Subject: Re: psql: Add command to use extended query protocol