Re: remap the .text segment into huge pages at run time - Mailing list pgsql-hackers

From John Naylor
Subject Re: remap the .text segment into huge pages at run time
Date
Msg-id CAFBsxsGmMGKv9eg-ESKCJ2FjqBZw_kHj3fRs5rdHkYoB=+XkkQ@mail.gmail.com
Whole thread Raw
In response to Re: remap the .text segment into huge pages at run time  (Andres Freund <andres@anarazel.de>)
Responses Re: remap the .text segment into huge pages at run time
List pgsql-hackers
On Sat, Nov 5, 2022 at 1:33 AM Andres Freund <andres@anarazel.de> wrote:

> > I wonder how far we can get with just using the linker hints to align
> > sections. I know that the linux folks are working on promoting sufficiently
> > aligned executable pages to huge pages too, and might have succeeded already.
> >
> > IOW, adding the linker flags might be a good first step.
>
> Indeed, I did see that that works to some degree on the 5.19 kernel I was
> running. However, it never seems to get around to using huge pages
> sufficiently to compete with explicit use of huge pages.

Oh nice, I didn't know that! There might be some threshold of pages mapped before it does so. At least, that issue is mentioned in that paper linked upthread for FreeBSD.

> More interestingly, a few days ago, a new madvise hint, MADV_COLLAPSE, was
> added into linux 6.1. That explicitly remaps a region and uses huge pages for
> it. Of course that's going to take a while to be widely available, but it
> seems like a safer approach than the remapping approach from this thread.

I didn't know that either, funny timing.

> I hacked in a MADV_COLLAPSE (with setarch -R, so that I could just hardcode
> the address / length), and it seems to work nicely.
>
> With the weird caveat that on fs one needs to make sure that the executable
> doesn't reflinks to reuse parts of other files, and that the mold linker and
> cp do... Not a concern on ext4, but on xfs. I took to copying the postgres
> binary with cp --reflink=never

What happens otherwise? That sounds like a difficult thing to guard against.

> The difference in itlb.itlb_flush between pipelined / non-pipelined cases
> unsurprisingly is stark.
>
> While the pipelined case still sees a good bit reduced itlb traffic, the total
> amount of cycles in which a walk is active is just not large enough to matter,
> by the looks of it.

Good to know, thanks for testing. Maybe the pipelined case is something devs should consider when microbenchmarking, to reduce noise from context switches.

On Sat, Nov 5, 2022 at 4:21 AM Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
> On 2022-11-03 10:21:23 -0700, Andres Freund wrote:
> > > - Add a "cold" __asm__ filler function that just takes up space, enough to
> > > push the end of the .text segment over the next aligned boundary, or to
> > > ~8MB in size.
> >
> > I don't understand why this is needed - as long as the pages are aligned to
> > 2MB, why do we need to fill things up on disk? The in-memory contents are the
> > relevant bit, no?
>
> I now assume it's because you either observed the mappings set up by the
> loader to not include the space between the segments?

My knowledge is not quite that deep. The iodlr repo has an example "hello world" program, which links with 8 filler objects, each with 32768 __attribute((used)) dummy functions. I just cargo-culted that idea and simplified it. Interestingly enough, looking through the commit history, they used to align the segments via linker flags, but took it out here:

https://github.com/intel/iodlr/pull/25#discussion_r397787559

...saying "I'm not sure why we added this". :/

I quickly tried to align the segments with the linker and then in my patch have the address for mmap() rounded *down* from the .text start to the beginning of that segment. It refused to start without logging an error.

BTW, that what I meant before, although I wasn't clear:

> > Since the front is all-cold, and there is very little at the end,
> > practically all hot pages are now remapped. The biggest problem with the
> > hackish filler function (in addition to maintainability) is, if explicit
> > huge pages are turned off in the kernel, attempting mmap() with MAP_HUGETLB
> > causes complete startup failure if the .text segment is larger than 8MB.
>
> I would expect MAP_HUGETLB to always fail if not enabled in the kernel,
> independent of the .text segment size?

With the file-level hack, it would just fail without a trace with .text > 8MB (I have yet to enable core dumps on this new OS I have...), whereas without it I did see the failures in the log, and successful fallback.

> With these flags the "R E" segments all start on a 0x200000/2MiB boundary and
> are padded to the next 2MiB boundary. However the OS / dynamic loader only
> maps the necessary part, not all the zero padding.
>
> This means that if we were to issue a MADV_COLLAPSE, we can before it do an
> mremap() to increase the length of the mapping.

I see, interesting. What location are you passing for madvise() and mremap()? The beginning of the segment (for me has .init/.plt) or an aligned boundary within .text?

--
John Naylor
EDB: http://www.enterprisedb.com

pgsql-hackers by date:

Previous
From: Amit Kapila
Date:
Subject: Re: Perform streaming logical transactions by background workers and parallel apply
Next
From: Amit Kapila
Date:
Subject: Re: Reviving lost replication slots