Re: remap the .text segment into huge pages at run time - Mailing list pgsql-hackers

From Andres Freund
Subject Re: remap the .text segment into huge pages at run time
Date
Msg-id 20221106181614.c5by4ijbjmtq5h3v@awork3.anarazel.de
Whole thread Raw
In response to Re: remap the .text segment into huge pages at run time  (John Naylor <john.naylor@enterprisedb.com>)
List pgsql-hackers
Hi,

On 2022-11-06 13:56:10 +0700, John Naylor wrote:
> On Sat, Nov 5, 2022 at 3:27 PM Andres Freund <andres@anarazel.de> wrote:
> > I don't think the dummy functions are a good approach, there were plenty
> > things after it when I played with them.
>
> To be technical, the point wasn't to have no code after it, but to have no
> *hot* code *before* it, since with the iodlr approach the first 1.99MB of
> .text is below the first aligned boundary within that section. But yeah,
> I'm happy to ditch that hack entirely.

Just because code is colder than the alternative branch, doesn't necessary
mean it's entirely cold overall. I saw hits to things after the dummy function
to have a perf effect.


> > > > With these flags the "R E" segments all start on a 0x200000/2MiB
> boundary
> > > and
> > > > are padded to the next 2MiB boundary. However the OS / dynamic loader
> only
> > > > maps the necessary part, not all the zero padding.
> > > >
> > > > This means that if we were to issue a MADV_COLLAPSE, we can before it
> do
> > > an
> > > > mremap() to increase the length of the mapping.
> > >
> > > I see, interesting. What location are you passing for madvise() and
> > > mremap()? The beginning of the segment (for me has .init/.plt) or an
> > > aligned boundary within .text?
>
> >        /*
> >         * Make huge pages out of it. Requires at least linux 6.1.  We
> could
> >         * fall back to MADV_HUGEPAGE if it fails, but it doesn't do all
> that
> >         * much in older kernels.
> >         */
>
> About madvise(), I take it MADV_HUGEPAGE and MADV_COLLAPSE only work for
> THP? The man page seems to indicate that.

MADV_HUGEPAGE works as long as /sys/kernel/mm/transparent_hugepage/enabled is
to always or madvise.  My understanding is that MADV_COLLAPSE will work even
if /sys/kernel/mm/transparent_hugepage/enabled is set to never.


> In the support work I've done, the standard recommendation is to turn THP
> off, especially if they report sudden performance problems.

I think that's pretty much an outdated suggestion FWIW. Largely caused by Red
Hat extremely aggressively backpatching transparent hugepages into RHEL 6
(IIRC). Lots of improvements have been made to THP since then. I've tried to
see negative effects maybe 2-3 years back, without success.

I really don't see a reason to ever set
/sys/kernel/mm/transparent_hugepage/enabled to 'never', rather than just 'madvise'.


> If explicit HP's are used for shared mem, maybe THP is less of a risk? I
> need to look back at the tests that led to that advice...

I wouldn't give that advice to customers anymore, unless they use extremely
old platforms or unless there's very concrete evidence.


> > A real version would have to open /proc/self/maps and do this for at least
>
> I can try and generalize your above sketch into a v2 patch.

Cool.


> > postgres' r-xp mapping. We could do it for libraries too, if they're
> suitably
> > aligned (both in memory and on-disk).
>
> It looks like plpgsql is only 27 standard pages in size...
>
> Regarding glibc, we could try moving a couple of the hotter functions into
> PG, using smaller and simpler coding, if that has better frontend cache
> behavior. The paper "Understanding and Mitigating Front-End Stalls in
> Warehouse-Scale Computers" talks about this, particularly section 4.4
> regarding memcmp().

I think the amount of work necessary for that is nontrivial and continual. So
I'm loathe to go there.


> > > I quickly tried to align the segments with the linker and then in my
> patch
> > > have the address for mmap() rounded *down* from the .text start to the
> > > beginning of that segment. It refused to start without logging an error.
> >
> > Hm, what linker was that? I did note that you need some additional flags
> for
> > some of the linkers.
>
> BFD, but I wouldn't worry about that failure too much, since the
> mremap()/madvise() strategy has a lot fewer moving parts.
>
> On the subject of linkers, though, one thing that tripped me up was trying
> to change the linker with Meson. First I tried
>
> -Dc_args='-fuse-ld=lld'

It's -Dc_link_args=...


> but that led to warnings like this when :
> /usr/bin/ld: warning: -z separate-loadable-segments ignored
>
> When using this in the top level meson.build
>
> elif host_system == 'linux'
>   sema_kind = 'unnamed_posix'
>   cppflags += '-D_GNU_SOURCE'
>   # Align the loadable segments to 2MB boundaries to support remapping to
>   # huge pages.
>   ldflags += cc.get_supported_link_arguments([
>     '-Wl,-zmax-page-size=0x200000',
>     '-Wl,-zcommon-page-size=0x200000',
>     '-Wl,-zseparate-loadable-segments'
>   ])
>
>
> According to
>
> https://mesonbuild.com/howtox.html#set-linker
>
> I need to add CC_LD=lld to the env vars before invoking, which got rid of
> the warning. Then I wanted to verify that lld was actually used, and in
>
> https://releases.llvm.org/14.0.0/tools/lld/docs/index.html

You can just look at build.ninja, fwiw. Or use ninja -v (in postgres's cases
with -d keeprsp, because the commandline ends up being long enough for a
response file being used).


> it says I can run this and it should show “Linker: LLD”, but that doesn't
> appear for me:
>
> $ readelf --string-dump .comment inst-perf/bin/postgres
>
> String dump of section '.comment':
>   [     0]  GCC: (GNU) 12.2.1 20220819 (Red Hat 12.2.1-2)

That's added by the compiler, not the linker. See e.g.:

$ readelf --string-dump .comment src/backend/postgres_lib.a.p/storage_ipc_procarray.c.o

String dump of section '.comment':
  [     1]  GCC: (Debian 12.2.0-9) 12.2.0

Greetings,

Andres Freund



pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: BUG #17434: CREATE/DROP DATABASE can be executed in the same transaction with other commands
Next
From: Peter Geoghegan
Date:
Subject: Re: Allow single table VACUUM in transaction block