Re: remap the .text segment into huge pages at run time - Mailing list pgsql-hackers
From | John Naylor |
---|---|
Subject | Re: remap the .text segment into huge pages at run time |
Date | |
Msg-id | CAFBsxsH7ryBmTzAo7Ot36G+2xZ=0MV6NnbVVgzs6m78wzetsCA@mail.gmail.com Whole thread Raw |
In response to | Re: remap the .text segment into huge pages at run time (Andres Freund <andres@anarazel.de>) |
Responses |
Re: remap the .text segment into huge pages at run time
|
List | pgsql-hackers |
On Sat, Nov 5, 2022 at 3:27 PM Andres Freund <andres@anarazel.de> wrote:
> > simplified it. Interestingly enough, looking through the commit history,
> > they used to align the segments via linker flags, but took it out here:
> >
> > https://github.com/intel/iodlr/pull/25#discussion_r397787559
> >
> > ...saying "I'm not sure why we added this". :/
>
> That was about using a linker script, not really linker flags though.
Oops, the commit I was referring to pointed to that discussion, but I should have shown it instead:
--- a/large_page-c/example/Makefile
+++ b/large_page-c/example/Makefile
@@ -28,7 +28,6 @@ OBJFILES= \
filler16.o \
OBJS=$(addprefix $(OBJDIR)/,$(OBJFILES))
-LDFLAGS=-Wl,-z,max-page-size=2097152
But from what you're saying, this flag wouldn't have been enough anyway...
> I don't think the dummy functions are a good approach, there were plenty
> things after it when I played with them.
To be technical, the point wasn't to have no code after it, but to have no *hot* code *before* it, since with the iodlr approach the first 1.99MB of .text is below the first aligned boundary within that section. But yeah, I'm happy to ditch that hack entirely.
> > > With these flags the "R E" segments all start on a 0x200000/2MiB boundary
> > and
> > > are padded to the next 2MiB boundary. However the OS / dynamic loader only
> > > maps the necessary part, not all the zero padding.
> > >
> > > This means that if we were to issue a MADV_COLLAPSE, we can before it do
> > an
> > > mremap() to increase the length of the mapping.
> >
> > I see, interesting. What location are you passing for madvise() and
> > mremap()? The beginning of the segment (for me has .init/.plt) or an
> > aligned boundary within .text?
> /*
> * Make huge pages out of it. Requires at least linux 6.1. We could
> * fall back to MADV_HUGEPAGE if it fails, but it doesn't do all that
> * much in older kernels.
> */
About madvise(), I take it MADV_HUGEPAGE and MADV_COLLAPSE only work for THP? The man page seems to indicate that.
In the support work I've done, the standard recommendation is to turn THP off, especially if they report sudden performance problems. If explicit HP's are used for shared mem, maybe THP is less of a risk? I need to look back at the tests that led to that advice...
> A real version would have to open /proc/self/maps and do this for at least
I can try and generalize your above sketch into a v2 patch.
> postgres' r-xp mapping. We could do it for libraries too, if they're suitably
> aligned (both in memory and on-disk).
It looks like plpgsql is only 27 standard pages in size...
Regarding glibc, we could try moving a couple of the hotter functions into PG, using smaller and simpler coding, if that has better frontend cache behavior. The paper "Understanding and Mitigating Front-End Stalls in Warehouse-Scale Computers" talks about this, particularly section 4.4 regarding memcmp().
> > I quickly tried to align the segments with the linker and then in my patch
> > have the address for mmap() rounded *down* from the .text start to the
> > beginning of that segment. It refused to start without logging an error.
>
> Hm, what linker was that? I did note that you need some additional flags for
> some of the linkers.
BFD, but I wouldn't worry about that failure too much, since the mremap()/madvise() strategy has a lot fewer moving parts.
On the subject of linkers, though, one thing that tripped me up was trying to change the linker with Meson. First I tried
-Dc_args='-fuse-ld=lld'
but that led to warnings like this when :
/usr/bin/ld: warning: -z separate-loadable-segments ignored
When using this in the top level meson.build
elif host_system == 'linux'
sema_kind = 'unnamed_posix'
cppflags += '-D_GNU_SOURCE'
# Align the loadable segments to 2MB boundaries to support remapping to
# huge pages.
ldflags += cc.get_supported_link_arguments([
'-Wl,-zmax-page-size=0x200000',
'-Wl,-zcommon-page-size=0x200000',
'-Wl,-zseparate-loadable-segments'
])
According to
https://mesonbuild.com/howtox.html#set-linker
I need to add CC_LD=lld to the env vars before invoking, which got rid of the warning. Then I wanted to verify that lld was actually used, and in
https://releases.llvm.org/14.0.0/tools/lld/docs/index.html
it says I can run this and it should show “Linker: LLD”, but that doesn't appear for me:
$ readelf --string-dump .comment inst-perf/bin/postgres
String dump of section '.comment':
[ 0] GCC: (GNU) 12.2.1 20220819 (Red Hat 12.2.1-2)
--
John Naylor
EDB: http://www.enterprisedb.com
> > simplified it. Interestingly enough, looking through the commit history,
> > they used to align the segments via linker flags, but took it out here:
> >
> > https://github.com/intel/iodlr/pull/25#discussion_r397787559
> >
> > ...saying "I'm not sure why we added this". :/
>
> That was about using a linker script, not really linker flags though.
Oops, the commit I was referring to pointed to that discussion, but I should have shown it instead:
--- a/large_page-c/example/Makefile
+++ b/large_page-c/example/Makefile
@@ -28,7 +28,6 @@ OBJFILES= \
filler16.o \
OBJS=$(addprefix $(OBJDIR)/,$(OBJFILES))
-LDFLAGS=-Wl,-z,max-page-size=2097152
But from what you're saying, this flag wouldn't have been enough anyway...
> I don't think the dummy functions are a good approach, there were plenty
> things after it when I played with them.
To be technical, the point wasn't to have no code after it, but to have no *hot* code *before* it, since with the iodlr approach the first 1.99MB of .text is below the first aligned boundary within that section. But yeah, I'm happy to ditch that hack entirely.
> > > With these flags the "R E" segments all start on a 0x200000/2MiB boundary
> > and
> > > are padded to the next 2MiB boundary. However the OS / dynamic loader only
> > > maps the necessary part, not all the zero padding.
> > >
> > > This means that if we were to issue a MADV_COLLAPSE, we can before it do
> > an
> > > mremap() to increase the length of the mapping.
> >
> > I see, interesting. What location are you passing for madvise() and
> > mremap()? The beginning of the segment (for me has .init/.plt) or an
> > aligned boundary within .text?
> /*
> * Make huge pages out of it. Requires at least linux 6.1. We could
> * fall back to MADV_HUGEPAGE if it fails, but it doesn't do all that
> * much in older kernels.
> */
About madvise(), I take it MADV_HUGEPAGE and MADV_COLLAPSE only work for THP? The man page seems to indicate that.
In the support work I've done, the standard recommendation is to turn THP off, especially if they report sudden performance problems. If explicit HP's are used for shared mem, maybe THP is less of a risk? I need to look back at the tests that led to that advice...
> A real version would have to open /proc/self/maps and do this for at least
I can try and generalize your above sketch into a v2 patch.
> postgres' r-xp mapping. We could do it for libraries too, if they're suitably
> aligned (both in memory and on-disk).
It looks like plpgsql is only 27 standard pages in size...
Regarding glibc, we could try moving a couple of the hotter functions into PG, using smaller and simpler coding, if that has better frontend cache behavior. The paper "Understanding and Mitigating Front-End Stalls in Warehouse-Scale Computers" talks about this, particularly section 4.4 regarding memcmp().
> > I quickly tried to align the segments with the linker and then in my patch
> > have the address for mmap() rounded *down* from the .text start to the
> > beginning of that segment. It refused to start without logging an error.
>
> Hm, what linker was that? I did note that you need some additional flags for
> some of the linkers.
BFD, but I wouldn't worry about that failure too much, since the mremap()/madvise() strategy has a lot fewer moving parts.
On the subject of linkers, though, one thing that tripped me up was trying to change the linker with Meson. First I tried
-Dc_args='-fuse-ld=lld'
but that led to warnings like this when :
/usr/bin/ld: warning: -z separate-loadable-segments ignored
When using this in the top level meson.build
elif host_system == 'linux'
sema_kind = 'unnamed_posix'
cppflags += '-D_GNU_SOURCE'
# Align the loadable segments to 2MB boundaries to support remapping to
# huge pages.
ldflags += cc.get_supported_link_arguments([
'-Wl,-zmax-page-size=0x200000',
'-Wl,-zcommon-page-size=0x200000',
'-Wl,-zseparate-loadable-segments'
])
According to
https://mesonbuild.com/howtox.html#set-linker
I need to add CC_LD=lld to the env vars before invoking, which got rid of the warning. Then I wanted to verify that lld was actually used, and in
https://releases.llvm.org/14.0.0/tools/lld/docs/index.html
it says I can run this and it should show “Linker: LLD”, but that doesn't appear for me:
$ readelf --string-dump .comment inst-perf/bin/postgres
String dump of section '.comment':
[ 0] GCC: (GNU) 12.2.1 20220819 (Red Hat 12.2.1-2)
--
John Naylor
EDB: http://www.enterprisedb.com
pgsql-hackers by date: