Re: remap the .text segment into huge pages at run time - Mailing list pgsql-hackers
From | Andres Freund |
---|---|
Subject | Re: remap the .text segment into huge pages at run time |
Date | |
Msg-id | 20221106181614.c5by4ijbjmtq5h3v@awork3.anarazel.de Whole thread Raw |
In response to | Re: remap the .text segment into huge pages at run time (John Naylor <john.naylor@enterprisedb.com>) |
List | pgsql-hackers |
Hi, On 2022-11-06 13:56:10 +0700, John Naylor wrote: > On Sat, Nov 5, 2022 at 3:27 PM Andres Freund <andres@anarazel.de> wrote: > > I don't think the dummy functions are a good approach, there were plenty > > things after it when I played with them. > > To be technical, the point wasn't to have no code after it, but to have no > *hot* code *before* it, since with the iodlr approach the first 1.99MB of > .text is below the first aligned boundary within that section. But yeah, > I'm happy to ditch that hack entirely. Just because code is colder than the alternative branch, doesn't necessary mean it's entirely cold overall. I saw hits to things after the dummy function to have a perf effect. > > > > With these flags the "R E" segments all start on a 0x200000/2MiB > boundary > > > and > > > > are padded to the next 2MiB boundary. However the OS / dynamic loader > only > > > > maps the necessary part, not all the zero padding. > > > > > > > > This means that if we were to issue a MADV_COLLAPSE, we can before it > do > > > an > > > > mremap() to increase the length of the mapping. > > > > > > I see, interesting. What location are you passing for madvise() and > > > mremap()? The beginning of the segment (for me has .init/.plt) or an > > > aligned boundary within .text? > > > /* > > * Make huge pages out of it. Requires at least linux 6.1. We > could > > * fall back to MADV_HUGEPAGE if it fails, but it doesn't do all > that > > * much in older kernels. > > */ > > About madvise(), I take it MADV_HUGEPAGE and MADV_COLLAPSE only work for > THP? The man page seems to indicate that. MADV_HUGEPAGE works as long as /sys/kernel/mm/transparent_hugepage/enabled is to always or madvise. My understanding is that MADV_COLLAPSE will work even if /sys/kernel/mm/transparent_hugepage/enabled is set to never. > In the support work I've done, the standard recommendation is to turn THP > off, especially if they report sudden performance problems. I think that's pretty much an outdated suggestion FWIW. Largely caused by Red Hat extremely aggressively backpatching transparent hugepages into RHEL 6 (IIRC). Lots of improvements have been made to THP since then. I've tried to see negative effects maybe 2-3 years back, without success. I really don't see a reason to ever set /sys/kernel/mm/transparent_hugepage/enabled to 'never', rather than just 'madvise'. > If explicit HP's are used for shared mem, maybe THP is less of a risk? I > need to look back at the tests that led to that advice... I wouldn't give that advice to customers anymore, unless they use extremely old platforms or unless there's very concrete evidence. > > A real version would have to open /proc/self/maps and do this for at least > > I can try and generalize your above sketch into a v2 patch. Cool. > > postgres' r-xp mapping. We could do it for libraries too, if they're > suitably > > aligned (both in memory and on-disk). > > It looks like plpgsql is only 27 standard pages in size... > > Regarding glibc, we could try moving a couple of the hotter functions into > PG, using smaller and simpler coding, if that has better frontend cache > behavior. The paper "Understanding and Mitigating Front-End Stalls in > Warehouse-Scale Computers" talks about this, particularly section 4.4 > regarding memcmp(). I think the amount of work necessary for that is nontrivial and continual. So I'm loathe to go there. > > > I quickly tried to align the segments with the linker and then in my > patch > > > have the address for mmap() rounded *down* from the .text start to the > > > beginning of that segment. It refused to start without logging an error. > > > > Hm, what linker was that? I did note that you need some additional flags > for > > some of the linkers. > > BFD, but I wouldn't worry about that failure too much, since the > mremap()/madvise() strategy has a lot fewer moving parts. > > On the subject of linkers, though, one thing that tripped me up was trying > to change the linker with Meson. First I tried > > -Dc_args='-fuse-ld=lld' It's -Dc_link_args=... > but that led to warnings like this when : > /usr/bin/ld: warning: -z separate-loadable-segments ignored > > When using this in the top level meson.build > > elif host_system == 'linux' > sema_kind = 'unnamed_posix' > cppflags += '-D_GNU_SOURCE' > # Align the loadable segments to 2MB boundaries to support remapping to > # huge pages. > ldflags += cc.get_supported_link_arguments([ > '-Wl,-zmax-page-size=0x200000', > '-Wl,-zcommon-page-size=0x200000', > '-Wl,-zseparate-loadable-segments' > ]) > > > According to > > https://mesonbuild.com/howtox.html#set-linker > > I need to add CC_LD=lld to the env vars before invoking, which got rid of > the warning. Then I wanted to verify that lld was actually used, and in > > https://releases.llvm.org/14.0.0/tools/lld/docs/index.html You can just look at build.ninja, fwiw. Or use ninja -v (in postgres's cases with -d keeprsp, because the commandline ends up being long enough for a response file being used). > it says I can run this and it should show “Linker: LLD”, but that doesn't > appear for me: > > $ readelf --string-dump .comment inst-perf/bin/postgres > > String dump of section '.comment': > [ 0] GCC: (GNU) 12.2.1 20220819 (Red Hat 12.2.1-2) That's added by the compiler, not the linker. See e.g.: $ readelf --string-dump .comment src/backend/postgres_lib.a.p/storage_ipc_procarray.c.o String dump of section '.comment': [ 1] GCC: (Debian 12.2.0-9) 12.2.0 Greetings, Andres Freund
pgsql-hackers by date: