Thread: remap the .text segment into huge pages at run time

remap the .text segment into huge pages at run time

From
John Naylor
Date:
It's been known for a while that Postgres spends a lot of time translating instruction addresses, and using huge pages in the text segment yields a substantial performance boost in OLTP workloads [1][2]. The difficulty is, this normally requires a lot of painstaking work (unless your OS does superpage promotion, like FreeBSD).

I found an MIT-licensed library "iodlr" from Intel [3] that allows one to remap the .text segment to huge pages at program start. Attached is a hackish, Meson-only, "works on my machine" patchset to experiment with this idea.

0001 adapts the library to our error logging and GUC system. The overview:

- read ELF info to get the start/end addresses of the .text segment
- calculate addresses therein aligned at huge page boundaries
- mmap a temporary region and memcpy the aligned portion of the .text segment
- mmap aligned start address to a second region with huge pages and MAP_FIXED
- memcpy over from the temp region and revoke the PROT_WRITE bit

The reason this doesn't "saw off the branch you're standing on" is that the remapping is done in a function that's forced to live in a different segment, and doesn't call any non-libc functions living elsewhere:

static void
__attribute__((__section__("lpstub")))
__attribute__((__noinline__))
MoveRegionToLargePages(const mem_range * r, int mmap_flags)

Debug messages show

2022-11-02 12:02:31.064 +07 [26955] DEBUG:  .text start: 0x487540
2022-11-02 12:02:31.064 +07 [26955] DEBUG:  .text end:   0x96cf12
2022-11-02 12:02:31.064 +07 [26955] DEBUG:  aligned .text start: 0x600000
2022-11-02 12:02:31.064 +07 [26955] DEBUG:  aligned .text end:   0x800000
2022-11-02 12:02:31.066 +07 [26955] DEBUG:  binary mapped to huge pages
2022-11-02 12:02:31.066 +07 [26955] DEBUG:  un-mmapping temporary code region

Here, out of 5MB of Postgres text, only 1 huge page can be used, but that still saves 512 entries in the TLB and might bring a small improvement. The un-remapped region below 0x600000 contains the ~600kB of "cold" code, since the linker puts the cold section first, at least recent versions of ld and lld.

0002 is my attempt to force the linker's hand and get the entire text segment mapped to huge pages. It's quite a finicky hack, and easily broken (see below). That said, it still builds easily within our normal build process, and maybe there is a better way to get the effect.

It does two things:

- Pass the linker -Wl,-zcommon-page-size=2097152 -Wl,-zmax-page-size=2097152 which aligns .init to a 2MB boundary. That's done for predictability, but that means the next 2MB boundary is very nearly 2MB away.

- Add a "cold" __asm__ filler function that just takes up space, enough to push the end of the .text segment over the next aligned boundary, or to ~8MB in size.

In a non-assert build:

0001:

$ bloaty inst-perf/bin/postgres

    FILE SIZE        VM SIZE    
 --------------  --------------
  53.7%  4.90Mi  58.7%  4.90Mi    .text
...
 100.0%  9.12Mi 100.0%  8.35Mi    TOTAL

$ readelf -S --wide inst-perf/bin/postgres

  [Nr] Name              Type            Address          Off    Size   ES Flg Lk Inf Al
...
  [12] .init             PROGBITS        0000000000486000 086000 00001b 00  AX  0   0  4
  [13] .plt              PROGBITS        0000000000486020 086020 001520 10  AX  0   0 16
  [14] .text             PROGBITS        0000000000487540 087540 4e59d2 00  AX  0   0 16
...

0002:

$ bloaty inst-perf/bin/postgres

    FILE SIZE        VM SIZE    
 --------------  --------------
  46.9%  8.00Mi  69.9%  8.00Mi    .text
...
 100.0%  17.1Mi 100.0%  11.4Mi    TOTAL


$ readelf -S --wide inst-perf/bin/postgres

  [Nr] Name              Type            Address          Off    Size   ES Flg Lk Inf Al
...
  [12] .init             PROGBITS        0000000000600000 200000 00001b 00  AX  0   0  4
  [13] .plt              PROGBITS        0000000000600020 200020 001520 10  AX  0   0 16
  [14] .text             PROGBITS        0000000000601540 201540 7ff512 00  AX  0   0 16
...

Debug messages with 0002 shows 6MB mapped:

2022-11-02 12:35:28.482 +07 [28530] DEBUG:  .text start: 0x601540
2022-11-02 12:35:28.482 +07 [28530] DEBUG:  .text end:   0xe00a52
2022-11-02 12:35:28.482 +07 [28530] DEBUG:  aligned .text start: 0x800000
2022-11-02 12:35:28.482 +07 [28530] DEBUG:  aligned .text end:   0xe00000
2022-11-02 12:35:28.486 +07 [28530] DEBUG:  binary mapped to huge pages
2022-11-02 12:35:28.486 +07 [28530] DEBUG:  un-mmapping temporary code region

Since the front is all-cold, and there is very little at the end, practically all hot pages are now remapped. The biggest problem with the hackish filler function (in addition to maintainability) is, if explicit huge pages are turned off in the kernel, attempting mmap() with MAP_HUGETLB causes complete startup failure if the .text segment is larger than 8MB. I haven't looked into what's happening there yet, but I didn't want to get too far in the weeds before getting feedback on whether the entire approach in this thread is sound enough to justify working further on.

[1] https://www.cs.rochester.edu/u/sandhya/papers/ispass19.pdf
    (paper: "On the Impact of Instruction Address Translation Overhead")
[2] https://twitter.com/AndresFreundTec/status/1214305610172289024
[3] https://github.com/intel/iodlr

--
Attachment

Re: remap the .text segment into huge pages at run time

From
Andres Freund
Date:
Hi,

On 2022-11-02 13:32:37 +0700, John Naylor wrote:
> It's been known for a while that Postgres spends a lot of time translating
> instruction addresses, and using huge pages in the text segment yields a
> substantial performance boost in OLTP workloads [1][2].

Indeed. Some of that we eventually should address by making our code less
"jumpy", but that's a large amount of work and only going to go so far.


> The difficulty is,
> this normally requires a lot of painstaking work (unless your OS does
> superpage promotion, like FreeBSD).

I still am confused by FreeBSD being able to do this without changing the
section alignment to be big enough. Or is the default alignment on FreeBSD
large enough already?


> I found an MIT-licensed library "iodlr" from Intel [3] that allows one to
> remap the .text segment to huge pages at program start. Attached is a
> hackish, Meson-only, "works on my machine" patchset to experiment with this
> idea.

I wonder how far we can get with just using the linker hints to align
sections. I know that the linux folks are working on promoting sufficiently
aligned executable pages to huge pages too, and might have succeeded already.

IOW, adding the linker flags might be a good first step.


> 0001 adapts the library to our error logging and GUC system. The overview:
> 
> - read ELF info to get the start/end addresses of the .text segment
> - calculate addresses therein aligned at huge page boundaries
> - mmap a temporary region and memcpy the aligned portion of the .text
> segment
> - mmap aligned start address to a second region with huge pages and
> MAP_FIXED
> - memcpy over from the temp region and revoke the PROT_WRITE bit

Would mremap()'ing the temporary region also work? That might be simpler and
more robust (you'd see the MAP_HUGETLB failure before doing anything
irreversible). And you then might not even need this:

> The reason this doesn't "saw off the branch you're standing on" is that the
> remapping is done in a function that's forced to live in a different
> segment, and doesn't call any non-libc functions living elsewhere:
> 
> static void
> __attribute__((__section__("lpstub")))
> __attribute__((__noinline__))
> MoveRegionToLargePages(const mem_range * r, int mmap_flags)


This would likely need a bunch more gating than the patch, understandably,
has. I think it'd faily horribly if there were .text relocations, for example?
I think there are some architectures that do that by default...


> 0002 is my attempt to force the linker's hand and get the entire text
> segment mapped to huge pages. It's quite a finicky hack, and easily broken
> (see below). That said, it still builds easily within our normal build
> process, and maybe there is a better way to get the effect.
> 
> It does two things:
> 
> - Pass the linker -Wl,-zcommon-page-size=2097152
> -Wl,-zmax-page-size=2097152 which aligns .init to a 2MB boundary. That's
> done for predictability, but that means the next 2MB boundary is very
> nearly 2MB away.

Yep. FWIW, my notes say

# align sections to 2MB boundaries for hugepage support
# bfd and gold linkers:
# -Wl,-zmax-page-size=0x200000 -Wl,-zcommon-page-size=0x200000
# lld:
# -Wl,-zmax-page-size=0x200000 -Wl,-z,separate-loadable-segments
# then copy binary to tmpfs mounted with -o huge=always

I.e. with lld you need slightly different flags -Wl,-z,separate-loadable-segments

The meson bit should probably just use
cc.get_supported_link_arguments([
  '-Wl,-zmax-page-size=0x200000',
  '-Wl,-zcommon-page-size=0x200000',
  '-Wl,-zseparate-loadable-segments'])

Afaict there's really no reason to not do that by default, allowing kernels
that can promote to huge pages to do so.


My approach to forcing huge pages to be used was to then:

# copy binary to tmpfs mounted with -o huge=always


> - Add a "cold" __asm__ filler function that just takes up space, enough to
> push the end of the .text segment over the next aligned boundary, or to
> ~8MB in size.

I don't understand why this is needed - as long as the pages are aligned to
2MB, why do we need to fill things up on disk? The in-memory contents are the
relevant bit, no?


> Since the front is all-cold, and there is very little at the end,
> practically all hot pages are now remapped. The biggest problem with the
> hackish filler function (in addition to maintainability) is, if explicit
> huge pages are turned off in the kernel, attempting mmap() with MAP_HUGETLB
> causes complete startup failure if the .text segment is larger than 8MB.

I would expect MAP_HUGETLB to always fail if not enabled in the kernel,
independent of the .text segment size?



> +/* Callback for dl_iterate_phdr to set the start and end of the .text segment */
> +static int
> +FindMapping(struct dl_phdr_info *hdr, size_t size, void *data)
> +{
> +    ElfW(Shdr) text_section;
> +    FindParams *find_params = (FindParams *) data;
> +
> +    /*
> +     * We are only interested in the mapping matching the main executable.
> +     * This has the empty string for a name.
> +     */
> +    if (hdr->dlpi_name[0] != '\0')
> +        return 0;
> +

It's not entirely clear we'd only ever want to do this for the main
executable. E.g. plpgsql could also benefit.


> diff --git a/meson.build b/meson.build
> index bfacbdc0af..450946370c 100644
> --- a/meson.build
> +++ b/meson.build
> @@ -239,6 +239,9 @@ elif host_system == 'freebsd'
>  elif host_system == 'linux'
>    sema_kind = 'unnamed_posix'
>    cppflags += '-D_GNU_SOURCE'
> +  # WIP: debug builds are huge
> +  # TODO: add portability check
> +  ldflags += ['-Wl,-zcommon-page-size=2097152', '-Wl,-zmax-page-size=2097152']

What's that WIP about?


>  elif host_system == 'netbsd'
>    # We must resolve all dynamic linking in the core server at program start.
> diff --git a/src/backend/port/filler.c b/src/backend/port/filler.c
> new file mode 100644
> index 0000000000..de4e33bb05
> --- /dev/null
> +++ b/src/backend/port/filler.c
> @@ -0,0 +1,29 @@
> +/*
> + * Add enough padding to .text segment to bring the end just
> + * past a 2MB alignment boundary. In practice, this means .text needs
> + * to be at least 8MB. It shouldn't be much larger than this,
> + * because then more hot pages will remain in 4kB pages.
> + *
> + * FIXME: With this filler added, if explicit huge pages are turned off
> + * in the kernel, attempting mmap() with MAP_HUGETLB causes a crash
> + * instead of reporting failure if the .text segment is larger than 8MB.
> + *
> + * See MapStaticCodeToLargePages() in large_page.c
> + *
> + * XXX: The exact amount of filler must be determined experimentally
> + * on platforms of interest, in non-assert builds.
> + *
> + */
> +static void
> +__attribute__((used))
> +__attribute__((cold))
> +fill_function(int x)
> +{
> +    /* TODO: More architectures */
> +#ifdef __x86_64__
> +__asm__(
> +    ".fill 3251000"
> +);
> +#endif
> +    (void) x;
> +}
> \ No newline at end of file
> diff --git a/src/backend/port/meson.build b/src/backend/port/meson.build
> index 5ab65115e9..d876712e0c 100644
> --- a/src/backend/port/meson.build
> +++ b/src/backend/port/meson.build
> @@ -16,6 +16,9 @@ if cdata.has('USE_WIN32_SEMAPHORES')
>  endif
>  
>  if cdata.has('USE_SYSV_SHARED_MEMORY')
> +  if host_system == 'linux'
> +    backend_sources += files('filler.c')
> +  endif
>    backend_sources += files('large_page.c')
>    backend_sources += files('sysv_shmem.c')
>  endif
> -- 
> 2.37.3
> 

Greetings,

Andres Freund



Re: remap the .text segment into huge pages at run time

From
Andres Freund
Date:
Hi,

This nerd-sniped me badly :)

On 2022-11-03 10:21:23 -0700, Andres Freund wrote:
> On 2022-11-02 13:32:37 +0700, John Naylor wrote:
> > I found an MIT-licensed library "iodlr" from Intel [3] that allows one to
> > remap the .text segment to huge pages at program start. Attached is a
> > hackish, Meson-only, "works on my machine" patchset to experiment with this
> > idea.
>
> I wonder how far we can get with just using the linker hints to align
> sections. I know that the linux folks are working on promoting sufficiently
> aligned executable pages to huge pages too, and might have succeeded already.
>
> IOW, adding the linker flags might be a good first step.

Indeed, I did see that that works to some degree on the 5.19 kernel I was
running. However, it never seems to get around to using huge pages
sufficiently to compete with explicit use of huge pages.

More interestingly, a few days ago, a new madvise hint, MADV_COLLAPSE, was
added into linux 6.1. That explicitly remaps a region and uses huge pages for
it. Of course that's going to take a while to be widely available, but it
seems like a safer approach than the remapping approach from this thread.

I hacked in a MADV_COLLAPSE (with setarch -R, so that I could just hardcode
the address / length), and it seems to work nicely.

With the weird caveat that on fs one needs to make sure that the executable
doesn't reflinks to reuse parts of other files, and that the mold linker and
cp do... Not a concern on ext4, but on xfs. I took to copying the postgres
binary with cp --reflink=never


FWIW, you can see the state of the page mapping in more detail with the
kernel's page-types tool

sudo /home/andres/src/kernel/tools/vm/page-types -L -p 12297 -a 0x555555800,0x555556122
sudo /home/andres/src/kernel/tools/vm/page-types -f /srv/dev/build/m-opt/src/backend/postgres2


Perf results:

c=150;psql -f ~/tmp/prewarm.sql;perf stat -a -e
cycles,iTLB-loads,iTLB-load-misses,itlb_misses.walk_active,itlb_misses.walk_completed_4k,itlb_misses.walk_completed_2m_4m,itlb_misses.walk_completed_1g
pgbench-n -M prepared -S -P1 -c$c -j$c -T10
 

without MADV_COLLAPSE:

tps = 1038230.070771 (without initial connection time)

 Performance counter stats for 'system wide':

 1,184,344,476,152      cycles                                                               (71.41%)
     2,846,146,710      iTLB-loads                                                           (71.43%)
     2,021,885,782      iTLB-load-misses                 #   71.04% of all iTLB cache accesses  (71.44%)
    75,633,850,933      itlb_misses.walk_active                                              (71.44%)
     2,020,962,930      itlb_misses.walk_completed_4k                                        (71.44%)
         1,213,368      itlb_misses.walk_completed_2m_4m                                     (57.12%)
             2,293      itlb_misses.walk_completed_1g                                        (57.11%)

      10.064352587 seconds time elapsed



with MADV_COLLAPSE:

tps = 1113717.114278 (without initial connection time)

 Performance counter stats for 'system wide':

 1,173,049,140,611      cycles                                                               (71.42%)
     1,059,224,678      iTLB-loads                                                           (71.44%)
       653,603,712      iTLB-load-misses                 #   61.71% of all iTLB cache accesses  (71.44%)
    26,135,902,949      itlb_misses.walk_active                                              (71.44%)
       628,314,285      itlb_misses.walk_completed_4k                                        (71.44%)
        25,462,916      itlb_misses.walk_completed_2m_4m                                     (57.13%)
             2,228      itlb_misses.walk_completed_1g                                        (57.13%)

Note that while the rate of itlb-misses stays roughly the same, the total
number of iTLB loads reduced substantially, and the number of cycles in which
an itlb miss was in progress is 1/3 of what it was before.


A lot of the remaining misses are from the context switches. The iTLB is
flushed on context switches, and of course pgbench -S is extremely context
switch heavy.

Comparing plain -S with 10 pipelined -S transactions (using -t 100000 / -t
10000 to compare the same amount of work) I get:


without MADV_COLLAPSE:

not pipelined:

tps = 1037732.722805 (without initial connection time)

 Performance counter stats for 'system wide':

 1,691,411,678,007      cycles                                                               (62.48%)
         8,856,107      itlb.itlb_flush                                                      (62.48%)
     4,600,041,062      iTLB-loads                                                           (62.48%)
     2,598,218,236      iTLB-load-misses                 #   56.48% of all iTLB cache accesses  (62.50%)
   100,095,862,126      itlb_misses.walk_active                                              (62.53%)
     2,595,376,025      itlb_misses.walk_completed_4k                                        (50.02%)
         2,558,713      itlb_misses.walk_completed_2m_4m                                     (50.00%)
             2,146      itlb_misses.walk_completed_1g                                        (49.98%)

      14.582927646 seconds time elapsed


pipelined:

tps = 161947.008995 (without initial connection time)

 Performance counter stats for 'system wide':

 1,095,948,341,745      cycles                                                               (62.46%)
           877,556      itlb.itlb_flush                                                      (62.46%)
     4,576,237,561      iTLB-loads                                                           (62.48%)
       307,971,166      iTLB-load-misses                 #    6.73% of all iTLB cache accesses  (62.52%)
    15,565,279,213      itlb_misses.walk_active                                              (62.55%)
       306,240,104      itlb_misses.walk_completed_4k                                        (50.03%)
         1,753,560      itlb_misses.walk_completed_2m_4m                                     (50.00%)
             2,189      itlb_misses.walk_completed_1g                                        (49.96%)

       9.374687885 seconds time elapsed



with MADV_COLLAPSE:

not pipelined:
tps = 1112040.859643 (without initial connection time)

 Performance counter stats for 'system wide':

 1,569,546,236,696      cycles                                                               (62.50%)
         7,094,291      itlb.itlb_flush                                                      (62.51%)
     1,599,845,097      iTLB-loads                                                           (62.51%)
       692,042,864      iTLB-load-misses                 #   43.26% of all iTLB cache accesses  (62.51%)
    31,529,641,124      itlb_misses.walk_active                                              (62.51%)
       669,849,177      itlb_misses.walk_completed_4k                                        (49.99%)
        22,708,146      itlb_misses.walk_completed_2m_4m                                     (49.99%)
             2,752      itlb_misses.walk_completed_1g                                        (49.99%)

      13.611206182 seconds time elapsed


pipelined:

tps = 162484.443469 (without initial connection time)

 Performance counter stats for 'system wide':

 1,092,897,514,658      cycles                                                               (62.48%)
           942,351      itlb.itlb_flush                                                      (62.48%)
       233,996,092      iTLB-loads                                                           (62.48%)
       102,155,575      iTLB-load-misses                 #   43.66% of all iTLB cache accesses  (62.49%)
     6,419,597,286      itlb_misses.walk_active                                              (62.52%)
        98,758,409      itlb_misses.walk_completed_4k                                        (50.03%)
         3,342,332      itlb_misses.walk_completed_2m_4m                                     (50.02%)
             2,190      itlb_misses.walk_completed_1g                                        (49.98%)

       9.355239897 seconds time elapsed

The difference in itlb.itlb_flush between pipelined / non-pipelined cases
unsurprisingly is stark.

While the pipelined case still sees a good bit reduced itlb traffic, the total
amount of cycles in which a walk is active is just not large enough to matter,
by the looks of it.

Greetings,

Andres Freund



Re: remap the .text segment into huge pages at run time

From
Andres Freund
Date:
Hi,

On 2022-11-03 10:21:23 -0700, Andres Freund wrote:
> > - Add a "cold" __asm__ filler function that just takes up space, enough to
> > push the end of the .text segment over the next aligned boundary, or to
> > ~8MB in size.
>
> I don't understand why this is needed - as long as the pages are aligned to
> 2MB, why do we need to fill things up on disk? The in-memory contents are the
> relevant bit, no?

I now assume it's because you either observed the mappings set up by the
loader to not include the space between the segments?

With sufficient linker flags the segments are sufficiently aligned both on
disk and in memory to just map more:

bfd: -Wl,-zmax-page-size=0x200000,-zcommon-page-size=0x200000
  Type           Offset             VirtAddr           PhysAddr
                 FileSiz            MemSiz              Flags  Align
...
  LOAD           0x0000000000000000 0x0000000000000000 0x0000000000000000
                 0x00000000000c7f58 0x00000000000c7f58  R      0x200000
  LOAD           0x0000000000200000 0x0000000000200000 0x0000000000200000
                 0x0000000000921d39 0x0000000000921d39  R E    0x200000
  LOAD           0x0000000000c00000 0x0000000000c00000 0x0000000000c00000
                 0x00000000002626b8 0x00000000002626b8  R      0x200000
  LOAD           0x0000000000fdf510 0x00000000011df510 0x00000000011df510
                 0x0000000000037fd6 0x000000000006a310  RW     0x200000

gold -Wl,-zmax-page-size=0x200000,-zcommon-page-size=0x200000,--rosegment
  Type           Offset             VirtAddr           PhysAddr
                 FileSiz            MemSiz              Flags  Align
...
  LOAD           0x0000000000000000 0x0000000000000000 0x0000000000000000
                 0x00000000009230f9 0x00000000009230f9  R E    0x200000
  LOAD           0x0000000000a00000 0x0000000000a00000 0x0000000000a00000
                 0x000000000033a738 0x000000000033a738  R      0x200000
  LOAD           0x0000000000ddf4e0 0x0000000000fdf4e0 0x0000000000fdf4e0
                 0x000000000003800a 0x000000000006a340  RW     0x200000

lld: -Wl,-zmax-page-size=0x200000,-zseparate-loadable-segments
  LOAD           0x0000000000000000 0x0000000000000000 0x0000000000000000
                 0x000000000033710c 0x000000000033710c  R      0x200000
  LOAD           0x0000000000400000 0x0000000000400000 0x0000000000400000
                 0x0000000000921cb0 0x0000000000921cb0  R E    0x200000
  LOAD           0x0000000000e00000 0x0000000000e00000 0x0000000000e00000
                 0x0000000000020ae0 0x0000000000020ae0  RW     0x200000
  LOAD           0x0000000001000000 0x0000000001000000 0x0000000001000000
                 0x00000000000174ea 0x0000000000049820  RW     0x200000

mold -Wl,-zmax-page-size=0x200000,-zcommon-page-size=0x200000,-zseparate-loadable-segments
  Type           Offset             VirtAddr           PhysAddr
                 FileSiz            MemSiz              Flags  Align
...
  LOAD           0x0000000000000000 0x0000000000000000 0x0000000000000000
                 0x000000000032dde9 0x000000000032dde9  R      0x200000
  LOAD           0x0000000000400000 0x0000000000400000 0x0000000000400000
                 0x0000000000921cbe 0x0000000000921cbe  R E    0x200000
  LOAD           0x0000000000e00000 0x0000000000e00000 0x0000000000e00000
                 0x00000000002174e8 0x0000000000249820  RW     0x200000

With these flags the "R E" segments all start on a 0x200000/2MiB boundary and
are padded to the next 2MiB boundary. However the OS / dynamic loader only
maps the necessary part, not all the zero padding.

This means that if we were to issue a MADV_COLLAPSE, we can before it do an
mremap() to increase the length of the mapping.


MADV_COLLAPSE without mremap:

tps = 1117335.766756 (without initial connection time)

 Performance counter stats for 'system wide':

 1,169,012,466,070      cycles                                                               (55.53%)
   729,146,640,019      instructions                     #    0.62  insn per cycle           (66.65%)
         7,062,923      itlb.itlb_flush                                                      (66.65%)
     1,041,825,587      iTLB-loads                                                           (66.65%)
       634,272,420      iTLB-load-misses                 #   60.88% of all iTLB cache accesses  (66.66%)
    27,018,254,873      itlb_misses.walk_active                                              (66.68%)
       610,639,252      itlb_misses.walk_completed_4k                                        (44.47%)
        24,262,549      itlb_misses.walk_completed_2m_4m                                     (44.46%)
             2,948      itlb_misses.walk_completed_1g                                        (44.43%)

      10.039217004 seconds time elapsed


MADV_COLLAPSE with mremap:

tps = 1140869.853616 (without initial connection time)

 Performance counter stats for 'system wide':

 1,173,272,878,934      cycles                                                               (55.53%)
   746,008,850,147      instructions                     #    0.64  insn per cycle           (66.65%)
         7,538,962      itlb.itlb_flush                                                      (66.65%)
       799,861,088      iTLB-loads                                                           (66.65%)
       254,347,048      iTLB-load-misses                 #   31.80% of all iTLB cache accesses  (66.66%)
    14,427,296,885      itlb_misses.walk_active                                              (66.69%)
       221,811,835      itlb_misses.walk_completed_4k                                        (44.47%)
        32,881,405      itlb_misses.walk_completed_2m_4m                                     (44.46%)
             3,043      itlb_misses.walk_completed_1g                                        (44.43%)

      10.038517778 seconds time elapsed


compared to a run without any huge pages (via THP or MADV_COLLAPSE):

tps = 1034960.102843 (without initial connection time)

 Performance counter stats for 'system wide':

 1,183,743,785,066      cycles                                                               (55.54%)
   678,525,810,443      instructions                     #    0.57  insn per cycle           (66.65%)
         7,163,304      itlb.itlb_flush                                                      (66.65%)
     2,952,660,798      iTLB-loads                                                           (66.65%)
     2,105,431,590      iTLB-load-misses                 #   71.31% of all iTLB cache accesses  (66.66%)
    80,593,535,910      itlb_misses.walk_active                                              (66.68%)
     2,105,377,810      itlb_misses.walk_completed_4k                                        (44.46%)
         1,254,156      itlb_misses.walk_completed_2m_4m                                     (44.46%)
             3,366      itlb_misses.walk_completed_1g                                        (44.44%)

      10.039821650 seconds time elapsed


So a 7.96% win from no-huge-pages to MADV_COLLAPSE and a further 2.11% win
from there to also using mremap(), yielding a total of 10.23%. It's similar
across runs.


On my system the other libraries unfortunately aren't aligned properly. It'd
be nice to also remap at least libc. The majority of the remaining misses are
from the vdso (too small for a huge page), libc (not aligned properly),
returning from system calls (which flush the itlb) and pgbench / libpq (I
didn't add the mremap there, there's not enough code for a huge page without
it).

Greetings,

Andres Freund



Re: remap the .text segment into huge pages at run time

From
John Naylor
Date:
On Sat, Nov 5, 2022 at 1:33 AM Andres Freund <andres@anarazel.de> wrote:

> > I wonder how far we can get with just using the linker hints to align
> > sections. I know that the linux folks are working on promoting sufficiently
> > aligned executable pages to huge pages too, and might have succeeded already.
> >
> > IOW, adding the linker flags might be a good first step.
>
> Indeed, I did see that that works to some degree on the 5.19 kernel I was
> running. However, it never seems to get around to using huge pages
> sufficiently to compete with explicit use of huge pages.

Oh nice, I didn't know that! There might be some threshold of pages mapped before it does so. At least, that issue is mentioned in that paper linked upthread for FreeBSD.

> More interestingly, a few days ago, a new madvise hint, MADV_COLLAPSE, was
> added into linux 6.1. That explicitly remaps a region and uses huge pages for
> it. Of course that's going to take a while to be widely available, but it
> seems like a safer approach than the remapping approach from this thread.

I didn't know that either, funny timing.

> I hacked in a MADV_COLLAPSE (with setarch -R, so that I could just hardcode
> the address / length), and it seems to work nicely.
>
> With the weird caveat that on fs one needs to make sure that the executable
> doesn't reflinks to reuse parts of other files, and that the mold linker and
> cp do... Not a concern on ext4, but on xfs. I took to copying the postgres
> binary with cp --reflink=never

What happens otherwise? That sounds like a difficult thing to guard against.

> The difference in itlb.itlb_flush between pipelined / non-pipelined cases
> unsurprisingly is stark.
>
> While the pipelined case still sees a good bit reduced itlb traffic, the total
> amount of cycles in which a walk is active is just not large enough to matter,
> by the looks of it.

Good to know, thanks for testing. Maybe the pipelined case is something devs should consider when microbenchmarking, to reduce noise from context switches.

On Sat, Nov 5, 2022 at 4:21 AM Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
> On 2022-11-03 10:21:23 -0700, Andres Freund wrote:
> > > - Add a "cold" __asm__ filler function that just takes up space, enough to
> > > push the end of the .text segment over the next aligned boundary, or to
> > > ~8MB in size.
> >
> > I don't understand why this is needed - as long as the pages are aligned to
> > 2MB, why do we need to fill things up on disk? The in-memory contents are the
> > relevant bit, no?
>
> I now assume it's because you either observed the mappings set up by the
> loader to not include the space between the segments?

My knowledge is not quite that deep. The iodlr repo has an example "hello world" program, which links with 8 filler objects, each with 32768 __attribute((used)) dummy functions. I just cargo-culted that idea and simplified it. Interestingly enough, looking through the commit history, they used to align the segments via linker flags, but took it out here:

https://github.com/intel/iodlr/pull/25#discussion_r397787559

...saying "I'm not sure why we added this". :/

I quickly tried to align the segments with the linker and then in my patch have the address for mmap() rounded *down* from the .text start to the beginning of that segment. It refused to start without logging an error.

BTW, that what I meant before, although I wasn't clear:

> > Since the front is all-cold, and there is very little at the end,
> > practically all hot pages are now remapped. The biggest problem with the
> > hackish filler function (in addition to maintainability) is, if explicit
> > huge pages are turned off in the kernel, attempting mmap() with MAP_HUGETLB
> > causes complete startup failure if the .text segment is larger than 8MB.
>
> I would expect MAP_HUGETLB to always fail if not enabled in the kernel,
> independent of the .text segment size?

With the file-level hack, it would just fail without a trace with .text > 8MB (I have yet to enable core dumps on this new OS I have...), whereas without it I did see the failures in the log, and successful fallback.

> With these flags the "R E" segments all start on a 0x200000/2MiB boundary and
> are padded to the next 2MiB boundary. However the OS / dynamic loader only
> maps the necessary part, not all the zero padding.
>
> This means that if we were to issue a MADV_COLLAPSE, we can before it do an
> mremap() to increase the length of the mapping.

I see, interesting. What location are you passing for madvise() and mremap()? The beginning of the segment (for me has .init/.plt) or an aligned boundary within .text?

--
John Naylor
EDB: http://www.enterprisedb.com

Re: remap the .text segment into huge pages at run time

From
Andres Freund
Date:
Hi,

On 2022-11-05 12:54:18 +0700, John Naylor wrote:
> On Sat, Nov 5, 2022 at 1:33 AM Andres Freund <andres@anarazel.de> wrote:
> > I hacked in a MADV_COLLAPSE (with setarch -R, so that I could just
> hardcode
> > the address / length), and it seems to work nicely.
> >
> > With the weird caveat that on fs one needs to make sure that the
> executable
> > doesn't reflinks to reuse parts of other files, and that the mold linker
> and
> > cp do... Not a concern on ext4, but on xfs. I took to copying the postgres
> > binary with cp --reflink=never
>
> What happens otherwise? That sounds like a difficult thing to guard against.

MADV_COLLAPSE fails, but otherwise things continue on. I think it's mostly an
issue on dev systems, not on prod systems, because there the files will be be
unpacked from a package or such.


> > On 2022-11-03 10:21:23 -0700, Andres Freund wrote:
> > > > - Add a "cold" __asm__ filler function that just takes up space,
> enough to
> > > > push the end of the .text segment over the next aligned boundary, or
> to
> > > > ~8MB in size.
> > >
> > > I don't understand why this is needed - as long as the pages are
> aligned to
> > > 2MB, why do we need to fill things up on disk? The in-memory contents
> are the
> > > relevant bit, no?
> >
> > I now assume it's because you either observed the mappings set up by the
> > loader to not include the space between the segments?
>
> My knowledge is not quite that deep. The iodlr repo has an example "hello
> world" program, which links with 8 filler objects, each with 32768
> __attribute((used)) dummy functions. I just cargo-culted that idea and
> simplified it. Interestingly enough, looking through the commit history,
> they used to align the segments via linker flags, but took it out here:
>
> https://github.com/intel/iodlr/pull/25#discussion_r397787559
>
> ...saying "I'm not sure why we added this". :/

That was about using a linker script, not really linker flags though.

I don't think the dummy functions are a good approach, there were plenty
things after it when I played with them.



> I quickly tried to align the segments with the linker and then in my patch
> have the address for mmap() rounded *down* from the .text start to the
> beginning of that segment. It refused to start without logging an error.

Hm, what linker was that? I did note that you need some additional flags for
some of the linkers.


> > With these flags the "R E" segments all start on a 0x200000/2MiB boundary
> and
> > are padded to the next 2MiB boundary. However the OS / dynamic loader only
> > maps the necessary part, not all the zero padding.
> >
> > This means that if we were to issue a MADV_COLLAPSE, we can before it do
> an
> > mremap() to increase the length of the mapping.
>
> I see, interesting. What location are you passing for madvise() and
> mremap()? The beginning of the segment (for me has .init/.plt) or an
> aligned boundary within .text?

I started postgres with setarch -R, looked at /proc/$pid/[s]maps to see the
start/end of the r-xp mapped segment.  Here's my hacky code, with a bunch of
comments added.

       void *addr = (void*) 0x555555800000;
       void *end = (void *) 0x555555e09000;
       size_t advlen = (uintptr_t) end - (uintptr_t) addr;

       const size_t bound = 1024*1024*2 - 1;
       size_t advlen_up = (advlen + bound - 1) & ~(bound - 1);
       void *r2;

       /*
        * Increase size of mapping to cover the tailing padding to the next
        * segment. Otherwise all the code in that range can't be put into
        * a huge page (access in the non-mapped range needs to cause a fault,
        * hence can't be in the huge page).
        * XXX: Should proably assert that that space is actually zeroes.
        */
       r2 = mremap(addr, advlen, advlen_up, 0);
       if (r2 == MAP_FAILED)
           fprintf(stderr, "mremap failed: %m\n");
       else if (r2 != addr)
           fprintf(stderr, "mremap wrong addr: %m\n");
       else
           advlen = advlen_up;

       /*
        * The docs for MADV_COLLAPSE say there should be at least one page
        * in the mapped space "for every eligible hugepage-aligned/sized
        * region to be collapsed". I just forced that. But probably not
        * necessary.
        */
       r = madvise(addr, advlen, MADV_WILLNEED);
       if (r != 0)
           fprintf(stderr, "MADV_WILLNEED failed: %m\n");

       r = madvise(addr, advlen, MADV_POPULATE_READ);
       if (r != 0)
           fprintf(stderr, "MADV_POPULATE_READ failed: %m\n");

       /*
        * Make huge pages out of it. Requires at least linux 6.1.  We could
        * fall back to MADV_HUGEPAGE if it fails, but it doesn't do all that
        * much in older kernels.
        */
#define MADV_COLLAPSE    25
       r = madvise(addr, advlen, MADV_COLLAPSE);
       if (r != 0)
           fprintf(stderr, "MADV_COLLAPSE failed: %m\n");


A real version would have to open /proc/self/maps and do this for at least
postgres' r-xp mapping. We could do it for libraries too, if they're suitably
aligned (both in memory and on-disk).

Greetings,

Andres Freund



Re: remap the .text segment into huge pages at run time

From
John Naylor
Date:
On Sat, Nov 5, 2022 at 3:27 PM Andres Freund <andres@anarazel.de> wrote:

> > simplified it. Interestingly enough, looking through the commit history,
> > they used to align the segments via linker flags, but took it out here:
> >
> > https://github.com/intel/iodlr/pull/25#discussion_r397787559
> >
> > ...saying "I'm not sure why we added this". :/
>
> That was about using a linker script, not really linker flags though.

Oops, the commit I was referring to pointed to that discussion, but I should have shown it instead:

--- a/large_page-c/example/Makefile
+++ b/large_page-c/example/Makefile
@@ -28,7 +28,6 @@ OBJFILES=              \
   filler16.o           \

 OBJS=$(addprefix $(OBJDIR)/,$(OBJFILES))
-LDFLAGS=-Wl,-z,max-page-size=2097152

But from what you're saying, this flag wouldn't have been enough anyway...

> I don't think the dummy functions are a good approach, there were plenty
> things after it when I played with them.

To be technical, the point wasn't to have no code after it, but to have no *hot* code *before* it, since with the iodlr approach the first 1.99MB of .text is below the first aligned boundary within that section. But yeah, I'm happy to ditch that hack entirely.

> > > With these flags the "R E" segments all start on a 0x200000/2MiB boundary
> > and
> > > are padded to the next 2MiB boundary. However the OS / dynamic loader only
> > > maps the necessary part, not all the zero padding.
> > >
> > > This means that if we were to issue a MADV_COLLAPSE, we can before it do
> > an
> > > mremap() to increase the length of the mapping.
> >
> > I see, interesting. What location are you passing for madvise() and
> > mremap()? The beginning of the segment (for me has .init/.plt) or an
> > aligned boundary within .text?

>        /*
>         * Make huge pages out of it. Requires at least linux 6.1.  We could
>         * fall back to MADV_HUGEPAGE if it fails, but it doesn't do all that
>         * much in older kernels.
>         */

About madvise(), I take it MADV_HUGEPAGE and MADV_COLLAPSE only work for THP? The man page seems to indicate that.

In the support work I've done, the standard recommendation is to turn THP off, especially if they report sudden performance problems. If explicit HP's are used for shared mem, maybe THP is less of a risk? I need to look back at the tests that led to that advice...

> A real version would have to open /proc/self/maps and do this for at least

I can try and generalize your above sketch into a v2 patch.

> postgres' r-xp mapping. We could do it for libraries too, if they're suitably
> aligned (both in memory and on-disk).

It looks like plpgsql is only 27 standard pages in size...

Regarding glibc, we could try moving a couple of the hotter functions into PG, using smaller and simpler coding, if that has better frontend cache behavior. The paper "Understanding and Mitigating Front-End Stalls in Warehouse-Scale Computers" talks about this, particularly section 4.4 regarding memcmp().

> > I quickly tried to align the segments with the linker and then in my patch
> > have the address for mmap() rounded *down* from the .text start to the
> > beginning of that segment. It refused to start without logging an error.
>
> Hm, what linker was that? I did note that you need some additional flags for
> some of the linkers.

BFD, but I wouldn't worry about that failure too much, since the mremap()/madvise() strategy has a lot fewer moving parts.

On the subject of linkers, though, one thing that tripped me up was trying to change the linker with Meson. First I tried

-Dc_args='-fuse-ld=lld'

but that led to warnings like this when :
/usr/bin/ld: warning: -z separate-loadable-segments ignored

When using this in the top level meson.build

elif host_system == 'linux'
  sema_kind = 'unnamed_posix'
  cppflags += '-D_GNU_SOURCE'
  # Align the loadable segments to 2MB boundaries to support remapping to
  # huge pages.
  ldflags += cc.get_supported_link_arguments([
    '-Wl,-zmax-page-size=0x200000',
    '-Wl,-zcommon-page-size=0x200000',
    '-Wl,-zseparate-loadable-segments'
  ])


According to

https://mesonbuild.com/howtox.html#set-linker

I need to add CC_LD=lld to the env vars before invoking, which got rid of the warning. Then I wanted to verify that lld was actually used, and in

https://releases.llvm.org/14.0.0/tools/lld/docs/index.html

it says I can run this and it should show “Linker: LLD”, but that doesn't appear for me:

$ readelf --string-dump .comment inst-perf/bin/postgres

String dump of section '.comment':
  [     0]  GCC: (GNU) 12.2.1 20220819 (Red Hat 12.2.1-2)


--
John Naylor
EDB: http://www.enterprisedb.com

Re: remap the .text segment into huge pages at run time

From
Andres Freund
Date:
Hi,

On 2022-11-06 13:56:10 +0700, John Naylor wrote:
> On Sat, Nov 5, 2022 at 3:27 PM Andres Freund <andres@anarazel.de> wrote:
> > I don't think the dummy functions are a good approach, there were plenty
> > things after it when I played with them.
>
> To be technical, the point wasn't to have no code after it, but to have no
> *hot* code *before* it, since with the iodlr approach the first 1.99MB of
> .text is below the first aligned boundary within that section. But yeah,
> I'm happy to ditch that hack entirely.

Just because code is colder than the alternative branch, doesn't necessary
mean it's entirely cold overall. I saw hits to things after the dummy function
to have a perf effect.


> > > > With these flags the "R E" segments all start on a 0x200000/2MiB
> boundary
> > > and
> > > > are padded to the next 2MiB boundary. However the OS / dynamic loader
> only
> > > > maps the necessary part, not all the zero padding.
> > > >
> > > > This means that if we were to issue a MADV_COLLAPSE, we can before it
> do
> > > an
> > > > mremap() to increase the length of the mapping.
> > >
> > > I see, interesting. What location are you passing for madvise() and
> > > mremap()? The beginning of the segment (for me has .init/.plt) or an
> > > aligned boundary within .text?
>
> >        /*
> >         * Make huge pages out of it. Requires at least linux 6.1.  We
> could
> >         * fall back to MADV_HUGEPAGE if it fails, but it doesn't do all
> that
> >         * much in older kernels.
> >         */
>
> About madvise(), I take it MADV_HUGEPAGE and MADV_COLLAPSE only work for
> THP? The man page seems to indicate that.

MADV_HUGEPAGE works as long as /sys/kernel/mm/transparent_hugepage/enabled is
to always or madvise.  My understanding is that MADV_COLLAPSE will work even
if /sys/kernel/mm/transparent_hugepage/enabled is set to never.


> In the support work I've done, the standard recommendation is to turn THP
> off, especially if they report sudden performance problems.

I think that's pretty much an outdated suggestion FWIW. Largely caused by Red
Hat extremely aggressively backpatching transparent hugepages into RHEL 6
(IIRC). Lots of improvements have been made to THP since then. I've tried to
see negative effects maybe 2-3 years back, without success.

I really don't see a reason to ever set
/sys/kernel/mm/transparent_hugepage/enabled to 'never', rather than just 'madvise'.


> If explicit HP's are used for shared mem, maybe THP is less of a risk? I
> need to look back at the tests that led to that advice...

I wouldn't give that advice to customers anymore, unless they use extremely
old platforms or unless there's very concrete evidence.


> > A real version would have to open /proc/self/maps and do this for at least
>
> I can try and generalize your above sketch into a v2 patch.

Cool.


> > postgres' r-xp mapping. We could do it for libraries too, if they're
> suitably
> > aligned (both in memory and on-disk).
>
> It looks like plpgsql is only 27 standard pages in size...
>
> Regarding glibc, we could try moving a couple of the hotter functions into
> PG, using smaller and simpler coding, if that has better frontend cache
> behavior. The paper "Understanding and Mitigating Front-End Stalls in
> Warehouse-Scale Computers" talks about this, particularly section 4.4
> regarding memcmp().

I think the amount of work necessary for that is nontrivial and continual. So
I'm loathe to go there.


> > > I quickly tried to align the segments with the linker and then in my
> patch
> > > have the address for mmap() rounded *down* from the .text start to the
> > > beginning of that segment. It refused to start without logging an error.
> >
> > Hm, what linker was that? I did note that you need some additional flags
> for
> > some of the linkers.
>
> BFD, but I wouldn't worry about that failure too much, since the
> mremap()/madvise() strategy has a lot fewer moving parts.
>
> On the subject of linkers, though, one thing that tripped me up was trying
> to change the linker with Meson. First I tried
>
> -Dc_args='-fuse-ld=lld'

It's -Dc_link_args=...


> but that led to warnings like this when :
> /usr/bin/ld: warning: -z separate-loadable-segments ignored
>
> When using this in the top level meson.build
>
> elif host_system == 'linux'
>   sema_kind = 'unnamed_posix'
>   cppflags += '-D_GNU_SOURCE'
>   # Align the loadable segments to 2MB boundaries to support remapping to
>   # huge pages.
>   ldflags += cc.get_supported_link_arguments([
>     '-Wl,-zmax-page-size=0x200000',
>     '-Wl,-zcommon-page-size=0x200000',
>     '-Wl,-zseparate-loadable-segments'
>   ])
>
>
> According to
>
> https://mesonbuild.com/howtox.html#set-linker
>
> I need to add CC_LD=lld to the env vars before invoking, which got rid of
> the warning. Then I wanted to verify that lld was actually used, and in
>
> https://releases.llvm.org/14.0.0/tools/lld/docs/index.html

You can just look at build.ninja, fwiw. Or use ninja -v (in postgres's cases
with -d keeprsp, because the commandline ends up being long enough for a
response file being used).


> it says I can run this and it should show “Linker: LLD”, but that doesn't
> appear for me:
>
> $ readelf --string-dump .comment inst-perf/bin/postgres
>
> String dump of section '.comment':
>   [     0]  GCC: (GNU) 12.2.1 20220819 (Red Hat 12.2.1-2)

That's added by the compiler, not the linker. See e.g.:

$ readelf --string-dump .comment src/backend/postgres_lib.a.p/storage_ipc_procarray.c.o

String dump of section '.comment':
  [     1]  GCC: (Debian 12.2.0-9) 12.2.0

Greetings,

Andres Freund



Re: remap the .text segment into huge pages at run time

From
John Naylor
Date:
On Sat, Nov 5, 2022 at 3:27 PM Andres Freund <andres@anarazel.de> wrote:

>        /*
>         * Make huge pages out of it. Requires at least linux 6.1.  We could
>         * fall back to MADV_HUGEPAGE if it fails, but it doesn't do all that
>         * much in older kernels.
>         */
> #define MADV_COLLAPSE    25
>        r = madvise(addr, advlen, MADV_COLLAPSE);
>        if (r != 0)
>            fprintf(stderr, "MADV_COLLAPSE failed: %m\n");
>
>
> A real version would have to open /proc/self/maps and do this for at least
> postgres' r-xp mapping. We could do it for libraries too, if they're suitably
> aligned (both in memory and on-disk).

Hi Andres, my kernel has been new enough for a while now, and since TLBs and context switches came up in the thread on... threads, I'm swapping this back in my head.

For the postmaster, it should be simple to have a function that just takes the address of itself, then parses /proc/self/maps to find the boundaries within which it lies. I haven't thought about libraries much. Though with just the postmaster it seems that would give us the biggest bang for the buck?

--
John Naylor
EDB: http://www.enterprisedb.com

Re: remap the .text segment into huge pages at run time

From
Andres Freund
Date:
Hi,

On 2023-06-14 12:40:18 +0700, John Naylor wrote:
> On Sat, Nov 5, 2022 at 3:27 PM Andres Freund <andres@anarazel.de> wrote:
> 
> >        /*
> >         * Make huge pages out of it. Requires at least linux 6.1.  We
> could
> >         * fall back to MADV_HUGEPAGE if it fails, but it doesn't do all
> that
> >         * much in older kernels.
> >         */
> > #define MADV_COLLAPSE    25
> >        r = madvise(addr, advlen, MADV_COLLAPSE);
> >        if (r != 0)
> >            fprintf(stderr, "MADV_COLLAPSE failed: %m\n");
> >
> >
> > A real version would have to open /proc/self/maps and do this for at least
> > postgres' r-xp mapping. We could do it for libraries too, if they're
> suitably
> > aligned (both in memory and on-disk).
> 
> Hi Andres, my kernel has been new enough for a while now, and since TLBs
> and context switches came up in the thread on... threads, I'm swapping this
> back in my head.

Cool - I think we have some real potential for substantial wins around this.


> For the postmaster, it should be simple to have a function that just takes
> the address of itself, then parses /proc/self/maps to find the boundaries
> within which it lies. I haven't thought about libraries much. Though with
> just the postmaster it seems that would give us the biggest bang for the
> buck?

I think that is the main bit, yes. We could just try to do this for the
libraries, but accept failure to do so?

Greetings,

Andres Freund



Re: remap the .text segment into huge pages at run time

From
John Naylor
Date:

On Wed, Jun 14, 2023 at 12:40 PM John Naylor <john.naylor@enterprisedb.com> wrote:
>
> On Sat, Nov 5, 2022 at 3:27 PM Andres Freund <andres@anarazel.de> wrote:

> > A real version would have to open /proc/self/maps and do this for at least
> > postgres' r-xp mapping. We could do it for libraries too, if they're suitably
> > aligned (both in memory and on-disk).

> For the postmaster, it should be simple to have a function that just takes the address of itself, then parses /proc/self/maps to find the boundaries within which it lies. I haven't thought about libraries much. Though with just the postmaster it seems that would give us the biggest bang for the buck?

Here's a start at that, trying with postmaster only. Unfortunately, I get "MADV_COLLAPSE failed: Invalid argument". I tried different addresses with no luck, and also got the same result with a small standalone program. I'm on ext4, so I gather I don't need "cp --reflink=never" but tried it anyway. Configuration looks normal by "grep HUGEPAGE /boot/config-$(uname -r)".  Maybe there's something obvious I'm missing?

--
John Naylor
EDB: http://www.enterprisedb.com
Attachment

Re: remap the .text segment into huge pages at run time

From
Andres Freund
Date:
Hi,

On 2023-06-20 10:23:14 +0700, John Naylor wrote:
> Here's a start at that, trying with postmaster only. Unfortunately, I get
> "MADV_COLLAPSE failed: Invalid argument".

I also see that. But depending on the steps, I also see
  MADV_COLLAPSE failed: Resource temporarily unavailable

I suspect there's some kernel issue. I'll try to ping somebody.

Greetings,

Andres Freund



Re: remap the .text segment into huge pages at run time

From
Andres Freund
Date:
Hi,

On 2023-06-20 10:29:41 -0700, Andres Freund wrote:
> On 2023-06-20 10:23:14 +0700, John Naylor wrote:
> > Here's a start at that, trying with postmaster only. Unfortunately, I get
> > "MADV_COLLAPSE failed: Invalid argument".
> 
> I also see that. But depending on the steps, I also see
>   MADV_COLLAPSE failed: Resource temporarily unavailable
> 
> I suspect there's some kernel issue. I'll try to ping somebody.

Which kernel version are you using? It looks like the issue I am hitting might
be specific to the in-development 6.4 kernel.

One thing I now remember, after trying older kernels, is that it looks like
one sometimes needs to call 'sync' to ensure the page cache data for the
executable is clean, before executing postgres.

Greetings,

Andres Freund



Re: remap the .text segment into huge pages at run time

From
John Naylor
Date:

On Wed, Jun 21, 2023 at 12:46 AM Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
> On 2023-06-20 10:29:41 -0700, Andres Freund wrote:
> > On 2023-06-20 10:23:14 +0700, John Naylor wrote:
> > > Here's a start at that, trying with postmaster only. Unfortunately, I get
> > > "MADV_COLLAPSE failed: Invalid argument".
> >
> > I also see that. But depending on the steps, I also see
> >   MADV_COLLAPSE failed: Resource temporarily unavailable
> >
> > I suspect there's some kernel issue. I'll try to ping somebody.
>
> Which kernel version are you using? It looks like the issue I am hitting might
> be specific to the in-development 6.4 kernel.

(Fedora 38) uname -r shows 

6.3.7-200.fc38.x86_64

--
John Naylor
EDB: http://www.enterprisedb.com

Re: remap the .text segment into huge pages at run time

From
Andres Freund
Date:
Hi,

On 2023-06-21 09:35:36 +0700, John Naylor wrote:
> On Wed, Jun 21, 2023 at 12:46 AM Andres Freund <andres@anarazel.de> wrote:
> >
> > Hi,
> >
> > On 2023-06-20 10:29:41 -0700, Andres Freund wrote:
> > > On 2023-06-20 10:23:14 +0700, John Naylor wrote:
> > > > Here's a start at that, trying with postmaster only. Unfortunately, I
> get
> > > > "MADV_COLLAPSE failed: Invalid argument".
> > >
> > > I also see that. But depending on the steps, I also see
> > >   MADV_COLLAPSE failed: Resource temporarily unavailable
> > >
> > > I suspect there's some kernel issue. I'll try to ping somebody.
> >
> > Which kernel version are you using? It looks like the issue I am hitting
> might
> > be specific to the in-development 6.4 kernel.
> 
> (Fedora 38) uname -r shows
> 
> 6.3.7-200.fc38.x86_64

FWIW, I bisected the bug I was encountering.

As far as I understand, it should not affect you, it was only merged into
6.4-rc1 and a fix is scheduled to be merged into 6.4 before its release. See
https://lore.kernel.org/all/ZJIWAvTczl0rHJBv@x1n/

So I am wondering if you're encountering a different kind of problem. As I
mentioned, I have observed that the pages need to be clean for this to
work. For me adding a "sync path/to/postgres" makes it work on 6.3.8. Without
the sync it starts to work a while later (presumably when the kernel got
around to writing the data back).


without sync:

self: 0x563b2abf0a72 start: 563b2a800000 end: 563b2afe3000
old advlen: 7e3000
new advlen: 800000
MADV_COLLAPSE failed: Invalid argument

with sync:
self: 0x555c947f0a72 start: 555c94400000 end: 555c94be3000
old advlen: 7e3000
new advlen: 800000


Greetings,

Andres Freund



Re: remap the .text segment into huge pages at run time

From
John Naylor
Date:

On Wed, Jun 21, 2023 at 10:42 AM Andres Freund <andres@anarazel.de> wrote:

> So I am wondering if you're encountering a different kind of problem. As I
> mentioned, I have observed that the pages need to be clean for this to
> work. For me adding a "sync path/to/postgres" makes it work on 6.3.8. Without
> the sync it starts to work a while later (presumably when the kernel got
> around to writing the data back).

Hmm, then after rebooting today, it shouldn't have that problem until a build links again, but I'll make sure to do that when building. Still same failure, though. Looking more closely at the manpage for madvise, it has this under MADV_HUGEPAGE:

"The  MADV_HUGEPAGE,  MADV_NOHUGEPAGE,  and  MADV_COLLAPSE  operations  are available only if the kernel was configured with CONFIG_TRANSPARENT_HUGEPAGE and file/shmem memory is only supported if the kernel was configured with CONFIG_READ_ONLY_THP_FOR_FS."

Earlier, I only checked the first config option but didn't know about the second...

$ grep CONFIG_READ_ONLY_THP_FOR_FS /boot/config-$(uname -r)
# CONFIG_READ_ONLY_THP_FOR_FS is not set

Apparently, it's experimental. That could be the explanation, but now I'm wondering why the fallback

madvise(addr, advlen, MADV_HUGEPAGE);

didn't also give an error. I wonder if we could mremap to some anonymous region and call madvise on that. That would be more similar to the hack I shared last year, which may be more fragile, but now it wouldn't need explicit huge pages.

--
John Naylor
EDB: http://www.enterprisedb.com