remap the .text segment into huge pages at run time - Mailing list pgsql-hackers

From John Naylor
Subject remap the .text segment into huge pages at run time
Date
Msg-id CAFBsxsHx9z45MfsAjELFiPv_kcgCcH_P5jNa=WaeGxO7HU3mag@mail.gmail.com
Whole thread Raw
Responses Re: remap the .text segment into huge pages at run time
List pgsql-hackers
It's been known for a while that Postgres spends a lot of time translating instruction addresses, and using huge pages in the text segment yields a substantial performance boost in OLTP workloads [1][2]. The difficulty is, this normally requires a lot of painstaking work (unless your OS does superpage promotion, like FreeBSD).

I found an MIT-licensed library "iodlr" from Intel [3] that allows one to remap the .text segment to huge pages at program start. Attached is a hackish, Meson-only, "works on my machine" patchset to experiment with this idea.

0001 adapts the library to our error logging and GUC system. The overview:

- read ELF info to get the start/end addresses of the .text segment
- calculate addresses therein aligned at huge page boundaries
- mmap a temporary region and memcpy the aligned portion of the .text segment
- mmap aligned start address to a second region with huge pages and MAP_FIXED
- memcpy over from the temp region and revoke the PROT_WRITE bit

The reason this doesn't "saw off the branch you're standing on" is that the remapping is done in a function that's forced to live in a different segment, and doesn't call any non-libc functions living elsewhere:

static void
__attribute__((__section__("lpstub")))
__attribute__((__noinline__))
MoveRegionToLargePages(const mem_range * r, int mmap_flags)

Debug messages show

2022-11-02 12:02:31.064 +07 [26955] DEBUG:  .text start: 0x487540
2022-11-02 12:02:31.064 +07 [26955] DEBUG:  .text end:   0x96cf12
2022-11-02 12:02:31.064 +07 [26955] DEBUG:  aligned .text start: 0x600000
2022-11-02 12:02:31.064 +07 [26955] DEBUG:  aligned .text end:   0x800000
2022-11-02 12:02:31.066 +07 [26955] DEBUG:  binary mapped to huge pages
2022-11-02 12:02:31.066 +07 [26955] DEBUG:  un-mmapping temporary code region

Here, out of 5MB of Postgres text, only 1 huge page can be used, but that still saves 512 entries in the TLB and might bring a small improvement. The un-remapped region below 0x600000 contains the ~600kB of "cold" code, since the linker puts the cold section first, at least recent versions of ld and lld.

0002 is my attempt to force the linker's hand and get the entire text segment mapped to huge pages. It's quite a finicky hack, and easily broken (see below). That said, it still builds easily within our normal build process, and maybe there is a better way to get the effect.

It does two things:

- Pass the linker -Wl,-zcommon-page-size=2097152 -Wl,-zmax-page-size=2097152 which aligns .init to a 2MB boundary. That's done for predictability, but that means the next 2MB boundary is very nearly 2MB away.

- Add a "cold" __asm__ filler function that just takes up space, enough to push the end of the .text segment over the next aligned boundary, or to ~8MB in size.

In a non-assert build:

0001:

$ bloaty inst-perf/bin/postgres

    FILE SIZE        VM SIZE    
 --------------  --------------
  53.7%  4.90Mi  58.7%  4.90Mi    .text
...
 100.0%  9.12Mi 100.0%  8.35Mi    TOTAL

$ readelf -S --wide inst-perf/bin/postgres

  [Nr] Name              Type            Address          Off    Size   ES Flg Lk Inf Al
...
  [12] .init             PROGBITS        0000000000486000 086000 00001b 00  AX  0   0  4
  [13] .plt              PROGBITS        0000000000486020 086020 001520 10  AX  0   0 16
  [14] .text             PROGBITS        0000000000487540 087540 4e59d2 00  AX  0   0 16
...

0002:

$ bloaty inst-perf/bin/postgres

    FILE SIZE        VM SIZE    
 --------------  --------------
  46.9%  8.00Mi  69.9%  8.00Mi    .text
...
 100.0%  17.1Mi 100.0%  11.4Mi    TOTAL


$ readelf -S --wide inst-perf/bin/postgres

  [Nr] Name              Type            Address          Off    Size   ES Flg Lk Inf Al
...
  [12] .init             PROGBITS        0000000000600000 200000 00001b 00  AX  0   0  4
  [13] .plt              PROGBITS        0000000000600020 200020 001520 10  AX  0   0 16
  [14] .text             PROGBITS        0000000000601540 201540 7ff512 00  AX  0   0 16
...

Debug messages with 0002 shows 6MB mapped:

2022-11-02 12:35:28.482 +07 [28530] DEBUG:  .text start: 0x601540
2022-11-02 12:35:28.482 +07 [28530] DEBUG:  .text end:   0xe00a52
2022-11-02 12:35:28.482 +07 [28530] DEBUG:  aligned .text start: 0x800000
2022-11-02 12:35:28.482 +07 [28530] DEBUG:  aligned .text end:   0xe00000
2022-11-02 12:35:28.486 +07 [28530] DEBUG:  binary mapped to huge pages
2022-11-02 12:35:28.486 +07 [28530] DEBUG:  un-mmapping temporary code region

Since the front is all-cold, and there is very little at the end, practically all hot pages are now remapped. The biggest problem with the hackish filler function (in addition to maintainability) is, if explicit huge pages are turned off in the kernel, attempting mmap() with MAP_HUGETLB causes complete startup failure if the .text segment is larger than 8MB. I haven't looked into what's happening there yet, but I didn't want to get too far in the weeds before getting feedback on whether the entire approach in this thread is sound enough to justify working further on.

[1] https://www.cs.rochester.edu/u/sandhya/papers/ispass19.pdf
    (paper: "On the Impact of Instruction Address Translation Overhead")
[2] https://twitter.com/AndresFreundTec/status/1214305610172289024
[3] https://github.com/intel/iodlr

--
Attachment

pgsql-hackers by date:

Previous
From: Amit Kapila
Date:
Subject: Re: Improve description of XLOG_RUNNING_XACTS
Next
From: John Naylor
Date:
Subject: Re: Incorrect include file order in guc-file.l