Thread: Patch für MAP_HUGETLB for mmap() shared memory

Patch für MAP_HUGETLB for mmap() shared memory

From
Christian Kruse
Date:
Hey,

this is my first post to the -hackers lists, so be merciful ;-)

I created a patch which implements MAP_HUGETLB for sysv shared memory segments
(PGSharedMemoryCreate). It is based on tests of Tom Lane and Andres Freund, I
added error handling, huge page size detection and a GUC variable.

Performance improvements differ from about 1% in the worst case to about 13% in
the best case. Benchmarking results are as follows:

pgbench -i -s 100 test

Patched:
pgbench -n -S -j 64 -c 64 -T 10 -M prepared test
tps avg: 51879.2


Unpatched:
pgbench -n -S -j 64 -c 64 -T 10 -M prepared test
tps avg: 45321.6

tps increase: 6557.6, 12.6%


Patched:
pgbench -n -S -j 64 -c 64 -T 180 -M prepared test (patched)

number of transactions actually processed: 8767510
tps = 48705.159196 (including connections establishing)
tps = 48749.761241 (excluding connections establishing)


Unpatched:
mit pgbench -n -S -j 64 -c 64 -T 120 -M prepared test (unpatched)

number of transactions actually processed: 8295439
tps = 46083.559187 (including connections establishing)
tps = 46097.763939 (excluding connections establishing)

tps diff: 2652, 5%



create table large (a int, b int);
insert into large (a, b) select s, s + 10 from generate_series(1, 10000000) s;

5 times executed, with \timing on:
SELECT sum(a), sum(b) from large;
Time: 1143.880 ms unpatched
Time: 1125.644 ms patched
about 1% difference


The patch ist attached. Any comments?

Greetings,
 CK

Attachment

Re: Patch für MAP_HUGETLB for mmap() shared memory

From
Tom Lane
Date:
Christian Kruse <cjk+postgres@defunct.ch> writes:
> I created a patch which implements MAP_HUGETLB for sysv shared memory segments
> (PGSharedMemoryCreate). It is based on tests of Tom Lane and Andres Freund, I
> added error handling, huge page size detection and a GUC variable.

My recollection is we'd decided not to pursue that because of fears that
it could make performance horribly worse on some platforms.

            regards, tom lane



Re: Patch für MAP_HUGETLB for mmap() sharedmemory

From
Andres Freund
Date:
On Monday, October 29, 2012 09:33:25 PM Tom Lane wrote:
> Christian Kruse <cjk+postgres@defunct.ch> writes:
> > I created a patch which implements MAP_HUGETLB for sysv shared memory
> > segments (PGSharedMemoryCreate). It is based on tests of Tom Lane and
> > Andres Freund, I added error handling, huge page size detection and a
> > GUC variable.
>
> My recollection is we'd decided not to pursue that because of fears that
> it could make performance horribly worse on some platforms.

Hm. I don't so. Maybe youre remembering our discussion about mlock?

MAP_HUGETLB is linux specific anyway, there are other OSes doing this, but I
don't think a standard exists.

http://archives.postgresql.org/message-
id/201206292152.40311.andres%402ndquadrant.com and
http://archives.postgresql.org/message-
id/CA%2BTgmoZGX0Pi0rw5sDH0Uz%3D03WkQ%3DmnoAW3TXVEfvUpyW%2BfMjw%40mail.gmail.com
seem to be the most recent discussions arround this.


The parts I like most about having usable hugepages arround is usable 'ps'
output and way smaller pagetables...

Greetings,

Andres
--
 Andres Freund                       http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



Re: [HACKERS] Patch für MAP_HUGETLB for mmap() shared memory

From
Peter Geoghegan
Date:
On 29 October 2012 20:14, Christian Kruse <cjk+postgres@defunct.ch> wrote:
> created a patch which implements MAP_HUGETLB for sysv shared memory segments
> (PGSharedMemoryCreate). It is based on tests of Tom Lane and Andres Freund, I
> added error handling, huge page size detection and a GUC variable.

I have a few initial observations on this.

* I think you should be making the new GUC PGC_INTERNAL on platforms
where MAP_HUGETLB is not defined or available. See also,
effective_io_concurrency. This gives sane error handling.

* Why aren't you failing when HUGE_TLB_ON and the mmap fails? You do
the same thing as HUGE_TLB_TRY, contrary to your documentation:

 +        if (huge_tlb_pages == HUGE_TLB_ON || huge_tlb_pages == HUGE_TLB_TRY)

* I think you should add MAP_HUGETLB to PG_MMAP_FLAGS directly, rather
than XOR'ing after the fact. We already build the flags that comprise
PG_MMAP_FLAGS using another set of intermediary flags, based on
platform considerations, so what you're doing here is just
inconsistent. Don't wrap the mmap() code in ifdefs, and instead rely
on the GUC being available on all platforms, with setting the GUC
causing an error on unsupported platforms, in the style of
effective_io_concurrency, as established in commit
3b07182e613a0e63ff662e3dc7f820f010923ec7 . Build a PG_MAP_HUGETLB
intermediary flag on all platforms.

* Apparently you're supposed to do this for the benefit of Itanic [1]:

/* Only ia64 requires this */
#ifdef __ia64__
#define ADDR (void *)(0x8000000000000000UL)
#define FLAGS (MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB | MAP_FIXED)
#else
#define ADDR (void *)(0x0UL)
#define FLAGS (MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB)
#endif

* You aren't following our style guide with respect to error messages [2].

[1] https://www.kernel.org/doc/Documentation/vm/map_hugetlb.c

[2] http://www.postgresql.org/docs/devel/static/error-style-guide.html
--
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services



Re: Re: [HACKERS] Patch für MAP_HUGETLB for mmap() shared memory

From
Christian Kruse
Date:
Hi,

On 29/10/12 21:14, Peter Geoghegan wrote:
> I have a few initial observations on this.

Thanks for your feedback.

>
> * I think you should be making the new GUC PGC_INTERNAL on platforms
> where MAP_HUGETLB is not defined or available. See also,
> effective_io_concurrency. This gives sane error handling.

Ok, sounds good for me.

> * Why aren't you failing when HUGE_TLB_ON and the mmap fails? You do
> the same thing as HUGE_TLB_TRY, contrary to your documentation:
>
>  +        if (huge_tlb_pages == HUGE_TLB_ON || huge_tlb_pages == HUGE_TLB_TRY)

No, what I did was mmap()ing the memory with MAP_HUGETLB and falling
back to mmap() w/o MAP_HUGETLB when mmap() failed. But there was
another bug, I didn't mmap() at all when huge_tlb_pages == off.

> * I think you should add MAP_HUGETLB to PG_MMAP_FLAGS directly, rather
> than XOR'ing after the fact. We already build the flags that comprise
> PG_MMAP_FLAGS using another set of intermediary flags, based on
> platform considerations, so what you're doing here is just
> inconsistent.

This would mean that I have to disable the bit when huge_tlb_pages ==
off. I think it is better to enable it if wanted and to just leave the
flags as they were when this feature has been turned off, do you
disagree?

> Don't wrap the mmap() code in ifdefs, and instead rely
> on the GUC being available on all platforms, with setting the GUC
> causing an error on unsupported platforms, in the style of
> effective_io_concurrency, as established in commit
> 3b07182e613a0e63ff662e3dc7f820f010923ec7 . Build a PG_MAP_HUGETLB
> intermediary flag on all platforms.

Ok, this sounds good. It will remove complexity from the code.

> * Apparently you're supposed to do this for the benefit of Itanic [1]:
>
> /* Only ia64 requires this */
> #ifdef __ia64__
> #define ADDR (void *)(0x8000000000000000UL)
> #define FLAGS (MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB | MAP_FIXED)
> #else
> #define ADDR (void *)(0x0UL)
> #define FLAGS (MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB)
> #endif

Hm… is this enough? Or do we have to do more work for ia64?

> * You aren't following our style guide with respect to error messages [2].

Thanks, I wasn't aware of this. I attached a new version of the patch.

Greetings,
 CK

Re: Patch fürMAP_HUGETLB for mmap() shared memory

From
Christian Kruse
Date:
Hi,

On 29/10/12 16:33, Tom Lane wrote:
> > I created a patch which implements MAP_HUGETLB for sysv shared memory segments
> > (PGSharedMemoryCreate). It is based on tests of Tom Lane and Andres Freund, I
> > added error handling, huge page size detection and a GUC variable.
>
> My recollection is we'd decided not to pursue that because of fears that
> it could make performance horribly worse on some platforms.

This is why I created a GUC for turning this feature off.

Greetings,
 CK

Re: Re: [HACKERS] Patch für MAP_HUGETLB for mmap() shared memory

From
Christian Kruse
Date:
Hey,

Oh man, first I didn't sent the email to the list and now I forgot the
attachment. I should really get some sleep, sorry for any
inconveniences :(

Greetings,
 CK

Attachment

Re: Re: [HACKERS] Patch für MAP_HUGETLB for mmap() shared memory

From
Andres Freund
Date:
On Tuesday, October 30, 2012 08:20:33 PM Christian Kruse wrote:
> Hey,
>
> Oh man, first I didn't sent the email to the list and now I forgot the
> attachment. I should really get some sleep, sorry for any
> inconveniences :(

+#ifdef MAP_HUGETLB
+#  ifdef __ia64__
+#    define PG_HUGETLB_BASE_ADDR (void *)(0x8000000000000000UL)
+#    define PG_MAP_HUGETLB (MAP_HUGETLB|MAP_FIXED)
+#  else

Not your fault, but that looks rather strange to me. The level of documentation
around the requirement of using MAP_FIXED is subpar...

+long
+InternalGetHugepageSize()
+{
+...
+            if (biggest_size < size && InternalGetFreeHugepagesCount(ent-
>d_name) > 0)
+            {
+                biggest_size = size;
+            }

I dislike automatically using the biggest size here. There are platforms with
16GB hugepages... I think we should rather always use the smallest size and
leave it to the kernel to choose whatever pagesize he wants to use.

+#ifdef MAP_HUGETLB
+        long    pagesize = InternalGetHugepageSize();
+
+        if (pagesize <= 0)
+            pagesize = sysconf(_SC_PAGE_SIZE);
+#else
         long    pagesize = sysconf(_SC_PAGE_SIZE);
+#endif

Thats going to log two warnings if the current system doesn't have hugepage
support compiled in and thus no /sys/kernel/mm/hugepages directory.
I think you should 1. only test this if TRY or ON is configured, 2. make the
messages in InternalGetHugepageSize DEBUG1 at least for the TRY case.

+    {
+        {"huge_tlb_pages",
+#ifdef MAP_HUGETLB
+            PGC_SUSET,
+#else
+            PGC_INTERNAL,
+#endif
+            RESOURCES_MEM,
+            gettext_noop("Enable/disable the use of the hugepages feature"),
+            NULL
+        },
+        &huge_tlb_pages,
+        HUGE_TLB_TRY, huge_tlb_options,
+        NULL, NULL, NULL
+    },

you use HUGE_TLB_TRY with ifdef here.


Looking forward to having this...

Andres
--
 Andres Freund                       http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



Re: Re: [HACKERS] Patch für MAP_HUGETLB for mmap() shared memory

From
Christian Kruse
Date:
Hey,

On 30/10/12 20:33, Andres Freund wrote:
> +#ifdef MAP_HUGETLB
> +#  ifdef __ia64__
> +#    define PG_HUGETLB_BASE_ADDR (void *)(0x8000000000000000UL)
> +#    define PG_MAP_HUGETLB (MAP_HUGETLB|MAP_FIXED)
> +#  else
>
> Not your fault, but that looks rather strange to me. The level of documentation
> around the requirement of using MAP_FIXED is subpar...

Yeah, looks strange to me, too. This is why I asked if this is really
all we have to do.

> I dislike automatically using the biggest size here. There are platforms with
> 16GB hugepages... I think we should rather always use the smallest size and
> leave it to the kernel to choose whatever pagesize he wants to use.

Good point. I will change this to the smallest available page size.

> Thats going to log two warnings if the current system doesn't have hugepage
> support compiled in and thus no /sys/kernel/mm/hugepages directory.
> I think you should 1. only test this if TRY or ON is configured, 2. make the
> messages in InternalGetHugepageSize DEBUG1 at least for the TRY case.

Ok, point taken.

> […]
> +             HUGE_TLB_TRY, huge_tlb_options,
> +             NULL, NULL, NULL
> +     },
>
> you use HUGE_TLB_TRY with ifdef here.

Ah, right. Setting it to HUGE_TLB_OFF when no MAP_HUGETLB is available.

Greetings,
 CK

Re: Re: [HACKERS] Patch für MAP_HUGETLB for mmap() shared memory

From
Christian Kruse
Date:
Hey,

ok, I think I implemented all of the changes you requested. All but
the ia64 dependent, I have to do more research for this one.


Greetings,
 CK

Attachment

Re: Re: [HACKERS] Patch für MAP_HUGETLB for mmap()shared memory

From
Christian Kruse
Date:
Hey,

On Tuesday 30 October 2012 20:33:18 Andres Freund wrote:
> +#ifdef MAP_HUGETLB
> +#  ifdef __ia64__
> +#    define PG_HUGETLB_BASE_ADDR (void *)(0x8000000000000000UL)
> +#    define PG_MAP_HUGETLB (MAP_HUGETLB|MAP_FIXED)
> +#  else
>
> Not your fault, but that looks rather strange to me. The level of
> documentation around the requirement of using MAP_FIXED is subpar...

Ok, after further reading I think with MAP_FIXED one has to „generate“ SHM
segment addresses hisself. So 0x8000000000000000UL would just be one possible
location for a hugepage.

Since this feature is not used in PG for now, the code should work. Could
someone with access to a ia64 architecture verify it?

Greetings,
 CK




Re: Re: [HACKERS] Patch für MAP_HUGETLB for mmap() shared memory

From
Christian Kruse
Date:
Hey,

On 30/10/12 20:33, Andres Freund wrote:
> +#ifdef MAP_HUGETLB
> +#  ifdef __ia64__
> +#    define PG_HUGETLB_BASE_ADDR (void *)(0x8000000000000000UL)
> +#    define PG_MAP_HUGETLB (MAP_HUGETLB|MAP_FIXED)
> +#  else
>
> Not your fault, but that looks rather strange to me. The level of documentation
> around the requirement of using MAP_FIXED is subpar...

Yeah, looks strange to me, too. This is why I asked if this is really
all we have to do.

> I dislike automatically using the biggest size here. There are platforms with
> 16GB hugepages... I think we should rather always use the smallest size and
> leave it to the kernel to choose whatever pagesize he wants to use.

Good point. I will change this to the smallest available page size.

> Thats going to log two warnings if the current system doesn't have hugepage
> support compiled in and thus no /sys/kernel/mm/hugepages directory.
> I think you should 1. only test this if TRY or ON is configured, 2. make the
> messages in InternalGetHugepageSize DEBUG1 at least for the TRY case.

Ok, point taken.

> […]
> +        HUGE_TLB_TRY, huge_tlb_options,
> +        NULL, NULL, NULL
> +    },
>
> you use HUGE_TLB_TRY with ifdef here.

Ah, right. Setting it to HUGE_TLB_OFF when no MAP_HUGETLB is available.

Greetings,
 CK

Re: Patch für MAP_HUGETLB for mmap() shared memory

From
Christian Kruse
Date:
Hi,

On Monday 29 October 2012 16:33:25 Tom Lane wrote:
> Christian Kruse <cjk+postgres@defunct.ch> writes:
> > I created a patch which implements MAP_HUGETLB for sysv shared memory
> > segments (PGSharedMemoryCreate). It is based on tests of Tom Lane and
> > Andres Freund, I added error handling, huge page size detection and a GUC
> > variable.
> My recollection is we'd decided not to pursue that because of fears that
> it could make performance horribly worse on some platforms.

This is why I created a GUC variable huge_tlb_pages = on|off|try to
enable/disable this feature.

A performance improvement up to 10% seems worth the trouble of setting another
config value, IMO.

Greetings,
 CK




Re: Re: [HACKERS] Patch für MAP_HUGETLB for mmap() shared memory

From
Andres Freund
Date:
Hi CK,

On 2012-10-30 21:16:07 +0100, Christian Kruse wrote:
> index b4fcbaf..66ed10f 100644
> --- a/doc/src/sgml/config.sgml
> +++ b/doc/src/sgml/config.sgml

I think a short introduction or at least a reference on how to configure
hugepages would be a good thing.

>       <varlistentry id="guc-temp-buffers" xreflabel="temp_buffers">
>        <term><varname>temp_buffers</varname> (<type>integer</type>)</term>
>        <indexterm>
> diff --git a/src/backend/port/sysv_shmem.c b/src/backend/port/sysv_shmem.c
> index df06312..f9de239 100644
> --- a/src/backend/port/sysv_shmem.c
> +++ b/src/backend/port/sysv_shmem.c
> @@ -27,10 +27,14 @@
>  #ifdef HAVE_SYS_SHM_H
>  #include <sys/shm.h>
>  #endif
> +#ifdef MAP_HUGETLB
> +#include <dirent.h>
> +#endif

I think a central #define for the MAP_HUGETLB capability would be a good
idea, akin to HAVE_SYS_SHM_H.

E.g. this:
> --- a/src/backend/utils/misc/guc.c
> +++ b/src/backend/utils/misc/guc.c
> @@ -22,6 +22,7 @@
>  #include <limits.h>
>  #include <unistd.h>
>  #include <sys/stat.h>
> +#include <sys/mman.h>
>  #ifdef HAVE_SYSLOG
>  #include <syslog.h>
>  #endif

is unlikely to fly on windows.


> +/*
> + *    static long InternalGetHugepageSize()
> + *
> + * Attempt to get a valid hugepage size from /sys/kernel/mm/hugepages/ by
> + * reading directory contents
> + * Will fail (return -1) if the directory could not be opened or no valid
> + * page sizes are available. Will return the biggest hugepage size on
> + * success.
> + *
> + */

The "biggest" remark is out of date.


> +static long
> +InternalGetHugepageSize()
> +{
> ...
> +            if ((smallest_size == -1 || size < smallest_size)
> +                && InternalGetFreeHugepagesCount(ent->d_name) > 0)
> +            {
> +                smallest_size = size;
> +            }
> ...
> +
> +    if (smallest_size == -1)
> +    {
> +        ereport(huge_tlb_pages == HUGE_TLB_TRY ? DEBUG1 : WARNING,
> +                (errmsg("Could not find a valid hugepage size"),
> +                 errhint("This error usually means that either CONFIG_HUGETLB_PAGE "
> +                         "is not in kernel or that your architecture does not "
> +                         "support hugepages or you did not configure hugepages")));
> +    }

I think differentiating the error message between no hugepages found and
InternalGetFreeHugepagesCount(ent->d_name) always beeing zero would be a
good idea. Failing this way if
InternalGetFreeHugepagesCount(ent->d_name) < 0 seems fine.


Greetings,

Andres Freund



Re: Patch fürMAP_HUGETLB for mmap() shared memory

From
Andres Freund
Date:
Oh, one more thing...

On 2012-10-30 21:16:07 +0100, Christian Kruse wrote:
> ok, I think I implemented all of the changes you requested. All but
> the ia64 dependent, I have to do more research for this one.

I vote for simply not caring about ia64.

This is:

> +#ifdef MAP_HUGETLB
> +#  ifdef __ia64__
> +#    define PG_HUGETLB_BASE_ADDR (void *)(0x8000000000000000UL)
> +#    define PG_MAP_HUGETLB (MAP_HUGETLB|MAP_FIXED)
> +#  else
> +#    define PG_HUGETLB_BASE_ADDR (void *)(0x0UL)
> +#    define PG_MAP_HUGETLB MAP_HUGETLB
> +#  endif
> +#else
> +#  define PG_MAP_HUGETLB 0
> +#endif

too much underdocumented crazyness for a very minor platform. Should
somebody with the approprate harware want to submit an additional patch,
fine....


Greetings,

Andres Freund

--
 Andres Freund                       http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



Re: Patch fürMAP_HUGETLB for mmap() shared memory

From
Andres Freund
Date:
On 2012-10-30 21:16:07 +0100, Christian Kruse wrote:
> +static long
> +InternalGetHugepageSize()
> +{
> +    DIR *dir = opendir(HUGE_PAGE_INFO_DIR);
> +    long smallest_size = -1, size;
> +    char *ptr;
> ...
> +    while((ent = readdir(dir)) != NULL)
> +    {

This should be (Allocate|Read)Dir btw.

Greetings,

Andres Freund

--
 Andres Freund                       http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



Re: Re: [HACKERS] Patch für MAP_HUGETLB for mmap() shared memory

From
Bruce Momjian
Date:
Did we decide against specifying huge pages in Postgres?

---------------------------------------------------------------------------

On Tue, Oct 30, 2012 at 09:16:07PM +0100, Christian Kruse wrote:
> Hey,
>
> ok, I think I implemented all of the changes you requested. All but
> the ia64 dependent, I have to do more research for this one.
>
>
> Greetings,
>  CK

> diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
> index b4fcbaf..66ed10f 100644
> --- a/doc/src/sgml/config.sgml
> +++ b/doc/src/sgml/config.sgml
> @@ -1049,6 +1049,37 @@ include 'filename'
>        </listitem>
>       </varlistentry>
>
> +     <varlistentry id="guc-huge-tlb-pages" xreflabel="huge_tlb_pages">
> +      <term><varname>huge_tlb_pages</varname> (<type>enum</type>)</term>
> +      <indexterm>
> +       <primary><varname>huge_tlb_pages</> configuration parameter</primary>
> +      </indexterm>
> +      <listitem>
> +       <para>
> +        Enables/disables the use of huge tlb pages. Valid values are
> +        <literal>on</literal>, <literal>off</literal> and <literal>try</literal>.
> +        The default value is <literal>try</literal>.
> +       </para>
> +
> +       <para>
> +        With <varname>huge_tlb_pages</varname> set to <literal>on</literal>
> +        <symbol>mmap()</symbol> will be called with <symbol>MAP_HUGETLB</symbol>.
> +        If the call fails the server will fail fatally.
> +       </para>
> +
> +       <para>
> +        With <varname>huge_tlb_pages</varname> set to <literal>off</literal> we
> +        will not use <symbol>MAP_HUGETLB</symbol> at all.
> +       </para>
> +
> +       <para>
> +        With <varname>huge_tlb_pages</varname> set to <literal>try</literal>
> +        we will try to use <symbol>MAP_HUGETLB</symbol> and fall back to
> +        <symbol>mmap()</symbol> without <symbol>MAP_HUGETLB</symbol>.
> +       </para>
> +      </listitem>
> +     </varlistentry>
> +
>       <varlistentry id="guc-temp-buffers" xreflabel="temp_buffers">
>        <term><varname>temp_buffers</varname> (<type>integer</type>)</term>
>        <indexterm>
> diff --git a/src/backend/port/sysv_shmem.c b/src/backend/port/sysv_shmem.c
> index df06312..f9de239 100644
> --- a/src/backend/port/sysv_shmem.c
> +++ b/src/backend/port/sysv_shmem.c
> @@ -27,10 +27,14 @@
>  #ifdef HAVE_SYS_SHM_H
>  #include <sys/shm.h>
>  #endif
> +#ifdef MAP_HUGETLB
> +#include <dirent.h>
> +#endif
>
>  #include "miscadmin.h"
>  #include "storage/ipc.h"
>  #include "storage/pg_shmem.h"
> +#include "utils/guc.h"
>
>
>  typedef key_t IpcMemoryKey;        /* shared memory key passed to shmget(2) */
> @@ -61,6 +65,19 @@ typedef int IpcMemoryId;        /* shared memory ID returned by shmget(2) */
>  #define MAP_FAILED ((void *) -1)
>  #endif
>
> +#ifdef MAP_HUGETLB
> +#  ifdef __ia64__
> +#    define PG_HUGETLB_BASE_ADDR (void *)(0x8000000000000000UL)
> +#    define PG_MAP_HUGETLB (MAP_HUGETLB|MAP_FIXED)
> +#  else
> +#    define PG_HUGETLB_BASE_ADDR (void *)(0x0UL)
> +#    define PG_MAP_HUGETLB MAP_HUGETLB
> +#  endif
> +#else
> +#  define PG_MAP_HUGETLB 0
> +#endif
> +
> +
>
>  unsigned long UsedShmemSegID = 0;
>  void       *UsedShmemSegAddr = NULL;
> @@ -73,7 +90,6 @@ static void IpcMemoryDelete(int status, Datum shmId);
>  static PGShmemHeader *PGSharedMemoryAttach(IpcMemoryKey key,
>                       IpcMemoryId *shmid);
>
> -
>  /*
>   *    InternalIpcMemoryCreate(memKey, size)
>   *
> @@ -342,6 +358,155 @@ PGSharedMemoryIsInUse(unsigned long id1, unsigned long id2)
>  }
>
>
> +#ifdef MAP_HUGETLB
> +#define HUGE_PAGE_INFO_DIR  "/sys/kernel/mm/hugepages"
> +
> +/*
> + *    static long InternalGetFreeHugepagesCount(const char *name)
> + *
> + * Attempt to read the number of available hugepages from
> + * /sys/kernel/mm/hugepages/hugepages-<size>/free_hugepages
> + * Will fail (return -1) if file could not be opened, 0 if no pages are available
> + * and > 0 if there are free pages
> + *
> + */
> +static long
> +InternalGetFreeHugepagesCount(const char *name)
> +{
> +    int fd;
> +    char buff[1024];
> +    size_t len;
> +    long result;
> +    char *ptr;
> +
> +    len = snprintf(buff, 1024, "%s/%s/free_hugepages", HUGE_PAGE_INFO_DIR, name);
> +    if (len == 1024) /* I don't think that this will happen ever */
> +    {
> +        ereport(huge_tlb_pages == HUGE_TLB_TRY ? DEBUG1 : WARNING,
> +                (errmsg("Filename %s/%s/free_hugepages is too long", HUGE_PAGE_INFO_DIR, name),
> +                 errcontext("while checking hugepage size")));
> +        return -1;
> +    }
> +
> +    fd = open(buff, O_RDONLY);
> +    if (fd <= 0)
> +    {
> +        ereport(huge_tlb_pages == HUGE_TLB_TRY ? DEBUG1 : WARNING,
> +                (errmsg("Could not open file %s: %s", buff, strerror(errno)),
> +                 errcontext("while checking hugepage size")));
> +        return -1;
> +    }
> +
> +    len = read(fd, buff, 1024);
> +    if (len <= 0)
> +    {
> +        ereport(huge_tlb_pages == HUGE_TLB_TRY ? DEBUG1 : WARNING,
> +                (errmsg("Error reading from file %s: %s", buff, strerror(errno)),
> +                 errcontext("while checking hugepage size")));
> +        close(fd);
> +        return -1;
> +    }
> +
> +    /*
> +     * If the content of free_hugepages is longer than or equal to 1024 bytes
> +     * the rest is irrelevant; we simply want to know if there are any
> +     * hugepages left
> +     */
> +    if (len == 1024)
> +    {
> +        buff[1023] = 0;
> +    }
> +    else
> +    {
> +        buff[len] = 0;
> +    }
> +
> +    close(fd);
> +
> +    result = strtol(buff, &ptr, 10);
> +
> +    if (ptr == NULL)
> +    {
> +        ereport(huge_tlb_pages == HUGE_TLB_TRY ? DEBUG1 : WARNING,
> +                (errmsg("Could not convert contents of file %s/%s/free_hugepages to number", HUGE_PAGE_INFO_DIR,
name),
> +                 errcontext("while checking hugepage size")));
> +        return -1;
> +    }
> +
> +    return result;
> +}
> +
> +/*
> + *    static long InternalGetHugepageSize()
> + *
> + * Attempt to get a valid hugepage size from /sys/kernel/mm/hugepages/ by
> + * reading directory contents
> + * Will fail (return -1) if the directory could not be opened or no valid
> + * page sizes are available. Will return the biggest hugepage size on
> + * success.
> + *
> + */
> +static long
> +InternalGetHugepageSize()
> +{
> +    struct dirent *ent;
> +    DIR *dir = opendir(HUGE_PAGE_INFO_DIR);
> +    long smallest_size = -1, size;
> +    char *ptr;
> +
> +    if (dir == NULL)
> +    {
> +        ereport(huge_tlb_pages == HUGE_TLB_TRY ? DEBUG1 : WARNING,
> +                (errmsg("Could not open directory %s: %s", HUGE_PAGE_INFO_DIR, strerror(errno)),
> +                 errcontext("while checking hugepage size")));
> +        return -1;
> +    }
> +
> +    /*
> +     * Linux supports multiple hugepage sizes if the hardware
> +     * supports it; for each possible size there will be a
> +     * directory in /sys/kernel/mm/hugepages consisting of the
> +     * string hugepages- and the size of the page, e.g. on x86_64:
> +     * hugepages-2048kB
> +     */
> +    while((ent = readdir(dir)) != NULL)
> +    {
> +        if (strncmp(ent->d_name, "hugepages-", 10) == 0)
> +        {
> +            size = strtol(ent->d_name + 10, &ptr, 10);
> +            if (ptr == NULL)
> +            {
> +                continue;
> +            }
> +
> +            if (strcmp(ptr, "kB") == 0)
> +            {
> +                size *= 1024;
> +            }
> +
> +            if ((smallest_size == -1 || size < smallest_size)
> +                && InternalGetFreeHugepagesCount(ent->d_name) > 0)
> +            {
> +                smallest_size = size;
> +            }
> +        }
> +    }
> +
> +    closedir(dir);
> +
> +    if (smallest_size == -1)
> +    {
> +        ereport(huge_tlb_pages == HUGE_TLB_TRY ? DEBUG1 : WARNING,
> +                (errmsg("Could not find a valid hugepage size"),
> +                 errhint("This error usually means that either CONFIG_HUGETLB_PAGE "
> +                         "is not in kernel or that your architecture does not "
> +                         "support hugepages or you did not configure hugepages")));
> +    }
> +
> +    return smallest_size;
> +}
> +#endif
> +
>  /*
>   * PGSharedMemoryCreate
>   *
> @@ -391,7 +556,17 @@ PGSharedMemoryCreate(Size size, bool makePrivate, int port)
>       */
>  #ifndef EXEC_BACKEND
>      {
> +#ifdef MAP_HUGETLB
> +        long    pagesize = 0;
> +
> +        if (huge_tlb_pages == HUGE_TLB_ON || huge_tlb_pages == HUGE_TLB_TRY)
> +            pagesize = InternalGetHugepageSize();
> +
> +        if (pagesize <= 0)
> +            pagesize = sysconf(_SC_PAGE_SIZE);
> +#else
>          long    pagesize = sysconf(_SC_PAGE_SIZE);
> +#endif
>
>          /*
>           * Ensure request size is a multiple of pagesize.
> @@ -410,8 +585,22 @@ PGSharedMemoryCreate(Size size, bool makePrivate, int port)
>           * to be false, we might need to add a run-time test here and do this
>           * only if the running kernel supports it.
>           */
> -        AnonymousShmem = mmap(NULL, size, PROT_READ|PROT_WRITE, PG_MMAP_FLAGS,
> -                              -1, 0);
> +
> +        if (huge_tlb_pages == HUGE_TLB_ON || huge_tlb_pages == HUGE_TLB_TRY)
> +        {
> +            AnonymousShmem = mmap(PG_HUGETLB_BASE_ADDR, size, PROT_READ|PROT_WRITE,
> +                                  PG_MMAP_FLAGS|PG_MAP_HUGETLB, -1, 0);
> +
> +            elog(DEBUG3, "mmap() tried with MAP_HUGEPAGE: %p", AnonymousShmem);
> +        }
> +
> +        if ((AnonymousShmem == MAP_FAILED && huge_tlb_pages == HUGE_TLB_TRY)
> +            || huge_tlb_pages == HUGE_TLB_OFF)
> +        {
> +            AnonymousShmem = mmap(NULL, size, PROT_READ|PROT_WRITE, PG_MMAP_FLAGS,
> +                                  -1, 0);
> +        }
> +
>          if (AnonymousShmem == MAP_FAILED)
>              ereport(FATAL,
>               (errmsg("could not map anonymous shared memory: %m"),
> diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
> index 745e7be..28b6191 100644
> --- a/src/backend/utils/misc/guc.c
> +++ b/src/backend/utils/misc/guc.c
> @@ -22,6 +22,7 @@
>  #include <limits.h>
>  #include <unistd.h>
>  #include <sys/stat.h>
> +#include <sys/mman.h>
>  #ifdef HAVE_SYSLOG
>  #include <syslog.h>
>  #endif
> @@ -389,6 +390,22 @@ static const struct config_enum_entry synchronous_commit_options[] = {
>  };
>
>  /*
> + * huge_tlb_pages may be on|off|try, where try is the default
> + * on: try to mmap() with MAP_HUGETLB and fail when mmap() fails
> + * off: do not try tp mmap() with MAP_HUGETLB
> + * try: try to mmap() with MAP_HUGETLB and fallback to mmap()
> + *      w/o MAP_HUGETLB
> + */
> +static const struct config_enum_entry huge_tlb_options[] = {
> +#ifdef MAP_HUGETLB
> +    {"on", HUGE_TLB_ON, false},
> +    {"try", HUGE_TLB_TRY, false},
> +#endif
> +    {"off", HUGE_TLB_OFF, false},
> +    {NULL, 0, false}
> +};
> +
> +/*
>   * Options for enum values stored in other modules
>   */
>  extern const struct config_enum_entry wal_level_options[];
> @@ -447,6 +464,12 @@ int            tcp_keepalives_idle;
>  int            tcp_keepalives_interval;
>  int            tcp_keepalives_count;
>
> +#ifdef MAP_HUGETLB
> +int huge_tlb_pages = HUGE_TLB_TRY;
> +#else
> +int huge_tlb_pages = HUGE_TLB_OFF;
> +#endif
> +
>  /*
>   * These variables are all dummies that don't do anything, except in some
>   * cases provide the value for SHOW to display.  The real state is elsewhere
> @@ -3301,6 +3324,26 @@ static struct config_enum ConfigureNamesEnum[] =
>          NULL, NULL, NULL
>      },
>
> +    {
> +        {"huge_tlb_pages",
> +#ifdef MAP_HUGETLB
> +            PGC_SUSET,
> +#else
> +            PGC_INTERNAL,
> +#endif
> +            RESOURCES_MEM,
> +            gettext_noop("Enable/disable the use of the hugepages feature"),
> +            NULL
> +        },
> +        &huge_tlb_pages,
> +#ifdef MAP_HUGETLB
> +        HUGE_TLB_TRY,
> +#else
> +        HUGE_TLB_OFF,
> +#endif
> +        huge_tlb_options,
> +        NULL, NULL, NULL
> +    },
>
>      /* End-of-list marker */
>      {
> diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
> index eeb9b82..e5bafec 100644
> --- a/src/backend/utils/misc/postgresql.conf.sample
> +++ b/src/backend/utils/misc/postgresql.conf.sample
> @@ -113,6 +113,7 @@
>
>  #shared_buffers = 32MB            # min 128kB
>                      # (change requires restart)
> +#huge_tlb_pages = try            # try to map memory with MAP_HUGETLB (on, off, try)
>  #temp_buffers = 8MB            # min 800kB
>  #max_prepared_transactions = 0        # zero disables the feature
>                      # (change requires restart)
> diff --git a/src/include/utils/guc.h b/src/include/utils/guc.h
> index 06f797c..17f5870 100644
> --- a/src/include/utils/guc.h
> +++ b/src/include/utils/guc.h
> @@ -230,6 +230,24 @@ extern int    tcp_keepalives_idle;
>  extern int    tcp_keepalives_interval;
>  extern int    tcp_keepalives_count;
>
> +
> +/*
> + * Possible values for huge_tlb_pages; default is HUGE_TLB_TRY
> + */
> +typedef enum
> +{
> +    HUGE_TLB_OFF,
> +    HUGE_TLB_ON,
> +    HUGE_TLB_TRY
> +} HugeTlbType;
> +
> +
> +/*
> + * configure the use of huge TLB pages
> + */
> +extern int huge_tlb_pages;
> +
> +
>  /*
>   * Functions exported by guc.c
>   */




--
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

  + It's impossible for everything to be true. +



Re: Re: [HACKERS] Patch für MAP_HUGETLB for mmap() shared memory

From
Andres Freund
Date:
Hi,

On 2013-06-28 23:03:22 -0400, Bruce Momjian wrote:
> Did we decide against specifying huge pages in Postgres?

I don't think so. We need somebody to make some last modifications to
the patch though and Christian doesn't seem to have the time atm.

I think the bigger memory (size of the per process page table) and #cpus
(number of pg backends => number of page tables allocated) the more
imporant it gets to implement this. We already can spend gigabytes of
memory on those on big systems.

Greetings,

Andres Freund

--
 Andres Freund                       http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services