Re: patch: add MAP_HUGETLB to mmap() where supported (WIP) - Mailing list pgsql-hackers

From Andres Freund
Subject Re: patch: add MAP_HUGETLB to mmap() where supported (WIP)
Date
Msg-id 20130916131850.GB5249@awork2.anarazel.de
Whole thread Raw
In response to Re: patch: add MAP_HUGETLB to mmap() where supported (WIP)  (Heikki Linnakangas <hlinnakangas@vmware.com>)
Responses Re: patch: add MAP_HUGETLB to mmap() where supported (WIP)
List pgsql-hackers
On 2013-09-16 16:13:57 +0300, Heikki Linnakangas wrote:
> On 16.09.2013 13:15, Andres Freund wrote:
> >On 2013-09-16 11:15:28 +0300, Heikki Linnakangas wrote:
> >>On 14.09.2013 02:41, Richard Poole wrote:
> >>>The attached patch adds the MAP_HUGETLB flag to mmap() for shared memory
> >>>on systems that support it. It's based on Christian Kruse's patch from
> >>>last year, incorporating suggestions from Andres Freund.
> >>
> >>I don't understand the logic in figuring out the pagesize, and the smallest
> >>supported hugepage size. First of all, even without the patch, why do we
> >>round up the size passed to mmap() to the _SC_PAGE_SIZE? Surely the kernel
> >>will round up the request all by itself. The mmap() man page doesn't say
> >>anything about length having to be a multiple of pages size.
> >
> >I think it does:
> >        EINVAL We don't like addr, length, or offset (e.g., they are  too
> >               large,  or  not aligned on a page boundary).
> 
> That doesn't mean that they *all* have to be aligned on a page boundary.
> It's understandable that 'addr' and 'offset' have to be, but it doesn't make
> much sense for 'length'.
> 
> >and
> >        A file is mapped in multiples of the page size.  For a file that is not a multiple
> >        of  the  page size, the remaining memory is zeroed when mapped, and writes to that
> >        region are not written out to the file.  The effect of changing the  size  of  the
> >        underlying  file  of  a  mapping  on the pages that correspond to added or removed
> >        regions of the file is unspecified.
> >
> >And no, according to my past experience, the kernel does *not* do any
> >such rounding up. It will just fail.
> 
> I wrote a little test program to play with different values (attached). I
> tried this on my laptop with a 3.2 kernel (uname -r: 3.10-2-amd6), and on a
> VM with a fresh Centos 6.4 install with 2.6.32 kernel
> (2.6.32-358.18.1.el6.x86_64), and they both work the same:
> 
> $ ./mmaptest 100 # mmap 100 bytes
> 
> in a different terminal:
> $ cat /proc/meminfo  | grep HugePages_Rsvd
> HugePages_Rsvd:        1
> 
> So even a tiny allocation, much smaller than any page size, succeeds, and it
> reserves a huge page. I tried the same with larger values; the kernel always
> uses huge pages, and rounds up the allocation to a multiple of the huge page
> size.

When developing the prototype I am pretty sure I had to add the rounding
up - but I am not sure why now, because after chatting with Heikki about
it, I've looked around and the initial MAP_HUGETLB support in the kernel
(commit 4e52780d41a741fb4861ae1df2413dd816ec11b1) has support for
rounding up.

> So, let's just get rid of the /sys scanning code.

Alternatively we could round up NBuffers to actually use the
additionally allocated space. Not sure if that's worth the amount of
code, but wasting several megabytes - or even gigabytes - of memory
isn't nice either.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



pgsql-hackers by date:

Previous
From: Heikki Linnakangas
Date:
Subject: Re: patch: add MAP_HUGETLB to mmap() where supported (WIP)
Next
From: Andres Freund
Date:
Subject: Re: patch: add MAP_HUGETLB to mmap() where supported (WIP)