Thread: patch: add MAP_HUGETLB to mmap() where supported (WIP)
The attached patch adds the MAP_HUGETLB flag to mmap() for shared memory on systems that support it. It's based on Christian Kruse's patch from last year, incorporating suggestions from Andres Freund. On a system with 4GB shared_buffers, doing pgbench runs long enough for each backend to touch most of the buffers, this patch saves nearly 8MB of memory per backend and improves performances by just over 2% on average. It is still WIP as there are a couple of points that Andres has pointed out to me that haven't been addressed yet; also, the documentation is incomplete. Richard -- Richard Poole http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
On Sat, 2013-09-14 at 00:41 +0100, Richard Poole wrote: > The attached patch adds the MAP_HUGETLB flag to mmap() for shared > memory on systems that support it. Please fix the tabs in the SGML files.
On 14.09.2013 02:41, Richard Poole wrote: > The attached patch adds the MAP_HUGETLB flag to mmap() for shared memory > on systems that support it. It's based on Christian Kruse's patch from > last year, incorporating suggestions from Andres Freund. I don't understand the logic in figuring out the pagesize, and the smallest supported hugepage size. First of all, even without the patch, why do we round up the size passed to mmap() to the _SC_PAGE_SIZE? Surely the kernel will round up the request all by itself. The mmap() man page doesn't say anything about length having to be a multiple of pages size. And with the patch, why do you bother detecting the minimum supported hugepage size? Surely the kernel will choose the appropriate hugepage size just fine on its own, no? > It is still WIP as there are a couple of points that Andres has pointed > out to me that haven't been addressed yet; Which points are those? I wonder if it would be better to allow setting huge_tlb_pages=try even on platforms that don't have hugepages. It would simply mean the same as 'off' on such platforms. - Heikki
On 2013-09-16 11:15:28 +0300, Heikki Linnakangas wrote: > On 14.09.2013 02:41, Richard Poole wrote: > >The attached patch adds the MAP_HUGETLB flag to mmap() for shared memory > >on systems that support it. It's based on Christian Kruse's patch from > >last year, incorporating suggestions from Andres Freund. > > I don't understand the logic in figuring out the pagesize, and the smallest > supported hugepage size. First of all, even without the patch, why do we > round up the size passed to mmap() to the _SC_PAGE_SIZE? Surely the kernel > will round up the request all by itself. The mmap() man page doesn't say > anything about length having to be a multiple of pages size. I think it does: EINVAL We don't like addr, length, or offset (e.g., they are too large, or not alignedon a page boundary). and A file is mapped in multiples of the page size. For a file that is not a multiple of the page size, theremaining memory is zeroed when mapped, and writes to that region are not written out to the file. The effect ofchanging the size of the underlying file of a mapping on the pages that correspond to added or removed regions of the file is unspecified. And no, according to my past experience, the kernel does *not* do any such rounding up. It will just fail. > And with the patch, why do you bother detecting the minimum supported > hugepage size? Surely the kernel will choose the appropriate hugepage size > just fine on its own, no? It will fail if it's not a multiple. > >It is still WIP as there are a couple of points that Andres has pointed > >out to me that haven't been addressed yet; > > Which points are those? I don't know which point Richard already has fixed, so I'll let him comment on that. > I wonder if it would be better to allow setting huge_tlb_pages=try even on > platforms that don't have hugepages. It would simply mean the same as 'off' > on such platforms. I wouldn't argue against that. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 16.09.2013 13:15, Andres Freund wrote: > On 2013-09-16 11:15:28 +0300, Heikki Linnakangas wrote: >> On 14.09.2013 02:41, Richard Poole wrote: >>> The attached patch adds the MAP_HUGETLB flag to mmap() for shared memory >>> on systems that support it. It's based on Christian Kruse's patch from >>> last year, incorporating suggestions from Andres Freund. >> >> I don't understand the logic in figuring out the pagesize, and the smallest >> supported hugepage size. First of all, even without the patch, why do we >> round up the size passed to mmap() to the _SC_PAGE_SIZE? Surely the kernel >> will round up the request all by itself. The mmap() man page doesn't say >> anything about length having to be a multiple of pages size. > > I think it does: > EINVAL We don't like addr, length, or offset (e.g., they are too > large, or not aligned on a page boundary). That doesn't mean that they *all* have to be aligned on a page boundary. It's understandable that 'addr' and 'offset' have to be, but it doesn't make much sense for 'length'. > and > A file is mapped in multiples of the page size. For a file that is not a multiple > of the page size, the remaining memory is zeroed when mapped, and writes to that > region are not written out to the file. The effect of changing the size of the > underlying file of a mapping on the pages that correspond to added or removed > regions of the file is unspecified. > > And no, according to my past experience, the kernel does *not* do any > such rounding up. It will just fail. I wrote a little test program to play with different values (attached). I tried this on my laptop with a 3.2 kernel (uname -r: 3.10-2-amd6), and on a VM with a fresh Centos 6.4 install with 2.6.32 kernel (2.6.32-358.18.1.el6.x86_64), and they both work the same: $ ./mmaptest 100 # mmap 100 bytes in a different terminal: $ cat /proc/meminfo | grep HugePages_Rsvd HugePages_Rsvd: 1 So even a tiny allocation, much smaller than any page size, succeeds, and it reserves a huge page. I tried the same with larger values; the kernel always uses huge pages, and rounds up the allocation to a multiple of the huge page size. So, let's just get rid of the /sys scanning code. Robert, do you remember why you put the "pagesize = sysconf(_SC_PAGE_SIZE);" call in the new mmap() shared memory allocator? - Heikki
Attachment
On 2013-09-16 16:13:57 +0300, Heikki Linnakangas wrote: > On 16.09.2013 13:15, Andres Freund wrote: > >On 2013-09-16 11:15:28 +0300, Heikki Linnakangas wrote: > >>On 14.09.2013 02:41, Richard Poole wrote: > >>>The attached patch adds the MAP_HUGETLB flag to mmap() for shared memory > >>>on systems that support it. It's based on Christian Kruse's patch from > >>>last year, incorporating suggestions from Andres Freund. > >> > >>I don't understand the logic in figuring out the pagesize, and the smallest > >>supported hugepage size. First of all, even without the patch, why do we > >>round up the size passed to mmap() to the _SC_PAGE_SIZE? Surely the kernel > >>will round up the request all by itself. The mmap() man page doesn't say > >>anything about length having to be a multiple of pages size. > > > >I think it does: > > EINVAL We don't like addr, length, or offset (e.g., they are too > > large, or not aligned on a page boundary). > > That doesn't mean that they *all* have to be aligned on a page boundary. > It's understandable that 'addr' and 'offset' have to be, but it doesn't make > much sense for 'length'. > > >and > > A file is mapped in multiples of the page size. For a file that is not a multiple > > of the page size, the remaining memory is zeroed when mapped, and writes to that > > region are not written out to the file. The effect of changing the size of the > > underlying file of a mapping on the pages that correspond to added or removed > > regions of the file is unspecified. > > > >And no, according to my past experience, the kernel does *not* do any > >such rounding up. It will just fail. > > I wrote a little test program to play with different values (attached). I > tried this on my laptop with a 3.2 kernel (uname -r: 3.10-2-amd6), and on a > VM with a fresh Centos 6.4 install with 2.6.32 kernel > (2.6.32-358.18.1.el6.x86_64), and they both work the same: > > $ ./mmaptest 100 # mmap 100 bytes > > in a different terminal: > $ cat /proc/meminfo | grep HugePages_Rsvd > HugePages_Rsvd: 1 > > So even a tiny allocation, much smaller than any page size, succeeds, and it > reserves a huge page. I tried the same with larger values; the kernel always > uses huge pages, and rounds up the allocation to a multiple of the huge page > size. When developing the prototype I am pretty sure I had to add the rounding up - but I am not sure why now, because after chatting with Heikki about it, I've looked around and the initial MAP_HUGETLB support in the kernel (commit 4e52780d41a741fb4861ae1df2413dd816ec11b1) has support for rounding up. > So, let's just get rid of the /sys scanning code. Alternatively we could round up NBuffers to actually use the additionally allocated space. Not sure if that's worth the amount of code, but wasting several megabytes - or even gigabytes - of memory isn't nice either. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 2013-09-16 15:18:50 +0200, Andres Freund wrote: > > So even a tiny allocation, much smaller than any page size, succeeds, and it > > reserves a huge page. I tried the same with larger values; the kernel always > > uses huge pages, and rounds up the allocation to a multiple of the huge page > > size. > > When developing the prototype I am pretty sure I had to add the rounding > up - but I am not sure why now, because after chatting with Heikki about > it, I've looked around and the initial MAP_HUGETLB support in the kernel > (commit 4e52780d41a741fb4861ae1df2413dd816ec11b1) has support for > rounding up. Ok, the reason for that seems to have been the following bug https://bugzilla.kernel.org/show_bug.cgi?id=56881 Greetings, Andres Freund
On Mon, Sep 16, 2013 at 9:13 AM, Heikki Linnakangas <hlinnakangas@vmware.com> wrote: > Robert, do you remember why you put the "pagesize = sysconf(_SC_PAGE_SIZE);" > call in the new mmap() shared memory allocator? Hmm, no. Unfortunately, I don't. We could try ripping it out and see if the buildfarm breaks. If it is needed, then the dynamic shared memory patch I posted probably needs it as well. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hi. This is a slightly reworked version of the patch submitted by Richard Poole last month, which was based on Christian Kruse's earlier patch. Apart from doing various minor cleanups and documentation fixes, I also tested this patch against HEAD on a machine with 256GB of RAM. Here's an overview of the results. I set nr_hugepages to 32768 (== 64GB), which (took a very long time and) allowed me to set shared_buffers to 60GB. I then ran pgbench -s 1000 -i, and did some runs of "pgbench -c 100 -j 10 -t 1000" with huge_tlb_pages set to off and on respectively. With huge_tlb_pages=off, this is the best result I got: tps = 8680.771068 (including connections establishing) tps = 8721.504838 (excluding connections establishing) With huge_tlb_pages=on, this is the best result I got: tps = 9932.245203 (including connections establishing) tps = 9983.190304 (excluding connections establishing) (Even the worst result I got in the latter case was a smidgen faster than the best with huge_tlb_pages=off: 8796.344078 vs. 8721.504838.) >From /proc/$pid/status, VmPTE was 2880kb with huge_tlb_pages=off, and 56kb with it turned on. One open question is what to do about rounding up the size. It should not be necessary, but for the fairly recent bug described at the link in the comment (https://bugzilla.kernel.org/show_bug.cgi?id=56881). I tried it without the rounding-up, and it fails on Ubuntu's 3.5.0-28 kernel (mmap returns EINVAL). Any thoughts? -- Abhijit
Attachment
At 2013-10-24 11:33:13 +0530, ams@2ndquadrant.com wrote: > > >From /proc/$pid/status, VmPTE was 2880kb with huge_tlb_pages=off, and > 56kb with it turned on. (VmPTE is the size of the process's page tables.) -- Abhijit
On 24.10.2013 09:03, Abhijit Menon-Sen wrote: > This is a slightly reworked version of the patch submitted by Richard > Poole last month, which was based on Christian Kruse's earlier patch. Thanks. > With huge_tlb_pages=off, this is the best result I got: > > tps = 8680.771068 (including connections establishing) > tps = 8721.504838 (excluding connections establishing) > > With huge_tlb_pages=on, this is the best result I got: > > tps = 9932.245203 (including connections establishing) > tps = 9983.190304 (excluding connections establishing) > > (Even the worst result I got in the latter case was a smidgen faster > than the best with huge_tlb_pages=off: 8796.344078 vs. 8721.504838.) That's really impressive. > One open question is what to do about rounding up the size. It should > not be necessary, but for the fairly recent bug described at the link > in the comment (https://bugzilla.kernel.org/show_bug.cgi?id=56881). I > tried it without the rounding-up, and it fails on Ubuntu's 3.5.0-28 > kernel (mmap returns EINVAL). Let's get rid of the rounding. It's clearly a kernel bug, and it shouldn't be our business to add workarounds for any kernel bug out there. And the worst that will happen if you're running a buggy kernel version is that you fall back to not using huge pages (assuming huge_tlb_pages=try). Other comments: * guc.c doesn't actually need sys/mman.h for anything. Getting rid of the #include also lets you remove the configure test. * the documentation should perhaps mention that the setting only has an effect if POSIX shared memory is used. That's the default on Linux, but we will try to fall back to SystemV shared memory if it fails. - Heikki
On Thu, Oct 24, 2013 at 9:06 AM, Heikki Linnakangas <hlinnakangas@vmware.com> wrote: > * the documentation should perhaps mention that the setting only has an > effect if POSIX shared memory is used. That's the default on Linux, but we > will try to fall back to SystemV shared memory if it fails. This is true for dynamic shared memory, but not for the main shared memory segment. The main shared memory segment is always the combination of a small, fixed-size System V shared memory chunk and a anonymous shared memory region created by mmap(NULL, ..., MAP_SHARED).POSIX shared memory is not used. (Exceptions: Anonymous shared memory isn't used on Windows, which has its own mechanism, or when compiling with EXEC_BACKEND, when the whole chunk is allocated as System V shared memory.) -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2013-10-24 16:06:19 +0300, Heikki Linnakangas wrote: > On 24.10.2013 09:03, Abhijit Menon-Sen wrote: > >One open question is what to do about rounding up the size. It should > >not be necessary, but for the fairly recent bug described at the link > >in the comment (https://bugzilla.kernel.org/show_bug.cgi?id=56881). I > >tried it without the rounding-up, and it fails on Ubuntu's 3.5.0-28 > >kernel (mmap returns EINVAL). > > Let's get rid of the rounding. It's clearly a kernel bug, and it shouldn't > be our business to add workarounds for any kernel bug out there. And the > worst that will happen if you're running a buggy kernel version is that you > fall back to not using huge pages (assuming huge_tlb_pages=try). But it's a range of relatively popular kernels, that will stay around for a good while. So I am hesitant to just not do anything about it. The directory scanning code isn't that bad imo. Either way: I think we should log when we tried to use hugepages but fell back to plain mmap, currently it's hard to see whether they are used. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Thu, Oct 24, 2013 at 1:00 PM, Andres Freund <andres@2ndquadrant.com> wrote: > On 2013-10-24 16:06:19 +0300, Heikki Linnakangas wrote: >> On 24.10.2013 09:03, Abhijit Menon-Sen wrote: >> >One open question is what to do about rounding up the size. It should >> >not be necessary, but for the fairly recent bug described at the link >> >in the comment (https://bugzilla.kernel.org/show_bug.cgi?id=56881). I >> >tried it without the rounding-up, and it fails on Ubuntu's 3.5.0-28 >> >kernel (mmap returns EINVAL). >> >> Let's get rid of the rounding. It's clearly a kernel bug, and it shouldn't >> be our business to add workarounds for any kernel bug out there. And the >> worst that will happen if you're running a buggy kernel version is that you >> fall back to not using huge pages (assuming huge_tlb_pages=try). > > But it's a range of relatively popular kernels, that will stay around > for a good while. So I am hesitant to just not do anything about it. The > directory scanning code isn't that bad imo. > > Either way: > I think we should log when we tried to use hugepages but fell back to > plain mmap, currently it's hard to see whether they are used. Logging it might be a good idea, but suppose the systems been running for 6 months and you don't have the startup logs. Might be a good way to have an easy way to discover later what happened back then. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hi, On Wed, Oct 23, 2013 at 11:03 PM, Abhijit Menon-Sen <ams@2ndquadrant.com> wrote: > This is a slightly reworked version of the patch submitted by Richard > Poole last month, which was based on Christian Kruse's earlier patch. Is it possible that this patch will be included in a minor version of 9.3? IMHO hugepages is a very important ability that postgres lost in 9.3, and it would be great to have it back ASAP. Thank you. -- Kind regards, Sergey Konoplev PostgreSQL Consultant and DBA http://www.linkedin.com/in/grayhemp +1 (415) 867-9984, +7 (901) 903-0499, +7 (988) 888-1979 gray.ru@gmail.com
Sergey Konoplev <gray.ru@gmail.com> writes: > On Wed, Oct 23, 2013 at 11:03 PM, Abhijit Menon-Sen <ams@2ndquadrant.com> wrote: >> This is a slightly reworked version of the patch submitted by Richard >> Poole last month, which was based on Christian Kruse's earlier patch. > Is it possible that this patch will be included in a minor version of > 9.3? IMHO hugepages is a very important ability that postgres lost in > 9.3, and it would be great to have it back ASAP. Say what? There's never been any hugepages support in Postgres. regards, tom lane
At 2013-10-24 16:06:19 +0300, hlinnakangas@vmware.com wrote: > > Let's get rid of the rounding. I share Andres's concern that the bug is present in various recent kernels that are going to stick around for quite some time. Given the rather significant performance gain, I think it's worth doing something, though I'm not a big fan of the directory-scanning code myself. As a compromise, perhaps we can unconditionally round the size up to be a multiple of 2MB? That way, we can use huge pages more often, but also avoid putting in a lot of code and effort into the workaround and waste only a little space (if any at all). > Other comments: > > * guc.c doesn't actually need sys/mman.h for anything. Getting rid > of the #include also lets you remove the configure test. You're right, guc.c doesn't use it any more; I've removed the #include. sysv_shmem.c does use it (MAP_*, PROT_*), however, so I've left the test in configure alone. I see that sys/mman.h is included elsewhere with an #ifdef WIN32 or HAVE_SHM_OPEN guard, but HAVE_SYS_MMAN_H seems better. > * the documentation should perhaps mention that the setting only has > an effect if POSIX shared memory is used. As Robert said, this is not correct, so I haven't changed anything. -- Abhijit
At 2013-10-24 19:00:28 +0200, andres@2ndquadrant.com wrote: > > I think we should log when we tried to use hugepages but fell back to > plain mmap, currently it's hard to see whether they are used. Good idea, thanks. I'll do this in the next patch I post (which will be after we reach some consensus about how to handle the rounding problem). -- Abhijit
On Tue, Oct 29, 2013 at 9:31 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Sergey Konoplev <gray.ru@gmail.com> writes: >> On Wed, Oct 23, 2013 at 11:03 PM, Abhijit Menon-Sen <ams@2ndquadrant.com> wrote: >>> This is a slightly reworked version of the patch submitted by Richard >>> Poole last month, which was based on Christian Kruse's earlier patch. > >> Is it possible that this patch will be included in a minor version of >> 9.3? IMHO hugepages is a very important ability that postgres lost in >> 9.3, and it would be great to have it back ASAP. > > Say what? There's never been any hugepages support in Postgres. There were an ability to back shared memory with hugepages when using <=9.2. I use it on ~30 servers for several years and it brings 8-17% of performance depending on the memory size. Here you will find several paragraphs of the description about how to do it https://github.com/grayhemp/pgcookbook/blob/master/database_server_configuration.md. Just search for the 'hugepages' word on the page. -- Kind regards, Sergey Konoplev PostgreSQL Consultant and DBA http://www.linkedin.com/in/grayhemp +1 (415) 867-9984, +7 (901) 903-0499, +7 (988) 888-1979 gray.ru@gmail.com
On Tue, Oct 29, 2013 at 11:08:05PM -0700, Sergey Konoplev wrote: > On Tue, Oct 29, 2013 at 9:31 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Sergey Konoplev <gray.ru@gmail.com> writes: > >> On Wed, Oct 23, 2013 at 11:03 PM, Abhijit Menon-Sen <ams@2ndquadrant.com> wrote: > >>> This is a slightly reworked version of the patch submitted by Richard > >>> Poole last month, which was based on Christian Kruse's earlier patch. > > > >> Is it possible that this patch will be included in a minor version of > >> 9.3? IMHO hugepages is a very important ability that postgres lost in > >> 9.3, and it would be great to have it back ASAP. > > > > Say what? There's never been any hugepages support in Postgres. > > There were an ability to back shared memory with hugepages when using > <=9.2. I use it on ~30 servers for several years and it brings 8-17% > of performance depending on the memory size. Here you will find > several paragraphs of the description about how to do it > https://github.com/grayhemp/pgcookbook/blob/master/database_server_configuration.md. > Just search for the 'hugepages' word on the page. For better or worse, we add new features exactly and only in .0 releases. It's what's made it possible for people to plan deployments, given us a deserved reputation for stability, etc., etc. I guess what I'm saying here is that awesome as any particular feature might be to back-patch, that benefit is overwhelmed by the cost of having unstable releases. -infininty from me to any proposal that gets us into "are you using PostgreSQL x.y.z or x.y.w?" when it comes to features. Cheers, David. -- David Fetter <david@fetter.org> http://fetter.org/ Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter Skype: davidfetter XMPP: david.fetter@gmail.com iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate
On Wed, Oct 30, 2013 at 10:16:57AM +0530, Abhijit Menon-Sen wrote: > At 2013-10-24 16:06:19 +0300, hlinnakangas@vmware.com wrote: > > > > Let's get rid of the rounding. > > I share Andres's concern that the bug is present in various recent > kernels that are going to stick around for quite some time. Given > the rather significant performance gain, I think it's worth doing > something, though I'm not a big fan of the directory-scanning code > myself. > > As a compromise, perhaps we can unconditionally round the size up to be > a multiple of 2MB? How about documenting that 2MB is the quantum (OK, we'll say "indivisible unit" or "smallest division" or something) and failing with a message to that effect if someone tries to set it otherwise? Cheers, David. -- David Fetter <david@fetter.org> http://fetter.org/ Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter Skype: davidfetter XMPP: david.fetter@gmail.com iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate
At 2013-10-30 00:10:39 -0700, david@fetter.org wrote: > > How about documenting that 2MB is the quantum (OK, we'll say > "indivisible unit" or "smallest division" or something) and failing > with a message to that effect if someone tries to set it otherwise? I don't think you understand the problem. We're not discussing a user setting here. The size that is passed to PGSharedMemoryCreate is based on shared_buffers and our estimates of how much memory we need for other things like ProcArray (see ipci.c:CreateSharedMemoryAndSemaphores). If this calculated size is not a multiple of a page size supported by the hardware (usually 2/4/16MB etc.), the allocation will fail under some commonly-used kernels. We can either ignore the problem and let the allocation fail, or try to discover the smallest supported huge page size (what the patch does now), or assume that 2MB pages can be used if any huge pages can be used and align accordingly. We could use a larger size, e.g. if we aligned to 16MB then it would work on hardware that supported 2/4/8/16MB pages, but we'd waste the extra memory unless we also increased NBuffers after the rounding up (which is also something Andres suggested earlier). I don't have a strong opinion on the available options, other than not liking the "do nothing" approach. -- Abhijit
Abhijit Menon-Sen <ams@2ndquadrant.com> writes: > As a compromise, perhaps we can unconditionally round the size up to be > a multiple of 2MB? That way, we can use huge pages more often, but also > avoid putting in a lot of code and effort into the workaround and waste > only a little space (if any at all). That sounds reasonably painless to me. Note that at least in our main shmem segment, "extra" space is not useless, because it allows slop for the main hash tables, notably the locks table. regards, tom lane
Sergey Konoplev <gray.ru@gmail.com> writes: > On Tue, Oct 29, 2013 at 9:31 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Say what? There's never been any hugepages support in Postgres. > There were an ability to back shared memory with hugepages when using > <=9.2. I use it on ~30 servers for several years and it brings 8-17% > of performance depending on the memory size. Here you will find > several paragraphs of the description about how to do it > https://github.com/grayhemp/pgcookbook/blob/master/database_server_configuration.md. What this describes is how to modify Postgres to request huge pages. That's hardly built-in support. In any case, as David already explained, we don't do feature additions in minor releases. We'd be especially unlikely to make an exception for this, since it has uncertain portability and benefits. Anything that carries portability risks has got to go through a beta testing cycle before we'll unleash it on the masses. regards, tom lane
At 2013-10-30 11:04:36 -0400, tgl@sss.pgh.pa.us wrote: > > > As a compromise, perhaps we can unconditionally round the size up to be > > a multiple of 2MB? […] > > That sounds reasonably painless to me. Here's a patch that does that and adds a DEBUG1 log message when we try with MAP_HUGETLB and fail and fallback to ordinary mmap. -- Abhijit
Attachment
On 2013-10-30 22:39:20 +0530, Abhijit Menon-Sen wrote: > At 2013-10-30 11:04:36 -0400, tgl@sss.pgh.pa.us wrote: > > > > > As a compromise, perhaps we can unconditionally round the size up to be > > > a multiple of 2MB? […] > > > > That sounds reasonably painless to me. > > Here's a patch that does that and adds a DEBUG1 log message when we try > with MAP_HUGETLB and fail and fallback to ordinary mmap. But it's in no way guaranteed that the smallest hugepage size is 2MB. It'll be on current x86 hardware, but not on any other platform... Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Wed, Oct 30, 2013 at 8:11 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Sergey Konoplev <gray.ru@gmail.com> writes: >> On Tue, Oct 29, 2013 at 9:31 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> Say what? There's never been any hugepages support in Postgres. > >> There were an ability to back shared memory with hugepages when using >> <=9.2. I use it on ~30 servers for several years and it brings 8-17% >> of performance depending on the memory size. Here you will find >> several paragraphs of the description about how to do it >> https://github.com/grayhemp/pgcookbook/blob/master/database_server_configuration.md. > > What this describes is how to modify Postgres to request huge pages. > That's hardly built-in support. I wasn't talking about a built-in support. It was about an ability (a way) to back sh_buf with hugepages. > In any case, as David already explained, we don't do feature additions > in minor releases. We'd be especially unlikely to make an exception > for this, since it has uncertain portability and benefits. Anything > that carries portability risks has got to go through a beta testing > cycle before we'll unleash it on the masses. Yes, I got the idea. Thanks both of you for clarification. -- Kind regards, Sergey Konoplev PostgreSQL Consultant and DBA http://www.linkedin.com/in/grayhemp +1 (415) 867-9984, +7 (901) 903-0499, +7 (988) 888-1979 gray.ru@gmail.com
Sergey Konoplev escribió: > On Wed, Oct 30, 2013 at 8:11 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Sergey Konoplev <gray.ru@gmail.com> writes: > >> There were an ability to back shared memory with hugepages when using > >> <=9.2. I use it on ~30 servers for several years and it brings 8-17% > >> of performance depending on the memory size. Here you will find > >> several paragraphs of the description about how to do it > >> https://github.com/grayhemp/pgcookbook/blob/master/database_server_configuration.md. > > > > What this describes is how to modify Postgres to request huge pages. > > That's hardly built-in support. > > I wasn't talking about a built-in support. It was about an ability (a > way) to back sh_buf with hugepages. Then what you need is to set dynamic_shared_memory_type = sysv in postgresql.conf. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Wed, Oct 30, 2013 at 11:50 AM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: >> >> There were an ability to back shared memory with hugepages when using >> >> <=9.2. I use it on ~30 servers for several years and it brings 8-17% >> >> of performance depending on the memory size. Here you will find >> >> several paragraphs of the description about how to do it >> >> https://github.com/grayhemp/pgcookbook/blob/master/database_server_configuration.md. >> > >> > What this describes is how to modify Postgres to request huge pages. >> > That's hardly built-in support. >> >> I wasn't talking about a built-in support. It was about an ability (a >> way) to back sh_buf with hugepages. > > Then what you need is to set > dynamic_shared_memory_type = sysv > in postgresql.conf. Neither I found this parameter in the docs nor it works when I specify it in postgresql.conf. LOG: unrecognized configuration parameter "dynamic_shared_memory_type" in file "/etc/postgresql/9.3/main/postgresql.conf" line 114 FATAL: configuration file "/etc/postgresql/9.3/main/postgresql.conf" contains errors -- Kind regards, Sergey Konoplev PostgreSQL Consultant and DBA http://www.linkedin.com/in/grayhemp +1 (415) 867-9984, +7 (901) 903-0499, +7 (988) 888-1979 gray.ru@gmail.com
Alvaro Herrera escribió: > Sergey Konoplev escribió: > > I wasn't talking about a built-in support. It was about an ability (a > > way) to back sh_buf with hugepages. > > Then what you need is to set > dynamic_shared_memory_type = sysv > in postgresql.conf. The above is mistaken -- there's no way to disable the mmap() segment in 9.3, other than recompiling with EXEC_BACKEND which is probably undesirable for other reasons. I don't think I had ever heard of that recipe to use huge pages in previous versions; since the win is probably significant in some systems, we could have made this configurable. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Wed, Oct 30, 2013 at 12:17 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: >> > I wasn't talking about a built-in support. It was about an ability (a >> > way) to back sh_buf with hugepages. >> >> Then what you need is to set >> dynamic_shared_memory_type = sysv >> in postgresql.conf. > > The above is mistaken -- there's no way to disable the mmap() segment in > 9.3, other than recompiling with EXEC_BACKEND which is probably > undesirable for other reasons. Alternatively, I assume it could be linked with libhugetlbfs and you don't need any source modifications in this case. However I am not sure it will work with shared memory. > I don't think I had ever heard of that recipe to use huge pages in > previous versions; since the win is probably significant in some > systems, we could have made this configurable. There are several articles in the web describing how to do this, except the mine one. And the win becomes mostly significant when you have 64GB and more on your server. -- Kind regards, Sergey Konoplev PostgreSQL Consultant and DBA http://www.linkedin.com/in/grayhemp +1 (415) 867-9984, +7 (901) 903-0499, +7 (988) 888-1979 gray.ru@gmail.com
On Wed, Oct 30, 2013 at 12:51 PM, Sergey Konoplev <gray.ru@gmail.com> wrote: > On Wed, Oct 30, 2013 at 12:17 PM, Alvaro Herrera > <alvherre@2ndquadrant.com> wrote: >>> > I wasn't talking about a built-in support. It was about an ability (a >>> > way) to back sh_buf with hugepages. >>> >>> Then what you need is to set >>> dynamic_shared_memory_type = sysv >>> in postgresql.conf. >> >> The above is mistaken -- there's no way to disable the mmap() segment in >> 9.3, other than recompiling with EXEC_BACKEND which is probably >> undesirable for other reasons. > > Alternatively, I assume it could be linked with libhugetlbfs and you > don't need any source modifications in this case. However I am not > sure it will work with shared memory. BTW, I managed to run 9.3 backed with hugepages after I put HUGETLB_MORECORE (see man libhugetlbfs) to the environment yesterday, but, after some time of working, it failed with messages showed below. syslog: Oct 29 17:53:13 grayhemp kernel: [150579.903875] PID 7584 killed due to inadequate hugepage pool postgres: libhugetlbfslibhugetlbfs2013-10-29 17:53:21 PDT LOG: server process (PID 7584) was terminated by signal 7: Bus error 2013-10-29 17:53:21 PDT LOG: terminating any other active server processes 2013-10-29 1 7:53:21 PDT WARNING: terminating connection because of crash of another server process 2013-10-29 17:53:21 PDT DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory. My theory is that it has happened after the amount of huge pages (vm.nr_overcommit_hugepages + vm.nr_hugepages) was exceeded, but I might be wrong. Does anybody has some thoughts of why it has happened and how to work abound it? -- Kind regards, Sergey Konoplev PostgreSQL Consultant and DBA http://www.linkedin.com/in/grayhemp +1 (415) 867-9984, +7 (901) 903-0499, +7 (988) 888-1979 gray.ru@gmail.com
On 30.10.2013 19:11, Andres Freund wrote: > On 2013-10-30 22:39:20 +0530, Abhijit Menon-Sen wrote: >> At 2013-10-30 11:04:36 -0400, tgl@sss.pgh.pa.us wrote: >>> >>>> As a compromise, perhaps we can unconditionally round the size up to be >>>> a multiple of 2MB? […] >>> >>> That sounds reasonably painless to me. >> >> Here's a patch that does that and adds a DEBUG1 log message when we try >> with MAP_HUGETLB and fail and fallback to ordinary mmap. > > But it's in no way guaranteed that the smallest hugepage size is > 2MB. It'll be on current x86 hardware, but not on any other platform... Sure, but there's no big harm done. We're just trying to avoid hitting a kernel bug, and as a bonus, we avoid wasting some memory that would otherwise be lost due to the kernel rounding the allocation. If the smallest hugepage size is smaller than 2MB, we round up the allocation unnecessarily, but that doesn't seem serious. I spent some time whacking this around, new patch version attached. I moved the mmap() code into a new function, that leaves the PGSharedMemoryCreate more readable. I modified the patch so that it throws an error if you set huge_tlb_pages=on, and the platform doesn't support MAP_HUGETLB (ie. non-Linux, or EXEC_BACKEND). 'try' is the default, so this only affects you if you explicitly set it to 'on'. I think that's the right behavior; if you explicitly ask for it, and you don't get it, that should be an error. But I'm not wedded to the idea if someone objects; a log message might also be reasonable: "LOG: huge TLB pages are not supported on this platform, but huge_tlb_pages was 'on'" The error message on failed allocation, if huge_tlb_pages=on, needs updating: $ bin/postmaster -D data FATAL: could not map anonymous shared memory: Cannot allocate memory HINT: This error usually means that PostgreSQL's request for a shared memory segment exceeded available memory or swap space. To reduce the request size (currently 189390848 bytes), reduce PostgreSQL's shared memory usage, perhaps by reducing shared_buffers or max_connections. The reason the allocation failed in this case was that I used huge_tlb_pages=on, but had not configured the kernel for huge pages. The hint is quite misleading in that case, it should advise to configure the kernel, or turn off huge_tlb_pages. The documentation needs some work. I think it's pretty user-unfriendly to link to https://www.kernel.org/doc/Documentation/vm/hugetlbpage.txt. It gives a lot of details, and although it explains stuff that is relevant, like setting the nr_hugepages sysctl, it also contains a lot of stuff that is not relevant to us, like how to mount hugetlbfs. Can we do better than that? Is there a better guide somewhere on how to set the kernel settings. If not, we should include step-by-step instructions in our manual. The "Managing Kernel Resources" section in the user manual should also be updated to mention how to enable huge pages. Also, now that I changed huge_tlb_pages='on' to fail on platforms where it's not supported at all, the docs need to be updated to reflect it. - Heikki
Attachment
I was recently running some tests with huge page tables. I ran them on two different architectures: x86 and PPC64.
I saw some discussion going on over here so thought of sharing.
I was using 3 Cores, 8GB RAM, 2 LUN for filesystem (1 for dbfiles and 1 for logfiles) for these tests...
I had dedicated
(shared_buffers + 400bytes*max_connection + wal_buffers)/Pagesize [from /proc/meminfo] for huge pages. I kept some overcommit_hugepages to be used by work_mem (max_connection*work_mem)/Pagesize
x86_64 bit gave me a benefit of 2-5% for TPC-C workload( I scaled from 1 to 100 users). PPC64 which uses 16MB and 64MB did not give me any benefits in fact the performance degraded as the concurrency of system increased.
my 2 cents, hope it helps.
At 2013-11-15 15:17:32 +0200, hlinnakangas@vmware.com wrote: > > I spent some time whacking this around, new patch version attached. Thanks. > But I'm not wedded to the idea if someone objects; a log message might > also be reasonable: "LOG: huge TLB pages are not supported on this > platform, but huge_tlb_pages was 'on'" Put that way, I have to wonder if the right thing to do is just to have a "try_huge_pages=on|off" setting, and log a warning if the attempt did not succeed. It would be easier to document, and I don't think there's much point in making it an error if the allocation fails. -- Abhijit P.S. I'd be happy to do the followup work for this patch (updating documentation, etc.), but it'll have to wait until I recover from this !#$&@! stomach bug.
Abhijit Menon-Sen wrote: > At 2013-11-15 15:17:32 +0200, hlinnakangas@vmware.com wrote: > > But I'm not wedded to the idea if someone objects; a log message might > > also be reasonable: "LOG: huge TLB pages are not supported on this > > platform, but huge_tlb_pages was 'on'" > > Put that way, I have to wonder if the right thing to do is just to have > a "try_huge_pages=on|off" setting, and log a warning if the attempt did > not succeed. It would be easier to document, and I don't think there's > much point in making it an error if the allocation fails. What about huge_tlb_pages={off,try} Or maybe huge_tlb_pages={off,try,require} -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 2013-11-21 18:09:38 -0300, Alvaro Herrera wrote: > Abhijit Menon-Sen wrote: > > At 2013-11-15 15:17:32 +0200, hlinnakangas@vmware.com wrote: > > > > But I'm not wedded to the idea if someone objects; a log message might > > > also be reasonable: "LOG: huge TLB pages are not supported on this > > > platform, but huge_tlb_pages was 'on'" > > > > Put that way, I have to wonder if the right thing to do is just to have > > a "try_huge_pages=on|off" setting, and log a warning if the attempt did > > not succeed. It would be easier to document, and I don't think there's > > much point in making it an error if the allocation fails. > > What about > huge_tlb_pages={off,try} > > Or maybe > huge_tlb_pages={off,try,require} I'd certainly want a setting that errors out if it cannot get the memory using hugetables. If you rely on the reduction in memory (which can be significant on large s_b, large max_connections), it's rather annoying not to know whether it suceeded using it. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Thu, Nov 21, 2013 at 4:09 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > Abhijit Menon-Sen wrote: >> At 2013-11-15 15:17:32 +0200, hlinnakangas@vmware.com wrote: > >> > But I'm not wedded to the idea if someone objects; a log message might >> > also be reasonable: "LOG: huge TLB pages are not supported on this >> > platform, but huge_tlb_pages was 'on'" >> >> Put that way, I have to wonder if the right thing to do is just to have >> a "try_huge_pages=on|off" setting, and log a warning if the attempt did >> not succeed. It would be easier to document, and I don't think there's >> much point in making it an error if the allocation fails. > > What about > huge_tlb_pages={off,try} > > Or maybe > huge_tlb_pages={off,try,require} I'd spell "require" as "on", or at least accept that as a synonym. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2013-11-21 16:24:56 -0500, Robert Haas wrote: > > What about > > huge_tlb_pages={off,try} > > > > Or maybe > > huge_tlb_pages={off,try,require} > > I'd spell "require" as "on", or at least accept that as a synonym. That's off,try, on is what the patch currently implements, Abhijit just was arguing for dropping the error-out option. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
At 2013-11-21 22:14:35 +0100, andres@2ndquadrant.com wrote: > > I'd certainly want a setting that errors out if it cannot get the > memory using hugetables. OK, then the current try/on/off settings are fine. I'm better today, so I'll read the patch Heikki posted and see what more needs to be done there. -- Abhijit
Heikki Linnakangas wrote: > I spent some time whacking this around, new patch version attached. > I moved the mmap() code into a new function, that leaves the > PGSharedMemoryCreate more readable. Did this patch go anywhere? Someone just pinged me about a kernel scalability problem in Linux with huge pages; if someone did performance measurements with this patch, perhaps it'd be good to measure again with the kernel patch in place. https://lkml.org/lkml/2014/1/26/227 -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 01/27/2014 09:20 PM, Alvaro Herrera wrote: > Heikki Linnakangas wrote: > >> I spent some time whacking this around, new patch version attached. >> I moved the mmap() code into a new function, that leaves the >> PGSharedMemoryCreate more readable. > > Did this patch go anywhere? Oh darn, I remembered we had already committed this, but clearly not. I'd love to still get this into 9.4. The latest patch (hugepages-v5.patch) was pretty much ready for commit, except for documentation. - Heikki
Hi, On 28/01/14 13:51, Heikki Linnakangas wrote: > Oh darn, I remembered we had already committed this, but clearly not. I'd > love to still get this into 9.4. The latest patch (hugepages-v5.patch) was > pretty much ready for commit, except for documentation. I'm working on it. I ported it to HEAD and currently doing some benchmarks. Next will be documentation. Best regards, -- Christian Kruse http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Hi, On 15/11/13 15:17, Heikki Linnakangas wrote: > I spent some time whacking this around, new patch version attached. I moved > the mmap() code into a new function, that leaves the PGSharedMemoryCreate > more readable. I think there's a bug in this version of the patch. Have a look at this: + if (huge_tlb_pages == HUGE_TLB_ON || huge_tlb_pages == HUGE_TLB_TRY) + { […] + ptr = mmap(NULL, *size, PROT_READ | PROT_WRITE, + PG_MMAP_FLAGS | MAP_HUGETLB, -1, 0); […] + } +#endif + + if (huge_tlb_pages == HUGE_TLB_OFF || huge_tlb_pages == HUGE_TLB_TRY) + { + allocsize = *size; + ptr = mmap(NULL, *size, PROT_READ | PROT_WRITE, PG_MMAP_FLAGS, -1, 0); + } This will lead to a duplicate mmap() if hugepages work and huge_tlb_pages == HUGE_TLB_TRY, or am I missing something? I think it should be like this: if (huge_tlb_pages == HUGE_TLB_OFF || (huge_tlb_pages == HUGE_TLB_TRY && ptr == MAP_FAILED)) Best regards, -- Christian Kruse http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Hi, attached you will find a new version of the patch, ported to HEAD, fixed the mentioned bug and - hopefully - dealing the the remaining issues. Best regards, -- Christian Kruse http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
On 01/28/2014 06:11 PM, Christian Kruse wrote: > Hi, > > attached you will find a new version of the patch, ported to HEAD, > fixed the mentioned bug and - hopefully - dealing the the remaining > issues. Thanks, I have committed this now. The documentation is still lacking. We should explain somewhere how to set nr.hugepages, for example. The "Managing Kernel Resources" section ought to mention setting. Could I ask you to work on that, please? - Heikki
On 01/29/2014 01:12 PM, Heikki Linnakangas wrote: > On 01/28/2014 06:11 PM, Christian Kruse wrote: >> Hi, >> >> attached you will find a new version of the patch, ported to HEAD, >> fixed the mentioned bug and - hopefully - dealing the the remaining >> issues. > > Thanks, I have committed this now. > > The documentation is still lacking. > The documentation is indeed lacking since it breaks the build. doc/src/sgml/config.sgml contains the line normal allocation if that fails. With <literal>on</literal, failure which doesn't correctly terminate the closing </literal> tag. Trivial patch attached. -- Vik
Attachment
On 01/29/2014 04:01 PM, Vik Fearing wrote: > On 01/29/2014 01:12 PM, Heikki Linnakangas wrote: >> The documentation is still lacking. > > The documentation is indeed lacking since it breaks the build. > > doc/src/sgml/config.sgml contains the line > > normal allocation if that fails. With <literal>on</literal, failure > > which doesn't correctly terminate the closing </literal> tag. > > Trivial patch attached. Thanks, applied! - Heikki
On Tue, Jan 28, 2014 at 5:58 AM, Christian Kruse <christian@2ndquadrant.com> wrote: > Hi, > > On 28/01/14 13:51, Heikki Linnakangas wrote: >> Oh darn, I remembered we had already committed this, but clearly not. I'd >> love to still get this into 9.4. The latest patch (hugepages-v5.patch) was >> pretty much ready for commit, except for documentation. > > I'm working on it. I ported it to HEAD and currently doing some > benchmarks. Next will be documentation. you mentioned benchmarks -- do you happen to have the results handy? (curious) merlin
On Wed, Jan 29, 2014 at 4:12 AM, Heikki Linnakangas <hlinnakangas@vmware.com> wrote:
On 01/28/2014 06:11 PM, Christian Kruse wrote:Thanks, I have committed this now.Hi,
attached you will find a new version of the patch, ported to HEAD,
fixed the mentioned bug and - hopefully - dealing the the remaining
issues.
I'm getting this warning now with gcc (GCC) 4.4.7:
pg_shmem.c: In function 'PGSharedMemoryCreate':
pg_shmem.c:332: warning: 'allocsize' may be used uninitialized in this function
pg_shmem.c:332: note: 'allocsize' was declared here
Cheers,
Jeff
Hi, On 29/01/14 14:12, Heikki Linnakangas wrote: > The documentation is still lacking. We should explain somewhere how to set > nr.hugepages, for example. The "Managing Kernel Resources" section ought to > mention setting. Could I ask you to work on that, please? Of course! Attached you will find a patch for better documentation. Best regards, -- Christian Kruse http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
Hi, On 29/01/14 10:11, Jeff Janes wrote: > I'm getting this warning now with gcc (GCC) 4.4.7: Interesting. I don't get that warning. But the compiler is (formally) right. > pg_shmem.c: In function 'PGSharedMemoryCreate': > pg_shmem.c:332: warning: 'allocsize' may be used uninitialized in this > function > pg_shmem.c:332: note: 'allocsize' was declared here Attached patch should fix that. Best regards, -- Christian Kruse http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
On 01/29/2014 09:18 PM, Christian Kruse wrote: > Hi, > > On 29/01/14 10:11, Jeff Janes wrote: >> I'm getting this warning now with gcc (GCC) 4.4.7: > > Interesting. I don't get that warning. But the compiler is (formally) > right. > >> pg_shmem.c: In function 'PGSharedMemoryCreate': >> pg_shmem.c:332: warning: 'allocsize' may be used uninitialized in this >> function >> pg_shmem.c:332: note: 'allocsize' was declared here Hmm, I didn't get that warning either. > Attached patch should fix that. That's not quite right. If the first mmap() fails, allocsize is set to the rounded-up size, but the second mmap() uses the original size for the allocation. So it returns a too high value to the caller. Ugh, it's actually broken anyway :-(. The first allocation also passes *size to mmap(), so the calculated rounded-up allocsize value is not used for anything. Fix pushed. - Heikki
Hi, On 29/01/14 21:36, Heikki Linnakangas wrote: > […] > Fix pushed. You are right. Thanks. But there is another bug, see <20140128154307.GC24091@defunct.ch> ff. Attached you will find a patch fixing that. Best regards, -- Christian Kruse http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
On 01/29/2014 09:59 PM, Christian Kruse wrote: > Hi, > > On 29/01/14 21:36, Heikki Linnakangas wrote: >> […] >> Fix pushed. > > You are right. Thanks. But there is another bug, see > > <20140128154307.GC24091@defunct.ch> > > ff. Attached you will find a patch fixing that. Thanks. There are more cases of that in InternalIpcMemoryCreate, they ought to be fixed as well. And should also grep the rest of the codebase for more instances of that. And this needs to be back-patched. - Heikki
Hi, On 29/01/14 22:17, Heikki Linnakangas wrote: > Thanks. There are more cases of that in InternalIpcMemoryCreate, they ought > to be fixed as well. And should also grep the rest of the codebase for more > instances of that. And this needs to be back-patched. I'm way ahead of you ;-) Working on it. Best regards, -- Christian Kruse http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Hi, after I finally got documentation compilation working I updated the patch to be syntactically correct. You will find it attached. Best regards, -- Christian Kruse http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
On 1/30/14, 2:28 AM, Christian Kruse wrote: > after I finally got documentation compilation working I updated the > patch to be syntactically correct. You will find it attached. I don't think we should be explaining the basics of OS memory management in our documentation. And if we did, we shouldn't copy it verbatim from the Debian wiki without attribution. I think this patch should be cut down to the paragraphs that cover the actual configuration. On a technical note, use <xref> instead of <link> for linking. doc/src/sgml/README.links contains some information.
On Tue, Feb 25, 2014 at 10:29 AM, Peter Eisentraut <peter_e@gmx.net> wrote: > And if we did, we shouldn't copy it verbatim from > the Debian wiki without attribution. That is seriously not cool. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2014-02-25 10:29:32 -0500, Peter Eisentraut wrote: > On 1/30/14, 2:28 AM, Christian Kruse wrote: > > after I finally got documentation compilation working I updated the > > patch to be syntactically correct. You will find it attached. > > I don't think we should be explaining the basics of OS memory management > in our documentation. Agreed. > And if we did, we shouldn't copy it verbatim from the Debian wiki > without attribution. Is it actually? A quick comparison doesn't show that many similarities? Christian? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Hi, On 25/02/14 10:29, Peter Eisentraut wrote: > I don't think we should be explaining the basics of OS memory management > in our documentation. Well, I'm confused. I thought that's exactly what has been asked. > And if we did, we shouldn't copy it verbatim from the Debian wiki > without attribution. I didn't. This is a write-up of several articles, blog posts and documentation I read about this topic. However, if you think the texts are too similar, then we should add a note, yes. Didn't mean to copy w/o referring to a source. > I think this patch should be cut down to the paragraphs that cover the > actual configuration. I tried to cover the issues Heikki brought up in <52861EEC.2090702@vmware.com>. > On a technical note, use <xref> instead of <link> for linking. > doc/src/sgml/README.links contains some information. OK, I will post an updated patch later this evening. Best regards, -- Christian Kruse http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Hi, On 25/02/14 17:01, Andres Freund wrote: > > And if we did, we shouldn't copy it verbatim from the Debian wiki > > without attribution. > > Is it actually? A quick comparison doesn't show that many similarities? > Christian? Not as far as I know. But of course, as I wrote the text I _also_ (that's not my only source) read the Debian article and I was influenced by it. It may be that the texts are more similar then I thought, although I still don't see it. Best regards, -- Christian Kruse http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 2/25/14, 11:08 AM, Christian Kruse wrote: > Hi, > > On 25/02/14 17:01, Andres Freund wrote: >>> And if we did, we shouldn't copy it verbatim from the Debian wiki >>> without attribution. >> >> Is it actually? A quick comparison doesn't show that many similarities? >> Christian? > > Not as far as I know. But of course, as I wrote the text I _also_ > (that's not my only source) read the Debian article and I was > influenced by it. It may be that the texts are more similar then I > thought, although I still don't see it. I suspect that it was done subconsciously. But I did notice it right away, so there is something to it. As I mentioned, I would just cut those introductory parts out.
On Tue, Feb 25, 2014 at 12:18:02PM -0500, Peter Eisentraut wrote: > On 2/25/14, 11:08 AM, Christian Kruse wrote: > > Hi, > > > > On 25/02/14 17:01, Andres Freund wrote: > >>> And if we did, we shouldn't copy it verbatim from the Debian wiki > >>> without attribution. > >> > >> Is it actually? A quick comparison doesn't show that many similarities? > >> Christian? > > > > Not as far as I know. But of course, as I wrote the text I _also_ > > (that's not my only source) read the Debian article and I was > > influenced by it. It may be that the texts are more similar then I > > thought, although I still don't see it. > > I suspect that it was done subconsciously. But I did notice it right > away, so there is something to it. > > As I mentioned, I would just cut those introductory parts out. Should we link to the Debian wiki content? -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + Everyone has their own god. +
Bruce Momjian <bruce@momjian.us> writes: > On Tue, Feb 25, 2014 at 12:18:02PM -0500, Peter Eisentraut wrote: >> As I mentioned, I would just cut those introductory parts out. > Should we link to the Debian wiki content? -1. We generally don't link to our *own* wiki in our SGML docs, let alone things that aren't even under our project's control. Moreover, Debian is not going to be explaining these things in a way that accounts for non-Linux operating systems. regards, tom lane
On 2014-02-25 13:21:46 -0500, Tom Lane wrote: > Bruce Momjian <bruce@momjian.us> writes: > > On Tue, Feb 25, 2014 at 12:18:02PM -0500, Peter Eisentraut wrote: > >> As I mentioned, I would just cut those introductory parts out. > > > Should we link to the Debian wiki content? > > -1. We generally don't link to our *own* wiki in our SGML docs, let alone > things that aren't even under our project's control. Agreed. Especially as the interesting bit is the postgres specific logic, not the rest. I think all that's needed is to cut the first paragraphs that generally explain what huge pages are in some detail from the text and make sure the later paragraphs don't refer to the earlier ones. > Moreover, Debian > is not going to be explaining these things in a way that accounts for > non-Linux operating systems. It's a linux only feature so far, so that alone wouldn't be a problem. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Hi, On 25/02/14 19:28, Andres Freund wrote: > I think all that's needed is to cut the first paragraphs that generally > explain what huge pages are in some detail from the text and make sure > the later paragraphs don't refer to the earlier ones. Attached you will find a new version of the patch. Best regards, -- Christian Kruse http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
Hi Peter, after a night of sleep I'm still not able to swallow the pill. To be honest I'm a little bit angry about this accusation. I didn't mean to copy from the Debian wiki and after re-checking the text again I'm still convinced that I didn't. Of course the text SAYS something similar, but this is in the nature of things. Structure, diction and focus are different. Also the information transferred is different and gathered from various articles, including the Debian wiki, the huge page docs of the kernel, the Wikipedia and some old IBM and Oracle docs. Best regards, -- Christian Kruse http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 02/26/2014 10:35 AM, Christian Kruse wrote: > On 25/02/14 19:28, Andres Freund wrote: >> I think all that's needed is to cut the first paragraphs that generally >> explain what huge pages are in some detail from the text and make sure >> the later paragraphs don't refer to the earlier ones. > > Attached you will find a new version of the patch. Thanks! > huge_tlb_pages (enum) > > Enables/disables the use of huge TLB pages. Valid values are try (the default), on, and off. > > At present, this feature is supported only on Linux. The setting is ignored on other systems. > > The use of huge TLB pages results in smaller page tables and less CPU time spent on memory management, increasing performance.For more details, see Section 17.4.4. > > With huge_tlb_pages set to try, the server will try to use huge pages, but fall back to using normal allocation ifthat fails. With on, failure to use huge pages will prevent the server from starting up. With off, huge pages will notbe used. That still says "The setting is ignored on other systems". That's not quite true: as explained later in the section, if you set huge_tlb_pages=on and the platform doesn't support it, the server will refuse to start. > 17.4.4. Linux huge TLB pages This section looks good to me. I'm OK with the level of detail, although maybe just a sentence or two about what huge TLB pages are and what benefits they have would still be in order. How about adding something like this as the first sentence: "Using huge TLB pages reduces overhead when using large contiguous chunks of memory, like PostgreSQL does." > To enable this feature in PostgreSQL you need a kernel with CONFIG_HUGETLBFS=y and CONFIG_HUGETLB_PAGE=y. You also haveto tune the system setting vm.nr_hugepages. To calculate the number of necessary huge pages start PostgreSQL withouthuge pages enabled and check the VmPeak value from the proc filesystem: > > $ head -1 /path/to/data/directory/postmaster.pid > 4170 > $ grep ^VmPeak /proc/4170/status > VmPeak: 6490428 kB > > 6490428 / 2048 (PAGE_SIZE is 2MB in this case) are roughly 3169.154 huge pages, so you will need at least 3170 huge pages: > > $ sysctl -w vm.nr_hugepages=3170 That's good advice, but perhaps s/calculate/estimate/. It's just an approximation, after all. - Heikki
Hi, On 26/02/14 14:34, Heikki Linnakangas wrote: > That still says "The setting is ignored on other systems". That's not quite > true: as explained later in the section, if you set huge_tlb_pages=on and > the platform doesn't support it, the server will refuse to start. I added a sentence about it. > "Using huge TLB pages reduces overhead when using large contiguous chunks of > memory, like PostgreSQL does." Sentence added. > That's good advice, but perhaps s/calculate/estimate/. It's just an > approximation, after all. Fixed. New patch version is attached. Best regards, -- Christian Kruse http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
There's one thing that rubs me the wrong way about all this functionality, which is that we've named it "huge TLB pages". That is wrong -- the TLB pages are not huge. In fact, as far as I understand, the TLB doesn't have pages at all. It's the pages that are huge, but those pages are not TLB pages, they are just memory pages. I think we have named it this way only because Linux for some reason named the mmap() flag MAP_HUGETLB for some reason. The TLB is not huge either (in fact you can't alter the size of the TLB at all; it's a hardware thing.) I think this flag means "use the TLB entries reserved for huge pages for the memory I'm requesting". Since we haven't released any of this, should we discuss renaming it to just "huge pages"? -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 02/26/2014 06:13 PM, Alvaro Herrera wrote: > > There's one thing that rubs me the wrong way about all this > functionality, which is that we've named it "huge TLB pages". That is > wrong -- the TLB pages are not huge. In fact, as far as I understand, > the TLB doesn't have pages at all. It's the pages that are huge, but > those pages are not TLB pages, they are just memory pages. > > I think we have named it this way only because Linux for some reason > named the mmap() flag MAP_HUGETLB for some reason. The TLB is not huge > either (in fact you can't alter the size of the TLB at all; it's a > hardware thing.) I think this flag means "use the TLB entries reserved > for huge pages for the memory I'm requesting". > > Since we haven't released any of this, should we discuss renaming it to > just "huge pages"? Linux calls it "huge tlb pages" in many places, not just MAP_HUGETLB. Like in CONFIG_HUGETLB_PAGES and hugetlbfs. I agree it's a bit weird. Linux also calls it just "huge pages" in many other places, like in /proc/meminfo output. FreeBSD calls them superpages and Windows calls them "large pages". Yeah, it would seem better to call them just "huge pages", so that it's more reminiscent of those names, if we ever implement support for super/huge/large pages on other platforms. - Heikki
Christian, Thanks for working on all of this and dealing with the requests for updates and changes, as well as for dealing very professionally with an inappropriate and incorrect remark. Unfortunately, mailing lists can make communication difficult and someone's knee-jerk reaction (not referring to your reaction here) can end up causing much frustration. Remind me when we're at a conference somewhere and I'll gladly buy you a beer (or whatever your choice is). Seriously, thanks for working on the 'huge pages' changes and documentation- it's often a thankless job and clearly one which can be extremely frustrating. Thanks again, Stephen
Hi, On 26/02/14 13:13, Alvaro Herrera wrote: > > There's one thing that rubs me the wrong way about all this > functionality, which is that we've named it "huge TLB pages". That is > wrong -- the TLB pages are not huge. In fact, as far as I understand, > the TLB doesn't have pages at all. It's the pages that are huge, but > those pages are not TLB pages, they are just memory pages. I didn't think about this, yet, but you are totally right. > Since we haven't released any of this, should we discuss renaming it to > just "huge pages"? Attached is a patch with the updated documentation (now uses consistently huge pages) as well as a renamed GUC, consistent wording (always use huge pages) as well as renamed variables. Should I create a new commit fest entry for this and delete the old one? Or should this be done in two patches? Locally in my repo this is done with two commits, so it would be easy to split that. Best regards, -- Christian Kruse http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
Hi Peter, thank you for your nice words, much appreciated. I'm sorry that I was so whiny about this in the last post. Best regards, -- Christian Kruse http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Hi, On 27/02/14 08:35, Christian Kruse wrote: > Hi Peter, Sorry, Stephen of course – it was definitely to early. Best regards, -- Christian Kruse http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 02/27/2014 09:34 AM, Christian Kruse wrote: > Hi, > > On 26/02/14 13:13, Alvaro Herrera wrote: >> >> There's one thing that rubs me the wrong way about all this >> functionality, which is that we've named it "huge TLB pages". That is >> wrong -- the TLB pages are not huge. In fact, as far as I understand, >> the TLB doesn't have pages at all. It's the pages that are huge, but >> those pages are not TLB pages, they are just memory pages. > > I didn't think about this, yet, but you are totally right. > >> Since we haven't released any of this, should we discuss renaming it to >> just "huge pages"? > > Attached is a patch with the updated documentation (now uses > consistently huge pages) as well as a renamed GUC, consistent wording > (always use huge pages) as well as renamed variables. Hmm, I wonder if that could now be misunderstood to have something to do with the PostgreSQL page size? Maybe add the word "memory" or "operating system" in the first sentence in the docs, like this: "Enables/disables the use of huge memory pages". > <para> > At present, this feature is supported only on Linux. The setting is > ignored on other systems when set to <literal>try</literal>. > <productname>PostgreSQL</productname> will > refuse to start when set to <literal>on</literal>. > </para> Is it clear enough that PostgreSQL will only refuse to start up when it's set to on, *if the feature's not supported on the platform*? Perhaps just leave that last sentence out. It's mentioned later that " With <literal>on</literal>, failure to use huge pages will prevent the server from starting up.", that's probably enough. - Heikki
On Fri, Feb 28, 2014 at 9:43 AM, Heikki Linnakangas <hlinnakangas@vmware.com> wrote: > Hmm, I wonder if that could now be misunderstood to have something to do > with the PostgreSQL page size? Maybe add the word "memory" or "operating > system" in the first sentence in the docs, like this: "Enables/disables the > use of huge memory pages". Whenever I wish to emphasize that distinction, I tend to use the term "MMU pages". -- Peter Geoghegan
Hi, > >Attached is a patch with the updated documentation (now uses > >consistently huge pages) as well as a renamed GUC, consistent wording > >(always use huge pages) as well as renamed variables. > > Hmm, I wonder if that could now be misunderstood to have something to do > with the PostgreSQL page size? Maybe add the word "memory" or "operating > system" in the first sentence in the docs, like this: "Enables/disables the > use of huge memory pages". Accepted, see attached patch. > > <para> > > At present, this feature is supported only on Linux. The setting is > > ignored on other systems when set to <literal>try</literal>. > > <productname>PostgreSQL</productname> will > > refuse to start when set to <literal>on</literal>. > > </para> > > Is it clear enough that PostgreSQL will only refuse to start up when it's > set to on, *if the feature's not supported on the platform*? Perhaps just > leave that last sentence out. It's mentioned later that " With > <literal>on</literal>, failure to use huge pages will prevent the server > from starting up.", that's probably enough. Fixed. Best regards, -- Christian Kruse http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
Hi, On 28/02/14 17:58, Peter Geoghegan wrote: > On Fri, Feb 28, 2014 at 9:43 AM, Heikki Linnakangas > <hlinnakangas@vmware.com> wrote: > > Hmm, I wonder if that could now be misunderstood to have something to do > > with the PostgreSQL page size? Maybe add the word "memory" or "operating > > system" in the first sentence in the docs, like this: "Enables/disables the > > use of huge memory pages". > > Whenever I wish to emphasize that distinction, I tend to use the term > "MMU pages". I don't like to distinct that much from Linux terminology, this may lead to confusion. And to use this term only in one place doesn't seem to make sense, too – naming will then be inconsistent and thus lead to confusion, too. Do you agree? Best regards, -- Christian Kruse http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 03/03/2014 11:34 AM, Christian Kruse wrote: > Hi, > >>> Attached is a patch with the updated documentation (now uses >>> consistently huge pages) as well as a renamed GUC, consistent wording >>> (always use huge pages) as well as renamed variables. >> >> Hmm, I wonder if that could now be misunderstood to have something to do >> with the PostgreSQL page size? Maybe add the word "memory" or "operating >> system" in the first sentence in the docs, like this: "Enables/disables the >> use of huge memory pages". > > Accepted, see attached patch. Thanks, committed! I spotted this in section "17.4.1 Shared Memory and Semaphores": > Linux > > The default maximum segment size is 32 MB, and the default maximum total size is 2097152 pages. A page is almost always4096 bytes except in unusual kernel configurations with "huge pages" (use getconf PAGE_SIZE to verify). It's not any more wrong now than it's always been, but I don't think huge pages ever affect PAGE_SIZE... Could I cajole you into rephrasing that, too? - Heikki
Hi, On 03/03/14 21:03, Heikki Linnakangas wrote: > I spotted this in section "17.4.1 Shared Memory and Semaphores": > > >Linux > > > > The default maximum segment size is 32 MB, and the default maximum total size is 2097152 pages. A page is almost always4096 bytes except in unusual kernel configurations with "huge pages" (use getconf PAGE_SIZE to verify). > > It's not any more wrong now than it's always been, but I don't think huge > pages ever affect PAGE_SIZE... Could I cajole you into rephrasing that, too? Hm… to be honest, I'm not sure how to change that. What about this? The default maximum segment size is 32 MB, and the default maximum total size is 2097152 pages. A page is almost always 4096 bytes except in kernel configurations with <quote>huge pages</quote> (use <literal>cat /proc/meminfo | grep Hugepagesize</literal> to verify), but they have to be enabled explicitely via <xref linkend="guc-huge-pages">. See <xref linkend="linux-huge-pages"> for details. I attached a patch doing this change. Best regards, -- Christian Kruse http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services