Thread: Report: Linux huge pages with Postgres
We've gotten a few inquiries about whether Postgres can use "huge pages" under Linux. In principle that should be more efficient for large shmem regions, since fewer TLB entries are needed to support the address space. I spent a bit of time today looking into what that would take. My testing was done with current Fedora 13, kernel version 2.6.34.7-61.fc13.x86_64 --- it's possible some of these details vary across other kernel versions. You can test this with fairly minimal code changes, as illustrated in the attached not-production-grade patch. To select huge pages we have to include SHM_HUGETLB in the flags for shmget(), and we have to be prepared for failure (due to permissions or lack of allocated hugepages). I made the code just fall back to a normal shmget on failure. A bigger problem is that the shmem request size must be a multiple of the system's hugepage size, which is *not* a constant even though the test patch just uses 2MB as the assumed value. For a production-grade patch we'd have to scrounge the active value out of someplace in the /proc filesystem (ick). In addition to the code changes there are a couple of sysadmin requirements to make huge pages available to Postgres: 1. You have to configure the Postgres user as a member of the group that's permitted to allocate hugepage shared memory. I did this: sudo sh -c "id -g postgres >/proc/sys/vm/hugetlb_shm_group" For production use you'd need to put this in the PG initscript, probably, to ensure it gets re-set after every reboot and before PG is started. 2. You have to manually allocate some huge pages --- there doesn't seem to be any setting that says "just give them out on demand". I did this: sudo sh -c "echo 600 >/proc/sys/vm/nr_hugepages" which gave me a bit over 1GB of space reserved as huge pages. Again, this'd have to be done over again at each system boot. For testing purposes, I figured that what I wanted to stress was postgres process swapping and shmem access. I built current git HEAD with --enable-debug and no other options, and tested with these non-default settings: shared_buffers 1GB checkpoint_segments 50 fsync off (fsync intentionally off since I'm not trying to measure disk speed). The test machine has two dual-core Nehalem CPUs. Test case is pgbench at -s 25; I ran several iterations of "pgbench -c 10 -T 60 bench" in each configuration. And the bottom line is: if there's any performance benefit at all, it's on the order of 1%. The best result I got was about 3200 TPS with hugepages, and about 3160 without. The noise in these numbers is more than 1% though. This is discouraging; it certainly doesn't make me want to expend the effort to develop a production patch. However, perhaps someone else can try to show a greater benefit under some other test conditions. regards, tom lane *** src/backend/port/sysv_shmem.c.orig Wed Sep 22 18:57:31 2010 --- src/backend/port/sysv_shmem.c Sat Nov 27 13:39:46 2010 *************** *** 33,38 **** --- 33,39 ---- #include "miscadmin.h" #include "storage/ipc.h" #include "storage/pg_shmem.h" + #include "storage/shmem.h" typedef key_t IpcMemoryKey; /* shared memory key passed to shmget(2) */ *************** *** 75,80 **** --- 76,92 ---- IpcMemoryId shmid; void *memAddress; + #ifdef SHM_HUGETLB + /* request must be multiple of page size, else shmat() will fail */ + #define HUGE_PAGE_SIZE (2 * 1024 * 1024) + size = add_size(size, HUGE_PAGE_SIZE - (size % HUGE_PAGE_SIZE)); + + shmid = shmget(memKey, size, + SHM_HUGETLB | IPC_CREAT | IPC_EXCL | IPCProtection); + if (shmid >= 0) + elog(LOG, "shmget with SHM_HUGETLB succeeded"); + else + #endif shmid = shmget(memKey, size, IPC_CREAT | IPC_EXCL | IPCProtection); if (shmid < 0)
On Sat, Nov 27, 2010 at 2:27 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > For testing purposes, I figured that what I wanted to stress was > postgres process swapping and shmem access. I built current git HEAD > with --enable-debug and no other options, and tested with these > non-default settings: > shared_buffers 1GB > checkpoint_segments 50 > fsync off > (fsync intentionally off since I'm not trying to measure disk speed). > The test machine has two dual-core Nehalem CPUs. Test case is pgbench > at -s 25; I ran several iterations of "pgbench -c 10 -T 60 bench" > in each configuration. > > And the bottom line is: if there's any performance benefit at all, > it's on the order of 1%. The best result I got was about 3200 TPS > with hugepages, and about 3160 without. The noise in these numbers > is more than 1% though. > > This is discouraging; it certainly doesn't make me want to expend the > effort to develop a production patch. However, perhaps someone else > can try to show a greater benefit under some other test conditions. Hmm. Presumably in order to see a large benefit, you would need to have shared_buffers set large enough to thrash the TLB. I have no idea how big TLBs on modern systems are, but it'd be interesting to test this on a big machine with 8GB of shared buffers. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Sat, 2010-11-27 at 14:27 -0500, Tom Lane wrote: > This is discouraging; it certainly doesn't make me want to expend the > effort to develop a production patch. Perhaps. Why do this only for shared memory? Surely the majority of memory accesses are to private memory, so being able to allocate private memory in a single huge page would be better for avoiding TLB cache misses. -- Simon Riggs http://www.2ndQuadrant.com/books/PostgreSQL Development, 24x7 Support, Training and Services
Simon Riggs <simon@2ndQuadrant.com> writes: > On Sat, 2010-11-27 at 14:27 -0500, Tom Lane wrote: >> This is discouraging; it certainly doesn't make me want to expend the >> effort to develop a production patch. > Perhaps. > Why do this only for shared memory? There's no exposed API for causing a process's regular memory to become hugepages. > Surely the majority of memory > accesses are to private memory, so being able to allocate private memory > in a single huge page would be better for avoiding TLB cache misses. It's not really about the number of memory accesses, it's about the number of TLB entries needed. Private memory is generally a lot smaller than shared, in a tuned PG installation. regards, tom lane
On Sun, 2010-11-28 at 12:04 -0500, Tom Lane wrote: > Simon Riggs <simon@2ndQuadrant.com> writes: > > On Sat, 2010-11-27 at 14:27 -0500, Tom Lane wrote: > >> This is discouraging; it certainly doesn't make me want to expend the > >> effort to develop a production patch. > > > Perhaps. > > > Why do this only for shared memory? > > There's no exposed API for causing a process's regular memory to become > hugepages. We could make all the palloc stuff into shared memory also ("private" shared memory that is). We're not likely to run out of 64-bit memory addresses any time soon. > > Surely the majority of memory > > accesses are to private memory, so being able to allocate private memory > > in a single huge page would be better for avoiding TLB cache misses. > > It's not really about the number of memory accesses, it's about the > number of TLB entries needed. Private memory is generally a lot smaller > than shared, in a tuned PG installation. Sure, but 4MB of memory is enough to require 1000 TLB entries, which is more than enough to blow the TLB even on a Nehalem. So the size of the memory we access is already big enough to blow the cache, even without shared buffers. If the majority of accesses are from private memory then the TLB cache will already be thrashed by the time we access shared buffers again. That is at least one possible explanation for the lack of benefit. -- Simon Riggs http://www.2ndQuadrant.com/books/PostgreSQL Development, 24x7 Support, Training and Services
Simon Riggs <simon@2ndQuadrant.com> writes: > On Sun, 2010-11-28 at 12:04 -0500, Tom Lane wrote: >> There's no exposed API for causing a process's regular memory to become >> hugepages. > We could make all the palloc stuff into shared memory also ("private" > shared memory that is). We're not likely to run out of 64-bit memory > addresses any time soon. Mph. It's still not going to work well enough to be useful, because the kernel design for hugepages assumes a pretty static number of them. That maps well to our use of shared memory, not at all well to process local memory. > Sure, but 4MB of memory is enough to require 1000 TLB entries, which is > more than enough to blow the TLB even on a Nehalem. That can't possibly be right. I'm sure the chip designers have heard of programs using more than 4MB. regards, tom lane
On Sun, Nov 28, 2010 at 02:32:04PM -0500, Tom Lane wrote: > > Sure, but 4MB of memory is enough to require 1000 TLB entries, which is > > more than enough to blow the TLB even on a Nehalem. > > That can't possibly be right. I'm sure the chip designers have heard of > programs using more than 4MB. According to http://www.realworldtech.com/page.cfm?ArticleID=RWT040208182719&p=8 on the Core 2 chip there wasn't even enough TLB to cover the entire onboard cache. With Nehalem there are 2304 TLB entries on the chip, which cover at least the whole onboard cache, but only just. Memory access is expensive. I think if you got good statistics on how much time your CPU is waiting for memory it'd be pretty depressing. Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > Patriotism is when love of your own people comes first; nationalism, > when hate for people other than your own comes first. > - Charles de Gaulle
On Sat, Nov 27, 2010 at 02:27:12PM -0500, Tom Lane wrote: > We've gotten a few inquiries about whether Postgres can use "huge pages" > under Linux. In principle that should be more efficient for large shmem > regions, since fewer TLB entries are needed to support the address > space. I spent a bit of time today looking into what that would take. > My testing was done with current Fedora 13, kernel version > 2.6.34.7-61.fc13.x86_64 --- it's possible some of these details vary > across other kernel versions. > > You can test this with fairly minimal code changes, as illustrated in > the attached not-production-grade patch. To select huge pages we have > to include SHM_HUGETLB in the flags for shmget(), and we have to be > prepared for failure (due to permissions or lack of allocated > hugepages). I made the code just fall back to a normal shmget on > failure. A bigger problem is that the shmem request size must be a > multiple of the system's hugepage size, which is *not* a constant > even though the test patch just uses 2MB as the assumed value. For a > production-grade patch we'd have to scrounge the active value out of > someplace in the /proc filesystem (ick). > I would expect that you can just iterate through the size possibilities pretty quickly and just use the first one that works -- no /proc groveling. > In addition to the code changes there are a couple of sysadmin > requirements to make huge pages available to Postgres: > > 1. You have to configure the Postgres user as a member of the group > that's permitted to allocate hugepage shared memory. I did this: > sudo sh -c "id -g postgres >/proc/sys/vm/hugetlb_shm_group" > For production use you'd need to put this in the PG initscript, > probably, to ensure it gets re-set after every reboot and before PG > is started. > Since it would take advantage of them automatically, this would be just a normal DBA/admin task. > 2. You have to manually allocate some huge pages --- there doesn't > seem to be any setting that says "just give them out on demand". > I did this: > sudo sh -c "echo 600 >/proc/sys/vm/nr_hugepages" > which gave me a bit over 1GB of space reserved as huge pages. > Again, this'd have to be done over again at each system boot. > Same. > For testing purposes, I figured that what I wanted to stress was > postgres process swapping and shmem access. I built current git HEAD > with --enable-debug and no other options, and tested with these > non-default settings: > shared_buffers 1GB > checkpoint_segments 50 > fsync off > (fsync intentionally off since I'm not trying to measure disk speed). > The test machine has two dual-core Nehalem CPUs. Test case is pgbench > at -s 25; I ran several iterations of "pgbench -c 10 -T 60 bench" > in each configuration. > > And the bottom line is: if there's any performance benefit at all, > it's on the order of 1%. The best result I got was about 3200 TPS > with hugepages, and about 3160 without. The noise in these numbers > is more than 1% though. > > This is discouraging; it certainly doesn't make me want to expend the > effort to develop a production patch. However, perhaps someone else > can try to show a greater benefit under some other test conditions. > > regards, tom lane > I would not really expect to see much benefit in the region that the normal TLB page size would cover with the typical number of TLB entries. 1GB of shared buffers would not be enough to cause TLB thrashing with most processors. Bump it to 8-32GB or more and if the queries use up TLB entries with local work_mem you should see some more value in the patch. Regards, Ken
Kenneth Marshall <ktm@rice.edu> writes: > On Sat, Nov 27, 2010 at 02:27:12PM -0500, Tom Lane wrote: >> ... A bigger problem is that the shmem request size must be a >> multiple of the system's hugepage size, which is *not* a constant >> even though the test patch just uses 2MB as the assumed value. For a >> production-grade patch we'd have to scrounge the active value out of >> someplace in the /proc filesystem (ick). > I would expect that you can just iterate through the size possibilities > pretty quickly and just use the first one that works -- no /proc > groveling. It's not really that easy, because (at least on the kernel version I tested) it's not the shmget that fails, it's the later shmat. Releasing and reacquiring the shm segment would require significant code restructuring, and at least on some platforms could produce weird failure cases --- I seem to recall having heard of kernels where the release isn't instantaneous, so that you could run up against SHMMAX for no apparent reason. Really you do want to scrape the value. >> 2. You have to manually allocate some huge pages --- there doesn't >> seem to be any setting that says "just give them out on demand". >> I did this: >> sudo sh -c "echo 600 >/proc/sys/vm/nr_hugepages" >> which gave me a bit over 1GB of space reserved as huge pages. >> Again, this'd have to be done over again at each system boot. > Same. The fact that hugepages have to be manually managed, and that any unaccounted-for represent completely wasted RAM, seems like a pretty large PITA to me. I don't see anybody buying into that for gains measured in single-digit percentages. > 1GB of shared buffers would not be enough to cause TLB thrashing with > most processors. Well, bigger cases would be useful to try, although Simon was claiming that the TLB starts to fall over at 4MB of working set. I don't have a large enough machine to try the sort of test you're suggesting, so if anyone thinks this is worth pursuing, there's the patch ... go test it. regards, tom lane
On Mon, Nov 29, 2010 at 12:12 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> I would expect that you can just iterate through the size possibilities >> pretty quickly and just use the first one that works -- no /proc >> groveling. > > It's not really that easy, because (at least on the kernel version I > tested) it's not the shmget that fails, it's the later shmat. Releasing > and reacquiring the shm segment would require significant code > restructuring, and at least on some platforms could produce weird > failure cases --- I seem to recall having heard of kernels where the > release isn't instantaneous, so that you could run up against SHMMAX > for no apparent reason. Really you do want to scrape the value. > Couldn't we just round the shared memory allocation down to a multiple of 4MB? That would handle all older architectures where the size is 2MB or 4MB. I see online that IA64 supports larger page sizes up to 256MB but then could we make it the user's problem if they change their hugepagesize to a larger value to pick a value of shared_buffers that will fit cleanly? We might need to rejigger things so that the shared memory segment is exactly the size of shared_buffers and any other shared data structures are in a separate segment though for that to work. -- greg
Greg Stark <gsstark@mit.edu> writes: > On Mon, Nov 29, 2010 at 12:12 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Really you do want to scrape the value. > Couldn't we just round the shared memory allocation down to a multiple > of 4MB? That would handle all older architectures where the size is > 2MB or 4MB. Rounding *down* will not work, at least not without extremely invasive changes to the shmem allocation code. Rounding up is okay, as long as you don't mind some possibly-wasted space. > I see online that IA64 supports larger page sizes up to 256MB but then > could we make it the user's problem if they change their hugepagesize > to a larger value to pick a value of shared_buffers that will fit > cleanly? We might need to rejigger things so that the shared memory > segment is exactly the size of shared_buffers and any other shared > data structures are in a separate segment though for that to work. Two shmem segments would be a pretty serious PITA too, certainly a lot more so than a few lines to read a magic number from /proc. But this is all premature pending a demonstration that there's enough potential gain here to be worth taking any trouble at all. The one set of numbers we have says otherwise. regards, tom lane
On Sat, 27 Nov 2010 14:27:12 -0500 Tom Lane <tgl@sss.pgh.pa.us> wrote: > And the bottom line is: if there's any performance benefit at all, > it's on the order of 1%. The best result I got was about 3200 TPS > with hugepages, and about 3160 without. The noise in these numbers > is more than 1% though. > > This is discouraging; it certainly doesn't make me want to expend the > effort to develop a production patch. However, perhaps someone else > can try to show a greater benefit under some other test conditions. Just a quick note: I can't hazard a guess as to why you're not getting better results than you are, but I *can* say that putting together a production-quality patch may not be worth your effort regardless. There is a nice "transparent hugepages" patch set out there which makes hugepages "just happen" when it seems to make sense and the system can support it. It eliminates the need for all administrative fiddling and for any support at the application level. This patch is invasive and has proved to be hard to merge. RHEL6 has it, though, and I believe it will get in eventually. I can point you at the developer involved if you'd like to experiment with this feature and see what it can do for you. jon Jonathan Corbet / LWN.net / corbet@lwn.net
Jonathan Corbet <corbet@lwn.net> writes: > Just a quick note: I can't hazard a guess as to why you're not getting > better results than you are, but I *can* say that putting together a > production-quality patch may not be worth your effort regardless. There > is a nice "transparent hugepages" patch set out there which makes > hugepages "just happen" when it seems to make sense and the system can > support it. It eliminates the need for all administrative fiddling and > for any support at the application level. That would be cool, because the current kernel feature is about as unfriendly to use as it could possibly be ... regards, tom lane
On Mon, Nov 29, 2010 at 10:30 AM, Jonathan Corbet <corbet@lwn.net> wrote: > On Sat, 27 Nov 2010 14:27:12 -0500 > Tom Lane <tgl@sss.pgh.pa.us> wrote: > >> And the bottom line is: if there's any performance benefit at all, >> it's on the order of 1%. The best result I got was about 3200 TPS >> with hugepages, and about 3160 without. The noise in these numbers >> is more than 1% though. >> >> This is discouraging; it certainly doesn't make me want to expend the >> effort to develop a production patch. However, perhaps someone else >> can try to show a greater benefit under some other test conditions. > > Just a quick note: I can't hazard a guess as to why you're not getting > better results than you are, but I *can* say that putting together a > production-quality patch may not be worth your effort regardless. There > is a nice "transparent hugepages" patch set out there which makes > hugepages "just happen" when it seems to make sense and the system can > support it. It eliminates the need for all administrative fiddling and > for any support at the application level. Neat! -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company