Re: Report: Linux huge pages with Postgres - Mailing list pgsql-hackers

From Kenneth Marshall
Subject Re: Report: Linux huge pages with Postgres
Date
Msg-id 20101128223038.GA13313@aart.is.rice.edu
Whole thread Raw
In response to Report: Linux huge pages with Postgres  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: Report: Linux huge pages with Postgres
List pgsql-hackers
On Sat, Nov 27, 2010 at 02:27:12PM -0500, Tom Lane wrote:
> We've gotten a few inquiries about whether Postgres can use "huge pages"
> under Linux.  In principle that should be more efficient for large shmem
> regions, since fewer TLB entries are needed to support the address
> space.  I spent a bit of time today looking into what that would take.
> My testing was done with current Fedora 13, kernel version
> 2.6.34.7-61.fc13.x86_64 --- it's possible some of these details vary
> across other kernel versions.
> 
> You can test this with fairly minimal code changes, as illustrated in
> the attached not-production-grade patch.  To select huge pages we have
> to include SHM_HUGETLB in the flags for shmget(), and we have to be
> prepared for failure (due to permissions or lack of allocated
> hugepages).  I made the code just fall back to a normal shmget on
> failure.  A bigger problem is that the shmem request size must be a
> multiple of the system's hugepage size, which is *not* a constant
> even though the test patch just uses 2MB as the assumed value.  For a
> production-grade patch we'd have to scrounge the active value out of
> someplace in the /proc filesystem (ick).
> 

I would expect that you can just iterate through the size possibilities
pretty quickly and just use the first one that works -- no /proc
groveling.

> In addition to the code changes there are a couple of sysadmin
> requirements to make huge pages available to Postgres:
> 
> 1. You have to configure the Postgres user as a member of the group
> that's permitted to allocate hugepage shared memory.  I did this:
> sudo sh -c "id -g postgres >/proc/sys/vm/hugetlb_shm_group"
> For production use you'd need to put this in the PG initscript,
> probably, to ensure it gets re-set after every reboot and before PG
> is started.
> 
Since it would take advantage of them automatically, this would be
just a normal DBA/admin task.

> 2. You have to manually allocate some huge pages --- there doesn't
> seem to be any setting that says "just give them out on demand".
> I did this:
> sudo sh -c "echo 600 >/proc/sys/vm/nr_hugepages"
> which gave me a bit over 1GB of space reserved as huge pages.
> Again, this'd have to be done over again at each system boot.
> 
Same.

> For testing purposes, I figured that what I wanted to stress was
> postgres process swapping and shmem access.  I built current git HEAD
> with --enable-debug and no other options, and tested with these
> non-default settings:
>  shared_buffers        1GB
>  checkpoint_segments    50
>  fsync            off
> (fsync intentionally off since I'm not trying to measure disk speed).
> The test machine has two dual-core Nehalem CPUs.  Test case is pgbench
> at -s 25; I ran several iterations of "pgbench -c 10 -T 60 bench"
> in each configuration.
> 
> And the bottom line is: if there's any performance benefit at all,
> it's on the order of 1%.  The best result I got was about 3200 TPS
> with hugepages, and about 3160 without.  The noise in these numbers
> is more than 1% though.
> 
> This is discouraging; it certainly doesn't make me want to expend the
> effort to develop a production patch.  However, perhaps someone else
> can try to show a greater benefit under some other test conditions.
> 
>             regards, tom lane
> 
I would not really expect to see much benefit in the region that the
normal TLB page size would cover with the typical number of TLB entries.
1GB of shared buffers would not be enough to cause TLB thrashing with
most processors. Bump it to 8-32GB or more and if the queries use up
TLB entries with local work_mem you should see some more value in the
patch. 

Regards,
Ken


pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: Rethinking representation of sort/hash semantics in queries and plans
Next
From: Jeff Janes
Date:
Subject: Re: contrib: auth_delay module