Re: [PATCH] Add support for choosing huge page size - Mailing list pgsql-hackers

From Thomas Munro
Subject Re: [PATCH] Add support for choosing huge page size
Date
Msg-id CA+hUKG+gdWThHi0v6TmiLgUE_rqqQ+PKw2t+kT6w08H36qzxpw@mail.gmail.com
Whole thread Raw
In response to Re: [PATCH] Add support for choosing huge page size  (Odin Ugedal <odin@ugedal.com>)
Responses Re: [PATCH] Add support for choosing huge page size
Re: [PATCH] Add support for choosing huge page size
List pgsql-hackers
Hi Odin,

Documentation syntax error "<literal>2MB<literal>" shows up as:

config.sgml:1605: parser error : Opening and ending tag mismatch:
literal line 1602 and para
       </para>
              ^

Please install the documentation tools
https://www.postgresql.org/docs/devel/docguide-toolsets.html, rerun
configure and "make docs" to see these kinds of errors.

The build is currently failing on Windows:

undefined symbol: HAVE_DECL_MAP_HUGE_MASK at src/include/pg_config.h
line 143 at src/tools/msvc/Mkvcbuild.pm line 851.

I think that's telling us that you need to add this stuff into
src/tools/msvc/Solution.pm, so that we can say it doesn't have it.  I
don't have Windows but whenever you post a new version we'll see if
Windows likes it here:

http://cfbot.cputube.org/odin-ugedal.html

When using huge_pages=on, huge_page_size=1GB, but default
shared_buffers, I noticed that the error message reports the wrong
(unrounded) size in this message:

2020-06-18 02:06:30.407 UTC [73552] HINT:  This error usually means
that PostgreSQL's request for a shared memory segment exceeded
available memory, swap space, or huge pages. To reduce the request
size (currently 149069824 bytes), reduce PostgreSQL's shared memory
usage, perhaps by reducing shared_buffers or max_connections.

The request size was actually:

mmap(NULL, 1073741824, PROT_READ|PROT_WRITE,
MAP_SHARED|MAP_ANONYMOUS|MAP_HUGETLB|30<<MAP_HUGE_SHIFT, -1, 0) = -1
ENOMEM (Cannot allocate memory)

1GB pages are so big that it becomes a little tricky to set shared
buffers large enough without wasting RAM.  What I mean is, if I want
to use shared_buffers=16GB, I need to have at least 17 huge pages
available, but the 17th page is nearly entirely wasted!  Imagine that
on POWER 16GB pages.  That makes me wonder if we should actually
redefine these GUCs differently so that you state the total, or at
least use the rounded memory for buffers...  I think we could consider
that to be a separate problem with a separate patch though.

Just for fun, I compared 4KB, 2MB and 1GB pages for a hash join of a
3.5GB table against itself.  Hash joins are the perfect way to
exercise the TLB because they're very likely to miss.  I also applied
my patch[1] to allow parallel queries to use shared memory from the
main shared memory area, so that they benefit from the configured page
size, using pages that are allocated once at start up.  (Without that,
you'd have to mess around with /dev/shm mount options, and then hope
that pages were available at query time, and it'd also be slower for
other stupid implementation reasons).

# echo never > /sys/kernel/mm/transparent_hugepage/enabled
# echo 8500 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
# echo 17 > /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages

shared_buffers=8GB
dynamic_shared_memory_main_size=8GB

create table t as select generate_series(1, 100000000)::int i;
alter table t set (parallel_workers = 7);
create extension pg_prewarm;
select pg_prewarm('t');
set max_parallel_workers_per_gather=7;
set work_mem='1GB';

select count(*) from t t1 join t t2 using (i);

4KB pages: 12.42 seconds
2MB pages:  9.12 seconds
1GB pages:  9.07 seconds

Unfortunately I can't access the TLB miss counters on this system due
to virtualisation restrictions, and the systems where I can don't have
1GB pages.  According to cpuid(1) this system has a fairly typical
setup:

   cache and TLB information (2):
      0x63: data TLB: 2M/4M pages, 4-way, 32 entries
            data TLB: 1G pages, 4-way, 4 entries
      0x03: data TLB: 4K pages, 4-way, 64 entries

This operation is touching about 8GB of data (scanning 3.5GB of table,
building a 4.5GB hash table) so 4 x 1GB is not enough do this without
TLB misses.

Let's try that again, except this time with shared_buffers=4GB,
dynamic_shared_memory_main_size=4GB, and only half as many tuples in
t, so it ought to fit:

4KB pages:  6.37 seconds
2MB pages:  4.96 seconds
1GB pages:  5.07 seconds

Well that's disappointing.  I wondered if this was something to do
with NUMA effects on this two node box, so I tried running that again
with postgres under numactl --cpunodebind 0 --membind 0 and I got:

4KB pages:  5.43 seconds
2MB pages:  4.05 seconds
1GB pages:  4.00 seconds

From this I can't really conclude that it's terribly useful to use
larger page sizes, but it's certainly useful to have the ability to do
further testing using the proposed GUC.

[1]
https://www.postgresql.org/message-id/flat/CA%2BhUKGLAE2QBv-WgGp%2BD9P_J-%3Dyne3zof9nfMaqq1h3EGHFXYQ%40mail.gmail.com



pgsql-hackers by date:

Previous
From: Fujii Masao
Date:
Subject: Re: Creating a function for exposing memory usage of backend process
Next
From: amul sul
Date:
Subject: Re: [Patch] ALTER SYSTEM READ ONLY