Fast DSM segments - Mailing list pgsql-hackers

From Thomas Munro
Subject Fast DSM segments
Date
Msg-id CA+hUKGLAE2QBv-WgGp+D9P_J-=yne3zof9nfMaqq1h3EGHFXYQ@mail.gmail.com
Whole thread Raw
Responses Re: Fast DSM segments  (Robert Haas <robertmhaas@gmail.com>)
List pgsql-hackers
Hello PostgreSQL 14 hackers,

FreeBSD is much faster than Linux (and probably Windows) at parallel
hash joins on the same hardware, primarily because its DSM segments
run in huge pages out of the box.  There are various ways to convince
recent-ish Linux to put our DSMs on huge pages (see below for one),
but that's not the only problem I wanted to attack.

The attached highly experimental patch adds a new GUC
dynamic_shared_memory_main_size.  If you set it > 0, it creates a
fixed sized shared memory region that supplies memory for "fast" DSM
segments.  When there isn't enough free space, dsm_create() falls back
to the traditional approach using eg shm_open().  This allows parallel
queries to run faster, because:

* no more expensive system calls
* no repeated VM allocation (whether explicit posix_fallocate() or first-touch)
* can be in huge pages on Linux and Windows

This makes lots of parallel queries measurably faster, especially
parallel hash join.  To demonstrate with a very simple query:

  create table t (i int);
  insert into t select generate_series(1, 10000000);
  select pg_prewarm('t');
  set work_mem = '1GB';

  select count(*) from t t1 join t t2 using (i);

Here are some quick and dirty results from a Linux 4.19 laptop.  The
first column is the new GUC, and the last column is from "perf stat -e
dTLB-load-misses -p <backend>".

  size  huge_pages time   speedup  TLB misses
  0     off        2.595s           9,131,285
  0     on         2.571s      1%   8,951,595
  1GB   off        2.398s      8%   9,082,803
  1GB   on         1.898s     37%     169,867

You can get some of this speedup unpatched on a Linux 4.7+ system by
putting "huge=always" in your /etc/fstab options for /dev/shm (= where
shm_open() lives).  For comparison, that gives me:

  size  huge_pages time   speedup  TLB misses
  0     on         2.007s     29%     221,910

That still leave the other 8% on the table, and in fact that 8%
explodes to a much larger number as you throw more cores at the
problem (here I was using defaults, 2 workers).  Unfortunately, dsa.c
-- used by parallel hash join to allocate vast amounts of memory
really fast during the build phase -- holds a lock while creating new
segments, as you'll soon discover if you test very large hash join
builds on a 72-way box.  I considered allowing concurrent segment
creation, but as far as I could see that would lead to terrible
fragmentation problems, especially in combination with our geometric
growth policy for segment sizes due to limited slots.  I think this is
the main factor that causes parallel hash join scalability to fall off
around 8 cores.  The present patch should really help with that (more
digging in that area needed; there are other ways to improve that
situation, possibly including something smarter than a stream of of
dsa_allocate(32kB) calls).

A competing idea would have freelists of lingering DSM segments for
reuse.  Among other problems, you'd probably have fragmentation
problems due to their differing sizes.  Perhaps there could be a
hybrid of these two ideas, putting a region for "fast" DSM segments
inside many OS-supplied segments, though it's obviously much more
complicated.

As for what a reasonable setting would be for this patch, well, erm,
it depends.  Obviously that's RAM that the system can't use for other
purposes while you're not running parallel queries, and if it's huge
pages, it can't be swapped out; if it's not huge pages, then it can be
swapped out, and that'd be terrible for performance next time you need
it.  So you wouldn't want to set it too large.  If you set it too
small, it falls back to the traditional behaviour.

One argument I've heard in favour of creating fresh segments every
time is that NUMA systems configured to prefer local memory allocation
(as opposed to interleaved allocation) probably avoid cross node
traffic.  I haven't looked into that topic yet; I suppose one way to
deal with it in this scheme would be to have one such region per node,
and prefer to allocate from the local one.

Attachment

pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: [HACKERS] advanced partition matching algorithm for partition-wise join
Next
From: Amit Kapila
Date:
Subject: Re: Vacuum o/p with (full 1, parallel 0) option throwing an error