Experimenting with hash join prefetch - Mailing list pgsql-hackers

From Thomas Munro
Subject Experimenting with hash join prefetch
Date
Msg-id CAEepm=2y9HM9QP+HhRZdQ3pU6FShSMyu=V1uHXhQ5gG-dketHg@mail.gmail.com
Whole thread Raw
Responses Re: Experimenting with hash join prefetch  (Andrey Borodin <x4mmm@yandex-team.ru>)
List pgsql-hackers
Hello hackers,

I have a list of micro-optimisations and things to look into for hash
joins, which I've updated on the wiki[1].  Here's one that I was
inspired to poke at with a stick in a spare hour today.

Cache-oblivious hash joins cause a lot of TLB and cache misses.
Researchers tell us to reduce those using huge/super VM pages[2] and
cache prefetch instructions[3].  (There is another class of
cache-aware hash join algorithms that partition carefully up front to
avoid them; that's not us.)

Here is a totally contrived experiment that shows the effect quite
reproducibly here:

shared_buffers = '1GB'

create table t as select generate_series(1, 20000000)::int i;
set max_parallel_workers_per_gather = 0;
set work_mem = '2GB';
select pg_prewarm('t'::regclass);

select count(*) from t t1 join t t2 using (i);

-> 00:12.683

First, let's try to prefetch the hash bucket for the next tuple while
computing the hash for the current tuple, since we can see into the
future quite easily: we know the keys are sequential integers in this
contrived experiment.  In ExecHashGetHashValue():

+               /* Prefetch the bucket for the next key */
+               uint32 next_hash = hash_uint32(DatumGetInt32(keyval) + 1);
+               uint32 next_bucket = next_hash % hashtable->nbuckets;
+               __builtin_prefetch(&hashtable->buckets.unshared[next_bucket]);

select count(*) from t t1 join t t2 using (i);

-> 00:09.901

Huzzah!  Next up, let's try a two-stage prefetch pipeline for the
probing phase, seeing two steps ahead:

+               /* Prefetch the bucket for the tuple after next */
+               uint32 next_next_hash = hash_uint32(DatumGetInt32(keyval) + 2);
+               uint32 next_next_bucket = next_next_hash % hashtable->nbuckets;
+
__builtin_prefetch(&hashtable->buckets.unshared[next_next_bucket]);
+               if (outer_tuple)
+               {
+                       /* Prefetch the first tuple in the next bucket */
+                       uint32 next_hash =
hash_uint32(DatumGetInt32(keyval) + 1);
+                       uint32 next_bucket = next_hash % hashtable->nbuckets;
+
__builtin_prefetch(hashtable->buckets.unshared[next_bucket]);
+               }

-> 00:09.138

It's nice to see this effect, but it isn't really surprising: there is
no doubt about the value of prefetching random access data.

I think we could probably do this for the build phase with the
existing tuple-at-a-time executor interface by doing the bucket
insertions from a queue that runs N tuples behind the one we're
currently loading and hashing.  Or something like that.  For the probe
phase (probably much more interesting) I think it'd involve extra
tuple copying, so that it could still access the last tuple while
pulling the next tuple to hash its key.  To avoid that we'd need a
facility for peeking at future tuples, or a proper first class batch
mode.

[1] https://wiki.postgresql.org/wiki/Parallel_Hash
[2] https://15721.courses.cs.cmu.edu/spring2016/papers/balkesen-icde2013.pdf
(table VI)
[3] http://www.cs.cmu.edu/~chensm/papers/hashjoin_icde04.pdf

-- 
Thomas Munro
http://www.enterprisedb.com


pgsql-hackers by date:

Previous
From: Justin Pryzby
Date:
Subject: Re: fine tune v11 release notes
Next
From: Andrey Borodin
Date:
Subject: Re: Experimenting with hash join prefetch