Re: Experimenting with hash join prefetch - Mailing list pgsql-hackers
From | Thomas Munro |
---|---|
Subject | Re: Experimenting with hash join prefetch |
Date | |
Msg-id | CA+hUKGL1HQaf15cU=D1nRPOhzfA4K-s=9ZsyLYPYc5tN+MykuA@mail.gmail.com Whole thread Raw |
In response to | Re: Experimenting with hash join prefetch (Andres Freund <andres@anarazel.de>) |
List | pgsql-hackers |
On Tue, Feb 4, 2020 at 2:31 PM Andres Freund <andres@anarazel.de> wrote: > How much of the benefit here comes from the prefetching, and how much > just from writing the code in a manner that allows for more out-of-order > execution? Because there's no dependencies preventing execution of the > next queued tuple anymore, I'd assume that this is a good part what > helps? A small part of the speed-up does indeed seem to come from that sort of thing. > Code like this really should look something roughly like: > > while (true) > have_skew = False > > # get tuples > for i in 0..batchsize: > tuples[i] = ExecProcNode(outerNode); > if tuples[i] == NULL: > # have slowpath handle this > break; > > # compute their hash values > for i in 0..batchsize: > hashvalues[i] = ExecHashGetHashValue(tuples[i]) > > # check whether go into skew buckets > if have_skew_table: > for i in 0..batchsize: > skewbuckets[i] = ExecHashGetSkewBucket(tuples[i], hashvalues[i]) > if (skewbuckets[i] != INVALID_SKEW_BUCKET_NO) > have_skew = True > if have_skew: > # handle everything here > continue > > # assume there's no skew tuples going forward, all handled above > > # compute bucket/batch for all tuples > have_into_batch = False > for i in 0..batchsize: > hashbuckets[i] = ExecHashGetBucketAndBatch() > if hashbuckets[i] != hashtable->curbatch: > have_into_batchfile = True > > if have_into_batchfile: > # slowpath > continue > > # Allocate all tuples > for i in 0..batchsize: > hjtuples[i] = alloc_mintuple(hashtuples[i]) > > # And finally isert them > for i in 0..batchsize: > hjtuple.next = buckets[hashbuckets[i]] > buckets[hashbuckets[i]] = hjtuple Hmm. I see your point: don't use the batch number for a branch immediately, and so on. I thought a bit about a multi-pass system a bit like that too just for prefetching purposes, though I haven't tested due to lack of required infrastructure. I guess you need a way to get the next N tuples and make them all simultaneously available without copying them yet. For this experiment, I speculated that it might be better to be continually inserting a short distance behind so there are no batch-boundary stalls anyway. Admittedly, it's pretty hard to choose the right queue depth if your loop includes ExecProcNode() because you have no idea what that actually does, but on the other hand, you do need to put enough cycles between prefetch and fetch to see benefits, so maybe that's not so crazy. Perhaps to get more speedup I'd need to consider dependencies along the lines your'e describing, but also find a way to keep prefetch and insertion far enough apart to win. Hmm. > I would bet this helps significantly even if there's no prefetch > instruction - but using explicit prefetching might help further. Also > allows us to increase branch costs, because we can amortize them over a > few tuples. Yes, I already observe that performance improves a little bit even with my simple insert-queue patch if you comment out the pg_prefetch_mem() call, and figured it was something about execution order at work there (though I didn't study the effect up close with perf etc due to lack of PMC access on this host), but the prefetch apparently supplies most of the speed-up I saw. It stands to reason that hash joins should benefit from explicit prefetching (even though lots of pixels have been expended explaining that explicit prefetching is often a mistake, cf Linux experience), since hash joins are basically cache miss machines par excellence, at least in the build phase with unique keys. > > create unlogged table t as select generate_series(1, 100000000)::int i; > > select pg_prewarm('t'); > > set work_mem='8GB'; > > > > select count(*) from t t1 join t t2 using (i); > > > > master patched/N=1 patched/N=4 > > workers=0 89.808s 80.556s (+11%) 76.864 (+16%) > > workers=2 27.010s 22.679s (+19%) 23.503 (+15%) > > workers=4 16.728s 14.896s (+12%) 14.167 (+18%) > > > > Just an early experiment, but I though it looked promising enough to share. > > Nice! It's starting to look like prefetching of build + prefetching of probe + reordering-friendly code + icache-friendly tight loops could add up to some serious gains, but some architectural stuff is needed for much of that, hence my lower aim :-) Other things I noticed on that hacking escapade: the patch generates PREFETCHT0 instructions on my compiler, but there are also "write" and "locality" flags you can pass to __builtin_prefetch() to get PREFETCHW, and variants for predictions about how valuable the data is after the next access; write=1 slowed down my initial tests for reasons I don't fully understand, but I didn't look further once I realised you need -march=<broadwell or later> anyway. I didn't look into the locality/eviction stuff.
pgsql-hackers by date: