Thread: merging HashJoin and Hash nodes
Hi, I've groused about this a few times, but to me it seems wrong that HashJoin and Hash are separate nodes. They're so tightly bound together that keeping them separate just doesn't architecturally makes sense, imo. So I wrote a prototype. Evidence of being tightly bound together: - functions in nodeHash.h that take a HashJoinState etc - how many functions in nodeHash.h and nodeHashjoin.h are purely exposed so the other side can call them - there's basically no meaningful separation of concerns between code in nodeHash.c and nodeHashjoin.c - the Hash node doesn't really exist during most of the planning, it's kind of faked up in create_hashjoin_plan(). - HashJoin knows that the inner node is always going to be a Hash node. - HashJoinState and HashState both have pointers to HashJoinTable, etc Besides violating some aesthetical concerns, I think it also causes practical issues: - code being in different translation code prevents the compiler from inlining etc. There's a lot of HOT calls going between both. For each new outer tuple we e.g. call, from nodeHashjoin.c separately into nodeHash.c for ExecHashGetHashValue(), ExecHashGetBucketAndBatch(), ExecHashGetSkewBucket(), ExecScanHashBucket(). They each have to do memory loads from HashJoinState/HashJoinTable, even though previous code *just* has done so. - a separate executor node, and all the ancillary data (slots, expression context, target lists etc) is far from free - instead of just applying an "empty outer" style optimization to both sides of the HashJoin, we have to choose. Once unified it's fairly easy to just use it on both. - generally, a lot of improvements are harder to develop because of the odd separation. Does anybody have good arguments for keeping them separate? The only real one I can see is that it's not a small change, and will make bugfixes etc a bit harder. Personally I think that's outweighed by the disadvantages. Attached is a quick prototype that unifies them. It's not actually that hard, I think? Obviously this is far from ready, but I thought it'd be a good basis to get a discussion started? Comments on the prototype: - I've hacked EXPLAIN to still show the Hash node, to reduce the size of the diffs. I'm doubtful that that's the right approach (and I'm sure it's not the right approach to do so with the code I injected) - I think the Hash node in the explain doesn't really help users, and just makes the explain bigger (except for making it clearer which side is hashed) - currently I applied a very ugly hack to distinguish the parallel shm_toc key for the data previously in hash from the data previously in HashJoin. Clearly that'd need to be done properly. - obviously we'd have to work a lot more on comments, function ordering, docs etc. if we wanted to actually apply this. FWIW, it's much easier to look at the patch if you use --color-moved --color-moved-ws=allow-indentation-change as parameters, as that will color code that's moved without any changes (except for indentation), differently from modified code. One thing I noticed is that create_hashjoin_plan() currently says: /* * Set Hash node's startup & total costs equal to total cost of input * plan; this only affects EXPLAIN display not decisions. */ copy_plan_costsize(&hash_plan->plan, inner_plan); hash_plan->plan.startup_cost = hash_plan->plan.total_cost; which I don't think is actually true? We use that for: else if (HJ_FILL_OUTER(node) || (outerNode->plan->startup_cost < hashNode->ps.plan->total_cost && !node->hj_OuterNotEmpty)) Leaving the inaccurate (outdated?) comment aside, it's not clear to me why we should ignore the cost of hashing? It also seems like we ought actually charge the cost of hashing to the hash node, given that we actually apply some hashing cost (c.f. initial_cost_hashjoin). Greetings, Andres Freund
Attachment
On Tue, Oct 29, 2019 at 12:15 PM Andres Freund <andres@anarazel.de> wrote: > I've groused about this a few times, but to me it seems wrong that > HashJoin and Hash are separate nodes. They're so tightly bound together > that keeping them separate just doesn't architecturally makes sense, > imo. So I wrote a prototype. > > Evidence of being tightly bound together: > - functions in nodeHash.h that take a HashJoinState etc > - how many functions in nodeHash.h and nodeHashjoin.h are purely exposed > so the other side can call them > - there's basically no meaningful separation of concerns between code in > nodeHash.c and nodeHashjoin.c > - the Hash node doesn't really exist during most of the planning, it's > kind of faked up in create_hashjoin_plan(). > - HashJoin knows that the inner node is always going to be a Hash node. > - HashJoinState and HashState both have pointers to HashJoinTable, etc > > Besides violating some aesthetical concerns, I think it also causes > practical issues: > > - code being in different translation code prevents the compiler from > inlining etc. There's a lot of HOT calls going between both. For each > new outer tuple we e.g. call, from nodeHashjoin.c separately into > nodeHash.c for ExecHashGetHashValue(), ExecHashGetBucketAndBatch(), > ExecHashGetSkewBucket(), ExecScanHashBucket(). They each have to > do memory loads from HashJoinState/HashJoinTable, even though previous > code *just* has done so. > - a separate executor node, and all the ancillary data (slots, > expression context, target lists etc) is far from free > - instead of just applying an "empty outer" style optimization to both > sides of the HashJoin, we have to choose. Once unified it's fairly > easy to just use it on both. > - generally, a lot of improvements are harder to develop because of the > odd separation. I agree with all of that. > Does anybody have good arguments for keeping them separate? The only > real one I can see is that it's not a small change, and will make > bugfixes etc a bit harder. Personally I think that's outweighed by the > disadvantages. Yeah, the ~260KB of churn you came up with is probably the reason I didn't even think of suggesting something along these lines while working on PHJ, though it did occur to me that the division was entirely artificial as I carefully smashed more holes in both directions through that wall. Trying to think of a reason to keep Hash, I remembered Kohei KaiGai's speculation about Hash nodes that are shared by different Hash Join nodes (in the context of a partition-wise join where each partition is joined against one table). But even if we were to try to do that, a Hash node isn't necessary to share the hash table, so that's not an argument. > Attached is a quick prototype that unifies them. It's not actually that hard, > I think? Obviously this is far from ready, but I thought it'd be a good > basis to get a discussion started? I haven't looked at the patch yet but this yet but it sounds like a good start from the description. > Comments on the prototype: > > - I've hacked EXPLAIN to still show the Hash node, to reduce the size of > the diffs. I'm doubtful that that's the right approach (and I'm sure > it's not the right approach to do so with the code I injected) - I > think the Hash node in the explain doesn't really help users, and just > makes the explain bigger (except for making it clearer which side is > hashed) Yeah, I'm not sure why you'd want to show a Hash node to users if there is no way to use it in any other context than a Hash Join. FWIW, Oracle, DB2 and SQL Server don't show an intermediate Hash node in their plans, and generally you just have to know which way around the input relations are shown (from a quick a glance at some examples found on the web, Oracle and SQL Server show hash relation above probe relation, while DB2, PostgreSQL and MySQL show probe relation above hash relation). Curiously, MySQL 8 just added hash joins, and they do show a Hash node (at least in FORMAT=TREE mode, which looks a bit like our EXPLAIN). The fact that EXPLAIN doesn't label relations seems to be a separate concern that applies equally to nestloop joins, and could perhaps be addressed with some more optional verbosity, not a fake node?
On Tue, Oct 29, 2019 at 02:00:00PM +1300, Thomas Munro wrote: >On Tue, Oct 29, 2019 at 12:15 PM Andres Freund <andres@anarazel.de> wrote: >> I've groused about this a few times, but to me it seems wrong that >> HashJoin and Hash are separate nodes. They're so tightly bound together >> that keeping them separate just doesn't architecturally makes sense, >> imo. So I wrote a prototype. >> >> Evidence of being tightly bound together: >> - functions in nodeHash.h that take a HashJoinState etc >> - how many functions in nodeHash.h and nodeHashjoin.h are purely exposed >> so the other side can call them >> - there's basically no meaningful separation of concerns between code in >> nodeHash.c and nodeHashjoin.c >> - the Hash node doesn't really exist during most of the planning, it's >> kind of faked up in create_hashjoin_plan(). >> - HashJoin knows that the inner node is always going to be a Hash node. >> - HashJoinState and HashState both have pointers to HashJoinTable, etc >> >> Besides violating some aesthetical concerns, I think it also causes >> practical issues: >> >> - code being in different translation code prevents the compiler from >> inlining etc. There's a lot of HOT calls going between both. For each >> new outer tuple we e.g. call, from nodeHashjoin.c separately into >> nodeHash.c for ExecHashGetHashValue(), ExecHashGetBucketAndBatch(), >> ExecHashGetSkewBucket(), ExecScanHashBucket(). They each have to >> do memory loads from HashJoinState/HashJoinTable, even though previous >> code *just* has done so. I wonder how much we can gain by this. I don't expect any definitive figures from a patch at this stage, but maybe you have some guesses? >> - a separate executor node, and all the ancillary data (slots, >> expression context, target lists etc) is far from free >> - instead of just applying an "empty outer" style optimization to both >> sides of the HashJoin, we have to choose. Once unified it's fairly >> easy to just use it on both. >> - generally, a lot of improvements are harder to develop because of the >> odd separation. > >I agree with all of that. > >> Does anybody have good arguments for keeping them separate? The only >> real one I can see is that it's not a small change, and will make >> bugfixes etc a bit harder. Personally I think that's outweighed by the >> disadvantages. > >Yeah, the ~260KB of churn you came up with is probably the reason I >didn't even think of suggesting something along these lines while >working on PHJ, though it did occur to me that the division was >entirely artificial as I carefully smashed more holes in both >directions through that wall. > >Trying to think of a reason to keep Hash, I remembered Kohei KaiGai's >speculation about Hash nodes that are shared by different Hash Join >nodes (in the context of a partition-wise join where each partition is >joined against one table). But even if we were to try to do that, a >Hash node isn't necessary to share the hash table, so that's not an >argument. > >> Attached is a quick prototype that unifies them. It's not actually that hard, >> I think? Obviously this is far from ready, but I thought it'd be a good >> basis to get a discussion started? > >I haven't looked at the patch yet but this yet but it sounds like a >good start from the description. > >> Comments on the prototype: >> >> - I've hacked EXPLAIN to still show the Hash node, to reduce the size of >> the diffs. I'm doubtful that that's the right approach (and I'm sure >> it's not the right approach to do so with the code I injected) - I >> think the Hash node in the explain doesn't really help users, and just >> makes the explain bigger (except for making it clearer which side is >> hashed) > >Yeah, I'm not sure why you'd want to show a Hash node to users if >there is no way to use it in any other context than a Hash Join. > >FWIW, Oracle, DB2 and SQL Server don't show an intermediate Hash node >in their plans, and generally you just have to know which way around >the input relations are shown (from a quick a glance at some examples >found on the web, Oracle and SQL Server show hash relation above probe >relation, while DB2, PostgreSQL and MySQL show probe relation above >hash relation). Curiously, MySQL 8 just added hash joins, and they do >show a Hash node (at least in FORMAT=TREE mode, which looks a bit like >our EXPLAIN). > Not sure. Maybe we don't need to show an explicit Hash node, because that might seem to imply there's a separate executor step. And that would be misleading. But IMO we should make it obvious which side of the join is hashed, instead of relyin on users to "know which way around the relations are shown". The explain is often used by users who're learning stuff, or maybe investigating it for the first time, and we should not make it unnecessarily hard to understand. I don't think we have any other "explain node" that would not represent an actual executor node, so not sure what's the right approach. So maybe that's not the right way to do that ... OTOH we have tools visualizing execution plans, so maybe backwards compatibility of the output is a concern too (I know we don't promise anything, though). >The fact that EXPLAIN doesn't label relations seems to be a separate >concern that applies equally to nestloop joins, and could perhaps be >addressed with some more optional verbosity, not a fake node? > Yeah, that seems separate. -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Hi, On 2019-10-31 23:59:19 +0100, Tomas Vondra wrote: > On Tue, Oct 29, 2019 at 02:00:00PM +1300, Thomas Munro wrote: > > On Tue, Oct 29, 2019 at 12:15 PM Andres Freund <andres@anarazel.de> wrote: > > > I've groused about this a few times, but to me it seems wrong that > > > HashJoin and Hash are separate nodes. They're so tightly bound together > > > that keeping them separate just doesn't architecturally makes sense, > > > imo. So I wrote a prototype. > > > > > > Evidence of being tightly bound together: > > > - functions in nodeHash.h that take a HashJoinState etc > > > - how many functions in nodeHash.h and nodeHashjoin.h are purely exposed > > > so the other side can call them > > > - there's basically no meaningful separation of concerns between code in > > > nodeHash.c and nodeHashjoin.c > > > - the Hash node doesn't really exist during most of the planning, it's > > > kind of faked up in create_hashjoin_plan(). > > > - HashJoin knows that the inner node is always going to be a Hash node. > > > - HashJoinState and HashState both have pointers to HashJoinTable, etc > > > > > > Besides violating some aesthetical concerns, I think it also causes > > > practical issues: > > > > > > - code being in different translation code prevents the compiler from > > > inlining etc. There's a lot of HOT calls going between both. For each > > > new outer tuple we e.g. call, from nodeHashjoin.c separately into > > > nodeHash.c for ExecHashGetHashValue(), ExecHashGetBucketAndBatch(), > > > ExecHashGetSkewBucket(), ExecScanHashBucket(). They each have to > > > do memory loads from HashJoinState/HashJoinTable, even though previous > > > code *just* has done so. > > I wonder how much we can gain by this. I don't expect any definitive > figures from a patch at this stage, but maybe you have some guesses? It's measureable, but not a world-changing difference. Some of the gains are limited by the compiler not realizing that it does not have to reload values across some external function calls. I saw somewhere around ~3% for a case that was bottlenecked by HJ lookups (not build). I think part of the effect size is also limited by other unnecessary inefficiencies being a larger bottleneck. E.g. 1) the fact that ExecScanHashBucket() contains branches that have roughly 50% likelihoods, making them unpredictable ( 1. on a successfull lookup we oscillate between the first hashTuple != NULL test succeeding and failing except in case of bucket conflict; 2. the while (hashTuple != NULL) oscillates similarly, because it tests for I. initial lookup succeeding, II. further tuple in previous bucket lookup III. further tuples in case of hashvalue mismatch. Quite observable by profiling for br_misp_retired.conditional. 2) The fact that there's *two* indirections for a successfull lookup that are very likely to be cache misses. First we need to look up the relevant bucket, second we need to actually fetch hashvalue from the pointer stored in the bucket. But even independent of these larger inefficiencies, I suspect we could also gain more from inlining by changing nodeHashjoin a bit. E.g. moving the HJ_SCAN_BUCKET code into an always_inline function, and also referencing it from the tail end of the HJ_NEED_NEW_OUTER code, instead of falling through, would allow to optimize away a number of loads (and I think also stores), and improve branch predictor efficiency. E.g. optimizing away store/load combinations for node->hj_CurHashValue, node->hj_CurBucketNo, node->hj_CurSkewBucketNo; loads of hj_HashTable, ...; stores of node->hj_JoinState, node->hj_MatchedOuter. And probably make the code easier to read, to boot. > But IMO we should make it obvious which side of the join is hashed, > instead of relyin on users to "know which way around the relations are > shown". The explain is often used by users who're learning stuff, or > maybe investigating it for the first time, and we should not make it > unnecessarily hard to understand. I agree. I wonder if just outputting something like 'Hashed Side: second' (or "right", or ...) could work. Not perfect, but I don't really have a better idea. We somewhat rely on users understanding inner/outer for explain output (I doubt this is good, to be clear), e.g. "Inner Unique: true ". Made worse by the fact that "inner"/"outer" is also used to describe different kinds of joins, with a mostly independent meaning. > OTOH we have tools visualizing execution plans, so maybe backwards > compatibility of the output is a concern too (I know we don't promise > anything, though). Well, execution *would* work a bit differently, so I don't feel too bad about tools having to adapt to that. E.g. graphical explain tool really shouldn't display a separate Hash nodes anymore. > > The fact that EXPLAIN doesn't label relations seems to be a separate > > concern that applies equally to nestloop joins, and could perhaps be > > addressed with some more optional verbosity, not a fake node? > Yeah, that seems separate. I'm not sure that's true. If we were labelling sub-nodes with 'inner' and 'outer' or such, we could just make that 'hashed inner' or such. But changing this seems to be a large compat break... Greetings, Andres Freund