Re: [HACKERS] Parallel Hash take II - Mailing list pgsql-hackers

From Andres Freund
Subject Re: [HACKERS] Parallel Hash take II
Date
Msg-id 20170801011139.pgu3v5tk26t5xeon@alap3.anarazel.de
Whole thread Raw
In response to [HACKERS] Parallel Hash take II  (Thomas Munro <thomas.munro@enterprisedb.com>)
Responses Re: [HACKERS] Parallel Hash take II  (Robert Haas <robertmhaas@gmail.com>)
Re: [HACKERS] Parallel Hash take II  (Thomas Munro <thomas.munro@enterprisedb.com>)
Re: [HACKERS] Parallel Hash take II  (Thomas Munro <thomas.munro@enterprisedb.com>)
List pgsql-hackers
From: Thomas Munro <thomas.munro@enterprisedb.com>
Date: Wed 26 Jul 2017 19:58:20 NZST
Subject: [PATCH] Add support for parallel-aware hash joins.

Hi,

WRT the main patch:

- Echoing concerns from other threads (Robert: ping): I'm doubtful that it makes sense to size the number of parallel
workerssolely based on the parallel scan node's size.  I don't think it's this patch's job to change that, but to me it
seriouslyamplifys that - I'd bet there's a lot of cases with nontrivial joins where the benefit from parallelism on the
joinlevel is bigger than on the scan level itself.  And the number of rows in the upper nodes might also be bigger than
onthe scan node level, making it more important to have higher number of nodes.
 

- If I understand the code in initial_cost_hashjoin() correctly, we count the synchronization overhead once,
independentof the number of workers.  But on the other hand we calculate the throughput by dividing by the number of
workers. Do you think that's right?
 

- I haven't really grokked the deadlock issue you address. Could you expand the comments on that? Possibly somewhere
centralreferenced by the various parts.
 

- maybe I'm overly paranoid, but it might not be bad to add some extra checks for ExecReScanHashJoin ensuring that it
doesn'tget called when workers are still doing something.
 

- seems like you're dereffing tuple unnecessarily here:

+    /*
+     * If we detached a chain of tuples, transfer them to the main hash table
+     * or batch storage.
+     */
+    if (regainable_space > 0)
+    {
+        HashJoinTuple tuple;
+
+        tuple = (HashJoinTuple)
+            dsa_get_address(hashtable->area, detached_chain_shared);
+        ExecHashTransferSkewTuples(hashtable, detached_chain,
+                                   detached_chain_shared);
+
+        /* Remove from the total space used. */
+        LWLockAcquire(&hashtable->shared->chunk_lock, LW_EXCLUSIVE);
+        Assert(hashtable->shared->size >= regainable_space);
+        hashtable->shared->size -= regainable_space;
+        LWLockRelease(&hashtable->shared->chunk_lock);
+
+        /*
+         * If the bucket we removed is the same as the bucket the caller just
+         * overflowed, then we can forget about the overflowing part of the
+         * tuple.  It's been moved out of the skew hash table.  Otherwise, the
+         * caller will call again; eventually we'll either succeed in
+         * allocating space for the overflow or reach this case.
+         */
+        if (bucket_to_remove == bucketno)
+        {
+            hashtable->spaceUsedSkew = 0;
+            hashtable->spaceAllowedSkew = 0;
+        }
+    }


- The names here could probably improved some:
+        case WAIT_EVENT_HASH_SHRINKING1:
+            event_name = "Hash/Shrinking1";
+            break;
+        case WAIT_EVENT_HASH_SHRINKING2:
+            event_name = "Hash/Shrinking2";
+            break;
+        case WAIT_EVENT_HASH_SHRINKING3:
+            event_name = "Hash/Shrinking3";
+            break;
+        case WAIT_EVENT_HASH_SHRINKING4:
+            event_name = "Hash/Shrinking4";

- why are we restricting rows_total bit to parallel aware?

+    /*
+     * If parallel-aware, the executor will also need an estimate of the total
+     * number of rows expected from all participants so that it can size the
+     * shared hash table.
+     */
+    if (best_path->jpath.path.parallel_aware)
+    {
+        hash_plan->plan.parallel_aware = true;
+        hash_plan->rows_total = best_path->inner_rows_total;
+    }
+

- seems we need a few more test - I don't think the existing tests are properly going to exercise the skew stuff,
multiplebatches, etc? This is nontrivial code, I'd really like to see a high test coverage of the new code.
 

- might not hurt to reindent before the final submission

- Unsurprisingly, please implement the FIXME ;)


Regards,

Andres



pgsql-hackers by date:

Previous
From: "David G. Johnston"
Date:
Subject: Re: [HACKERS] [BUGS] BUG #14759: insert into foreign data partitions fail
Next
From: Andres Freund
Date:
Subject: [HACKERS] Re: [BUGS] BUG #14758: Segfault with logical replication on afunction index