Re: Parallel Full Hash Join - Mailing list pgsql-hackers

From Zhihong Yu
Subject Re: Parallel Full Hash Join
Date
Msg-id CALNJ-vS5GELo_=kJ7wNjKVPe5qWLXGfLABTC8hhCW1Y3Ls9BJw@mail.gmail.com
Whole thread Raw
In response to Re: Parallel Full Hash Join  (Melanie Plageman <melanieplageman@gmail.com>)
List pgsql-hackers

On Tue, Apr 6, 2021 at 11:59 AM Melanie Plageman <melanieplageman@gmail.com> wrote:
On Fri, Apr 2, 2021 at 3:06 PM Zhihong Yu <zyu@yugabyte.com> wrote:
>
> Hi,
> For v6-0003-Parallel-Hash-Full-Right-Outer-Join.patch
>
> +    * current_chunk_idx: index in current HashMemoryChunk
>
> The above comment seems to be better fit for ExecScanHashTableForUnmatched(), instead of ExecParallelPrepHashTableForUnmatched.
> I wonder where current_chunk_idx should belong (considering the above comment and what the code does).
>
> +       while (hashtable->current_chunk_idx < hashtable->current_chunk->used)
> ...
> +       next = hashtable->current_chunk->next.unshared;
> +       hashtable->current_chunk = next;
> +       hashtable->current_chunk_idx = 0;
>
> Each time we advance to the next chunk, current_chunk_idx is reset. It seems current_chunk_idx can be placed inside chunk.
> Maybe the consideration is that, with the current formation we save space by putting current_chunk_idx field at a higher level.
> If that is the case, a comment should be added.
>

Thank you for the review. I think that moving the current_chunk_idx into
the HashMemoryChunk would probably take up too much space.

Other places that we loop through the tuples in the chunk, we are able
to just keep a local idx, like here in
ExecParallelHashIncreaseNumBuckets():

case PHJ_GROW_BUCKETS_REINSERTING:
...
        while ((chunk = ExecParallelHashPopChunkQueue(hashtable, &chunk_s)))
        {
            size_t        idx = 0;

            while (idx < chunk->used)

but, since we cannot do that while also emitting tuples, I thought,
let's just stash the index in the hashtable for use in serial hash join
and the batch accessor for parallel hash join. A comment to this effect
sounds good to me.

From the way HashJoinTable is used, I don't have better idea w.r.t. the location of current_chunk_idx.
It is not worth introducing another level of mapping between HashJoinTable and the chunk index.

So the current formation is fine with additional comment in ParallelHashJoinBatchAccessor (current comment doesn't explicitly mention current_chunk_idx).

Cheers

pgsql-hackers by date:

Previous
From: Dan Lynch
Date:
Subject: Re: policies with security definer option for allowing inline optimization
Next
From: Alvaro Herrera
Date:
Subject: Re: Autovacuum on partitioned table (autoanalyze)