Re: [HACKERS] Parallel bitmap heap scan - Mailing list pgsql-hackers

From Dilip Kumar
Subject Re: [HACKERS] Parallel bitmap heap scan
Date
Msg-id CAFiTN-uqMdXQdKDkD94WVYWoPnRCmA+sy_23--iq2ZPemniz6Q@mail.gmail.com
Whole thread Raw
In response to Re: [HACKERS] Parallel bitmap heap scan  (Robert Haas <robertmhaas@gmail.com>)
Responses Re: [HACKERS] Parallel bitmap heap scan  (Robert Haas <robertmhaas@gmail.com>)
List pgsql-hackers
On Tue, Feb 14, 2017 at 10:18 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> What is the point of having a TBMSharedIterator contain a TIDBitmap
> pointer?  All the values in that TIDBitmap are populated from the
> TBMSharedInfo, but it seems to me that the fields that are copied over
> unchanged (like status and nchunks) could just be used directly from
> the TBMSharedInfo, and the fields that are computed (like relpages and
> relchunks) could be stored directly in the TBMSharedIterator.
> tbm_shared_iterate() is already separate code from tbm_iterate(), so
> it can be changed to refer to whichever data structure contains the
> data it needs.

Done

>
> Why is TBMSharedInfo separate from TBMSharedIteratorState?  It seems
> like you could merge those two into a single structure.

Done
>
> I think we can simplify things here a bit by having shared iterators
> not support single-page mode.  In the backend-private case,
> tbm_begin_iterate() really doesn't need to do anything with the
> pagetable in the TBM_ONE_PAGE case, but for a shared iterator, we've
> got to anyway copy the pagetable into shared memory.  So it seems to
> me that it would be simpler just to transform it into a standard
> iteration array while we're at it, instead of copying it into entry1.
> In other words, I suggest removing both "entry1" and "status" from
> TBMSharedInfo and making tbm_prepare_shared_iterate smarter to
> compensate.

Done

>
> I think "dsa_data" is poorly named; it should be something like
> "dsapagetable" in TIDBitmap and just "pagetable" in TBMSharedInfo.  I
> think you should should move the "base", "relpages", and "relchunks"
> into TBMSharedIterator and give them better names, like "ptbase",
> "ptpages" and "ptchunks".  "base" isn't clear that we're talking about
> the pagetable's base as opposed to anything else, and "relpages" and
> "relchunks" are named based on the fact that the pointers are relative
> rather than named based on what data they point at, which doesn't seem
> like the right choice.

Done
>
> I suggest putting the parallel functions next to each other in the
> file: tbm_begin_iterate(), tbm_prepare_shared_iterate(),
> tbm_iterate(), tbm_shared_iterate(), tbm_end_iterate(),
> tbm_end_shared_iterate().

Done
>
> +    if (tbm->dsa == NULL)
> +        return pfree(pointer);
>
> Don't try to return a value of type void.  The correct spelling of
> this is { pfree(pointer); return; }.  Formatted appropriately across 4
> lines, of course.

Fixed

>
> +    /*
> +     * If TBM is in iterating phase that means pagetable is already created
> +     * and we have come here during tbm_free. By this time we are already
> +     * detached from the DSA because the GatherNode would have been shutdown.
> +     */
> +    if (tbm->iterating)
> +        return;
>
> This seems like something of a kludge, although not a real easy one to
> fix.  The problem here is that tidbitmap.c ideally shouldn't have to
> know that it's being used by the executor or that there's a Gather
> node involved, and certainly not the order of operations around
> shutdown.  It should really be the problem of 0002 to handle this kind
> of problem, without 0001 having to know anything about it.  It strikes
> me that it would probably be a good idea to have Gather clean up its
> children before it gets cleaned up itself.  So in ExecShutdownNode, we
> could do something like this:
>
> diff --git a/src/backend/executor/execProcnode.c
> b/src/backend/executor/execProcnode.c
> index 0dd95c6..5ccc2e8 100644
> --- a/src/backend/executor/execProcnode.c
> +++ b/src/backend/executor/execProcnode.c
> @@ -815,6 +815,8 @@ ExecShutdownNode(PlanState *node)
>      if (node == NULL)
>          return false;
>
> +    planstate_tree_walker(node, ExecShutdownNode, NULL);
> +
>      switch (nodeTag(node))
>      {
>          case T_GatherState:
> @@ -824,5 +826,5 @@ ExecShutdownNode(PlanState *node)
>              break;
>      }
>
> -    return planstate_tree_walker(node, ExecShutdownNode, NULL);
> +    return false;
>  }
>
> Also, in ExecEndGather, something like this:
>
> diff --git a/src/backend/executor/nodeGather.c
> b/src/backend/executor/nodeGather.c
> index a1a3561..32c97d3 100644
> --- a/src/backend/executor/nodeGather.c
> +++ b/src/backend/executor/nodeGather.c
> @@ -229,10 +229,10 @@ ExecGather(GatherState *node)
>  void
>  ExecEndGather(GatherState *node)
>  {
> +    ExecEndNode(outerPlanState(node));      /* let children clean up first */
>      ExecShutdownGather(node);
>      ExecFreeExprContext(&node->ps);
>      ExecClearTuple(node->ps.ps_ResultTupleSlot);
> -    ExecEndNode(outerPlanState(node));
>  }
>
>  /*
>
> With those changes and an ExecShutdownBitmapHeapScan() called from
> ExecShutdownNode(), it should be possible (I think) for us to always
> have the bitmap heap scan node shut down before the Gather node shuts
> down, which I think would let you avoid having a special case for this
> inside the TBM code.

Done
(gather_shutdown_children_first.patch does what you have mentioned
above and 0002 implement the BitmapHeapScanShutdown function).

>
> +    char       *ptr;
> +    dsaptr = dsa_allocate(tbm->dsa, size + sizeof(dsa_pointer));
> +    tbm->dsa_data = dsaptr;
> +    ptr = dsa_get_address(tbm->dsa, dsaptr);
> +    memset(ptr, 0, size + sizeof(dsa_pointer));
> +    *((dsa_pointer *) ptr) = dsaptr;
>
> Hmm.  Apparently, we need a dsa_allocate_and_zero() or dsa_allocate0()
> function.  This pattern seems likely to come up a lot, and I don't
> think we should require every caller to deal with it.
Done

>
> I don't see why you think you need to add sizeof(dsa_pointer) to the
> allocation and store an extra copy of the dsa_pointer in the
> additional space.   You are already storing it in tbm->dsa_data and

Done with the help of two pointers as discussed in another mail [1].

[1] https://www.postgresql.org/message-id/CA%2BTgmoaWsCJ5L2du9i1RQaegaNLgOYF2KgRFu%2B7sUAeQc_xBFg%40mail.gmail.com

>
> It seems shaky to me that tbm->iterating can be set when we've created
> either a shared or a backend-private iterator.  For example,
> tbm_iterate() asserts that tbm->iterating is set as a way of checking
> that spages and schunks will be set.  But that's not guaranteed any
> more with these changes, because we might've built the shared version
> of the iteration arrays rather than the backend-private version.
> Maybe you ought to change it from a bool to an enum:
> TBM_NOT_ITERATING, TBM_ITERATING_PRIVATE, TBM_ITERATING_SHARED.  And
> then update all of the asserts and tests to check for the specific
> state they care about.

Done
>
> +        while (schunkbit < PAGES_PER_CHUNK)
> +        {
> +            int            wordnum = WORDNUM(schunkbit);
> +            int            bitnum = BITNUM(schunkbit);
> +
> +            if ((chunk->words[wordnum] & ((bitmapword) 1 << bitnum)) != 0)
> +                break;
> +            schunkbit++;
> +        }
>
> How about refactoring this chunk of code into a static inline function
> and having both tbm_iterate() and tbm_shared_iterate() call it?
> Probably something like static inline void
> tbm_advance_schunkbit(PageTableEntry *chunk, int *schunkbit).

Good idea, done this way
>
> +        /* scan bitmap to extract individual offset numbers */
> +        ntuples = 0;
> +        for (wordnum = 0; wordnum < WORDS_PER_PAGE; wordnum++)
> +        {
> +            bitmapword    w = page->words[wordnum];
> +
> +            if (w != 0)
> +            {
> +                int            off = wordnum * BITS_PER_BITMAPWORD + 1;
> +
> +                while (w != 0)
> +                {
> +                    if (w & 1)
> +                        output->offsets[ntuples++] = (OffsetNumber) off;
> +                    off++;
> +                    w >>= 1;
> +                }
> +            }
> +        }
>
> Similarly, this looks like it could be refactored into a shared static
> inline function as well, instead of duplicating it.

Done.

Attached patches:

interface_dsa_allocate0.patch : Provides new interface dsa_allocate0,
which is used by 0001
gather_shutdown_childeren_first.patch : Processing the child node
before shuting down the gather, 0002 will use that part to shutdown
BitmapNode before gather.

0001: tidbimap change to provide the interfaces for shared iterator.
0002: actual parallel bitmap heap scan by using interfaces of 0001.


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: [HACKERS] duplicate "median" entry in doc
Next
From: Amit Kapila
Date:
Subject: Re: [HACKERS] Parallel Index-only scan