Re: [HACKERS] Parallel bitmap heap scan - Mailing list pgsql-hackers
From | Robert Haas |
---|---|
Subject | Re: [HACKERS] Parallel bitmap heap scan |
Date | |
Msg-id | CA+TgmoY6bOc1YnhcAQnMfCBDbsJzROQ3sYxSAL-SYB5tMJcTKg@mail.gmail.com Whole thread Raw |
In response to | Re: [HACKERS] Parallel bitmap heap scan (Dilip Kumar <dilipbalaut@gmail.com>) |
Responses |
Re: [HACKERS] Parallel bitmap heap scan
Re: [HACKERS] Parallel bitmap heap scan |
List | pgsql-hackers |
On Tue, Feb 14, 2017 at 12:18 AM, Dilip Kumar <dilipbalaut@gmail.com> wrote: > Fixed. Thanks, the external interface to this looks much cleaner now. Scrutinizing the internals: What is the point of having a TBMSharedIterator contain a TIDBitmap pointer? All the values in that TIDBitmap are populated from the TBMSharedInfo, but it seems to me that the fields that are copied over unchanged (like status and nchunks) could just be used directly from the TBMSharedInfo, and the fields that are computed (like relpages and relchunks) could be stored directly in the TBMSharedIterator. tbm_shared_iterate() is already separate code from tbm_iterate(), so it can be changed to refer to whichever data structure contains the data it needs. Why is TBMSharedInfo separate from TBMSharedIteratorState? It seems like you could merge those two into a single structure. I think we can simplify things here a bit by having shared iterators not support single-page mode. In the backend-private case, tbm_begin_iterate() really doesn't need to do anything with the pagetable in the TBM_ONE_PAGE case, but for a shared iterator, we've got to anyway copy the pagetable into shared memory. So it seems to me that it would be simpler just to transform it into a standard iteration array while we're at it, instead of copying it into entry1. In other words, I suggest removing both "entry1" and "status" from TBMSharedInfo and making tbm_prepare_shared_iterate smarter to compensate. I think "dsa_data" is poorly named; it should be something like "dsapagetable" in TIDBitmap and just "pagetable" in TBMSharedInfo. I think you should should move the "base", "relpages", and "relchunks" into TBMSharedIterator and give them better names, like "ptbase", "ptpages" and "ptchunks". "base" isn't clear that we're talking about the pagetable's base as opposed to anything else, and "relpages" and "relchunks" are named based on the fact that the pointers are relative rather than named based on what data they point at, which doesn't seem like the right choice. I suggest putting the parallel functions next to each other in the file: tbm_begin_iterate(), tbm_prepare_shared_iterate(), tbm_iterate(), tbm_shared_iterate(), tbm_end_iterate(), tbm_end_shared_iterate(). + if (tbm->dsa == NULL) + return pfree(pointer); Don't try to return a value of type void. The correct spelling of this is { pfree(pointer); return; }. Formatted appropriately across 4 lines, of course. + /* + * If TBM is in iterating phase that means pagetable is already created + * and we have come here during tbm_free. By this time we are already + * detached from the DSA because the GatherNode would have been shutdown. + */ + if (tbm->iterating) + return; This seems like something of a kludge, although not a real easy one to fix. The problem here is that tidbitmap.c ideally shouldn't have to know that it's being used by the executor or that there's a Gather node involved, and certainly not the order of operations around shutdown. It should really be the problem of 0002 to handle this kind of problem, without 0001 having to know anything about it. It strikes me that it would probably be a good idea to have Gather clean up its children before it gets cleaned up itself. So in ExecShutdownNode, we could do something like this: diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c index 0dd95c6..5ccc2e8 100644 --- a/src/backend/executor/execProcnode.c +++ b/src/backend/executor/execProcnode.c @@ -815,6 +815,8 @@ ExecShutdownNode(PlanState *node) if (node == NULL) return false; + planstate_tree_walker(node, ExecShutdownNode, NULL); + switch (nodeTag(node)) { case T_GatherState: @@ -824,5 +826,5 @@ ExecShutdownNode(PlanState *node) break; } - return planstate_tree_walker(node, ExecShutdownNode, NULL); + return false;} Also, in ExecEndGather, something like this: diff --git a/src/backend/executor/nodeGather.c b/src/backend/executor/nodeGather.c index a1a3561..32c97d3 100644 --- a/src/backend/executor/nodeGather.c +++ b/src/backend/executor/nodeGather.c @@ -229,10 +229,10 @@ ExecGather(GatherState *node)voidExecEndGather(GatherState *node){ + ExecEndNode(outerPlanState(node)); /* let children clean up first */ ExecShutdownGather(node); ExecFreeExprContext(&node->ps); ExecClearTuple(node->ps.ps_ResultTupleSlot); - ExecEndNode(outerPlanState(node));} /* With those changes and an ExecShutdownBitmapHeapScan() called from ExecShutdownNode(), it should be possible (I think) for us to always have the bitmap heap scan node shut down before the Gather node shuts down, which I think would let you avoid having a special case for this inside the TBM code. + char *ptr; + dsaptr = dsa_allocate(tbm->dsa, size + sizeof(dsa_pointer)); + tbm->dsa_data = dsaptr; + ptr = dsa_get_address(tbm->dsa, dsaptr); + memset(ptr, 0, size + sizeof(dsa_pointer)); + *((dsa_pointer *) ptr) = dsaptr; Hmm. Apparently, we need a dsa_allocate_and_zero() or dsa_allocate0() function. This pattern seems likely to come up a lot, and I don't think we should require every caller to deal with it. I don't see why you think you need to add sizeof(dsa_pointer) to the allocation and store an extra copy of the dsa_pointer in the additional space. You are already storing it in tbm->dsa_data and that seems good enough. pagetable_free() needs the value, but it has a pointer to the TIDBitmap and could just pass tbm->dsa_data directly instead of fishing the pointer out of the extra space. Also, this kind of thing is in general not very wise because it tends to cause alignment faults on some picky machine. dsa_allocate() will return a MAXALIGN'd pointer but ptr + sizeof(dsa_pointer) will only be MAXALIGN'd if (sizeof(dsa_pointer) % MAXIMUM_ALIGNOF) == 0, which I suspect might not be true everywhere and which is a shaky assumption for the future even if it happens to be true today. For example, consider a 32-bit platform where doubles need to be 8-byte aligned. It seems shaky to me that tbm->iterating can be set when we've created either a shared or a backend-private iterator. For example, tbm_iterate() asserts that tbm->iterating is set as a way of checking that spages and schunks will be set. But that's not guaranteed any more with these changes, because we might've built the shared version of the iteration arrays rather than the backend-private version. Maybe you ought to change it from a bool to an enum: TBM_NOT_ITERATING, TBM_ITERATING_PRIVATE, TBM_ITERATING_SHARED. And then update all of the asserts and tests to check for the specific state they care about. + while (schunkbit < PAGES_PER_CHUNK) + { + int wordnum = WORDNUM(schunkbit); + int bitnum = BITNUM(schunkbit); + + if ((chunk->words[wordnum] & ((bitmapword) 1 << bitnum)) != 0) + break; + schunkbit++; + } How about refactoring this chunk of code into a static inline function and having both tbm_iterate() and tbm_shared_iterate() call it? Probably something like static inline void tbm_advance_schunkbit(PageTableEntry *chunk, int *schunkbit). + /* scan bitmap to extract individual offset numbers */ + ntuples = 0; + for (wordnum = 0; wordnum < WORDS_PER_PAGE; wordnum++) + { + bitmapword w = page->words[wordnum]; + + if (w != 0) + { + int off = wordnum * BITS_PER_BITMAPWORD + 1; + + while (w != 0) + { + if (w & 1) + output->offsets[ntuples++] = (OffsetNumber) off; + off++; + w >>= 1; + } + } + } Similarly, this looks like it could be refactored into a shared static inline function as well, instead of duplicating it. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
pgsql-hackers by date: