Re: [HACKERS] Parallel bitmap heap scan - Mailing list pgsql-hackers

From Dilip Kumar
Subject Re: [HACKERS] Parallel bitmap heap scan
Date
Msg-id CAFiTN-s5KuRuDrQCEpiHHzmVf7JTtbnb8eb10c-6AywJDxbWrA@mail.gmail.com
Whole thread Raw
In response to Re: [HACKERS] Parallel bitmap heap scan  (Robert Haas <robertmhaas@gmail.com>)
Responses Re: [HACKERS] Parallel bitmap heap scan  (Dilip Kumar <dilipbalaut@gmail.com>)
List pgsql-hackers
On Fri, Feb 17, 2017 at 2:01 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> + * in order to share relptrs of the chunk and the pages arrays and other
> + * TBM related information with other workers.
>
> No relptrs any more.

Fixed
>
> +    dsa_pointer dsapagetable;    /* dsa_pointer to the element array */
> +    dsa_pointer dsapagetableold;    /* dsa_pointer to the old element array */
> +    dsa_area   *dsa;            /* reference to per-query dsa area */
> +    dsa_pointer ptpages;        /* dsa_pointer to the page array */
> +    dsa_pointer ptchunks;        /* dsa_pointer to the chunk array */
>
> Let's put the DSA pointer first and then the other stuff after it.
> That seems more logical.

Done that way
>
> +typedef struct TBMSharedIteratorState
> +{
> +    int            spageptr;        /* next spages index */
> +    int            schunkptr;        /* next schunks index */
> +    int            schunkbit;        /* next bit to check in current schunk */
> +    LWLock        lock;            /* lock to protect above members */
> +    dsa_pointer pagetable;        /* dsa pointers to head of pagetable data */
> +    dsa_pointer spages;            /* dsa pointer to page array */
> +    dsa_pointer schunks;        /* dsa pointer to chunk array */
> +    int            nentries;        /* number of entries in pagetable */
> +    int            maxentries;        /* limit on same to meet maxbytes */
> +    int            npages;            /* number of exact entries in
> pagetable */
> +    int            nchunks;        /* number of lossy entries in pagetable */
> +} TBMSharedIteratorState;
>
> I think you've got this largely backwards from the most logical order.
> Let's order it like this: nentries, maxentries, npages, nchunks,
> pagetable, spages, schunks, lock (to protect BELOW members), spageptr,
> chunkptr, schunkbit.

Done
>
> +struct TBMSharedIterator
> +{
> +    PagetableEntry *ptbase;        /* pointer to the pagetable element array */
> +    dsa_area   *dsa;            /* reference to per-query dsa area */
> +    int           *ptpages;        /* sorted exact page index list */
> +    int           *ptchunks;        /* sorted lossy page index list */
> +    TBMSharedIteratorState *state;        /* shared members */
> +    TBMIterateResult output;    /* MUST BE LAST (because variable-size) */
> +};
>
> Do we really need "dsa" here when it's already present in the shared
> state?  It doesn't seem like we even use it for anything.  It's needed
> to create each backend's TBMSharedIterator, but not afterwards, I
> think.

Right, removed.

>
> In terms of ordering, I'd move "state" to the top of the structure,
> just like "tbm" comes first in a TBMIterator.

Yeah, that looks better. Done that way.
>
> + * total memory consumption.  If dsa passed to this function is not NULL
> + * then the memory for storing elements of the underlying page table will
> + * be allocated from the DSA.
>
> Notice that you capitalized "DSA" in two different ways in the same
> sentence; I'd go for the all-caps version.  Also, you need the word
> "the" before the first one.

Fixed, all such instances.

>
> +    if (tbm->status == TBM_HASH && (tbm->iterating == TBM_NOT_ITERATING))
>
> Excess parentheses.
Fixed
>
> + * tbm_prepare_shared_iterate - prepare to iterator through a TIDBitmap
> + * by multiple workers using shared iterator.
>
> Prepare to iterate, not prepare to iterator.  I suggest rephrasing
> this as "prepare shared iteration state for a TIDBitmap".

Fixed.
>
> + * The TBMSharedIteratorState will be allocated from DSA so that multiple
> + * worker can attach to it and iterate jointly.
>
> Maybe change to "The necessary shared state will be allocated from the
> DSA passed to tbm_create, so that multiple workers can attach to it
> and iterate jointly".

Done.
>
> + * This will convert the pagetable hash into page and chunk array of the index
> + * into pagetable array so that multiple workers can use it to get the actual
> + * page entry.
>
> I think you can leave off everything starting from "so that".  It's
> basically redundant with what you already said.

Done
>
> +    dsa_pointer iterator;
> +    TBMSharedIteratorState *iterator_state;
>
> These aren't really great variable names, because the latter isn't the
> state associated with the former.  They're both the same object.
> Maybe just "dp" and "istate".

Done
>
> I think this function should Assert(tbm->dsa != NULL) and
> Assert(tbm->iterating != TBM_ITERATING_PRIVATE), and similarly
> tbm_begin_iterate should Assert(tbm->iterating != TBM_ITERATE_SHARED).

Done
>
> +    /*
> +     * If we have a hashtable, create and fill the sorted page lists, unless
> +     * we already did that for a previous iterator.  Note that the lists are
> +     * attached to the bitmap not the iterator, so they can be used by more
> +     * than one iterator.  However, we keep dsa_pointer to these in the shared
> +     * iterator so that other workers can access them directly.
> +     */
>
> This is mostly cut-and-pasted from tbm_begin_iterate() but it's not
> really correct here now because (1) we're no longer trying to fake up
> a TIDBitmap proper in every backend and (2) this code runs even if we
> don't have a hashtable.  I think the comment should be something like
> "If we're not already iterating, create and fill the sorted page
> lists.  (If we are, the sorted page lists are already stored in the
> TIDBitmap, and we can just reuse them.)"

Done
>
> +         * Create page list and chunk list using relptr so that we can share
> +         * this information across multiple workers.
>
> No relptrs any more.

Done
>
> +            tbm->ptpages = dsa_allocate(tbm->dsa, tbm->npages * (sizeof(int)));
>
> Extra parentheses.
>
> +                                         tbm->nchunks * (sizeof(int)));
>
> Extra parentheses.
Fixed
>
> +         * If TBM status is TBM_HASH then iterate over the pagetable and
>
> "If the TBM status is"...
>
> +             * directly store it's index i.e. 0 in page array
>
> s/it's/its/
Done
>
> Don't you also need some code here to handle the TBM_EMPTY case?,

IMHO, we don't need to do anything for TBM_EMPTY because npages and
nchunks will be zero so iterator will handle such cases (same as it's
done for non-parallel case.)

>
> +    /*
> +     * Store the TBM member in the shared state so that we can share them
> +     * across multiple workers.
> +     */
> +    iterator_state->maxentries = tbm->maxentries;
> +    iterator_state->nchunks = tbm->nchunks;
> +    iterator_state->nentries = tbm->nentries;
> +    iterator_state->npages = tbm->npages;
> +    iterator_state->pagetable = tbm->dsapagetable;
> +    iterator_state->spages = tbm->ptpages;
> +    iterator_state->schunks = tbm->ptchunks;
> +
> +    /* Initialize the shared iterator state */
> +    iterator_state->schunkbit = 0;
> +    iterator_state->schunkptr = 0;
> +    iterator_state->spageptr = 0;
> +
> +    /* Initialize the iterator lock */
> +    LWLockInitialize(&iterator_state->lock, LWTRANCHE_PARALLEL_TBM_ITERATOR);
>
> Set all of the structure members in the same order that you declare them.

Done.
>
> + * tbm_extract_page_tuple - extract the tuple offsets from the page
>
> s/the page/a page/
>
> + * Process the page bits to extract the tuple offsets and store them into
> + * TBMIterateResult.
>
> This duplicates the preceding, a bit.  Maybe just "The extracted
> offsets are stored into the TBMIterateResult".

Done
>
> +/*
> + *    tbm_advance_schunkbit
> + *
> + *    Advance the chunkbit to get the next page entry from the chunk
> + */
>
> The formatting of this header comment is randomly different from the
> preceding and following header comments.

Seems like different function in the file are using different
formatting. But done as near by functions.
>
> I would change the argument to schunkbitp, declare a local variable
> named schunkbit, and do int schunkbit = *schunkbitp at the top and
> *schunkbitp = schunkbit at the bottom.
>

Fixed

> + *    As above, but this will iterate using shared iterator which is shared
> + *    across multiple workers.  We need to acquire the iterator LWLock, before
> + *    accessing the shared members.
>
> "using shared iterator which is shared" -> "using an iterator which is shared"
> "multiple workers" -> "multiple processes"

Done
>
> +        PagetableEntry *chunk = ptbase + ptchunks[state->schunkptr];
>
> Maybe &ptbase[ptchunks[state->schunkptr]] would look a little nicer.
>
> +        PagetableEntry *chunk = ptbase + ptchunks[state->schunkptr];
> +        PagetableEntry *page = ptbase + ptpages[state->spageptr];
> +        PagetableEntry *page = ptbase + ptpages[state->spageptr];
>
> Similarly.

Done
>
> + * tbm_end_shared_iterate - finish an iteration over a TIDBitmap
>
> Maybe s/an iteration/a shared iteration/
>
> + * As above, but it frees the memory of TBMSharedIterator.
>
> Instead of this, I'd write "This doesn't free any of the shared state
> associated with the iterator, just our backend-private state."

Done
>
> +    PagetableEntry *lpage = ((PagetableEntry *) arg) + *((int *) left);
> +    PagetableEntry *rpage = ((PagetableEntry *) arg) + *((int *) right);
>
> Again, try to use array dereference notation rather than pointer
> arithmetic where you can.  Maybe:
>
> PageTableEntry *base = (PageTableEntry *) arg;
> PageTableEntry *lpage = &arg[*(int *) left];
> etc.
Done

>
> +tbm_attach_shared_iterate(dsa_area *dsa, dsa_pointer iterator)
> +{
> +    TBMSharedIterator *shared_iterator;
> +    TBMSharedIteratorState *shared_state;
>
> Again, variable naming issues.  Let's go with dsa_pointer dp,
> TBMSharedIterator *iterator, TBMSharedIteratorState *istate.  Please
> try to go through and always use the same variable name for the same
> kind of object.  It's confusing when "iterator" sometimes means a
> TBMSharedIterator * and sometimes a TBMSharedIteratorState * and
> sometimes a dsa_pointer.  Make it all consistent (and hopefully
> logical).
Done
>
> +    shared_iterator = (TBMSharedIterator *) palloc(sizeof(TBMIterator) +
> +                                 MAX_TUPLES_PER_PAGE * sizeof(OffsetNumber));
>
> Wrong sizeof.  (This isn't just a style issue - it's a bug!)

Right, it's a bug. Fixed.
>
> + * Callback function for allocating the memory for hashtable elements.
> + * It allocates memory from DSA if tbm holds a reference to a dsa.
>
> Inconsistent capitalization of "DSA" again.
>
> Maybe change the whole comment to "Allocate memory for hashtable
> elements, using DSA if available."

Fixe
>
> +    if (tbm->dsapagetable)
>
> Test against InvalidDsaPointer.  Or maybe just get rid of the "if" and
> do it unconditionally; if dsapagetable is InvalidDsaPointer then
> dsapagetableold will be too, so no harm done.

Right
>
> +    LWTRANCHE_PARALLEL_TBM_ITERATOR,
>
> Change to LWLTRANCHE_TBM, add a call to LWLockRegisterTranche in
> lwlock.c, and patch the docs with the new tranche name.

Fixed
>
> Despite this fairly extensive list of mostly-trivial cosmetic
> problems, I think this is in fairly good shape now.  I think the basic
> design is right, so once we get these loose ends tied up this should
> be good to go in.
>
>> 0002: actual parallel bitmap heap scan by using interfaces of 0001.
>
> This needs a lot more extensive review than I quite have the time and
> energy for right now, but here are a few comments.
>
> +         <entry><literal>ParallelBitmapScan</></entry>
>
> ParallelBitmapPopulate, maybe?  You're not waiting for the scan;
> you're waiting for the bitmap to be populated.
Done
>
> +         <entry>Waiting for leader backend to complete the TidBitmap.</entry>
>
> Likewise, complete -> populate?
Done
>
> +    /* If any limit was set to zero, the user doesn't want a parallel scan. */
>
> If we got zero?  There aren't multiple limits mentioned anywhere
> nearby, so this reads oddly.

Removed
>
> +        /*
> +         * It may be possible to amortize some of the I/O cost, but probably
> +         * not very much, because most operating systems already do aggressive
> +         * prefetching.  For now, we assume that the disk run cost can't be
> +         * amortized at all.
> +         */
>
> This looks like it was cut-and-pasted from the sequential scan case,
> but it doesn't really apply here.  This presupposes we're relying on
> OS prefetching, but in the case of a bitmap heap scan we do our own
> prefetching.  Maybe change to "Since we do prefetching in both the
> parallel and non-parallel cases, we don't expect to get substantially
> more I/O parallelism by spreading the scan across multiple backends.
> So, assume the disk run cost can't be amortized at all."

Removed this comment.


Apart from these, there are some more changes.
in 0001:
- Improved the comments for TBMSharedIteratorState and TBMSharedIterator.
- Added error handling for return pointer from dsa_allocate.
- Moved function declaration in tidbitmap.h (functionaly related are
kept together)

in 0002:
- Improved comments.
- Code refactoring in BitmapHeapNext.
- Removed local tbm creation in BitmapHeapNext : as per new tidbitmap
it's of no use.


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: [HACKERS] Suppressing that pesky warning with older flex versions
Next
From: Robert Haas
Date:
Subject: Re: [HACKERS] Instability in select_parallel regression test