Re: [HACKERS] Parallel bitmap heap scan - Mailing list pgsql-hackers

From Dilip Kumar
Subject Re: [HACKERS] Parallel bitmap heap scan
Date
Msg-id CAFiTN-tGOBVVLRiBs7aS3hyUh=XJKdkzbWs879nXCgUL0JQNdg@mail.gmail.com
Whole thread Raw
In response to Re: [HACKERS] Parallel bitmap heap scan  (Robert Haas <robertmhaas@gmail.com>)
Responses Re: [HACKERS] Parallel bitmap heap scan  (Robert Haas <robertmhaas@gmail.com>)
Re: [HACKERS] Parallel bitmap heap scan  (Robert Haas <robertmhaas@gmail.com>)
List pgsql-hackers
On Wed, Feb 1, 2017 at 11:09 PM, Robert Haas <robertmhaas@gmail.com> wrote:

Here are the latest version of the patch, which address all the
comments given by Robert.

> https://www.postgresql.org/message-id/20161017201558.cr34stc6zvxy3mst@alap3.anarazel.de
> that you should try to share only the iteration arrays, I believe that
> he meant the results of tbm_begin_iterate() -- that is,
> iterator->spageptr, chunkptr, schunkbit, spages, and schunks.  I'm
> perhaps putting words into his mouth, but what he said was that we
> could avoid sharing "the whole underlying hash".  But the patch set
> you've implemented here doesn't behave that way.  Instead, you've got
> the array containing the hash table elements shared (which is good)
> plus there's a sort of dummy hash table in every worker which copies
> some but not all of the members of the original hash table, leading to
> the otherwise-unnecessary if-test that Haribabu is complaining about.
> So the hash table is kinda-shared-kinda-not-shared, which I don't
> *think* is really what Andres had in mind.
>
> In short, I think SH_COPY (implemented here as pagetable_copy) needs
> to go away.  The work of tbm_begin_iterate() should be done before we
> begin the shared scan and the results of that work should be shared.

I have removed the SH_COPY and now leader performs the
tbm_begin_shared_iterate before waking up the worker. Basically, the
leader will create the page and chunk array and that is the array of
the "relptr" (offlist, suggested by Robert).

> What happens right now (it appears) is that every backend does that
> work based on the same hash table and we just assume they all get the
> same answer.  And we really do assume that, because
> pbms_parallel_iterate() assumes it can shuttle private state back and
> forth between iterator in different backends and nothing will break;
> but those backends aren't actually using the same iteration arrays.
> They are using different iteration arrays that should have the same
> contents because they were all derived from the same semi-shared hash
> table.  That's pretty fragile, and a waste of CPU cycles if the hash
> table is large (every backend does its own sort).
>
> On a related note, I think it's unacceptable to make the internal
> details of TBMIterator public.  You've moved it from tidbitmap.c to
> tidbitmap.h so that nodeBitmapHeapScan.c can tinker with the guts of
> the TBMIterator, but that's not OK.  Those details need to remain
> private to tidbitmap.c.
Fixed

 pbms_parallel_iterate() is a nasty kludge; we
> need some better solution.  The knowledge of how a shared iterator
> should iterate needs to be private to tidbitmap.c, not off in the
> executor someplace.  And I think the entries need to actually be
> updated in shared memory directly, not copied back and forth between a
> backend-private iterator and a shared iterator.

I have fixed this, now there is new function called tbm_shared_iterate
which will directly iterate using shared iterator. So now no need to
copy member back and forth between local and shared iterator.

>
> Also, pbms_parallel_iterate() can't hold a spinlock around a call to
> tbm_iterate().  Note the coding rules for spinlocks mentioned in
> spin.h and src/backend/storage/lmgr/README.  I think the right thing
> to do here is to use an LWLock in a new tranche (add an entry to
> BuiltinTrancheIds).

Done that way.
>
> In 0002, you have this:
>
> +    tb->alloc = palloc(sizeof(SH_ALLOCATOR));
>
> This should be using MemoryContextAlloc(ctx, ...) rather than palloc.
> Otherwise the allocator object is in a different context from
> everything else in the hash table.  But TBH, I don't really see why we
> want this to be a separate object.  Why not just put
> HashAlloc/HashFree/args into SH_TYPE directly?  That avoids some
> pointer chasing and doesn't really seem to cost anything (except that
> SH_CREATE will grow a slightly longer argument sequence).

Done as suggested
>
> I think maybe we should rename the functions to element_allocate,
> element_free, and element_allocator_ctx, or something like that.  The
> current names aren't capitalized consistently with other things in
> this module, and putting the word "element" in there would make it
> more clear what the purpose of this thing is.

Fixed as per the suggestion

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

pgsql-hackers by date:

Previous
From: Michael Paquier
Date:
Subject: Re: [HACKERS] Index corruption with CREATE INDEX CONCURRENTLY
Next
From: Andrew Borodin
Date:
Subject: Re: [HACKERS] Review: GIN non-intrusive vacuum of posting tree