Re: Parallel bitmap heap scan - Mailing list pgsql-hackers
From | Amit Khandekar |
---|---|
Subject | Re: Parallel bitmap heap scan |
Date | |
Msg-id | CAJ3gD9dYD7RhKsyRBnta6=g_0+s7y6r1SpEgW_GsnSeDSJRnPQ@mail.gmail.com Whole thread Raw |
In response to | Re: Parallel bitmap heap scan (Amit Kapila <amit.kapila16@gmail.com>) |
Responses |
Re: Parallel bitmap heap scan
|
List | pgsql-hackers |
On 19 October 2016 at 09:47, Dilip Kumar <dilipbalaut@gmail.com> wrote: > On Tue, Oct 18, 2016 at 1:45 AM, Andres Freund <andres@anarazel.de> wrote: >> I don't quite understand why the bitmap has to be parallel at all. As >> far as I understand your approach as described here, the only thing that >> needs to be shared are the iteration arrays. Since they never need to >> be resized and such, it seems to make a lot more sense to just add an >> API to share those, instead of the whole underlying hash. > > You are right that we only share iteration arrays. But only point is > that each entry of iteration array is just a pointer to hash entry. > So either we need to build hash in shared memory (my current approach) > or we need to copy each hash element at shared location (I think this > is going to be expensive). While the discussion is going on regarding implementation for creating shared tidbitmap, meanwhile I am starting with review of the bitmap heap scan part, i.e. nodeBitmapHeapscan.c, since this looks mostly independent of tidbitmap implementation. > Brief design idea: > ----------------------- > #1. Shared TIDBitmap creation and initialization > First worker to see the state as parallel bitmap info as PBM_INITIAL > become leader and set the state to PBM_INPROGRESS All other workers > see the state as PBM_INPROGRESS will wait for leader to complete the > TIDBitmap. > > #2 At this level TIDBitmap is ready and all workers are awake. As far as correctness is concerned, the logic where the first worker becomes leader while others synchronously wait, looks good. Workers get allocated right from the beginning even though they would stay idle for some percentage of time (5-20% ?) , but I guess there is nothing we can do about it with the current parallel query infrastructure. In pbms_is_leader() , I didn't clearly understand the significance of the for-loop. If it is a worker, it can call ConditionVariablePrepareToSleep() followed by ConditionVariableSleep(). Once it comes out of ConditionVariableSleep(), isn't it guaranteed that the leader has finished the bitmap ? If yes, then it looks like it is not necessary to again iterate and go back through the pbminfo->state checking. Also, with this, variable queuedSelf also might not be needed. But I might be missing something here. Not sure what happens if worker calls ConditionVariable[Prepare]Sleep() but leader has already called ConditionVariableBroadcast(). Does the for loop have something to do with this ? But this can happen even with the current for-loop, it seems. > #3. Bitmap processing (Iterate and process the pages). > In this phase each worker will iterate over page and chunk array and > select heap pages one by one. If prefetch is enable then there will be > two iterator. Since multiple worker are iterating over same page and > chunk array we need to have a shared iterator, so we grab a spin lock > and iterate within a lock, so that each worker get and different page > to process. tbm_iterate() call under SpinLock : For parallel tbm iteration, tbm_iterate() is called while SpinLock is held. Generally we try to keep code inside Spinlock call limited to a few lines, and that too without occurrence of a function call. Although tbm_iterate() code itself looks safe under a spinlock, I was checking if we can squeeze SpinlockAcquire() and SpinLockRelease() closer to each other. One thought is : in tbm_iterate(), acquire the SpinLock before the while loop that iterates over lossy chunks. Then, if both chunk and per-page data remain, release spinlock just before returning (the first return stmt). And then just before scanning bitmap of an exact page, i.e. just after "if (iterator->spageptr < tbm->npages)", save the page handle, increment iterator->spageptr, release Spinlock, and then use the saved page handle to iterate over the page bitmap. prefetch_pages() call under Spinlock : Here again, prefetch_pages() is called while pbminfo->prefetch_mutex Spinlock is held. Effectively, heavy functions like PrefetchBuffer() would get called while under the Spinlock. These can even ereport(). One option is to use mutex lock for this purpose. But I think that would slow things down. Moreover, the complete set of prefetch pages would be scanned by a single worker, and others might wait for this one. Instead, what I am thinking is: grab the pbminfo->prefetch_mutex Spinlock only while incrementing pbminfo->prefetch_pages. The rest part viz : iterating over the prefetch pages, and doing the PrefetchBuffer() need not be synchronised using this pgbinfo->prefetch_mutex Spinlock. pbms_parallel_iterate() already has its own iterator spinlock. Only thing is, workers may not do the actual PrefetchBuffer() sequentially. One of them might shoot ahead and prefetch 3-4 pages while the other is lagging with the sequentially lesser page number; but I believe this is fine, as long as they all prefetch all the required blocks.
pgsql-hackers by date: