Re: [HACKERS] Parallel bitmap heap scan - Mailing list pgsql-hackers
From | Amit Khandekar |
---|---|
Subject | Re: [HACKERS] Parallel bitmap heap scan |
Date | |
Msg-id | CAJ3gD9dSzGr_f6x8fWuSbvPvNHCKbRZRi+ZJaSSYYg3ub3uYng@mail.gmail.com Whole thread Raw |
In response to | Re: Parallel bitmap heap scan (Dilip Kumar <dilipbalaut@gmail.com>) |
Responses |
Re: [HACKERS] Parallel bitmap heap scan
|
List | pgsql-hackers |
Sorry for the delay in my next response. I still haven't exhaustively gone through all changes, but meanwhile, below are some more points. On 26 November 2016 at 18:18, Dilip Kumar <dilipbalaut@gmail.com> wrote: > On Tue, Nov 22, 2016 at 9:05 AM, Dilip Kumar <dilipbalaut@gmail.com> wrote: >> On Fri, Nov 18, 2016 at 9:59 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote: >> >> Thanks for the review.. > > I have worked on these comments.. >> >>> In pbms_is_leader() , I didn't clearly understand the significance of >>> the for-loop. If it is a worker, it can call >>> ConditionVariablePrepareToSleep() followed by >>> ConditionVariableSleep(). Once it comes out of >>> ConditionVariableSleep(), isn't it guaranteed that the leader has >>> finished the bitmap ? If yes, then it looks like it is not necessary >>> to again iterate and go back through the pbminfo->state checking. >>> Also, with this, variable queuedSelf also might not be needed. But I >>> might be missing something here. >> >> I have taken this logic from example posted on conditional variable thread[1] >> >> for (;;) >> { >> ConditionVariablePrepareToSleep(cv); >> if (condition for which we are waiting is satisfied) >> break; >> ConditionVariableSleep(); >> } >> ConditionVariableCancelSleep(); >> >> [1] https://www.postgresql.org/message-id/CA%2BTgmoaj2aPti0yho7FeEf2qt-JgQPRWb0gci_o1Hfr%3DC56Xng%40mail.gmail.com With the latest patches, this looks fine to me. >>> tbm_iterate() call under SpinLock : >>> For parallel tbm iteration, tbm_iterate() is called while SpinLock is >>> held. Generally we try to keep code inside Spinlock call limited to a >>> few lines, and that too without occurrence of a function call. >>> Although tbm_iterate() code itself looks safe under a spinlock, I was >>> checking if we can squeeze SpinlockAcquire() and SpinLockRelease() >>> closer to each other. One thought is : in tbm_iterate(), acquire the >>> SpinLock before the while loop that iterates over lossy chunks. Then, >>> if both chunk and per-page data remain, release spinlock just before >>> returning (the first return stmt). And then just before scanning >>> bitmap of an exact page, i.e. just after "if (iterator->spageptr < >>> tbm->npages)", save the page handle, increment iterator->spageptr, >>> release Spinlock, and then use the saved page handle to iterate over >>> the page bitmap. >> >> Main reason to keep Spin lock out of this function to avoid changes >> inside this function, and also this function takes local iterator as >> input which don't have spin lock reference to it. But that can be >> changed, we can pass shared iterator to it. >> >> I will think about this logic and try to update in next version. > Still this issue is not addressed. > Logic inside tbm_iterate is using same variable, like spageptr, > multiple places. IMHO this complete logic needs to be done under one > spin lock. I think we both agree on the part that the mutex handle can be passed to tbm_iterate() to do this. I am yet to give more thought on how clumsy it will be if we add SpinlockRelease() calls in intermediate places in the function rather than in the end. > >> >>> >>> prefetch_pages() call under Spinlock : >>> Here again, prefetch_pages() is called while pbminfo->prefetch_mutex >>> Spinlock is held. Effectively, heavy functions like PrefetchBuffer() >>> would get called while under the Spinlock. These can even ereport(). >>> One option is to use mutex lock for this purpose. But I think that >>> would slow things down. Moreover, the complete set of prefetch pages >>> would be scanned by a single worker, and others might wait for this >>> one. Instead, what I am thinking is: grab the pbminfo->prefetch_mutex >>> Spinlock only while incrementing pbminfo->prefetch_pages. The rest >>> part viz : iterating over the prefetch pages, and doing the >>> PrefetchBuffer() need not be synchronised using this >>> pgbinfo->prefetch_mutex Spinlock. pbms_parallel_iterate() already has >>> its own iterator spinlock. Only thing is, workers may not do the >>> actual PrefetchBuffer() sequentially. One of them might shoot ahead >>> and prefetch 3-4 pages while the other is lagging with the >>> sequentially lesser page number; but I believe this is fine, as long >>> as they all prefetch all the required blocks. >> >> I agree with your point, will try to fix this as well. > > I have fixed this part. This looks good now. Further points below .... ===== nodeBItmapHeapscan.c ====== In BitmapHeapNext(), the following code is relevant only for tbm returned from MultiExecProcNode(). if (!tbm || !IsA(tbm, TIDBitmap)) elog(ERROR, "unrecognized result from subplan"); So it should be moved just below MultiExecProcNode(), so that it does not get called when it is created from tbm_create(). -------------- BitmapHeapScanState->is_leader field looks unnecessary. Instead, a local variable is_leader in BitmapHeapNext() will solve the purpose. This is because is_leader is used only in BitmapHeapNext(). -------------- In BitmapHeapNext(), just before tbm_restore_shared_members() is called, we create tbm using tbm_create(). I think tbm_create() does not make sense for shared tbm. Whatever fields are required, will be restored in tbm_restore_shared_members(). The other fields which do not make sense in a restored tbm are not required to be initialized using tbm_create(). So I think tbm_restore_shared_members() itself can call makeNode(TIDBitmap). (Also it is not required to initialize tbm->allocator; see note below in tidbitmap.c section). So tbm_restore_shared_members() itself can call tbm_set_parallel(). Looking at all this, it looks better to have the function name changed to tbm_attach(parallel_tbm) or tbm_restore(parallel_tbm) rather than tbm_restore_shared_members(). The function header anyways (rightly) says : Attach worker to shared TID bitmap. ------------- From what I understand, the leader worker does not have to create its iterators before waking up the other workers, as long as it makes sure it copies tbm fields into shared memory before waking workers. But in the patch, tbm_begin_iterate() is called *before* the ConditionVariableBroadcast() is called. So I feel, we can shift the code inside the "if (node->is_leader)" to a place inside the "if (pbminfo == NULL || pbms_is_leader(pbminfo))" condition block, just after MultiExecProcNode() call. (And we won't even need is_leader local variable as well). This way now the other workers will start working sooner. ====== tidbitmap.c ======= The new #include's are not in alphabetical order. -------------- ParallelTIDBitmap.inited is unused, and I believe, not required. -------------- For leader worker, the new TIDBitmap fields added for parallel bitmap *are* valid while the tid is being built. So the below comment should be shifted accordingly : /* these are valid when iterating is true: */ Better still, the shared tbm-related fields can be kept in the end, and a comment should be added that these are for shared tbm. -------------- It seems, the comment below the last ParallelTIDBitmap field is not relevant : /* table to dsa_pointer's array. */ -------------- tbm->allocator field does not seem to be required. A new allocator can be just palloc'ed in tbm_create_pagetable(), and passed to pagetable_create(). SH_CREATE() stores this allocator in the SH_TYPE->alloc field, and fetches the same whenever it needs it for calling any of the allocator functions. So we can remove the tbm->allocator field and shift "palloc(sizeof(pagetable_alloc))" call from tbm_create() to tbm_create_pagetable(). -------------- In tbm_free() : I think we should call pagetable_destroy() even when tbm is shared, so that the hash implementation gets a chance to free the hash table structure. I understand that the hash table structure itself is not big, so it will only be a small memory leak, but it's better to not assume that. Instead, let SH_DESTROY() call HashFree(). Then, in tbm_free_shared(), we can skip the dsa_free() call if tbm->iterating is false. Basically, tbm bitmap implementation should deduce from the bitmap state whether it should free the shared data, rather than preventing a call to SH_DESTROY(). ----------- In tbm_begin_iterate(), for shared tbm, internal structures from simplehash.h are assumed to be known. For e.g., the hash entries will always be present in one single array, and also the entry status is evaluated using pagetable_IN_USE. Is simplehash.h designed keeping in mind that these things are suppose to be exposed ? I understand that the hash table handle is local to the leader worker, and so it is not accessible to other workers. And so, we cannot use pagetable_iterate() to scan the hash table. So, how about copying the SH_TYPE structure and making it accessible to other workers ? If we have something like SH_ATTACH() or SH_COPY(), this will copy the relevant fields that are sufficient to restore the SH_TYPE structure, and other workers can start using this hash table after assigning dsa array back to tb->data. Something like HASH_ATTACH used in dynahash.c. -------------- In the tbm_update_shared_members() header comments :* Restore leaders private tbm state to shared location. This must* becalled before waking up the other worker. Above can be changed to:* Store leader's private tbm state to shared location. This must* be called before waking up otherworkers. -------------- To be consistent with other function header comments in this file, a blank line is required between these two lines : * tbm_estimate_parallel_tidbitmap* Estimate size of shared TIDBitmap related info. ------------- tbm->is_shared is set in both tbm_set_parallel() and tbm_restore_shared_members(). I think it is needed only in tbm_set_parallel(). -------------- tbm_alloc_shared() Function header comments need some typo correction :* It allocated memory from DSA and also stores dsa_pointerin the memory allocated => allocates -------------- We prepend the dsa_pointer value into the shared memory data allocated for the pagetable entries; and we save the address of the first page table entry in tbm->data. But the same dsa_pointer is also stored in tbm->dsa_data. And tbm is accessible in tbm_free_shared(). So it does not look necessary to prepend dsa_pointer before the page table entries. In tbm_free_shared(), we can do dsa_free(tbm->area, tbm->dsa_data). -------------- Below are my suggested changes in the algorithm comments : @@ -103,27 +103,27 @@ BitmapHeapNext(BitmapHeapScanState *node) /* -------------------------------------------------------------------- * Parallel Bitmap Heap Scan Algorithm * - * First worker to see the state as parallel bitmap info as PBM_INITIAL - * become leader and set the state to PBM_INPROGRESS All other workers - * see the state as PBM_INPROGRESS will wait for leader to complete the - * TIDBitmap. + * The first worker to see the state of ParallelBitmapInfo as PBM_INITIAL + * becomes the leader and sets the state to PBM_INPROGRESS. All other + * workers see the state as PBM_INPROGRESS, and will wait for leader to + * finish building the TIDBitmap. * * Leader Processing: * Create TIDBitmapusing DSA memory. * Restore local TIDBitmap variable information into - * ParallelBitmapInfo so that other worker can see those. - * set state to PBM_FINISHED. + * ParallelBitmapInfo so that other worker can see those. + * Set state to PBM_FINISHED. * Wake up other workers. * * Other Worker Processing: - * Wait until leader create shared TIDBitmap. - * copy TIDBitmap from info from ParallelBitmapInfo to local TIDBitmap. + * Wait until leader creates shared TIDBitmap. + * Copy TIDBitmap from ParallelBitmapInfo to local TIDBitmap. * * Iterate and process thepages. * a) In this phase each worker will iterate over page and chunk array - * and select heap pages one by one. If prefetch is enable then - * there will be two iterator. - * b) Since multiple worker are iterating over same page and chunk + * and select heap pages one by one. If prefetch is enabled then + * there will be two iterators. + * b) Since multiple workers are iterating over same page and chunk * array we need to have a shared iterator, so we grab a spin * lock and iterate within a lock.
pgsql-hackers by date: