Re: [HACKERS] Parallel bitmap heap scan - Mailing list pgsql-hackers
From | Robert Haas |
---|---|
Subject | Re: [HACKERS] Parallel bitmap heap scan |
Date | |
Msg-id | CA+TgmobBP7=Tnk8cL-mz4QZupwNbN3aFcY7N2eNZbrSBxDBK0Q@mail.gmail.com Whole thread Raw |
In response to | Re: [HACKERS] Parallel bitmap heap scan (Dilip Kumar <dilipbalaut@gmail.com>) |
Responses |
Re: [HACKERS] Parallel bitmap heap scan
Re: [HACKERS] Parallel bitmap heap scan |
List | pgsql-hackers |
On Tue, Jan 31, 2017 at 6:05 AM, Dilip Kumar <dilipbalaut@gmail.com> wrote: >> 0002-hash-support-alloc-free-v14.patch: >> >> >> + if (tb->alloc) >> + { >> >> The memory for tb->alloc is allocated always, is the if check still >> required? > > In parallel case, only first worker will call SH_CREATE, other worker > will only do palloc for pagetable and copy the reference from main > worker, so they will not have allocator. When Andres wrote in https://www.postgresql.org/message-id/20161017201558.cr34stc6zvxy3mst@alap3.anarazel.de that you should try to share only the iteration arrays, I believe that he meant the results of tbm_begin_iterate() -- that is, iterator->spageptr, chunkptr, schunkbit, spages, and schunks. I'm perhaps putting words into his mouth, but what he said was that we could avoid sharing "the whole underlying hash". But the patch set you've implemented here doesn't behave that way. Instead, you've got the array containing the hash table elements shared (which is good) plus there's a sort of dummy hash table in every worker which copies some but not all of the members of the original hash table, leading to the otherwise-unnecessary if-test that Haribabu is complaining about. So the hash table is kinda-shared-kinda-not-shared, which I don't *think* is really what Andres had in mind. In short, I think SH_COPY (implemented here as pagetable_copy) needs to go away. The work of tbm_begin_iterate() should be done before we begin the shared scan and the results of that work should be shared. What happens right now (it appears) is that every backend does that work based on the same hash table and we just assume they all get the same answer. And we really do assume that, because pbms_parallel_iterate() assumes it can shuttle private state back and forth between iterator in different backends and nothing will break; but those backends aren't actually using the same iteration arrays. They are using different iteration arrays that should have the same contents because they were all derived from the same semi-shared hash table. That's pretty fragile, and a waste of CPU cycles if the hash table is large (every backend does its own sort). On a related note, I think it's unacceptable to make the internal details of TBMIterator public. You've moved it from tidbitmap.c to tidbitmap.h so that nodeBitmapHeapScan.c can tinker with the guts of the TBMIterator, but that's not OK. Those details need to remain private to tidbitmap.c. pbms_parallel_iterate() is a nasty kludge; we need some better solution. The knowledge of how a shared iterator should iterate needs to be private to tidbitmap.c, not off in the executor someplace. And I think the entries need to actually be updated in shared memory directly, not copied back and forth between a backend-private iterator and a shared iterator. Also, pbms_parallel_iterate() can't hold a spinlock around a call to tbm_iterate(). Note the coding rules for spinlocks mentioned in spin.h and src/backend/storage/lmgr/README. I think the right thing to do here is to use an LWLock in a new tranche (add an entry to BuiltinTrancheIds). In 0002, you have this: + tb->alloc = palloc(sizeof(SH_ALLOCATOR)); This should be using MemoryContextAlloc(ctx, ...) rather than palloc. Otherwise the allocator object is in a different context from everything else in the hash table. But TBH, I don't really see why we want this to be a separate object. Why not just put HashAlloc/HashFree/args into SH_TYPE directly? That avoids some pointer chasing and doesn't really seem to cost anything (except that SH_CREATE will grow a slightly longer argument sequence). I think maybe we should rename the functions to element_allocate, element_free, and element_allocator_ctx, or something like that. The current names aren't capitalized consistently with other things in this module, and putting the word "element" in there would make it more clear what the purpose of this thing is. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
pgsql-hackers by date: