Thread: Parallel bitmap heap scan

Parallel bitmap heap scan

From

Dilip Kumar

Date:

07 October 2016, 06:16:51

Hi Hackers,

I would like to propose parallel bitmap heap scan feature. After
running TPCH benchmark, It was observed that many of TPCH queries are
using bitmap scan (@TPCH_plan.tar.gz attached below). Keeping this
point in mind we thought that many query will get benefited with
parallel bitmap scan.

Robert has also pointed out the same thing in his blog related to parallel query
http://rhaas.blogspot.in/2016/04/postgresql-96-with-parallel-query-vs.html

Currently Bitmap heap plan look like this :
------------------------------------------------------
Bitmap Heap Scan
-> Bitmap Index Scan

After this patch :
---------------------
Parallel Bitmap Heap Scan
-> Bitmap Index Scan

As part of this work I have implemented parallel processing in
BitmapHeapScan node. BitmapIndexScan is still non parallel.

Brief design idea:
-----------------------
#1. Shared TIDBitmap creation and initialization
  First worker to see the state as parallel bitmap info as PBM_INITIAL
become leader and set the state to PBM_INPROGRESS All other   workers
see the state as PBM_INPROGRESS will wait for leader to complete the
TIDBitmap.

#2 At this level TIDBitmap is ready and all workers are awake.

#3. Bitmap processing (Iterate and process the pages).
In this phase each worker will iterate over page and chunk array and
select heap pages one by one. If prefetch is enable then there will be
two iterator. Since multiple worker are iterating over same page and
chunk array we need to have a shared iterator, so we grab a spin lock
and iterate within a lock, so that each worker get and different page
to process.

Note: For more detail on design, please refer comment of
BitmapHeapNext API in "parallel-bitmap-heap-scan-v1.patch" file.

Attached patch details:
------------------------------
1. parallel-bitmap-heap-scan-v1.patch: This is the main patch to make
bitmap heap scan node parallel aware.

2. dht-return-dsa-v1.patch:  This patch will provide new API, where we
can scan full DHT[1], and get the dsa_pointers (a relative pointer).
The dsa_pointer values can be shared with other processes. We need
this because, after TIDBitmap is created, only one worker will process
whole TIDBitmap and convert it to a page and chunk array. So we need
to store the generic pointer, so that later on each worker can convert
those to their local pointer before start processing.

My patch depends on following patches.
------------------------------------------------------
1. conditional_variable
https://www.postgresql.org/message-id/CAEepm%3D0zshYwB6wDeJCkrRJeoBM%3DjPYBe%2B-k_VtKRU_8zMLEfA%40mail.gmail.com

2. dsa_area
https://www.postgresql.org/message-id/CAEepm%3D024p-MeAsDmG%3DR3%2BtR4EGhuGJs_%2BrjFKF0eRoSTmMJnA%40mail.gmail.com

3. Creating a DSA area to provide work space for parallel execution
https://www.postgresql.org/message-id/CAEepm%3D0HmRefi1%2BxDJ99Gj5APHr8Qr05KZtAxrMj8b%2Bay3o6sA%40mail.gmail.com

4. Hash table in dynamic shared memory (DHT) [1]
https://www.postgresql.org/message-id/CAEepm%3D0VrMt3s_REDhQv6z1pHL7FETOD7Rt9V2MQ3r-2ss2ccA%40mail.gmail.com

Order in which patches should be applied:
--------------------------------------------------------
1. conditional_variable
2. dsa_area
3. Creating a DSA area to provide work space for parallel execution
4. Hash table in dynamic shared memory.
5. dht-return-dsa-v1.patch
6. parallel-bitmap-heap-scan-v1.patch

Performance Results:
-----------------------------
Summary :
1. After this patch, I observed currently 4 queries are getting
significant improvement (Q4, Q6, Q14, Q15).
   - Q4, is converting from parallel seqscan to parallel bitmap heap scan.
   - Other queries are converted from a regular bitmap heap scan to a
parallel bitmap heap scan.
2. Benefit is more visible at lower workers (upto 4), after that some
of the queries are selecting ParallelSeqScan over ParallelBitmapScan.
And, I think this is expected, because so far we have only made
BitmapHeap node as parallel whereas ParallelSeqScan is completely
parallel so at higher worker count ParallelSeqScan is better choice.
3. Detailed result is attached  @TPCH_PBMS.pdf
4. Explain analyse output is attached @TPCH_plan.tar.gz  (for all
changed queries at worker 2)

TPCH query plan changed  example (TPCH Q6):
----------------------------------------------------------------
On Head:
-------------

                                  QUERY PLAN

-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=1558475.95..1558475.96 rows=1 width=32) (actual
time=40921.437..40921.438 rows=1 loops=1)
   ->  Aggregate  (cost=1558475.95..1558475.96 rows=1 width=32)
(actual time=40921.435..40921.435 rows=1 loops=1)
         ->  Bitmap Heap Scan on lineitem  (cost=291783.32..1552956.39
rows=1103911 width=12) (actual time=7032.075..38997.369 rows=1140434
loops=1)
               Recheck Cond: ((l_shipdate >= '1994-01-01'::date) AND
(l_shipdate < '1995-01-01 00:00:00'::timestamp without time zone) AND
(l_discount >= 0.01) AND (l_discount <= 0.03) AND (l_quantity <
'24'::numeric))
               Rows Removed by Index Recheck: 25284232
               Heap Blocks: exact=134904 lossy=530579
               ->  Bitmap Index Scan on idx_lineitem_shipdate
(cost=0.00..291507.35 rows=1103911 width=0) (actual
time=6951.408..6951.408 rows=1140434 loops=1)
                     Index Cond: ((l_shipdate >= '1994-01-01'::date)
AND (l_shipdate < '1995-01-01 00:00:00'::timestamp without time zone)
AND (l_discount >= 0.01) AND (l_discount <= 0.03) AND (l_quantity <
'24'::numeric))
 Planning time: 1.126 ms
 Execution time: 40922.569 ms
(10 rows)

After Patch:
----------------

                                         QUERY PLAN

-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=1541767.60..1541767.61 rows=1 width=32) (actual
time=21895.008..21895.009 rows=1 loops=1)
   ->  Finalize Aggregate  (cost=1541767.60..1541767.61 rows=1
width=32) (actual time=21895.006..21895.006 rows=1 loops=1)
         ->  Gather  (cost=1541767.38..1541767.59 rows=2 width=32)
(actual time=21894.341..21894.970 rows=3 loops=1)
               Workers Planned: 2
               Workers Launched: 2
               ->  Partial Aggregate  (cost=1540767.38..1540767.39
rows=1 width=32) (actual time=21890.990..21890.990 rows=1 loops=3)
                     ->  Parallel Bitmap Heap Scan on lineitem
(cost=291783.32..1538467.56 rows=459963 width=12) (actual
time=8517.126..21215.469 rows=380145 loops=3)
                           Recheck Cond: ((l_shipdate >=
'1994-01-01'::date) AND (l_shipdate < '1995-01-01 00:00:00'::timestamp
without time zone) AND (l_discount >= 0.01) AND (l_discount <= 0.03)
AND (l_quantity < '24'::numeric))
                           Rows Removed by Index Recheck: 8427921
                           Heap Blocks: exact=47761 lossy=187096
                           ->  Bitmap Index Scan on
idx_lineitem_shipdate  (cost=0.00..291507.35 rows=1103911 width=0)
(actual time=8307.291..8307.291 rows=1140434 loops=1)
                                 Index Cond: ((l_shipdate >=
'1994-01-01'::date) AND (l_shipdate < '1995-01-01 00:00:00'::timestamp
without time zone) AND (l_discount >= 0.01) AND (l_discount <= 0.03)
AND (l_quantity < '24'::numeric))
 Planning time: 1.173 ms
 Execution time: 21915.931 ms
(14 rows)


* Thanks to Robert Haas and Amit Kapila, for helping in design review
(off list) and many valuable inputs.
* Thanks to Thomas Munro for DSA and DHT work on which my patch is based on.
* Thanks to Rafia sabih for helping with performance test.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

On Fri, Nov 25, 2016 at 6:55 PM, Dilip Kumar <dilipbalaut@gmail.com> wrote:
> I have changed the design to directly make it based on DSA instead of using DHT.
> In new version we no longer use DHT. Instead, of that I have made some
> change in simplehash[1], so that it can allow external allocator. In
> tidbitmap.c, I have register the allocator to simplehash and those
> allocator functions will allocate memory directly from DSA.
>
> simplehash is always using one single memory (during expand it copy
> from old memory to new memory). Which makes remaining processing very
> simple for us.
>
> In tbm_begin_iterate, we need not to scan complete hash table instead
> of that we can just process dsa memory and convert into page array and
> chunk array.
>
> I have tested the performance in my local machine and I observed that
> it's slightly better than older
> DHT based version (complete performance number will be published soon).
>
> Dependency on other patches:
> 1. dsa_area (dsa-v7.patch)
> https://www.postgresql.org/message-id/CAEepm%3D024p-MeAsDmG%3DR3%2BtR4EGhuGJs_%2BrjFKF0eRoSTmMJnA%40mail.gmail.com
>
> 2. Creating a DSA area to provide work space for parallel execution
> (dsa-area-for-executor-v3.patch)
> https://www.postgresql.org/message-id/CAEepm%3D0HmRefi1%2BxDJ99Gj5APHr8Qr05KZtAxrMj8b%2Bay3o6sA%40mail.gmail.com
>
> patch details
> 1. hash_support_alloc_free_v1.patch [1].
> 2. parallel-bitmap-heap-scan-v3.patch

I just realised that, my latest patch I just sent to Andres, instead
of replying to all.
Forwarding the same mail to Hackers.

Performance reading with new patch..
TPCH-scale factor 10. work_mem 20MB, Power 4 socket machine

Query Head     Patch Improvement
Q4       4811     3290     1.5x
Q6     13136     6198     2.1x
Q14     8119     5057     1.6x
Q15   25652   20143     1.2x

Explained analyzed results are attached with the mail..

* I have also applied Andres patch from below link, for taking this
performance (both for head and for patch).
https://www.postgresql.org/message-id/20161123083351.5vramz52nmdokhzz%40alap3.anarazel.de

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachment

Re: Parallel bitmap heap scan

From

Haribabu Kommi

Date:

05 December 2016, 02:30:37

On Wed, Nov 30, 2016 at 4:38 PM, Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Fri, Nov 25, 2016 at 6:55 PM, Dilip Kumar <dilipbalaut@gmail.com> wrote:
> I have changed the design to directly make it based on DSA instead of using DHT.
> In new version we no longer use DHT. Instead, of that I have made some
> change in simplehash[1], so that it can allow external allocator. In
> tidbitmap.c, I have register the allocator to simplehash and those
> allocator functions will allocate memory directly from DSA.
>
> simplehash is always using one single memory (during expand it copy
> from old memory to new memory). Which makes remaining processing very
> simple for us.
>
> In tbm_begin_iterate, we need not to scan complete hash table instead
> of that we can just process dsa memory and convert into page array and
> chunk array.
>
> I have tested the performance in my local machine and I observed that
> it's slightly better than older
> DHT based version (complete performance number will be published soon).
>
> Dependency on other patches:
> 1. dsa_area (dsa-v7.patch)
> https://www.postgresql.org/message-id/CAEepm%3D024p-MeAsDmG%3DR3%2BtR4EGhuGJs_%2BrjFKF0eRoSTmMJnA%40mail.gmail.com
>
> 2. Creating a DSA area to provide work space for parallel execution
> (dsa-area-for-executor-v3.patch)
> https://www.postgresql.org/message-id/CAEepm%3D0HmRefi1%2BxDJ99Gj5APHr8Qr05KZtAxrMj8b%2Bay3o6sA%40mail.gmail.com
>
> patch details
> 1. hash_support_alloc_free_v1.patch [1].
> 2. parallel-bitmap-heap-scan-v3.patch

I just realised that, my latest patch I just sent to Andres, instead
of replying to all.
Forwarding the same mail to Hackers.

Performance reading with new patch..
TPCH-scale factor 10. work_mem 20MB, Power 4 socket machine

Query Head Patch Improvement
Q4 4811 3290 1.5x
Q6 13136 6198 2.1x
Q14 8119 5057 1.6x
Q15 25652 20143 1.2x

Explained analyzed results are attached with the mail..

* I have also applied Andres patch from below link, for taking this
performance (both for head and for patch).
https://www.postgresql.org/message-id/20161123083351.5vramz52nmdokhzz%40alap3.anarazel.de

Moved to next CF with "needs review" status.

Regards,

Hari Babu

Fujitsu Australia

Re: [HACKERS] Parallel bitmap heap scan

From

Amit Kapila

Date:

10 December 2016, 15:06:17

On Wed, Nov 30, 2016 at 11:08 AM, Dilip Kumar <dilipbalaut@gmail.com> wrote:
>>
>> patch details
>> 1. hash_support_alloc_free_v1.patch [1].
>> 2. parallel-bitmap-heap-scan-v3.patch

Few assorted comments:

1.
+ else if (needWait)
+ {
+ /* Add ourself to wait queue */
+ ConditionVariablePrepareToSleep(&pbminfo->cv);
+ queuedSelf = true;
+ }

With the committed version of condition variables, you can avoid
calling ConditionVariablePrepareToSleep().  Refer latest parallel
index scan patch [1].

2.
+         <entry><literal>ParallelBitmapScan</></entry>
+         <entry>Waiting for leader to complete the TidBitmap.</entry>
+        </row>
+        <row>

Isn't it better to write it as below:

Waiting for the leader backend to form the TidBitmap.

3.
+ * Update snpashot info in heap scan descriptor.

/snpashot/snapshot

4.#include "utils/tqual.h"
-
+#include "miscadmin.h"
static void bitgetpage(HeapScanDesc scan, TBMIterateResult *tbmres);
-
+static bool pbms_is_leader(ParallelBitmapInfo *pbminfo);
 TBMIterateResult *tbmres;
-
+ ParallelBitmapInfo *pbminfo = node->parallel_bitmap;
static int tbm_comparator(const void *left, const void *right);
-
+void * tbm_alloc_shared(Size size, void *arg);

It seems line deletes at above places are spurious.  Please check for
similar occurrences at other places in patch.

5.
+ bool is_shared; /* need to build shared tbm if set*/

space is required towards the end of the comment (set */).

6.
+ /*
+ * If we have shared TBM means we are running in parallel mode.
+ * So directly convert dsa array to page and chunk array.
+ */

I think the above comment can be simplified as "For shared TBM,
directly convert dsa array to page and chunk array"

7.
+ dsa_entry = (void*)(((char*)dsa_entry) + sizeof(dsa_pointer));

extra space is missing at multiple places in above line. It should be
written as below:

dsa_entry = (void *)(((char *) dsa_entry) + sizeof(dsa_pointer));

Similar stuff needs to be taken care at other places in the patch as
well.  I think it will be better if you run pgindent on your patch.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] Parallel bitmap heap scan

From

Dilip Kumar

Date:

13 December 2016, 09:36:33

On Sat, Dec 10, 2016 at 5:36 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> Few assorted comments:

Thanks for the review

>
> 1.
> + else if (needWait)
> + {
> + /* Add ourself to wait queue */
> + ConditionVariablePrepareToSleep(&pbminfo->cv);
> + queuedSelf = true;
> + }
>
> With the committed version of condition variables, you can avoid
> calling ConditionVariablePrepareToSleep().  Refer latest parallel
> index scan patch [1].

Oh, I see, Fixed.
>
> 2.
> +         <entry><literal>ParallelBitmapScan</></entry>
> +         <entry>Waiting for leader to complete the TidBitmap.</entry>
> +        </row>
> +        <row>
>
> Isn't it better to write it as below:
>
> Waiting for the leader backend to form the TidBitmap.
Done this way..
>
> 3.
> + * Update snpashot info in heap scan descriptor.
>
> /snpashot/snapshot
Fixed
>
> 4.
>  #include "utils/tqual.h"
> -
> +#include "miscadmin.h"
>
>  static void bitgetpage(HeapScanDesc scan, TBMIterateResult *tbmres);
> -
> +static bool pbms_is_leader(ParallelBitmapInfo *pbminfo);
>
>   TBMIterateResult *tbmres;
> -
> + ParallelBitmapInfo *pbminfo = node->parallel_bitmap;
>
>  static int tbm_comparator(const void *left, const void *right);
> -
> +void * tbm_alloc_shared(Size size, void *arg);
>
> It seems line deletes at above places are spurious.  Please check for
> similar occurrences at other places in patch.
>
Fixed
> 5.
> + bool is_shared; /* need to build shared tbm if set*/
>
> space is required towards the end of the comment (set */).
Fixed
>
> 6.
> + /*
> + * If we have shared TBM means we are running in parallel mode.
> + * So directly convert dsa array to page and chunk array.
> + */
>
> I think the above comment can be simplified as "For shared TBM,
> directly convert dsa array to page and chunk array"

Done this way..
>
> 7.
> + dsa_entry = (void*)(((char*)dsa_entry) + sizeof(dsa_pointer));
>
> extra space is missing at multiple places in above line. It should be
> written as below:
>
> dsa_entry = (void *)(((char *) dsa_entry) + sizeof(dsa_pointer));
Fixed..

>
> Similar stuff needs to be taken care at other places in the patch as
> well.  I think it will be better if you run pgindent on your patch.

I have run the pgindent, and taken all the changes whichever was
applicable for added code.

There are some cleanup for "hash-support-alloc-free" also, so
attaching a new patch for this as well.


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] Parallel bitmap heap scan

From

Dilip Kumar

Date:

21 December 2016, 08:06:13

Rebased on the current head.

On Tue, Dec 13, 2016 at 12:06 PM, Dilip Kumar <dilipbalaut@gmail.com> wrote:
> On Sat, Dec 10, 2016 at 5:36 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> Few assorted comments:
>
> Thanks for the review
>
>>
>> 1.
>> + else if (needWait)
>> + {
>> + /* Add ourself to wait queue */
>> + ConditionVariablePrepareToSleep(&pbminfo->cv);
>> + queuedSelf = true;
>> + }
>>
>> With the committed version of condition variables, you can avoid
>> calling ConditionVariablePrepareToSleep().  Refer latest parallel
>> index scan patch [1].
>
> Oh, I see, Fixed.
>>
>> 2.
>> +         <entry><literal>ParallelBitmapScan</></entry>
>> +         <entry>Waiting for leader to complete the TidBitmap.</entry>
>> +        </row>
>> +        <row>
>>
>> Isn't it better to write it as below:
>>
>> Waiting for the leader backend to form the TidBitmap.
> Done this way..
>>
>> 3.
>> + * Update snpashot info in heap scan descriptor.
>>
>> /snpashot/snapshot
> Fixed
>>
>> 4.
>>  #include "utils/tqual.h"
>> -
>> +#include "miscadmin.h"
>>
>>  static void bitgetpage(HeapScanDesc scan, TBMIterateResult *tbmres);
>> -
>> +static bool pbms_is_leader(ParallelBitmapInfo *pbminfo);
>>
>>   TBMIterateResult *tbmres;
>> -
>> + ParallelBitmapInfo *pbminfo = node->parallel_bitmap;
>>
>>  static int tbm_comparator(const void *left, const void *right);
>> -
>> +void * tbm_alloc_shared(Size size, void *arg);
>>
>> It seems line deletes at above places are spurious.  Please check for
>> similar occurrences at other places in patch.
>>
> Fixed
>> 5.
>> + bool is_shared; /* need to build shared tbm if set*/
>>
>> space is required towards the end of the comment (set */).
> Fixed
>>
>> 6.
>> + /*
>> + * If we have shared TBM means we are running in parallel mode.
>> + * So directly convert dsa array to page and chunk array.
>> + */
>>
>> I think the above comment can be simplified as "For shared TBM,
>> directly convert dsa array to page and chunk array"
>
> Done this way..
>>
>> 7.
>> + dsa_entry = (void*)(((char*)dsa_entry) + sizeof(dsa_pointer));
>>
>> extra space is missing at multiple places in above line. It should be
>> written as below:
>>
>> dsa_entry = (void *)(((char *) dsa_entry) + sizeof(dsa_pointer));
> Fixed..
>
>>
>> Similar stuff needs to be taken care at other places in the patch as
>> well.  I think it will be better if you run pgindent on your patch.
>
> I have run the pgindent, and taken all the changes whichever was
> applicable for added code.
>
> There are some cleanup for "hash-support-alloc-free" also, so
> attaching a new patch for this as well.
>
>
> --
> Regards,
> Dilip Kumar
> EnterpriseDB: http://www.enterprisedb.com



-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

On Mon, Dec 26, 2016 at 3:14 PM, Dilip Kumar <dilipbalaut@gmail.com> wrote:
> Other the another option is, that we can always make caller to provide
> an allocator. But this way every new user for simple hash need to take
> care of having allocator.
>
> What is your opinion?

Attached is the new version of the patch which implements it the way I
described.

>
>
>>This also needs docs, including a warning that just
>> using an allocator in shared memory does *NOT* allow the hash table to be
>> used in shared memory in the general case.
>
> Make sense.
Added the Warning.

I have also fixed some bug in parallel bitmap heap scan
(path.parallel_workers was not initialised before calling
cost_bitmap_heap_scan in some cases, so it was taking the
uninitialized value). Patch attached.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] Parallel bitmap heap scan

From

Amit Khandekar

Date:

06 January 2017, 08:17:44

Sorry for the delay in my next response. I still haven't exhaustively
gone through all changes, but meanwhile, below are some more points.

On 26 November 2016 at 18:18, Dilip Kumar <dilipbalaut@gmail.com> wrote:
> On Tue, Nov 22, 2016 at 9:05 AM, Dilip Kumar <dilipbalaut@gmail.com> wrote:
>> On Fri, Nov 18, 2016 at 9:59 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
>>
>> Thanks for the review..
>
> I have worked on these comments..
>>
>>> In pbms_is_leader() , I didn't clearly understand the significance of
>>> the for-loop. If it is a worker, it can call
>>> ConditionVariablePrepareToSleep() followed by
>>> ConditionVariableSleep(). Once it comes out of
>>> ConditionVariableSleep(), isn't it guaranteed that the leader has
>>> finished the bitmap ? If yes, then it looks like it is not necessary
>>> to again iterate and go back through the pbminfo->state checking.
>>> Also, with this, variable queuedSelf also might not be needed. But I
>>> might be missing something here.
>>
>> I have taken this logic from example posted on conditional variable thread[1]
>>
>> for (;;)
>> {
>>     ConditionVariablePrepareToSleep(cv);
>>     if (condition for which we are waiting is satisfied)
>>         break;
>>     ConditionVariableSleep();
>> }
>> ConditionVariableCancelSleep();
>>
>> [1] https://www.postgresql.org/message-id/CA%2BTgmoaj2aPti0yho7FeEf2qt-JgQPRWb0gci_o1Hfr%3DC56Xng%40mail.gmail.com

With the latest patches, this looks fine to me.

>>> tbm_iterate() call under SpinLock :
>>> For parallel tbm iteration, tbm_iterate() is called while SpinLock is
>>> held. Generally we try to keep code inside Spinlock call limited to a
>>> few lines, and that too without occurrence of a function call.
>>> Although tbm_iterate() code itself looks safe under a spinlock, I was
>>> checking if we can squeeze SpinlockAcquire() and SpinLockRelease()
>>> closer to each other. One thought is :  in tbm_iterate(), acquire the
>>> SpinLock before the while loop that iterates over lossy chunks. Then,
>>> if both chunk and per-page data remain, release spinlock just before
>>> returning (the first return stmt). And then just before scanning
>>> bitmap of an exact page, i.e. just after "if (iterator->spageptr <
>>> tbm->npages)", save the page handle, increment iterator->spageptr,
>>> release Spinlock, and then use the saved page handle to iterate over
>>> the page bitmap.
>>
>> Main reason to keep Spin lock out of this function to avoid changes
>> inside this function, and also this function takes local iterator as
>> input which don't have spin lock reference to it. But that can be
>> changed, we can pass shared iterator to it.
>>
>> I will think about this logic and try to update in next version.
> Still this issue is not addressed.
> Logic inside tbm_iterate is using same variable, like spageptr,
> multiple places. IMHO this complete logic needs to be done under one
> spin lock.

I think we both agree on the part that the mutex handle can be passed
to tbm_iterate() to do this. I am yet to give more thought on how
clumsy it will be if we add SpinlockRelease() calls in intermediate
places in the function rather than in the end.

>
>>
>>>
>>> prefetch_pages() call under Spinlock :
>>> Here again, prefetch_pages() is called while pbminfo->prefetch_mutex
>>> Spinlock is held. Effectively, heavy functions like PrefetchBuffer()
>>> would get called while under the Spinlock. These can even ereport().
>>> One option is to use mutex lock for this purpose. But I think that
>>> would slow things down. Moreover, the complete set of prefetch pages
>>> would be scanned by a single worker, and others might wait for this
>>> one. Instead, what I am thinking is: grab the pbminfo->prefetch_mutex
>>> Spinlock only while incrementing pbminfo->prefetch_pages. The rest
>>> part viz : iterating over the prefetch pages, and doing the
>>> PrefetchBuffer() need not be synchronised using this
>>> pgbinfo->prefetch_mutex Spinlock. pbms_parallel_iterate() already has
>>> its own iterator spinlock. Only thing is, workers may not do the
>>> actual PrefetchBuffer() sequentially. One of them might shoot ahead
>>> and prefetch 3-4 pages while the other is lagging with the
>>> sequentially lesser page number; but I believe this is fine, as long
>>> as they all prefetch all the required blocks.
>>
>> I agree with your point, will try to fix this as well.
>
> I have fixed this part.

This looks good now.

Further points below ....

===== nodeBItmapHeapscan.c ======

In BitmapHeapNext(), the following code is relevant only for tbm
returned from MultiExecProcNode().
if (!tbm || !IsA(tbm, TIDBitmap))
elog(ERROR, "unrecognized result from subplan");

So it should be moved just below MultiExecProcNode(), so that it does
not get called when it is created from tbm_create().

--------------

BitmapHeapScanState->is_leader field looks unnecessary. Instead, a
local variable is_leader in BitmapHeapNext() will solve the purpose.
This is because is_leader is used only in BitmapHeapNext().

--------------

In BitmapHeapNext(), just before tbm_restore_shared_members() is
called, we create tbm using tbm_create(). I think tbm_create() does
not make sense for shared tbm. Whatever fields are required, will be
restored in tbm_restore_shared_members(). The other fields which do
not make sense in a restored tbm are not required to be initialized
using tbm_create(). So I think tbm_restore_shared_members() itself can
call makeNode(TIDBitmap). (Also it is not required to initialize
tbm->allocator; see note below in tidbitmap.c section). So
tbm_restore_shared_members() itself can call tbm_set_parallel().
Looking at all this, it looks better to have the function name changed
to tbm_attach(parallel_tbm) or tbm_restore(parallel_tbm) rather than
tbm_restore_shared_members(). The function header anyways (rightly)
says  : Attach worker to shared TID bitmap.

-------------

From what I understand, the leader worker does not have to create its
iterators before waking up the other workers, as long as it makes sure
it copies tbm fields into shared memory before waking workers. But in
the patch, tbm_begin_iterate() is called *before* the
ConditionVariableBroadcast() is called. So I feel, we can shift the
code inside the "if (node->is_leader)" to a place inside the "if
(pbminfo == NULL || pbms_is_leader(pbminfo))" condition block, just
after MultiExecProcNode() call. (And we won't even need is_leader
local variable as well). This way now the other workers will start
working sooner.

====== tidbitmap.c =======

The new #include's are not in alphabetical order.

--------------

ParallelTIDBitmap.inited is unused, and I believe, not required.

--------------

For leader worker, the new TIDBitmap fields added for parallel bitmap
*are* valid while the tid is being built. So the below comment should
be shifted accordingly :
/* these are valid when iterating is true: */
Better still, the shared tbm-related fields can be kept in the end,
and a comment should be added that these are for shared tbm.

--------------

It seems, the comment below the last ParallelTIDBitmap field is not relevant :
/* table to dsa_pointer's array. */

--------------

tbm->allocator field does not seem to be required. A new allocator can
be just palloc'ed in tbm_create_pagetable(), and passed to
pagetable_create(). SH_CREATE() stores this allocator in the
SH_TYPE->alloc field, and fetches the same whenever it needs it for
calling any of the allocator functions. So we can remove the
tbm->allocator field and shift "palloc(sizeof(pagetable_alloc))" call
from tbm_create() to tbm_create_pagetable().

--------------

In tbm_free() :
I think we should call pagetable_destroy() even when tbm is shared, so
that the hash implementation gets a chance to free the hash table
structure. I understand that the hash table structure itself is not
big, so it will only be a small memory leak, but it's better to not
assume that. Instead, let SH_DESTROY() call HashFree(). Then, in
tbm_free_shared(), we can skip the dsa_free() call if tbm->iterating
is false. Basically, tbm bitmap implementation should deduce from the
bitmap state whether it should free the shared data, rather than
preventing a call to SH_DESTROY().

-----------

In tbm_begin_iterate(), for shared tbm, internal structures from
simplehash.h are assumed to be known. For e.g., the hash entries will
always be present in one single array, and also the entry status is
evaluated using pagetable_IN_USE. Is simplehash.h designed keeping in
mind that these things are suppose to be exposed ?

I understand that the hash table handle is local to the leader worker,
and so it is not accessible to other workers. And so, we cannot use
pagetable_iterate() to scan the hash table. So, how about copying the
SH_TYPE structure and making it accessible to other workers ? If we
have something like SH_ATTACH() or SH_COPY(), this will copy the
relevant fields that are sufficient to restore the SH_TYPE structure,
and other workers can start using this hash table after assigning dsa
array back to tb->data. Something like HASH_ATTACH used in dynahash.c.

--------------

In the tbm_update_shared_members() header comments :* Restore leaders private tbm state to shared location. This must*
becalled before waking up the other worker.

Above can be changed to:* Store leader's private tbm state to shared location. This must* be called before waking up
otherworkers.

--------------

To be consistent with other function header comments in this file, a
blank line is required between these two lines :
* tbm_estimate_parallel_tidbitmap* Estimate size of shared TIDBitmap related info.

-------------

tbm->is_shared is set in both tbm_set_parallel() and
tbm_restore_shared_members(). I think it is needed only in
tbm_set_parallel().

--------------

tbm_alloc_shared() Function header comments need some typo correction :* It allocated memory from DSA and also stores
dsa_pointerin the memory

allocated => allocates

--------------

We prepend the dsa_pointer value into the shared memory data allocated
for the pagetable entries; and we save the address of the first  page
table entry in tbm->data. But the same dsa_pointer is also stored in
tbm->dsa_data. And tbm is accessible in tbm_free_shared(). So it does
not look necessary to prepend dsa_pointer before the page table
entries. In tbm_free_shared(), we can do dsa_free(tbm->area,
tbm->dsa_data).

--------------

Below are my suggested changes in the algorithm comments :

@@ -103,27 +103,27 @@ BitmapHeapNext(BitmapHeapScanState *node)       /*
--------------------------------------------------------------------       *      Parallel Bitmap Heap Scan Algorithm
    *

-        *      First worker to see the state as parallel bitmap info
as PBM_INITIAL
-        *      become leader and set the state to PBM_INPROGRESS All
other workers
-        *      see the state as PBM_INPROGRESS will wait for leader
to complete the
-        *      TIDBitmap.
+        *      The first worker to see the state of
ParallelBitmapInfo as PBM_INITIAL
+        *      becomes the leader and sets the state to
PBM_INPROGRESS. All other
+        *      workers see the state as PBM_INPROGRESS, and will wait
for leader to
+        *      finish building the TIDBitmap.        *        *      Leader Processing:        *        Create
TIDBitmapusing DSA memory.        *        Restore local TIDBitmap variable information into

-        *        ParallelBitmapInfo so that other worker can see those.
-        *        set state to PBM_FINISHED.
+        *           ParallelBitmapInfo so that other worker can see those.
+        *        Set state to PBM_FINISHED.        *        Wake up other workers.        *        *      Other Worker
Processing:
-        *        Wait until leader create shared TIDBitmap.
-        *        copy TIDBitmap from info from ParallelBitmapInfo to
local TIDBitmap.
+        *        Wait until leader creates shared TIDBitmap.
+        *        Copy TIDBitmap from ParallelBitmapInfo to local TIDBitmap.        *        *      Iterate and process
thepages.        *        a) In this phase each worker will iterate over page

and chunk array
-        *               and select heap pages one by one. If prefetch
is enable then
-        *               there will be two iterator.
-        *        b) Since multiple worker are iterating over same
page and chunk
+        *               and select heap pages one by one. If prefetch
is enabled then
+        *               there will be two iterators.
+        *        b) Since multiple workers are iterating over same
page and chunk        *               array we need to have a shared iterator, so
we grab a spin        *               lock and iterate within a lock.

Re: [HACKERS] Parallel bitmap heap scan

From

Dilip Kumar

Date:

09 January 2017, 10:35:24

On Fri, Jan 6, 2017 at 10:47 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
> This looks good now.

Thanks..
>
>
> Further points below ....

Thanks for the review.


> ===== nodeBItmapHeapscan.c ======
>
>
> In BitmapHeapNext(), the following code is relevant only for tbm
> returned from MultiExecProcNode().
> if (!tbm || !IsA(tbm, TIDBitmap))
> elog(ERROR, "unrecognized result from subplan");

Fixed

> --------------
>
> BitmapHeapScanState->is_leader field looks unnecessary. Instead, a
> local variable is_leader in BitmapHeapNext() will solve the purpose.
> This is because is_leader is used only in BitmapHeapNext().

Fixed
>
> --------------
>
> In BitmapHeapNext(), just before tbm_restore_shared_members() is
> called, we create tbm using tbm_create(). I think tbm_create() does
> not make sense for shared tbm. Whatever fields are required, will be
> restored in tbm_restore_shared_members(). The other fields which do
> not make sense in a restored tbm are not required to be initialized
> using tbm_create(). So I think tbm_restore_shared_members() itself can
> call makeNode(TIDBitmap). (Also it is not required to initialize
> tbm->allocator; see note below in tidbitmap.c section). So
> tbm_restore_shared_members() itself can call tbm_set_parallel().
> Looking at all this, it looks better to have the function name changed
> to tbm_attach(parallel_tbm) or tbm_restore(parallel_tbm) rather than
> tbm_restore_shared_members(). The function header anyways (rightly)
> says  : Attach worker to shared TID bitmap.

Fixed

>
> -------------
>
> From what I understand, the leader worker does not have to create its
> iterators before waking up the other workers, as long as it makes sure
> it copies tbm fields into shared memory before waking workers. But in
> the patch, tbm_begin_iterate() is called *before* the
> ConditionVariableBroadcast() is called. So I feel, we can shift the
> code inside the "if (node->is_leader)" to a place inside the "if
> (pbminfo == NULL || pbms_is_leader(pbminfo))" condition block, just
> after MultiExecProcNode() call. (And we won't even need is_leader
> local variable as well). This way now the other workers will start
> working sooner.

Correct, Fixed.

>
>
>
> ====== tidbitmap.c =======
>
>
> The new #include's are not in alphabetical order.

Done..
>
> --------------
>
> ParallelTIDBitmap.inited is unused, and I believe, not required.

Fixed
>
> --------------
>
> For leader worker, the new TIDBitmap fields added for parallel bitmap
> *are* valid while the tid is being built. So the below comment should
> be shifted accordingly :
> /* these are valid when iterating is true: */
> Better still, the shared tbm-related fields can be kept in the end,
> and a comment should be added that these are for shared tbm.
>

Done
> --------------
>
> It seems, the comment below the last ParallelTIDBitmap field is not relevant :
> /* table to dsa_pointer's array. */

Fixed..
>
> --------------
>
> tbm->allocator field does not seem to be required. A new allocator can
> be just palloc'ed in tbm_create_pagetable(), and passed to
> pagetable_create(). SH_CREATE() stores this allocator in the
> SH_TYPE->alloc field, and fetches the same whenever it needs it for
> calling any of the allocator functions. So we can remove the
> tbm->allocator field and shift "palloc(sizeof(pagetable_alloc))" call
> from tbm_create() to tbm_create_pagetable().

Done
>
> --------------
>
> In tbm_free() :
> I think we should call pagetable_destroy() even when tbm is shared, so
> that the hash implementation gets a chance to free the hash table
> structure. I understand that the hash table structure itself is not
> big, so it will only be a small memory leak, but it's better to not
> assume that. Instead, let SH_DESTROY() call HashFree(). Then, in
> tbm_free_shared(), we can skip the dsa_free() call if tbm->iterating
> is false. Basically, tbm bitmap implementation should deduce from the
> bitmap state whether it should free the shared data, rather than
> preventing a call to SH_DESTROY().

Fixed
>
> -----------
>
> In tbm_begin_iterate(), for shared tbm, internal structures from
> simplehash.h are assumed to be known. For e.g., the hash entries will
> always be present in one single array, and also the entry status is
> evaluated using pagetable_IN_USE. Is simplehash.h designed keeping in
> mind that these things are suppose to be exposed ?
>
> I understand that the hash table handle is local to the leader worker,
> and so it is not accessible to other workers. And so, we cannot use
> pagetable_iterate() to scan the hash table. So, how about copying the
> SH_TYPE structure and making it accessible to other workers ? If we
> have something like SH_ATTACH() or SH_COPY(), this will copy the
> relevant fields that are sufficient to restore the SH_TYPE structure,
> and other workers can start using this hash table after assigning dsa
> array back to tb->data. Something like HASH_ATTACH used in dynahash.c.

This looks cleaner, and also avoid processing the data of the hash
directly. I have changed as per the suggestion.

>
> --------------
>
> In the tbm_update_shared_members() header comments :
>  * Restore leaders private tbm state to shared location. This must
>  * be called before waking up the other worker.
>
> Above can be changed to:
>  * Store leader's private tbm state to shared location. This must
>  * be called before waking up other workers.

Done

>
> --------------
>
> To be consistent with other function header comments in this file, a
> blank line is required between these two lines :
>
>  * tbm_estimate_parallel_tidbitmap
>  * Estimate size of shared TIDBitmap related info.
>
Done
> -------------
>
> tbm->is_shared is set in both tbm_set_parallel() and
> tbm_restore_shared_members(). I think it is needed only in
> tbm_set_parallel().

Done
>
> --------------
>
> tbm_alloc_shared() Function header comments need some typo correction :
>  * It allocated memory from DSA and also stores dsa_pointer in the memory
> allocated => allocates
>
> --------------
>
> We prepend the dsa_pointer value into the shared memory data allocated
> for the pagetable entries; and we save the address of the first  page
> table entry in tbm->data. But the same dsa_pointer is also stored in
> tbm->dsa_data. And tbm is accessible in tbm_free_shared(). So it does
> not look necessary to prepend dsa_pointer before the page table
> entries. In tbm_free_shared(), we can do dsa_free(tbm->area,
> tbm->dsa_data).

tbm->dsa_data allocates the latest allocated dsa_pointer. And, hash
table first allocates the new memory then free the old pointer. So
whenever it allocates the new memory we will store latest in
tbm->dsa_data, and when the hash table will call free it will call
with the local pointer and we don't have any reference to dsa pointer
for that old memory, that's the reason we store dsa pointer as header.

>
> --------------
>
> Below are my suggested changes in the algorithm comments :
>
> @@ -103,27 +103,27 @@ BitmapHeapNext(BitmapHeapScanState *node)

Changed.

Apart from these changes, I have created a separate
patch(0001-opt-parallelcost-refactoring-v7.patch) for code refactoring
in optimizer required for parallelbitmap heap scan. (earlier this was
part of my main patch)

This patch can be used by 0003-parallel-bitmap-heap-scan-v7.patch
attached in the mail and also parallel-index-scan[2] can be rebased on
this, if this get committed,

[2]
https://www.postgresql.org/message-id/CAA4eK1KthrAvNjmB2cWuUHz%2Bp3ZTTtbD7o2KUw49PC8HK4r1Wg%40mail.gmail.com

I will send the performance data with the new patch in the separate mail.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

On Mon, Jan 9, 2017 at 5:01 PM, tushar <tushar.ahuja@enterprisedb.com> wrote:
> Thanks Dilip. issue is reproducible  if  we  uses '--enable-cassert' switch
> in the configure. We are able to reproduce it only with --enable-cassert' .

Thanks, Tushar. I have fixed it. The defect was in 0002. I have also
observed another issue related to code refactoring, Actually, there
was some code present in 0001 which supposed to be in 0003.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

On Tue, Jan 10, 2017 at 2:19 PM, tushar <tushar.ahuja@enterprisedb.com> wrote:
> We found a regression , earlier the testcase was working fine (against the
> older patches of Parallel bitmap heap scan)  but now getting a server crash
> against v8 patches.
>
> Testcase - (one of the table of TPC-H )
>
> postgres=#explain analyze verbose
> SELECT SUM(l_extendedprice) FROM lineitem
> WHERE (l_shipdate >= '1995-01-01'::date)
> AND (l_shipdate <='1996-03-31'::date);

While fixing some of the review comments in v7 and v8, I had allocated
new memory for pagetable, and missed to initialize it.

Thanks for reporting, I have fixed it in v9. After fix query is
running fine for me. Please try attached patch and confirm from your
side.

postgres=# explain analyze verbose
SELECT SUM(l_extendedprice) FROM lineitem
WHERE (l_shipdate >= '1995-01-01'::date)
AND (l_shipdate <='1996-03-31'::date);

        QUERY PLAN


--------------------------------------------------------------------------------------------------------------------------------------------------------------
-----------
 Finalize Aggregate  (cost=798002.46..798002.47 rows=1 width=32)
(actual time=15501.245..15501.245 rows=1 loops=1)
   Output: sum(l_extendedprice)
   ->  Gather  (cost=798002.24..798002.45 rows=2 width=32) (actual
time=15494.358..15498.919 rows=3 loops=1)
         Output: (PARTIAL sum(l_extendedprice))
         Workers Planned: 2
         Workers Launched: 2
         ->  Partial Aggregate  (cost=797002.24..797002.25 rows=1
width=32) (actual time=15492.937..15492.937 rows=1 loops=3)
               Output: PARTIAL sum(l_extendedprice)
               Worker 0: actual time=15491.218..15491.219 rows=1 loops=1
               Worker 1: actual time=15493.514..15493.514 rows=1 loops=1
               ->  Parallel Bitmap Heap Scan on public.lineitem
(cost=147461.75..791014.31 rows=2395170 width=8) (actual
time=8553.301..15061.333 rows=189294
7 loops=3)
                     Output: l_extendedprice
                     Recheck Cond: ((lineitem.l_shipdate >=
'1995-01-01'::date) AND (lineitem.l_shipdate <= '1996-03-31'::date))
                     Rows Removed by Index Recheck: 6451177
                     Heap Blocks: exact=27963 lossy=164938
                     Worker 0: actual time=8548.957..15054.511
rows=1887239 loops=1
                     Worker 1: actual time=8554.817..15050.317
rows=1902477 loops=1
                     ->  Bitmap Index Scan on idx_lineitem_shipdate
(cost=0.00..146024.65 rows=5748409 width=0) (actual
time=8533.701..8533.701 rows=5678841
loops=1)
                           Index Cond: ((lineitem.l_shipdate >=
'1995-01-01'::date) AND (lineitem.l_shipdate <= '1996-03-31'::date))
 Planning time: 2.742 ms
 Execution time: 15509.696 ms
(21 rows)



-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

On 01/11/2017 05:33 PM, tushar wrote:

On 01/10/2017 05:16 PM, Dilip Kumar wrote:
Please try attached patch and confirm from your
side.
Thanks,issue seems to be fixed now.

Prabhat and I have done some testing for this feature and written few testcases. PFA the sql script(along with the expected .out files)

In addition we have generated the LCOV (code coverage) report and compared the files which are changed for the "Parallel Bitmap Heap scan" patch.
You can see the numbers for "with_patch" V/s "with_patch+TestCases" (.pdf file is attached)

-- 
regards,tushar
EnterpriseDB  https://www.enterprisedb.com/
The Enterprise PostgreSQL Company

Attachment

Re: [HACKERS] Parallel bitmap heap scan

From

Dilip Kumar

Date:

13 January 2017, 16:06:16

On Thu, Jan 12, 2017 at 5:47 PM, Dilip Kumar <dilipbalaut@gmail.com> wrote:
>> Hello Dilip,
>> I was trying to test the performance of all the parallel query related
>> patches on TPC-H queries, I found that when parallel hash [1 ]and your
>> patch are applied then Q10 and Q14 were hanged, however, without any
>> one of these patches, these queries are running fine. Following is the
>> stack trace,
>
> Thanks for reporting, I will look into this.
I had setup for TPCH scale factor 5, I could reproduce the hang
reported by you with Q10. I have fixed the issue in v10 (only 0003 is
changed other patches have no change). However, after fixing this
issue it's crashing in parallel shared hash patch and call stack is
same what you have reported on parallel shared hash thread[1]

[1] https://www.postgresql.org/message-id/CAOGQiiOJji9GbLwfT0vBuixdkgAsxzZjTpPxsO6BiTLD--SzYg%40mail.gmail.com

Please verify with the new patch.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

0003-parallel-bitmap-heap-scan-v10.patch

Re: [HACKERS] Parallel bitmap heap scan

From

Dilip Kumar

Date:

18 January 2017, 08:14:58

On Fri, Jan 13, 2017 at 6:36 PM, Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> Please verify with the new patch.

Patch 0001 and 0003 required to rebase on the latest head. 0002 is
still the same.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] Parallel bitmap heap scan

From

Robert Haas

Date:

18 January 2017, 21:56:35

On Wed, Jan 18, 2017 at 12:14 AM, Dilip Kumar <dilipbalaut@gmail.com> wrote:
> On Fri, Jan 13, 2017 at 6:36 PM, Dilip Kumar <dilipbalaut@gmail.com> wrote:
>>
>> Please verify with the new patch.
>
> Patch 0001 and 0003 required to rebase on the latest head. 0002 is
> still the same.

I've committed the first half of 0001.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] Parallel bitmap heap scan

From

Dilip Kumar

Date:

23 January 2017, 07:42:44

On Thu, Jan 19, 2017 at 12:26 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> Patch 0001 and 0003 required to rebase on the latest head. 0002 is
>> still the same.
>
> I've committed the first half of 0001.
Thanks.  0001 and 0003 required rebasing after this commit.



-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] Parallel bitmap heap scan

From

Haribabu Kommi

Date:

23 January 2017, 11:22:54

On Mon, Jan 23, 2017 at 3:42 PM, Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Thu, Jan 19, 2017 at 12:26 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> Patch 0001 and 0003 required to rebase on the latest head. 0002 is
>> still the same.
>
> I've committed the first half of 0001.
Thanks. 0001 and 0003 required rebasing after this commit.

I reviewed 0002-hash-support-alloc-free-v12.patch, some minor comments.

- SH_TYPE *tb;

- uint64 size;

+ SH_TYPE *tb;

+ uint64 size;

The above change may not be required.

+ if (tb->alloc)

+ {

+ tb->alloc->HashFree(tb->data, tb->alloc->args);

+ pfree(tb->alloc);

+ }

The above code tries to free the tb->alloc memory. In case if the user

has provide the alloc structure to SH_CREATE function and the same

pointer is stored in the tb structure. And in free function freeing that

memory may cause problem.

So either explicitly mentioning that the input must a palloc'ed data or

by default allocate memory and copy the input data into allocated

memory.

Regards,

Hari Babu

Fujitsu Australia

Re: [HACKERS] Parallel bitmap heap scan

From

Dilip Kumar

Date:

24 January 2017, 07:48:29

On Mon, Jan 23, 2017 at 1:52 PM, Haribabu Kommi
<kommi.haribabu@gmail.com> wrote:
> I reviewed 0002-hash-support-alloc-free-v12.patch, some minor comments.
>
> - SH_TYPE    *tb;
> - uint64 size;
> + SH_TYPE     *tb;
> + uint64 size;
>
> The above change may not be required.
>
> + if (tb->alloc)
> + {
> + tb->alloc->HashFree(tb->data, tb->alloc->args);
> + pfree(tb->alloc);
> + }
>
> The above code tries to free the tb->alloc memory. In case if the user
> has provide the alloc structure to SH_CREATE function and the same
> pointer is stored in the tb structure. And in free function freeing that
> memory may cause problem.
>
> So either explicitly mentioning that the input must a palloc'ed data or
> by default allocate memory and copy the input data into allocated
> memory.

I have changed as per the comments. 0002 and 0003 are changed, 0001 is
still the same.


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] Parallel bitmap heap scan

From

Dilip Kumar

Date:

27 January 2017, 09:32:53

On Tue, Jan 24, 2017 at 10:18 AM, Dilip Kumar <dilipbalaut@gmail.com> wrote:
> I have changed as per the comments. 0002 and 0003 are changed, 0001 is
> still the same.

2 days back my colleague Rafia, reported one issue (offlist) that
parallel bitmap node is not scaling as good as other nodes e.g
parallel sequence scan and parallel index scan.

After some perf analysis, I found that there was one unconditional
spin lock in parallel bitmap patch which we were taking for checking
the prefetch target. Basically, we were always taking the lock and
checking the prefetch_target is reached to prefetch_maximum. So even
after it will reach to prefetch_maximum we will acquire the lock and
check after every tuple. I have changed that logic, now I will check
the condition first if we need to increase the target then will take
the lock and recheck the condition.

There is just one line change in 0003 compared to older version, all
other patches are the same.

Some performance data to show that new parallel bitmap patch performs
way better than the previous version.
TPCH scale 20, work_mem 64MB, shared buffers 8GB (4 workers)
machine intel 8 socket machine

            v13(time in ms)           v14 (time in ms)
Q6        37065.841                18202.903

Q14     15229.569                 11341.121

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] Parallel bitmap heap scan

From

Robert Haas

Date:

28 January 2017, 00:31:22

On Fri, Jan 27, 2017 at 1:32 AM, Dilip Kumar <dilipbalaut@gmail.com> wrote:
> There is just one line change in 0003 compared to older version, all
> other patches are the same.

I spent some time looking at 0001 (and how those changes are used in
0003) and I thought it looked good, so I committed 0001.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] Parallel bitmap heap scan

From

Haribabu Kommi

Date:

31 January 2017, 05:17:52

On Fri, Jan 27, 2017 at 5:32 PM, Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Tue, Jan 24, 2017 at 10:18 AM, Dilip Kumar <dilipbalaut@gmail.com> wrote:
> I have changed as per the comments. 0002 and 0003 are changed, 0001 is
> still the same.

2 days back my colleague Rafia, reported one issue (offlist) that
parallel bitmap node is not scaling as good as other nodes e.g
parallel sequence scan and parallel index scan.

After some perf analysis, I found that there was one unconditional
spin lock in parallel bitmap patch which we were taking for checking
the prefetch target. Basically, we were always taking the lock and
checking the prefetch_target is reached to prefetch_maximum. So even
after it will reach to prefetch_maximum we will acquire the lock and
check after every tuple. I have changed that logic, now I will check
the condition first if we need to increase the target then will take
the lock and recheck the condition.

There is just one line change in 0003 compared to older version, all
other patches are the same.

Some performance data to show that new parallel bitmap patch performs
way better than the previous version.
TPCH scale 20, work_mem 64MB, shared buffers 8GB (4 workers)
machine intel 8 socket machine

v13(time in ms) v14 (time in ms)
Q6 37065.841 18202.903

Q14 15229.569 11341.121

Thanks for the update. I have some comments

0002-hash-support-alloc-free-v14.patch:

+ if (tb->alloc)

+ {

The memory for tb->alloc is allocated always, is the if check still required?

0003-parallel-bitmap-heap-scan-v14.patch:

+ * and set parallel flag in lower level bitmap index scan. Later

+ * bitmap index node will use this flag to indicate tidbitmap that

+ * it needs to create an shared page table.

+ */

I feel better to mention, where this flag is used, so that it will be more clear.

+ * Mark tidbitmap as shared and also store DSA area in it.

+ * marking tidbitmap as shared is indication that this tidbitmap

+ * should be created in shared memory. DSA area will be used for

The flag of shared is set in tidbitmap structure itself, but the

comment is mentioned that tidbitmpa should be created in shared memory.

I think that is the page table that needs to be created in shared memory.

Providing some more information here will be helpful.

- node->tbmres = tbmres = tbm_iterate(tbmiterator);

+ node->tbmres = tbmres = pbms_parallel_iterate(tbmiterator,

+ pbminfo ? &pbminfo->tbmiterator : NULL);

Instead Passing both normal and paralle iterators to the functions

and checking inside again for NULL, How about just adding check

in the caller itself? Or if you prefer the current method, then try to

explain the details of input in the function header and more description.

+ /* Increase prefetch target if it's not yet at the max. */

+ if (node->prefetch_target < node->prefetch_maximum)

I didn't evaluate all scenarios, but the above code may have a problem,

In case of parallel mode the the prefetch_target is fetched from node

and not from the pbminfo. Later it gets from the pminfo and update that.

May be this code needs to rewrite.

+ TBMIterator *iterator = node->prefetch_iterator;

Why another variable? Why can't we use the prefetch_iterator itself?

Currently node->tbmiterator and node->prefetch_iterator are initialized

irrespective of whether is it a parallel one or not. But while using, there

is a check to use the parallel one in case of parallel. if it is the case,

why can't we avoid the initialization itself?

+ else

+ needWait = false;

By default needWait is false. Just set that to true only for the

case of PBM_INPROGRESS

+ * so that during free we can directly get the dsa_pointe and free it.

dsa_pointe -> dsa_pointer

+typedef struct

+ TIDBitmap *tbm; /* TIDBitmap we're iterating over */

+ int spageptr; /* next spages index */

+ int schunkptr; /* next schunks index */

+ int schunkbit; /* next bit to check in current schunk */

+ TBMIterateResult output; /* MUST BE LAST (because variable-size) */

+} TBMIterator;

I didn't find the need of moving this structure. Can you point me where it

is used.

Regards,

Hari Babu

Fujitsu Australia

Re: [HACKERS] Parallel bitmap heap scan

From

Michael Paquier

Date:

31 January 2017, 09:54:25

On Tue, Jan 31, 2017 at 11:17 AM, Haribabu Kommi
<kommi.haribabu@gmail.com> wrote:
> On Fri, Jan 27, 2017 at 5:32 PM, Dilip Kumar <dilipbalaut@gmail.com> wrote:
>> On Tue, Jan 24, 2017 at 10:18 AM, Dilip Kumar <dilipbalaut@gmail.com>
>> wrote:
>> > I have changed as per the comments. 0002 and 0003 are changed, 0001 is
>> > still the same.
>>
>> There is just one line change in 0003 compared to older version, all
>> other patches are the same.
>
> Thanks for the update. I have some comments

This review is too fresh to be addressed, so I have moved this patch
to the next CF.
-- 
Michael

Re: [HACKERS] Parallel bitmap heap scan

From

Dilip Kumar

Date:

31 January 2017, 14:05:57

On Tue, Jan 31, 2017 at 7:47 AM, Haribabu Kommi
<kommi.haribabu@gmail.com> wrote:
> Thanks for the update. I have some comments
>
Thanks for the review.

>
> 0002-hash-support-alloc-free-v14.patch:
>
>
> + if (tb->alloc)
> + {
>
> The memory for tb->alloc is allocated always, is the if check still
> required?

In parallel case, only first worker will call SH_CREATE, other worker
will only do palloc for pagetable and copy the reference from main
worker, so they will not have allocator.


> 0003-parallel-bitmap-heap-scan-v14.patch:
>
>
> + * and set parallel flag in lower level bitmap index scan. Later
> + * bitmap index node will use this flag to indicate tidbitmap that
> + * it needs to create an shared page table.
> + */
>
> I feel better to mention, where this flag is used, so that it will be more
> clear.

Done
>
>
> + * Mark tidbitmap as shared and also store DSA area in it.
> + * marking tidbitmap as shared is indication that this tidbitmap
> + * should be created in shared memory. DSA area will be used for
>
> The flag of shared is set in tidbitmap structure itself, but the
> comment is mentioned that tidbitmpa should be created in shared memory.
> I think that is the page table that needs to be created in shared memory.
> Providing some more information here will be helpful.
>
Done

>
> - node->tbmres = tbmres = tbm_iterate(tbmiterator);
> + node->tbmres = tbmres = pbms_parallel_iterate(tbmiterator,
> + pbminfo ? &pbminfo->tbmiterator : NULL);
>
> Instead Passing both normal and paralle iterators to the functions
> and checking inside again for NULL, How about just adding check
> in the caller itself? Or if you prefer the current method, then try to
> explain the details of input in the function header and more description.
>
Done as you said.

>
> + /* Increase prefetch target if it's not yet at the max. */
> + if (node->prefetch_target < node->prefetch_maximum)
>
> I didn't evaluate all scenarios, but the above code may have a problem,
> In case of parallel mode the the prefetch_target is fetched from node
> and not from the pbminfo. Later it gets from the pminfo and update that.
> May be this code needs to rewrite.
>
Fixed it.

>
> + TBMIterator *iterator = node->prefetch_iterator;
>
> Why another variable? Why can't we use the prefetch_iterator itself?
> Currently node->tbmiterator and node->prefetch_iterator are initialized
> irrespective of whether is it a parallel one or not. But while using, there
> is a check to use the parallel one in case of parallel. if it is the case,
> why can't we avoid the initialization itself?

Fixed
>
>
> + else
> + needWait = false;
>
> By default needWait is false. Just set that to true only for the
> case of PBM_INPROGRESS
>

Actually inside the while loop, suppose first we set needWait = true,
if PBM_INPROGRESS and got for ConditionalSleep, After it come out of
sleep, we need to check that PBM_FINISHED is set or we need to sleep
again, So in such case we need to reset it to needWait=false. However
this can be done by directly returning if it's PBM_FINISHED. But I
want to keep below the logic common.
+ SpinLockRelease(&pbminfo->state_mutex);
+
+ /* If we are leader or leader has already created a TIDBITMAP */
+ if (leader || !needWait)
+ break;

> + * so that during free we can directly get the dsa_pointe and free it.
>
> dsa_pointe -> dsa_pointer
Done

>
>
> +typedef struct
> +{
> + TIDBitmap  *tbm; /* TIDBitmap we're iterating over */
> + int spageptr; /* next spages index */
> + int schunkptr; /* next schunks index */
> + int schunkbit; /* next bit to check in current schunk */
> + TBMIterateResult output; /* MUST BE LAST (because variable-size) */
> +} TBMIterator;
>
> I didn't find the need of moving this structure. Can you point me where it
> is used.

Because  pbms_parallel_iterate need to access this structure so I
moved it to tidbitmap.h

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

0003-parallel-bitmap-heap-scan-v15.patch

Re: [HACKERS] Parallel bitmap heap scan

From

Robert Haas

Date:

01 February 2017, 20:39:51

On Tue, Jan 31, 2017 at 6:05 AM, Dilip Kumar <dilipbalaut@gmail.com> wrote:
>> 0002-hash-support-alloc-free-v14.patch:
>>
>>
>> + if (tb->alloc)
>> + {
>>
>> The memory for tb->alloc is allocated always, is the if check still
>> required?
>
> In parallel case, only first worker will call SH_CREATE, other worker
> will only do palloc for pagetable and copy the reference from main
> worker, so they will not have allocator.

When Andres wrote in
https://www.postgresql.org/message-id/20161017201558.cr34stc6zvxy3mst@alap3.anarazel.de
that you should try to share only the iteration arrays, I believe that
he meant the results of tbm_begin_iterate() -- that is,
iterator->spageptr, chunkptr, schunkbit, spages, and schunks.  I'm
perhaps putting words into his mouth, but what he said was that we
could avoid sharing "the whole underlying hash".  But the patch set
you've implemented here doesn't behave that way.  Instead, you've got
the array containing the hash table elements shared (which is good)
plus there's a sort of dummy hash table in every worker which copies
some but not all of the members of the original hash table, leading to
the otherwise-unnecessary if-test that Haribabu is complaining about.
So the hash table is kinda-shared-kinda-not-shared, which I don't
*think* is really what Andres had in mind.

In short, I think SH_COPY (implemented here as pagetable_copy) needs
to go away.  The work of tbm_begin_iterate() should be done before we
begin the shared scan and the results of that work should be shared.
What happens right now (it appears) is that every backend does that
work based on the same hash table and we just assume they all get the
same answer.  And we really do assume that, because
pbms_parallel_iterate() assumes it can shuttle private state back and
forth between iterator in different backends and nothing will break;
but those backends aren't actually using the same iteration arrays.
They are using different iteration arrays that should have the same
contents because they were all derived from the same semi-shared hash
table.  That's pretty fragile, and a waste of CPU cycles if the hash
table is large (every backend does its own sort).

On a related note, I think it's unacceptable to make the internal
details of TBMIterator public.  You've moved it from tidbitmap.c to
tidbitmap.h so that nodeBitmapHeapScan.c can tinker with the guts of
the TBMIterator, but that's not OK.  Those details need to remain
private to tidbitmap.c. pbms_parallel_iterate() is a nasty kludge; we
need some better solution.  The knowledge of how a shared iterator
should iterate needs to be private to tidbitmap.c, not off in the
executor someplace.  And I think the entries need to actually be
updated in shared memory directly, not copied back and forth between a
backend-private iterator and a shared iterator.

Also, pbms_parallel_iterate() can't hold a spinlock around a call to
tbm_iterate().  Note the coding rules for spinlocks mentioned in
spin.h and src/backend/storage/lmgr/README.  I think the right thing
to do here is to use an LWLock in a new tranche (add an entry to
BuiltinTrancheIds).

In 0002, you have this:

+    tb->alloc = palloc(sizeof(SH_ALLOCATOR));

This should be using MemoryContextAlloc(ctx, ...) rather than palloc.
Otherwise the allocator object is in a different context from
everything else in the hash table.  But TBH, I don't really see why we
want this to be a separate object.  Why not just put
HashAlloc/HashFree/args into SH_TYPE directly?  That avoids some
pointer chasing and doesn't really seem to cost anything (except that
SH_CREATE will grow a slightly longer argument sequence).

I think maybe we should rename the functions to element_allocate,
element_free, and element_allocator_ctx, or something like that.  The
current names aren't capitalized consistently with other things in
this module, and putting the word "element" in there would make it
more clear what the purpose of this thing is.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] Parallel bitmap heap scan

From

Dilip Kumar

Date:

02 February 2017, 07:56:21

On Wed, Feb 1, 2017 at 11:09 PM, Robert Haas <robertmhaas@gmail.com> wrote:

Hi Robert,

Thanks for the review.

> When Andres wrote in
> https://www.postgresql.org/message-id/20161017201558.cr34stc6zvxy3mst@alap3.anarazel.de
> that you should try to share only the iteration arrays, I believe that
> he meant the results of tbm_begin_iterate() -- that is,
> iterator->spageptr, chunkptr, schunkbit, spages, and schunks.  I'm
> perhaps putting words into his mouth, but what he said was that we
> could avoid sharing "the whole underlying hash".  But the patch set
> you've implemented here doesn't behave that way.  Instead, you've got
> the array containing the hash table elements shared (which is good)

Here is my analysis why we selected this instead of just sharing the iterator:
----------------------------------------------------------------------------------------------------
The problem is that each entry of iteration array is just a pointer to
hash entry. So if we only shared iterator then iterator element will
be pointing to local memory of other workers. I thought of one more
option that during tbm_begin_iterate instead of keeping pointer to
hash entry, we can make a copy of that in DSA area, but that will have
2 copies of hash table element for short duration which is not good.
And, I think what we are doing currently is better than that.
 The work of tbm_begin_iterate() should be done before we
> begin the shared scan and the results of that work should be shared.
> What happens right now (it appears) is that every backend does that
> work based on the same hash table and we just assume they all get the
> same answer.

Actually, tbm_begin_iterate is processing the each element of the hash
table and converting to chunk and page array and currently, it’s done
by each worker and it’s pretty cheap. Suppose I try to do this only by
the first worker then implementation will look something like this.

1. First, we need to create a tbm->spages array and tbm->schunks array
and both should be the array of dsa_pointers.
2. Then each worker will process these array’s and will convert them
to the array of their local pointers.
3. With the current solution where all hash elements are stored in one
large dsa chunk, then how we are going to divide them into multiple
dsa pointers.

I will work on remaining comments.
 And we really do assume that, because
> pbms_parallel_iterate() assumes it can shuttle private state back and
> forth between iterator in different backends and nothing will break;
> but those backends aren't actually using the same iteration arrays.
> They are using different iteration arrays that should have the same
> contents because they were all derived from the same semi-shared hash
> table.  That's pretty fragile, and a waste of CPU cycles if the hash
> table is large (every backend does its own sort).
>
> On a related note, I think it's unacceptable to make the internal
> details of TBMIterator public.  You've moved it from tidbitmap.c to
> tidbitmap.h so that nodeBitmapHeapScan.c can tinker with the guts of
> the TBMIterator, but that's not OK.  Those details need to remain
> private to tidbitmap.c. pbms_parallel_iterate() is a nasty kludge; we
> need some better solution.  The knowledge of how a shared iterator
> should iterate needs to be private to tidbitmap.c, not off in the
> executor someplace.  And I think the entries need to actually be
> updated in shared memory directly, not copied back and forth between a
> backend-private iterator and a shared iterator.
>
> Also, pbms_parallel_iterate() can't hold a spinlock around a call to
> tbm_iterate().  Note the coding rules for spinlocks mentioned in
> spin.h and src/backend/storage/lmgr/README.  I think the right thing
> to do here is to use an LWLock in a new tranche (add an entry to
> BuiltinTrancheIds).
>
> In 0002, you have this:
>
> +    tb->alloc = palloc(sizeof(SH_ALLOCATOR));
>
> This should be using MemoryContextAlloc(ctx, ...) rather than palloc.
> Otherwise the allocator object is in a different context from
> everything else in the hash table.  But TBH, I don't really see why we
> want this to be a separate object.  Why not just put
> HashAlloc/HashFree/args into SH_TYPE directly?  That avoids some
> pointer chasing and doesn't really seem to cost anything (except that
> SH_CREATE will grow a slightly longer argument sequence).
>
> I think maybe we should rename the functions to element_allocate,
> element_free, and element_allocator_ctx, or something like that.  The
> current names aren't capitalized consistently with other things in
> this module, and putting the word "element" in there would make it
> more clear what the purpose of this thing is.


--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] Parallel bitmap heap scan

From

Dilip Kumar

Date:

05 February 2017, 17:48:51

On Wed, Feb 1, 2017 at 11:09 PM, Robert Haas <robertmhaas@gmail.com> wrote:

Here are the latest version of the patch, which address all the
comments given by Robert.

> https://www.postgresql.org/message-id/20161017201558.cr34stc6zvxy3mst@alap3.anarazel.de
> that you should try to share only the iteration arrays, I believe that
> he meant the results of tbm_begin_iterate() -- that is,
> iterator->spageptr, chunkptr, schunkbit, spages, and schunks.  I'm
> perhaps putting words into his mouth, but what he said was that we
> could avoid sharing "the whole underlying hash".  But the patch set
> you've implemented here doesn't behave that way.  Instead, you've got
> the array containing the hash table elements shared (which is good)
> plus there's a sort of dummy hash table in every worker which copies
> some but not all of the members of the original hash table, leading to
> the otherwise-unnecessary if-test that Haribabu is complaining about.
> So the hash table is kinda-shared-kinda-not-shared, which I don't
> *think* is really what Andres had in mind.
>
> In short, I think SH_COPY (implemented here as pagetable_copy) needs
> to go away.  The work of tbm_begin_iterate() should be done before we
> begin the shared scan and the results of that work should be shared.

I have removed the SH_COPY and now leader performs the
tbm_begin_shared_iterate before waking up the worker. Basically, the
leader will create the page and chunk array and that is the array of
the "relptr" (offlist, suggested by Robert).

> What happens right now (it appears) is that every backend does that
> work based on the same hash table and we just assume they all get the
> same answer.  And we really do assume that, because
> pbms_parallel_iterate() assumes it can shuttle private state back and
> forth between iterator in different backends and nothing will break;
> but those backends aren't actually using the same iteration arrays.
> They are using different iteration arrays that should have the same
> contents because they were all derived from the same semi-shared hash
> table.  That's pretty fragile, and a waste of CPU cycles if the hash
> table is large (every backend does its own sort).
>
> On a related note, I think it's unacceptable to make the internal
> details of TBMIterator public.  You've moved it from tidbitmap.c to
> tidbitmap.h so that nodeBitmapHeapScan.c can tinker with the guts of
> the TBMIterator, but that's not OK.  Those details need to remain
> private to tidbitmap.c.
Fixed

 pbms_parallel_iterate() is a nasty kludge; we
> need some better solution.  The knowledge of how a shared iterator
> should iterate needs to be private to tidbitmap.c, not off in the
> executor someplace.  And I think the entries need to actually be
> updated in shared memory directly, not copied back and forth between a
> backend-private iterator and a shared iterator.

I have fixed this, now there is new function called tbm_shared_iterate
which will directly iterate using shared iterator. So now no need to
copy member back and forth between local and shared iterator.

>
> Also, pbms_parallel_iterate() can't hold a spinlock around a call to
> tbm_iterate().  Note the coding rules for spinlocks mentioned in
> spin.h and src/backend/storage/lmgr/README.  I think the right thing
> to do here is to use an LWLock in a new tranche (add an entry to
> BuiltinTrancheIds).

Done that way.
>
> In 0002, you have this:
>
> +    tb->alloc = palloc(sizeof(SH_ALLOCATOR));
>
> This should be using MemoryContextAlloc(ctx, ...) rather than palloc.
> Otherwise the allocator object is in a different context from
> everything else in the hash table.  But TBH, I don't really see why we
> want this to be a separate object.  Why not just put
> HashAlloc/HashFree/args into SH_TYPE directly?  That avoids some
> pointer chasing and doesn't really seem to cost anything (except that
> SH_CREATE will grow a slightly longer argument sequence).

Done as suggested
>
> I think maybe we should rename the functions to element_allocate,
> element_free, and element_allocator_ctx, or something like that.  The
> current names aren't capitalized consistently with other things in
> this module, and putting the word "element" in there would make it
> more clear what the purpose of this thing is.

Fixed as per the suggestion

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] Parallel bitmap heap scan

From

Robert Haas

Date:

08 February 2017, 00:03:43

On Sun, Feb 5, 2017 at 9:48 AM, Dilip Kumar <dilipbalaut@gmail.com> wrote:
>> I think maybe we should rename the functions to element_allocate,
>> element_free, and element_allocator_ctx, or something like that.  The
>> current names aren't capitalized consistently with other things in
>> this module, and putting the word "element" in there would make it
>> more clear what the purpose of this thing is.
>
> Fixed as per the suggestion

Eh, not really.  You changed the memory context to be called
element_allocator_ctx, rather than changing the args passed to the
element allocator to have that name, which is what I had in mind.

I did some assorted renaming and other cosmetic improvements and committed 0002.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] Parallel bitmap heap scan

From

Jeff Janes

Date:

08 February 2017, 00:13:43

On Tue, Feb 7, 2017 at 1:03 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Sun, Feb 5, 2017 at 9:48 AM, Dilip Kumar <dilipbalaut@gmail.com> wrote:
>> I think maybe we should rename the functions to element_allocate,
>> element_free, and element_allocator_ctx, or something like that. The
>> current names aren't capitalized consistently with other things in
>> this module, and putting the word "element" in there would make it
>> more clear what the purpose of this thing is.
>
> Fixed as per the suggestion

Eh, not really. You changed the memory context to be called
element_allocator_ctx, rather than changing the args passed to the
element allocator to have that name, which is what I had in mind.

I did some assorted renaming and other cosmetic improvements and committed 0002.

I'm getting compiler errors:

In file included from execGrouping.c:47:

../../../src/include/lib/simplehash.h:91: error: redefinition of typedef 'simplehash_allocate'

../../../src/include/lib/simplehash.h:91: note: previous declaration of 'simplehash_allocate' was here

../../../src/include/lib/simplehash.h:92: error: redefinition of typedef 'simplehash_free'

../../../src/include/lib/simplehash.h:92: note: previous declaration of 'simplehash_free' was here

gcc version 4.4.7 20120313 (Red Hat 4.4.7-17) (GCC)

Cheers,

Jeff

Re: [HACKERS] Parallel bitmap heap scan

From

Andres Freund

Date:

08 February 2017, 00:15:37

On 2017-02-07 16:03:43 -0500, Robert Haas wrote:
> On Sun, Feb 5, 2017 at 9:48 AM, Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >> I think maybe we should rename the functions to element_allocate,
> >> element_free, and element_allocator_ctx, or something like that.  The
> >> current names aren't capitalized consistently with other things in
> >> this module, and putting the word "element" in there would make it
> >> more clear what the purpose of this thing is.
> >
> > Fixed as per the suggestion
> 
> Eh, not really.  You changed the memory context to be called
> element_allocator_ctx, rather than changing the args passed to the
> element allocator to have that name, which is what I had in mind.
> 
> I did some assorted renaming and other cosmetic improvements and committed 0002.

FWIW, I think it'd have been better to not add the new callbacks as
parameters to *_create(), but rather have them be "templatized" like the
rest of simplehash.  That'd require that callback to check the context,
to know whether it should use shared memory or not, but that seems fine
to me.  Right now this pushes the head of simplehash above a
cacheline...

I'm also doubtful about the naming of the default callbacks:
+/* default memory allocator function */
+static void *
+SH_DEFAULT_ALLOC(Size size, void *args)
+{
+   MemoryContext context = (MemoryContext) args;
+
+   return MemoryContextAllocExtended(context, size,
+                                     MCXT_ALLOC_HUGE | MCXT_ALLOC_ZERO);
+}

SH_DEFAULT_ALLOC sounds like it's a name that's one of the prefixed
names (like SH_CREATE actually becomes pagetable_create and such) - but
afaics it's not.  Which afaics means that this'll generate symbol
conflicts if one translation unit uses multiple simplehash.h style
hashes.  Afaics these should either be prefixed, or static inline
functions.

- Andres

Re: [HACKERS] Parallel bitmap heap scan

From

Andres Freund

Date:

08 February 2017, 00:18:45

On 2017-02-07 13:13:43 -0800, Jeff Janes wrote:
> On Tue, Feb 7, 2017 at 1:03 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>
> > On Sun, Feb 5, 2017 at 9:48 AM, Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > >> I think maybe we should rename the functions to element_allocate,
> > >> element_free, and element_allocator_ctx, or something like that.  The
> > >> current names aren't capitalized consistently with other things in
> > >> this module, and putting the word "element" in there would make it
> > >> more clear what the purpose of this thing is.
> > >
> > > Fixed as per the suggestion
> >
> > Eh, not really.  You changed the memory context to be called
> > element_allocator_ctx, rather than changing the args passed to the
> > element allocator to have that name, which is what I had in mind.
> >
> > I did some assorted renaming and other cosmetic improvements and committed
> > 0002.
> >
>
> I'm getting compiler errors:
>
> In file included from execGrouping.c:47:
> ../../../src/include/lib/simplehash.h:91: error: redefinition of typedef
> 'simplehash_allocate'
> ../../../src/include/lib/simplehash.h:91: note: previous declaration of
> 'simplehash_allocate' was here
> ../../../src/include/lib/simplehash.h:92: error: redefinition of typedef
> 'simplehash_free'
> ../../../src/include/lib/simplehash.h:92: note: previous declaration of
> 'simplehash_free' was here

Oh yea, that too - you can't just redefine a typedef like that according
to C89 :(

Re: [HACKERS] Parallel bitmap heap scan

From

Robert Haas

Date:

08 February 2017, 00:19:48

On Tue, Feb 7, 2017 at 4:13 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
> I'm getting compiler errors:
>
> In file included from execGrouping.c:47:
> ../../../src/include/lib/simplehash.h:91: error: redefinition of typedef
> 'simplehash_allocate'
> ../../../src/include/lib/simplehash.h:91: note: previous declaration of
> 'simplehash_allocate' was here
> ../../../src/include/lib/simplehash.h:92: error: redefinition of typedef
> 'simplehash_free'
> ../../../src/include/lib/simplehash.h:92: note: previous declaration of
> 'simplehash_free' was here

Thanks, I'll stick an #ifdef guard around that.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] Parallel bitmap heap scan

From

Robert Haas

Date:

08 February 2017, 00:36:55

On Tue, Feb 7, 2017 at 4:15 PM, Andres Freund <andres@anarazel.de> wrote:
> FWIW, I think it'd have been better to not add the new callbacks as
> parameters to *_create(), but rather have them be "templatized" like the
> rest of simplehash.  That'd require that callback to check the context,
> to know whether it should use shared memory or not, but that seems fine
> to me.  Right now this pushes the head of simplehash above a
> cacheline...

Something like the attached?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

rework-simplehash-allocator.patch

Re: [HACKERS] Parallel bitmap heap scan

From

Andres Freund

Date:

08 February 2017, 00:45:18

On 2017-02-07 16:36:55 -0500, Robert Haas wrote:
> On Tue, Feb 7, 2017 at 4:15 PM, Andres Freund <andres@anarazel.de> wrote:
> > FWIW, I think it'd have been better to not add the new callbacks as
> > parameters to *_create(), but rather have them be "templatized" like the
> > rest of simplehash.  That'd require that callback to check the context,
> > to know whether it should use shared memory or not, but that seems fine
> > to me.  Right now this pushes the head of simplehash above a
> > cacheline...
> 
> Something like the attached?

Yes.

> +#ifndef SH_USE_NONDEFAULT_ALLOCATOR
> +

That should probably be documented in the file header.


Thanks!

Andres

Re: [HACKERS] Parallel bitmap heap scan

From

Robert Haas

Date:

08 February 2017, 01:14:19

On Tue, Feb 7, 2017 at 4:45 PM, Andres Freund <andres@anarazel.de> wrote:
> On 2017-02-07 16:36:55 -0500, Robert Haas wrote:
>> On Tue, Feb 7, 2017 at 4:15 PM, Andres Freund <andres@anarazel.de> wrote:
>> > FWIW, I think it'd have been better to not add the new callbacks as
>> > parameters to *_create(), but rather have them be "templatized" like the
>> > rest of simplehash.  That'd require that callback to check the context,
>> > to know whether it should use shared memory or not, but that seems fine
>> > to me.  Right now this pushes the head of simplehash above a
>> > cacheline...
>>
>> Something like the attached?
>
> Yes.
>
>> +#ifndef SH_USE_NONDEFAULT_ALLOCATOR
>> +
>
> That should probably be documented in the file header.

Right.  OK, did that and a few other cleanups, and committed.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] Parallel bitmap heap scan

From

Robert Haas

Date:

08 February 2017, 05:37:30

On Sun, Feb 5, 2017 at 9:48 AM, Dilip Kumar <dilipbalaut@gmail.com> wrote:
> Here are the latest version of the patch, which address all the
> comments given by Robert.

I think it would be useful to break the remaining patch into two
parts, one that enhances the tidbitmap.c API and another that uses
that to implement Parallel Bitmap Heap Scan.  In this review I'm going
to focus on the changes to tidbitmap.c and tidbitmap.h.

+extern void tbm_restore_shared_members(TIDBitmap *tbm,
+                           ParallelTBMInfo * stbm);
+extern void tbm_update_shared_members(TIDBitmap *tbm,
+                          ParallelTBMInfo * parallel_tbm);
+void        tbm_set_parallel(TIDBitmap *tbm, void *area);
+extern Size tbm_estimate_parallel_tbminfo(Size size);
+TIDBitmap  *tbm_attach(ParallelTBMInfo * parallel_tbm, void *area);
+TBMIterator *tbm_begin_shared_iterate(TIDBitmap *tbm,
+                         ParallelTBMInfo * parallel_tbm, bool prefetch);
+TBMIterateResult *tbm_shared_iterate(TBMIterator *iterator);
+void        tbm_init_shared_iterator(ParallelTBMInfo * tbminfo);

There are several cosmetic problems here.  You may have noticed that
all function prototypes in PostgreSQL header files are explicitly
declared extern; yours should be, too.  Also, there is extra
whitespace before some of the variable names here, like
"ParallelTBMInfo * tbminfo" instead of "ParallelTBMInfo *tbminfo". If
that is due to pgindent, the solution is to add the typedef names to
your local typedef list.  Also, tbm_restore_shared_members doesn't
actually exist.

More broadly, this seems like an extremely complicated API.  Even
ignoring the function that doesn't exist, that's still 7 different
functions just for shared iteration, which seems like a lot.  I
suggest the following API:

1. Add an additional argument to tbm_create(), dsa_area *dsa.  If it's
NULL, we allocate a backend-private memory for the hash entries as
normal.  If it's true, we use the dsa_area to store the hash entries,
using the infrastructure added by your 0002 and revised in
c3c4f6e1740be256178cd7551d5b8a7677159b74.  (You can use a flag in the
BitmapIndexScan and BitmapOr to decide whether to pass NULL or
es_query_dsa.)

2. Add a new function dsa_pointer tbm_prepare_shared_iterate(TIDBitmap
*tbm) which allocates shared iteration arrays using the DSA passed to
tbm_create() and returns a dsa_pointer to the result.  Arrange this so
that if it's called more than once, each call returns a separate
iterator, so that you can call it once to get the main iterator and a
second time for the prefetch iterator, but have both of those point to
the same underlying iteration arrays.

3. Add a new function TBMSharedIterator
*tbm_attach_shared_iterate(dsa_area *dsa, dsa_pointer) which is called
once per backend and gets passed the dsa_pointer from the previous
step.

4. Add a new function TBMIterateResult
*tbm_shared_iterate(TBMSharedIterator *) to fetch the next result.

As compared with your proposal, this involves only 3 functions instead
of 7 (plus one new argument to another function), and I think it's a
lot clearer what those functions are actually doing.  You don't need
tbm_estimate_parallel_tbminfo() any more because the data being passed
from one backend to another is precisely a dsa_pointer, and the bitmap
scan can just leave space for that in its own estimator function.  You
don't need tbm_update_shared_members() separately from
tbm_begin_shared_iterate() separately from tbm_init_shared_iterator()
because tbm_prepare_shared_iterate() can do all that stuff.  You don't
need tbm_set_parallel() because the additional argument to
tbm_create() takes care of that need.

The way you've got ParallelTBMInfo set up right now doesn't respect
the separation of concerns between nodeBitmapHeapscan.c and
tidbitmap.c properly. tidbitmap.c shouldn't know that the caller
intends on creating two iterators, one of which is for prefetching.
The design above hopefully addresses that: each call to
tbm_prepare_shared_iterate returns a dsa_pointer to a separate chunk
of shared iterator state, but tidbitmap.c does not need to know how
many times that will be called.

Apart from the above, this patch will need a rebase over today's
commits, and please make sure all functions you add have high-quality
header comments.

Thanks,

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] Parallel bitmap heap scan

From

Dilip Kumar

Date:

08 February 2017, 09:59:25

On Wed, Feb 8, 2017 at 8:07 AM, Robert Haas <robertmhaas@gmail.com> wrote:

.Thanks for your input, I have few queries about these comments.

> 2. Add a new function dsa_pointer tbm_prepare_shared_iterate(TIDBitmap
> *tbm) which allocates shared iteration arrays using the DSA passed to
> tbm_create() and returns a dsa_pointer to the result.  Arrange this so
> that if it's called more than once, each call returns a separate
> iterator, so that you can call it once to get the main iterator and a
> second time for the prefetch iterator, but have both of those point to
> the same underlying iteration arrays.
>
> 3. Add a new function TBMSharedIterator
> *tbm_attach_shared_iterate(dsa_area *dsa, dsa_pointer) which is called
> once per backend and gets passed the dsa_pointer from the previous
> step.

IIUC, tbm_prepare_shared_iterate will be called only by the leader,
for tbmiterator as well as for the prefetch_iterator. And,
tbm_attach_shared_iterate will be called by each backend and for both
the iterators.

IMHO, tbm_attach_shared_iterate should return TBMIterator with
reference to TBMSharedIterator inside it. The reason behind same is
that we can not keep TBMIterateResult inside TBMSharedIterator
otherwise, results will also go in shared memory but we want to have
local result memory for each worker so that other worker doesn't
disturb it.

Another option can be that we change the tbm_shared_iterate as explained below

TBMIterateResult * tbm_shared_iterate(TBMSharedIterator *,
TBMIterateResult *result).

Now, if result passed to this API is NULL then it will allocate the
memory for the result and that way we will have local result memory,
and if it's not NULL we will use this memory to store our results.
BitmapHeapNode already having a reference to the TBMIterateResult so
we should not have any problem in passing this reference to the
tbm_shared_iterate. I think this one looks better than what I
explained above.

Please suggest.

>
> 4. Add a new function TBMIterateResult
> *tbm_shared_iterate(TBMSharedIterator *) to fetch the next result.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] Parallel bitmap heap scan

From

Dilip Kumar

Date:

08 February 2017, 12:20:16

On Wed, Feb 8, 2017 at 3:44 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>>> +#ifndef SH_USE_NONDEFAULT_ALLOCATOR
>>> +
>>
>> That should probably be documented in the file header.
>
> Right.  OK, did that and a few other cleanups, and committed.

The new SH_CREATE(MemoryContext ctx, uint32 nelements) don't have any
option to supply arguments to it. Our callback functions need access
to TBM.

Is it expected that if the user of SH_CREATE who doesn't want to pass
a "MemoryContext" then we can pass arguments instead of ctx?

something like this ?
if (!tbm->dsa)      tbm->pagetable = pagetable_create(tbm->mcxt, 128);
else      tbm->pagetable = pagetable_create((MemoryContext)tbm, 128);

And, In allocation function, we can access this context and typecast to tbm?

As shown below.
static void *
pagetable_allocate(pagetable_hash *pagetable, Size size)
{   TIDBitmap  *tbm = pagetable->ctx;

So Is it expected to do like I explained above, or we missed to have
an arg parameter to SH_CREATE as well as in SH_TYPE structure or there
is some other way you have in mind?

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] Parallel bitmap heap scan

From

Dilip Kumar

Date:

08 February 2017, 13:21:02

On Wed, Feb 8, 2017 at 3:44 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>>> +#ifndef SH_USE_NONDEFAULT_ALLOCATOR
>>> +
>>
>> That should probably be documented in the file header.
>
> Right.  OK, did that and a few other cleanups, and committed.

I think we need to have prototype for the default allocator outside of
#ifndef SH_USE_NONDEFAULT_ALLOCATOR. Because the file e.g. tidbitmap.c
who wants to use SH_USE_NONDEFAULT_ALLOCATOR will provide the
allocator function definition but it can not have the declaration of
those function as that function take SH_TYPE as input and that will be
only defined once we include the simplehash.h.

So basically we can not declare prototype before including
simplehash.h for allocator. And, if we don't declare we will get
"implicit declaration warning" because simplehash itself is using
those functions.

The solution is simplehash.h, should always declare it, and provide
the definitions only if SH_USE_NONDEFAULT_ALLOCATOR is not defined.
Attached patch does that.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

hash_allocate_fix.patch

Re: [HACKERS] Parallel bitmap heap scan

From

Robert Haas

Date:

08 February 2017, 16:25:34

On Wed, Feb 8, 2017 at 5:21 AM, Dilip Kumar <dilipbalaut@gmail.com> wrote:
> On Wed, Feb 8, 2017 at 3:44 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>>>> +#ifndef SH_USE_NONDEFAULT_ALLOCATOR
>>>> +
>>>
>>> That should probably be documented in the file header.
>>
>> Right.  OK, did that and a few other cleanups, and committed.
>
> I think we need to have prototype for the default allocator outside of
> #ifndef SH_USE_NONDEFAULT_ALLOCATOR. Because the file e.g. tidbitmap.c
> who wants to use SH_USE_NONDEFAULT_ALLOCATOR will provide the
> allocator function definition but it can not have the declaration of
> those function as that function take SH_TYPE as input and that will be
> only defined once we include the simplehash.h.
>
> So basically we can not declare prototype before including
> simplehash.h for allocator. And, if we don't declare we will get
> "implicit declaration warning" because simplehash itself is using
> those functions.
>
> The solution is simplehash.h, should always declare it, and provide
> the definitions only if SH_USE_NONDEFAULT_ALLOCATOR is not defined.
> Attached patch does that.

Makes sense, will commit.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] Parallel bitmap heap scan

From

Robert Haas

Date:

08 February 2017, 16:31:00

On Wed, Feb 8, 2017 at 4:20 AM, Dilip Kumar <dilipbalaut@gmail.com> wrote:
> The new SH_CREATE(MemoryContext ctx, uint32 nelements) don't have any
> option to supply arguments to it. Our callback functions need access
> to TBM.
>
> Is it expected that if the user of SH_CREATE who doesn't want to pass
> a "MemoryContext" then we can pass arguments instead of ctx?

You can store whatever you want in SH_TYPE's private_data member.
SH_ALLOCATE and SH_FREE both get a pointer to the SH_TYPE, so they
have access to that.  Hmm, but there's no way to get that set in
SH_CREATE before SH_ALLOCATE is called.  Maybe we need to add a
private_data argument to SH_CREATE.  execGrouping.c could use that
instead of frobbing private_data directly:

-    hashtable->hashtab = tuplehash_create(tablecxt, nbuckets);
-    hashtable->hashtab->private_data = hashtable;
+    hashtable->hashtab = tuplehash_create(tablecxt, nbuckets, hashtable);

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] Parallel bitmap heap scan

From

Robert Haas

Date:

08 February 2017, 16:44:16

On Wed, Feb 8, 2017 at 1:59 AM, Dilip Kumar <dilipbalaut@gmail.com> wrote:
> IIUC, tbm_prepare_shared_iterate will be called only by the leader,
> for tbmiterator as well as for the prefetch_iterator. And,
> tbm_attach_shared_iterate will be called by each backend and for both
> the iterators.

That's what I had in mind.

> IMHO, tbm_attach_shared_iterate should return TBMIterator with
> reference to TBMSharedIterator inside it. The reason behind same is
> that we can not keep TBMIterateResult inside TBMSharedIterator
> otherwise, results will also go in shared memory but we want to have
> local result memory for each worker so that other worker doesn't
> disturb it.

No, I don't agree.  I think TBMSharedIterator should be an unshared
structure created by tbm_attach_shared_iterate, which can internally
contain backend-private state like a TBMIterateResult, and which can
also contain a pointer to the shared-memory structure previously
created by tbm_prepare_shared_iterate.  That thing needs to be called
something other than a TBMSharedIterator, like TBMSharedIterationState
or something.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] Parallel bitmap heap scan

From

Dilip Kumar

Date:

08 February 2017, 16:58:28

On Wed, Feb 8, 2017 at 7:01 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> You can store whatever you want in SH_TYPE's private_data member.
> SH_ALLOCATE and SH_FREE both get a pointer to the SH_TYPE, so they
> have access to that.  Hmm, but there's no way to get that set in
> SH_CREATE before SH_ALLOCATE is called.  Maybe we need to add a
> private_data argument to SH_CREATE.  execGrouping.c could use that
> instead of frobbing private_data directly:
>
> -    hashtable->hashtab = tuplehash_create(tablecxt, nbuckets);
> -    hashtable->hashtab->private_data = hashtable;
> +    hashtable->hashtab = tuplehash_create(tablecxt, nbuckets, hashtable);

Okay, will go ahead as you suggested. Patch attached for the same.


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

hash_create_fix.patch

Re: [HACKERS] Parallel bitmap heap scan

From

Robert Haas

Date:

08 February 2017, 18:59:29

On Wed, Feb 8, 2017 at 8:58 AM, Dilip Kumar <dilipbalaut@gmail.com> wrote:
> On Wed, Feb 8, 2017 at 7:01 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> You can store whatever you want in SH_TYPE's private_data member.
>> SH_ALLOCATE and SH_FREE both get a pointer to the SH_TYPE, so they
>> have access to that.  Hmm, but there's no way to get that set in
>> SH_CREATE before SH_ALLOCATE is called.  Maybe we need to add a
>> private_data argument to SH_CREATE.  execGrouping.c could use that
>> instead of frobbing private_data directly:
>>
>> -    hashtable->hashtab = tuplehash_create(tablecxt, nbuckets);
>> -    hashtable->hashtab->private_data = hashtable;
>> +    hashtable->hashtab = tuplehash_create(tablecxt, nbuckets, hashtable);
>
> Okay, will go ahead as you suggested. Patch attached for the same.

Looks good to me.  If nobody has further ideas here, I'll push this
and your previous patch tomorrow.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] Parallel bitmap heap scan

From

Robert Haas

Date:

09 February 2017, 23:05:10

On Wed, Feb 8, 2017 at 10:59 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> Looks good to me.  If nobody has further ideas here, I'll push this
> and your previous patch tomorrow.

Done.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] Parallel bitmap heap scan

From

Dilip Kumar

Date:

11 February 2017, 09:41:30

On Wed, Feb 8, 2017 at 8:07 AM, Robert Haas <robertmhaas@gmail.com> wrote:

Thanks for the detailed review and your valuable feedback.

> I think it would be useful to break the remaining patch into two
> parts, one that enhances the tidbitmap.c API and another that uses
> that to implement Parallel Bitmap Heap Scan.

I have done that.

> There are several cosmetic problems here.  You may have noticed that
> all function prototypes in PostgreSQL header files are explicitly
> declared extern; yours should be, too.  Also, there is extra
> whitespace before some of the variable names here, like
> "ParallelTBMInfo * tbminfo" instead of "ParallelTBMInfo *tbminfo". If
> that is due to pgindent, the solution is to add the typedef names to
> your local typedef list.  Also, tbm_restore_shared_members doesn't
> actually exist.

Fixed

> 1. Add an additional argument to tbm_create(), dsa_area *dsa.  If it's
> NULL, we allocate a backend-private memory for the hash entries as
> normal.  If it's true, we use the dsa_area to store the hash entries,
> using the infrastructure added by your 0002 and revised in
> c3c4f6e1740be256178cd7551d5b8a7677159b74.  (You can use a flag in the
> BitmapIndexScan and BitmapOr to decide whether to pass NULL or
> es_query_dsa.)
Done

>
> 2. Add a new function dsa_pointer tbm_prepare_shared_iterate(TIDBitmap
> *tbm) which allocates shared iteration arrays using the DSA passed to
> tbm_create() and returns a dsa_pointer to the result.  Arrange this so
> that if it's called more than once, each call returns a separate
> iterator, so that you can call it once to get the main iterator and a
> second time for the prefetch iterator, but have both of those point to
> the same underlying iteration arrays.
>
> 3. Add a new function TBMSharedIterator
> *tbm_attach_shared_iterate(dsa_area *dsa, dsa_pointer) which is called
> once per backend and gets passed the dsa_pointer from the previous
> step.
>
> 4. Add a new function TBMIterateResult
> *tbm_shared_iterate(TBMSharedIterator *) to fetch the next result.

I have tried to get these three API's as you explained with one
change. In tbm_attach_shared_iterate I need to pass TBM also because
each worker will create their own copy of TBM. Those workers need to
get the TBM-related information from the shared location. Even though
I try to access some of the information like chunk, npages from
directly shared location, but some other information like base pointer
to dsa element, RelptrPagetableEntry etc. should be local to each
worker (after conversion from dsa pointer to local pointer).

>
> As compared with your proposal, this involves only 3 functions instead
> of 7 (plus one new argument to another function), and I think it's a
> lot clearer what those functions are actually doing.  You don't need
> tbm_estimate_parallel_tbminfo() any more because the data being passed
> from one backend to another is precisely a dsa_pointer, and the bitmap
> scan can just leave space for that in its own estimator function.  You
> don't need tbm_update_shared_members() separately from
> tbm_begin_shared_iterate() separately from tbm_init_shared_iterator()
> because tbm_prepare_shared_iterate() can do all that stuff.  You don't
> need tbm_set_parallel() because the additional argument to
> tbm_create() takes care of that need.

Right

>
> The way you've got ParallelTBMInfo set up right now doesn't respect
> the separation of concerns between nodeBitmapHeapscan.c and
> tidbitmap.c properly. tidbitmap.c shouldn't know that the caller
> intends on creating two iterators, one of which is for prefetching.
> The design above hopefully addresses that: each call to
> tbm_prepare_shared_iterate returns a dsa_pointer to a separate chunk
> of shared iterator state, but tidbitmap.c does not need to know how
> many times that will be called.

Done
>
> Apart from the above, this patch will need a rebase over today's
> commits,

Done
and please make sure all functions you add have high-quality
> header comments.

I tried my best, please check the latest patch (0001).

Apart from those, there are some more changes.
1. I have moved the dsa_pointer and dsa_area declaration from dsa.h to
postgres .h, an independent patch is attached for the same. Because we
need to declare function headers with dsa_pointer in tidbitmap.h, but
tidbitmap.h also used in FRONTEND, therefore, this can not include
dsa.h (as it contains atomics.h).

2. I noticed there was one defect in my code related to handling the
TBM_ONE_PAGE, In initial version, there was no problem, but it got
introduced somewhere in intermediate versions.

I have fixed the same. There were two option to do that

1. Don’t switch to  TBM_ONE_PAGE in case of parallel mode (ideally
this can not happen only worst estimation can get us there) and
directly got to TBM_HASH
2. In shared information keep space for sharing a PagetableEntry.

I have implemented 2nd (in the initial versions I implemented with 1st).

Note: Patch is also rebased on top of guc_parallel_index_scan_v1.patch
from Parallel Index Scan thread[1]

[1]
https://www.postgresql.org/message-id/CAA4eK1%2BTnM4pXQbvn7OXqam%2Bk_HZqb0ROZUMxOiL6DWJYCyYow%40mail.gmail.com

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

On Tue, Feb 14, 2017 at 12:48 AM, Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Mon, Feb 13, 2017 at 6:24 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> I don't think it's acceptable (or necessary) to move the DSA
> definitions into postgres.h. Why do you think you need to do that,
> vs. just including dsa.h in a few more places?

I need to access dsa_pointer in tidbitmap.h, which is included from
FRONTEND as well. Now, problem is that dsa.h is including #include
"port/atomics.h", but atomic.h can not be included if FRONTEND is
defined.

#ifndef ATOMICS_H
#define ATOMICS_H
#ifdef FRONTEND
#error "atomics.h may not be included from frontend code"
#endif

Is there any other solution to this ?

How about creating another header file with the parallel changes

and include it only in necessary places?

Following are my observations, while going through the patch.

+#if SIZEOF_DSA_POINTER == 4

+typedef uint32 dsa_pointer;

+#else

+typedef uint64 dsa_pointer;

+#endif

I feel the declaration of the above typdef can be moved into the

section above if we going with the current move into postgres.h

file.

+/*

+ * tbm_alloc_shared

+ *

+ * Callback function for allocating the memory for hashtable elements.

+ * It allocates memory from DSA if tbm holds a reference to a dsa.

+ */

+static inline void *

+pagetable_allocate(pagetable_hash *pagetable, Size size)

Function name and comments mismatch?

Regards,

Hari Babu

Fujitsu Australia

Re: [HACKERS] Parallel bitmap heap scan

From

Robert Haas

Date:

14 February 2017, 06:45:56

On Mon, Feb 13, 2017 at 8:48 AM, Dilip Kumar <dilipbalaut@gmail.com> wrote:
> On Mon, Feb 13, 2017 at 6:24 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> I don't think it's acceptable (or necessary) to move the DSA
>> definitions into postgres.h.  Why do you think you need to do that,
>> vs. just including dsa.h in a few more places?
>
> I need to access dsa_pointer in tidbitmap.h, which is included from
> FRONTEND as well. Now, problem is that dsa.h is including #include
> "port/atomics.h", but atomic.h can not be included if FRONTEND is
> defined.
>
> #ifndef ATOMICS_H
> #define ATOMICS_H
> #ifdef FRONTEND
> #error "atomics.h may not be included from frontend code"
> #endif
>
> Is there any other solution to this ?

Well, any problem like this generally has a bunch of solutions, so
I'll say yes.  I spent a good chunk of today studying the issue and
started a new thread devoted specifically to it:

https://www.postgresql.org/message-id/CA%2BTgmoZ%3DF%3DGkxV0YEv-A8tb%2BAEGy_Qa7GSiJ8deBKFATnzfEug%40mail.gmail.com

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] Parallel bitmap heap scan

From

Dilip Kumar

Date:

14 February 2017, 08:18:09

On Tue, Feb 14, 2017 at 6:51 AM, Haribabu Kommi
<kommi.haribabu@gmail.com> wrote:
> +#if SIZEOF_DSA_POINTER == 4
> +typedef uint32 dsa_pointer;
> +#else
> +typedef uint64 dsa_pointer;
> +#endif
>
> I feel the declaration of the above typdef can be moved into the
> section above if we going with the current move into postgres.h
> file.

Robert has already given the patch[1] for the same so now don't need
to do anything for this. Therefore, I included the dsa.h in
tidbitmap.h.

[1] https://www.postgresql.org/message-id/CA%2BTgmoZ%3DF%3DGkxV0YEv-A8tb%2BAEGy_Qa7GSiJ8deBKFATnzfEug%40mail.gmail.com
>
> +/*
> + * tbm_alloc_shared
> + *
> + * Callback function for allocating the memory for hashtable elements.
> + * It allocates memory from DSA if tbm holds a reference to a dsa.
> + */
> +static inline void *
> +pagetable_allocate(pagetable_hash *pagetable, Size size)
>
> Function name and comments mismatch?

Fixed.


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

On Tue, Feb 14, 2017 at 10:18 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> What is the point of having a TBMSharedIterator contain a TIDBitmap
> pointer?  All the values in that TIDBitmap are populated from the
> TBMSharedInfo, but it seems to me that the fields that are copied over
> unchanged (like status and nchunks) could just be used directly from
> the TBMSharedInfo, and the fields that are computed (like relpages and
> relchunks) could be stored directly in the TBMSharedIterator.
> tbm_shared_iterate() is already separate code from tbm_iterate(), so
> it can be changed to refer to whichever data structure contains the
> data it needs.

Done

>
> Why is TBMSharedInfo separate from TBMSharedIteratorState?  It seems
> like you could merge those two into a single structure.

Done
>
> I think we can simplify things here a bit by having shared iterators
> not support single-page mode.  In the backend-private case,
> tbm_begin_iterate() really doesn't need to do anything with the
> pagetable in the TBM_ONE_PAGE case, but for a shared iterator, we've
> got to anyway copy the pagetable into shared memory.  So it seems to
> me that it would be simpler just to transform it into a standard
> iteration array while we're at it, instead of copying it into entry1.
> In other words, I suggest removing both "entry1" and "status" from
> TBMSharedInfo and making tbm_prepare_shared_iterate smarter to
> compensate.

Done

>
> I think "dsa_data" is poorly named; it should be something like
> "dsapagetable" in TIDBitmap and just "pagetable" in TBMSharedInfo.  I
> think you should should move the "base", "relpages", and "relchunks"
> into TBMSharedIterator and give them better names, like "ptbase",
> "ptpages" and "ptchunks".  "base" isn't clear that we're talking about
> the pagetable's base as opposed to anything else, and "relpages" and
> "relchunks" are named based on the fact that the pointers are relative
> rather than named based on what data they point at, which doesn't seem
> like the right choice.

Done
>
> I suggest putting the parallel functions next to each other in the
> file: tbm_begin_iterate(), tbm_prepare_shared_iterate(),
> tbm_iterate(), tbm_shared_iterate(), tbm_end_iterate(),
> tbm_end_shared_iterate().

Done
>
> +    if (tbm->dsa == NULL)
> +        return pfree(pointer);
>
> Don't try to return a value of type void.  The correct spelling of
> this is { pfree(pointer); return; }.  Formatted appropriately across 4
> lines, of course.

Fixed

>
> +    /*
> +     * If TBM is in iterating phase that means pagetable is already created
> +     * and we have come here during tbm_free. By this time we are already
> +     * detached from the DSA because the GatherNode would have been shutdown.
> +     */
> +    if (tbm->iterating)
> +        return;
>
> This seems like something of a kludge, although not a real easy one to
> fix.  The problem here is that tidbitmap.c ideally shouldn't have to
> know that it's being used by the executor or that there's a Gather
> node involved, and certainly not the order of operations around
> shutdown.  It should really be the problem of 0002 to handle this kind
> of problem, without 0001 having to know anything about it.  It strikes
> me that it would probably be a good idea to have Gather clean up its
> children before it gets cleaned up itself.  So in ExecShutdownNode, we
> could do something like this:
>
> diff --git a/src/backend/executor/execProcnode.c
> b/src/backend/executor/execProcnode.c
> index 0dd95c6..5ccc2e8 100644
> --- a/src/backend/executor/execProcnode.c
> +++ b/src/backend/executor/execProcnode.c
> @@ -815,6 +815,8 @@ ExecShutdownNode(PlanState *node)
>      if (node == NULL)
>          return false;
>
> +    planstate_tree_walker(node, ExecShutdownNode, NULL);
> +
>      switch (nodeTag(node))
>      {
>          case T_GatherState:
> @@ -824,5 +826,5 @@ ExecShutdownNode(PlanState *node)
>              break;
>      }
>
> -    return planstate_tree_walker(node, ExecShutdownNode, NULL);
> +    return false;
>  }
>
> Also, in ExecEndGather, something like this:
>
> diff --git a/src/backend/executor/nodeGather.c
> b/src/backend/executor/nodeGather.c
> index a1a3561..32c97d3 100644
> --- a/src/backend/executor/nodeGather.c
> +++ b/src/backend/executor/nodeGather.c
> @@ -229,10 +229,10 @@ ExecGather(GatherState *node)
>  void
>  ExecEndGather(GatherState *node)
>  {
> +    ExecEndNode(outerPlanState(node));      /* let children clean up first */
>      ExecShutdownGather(node);
>      ExecFreeExprContext(&node->ps);
>      ExecClearTuple(node->ps.ps_ResultTupleSlot);
> -    ExecEndNode(outerPlanState(node));
>  }
>
>  /*
>
> With those changes and an ExecShutdownBitmapHeapScan() called from
> ExecShutdownNode(), it should be possible (I think) for us to always
> have the bitmap heap scan node shut down before the Gather node shuts
> down, which I think would let you avoid having a special case for this
> inside the TBM code.

Done
(gather_shutdown_children_first.patch does what you have mentioned
above and 0002 implement the BitmapHeapScanShutdown function).

>
> +    char       *ptr;
> +    dsaptr = dsa_allocate(tbm->dsa, size + sizeof(dsa_pointer));
> +    tbm->dsa_data = dsaptr;
> +    ptr = dsa_get_address(tbm->dsa, dsaptr);
> +    memset(ptr, 0, size + sizeof(dsa_pointer));
> +    *((dsa_pointer *) ptr) = dsaptr;
>
> Hmm.  Apparently, we need a dsa_allocate_and_zero() or dsa_allocate0()
> function.  This pattern seems likely to come up a lot, and I don't
> think we should require every caller to deal with it.
Done

>
> I don't see why you think you need to add sizeof(dsa_pointer) to the
> allocation and store an extra copy of the dsa_pointer in the
> additional space.   You are already storing it in tbm->dsa_data and

Done with the help of two pointers as discussed in another mail [1].

[1] https://www.postgresql.org/message-id/CA%2BTgmoaWsCJ5L2du9i1RQaegaNLgOYF2KgRFu%2B7sUAeQc_xBFg%40mail.gmail.com

>
> It seems shaky to me that tbm->iterating can be set when we've created
> either a shared or a backend-private iterator.  For example,
> tbm_iterate() asserts that tbm->iterating is set as a way of checking
> that spages and schunks will be set.  But that's not guaranteed any
> more with these changes, because we might've built the shared version
> of the iteration arrays rather than the backend-private version.
> Maybe you ought to change it from a bool to an enum:
> TBM_NOT_ITERATING, TBM_ITERATING_PRIVATE, TBM_ITERATING_SHARED.  And
> then update all of the asserts and tests to check for the specific
> state they care about.

Done
>
> +        while (schunkbit < PAGES_PER_CHUNK)
> +        {
> +            int            wordnum = WORDNUM(schunkbit);
> +            int            bitnum = BITNUM(schunkbit);
> +
> +            if ((chunk->words[wordnum] & ((bitmapword) 1 << bitnum)) != 0)
> +                break;
> +            schunkbit++;
> +        }
>
> How about refactoring this chunk of code into a static inline function
> and having both tbm_iterate() and tbm_shared_iterate() call it?
> Probably something like static inline void
> tbm_advance_schunkbit(PageTableEntry *chunk, int *schunkbit).

Good idea, done this way
>
> +        /* scan bitmap to extract individual offset numbers */
> +        ntuples = 0;
> +        for (wordnum = 0; wordnum < WORDS_PER_PAGE; wordnum++)
> +        {
> +            bitmapword    w = page->words[wordnum];
> +
> +            if (w != 0)
> +            {
> +                int            off = wordnum * BITS_PER_BITMAPWORD + 1;
> +
> +                while (w != 0)
> +                {
> +                    if (w & 1)
> +                        output->offsets[ntuples++] = (OffsetNumber) off;
> +                    off++;
> +                    w >>= 1;
> +                }
> +            }
> +        }
>
> Similarly, this looks like it could be refactored into a shared static
> inline function as well, instead of duplicating it.

Done.

Attached patches:

interface_dsa_allocate0.patch : Provides new interface dsa_allocate0,
which is used by 0001
gather_shutdown_childeren_first.patch : Processing the child node
before shuting down the gather, 0002 will use that part to shutdown
BitmapNode before gather.

0001: tidbimap change to provide the interfaces for shared iterator.
0002: actual parallel bitmap heap scan by using interfaces of 0001.


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] Parallel bitmap heap scan

From

Robert Haas

Date:

16 February 2017, 23:31:07

On Thu, Feb 16, 2017 at 10:54 AM, Dilip Kumar <dilipbalaut@gmail.com> wrote:
> interface_dsa_allocate0.patch : Provides new interface dsa_allocate0,
> which is used by 0001

Committed.  I moved the function prototype for dsa_allocate0() next to
the existing prototype for dsa_allocate().  Please try to think about
things like this when you add functions or prototypes or structure
members: put them in a natural place where they are close to related
things, not just wherever happens to be convenient.

> gather_shutdown_childeren_first.patch : Processing the child node

This same problem came up on the Parallel Hash thread.  I mentioned
this proposed fix over there; let's see what he (or anyone else) has
to say about it.

> 0001: tidbimap change to provide the interfaces for shared iterator.

+ * in order to share relptrs of the chunk and the pages arrays and other
+ * TBM related information with other workers.

No relptrs any more.

+    dsa_pointer dsapagetable;    /* dsa_pointer to the element array */
+    dsa_pointer dsapagetableold;    /* dsa_pointer to the old element array */
+    dsa_area   *dsa;            /* reference to per-query dsa area */
+    dsa_pointer ptpages;        /* dsa_pointer to the page array */
+    dsa_pointer ptchunks;        /* dsa_pointer to the chunk array */

Let's put the DSA pointer first and then the other stuff after it.
That seems more logical.

+typedef struct TBMSharedIteratorState
+{
+    int            spageptr;        /* next spages index */
+    int            schunkptr;        /* next schunks index */
+    int            schunkbit;        /* next bit to check in current schunk */
+    LWLock        lock;            /* lock to protect above members */
+    dsa_pointer pagetable;        /* dsa pointers to head of pagetable data */
+    dsa_pointer spages;            /* dsa pointer to page array */
+    dsa_pointer schunks;        /* dsa pointer to chunk array */
+    int            nentries;        /* number of entries in pagetable */
+    int            maxentries;        /* limit on same to meet maxbytes */
+    int            npages;            /* number of exact entries in
pagetable */
+    int            nchunks;        /* number of lossy entries in pagetable */
+} TBMSharedIteratorState;

I think you've got this largely backwards from the most logical order.
Let's order it like this: nentries, maxentries, npages, nchunks,
pagetable, spages, schunks, lock (to protect BELOW members), spageptr,
chunkptr, schunkbit.

+struct TBMSharedIterator
+{
+    PagetableEntry *ptbase;        /* pointer to the pagetable element array */
+    dsa_area   *dsa;            /* reference to per-query dsa area */
+    int           *ptpages;        /* sorted exact page index list */
+    int           *ptchunks;        /* sorted lossy page index list */
+    TBMSharedIteratorState *state;        /* shared members */
+    TBMIterateResult output;    /* MUST BE LAST (because variable-size) */
+};

Do we really need "dsa" here when it's already present in the shared
state?  It doesn't seem like we even use it for anything.  It's needed
to create each backend's TBMSharedIterator, but not afterwards, I
think.

In terms of ordering, I'd move "state" to the top of the structure,
just like "tbm" comes first in a TBMIterator.

+ * total memory consumption.  If dsa passed to this function is not NULL
+ * then the memory for storing elements of the underlying page table will
+ * be allocated from the DSA.

Notice that you capitalized "DSA" in two different ways in the same
sentence; I'd go for the all-caps version.  Also, you need the word
"the" before the first one.

+    if (tbm->status == TBM_HASH && (tbm->iterating == TBM_NOT_ITERATING))

Excess parentheses.

+ * tbm_prepare_shared_iterate - prepare to iterator through a TIDBitmap
+ * by multiple workers using shared iterator.

Prepare to iterate, not prepare to iterator.  I suggest rephrasing
this as "prepare shared iteration state for a TIDBitmap".

+ * The TBMSharedIteratorState will be allocated from DSA so that multiple
+ * worker can attach to it and iterate jointly.

Maybe change to "The necessary shared state will be allocated from the
DSA passed to tbm_create, so that multiple workers can attach to it
and iterate jointly".

+ * This will convert the pagetable hash into page and chunk array of the index
+ * into pagetable array so that multiple workers can use it to get the actual
+ * page entry.

I think you can leave off everything starting from "so that".  It's
basically redundant with what you already said.

+    dsa_pointer iterator;
+    TBMSharedIteratorState *iterator_state;

These aren't really great variable names, because the latter isn't the
state associated with the former.  They're both the same object.
Maybe just "dp" and "istate".

I think this function should Assert(tbm->dsa != NULL) and
Assert(tbm->iterating != TBM_ITERATING_PRIVATE), and similarly
tbm_begin_iterate should Assert(tbm->iterating != TBM_ITERATE_SHARED).

+    /*
+     * If we have a hashtable, create and fill the sorted page lists, unless
+     * we already did that for a previous iterator.  Note that the lists are
+     * attached to the bitmap not the iterator, so they can be used by more
+     * than one iterator.  However, we keep dsa_pointer to these in the shared
+     * iterator so that other workers can access them directly.
+     */

This is mostly cut-and-pasted from tbm_begin_iterate() but it's not
really correct here now because (1) we're no longer trying to fake up
a TIDBitmap proper in every backend and (2) this code runs even if we
don't have a hashtable.  I think the comment should be something like
"If we're not already iterating, create and fill the sorted page
lists.  (If we are, the sorted page lists are already stored in the
TIDBitmap, and we can just reuse them.)"

+         * Create page list and chunk list using relptr so that we can share
+         * this information across multiple workers.

No relptrs any more.

+            tbm->ptpages = dsa_allocate(tbm->dsa, tbm->npages * (sizeof(int)));

Extra parentheses.

+                                         tbm->nchunks * (sizeof(int)));

Extra parentheses.

+         * If TBM status is TBM_HASH then iterate over the pagetable and

"If the TBM status is"...

+             * directly store it's index i.e. 0 in page array

s/it's/its/

Don't you also need some code here to handle the TBM_EMPTY case?

+    /*
+     * Store the TBM member in the shared state so that we can share them
+     * across multiple workers.
+     */
+    iterator_state->maxentries = tbm->maxentries;
+    iterator_state->nchunks = tbm->nchunks;
+    iterator_state->nentries = tbm->nentries;
+    iterator_state->npages = tbm->npages;
+    iterator_state->pagetable = tbm->dsapagetable;
+    iterator_state->spages = tbm->ptpages;
+    iterator_state->schunks = tbm->ptchunks;
+
+    /* Initialize the shared iterator state */
+    iterator_state->schunkbit = 0;
+    iterator_state->schunkptr = 0;
+    iterator_state->spageptr = 0;
+
+    /* Initialize the iterator lock */
+    LWLockInitialize(&iterator_state->lock, LWTRANCHE_PARALLEL_TBM_ITERATOR);

Set all of the structure members in the same order that you declare them.

+ * tbm_extract_page_tuple - extract the tuple offsets from the page

s/the page/a page/

+ * Process the page bits to extract the tuple offsets and store them into
+ * TBMIterateResult.

This duplicates the preceding, a bit.  Maybe just "The extracted
offsets are stored into the TBMIterateResult".

+/*
+ *    tbm_advance_schunkbit
+ *
+ *    Advance the chunkbit to get the next page entry from the chunk
+ */

The formatting of this header comment is randomly different from the
preceding and following header comments.

I would change the argument to schunkbitp, declare a local variable
named schunkbit, and do int schunkbit = *schunkbitp at the top and
*schunkbitp = schunkbit at the bottom.

+ *    As above, but this will iterate using shared iterator which is shared
+ *    across multiple workers.  We need to acquire the iterator LWLock, before
+ *    accessing the shared members.

"using shared iterator which is shared" -> "using an iterator which is shared"
"multiple workers" -> "multiple processes"

+        PagetableEntry *chunk = ptbase + ptchunks[state->schunkptr];

Maybe &ptbase[ptchunks[state->schunkptr]] would look a little nicer.

+        PagetableEntry *chunk = ptbase + ptchunks[state->schunkptr];
+        PagetableEntry *page = ptbase + ptpages[state->spageptr];
+        PagetableEntry *page = ptbase + ptpages[state->spageptr];

Similarly.

+ * tbm_end_shared_iterate - finish an iteration over a TIDBitmap

Maybe s/an iteration/a shared iteration/

+ * As above, but it frees the memory of TBMSharedIterator.

Instead of this, I'd write "This doesn't free any of the shared state
associated with the iterator, just our backend-private state."

+    PagetableEntry *lpage = ((PagetableEntry *) arg) + *((int *) left);
+    PagetableEntry *rpage = ((PagetableEntry *) arg) + *((int *) right);

Again, try to use array dereference notation rather than pointer
arithmetic where you can.  Maybe:

PageTableEntry *base = (PageTableEntry *) arg;
PageTableEntry *lpage = &arg[*(int *) left];
etc.

+tbm_attach_shared_iterate(dsa_area *dsa, dsa_pointer iterator)
+{
+    TBMSharedIterator *shared_iterator;
+    TBMSharedIteratorState *shared_state;

Again, variable naming issues.  Let's go with dsa_pointer dp,
TBMSharedIterator *iterator, TBMSharedIteratorState *istate.  Please
try to go through and always use the same variable name for the same
kind of object.  It's confusing when "iterator" sometimes means a
TBMSharedIterator * and sometimes a TBMSharedIteratorState * and
sometimes a dsa_pointer.  Make it all consistent (and hopefully
logical).

+    shared_iterator = (TBMSharedIterator *) palloc(sizeof(TBMIterator) +
+                                 MAX_TUPLES_PER_PAGE * sizeof(OffsetNumber));

Wrong sizeof.  (This isn't just a style issue - it's a bug!)

+ * Callback function for allocating the memory for hashtable elements.
+ * It allocates memory from DSA if tbm holds a reference to a dsa.

Inconsistent capitalization of "DSA" again.

Maybe change the whole comment to "Allocate memory for hashtable
elements, using DSA if available."

+    if (tbm->dsapagetable)

Test against InvalidDsaPointer.  Or maybe just get rid of the "if" and
do it unconditionally; if dsapagetable is InvalidDsaPointer then
dsapagetableold will be too, so no harm done.

+    LWTRANCHE_PARALLEL_TBM_ITERATOR,

Change to LWLTRANCHE_TBM, add a call to LWLockRegisterTranche in
lwlock.c, and patch the docs with the new tranche name.

Despite this fairly extensive list of mostly-trivial cosmetic
problems, I think this is in fairly good shape now.  I think the basic
design is right, so once we get these loose ends tied up this should
be good to go in.

> 0002: actual parallel bitmap heap scan by using interfaces of 0001.

This needs a lot more extensive review than I quite have the time and
energy for right now, but here are a few comments.

+         <entry><literal>ParallelBitmapScan</></entry>

ParallelBitmapPopulate, maybe?  You're not waiting for the scan;
you're waiting for the bitmap to be populated.

+         <entry>Waiting for leader backend to complete the TidBitmap.</entry>

Likewise, complete -> populate?

+    /* If any limit was set to zero, the user doesn't want a parallel scan. */

If we got zero?  There aren't multiple limits mentioned anywhere
nearby, so this reads oddly.

+        /*
+         * It may be possible to amortize some of the I/O cost, but probably
+         * not very much, because most operating systems already do aggressive
+         * prefetching.  For now, we assume that the disk run cost can't be
+         * amortized at all.
+         */

This looks like it was cut-and-pasted from the sequential scan case,
but it doesn't really apply here.  This presupposes we're relying on
OS prefetching, but in the case of a bitmap heap scan we do our own
prefetching.  Maybe change to "Since we do prefetching in both the
parallel and non-parallel cases, we don't expect to get substantially
more I/O parallelism by spreading the scan across multiple backends.
So, assume the disk run cost can't be amortized at all."

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] Parallel bitmap heap scan

From

Dilip Kumar

Date:

18 February 2017, 20:15:40

On Fri, Feb 17, 2017 at 2:01 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> + * in order to share relptrs of the chunk and the pages arrays and other
> + * TBM related information with other workers.
>
> No relptrs any more.

Fixed
>
> +    dsa_pointer dsapagetable;    /* dsa_pointer to the element array */
> +    dsa_pointer dsapagetableold;    /* dsa_pointer to the old element array */
> +    dsa_area   *dsa;            /* reference to per-query dsa area */
> +    dsa_pointer ptpages;        /* dsa_pointer to the page array */
> +    dsa_pointer ptchunks;        /* dsa_pointer to the chunk array */
>
> Let's put the DSA pointer first and then the other stuff after it.
> That seems more logical.

Done that way
>
> +typedef struct TBMSharedIteratorState
> +{
> +    int            spageptr;        /* next spages index */
> +    int            schunkptr;        /* next schunks index */
> +    int            schunkbit;        /* next bit to check in current schunk */
> +    LWLock        lock;            /* lock to protect above members */
> +    dsa_pointer pagetable;        /* dsa pointers to head of pagetable data */
> +    dsa_pointer spages;            /* dsa pointer to page array */
> +    dsa_pointer schunks;        /* dsa pointer to chunk array */
> +    int            nentries;        /* number of entries in pagetable */
> +    int            maxentries;        /* limit on same to meet maxbytes */
> +    int            npages;            /* number of exact entries in
> pagetable */
> +    int            nchunks;        /* number of lossy entries in pagetable */
> +} TBMSharedIteratorState;
>
> I think you've got this largely backwards from the most logical order.
> Let's order it like this: nentries, maxentries, npages, nchunks,
> pagetable, spages, schunks, lock (to protect BELOW members), spageptr,
> chunkptr, schunkbit.

Done
>
> +struct TBMSharedIterator
> +{
> +    PagetableEntry *ptbase;        /* pointer to the pagetable element array */
> +    dsa_area   *dsa;            /* reference to per-query dsa area */
> +    int           *ptpages;        /* sorted exact page index list */
> +    int           *ptchunks;        /* sorted lossy page index list */
> +    TBMSharedIteratorState *state;        /* shared members */
> +    TBMIterateResult output;    /* MUST BE LAST (because variable-size) */
> +};
>
> Do we really need "dsa" here when it's already present in the shared
> state?  It doesn't seem like we even use it for anything.  It's needed
> to create each backend's TBMSharedIterator, but not afterwards, I
> think.

Right, removed.

>
> In terms of ordering, I'd move "state" to the top of the structure,
> just like "tbm" comes first in a TBMIterator.

Yeah, that looks better. Done that way.
>
> + * total memory consumption.  If dsa passed to this function is not NULL
> + * then the memory for storing elements of the underlying page table will
> + * be allocated from the DSA.
>
> Notice that you capitalized "DSA" in two different ways in the same
> sentence; I'd go for the all-caps version.  Also, you need the word
> "the" before the first one.

Fixed, all such instances.

>
> +    if (tbm->status == TBM_HASH && (tbm->iterating == TBM_NOT_ITERATING))
>
> Excess parentheses.
Fixed
>
> + * tbm_prepare_shared_iterate - prepare to iterator through a TIDBitmap
> + * by multiple workers using shared iterator.
>
> Prepare to iterate, not prepare to iterator.  I suggest rephrasing
> this as "prepare shared iteration state for a TIDBitmap".

Fixed.
>
> + * The TBMSharedIteratorState will be allocated from DSA so that multiple
> + * worker can attach to it and iterate jointly.
>
> Maybe change to "The necessary shared state will be allocated from the
> DSA passed to tbm_create, so that multiple workers can attach to it
> and iterate jointly".

Done.
>
> + * This will convert the pagetable hash into page and chunk array of the index
> + * into pagetable array so that multiple workers can use it to get the actual
> + * page entry.
>
> I think you can leave off everything starting from "so that".  It's
> basically redundant with what you already said.

Done
>
> +    dsa_pointer iterator;
> +    TBMSharedIteratorState *iterator_state;
>
> These aren't really great variable names, because the latter isn't the
> state associated with the former.  They're both the same object.
> Maybe just "dp" and "istate".

Done
>
> I think this function should Assert(tbm->dsa != NULL) and
> Assert(tbm->iterating != TBM_ITERATING_PRIVATE), and similarly
> tbm_begin_iterate should Assert(tbm->iterating != TBM_ITERATE_SHARED).

Done
>
> +    /*
> +     * If we have a hashtable, create and fill the sorted page lists, unless
> +     * we already did that for a previous iterator.  Note that the lists are
> +     * attached to the bitmap not the iterator, so they can be used by more
> +     * than one iterator.  However, we keep dsa_pointer to these in the shared
> +     * iterator so that other workers can access them directly.
> +     */
>
> This is mostly cut-and-pasted from tbm_begin_iterate() but it's not
> really correct here now because (1) we're no longer trying to fake up
> a TIDBitmap proper in every backend and (2) this code runs even if we
> don't have a hashtable.  I think the comment should be something like
> "If we're not already iterating, create and fill the sorted page
> lists.  (If we are, the sorted page lists are already stored in the
> TIDBitmap, and we can just reuse them.)"

Done
>
> +         * Create page list and chunk list using relptr so that we can share
> +         * this information across multiple workers.
>
> No relptrs any more.

Done
>
> +            tbm->ptpages = dsa_allocate(tbm->dsa, tbm->npages * (sizeof(int)));
>
> Extra parentheses.
>
> +                                         tbm->nchunks * (sizeof(int)));
>
> Extra parentheses.
Fixed
>
> +         * If TBM status is TBM_HASH then iterate over the pagetable and
>
> "If the TBM status is"...
>
> +             * directly store it's index i.e. 0 in page array
>
> s/it's/its/
Done
>
> Don't you also need some code here to handle the TBM_EMPTY case?,

IMHO, we don't need to do anything for TBM_EMPTY because npages and
nchunks will be zero so iterator will handle such cases (same as it's
done for non-parallel case.)

>
> +    /*
> +     * Store the TBM member in the shared state so that we can share them
> +     * across multiple workers.
> +     */
> +    iterator_state->maxentries = tbm->maxentries;
> +    iterator_state->nchunks = tbm->nchunks;
> +    iterator_state->nentries = tbm->nentries;
> +    iterator_state->npages = tbm->npages;
> +    iterator_state->pagetable = tbm->dsapagetable;
> +    iterator_state->spages = tbm->ptpages;
> +    iterator_state->schunks = tbm->ptchunks;
> +
> +    /* Initialize the shared iterator state */
> +    iterator_state->schunkbit = 0;
> +    iterator_state->schunkptr = 0;
> +    iterator_state->spageptr = 0;
> +
> +    /* Initialize the iterator lock */
> +    LWLockInitialize(&iterator_state->lock, LWTRANCHE_PARALLEL_TBM_ITERATOR);
>
> Set all of the structure members in the same order that you declare them.

Done.
>
> + * tbm_extract_page_tuple - extract the tuple offsets from the page
>
> s/the page/a page/
>
> + * Process the page bits to extract the tuple offsets and store them into
> + * TBMIterateResult.
>
> This duplicates the preceding, a bit.  Maybe just "The extracted
> offsets are stored into the TBMIterateResult".

Done
>
> +/*
> + *    tbm_advance_schunkbit
> + *
> + *    Advance the chunkbit to get the next page entry from the chunk
> + */
>
> The formatting of this header comment is randomly different from the
> preceding and following header comments.

Seems like different function in the file are using different
formatting. But done as near by functions.
>
> I would change the argument to schunkbitp, declare a local variable
> named schunkbit, and do int schunkbit = *schunkbitp at the top and
> *schunkbitp = schunkbit at the bottom.
>

Fixed

> + *    As above, but this will iterate using shared iterator which is shared
> + *    across multiple workers.  We need to acquire the iterator LWLock, before
> + *    accessing the shared members.
>
> "using shared iterator which is shared" -> "using an iterator which is shared"
> "multiple workers" -> "multiple processes"

Done
>
> +        PagetableEntry *chunk = ptbase + ptchunks[state->schunkptr];
>
> Maybe &ptbase[ptchunks[state->schunkptr]] would look a little nicer.
>
> +        PagetableEntry *chunk = ptbase + ptchunks[state->schunkptr];
> +        PagetableEntry *page = ptbase + ptpages[state->spageptr];
> +        PagetableEntry *page = ptbase + ptpages[state->spageptr];
>
> Similarly.

Done
>
> + * tbm_end_shared_iterate - finish an iteration over a TIDBitmap
>
> Maybe s/an iteration/a shared iteration/
>
> + * As above, but it frees the memory of TBMSharedIterator.
>
> Instead of this, I'd write "This doesn't free any of the shared state
> associated with the iterator, just our backend-private state."

Done
>
> +    PagetableEntry *lpage = ((PagetableEntry *) arg) + *((int *) left);
> +    PagetableEntry *rpage = ((PagetableEntry *) arg) + *((int *) right);
>
> Again, try to use array dereference notation rather than pointer
> arithmetic where you can.  Maybe:
>
> PageTableEntry *base = (PageTableEntry *) arg;
> PageTableEntry *lpage = &arg[*(int *) left];
> etc.
Done

>
> +tbm_attach_shared_iterate(dsa_area *dsa, dsa_pointer iterator)
> +{
> +    TBMSharedIterator *shared_iterator;
> +    TBMSharedIteratorState *shared_state;
>
> Again, variable naming issues.  Let's go with dsa_pointer dp,
> TBMSharedIterator *iterator, TBMSharedIteratorState *istate.  Please
> try to go through and always use the same variable name for the same
> kind of object.  It's confusing when "iterator" sometimes means a
> TBMSharedIterator * and sometimes a TBMSharedIteratorState * and
> sometimes a dsa_pointer.  Make it all consistent (and hopefully
> logical).
Done
>
> +    shared_iterator = (TBMSharedIterator *) palloc(sizeof(TBMIterator) +
> +                                 MAX_TUPLES_PER_PAGE * sizeof(OffsetNumber));
>
> Wrong sizeof.  (This isn't just a style issue - it's a bug!)

Right, it's a bug. Fixed.
>
> + * Callback function for allocating the memory for hashtable elements.
> + * It allocates memory from DSA if tbm holds a reference to a dsa.
>
> Inconsistent capitalization of "DSA" again.
>
> Maybe change the whole comment to "Allocate memory for hashtable
> elements, using DSA if available."

Fixe
>
> +    if (tbm->dsapagetable)
>
> Test against InvalidDsaPointer.  Or maybe just get rid of the "if" and
> do it unconditionally; if dsapagetable is InvalidDsaPointer then
> dsapagetableold will be too, so no harm done.

Right
>
> +    LWTRANCHE_PARALLEL_TBM_ITERATOR,
>
> Change to LWLTRANCHE_TBM, add a call to LWLockRegisterTranche in
> lwlock.c, and patch the docs with the new tranche name.

Fixed
>
> Despite this fairly extensive list of mostly-trivial cosmetic
> problems, I think this is in fairly good shape now.  I think the basic
> design is right, so once we get these loose ends tied up this should
> be good to go in.
>
>> 0002: actual parallel bitmap heap scan by using interfaces of 0001.
>
> This needs a lot more extensive review than I quite have the time and
> energy for right now, but here are a few comments.
>
> +         <entry><literal>ParallelBitmapScan</></entry>
>
> ParallelBitmapPopulate, maybe?  You're not waiting for the scan;
> you're waiting for the bitmap to be populated.
Done
>
> +         <entry>Waiting for leader backend to complete the TidBitmap.</entry>
>
> Likewise, complete -> populate?
Done
>
> +    /* If any limit was set to zero, the user doesn't want a parallel scan. */
>
> If we got zero?  There aren't multiple limits mentioned anywhere
> nearby, so this reads oddly.

Removed
>
> +        /*
> +         * It may be possible to amortize some of the I/O cost, but probably
> +         * not very much, because most operating systems already do aggressive
> +         * prefetching.  For now, we assume that the disk run cost can't be
> +         * amortized at all.
> +         */
>
> This looks like it was cut-and-pasted from the sequential scan case,
> but it doesn't really apply here.  This presupposes we're relying on
> OS prefetching, but in the case of a bitmap heap scan we do our own
> prefetching.  Maybe change to "Since we do prefetching in both the
> parallel and non-parallel cases, we don't expect to get substantially
> more I/O parallelism by spreading the scan across multiple backends.
> So, assume the disk run cost can't be amortized at all."

Removed this comment.


Apart from these, there are some more changes.
in 0001:
- Improved the comments for TBMSharedIteratorState and TBMSharedIterator.
- Added error handling for return pointer from dsa_allocate.
- Moved function declaration in tidbitmap.h (functionaly related are
kept together)

in 0002:
- Improved comments.
- Code refactoring in BitmapHeapNext.
- Removed local tbm creation in BitmapHeapNext : as per new tidbitmap
it's of no use.


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

On Mon, Feb 20, 2017 at 11:18 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Mon, Feb 20, 2017 at 5:55 AM, Thomas Munro
> <thomas.munro@enterprisedb.com> wrote:
>> On Sun, Feb 19, 2017 at 9:59 PM, Dilip Kumar <dilipbalaut@gmail.com> wrote:
>>> So basically, what I want to propose is that Only during
>>> ExecReScanBitmapHeapScan we can free all the DSA pointers because at
>>> that time we can be sure that all the workers have completed there
>>> task and we are safe to free. (And we don't free any DSA memory at
>>> ExecEndBitmapHeapScan).
>>
>> I think this works.
>
> OK.

In my latest version of the patch, I have fixed it as described above
i.e only cleanup in ExecReScanBitmapHeapScan.

For getting this there is some change in both the patches.

0001:
1. I have got a new interface, "tbm_free_shared_area(dsa_area *dsa,
dsa_pointer dp)"  which will be responsible for freeing all the shared
members (pagetable, ptpage and ptchunk).  Actually, we can not do this
in tbm_free itself because the only leader will have a local tbm with
reference to all these pointers and our parallel bitmap leader may not
necessarily be the actual backend.

If we want to get this done using tbm, then we need to create a local
tbm in each worker and get the shared member reference copied into tbm
using tbm_attach_shared_iterate like we were doing in the earlier
version.

2. Now tbm_free is not freeing any of the shared members which can be
accessed by other worker so tbm_free is safe to call from
ExecEndBitmapHeapScan without any safety check or ref count.

0002:
1. We don't need ExecShutdownBitmapHeapScan anymore because now we are
not freeing any shared member in ExecEndBitmapHeapScan.

2. In ExecReScanBitmapHeapScan we will call tbm_free_shared_area to
free the shared members of the TBM.
3. After that, we will free TBMSharedIteratorState what we allocated
using tbm_prepare_shared_iterate.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

On Sun, Feb 26, 2017 at 9:26 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> I'm not entirely sure about the calling convention for
> tbm_free_shared_area() but the rest seems OK.

Changed
>
>> 2. Now tbm_free is not freeing any of the shared members which can be
>> accessed by other worker so tbm_free is safe to call from
>> ExecEndBitmapHeapScan without any safety check or ref count.
>
> That also seems fine. We ended up with something very similar in the
> Parallel Index Scan patch.
>
>> 0002:
>> 1. We don't need ExecShutdownBitmapHeapScan anymore because now we are
>> not freeing any shared member in ExecEndBitmapHeapScan.
>> 2. In ExecReScanBitmapHeapScan we will call tbm_free_shared_area to
>> free the shared members of the TBM.
>> 3. After that, we will free TBMSharedIteratorState what we allocated
>> using tbm_prepare_shared_iterate.
>
> Check.  But I think tbm_free_shared_area() should also free the object
> itself, instead of making the caller do that separately.

Right, done that way.
>
> +        if (DsaPointerIsValid(node->pstate->tbmiterator))
> +        {
> +            /* First we free the shared TBM members using the shared state */
> +            tbm_free_shared_area(dsa, node->pstate->tbmiterator);
> +            dsa_free(dsa, node->pstate->tbmiterator);
> +        }
> +
> +        if (DsaPointerIsValid(node->pstate->prefetch_iterator))
> +            dsa_free(dsa, node->pstate->prefetch_iterator);
>
> The fact that these cases aren't symmetric suggests that your
> abstraction is leaky.  I'm guessing that you can't call
> tbm_free_shared_area because the two iterators share one copy of the
> underlying iteration arrays, and the TBM code isn't smart enough to
> avoid freeing them twice.  You're going to have to come up with a
> better solution to that problem; nodeBitmapHeapScan.c shouldn't know
> about the way the underlying storage details are managed.  (Maybe you
> need to reference-count the iterator arrays?)

Converted iterator arrays to structure and maintained the refcount. I
had to do the same thing for pagetable also because that is also
shared across iterator.

>
> +    if (node->inited)
> +        goto start_iterate;
>
> My first programming teacher told me not to use goto.  I've
> occasionally violated that rule, but I need a better reason than you
> have here.  It looks very easy to avoid.

Changed
>
> +            pbms_set_parallel(outerPlanState(node));
>
> I think this should be a flag in the plan, and the planner should set
> it correctly, instead of having it be a flag in the executor that the
> executor sets.  Also, the flag itself should really be called
> something that involves the word "shared" rather than "parallel",
> because the bitmap will not be created in parallel, but it will be
> shared.
Done
>
> Have you checked whether this patch causes any regression in the
> non-parallel case?  It adds a bunch of "if" statements, so it might.
> Hopefully not, but it needs to be carefully tested.

With new patch I tested in my local machine, perform multiple
executions and it doesn't show any regression.  Attached file
perf_result contains the explain analyze output for one of the query
(head, patch with 0 workers and patch with 2 workers).  I will perform
the test with all the TPC-H queries which using bitmap plan on the
bigger machine and will post the results next week.
>
> @@ -48,10 +48,11 @@
>  #include "utils/snapmgr.h"
>  #include "utils/tqual.h"
>
> -
>  static TupleTableSlot *BitmapHeapNext(BitmapHeapScanState *node);
>
> Unnecessary.
Fixed
>
> +static bool pbms_is_leader(ParallelBitmapState *pbminfo);
> +static void pbms_set_parallel(PlanState *node);
>
> I don't think this "pbms" terminology is very good.  It's dissimilar
> to the way other functions in this file are named.  Maybe
> BitmapShouldInitializeSharedState().
Changed
>
> I think that some of the bits that this function makes conditional on
> pstate should be factored out into inline functions.  Like:
>
> -            if (node->prefetch_target >= node->prefetch_maximum)
> -                 /* don't increase any further */ ;
> -            else if (node->prefetch_target >= node->prefetch_maximum / 2)
> -                node->prefetch_target = node->prefetch_maximum;
> -            else if (node->prefetch_target > 0)
> -                node->prefetch_target *= 2;
> -            else
> -                node->prefetch_target++;
> +            if (!pstate)
> +            {
> +                if (node->prefetch_target >= node->prefetch_maximum)
> +                     /* don't increase any further */ ;
> +                else if (node->prefetch_target >= node->prefetch_maximum / 2)
> +                    node->prefetch_target = node->prefetch_maximum;
> +                else if (node->prefetch_target > 0)
> +                    node->prefetch_target *= 2;
> +                else
> +                    node->prefetch_target++;
> +            }
> +            else if (pstate->prefetch_target < node->prefetch_maximum)
> +            {
> +                SpinLockAcquire(&pstate->prefetch_mutex);
> +                if (pstate->prefetch_target >= node->prefetch_maximum)
> +                     /* don't increase any further */ ;
> +                else if (pstate->prefetch_target >=
> +                         node->prefetch_maximum / 2)
> +                    pstate->prefetch_target = node->prefetch_maximum;
> +                else if (pstate->prefetch_target > 0)
> +                    pstate->prefetch_target *= 2;
> +                else
> +                    pstate->prefetch_target++;
> +                SpinLockRelease(&pstate->prefetch_mutex);
> +            }
>
> I suggest creating an inline function like BitmapAdjustPrefetch() for
> this logic, and letting the code call that.  The function can look
> something like this: if (pstate == NULL) { non-parallel stuff; return;
> } parallel stuff follows...
>
> And similarly for the other cases where you've made the logic
> conditional.  This will make it more clear what's happening
> post-patch, I think, and will also help keep the level of indentation
> from getting out-of-control in certain places.  In fact, maybe you
> should submit a preliminary refactoring patch that moves these chunks
> of logic into functions and then the main patch can apply over top of
> that.

Done.

>
> +    bool                 inited;
>
> Suggest: initialized

Done
>
> - * ----------------
> + *        pscan_len           size of the shared memory for parallel bitmap
> + *        inited               is node is ready to iterate
> + *        stbmiterator       shared iterator
> + *        sprefetch_iterator shared iterator for prefetching
> + *        pstate               shared state for parallel bitmap scan
> + *----------------
>
> No need to change number of dashes.
Fixed
>
> + *     PBMState information : Current status of the TIDBitmap creation during
> + *                            parallel bitmap heap scan.
>
> If you look for existing places where comments are formatted like
> this, I bet you won't find many.  Copy the surrounding style more.

Done as surrounding structure, also changed the name to
SharedBitmapState, I think that is better name for the purpose.

>
> +    dsa_pointer    tbmiterator;
> +    dsa_pointer    prefetch_iterator;
> +    slock_t        prefetch_mutex;
> +    int            prefetch_pages;
> +    int            prefetch_target;
> +    bool        prefetching;
> +    slock_t        state_mutex;
> +    PBMState    state;
> +    ConditionVariable cv;
> +    char        phs_snapshot_data[FLEXIBLE_ARRAY_MEMBER];
>
> I think it is probably not a good idea to have two separate mutexes
> here.  They'll be in the same cache line, so it won't be much faster
> than having one mutex, and the state mutex won't be acquired very
> often so you won't really gain anything anyway.  I think you can just
> merge the mutexes into one called 'mutex'.

Done.
>
> +                    /* Allow only one process to prefetch */
>
> If this is a good idea, there should be a comment explaining why.

Done
>
> +    TBMSharedIterator    *stbmiterator;
> +    TBMSharedIterator    *sprefetch_iterator;
>
> Maybe shared_iterator and shared_prefetch_iterator.
Done


0001- same as previous with some changes for freeing the shared memory stuff.
0002- nodeBitmapHeapScan refactoring, this applies independently
0003- actual parallel bitmap stuff applies on top of 0001 and 0002


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] Parallel bitmap heap scan

From

Robert Haas

Date:

02 March 2017, 16:22:39

On Tue, Feb 28, 2017 at 9:18 PM, Dilip Kumar <dilipbalaut@gmail.com> wrote:
> 0001- same as previous with some changes for freeing the shared memory stuff.
> 0002- nodeBitmapHeapScan refactoring, this applies independently
> 0003- actual parallel bitmap stuff applies on top of 0001 and 0002

0002 wasn't quite careful enough about the placement of #ifdef
USE_PREFETCH, but otherwise looks OK.  Committed after changing that
and getting rid of the local variable prefetch_iterator, which seemed
to be adding rather than removing complexity after this refactoring.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] Parallel bitmap heap scan

From

Dilip Kumar

Date:

06 March 2017, 08:35:45

On Thu, Mar 2, 2017 at 6:52 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> 0002 wasn't quite careful enough about the placement of #ifdef
> USE_PREFETCH, but otherwise looks OK.  Committed after changing that
> and getting rid of the local variable prefetch_iterator, which seemed
> to be adding rather than removing complexity after this refactoring.

0003 is rebased after this commit.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

0003-parallel-bitmap-heapscan-v7.patch

Re: [HACKERS] Parallel bitmap heap scan

From

Robert Haas

Date:

07 March 2017, 19:04:04

On Mon, Mar 6, 2017 at 12:35 AM, Dilip Kumar <dilipbalaut@gmail.com> wrote:
> On Thu, Mar 2, 2017 at 6:52 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> 0002 wasn't quite careful enough about the placement of #ifdef
>> USE_PREFETCH, but otherwise looks OK.  Committed after changing that
>> and getting rid of the local variable prefetch_iterator, which seemed
>> to be adding rather than removing complexity after this refactoring.
>
> 0003 is rebased after this commit.

You've still got this:

+               if (DsaPointerIsValid(node->pstate->tbmiterator))
+                       tbm_free_shared_area(dsa, node->pstate->tbmiterator);
+
+               if (DsaPointerIsValid(node->pstate->prefetch_iterator))
+                       dsa_free(dsa, node->pstate->prefetch_iterator);

I'm trying to get to a point where both calls use
tbm_free_shared_area() - i.e. no peeking behind the abstraction layer.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] Parallel bitmap heap scan

From

Dilip Kumar

Date:

07 March 2017, 19:09:31

On Tue, Mar 7, 2017 at 9:34 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>
> +               if (DsaPointerIsValid(node->pstate->tbmiterator))
> +                       tbm_free_shared_area(dsa, node->pstate->tbmiterator);
> +
> +               if (DsaPointerIsValid(node->pstate->prefetch_iterator))
> +                       dsa_free(dsa, node->pstate->prefetch_iterator);

As per latest code, both should be calling to tbm_free_shared_area
because tbm_free_shared_area is capable of handling that. My silly
mistake. I will fix it.


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] Parallel bitmap heap scan

From

Robert Haas

Date:

07 March 2017, 19:11:19

(On Tue, Feb 28, 2017 at 10:48 AM, Dilip Kumar <dilipbalaut@gmail.com> wrote:
> 0001- same as previous with some changes for freeing the shared memory stuff.

+    if (--ptbase->refcount == 0)
+        dsa_free(dsa, istate->pagetable);
+
+    if (istate->spages)
+    {
+        ptpages = dsa_get_address(dsa, istate->spages);
+        if (--ptpages->refcount == 0)
+            dsa_free(dsa, istate->spages);
+    }
+    if (istate->schunks)
+    {
+        ptchunks = dsa_get_address(dsa, istate->schunks);
+        if (--ptchunks->refcount == 0)
+            dsa_free(dsa, istate->schunks);
+    }

This doesn't involve any locking, which I think will happen to work
with the current usage pattern but doesn't seem very robust in
general.  I think you either need the refcounts to be protected by a
spinlock, or maybe better, use pg_atomic_uint32 for them.  You want
something like if (pg_atomic_sub_fetch_u32(&refcount, 1) == 0) {
dsa_free(...) }

Otherwise, there's no guarantee it will get freed exactly once.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] Parallel bitmap heap scan

From

Robert Haas

Date:

07 March 2017, 19:14:48

On Tue, Mar 7, 2017 at 11:09 AM, Dilip Kumar <dilipbalaut@gmail.com> wrote:
> On Tue, Mar 7, 2017 at 9:34 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>>
>> +               if (DsaPointerIsValid(node->pstate->tbmiterator))
>> +                       tbm_free_shared_area(dsa, node->pstate->tbmiterator);
>> +
>> +               if (DsaPointerIsValid(node->pstate->prefetch_iterator))
>> +                       dsa_free(dsa, node->pstate->prefetch_iterator);
>
> As per latest code, both should be calling to tbm_free_shared_area
> because tbm_free_shared_area is capable of handling that. My silly
> mistake. I will fix it.

Thanks.  I don't think I believe this rationale:

+                               /*
+                                * If one of the process has already
identified that we need
+                                * to do prefetch then let it perform
the prefetch and allow
+                                * others to proceed with the work in
hand.  Another option
+                                * could be that we allow all of them
to participate in
+                                * prefetching.  But, most of this
work done under mutex or
+                                * LWLock so ultimately we may end up
in prefetching
+                                * sequentially.
+                                */

I mean, IIUC, the call to PrefetchBuffer() is not done under any lock.
And that's the slow part.  The tiny amount of time we spend updating
the prefetch information under the mutex should be insignificant
compared to the cost of actually reading the buffer.  Unless I'm
missing something.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] Parallel bitmap heap scan

From

Dilip Kumar

Date:

07 March 2017, 19:27:36

On Tue, Mar 7, 2017 at 9:44 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> I mean, IIUC, the call to PrefetchBuffer() is not done under any lock.
> And that's the slow part.  The tiny amount of time we spend updating
> the prefetch information under the mutex should be insignificant
> compared to the cost of actually reading the buffer.  Unless I'm
> missing something.

Okay, but IIUC, the PrefetchBuffer is an async call to load the buffer
if it's not already in shared buffer? so If instead of one process is
making multiple async calls to PrefetchBuffer, if we make it by
multiple processes will it be any faster?  Or you are thinking that at
least we can make BufTableLookup call parallel because that is not an
async call.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] Parallel bitmap heap scan

From

Robert Haas

Date:

07 March 2017, 19:37:41

On Tue, Mar 7, 2017 at 11:27 AM, Dilip Kumar <dilipbalaut@gmail.com> wrote:
> On Tue, Mar 7, 2017 at 9:44 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> I mean, IIUC, the call to PrefetchBuffer() is not done under any lock.
>> And that's the slow part.  The tiny amount of time we spend updating
>> the prefetch information under the mutex should be insignificant
>> compared to the cost of actually reading the buffer.  Unless I'm
>> missing something.
>
> Okay, but IIUC, the PrefetchBuffer is an async call to load the buffer
> if it's not already in shared buffer? so If instead of one process is
> making multiple async calls to PrefetchBuffer, if we make it by
> multiple processes will it be any faster?  Or you are thinking that at
> least we can make BufTableLookup call parallel because that is not an
> async call.

It's not about speed.  It's about not forgetting to prefetch.  Suppose
that worker 1 becomes the prefetch worker but then doesn't return to
the Bitmap Heap Scan node for a long time because it's busy in some
other part of the plan tree.  Now you just stop prefetching; that's
bad.  You want prefetching to continue regardless of which workers are
busy doing what; as long as SOME worker is executing the parallel
bitmap heap scan, prefetching should continue as needed.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] Parallel bitmap heap scan

From

Dilip Kumar

Date:

07 March 2017, 19:40:21

On Tue, Mar 7, 2017 at 10:07 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> It's not about speed.  It's about not forgetting to prefetch.  Suppose
> that worker 1 becomes the prefetch worker but then doesn't return to
> the Bitmap Heap Scan node for a long time because it's busy in some
> other part of the plan tree.  Now you just stop prefetching; that's
> bad.  You want prefetching to continue regardless of which workers are
> busy doing what; as long as SOME worker is executing the parallel
> bitmap heap scan, prefetching should continue as needed.

Right, I missed this part. I will fix this.


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] Parallel bitmap heap scan

From

Dilip Kumar

Date:

08 March 2017, 09:49:24

On Tue, Mar 7, 2017 at 10:10 PM, Dilip Kumar <dilipbalaut@gmail.com> wrote:
>> It's not about speed.  It's about not forgetting to prefetch.  Suppose
>> that worker 1 becomes the prefetch worker but then doesn't return to
>> the Bitmap Heap Scan node for a long time because it's busy in some
>> other part of the plan tree.  Now you just stop prefetching; that's
>> bad.  You want prefetching to continue regardless of which workers are
>> busy doing what; as long as SOME worker is executing the parallel
>> bitmap heap scan, prefetching should continue as needed.
>
> Right, I missed this part. I will fix this.
I have fixed this part, after doing that I realised if multiple
processes are prefetching then it may be possible that in boundary
cases (e.g. suppose prefetch_target is 3 and prefetch_pages is at 2)
there may be some extra prefetch but finally those prefetched blocks
will be used.  Another, solution to this problem is that we can
increase the prefetch_pages in advance then call tbm_shared_iterate,
this will avoid extra prefetch.  But I am not sure what will be best
here.


On Tue, Mar 7, 2017 at 9:41 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> +    if (--ptbase->refcount == 0)
> +        dsa_free(dsa, istate->pagetable);
> +
> +    if (istate->spages)
> +    {
> +        ptpages = dsa_get_address(dsa, istate->spages);
> +        if (--ptpages->refcount == 0)
> +            dsa_free(dsa, istate->spages);
> +    }
> +    if (istate->schunks)
> +    {
> +        ptchunks = dsa_get_address(dsa, istate->schunks);
> +        if (--ptchunks->refcount == 0)
> +            dsa_free(dsa, istate->schunks);
> +    }
>
> This doesn't involve any locking, which I think will happen to work
> with the current usage pattern but doesn't seem very robust in
> general.  I think you either need the refcounts to be protected by a
> spinlock, or maybe better, use pg_atomic_uint32 for them.  You want
> something like if (pg_atomic_sub_fetch_u32(&refcount, 1) == 0) {
> dsa_free(...) }
>
> Otherwise, there's no guarantee it will get freed exactly once.

Fixed


On Tue, Mar 7, 2017 at 9:34 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> You've still got this:
>
> +               if (DsaPointerIsValid(node->pstate->tbmiterator))
> +                       tbm_free_shared_area(dsa, node->pstate->tbmiterator);
> +
> +               if (DsaPointerIsValid(node->pstate->prefetch_iterator))
> +                       dsa_free(dsa, node->pstate->prefetch_iterator);
>
> I'm trying to get to a point where both calls use
> tbm_free_shared_area() - i.e. no peeking behind the abstraction layer.

Fixed

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

On Wed, Mar 8, 2017 at 8:28 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> How about adding a regression test?

Added
>
> bitmap_subplan_mark_shared could use castNode(), which seems like it
> would be better style.  Maybe some other places, too.
>
> +         <entry><literal>ParallelBitmapPopulate</></entry>
> +         <entry>Waiting for the leader to populate the TidBitmap.</entry>
> +        </row>
> +        <row>
>
> If you build the documentation, you'll find that this doesn't come out
> right; you need to add 1 to the value of the nearest preceding
> "morerows".  (I fixed a similar issue with 0001 while committing.)

Fixed
>
> +        /*---------------
> +         * Check the current state
> +         * If state is
> +         * BM_INITIAL     : We become the leader and set it to BM_INPROGRESS
> +         * BM_INPROGRESS : We need to wait till leader creates bitmap
> +         * BM_FINISHED   : bitmap is ready so no need to wait
> +         *---------------
>
> The formatting of this comment is slightly off - the comment for
> BM_INITIAL isn't aligned the same as the others.  But I would just
> delete the whole comment, since more or less it recapitulates the
> function header comment anyway.

Removed.
>
> I wonder if BitmapShouldInitializeSharedState couldn't be written a
> little more compactly overall, like this:
>
> {
>     SharedBitmapState   state;
>
>     while (1)
>     {
>         SpinLockAcquire(&pstate->mutex);
>         state = pstate->state;
>         if (pstate->state == BM_INITIAL)
>             pstate->state = BM_INPROGRESS;
>         SpinLockRelease(&pstate->mutex);
>
>         /* If we are leader or leader has already created a TIDBITMAP */
>         if (state != BM_INPROGRESS)
>             break;
>
>         /* Sleep until leader finishes creating the bitmap */
>         ConditionVariableSleep(&pstate->cv, WAIT_EVENT_PARALLEL_BITMAP_SCAN);
>     }
>
>     ConditionVariableCancelSleep();
>
>     return (state == BM_INITIAL);
> }

This looks good, done this way

>
> +                /*
> +                 * By this time we have already populated the TBM and
> +                 * initialized the shared iterators so set the state to
> +                 * BM_FINISHED and wake up others.
> +                 */
> +                SpinLockAcquire(&pstate->mutex);
> +                pstate->state = BM_FINISHED;
> +                SpinLockRelease(&pstate->mutex);
> +                ConditionVariableBroadcast(&pstate->cv);
>
> I think it would be good to have a function for this, like
> BitmapDoneInitializingSharedState(), and just call that function here.

Done
>
> +            SpinLockAcquire(&pstate->mutex);
> +
> +            /*
> +             * Recheck under the mutex, If some other process has already done
> +             * the enough prefetch then we need not to do anything.
> +             */
> +            if (pstate->prefetch_pages >= pstate->prefetch_target)
> +                SpinLockRelease(&pstate->mutex);
> +                return;
> +            SpinLockRelease(&pstate->mutex);
>
> I think it would be clearer to write this as:
>
> SpinLockAcquire(&pstate->mutex);
> do_prefetch = (pstate->prefetch_pages >= pstate->prefetch_target);
> SpinLockRelease(&pstate->mutex);
> if (!do_prefetch)
>    return;
>
> Then it's more obvious what's going on with the spinlock.  But
> actually what I would do is also roll in the increment to prefetch
> pages in there, so that you don't have to reacquire the spinlock after
> calling PrefetchBuffer:
>
> bool do_prefetch = false;
> SpinLockAcquire(&pstate->mutex);
> if (pstate->prefetch_pages < pstate->prefetch_target)
> {
>     pstate->prefetch_pages++;
>     do_prefetch = true;
> }
> SpinLockRelease(&pstate->mutex);
>
> That seems like it will reduce the amount of excess prefetching
> considerably, and also simplify the code and cut the spinlock
> acquisitions by 50%.
>
Right, done that way
> Overall I think this is in pretty good shape.

Thanks.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

0003-parallel-bitmap-heapscan-v9.patch

Re: [HACKERS] Parallel bitmap heap scan

From

Robert Haas

Date:

08 March 2017, 20:08:13

On Wed, Mar 8, 2017 at 11:20 AM, Dilip Kumar <dilipbalaut@gmail.com> wrote:
> Right, done that way

This didn't compile because you bobbled some code in
src/backend/nodes, but it was a trivial mistake so I fixed it.

Committed with that fix and a bunch of minor cosmetic changes.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] Parallel bitmap heap scan

From

Jeff Janes

Date:

08 March 2017, 20:28:23

On Wed, Mar 8, 2017 at 9:08 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Mar 8, 2017 at 11:20 AM, Dilip Kumar <dilipbalaut@gmail.com> wrote:
> Right, done that way

This didn't compile because you bobbled some code in
src/backend/nodes, but it was a trivial mistake so I fixed it.

Committed with that fix and a bunch of minor cosmetic changes.

I'm getting some compiler warnings in gcc version 4.4.7 20120313 (Red Hat 4.4.7-17) (GCC):

nodeBitmapHeapscan.c: In function 'BitmapHeapNext':

nodeBitmapHeapscan.c:79: warning: 'tbmiterator' may be used uninitialized in this function

nodeBitmapHeapscan.c:80: warning: 'shared_tbmiterator' may be used uninitialized in this function

Cheers,

Jeff

Re: [HACKERS] Parallel bitmap heap scan

From

Tom Lane

Date:

08 March 2017, 20:45:22

Jeff Janes <jeff.janes@gmail.com> writes:
> I'm getting some compiler warnings in gcc version 4.4.7 20120313 (Red Hat
> 4.4.7-17) (GCC):

Me too.  Fix pushed.
        regards, tom lane

Re: [HACKERS] Parallel bitmap heap scan

From

Robert Haas

Date:

08 March 2017, 20:47:28

On Wed, Mar 8, 2017 at 12:45 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Jeff Janes <jeff.janes@gmail.com> writes:
>> I'm getting some compiler warnings in gcc version 4.4.7 20120313 (Red Hat
>> 4.4.7-17) (GCC):
>
> Me too.  Fix pushed.

Thanks.  Sorry for the hassle; my compiler isn't as picky about this
as I would like, and apparently Dilip's isn't either.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] Parallel bitmap heap scan

From

Tom Lane

Date:

08 March 2017, 20:53:23

Robert Haas <robertmhaas@gmail.com> writes:
> On Wed, Mar 8, 2017 at 12:45 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Jeff Janes <jeff.janes@gmail.com> writes:
>>> I'm getting some compiler warnings in gcc version 4.4.7 20120313 (Red Hat
>>> 4.4.7-17) (GCC):

>> Me too.  Fix pushed.

> Thanks.  Sorry for the hassle; my compiler isn't as picky about this
> as I would like, and apparently Dilip's isn't either.

Might be interesting to see whether -O level affects it.  In principle,
whether you get the warning should depend on how much the compiler has
analyzed the logic flow ...
        regards, tom lane

Re: [HACKERS] Parallel bitmap heap scan

From

Robert Haas

Date:

08 March 2017, 21:00:06

On Wed, Mar 8, 2017 at 12:53 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Thanks.  Sorry for the hassle; my compiler isn't as picky about this
>> as I would like, and apparently Dilip's isn't either.
>
> Might be interesting to see whether -O level affects it.  In principle,
> whether you get the warning should depend on how much the compiler has
> analyzed the logic flow ...

What I'm using is:

Configured with:
--prefix=/Applications/Xcode.app/Contents/Developer/usr
--with-gxx-include-dir=/usr/include/c++/4.2.1
Apple LLVM version 7.0.2 (clang-700.1.81)
Target: x86_64-apple-darwin14.5.0
Thread model: posix

While I haven't experimented with this too extensively, my general
impression is that this thing is extremely tolerant of uninitialized
variables.  I just tried compiling nodeBitmapHeapscan.c with -Wall
-Werror and each of -O0, -O1, -O2, and -O3, and none of those produced
any warnings.  I've been reluctant to go to the hassle of installing a
different compiler...

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] Parallel bitmap heap scan

From

Tom Lane

Date:

08 March 2017, 21:18:34

Robert Haas <robertmhaas@gmail.com> writes:
> What I'm using is:

> Configured with:
> --prefix=/Applications/Xcode.app/Contents/Developer/usr
> --with-gxx-include-dir=/usr/include/c++/4.2.1
> Apple LLVM version 7.0.2 (clang-700.1.81)
> Target: x86_64-apple-darwin14.5.0
> Thread model: posix

Hm.  I noticed that longfin didn't spit up on it either, despite having
-Werror turned on.  That's a slightly newer version, but still Apple's
clang:

$ gcc -v
Configured with: --prefix=/Applications/Xcode.app/Contents/Developer/usr --with-gxx-include-dir=/usr/include/c++/4.2.1
Apple LLVM version 8.0.0 (clang-800.0.42.1)
Target: x86_64-apple-darwin16.4.0
Thread model: posix
InstalledDir: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin
        regards, tom lane

Re: [HACKERS] Parallel bitmap heap scan

From

Robert Haas

Date:

08 March 2017, 21:22:18

On Wed, Mar 8, 2017 at 1:18 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> What I'm using is:
>
>> Configured with:
>> --prefix=/Applications/Xcode.app/Contents/Developer/usr
>> --with-gxx-include-dir=/usr/include/c++/4.2.1
>> Apple LLVM version 7.0.2 (clang-700.1.81)
>> Target: x86_64-apple-darwin14.5.0
>> Thread model: posix
>
> Hm.  I noticed that longfin didn't spit up on it either, despite having
> -Werror turned on.  That's a slightly newer version, but still Apple's
> clang:
>
> $ gcc -v
> Configured with: --prefix=/Applications/Xcode.app/Contents/Developer/usr
--with-gxx-include-dir=/usr/include/c++/4.2.1
> Apple LLVM version 8.0.0 (clang-800.0.42.1)
> Target: x86_64-apple-darwin16.4.0
> Thread model: posix
> InstalledDir: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin

Yeah.  I think on my previous MacBook Pro, you could do this without
generating a warning:

int x;
printf("%d\n", x);

The compiler on this one detects that case, but that seems to be about
as far as it goes.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Parallel bitmap heap scan

From

Rafia Sabih

Date:

27 March 2017, 10:23:33

On Wed, Mar 8, 2017 at 11:52 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Wed, Mar 8, 2017 at 1:18 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Robert Haas <robertmhaas@gmail.com> writes:
>>> What I'm using is:
>>
>>> Configured with:
>>> --prefix=/Applications/Xcode.app/Contents/Developer/usr
>>> --with-gxx-include-dir=/usr/include/c++/4.2.1
>>> Apple LLVM version 7.0.2 (clang-700.1.81)
>>> Target: x86_64-apple-darwin14.5.0
>>> Thread model: posix
>>
>> Hm.  I noticed that longfin didn't spit up on it either, despite having
>> -Werror turned on.  That's a slightly newer version, but still Apple's
>> clang:
>>
>> $ gcc -v
>> Configured with: --prefix=/Applications/Xcode.app/Contents/Developer/usr
--with-gxx-include-dir=/usr/include/c++/4.2.1
>> Apple LLVM version 8.0.0 (clang-800.0.42.1)
>> Target: x86_64-apple-darwin16.4.0
>> Thread model: posix
>> InstalledDir: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin
>
> Yeah.  I think on my previous MacBook Pro, you could do this without
> generating a warning:
>
> int x;
> printf("%d\n", x);
>
> The compiler on this one detects that case, but that seems to be about
> as far as it goes.

Recently, on testing TPC-H 300 scale factor I stumbled on to a error
for Q6, the test environment is as follows,
work_mem = 1GB,
shared_buffers = 10 GB,
Effective_cache_size = 10GB
random_page_cost = seq+page_cost =0.1

The error message is, ERROR:  invalid DSA memory alloc request size 1610612740
In case required, stack trace is as follows,

#0  errfinish (dummy=0) at elog.c:415
#1  0x0000000010738820 in elog_finish (elevel=20, fmt=0x109115d0
"invalid DSA memory alloc request size %zu") at elog.c:1378
#2  0x0000000010778824 in dsa_allocate_extended (area=0x1001b53b138,
size=1610612740, flags=4) at dsa.c:676
#3  0x00000000103a3e60 in pagetable_allocate (pagetable=0x1001b52a590,
size=1610612736) at tidbitmap.c:1521
#4  0x00000000103a0488 in pagetable_grow (tb=0x1001b52a590,
newsize=33554432) at ../../../src/include/lib/simplehash.h:379
#5  0x00000000103a07b0 in pagetable_insert (tb=0x1001b52a590,
key=34962730, found=0x3fffc705ba48 "") at
../../../src/include/lib/simplehash.h:504
#6  0x00000000103a3354 in tbm_get_pageentry (tbm=0x1001b526fa0,
pageno=34962730) at tidbitmap.c:1235
#7  0x00000000103a1614 in tbm_add_tuples (tbm=0x1001b526fa0,
tids=0x1001b51d22a, ntids=1, recheck=0 '\000') at tidbitmap.c:425
#8  0x00000000100fdf24 in btgetbitmap (scan=0x1001b51c4a0,
tbm=0x1001b526fa0) at nbtree.c:460
#9  0x00000000100f2c48 in index_getbitmap (scan=0x1001b51c4a0,
bitmap=0x1001b526fa0) at indexam.c:726
#10 0x000000001033c7d8 in MultiExecBitmapIndexScan
(node=0x1001b51c180) at nodeBitmapIndexscan.c:91
#11 0x000000001031c0b4 in MultiExecProcNode (node=0x1001b51c180) at
execProcnode.c:611
#12 0x000000001033a8a8 in BitmapHeapNext (node=0x1001b5187a8) at
nodeBitmapHeapscan.c:143
#13 0x000000001032a094 in ExecScanFetch (node=0x1001b5187a8,
accessMtd=0x1033a6c8 <BitmapHeapNext>, recheckMtd=0x1033bab8
<BitmapHeapRecheck>) at execScan.c:95
#14 0x000000001032a194 in ExecScan (node=0x1001b5187a8,
accessMtd=0x1033a6c8 <BitmapHeapNext>, recheckMtd=0x1033bab8
<BitmapHeapRecheck>) at execScan.c:162
#15 0x000000001033bb88 in ExecBitmapHeapScan (node=0x1001b5187a8) at
nodeBitmapHeapscan.c:667

Please feel free to ask if any more information is required for this.
-- 
Regards,
Rafia Sabih
EnterpriseDB: http://www.enterprisedb.com/

Re: Parallel bitmap heap scan

From

Dilip Kumar

Date:

27 March 2017, 12:02:03

On Mon, Mar 27, 2017 at 12:53 PM, Rafia Sabih
<rafia.sabih@enterprisedb.com> wrote:
> Recently, on testing TPC-H 300 scale factor I stumbled on to a error
> for Q6, the test environment is as follows,
> work_mem = 1GB,
> shared_buffers = 10 GB,
> Effective_cache_size = 10GB
> random_page_cost = seq+page_cost =0.1
>
> The error message is, ERROR:  invalid DSA memory alloc request size 1610612740
> In case required, stack trace is as follows,

Thanks for reporting.  In pagetable_allocate, DSA_ALLOC_HUGE flag were
missing while allocating the memory from the DSA.  I have also handled
the NULL pointer return from dsa_get_address.

Please verify with the attached patch.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachment

bitmap_hugepage_fix.patch

Re: Parallel bitmap heap scan

From

Robert Haas

Date:

27 March 2017, 15:28:07

On Mon, Mar 27, 2017 at 5:02 AM, Dilip Kumar <dilipbalaut@gmail.com> wrote:
> On Mon, Mar 27, 2017 at 12:53 PM, Rafia Sabih
> <rafia.sabih@enterprisedb.com> wrote:
>> Recently, on testing TPC-H 300 scale factor I stumbled on to a error
>> for Q6, the test environment is as follows,
>> work_mem = 1GB,
>> shared_buffers = 10 GB,
>> Effective_cache_size = 10GB
>> random_page_cost = seq+page_cost =0.1
>>
>> The error message is, ERROR:  invalid DSA memory alloc request size 1610612740
>> In case required, stack trace is as follows,
>
> Thanks for reporting.  In pagetable_allocate, DSA_ALLOC_HUGE flag were
> missing while allocating the memory from the DSA.  I have also handled
> the NULL pointer return from dsa_get_address.
>
> Please verify with the attached patch.

Failing to pass DSA_ALLOC_HUGE is an obvious oversight, but I think
the ptbase == NULL check shouldn't be needed, because we're not
passing DSA_ALLOC_NO_OOM.  And that's good, because this is going to
be called from SH_CREATE, which isn't prepared for a NULL return
anyway.

Am I all wet?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Parallel bitmap heap scan

From

Dilip Kumar

Date:

27 March 2017, 16:03:57

On Mon, Mar 27, 2017 at 5:58 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> Failing to pass DSA_ALLOC_HUGE is an obvious oversight, but I think
> the ptbase == NULL check shouldn't be needed, because we're not
> passing DSA_ALLOC_NO_OOM.  And that's good, because this is going to
> be called from SH_CREATE, which isn't prepared for a NULL return
> anyway.
>
> Am I all wet?

Yes you are right that we are not passing DSA_ALLOC_NO_OOM, so
dsa_allocate_extended will return error in case if allocation failed.


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com