Re: Gather Merge - Mailing list pgsql-hackers

From Rushabh Lathia
Subject Re: Gather Merge
Date
Msg-id CAGPqQf3TsdOCTSHUBqR+K4jRYuU62Y+NepwcrCm0oe=n1kP1XA@mail.gmail.com
Whole thread Raw
In response to Re: Gather Merge  (Rushabh Lathia <rushabh.lathia@gmail.com>)
Responses Re: Gather Merge  (Thomas Munro <thomas.munro@enterprisedb.com>)
List pgsql-hackers
Please find attached latest patch which fix the review point as well as
additional clean-up.

- Get rid of funnel_slot as its not needed for the Gather Merge
- renamed fill_tuple_array to form_tuple_array
- Fix possible infinite loop into gather_merge_init (Reported by Amit Kaplia)
- Fix tqueue.c memory leak, by calling TupleQueueReaderNext() with per-tuple context.
- Code cleanup into ExecGatherMerge.
- Added new function gather_merge_clear_slots(), which clear out all gather
merge slots and also free tuple array at end of execution.

I did the performance testing again with the latest patch and I haven't
observed any regression. Some of TPC-H queries showing additional benefit with
the latest patch, but its just under 5%.

Do let me know if I missed anything.


On Mon, Oct 24, 2016 at 11:55 AM, Rushabh Lathia <rushabh.lathia@gmail.com> wrote:


On Thu, Oct 20, 2016 at 1:12 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
On Tue, Oct 18, 2016 at 5:29 PM, Rushabh Lathia
<rushabh.lathia@gmail.com> wrote:
> On Mon, Oct 17, 2016 at 2:26 PM, Amit Kapila <amit.kapila16@gmail.com>
> wrote:
>>
>> There is lot of common code between ExecGatherMerge and ExecGather.
>> Do you think it makes sense to have a common function to avoid the
>> duplicity?
>>
>> I see there are small discrepancies in both the codes like I don't see
>> the use of single_copy flag, as it is present in gather node.
>>
>
> Yes, even I thought to centrilize some things of ExecGather and
> ExecGatherMerge,
> but its really not something that is fixed. And I thought it might change
> particularly
> for the Gather Merge. And as explained by Robert single_copy doesn't make
> sense
> for the Gather Merge. I will still look into this to see if something can be
> make
> centralize.
>

Okay, I haven't thought about it, but do let me know if you couldn't
find any way to merge the code.


Sure, I will look into this.
 
>>
>> 3.
>> +gather_merge_readnext(GatherMergeState * gm_state, int reader, bool
>> force)
>> {
>> ..
>> + tup = gm_readnext_tuple(gm_state, reader, force, NULL);
>> +
>> + /*
>> +
>>  * try to read more tuple into nowait mode and store it into the tuple
>> + * array.
>> +
>>  */
>> + if (HeapTupleIsValid(tup))
>> + fill_tuple_array(gm_state, reader);
>>
>> How the above read tuple is stored in array?  In anycase the above
>> interface seems slightly awkward to me. Basically, I think what you
>> are trying to do here is after reading first tuple in waitmode, you
>> are trying to fill the array by reading more tuples.  So, can't we
>> push reading of this fist tuple into that function and name it as
>> form_tuple_array().
>>
>
> Yes, you are right.
>

You have not answered my first question.  I will try to ask again, how
the tuple read by below code is stored in the array:


Tuple directly get stored into related TupleTableSlot. In gather_merge_readnext()
at the end of function it build the TupleTableSlot for the given tuple. So tuple
read by above code is directly stored into TupleTableSlot.

>> + tup = gm_readnext_tuple(gm_state, reader, force, NULL);

> First its trying to read tuple into wait-mode, and once
> it
> find tuple then its try to fill the tuple array (which basically try to read
> tuple
> into nowait-mode). Reason I keep it separate is because in case of
> initializing
> the gather merge, if we unable to read tuple from all the worker - while
> trying
> re-read, we again try to fill the tuple array for the worker who already
> produced
> atleast a single tuple (see gather_merge_init() for more details).
>

Whenever any worker produced one tuple, you already try to fill the
array in gather_merge_readnext(), so does the above code help much?

> Also I
> thought
> filling tuple array (which basically read tuple into nowait mode) and
> reading tuple
> (into wait-mode) are two separate task - and if its into separate function
> that code
> look more clear.

To me that looks slightly confusing.

> If you have any suggestion for the function name
> (fill_tuple_array)
> then I am open to change that.
>

form_tuple_array (form_tuple is used at many places in code, so it
should look okay). 

Ok, I rename it with next patch.
 
I think you might want to consider forming array
even for leader, although it might not be as beneficial as for
workers, OTOH, the code will get simplified if we do that way. 

Yes, I did that earlier - and as you guessed its not be any beneficial
so to avoided extra memory allocation for the tuple array, I am not
forming array for leader.

Today, I observed another issue in code:

+gather_merge_init(GatherMergeState *gm_state)
{
..
+reread:
+ for (i = 0; i < nreaders + 1; i++)
+ {
+ if (TupIsNull(gm_state->gm_slots[i]) ||
+ gm_state->gm_slots[i]->tts_isempty)
+ {
+ if (gather_merge_readnext(gm_state, i, initialize ? false : true))
+ {
+ binaryheap_add_unordered(gm_state->gm_heap,
+ Int32GetDatum(i));
+ }
+ }
+ else
+ fill_tuple_array(gm_state, i);
+ }
+ initialize = false;
+
+ for (i = 0; i < nreaders; i++)
+ if (TupIsNull(gm_state->gm_slots[i]) || gm_state->gm_slots[i]->tts_isempty)
+ goto reread;
..
}

This code can cause infinite loop.  Consider a case where one of the
worker doesn't get any tuple because by the time it starts all the
tuples are consumed by all other workers. The above code will keep on
looping to fetch the tuple from that worker whereas that worker can't
return any tuple.  I think you can fix it by checking if the
particular queue has been exhausted.


Oh yes. I will work on the fix and soon submit the next set of patch.
 
>> >
>> > Open Issue:
>> >
>> > - Commit af33039317ddc4a0e38a02e2255c2bf453115fd2 fixed the leak into
>> > tqueue.c by
>> > calling gather_readnext() into per-tuple context. Now for gather merge
>> > that
>> > is
>> > not possible, as we storing the tuple into Tuple array and we want tuple
>> > to
>> > be
>> > free only its get pass through the merge sort algorithm. One idea is, we
>> > can
>> > also call gm_readnext_tuple() under per-tuple context (which will fix
>> > the
>> > leak
>> > into tqueue.c) and then store the copy of tuple into tuple array.
>> >
>>
>> Won't extra copy per tuple impact performance?  Is the fix in
>> mentioned commit was for record or composite types, if so, does
>> GatherMerge support such types and if it support, does it provide any
>> benefit over Gather?
>>
>
> I don't think was specificially for the record or composite types - but I
> might be
> wrong. As per my understanding commit fix leak into tqueue.c.
>

Hmm, in tqueue.c, I think the memory leak was remapping logic, refer
mail [1] of Tom (Just to add insult to injury, the backend's memory
consumption bloats to something over 5.5G during that last query).

> Fix was to add
> standard to call TupleQueueReaderNext() with shorter memory context - so
> that
> tqueue.c doesn't leak the memory.
>
> I have idea to fix this by calling the TupleQueueReaderNext() with per-tuple
> context,
> and then copy the tuple and store it to the tuple array and later with the
> next run of
> ExecStoreTuple() will free the earlier tuple. I will work on that.
>

Okay, if you think that is viable, then you can do it, but do check
the performance impact of same, because extra copy per fetched tuple
can impact performance.


Sure, I will check the performance impact for the same.

 

[1] - https://www.postgresql.org/message-id/32763.1469821037%40sss.pgh.pa.us

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Thanks,
Rushabh Lathia



--
Rushabh Lathia
Attachment

pgsql-hackers by date:

Previous
From: Christoph Berg
Date:
Subject: Re: Patch to implement pg_current_logfile() function
Next
From: Amit Kapila
Date:
Subject: Re: [BUG] pg_basebackup from disconnected standby fails