Re: Gather Merge - Mailing list pgsql-hackers

From Rushabh Lathia
Subject Re: Gather Merge
Date
Msg-id CAGPqQf2cqGb9sAmZMhxAofLCwk1vBgCs5RsBp1EbYpRN_5CKbg@mail.gmail.com
Whole thread Raw
In response to Re: Gather Merge  (Amit Kapila <amit.kapila16@gmail.com>)
Responses Re: Gather Merge
List pgsql-hackers


On Thu, Oct 20, 2016 at 1:12 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
On Tue, Oct 18, 2016 at 5:29 PM, Rushabh Lathia
<rushabh.lathia@gmail.com> wrote:
> On Mon, Oct 17, 2016 at 2:26 PM, Amit Kapila <amit.kapila16@gmail.com>
> wrote:
>>
>> There is lot of common code between ExecGatherMerge and ExecGather.
>> Do you think it makes sense to have a common function to avoid the
>> duplicity?
>>
>> I see there are small discrepancies in both the codes like I don't see
>> the use of single_copy flag, as it is present in gather node.
>>
>
> Yes, even I thought to centrilize some things of ExecGather and
> ExecGatherMerge,
> but its really not something that is fixed. And I thought it might change
> particularly
> for the Gather Merge. And as explained by Robert single_copy doesn't make
> sense
> for the Gather Merge. I will still look into this to see if something can be
> make
> centralize.
>

Okay, I haven't thought about it, but do let me know if you couldn't
find any way to merge the code.


Sure, I will look into this.
 
>>
>> 3.
>> +gather_merge_readnext(GatherMergeState * gm_state, int reader, bool
>> force)
>> {
>> ..
>> + tup = gm_readnext_tuple(gm_state, reader, force, NULL);
>> +
>> + /*
>> +
>>  * try to read more tuple into nowait mode and store it into the tuple
>> + * array.
>> +
>>  */
>> + if (HeapTupleIsValid(tup))
>> + fill_tuple_array(gm_state, reader);
>>
>> How the above read tuple is stored in array?  In anycase the above
>> interface seems slightly awkward to me. Basically, I think what you
>> are trying to do here is after reading first tuple in waitmode, you
>> are trying to fill the array by reading more tuples.  So, can't we
>> push reading of this fist tuple into that function and name it as
>> form_tuple_array().
>>
>
> Yes, you are right.
>

You have not answered my first question.  I will try to ask again, how
the tuple read by below code is stored in the array:


Tuple directly get stored into related TupleTableSlot. In gather_merge_readnext()
at the end of function it build the TupleTableSlot for the given tuple. So tuple
read by above code is directly stored into TupleTableSlot.

>> + tup = gm_readnext_tuple(gm_state, reader, force, NULL);

> First its trying to read tuple into wait-mode, and once
> it
> find tuple then its try to fill the tuple array (which basically try to read
> tuple
> into nowait-mode). Reason I keep it separate is because in case of
> initializing
> the gather merge, if we unable to read tuple from all the worker - while
> trying
> re-read, we again try to fill the tuple array for the worker who already
> produced
> atleast a single tuple (see gather_merge_init() for more details).
>

Whenever any worker produced one tuple, you already try to fill the
array in gather_merge_readnext(), so does the above code help much?

> Also I
> thought
> filling tuple array (which basically read tuple into nowait mode) and
> reading tuple
> (into wait-mode) are two separate task - and if its into separate function
> that code
> look more clear.

To me that looks slightly confusing.

> If you have any suggestion for the function name
> (fill_tuple_array)
> then I am open to change that.
>

form_tuple_array (form_tuple is used at many places in code, so it
should look okay). 

Ok, I rename it with next patch.
 
I think you might want to consider forming array
even for leader, although it might not be as beneficial as for
workers, OTOH, the code will get simplified if we do that way. 

Yes, I did that earlier - and as you guessed its not be any beneficial
so to avoided extra memory allocation for the tuple array, I am not
forming array for leader.

Today, I observed another issue in code:

+gather_merge_init(GatherMergeState *gm_state)
{
..
+reread:
+ for (i = 0; i < nreaders + 1; i++)
+ {
+ if (TupIsNull(gm_state->gm_slots[i]) ||
+ gm_state->gm_slots[i]->tts_isempty)
+ {
+ if (gather_merge_readnext(gm_state, i, initialize ? false : true))
+ {
+ binaryheap_add_unordered(gm_state->gm_heap,
+ Int32GetDatum(i));
+ }
+ }
+ else
+ fill_tuple_array(gm_state, i);
+ }
+ initialize = false;
+
+ for (i = 0; i < nreaders; i++)
+ if (TupIsNull(gm_state->gm_slots[i]) || gm_state->gm_slots[i]->tts_isempty)
+ goto reread;
..
}

This code can cause infinite loop.  Consider a case where one of the
worker doesn't get any tuple because by the time it starts all the
tuples are consumed by all other workers. The above code will keep on
looping to fetch the tuple from that worker whereas that worker can't
return any tuple.  I think you can fix it by checking if the
particular queue has been exhausted.


Oh yes. I will work on the fix and soon submit the next set of patch.
 
>> >
>> > Open Issue:
>> >
>> > - Commit af33039317ddc4a0e38a02e2255c2bf453115fd2 fixed the leak into
>> > tqueue.c by
>> > calling gather_readnext() into per-tuple context. Now for gather merge
>> > that
>> > is
>> > not possible, as we storing the tuple into Tuple array and we want tuple
>> > to
>> > be
>> > free only its get pass through the merge sort algorithm. One idea is, we
>> > can
>> > also call gm_readnext_tuple() under per-tuple context (which will fix
>> > the
>> > leak
>> > into tqueue.c) and then store the copy of tuple into tuple array.
>> >
>>
>> Won't extra copy per tuple impact performance?  Is the fix in
>> mentioned commit was for record or composite types, if so, does
>> GatherMerge support such types and if it support, does it provide any
>> benefit over Gather?
>>
>
> I don't think was specificially for the record or composite types - but I
> might be
> wrong. As per my understanding commit fix leak into tqueue.c.
>

Hmm, in tqueue.c, I think the memory leak was remapping logic, refer
mail [1] of Tom (Just to add insult to injury, the backend's memory
consumption bloats to something over 5.5G during that last query).

> Fix was to add
> standard to call TupleQueueReaderNext() with shorter memory context - so
> that
> tqueue.c doesn't leak the memory.
>
> I have idea to fix this by calling the TupleQueueReaderNext() with per-tuple
> context,
> and then copy the tuple and store it to the tuple array and later with the
> next run of
> ExecStoreTuple() will free the earlier tuple. I will work on that.
>

Okay, if you think that is viable, then you can do it, but do check
the performance impact of same, because extra copy per fetched tuple
can impact performance.


Sure, I will check the performance impact for the same.

 

[1] - https://www.postgresql.org/message-id/32763.1469821037%40sss.pgh.pa.us

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Thanks,
Rushabh Lathia

pgsql-hackers by date:

Previous
From: Michael Paquier
Date:
Subject: Re: pg_basebackup stream xlog to tar
Next
From: Fabien COELHO
Date:
Subject: Re: [COMMITTERS] pgsql: Remove extra comma at end of enum list