Re: [DESIGN] ParallelAppend - Mailing list pgsql-hackers

From Amit Kapila
Subject Re: [DESIGN] ParallelAppend
Date
Msg-id CAA4eK1LNt6wQBCxKsMj_QC+GahBuwyKWsQn6UL3nWVQ2savzwg@mail.gmail.com
Whole thread Raw
In response to Re: [DESIGN] ParallelAppend  (Kouhei Kaigai <kaigai@ak.jp.nec.com>)
List pgsql-hackers
On Tue, Aug 25, 2015 at 6:19 AM, Kouhei Kaigai <kaigai@ak.jp.nec.com> wrote:
>
> > On Fri, Aug 21, 2015 at 7:40 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> >
> > It could be possible, but let me summarize what I thought would be required
> > for above use case.  For Parallel Append, we need to push multiple
> > planned statements in contrast to one planned statement as is done for
> > current patch and then one or more parallel workers needs to work on each
> > planned statement. So if we know in advance how many planned statements
> > are we passing down (which we should), then using ParallelWorkerNumber
> > (ParallelWorkerNumber % num_planned_statements or some other similar
> > way), workers can find the the planned statement on which they need to work
> > and similarly information for PartialSeqScan (which currently is parallel heap
> > scan descriptor information).
> >
> My problem is that we have no identifier to point a particular element on
> the TOC segment even if PARALLEL_KEY_PLANNEDSTMT or PARALLEL_KEY_SCAN can
> have multiple items.
> Please assume a situation when ExecPartialSeqScan() has to lookup
> a particular item on TOC but multiple PartialSeqScan nodes can exist.
>
> Currently, it does:
>     pscan = shm_toc_lookup(node->ss.ps.toc, PARALLEL_KEY_SCAN);
>
> However, ExecPartialSeqScan() cannot know which is the index of mine,
> or it is not reasonable to pay attention on other node in this level.
> Even if PARALLEL_KEY_SCAN has multiple items, PartialSeqScan node also
> needs to have identifier.
>

Yes that's right and I think we can find out the same.  Basically we need to
know the planned statement number on which current worker is working and
that anyway we have to do before the worker can start the work.  One way is
as I have explained above that use ParallelWorkerNumber
(ParallelWorkerNumber % num_planned_statements) to find or might need
some sophisticated way to find that out, but definitely we need to know that
before start of execution by worker and once we know that we can use it
find the PARALLEL_KEY_SCAN or whatever key for this worker (as the
the position of PARALLEL_KEY_SCAN will be same as of planned stmt
for a worker).


> > >  I think KaiGai's correct,
> > > and I pointed out the same problem to you before.  The parallel key
> > > for the Partial Seq Scan needs to be allocated on the fly and carried
> > > in the node, or we'll never be able to push multiple things below the
> > > funnel.
> >
> > Okay, immediately I don't see what is the best way to achieve this but
> > let us discuss this separately on Parallel Seq Scan thread and let me
> > know if you have something specific in your mind.  I will also give this
> > a more thought.
> >
> I want to have 'node_id' in the Plan node, then unique identifier is
> assigned on the field prior to serialization. It is a property of the
> Plan node, so we can reproduce this identifier on the background worker
> side using stringToNode(), then ExecPartialSeqScan can pull out a proper
> field from the TOC segment by this node_id.
>

Okay, this can also work, but why to introduce identifier in plan node, if it
can work without it.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: pg_controldata output alignment regression
Next
From: David Rowley
Date:
Subject: Re: Performance improvement for joins where outer side is unique