Re: Parallel Seq Scan - Mailing list pgsql-hackers

From Amit Kapila
Subject Re: Parallel Seq Scan
Date
Msg-id CAA4eK1KAOUVM-GqroDfzDCfbEn_RzKDLzF+UPkV2tHGAw1wrDQ@mail.gmail.com
Whole thread Raw
In response to Re: Parallel Seq Scan  (Stephen Frost <sfrost@snowman.net>)
Responses Re: Parallel Seq Scan  (Amit Kapila <amit.kapila16@gmail.com>)
List pgsql-hackers
On Sat, Dec 6, 2014 at 5:37 PM, Stephen Frost <sfrost@snowman.net> wrote:
>
> * Amit Kapila (amit.kapila16@gmail.com) wrote:
> > 1. As the patch currently stands, it just shares the relevant
> > data (like relid, target list, block range each worker should
> > perform on etc.) to the worker and then worker receives that
> > data and form the planned statement which it will execute and
> > send the results back to master backend.  So the question
> > here is do you think it is reasonable or should we try to form
> > the complete plan for each worker and then share the same
> > and may be other information as well like range table entries
> > which are required.   My personal gut feeling in this matter
> > is that for long term it might be better to form the complete
> > plan of each worker in master and share the same, however
> > I think the current way as done in patch (okay that needs
> > some improvement) is also not bad and quite easier to implement.
>
> For my 2c, I'd like to see it support exactly what the SeqScan node
> supports and then also what Foreign Scan supports.  That would mean we'd
> then be able to push filtering down to the workers which would be great.
> Even better would be figuring out how to parallelize an Append node
> (perhaps only possible when the nodes underneath are all SeqScan or
> ForeignScan nodes) since that would allow us to then parallelize the
> work across multiple tables and remote servers.
>
> One of the big reasons why I was asking about performance data is that,
> today, we can't easily split a single relation across multiple i/o
> channels.  Sure, we can use RAID and get the i/o channel that the table
> sits on faster than a single disk and possibly fast enough that a single
> CPU can't keep up, but that's not quite the same.  The historical
> recommendations for Hadoop nodes is around one CPU per drive (of course,
> it'll depend on workload, etc, etc, but still) and while there's still a
> lot of testing, etc, to be done before we can be sure about the 'right'
> answer for PG (and it'll also vary based on workload, etc), that strikes
> me as a pretty reasonable rule-of-thumb to go on.
>
> Of course, I'm aware that this won't be as easy to implement..
>
> > 2. Next question related to above is what should be the
> > output of ExplainPlan, as currently worker is responsible
> > for forming its own plan, Explain Plan is not able to show
> > the detailed plan for each worker, is that okay?
>
> I'm not entirely following this.  How can the worker be responsible for
> its own "plan" when the information passed to it (per the above
> paragraph..) is pretty minimal?

Because for a simple sequence scan that much information is sufficient,
basically if we have scanrelid, target list, qual and then RTE (primarily
relOid), then worker can form and perform sequence scan. 

> In general, I don't think we need to
> have specifics like "this worker is going to do exactly X" because we
> will eventually need some communication to happen between the worker and
> the master process where the worker can ask for more work because it's
> finished what it was tasked with and the master will need to give it
> another chunk of work to do.  I don't think we want exactly what each
> worker process will do to be fully formed at the outset because, even
> with the best information available, given concurrent load on the
> system, it's not going to be perfect and we'll end up starving workers.
> The plan, as formed by the master, should be more along the lines of
> "this is what I'm gonna have my workers do" along w/ how many workers,
> etc, and then it goes and does it.

I think here you want to say that work allocation for workers should be
dynamic rather fixed which I think makes sense, however we can try
such an optimization after some initial performance data.

> Perhaps for an 'explain analyze' we
> return information about what workers actually *did* what, but that's a
> whole different discussion.
>

Agreed.

> > 3. Some places where optimizations are possible:
> > - Currently after getting the tuple from heap, it is deformed by
> > worker and sent via message queue to master backend, master
> > backend then forms the tuple and send it to upper layer which
> > before sending it to frontend again deforms it via slot_getallattrs(slot).
>
> If this is done as I was proposing above, we might be able to avoid
> this, but I don't know that it's a huge issue either way..  The bigger
> issue is getting the filtering pushed down.
>
> > - Master backend currently receives the data from multiple workers
> > serially.  We can optimize in a way that it can check other queues,
> > if there is no data in current queue.
>
> Yes, this is pretty critical.  In fact, it's one of the recommendations
> I made previously about how to change the Append node to parallelize
> Foreign Scan node work.
>
> > - Master backend is just responsible for coordination among workers
> > It shares the required information to workers and then fetch the
> > data processed by each worker, by using some more logic, we might
> > be able to make master backend also fetch data from heap rather than
> > doing just co-ordination among workers.
>
> I don't think this is really necessary...
>
> > I think in all above places we can do some optimisation, however
> > we can do that later as well, unless they hit the performance badly for
> > cases which people care most.
>
> I agree that we can improve the performance through various
> optimizations later, but it's important to get the general structure and
> design right or we'll end up having to reimplement a lot of it.
>

So to summarize my understanding, below are the set of things
which I should work on and in the order they are listed.

1. Push down qualification
2. Performance Data
3. Improve the way to push down the information related to worker.
4. Dynamic allocation of work for workers.



With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

pgsql-hackers by date:

Previous
From: Amit Kapila
Date:
Subject: Re: TODO : Allow parallel cores to be used by vacuumdb [ WIP ]
Next
From: "Amit Langote"
Date:
Subject: Re: On partitioning