Re: Parallel Seq Scan - Mailing list pgsql-hackers
From | Stephen Frost |
---|---|
Subject | Re: Parallel Seq Scan |
Date | |
Msg-id | 20150111110158.GS3062@tamriel.snowman.net Whole thread Raw |
In response to | Re: Parallel Seq Scan (Robert Haas <robertmhaas@gmail.com>) |
Responses |
Re: Parallel Seq Scan
|
List | pgsql-hackers |
* Robert Haas (robertmhaas@gmail.com) wrote: > On Fri, Jan 9, 2015 at 12:24 PM, Stephen Frost <sfrost@snowman.net> wrote: > > Yeah, we also need to consider the i/o side of this, which will > > definitely be tricky. There are i/o systems out there which are faster > > than a single CPU and ones where a single CPU can manage multiple i/o > > channels. There are also cases where the i/o system handles sequential > > access nearly as fast as random and cases where sequential is much > > faster than random. Where we can get an idea of that distinction is > > with seq_page_cost vs. random_page_cost as folks running on SSDs tend to > > lower random_page_cost from the default to indicate that. > > On my MacOS X system, I've already seen cases where my parallel_count > module runs incredibly slowly some of the time. I believe that this > is because having multiple workers reading the relation block-by-block > at the same time causes the OS to fail to realize that it needs to do > aggressive readahead. I suspect we're going to need to account for > this somehow. So, for my 2c, I've long expected us to parallelize at the relation-file level for these kinds of operations. This goes back to my other thoughts on how we should be thinking about parallelizing inbound data for bulk data loads but it seems appropriate to consider it here also. One of the issues there is that 1G still feels like an awful lot for a minimum work size for each worker and it would mean we don't parallelize for relations less than that size. On a random VM on my personal server, an uncached 1G read takes over 10s. Cached it's less than half that, of course. This is all spinning rust (and only 7200 RPM at that) and there's a lot of other stuff going on but that still seems like too much of a chunk to give to one worker unless the overall data set to go through is really large. There's other issues in there too, of course, if we're dumping data in like this then we have to either deal with jagged relation files somehow or pad the file out to 1G, and that doesn't even get into the issues around how we'd have to redesign the interfaces for relation access and how this thinking is an utter violation of the modularity we currently have there. > > Yeah, I agree that's more typical. Robert's point that the master > > backend should participate is interesting but, as I recall, it was based > > on the idea that the master could finish faster than the worker- but if > > that's the case then we've planned it out wrong from the beginning. > > So, if the workers have been started but aren't keeping up, the master > should do nothing until they produce tuples rather than participating? > That doesn't seem right. Having the master jump in and start working could screw things up also though. Perhaps we need the master to start working as a fail-safe but not plan on having things go that way? Having more processes trying to do X doesn't always result in things getting better and the master needs to keep up with all the tuples being thrown at it from the workers. Thanks, Stephen
pgsql-hackers by date: