Re: Parallel Seq Scan - Mailing list pgsql-hackers

From Stephen Frost
Subject Re: Parallel Seq Scan
Date
Msg-id 20150110180336.GQ3062@tamriel.snowman.net
Whole thread Raw
In response to Re: Parallel Seq Scan  (Amit Kapila <amit.kapila16@gmail.com>)
List pgsql-hackers
* Amit Kapila (amit.kapila16@gmail.com) wrote:
> At this moment if we can ensure that parallel plan should not be selected
> for cases where it will perform poorly is more than enough considering
> we have lots of other work left to even make any parallel operation work.

The problem with this approach is that it doesn't consider any options
between 'serial' and 'parallelize by factor X'.  If the startup cost is
1000 and the factor is 32, then a seqscan which costs 31000 won't ever
be parallelized, even though a factor of 8 would have parallelized it.

You could forget about the per-process startup cost entirely, in fact,
and simply say "only parallelize if it's more than X".

Again, I don't like the idea of designing this with the assumption that
the user dictates the right level of parallelization for each and every
query.  I'd love to go out and tell users "set the factor to the number
of CPUs you have and we'll just use what makes sense."

The same goes for max number of backends.  If we set the parallel level
to the number of CPUs and set the max backends to the same, then we end
up with only one parallel query running at a time, ever.  That's
terrible.  Now, we could set the parallel level lower or set the max
backends higher, but either way we're going to end up either using less
than we could or over-subscribing, neither of which is good.

I agree that this makes it a bit different from work_mem, but in this
case there's an overall max in the form of the maximum number of
background workers.  If we had something similar for work_mem, then we
could set that higher and still trust the system to only use the amount
of memory necessary (eg: a hashjoin doesn't use all available work_mem
and neither does a sort, unless the set is larger than available
memory).

> > > Execution:
> > > Most other databases does partition level scan for partition on
> > > different disks by each individual parallel worker.  However,
> > > it seems amazon dynamodb [2] also works on something
> > > similar to what I have used in patch which means on fixed
> > > blocks.  I think this kind of strategy seems better than dividing
> > > the blocks at runtime because dividing randomly the blocks
> > > among workers could lead to random scan for a parallel
> > > sequential scan.
> >
> > Yeah, we also need to consider the i/o side of this, which will
> > definitely be tricky.  There are i/o systems out there which are faster
> > than a single CPU and ones where a single CPU can manage multiple i/o
> > channels.  There are also cases where the i/o system handles sequential
> > access nearly as fast as random and cases where sequential is much
> > faster than random.  Where we can get an idea of that distinction is
> > with seq_page_cost vs. random_page_cost as folks running on SSDs tend to
> > lower random_page_cost from the default to indicate that.
> >
> I am not clear, do you expect anything different in execution strategy
> than what I have mentioned or does that sound reasonable to you?

What I'd like is a way to figure out the right amount of CPU for each
tablespace (0.25, 1, 2, 4, etc) and then use that many.  Using a single
CPU for each tablespace is likely to starve the CPU or starve the I/O
system and I'm not sure if there's a way to address that.

Note that I intentionally said tablespace there because that's how users
can tell us what the different i/o channels are.  I realize this ends up
going beyond the current scope, but the parallel seqscan at the per
relation level will only ever be using one i/o channel.  It'd be neat if
we could work out how fast that i/o channel is vs. the CPUs and
determine how many CPUs are necessary to keep up with the i/o channel
and then use more-or-less exactly that many for the scan.

I agree that some of this can come later but I worry that starting out
with a design that expects to always be told exactly how many CPUs to
use when running a parallel query will be difficult to move away from
later.
Thanks,
    Stephen

pgsql-hackers by date:

Previous
From: Andres Freund
Date:
Subject: Re: Escaping from blocked send() reprised.
Next
From: Tom Lane
Date:
Subject: Re: libpq 9.4 requires /etc/passwd?