Re: Parallel Seq Scan - Mailing list pgsql-hackers

From Andres Freund
Subject Re: Parallel Seq Scan
Date
Msg-id 20150217162231.GA21168@awork2.anarazel.de
Whole thread Raw
In response to Re: Parallel Seq Scan  (Robert Haas <robertmhaas@gmail.com>)
Responses Re: Parallel Seq Scan  (Amit Kapila <amit.kapila16@gmail.com>)
Re: Parallel Seq Scan  (Robert Haas <robertmhaas@gmail.com>)
List pgsql-hackers
On 2015-02-11 15:49:17 -0500, Robert Haas wrote:
> On Tue, Feb 10, 2015 at 3:56 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> >> On Tue, Feb 10, 2015 at 9:08 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> >> > And good chunk sizes et al depend on higher layers,
> >> > selectivity estimates and such. And that's planner/executor work, not
> >> > the physical layer (which heapam.c pretty much is).
> >>
> >> If it's true that a good chunk size depends on the higher layers, then
> >> that would be a good argument for doing this differently, or at least
> >> exposing an API for the higher layers to tell heapam.c what chunk size
> >> they want.  I hadn't considered that possibility - can you elaborate
> >> on why you think we might want to vary the chunk size?
> >
> > Because things like chunk size depend on the shape of the entire
> > plan. If you have a 1TB table and want to sequentially scan it in
> > parallel with 10 workers you better use some rather large chunks. That
> > way readahead will be efficient in a cpu/socket local manner,
> > i.e. directly reading in the pages into the directly connected memory of
> > that cpu. Important for performance on a NUMA system, otherwise you'll
> > constantly have everything go over the shared bus.  But if you instead
> > have a plan where the sequential scan goes over a 1GB table, perhaps
> > with some relatively expensive filters, you'll really want a small
> > chunks size to avoid waiting.
> 
> I see.  That makes sense.
> 
> > The chunk size will also really depend on
> > what other nodes are doing, at least if they can run in the same worker.
> 
> Example?

A query whose runetime is dominated by a sequential scan (+ attached
filter) is certainly going to require a bigger prefetch size than one
that does other expensive stuff.

Imagine parallelizing
SELECT * FROM largetable WHERE col = low_cardinality_value;
and
SELECT *
FROM largetable JOIN gigantic_table ON (index_nestloop_condition)
WHERE col = high_cardinality_value;

The first query will be a simple sequential and disk reads on largetable
will be the major cost of executing it.  In contrast the second query
might very well sensibly be planned as a parallel sequential scan with
the nested loop executing in the same worker. But the cost of the
sequential scan itself will likely be completely drowned out by the
nestloop execution - index probes are expensive/unpredictable.

My guess is that the batch size can wil have to be computed based on the
fraction of cost of the parallized work it has.

> > Even without things like NUMA and readahead I'm pretty sure that you'll
> > want a chunk size a good bit above one page. The locks we acquire for
> > the buffercache lookup and for reading the page are already quite bad
> > for performance/scalability; even if we don't always/often hit the same
> > lock. Making 20 processes that scan pages in parallel acquire yet a
> > another lock (that's shared between all of them!) for every single page
> > won't be fun, especially without or fast filters.
> 
> This is possible, but I'm skeptical.  If the amount of other work we
> have to do that page is so little that the additional spinlock cycle
> per page causes meaningful contention, I doubt we should be
> parallelizing in the first place.

It's easy to see contention of buffer mapping (many workloads), buffer
content and buffer header (especially btree roots and small foreign key
target tables) locks. And for most of them we already avoid acquiring
the same spinlock in all backends.

Right now to process a page in a sequential scan we acquire a
nonblocking buffer mapping lock (which doesn't use a spinlock anymore
*because* it proved to be a bottleneck), a nonblocking content lock and
a the buffer header spinlock. All of those are essentially partitioned -
another spinlock shared between all workers will show up.

> > As pointed out above (moved there after reading the patch...) I don't
> > think a chunk size of 1 or any other constant size can make sense. I
> > don't even believe it'll necessarily be constant across an entire query
> > execution (big initially, small at the end).  Now, we could move
> > determining that before the query execution into executor
> > initialization, but then we won't yet know how many workers we're going
> > to get. We could add a function setting that at runtime, but that'd mix
> > up responsibilities quite a bit.
> 
> I still think this belongs in heapam.c somehow or other.  If the logic
> is all in the executor, then it becomes impossible for any code that
> doensn't use the executor to do a parallel heap scan, and that's
> probably bad.  It's not hard to imagine something like CLUSTER wanting
> to reuse that code, and that won't be possible if the logic is up in
> some higher layer.

Yea.

> If the logic we want is to start with a large chunk size and then
> switch to a small chunk size when there's not much of the relation
> left to scan, there's still no reason that can't be encapsulated in
> heapam.c.

I don't mind having some logic in there, but I think you put in too
much. The snapshot stuff should imo go, and the next page logic should
be caller provided.

> > Btw, using a atomic uint32 you'd end up without the spinlock and just
> > about the same amount of code... Just do a atomic_fetch_add_until32(var,
> > 1, InvalidBlockNumber)... ;)
> 
> I thought of that, but I think there's an overflow hazard.

That's why I said atomic_fetch_add_until32 - which can't overflow ;). I
now remember that that was actually pulled on Heikki's request from the
commited patch until a user shows up, but we can easily add it
back. compare/exchange makes such things simple luckily.

> > To me, given the existing executor code, it seems easiest to achieve
> > that by having the ParallelismDrivingNode above having a dynamic number
> > of nestloop children in different backends and point the coordinated
> > seqscan to some shared state.  As you point out, the number of these
> > children cannot be certainly known (just targeted for) at plan time;
> > that puts a certain limit on how independent they are.  But since a
> > large number of them can be independent between workers it seems awkward
> > to generally treat them as being the same node across workers. But maybe
> > that's just an issue with my mental model.
> 
> I think it makes sense to think of a set of tasks in which workers can
> assist.  So you a query tree which is just one query tree, with no
> copies of the nodes, and then there are certain places in that query
> tree where a worker can jump in and assist that node.  To do that, it
> will have a copy of the node, but that doesn't mean that all of the
> stuff inside the node becomes shared data at the code level, because
> that would be stupid.

My only "problem" with that description is that I think workers will
have to work on more than one node - it'll be entire subtrees of the
executor tree.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



pgsql-hackers by date:

Previous
From: Andres Freund
Date:
Subject: Re: parallel mode and parallel contexts
Next
From: Simon Riggs
Date:
Subject: Re: pgaudit - an auditing extension for PostgreSQL