Re: Parallel Seq Scan - Mailing list pgsql-hackers

From Robert Haas
Subject Re: Parallel Seq Scan
Date
Msg-id CA+TgmoaMT8d-Tk2yi5cjScC0UnJfib78fn0pGsutDp-4keTB-A@mail.gmail.com
Whole thread Raw
In response to Re: Parallel Seq Scan  (Andres Freund <andres@2ndquadrant.com>)
Responses Re: Parallel Seq Scan  (Amit Kapila <amit.kapila16@gmail.com>)
Re: Parallel Seq Scan  (Andres Freund <andres@2ndquadrant.com>)
List pgsql-hackers
On Tue, Feb 10, 2015 at 3:56 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> On 2015-02-10 09:23:02 -0500, Robert Haas wrote:
>> On Tue, Feb 10, 2015 at 9:08 AM, Andres Freund <andres@2ndquadrant.com> wrote:
>> > And good chunk sizes et al depend on higher layers,
>> > selectivity estimates and such. And that's planner/executor work, not
>> > the physical layer (which heapam.c pretty much is).
>>
>> If it's true that a good chunk size depends on the higher layers, then
>> that would be a good argument for doing this differently, or at least
>> exposing an API for the higher layers to tell heapam.c what chunk size
>> they want.  I hadn't considered that possibility - can you elaborate
>> on why you think we might want to vary the chunk size?
>
> Because things like chunk size depend on the shape of the entire
> plan. If you have a 1TB table and want to sequentially scan it in
> parallel with 10 workers you better use some rather large chunks. That
> way readahead will be efficient in a cpu/socket local manner,
> i.e. directly reading in the pages into the directly connected memory of
> that cpu. Important for performance on a NUMA system, otherwise you'll
> constantly have everything go over the shared bus.  But if you instead
> have a plan where the sequential scan goes over a 1GB table, perhaps
> with some relatively expensive filters, you'll really want a small
> chunks size to avoid waiting.

I see.  That makes sense.

> The chunk size will also really depend on
> what other nodes are doing, at least if they can run in the same worker.

Example?

> Even without things like NUMA and readahead I'm pretty sure that you'll
> want a chunk size a good bit above one page. The locks we acquire for
> the buffercache lookup and for reading the page are already quite bad
> for performance/scalability; even if we don't always/often hit the same
> lock. Making 20 processes that scan pages in parallel acquire yet a
> another lock (that's shared between all of them!) for every single page
> won't be fun, especially without or fast filters.

This is possible, but I'm skeptical.  If the amount of other work we
have to do that page is so little that the additional spinlock cycle
per page causes meaningful contention, I doubt we should be
parallelizing in the first place.

> No, and I said so upthread. I started commenting because you argued that
> architecturally parallelism belongs in heapam.c instead of upper layers,
> and I can't agree with that.  I now have, and it looks less bad than I
> had assumed, sorry.

OK, that's something.

> Unfortunately I still think it's wrong approach, also sorry.
>
> As pointed out above (moved there after reading the patch...) I don't
> think a chunk size of 1 or any other constant size can make sense. I
> don't even believe it'll necessarily be constant across an entire query
> execution (big initially, small at the end).  Now, we could move
> determining that before the query execution into executor
> initialization, but then we won't yet know how many workers we're going
> to get. We could add a function setting that at runtime, but that'd mix
> up responsibilities quite a bit.

I still think this belongs in heapam.c somehow or other.  If the logic
is all in the executor, then it becomes impossible for any code that
doensn't use the executor to do a parallel heap scan, and that's
probably bad.  It's not hard to imagine something like CLUSTER wanting
to reuse that code, and that won't be possible if the logic is up in
some higher layer.  If the logic we want is to start with a large
chunk size and then switch to a small chunk size when there's not much
of the relation left to scan, there's still no reason that can't be
encapsulated in heapam.c.

> Btw, using a atomic uint32 you'd end up without the spinlock and just
> about the same amount of code... Just do a atomic_fetch_add_until32(var,
> 1, InvalidBlockNumber)... ;)

I thought of that, but I think there's an overflow hazard.

> Where the 'coordinated seqscan' scans a relation so that each tuple
> eventually gets returned once across all nodes, but the nested loop (and
> through it the index scan) will just run normally, without any
> coordination and parallelism. But everything below --- would happen
> multiple nodes. If you agree, yes, then we're in violent agreement
> ;). The "single node that gets copied" bit above makes me a bit unsure
> whether we are though.

Yeah, I think we're talking about the same thing.

> To me, given the existing executor code, it seems easiest to achieve
> that by having the ParallelismDrivingNode above having a dynamic number
> of nestloop children in different backends and point the coordinated
> seqscan to some shared state.  As you point out, the number of these
> children cannot be certainly known (just targeted for) at plan time;
> that puts a certain limit on how independent they are.  But since a
> large number of them can be independent between workers it seems awkward
> to generally treat them as being the same node across workers. But maybe
> that's just an issue with my mental model.

I think it makes sense to think of a set of tasks in which workers can
assist.  So you a query tree which is just one query tree, with no
copies of the nodes, and then there are certain places in that query
tree where a worker can jump in and assist that node.  To do that, it
will have a copy of the node, but that doesn't mean that all of the
stuff inside the node becomes shared data at the code level, because
that would be stupid.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



pgsql-hackers by date:

Previous
From: Heikki Linnakangas
Date:
Subject: Re: reducing our reliance on MD5
Next
From: Claudio Freire
Date:
Subject: Re: reducing our reliance on MD5