Thread: Re: [HACKERS] modeling parallel contention (was: Parallel Append implementation)

Re: [HACKERS] modeling parallel contention (was: Parallel Append implementation)

From

Robert Haas

Date:

02 May 2017, 22:13:58

On Tue, Apr 18, 2017 at 2:48 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
> After searching through earlier mails about parallel scan, I am not
> sure whether the shared state was considered to be a potential factor
> that might reduce parallel query gains, when deciding the calculation
> for number of workers for a parallel seq scan. I mean even today if we
> allocate 10 workers as against a calculated 4 workers count for a
> parallel seq scan, they might help. I think it's just that we don't
> know if they would *always* help or it would regress sometimes.

No, that's not considered, currently.  This is actually an issue even
for nodes that are not parallel-aware at all.  For example, consider
this:

Hash Join
-> Parallel Seq Scan
-> Hash -> Seq Scan

It is of course possible that the Parallel Seq Scan could run into
contention problems if the number of workers is large, but in my
experience there are bigger problems here.  The non-parallel Seq Scan
can also contend -- not of course over the shared mutex because there
isn't one, but over access to the blocks themselves.  Every one of
those blocks has a content lock and a buffer header and so on, and
having multiple processes accessing those things at the same time
scales well, but not perfectly.  The Hash node can also contend: if
the hash join spills to disk, you've got multiple processes reading
and writing to the temp directory at the same time and, of course,
that can be worse than just one process doing it -- sometimes much
worse.  It can also be better, depending on how much I/O gets
generated and how much I/O bandwidth you have.

The main things that keeps this from being a crippling issue right now
is the fact that we tend not to use that many parallel workers in the
first place.  We're trying to scale a query that would otherwise use 1
process out to 3 or 5 processes, and so the contention effects, in
many cases, are not too bad.  Multiple people (including David Rowley
as well as folks here at EnterpriseDB) have demonstrated that for
certain queries, we can actually use a lot more workers and everything
works great.  The problem is that for other queries, using a lot of
workers works terribly.  The planner doesn't know how to figure out
which it'll be - and honestly, I don't either.

/me crosses fingers, hopes someone smarter will work on this problem.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] modeling parallel contention (was: Parallel Appendimplementation)

From

Andres Freund

Date:

05 May 2017, 04:37:10

On 2017-05-02 15:13:58 -0400, Robert Haas wrote:
> On Tue, Apr 18, 2017 at 2:48 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
> The main things that keeps this from being a crippling issue right now
> is the fact that we tend not to use that many parallel workers in the
> first place.  We're trying to scale a query that would otherwise use 1
> process out to 3 or 5 processes, and so the contention effects, in
> many cases, are not too bad.  Multiple people (including David Rowley
> as well as folks here at EnterpriseDB) have demonstrated that for
> certain queries, we can actually use a lot more workers and everything
> works great.  The problem is that for other queries, using a lot of
> workers works terribly.  The planner doesn't know how to figure out
> which it'll be - and honestly, I don't either.

Have those benchmarks, even in a very informal form, been shared /
collected / referenced centrally?  I'd be very interested to know where
the different contention points are. Possibilities:

- in non-resident workloads: too much concurrent IOs submitted, leading to overly much random IO
- internal contention in the the parallel node, e.g. parallel seqscan
- contention on PG componenents like buffer mapping, procarray, clog
- contention on individual buffers, e.g. btree root pages, or small tables in nestloop joins
- just too many context switches, due to ineffective parallelization

probably multiple of those are a problem, but without trying to figure
them out, we're going to have a hard time to develop a better costing
model...

Greetings,

Andres Freund

Re: [HACKERS] modeling parallel contention (was: Parallel Append implementation)

From

David Rowley

Date:

05 May 2017, 05:20:48

 On 3 May 2017 at 07:13, Robert Haas <robertmhaas@gmail.com> wrote:
> It is of course possible that the Parallel Seq Scan could run into
> contention problems if the number of workers is large, but in my
> experience there are bigger problems here.  The non-parallel Seq Scan
> can also contend -- not of course over the shared mutex because there
> isn't one, but over access to the blocks themselves.  Every one of
> those blocks has a content lock and a buffer header and so on, and
> having multiple processes accessing those things at the same time
> scales well, but not perfectly.  The Hash node can also contend: if
> the hash join spills to disk, you've got multiple processes reading
> and writing to the temp directory at the same time and, of course,
> that can be worse than just one process doing it -- sometimes much
> worse.  It can also be better, depending on how much I/O gets
> generated and how much I/O bandwidth you have.

Yeah, I did get some time to look over the contention in Parallel Seq
Scan a while back and I discovered that on the machine that I was
testing on. the lock obtained in heap_parallelscan_nextpage() was
causing workers to have to wait for other workers to fetch their next
task to work on.

I ended up writing the attached (which I'd not intended to post until
some time closer to when the doors open for PG11). At the moment it's
basically just a test patch to see how it affects things when we give
workers a bit more to do before they come back to look for more work.
In this case, I've just given them 10 pages to work on, instead of the
1 that's allocated in 9.6 and v10.

A quick test on a pretty large table on a large machine shows:

Unpatched:

postgres=# select count(*) from a;
   count
------------
 1874000000
(1 row)

Time: 5211.485 ms (00:05.211)

Patched:

postgres=# select count(*) from a;
   count
------------
 1874000000
(1 row)

Time: 2523.983 ms (00:02.524)

So it seems worth looking into. "a" was just a table with a single int
column. I'm unsure as yet if there are more gains to be had for tables
with wider tuples. I do suspect the patch will be a bigger win in
those cases, since there's less work to do for each page, e.g less
advance aggregate calls, so likely they'll be looking for their next
page a bit sooner.

Now I'm not going to pretend that this patch is ready for the
prime-time. I've not yet worked out how to properly report sync-scan
locations without risking reporting later pages after reporting the
end of the scan. What I have at the moment could cause a report to be
missed if SYNC_SCAN_REPORT_INTERVAL is not divisible by the batch
size. I'm also not sure how batching like this affect read-aheads, but
at least the numbers above speak for something. Although none of the
pages in this case came from disk.

I'd had thoughts that the 10 pages wouldn't be constant, but the
batching size would depend on the size of the relation to be scanned.
I'd rough ideas to just try to make about 1 million batches. Something
like batch_pages = Max(parallel_scan->phs_nblocks / 1000000, 1); so
that we only take more than 1 page if there's some decent amount to
process. We don't want to make the batches too big as we might end up
having to wait on slow workers at the end of a scan.

Anyway. I don't want to hi-jack this thread with discussions on this.
I just wanted to mark that I plan to work on this in order to avoid
any repeat developments or analysis. I'll probably start a new thread
for this sometime nearer PG11's dev cycle.

The patch is attached if in the meantime someone wants to run this on
some big hardware.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

parallel_nextpage_batching_v2.patch

Re: [HACKERS] modeling parallel contention (was: Parallel Append implementation)

From

David Rowley

Date:

05 May 2017, 05:23:15

On 5 May 2017 at 13:37, Andres Freund <andres@anarazel.de> wrote:
> On 2017-05-02 15:13:58 -0400, Robert Haas wrote:
>> Multiple people (including David Rowley
>> as well as folks here at EnterpriseDB) have demonstrated that for
>> certain queries, we can actually use a lot more workers and everything
>> works great.  The problem is that for other queries, using a lot of
>> workers works terribly.  The planner doesn't know how to figure out
>> which it'll be - and honestly, I don't either.
>
> Have those benchmarks, even in a very informal form, been shared /
> collected / referenced centrally?  I'd be very interested to know where
> the different contention points are. Possibilities:

I posted mine on [1], although the post does not go into much detail
about the contention points. I only really briefly mention it at the
end.

[1] https://blog.2ndquadrant.com/parallel-monster-benchmark/#comment-248273

-- David Rowley                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] modeling parallel contention (was: Parallel Appendimplementation)

From

Andres Freund

Date:

05 May 2017, 05:36:46

Hi,

On 2017-05-05 14:20:48 +1200, David Rowley wrote:
> Yeah, I did get some time to look over the contention in Parallel Seq
> Scan a while back and I discovered that on the machine that I was
> testing on. the lock obtained in heap_parallelscan_nextpage() was
> causing workers to have to wait for other workers to fetch their next
> task to work on.

Oh, if it's "just" that, it should be easy enough to address.  Two
approaches:
1) use atomic ops for increment, modulo afterwards to deal with  wraparound in the synchronous scan
2) batching


> I ended up writing the attached (which I'd not intended to post until
> some time closer to when the doors open for PG11). At the moment it's
> basically just a test patch to see how it affects things when we give
> workers a bit more to do before they come back to look for more work.
> In this case, I've just given them 10 pages to work on, instead of the
> 1 that's allocated in 9.6 and v10.

Right.


> A quick test on a pretty large table on a large machine shows:
> 
> Unpatched:
> 
> postgres=# select count(*) from a;
>    count
> ------------
>  1874000000
> (1 row)
> 
> Time: 5211.485 ms (00:05.211)
> 
> Patched:
> 
> postgres=# select count(*) from a;
>    count
> ------------
>  1874000000
> (1 row)
> 
> Time: 2523.983 ms (00:02.524)

Neat!


> I'd had thoughts that the 10 pages wouldn't be constant, but the
> batching size would depend on the size of the relation to be scanned.
> I'd rough ideas to just try to make about 1 million batches. Something
> like batch_pages = Max(parallel_scan->phs_nblocks / 1000000, 1); so
> that we only take more than 1 page if there's some decent amount to
> process. We don't want to make the batches too big as we might end up
> having to wait on slow workers at the end of a scan.

I wonder how much doing the atomic ops approach alone can help, that
doesn't have the issue that the work might be unevenly distributed
between pages.


> Anyway. I don't want to hi-jack this thread with discussions on this.
> I just wanted to mark that I plan to work on this in order to avoid
> any repeat developments or analysis. I'll probably start a new thread
> for this sometime nearer PG11's dev cycle.

Cool.  I think it might sense to post about this soon, just to give it
some more visibility to reduce the potential for duplication.

- andres

Re: [HACKERS] modeling parallel contention (was: Parallel Append implementation)

From

David Rowley

Date:

05 May 2017, 05:36:58

On 3 May 2017 at 07:13, Robert Haas <robertmhaas@gmail.com> wrote:
> Multiple people (including David Rowley
> as well as folks here at EnterpriseDB) have demonstrated that for
> certain queries, we can actually use a lot more workers and everything
> works great.  The problem is that for other queries, using a lot of
> workers works terribly.  The planner doesn't know how to figure out
> which it'll be - and honestly, I don't either.

For me, it seems pretty much related to the number of tuples processed
on a worker, vs how many they return. As a general rule, I'd say the
higher this ratio, the higher the efficiency ratio will be for the
worker. Although that's not taking into account contention points
where workers must wait for fellow workers to complete some operation.
I think parallel_tuple_cost is a good GUC to have, perhaps we can be
smarter about the use of it when deciding on how many workers should
be used.

By efficiency, I mean that if a query takes 10 seconds in a normal
serial plan, and adding 1 worker, it takes 5 seconds, it would be 100%
efficient to use another worker. I charted this in [1]. It would have
been interesting to chart the same in a query that returned a larger
number of groups, but I ran out of time, but I think it pretty much
goes, without testing, that more groups == less efficiency. Which'll
be due to more overhead in parallel tuple communication, and more work
to do in the serial portion of the plan.

[1] https://blog.2ndquadrant.com/parallel-monster-benchmark
-- David Rowley                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] modeling parallel contention (was: Parallel Append implementation)

From

David Rowley

Date:

05 May 2017, 05:40:43

On 5 May 2017 at 14:36, Andres Freund <andres@anarazel.de> wrote:
> I wonder how much doing the atomic ops approach alone can help, that
> doesn't have the issue that the work might be unevenly distributed
> between pages.

I wondered that too, since I though the barrier for making this change
would be lower by doing it that way.

I didn't manage to think of a way to get around the wrapping the
position back to 0 when synch-scans are involved.

i.e
parallel_scan->phs_cblock++;
if (parallel_scan->phs_cblock >= scan->rs_nblocks)
parallel_scan->phs_cblock = 0;

-- David Rowley                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] modeling parallel contention (was: Parallel Appendimplementation)

From

Andres Freund

Date:

05 May 2017, 05:45:33

On 2017-05-05 14:40:43 +1200, David Rowley wrote:
> On 5 May 2017 at 14:36, Andres Freund <andres@anarazel.de> wrote:
> > I wonder how much doing the atomic ops approach alone can help, that
> > doesn't have the issue that the work might be unevenly distributed
> > between pages.
> 
> I wondered that too, since I though the barrier for making this change
> would be lower by doing it that way.
> 
> I didn't manage to think of a way to get around the wrapping the
> position back to 0 when synch-scans are involved.
> 
> i.e
> parallel_scan->phs_cblock++;
> if (parallel_scan->phs_cblock >= scan->rs_nblocks)
> parallel_scan->phs_cblock = 0;

Increment phs_cblock without checking rs_nblocks, but outside of the
lock do a % scan->rs_nblocks, to get the "actual" position.  Finish if
(phs_cblock - phs_startblock) / scan->rs_nblocks >= 1.

The difficult part seems to be the parallel_scan->phs_startblock
computation, but that we probably can do via an read barrier & unlocked
check, and then a spinlock & recheck if still uninitialized.

- Andres

Re: [HACKERS] modeling parallel contention (was: Parallel Appendimplementation)

From

Andres Freund

Date:

05 May 2017, 05:48:43

On 2017-05-04 19:45:33 -0700, Andres Freund wrote:
> Increment phs_cblock without checking rs_nblocks, but outside of the
> lock do a % scan->rs_nblocks, to get the "actual" position.  Finish if
> (phs_cblock - phs_startblock) / scan->rs_nblocks >= 1.

Err, as I've been pointed to: It should be s/lock/atomic operation/

Re: [HACKERS] modeling parallel contention (was: Parallel Append implementation)

From

Thomas Munro

Date:

05 May 2017, 05:54:05

On Fri, May 5, 2017 at 2:23 PM, David Rowley
<david.rowley@2ndquadrant.com> wrote:
> On 5 May 2017 at 13:37, Andres Freund <andres@anarazel.de> wrote:
>> On 2017-05-02 15:13:58 -0400, Robert Haas wrote:
>>> Multiple people (including David Rowley
>>> as well as folks here at EnterpriseDB) have demonstrated that for
>>> certain queries, we can actually use a lot more workers and everything
>>> works great.  The problem is that for other queries, using a lot of
>>> workers works terribly.  The planner doesn't know how to figure out
>>> which it'll be - and honestly, I don't either.
>>
>> Have those benchmarks, even in a very informal form, been shared /
>> collected / referenced centrally?  I'd be very interested to know where
>> the different contention points are. Possibilities:
>
> I posted mine on [1], although the post does not go into much detail
> about the contention points. I only really briefly mention it at the
> end.

Just for fun, check out pages 42 and 43 of Wei Hong's thesis.  He
worked on Berkeley POSTGRES parallel query and a spin-off called XPRS,
and they got linear seq scan scaling up to number of spindles:

http://db.cs.berkeley.edu/papers/ERL-M93-28.pdf

It gather from flicking through the POSTGRES 4.2 sources and this
stuff about XPRS that they switched from a "launch N workers!" model
to a "generate tasks and schedule them" model somewhere between these
systems.  Chapters 2 and 3 cover the problem of avoiding excessive
parallelism that reduces performance adjusting dynamically to maximum
throughput.  I suspect we're going that way too at some point, and it
would certainly fix some problems I ran into with Parallel Shared
Hash.

XPRS's cost model included resource consumption, not just 'timerons'.
This is something I grappled with when trying to put a price tag on
Parallel Shared Hash plans where just one worker builds the hash table
while the others wait.  I removed that plan from the patch because it
became mostly redundant, but when it was there Postgres thought it was
the same cost as a plan where every worker hammers your system
building the same hash table, whereas XPRS would have considered such
a plan ludicrously expensive (depending on his 'w' term, see page 28,
which determines whether you care more about resource usage or
response time).

-- 
Thomas Munro
http://www.enterprisedb.com

Re: [HACKERS] modeling parallel contention (was: Parallel Append implementation)

From

Amit Khandekar

Date:

05 May 2017, 09:09:28

On 5 May 2017 at 07:50, David Rowley <david.rowley@2ndquadrant.com> wrote:
>  On 3 May 2017 at 07:13, Robert Haas <robertmhaas@gmail.com> wrote:
>> It is of course possible that the Parallel Seq Scan could run into
>> contention problems if the number of workers is large, but in my
>> experience there are bigger problems here.  The non-parallel Seq Scan
>> can also contend -- not of course over the shared mutex because there
>> isn't one, but over access to the blocks themselves.  Every one of
>> those blocks has a content lock and a buffer header and so on, and
>> having multiple processes accessing those things at the same time
>> scales well, but not perfectly.  The Hash node can also contend: if
>> the hash join spills to disk, you've got multiple processes reading
>> and writing to the temp directory at the same time and, of course,
>> that can be worse than just one process doing it -- sometimes much
>> worse.  It can also be better, depending on how much I/O gets
>> generated and how much I/O bandwidth you have.
>
> Yeah, I did get some time to look over the contention in Parallel Seq
> Scan a while back and I discovered that on the machine that I was
> testing on. the lock obtained in heap_parallelscan_nextpage() was
> causing workers to have to wait for other workers to fetch their next
> task to work on.
>
> I ended up writing the attached (which I'd not intended to post until
> some time closer to when the doors open for PG11). At the moment it's
> basically just a test patch to see how it affects things when we give
> workers a bit more to do before they come back to look for more work.
> In this case, I've just given them 10 pages to work on, instead of the
> 1 that's allocated in 9.6 and v10.
>
> A quick test on a pretty large table on a large machine shows:
>
> Unpatched:
>
> postgres=# select count(*) from a;
>    count
> ------------
>  1874000000
> (1 row)
>
> Time: 5211.485 ms (00:05.211)
>
> Patched:
>
> postgres=# select count(*) from a;
>    count
> ------------
>  1874000000
> (1 row)
>
> Time: 2523.983 ms (00:02.524)

The result is quite impressive.

>
> So it seems worth looking into. "a" was just a table with a single int
> column. I'm unsure as yet if there are more gains to be had for tables
> with wider tuples. I do suspect the patch will be a bigger win in
> those cases, since there's less work to do for each page, e.g less
> advance aggregate calls, so likely they'll be looking for their next
> page a bit sooner.
>
> Now I'm not going to pretend that this patch is ready for the
> prime-time. I've not yet worked out how to properly report sync-scan
> locations without risking reporting later pages after reporting the
> end of the scan. What I have at the moment could cause a report to be
> missed if SYNC_SCAN_REPORT_INTERVAL is not divisible by the batch
> size. I'm also not sure how batching like this affect read-aheads, but
> at least the numbers above speak for something. Although none of the
> pages in this case came from disk.
>
> I'd had thoughts that the 10 pages wouldn't be constant, but the
> batching size would depend on the size of the relation to be scanned.
> I'd rough ideas to just try to make about 1 million batches. Something
> like batch_pages = Max(parallel_scan->phs_nblocks / 1000000, 1); so
> that we only take more than 1 page if there's some decent amount to
> process. We don't want to make the batches too big as we might end up
> having to wait on slow workers at the end of a scan.

I was wondering : if we keep increasing the batch size, that might
lead to I/O contention. I mean, the higher the batch size, the higher
is the chance to cause more random I/O, because all workers would be
accessing disk blocks far away from each other in parallel. So there
might be a trade off here. (it's another thing that there needs to be
I/O contention testing done, in general, for many scenarios).

I believe there are certain parallel scans (parallel bitmap heap scan
? ) where the logic to go to the next block consumes time, so more
waits consequently.

What if we supply for each worker with a sequence of blocks to be
scanned, rather than a range of blocks. Each worker would have a list
of next few blocks, say :
w1 : 1, 5, 9, 13
w2 : 2, 6, 10, 14
w3 : 3, 7, 11, 15.
w4 : .....

May be the leader worker would do the accounting and store the
instructions for each of the workers at individual locations in shared
memory, so there won't be any contention while accessing them.

This may be simple/applicable for a sequential scan, but not for other
scans, some of which this may not be even possible. But basically I
was thinking of a way around to tackle shared memory contention as
well as random I/O.

-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

Re: [HACKERS] modeling parallel contention (was: Parallel Append implementation)

From

Amit Kapila

Date:

05 May 2017, 09:10:43

On Fri, May 5, 2017 at 7:07 AM, Andres Freund <andres@anarazel.de> wrote:
> On 2017-05-02 15:13:58 -0400, Robert Haas wrote:
>> On Tue, Apr 18, 2017 at 2:48 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
>> The main things that keeps this from being a crippling issue right now
>> is the fact that we tend not to use that many parallel workers in the
>> first place.  We're trying to scale a query that would otherwise use 1
>> process out to 3 or 5 processes, and so the contention effects, in
>> many cases, are not too bad.  Multiple people (including David Rowley
>> as well as folks here at EnterpriseDB) have demonstrated that for
>> certain queries, we can actually use a lot more workers and everything
>> works great.  The problem is that for other queries, using a lot of
>> workers works terribly.  The planner doesn't know how to figure out
>> which it'll be - and honestly, I don't either.
>
> Have those benchmarks, even in a very informal form, been shared /
> collected / referenced centrally?
>

The numbers have been posted on parallel seq. scan the thread and more
formally shared in PGCon presentation ([1], refer slide-15).

I'd be very interested to know where
> the different contention points are. Possibilities:
>
> - in non-resident workloads: too much concurrent IOs submitted, leading
>   to overly much random IO
> - internal contention in the the parallel node, e.g. parallel seqscan
>

I think one of the points of scaling/contention is tuple
communication.  This is what is shown is perf profiles and we (one of
my colleagues is working on it) are already working on some ways to
improve the same, but I don't think we can get anywhere near to linear
scaling by improving the same.


[1] - https://www.pgcon.org/2015/schedule/events/785.en.html

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] modeling parallel contention (was: Parallel Append implementation)

From

Robert Haas

Date:

05 May 2017, 22:29:40

On Thu, May 4, 2017 at 9:37 PM, Andres Freund <andres@anarazel.de> wrote:
> Have those benchmarks, even in a very informal form, been shared /
> collected / referenced centrally?  I'd be very interested to know where
> the different contention points are. Possibilities:
>
> - in non-resident workloads: too much concurrent IOs submitted, leading
>   to overly much random IO
> - internal contention in the the parallel node, e.g. parallel seqscan
> - contention on PG componenents like buffer mapping, procarray, clog
> - contention on individual buffers, e.g. btree root pages, or small
>   tables in nestloop joins
> - just too many context switches, due to ineffective parallelization
>
> probably multiple of those are a problem, but without trying to figure
> them out, we're going to have a hard time to develop a better costing
> model...

It's pretty easy (but IMHO not very interesting) to measure internal
contention in the Parallel Seq Scan node.  As David points out
downthread, that problem isn't trivial to fix, but it's not that hard,
either.  I do believe that there is a problem with too much concurrent
I/O on things like:

Gather
-> Parallel Seq Scan on lineitem
-> Hash Join  -> Seq Scan on lineitem

If that goes to multiple batches, you're probably wrenching the disk
head all over the place - multiple processes are now reading and
writing batch files at exactly the same time.  I also strongly suspect
that contention on individual buffers can turn into a problem on
queries like this:

Gather (Merge)
-> Merge Join -> Parallel Index Scan -> Index Scan

The parallel index scan surely has some upper limit on concurrency,
but if that is exceeded, what will tend to happen is that processes
will sleep.  On the inner side, though, every process is doing a full
index scan and chances are good that they are doing it more or less in
lock step, hitting the same buffers at the same time.

Also consider:

Finalize HashAggregate
-> Gather -> Partial HashAggregate   -> Parallel Seq Scan

Suppose that the average group contains ten items which will tend to
be widely spaced across the table.  As you add workers, the number of
workers across which any given group gets spread increases.  There's
probably a "sweet spot" here.  Up to a certain point, adding workers
makes it faster, because the work of the Seq Scan and the Partial
HashAggregate gets spread across more processes.  However, adding
workers also increases the number of rows that pass through the Gather
node, because as you add workers more groups end up being split across
workers, or across more workers.  That means more and more of the
aggregation starts happening in the Finalize HashAggregate rather than
the Partial HashAggregate.  If you had for example 20 workers here
almost nothing would be happening in the Partial HashAggregate,
because chances are good that each of the 10 rows in each group would
be encountered by a different worker, so that'd probably be
counterproductive.

I think there are two separate questions here:

1. How do we reduce or eliminate contention during parallel query execution?
2. How do we make the query planner smarter about picking the optimal
number of workers?

I think the second problem is both more difficult and more
interesting.  I think that no matter how much work we do on #1, there
are always going to be cases where the amount of effective parallelism
has some finite limit - and I think the limit will vary substantially
from query to query.  So, without #2, we'll either leave a lot of
money on the table for queries that can benefit from using a large
number of workers, or we'll throw extra workers uselessly at queries
where they don't help (or even make things worse).

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] modeling parallel contention (was: Parallel Append implementation)

From

Robert Haas

Date:

05 May 2017, 22:34:24

On Thu, May 4, 2017 at 10:20 PM, David Rowley
<david.rowley@2ndquadrant.com> wrote:
> Now I'm not going to pretend that this patch is ready for the
> prime-time. I've not yet worked out how to properly report sync-scan
> locations without risking reporting later pages after reporting the
> end of the scan. What I have at the moment could cause a report to be
> missed if SYNC_SCAN_REPORT_INTERVAL is not divisible by the batch
> size. I'm also not sure how batching like this affect read-aheads, but
> at least the numbers above speak for something. Although none of the
> pages in this case came from disk.

This kind of approach has also been advocated within EnterpriseDB, and
I immediately thought of the read-ahead problem.  I think we need more
research into how Parallel Seq Scan interacts with OS readahead
behavior on various operating systems.  It seem possible that Parallel
Seq Scan frustrates operating system read-ahead even without this
change on at least some systems (because maybe they can only detect
ascending block number requests within a single process) and even more
possible that you run into problems with the block number requests are
no longer precisely in order (which, at present, they should be, or
very close).  If it turns out to be a problem, either currently or
with this patch, we might need to add explicit prefetching logic to
Parallel Seq Scan.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] modeling parallel contention (was: Parallel Append implementation)

From

Robert Haas

Date:

05 May 2017, 22:40:32

On Thu, May 4, 2017 at 10:36 PM, David Rowley
<david.rowley@2ndquadrant.com> wrote:
> On 3 May 2017 at 07:13, Robert Haas <robertmhaas@gmail.com> wrote:
>> Multiple people (including David Rowley
>> as well as folks here at EnterpriseDB) have demonstrated that for
>> certain queries, we can actually use a lot more workers and everything
>> works great.  The problem is that for other queries, using a lot of
>> workers works terribly.  The planner doesn't know how to figure out
>> which it'll be - and honestly, I don't either.
>
> For me, it seems pretty much related to the number of tuples processed
> on a worker, vs how many they return. As a general rule, I'd say the
> higher this ratio, the higher the efficiency ratio will be for the
> worker. Although that's not taking into account contention points
> where workers must wait for fellow workers to complete some operation.
> I think parallel_tuple_cost is a good GUC to have, perhaps we can be
> smarter about the use of it when deciding on how many workers should
> be used.

It does seem pretty clear that Gather is horrifyingly slow and
therefore often the gating factor, but I think it's also clear that
even if you removed the overhead of Gather completely, there will
always be *something* that limits scaling -- and I bet we're quite a
long way from the point where that thing is typically the amount of
hardware that you have.

One idea that crossed my mind is to just have workers write all of
their output tuples to a temp file and have the leader read them back
in.  At some cost in I/O, this would completely eliminate the overhead
of workers waiting for the leader.  In some cases, it might be worth
it.  At the least, it could be interesting to try a prototype
implementation of this with different queries (TPC-H, maybe) and see
what happens.  It would give us some idea how much of a problem
stalling on the leader is in practice.  Wait event monitoring could
possibly also be used to figure out an answer to that question.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] modeling parallel contention (was: Parallel Appendimplementation)

From

Andres Freund

Date:

06 May 2017, 00:24:06

Hi,

On 2017-05-05 15:29:40 -0400, Robert Haas wrote:
> On Thu, May 4, 2017 at 9:37 PM, Andres Freund <andres@anarazel.de> wrote:
> It's pretty easy (but IMHO not very interesting) to measure internal
> contention in the Parallel Seq Scan node.  As David points out
> downthread, that problem isn't trivial to fix, but it's not that hard,
> either.

Well, I think it's important that we do some basic optimization before
we start building a more elaborate costing model, because we'll
otherwise codify assumptions that won't be true for long.  Changing an
established costing model once it's out there is really hard, because
you always will hurt someone.  I think some of the contention is easy
enough to remove, and some of the IO concurrency issues

> I do believe that there is a problem with too much concurrent
> I/O on things like:
> 
> Gather
> -> Parallel Seq Scan on lineitem
> -> Hash Join
>    -> Seq Scan on lineitem
> 
> If that goes to multiple batches, you're probably wrenching the disk
> head all over the place - multiple processes are now reading and
> writing batch files at exactly the same time.

At least on rotating media. On decent SSDs you usually can have so many
concurrent IOs out there, that it's not actually easy to overwhelm a
disk that way.  While we obviously still need to pay some attention to
spinning disks, I also think that being good on decent (not great) SSDs
is more important.  We shouldn't optimize for things to run most
performant on hydra with it's slow storage system ;)

We probably should take something like effective_io_concurrency into
account, but that'd require a smarter default than we currently have.
It's not that hard to estimate an upper bound of parallelism with a
meaningful effective_io_concurrency - we'd probably have to move the
effective_io_concurrency handling in bitmapscans to be a plan parameter
though, so it's divided by the number of workers.

> I also strongly suspect
> that contention on individual buffers can turn into a problem on
> queries like this:
> 
> Gather (Merge)
> -> Merge Join
>   -> Parallel Index Scan
>   -> Index Scan
> 
> The parallel index scan surely has some upper limit on concurrency,
> but if that is exceeded, what will tend to happen is that processes
> will sleep.  On the inner side, though, every process is doing a full
> index scan and chances are good that they are doing it more or less in
> lock step, hitting the same buffers at the same time.

Hm. Not sure how big that problem is practically.  But I wonder if it's
worthwhile to sleep more if you hit contention or something liek that...

> I think there are two separate questions here:
> 
> 1. How do we reduce or eliminate contention during parallel query execution?
> 2. How do we make the query planner smarter about picking the optimal
> number of workers?
> 
> I think the second problem is both more difficult and more
> interesting.  I think that no matter how much work we do on #1, there
> are always going to be cases where the amount of effective parallelism
> has some finite limit - and I think the limit will vary substantially
> from query to query.  So, without #2, we'll either leave a lot of
> money on the table for queries that can benefit from using a large
> number of workers, or we'll throw extra workers uselessly at queries
> where they don't help (or even make things worse).

Yea, I agree that 2) is more important in the long run, but I do think
that my point that we shouldn't put too much effort into modelling
concurrency before doing basic optimization still stands.

Greetings,

Andres Freund

Re: [HACKERS] modeling parallel contention (was: Parallel Append implementation)

From

Peter Geoghegan

Date:

06 May 2017, 00:42:32

On Fri, May 5, 2017 at 12:40 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> One idea that crossed my mind is to just have workers write all of
> their output tuples to a temp file and have the leader read them back
> in.  At some cost in I/O, this would completely eliminate the overhead
> of workers waiting for the leader.  In some cases, it might be worth
> it.  At the least, it could be interesting to try a prototype
> implementation of this with different queries (TPC-H, maybe) and see
> what happens.  It would give us some idea how much of a problem
> stalling on the leader is in practice.  Wait event monitoring could
> possibly also be used to figure out an answer to that question.

The use of temp files in all cases was effective in my parallel
external sort patch, relative to what I imagine an approach built on a
gather node would get you, but not because of the inherent slowness of
a Gather node. I'm not so sure that Gather is actually inherently
slow, given the interface it supports.

While incremental, retail processing of each tuple is flexible and
composable, it will tend to be slow compared to an approach based on
batch processing (for tasks where you happen to be able to get away
with batch processing). This is true for all the usual reasons --
better locality of access, better branch prediction properties, lower
"effective instruction count" due to having very tight inner loops,
and so on.

I agree with Andres that we shouldn't put too much effort into
modelling concurrency ahead of optimizing serial performance. The
machine's *aggregate* memory bandwidth should be used as efficiently
as possible, and parallelism is just one (very important) tool for
making that happen.

-- 
Peter Geoghegan

VMware vCenter Server
https://www.vmware.com/

Re: [HACKERS] modeling parallel contention (was: Parallel Append implementation)

From

Thomas Munro

Date:

06 May 2017, 04:44:21

On Sat, May 6, 2017 at 7:34 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Thu, May 4, 2017 at 10:20 PM, David Rowley
> <david.rowley@2ndquadrant.com> wrote:
>> Now I'm not going to pretend that this patch is ready for the
>> prime-time. I've not yet worked out how to properly report sync-scan
>> locations without risking reporting later pages after reporting the
>> end of the scan. What I have at the moment could cause a report to be
>> missed if SYNC_SCAN_REPORT_INTERVAL is not divisible by the batch
>> size. I'm also not sure how batching like this affect read-aheads, but
>> at least the numbers above speak for something. Although none of the
>> pages in this case came from disk.
>
> This kind of approach has also been advocated within EnterpriseDB, and
> I immediately thought of the read-ahead problem.  I think we need more
> research into how Parallel Seq Scan interacts with OS readahead
> behavior on various operating systems.  It seem possible that Parallel
> Seq Scan frustrates operating system read-ahead even without this
> change on at least some systems (because maybe they can only detect
> ascending block number requests within a single process) and even more
> possible that you run into problems with the block number requests are
> no longer precisely in order (which, at present, they should be, or
> very close).  If it turns out to be a problem, either currently or
> with this patch, we might need to add explicit prefetching logic to
> Parallel Seq Scan.

I don't know much about this stuff, but I was curious to go looking at
source code.  I hope someone will correct me if I'm wrong but here's
what I could glean:

In Linux, each process that opens a file gets its own 'file'
object[1][5].  Each of those has it's own 'file_ra_state'
object[2][3], used by ondemand_readahead[4] for sequential read
detection.  So I speculate that page-at-a-time parallel seq scan must
look like random access to Linux.

In FreeBSD the situation looks similar.  Each process that opens a
file gets a 'file' object[8] which has members 'f_seqcount' and
'f_nextoff'[6].  These are used by the 'sequential_heuristics'
function[7] which affects the ioflag which UFS/FFS uses to control
read ahead (see ffs_read).  So I speculate that page-at-a-time
parallel seq scan must look like random access to FreeBSD too.

In both cases I suspect that if you'd inherited (or sent the file
descriptor to the other process via obscure tricks), it would actually
work because they'd have the same 'file' entry, but that's clearly not
workable for md.c.

Experimentation required...

[1] https://github.com/torvalds/linux/blob/a3719f34fdb664ffcfaec2160ef20fca7becf2ee/include/linux/fs.h#L837
[2] https://github.com/torvalds/linux/blob/a3719f34fdb664ffcfaec2160ef20fca7becf2ee/include/linux/fs.h#L858
[3] https://github.com/torvalds/linux/blob/a3719f34fdb664ffcfaec2160ef20fca7becf2ee/include/linux/fs.h#L817
[4] https://github.com/torvalds/linux/blob/a3719f34fdb664ffcfaec2160ef20fca7becf2ee/mm/readahead.c#L376
[5] http://www.makelinux.net/ldd3/chp-3-sect-3 "There can be numerous
file structures representing multiple open descriptors on a single
file, but they all point to a single inode structure."
[6] https://github.com/freebsd/freebsd/blob/7e6cabd06e6caa6a02eeb86308dc0cb3f27e10da/sys/sys/file.h#L180
[7] https://github.com/freebsd/freebsd/blob/7e6cabd06e6caa6a02eeb86308dc0cb3f27e10da/sys/kern/vfs_vnops.c#L477
[8] Page 319 of 'Design and Implementation of the FreeBSD Operating
System' 2nd Edition

-- 
Thomas Munro
http://www.enterprisedb.com

Re: [HACKERS] modeling parallel contention (was: Parallel Append implementation)

From

David Rowley

Date:

08 May 2017, 04:30:29

On 5 May 2017 at 14:54, Thomas Munro <thomas.munro@enterprisedb.com> wrote:
> Just for fun, check out pages 42 and 43 of Wei Hong's thesis.  He
> worked on Berkeley POSTGRES parallel query and a spin-off called XPRS,
> and they got linear seq scan scaling up to number of spindles:
>
> http://db.cs.berkeley.edu/papers/ERL-M93-28.pdf

That's interesting. I'd no idea that work was done. Actually, I didn't
really know that anyone had thought to have more than one processor
back then :-)

And I also now know the origins of the tenk1 table in the regression
database. Those 10,000 rows were once used for benchmarking! I'm glad
we're all using a couple more rows these days.

-- David Rowley                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] modeling parallel contention (was: Parallel Append implementation)

From

David Rowley

Date:

08 May 2017, 04:39:00

On 6 May 2017 at 13:44, Thomas Munro <thomas.munro@enterprisedb.com> wrote:
> In Linux, each process that opens a file gets its own 'file'
> object[1][5].  Each of those has it's own 'file_ra_state'
> object[2][3], used by ondemand_readahead[4] for sequential read
> detection.  So I speculate that page-at-a-time parallel seq scan must
> look like random access to Linux.
>
> In FreeBSD the situation looks similar.  Each process that opens a
> file gets a 'file' object[8] which has members 'f_seqcount' and
> 'f_nextoff'[6].  These are used by the 'sequential_heuristics'
> function[7] which affects the ioflag which UFS/FFS uses to control
> read ahead (see ffs_read).  So I speculate that page-at-a-time
> parallel seq scan must look like random access to FreeBSD too.
>
> In both cases I suspect that if you'd inherited (or sent the file
> descriptor to the other process via obscure tricks), it would actually
> work because they'd have the same 'file' entry, but that's clearly not
> workable for md.c.
>

Interesting!

> Experimentation required...

Indeed. I do remember long discussions on this before Parallel seq
scan went in, but I don't recall if anyone checked any OS kernels to
see what they did.

We really need a machine with good IO concurrency, and not too much
RAM to test these things out. It could well be that for a suitability
large enough table we'd want to scan a whole 1GB extent per worker.

I did post a patch to have heap_parallelscan_nextpage() use atomics
instead of locking over in [1], but I think doing atomics there does
not rule out also adding batching later. In fact, I think it
structures things so batching would be easier than it is today.

[1] https://www.postgresql.org/message-id/CAKJS1f9tgsPhqBcoPjv9_KUPZvTLCZ4jy=B=bhqgaKn7cYzm-w@mail.gmail.com

-- David Rowley                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] modeling parallel contention (was: Parallel Append implementation)

From

Thomas Munro

Date:

08 May 2017, 05:00:26

On Mon, May 8, 2017 at 1:39 PM, David Rowley
<david.rowley@2ndquadrant.com> wrote:
> On 6 May 2017 at 13:44, Thomas Munro <thomas.munro@enterprisedb.com> wrote:
>> Experimentation required...
>
> Indeed. I do remember long discussions on this before Parallel seq
> scan went in, but I don't recall if anyone checked any OS kernels to
> see what they did.
>
> We really need a machine with good IO concurrency, and not too much
> RAM to test these things out. It could well be that for a suitability
> large enough table we'd want to scan a whole 1GB extent per worker.

I did a bunch of simple experiments this morning to try to observe RA
effects, using a couple of different EDB machines running Linux.  I
wrote a simple program to read large files sequentially using lseek +
read, but rotate the reads over N file descriptors to simulate
parallel workers.  I was surprised to find that I couldn't change
cache-cold read performance that way, up to very large numbers of N.
I did manage to break it by introducing some artificial disorder,
reversing/scrambling the read order of small groups of blocks, but
even that required groups over about 16 blocks before performance
started to drop (possibly related to the window size which I can't see
due to permissions right now).  I've also learned that RAID cards
sometimes do read-ahead of their own, making matters more complicated.
I hope to report more when I figure out all the moving parts...

> I did post a patch to have heap_parallelscan_nextpage() use atomics
> instead of locking over in [1], but I think doing atomics there does
> not rule out also adding batching later. In fact, I think it
> structures things so batching would be easier than it is today.

+1

-- 
Thomas Munro
http://www.enterprisedb.com

Re: [HACKERS] modeling parallel contention (was: Parallel Append implementation)

From

Haribabu Kommi

Date:

09 May 2017, 08:28:55

On Mon, May 8, 2017 at 11:39 AM, David Rowley <david.rowley@2ndquadrant.com> wrote:

We really need a machine with good IO concurrency, and not too much
RAM to test these things out. It could well be that for a suitability
large enough table we'd want to scan a whole 1GB extent per worker.

I did post a patch to have heap_parallelscan_nextpage() use atomics
instead of locking over in [1], but I think doing atomics there does
not rule out also adding batching later. In fact, I think it
structures things so batching would be easier than it is today.

As part of our internal PostgreSQL project, we developed parallel seq

scan with batch mode only. The problem that we faced with batch mode

is making sure that all the parallel workers should finish almost the same

time with a proper distribution of data pages. Otherwise, it may lead to

a problem where one worker only doing the last batch job and all others

gets finished their job. In these cases, we cannot achieve good performance.

Whereas in the current approach, the maximum time the last worker

will do the job is scanning the last one page of the table.

If we go with batching of 1GB per worker, there may be chances that, the

data that satisfies the query condition may fall into only one extent then

in these cases also the batching may not yield the good results.

Regards,

Hari Babu

Fujitsu Australia

Re: [HACKERS] modeling parallel contention (was: Parallel Append implementation)

From

Peter Geoghegan

Date:

06 August 2017, 01:17:17

On Thu, May 4, 2017 at 7:20 PM, David Rowley
<david.rowley@2ndquadrant.com> wrote:
> I ended up writing the attached (which I'd not intended to post until
> some time closer to when the doors open for PG11). At the moment it's
> basically just a test patch to see how it affects things when we give
> workers a bit more to do before they come back to look for more work.
> In this case, I've just given them 10 pages to work on, instead of the
> 1 that's allocated in 9.6 and v10.

I think that this could benefit parallel sort, beyond the obvious fact
that it too must have the contention you describe.

We generally are faster at sorting presorted input for all kinds of
reasons (e.g., insertion sort fallback for quicksort, merging based on
replacement of heap's root item). It follows that it's to our
advantage to have parallel tuplesort read multiple pages in a range
into a worker at once within the parallel heap scan that feeds
workers. The leader, which merges worker runs, may ultimately have to
perform fewer comparisons as a result of this, which is where most of
the benefit would be.

-- 
Peter Geoghegan

Re: [HACKERS] modeling parallel contention (was: Parallel Append implementation)

From

Robert Haas

Date:

06 August 2017, 04:24:37

On Sat, Aug 5, 2017 at 6:17 PM, Peter Geoghegan <pg@bowt.ie> wrote:
> On Thu, May 4, 2017 at 7:20 PM, David Rowley
> <david.rowley@2ndquadrant.com> wrote:
>> I ended up writing the attached (which I'd not intended to post until
>> some time closer to when the doors open for PG11). At the moment it's
>> basically just a test patch to see how it affects things when we give
>> workers a bit more to do before they come back to look for more work.
>> In this case, I've just given them 10 pages to work on, instead of the
>> 1 that's allocated in 9.6 and v10.
>
> I think that this could benefit parallel sort, beyond the obvious fact
> that it too must have the contention you describe.
>
> We generally are faster at sorting presorted input for all kinds of
> reasons (e.g., insertion sort fallback for quicksort, merging based on
> replacement of heap's root item). It follows that it's to our
> advantage to have parallel tuplesort read multiple pages in a range
> into a worker at once within the parallel heap scan that feeds
> workers. The leader, which merges worker runs, may ultimately have to
> perform fewer comparisons as a result of this, which is where most of
> the benefit would be.

On the other hand, it could hurt Gather Merge for essentially
symmetric reasons - Gather Merge works best if all the tuples are in
roughly the same range of values.  Otherwise the work isn't equally
distributed.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company