Thread: Re: [HACKERS] modeling parallel contention (was: Parallel Append implementation)
Re: [HACKERS] modeling parallel contention (was: Parallel Append implementation)
From
Robert Haas
Date:
On Tue, Apr 18, 2017 at 2:48 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote: > After searching through earlier mails about parallel scan, I am not > sure whether the shared state was considered to be a potential factor > that might reduce parallel query gains, when deciding the calculation > for number of workers for a parallel seq scan. I mean even today if we > allocate 10 workers as against a calculated 4 workers count for a > parallel seq scan, they might help. I think it's just that we don't > know if they would *always* help or it would regress sometimes. No, that's not considered, currently. This is actually an issue even for nodes that are not parallel-aware at all. For example, consider this: Hash Join -> Parallel Seq Scan -> Hash -> Seq Scan It is of course possible that the Parallel Seq Scan could run into contention problems if the number of workers is large, but in my experience there are bigger problems here. The non-parallel Seq Scan can also contend -- not of course over the shared mutex because there isn't one, but over access to the blocks themselves. Every one of those blocks has a content lock and a buffer header and so on, and having multiple processes accessing those things at the same time scales well, but not perfectly. The Hash node can also contend: if the hash join spills to disk, you've got multiple processes reading and writing to the temp directory at the same time and, of course, that can be worse than just one process doing it -- sometimes much worse. It can also be better, depending on how much I/O gets generated and how much I/O bandwidth you have. The main things that keeps this from being a crippling issue right now is the fact that we tend not to use that many parallel workers in the first place. We're trying to scale a query that would otherwise use 1 process out to 3 or 5 processes, and so the contention effects, in many cases, are not too bad. Multiple people (including David Rowley as well as folks here at EnterpriseDB) have demonstrated that for certain queries, we can actually use a lot more workers and everything works great. The problem is that for other queries, using a lot of workers works terribly. The planner doesn't know how to figure out which it'll be - and honestly, I don't either. /me crosses fingers, hopes someone smarter will work on this problem. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: [HACKERS] modeling parallel contention (was: Parallel Appendimplementation)
From
Andres Freund
Date:
On 2017-05-02 15:13:58 -0400, Robert Haas wrote: > On Tue, Apr 18, 2017 at 2:48 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote: > The main things that keeps this from being a crippling issue right now > is the fact that we tend not to use that many parallel workers in the > first place. We're trying to scale a query that would otherwise use 1 > process out to 3 or 5 processes, and so the contention effects, in > many cases, are not too bad. Multiple people (including David Rowley > as well as folks here at EnterpriseDB) have demonstrated that for > certain queries, we can actually use a lot more workers and everything > works great. The problem is that for other queries, using a lot of > workers works terribly. The planner doesn't know how to figure out > which it'll be - and honestly, I don't either. Have those benchmarks, even in a very informal form, been shared / collected / referenced centrally? I'd be very interested to know where the different contention points are. Possibilities: - in non-resident workloads: too much concurrent IOs submitted, leading to overly much random IO - internal contention in the the parallel node, e.g. parallel seqscan - contention on PG componenents like buffer mapping, procarray, clog - contention on individual buffers, e.g. btree root pages, or small tables in nestloop joins - just too many context switches, due to ineffective parallelization probably multiple of those are a problem, but without trying to figure them out, we're going to have a hard time to develop a better costing model... Greetings, Andres Freund
Re: [HACKERS] modeling parallel contention (was: Parallel Append implementation)
From
David Rowley
Date:
On 3 May 2017 at 07:13, Robert Haas <robertmhaas@gmail.com> wrote: > It is of course possible that the Parallel Seq Scan could run into > contention problems if the number of workers is large, but in my > experience there are bigger problems here. The non-parallel Seq Scan > can also contend -- not of course over the shared mutex because there > isn't one, but over access to the blocks themselves. Every one of > those blocks has a content lock and a buffer header and so on, and > having multiple processes accessing those things at the same time > scales well, but not perfectly. The Hash node can also contend: if > the hash join spills to disk, you've got multiple processes reading > and writing to the temp directory at the same time and, of course, > that can be worse than just one process doing it -- sometimes much > worse. It can also be better, depending on how much I/O gets > generated and how much I/O bandwidth you have. Yeah, I did get some time to look over the contention in Parallel Seq Scan a while back and I discovered that on the machine that I was testing on. the lock obtained in heap_parallelscan_nextpage() was causing workers to have to wait for other workers to fetch their next task to work on. I ended up writing the attached (which I'd not intended to post until some time closer to when the doors open for PG11). At the moment it's basically just a test patch to see how it affects things when we give workers a bit more to do before they come back to look for more work. In this case, I've just given them 10 pages to work on, instead of the 1 that's allocated in 9.6 and v10. A quick test on a pretty large table on a large machine shows: Unpatched: postgres=# select count(*) from a; count ------------ 1874000000 (1 row) Time: 5211.485 ms (00:05.211) Patched: postgres=# select count(*) from a; count ------------ 1874000000 (1 row) Time: 2523.983 ms (00:02.524) So it seems worth looking into. "a" was just a table with a single int column. I'm unsure as yet if there are more gains to be had for tables with wider tuples. I do suspect the patch will be a bigger win in those cases, since there's less work to do for each page, e.g less advance aggregate calls, so likely they'll be looking for their next page a bit sooner. Now I'm not going to pretend that this patch is ready for the prime-time. I've not yet worked out how to properly report sync-scan locations without risking reporting later pages after reporting the end of the scan. What I have at the moment could cause a report to be missed if SYNC_SCAN_REPORT_INTERVAL is not divisible by the batch size. I'm also not sure how batching like this affect read-aheads, but at least the numbers above speak for something. Although none of the pages in this case came from disk. I'd had thoughts that the 10 pages wouldn't be constant, but the batching size would depend on the size of the relation to be scanned. I'd rough ideas to just try to make about 1 million batches. Something like batch_pages = Max(parallel_scan->phs_nblocks / 1000000, 1); so that we only take more than 1 page if there's some decent amount to process. We don't want to make the batches too big as we might end up having to wait on slow workers at the end of a scan. Anyway. I don't want to hi-jack this thread with discussions on this. I just wanted to mark that I plan to work on this in order to avoid any repeat developments or analysis. I'll probably start a new thread for this sometime nearer PG11's dev cycle. The patch is attached if in the meantime someone wants to run this on some big hardware. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
Re: [HACKERS] modeling parallel contention (was: Parallel Append implementation)
From
David Rowley
Date:
On 5 May 2017 at 13:37, Andres Freund <andres@anarazel.de> wrote: > On 2017-05-02 15:13:58 -0400, Robert Haas wrote: >> Multiple people (including David Rowley >> as well as folks here at EnterpriseDB) have demonstrated that for >> certain queries, we can actually use a lot more workers and everything >> works great. The problem is that for other queries, using a lot of >> workers works terribly. The planner doesn't know how to figure out >> which it'll be - and honestly, I don't either. > > Have those benchmarks, even in a very informal form, been shared / > collected / referenced centrally? I'd be very interested to know where > the different contention points are. Possibilities: I posted mine on [1], although the post does not go into much detail about the contention points. I only really briefly mention it at the end. [1] https://blog.2ndquadrant.com/parallel-monster-benchmark/#comment-248273 -- David Rowley http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Re: [HACKERS] modeling parallel contention (was: Parallel Appendimplementation)
From
Andres Freund
Date:
Hi, On 2017-05-05 14:20:48 +1200, David Rowley wrote: > Yeah, I did get some time to look over the contention in Parallel Seq > Scan a while back and I discovered that on the machine that I was > testing on. the lock obtained in heap_parallelscan_nextpage() was > causing workers to have to wait for other workers to fetch their next > task to work on. Oh, if it's "just" that, it should be easy enough to address. Two approaches: 1) use atomic ops for increment, modulo afterwards to deal with wraparound in the synchronous scan 2) batching > I ended up writing the attached (which I'd not intended to post until > some time closer to when the doors open for PG11). At the moment it's > basically just a test patch to see how it affects things when we give > workers a bit more to do before they come back to look for more work. > In this case, I've just given them 10 pages to work on, instead of the > 1 that's allocated in 9.6 and v10. Right. > A quick test on a pretty large table on a large machine shows: > > Unpatched: > > postgres=# select count(*) from a; > count > ------------ > 1874000000 > (1 row) > > Time: 5211.485 ms (00:05.211) > > Patched: > > postgres=# select count(*) from a; > count > ------------ > 1874000000 > (1 row) > > Time: 2523.983 ms (00:02.524) Neat! > I'd had thoughts that the 10 pages wouldn't be constant, but the > batching size would depend on the size of the relation to be scanned. > I'd rough ideas to just try to make about 1 million batches. Something > like batch_pages = Max(parallel_scan->phs_nblocks / 1000000, 1); so > that we only take more than 1 page if there's some decent amount to > process. We don't want to make the batches too big as we might end up > having to wait on slow workers at the end of a scan. I wonder how much doing the atomic ops approach alone can help, that doesn't have the issue that the work might be unevenly distributed between pages. > Anyway. I don't want to hi-jack this thread with discussions on this. > I just wanted to mark that I plan to work on this in order to avoid > any repeat developments or analysis. I'll probably start a new thread > for this sometime nearer PG11's dev cycle. Cool. I think it might sense to post about this soon, just to give it some more visibility to reduce the potential for duplication. - andres
Re: [HACKERS] modeling parallel contention (was: Parallel Append implementation)
From
David Rowley
Date:
On 3 May 2017 at 07:13, Robert Haas <robertmhaas@gmail.com> wrote: > Multiple people (including David Rowley > as well as folks here at EnterpriseDB) have demonstrated that for > certain queries, we can actually use a lot more workers and everything > works great. The problem is that for other queries, using a lot of > workers works terribly. The planner doesn't know how to figure out > which it'll be - and honestly, I don't either. For me, it seems pretty much related to the number of tuples processed on a worker, vs how many they return. As a general rule, I'd say the higher this ratio, the higher the efficiency ratio will be for the worker. Although that's not taking into account contention points where workers must wait for fellow workers to complete some operation. I think parallel_tuple_cost is a good GUC to have, perhaps we can be smarter about the use of it when deciding on how many workers should be used. By efficiency, I mean that if a query takes 10 seconds in a normal serial plan, and adding 1 worker, it takes 5 seconds, it would be 100% efficient to use another worker. I charted this in [1]. It would have been interesting to chart the same in a query that returned a larger number of groups, but I ran out of time, but I think it pretty much goes, without testing, that more groups == less efficiency. Which'll be due to more overhead in parallel tuple communication, and more work to do in the serial portion of the plan. [1] https://blog.2ndquadrant.com/parallel-monster-benchmark -- David Rowley http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Re: [HACKERS] modeling parallel contention (was: Parallel Append implementation)
From
David Rowley
Date:
On 5 May 2017 at 14:36, Andres Freund <andres@anarazel.de> wrote: > I wonder how much doing the atomic ops approach alone can help, that > doesn't have the issue that the work might be unevenly distributed > between pages. I wondered that too, since I though the barrier for making this change would be lower by doing it that way. I didn't manage to think of a way to get around the wrapping the position back to 0 when synch-scans are involved. i.e parallel_scan->phs_cblock++; if (parallel_scan->phs_cblock >= scan->rs_nblocks) parallel_scan->phs_cblock = 0; -- David Rowley http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Re: [HACKERS] modeling parallel contention (was: Parallel Appendimplementation)
From
Andres Freund
Date:
On 2017-05-05 14:40:43 +1200, David Rowley wrote: > On 5 May 2017 at 14:36, Andres Freund <andres@anarazel.de> wrote: > > I wonder how much doing the atomic ops approach alone can help, that > > doesn't have the issue that the work might be unevenly distributed > > between pages. > > I wondered that too, since I though the barrier for making this change > would be lower by doing it that way. > > I didn't manage to think of a way to get around the wrapping the > position back to 0 when synch-scans are involved. > > i.e > parallel_scan->phs_cblock++; > if (parallel_scan->phs_cblock >= scan->rs_nblocks) > parallel_scan->phs_cblock = 0; Increment phs_cblock without checking rs_nblocks, but outside of the lock do a % scan->rs_nblocks, to get the "actual" position. Finish if (phs_cblock - phs_startblock) / scan->rs_nblocks >= 1. The difficult part seems to be the parallel_scan->phs_startblock computation, but that we probably can do via an read barrier & unlocked check, and then a spinlock & recheck if still uninitialized. - Andres
Re: [HACKERS] modeling parallel contention (was: Parallel Appendimplementation)
From
Andres Freund
Date:
On 2017-05-04 19:45:33 -0700, Andres Freund wrote: > Increment phs_cblock without checking rs_nblocks, but outside of the > lock do a % scan->rs_nblocks, to get the "actual" position. Finish if > (phs_cblock - phs_startblock) / scan->rs_nblocks >= 1. Err, as I've been pointed to: It should be s/lock/atomic operation/
Re: [HACKERS] modeling parallel contention (was: Parallel Append implementation)
From
Thomas Munro
Date:
On Fri, May 5, 2017 at 2:23 PM, David Rowley <david.rowley@2ndquadrant.com> wrote: > On 5 May 2017 at 13:37, Andres Freund <andres@anarazel.de> wrote: >> On 2017-05-02 15:13:58 -0400, Robert Haas wrote: >>> Multiple people (including David Rowley >>> as well as folks here at EnterpriseDB) have demonstrated that for >>> certain queries, we can actually use a lot more workers and everything >>> works great. The problem is that for other queries, using a lot of >>> workers works terribly. The planner doesn't know how to figure out >>> which it'll be - and honestly, I don't either. >> >> Have those benchmarks, even in a very informal form, been shared / >> collected / referenced centrally? I'd be very interested to know where >> the different contention points are. Possibilities: > > I posted mine on [1], although the post does not go into much detail > about the contention points. I only really briefly mention it at the > end. Just for fun, check out pages 42 and 43 of Wei Hong's thesis. He worked on Berkeley POSTGRES parallel query and a spin-off called XPRS, and they got linear seq scan scaling up to number of spindles: http://db.cs.berkeley.edu/papers/ERL-M93-28.pdf It gather from flicking through the POSTGRES 4.2 sources and this stuff about XPRS that they switched from a "launch N workers!" model to a "generate tasks and schedule them" model somewhere between these systems. Chapters 2 and 3 cover the problem of avoiding excessive parallelism that reduces performance adjusting dynamically to maximum throughput. I suspect we're going that way too at some point, and it would certainly fix some problems I ran into with Parallel Shared Hash. XPRS's cost model included resource consumption, not just 'timerons'. This is something I grappled with when trying to put a price tag on Parallel Shared Hash plans where just one worker builds the hash table while the others wait. I removed that plan from the patch because it became mostly redundant, but when it was there Postgres thought it was the same cost as a plan where every worker hammers your system building the same hash table, whereas XPRS would have considered such a plan ludicrously expensive (depending on his 'w' term, see page 28, which determines whether you care more about resource usage or response time). -- Thomas Munro http://www.enterprisedb.com
Re: [HACKERS] modeling parallel contention (was: Parallel Append implementation)
From
Amit Khandekar
Date:
On 5 May 2017 at 07:50, David Rowley <david.rowley@2ndquadrant.com> wrote: > On 3 May 2017 at 07:13, Robert Haas <robertmhaas@gmail.com> wrote: >> It is of course possible that the Parallel Seq Scan could run into >> contention problems if the number of workers is large, but in my >> experience there are bigger problems here. The non-parallel Seq Scan >> can also contend -- not of course over the shared mutex because there >> isn't one, but over access to the blocks themselves. Every one of >> those blocks has a content lock and a buffer header and so on, and >> having multiple processes accessing those things at the same time >> scales well, but not perfectly. The Hash node can also contend: if >> the hash join spills to disk, you've got multiple processes reading >> and writing to the temp directory at the same time and, of course, >> that can be worse than just one process doing it -- sometimes much >> worse. It can also be better, depending on how much I/O gets >> generated and how much I/O bandwidth you have. > > Yeah, I did get some time to look over the contention in Parallel Seq > Scan a while back and I discovered that on the machine that I was > testing on. the lock obtained in heap_parallelscan_nextpage() was > causing workers to have to wait for other workers to fetch their next > task to work on. > > I ended up writing the attached (which I'd not intended to post until > some time closer to when the doors open for PG11). At the moment it's > basically just a test patch to see how it affects things when we give > workers a bit more to do before they come back to look for more work. > In this case, I've just given them 10 pages to work on, instead of the > 1 that's allocated in 9.6 and v10. > > A quick test on a pretty large table on a large machine shows: > > Unpatched: > > postgres=# select count(*) from a; > count > ------------ > 1874000000 > (1 row) > > Time: 5211.485 ms (00:05.211) > > Patched: > > postgres=# select count(*) from a; > count > ------------ > 1874000000 > (1 row) > > Time: 2523.983 ms (00:02.524) The result is quite impressive. > > So it seems worth looking into. "a" was just a table with a single int > column. I'm unsure as yet if there are more gains to be had for tables > with wider tuples. I do suspect the patch will be a bigger win in > those cases, since there's less work to do for each page, e.g less > advance aggregate calls, so likely they'll be looking for their next > page a bit sooner. > > Now I'm not going to pretend that this patch is ready for the > prime-time. I've not yet worked out how to properly report sync-scan > locations without risking reporting later pages after reporting the > end of the scan. What I have at the moment could cause a report to be > missed if SYNC_SCAN_REPORT_INTERVAL is not divisible by the batch > size. I'm also not sure how batching like this affect read-aheads, but > at least the numbers above speak for something. Although none of the > pages in this case came from disk. > > I'd had thoughts that the 10 pages wouldn't be constant, but the > batching size would depend on the size of the relation to be scanned. > I'd rough ideas to just try to make about 1 million batches. Something > like batch_pages = Max(parallel_scan->phs_nblocks / 1000000, 1); so > that we only take more than 1 page if there's some decent amount to > process. We don't want to make the batches too big as we might end up > having to wait on slow workers at the end of a scan. I was wondering : if we keep increasing the batch size, that might lead to I/O contention. I mean, the higher the batch size, the higher is the chance to cause more random I/O, because all workers would be accessing disk blocks far away from each other in parallel. So there might be a trade off here. (it's another thing that there needs to be I/O contention testing done, in general, for many scenarios). I believe there are certain parallel scans (parallel bitmap heap scan ? ) where the logic to go to the next block consumes time, so more waits consequently. What if we supply for each worker with a sequence of blocks to be scanned, rather than a range of blocks. Each worker would have a list of next few blocks, say : w1 : 1, 5, 9, 13 w2 : 2, 6, 10, 14 w3 : 3, 7, 11, 15. w4 : ..... May be the leader worker would do the accounting and store the instructions for each of the workers at individual locations in shared memory, so there won't be any contention while accessing them. This may be simple/applicable for a sequential scan, but not for other scans, some of which this may not be even possible. But basically I was thinking of a way around to tackle shared memory contention as well as random I/O. -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company
Re: [HACKERS] modeling parallel contention (was: Parallel Append implementation)
From
Amit Kapila
Date:
On Fri, May 5, 2017 at 7:07 AM, Andres Freund <andres@anarazel.de> wrote: > On 2017-05-02 15:13:58 -0400, Robert Haas wrote: >> On Tue, Apr 18, 2017 at 2:48 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote: >> The main things that keeps this from being a crippling issue right now >> is the fact that we tend not to use that many parallel workers in the >> first place. We're trying to scale a query that would otherwise use 1 >> process out to 3 or 5 processes, and so the contention effects, in >> many cases, are not too bad. Multiple people (including David Rowley >> as well as folks here at EnterpriseDB) have demonstrated that for >> certain queries, we can actually use a lot more workers and everything >> works great. The problem is that for other queries, using a lot of >> workers works terribly. The planner doesn't know how to figure out >> which it'll be - and honestly, I don't either. > > Have those benchmarks, even in a very informal form, been shared / > collected / referenced centrally? > The numbers have been posted on parallel seq. scan the thread and more formally shared in PGCon presentation ([1], refer slide-15). I'd be very interested to know where > the different contention points are. Possibilities: > > - in non-resident workloads: too much concurrent IOs submitted, leading > to overly much random IO > - internal contention in the the parallel node, e.g. parallel seqscan > I think one of the points of scaling/contention is tuple communication. This is what is shown is perf profiles and we (one of my colleagues is working on it) are already working on some ways to improve the same, but I don't think we can get anywhere near to linear scaling by improving the same. [1] - https://www.pgcon.org/2015/schedule/events/785.en.html -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] modeling parallel contention (was: Parallel Append implementation)
From
Robert Haas
Date:
On Thu, May 4, 2017 at 9:37 PM, Andres Freund <andres@anarazel.de> wrote: > Have those benchmarks, even in a very informal form, been shared / > collected / referenced centrally? I'd be very interested to know where > the different contention points are. Possibilities: > > - in non-resident workloads: too much concurrent IOs submitted, leading > to overly much random IO > - internal contention in the the parallel node, e.g. parallel seqscan > - contention on PG componenents like buffer mapping, procarray, clog > - contention on individual buffers, e.g. btree root pages, or small > tables in nestloop joins > - just too many context switches, due to ineffective parallelization > > probably multiple of those are a problem, but without trying to figure > them out, we're going to have a hard time to develop a better costing > model... It's pretty easy (but IMHO not very interesting) to measure internal contention in the Parallel Seq Scan node. As David points out downthread, that problem isn't trivial to fix, but it's not that hard, either. I do believe that there is a problem with too much concurrent I/O on things like: Gather -> Parallel Seq Scan on lineitem -> Hash Join -> Seq Scan on lineitem If that goes to multiple batches, you're probably wrenching the disk head all over the place - multiple processes are now reading and writing batch files at exactly the same time. I also strongly suspect that contention on individual buffers can turn into a problem on queries like this: Gather (Merge) -> Merge Join -> Parallel Index Scan -> Index Scan The parallel index scan surely has some upper limit on concurrency, but if that is exceeded, what will tend to happen is that processes will sleep. On the inner side, though, every process is doing a full index scan and chances are good that they are doing it more or less in lock step, hitting the same buffers at the same time. Also consider: Finalize HashAggregate -> Gather -> Partial HashAggregate -> Parallel Seq Scan Suppose that the average group contains ten items which will tend to be widely spaced across the table. As you add workers, the number of workers across which any given group gets spread increases. There's probably a "sweet spot" here. Up to a certain point, adding workers makes it faster, because the work of the Seq Scan and the Partial HashAggregate gets spread across more processes. However, adding workers also increases the number of rows that pass through the Gather node, because as you add workers more groups end up being split across workers, or across more workers. That means more and more of the aggregation starts happening in the Finalize HashAggregate rather than the Partial HashAggregate. If you had for example 20 workers here almost nothing would be happening in the Partial HashAggregate, because chances are good that each of the 10 rows in each group would be encountered by a different worker, so that'd probably be counterproductive. I think there are two separate questions here: 1. How do we reduce or eliminate contention during parallel query execution? 2. How do we make the query planner smarter about picking the optimal number of workers? I think the second problem is both more difficult and more interesting. I think that no matter how much work we do on #1, there are always going to be cases where the amount of effective parallelism has some finite limit - and I think the limit will vary substantially from query to query. So, without #2, we'll either leave a lot of money on the table for queries that can benefit from using a large number of workers, or we'll throw extra workers uselessly at queries where they don't help (or even make things worse). -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: [HACKERS] modeling parallel contention (was: Parallel Append implementation)
From
Robert Haas
Date:
On Thu, May 4, 2017 at 10:20 PM, David Rowley <david.rowley@2ndquadrant.com> wrote: > Now I'm not going to pretend that this patch is ready for the > prime-time. I've not yet worked out how to properly report sync-scan > locations without risking reporting later pages after reporting the > end of the scan. What I have at the moment could cause a report to be > missed if SYNC_SCAN_REPORT_INTERVAL is not divisible by the batch > size. I'm also not sure how batching like this affect read-aheads, but > at least the numbers above speak for something. Although none of the > pages in this case came from disk. This kind of approach has also been advocated within EnterpriseDB, and I immediately thought of the read-ahead problem. I think we need more research into how Parallel Seq Scan interacts with OS readahead behavior on various operating systems. It seem possible that Parallel Seq Scan frustrates operating system read-ahead even without this change on at least some systems (because maybe they can only detect ascending block number requests within a single process) and even more possible that you run into problems with the block number requests are no longer precisely in order (which, at present, they should be, or very close). If it turns out to be a problem, either currently or with this patch, we might need to add explicit prefetching logic to Parallel Seq Scan. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: [HACKERS] modeling parallel contention (was: Parallel Append implementation)
From
Robert Haas
Date:
On Thu, May 4, 2017 at 10:36 PM, David Rowley <david.rowley@2ndquadrant.com> wrote: > On 3 May 2017 at 07:13, Robert Haas <robertmhaas@gmail.com> wrote: >> Multiple people (including David Rowley >> as well as folks here at EnterpriseDB) have demonstrated that for >> certain queries, we can actually use a lot more workers and everything >> works great. The problem is that for other queries, using a lot of >> workers works terribly. The planner doesn't know how to figure out >> which it'll be - and honestly, I don't either. > > For me, it seems pretty much related to the number of tuples processed > on a worker, vs how many they return. As a general rule, I'd say the > higher this ratio, the higher the efficiency ratio will be for the > worker. Although that's not taking into account contention points > where workers must wait for fellow workers to complete some operation. > I think parallel_tuple_cost is a good GUC to have, perhaps we can be > smarter about the use of it when deciding on how many workers should > be used. It does seem pretty clear that Gather is horrifyingly slow and therefore often the gating factor, but I think it's also clear that even if you removed the overhead of Gather completely, there will always be *something* that limits scaling -- and I bet we're quite a long way from the point where that thing is typically the amount of hardware that you have. One idea that crossed my mind is to just have workers write all of their output tuples to a temp file and have the leader read them back in. At some cost in I/O, this would completely eliminate the overhead of workers waiting for the leader. In some cases, it might be worth it. At the least, it could be interesting to try a prototype implementation of this with different queries (TPC-H, maybe) and see what happens. It would give us some idea how much of a problem stalling on the leader is in practice. Wait event monitoring could possibly also be used to figure out an answer to that question. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: [HACKERS] modeling parallel contention (was: Parallel Appendimplementation)
From
Andres Freund
Date:
Hi, On 2017-05-05 15:29:40 -0400, Robert Haas wrote: > On Thu, May 4, 2017 at 9:37 PM, Andres Freund <andres@anarazel.de> wrote: > It's pretty easy (but IMHO not very interesting) to measure internal > contention in the Parallel Seq Scan node. As David points out > downthread, that problem isn't trivial to fix, but it's not that hard, > either. Well, I think it's important that we do some basic optimization before we start building a more elaborate costing model, because we'll otherwise codify assumptions that won't be true for long. Changing an established costing model once it's out there is really hard, because you always will hurt someone. I think some of the contention is easy enough to remove, and some of the IO concurrency issues > I do believe that there is a problem with too much concurrent > I/O on things like: > > Gather > -> Parallel Seq Scan on lineitem > -> Hash Join > -> Seq Scan on lineitem > > If that goes to multiple batches, you're probably wrenching the disk > head all over the place - multiple processes are now reading and > writing batch files at exactly the same time. At least on rotating media. On decent SSDs you usually can have so many concurrent IOs out there, that it's not actually easy to overwhelm a disk that way. While we obviously still need to pay some attention to spinning disks, I also think that being good on decent (not great) SSDs is more important. We shouldn't optimize for things to run most performant on hydra with it's slow storage system ;) We probably should take something like effective_io_concurrency into account, but that'd require a smarter default than we currently have. It's not that hard to estimate an upper bound of parallelism with a meaningful effective_io_concurrency - we'd probably have to move the effective_io_concurrency handling in bitmapscans to be a plan parameter though, so it's divided by the number of workers. > I also strongly suspect > that contention on individual buffers can turn into a problem on > queries like this: > > Gather (Merge) > -> Merge Join > -> Parallel Index Scan > -> Index Scan > > The parallel index scan surely has some upper limit on concurrency, > but if that is exceeded, what will tend to happen is that processes > will sleep. On the inner side, though, every process is doing a full > index scan and chances are good that they are doing it more or less in > lock step, hitting the same buffers at the same time. Hm. Not sure how big that problem is practically. But I wonder if it's worthwhile to sleep more if you hit contention or something liek that... > I think there are two separate questions here: > > 1. How do we reduce or eliminate contention during parallel query execution? > 2. How do we make the query planner smarter about picking the optimal > number of workers? > > I think the second problem is both more difficult and more > interesting. I think that no matter how much work we do on #1, there > are always going to be cases where the amount of effective parallelism > has some finite limit - and I think the limit will vary substantially > from query to query. So, without #2, we'll either leave a lot of > money on the table for queries that can benefit from using a large > number of workers, or we'll throw extra workers uselessly at queries > where they don't help (or even make things worse). Yea, I agree that 2) is more important in the long run, but I do think that my point that we shouldn't put too much effort into modelling concurrency before doing basic optimization still stands. Greetings, Andres Freund
Re: [HACKERS] modeling parallel contention (was: Parallel Append implementation)
From
Peter Geoghegan
Date:
On Fri, May 5, 2017 at 12:40 PM, Robert Haas <robertmhaas@gmail.com> wrote: > One idea that crossed my mind is to just have workers write all of > their output tuples to a temp file and have the leader read them back > in. At some cost in I/O, this would completely eliminate the overhead > of workers waiting for the leader. In some cases, it might be worth > it. At the least, it could be interesting to try a prototype > implementation of this with different queries (TPC-H, maybe) and see > what happens. It would give us some idea how much of a problem > stalling on the leader is in practice. Wait event monitoring could > possibly also be used to figure out an answer to that question. The use of temp files in all cases was effective in my parallel external sort patch, relative to what I imagine an approach built on a gather node would get you, but not because of the inherent slowness of a Gather node. I'm not so sure that Gather is actually inherently slow, given the interface it supports. While incremental, retail processing of each tuple is flexible and composable, it will tend to be slow compared to an approach based on batch processing (for tasks where you happen to be able to get away with batch processing). This is true for all the usual reasons -- better locality of access, better branch prediction properties, lower "effective instruction count" due to having very tight inner loops, and so on. I agree with Andres that we shouldn't put too much effort into modelling concurrency ahead of optimizing serial performance. The machine's *aggregate* memory bandwidth should be used as efficiently as possible, and parallelism is just one (very important) tool for making that happen. -- Peter Geoghegan VMware vCenter Server https://www.vmware.com/
Re: [HACKERS] modeling parallel contention (was: Parallel Append implementation)
From
Thomas Munro
Date:
On Sat, May 6, 2017 at 7:34 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Thu, May 4, 2017 at 10:20 PM, David Rowley > <david.rowley@2ndquadrant.com> wrote: >> Now I'm not going to pretend that this patch is ready for the >> prime-time. I've not yet worked out how to properly report sync-scan >> locations without risking reporting later pages after reporting the >> end of the scan. What I have at the moment could cause a report to be >> missed if SYNC_SCAN_REPORT_INTERVAL is not divisible by the batch >> size. I'm also not sure how batching like this affect read-aheads, but >> at least the numbers above speak for something. Although none of the >> pages in this case came from disk. > > This kind of approach has also been advocated within EnterpriseDB, and > I immediately thought of the read-ahead problem. I think we need more > research into how Parallel Seq Scan interacts with OS readahead > behavior on various operating systems. It seem possible that Parallel > Seq Scan frustrates operating system read-ahead even without this > change on at least some systems (because maybe they can only detect > ascending block number requests within a single process) and even more > possible that you run into problems with the block number requests are > no longer precisely in order (which, at present, they should be, or > very close). If it turns out to be a problem, either currently or > with this patch, we might need to add explicit prefetching logic to > Parallel Seq Scan. I don't know much about this stuff, but I was curious to go looking at source code. I hope someone will correct me if I'm wrong but here's what I could glean: In Linux, each process that opens a file gets its own 'file' object[1][5]. Each of those has it's own 'file_ra_state' object[2][3], used by ondemand_readahead[4] for sequential read detection. So I speculate that page-at-a-time parallel seq scan must look like random access to Linux. In FreeBSD the situation looks similar. Each process that opens a file gets a 'file' object[8] which has members 'f_seqcount' and 'f_nextoff'[6]. These are used by the 'sequential_heuristics' function[7] which affects the ioflag which UFS/FFS uses to control read ahead (see ffs_read). So I speculate that page-at-a-time parallel seq scan must look like random access to FreeBSD too. In both cases I suspect that if you'd inherited (or sent the file descriptor to the other process via obscure tricks), it would actually work because they'd have the same 'file' entry, but that's clearly not workable for md.c. Experimentation required... [1] https://github.com/torvalds/linux/blob/a3719f34fdb664ffcfaec2160ef20fca7becf2ee/include/linux/fs.h#L837 [2] https://github.com/torvalds/linux/blob/a3719f34fdb664ffcfaec2160ef20fca7becf2ee/include/linux/fs.h#L858 [3] https://github.com/torvalds/linux/blob/a3719f34fdb664ffcfaec2160ef20fca7becf2ee/include/linux/fs.h#L817 [4] https://github.com/torvalds/linux/blob/a3719f34fdb664ffcfaec2160ef20fca7becf2ee/mm/readahead.c#L376 [5] http://www.makelinux.net/ldd3/chp-3-sect-3 "There can be numerous file structures representing multiple open descriptors on a single file, but they all point to a single inode structure." [6] https://github.com/freebsd/freebsd/blob/7e6cabd06e6caa6a02eeb86308dc0cb3f27e10da/sys/sys/file.h#L180 [7] https://github.com/freebsd/freebsd/blob/7e6cabd06e6caa6a02eeb86308dc0cb3f27e10da/sys/kern/vfs_vnops.c#L477 [8] Page 319 of 'Design and Implementation of the FreeBSD Operating System' 2nd Edition -- Thomas Munro http://www.enterprisedb.com
Re: [HACKERS] modeling parallel contention (was: Parallel Append implementation)
From
David Rowley
Date:
On 5 May 2017 at 14:54, Thomas Munro <thomas.munro@enterprisedb.com> wrote: > Just for fun, check out pages 42 and 43 of Wei Hong's thesis. He > worked on Berkeley POSTGRES parallel query and a spin-off called XPRS, > and they got linear seq scan scaling up to number of spindles: > > http://db.cs.berkeley.edu/papers/ERL-M93-28.pdf That's interesting. I'd no idea that work was done. Actually, I didn't really know that anyone had thought to have more than one processor back then :-) And I also now know the origins of the tenk1 table in the regression database. Those 10,000 rows were once used for benchmarking! I'm glad we're all using a couple more rows these days. -- David Rowley http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Re: [HACKERS] modeling parallel contention (was: Parallel Append implementation)
From
David Rowley
Date:
On 6 May 2017 at 13:44, Thomas Munro <thomas.munro@enterprisedb.com> wrote: > In Linux, each process that opens a file gets its own 'file' > object[1][5]. Each of those has it's own 'file_ra_state' > object[2][3], used by ondemand_readahead[4] for sequential read > detection. So I speculate that page-at-a-time parallel seq scan must > look like random access to Linux. > > In FreeBSD the situation looks similar. Each process that opens a > file gets a 'file' object[8] which has members 'f_seqcount' and > 'f_nextoff'[6]. These are used by the 'sequential_heuristics' > function[7] which affects the ioflag which UFS/FFS uses to control > read ahead (see ffs_read). So I speculate that page-at-a-time > parallel seq scan must look like random access to FreeBSD too. > > In both cases I suspect that if you'd inherited (or sent the file > descriptor to the other process via obscure tricks), it would actually > work because they'd have the same 'file' entry, but that's clearly not > workable for md.c. > Interesting! > Experimentation required... Indeed. I do remember long discussions on this before Parallel seq scan went in, but I don't recall if anyone checked any OS kernels to see what they did. We really need a machine with good IO concurrency, and not too much RAM to test these things out. It could well be that for a suitability large enough table we'd want to scan a whole 1GB extent per worker. I did post a patch to have heap_parallelscan_nextpage() use atomics instead of locking over in [1], but I think doing atomics there does not rule out also adding batching later. In fact, I think it structures things so batching would be easier than it is today. [1] https://www.postgresql.org/message-id/CAKJS1f9tgsPhqBcoPjv9_KUPZvTLCZ4jy=B=bhqgaKn7cYzm-w@mail.gmail.com -- David Rowley http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Re: [HACKERS] modeling parallel contention (was: Parallel Append implementation)
From
Thomas Munro
Date:
On Mon, May 8, 2017 at 1:39 PM, David Rowley <david.rowley@2ndquadrant.com> wrote: > On 6 May 2017 at 13:44, Thomas Munro <thomas.munro@enterprisedb.com> wrote: >> Experimentation required... > > Indeed. I do remember long discussions on this before Parallel seq > scan went in, but I don't recall if anyone checked any OS kernels to > see what they did. > > We really need a machine with good IO concurrency, and not too much > RAM to test these things out. It could well be that for a suitability > large enough table we'd want to scan a whole 1GB extent per worker. I did a bunch of simple experiments this morning to try to observe RA effects, using a couple of different EDB machines running Linux. I wrote a simple program to read large files sequentially using lseek + read, but rotate the reads over N file descriptors to simulate parallel workers. I was surprised to find that I couldn't change cache-cold read performance that way, up to very large numbers of N. I did manage to break it by introducing some artificial disorder, reversing/scrambling the read order of small groups of blocks, but even that required groups over about 16 blocks before performance started to drop (possibly related to the window size which I can't see due to permissions right now). I've also learned that RAID cards sometimes do read-ahead of their own, making matters more complicated. I hope to report more when I figure out all the moving parts... > I did post a patch to have heap_parallelscan_nextpage() use atomics > instead of locking over in [1], but I think doing atomics there does > not rule out also adding batching later. In fact, I think it > structures things so batching would be easier than it is today. +1 -- Thomas Munro http://www.enterprisedb.com
Re: [HACKERS] modeling parallel contention (was: Parallel Append implementation)
From
Haribabu Kommi
Date:
On Mon, May 8, 2017 at 11:39 AM, David Rowley <david.rowley@2ndquadrant.com> wrote:
We really need a machine with good IO concurrency, and not too much
RAM to test these things out. It could well be that for a suitability
large enough table we'd want to scan a whole 1GB extent per worker.
I did post a patch to have heap_parallelscan_nextpage() use atomics
instead of locking over in [1], but I think doing atomics there does
not rule out also adding batching later. In fact, I think it
structures things so batching would be easier than it is today.
As part of our internal PostgreSQL project, we developed parallel seq
scan with batch mode only. The problem that we faced with batch mode
is making sure that all the parallel workers should finish almost the same
time with a proper distribution of data pages. Otherwise, it may lead to
a problem where one worker only doing the last batch job and all others
gets finished their job. In these cases, we cannot achieve good performance.
Whereas in the current approach, the maximum time the last worker
will do the job is scanning the last one page of the table.
If we go with batching of 1GB per worker, there may be chances that, the
data that satisfies the query condition may fall into only one extent then
in these cases also the batching may not yield the good results.
Regards,
Hari Babu
Fujitsu Australia
Re: [HACKERS] modeling parallel contention (was: Parallel Append implementation)
From
Peter Geoghegan
Date:
On Thu, May 4, 2017 at 7:20 PM, David Rowley <david.rowley@2ndquadrant.com> wrote: > I ended up writing the attached (which I'd not intended to post until > some time closer to when the doors open for PG11). At the moment it's > basically just a test patch to see how it affects things when we give > workers a bit more to do before they come back to look for more work. > In this case, I've just given them 10 pages to work on, instead of the > 1 that's allocated in 9.6 and v10. I think that this could benefit parallel sort, beyond the obvious fact that it too must have the contention you describe. We generally are faster at sorting presorted input for all kinds of reasons (e.g., insertion sort fallback for quicksort, merging based on replacement of heap's root item). It follows that it's to our advantage to have parallel tuplesort read multiple pages in a range into a worker at once within the parallel heap scan that feeds workers. The leader, which merges worker runs, may ultimately have to perform fewer comparisons as a result of this, which is where most of the benefit would be. -- Peter Geoghegan
Re: [HACKERS] modeling parallel contention (was: Parallel Append implementation)
From
Robert Haas
Date:
On Sat, Aug 5, 2017 at 6:17 PM, Peter Geoghegan <pg@bowt.ie> wrote: > On Thu, May 4, 2017 at 7:20 PM, David Rowley > <david.rowley@2ndquadrant.com> wrote: >> I ended up writing the attached (which I'd not intended to post until >> some time closer to when the doors open for PG11). At the moment it's >> basically just a test patch to see how it affects things when we give >> workers a bit more to do before they come back to look for more work. >> In this case, I've just given them 10 pages to work on, instead of the >> 1 that's allocated in 9.6 and v10. > > I think that this could benefit parallel sort, beyond the obvious fact > that it too must have the contention you describe. > > We generally are faster at sorting presorted input for all kinds of > reasons (e.g., insertion sort fallback for quicksort, merging based on > replacement of heap's root item). It follows that it's to our > advantage to have parallel tuplesort read multiple pages in a range > into a worker at once within the parallel heap scan that feeds > workers. The leader, which merges worker runs, may ultimately have to > perform fewer comparisons as a result of this, which is where most of > the benefit would be. On the other hand, it could hurt Gather Merge for essentially symmetric reasons - Gather Merge works best if all the tuples are in roughly the same range of values. Otherwise the work isn't equally distributed. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company