Thread: Parallel query execution introduces performance regressions
We noticed that the following SQL query runs 3 times slower on the latest version of PostgreSQL due to the “parallel query execution” feature. Several other queries are affected by the same feature. We are planning to report the discovered regressions in an academic paper. We would like to get your feedback before doing so.
Here’s the time taken to execute this query on older (v9.5.16) and newer versions (v11.2) of PostgreSQL (in milliseconds):
+-----------------------+--------+---------+---------+-----------+
| | scale1 | scale10 | scale50 | scale 300 |
+-----------------------+--------+---------+---------+-----------+
| v9.5.16 | 88 | 937 | 4721 | 27241 |
| v11.2 | 288 | 2822 | 13838 | 85081 |
+-----------------------+--------+---------+---------+-----------+
We have shared the following details below:
1) the associated query,
2) the commit that activated it,
3) our high-level analysis,
4) query execution plans in old and new versions of PostgreSQL, and
5) information on reproducing these regressions.
### QUERY
select
ref_0.ol_delivery_d as c1
from
public.order_line as ref_0
where EXISTS (
select
ref_1.i_im_id as c0
from
public.item as ref_1
where ref_0.ol_d_id <= ref_1.i_im_id
)
### COMMIT
77cd477 (Enable parallel query by default.)
We found several other queries exhibiting regressions related to this commit.
### HIGH-LEVEL ANALYSIS
We believe that this regression is due to parallel queries being enabled by default. Surprisingly, we found that even on a larger TPC-C database (scale factor of 50, roughly 4GB of size), parallel scan is still slower than the non-parallel execution plan in the old version.
### QUERY EXECUTION PLANS
[OLD version]
Nested Loop Semi Join (cost=0.00..90020417940.08 rows=30005835 width=8) (actual time=0.034..24981.895 rows=90017507 loops=1)
Join Filter: (ref_0.ol_d_id <= ref_1.i_im_id)
-> Seq Scan on order_line ref_0 (cost=0.00..2011503.04 rows=90017504 width=12) (actual time=0.022..7145.811 rows=90017507 loops=1)
-> Materialize (cost=0.00..2771.00 rows=100000 width=4) (actual time=0.000..0.000 rows=1 loops=90017507)
-> Seq Scan on item ref_1 (cost=0.00..2271.00 rows=100000 width=4) (actual time=0.006..0.006 rows=1 loops=1)
Planning time: 0.290 ms
Execution time: 27241.239 ms
[NEW version]
Gather (cost=1000.00..88047487498.82 rows=30005835 width=8) (actual time=0.265..82355.289 rows=90017507 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Nested Loop Semi Join (cost=0.00..88044485915.32 rows=12502431 width=8) (actual time=0.033..68529.259 rows=30005836 loops=3)
Join Filter: (ref_0.ol_d_id <= ref_1.i_im_id)
-> Parallel Seq Scan on order_line ref_0 (cost=0.00..1486400.93 rows=37507293 width=12) (actual time=0.023..2789.901 rows=30005836 loops=3)
-> Seq Scan on item ref_1 (cost=0.00..2271.00 rows=100000 width=4) (actual time=0.001..0.001 rows=1 loops=90017507)
Planning Time: 0.319 ms
Execution Time: 85081.158 ms
### REPRODUCING REGRESSION
* The queries can be downloaded here:
* You can reproduce these results by using the setup described in:
Best regards,
Jinho Jung
Hi, On 2019-04-01 14:30:13 -0400, Jinho Jung wrote: > We noticed that the following SQL query runs 3 times slower on the latest > version of PostgreSQL due to the “parallel query execution” feature. > Several other queries are affected by the same feature. We are planning to > report the discovered regressions in an academic paper. We would like to > get your feedback before doing so. Didn't you post exactly this multiple times before? E.g. https://www.postgresql.org/message-id/CAF6pyKfsYkkwk3-uCzcxXQTvY0SRp3SF%3DdFmwJXnP8vzVpXNjw%40mail.gmail.com https://www.postgresql.org/message-id/CAF6pyKfRchfdLSvFOBJ4U96AhEFsHbm-gGcrQrjXdqPC4q%3DHpg%40mail.gmail.com please don't just start new threads because you didn't get the amount of attention you wanted. Nor do I think this is a bug, so this would be the wrong list in isolation, too. Greetings, Andres Freund
On Mon, Apr 1, 2019 at 11:30 AM Jinho Jung <jinhojun@usc.edu> wrote: > Surprisingly, we found that even on a larger TPC-C database (scale factor of 50, roughly 4GB of size), parallel scan isstill slower than the non-parallel execution plan in the old version. That's not a large database, and it's certainly not a large TPC-C database. If you attempt to stay under the spec's maximum tpmC/throughput per warehouse, which is 12.86 tpmC per warehouse, then you'll need several thousand warehouses on modern hardware. We're talking several hundred gigabytes. Otherwise, as far as the spec is concerned you're testing an unrealistic workload. There will be individual customers that make many more purchases than is humanly possible. You're modelling an app involving hypothetical warehouse employees that must enter data into their terminals at a rate that is not humanly possible. More importantly, this kind of analysis seems much too simplistic to be useful. *Any* change to the optimizer or optimizer settings is certain to regress some queries. We expect users that are very sensitive to small regressions to take an active interest in performance tuning their database. It would certainly be very useful if somebody came up with a less complicated way of assessing these questions, but that seems to be elusive. -- Peter Geoghegan
Hi, On 2019-04-01 11:52:54 -0700, Peter Geoghegan wrote: > On Mon, Apr 1, 2019 at 11:30 AM Jinho Jung <jinhojun@usc.edu> wrote: > > Surprisingly, we found that even on a larger TPC-C database (scale factor of 50, roughly 4GB of size), parallel scanis still slower than the non-parallel execution plan in the old version. > > That's not a large database, and it's certainly not a large TPC-C > database. If you attempt to stay under the spec's maximum > tpmC/throughput per warehouse, which is 12.86 tpmC per warehouse, then > you'll need several thousand warehouses on modern hardware. We're > talking several hundred gigabytes. Otherwise, as far as the spec is > concerned you're testing an unrealistic workload. There will be > individual customers that make many more purchases than is humanly > possible. You're modelling an app involving hypothetical warehouse > employees that must enter data into their terminals at a rate that is > not humanly possible. I don't think that's really the problem here. It's that there's a fundamental misestimation in the query: > [OLD version] > Nested Loop Semi Join (cost=0.00..90020417940.08 rows=30005835 width=8) > (actual time=0.034..24981.895 rows=90017507 loops=1) > Join Filter: (ref_0.ol_d_id <= ref_1.i_im_id) > -> Seq Scan on order_line ref_0 (cost=0.00..2011503.04 rows=90017504 > width=12) (actual time=0.022..7145.811 rows=90017507 loops=1) > -> Materialize (cost=0.00..2771.00 rows=100000 width=4) (actual > time=0.000..0.000 rows=1 loops=90017507) > -> Seq Scan on item ref_1 (cost=0.00..2271.00 rows=100000 width=4) > (actual time=0.006..0.006 rows=1 loops=1) note the estimated rows=100000 vs the actual rows=1 in the seqscan / materialize. That's what makes the planner think this is much more expensive than it is, which in turn triggers the use of a parallel scan. Greetings, Andres Freund
On Mon, Apr 1, 2019 at 12:00 PM Andres Freund <andres@anarazel.de> wrote: > > Nested Loop Semi Join (cost=0.00..90020417940.08 rows=30005835 width=8) > > (actual time=0.034..24981.895 rows=90017507 loops=1) > > Join Filter: (ref_0.ol_d_id <= ref_1.i_im_id) > > -> Seq Scan on order_line ref_0 (cost=0.00..2011503.04 rows=90017504 > > width=12) (actual time=0.022..7145.811 rows=90017507 loops=1) > > -> Materialize (cost=0.00..2771.00 rows=100000 width=4) (actual > > time=0.000..0.000 rows=1 loops=90017507) > > -> Seq Scan on item ref_1 (cost=0.00..2271.00 rows=100000 width=4) > > (actual time=0.006..0.006 rows=1 loops=1) > > note the estimated rows=100000 vs the actual rows=1 in the seqscan / > materialize. That's what makes the planner think this is much more > expensive than it is, which in turn triggers the use of a parallel scan. Yeah, I just noticed that. The sequential scan on the inner side of the nestloop join is a problem. More generally, as somebody familiar with the TPC-C schema, I cannot make sense of the query itself. Why would anybody want to join "Image ID associated to Item" from the item table to the district column of the orderlines table? It simply makes no sense. -- Peter Geoghegan
Sorry for the duplicated thread here and thanks for the reply.
I have another question regarding parallel execution. We noticed that PostgreSQL sometimes launches only one worker. Is that also an intended behavior?
Best regards,
Jinho Jung
On Mon, Apr 1, 2019 at 3:00 PM Andres Freund <andres@anarazel.de> wrote:
Hi,
On 2019-04-01 11:52:54 -0700, Peter Geoghegan wrote:
> On Mon, Apr 1, 2019 at 11:30 AM Jinho Jung <jinhojun@usc.edu> wrote:
> > Surprisingly, we found that even on a larger TPC-C database (scale factor of 50, roughly 4GB of size), parallel scan is still slower than the non-parallel execution plan in the old version.
>
> That's not a large database, and it's certainly not a large TPC-C
> database. If you attempt to stay under the spec's maximum
> tpmC/throughput per warehouse, which is 12.86 tpmC per warehouse, then
> you'll need several thousand warehouses on modern hardware. We're
> talking several hundred gigabytes. Otherwise, as far as the spec is
> concerned you're testing an unrealistic workload. There will be
> individual customers that make many more purchases than is humanly
> possible. You're modelling an app involving hypothetical warehouse
> employees that must enter data into their terminals at a rate that is
> not humanly possible.
I don't think that's really the problem here. It's that there's a
fundamental misestimation in the query:
> [OLD version]
> Nested Loop Semi Join (cost=0.00..90020417940.08 rows=30005835 width=8)
> (actual time=0.034..24981.895 rows=90017507 loops=1)
> Join Filter: (ref_0.ol_d_id <= ref_1.i_im_id)
> -> Seq Scan on order_line ref_0 (cost=0.00..2011503.04 rows=90017504
> width=12) (actual time=0.022..7145.811 rows=90017507 loops=1)
> -> Materialize (cost=0.00..2771.00 rows=100000 width=4) (actual
> time=0.000..0.000 rows=1 loops=90017507)
> -> Seq Scan on item ref_1 (cost=0.00..2271.00 rows=100000 width=4)
> (actual time=0.006..0.006 rows=1 loops=1)
note the estimated rows=100000 vs the actual rows=1 in the seqscan /
materialize. That's what makes the planner think this is much more
expensive than it is, which in turn triggers the use of a parallel scan.
Greetings,
Andres Freund
Hi, Please don't top-quote on postgres mailing lists. On 2019-04-01 15:08:59 -0400, Jinho Jung wrote: > Sorry for the duplicated thread here and thanks for the reply. > > I have another question regarding parallel execution. We noticed that > PostgreSQL sometimes launches only one worker. Is that also an intended > behavior? Note that even a single worker is a form of parallelism - the "main" backend also participates in query execution. So yes, that can be a reasonable behaviour. Also, it might be that query was just finished faster than a worker was able to launch. - Andres
On Tue, 2 Apr 2019 at 08:00, Andres Freund <andres@anarazel.de> wrote: > > On 2019-04-01 11:52:54 -0700, Peter Geoghegan wrote: > > On Mon, Apr 1, 2019 at 11:30 AM Jinho Jung <jinhojun@usc.edu> wrote: > > > Surprisingly, we found that even on a larger TPC-C database (scale factor of 50, roughly 4GB of size), parallel scanis still slower than the non-parallel execution plan in the old version. > > > > That's not a large database, and it's certainly not a large TPC-C > > database. If you attempt to stay under the spec's maximum > > tpmC/throughput per warehouse, which is 12.86 tpmC per warehouse, then > > you'll need several thousand warehouses on modern hardware. We're > > talking several hundred gigabytes. Otherwise, as far as the spec is > > concerned you're testing an unrealistic workload. There will be > > individual customers that make many more purchases than is humanly > > possible. You're modelling an app involving hypothetical warehouse > > employees that must enter data into their terminals at a rate that is > > not humanly possible. > > I don't think that's really the problem here. It's that there's a > fundamental misestimation in the query: > > > [OLD version] > > Nested Loop Semi Join (cost=0.00..90020417940.08 rows=30005835 width=8) > > (actual time=0.034..24981.895 rows=90017507 loops=1) > > Join Filter: (ref_0.ol_d_id <= ref_1.i_im_id) > > -> Seq Scan on order_line ref_0 (cost=0.00..2011503.04 rows=90017504 > > width=12) (actual time=0.022..7145.811 rows=90017507 loops=1) > > -> Materialize (cost=0.00..2771.00 rows=100000 width=4) (actual > > time=0.000..0.000 rows=1 loops=90017507) > > -> Seq Scan on item ref_1 (cost=0.00..2271.00 rows=100000 width=4) > > (actual time=0.006..0.006 rows=1 loops=1) > > note the estimated rows=100000 vs the actual rows=1 in the seqscan / > materialize. That's what makes the planner think this is much more > expensive than it is, which in turn triggers the use of a parallel scan. It's true that there's a bad row estimate here, but short of fixing that, it seems we could fix it by having try_partial_nestloop_path() check if a Materialize node would help save rescanning costs of the inner plan and applying that if it does. The planner does think that it's going to have to perform the inner seq scan 37507293 * 3 times so materializing is likely to help. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services