Thread: Parallel query execution introduces performance regressions

Parallel query execution introduces performance regressions

From

Jinho Jung

Date:

01 April 2019, 18:30:13

We noticed that the following SQL query runs 3 times slower on the latest version of PostgreSQL due to the “parallel query execution” feature. Several other queries are affected by the same feature. We are planning to report the discovered regressions in an academic paper. We would like to get your feedback before doing so.

Here’s the time taken to execute this query on older (v9.5.16) and newer versions (v11.2) of PostgreSQL (in milliseconds):

+-----------------------+--------+---------+---------+-----------+

+-----------------------+--------+---------+---------+-----------+

| v9.5.16 | 88 | 937 | 4721 | 27241 |

| v11.2 | 288 | 2822 | 13838 | 85081 |

+-----------------------+--------+---------+---------+-----------+

We have shared the following details below:

1) the associated query,

2) the commit that activated it,

3) our high-level analysis,

4) query execution plans in old and new versions of PostgreSQL, and

5) information on reproducing these regressions.

### QUERY

select

ref_0.ol_delivery_d as c1

from

public.order_line as ref_0

where EXISTS (

select

ref_1.i_im_id as c0

from

public.item as ref_1

where ref_0.ol_d_id <= ref_1.i_im_id

)

### COMMIT

77cd477 (Enable parallel query by default.)

We found several other queries exhibiting regressions related to this commit.

### HIGH-LEVEL ANALYSIS

We believe that this regression is due to parallel queries being enabled by default. Surprisingly, we found that even on a larger TPC-C database (scale factor of 50, roughly 4GB of size), parallel scan is still slower than the non-parallel execution plan in the old version.

### QUERY EXECUTION PLANS

[OLD version]

Nested Loop Semi Join (cost=0.00..90020417940.08 rows=30005835 width=8) (actual time=0.034..24981.895 rows=90017507 loops=1)

Join Filter: (ref_0.ol_d_id <= ref_1.i_im_id)

-> Seq Scan on order_line ref_0 (cost=0.00..2011503.04 rows=90017504 width=12) (actual time=0.022..7145.811 rows=90017507 loops=1)

-> Materialize (cost=0.00..2771.00 rows=100000 width=4) (actual time=0.000..0.000 rows=1 loops=90017507)

-> Seq Scan on item ref_1 (cost=0.00..2271.00 rows=100000 width=4) (actual time=0.006..0.006 rows=1 loops=1)

Planning time: 0.290 ms

Execution time: 27241.239 ms

[NEW version]

Gather (cost=1000.00..88047487498.82 rows=30005835 width=8) (actual time=0.265..82355.289 rows=90017507 loops=1)

Workers Planned: 2

Workers Launched: 2

-> Nested Loop Semi Join (cost=0.00..88044485915.32 rows=12502431 width=8) (actual time=0.033..68529.259 rows=30005836 loops=3)

Join Filter: (ref_0.ol_d_id <= ref_1.i_im_id)

-> Parallel Seq Scan on order_line ref_0 (cost=0.00..1486400.93 rows=37507293 width=12) (actual time=0.023..2789.901 rows=30005836 loops=3)

-> Seq Scan on item ref_1 (cost=0.00..2271.00 rows=100000 width=4) (actual time=0.001..0.001 rows=1 loops=90017507)

Planning Time: 0.319 ms

Execution Time: 85081.158 ms

### REPRODUCING REGRESSION

* The queries can be downloaded here:

https://gts3.org/~/jjung/tpcc/case4.tar.gz

* You can reproduce these results by using the setup described in:

https://www.postgresql.org/message-id/BN6PR07MB3409922471073F2B619A8CA4EE640%40BN6PR07MB3409.namprd07.prod.outlook.com

Best regards,

Jinho Jung

Re: Parallel query execution introduces performance regressions

From

Andres Freund

Date:

01 April 2019, 18:34:55

Hi,

On 2019-04-01 14:30:13 -0400, Jinho Jung wrote:
> We noticed that the following SQL query runs 3 times slower on the latest
> version of PostgreSQL due to the “parallel query execution” feature.
> Several other queries are affected by the same feature. We are planning to
> report the discovered regressions in an academic paper. We would like to
> get your feedback before doing so.

Didn't you post exactly this multiple times before? E.g.
https://www.postgresql.org/message-id/CAF6pyKfsYkkwk3-uCzcxXQTvY0SRp3SF%3DdFmwJXnP8vzVpXNjw%40mail.gmail.com
https://www.postgresql.org/message-id/CAF6pyKfRchfdLSvFOBJ4U96AhEFsHbm-gGcrQrjXdqPC4q%3DHpg%40mail.gmail.com

please don't just start new threads because you didn't get the amount of
attention you wanted.  Nor do I think this is a bug, so this would be
the wrong list in isolation, too.

Greetings,

Andres Freund

Re: Parallel query execution introduces performance regressions

From

Peter Geoghegan

Date:

01 April 2019, 18:52:54

On Mon, Apr 1, 2019 at 11:30 AM Jinho Jung <jinhojun@usc.edu> wrote:
> Surprisingly, we found that even on a larger TPC-C database (scale factor of 50, roughly 4GB of size), parallel scan
isstill slower than the non-parallel execution plan in the old version.
 

That's not a large database, and it's certainly not a large TPC-C
database. If you attempt to stay under the spec's maximum
tpmC/throughput per warehouse, which is 12.86 tpmC per warehouse, then
you'll need several thousand warehouses on modern hardware. We're
talking several hundred gigabytes. Otherwise, as far as the spec is
concerned you're testing an unrealistic workload. There will be
individual customers that make many more purchases than is humanly
possible. You're modelling an app involving hypothetical warehouse
employees that must enter data into their terminals at a rate that is
not humanly possible.

More importantly, this kind of analysis seems much too simplistic to
be useful. *Any* change to the optimizer or optimizer settings is
certain to regress some queries. We expect users that are very
sensitive to small regressions to take an active interest in
performance tuning their database. It would certainly be very useful
if somebody came up with a less complicated way of assessing these
questions, but that seems to be elusive.

-- 
Peter Geoghegan

Re: Parallel query execution introduces performance regressions

From

Andres Freund

Date:

01 April 2019, 19:00:22

Hi,

On 2019-04-01 11:52:54 -0700, Peter Geoghegan wrote:
> On Mon, Apr 1, 2019 at 11:30 AM Jinho Jung <jinhojun@usc.edu> wrote:
> > Surprisingly, we found that even on a larger TPC-C database (scale factor of 50, roughly 4GB of size), parallel
scanis still slower than the non-parallel execution plan in the old version.
 
> 
> That's not a large database, and it's certainly not a large TPC-C
> database. If you attempt to stay under the spec's maximum
> tpmC/throughput per warehouse, which is 12.86 tpmC per warehouse, then
> you'll need several thousand warehouses on modern hardware. We're
> talking several hundred gigabytes. Otherwise, as far as the spec is
> concerned you're testing an unrealistic workload. There will be
> individual customers that make many more purchases than is humanly
> possible. You're modelling an app involving hypothetical warehouse
> employees that must enter data into their terminals at a rate that is
> not humanly possible.

I don't think that's really the problem here. It's that there's a
fundamental misestimation in the query:

> [OLD version]
> Nested Loop Semi Join  (cost=0.00..90020417940.08 rows=30005835 width=8)
> (actual time=0.034..24981.895 rows=90017507 loops=1)
>   Join Filter: (ref_0.ol_d_id <= ref_1.i_im_id)
>   ->  Seq Scan on order_line ref_0  (cost=0.00..2011503.04 rows=90017504
> width=12) (actual time=0.022..7145.811 rows=90017507 loops=1)
>   ->  Materialize  (cost=0.00..2771.00 rows=100000 width=4) (actual
> time=0.000..0.000 rows=1 loops=90017507)
>       ->  Seq Scan on item ref_1  (cost=0.00..2271.00 rows=100000 width=4)
> (actual time=0.006..0.006 rows=1 loops=1)

note the estimated rows=100000 vs the actual rows=1 in the seqscan /
materialize. That's what makes the planner think this is much more
expensive than it is, which in turn triggers the use of a parallel scan.

Greetings,

Andres Freund

Re: Parallel query execution introduces performance regressions

From

Peter Geoghegan

Date:

01 April 2019, 19:05:57

On Mon, Apr 1, 2019 at 12:00 PM Andres Freund <andres@anarazel.de> wrote:
> > Nested Loop Semi Join  (cost=0.00..90020417940.08 rows=30005835 width=8)
> > (actual time=0.034..24981.895 rows=90017507 loops=1)
> >   Join Filter: (ref_0.ol_d_id <= ref_1.i_im_id)
> >   ->  Seq Scan on order_line ref_0  (cost=0.00..2011503.04 rows=90017504
> > width=12) (actual time=0.022..7145.811 rows=90017507 loops=1)
> >   ->  Materialize  (cost=0.00..2771.00 rows=100000 width=4) (actual
> > time=0.000..0.000 rows=1 loops=90017507)
> >       ->  Seq Scan on item ref_1  (cost=0.00..2271.00 rows=100000 width=4)
> > (actual time=0.006..0.006 rows=1 loops=1)
>
> note the estimated rows=100000 vs the actual rows=1 in the seqscan /
> materialize. That's what makes the planner think this is much more
> expensive than it is, which in turn triggers the use of a parallel scan.

Yeah, I just noticed that. The sequential scan on the inner side of
the nestloop join is a problem.

More generally, as somebody familiar with the TPC-C schema, I cannot
make sense of the query itself. Why would anybody want to join "Image
ID associated to Item" from the item table to the district column of
the orderlines table? It simply makes no sense.

-- 
Peter Geoghegan

Re: Parallel query execution introduces performance regressions

From

Jinho Jung

Date:

01 April 2019, 19:08:59

Sorry for the duplicated thread here and thanks for the reply.

I have another question regarding parallel execution. We noticed that PostgreSQL sometimes launches only one worker. Is that also an intended behavior?

Best regards,

Jinho Jung

On Mon, Apr 1, 2019 at 3:00 PM Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2019-04-01 11:52:54 -0700, Peter Geoghegan wrote:
> On Mon, Apr 1, 2019 at 11:30 AM Jinho Jung <jinhojun@usc.edu> wrote:
> > Surprisingly, we found that even on a larger TPC-C database (scale factor of 50, roughly 4GB of size), parallel scan is still slower than the non-parallel execution plan in the old version.
>
> That's not a large database, and it's certainly not a large TPC-C
> database. If you attempt to stay under the spec's maximum
> tpmC/throughput per warehouse, which is 12.86 tpmC per warehouse, then
> you'll need several thousand warehouses on modern hardware. We're
> talking several hundred gigabytes. Otherwise, as far as the spec is
> concerned you're testing an unrealistic workload. There will be
> individual customers that make many more purchases than is humanly
> possible. You're modelling an app involving hypothetical warehouse
> employees that must enter data into their terminals at a rate that is
> not humanly possible.

I don't think that's really the problem here. It's that there's a
fundamental misestimation in the query:

> [OLD version]
> Nested Loop Semi Join (cost=0.00..90020417940.08 rows=30005835 width=8)
> (actual time=0.034..24981.895 rows=90017507 loops=1)
> Join Filter: (ref_0.ol_d_id <= ref_1.i_im_id)
> -> Seq Scan on order_line ref_0 (cost=0.00..2011503.04 rows=90017504
> width=12) (actual time=0.022..7145.811 rows=90017507 loops=1)
> -> Materialize (cost=0.00..2771.00 rows=100000 width=4) (actual
> time=0.000..0.000 rows=1 loops=90017507)
> -> Seq Scan on item ref_1 (cost=0.00..2271.00 rows=100000 width=4)
> (actual time=0.006..0.006 rows=1 loops=1)

note the estimated rows=100000 vs the actual rows=1 in the seqscan /
materialize. That's what makes the planner think this is much more
expensive than it is, which in turn triggers the use of a parallel scan.

Greetings,

Andres Freund

Re: Parallel query execution introduces performance regressions

From

Andres Freund

Date:

01 April 2019, 19:13:04

Hi,

Please don't top-quote on postgres mailing lists.

On 2019-04-01 15:08:59 -0400, Jinho Jung wrote:
> Sorry for the duplicated thread here and thanks for the reply.
> 
> I have another question regarding parallel execution. We noticed that
> PostgreSQL sometimes launches only one worker. Is that also an intended
> behavior?

Note that even a single worker is a form of parallelism - the "main"
backend also participates in query execution. So yes, that can be a
reasonable behaviour. Also, it might be that query was just finished
faster than a worker was able to launch.

- Andres

Re: Parallel query execution introduces performance regressions

From

David Rowley

Date:

02 April 2019, 03:21:00

On Tue, 2 Apr 2019 at 08:00, Andres Freund <andres@anarazel.de> wrote:
>
> On 2019-04-01 11:52:54 -0700, Peter Geoghegan wrote:
> > On Mon, Apr 1, 2019 at 11:30 AM Jinho Jung <jinhojun@usc.edu> wrote:
> > > Surprisingly, we found that even on a larger TPC-C database (scale factor of 50, roughly 4GB of size), parallel
scanis still slower than the non-parallel execution plan in the old version.
 
> >
> > That's not a large database, and it's certainly not a large TPC-C
> > database. If you attempt to stay under the spec's maximum
> > tpmC/throughput per warehouse, which is 12.86 tpmC per warehouse, then
> > you'll need several thousand warehouses on modern hardware. We're
> > talking several hundred gigabytes. Otherwise, as far as the spec is
> > concerned you're testing an unrealistic workload. There will be
> > individual customers that make many more purchases than is humanly
> > possible. You're modelling an app involving hypothetical warehouse
> > employees that must enter data into their terminals at a rate that is
> > not humanly possible.
>
> I don't think that's really the problem here. It's that there's a
> fundamental misestimation in the query:
>
> > [OLD version]
> > Nested Loop Semi Join  (cost=0.00..90020417940.08 rows=30005835 width=8)
> > (actual time=0.034..24981.895 rows=90017507 loops=1)
> >   Join Filter: (ref_0.ol_d_id <= ref_1.i_im_id)
> >   ->  Seq Scan on order_line ref_0  (cost=0.00..2011503.04 rows=90017504
> > width=12) (actual time=0.022..7145.811 rows=90017507 loops=1)
> >   ->  Materialize  (cost=0.00..2771.00 rows=100000 width=4) (actual
> > time=0.000..0.000 rows=1 loops=90017507)
> >       ->  Seq Scan on item ref_1  (cost=0.00..2271.00 rows=100000 width=4)
> > (actual time=0.006..0.006 rows=1 loops=1)
>
> note the estimated rows=100000 vs the actual rows=1 in the seqscan /
> materialize. That's what makes the planner think this is much more
> expensive than it is, which in turn triggers the use of a parallel scan.

It's true that there's a bad row estimate here, but short of fixing
that, it seems we could fix it by having try_partial_nestloop_path()
check if a Materialize node would help save rescanning costs of the
inner plan and applying that if it does. The planner does think that
it's going to have to perform the inner seq scan 37507293 * 3 times so
materializing is likely to help.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services