Re: Parallel query execution introduces performance regressions - Mailing list pgsql-bugs

From Andres Freund
Subject Re: Parallel query execution introduces performance regressions
Date
Msg-id 20190401190022.ccugidpkncnckqli@alap3.anarazel.de
Whole thread Raw
In response to Re: Parallel query execution introduces performance regressions  (Peter Geoghegan <pg@bowt.ie>)
Responses Re: Parallel query execution introduces performance regressions  (Peter Geoghegan <pg@bowt.ie>)
Re: Parallel query execution introduces performance regressions  (Jinho Jung <jinhojun@usc.edu>)
Re: Parallel query execution introduces performance regressions  (David Rowley <david.rowley@2ndquadrant.com>)
List pgsql-bugs
Hi,

On 2019-04-01 11:52:54 -0700, Peter Geoghegan wrote:
> On Mon, Apr 1, 2019 at 11:30 AM Jinho Jung <jinhojun@usc.edu> wrote:
> > Surprisingly, we found that even on a larger TPC-C database (scale factor of 50, roughly 4GB of size), parallel
scanis still slower than the non-parallel execution plan in the old version.
 
> 
> That's not a large database, and it's certainly not a large TPC-C
> database. If you attempt to stay under the spec's maximum
> tpmC/throughput per warehouse, which is 12.86 tpmC per warehouse, then
> you'll need several thousand warehouses on modern hardware. We're
> talking several hundred gigabytes. Otherwise, as far as the spec is
> concerned you're testing an unrealistic workload. There will be
> individual customers that make many more purchases than is humanly
> possible. You're modelling an app involving hypothetical warehouse
> employees that must enter data into their terminals at a rate that is
> not humanly possible.

I don't think that's really the problem here. It's that there's a
fundamental misestimation in the query:

> [OLD version]
> Nested Loop Semi Join  (cost=0.00..90020417940.08 rows=30005835 width=8)
> (actual time=0.034..24981.895 rows=90017507 loops=1)
>   Join Filter: (ref_0.ol_d_id <= ref_1.i_im_id)
>   ->  Seq Scan on order_line ref_0  (cost=0.00..2011503.04 rows=90017504
> width=12) (actual time=0.022..7145.811 rows=90017507 loops=1)
>   ->  Materialize  (cost=0.00..2771.00 rows=100000 width=4) (actual
> time=0.000..0.000 rows=1 loops=90017507)
>       ->  Seq Scan on item ref_1  (cost=0.00..2271.00 rows=100000 width=4)
> (actual time=0.006..0.006 rows=1 loops=1)

note the estimated rows=100000 vs the actual rows=1 in the seqscan /
materialize. That's what makes the planner think this is much more
expensive than it is, which in turn triggers the use of a parallel scan.

Greetings,

Andres Freund



pgsql-bugs by date:

Previous
From: Peter Geoghegan
Date:
Subject: Re: Parallel query execution introduces performance regressions
Next
From: Peter Geoghegan
Date:
Subject: Re: Parallel query execution introduces performance regressions