Re: Performance regressions found using sqlfuzz - Mailing list pgsql-performance

From Jung, Jinho
Subject Re: Performance regressions found using sqlfuzz
Date
Msg-id BN6PR07MB3409EE6CAAF8CCF43820AFB9EE670@BN6PR07MB3409.namprd07.prod.outlook.com
Whole thread Raw
In response to Re: Performance regressions found using sqlfuzz  (Jeff Janes <jeff.janes@gmail.com>)
Responses Re: Performance regressions found using sqlfuzz  (Andrew Gierth <andrew@tao11.riddles.org.uk>)
Re: Performance regressions found using sqlfuzz  (Andres Freund <andres@anarazel.de>)
List pgsql-performance

Hi Jeff, 


Thanks for the feedback! The git bisect idea was particularly helpful.


We use query complexity constraints in sqlfuzz to ensure that the constructed queries are realistic (e.g., limit the query size, automatically minimize the query, avoid esoteric expressions and functions, restrict number of joins, etc.). 


Our goal is to augment the test suite with queries that will assist developers  with more comprehensively  evaluating the impact of new optimization heuristics, query processing  strategies etc. We are working on improving the utility of the tool and your feedback on these reports will be super helpful. Thanks.


For each regression, we share:


1) the associated query,

2) the commit that activated it,

3) our high-level analysis, and

4) query execution plans in old and new versions of PostgreSQL.


All these regressions are observed on the latest version (dev HEAD).


####### QUERY 2: 


  select distinct   
    ref_0.i_im_id as c0,
    ref_1.ol_dist_info as c1
  from 
      public.item as ref_0 right join 
      public.order_line as ref_1
      on (ref_0.i_id = 5)

- Commit: 84f9a35 (Improve estimate of distinct values in estimate_num_groups())

- Our analysis: We believe that this regression is related to the new logic for estimating the number of distinct values in the optimizer. This is affecting even queries with point lookups (ref_0.i_id = 5) in the TPC-C benchmark. 

- Query Execution Plans

  [OLD version]
  HashAggregate  (cost=11972.38..12266.20 rows=29382 width=29) (actual time=233.543..324.973 rows=300144 loops=1)
    Group Key: ref_0.i_im_id, ref_1.ol_dist_info
    ->  Nested Loop Left Join  (cost=0.29..10471.64 rows=300148 width=29) (actual time=0.012..114.955 rows=300148 loops=1)
          ->  Seq Scan on order_line ref_1  (cost=0.00..6711.48 rows=300148 width=25) (actual time=0.004..25.061 rows=300148 loops=1)
          ->  Materialize  (cost=0.29..8.32 rows=1 width=4) (actual time=0.000..0.000 rows=1 loops=300148)
                ->  Index Scan using item_pkey on item ref_0  (cost=0.29..8.31 rows=1 width=4) (actual time=0.005..0.005 rows=1 loops=1)
                      Index Cond: (i_id = 10)
  Planning time: 0.267 ms
  Execution time: 338.027 ms


  [NEW version]
  Unique  (cost=44960.08..47211.19 rows=300148 width=29) (actual time=646.545..885.502 rows=300144 loops=1)
    ->  Sort  (cost=44960.08..45710.45 rows=300148 width=29) (actual time=646.544..838.933 rows=300148 loops=1)
          Sort Key: ref_0.i_im_id, ref_1.ol_dist_info
          Sort Method: external merge  Disk: 11480kB
          ->  Nested Loop Left Join  (cost=0.29..10471.64 rows=300148 width=29) (actual time=0.016..111.889 rows=300148 loops=1)
                ->  Seq Scan on order_line ref_1  (cost=0.00..6711.48 rows=300148 width=25) (actual time=0.004..24.612 rows=300148 loops=1)
                ->  Materialize  (cost=0.29..8.32 rows=1 width=4) (actual time=0.000..0.000 rows=1 loops=300148)
                      ->  Index Scan using item_pkey on item ref_0  (cost=0.29..8.31 rows=1 width=4) (actual time=0.008..0.008 rows=1 loops=1)
                            Index Cond: (i_id = 10)
  Planning Time: 0.341 ms
  Execution Time: 896.662 ms

####### QUERY 3: 


  select  
    cast(ref_1.ol_i_id as int4) as c0
  from 
    public.stock as ref_0
      left join public.order_line as ref_1
      on (ref_1.ol_number is not null)
  where ref_1.ol_number is null

- Commit: 77cd477 (Enable parallel query by default.)

- Our analysis: We believe that this regression is due to parallel queries being enabled by default. Surprisingly, we found that even on a larger TPC-C database (scale factor of 50, roughly 4GB), parallel scan is still slower than the non-parallel one in the old version, when the query is not returning any tuples. 

- Query Execution Plans

  [OLD version]
  Nested Loop Anti Join  (cost=0.00..18006.81 rows=1 width=4) (actual time=28.689..28.689 rows=0 loops=1)
    ->  Seq Scan on stock ref_0  (cost=0.00..5340.00 rows=100000 width=0) (actual time=0.028..15.722 rows=100000 loops=1)
    ->  Materialize  (cost=0.00..9385.22 rows=300148 width=4) (actual time=0.000..0.000 rows=1 loops=100000)
          ->  Seq Scan on order_line ref_1  (cost=0.00..6711.48 rows=300148 width=4) (actual time=0.004..0.004 rows=1 loops=1)
                Filter: (ol_number IS NOT NULL)
  Planning time: 0.198 ms
  Execution time: 28.745 ms

  [NEW version]
  Gather  (cost=1000.00..15164.93 rows=1 width=4) (actual time=91.022..92.634 rows=0 loops=1)
    Workers Planned: 2
    Workers Launched: 2
    ->  Nested Loop Anti Join  (cost=0.00..14164.83 rows=1 width=4) (actual time=88.889..88.889 rows=0 loops=3)
          ->  Parallel Seq Scan on stock ref_0  (cost=0.00..4756.67 rows=41667 width=0) (actual time=0.025..7.331 rows=33333 loops=3)
          ->  Seq Scan on order_line ref_1  (cost=0.00..6711.48 rows=300148 width=4) (actual time=0.002..0.002 rows=1 loops=100000)
                Filter: (ol_number IS NOT NULL)
  Planning Time: 0.258 ms
  Execution Time: 92.699 ms

####### QUERY 4: 


  select
    ref_0.s_dist_06 as c0
  from
    public.stock as ref_0
  where (ref_0.s_w_id < cast(least(0, 1) as int8))

- Commit: 5edc63b (Account for the effect of lossy pages when costing bitmap scans)

- Our analysis: We believe that this regression has to do with two factors: 1) conditional expression (e.g., LEAST or NULLIF) are not reduced to constants unlike string functions (e.g., CHAR_LENGTH)  2) change in the cost estimation function for bitmap scan. Execution time grows by 3 orders of magnitude.  We note that this regression is only observed on large databases (e.g., scale factor of 50).

- Query Execution Plans

  [OLD version]
  Bitmap Heap Scan on stock ref_0  (cost=31201.11..273173.13 rows=1666668 width=25) (actual time=0.005..0.005 rows=0 loops=1)
    Recheck Cond: (s_w_id < (LEAST(0, 1))::bigint)
    ->  Bitmap Index Scan on stock_pkey  (cost=0.00..30784.44 rows=1666668 width=0) (actual time=0.005..0.005 rows=0 loops=1)
          Index Cond: (s_w_id < (LEAST(0, 1))::bigint)
  Planning time: 0.228 ms
  Execution time: 0.107 ms

  [NEW version]
  Seq Scan on stock ref_0  (cost=0.00..304469.17 rows=1666613 width=25) (actual time=716.397..716.397 rows=0 loops=1)
    Filter: (s_w_id < (LEAST(0, 1))::bigint)
    Rows Removed by Filter: 5000000
  Planning Time: 0.221 ms
  Execution Time: 716.513 ms

####### QUERY 1: 


  select  
    ref_0.o_d_id as c0
  from 
    public.oorder as ref_0
  where EXISTS (
    select              
      1
    from 
      (select distinct 
          ref_0.o_entry_d as c0, 
          ref_1.c_credit as c1
       from 
          public.customer as ref_1
       where (false)
      ) as subq_1
  );

- Commit: bf6c614 (Do execGrouping.c via expression eval machinery, take two)

- Our analysis: We are not sure about the root cause of this regression. This might have to do with grouping logic. 

- Query Execution Plans

  [OLD version]
  Seq Scan on oorder ref_0  (cost=0.00..77184338.54 rows=15022 width=4) (actual time=34.173..34.173 rows=0 loops=1)
    Filter: (SubPlan 1)
    Rows Removed by Filter: 30044
    SubPlan 1
      ->  Subquery Scan on subq_1  (cost=2569.01..2569.03 rows=1 width=0) (actual time=0.001..0.001 rows=0 loops=30044)
            ->  HashAggregate  (cost=2569.01..2569.02 rows=1 width=3) (actual time=0.000..0.000 rows=0 loops=30044)
                  Group Key: ref_0.o_entry_d, ref_1.c_credit
                  ->  Result  (cost=0.00..2569.00 rows=1 width=3) (actual time=0.000..0.000 rows=0 loops=30044)
                        One-Time Filter: false
                        ->  Seq Scan on customer ref_1  (cost=0.00..2569.00 rows=1 width=3) (never executed)
  Planning time: 0.325 ms
  Execution time: 34.234 ms

  [NEW version]
  Seq Scan on oorder ref_0  (cost=0.00..1152.32 rows=15022 width=4) (actual time=74.799..74.799 rows=0 loops=1)
    Filter: (SubPlan 1)
    Rows Removed by Filter: 30044
    SubPlan 1
      ->  Subquery Scan on subq_1  (cost=0.00..0.02 rows=1 width=0) (actual time=0.002..0.002 rows=0 loops=30044)
            ->  HashAggregate  (cost=0.00..0.01 rows=1 width=20) (actual time=0.000..0.000 rows=0 loops=30044)
                  Group Key: ref_0.o_entry_d, c_credit
                  ->  Result  (cost=0.00..0.00 rows=0 width=20) (actual time=0.000..0.000 rows=0 loops=30044)
                        One-Time Filter: false
  Planning Time: 0.350 ms
  Execution Time: 79.237 ms


From: Jeff Janes <jeff.janes@gmail.com>
Sent: Tuesday, February 12, 2019 1:03 PM
To: Jung, Jinho
Cc: pgsql-performance@postgresql.org
Subject: Re: Performance regressions found using sqlfuzz
 
On Tue, Feb 12, 2019 at 4:23 AM Jung, Jinho <jinho.jung@gatech.edu> wrote:


Hello,

We are developing a tool called sqlfuzz for automatically finding performance regressions in PostgreSQL. sqlfuzz performs mutational fuzzing to generate SQL queries that take more time to execute on the latest version of PostgreSQL compared to prior versions. We hope that these queries would help further increase the utility of the regression test suite.

We would greatly appreciate feedback from the community regarding the queries found by the tool so far. We have already incorporated prior feedback from the community in the latest version of sqlfuzz.

This approach doesn't seem very exciting to me as-is, because optimization is a very pragmatic endeavor.  We make decisions all the time that might make some queries better and others worse.  If the queries that get better are natural/common ones, and the ones that get worse are weird/uncommon ones (like generated by a fuzzer), then making that change is an improvement even if there are some performance (as opposed to correctness) regressions. 

I would be more interested in investigating some of these if the report would:

1) include the exact commit in which the regression was introduced (i.e. automate "git bisect").
2) verify that the regression still exists in the dev HEAD and report which commit it was verified in (since HEAD changes frequently).
3) report which queries (if any) in your corpus were made better by the same commit which made the victim query worse.

Cheers,

Jeff

pgsql-performance by date:

Previous
From: Mariel Cherkassky
Date:
Subject: Re: autovacuum big table taking hours and sometimes seconds
Next
From: Michael Lewis
Date:
Subject: Re: autovacuum big table taking hours and sometimes seconds