BUG #16635: Query Optimizer - Performance optimization for the UNION function - Mailing list pgsql-bugs

From PG Bug reporting form
Subject BUG #16635: Query Optimizer - Performance optimization for the UNION function
Date
Msg-id 16635-b06e1ff6be5a1975@postgresql.org
Whole thread Raw
List pgsql-bugs
The following bug has been logged on the website:

Bug reference:      16635
Logged by:          XINYU LIU
Email address:      XINYULIU@UMICH.EDU
PostgreSQL version: 13.0
Operating system:   Ubuntu 20.04
Description:

Hello,

We are sharing a pair of TPC-H queries that show a potential performance
optimization in this report:

First query:
SELECT "o_orderstatus", 
       "o_clerk", 
       "o_comment" 
FROM   "orders" 
WHERE  "o_orderstatus" LIKE '%' 
                            || 'F' 
UNION 
SELECT "o_orderstatus", 
       "o_clerk", 
       "o_comment" 
FROM   "orders";    

Second query:
SELECT "o_orderstatus", 
       "o_clerk", 
       "o_comment" 
FROM   (SELECT "o_orderstatus", 
               "o_clerk", 
               "o_comment" 
        FROM   "orders" 
        WHERE  "o_orderstatus" LIKE '%' 
                                    || 'F' 
        UNION ALL 
        SELECT "o_orderstatus", 
               "o_clerk", 
               "o_comment" 
        FROM   "orders") AS "t6" 
GROUP  BY "o_orderstatus", 
          "o_clerk", 
          "o_comment"; 

[Actual Behavior]
We executed both queries on the TPC-H benchmark of scale factor 5: the first
query takes 40 seconds, while the second query only takes 15 seconds. We
think the time difference results from different plans selected.

[Query Execution Plan]
First query:
                                                                QUERY PLAN
                                                                

-----------------------------------------------------------------------------------------------------------------------------------------------
 Unique  (cost=5807645.69..5919521.71 rows=11187602 width=248) (actual
time=32713.029..40265.378 rows=7499932 loops=1)
   ->  Sort  (cost=5807645.69..5835614.69 rows=11187602 width=248) (actual
time=32713.028..38766.718 rows=11154501 loops=1)
         Sort Key: orders.o_orderstatus, orders.o_clerk, orders.o_comment
         Sort Method: external merge  Disk: 854080kB
         ->  Append  (cost=0.00..597462.95 rows=11187602 width=248) (actual
time=0.380..2995.901 rows=11154501 loops=1)
               ->  Seq Scan on orders  (cost=0.00..224198.29 rows=3688539
width=67) (actual time=0.379..1420.758 rows=3654501 loops=1)
                     Filter: (o_orderstatus ~~ '%F'::text)
                     Rows Removed by Filter: 3845499
               ->  Seq Scan on orders orders_1  (cost=0.00..205450.63
rows=7499063 width=67) (actual time=0.016..984.347 rows=7500000 loops=1)
 Planning Time: 5.560 ms
 Execution Time: 40517.865 ms
(11 rows)






Second query:
                                                                     QUERY
PLAN
 

--------------------------------------------------------------------------------------------------------------------------------------------------------------
 Finalize HashAggregate  (cost=1210086.17..1247494.70 rows=1118760 width=67)
(actual time=10487.829..14780.165 rows=7499932 loops=1)
   Group Key: orders_1.o_orderstatus, orders_1.o_clerk, orders_1.o_comment
   Planned Partitions: 64  Batches: 697  Memory Usage: 4625kB  Disk Usage:
984304kB
   ->  Gather  (cost=773639.83..1088421.02 rows=2237520 width=67) (actual
time=3937.253..6917.972 rows=11153666 loops=1)
         Workers Planned: 2
         Workers Launched: 2
         ->  Partial HashAggregate  (cost=772639.83..863669.02 rows=1118760
width=67) (actual time=3932.997..5636.808 rows=3717889 loops=3)
               Group Key: orders_1.o_orderstatus, orders_1.o_clerk,
orders_1.o_comment
               Planned Partitions: 64  Batches: 425  Memory Usage: 4185kB
Disk Usage: 323584kB
               Worker 0:  Batches: 393  Memory Usage: 4185kB  Disk Usage:
323584kB
               Worker 1:  Batches: 453  Memory Usage: 4185kB  Disk Usage:
370688kB
               ->  Parallel Append  (cost=0.00..402174.85 rows=6813149
width=67) (actual time=0.139..2475.761 rows=3718167 loops=3)
                     ->  Parallel Seq Scan on orders orders_1
(cost=0.00..169517.62 rows=1536891 width=67) (actual time=0.014..769.725
rows=1218167 loops=3)
                           Filter: (o_orderstatus ~~ '%F'::text)
                           Rows Removed by Filter: 1281833
                     ->  Parallel Seq Scan on orders  (cost=0.00..161706.10
rows=3124610 width=67) (actual time=0.202..2193.033 rows=3750000 loops=2)
 Planning Time: 5.025 ms
 Execution Time: 15067.809 ms
(18 rows)






[Expected Behavior]
Since these two queries are semantically equivalent, we want to know whether
PostgreSQL would consider adding such optimization given the potential of
performance improvement.

[Test Environment]
Ubuntu 20.04 machine "Linux panda 5.4.0-40-generic #44-Ubuntu SMP Tue Jun 23
00:01:04 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux"
PostgreSQL v12.3
Database: TPC-H benchmark (with scale factor 5)
Output of the “\d orders”: 

tpch5=# \d orders
                          Table "public.orders"
     Column      |         Type          | Collation | Nullable | Default 
-----------------+-----------------------+-----------+----------+---------
 o_orderkey      | integer               |           | not null | 
 o_custkey       | integer               |           | not null | 
 o_orderstatus   | character(1)          |           | not null | 
 o_totalprice    | numeric(15,2)         |           | not null | 
 o_orderdate     | date                  |           | not null | 
 o_orderpriority | character(15)         |           | not null | 
 o_clerk         | character(15)         |           | not null | 
 o_shippriority  | integer               |           | not null | 
 o_comment       | character varying(79) |           | not null | 
Indexes:
    "orders_pkey" PRIMARY KEY, btree (o_orderkey)
Foreign-key constraints:
    "orders_fk1" FOREIGN KEY (o_custkey) REFERENCES customer(c_custkey)
Referenced by:
    TABLE "lineitem" CONSTRAINT "lineitem_fk1" FOREIGN KEY (l_orderkey)
REFERENCES orders(o_orderkey)






[Steps for reproducing our observations:]    
Download the dataset from the link:
https://drive.google.com/file/d/13rFa1BNDi4e2RmXBn-yEQkcqt6lsBu1c/view?usp=sharing
Set up TPC-H benchmark
tar xzvf tpch5_postgresql.tar.gz
cd tpch5_postgresql
db=tpch5
createdb $db
psql -d $db < dss.ddl
for i in `ls *.tbl`
do
    echo $i
    name=`echo $i|cut -d'.' -f1`
    psql -d $db -c "COPY $name FROM '`pwd`/$i' DELIMITER '|' ENCODING
'LATIN1';"
done
psql -d $db < dss_postgres.ri
Execute the queries


pgsql-bugs by date:

Previous
From: Xinyu Liu
Date:
Subject: Re: BUG #16624: Query Optimizer - Performance bug related to predicate simplification
Next
From: Michał Albrycht
Date:
Subject: Re: BUG #16634: Conflicting names of indexes for partitioned tables