BUG #16635: Query Optimizer - Performance optimization for the UNION function - Mailing list pgsql-bugs
From | PG Bug reporting form |
---|---|
Subject | BUG #16635: Query Optimizer - Performance optimization for the UNION function |
Date | |
Msg-id | 16635-b06e1ff6be5a1975@postgresql.org Whole thread Raw |
List | pgsql-bugs |
The following bug has been logged on the website: Bug reference: 16635 Logged by: XINYU LIU Email address: XINYULIU@UMICH.EDU PostgreSQL version: 13.0 Operating system: Ubuntu 20.04 Description: Hello, We are sharing a pair of TPC-H queries that show a potential performance optimization in this report: First query: SELECT "o_orderstatus", "o_clerk", "o_comment" FROM "orders" WHERE "o_orderstatus" LIKE '%' || 'F' UNION SELECT "o_orderstatus", "o_clerk", "o_comment" FROM "orders"; Second query: SELECT "o_orderstatus", "o_clerk", "o_comment" FROM (SELECT "o_orderstatus", "o_clerk", "o_comment" FROM "orders" WHERE "o_orderstatus" LIKE '%' || 'F' UNION ALL SELECT "o_orderstatus", "o_clerk", "o_comment" FROM "orders") AS "t6" GROUP BY "o_orderstatus", "o_clerk", "o_comment"; [Actual Behavior] We executed both queries on the TPC-H benchmark of scale factor 5: the first query takes 40 seconds, while the second query only takes 15 seconds. We think the time difference results from different plans selected. [Query Execution Plan] First query: QUERY PLAN ----------------------------------------------------------------------------------------------------------------------------------------------- Unique (cost=5807645.69..5919521.71 rows=11187602 width=248) (actual time=32713.029..40265.378 rows=7499932 loops=1) -> Sort (cost=5807645.69..5835614.69 rows=11187602 width=248) (actual time=32713.028..38766.718 rows=11154501 loops=1) Sort Key: orders.o_orderstatus, orders.o_clerk, orders.o_comment Sort Method: external merge Disk: 854080kB -> Append (cost=0.00..597462.95 rows=11187602 width=248) (actual time=0.380..2995.901 rows=11154501 loops=1) -> Seq Scan on orders (cost=0.00..224198.29 rows=3688539 width=67) (actual time=0.379..1420.758 rows=3654501 loops=1) Filter: (o_orderstatus ~~ '%F'::text) Rows Removed by Filter: 3845499 -> Seq Scan on orders orders_1 (cost=0.00..205450.63 rows=7499063 width=67) (actual time=0.016..984.347 rows=7500000 loops=1) Planning Time: 5.560 ms Execution Time: 40517.865 ms (11 rows) Second query: QUERY PLAN -------------------------------------------------------------------------------------------------------------------------------------------------------------- Finalize HashAggregate (cost=1210086.17..1247494.70 rows=1118760 width=67) (actual time=10487.829..14780.165 rows=7499932 loops=1) Group Key: orders_1.o_orderstatus, orders_1.o_clerk, orders_1.o_comment Planned Partitions: 64 Batches: 697 Memory Usage: 4625kB Disk Usage: 984304kB -> Gather (cost=773639.83..1088421.02 rows=2237520 width=67) (actual time=3937.253..6917.972 rows=11153666 loops=1) Workers Planned: 2 Workers Launched: 2 -> Partial HashAggregate (cost=772639.83..863669.02 rows=1118760 width=67) (actual time=3932.997..5636.808 rows=3717889 loops=3) Group Key: orders_1.o_orderstatus, orders_1.o_clerk, orders_1.o_comment Planned Partitions: 64 Batches: 425 Memory Usage: 4185kB Disk Usage: 323584kB Worker 0: Batches: 393 Memory Usage: 4185kB Disk Usage: 323584kB Worker 1: Batches: 453 Memory Usage: 4185kB Disk Usage: 370688kB -> Parallel Append (cost=0.00..402174.85 rows=6813149 width=67) (actual time=0.139..2475.761 rows=3718167 loops=3) -> Parallel Seq Scan on orders orders_1 (cost=0.00..169517.62 rows=1536891 width=67) (actual time=0.014..769.725 rows=1218167 loops=3) Filter: (o_orderstatus ~~ '%F'::text) Rows Removed by Filter: 1281833 -> Parallel Seq Scan on orders (cost=0.00..161706.10 rows=3124610 width=67) (actual time=0.202..2193.033 rows=3750000 loops=2) Planning Time: 5.025 ms Execution Time: 15067.809 ms (18 rows) [Expected Behavior] Since these two queries are semantically equivalent, we want to know whether PostgreSQL would consider adding such optimization given the potential of performance improvement. [Test Environment] Ubuntu 20.04 machine "Linux panda 5.4.0-40-generic #44-Ubuntu SMP Tue Jun 23 00:01:04 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux" PostgreSQL v12.3 Database: TPC-H benchmark (with scale factor 5) Output of the “\d orders”: tpch5=# \d orders Table "public.orders" Column | Type | Collation | Nullable | Default -----------------+-----------------------+-----------+----------+--------- o_orderkey | integer | | not null | o_custkey | integer | | not null | o_orderstatus | character(1) | | not null | o_totalprice | numeric(15,2) | | not null | o_orderdate | date | | not null | o_orderpriority | character(15) | | not null | o_clerk | character(15) | | not null | o_shippriority | integer | | not null | o_comment | character varying(79) | | not null | Indexes: "orders_pkey" PRIMARY KEY, btree (o_orderkey) Foreign-key constraints: "orders_fk1" FOREIGN KEY (o_custkey) REFERENCES customer(c_custkey) Referenced by: TABLE "lineitem" CONSTRAINT "lineitem_fk1" FOREIGN KEY (l_orderkey) REFERENCES orders(o_orderkey) [Steps for reproducing our observations:] Download the dataset from the link: https://drive.google.com/file/d/13rFa1BNDi4e2RmXBn-yEQkcqt6lsBu1c/view?usp=sharing Set up TPC-H benchmark tar xzvf tpch5_postgresql.tar.gz cd tpch5_postgresql db=tpch5 createdb $db psql -d $db < dss.ddl for i in `ls *.tbl` do echo $i name=`echo $i|cut -d'.' -f1` psql -d $db -c "COPY $name FROM '`pwd`/$i' DELIMITER '|' ENCODING 'LATIN1';" done psql -d $db < dss_postgres.ri Execute the queries
pgsql-bugs by date: