Re: UNION ALL - Mailing list pgsql-hackers

From Mark Pasterkamp
Subject Re: UNION ALL
Date
Msg-id CACa2YyAgpEoZzi51bxftX3eWqjH48TiX32knwj+PsZ309M80pA@mail.gmail.com
Whole thread Raw
In response to Re: UNION ALL  (Ibrar Ahmed <ibrar.ahmad@gmail.com>)
Responses Re: UNION ALL
List pgsql-hackers
First of all, thank you for the replies.

I am using a base installation of postgres 10.10, with no modifications to any of the system defaults.

I am trying to speedup a join between two tables: the title table and the cast_info table.

The title table is a table containing information about different movies.
it contains 4626969 records.
the table also has a foreign key index on the cast_info table, enabling the planner to use a hash-join.

The cast_info table is a table containing the information of which actor was casted in which movie and contains 62039343 records.

The database also contains a materialized view ci_t_15, defined as:
select * from cast_info join title on cast_info.movie_id = title.id
where title.production_year < 2015

I am comparing two queries, q1 and q2 respectively.
Query q1 is the original query and q2 is an attempt to reduce the cost of execution via leveraging the materialized view ci_t_15.

Query q1 is defined as:
select * from cast_info join title on cast_info.movie_id = title.id

Query q2 is defined as
select * from cast_info join title on cast_info.movie_id = title.id
where title.production_year >= 2015
UNION ALL
select * from ci_t_15

Both queries are executed on a Dell xps laptop with an I7-8750H processor and 16 (2*8) gb ram on an SSD running on ubuntu 18.04.2 LTS.

Running explain analyze on both queries I get the following execution plans.
q1:
"Hash Join  (cost=199773.80..2561662.10 rows=62155656 width=103) (actual time=855.063..25786.264 rows=62039343 loops=1)"
"  Hash Cond: (cast_info.ci_movie_id = title.t_id)"
"  ->  Seq Scan on cast_info  (cost=0.00..1056445.56 rows=62155656 width=42) (actual time=0.027..3837.722 rows=62039343 loops=1)"
"  ->  Hash  (cost=92232.69..92232.69 rows=4626969 width=61) (actual time=854.548..854.548 rows=4626969 loops=1)"
"        Buckets: 65536  Batches: 128  Memory Usage: 3431kB"
"        ->  Seq Scan on title  (cost=0.00..92232.69 rows=4626969 width=61) (actual time=0.005..327.588 rows=4626969 loops=1)"
"Planning time: 5.097 ms"
"Execution time: 27236.088 ms"

q2:
"Append  (cost=123209.65..3713445.65 rows=61473488 width=105) (actual time=442.207..29713.621 rows=60918189 loops=1)"
"  ->  Gather  (cost=123209.65..2412792.77 rows=10639784 width=103) (actual time=442.206..14634.427 rows=10046633 loops=1)"
"        Workers Planned: 2"
"        Workers Launched: 2"
"        ->  Hash Join  (cost=122209.65..1347814.37 rows=4433243 width=103) (actual time=471.969..12527.840 rows=3348878 loops=3)"
"              Hash Cond: (cast_info.ci_movie_id = title.t_id)"
"              ->  Parallel Seq Scan on cast_info  (cost=0.00..693870.90 rows=25898190 width=42) (actual time=0.006..7302.679 rows=20679781 loops=3)"
"              ->  Hash  (cost=103800.11..103800.11 rows=792043 width=61) (actual time=471.351..471.351 rows=775098 loops=3)"
"                    Buckets: 65536  Batches: 32  Memory Usage: 2515kB"
"                    ->  Seq Scan on title  (cost=0.00..103800.11 rows=792043 width=61) (actual time=0.009..376.127 rows=775098 loops=3)"
"                          Filter: (t_production_year >= 2015)"
"                          Rows Removed by Filter: 3851871"
"  ->  Seq Scan on ci_t_15  (cost=0.00..1194255.04 rows=50833704 width=105) (actual time=1.143..11967.391 rows=50871556 loops=1)"
"Planning time: 0.268 ms"
"Execution time: 31379.854 ms"

Due to using the materialized view I can reduce the amount of records going into the hash join, lowering the time from 25786.264 msec to 12527.840 msec.
However, this is where my question comes in, this reduction is completely negated by the cost of appending both results in the UNION ALL command.
I was wondering if this is normal behaviour.
In my mind, I wouldn't expect appending 2 resultsets to have such a relative huge cost associated with it.
This is also why I asked what exactly a UNION ALL does to achieve its functionality, to perhaps gain some insight in its cost.


With kind regards,

Mark

On Thu, 15 Aug 2019 at 21:22, Ibrar Ahmed <ibrar.ahmad@gmail.com> wrote:


On Fri, Aug 16, 2019 at 12:16 AM <066ce286@free.fr> wrote:
Generally speaking, when executing UNION ; a DISTINCT is run afterward on the resultset.

So, if you're sure that each part of UNION cannot return a line returned by another one, you may use UNION ALL, you'll cut the cost of the final implicit DISTINCT.


----- Mail original -----
De: "Mark Pasterkamp" <markpasterkamp1994@gmail.com>
À: pgsql-hackers@lists.postgresql.org
Envoyé: Jeudi 15 Août 2019 20:37:06
Objet: UNION ALL


Dear all,


I was wondering if someone could help me understands what a union all actually does.


For my thesis I am using Apache Calcite to rewrite queries into using materialized views which I then give to a Postgres database.
For some queries, this means that they will be rewritten in a UNION ALL style query between an expression and a table scan of a materialized view.
However, contrary to what I expected, the UNION ALL query is actually a lot slower.


As an example, say I have 2 tables: actor and movie. Furthermore, there is also a foreign key index on movie to actor.
I also have a materialized view with the join of these 2 tables for all movies <= 2015 called A.
Now, if I want to query all entries in the join between actor and movie, I would assume that a UNION ALL between the join of actor and movie for movies >2015 and A is faster than executing the original query..
If I look at the explain analyze part, I can certainly see a reduction in cost up until the UNION ALL part, which carries a respective cost more than negating the cost reduction up to a point where I might as well not use the existing materialized view.


I have some trouble understanding this phenomenon.
One thought which came to my mind was that perhaps UNION ALL might create a temporary table containing both result sets, and then do a table scan and return that result.

this would greatly increase IO cost which could attribute to the problem.
However, I am really not sure what UNION ALL actually does to append both result sets so I was wondering if someone would be able to help me out with this.




Mark


066ce286@free.fr:  Please, avoid top-posting. It makes harder to follow the
discussion.

--
Ibrar Ahmed

pgsql-hackers by date:

Previous
From: Konstantin Knizhnik
Date:
Subject: Re: Global temporary tables
Next
From: Craig Ringer
Date:
Subject: Re: Global temporary tables