Slow performance with trivial self-joins - Mailing list pgsql-performance

From Benny Kramek
Subject Slow performance with trivial self-joins
Date
Msg-id CAGPCyEZWEVjkDtuHipOaPKbO2WXdLUuNBhj818-wEuhi4KLbzA@mail.gmail.com
Whole thread Raw
Responses Re: Slow performance with trivial self-joins  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-performance
Hello,

I am experiencing slow performance when joining a table against itself on its
primary key column.

I expect the query plan to be identical for both of the below queries (and I
expect the performance to also be identical). But the second one is much slower:

The FAST QUERY has a planning time of 0.110 ms and execution time of 3.836 ms
The SLOW QUERY has a planning time of 0.296 ms and execution time of 22.969 ms

The reason I believe that they should be the same is because the postgres query
planner should notice that I am joining a table against itself on its primary
key column (which is not null + unique) and therefore it should realize that it
doesn't actually have to do any additional work and can simply directly access
the existing columns.

I've tested this on PostgreSQL 10, 11, 12, 12.1 and 13devel (source snapshot
from 2020-02-03, git commit f1f10a1ba9e17e606a7b217ccccdd3cc4d8cb771)

Here is a full example session:


---------
-- SETUP
---------

CREATE TABLE test_data (
    id SERIAL4 PRIMARY KEY,
    value TEXT
);

INSERT INTO test_data (value)
SELECT value FROM (
    SELECT
        generate_series(1, 100000) AS id,
        md5(random()::TEXT) AS value
) q;

--------------
-- FAST QUERY
--------------

EXPLAIN ANALYZE SELECT
    test_data.id,
    md5(test_data.value) AS x,
    md5(md5(md5(md5(md5(test_data.value))))) AS y
FROM
    test_data
WHERE TRUE
    AND test_data.id BETWEEN 3000 AND 4000;

--------------
-- SLOW QUERY
--------------

EXPLAIN ANALYZE SELECT
    test_data.id,
    md5(test_data.value) AS x,
    md5(md5(md5(md5(md5(t2.value))))) AS y
FROM
    test_data,
    test_data AS t2
WHERE TRUE
    AND t2.id = test_data.id
    AND test_data.id BETWEEN 3000 AND 4000;

--- END ---


Here is the query plan of the FAST QUERY:

 Index Scan using test_data_pkey on test_data  (cost=0.29..60.17
rows=1025 width=68) (actual time=0.047..3.747 rows=1001 loops=1)
   Index Cond: ((id >= 3000) AND (id <= 4000))
 Planning Time: 0.110 ms
 Execution Time: 3.836 ms
(4 rows)


Here is the query plan of the SLOW QUERY:

 Hash Join  (cost=57.60..2169.49 rows=1025 width=68) (actual
time=1.372..22.876 rows=1001 loops=1)
   Hash Cond: (t2.id = test_data.id)
   ->  Seq Scan on test_data t2  (cost=0.00..1834.00 rows=100000
width=37) (actual time=0.010..8.800 rows=100000 loops=1)
   ->  Hash  (cost=44.79..44.79 rows=1025 width=37) (actual
time=0.499..0.499 rows=1001 loops=1)
         Buckets: 2048  Batches: 1  Memory Usage: 84kB
         ->  Index Scan using test_data_pkey on test_data
(cost=0.29..44.79 rows=1025 width=37) (actual time=0.023..0.287
rows=1001 loops=1)
               Index Cond: ((id >= 3000) AND (id <= 4000))
 Planning Time: 0.296 ms
 Execution Time: 22.969 ms
(9 rows)



pgsql-performance by date:

Previous
From: Marco Colli
Date:
Subject: Re: Statistics on array values
Next
From: Tom Lane
Date:
Subject: Re: Slow performance with trivial self-joins