Thread: Slow performance with trivial self-joins

Slow performance with trivial self-joins

From

Benny Kramek

Date:

03 February 2020, 20:38:15

Hello,

I am experiencing slow performance when joining a table against itself on its
primary key column.

I expect the query plan to be identical for both of the below queries (and I
expect the performance to also be identical). But the second one is much slower:

The FAST QUERY has a planning time of 0.110 ms and execution time of 3.836 ms
The SLOW QUERY has a planning time of 0.296 ms and execution time of 22.969 ms

The reason I believe that they should be the same is because the postgres query
planner should notice that I am joining a table against itself on its primary
key column (which is not null + unique) and therefore it should realize that it
doesn't actually have to do any additional work and can simply directly access
the existing columns.

I've tested this on PostgreSQL 10, 11, 12, 12.1 and 13devel (source snapshot
from 2020-02-03, git commit f1f10a1ba9e17e606a7b217ccccdd3cc4d8cb771)

Here is a full example session:


---------
-- SETUP
---------

CREATE TABLE test_data (
    id SERIAL4 PRIMARY KEY,
    value TEXT
);

INSERT INTO test_data (value)
SELECT value FROM (
    SELECT
        generate_series(1, 100000) AS id,
        md5(random()::TEXT) AS value
) q;

--------------
-- FAST QUERY
--------------

EXPLAIN ANALYZE SELECT
    test_data.id,
    md5(test_data.value) AS x,
    md5(md5(md5(md5(md5(test_data.value))))) AS y
FROM
    test_data
WHERE TRUE
    AND test_data.id BETWEEN 3000 AND 4000;

--------------
-- SLOW QUERY
--------------

EXPLAIN ANALYZE SELECT
    test_data.id,
    md5(test_data.value) AS x,
    md5(md5(md5(md5(md5(t2.value))))) AS y
FROM
    test_data,
    test_data AS t2
WHERE TRUE
    AND t2.id = test_data.id
    AND test_data.id BETWEEN 3000 AND 4000;

--- END ---


Here is the query plan of the FAST QUERY:

 Index Scan using test_data_pkey on test_data  (cost=0.29..60.17
rows=1025 width=68) (actual time=0.047..3.747 rows=1001 loops=1)
   Index Cond: ((id >= 3000) AND (id <= 4000))
 Planning Time: 0.110 ms
 Execution Time: 3.836 ms
(4 rows)


Here is the query plan of the SLOW QUERY:

 Hash Join  (cost=57.60..2169.49 rows=1025 width=68) (actual
time=1.372..22.876 rows=1001 loops=1)
   Hash Cond: (t2.id = test_data.id)
   ->  Seq Scan on test_data t2  (cost=0.00..1834.00 rows=100000
width=37) (actual time=0.010..8.800 rows=100000 loops=1)
   ->  Hash  (cost=44.79..44.79 rows=1025 width=37) (actual
time=0.499..0.499 rows=1001 loops=1)
         Buckets: 2048  Batches: 1  Memory Usage: 84kB
         ->  Index Scan using test_data_pkey on test_data
(cost=0.29..44.79 rows=1025 width=37) (actual time=0.023..0.287
rows=1001 loops=1)
               Index Cond: ((id >= 3000) AND (id <= 4000))
 Planning Time: 0.296 ms
 Execution Time: 22.969 ms
(9 rows)

Re: Slow performance with trivial self-joins

From

Tom Lane

Date:

03 February 2020, 21:10:20

Benny Kramek <benny@medflyt.com> writes:
> I expect the query plan to be identical for both of the below queries (and I
> expect the performance to also be identical).

[ shrug... ] Your expectation is mistaken.  There is no code in Postgres
to eliminate useless self-joins.  People have been fooling around with
a patch to do so [1], but I'm unsure whether it'll ever get committed,
or whether we even want the feature.  It seems not unlikely that the
time wasted trying to identify useless self-joins (in queries where
the optimization doesn't actually apply) would outweigh the win when
it does apply.  So there's a limit to how much effort the server should
spend trying to clean up after poorly-written queries.

            regards, tom lane

[1] https://www.postgresql.org/message-id/flat/64486b0b-0404-e39e-322d-0801154901f3@postgrespro.ru

Re: Slow performance with trivial self-joins

From

Benny Kramek

Date:

05 February 2020, 21:55:49

Thank you for your response. I have tested out the patch in the linked
thread and it works very well on a bunch of complex queries that I
have tested, improving both the planning time significantly and the
execution time drastically.

I have also read through the entire linked discussion thread as well
as a few other large threads linked from it, and found the discussion
very interesting.

I don't believe that all such queries are "poorly-written". As was
discussed in the other threads, the reason these types of self-joins
can occur is when you use SQL views. You can create a library of
reusable views that are small, easy-to-understand and readable. Then
you build them up into bigger views, and finally query from them. But
then you end up with lots of (hidden) self-joins. The alternative is
to copy&paste the shared logic from the views into all of the queries.

I understand the need to be conservative about which optimizations to
apply in order to not waste time looking for opportunities that don't
exist. One idea I had that I didn't see mentioned is the following
heuristic: Only if a query references an SQL view (or multiple views),
then try to apply the self_join_removal optimization. This should be
enough, because as you say, no human would intentionally write such a
query. Queries generated by ORMs were also discussed, so I believe it
might also be beneficial to consider queries that contain inner
SELECTs.

Re: Slow performance with trivial self-joins

From

Adam Brusselback

Date:

05 February 2020, 22:12:21

> You can create a library of
> reusable views that are small, easy-to-understand and readable. Then
> you build them up into bigger views, and finally query from them. But
> then you end up with lots of (hidden) self-joins.

I will concur with this use case being pretty common, but also something I

have actively avoided anywhere performance is important because of the

lack of this optimization.

Even still, I have 20+ views like that in my database.

Re: Slow performance with trivial self-joins

From

David Rowley

Date:

05 February 2020, 23:47:11

On Thu, 6 Feb 2020 at 11:12, Adam Brusselback <adambrusselback@gmail.com> wrote:
>
> > You can create a library of
> > reusable views that are small, easy-to-understand and readable. Then
> > you build them up into bigger views, and finally query from them. But
> > then you end up with lots of (hidden) self-joins.
>
> I will concur with this use case being pretty common, but also something I
> have actively avoided anywhere performance is important because of the
> lack of this optimization.
>
> Even still, I have 20+ views like that in my database.

I think the best direction to move in to push that forward would be to
go and benchmark the proposed patch and see if the overhead of
detecting the self joined relations is measurable with various queries
with varying numbers of joins.

It does not sound too like it would be a great deal of effort to look
through the rangetable for duplicate Oids and only do further
processing to attempt self-join removal if there are. However, if that
effort happened to slow down all queries by say 5%, then perhaps it
would be a bad idea.  People's opinions don't really have much
traction for arguments on this. Unbiased and reproducible benchmarks
should be used as evidence to support discussion. Doing worst-case and
average-case benchmarks initially will save you time, as someone will
almost certainly ask if you don't do it.

(I've not been following the thread for the patch)

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services