Re: Add rows removed by hash join clause to instrumentation - Mailing list pgsql-hackers

From Andrei Lepikhov
Subject Re: Add rows removed by hash join clause to instrumentation
Date
Msg-id 3295bbb0-99e1-4f9d-a8e9-036dbdfc36ce@gmail.com
Whole thread Raw
In response to Add rows removed by hash join clause to instrumentation  (Andrei Lepikhov <lepihov@gmail.com>)
List pgsql-hackers
On 17/1/26 00:51, Andrei Lepikhov wrote:
> Thoughts?
> 
In this letter, I want to demonstrate how to use the proposed feature.

Install PostgreSQL 18 and Join-Order-Benchmark [1]. We need just one 
pass, executing each query one-by-one. To identify issues, I use the 
pg_track_optimizer extension [2] to examine the ‘bad join’ criterion. 
For each JOIN node, the following is calculated:

jf_max = (nfiltered1 + nfiltered2 + nfiltered3) / nloops

Extension stores MAX(jf_max) for each query to detect potentially 
optimisable ones. So, let’s benchmark both with and without the proposed 
change. TOP-5 bad queries expose quite a different picture:

SELECT queryid,LEFT(query,12) AS query,floor(jf_max) AS jf_max
FROM pg_track_optimizer WHERE jf_max > 0
ORDER BY jf_max DESC LIMIT 5;

Without the patch:

    queryid        |    query     | jf_max
----------------------±-------------±-------
4335842597099666660 | /* 27c.sql * | 490348

patched Postgres and the extension:

    queryid        |    query     | jf_max
----------------------±-------------±-------
-3135863217603061370 | /* 32a.sql * | 871591
6389747766960672879  | /* 27c.sql * | 491129
-6029921280260605067 | /* 6a.sql /  | 218912
5025464644910963332  | / 25b.sql *  |  25753
1798119524484989875  | /* 10c.sql * |  22939

In the patched variant, we see more potential cases. Let’s take a look 
at the first three of them.

Query 27c.sql (see analysis-27c.sql in attachment) is detected as 
filtering too much in both cases because, on occasion, intensive 
filtering occurs in a NestLoop node. Index identification is a separate 
topic, but here I used a quite simple approach: I just asked Claude to 
collect join clauses that filter out many rows and propose potential 
indexes. It is not for production, but worked with such a simple and 
limited set of queries.

So, without the patch, we have extra indexes:

CREATE INDEX ON movie_info(movie_id);
CREATE INDEX ON movie_companies(movie_id);

Patched version finds one more problematic join, and the recommendation 
changes a little:

CREATE INDEX ind_3 ON movie_companies(movie_id,company_id,company_type_id);

After applying these indexes query speeds up 3340 ms → 32 ms (x100)

Now, undetected without patch 32a.sql, exposes that we can employ the 
following indexes:

CREATE INDEX ON movie_keyword (keyword_id,movie_id);
CREATE INDEX ON movie_link(movie_id);

It speeds up query: 250ms → 16ms (x15)

And the last one, 6a.sql, needs two iterations:

The first one demonstrates we need the index:
CREATE INDEX ON cast_info(movie_id,person_id);

An additional pass shows we can reduce the number of fetched tuples with 
the following index:
CREATE INDEX ON movie_keyword(keyword_id,movie_id);

Finally, that speeds up query: 150ms->220ms → 26ms (x7)

As you can see, the main idea is to draw attention to specific queries 
and having specific methodology propose indexes that may reduce the size 
of initially fetched data.

[1] https://github.com/danolivo/jo-bench
[2] https://github.com/danolivo/pg_track_optimizer/tree/nfiltered3

-- 
regards, Andrei Lepikhov,
pgEdge
Attachment

pgsql-hackers by date:

Previous
From: KAZAR Ayoub
Date:
Subject: Re: Speed up COPY FROM text/CSV parsing using SIMD
Next
From: KAZAR Ayoub
Date:
Subject: Re: Speed up COPY FROM text/CSV parsing using SIMD