Performance of lateral join - Mailing list pgsql-performance

From Simen Andreas Andreassen Lønsethagen
Subject Performance of lateral join
Date
Msg-id A67CAB4B-C442-4641-BDF1-EDBE1DBF5876@fremtind.no
Whole thread Raw
Responses Re: Performance of lateral join  (Justin Pryzby <pryzby@telsasoft.com>)
List pgsql-performance
Hi, first time posting, hope I have included the relevant information.

I am trying to understand the performance of a query which is intended to retrieve a subset of the following table:

    Table "contracts.bis_person_alle_endringer"
              Column              |           Type           | Collation | Nullable | Default 
    ----------------------------------+--------------------------+-----------+----------+---------
     person_id                        | uuid                     |           | not null | 
     dpd_gyldig_fra_dato              | date                     |           | not null | 
     dpd_i_kraft_fra_dato             | date                     |           | not null | 
     dpd_i_kraft_til_dato             | date                     |           | not null | 
     dpd_endret_tidspunkt             | timestamp with time zone |           | not null | 
     dpd_bis_foedselsnummer           | text                     |           |          | 
     dpd_bis_treffkilde_id            | text                     |           |          | 
    ... [omitted for brevity] ...
     dpd_endret_av                    | text                     |           |          | 
     dpd_bis_kjoenn_id                | text                     |           |          | 
    Indexes:
        "bis_person_alle_endringer_by_person_id" btree (person_id)
        "bis_person_alle_endringer_unique_descending" UNIQUE, btree (dpd_bis_foedselsnummer, dpd_gyldig_fra_dato DESC,
dpd_endret_tidspunktDESC)
 



    dpd=> SELECT relname, relpages, reltuples, relallvisible, relkind, relnatts, relhassubclass, reloptions,
pg_table_size(oid)FROM pg_class WHERE relname='bis_person_alle_endringer';
 
          relname          | relpages |  reltuples  | relallvisible | relkind | relnatts | relhassubclass | reloptions
|pg_table_size
 

---------------------------+----------+-------------+---------------+---------+----------+----------------+------------+---------------
     bis_person_alle_endringer |  9367584 | 1.09584e+08 |       6392129 | r       |      106 | f              |
  |   76760489984
 
    (1 row)

I have ommitted most of the columns, as there are 106 columns in total. The ommitted columns have data types text,
numericor date, all are nullable.
 

To create the subsets, I (or rather my application) will receive lists of records which should be matched according to
somebusiness logic. Each of these lists will be read into a temporary table:
 

           Table "pg_temp_9.records_to_filter_on"
           Column        | Type | Collation | Nullable | Default
    ---------------------+------+-----------+----------+---------
     foedselsnummer      | text |           |          |
     tariff_dato         | date |           |          |
     versjons_dato       | date |           |          |
     kjent_i_system_dato | date |           |          |

The subset is then created by the following query, which finds the records in contracts.bis_person_alle_endringer which
satisfiesthe business logic (if any).
 

    select * from records_to_filter_on r
    left join lateral (
        select * from contracts.bis_person_alle_endringer b
        where b.dpd_bis_foedselsnummer = r.foedselsnummer AND
            r.kjent_i_system_dato >= b.dpd_endret_tidspunkt AND
            r.tariff_dato > b.dpd_gyldig_fra_dato 
        order by b.dpd_gyldig_fra_dato desc, b.dpd_endret_tidspunkt desc
        limit 1
    ) c on true
    where person_id is not null and
        r.versjons_dato < c.dpd_i_kraft_til_dato

The temporary table records_to_filter_on and the result of the above query will typically contain 1-5 million rows (the
returnedsubsets are used for training machine learning models).
 

I've created a sample data set with 3.75 million rows and run EXPLAIN (ANALYZE, BUFFERS) on the query,
https://explain.dalibo.com/plan/U41(and also attached). Running the full EXPLAIN (ANALYZE, BUFFERS) takes about 30
minutes,which seems quite slow. However, as I am new to postgres, I find it difficult to interpret the output of the
EXPLAIN(ANALYZE, BUFFERS) - most of the time is spent during an index scan, which to my understanding is "good".
However,I don't think I understand postgres well enough to judge whether this is the best I can achieve (or at last
closeenough) or if the query should be rewritten. Alternatively, is it not realistic to expect faster performance given
thesize of the table and the hardware of the database instance?
 

I am running PostgreSQL 11.9 on x86_64-pc-linux-gnu using AWS Aurora on a db.t3.large instance
(https://aws.amazon.com/rds/instance-types/).The output of 
 

    SELECT name, current_setting(name), source
      FROM pg_settings
      WHERE source NOT IN ('default', 'override');

is attached in pg_settings.conf.

I realize that these questions are a little vague, but any guidance would be much appreciated.

Thanks, Simen Lønsethagen


pgsql-performance by date:

Previous
From: "ldh@laurent-hasson.com"
Date:
Subject: RE: Big performance slowdown from 11.2 to 13.3
Next
From: Justin Pryzby
Date:
Subject: Re: Performance of lateral join