Thread: How to solve my slow disk i/o throughput during index scan

How to solve my slow disk i/o throughput during index scan

From
"FREYBURGER Simon (SNCF VOYAGEURS / DIRECTION GENERALE TGV / DM RMP YIELD MANAGEMENT)"
Date:

Hello everyone,

I have a database of sufficient size that it does not fit entirely in RAM, including indexes that also exceed RAM capacity. Most of my big fact tables are following the same scheme :

  • 3-5 id columns.
  • Partitioned along one ID (ID_JOUR).
  • 10 – 100 Go partitions.
  • 1-5 Go primary key indexes, for each partition
  • Contains 10 – 120 columns, some of them with a lot of NaNs. All of them are int or float8. Some columns are containing a lot of NaNs, because they were not all created and added to daily processing at the same date
  • Requesting a small subset (100k-1M) lines from this database, always filtering on the primary key (I have a WHERE … filtering on each ID column. Some are done through dimension tables + join, but I tried doing those directly, it did not solve my problem).
  • See in the annex a DDL code for one of those tables.
  • See the annex for the size of my different tables.
  • Stable data (I am focusing on past data, rare to no insertions, VACUUM + CLUSTER done after those rare modifications).
  • Played REINDEX, VACUUM, CLUSTER, ANALYZE on those tables.

 

When performing queries, I observe significant differences in processing time depending on whether the index needs to be read from disk or is already loaded in RAM. I think I have confirmed using EXPLAIN ANALYZE that the issue stems from index scans. See for example :

 

 

I measured the speed of loading my index into RAM during a query, which is approximately 2 to 3 MB/s. However, my infrastructure theoretically supports I/O speed of around 900 MB/s.


On some older partitions, I was able to sometimes get better throughputs (see e.g. https://explain.dalibo.com/plan/4db409d1d6d95d4b)


I do not understand why reading my index from disk is so slow. I suspect that the index is not sequentially read, but I do not know how postgresql internals really behave, so this is just a supposition.

 

My question is : what can I change to get a better index reading speed ?

 

What I already tried :

 

  • Setting random_page_cost to prohibitive value (10000000) to force a bitmap heap scan, because those can be made in parallel. This has not worked, the optimizer is still doing an index scan on my fact table.
  • Change effective_io_concurrency, max_parallel_workers_per_gather, work_mem to much higher values.


Thank you in advance for your help, any idea/advice greatly appreciated !

Simon F.

ANNEX :

Some more details about my environment :

 

 

==> /etc/lsb-release <==

DISTRIB_ID=Ubuntu

DISTRIB_RELEASE=20.04

DISTRIB_CODENAME=focal

DISTRIB_DESCRIPTION="Ubuntu 20.04 LTS"

 

==> /etc/os-release <==

ID=ubuntu

ID_LIKE=debian

PRETTY_NAME="Ubuntu 20.04 LTS"

VERSION_ID="20.04"

HOME_URL=https://www.ubuntu.com/

SUPPORT_URL=https://help.ubuntu.com/

BUG_REPORT_URL=https://bugs.launchpad.net/ubuntu/

PRIVACY_POLICY_URL=https://www.ubuntu.com/legal/terms-and-policies/privacy-policy

VERSION_CODENAME=focal

UBUNTU_CODENAME=focal

 

==> /etc/SNCF_release <==

HOSTNAME       : uzcordbr05

OS_NAME        : UBUNTU 20.04

OS_DESCRIPTION : Ubuntu 20.04 LTS

OS_RELEASE     : 20.04

OS_CODENAME    : focal

CMDB_ENV       : Recette

AENV           : hprod

REPO_IT        : https://repos.it.sncf.fr/repos/os/ubuntu/bugfix/dists/focal-bugfix/Release

 

  • Default requests settings :
    • effective_cache_size = '96GB',
    • effective_io_concurrency = '200',
    • max_parallel_workers_per_gather = '4',
    • random_page_cost = '1.1',
    • search_path = 'public',
    • work_mem = '64MB' à I tried to change work_mem to 4GB, did not change anything.

 

  • Postgres custom configuration settings

 

name                            |current_setting                                                                                                                                          |source              |

--------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------+

application_name                |DBeaver 21.2.5 - SQLEditor <Script-19.sql>                                                                                                               |session             |

archive_command                 |(disabled)                                                                                                                                               |configuration file  |

archive_mode                    |off                                                                                                                                                      |configuration file  |

archive_timeout                 |2h                                                                                                                                                       |configuration file  |

autovacuum_analyze_scale_factor |0.05                                                                                                                                                     |configuration file  |

autovacuum_analyze_threshold    |50                                                                                                                                                       |configuration file  |

autovacuum_max_workers          |6                                                                                                                                                        |configuration file  |

autovacuum_naptime              |15s                                                                                                                                                      |configuration file  |

autovacuum_vacuum_cost_delay    |10ms                                                                                                                                                     |configuration file  |

autovacuum_vacuum_cost_limit    |-1                                                                                                                                                       |configuration file  |

autovacuum_vacuum_scale_factor  |0.01                                                                                                                                                     |configuration file  |

autovacuum_vacuum_threshold     |50                                                                                                                                                       |configuration file  |

autovacuum_work_mem             |512MB                                                                                                                                                    |configuration file  |

checkpoint_completion_target    |0.9                                                                                                                                                      |configuration file  |

client_encoding                 |UTF8                                                                                                                                                     |client              |

cluster_name                    |irdbr010                                                                                                                                                 |configuration file  |

DateStyle                       |ISO, DMY                                                                                                                                                 |client              |

default_text_search_config      |pg_catalog.french                                                                                                                                        |configuration file  |

effective_cache_size            |96GB                                                                                                                                                     |configuration file  |

effective_io_concurrency        |200                                                                                                                                                      |database            |

extra_float_digits              |3                                                                                                                                                        |session             |

lc_messages                     |C                                                                                                                                                        |configuration file  |

lc_monetary                     |fr_FR.UTF8                                                                                                                                               |configuration file  |

lc_numeric                      |fr_FR.UTF8                                                                                                                                               |configuration file  |

lc_time                         |fr_FR.UTF8                                                                                                                                               |configuration file  |

listen_addresses                |*                                                                                                                                                        |configuration file  |

log_autovacuum_min_duration     |0                                                                                                                                                        |configuration file  |

log_checkpoints                 |on                                                                                                                                                       |configuration file  |

log_connections                 |on                                                                                                                                                       |configuration file  |

log_disconnections              |off                                                                                                                                                      |configuration file  |

log_file_mode                   |0640                                                                                                                                                     |configuration file  |

log_line_prefix                 |%t [%p]: [%l-1] user=%u,db=%d,app=%a,client=%h                                                                                                           |configuration file  |

log_lock_waits                  |on                                                                                                                                                       |configuration file  |

log_min_duration_statement      |5s                                                                                                                                                       |configuration file  |

log_min_error_statement         |warning                                                                                                                                                  |configuration file  |

log_statement                   |ddl                                                                                                                                                      |configuration file  |

log_temp_files                  |0                                                                                                                                                        |configuration file  |

log_timezone                    |Europe/Paris                                                                                                                                             |configuration file  |

logging_collector               |on                                                                                                                                                       |configuration file  |

maintenance_work_mem            |4GB                                                                                                                                                      |configuration file  |

max_connections                 |500                                                                                                                                                      |configuration file  |

max_locks_per_transaction       |1024                                                                                                                                                     |configuration file  |

max_parallel_workers_per_gather |4                                                                                                                                                        |configuration file  |

max_stack_depth                 |2MB                                                                                                                                                      |environment variable|

max_wal_size                    |4GB                                                                                                                                                      |configuration file  |

min_wal_size                    |128MB                                                                                                                                                    |configuration file  |

password_encryption             |scram-sha-256                                                                                                                                            |configuration file  |

pg_stat_statements.max          |15000                                                                                                                                                    |configuration file  |

pg_stat_statements.save         |off                                                                                                                                                      |configuration file  |

pg_stat_statements.track        |all                                                                                                                                                      |configuration file  |

pg_stat_statements.track_utility|off                                                                                                                                                      |configuration file  |

port                            |5433                                                                                                                                                     |configuration file  |

random_page_cost                |1.1                                                                                                                                                      |database            |

restore_command                 |/home/postgres/admin/bin/pgbackrest --config=/etc/pgbackrest.conf --pg1-path=/home/postgres/data/irdbr010/systeme --stanza=rdb_backup archive-get %f "%p"|configuration file  |

search_path                     |public, public, temporaire, dtm_2019                                                                                                                     |session             |

shared_buffers                  |32GB                                                                                                                                                     |configuration file  |

ssl                             |on                                                                                                                                                       |configuration file  |

statement_timeout               |0                                                                                                                                                        |user                |

tcp_keepalives_count            |10                                                                                                                                                       |configuration file  |

tcp_keepalives_idle             |900                                                                                                                                                      |configuration file  |

tcp_keepalives_interval         |75                                                                                                                                                       |configuration file  |

TimeZone                        |Europe/Paris                                                                                                                                             |client              |

unix_socket_group               |postgres                                                                                                                                                 |configuration file  |

unix_socket_permissions         |0700                                                                                                                                                     |configuration file  |

wal_buffers                     |16MB                                                                                                                                                     |configuration file  |

work_mem                        |64MB                                                                                                                                                     |configuration file  |



DDL for my one of my table :


-- public."F_TDOJ_HIST_1" definition

 

-- Drop table

 

-- DROP TABLE public."F_TDOJ_HIST_1";

 

CREATE TABLE public."F_TDOJ_HIST_1" (

       "ID_TRAIN" int4 NOT NULL,

       "ID_JOUR" int4 NOT NULL,

       "ID_OD" int4 NOT NULL,

       "JX" int4 NOT NULL,

       "RES" int4 NULL,

       "REV" float8 NULL,

       "OFFRE" int4 NULL,

       "CC_OUV" int4 NULL,

       "GENV_NUM" int8 NULL,

       "GENV_DEN" int8 NULL,

       "GENR_NUM" int8 NULL,

       "GENR_DEN" int8 NULL,

       "GENH_NUM" int8 NULL,

       "GENH_DEN" int8 NULL,

       "RES_CC0" int4 NULL,

       "RES_CC1" int4 NULL,

       "RES_CC2" int4 NULL,

       "RES_CC3" int4 NULL,

       "RES_CC4" int4 NULL,

       "RES_CC5" int4 NULL,

       "RES_CC6" int4 NULL,

       "RES_CC7" int4 NULL,

       "RES_CC8" int4 NULL,

       "RES_CC9" int4 NULL,

       "RES_CC10" int4 NULL,

       "RES_CC11" int4 NULL,

       "RES_CC12" int4 NULL,

       "RES_CC13" int4 NULL,

       "RES_CC14" int4 NULL,

       "RES_CC15" int4 NULL,

       "RES_CC16" int4 NULL,

       "RES_CC17" int4 NULL,

       "RES_CC18" int4 NULL,

       "RES_CC19" int4 NULL,

       "RES_CC20" int4 NULL,

       "AUT_CC0" int4 NULL,

       "AUT_CC1" int4 NULL,

       "AUT_CC2" int4 NULL,

       "AUT_CC3" int4 NULL,

       "AUT_CC4" int4 NULL,

       "AUT_CC5" int4 NULL,

       "AUT_CC6" int4 NULL,

       "AUT_CC7" int4 NULL,

       "AUT_CC8" int4 NULL,

       "AUT_CC9" int4 NULL,

       "AUT_CC10" int4 NULL,

       "AUT_CC11" int4 NULL,

       "AUT_CC12" int4 NULL,

       "AUT_CC13" int4 NULL,

       "AUT_CC14" int4 NULL,

       "AUT_CC15" int4 NULL,

       "AUT_CC16" int4 NULL,

       "AUT_CC17" int4 NULL,

       "AUT_CC18" int4 NULL,

       "AUT_CC19" int4 NULL,

       "AUT_CC20" int4 NULL,

       "DSP_CC0" int4 NULL,

       "DSP_CC1" int4 NULL,

       "DSP_CC2" int4 NULL,

       "DSP_CC3" int4 NULL,

       "DSP_CC4" int4 NULL,

       "DSP_CC5" int4 NULL,

       "DSP_CC6" int4 NULL,

       "DSP_CC7" int4 NULL,

       "DSP_CC8" int4 NULL,

       "DSP_CC9" int4 NULL,

       "DSP_CC10" int4 NULL,

       "DSP_CC11" int4 NULL,

       "DSP_CC12" int4 NULL,

       "DSP_CC13" int4 NULL,

       "DSP_CC14" int4 NULL,

       "DSP_CC15" int4 NULL,

       "DSP_CC16" int4 NULL,

       "DSP_CC17" int4 NULL,

       "DSP_CC18" int4 NULL,

       "DSP_CC19" int4 NULL,

       "DSP_CC20" int4 NULL,

       "REV_CC0" float8 NULL,

       "REV_CC1" float8 NULL,

       "REV_CC2" float8 NULL,

       "REV_CC3" float8 NULL,

       "REV_CC4" float8 NULL,

       "REV_CC5" float8 NULL,

       "REV_CC6" float8 NULL,

       "REV_CC7" float8 NULL,

       "REV_CC8" float8 NULL,

       "REV_CC9" float8 NULL,

       "REV_CC10" float8 NULL,

       "REV_CC11" float8 NULL,

       "REV_CC12" float8 NULL,

       "REV_CC13" float8 NULL,

       "REV_CC14" float8 NULL,

       "REV_CC15" float8 NULL,

       "REV_CC16" float8 NULL,

       "REV_CC17" float8 NULL,

       "REV_CC18" float8 NULL,

       "REV_CC19" float8 NULL,

       "REV_CC20" float8 NULL,

       "RES_CHD" int4 NULL,

       "REV_CHD" float8 NULL,

       "RES_PRO" int4 NULL,

       "REV_PRO" float8 NULL,

       "RES_SOC" int4 NULL,

       "REV_SOC" float8 NULL,

       "RES_TMX" int4 NULL,

       "REV_TMX" float8 NULL,

       "RES_MAX" int4 NULL,

       "RES_GRP" int4 NULL,

       "REV_GRP" float8 NULL,

       "PREV_RES" int4 NULL,

       "PREV_REV" float8 NULL,

       "OPTIM_RES" int4 NULL,

       "OPTIM_REV" float8 NULL,

       "RES_FFX" int4 NULL,

       "REV_FFX" float8 NULL,

       "RES_SFE" int4 NULL,

       "REV_SFE" float8 NULL,

       "RES_SFN" int4 NULL,

       "REV_SFN" float8 NULL,

       "RES_NFE" int4 NULL,

       "REV_NFE" float8 NULL,

       "RES_NFN" int4 NULL,

       "REV_NFN" float8 NULL,

       "RES_ABO" int4 NULL,

       "REV_ABO" float8 NULL,

       "RES_AGN" int4 NULL,

       "REV_AGN" float8 NULL,

       "RES_BPR" int4 NULL,

       "REV_BPR" float8 NULL,

       "RES_LIB" int4 NULL,

       "REV_LIB" float8 NULL,

       "RES_FFN" int4 NULL,

       "REV_FFN" float8 NULL,

       "RES_PRI" int4 NULL,

       "REV_PRI" float8 NULL,

       CONSTRAINT "F_TDOJ_HIST_1_OLDP_pkey" PRIMARY KEY ("ID_TRAIN", "ID_JOUR", "ID_OD", "JX")

)

PARTITION BY RANGE ("ID_JOUR");

CREATE INDEX "F_TDOJ_HIST_1_OLDP_ID_JOUR_JX_idx" ON ONLY public."F_TDOJ_HIST_1" USING btree ("ID_JOUR", "JX");

CREATE INDEX "F_TDOJ_HIST_1_OLDP_ID_JOUR_idx" ON ONLY public."F_TDOJ_HIST_1" USING btree ("ID_JOUR");

CREATE INDEX "F_TDOJ_HIST_1_OLDP_ID_OD_idx" ON ONLY public."F_TDOJ_HIST_1" USING btree ("ID_OD");

CREATE INDEX "F_TDOJ_HIST_1_OLDP_ID_TRAIN_idx" ON ONLY public."F_TDOJ_HIST_1" USING btree ("ID_TRAIN");

CREATE INDEX "F_TDOJ_HIST_1_OLDP_JX_idx" ON ONLY public."F_TDOJ_HIST_1" USING btree ("JX");

 

 

-- public."F_TDOJ_HIST_1" foreign keys

 

ALTER TABLE public."F_TDOJ_HIST_1" ADD CONSTRAINT "F_TDOJ_HIST_1_OLDP_ID_JOUR_fkey" FOREIGN KEY ("ID_JOUR") REFERENCES public."D_JOUR"("ID_JOUR");

ALTER TABLE public."F_TDOJ_HIST_1" ADD CONSTRAINT "F_TDOJ_HIST_1_OLDP_ID_OD_fkey" FOREIGN KEY ("ID_OD") REFERENCES public."D_OD"("ID_OD");

ALTER TABLE public."F_TDOJ_HIST_1" ADD CONSTRAINT "F_TDOJ_HIST_1_OLDP_ID_TRAIN_fkey" FOREIGN KEY ("ID_TRAIN") REFERENCES public."D_TRAIN"("ID_TRAIN");

ALTER TABLE public."F_TDOJ_HIST_1" ADD CONSTRAINT "F_TDOJ_HIST_1_OLDP_JX_pkey" FOREIGN KEY ("JX") REFERENCES public."D_JX"("JX");

 

 

 

Result of :

SELECT relname, relpages, reltuples, relallvisible, relkind, relnatts, relhassubclass, reloptions, pg_table_size(oid) FROM pg_class WHERE relname in (

    'F_TDLJ_HIST_1', 'F_TDLJ_HIST_2', 'F_TDLJ_HIST', 'F_TDOJ_HIST_1', 'F_TDOJ_HIST_2', 'F_TDOJ_HIST'

);



relpages|reltuples|relname      |relallvisible|relkind|relnatts|relhassubclass|reloptions|pg_table_size|

--------+---------+-------------+-------------+-------+--------+--------------+----------+-------------+

       0|      0.0|F_TDLJ_HIST  |            0|p      |      47|true          |NULL      |            0|

   95806| 442969.0|F_TDLJ_HIST  |        95806|r      |      47|false         |NULL      |    785080320|

       0|      0.0|F_TDLJ_HIST_1|            0|p      |     129|true          |NULL      |            0|

  197458| 730226.0|F_TDLJ_HIST_1|       157954|r      |     129|false         |NULL      |   1618059264|

       0|      0.0|F_TDLJ_HIST_2|            0|p      |     159|true          |NULL      |            0|

  278359| 441524.0|F_TDLJ_HIST_2|       278359|r      |     159|false         |NULL      |   2280972288|

       0|      0.0|F_TDOJ_HIST  |            0|p      |     104|true          |NULL      |            0|

  311913|1424975.0|F_TDOJ_HIST  |       311913|r      |      56|false         |NULL      |   2555928576|

       0|      0.0|F_TDOJ_HIST_1|            0|p      |     135|true          |NULL      |            0|

  682522|1241940.0|F_TDOJ_HIST_1|       682522|r      |     135|false         |NULL      |   5592793088|

       0|      0.0|F_TDOJ_HIST_2|            0|p      |     163|true          |NULL      |            0|

  661324|1397598.0|F_TDOJ_HIST_2|       661324|r      |     163|false         |NULL      |   5419098112|


Interne

-------
Ce message et toutes les pièces jointes sont établis à l'intention exclusive de ses destinataires et sont confidentiels. L'intégrité de ce message n'étant pas assurée sur Internet, la SNCF ne peut être tenue responsable des altérations qui pourraient se produire sur son contenu. Toute publication, utilisation, reproduction, ou diffusion, même partielle, non autorisée préalablement par la SNCF, est strictement interdite. Si vous n'êtes pas le destinataire de ce message, merci d'en avertir immédiatement l'expéditeur et de le détruire.
-------
This message and any attachments are intended solely for the addressees and are confidential. SNCF may not be held responsible for their contents whose accuracy and completeness cannot be guaranteed over the Internet. Unauthorized use, disclosure, distribution, copying, or any part thereof is strictly prohibited. If you are not the intended recipient of this message, please notify the sender immediately and delete it.

Re: How to solve my slow disk i/o throughput during index scan

From
Andrei Lepikhov
Date:
On 4/7/2024 20:25, FREYBURGER Simon (SNCF VOYAGEURS / DIRECTION GENERALE 
TGV / DM RMP YIELD MANAGEMENT) wrote:
> *My question is : what can I change to get a better index reading speed ?*
> 
> What I already tried :
> 
>   * Setting random_page_cost to prohibitive value (10000000) to force a
>     bitmap heap scan, because those can be made in parallel. This has
>     not worked, the optimizer is still doing an index scan on my fact table.
>   * Change effective_io_concurrency, max_parallel_workers_per_gather,
>     work_mem to much higher values.
I'm not sure the case is only about speed of index scan. Just see into 
slow Index clause:
fth.\"ID_TRAIN\" = ANY ('{17855,13945,536795,18838,18837,13574 ...
and many more values.
IndexScan need to make scan for each of these values and for each value 
go through the pages to check other conditions.
We already discuss some optimisations related to this case in couple of 
pgsql-hackers threads. But I'm not sure we have quick solution right now.
If you want to use BitmapScan (that might be reasonable to try here) - 
you need to split huge ANY (...) clause into sequence of ORs.
Also, may be parallel append could help here? if can change 
corresponding startup and tuple costs to force such a plan.

-- 
regards, Andrei Lepikhov




RE: How to solve my slow disk i/o throughput during index scan

From
"FREYBURGER Simon (SNCF VOYAGEURS / DIRECTION GENERALE TGV / DM RMP YIELD MANAGEMENT)"
Date:

Hello,

Thank you, splitting in “OR” query definitely enables bitmap heap scans, and thus parallelized read to disk
😃 ! I though did not understand your second point, what is parallel append, and how to enable it ?

 

Simon F.

 


Interne
De : Andrei Lepikhov <lepihov@gmail.com>
Envoyé : jeudi 4 juillet 2024 16:37
À : FREYBURGER Simon (SNCF VOYAGEURS / DIRECTION GENERALE TGV / DM RMP YIELD MANAGEMENT) <simon.freyburger@sncf.fr>; pgsql-performance@lists.postgresql.org; Peter Geoghegan <pg@bowt.ie>
Objet : Re: How to solve my slow disk i/o throughput during index scan

 

On 4/7/2024 20: 25, FREYBURGER Simon (SNCF VOYAGEURS / DIRECTION GENERALE TGV / DM RMP YIELD MANAGEMENT) wrote: > *My question is : what can I change to get a better index reading speed ?* > > What I already tried : > > * Setting

On 4/7/2024 20:25, FREYBURGER Simon (SNCF VOYAGEURS / DIRECTION GENERALE 
TGV / DM RMP YIELD MANAGEMENT) wrote:
> *My question is : what can I change to get a better index reading speed ?*
> 
> What I already tried :
> 
>   * Setting random_page_cost to prohibitive value (10000000) to force a
>     bitmap heap scan, because those can be made in parallel. This has
>     not worked, the optimizer is still doing an index scan on my fact table.
>   * Change effective_io_concurrency, max_parallel_workers_per_gather,
>     work_mem to much higher values.
I'm not sure the case is only about speed of index scan. Just see into 
slow Index clause:
fth.\"ID_TRAIN\" = ANY ('{17855,13945,536795,18838,18837,13574 ...
and many more values.
IndexScan need to make scan for each of these values and for each value 
go through the pages to check other conditions.
We already discuss some optimisations related to this case in couple of 
pgsql-hackers threads. But I'm not sure we have quick solution right now.
If you want to use BitmapScan (that might be reasonable to try here) - 
you need to split huge ANY (...) clause into sequence of ORs.
Also, may be parallel append could help here? if can change 
corresponding startup and tuple costs to force such a plan.
 
-- 
regards, Andrei Lepikhov
 
-------
Ce message et toutes les pièces jointes sont établis à l'intention exclusive de ses destinataires et sont confidentiels. L'intégrité de ce message n'étant pas assurée sur Internet, la SNCF ne peut être tenue responsable des altérations qui pourraient se produire sur son contenu. Toute publication, utilisation, reproduction, ou diffusion, même partielle, non autorisée préalablement par la SNCF, est strictement interdite. Si vous n'êtes pas le destinataire de ce message, merci d'en avertir immédiatement l'expéditeur et de le détruire.
-------
This message and any attachments are intended solely for the addressees and are confidential. SNCF may not be held responsible for their contents whose accuracy and completeness cannot be guaranteed over the Internet. Unauthorized use, disclosure, distribution, copying, or any part thereof is strictly prohibited. If you are not the intended recipient of this message, please notify the sender immediately and delete it.

Re: How to solve my slow disk i/o throughput during index scan

From
Andrei Lepikhov
Date:
On 7/4/24 22:23, FREYBURGER Simon (SNCF VOYAGEURS / DIRECTION GENERALE 
TGV / DM RMP YIELD MANAGEMENT) wrote:
> Hello,
> 
> Thank you, splitting in “OR” query definitely enables bitmap heap scans, 
> and thus parallelized read to disk 😃! I though did not understand your 
> second point, what is parallel append, and how to enable it ?
Just for example:

DROP TABLE IF EXISTS t CASCADE;
CREATE TABLE t (id int not null, payload text) PARTITION BY RANGE (id);
CREATE TABLE p1 PARTITION OF t FOR VALUES FROM (0) TO (1000);
CREATE TABLE p2 PARTITION OF t FOR VALUES FROM (1000) TO (2000);
CREATE TABLE p3 PARTITION OF t FOR VALUES FROM (2000) TO (3000);
CREATE TABLE p4 PARTITION OF t FOR VALUES FROM (3000) TO (4000);
INSERT INTO t SELECT x % 4000, repeat('a',128) || x FROM 
generate_series(1,1E5) AS x;
ANALYZE t;

SET enable_parallel_append = on;
SET parallel_setup_cost = 0.00001;
SET parallel_tuple_cost = 0.00001;
SET max_parallel_workers_per_gather = 8;
SET min_parallel_table_scan_size = 0;
SET min_parallel_index_scan_size = 0;

EXPLAIN (COSTS OFF)
SELECT t.id, t.payload FROM t WHERE t.id % 2 = 0
GROUP BY t.id, t.payload;

  Group
    Group Key: t.id, t.payload
    ->  Gather Merge
          Workers Planned: 6
          ->  Sort
                Sort Key: t.id, t.payload
                ->  Parallel Append
                      ->  Parallel Seq Scan on p1 t_1
                            Filter: ((id % 2) = 0)
                      ->  Parallel Seq Scan on p2 t_2
                            Filter: ((id % 2) = 0)
                      ->  Parallel Seq Scan on p3 t_3
                            Filter: ((id % 2) = 0)
                      ->  Parallel Seq Scan on p4 t_4
                            Filter: ((id % 2) = 0)

Here the table is scanned in parallel. It also works with IndexScan.

-- 
regards, Andrei Lepikhov




RE: How to solve my slow disk i/o throughput during index scan

From
"FREYBURGER Simon (SNCF VOYAGEURS / DIRECTION GENERALE TGV / DM RMP YIELD MANAGEMENT)"
Date:


Hello, and thank you again for your example !

Sorry for my late answer, I was working on a patch for our requests. I am though not completely understanding what is happening. Here is a plan of a query where I splitted the calls with OR as you suggested, what seemed to have enabled parallel scans.

https://explain.dalibo.com/plan/gfa1cf9fffd01bcg#plan/node/1

But, I still wonder, why was my request that slow ? My current understanding of what happened is :

 

  • When postgresql does an Index Scan, it goes through a loop (which is not parallel) of asking for a chunk of data, and then processing it. It wait for having processed the data to ask the next chunk, instead of loading the whole index in RAM (which, I suppose, would be much faster, but also not feasible if the index is too big and the RAM too small, so postgresql does not do it). Thus, the 2MB/s.
  • When it does a Bitmap Index Scan, it can parallelize disk interactions, and does not use the processor to discard lines, thus a much faster index load and processing.

 

I might be completely wrong, and would really like to understand the details, in order to explain them to my team, and to other who might encounter the same problem.  

Again, thank you very much for your help, we were really struggling with those slow queries !

Simon FREYBURGER

 


Interne
De : Andrei Lepikhov <lepihov@gmail.com>
Envoyé : vendredi 5 juillet 2024 04:05
À : FREYBURGER Simon (SNCF VOYAGEURS / DIRECTION GENERALE TGV / DM RMP YIELD MANAGEMENT) <simon.freyburger@sncf.fr>; pgsql-performance@lists.postgresql.org; Peter Geoghegan <pg@bowt.ie>
Objet : Re: How to solve my slow disk i/o throughput during index scan

 

On 7/4/24 22: 23, FREYBURGER Simon (SNCF VOYAGEURS / DIRECTION GENERALE TGV / DM RMP YIELD MANAGEMENT) wrote: > Hello, > > Thank you, splitting in “OR” query definitely enables bitmap heap scans, > and thus parallelized read to disk

On 7/4/24 22:23, FREYBURGER Simon (SNCF VOYAGEURS / DIRECTION GENERALE 
TGV / DM RMP YIELD MANAGEMENT) wrote:
> Hello,
> 
> Thank you, splitting in “OR” query definitely enables bitmap heap scans, 
> and thus parallelized read to disk 😃! I though did not understand your 
> second point, what is parallel append, and how to enable it ?
Just for example:
 
DROP TABLE IF EXISTS t CASCADE;
CREATE TABLE t (id int not null, payload text) PARTITION BY RANGE (id);
CREATE TABLE p1 PARTITION OF t FOR VALUES FROM (0) TO (1000);
CREATE TABLE p2 PARTITION OF t FOR VALUES FROM (1000) TO (2000);
CREATE TABLE p3 PARTITION OF t FOR VALUES FROM (2000) TO (3000);
CREATE TABLE p4 PARTITION OF t FOR VALUES FROM (3000) TO (4000);
INSERT INTO t SELECT x % 4000, repeat('a',128) || x FROM 
generate_series(1,1E5) AS x;
ANALYZE t;
 
SET enable_parallel_append = on;
SET parallel_setup_cost = 0.00001;
SET parallel_tuple_cost = 0.00001;
SET max_parallel_workers_per_gather = 8;
SET min_parallel_table_scan_size = 0;
SET min_parallel_index_scan_size = 0;
 
EXPLAIN (COSTS OFF)
SELECT t.id, t.payload FROM t WHERE t.id % 2 = 0
GROUP BY t.id, t.payload;
 
  Group
    Group Key: t.id, t.payload
    ->  Gather Merge
          Workers Planned: 6
          ->  Sort
                Sort Key: t.id, t.payload
                ->  Parallel Append
                      ->  Parallel Seq Scan on p1 t_1
                            Filter: ((id % 2) = 0)
                      ->  Parallel Seq Scan on p2 t_2
                            Filter: ((id % 2) = 0)
                      ->  Parallel Seq Scan on p3 t_3
                            Filter: ((id % 2) = 0)
                      ->  Parallel Seq Scan on p4 t_4
                            Filter: ((id % 2) = 0)
 
Here the table is scanned in parallel. It also works with IndexScan.
 
-- 
regards, Andrei Lepikhov
 
-------
Ce message et toutes les pièces jointes sont établis à l'intention exclusive de ses destinataires et sont confidentiels. L'intégrité de ce message n'étant pas assurée sur Internet, la SNCF ne peut être tenue responsable des altérations qui pourraient se produire sur son contenu. Toute publication, utilisation, reproduction, ou diffusion, même partielle, non autorisée préalablement par la SNCF, est strictement interdite. Si vous n'êtes pas le destinataire de ce message, merci d'en avertir immédiatement l'expéditeur et de le détruire.
-------
This message and any attachments are intended solely for the addressees and are confidential. SNCF may not be held responsible for their contents whose accuracy and completeness cannot be guaranteed over the Internet. Unauthorized use, disclosure, distribution, copying, or any part thereof is strictly prohibited. If you are not the intended recipient of this message, please notify the sender immediately and delete it.

RE: How to solve my slow disk i/o throughput during index scan

From
"FREYBURGER Simon (SNCF VOYAGEURS / DIRECTION GENERALE TGV / DM RMP YIELD MANAGEMENT)"
Date:

Also, It might not be related, but I have suspiciously similar slow reads when I am inserting in database, could it be related ?

See e.g. : https://explain.dalibo.com/plan/43d37de5870e1651

The table I am inserting into looks like :

-- public."F_TDOJ_SC_HIST_2" definition

 

-- Drop table

 

-- DROP TABLE public."F_TDOJ_SC_HIST_2";

 

CREATE TABLE public."F_TDOJ_SC_HIST_2" (

       "ID_TRAIN" int4 NOT NULL,

       "ID_JOUR" int4 NOT NULL,

       "ID_OD" int4 NOT NULL,

       "JX" int4 NOT NULL,

       "RES" int4 NULL,

       "REV" float4 NULL,

       "RES_SC1" int4 NULL,

       "RES_SC2" int4 NULL,

       "RES_SC3" int4 NULL,

       "RES_SC4" int4 NULL,

       "RES_SC5" int4 NULL,

       "RES_SC6" int4 NULL,

       "RES_SC7" int4 NULL,

       "RES_SC8" int4 NULL,

       "RES_SC9" int4 NULL,

       "RES_SC10" int4 NULL,

       "RES_SC11" int4 NULL,

       "RES_SC12" int4 NULL,

       "RES_SC13" int4 NULL,

       "RES_SC14" int4 NULL,

       "RES_SC15" int4 NULL,

       "RES_SC16" int4 NULL,

       "RES_SC17" int4 NULL,

       "RES_SC18" int4 NULL,

       "AUT_SC1" int4 NULL,

       "AUT_SC2" int4 NULL,

       "AUT_SC3" int4 NULL,

       "AUT_SC4" int4 NULL,

       "AUT_SC5" int4 NULL,

       "AUT_SC6" int4 NULL,

       "AUT_SC7" int4 NULL,

       "AUT_SC8" int4 NULL,

       "AUT_SC9" int4 NULL,

       "AUT_SC10" int4 NULL,

       "AUT_SC11" int4 NULL,

       "AUT_SC12" int4 NULL,

       "AUT_SC13" int4 NULL,

       "AUT_SC14" int4 NULL,

       "AUT_SC15" int4 NULL,

       "AUT_SC16" int4 NULL,

       "AUT_SC17" int4 NULL,

       "AUT_SC18" int4 NULL,

       "DSP_SC1" int4 NULL,

       "DSP_SC2" int4 NULL,

       "DSP_SC3" int4 NULL,

       "DSP_SC4" int4 NULL,

       "DSP_SC5" int4 NULL,

       "DSP_SC6" int4 NULL,

       "DSP_SC7" int4 NULL,

       "DSP_SC8" int4 NULL,

       "DSP_SC9" int4 NULL,

       "DSP_SC10" int4 NULL,

       "DSP_SC11" int4 NULL,

       "DSP_SC12" int4 NULL,

       "DSP_SC13" int4 NULL,

       "DSP_SC14" int4 NULL,

       "DSP_SC15" int4 NULL,

       "DSP_SC16" int4 NULL,

       "DSP_SC17" int4 NULL,

       "DSP_SC18" int4 NULL,

       "REV_SC1" float4 NULL,

       "REV_SC2" float4 NULL,

       "REV_SC3" float4 NULL,

       "REV_SC4" float4 NULL,

       "REV_SC5" float4 NULL,

       "REV_SC6" float4 NULL,

       "REV_SC7" float4 NULL,

       "REV_SC8" float4 NULL,

       "REV_SC9" float4 NULL,

       "REV_SC10" float4 NULL,

       "REV_SC11" float4 NULL,

       "REV_SC12" float4 NULL,

       "REV_SC13" float4 NULL,

       "REV_SC14" float4 NULL,

       "REV_SC15" float4 NULL,

       "REV_SC16" float4 NULL,

       "REV_SC17" float4 NULL,

       "REV_SC18" float4 NULL,

       CONSTRAINT "F_TDOJ_SC_HIST_2_pkey" PRIMARY KEY ("ID_TRAIN","ID_JOUR","ID_OD","JX")

)

PARTITION BY RANGE ("ID_JOUR");

 

 

-- public."F_TDOJ_SC_HIST_2" foreign keys

 

ALTER TABLE public."F_TDOJ_SC_HIST_2" ADD CONSTRAINT "F_TDOJ_SC_HIST_2_ID_JOUR_fkey" FOREIGN KEY ("ID_JOUR") REFERENCES public."D_JOUR"("ID_JOUR");

ALTER TABLE public."F_TDOJ_SC_HIST_2" ADD CONSTRAINT "F_TDOJ_SC_HIST_2_ID_OD_fkey" FOREIGN KEY ("ID_OD") REFERENCES public."D_OD"("ID_OD");

ALTER TABLE public."F_TDOJ_SC_HIST_2" ADD CONSTRAINT "F_TDOJ_SC_HIST_2_ID_TRAIN_fkey" FOREIGN KEY ("ID_TRAIN") REFERENCES public."D_TRAIN"("ID_TRAIN");

ALTER TABLE public."F_TDOJ_SC_HIST_2" ADD CONSTRAINT "F_TDOJ_SC_HIST_2_JX_fkey" FOREIGN KEY ("JX") REFERENCES public."D_JX"("JX");

I’m using a 3 steps process to insert my lines in the table :

  • COPY into a temporary table
  • DELETE FROM on the perimeter I will be inserting into
  • INSERT … INTO mytable SELECT … FROM temporarytable ON CONFLICT DO NOTHING

 

Is it possible to parallelize the scans during the modify step ?

 

Regards

Simon FREYBURGER

 


Interne
De : FREYBURGER Simon (SNCF VOYAGEURS / DIRECTION GENERALE TGV / DM RMP YIELD MANAGEMENT) <simon.freyburger@sncf.fr>
Envoyé : jeudi 11 juillet 2024 16:59
À : Andrei Lepikhov <lepihov@gmail.com>; pgsql-performance@lists.postgresql.org; Peter Geoghegan <pg@bowt.ie>
Objet : RE: How to solve my slow disk i/o throughput during index scan

 


Hello, and thank you again for your example !

Sorry for my late answer, I was working on a patch for our requests. I am though not completely understanding what is happening. Here is a plan of a query where I splitted the calls with OR as you suggested, what seemed to have enabled parallel scans.

https://explain.dalibo.com/plan/gfa1cf9fffd01bcg#plan/node/1

But, I still wonder, why was my request that slow ? My current understanding of what happened is :

 

  • When postgresql does an Index Scan, it goes through a loop (which is not parallel) of asking for a chunk of data, and then processing it. It wait for having processed the data to ask the next chunk, instead of loading the whole index in RAM (which, I suppose, would be much faster, but also not feasible if the index is too big and the RAM too small, so postgresql does not do it). Thus, the 2MB/s.
  • When it does a Bitmap Index Scan, it can parallelize disk interactions, and does not use the processor to discard lines, thus a much faster index load and processing.

 

I might be completely wrong, and would really like to understand the details, in order to explain them to my team, and to other who might encounter the same problem.  

Again, thank you very much for your help, we were really struggling with those slow queries !

Simon FREYBURGER

 

 

Interne

De : Andrei Lepikhov <lepihov@gmail.com>
Envoyé : vendredi 5 juillet 2024 04:05
À : FREYBURGER Simon (SNCF VOYAGEURS / DIRECTION GENERALE TGV / DM RMP YIELD MANAGEMENT) <simon.freyburger@sncf.fr>; pgsql-performance@lists.postgresql.org; Peter Geoghegan <pg@bowt.ie>
Objet : Re: How to solve my slow disk i/o throughput during index scan

 

On 7/4/24 22: 23, FREYBURGER Simon (SNCF VOYAGEURS / DIRECTION GENERALE TGV / DM RMP YIELD MANAGEMENT) wrote: > Hello, > > Thank you, splitting in “OR” query definitely enables bitmap heap scans, > and thus parallelized read to disk

 

 

Interne

 
On 7/4/24 22:23, FREYBURGER Simon (SNCF VOYAGEURS / DIRECTION GENERALE 
TGV / DM RMP YIELD MANAGEMENT) wrote:
> Hello,
> 
> Thank you, splitting in “OR” query definitely enables bitmap heap scans, 
> and thus parallelized read to disk 😃! I though did not understand your 
> second point, what is parallel append, and how to enable it ?
Just for example:
 
DROP TABLE IF EXISTS t CASCADE;
CREATE TABLE t (id int not null, payload text) PARTITION BY RANGE (id);
CREATE TABLE p1 PARTITION OF t FOR VALUES FROM (0) TO (1000);
CREATE TABLE p2 PARTITION OF t FOR VALUES FROM (1000) TO (2000);
CREATE TABLE p3 PARTITION OF t FOR VALUES FROM (2000) TO (3000);
CREATE TABLE p4 PARTITION OF t FOR VALUES FROM (3000) TO (4000);
INSERT INTO t SELECT x % 4000, repeat('a',128) || x FROM 
generate_series(1,1E5) AS x;
ANALYZE t;
 
SET enable_parallel_append = on;
SET parallel_setup_cost = 0.00001;
SET parallel_tuple_cost = 0.00001;
SET max_parallel_workers_per_gather = 8;
SET min_parallel_table_scan_size = 0;
SET min_parallel_index_scan_size = 0;
 
EXPLAIN (COSTS OFF)
SELECT t.id, t.payload FROM t WHERE t.id % 2 = 0
GROUP BY t.id, t.payload;
 
  Group
    Group Key: t.id, t.payload
    ->  Gather Merge
          Workers Planned: 6
          ->  Sort
                Sort Key: t.id, t.payload
                ->  Parallel Append
                      ->  Parallel Seq Scan on p1 t_1
                            Filter: ((id % 2) = 0)
                      ->  Parallel Seq Scan on p2 t_2
                            Filter: ((id % 2) = 0)
                      ->  Parallel Seq Scan on p3 t_3
                            Filter: ((id % 2) = 0)
                      ->  Parallel Seq Scan on p4 t_4
                            Filter: ((id % 2) = 0)
 
Here the table is scanned in parallel. It also works with IndexScan.
 
-- 
regards, Andrei Lepikhov
 

-------
Ce message et toutes les pièces jointes sont établis à l'intention exclusive de ses destinataires et sont confidentiels. L'intégrité de ce message n'étant pas assurée sur Internet, la SNCF ne peut être tenue responsable des altérations qui pourraient se produire sur son contenu. Toute publication, utilisation, reproduction, ou diffusion, même partielle, non autorisée préalablement par la SNCF, est strictement interdite. Si vous n'êtes pas le destinataire de ce message, merci d'en avertir immédiatement l'expéditeur et de le détruire.
-------
This message and any attachments are intended solely for the addressees and are confidential. SNCF may not be held responsible for their contents whose accuracy and completeness cannot be guaranteed over the Internet. Unauthorized use, disclosure, distribution, copying, or any part thereof is strictly prohibited. If you are not the intended recipient of this message, please notify the sender immediately and delete it.

Re: How to solve my slow disk i/o throughput during index scan

From
"David G. Johnston"
Date:
On Thursday, July 11, 2024, FREYBURGER Simon (SNCF VOYAGEURS / DIRECTION GENERALE TGV / DM RMP YIELD MANAGEMENT) <simon.freyburger@sncf.fr> wrote:

Also, It might not be related, but I have suspiciously similar slow reads when I am inserting in database, could it be related ?
I’m using a 3 steps process to insert my lines in the table :

  • COPY into a temporary table
  • DELETE FROM on the perimeter I will be inserting into
  • INSERT … INTO mytable SELECT … FROM temporarytable ON CONFLICT DO NOTHING

 

Is it possible to parallelize the scans during the modify step ?



This tells you when parallelism is used:


David J.

Re: How to solve my slow disk i/o throughput during index scan

From
Andrei Lepikhov
Date:
On 11/7/2024 22:09, FREYBURGER Simon (SNCF VOYAGEURS / DIRECTION 
GENERALE TGV / DM RMP YIELD MANAGEMENT) wrote:
> Is it possible to parallelize the scans during the modify step ?
Temporary tables can't be used inside a query with parallel workers 
involved, because such table is local for single process.

What about your question - I'm not sure without whole bunch of data. But 
maximum speedup you can get by disabling as much constraints as possible 
- ideally, fill each partition individually with no constraints and 
indexes at all before uniting them into one partitioned table.

-- 
regards, Andrei Lepikhov




Re: How to solve my slow disk i/o throughput during index scan

From
Andrei Lepikhov
Date:
On 11/7/2024 21:59, FREYBURGER Simon (SNCF VOYAGEURS / DIRECTION 
GENERALE TGV / DM RMP YIELD MANAGEMENT) wrote:
> 
> Hello, and thank you again for your example !
> 
> Sorry for my late answer, I was working on a patch for our requests. I 
> am though not completely understanding what is happening. Here is a plan 
> of a query where I splitted the calls with OR as you suggested, what 
> seemed to have enabled parallel scans.
Thanks for the feedback!
Generally, I don't understand why you needed to transform ANY -> OR at 
all to get BitmapScan. Can you just disable IndexScan and possibly 
SeqScan to see is it a hard transformation limit or mistake in cost 
estimation?

-- 
regards, Andrei Lepikhov