Thread: How to solve my slow disk i/o throughput during index scan
How to solve my slow disk i/o throughput during index scan
Hello everyone,
I have a database of sufficient size that it does not fit entirely in RAM, including indexes that also exceed RAM capacity. Most of my big fact tables are following the same scheme :
- 3-5 id columns.
- Partitioned along one ID (ID_JOUR).
- 10 – 100 Go partitions.
- 1-5 Go primary key indexes, for each partition
- Contains 10 – 120 columns, some of them with a lot of NaNs. All of them are int or float8. Some columns are containing a lot of NaNs, because they were not all created and added to daily processing at the same date
- Requesting a small subset (100k-1M) lines from this database, always filtering on the primary key (I have a WHERE … filtering on each ID column. Some are done through dimension tables + join, but I tried doing those directly, it did not solve my problem).
- See in the annex a DDL code for one of those tables.
- See the annex for the size of my different tables.
- Stable data (I am focusing on past data, rare to no insertions, VACUUM + CLUSTER done after those rare modifications).
- Played REINDEX, VACUUM, CLUSTER, ANALYZE on those tables.
When performing queries, I observe significant differences in processing time depending on whether the index needs to be read from disk or is already loaded in RAM. I think I have confirmed using EXPLAIN ANALYZE that the issue stems from index scans. See for example :
- https://explain.dalibo.com/plan/2c85077gagh98a17: very slow because some part of the index is "read" (from disk) and not "hit".
- https://explain.dalibo.com/plan/gfd20f8cadaa5261#plan/node/8 : 1M lines “instantaneously” (2 sec) retrieved, when the index is in RAM.
- https://explain.dalibo.com/plan/1b394hc5a26cf747 where I added set track_io_timing=TRUE.
I measured the speed of loading my index into RAM during a query, which is approximately 2 to 3 MB/s. However, my infrastructure theoretically supports I/O speed of around 900 MB/s.
On some older partitions, I was able to sometimes get better throughputs (see e.g. https://explain.dalibo.com/plan/4db409d1d6d95d4b)
I do not understand why reading my index from disk is so slow. I suspect that the index is not sequentially read, but I do not know how postgresql internals really behave, so this is just a supposition.
My question is : what can I change to get a better index reading speed ?
What I already tried :
- Setting random_page_cost to prohibitive value (10000000) to force a bitmap heap scan, because those can be made in parallel. This has not worked, the optimizer is still doing an index scan on my fact table.
- Change effective_io_concurrency, max_parallel_workers_per_gather, work_mem to much higher values.
Thank you in advance for your help, any idea/advice greatly appreciated !
Simon F.
ANNEX :
Some more details about my environment :
- I am working on Azure. My hardware are a E16s_v3 (see https://learn.microsoft.com/en-us/azure/virtual-machines/ev3-esv3-series) and P80 disks (https://azure.microsoft.com/en-us/pricing/details/managed-disks/)
- I was unable to run the hardware speed test, because I do not have sudo rights, but through VACUUM execution, the
- Postgresql version : PostgreSQL 12.15 (Ubuntu 12.15-1.pgdg20.04+1) on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0, 64-bit
- OS version :
==> /etc/lsb-release <==
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=20.04
DISTRIB_CODENAME=focal
DISTRIB_DESCRIPTION="Ubuntu 20.04 LTS"
==> /etc/os-release <==
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 20.04 LTS"
VERSION_ID="20.04"
HOME_URL=https://www.ubuntu.com/
SUPPORT_URL=https://help.ubuntu.com/
BUG_REPORT_URL=https://bugs.launchpad.net/ubuntu/
PRIVACY_POLICY_URL=https://www.ubuntu.com/legal/terms-and-policies/privacy-policy
VERSION_CODENAME=focal
UBUNTU_CODENAME=focal
==> /etc/SNCF_release <==
HOSTNAME : uzcordbr05
OS_NAME : UBUNTU 20.04
OS_DESCRIPTION : Ubuntu 20.04 LTS
OS_RELEASE : 20.04
OS_CODENAME : focal
CMDB_ENV : Recette
AENV : hprod
REPO_IT : https://repos.it.sncf.fr/repos/os/ubuntu/bugfix/dists/focal-bugfix/Release
- Default requests settings :
- effective_cache_size = '96GB',
- effective_io_concurrency = '200',
- max_parallel_workers_per_gather = '4',
- random_page_cost = '1.1',
- search_path = 'public',
- work_mem = '64MB' à I tried to change work_mem to 4GB, did not change anything.
- Postgres custom configuration settings
name |current_setting |source |
--------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------+
application_name |DBeaver 21.2.5 - SQLEditor <Script-19.sql> |session |
archive_command |(disabled) |configuration file |
archive_mode |off |configuration file |
archive_timeout |2h |configuration file |
autovacuum_analyze_scale_factor |0.05 |configuration file |
autovacuum_analyze_threshold |50 |configuration file |
autovacuum_max_workers |6 |configuration file |
autovacuum_naptime |15s |configuration file |
autovacuum_vacuum_cost_delay |10ms |configuration file |
autovacuum_vacuum_cost_limit |-1 |configuration file |
autovacuum_vacuum_scale_factor |0.01 |configuration file |
autovacuum_vacuum_threshold |50 |configuration file |
autovacuum_work_mem |512MB |configuration file |
checkpoint_completion_target |0.9 |configuration file |
client_encoding |UTF8 |client |
cluster_name |irdbr010 |configuration file |
DateStyle |ISO, DMY |client |
default_text_search_config |pg_catalog.french |configuration file |
effective_cache_size |96GB |configuration file |
effective_io_concurrency |200 |database |
extra_float_digits |3 |session |
lc_messages |C |configuration file |
lc_monetary |fr_FR.UTF8 |configuration file |
lc_numeric |fr_FR.UTF8 |configuration file |
lc_time |fr_FR.UTF8 |configuration file |
listen_addresses |* |configuration file |
log_autovacuum_min_duration |0 |configuration file |
log_checkpoints |on |configuration file |
log_connections |on |configuration file |
log_disconnections |off |configuration file |
log_file_mode |0640 |configuration file |
log_line_prefix |%t [%p]: [%l-1] user=%u,db=%d,app=%a,client=%h |configuration file |
log_lock_waits |on |configuration file |
log_min_duration_statement |5s |configuration file |
log_min_error_statement |warning |configuration file |
log_statement |ddl |configuration file |
log_temp_files |0 |configuration file |
log_timezone |Europe/Paris |configuration file |
logging_collector |on |configuration file |
maintenance_work_mem |4GB |configuration file |
max_connections |500 |configuration file |
max_locks_per_transaction |1024 |configuration file |
max_parallel_workers_per_gather |4 |configuration file |
max_stack_depth |2MB |environment variable|
max_wal_size |4GB |configuration file |
min_wal_size |128MB |configuration file |
password_encryption |scram-sha-256 |configuration file |
pg_stat_statements.max |15000 |configuration file |
pg_stat_statements.save |off |configuration file |
pg_stat_statements.track |all |configuration file |
pg_stat_statements.track_utility|off |configuration file |
port |5433 |configuration file |
random_page_cost |1.1 |database |
restore_command |/home/postgres/admin/bin/pgbackrest --config=/etc/pgbackrest.conf --pg1-path=/home/postgres/data/irdbr010/systeme --stanza=rdb_backup archive-get %f "%p"|configuration file |
search_path |public, public, temporaire, dtm_2019 |session |
shared_buffers |32GB |configuration file |
ssl |on |configuration file |
statement_timeout |0 |user |
tcp_keepalives_count |10 |configuration file |
tcp_keepalives_idle |900 |configuration file |
tcp_keepalives_interval |75 |configuration file |
TimeZone |Europe/Paris |client |
unix_socket_group |postgres |configuration file |
unix_socket_permissions |0700 |configuration file |
wal_buffers |16MB |configuration file |
work_mem |64MB |configuration file |
DDL for my one of my table :
-- public."F_TDOJ_HIST_1" definition
-- Drop table
-- DROP TABLE public."F_TDOJ_HIST_1";
CREATE TABLE public."F_TDOJ_HIST_1" (
"ID_TRAIN" int4 NOT NULL,
"ID_JOUR" int4 NOT NULL,
"ID_OD" int4 NOT NULL,
"JX" int4 NOT NULL,
"RES" int4 NULL,
"REV" float8 NULL,
"OFFRE" int4 NULL,
"CC_OUV" int4 NULL,
"GENV_NUM" int8 NULL,
"GENV_DEN" int8 NULL,
"GENR_NUM" int8 NULL,
"GENR_DEN" int8 NULL,
"GENH_NUM" int8 NULL,
"GENH_DEN" int8 NULL,
"RES_CC0" int4 NULL,
"RES_CC1" int4 NULL,
"RES_CC2" int4 NULL,
"RES_CC3" int4 NULL,
"RES_CC4" int4 NULL,
"RES_CC5" int4 NULL,
"RES_CC6" int4 NULL,
"RES_CC7" int4 NULL,
"RES_CC8" int4 NULL,
"RES_CC9" int4 NULL,
"RES_CC10" int4 NULL,
"RES_CC11" int4 NULL,
"RES_CC12" int4 NULL,
"RES_CC13" int4 NULL,
"RES_CC14" int4 NULL,
"RES_CC15" int4 NULL,
"RES_CC16" int4 NULL,
"RES_CC17" int4 NULL,
"RES_CC18" int4 NULL,
"RES_CC19" int4 NULL,
"RES_CC20" int4 NULL,
"AUT_CC0" int4 NULL,
"AUT_CC1" int4 NULL,
"AUT_CC2" int4 NULL,
"AUT_CC3" int4 NULL,
"AUT_CC4" int4 NULL,
"AUT_CC5" int4 NULL,
"AUT_CC6" int4 NULL,
"AUT_CC7" int4 NULL,
"AUT_CC8" int4 NULL,
"AUT_CC9" int4 NULL,
"AUT_CC10" int4 NULL,
"AUT_CC11" int4 NULL,
"AUT_CC12" int4 NULL,
"AUT_CC13" int4 NULL,
"AUT_CC14" int4 NULL,
"AUT_CC15" int4 NULL,
"AUT_CC16" int4 NULL,
"AUT_CC17" int4 NULL,
"AUT_CC18" int4 NULL,
"AUT_CC19" int4 NULL,
"AUT_CC20" int4 NULL,
"DSP_CC0" int4 NULL,
"DSP_CC1" int4 NULL,
"DSP_CC2" int4 NULL,
"DSP_CC3" int4 NULL,
"DSP_CC4" int4 NULL,
"DSP_CC5" int4 NULL,
"DSP_CC6" int4 NULL,
"DSP_CC7" int4 NULL,
"DSP_CC8" int4 NULL,
"DSP_CC9" int4 NULL,
"DSP_CC10" int4 NULL,
"DSP_CC11" int4 NULL,
"DSP_CC12" int4 NULL,
"DSP_CC13" int4 NULL,
"DSP_CC14" int4 NULL,
"DSP_CC15" int4 NULL,
"DSP_CC16" int4 NULL,
"DSP_CC17" int4 NULL,
"DSP_CC18" int4 NULL,
"DSP_CC19" int4 NULL,
"DSP_CC20" int4 NULL,
"REV_CC0" float8 NULL,
"REV_CC1" float8 NULL,
"REV_CC2" float8 NULL,
"REV_CC3" float8 NULL,
"REV_CC4" float8 NULL,
"REV_CC5" float8 NULL,
"REV_CC6" float8 NULL,
"REV_CC7" float8 NULL,
"REV_CC8" float8 NULL,
"REV_CC9" float8 NULL,
"REV_CC10" float8 NULL,
"REV_CC11" float8 NULL,
"REV_CC12" float8 NULL,
"REV_CC13" float8 NULL,
"REV_CC14" float8 NULL,
"REV_CC15" float8 NULL,
"REV_CC16" float8 NULL,
"REV_CC17" float8 NULL,
"REV_CC18" float8 NULL,
"REV_CC19" float8 NULL,
"REV_CC20" float8 NULL,
"RES_CHD" int4 NULL,
"REV_CHD" float8 NULL,
"RES_PRO" int4 NULL,
"REV_PRO" float8 NULL,
"RES_SOC" int4 NULL,
"REV_SOC" float8 NULL,
"RES_TMX" int4 NULL,
"REV_TMX" float8 NULL,
"RES_MAX" int4 NULL,
"RES_GRP" int4 NULL,
"REV_GRP" float8 NULL,
"PREV_RES" int4 NULL,
"PREV_REV" float8 NULL,
"OPTIM_RES" int4 NULL,
"OPTIM_REV" float8 NULL,
"RES_FFX" int4 NULL,
"REV_FFX" float8 NULL,
"RES_SFE" int4 NULL,
"REV_SFE" float8 NULL,
"RES_SFN" int4 NULL,
"REV_SFN" float8 NULL,
"RES_NFE" int4 NULL,
"REV_NFE" float8 NULL,
"RES_NFN" int4 NULL,
"REV_NFN" float8 NULL,
"RES_ABO" int4 NULL,
"REV_ABO" float8 NULL,
"RES_AGN" int4 NULL,
"REV_AGN" float8 NULL,
"RES_BPR" int4 NULL,
"REV_BPR" float8 NULL,
"RES_LIB" int4 NULL,
"REV_LIB" float8 NULL,
"RES_FFN" int4 NULL,
"REV_FFN" float8 NULL,
"RES_PRI" int4 NULL,
"REV_PRI" float8 NULL,
CONSTRAINT "F_TDOJ_HIST_1_OLDP_pkey" PRIMARY KEY ("ID_TRAIN", "ID_JOUR", "ID_OD", "JX")
)
PARTITION BY RANGE ("ID_JOUR");
CREATE INDEX "F_TDOJ_HIST_1_OLDP_ID_JOUR_JX_idx" ON ONLY public."F_TDOJ_HIST_1" USING btree ("ID_JOUR", "JX");
CREATE INDEX "F_TDOJ_HIST_1_OLDP_ID_JOUR_idx" ON ONLY public."F_TDOJ_HIST_1" USING btree ("ID_JOUR");
CREATE INDEX "F_TDOJ_HIST_1_OLDP_ID_OD_idx" ON ONLY public."F_TDOJ_HIST_1" USING btree ("ID_OD");
CREATE INDEX "F_TDOJ_HIST_1_OLDP_ID_TRAIN_idx" ON ONLY public."F_TDOJ_HIST_1" USING btree ("ID_TRAIN");
CREATE INDEX "F_TDOJ_HIST_1_OLDP_JX_idx" ON ONLY public."F_TDOJ_HIST_1" USING btree ("JX");
-- public."F_TDOJ_HIST_1" foreign keys
ALTER TABLE public."F_TDOJ_HIST_1" ADD CONSTRAINT "F_TDOJ_HIST_1_OLDP_ID_JOUR_fkey" FOREIGN KEY ("ID_JOUR") REFERENCES public."D_JOUR"("ID_JOUR");
ALTER TABLE public."F_TDOJ_HIST_1" ADD CONSTRAINT "F_TDOJ_HIST_1_OLDP_ID_OD_fkey" FOREIGN KEY ("ID_OD") REFERENCES public."D_OD"("ID_OD");
ALTER TABLE public."F_TDOJ_HIST_1" ADD CONSTRAINT "F_TDOJ_HIST_1_OLDP_ID_TRAIN_fkey" FOREIGN KEY ("ID_TRAIN") REFERENCES public."D_TRAIN"("ID_TRAIN");
ALTER TABLE public."F_TDOJ_HIST_1" ADD CONSTRAINT "F_TDOJ_HIST_1_OLDP_JX_pkey" FOREIGN KEY ("JX") REFERENCES public."D_JX"("JX");
Result of :
SELECT relname, relpages, reltuples, relallvisible, relkind, relnatts, relhassubclass, reloptions, pg_table_size(oid) FROM pg_class WHERE relname in (
'F_TDLJ_HIST_1', 'F_TDLJ_HIST_2', 'F_TDLJ_HIST', 'F_TDOJ_HIST_1', 'F_TDOJ_HIST_2', 'F_TDOJ_HIST'
);
relpages|reltuples|relname |relallvisible|relkind|relnatts|relhassubclass|reloptions|pg_table_size|
--------+---------+-------------+-------------+-------+--------+--------------+----------+-------------+
0| 0.0|F_TDLJ_HIST | 0|p | 47|true |NULL | 0|
95806| 442969.0|F_TDLJ_HIST | 95806|r | 47|false |NULL | 785080320|
0| 0.0|F_TDLJ_HIST_1| 0|p | 129|true |NULL | 0|
197458| 730226.0|F_TDLJ_HIST_1| 157954|r | 129|false |NULL | 1618059264|
0| 0.0|F_TDLJ_HIST_2| 0|p | 159|true |NULL | 0|
278359| 441524.0|F_TDLJ_HIST_2| 278359|r | 159|false |NULL | 2280972288|
0| 0.0|F_TDOJ_HIST | 0|p | 104|true |NULL | 0|
311913|1424975.0|F_TDOJ_HIST | 311913|r | 56|false |NULL | 2555928576|
0| 0.0|F_TDOJ_HIST_1| 0|p | 135|true |NULL | 0|
682522|1241940.0|F_TDOJ_HIST_1| 682522|r | 135|false |NULL | 5592793088|
0| 0.0|F_TDOJ_HIST_2| 0|p | 163|true |NULL | 0|
661324|1397598.0|F_TDOJ_HIST_2| 661324|r | 163|false |NULL | 5419098112|
Interne
Ce message et toutes les pièces jointes sont établis à l'intention exclusive de ses destinataires et sont confidentiels. L'intégrité de ce message n'étant pas assurée sur Internet, la SNCF ne peut être tenue responsable des altérations qui pourraient se produire sur son contenu. Toute publication, utilisation, reproduction, ou diffusion, même partielle, non autorisée préalablement par la SNCF, est strictement interdite. Si vous n'êtes pas le destinataire de ce message, merci d'en avertir immédiatement l'expéditeur et de le détruire.
-------
This message and any attachments are intended solely for the addressees and are confidential. SNCF may not be held responsible for their contents whose accuracy and completeness cannot be guaranteed over the Internet. Unauthorized use, disclosure, distribution, copying, or any part thereof is strictly prohibited. If you are not the intended recipient of this message, please notify the sender immediately and delete it.
On 4/7/2024 20:25, FREYBURGER Simon (SNCF VOYAGEURS / DIRECTION GENERALE TGV / DM RMP YIELD MANAGEMENT) wrote: > *My question is : what can I change to get a better index reading speed ?* > > What I already tried : > > * Setting random_page_cost to prohibitive value (10000000) to force a > bitmap heap scan, because those can be made in parallel. This has > not worked, the optimizer is still doing an index scan on my fact table. > * Change effective_io_concurrency, max_parallel_workers_per_gather, > work_mem to much higher values. I'm not sure the case is only about speed of index scan. Just see into slow Index clause: fth.\"ID_TRAIN\" = ANY ('{17855,13945,536795,18838,18837,13574 ... and many more values. IndexScan need to make scan for each of these values and for each value go through the pages to check other conditions. We already discuss some optimisations related to this case in couple of pgsql-hackers threads. But I'm not sure we have quick solution right now. If you want to use BitmapScan (that might be reasonable to try here) - you need to split huge ANY (...) clause into sequence of ORs. Also, may be parallel append could help here? if can change corresponding startup and tuple costs to force such a plan. -- regards, Andrei Lepikhov
RE: How to solve my slow disk i/o throughput during index scan
Hello,
Thank you, splitting in “OR” query definitely enables bitmap heap scans, and thus parallelized read to disk 😃 ! I though did not understand your second point, what is parallel append, and how to enable it ?
Simon F.
Interne
De : Andrei Lepikhov <lepihov@gmail.com>
Envoyé : jeudi 4 juillet 2024 16:37
À : FREYBURGER Simon (SNCF VOYAGEURS / DIRECTION GENERALE TGV / DM RMP YIELD MANAGEMENT) <simon.freyburger@sncf.fr>; pgsql-performance@lists.postgresql.org; Peter Geoghegan <pg@bowt.ie>
Objet : Re: How to solve my slow disk i/o throughput during index scan
On 4/7/2024 20: 25, FREYBURGER Simon (SNCF VOYAGEURS / DIRECTION GENERALE TGV / DM RMP YIELD MANAGEMENT) wrote: > *My question is : what can I change to get a better index reading speed ?* > > What I already tried : > > * Setting
On 4/7/2024 20:25, FREYBURGER Simon (SNCF VOYAGEURS / DIRECTION GENERALE
TGV / DM RMP YIELD MANAGEMENT) wrote:
> *My question is : what can I change to get a better index reading speed ?*
>
> What I already tried :
>
> * Setting random_page_cost to prohibitive value (10000000) to force a
> bitmap heap scan, because those can be made in parallel. This has
> not worked, the optimizer is still doing an index scan on my fact table.
> * Change effective_io_concurrency, max_parallel_workers_per_gather,
> work_mem to much higher values.
I'm not sure the case is only about speed of index scan. Just see into
slow Index clause:
fth.\"ID_TRAIN\" = ANY ('{17855,13945,536795,18838,18837,13574 ...
and many more values.
IndexScan need to make scan for each of these values and for each value
go through the pages to check other conditions.
We already discuss some optimisations related to this case in couple of
pgsql-hackers threads. But I'm not sure we have quick solution right now.
If you want to use BitmapScan (that might be reasonable to try here) -
you need to split huge ANY (...) clause into sequence of ORs.
Also, may be parallel append could help here? if can change
corresponding startup and tuple costs to force such a plan.
--
regards, Andrei Lepikhov
Ce message et toutes les pièces jointes sont établis à l'intention exclusive de ses destinataires et sont confidentiels. L'intégrité de ce message n'étant pas assurée sur Internet, la SNCF ne peut être tenue responsable des altérations qui pourraient se produire sur son contenu. Toute publication, utilisation, reproduction, ou diffusion, même partielle, non autorisée préalablement par la SNCF, est strictement interdite. Si vous n'êtes pas le destinataire de ce message, merci d'en avertir immédiatement l'expéditeur et de le détruire.
-------
This message and any attachments are intended solely for the addressees and are confidential. SNCF may not be held responsible for their contents whose accuracy and completeness cannot be guaranteed over the Internet. Unauthorized use, disclosure, distribution, copying, or any part thereof is strictly prohibited. If you are not the intended recipient of this message, please notify the sender immediately and delete it.
On 7/4/24 22:23, FREYBURGER Simon (SNCF VOYAGEURS / DIRECTION GENERALE TGV / DM RMP YIELD MANAGEMENT) wrote: > Hello, > > Thank you, splitting in “OR” query definitely enables bitmap heap scans, > and thus parallelized read to disk 😃! I though did not understand your > second point, what is parallel append, and how to enable it ? Just for example: DROP TABLE IF EXISTS t CASCADE; CREATE TABLE t (id int not null, payload text) PARTITION BY RANGE (id); CREATE TABLE p1 PARTITION OF t FOR VALUES FROM (0) TO (1000); CREATE TABLE p2 PARTITION OF t FOR VALUES FROM (1000) TO (2000); CREATE TABLE p3 PARTITION OF t FOR VALUES FROM (2000) TO (3000); CREATE TABLE p4 PARTITION OF t FOR VALUES FROM (3000) TO (4000); INSERT INTO t SELECT x % 4000, repeat('a',128) || x FROM generate_series(1,1E5) AS x; ANALYZE t; SET enable_parallel_append = on; SET parallel_setup_cost = 0.00001; SET parallel_tuple_cost = 0.00001; SET max_parallel_workers_per_gather = 8; SET min_parallel_table_scan_size = 0; SET min_parallel_index_scan_size = 0; EXPLAIN (COSTS OFF) SELECT t.id, t.payload FROM t WHERE t.id % 2 = 0 GROUP BY t.id, t.payload; Group Group Key: t.id, t.payload -> Gather Merge Workers Planned: 6 -> Sort Sort Key: t.id, t.payload -> Parallel Append -> Parallel Seq Scan on p1 t_1 Filter: ((id % 2) = 0) -> Parallel Seq Scan on p2 t_2 Filter: ((id % 2) = 0) -> Parallel Seq Scan on p3 t_3 Filter: ((id % 2) = 0) -> Parallel Seq Scan on p4 t_4 Filter: ((id % 2) = 0) Here the table is scanned in parallel. It also works with IndexScan. -- regards, Andrei Lepikhov
RE: How to solve my slow disk i/o throughput during index scan
Hello, and thank you again for your example !
Sorry for my late answer, I was working on a patch for our requests. I am though not completely understanding what is happening. Here is a plan of a query where I splitted the calls with OR as you suggested, what seemed to have enabled parallel scans.
https://explain.dalibo.com/plan/gfa1cf9fffd01bcg#plan/node/1
But, I still wonder, why was my request that slow ? My current understanding of what happened is :
- When postgresql does an Index Scan, it goes through a loop (which is not parallel) of asking for a chunk of data, and then processing it. It wait for having processed the data to ask the next chunk, instead of loading the whole index in RAM (which, I suppose, would be much faster, but also not feasible if the index is too big and the RAM too small, so postgresql does not do it). Thus, the 2MB/s.
- When it does a Bitmap Index Scan, it can parallelize disk interactions, and does not use the processor to discard lines, thus a much faster index load and processing.
I might be completely wrong, and would really like to understand the details, in order to explain them to my team, and to other who might encounter the same problem.
Again, thank you very much for your help, we were really struggling with those slow queries !
Simon FREYBURGER
Interne
De : Andrei Lepikhov <lepihov@gmail.com>
Envoyé : vendredi 5 juillet 2024 04:05
À : FREYBURGER Simon (SNCF VOYAGEURS / DIRECTION GENERALE TGV / DM RMP YIELD MANAGEMENT) <simon.freyburger@sncf.fr>; pgsql-performance@lists.postgresql.org; Peter Geoghegan <pg@bowt.ie>
Objet : Re: How to solve my slow disk i/o throughput during index scan
On 7/4/24 22: 23, FREYBURGER Simon (SNCF VOYAGEURS / DIRECTION GENERALE TGV / DM RMP YIELD MANAGEMENT) wrote: > Hello, > > Thank you, splitting in “OR” query definitely enables bitmap heap scans, > and thus parallelized read to disk
On 7/4/24 22:23, FREYBURGER Simon (SNCF VOYAGEURS / DIRECTION GENERALE
TGV / DM RMP YIELD MANAGEMENT) wrote:
> Hello,
>
> Thank you, splitting in “OR” query definitely enables bitmap heap scans,
> and thus parallelized read to disk 😃! I though did not understand your
> second point, what is parallel append, and how to enable it ?
Just for example:
DROP TABLE IF EXISTS t CASCADE;
CREATE TABLE t (id int not null, payload text) PARTITION BY RANGE (id);
CREATE TABLE p1 PARTITION OF t FOR VALUES FROM (0) TO (1000);
CREATE TABLE p2 PARTITION OF t FOR VALUES FROM (1000) TO (2000);
CREATE TABLE p3 PARTITION OF t FOR VALUES FROM (2000) TO (3000);
CREATE TABLE p4 PARTITION OF t FOR VALUES FROM (3000) TO (4000);
INSERT INTO t SELECT x % 4000, repeat('a',128) || x FROM
generate_series(1,1E5) AS x;
ANALYZE t;
SET enable_parallel_append = on;
SET parallel_setup_cost = 0.00001;
SET parallel_tuple_cost = 0.00001;
SET max_parallel_workers_per_gather = 8;
SET min_parallel_table_scan_size = 0;
SET min_parallel_index_scan_size = 0;
EXPLAIN (COSTS OFF)
SELECT t.id, t.payload FROM t WHERE t.id % 2 = 0
GROUP BY t.id, t.payload;
Group
Group Key: t.id, t.payload
-> Gather Merge
Workers Planned: 6
-> Sort
Sort Key: t.id, t.payload
-> Parallel Append
-> Parallel Seq Scan on p1 t_1
Filter: ((id % 2) = 0)
-> Parallel Seq Scan on p2 t_2
Filter: ((id % 2) = 0)
-> Parallel Seq Scan on p3 t_3
Filter: ((id % 2) = 0)
-> Parallel Seq Scan on p4 t_4
Filter: ((id % 2) = 0)
Here the table is scanned in parallel. It also works with IndexScan.
--
regards, Andrei Lepikhov
Ce message et toutes les pièces jointes sont établis à l'intention exclusive de ses destinataires et sont confidentiels. L'intégrité de ce message n'étant pas assurée sur Internet, la SNCF ne peut être tenue responsable des altérations qui pourraient se produire sur son contenu. Toute publication, utilisation, reproduction, ou diffusion, même partielle, non autorisée préalablement par la SNCF, est strictement interdite. Si vous n'êtes pas le destinataire de ce message, merci d'en avertir immédiatement l'expéditeur et de le détruire.
-------
This message and any attachments are intended solely for the addressees and are confidential. SNCF may not be held responsible for their contents whose accuracy and completeness cannot be guaranteed over the Internet. Unauthorized use, disclosure, distribution, copying, or any part thereof is strictly prohibited. If you are not the intended recipient of this message, please notify the sender immediately and delete it.
RE: How to solve my slow disk i/o throughput during index scan
Also, It might not be related, but I have suspiciously similar slow reads when I am inserting in database, could it be related ?
See e.g. : https://explain.dalibo.com/plan/43d37de5870e1651
The table I am inserting into looks like :
-- public."F_TDOJ_SC_HIST_2" definition
-- Drop table
-- DROP TABLE public."F_TDOJ_SC_HIST_2";
CREATE TABLE public."F_TDOJ_SC_HIST_2" (
"ID_TRAIN" int4 NOT NULL,
"ID_JOUR" int4 NOT NULL,
"ID_OD" int4 NOT NULL,
"JX" int4 NOT NULL,
"RES" int4 NULL,
"REV" float4 NULL,
"RES_SC1" int4 NULL,
"RES_SC2" int4 NULL,
"RES_SC3" int4 NULL,
"RES_SC4" int4 NULL,
"RES_SC5" int4 NULL,
"RES_SC6" int4 NULL,
"RES_SC7" int4 NULL,
"RES_SC8" int4 NULL,
"RES_SC9" int4 NULL,
"RES_SC10" int4 NULL,
"RES_SC11" int4 NULL,
"RES_SC12" int4 NULL,
"RES_SC13" int4 NULL,
"RES_SC14" int4 NULL,
"RES_SC15" int4 NULL,
"RES_SC16" int4 NULL,
"RES_SC17" int4 NULL,
"RES_SC18" int4 NULL,
"AUT_SC1" int4 NULL,
"AUT_SC2" int4 NULL,
"AUT_SC3" int4 NULL,
"AUT_SC4" int4 NULL,
"AUT_SC5" int4 NULL,
"AUT_SC6" int4 NULL,
"AUT_SC7" int4 NULL,
"AUT_SC8" int4 NULL,
"AUT_SC9" int4 NULL,
"AUT_SC10" int4 NULL,
"AUT_SC11" int4 NULL,
"AUT_SC12" int4 NULL,
"AUT_SC13" int4 NULL,
"AUT_SC14" int4 NULL,
"AUT_SC15" int4 NULL,
"AUT_SC16" int4 NULL,
"AUT_SC17" int4 NULL,
"AUT_SC18" int4 NULL,
"DSP_SC1" int4 NULL,
"DSP_SC2" int4 NULL,
"DSP_SC3" int4 NULL,
"DSP_SC4" int4 NULL,
"DSP_SC5" int4 NULL,
"DSP_SC6" int4 NULL,
"DSP_SC7" int4 NULL,
"DSP_SC8" int4 NULL,
"DSP_SC9" int4 NULL,
"DSP_SC10" int4 NULL,
"DSP_SC11" int4 NULL,
"DSP_SC12" int4 NULL,
"DSP_SC13" int4 NULL,
"DSP_SC14" int4 NULL,
"DSP_SC15" int4 NULL,
"DSP_SC16" int4 NULL,
"DSP_SC17" int4 NULL,
"DSP_SC18" int4 NULL,
"REV_SC1" float4 NULL,
"REV_SC2" float4 NULL,
"REV_SC3" float4 NULL,
"REV_SC4" float4 NULL,
"REV_SC5" float4 NULL,
"REV_SC6" float4 NULL,
"REV_SC7" float4 NULL,
"REV_SC8" float4 NULL,
"REV_SC9" float4 NULL,
"REV_SC10" float4 NULL,
"REV_SC11" float4 NULL,
"REV_SC12" float4 NULL,
"REV_SC13" float4 NULL,
"REV_SC14" float4 NULL,
"REV_SC15" float4 NULL,
"REV_SC16" float4 NULL,
"REV_SC17" float4 NULL,
"REV_SC18" float4 NULL,
CONSTRAINT "F_TDOJ_SC_HIST_2_pkey" PRIMARY KEY ("ID_TRAIN","ID_JOUR","ID_OD","JX")
)
PARTITION BY RANGE ("ID_JOUR");
-- public."F_TDOJ_SC_HIST_2" foreign keys
ALTER TABLE public."F_TDOJ_SC_HIST_2" ADD CONSTRAINT "F_TDOJ_SC_HIST_2_ID_JOUR_fkey" FOREIGN KEY ("ID_JOUR") REFERENCES public."D_JOUR"("ID_JOUR");
ALTER TABLE public."F_TDOJ_SC_HIST_2" ADD CONSTRAINT "F_TDOJ_SC_HIST_2_ID_OD_fkey" FOREIGN KEY ("ID_OD") REFERENCES public."D_OD"("ID_OD");
ALTER TABLE public."F_TDOJ_SC_HIST_2" ADD CONSTRAINT "F_TDOJ_SC_HIST_2_ID_TRAIN_fkey" FOREIGN KEY ("ID_TRAIN") REFERENCES public."D_TRAIN"("ID_TRAIN");
ALTER TABLE public."F_TDOJ_SC_HIST_2" ADD CONSTRAINT "F_TDOJ_SC_HIST_2_JX_fkey" FOREIGN KEY ("JX") REFERENCES public."D_JX"("JX");
I’m using a 3 steps process to insert my lines in the table :
- COPY into a temporary table
- DELETE FROM on the perimeter I will be inserting into
- INSERT … INTO mytable SELECT … FROM temporarytable ON CONFLICT DO NOTHING
Is it possible to parallelize the scans during the modify step ?
Regards
Simon FREYBURGER
Interne
De : FREYBURGER Simon (SNCF VOYAGEURS / DIRECTION GENERALE TGV / DM RMP YIELD MANAGEMENT) <simon.freyburger@sncf.fr>
Envoyé : jeudi 11 juillet 2024 16:59
À : Andrei Lepikhov <lepihov@gmail.com>; pgsql-performance@lists.postgresql.org; Peter Geoghegan <pg@bowt.ie>
Objet : RE: How to solve my slow disk i/o throughput during index scan
Hello, and thank you again for your example !
Sorry for my late answer, I was working on a patch for our requests. I am though not completely understanding what is happening. Here is a plan of a query where I splitted the calls with OR as you suggested, what seemed to have enabled parallel scans.
https://explain.dalibo.com/plan/gfa1cf9fffd01bcg#plan/node/1
But, I still wonder, why was my request that slow ? My current understanding of what happened is :
- When postgresql does an Index Scan, it goes through a loop (which is not parallel) of asking for a chunk of data, and then processing it. It wait for having processed the data to ask the next chunk, instead of loading the whole index in RAM (which, I suppose, would be much faster, but also not feasible if the index is too big and the RAM too small, so postgresql does not do it). Thus, the 2MB/s.
- When it does a Bitmap Index Scan, it can parallelize disk interactions, and does not use the processor to discard lines, thus a much faster index load and processing.
I might be completely wrong, and would really like to understand the details, in order to explain them to my team, and to other who might encounter the same problem.
Again, thank you very much for your help, we were really struggling with those slow queries !
Simon FREYBURGER
Interne
De : Andrei Lepikhov <lepihov@gmail.com>
Envoyé : vendredi 5 juillet 2024 04:05
À : FREYBURGER Simon (SNCF VOYAGEURS / DIRECTION GENERALE TGV / DM RMP YIELD MANAGEMENT) <simon.freyburger@sncf.fr>; pgsql-performance@lists.postgresql.org; Peter Geoghegan <pg@bowt.ie>
Objet : Re: How to solve my slow disk i/o throughput during index scan
On 7/4/24 22: 23, FREYBURGER Simon (SNCF VOYAGEURS / DIRECTION GENERALE TGV / DM RMP YIELD MANAGEMENT) wrote: > Hello, > > Thank you, splitting in “OR” query definitely enables bitmap heap scans, > and thus parallelized read to disk
Interne
On 7/4/24 22:23, FREYBURGER Simon (SNCF VOYAGEURS / DIRECTION GENERALE
TGV / DM RMP YIELD MANAGEMENT) wrote:
> Hello,
>
> Thank you, splitting in “OR” query definitely enables bitmap heap scans,
> and thus parallelized read to disk 😃! I though did not understand your
> second point, what is parallel append, and how to enable it ?
Just for example:
DROP TABLE IF EXISTS t CASCADE;
CREATE TABLE t (id int not null, payload text) PARTITION BY RANGE (id);
CREATE TABLE p1 PARTITION OF t FOR VALUES FROM (0) TO (1000);
CREATE TABLE p2 PARTITION OF t FOR VALUES FROM (1000) TO (2000);
CREATE TABLE p3 PARTITION OF t FOR VALUES FROM (2000) TO (3000);
CREATE TABLE p4 PARTITION OF t FOR VALUES FROM (3000) TO (4000);
INSERT INTO t SELECT x % 4000, repeat('a',128) || x FROM
generate_series(1,1E5) AS x;
ANALYZE t;
SET enable_parallel_append = on;
SET parallel_setup_cost = 0.00001;
SET parallel_tuple_cost = 0.00001;
SET max_parallel_workers_per_gather = 8;
SET min_parallel_table_scan_size = 0;
SET min_parallel_index_scan_size = 0;
EXPLAIN (COSTS OFF)
SELECT t.id, t.payload FROM t WHERE t.id % 2 = 0
GROUP BY t.id, t.payload;
Group
Group Key: t.id, t.payload
-> Gather Merge
Workers Planned: 6
-> Sort
Sort Key: t.id, t.payload
-> Parallel Append
-> Parallel Seq Scan on p1 t_1
Filter: ((id % 2) = 0)
-> Parallel Seq Scan on p2 t_2
Filter: ((id % 2) = 0)
-> Parallel Seq Scan on p3 t_3
Filter: ((id % 2) = 0)
-> Parallel Seq Scan on p4 t_4
Filter: ((id % 2) = 0)
Here the table is scanned in parallel. It also works with IndexScan.
--
regards, Andrei Lepikhov
-------
Ce message et toutes les pièces jointes sont établis à l'intention exclusive de ses destinataires et sont confidentiels. L'intégrité de ce message n'étant pas assurée sur Internet, la SNCF ne peut être tenue responsable des altérations qui pourraient se produire sur son contenu. Toute publication, utilisation, reproduction, ou diffusion, même partielle, non autorisée préalablement par la SNCF, est strictement interdite. Si vous n'êtes pas le destinataire de ce message, merci d'en avertir immédiatement l'expéditeur et de le détruire.
-------
This message and any attachments are intended solely for the addressees and are confidential. SNCF may not be held responsible for their contents whose accuracy and completeness cannot be guaranteed over the Internet. Unauthorized use, disclosure, distribution, copying, or any part thereof is strictly prohibited. If you are not the intended recipient of this message, please notify the sender immediately and delete it.
Also, It might not be related, but I have suspiciously similar slow reads when I am inserting in database, could it be related ?
I’m using a 3 steps process to insert my lines in the table :
- COPY into a temporary table
- DELETE FROM on the perimeter I will be inserting into
- INSERT … INTO mytable SELECT … FROM temporarytable ON CONFLICT DO NOTHING
Is it possible to parallelize the scans during the modify step ?
On 11/7/2024 22:09, FREYBURGER Simon (SNCF VOYAGEURS / DIRECTION GENERALE TGV / DM RMP YIELD MANAGEMENT) wrote: > Is it possible to parallelize the scans during the modify step ? Temporary tables can't be used inside a query with parallel workers involved, because such table is local for single process. What about your question - I'm not sure without whole bunch of data. But maximum speedup you can get by disabling as much constraints as possible - ideally, fill each partition individually with no constraints and indexes at all before uniting them into one partitioned table. -- regards, Andrei Lepikhov
On 11/7/2024 21:59, FREYBURGER Simon (SNCF VOYAGEURS / DIRECTION GENERALE TGV / DM RMP YIELD MANAGEMENT) wrote: > > Hello, and thank you again for your example ! > > Sorry for my late answer, I was working on a patch for our requests. I > am though not completely understanding what is happening. Here is a plan > of a query where I splitted the calls with OR as you suggested, what > seemed to have enabled parallel scans. Thanks for the feedback! Generally, I don't understand why you needed to transform ANY -> OR at all to get BitmapScan. Can you just disable IndexScan and possibly SeqScan to see is it a hard transformation limit or mistake in cost estimation? -- regards, Andrei Lepikhov