Parallel Query Stats - Mailing list pgsql-hackers
From | Benoit Lobréau |
---|---|
Subject | Parallel Query Stats |
Date | |
Msg-id | d657df20-c4bf-63f6-e74c-cb85a81d0383@dalibo.com Whole thread Raw |
Responses |
Re: Parallel Query Stats
|
List | pgsql-hackers |
Hi hackers, Parallelism has been in core since 9.6, it's a great feature that got several upgrades since then. However, it lacks metrics to determine if and how parallelism is used and help tune parameters related to it. Currently, the only information available are pg_stat_activity.backend_type and pg_stat_activity.leader_pid. These could be sampled to get statistics about the number of queries that are using parallel workers and the number of workers spawned (globally or per statement), but this is not ideal because: * the sampling period would require a high frequency to get stats close enough from reality without missing lots of short duration queries; * with sampling we cannot get an accurate count of parallel queries; * we don't know how many queries can't get the workers they asked for. We thought about several places where we could add some user facing metrics, and would like some input about the design before working on a larger patch. The various chosen names are obviously not settled. # Traces We could add a GUC "log_parallel_draught": it would add a message in the logs when a query or utility asks for parallel workers but can't get all of them. The message could look like this. It could be issued several times per query since workers can be requested for different parts of the plan. LOG: Parallel worker draught detected: worker launched: 0, requested: 2 STATEMENT: explain analyze select * from pq_foo inner join pq_bar using(id); LOG: Parallel worker draught detected: worker launched: 0, requested: 1 CONTEXT: while scanning relation "public.pv_tbl" STATEMENT: VACUUM (PARALLEL 2, VERBOSE) pv_tbl; LOG: Parallel worker draught detected: worker launched: 0, requested: 1 STATEMENT: CREATE INDEX ON pi_tbl(i); This could be used in tools like pgBadger to aggregate stats on statements that didn't get their workers, but we might need additionnal information to know why we are lacking workers. We have a working PoC patch for this since it seems the most straightforward to implement and use. # pg_stat_bgworker view I was initially thinking about metrics like: * number of parallel queries * number of parallel queries that didn't get their workers But without a number of eligible queries, it's not very useful. Instead, some metrics could be useful: * how many workers were requested * how many workers were obtained. The data would be updated as the workers are spawned (or aren't). It would be interesting to have this information per background worker type in order to identify which pool is the source of a parallel worker draught. The view could look like this: * bgworker_type: possible values would be: logical replication worker / parallel worker / parallel maintenance worker / a name given by an extension; * datname: the database where the workers were connected if applicable, or null otherwise; * active: number of currently running workers; * requested: number of requested workers ; * obtained: number of obtained workers ; * duration: the aggregation of all durations; we could update this field when a background worker finishes and add the duration from the one still running to produce an more accurate number; * stats_reset: the reset would be handled the same way other pg_stat* views handle it. The parallel maintenance worker type doesn't exist in pg_stat_activity. I think it would be worthwhile to add it since this kind of parallel worker has it's own pool. This view could be used by monitoring or metrology tools to raise alerts or trace graphs of the background worker usage, and determine if, when and where there is a shortage of workers. Tools like pg_activity, check_postgres/check_pgactivity or prometheus exporters could use these stats. # pg_stat_statements This view is dedicated to per-query statistics. We could add a few metrics related to parallelism: * parallelized_calls: how many executions were planned with parallelism; * parallelized_draught_calls: how many executions were planned with parallelism but didn't get all their workers; * parallel_workers_requested: how many workers were requested for this parallel statement; * parallel_workers_total: how many workers were obtained for this parallel statement; The information is useful to detect queries that didn't get their workers on a regular basis. If it's sampled we could know when. It could be used by tools like POWA to eg. visualize the query runtime depending on the number of workers, the moment of the day it lacks the requested workers, etc. The two last could help estimate if a query makes a heavy use of parallelism. Note: I have skimmed throught the thread "Expose Parallelism counters planned/execute in pg_stat_statements" [1] and still need to take a closer look at it. [1] https://www.postgresql.org/message-id/flat/6acbe570-068e-bd8e-95d5-00c737b865e8%40gmail.com # pg_stat_all_tables and pg_stat_all_indexes We could add a parallel_seq_scan counter to pg_stat_all_tables. The column would be incremented for each worker participating in a scan. The leader would also increment the counter if it is participating. The same thing could be done to pg_stat_all_indexes with a parallel_index_scan column. These metrics could be used in relation to system stats and other PostgreSQL metrics such as pg_statio_* in tools like POWA. # Workflow An overview of the backgroud worker usage could be viewed via the pg_stat_bgworker view. It could help detect, and in some cases explain, parallel workers draughts. It would also help adapt the size of the worker pools and prompt us to look into the logs or pg_stat_statements. The statistics gathered in pg_stat_statements can be used the usual way: * have an idea of the parallel query usage on the server; * detect queries that starve from lack of parallel workers; * compare snapshots to see the impact of parameter modifications; * combine the statistics with other sources to know: * if the decrease in parallel workers had on impact on the average execution duration * if the increase in parallel workers allocation had an impact on the system time; The logs can be used to pin point specific queries with their parameters or to get global statistics when pg_stat_statements is not available or can't be used. Once a query is singled out, it can be analysed as usual with EXPLAIN to determine: * if the lack of workers is a problem; * how parallelism helps in this particular case. Finally, the per relation statitics could be combined with system and other PostgreSQL metrics to identify why the storage is stressed. If you reach this point, thank you for reading me! Many thanks to Melanie Plageman for the pointers she shared with us around the pgsessions in Paris and her time in general. -- Benoit Lobréau Consultant http://dalibo.com
pgsql-hackers by date: