pg_stat_statements: Add `calls_aborted` counter for tracking query cancellations - Mailing list pgsql-hackers
From | Benoit Tigeot |
---|---|
Subject | pg_stat_statements: Add `calls_aborted` counter for tracking query cancellations |
Date | |
Msg-id | CAHUgstAuVpiSr1yRXtCR1mT5U9kvkur6P+kCs1M0dp1c_mDMUQ@mail.gmail.com Whole thread Raw |
Responses |
Re: pg_stat_statements: Add `calls_aborted` counter for tracking query cancellations
|
List | pgsql-hackers |
Hello This is my first patch that I wanted to submit since nearly 2 years. It adds a new `calls_aborted` counter to `pg_stat_statements` that tracks queries terminated due to statement timeouts, user cancellations, or other errors during execution. Unlike the existing calls counter which only increments on successful completion, `calls_aborted` provides visibility into failed query attempts, enabling better monitoring of query reliability and timeout issues without needing to examine logs. The counter is incremented in the `ExecutorRun` `PG_FINALLY` block to capture queries that start execution but fail to complete. I work mostly as a web developer in environments with sensitive data. Leveraging `pg_stat_statements` for cancelled and timed-out queries instead of logs could be very beneficial for developers who want quick access to an overview of problematic queries without having access to all logs and more sensitive data (slow queries, auto explain, etc). Status: For discussion Branch: master Testing: Compiles successfully and passes all regression tests, including new tests for this feature Performance impact: Minimal performance impact. Only adds: - One additional int64 field to the counters structure - A hash table lookup and counter increment in the `PG_FINALLY` block (only for aborted queries) - No impact on successful query execution paths Implementation rationale: The patch tracks aborted queries specifically in `ExecutorRun's PG_FINALLY` block because: 1. Execution focus: Queries that fail during parsing/planning are fundamentally different from queries that start executing but get cancelled 2. Practical monitoring: Operations teams probably care most about queries that consume resources (start execution) but fail to complete. It follows my earlier need of more "dev oriented" columns. 3. Consistent semantics: Matches the existing calls counter which tracks execution completions even if this counter is quite "alone" as it is not linking to other incremented columns. The approach only increments `calls_aborted` for existing entries (doesn't create new ones, using parsed) since queries that would be tracked are already identified during earlier query phases. Anticipated questions and responses (Thanks to Michael Paquier and Tomas Vondra for early discussion on this patch 🙌) A. Why not use PGSS_ABORT kind? pgssStoreKind is designed for successful operations (plans, executions) or storing "setups" with PGSS_INVALID. For me aborted calls don't have meaningful timing/resource statistics to store in the arrays, and it would require extending ALL the timing/buffer/WAL arrays for incomplete operations. B. Could we collect additional stats about aborted runs (buffer/WAL usage, etc.)? I chose not to do this because it would require modifying execMain.c and dealing with interrupted queries to gather partial information. It seems quite ambitious to do as a first patch. I see `calls_aborted` more as a probe to identify problematic queries for further investigation in logs or at the application level. C. Why not track earlier phases? I deliberately focused `calls_aborted` on executor-level failures rather than earlier phases (parser/analyzer/rewriter/planner) because they serve different operational purposes. Earlier phase failures are typically development-time errors (syntax mistakes, missing tables, type mismatches) that don't provide actionable operational insights. Executor aborts represent runtime operational issues (query timeouts, cancellations, resource exhaustion, lock conflicts, etc.) that indicate performance degradation or capacity problems requiring attention. This design keeps the feature focused on what matters for production monitoring: distinguishing between queries that "worked before but now fail operationally" versus "never worked due to code bugs." The implementation is also cleaner, avoiding the complexity of hooking multiple subsystems and classifying different error types but of course I may be wrong. ;) Previous discussions: - https://www.postgresql.org/message-id/20171112223906.dqtl3kk3pd7vn6yc%40alap3.anarazel.de - https://www.postgresql.org/message-id/4671.1510537167%40sss.pgh.pa.us - https://wiki.postgresql.org/wiki/PGConf.dev_2025_Developer_Unconference "Could we count up timeouts?" Thanks __________ Benoit Tigeot benoit.tigeot@gmail.com
Attachment
pgsql-hackers by date: