Thread: Re: Sample rate added to pg_stat_statements
> On 18 Nov 2024, at 23:33, Ilia Evdokimov <ilya.evdokimov@tantorlabs.com> wrote: > > Hi hackers, > > Under high-load scenarios with a significant number of transactions per second, pg_stat_statements introduces substantialoverhead due to the collection and storage of statistics. Currently, we are sometimes forced to disable pg_stat_statementsor adjust the size of the statistics using pg_stat_statements.max, which is not always optimal. One potentialsolution to this issue could be query sampling in pg_stat_statements. > > A similar approach has been implemented in extensions like auto_explain and pg_store_plans, and it has proven very usefulin high-load systems. However, this approach has its trade-offs, as it sacrifices statistical accuracy for improvedperformance. This patch introduces a new configuration parameter, pg_stat_statements.sample_rate for the pg_stat_statementsextension. The patch provides the ability to control the sampling of query statistics in pg_stat_statements. > > This patch serves as a proof of concept (POC), and I would like to hear your thoughts on whether such an approach is viableand applicable. +1 for the idea. I heard a lot of complaints about that pgss is costly. Most of them were using it wrong though. But at leastit could give an easy way to rule out performance impact of pgss. > On 19 Nov 2024, at 15:09, Ilia Evdokimov <ilya.evdokimov@tantorlabs.com> wrote: > > I believe we should also include this check in the pgss_ExecutorEnd() function because sampling in pgss_ExecutorEnd() ensuresthat a query not initially sampled in pgss_ExecutorStart() can still be logged if it meets the pg_stat_statements.sample_ratecriteria. This approach adds flexibility by allowing critical queries to be captured whilemaintaining efficient sampling. Is there a reason why pgss_ProcessUtility is excluded? Best regards, Andrey Borodin.
+1 for the idea. I heard a lot of complaints about that pgss is costly. Most of them were using it wrong though.
One piece of it would be to see how much of such "bottlenecks" we
would be able to get rid of by integrating pg_stat_statements into
the central pgstats with the custom APIs, without pushing the module
into core.
> On 19 Nov 2024, at 17:39, Greg Sabino Mullane <htamfids@gmail.com> wrote: > > I'm curious what "using it wrong" means exactly? Here's an example. pgSCV is querying pgss for every database separately every minute. It makes sense for the project. Butwhen you have ~1000 databases, you have a lot of traffic to pgss alone. Even if you have only one active database, becauseall pgss results must be put into tuplestore before filtering. See [0] for details. Best regards, Andrey Borodin. [0] https://www.postgresql.org/message-id/flat/1AEEB240-9B68-44D5-8A29-8F9FDB22C801%40yandex-team.ru
On Wed, Nov 20, 2024 at 12:07 AM Michael Paquier <michael@paquier.xyz> wrote: > On Tue, Nov 19, 2024 at 09:39:21AM -0500, Greg Sabino Mullane wrote: > > Oh, and a +1 in general to the patch, OP, although it would also be nice to > > start finding the bottlenecks that cause such performance issues. > > FWIW, I'm not eager to integrate this proposal without looking at this > exact argument in depth. > > One piece of it would be to see how much of such "bottlenecks" we > would be able to get rid of by integrating pg_stat_statements into > the central pgstats with the custom APIs, without pushing the module > into core. This means that we would combine the existing hash of pgss > to shrink to 8 bytes for objid rather than 13 bytes now as the current > code relies on (toplevel, userid, queryid) for the entry lookup (entry > removal is sniped with these three values as well, or dshash seq > scans). The odds of conflicts still still play in our favor even if > we have a few million entries, or even ten times that. If you run "pgbench -S -M prepared" on a pretty large machine with high concurrency, then spin lock in pgss_store() could become pretty much of a bottleneck. And I'm not sure switching all counters to atomics could somehow improve the situation given we already have pretty many counters. I'm generally +1 for the approach taken in this thread. But I would suggest introducing a threshold value for a query execution time, and sample just everything below that threshold. The slower query shouldn't be sampled, because it can't be too frequent, and also it could be more valuable to be counter individually (while very fast queries probably only matter "in average"). ------ Regards, Alexander Korotkov Supabase
On 22.11.2024 09:08, Alexander Korotkov wrote:On Wed, Nov 20, 2024 at 12:07 AM Michael Paquier <michael@paquier.xyz> wrote:On Tue, Nov 19, 2024 at 09:39:21AM -0500, Greg Sabino Mullane wrote:Oh, and a +1 in general to the patch, OP, although it would also be nice to start finding the bottlenecks that cause such performance issues.FWIW, I'm not eager to integrate this proposal without looking at this exact argument in depth. One piece of it would be to see how much of such "bottlenecks" we would be able to get rid of by integrating pg_stat_statements into the central pgstats with the custom APIs, without pushing the module into core. This means that we would combine the existing hash of pgss to shrink to 8 bytes for objid rather than 13 bytes now as the current code relies on (toplevel, userid, queryid) for the entry lookup (entry removal is sniped with these three values as well, or dshash seq scans). The odds of conflicts still still play in our favor even if we have a few million entries, or even ten times that.If you run "pgbench -S -M prepared" on a pretty large machine with high concurrency, then spin lock in pgss_store() could become pretty much of a bottleneck. And I'm not sure switching all counters to atomics could somehow improve the situation given we already have pretty many counters. I'm generally +1 for the approach taken in this thread. But I would suggest introducing a threshold value for a query execution time, and sample just everything below that threshold. The slower query shouldn't be sampled, because it can't be too frequent, and also it could be more valuable to be counter individually (while very fast queries probably only matter "in average"). ------ Regards, Alexander Korotkov Supabase
I really liked your idea, and I’d like to propose an enhancement that I believe improves it further.
Yes, if a query’s execution time exceeds the threshold, it should always be tracked without sampling. However, for queries with execution times below the threshold, the sampling logic should prioritize shorter queries over those closer to the threshold. In my view, the ideal approach is for shorter queries to have the highest probability of being sampled, while queries closer to the threshold are less likely to be sampled.
This behavior can be achieved with the following logic:
pg_stat_statements.sample_exectime_threshold * random(0, 1) < total_time
Here’s how it works:
- As a query’s execution time approaches zero, the probability of it being sampled approaches one.
- Conversely, as a query’s execution time approaches the threshold, the probability of it being sampled approaches zero.
In other words, the sampling probability decreases linearly from 1 to 0 as the execution time gets closer to the threshold.
I believe this approach offers an ideal user experience. I have attached a new patch implementing this logic. Please let me know if you have any feedback regarding the comments in the code, the naming of variables or documentation. I’m always open to discussion.
--
Best regards,
Ilia Evdokimov,
Tantor Labs LLC.
I’ve been closely reviewing my last (v5-*.patch) patch on implementing time-based sampling, and I’ve realized that it might not be the best approach. Let me explain the reasons.
- We can only perform sampling before the 'pgss_planner()' function. However, at that point, we don’t yet know the query's execution time since it only becomes available during 'pgss_ExecutorEnd()' or 'pgss_ProcessUtility()';
- If we wait to sample until the execution completes and we have the actual execution time, this introduces a problem. By that point, we might have already recorded the query's statistics into shared memory from the 'pgss_planner()' making it too late to decide whether to sample the query;
- Delaying sampling until the execution finishes would require waiting for the execution time, which could introduce performance overhead. This defeats the purpose of sampling, which aims to reduce the cost of tracking query.
I believe we should reconsider the approach and revert to sampling based on v4-*.patch. If I’ve missed anything or there’s an alternative way to implement time threshold-based sampling efficiently, I’d be grateful to hear your insights.
--
Best regards,
Ilia Evdokimov,
Tantor Labs LLC.
> On 6 Jan 2025, at 15:50, Ilia Evdokimov <ilya.evdokimov@tantorlabs.com> wrote: > > Any suggestions for improvements? The patch looks good to me, just a few nits. 0. Perhaps, we could have a test for edge values of 0 and 1. I do not insist, just an idea to think about. 1. This code seems a little different from your patch. It is trying to avoid engaging PRNG. I'm not sure it's a good idea,but still. Also, it uses "<=", not "<". xact_is_sampled = log_xact_sample_rate != 0 && (log_xact_sample_rate == 1 || pg_prng_double(&pg_global_prng_state) <= log_xact_sample_rate); Thanks! Best regards, Andrey Borodin.
Hi, I was looking at this patch, and I was specifically curious about how this works with prepared statements. The way the patch works now is that it determines if the statement is to be sampled at post_parse_analyze time which could lead to unexpected behavior. Let's take an example below in which the pg_stat_statements.sample_rate is set to 0 ( to mimic some sampling rate < 1 in which this query does not get sampled ). At that point, all subsequent executions of the statement will not get tracked at all. Is this what is expected for prepared statements? My concern is we will even lose more stats than what a user may expect. This of course will not be an issue for simple query. postgres=# set pg_stat_statements.sample_rate = 0; SET postgres=# select pg_stat_statements_reset(); pg_stat_statements_reset ------------------------------- 2025-01-06 11:45:23.484793-06 (1 row) postgres=# SELECT $1 \parse stmt postgres=# postgres=# \bind_named stmt 1 \g ?column? ---------- 1 (1 row) postgres=# \bind_named stmt 1 \g ?column? ---------- 1 (1 row) postgres=# set pg_stat_statements.sample_rate = 1; SET postgres=# \bind_named stmt 1 \g ?column? ---------- 1 (1 row) postgres=# \bind_named stmt 1 \g ?column? ---------- 1 (1 row) postgres=# SELECT query, calls FROM pg_stat_statements; query | calls -------+------- (0 rows) Regards, Sami Imseih Amazon Web Services (AWS)
> On 7 Jan 2025, at 16:05, Ilia Evdokimov <ilya.evdokimov@tantorlabs.com> wrote: > >> 1. This code seems a little different from your patch. It is trying to avoid engaging PRNG. I'm not sure it's a good idea,but still. Also, it uses "<=", not "<". >> >> xact_is_sampled = log_xact_sample_rate != 0 && >> (log_xact_sample_rate == 1 || >> pg_prng_double(&pg_global_prng_state) <= log_xact_sample_rate); >> > Are we sure we're discussing the same patch? Because these remarks refer to the 5 version of the patch, which I abandoneddue to your remarks. Yes. v6 has this code + if (nesting_level == 0) + current_query_sampled = (pg_prng_double(&pg_global_prng_state) < pgss_sample_rate); while upstream has code that I cited. And logic is slightly different. Best regards, Andrey Borodin.
>> You are right. This is absolutely unexpected for users. Thank you for >> the review. >> >> To fix this, I suggest storing a random number in the [0, 1) range in a >> separate variable, which will be used for comparisons in any place. We >> will compare 'sample_rate' and random value not only in >> pgss_post_parse_analyze(), but also in pgss_ProcessUtility() and >> pgss_planner(). >> >> I attached patch with test and fixes. I still think this is not the correct approach. It seems that post_parse_analyze should not have to deal with checking for a sample rate. This is because post_parse_analyze, which is the only hook with access to jstate, is only responsible for storing a normalized query text on disk and creating a not-yet user visible entry in the hash. i.e. pgss_store will never increment counters when called from pgss_post_parse_analyze. /* Increment the counts, except when jstate is not NULL */ if (!jstate) { What I think may be a better idea is to add something similar to auto_explain.c, but it should only be added to the top of pgss_planner, pgss_ExecutorStart and pgss_ProcessUtility. if (nesting_level == 0) { if (!IsParallelWorker()) current_query_sampled = pg_prng_double(&pg_global_prng_state) < pgss_sample_rate; else current_query_sampled = false; } This is needed for ExecutorStart and not ExecutorEnd because ExecutorStart is where the instrumentation is initialized with queryDesc->totaltime. The above code block could be turned into a macro and reused in the routines mentioned. However, it seems with this change, we can see a much higher likelihood of non-normalized query texts being stored. This is because jstate is only available during post_parse_analyze. Therefore, if the first time you are sampling the statement is in ExecutorEnd, then you will end up storing a non-normalized version of the query text, see below example with the attached v8. postgres=# set pg_stat_statements.sample_rate = 0; SET postgres=# select pg_stat_statements_reset(); pg_stat_statements_reset ------------------------------- 2025-01-07 13:02:11.807952-06 (1 row) postgres=# SELECT 1 \parse stmt postgres=# \bind_named stmt \g ?column? ---------- 1 (1 row) postgres=# \bind_named stmt \g ?column? ---------- 1 (1 row) postgres=# SELECT query, calls FROM pg_stat_statements; query | calls -------+------- (0 rows) postgres=# set pg_stat_statements.sample_rate = 1; SET postgres=# \bind_named stmt \g ?column? ---------- 1 (1 row) postgres=# \bind_named stmt \g ?column? ---------- 1 (1 row) postgres=# SELECT query, calls FROM pg_stat_statements; query | calls -------+------- (0 rows) postgres=# \bind_named stmt \g ?column? ---------- 1 (1 row) postgres=# SELECT query, calls FROM pg_stat_statements; query | calls ---------------------------------------------+------- SELECT 1 | 1 SELECT query, calls FROM pg_stat_statements | 1 (2 rows) postgres=# \bind_named stmt \g ?column? ---------- 1 (1 row) postgres=# SELECT query, calls FROM pg_stat_statements; query | calls ---------------------------------------------+------- SELECT 1 | 2 SELECT query, calls FROM pg_stat_statements | 2 (2 rows) One idea is to make jstate available to all hooks, and completely remove reliance on post_parse_analyze in pg_stat_statements. I can't find the thread now, but I know this has come up in past discussions when troubleshooting gaps in query normalization. My concern is this feature will greatly increase the likelihood of non-normalized query texts. Also, with regards to the benchmarks, it seems sampling will be beneficial for workloads that are touching a small number of entries with high concurrency. This is why we see benefit for a standard pgbench workload. Samping becomes less beneficial when there is a large set of queries being updated. I still think this is a good approach for workloads that touch a small set of entries. Regards, Sami
On 07.01.2025 22:29, Sami Imseih wrote: >>> You are right. This is absolutely unexpected for users. Thank you for >>> the review. >>> >>> To fix this, I suggest storing a random number in the [0, 1) range in a >>> separate variable, which will be used for comparisons in any place. We >>> will compare 'sample_rate' and random value not only in >>> pgss_post_parse_analyze(), but also in pgss_ProcessUtility() and >>> pgss_planner(). >>> >>> I attached patch with test and fixes. > I still think this is not the correct approach. It seems that post_parse_analyze > should not have to deal with checking for a sample rate. This is because > post_parse_analyze, which is the only hook with access to jstate, is > only responsible > for storing a normalized query text on disk and creating a not-yet > user visible entry > in the hash. i.e. pgss_store will never increment counters when called from > pgss_post_parse_analyze. > > /* Increment the counts, except when jstate is not NULL */ > if (!jstate) > { > > What I think may be a better idea is to add something similar > to auto_explain.c, but it should only be added to the top of > pgss_planner, pgss_ExecutorStart and pgss_ProcessUtility. > > if (nesting_level == 0) > { > if (!IsParallelWorker()) > current_query_sampled = pg_prng_double(&pg_global_prng_state) < > pgss_sample_rate; > else > current_query_sampled = false; > } > > This is needed for ExecutorStart and not ExecutorEnd because > ExecutorStart is where the instrumentation is initialized with > queryDesc->totaltime. The above code block could be > turned into a macro and reused in the routines mentioned. > > However, it seems with this change, we can see a much > higher likelihood of non-normalized query texts being stored. > This is because jstate is only available during post_parse_analyze. > Therefore, if the first time you are sampling the statement is in ExecutorEnd, > then you will end up storing a non-normalized version of the query text, > see below example with the attached v8. > > postgres=# set pg_stat_statements.sample_rate = 0; > SET > postgres=# select pg_stat_statements_reset(); > pg_stat_statements_reset > ------------------------------- > 2025-01-07 13:02:11.807952-06 > (1 row) > > postgres=# SELECT 1 \parse stmt > > postgres=# \bind_named stmt \g > ?column? > ---------- > 1 > (1 row) > > postgres=# \bind_named stmt \g > ?column? > ---------- > 1 > (1 row) > > postgres=# SELECT query, calls FROM pg_stat_statements; > query | calls > -------+------- > (0 rows) > > postgres=# set pg_stat_statements.sample_rate = 1; > SET > postgres=# \bind_named stmt \g > ?column? > ---------- > 1 > (1 row) > > postgres=# \bind_named stmt \g > ?column? > ---------- > 1 > (1 row) > > postgres=# SELECT query, calls FROM pg_stat_statements; > query | calls > -------+------- > (0 rows) > > postgres=# \bind_named stmt \g > ?column? > ---------- > 1 > (1 row) > > postgres=# SELECT query, calls FROM pg_stat_statements; > query | calls > ---------------------------------------------+------- > SELECT 1 | 1 > SELECT query, calls FROM pg_stat_statements | 1 > (2 rows) > > postgres=# \bind_named stmt \g > ?column? > ---------- > 1 > (1 row) > > postgres=# SELECT query, calls FROM pg_stat_statements; > query | calls > ---------------------------------------------+------- > SELECT 1 | 2 > SELECT query, calls FROM pg_stat_statements | 2 > (2 rows) > > One idea is to make jstate available to all hooks, and completely > remove reliance on post_parse_analyze in pg_stat_statements. > I can't find the thread now, but I know this has come up in past discussions > when troubleshooting gaps in query normalization. My concern is this > feature will greatly increase the likelihood of non-normalized query texts. > > Also, with regards to the benchmarks, it seems > sampling will be beneficial for workloads that are touching a small number > of entries with high concurrency. This is why we see benefit for > a standard pgbench workload. > Samping becomes less beneficial when there is a large set of queries > being updated. > I still think this is a good approach for workloads that touch a small set > of entries. > > Regards, > > Sami > > Wow, thank you for pointing this out. Your solution looks interesting, but I'd like to explore other ways to solve the issue besides making it available to all hooks. If I don't find anything better, I'll go with yours. -- Best regards, Ilia Evdokimov, Tantor Labs LLC.
On 08.01.2025 22:39, Ilia Evdokimov wrote: > > After the changes in v9-patch, won’t all the necessary queries now be > normalized? Since there are no longer any restrictions in the parser > hook, queries will be normalized as usual, and pgss_planner, > pgss_ExecutorStart, and ExecutorEnd will simply fetch them from > 'pgss_query_texts.stat' file. > > For now, this seems like the simplest approach instead of making > JumbleState available to other hooks. Moreover, if we do proceed with > that, we might be able to remove the 'pgss_query_texts.stat' file > altogether and more other improvements. In my view, if no other > options arise, we should address making JumbleState available to other > hooks in a separate thread. If you have any suggestions, I'm always > open to feedback. > > I attached v9 patch with changes and test above. Unfortunately, these changes do not achieve the intended sampling goal. I looked into this more deeply: while the sampled-out queries do not appear in pg_stat_statements, an entry is still allocated in the hash table after normalization, which, in my view, should not happen when sampling is in effect. Therefore, patch v9 is unlikely to meet our needs. -- Best regards, Ilia Evdokimov, Tantor Labs LLC.
> Unfortunately, these changes do not achieve the intended sampling goal. > I looked into this more deeply: while the sampled-out queries do not > appear in pg_stat_statements, an entry is still allocated in the hash > table after normalization, which, in my view, should not happen when > sampling is in effect. Therefore, patch v9 is unlikely to meet our needs. pg_stat_statements creates entries as "sticky" initially to give them more time to stay in the hash before the first execution completes. It's not perfect, but it works for the majority of cases. So, what you are observing is how pg_stat_statements currently works. If an entry is popular enough, we will need it anyways ( even with the proposed sampling ). An entry that's not popular will eventually be aged out. From my understanding, what the proposed sampling will do is to reduce the overhead of incrementing counters of popular entries, because of the spinlock to update the counters. This is particularly the case with high concurrency on large machines ( high cpu count ), and especially when there is a small set of popular entries. IMO, This patch should also have a benchmark that proves that a user can benefit with sampling in those types of workloads. Regards, Sami Sami Regards, Sami
On 09.01.2025 05:29, Sami Imseih wrote: >> Unfortunately, these changes do not achieve the intended sampling goal. >> I looked into this more deeply: while the sampled-out queries do not >> appear in pg_stat_statements, an entry is still allocated in the hash >> table after normalization, which, in my view, should not happen when >> sampling is in effect. Therefore, patch v9 is unlikely to meet our needs. > pg_stat_statements creates entries as "sticky" initially to give them > more time to stay in the hash before the first execution completes. > It's not perfect, but it works for the majority of cases. So, what you > are observing is how pg_stat_statements currently works. > > If an entry is popular enough, we will need it anyways ( even > with the proposed sampling ). An entry that's not popular will > eventually be aged out. > > From my understanding, what the proposed sampling will do is > to reduce the overhead of incrementing counters of popular entries, > because of the spinlock to update the counters. This is particularly > the case with high concurrency on large machines ( high cpu count ), > and especially when there is a small set of popular entries. > IMO, This patch should also have a benchmark that proves > that a user can benefit with sampling in those types of > workloads. Ah, so patch version 9 might be the best fit to achieve this. I’ll need to benchmark it on a large, high-concurrency machine then. -- Best regards, Ilia Evdokimov, Tantor Labs LLC.
Hi, I have a few comments about the patch. 1/ tests are missing for pg_stat_statements.track=all and with a sample rate < 1 applied. Can we add tests for this case? 2/ This declaration: +static bool current_query_sampled = false; should be moved near the declaration of nesting_level, the other local variable. 3/ I do not really like the calls to update_current_query_sampled() the way there are, and I think the pgss_enabled should encapsulate the new sample rate check as well. My thoughts are: 1/ change the name of the function from update_current_query_sampled to current_query_sampled Also, the static bool should be called is_query_sampled or something like that. The function should also allow you to skip the sample rate check, such as if called from pgss_post_parse_analyze. We can also remove the IsParallelWorker() check in this case, since that is done in pgss_enabled. something like this is what I am thinking: static bool current_query_sampled(bool skip) { if (skip) return true; if (nesting_level == 0) { is_query_sampled = pgss_sample_rate != 0.0 && (pgss_sample_rate == 1.0 || pg_prng_double(&pg_global_prng_state) <= pgss_sample_rate); } return is_query_sampled; } #define pgss_enabled(level, skip_sampling_check) \ (!IsParallelWorker() && \ (pgss_track == PGSS_TRACK_ALL || \ (pgss_track == PGSS_TRACK_TOP && (level) == 0)) && \ current_query_sampled(skip_sampling_check)) What do you think? 4/ Now we have pg_stat_statements.track = "off" which is effectively the same thing as pg_stat_statements.sample_rate = "0". Does this need to be called out in docs? Regards, Sami
On 28.01.2025 23:50, Ilia Evdokimov wrote: > >> >>> If anyone has the capability to run this benchmark on machines with >>> more >>> CPUs or with different queries, it would be nice. I’d appreciate any >>> suggestions or feedback. >> I wanted to share some additional benchmarks I ran as well >> on a r8g.48xlarge ( 192 vCPUs, 1,536 GiB of memory) configured >> with 16GB of shared_buffers. I also attached the benchmark.sh >> script used to generate the output. >> The benchmark is running the select-only pgbench workload, >> so we have a single heavily contentious entry, which is the >> worst case. >> >> The test shows that the spinlock (SpinDelay waits) >> becomes an issue at high connection counts and will >> become worse on larger machines. A sample_rate going from >> 1 to .75 shows a 60% improvement; but this is on a single >> contentious entry. Most workloads will likely not see this type >> of improvement. I also could not really observe >> this type of difference on smaller machines ( i.e. 32 vCPUs), >> as expected. >> >> ## init >> pgbench -i -s500 >> >> ### 192 connections >> pgbench -c192 -j20 -S -Mprepared -T120 --progress 10 >> >> sample_rate = 1 >> tps = 484338.769799 (without initial connection time) >> waits >> ----- >> 11107 SpinDelay >> 9568 CPU >> 929 ClientRead >> 13 DataFileRead >> 3 BufferMapping >> >> sample_rate = .75 >> tps = 909547.562124 (without initial connection time) >> waits >> ----- >> 12079 CPU >> 4781 SpinDelay >> 2100 ClientRead >> >> sample_rate = .5 >> tps = 1028594.555273 (without initial connection time) >> waits >> ----- >> 13253 CPU >> 3378 ClientRead >> 174 SpinDelay >> >> sample_rate = .25 >> tps = 1019507.126313 (without initial connection time) >> waits >> ----- >> 13397 CPU >> 3423 ClientRead >> >> sample_rate = 0 >> tps = 1015425.288538 (without initial connection time) >> waits >> ----- >> 13106 CPU >> 3502 ClientRead >> >> ### 32 connections >> pgbench -c32 -j20 -S -Mprepared -T120 --progress 10 >> >> sample_rate = 1 >> tps = 620667.049565 (without initial connection time) >> waits >> ----- >> 1782 CPU >> 560 ClientRead >> >> sample_rate = .75 >> tps = 620663.131347 (without initial connection time) >> waits >> ----- >> 1736 CPU >> 554 ClientRead >> >> sample_rate = .5 >> tps = 624094.688239 (without initial connection time) >> waits >> ----- >> 1741 CPU >> 648 ClientRead >> >> sample_rate = .25 >> tps = 628638.538204 (without initial connection time) >> waits >> ----- >> 1702 CPU >> 576 ClientRead >> >> sample_rate = 0 >> tps = 630483.464912 (without initial connection time) >> waits >> ----- >> 1638 CPU >> 574 ClientRead >> >> Regards, >> >> Sami > > > Thank you so much for benchmarking this on a pretty large machine with > a large number of CPUs. The results look fantastic, and I truly > appreciate your effort. > > BWT, I realized that the 'sampling' test needs to be added not only to > the Makefile but also to meson.build. I've included that in the v14 > patch. > > -- > Best regards, > Ilia Evdokimov, > Tantor Labs LLC. In my opinion, if we can't observe bottleneck of spinlock on 32 CPUs, we should determine the CPU count at which it becomes. This will help us understand the scale of the problem. Does this make sense, or are there really no real workloads where the same query runs on more than 32 CPUs, and we've been trying to solve a non-existent problem? -- Best regards, Ilia Evdokimov, Tantor Labs LLC.
> and we've been trying to solve a non-existent problem? Anecdotally speaking, Most users will encounter bottlenecks due to exclusive locks on pgss hash ( dealloc and garbage collection ) more often than they will on the spinlock. The spinlock will likely become more of an issue when you are not seeing much dealloc/gc and running on a very large machine. So, I don't think it's a non-existent problem, but I don't think it's the most common problem. Regards, Sami
> To summarize the results of all benchmarks, I compiled them into a table: Thanks for compiling the benchmark data above. The main benefit of this patch will be to give the user a toggle if they are observing high spinlock contention due to pg_stat_statements which will likely occur on larger machines. I also did not see this thread [1] mentioned in the thread above, but one of the reasons pg_stat_statements.track_planning was turned off by default was due to the additional spinlock acquire to track the planning stats. Bringing this up as sample_rate might also be beneficial as well if a user decides to track planning. Regards, Sami [1] https://www.postgresql.org/message-id/flat/2895b53b033c47ccb22972b589050dd9@EX13D05UWC001.ant.amazon.com
On 04.02.2025 20:59, Sami Imseih wrote: >> To summarize the results of all benchmarks, I compiled them into a table: > Thanks for compiling the benchmark data above. > > The main benefit of this patch will be to give the user > a toggle if they are observing high spinlock contention > due to pg_stat_statements which will likely occur > on larger machines. > > I also did not see this thread [1] mentioned in the thread above, > but one of the reasons pg_stat_statements.track_planning > was turned off by default was due to the additional spinlock > acquire to track the planning stats. Bringing this up as sample_rate > might also be beneficial as well if a user decides to track planning. > > Regards, > > Sami > > [1] https://www.postgresql.org/message-id/flat/2895b53b033c47ccb22972b589050dd9@EX13D05UWC001.ant.amazon.com Thanks for the thread. As we can see, simply enabling or disabling not only track_planning but also other pg-stat_statements parameters is too radical a measure for managing performance. With the introduction of sample_rate, users now have more flexibility in controlling spinlock contention. This allows them to balance overhead and statistics collection rather than completely disabling the feature. -- Best regards, Ilia Evdokimov, Tantor Labs LLC.
On 14.02.2025 16:17, Ilia Evdokimov wrote: > Hi hackers, > > I've decided to explore a slightly different approach to reducing > spinlock contention—by introducing a simple execution time threshold. > If a query’s execution time exceeds this threshold, it is recorded in > pg_stat_statements; otherwise, it is ignored. As Alexander [0] pointed > out, this helps retain valuable queries for further analysis. A > similar mechanism is already present in auto_explain and > pg_store_plans. When pg_stat_statements.track_min_duration = -1, > disable tracking. If pg_stat_statements.track_min_duration = -1, all > statements are tracked. > > I benchmarked this approach using -M prepared -S on my machine with 48 > CPUs. However, I couldn’t reproduce spinlock contention because the > machine isn’t large enough to create sufficient concurrency. > Nevertheless, I’m sharing my results for reference and checking > correct results of threshold. > > Here’s the benchmarking procedure I followed: > createdb pgbench > pgbench -i -s 3000 pgbench > psql -c 'SELECT pg_stat_statements_reset()' > pgbench -c 46 -j 46 -T 120 -M prepared -S --progress=10 pgbench > > select query, calls, min_exec_time, max_exec_time, mean_exec_time, > stddev_exec_time from pg_stat_statements where query = 'SELECT > abalance FROM pgbench_accounts WHERE aid = $1'; > > track_min_duration | calls | min_exec_time | max_exec_time | > mean_exec_time | stddev_exec_time > 0 | 111282955 | 0.00365 | 15.56946 | > 0.015042374707317802 | 0.06067634978916631 > 5 | 458 | 5.00627 | 15.699129 | > 5.962879746724887 | 1.1432124887616204 > 10 | 14 | 10.538461 | 16.113204 | > 12.415218999999999 | 1.5598854455354354 > 20 | - | - | - | > - | - > -1 | - | - | - | > - | - > > I’d greatly appreciate any feedback on this alternative approach, as > well as benchmarking on a pretty large machine to see its impact at > scale. > > [0]: > https://www.postgresql.org/message-id/CAPpHfdsTKAQqC3A48-MGQhrhfEamXZPb64w%3Dutk7thQcOMNr7Q%40mail.gmail.com > > -- > Best regards, > Ilia Evdokimov, > Tantor Labs LLC. Hi hackers, I rebased this patch to v18 to fix an inaccurate additional description of the GUC parameter. The -1 value should be described first, followed by 0. Does anyone have other suggestions on how we could sample queries in pg_stat_statements to reduce spin-lock contention on entries? -- Best regards, Ilia Evdokimov, Tantor Labs LLC.
I don't see v18 attached. But, I think it's wrong that you changed the design of this patch (sample rate to duration based ) after it was ready for committer. This CF [0] should go back to "Needs Review" if this is the case. -- Sami [0] https://commitfest.postgresql.org/patch/5390/
I don't see v18 attached.
Ah, sorry, I attached v18.
But, I think it's wrong that you changed the design of this patch (sample rate to duration based ) after it was ready for committer.
I've decided to reconsider how to better reduce the load on the spin-lock in pg_stat_statements.
If we look at the benchmark with 192 CPUs [0], we can see that even a small reduction in the frequency of incrementing counter of entries gives a significant performance boost. For example, at sample_rate = 0.75, performance almost doubles, while at sample_rate < 0.5, the spin-lock bottleneck is already eliminated. This suggests that even small adjustments to the frequency can significantly improve the situation.
But instead of blindly reducing the frequency via PRNG, we can take a more thoughtful approach with threshold by execute time:
- Find the most frequent query by column 'calls' in pg_stat_statements;
- In this query look at info about execution time: min_exec_time, max_exec_time, etc;
- Gradually increase the threshold from min_exec_time to max_exec_time, limiting the tracking of this query.
- Monitor performance: once the bottleneck is resolved, stop at the current threshold value.
This approach allows us to:
- Eliminate the spin-lock bottleneck;
- Preserve data about slow queries, which may be critical for performance analysis;
- Reduce the load on the most frequent queries causing contention, instead of uniformly reducing the frequency for all queries.
BTW, I track queries in pgss_ProcessUtility without checking exec time because these queries unlikely can lead to bottleneck.
Any other thoughts or objections?
This CF [0] should go back to "Needs Review" if this is the case.
Done!
P.S. During the recent review, I noticed an interesting behavior: when we introduced \parse, changing the 'track' from 'none' to 'all' might lead to an undefined behavior. The steps are:
postgres=# SELECT pg_stat_statements_reset() IS NOT NULL AS t; t --- t (1 row) postgres=# SET pg_stat_statements.track = "none"; SET postgres=# SELECT 1 \parse stmt ?column? ---------- 1 (1 row) postgres=# \bind_named stmt \g ?column? ---------- 1 (1 row) postgres=# SELECT query, calls FROM pg_stat_statements; query | calls -------+------- (0 rows) postgres=# SET pg_stat_statements.track = "all"; SET postgres=# \bind_named stmt \g ?column? ---------- 1 (1 row) postgres=# SELECT query, calls FROM pg_stat_statements; query | calls ----------+------- SELECT 1 | 1 (1 row)
As you can see, the query has not been normalized. Is it a bug, expected or undefined behavior?
-- Best regards, Ilia Evdokimov, Tantor Labs LLC.
Attachment
> As you can see, the query has not been normalized. Is it a bug, expected or undefined behavior? No, this behavior is expected. The ability to normalize a statement is only available during post_parse_analyze (pgss_post_parse_analyze), as this is when we have access to JumbleState. In your example, pgss_post_parse_analyze is skipped and during the subsequent \bind_named, ExecutorEnd (pgss_ExecutorEnd) does not have access to JumbleState, leading to the query to be stored without normalization. It will be nice to make JumbleState available to all hooks, IMO. -- Sami
> But instead of blindly reducing the frequency via PRNG, we can take a more thoughtful approach with threshold by executetime: > Find the most frequent query by column 'calls' in pg_stat_statements; > In this query look at info about execution time: min_exec_time, max_exec_time, etc; > Gradually increase the threshold from min_exec_time to max_exec_time, limiting the tracking of this query. > Monitor performance: once the bottleneck is resolved, stop at the current threshold value. This approach allows us to: > Eliminate the spin-lock bottleneck; > Preserve data about slow queries, which may be critical for performance analysis; > Reduce the load on the most frequent queries causing contention, instead of uniformly reducing the frequency for all queries. In my opinion, sample rate is a better fit for pg_stat_statements, since the queries that you care about the most are usually the most frequently executed. Sampling them will still provide enough good data without the risk of not capturing statistics about them at all. Longer running queries will also likely be the least frequent, so they are already not likely contributing to the spinlock contention. Also, the least frequent queries will likely be aged out faster, so pg_stat_statements was never really a good candidate to track those anyways; slow query logging with log_min_duration_statement is a better way to ensure you capture the data. Maybe others may have a different opinion? -- Sami
On 20.02.2025 03:32, Sami Imseih wrote: >> As you can see, the query has not been normalized. Is it a bug, expected or undefined behavior? > No, this behavior is expected. The ability to normalize a statement is > only available > during post_parse_analyze (pgss_post_parse_analyze), as this is when > we have access to > JumbleState. > > In your example, pgss_post_parse_analyze is skipped and during the > subsequent \bind_named, > ExecutorEnd (pgss_ExecutorEnd) does not have access to JumbleState, > leading to the query to be stored without normalization. > > It will be nice to make JumbleState available to all hooks, IMO. > > -- > Sami Got it. I think it's worth considering, but not within the scope of this thread. -- Best regards, Ilia Evdokimov, Tantor Labs LLC.