Thread: Re: Sample rate added to pg_stat_statements

Re: Sample rate added to pg_stat_statements

From
"Andrey M. Borodin"
Date:
> On 18 Nov 2024, at 23:33, Ilia Evdokimov <ilya.evdokimov@tantorlabs.com> wrote:
>
> Hi hackers,
>
> Under high-load scenarios with a significant number of transactions per second, pg_stat_statements introduces
substantialoverhead due to the collection and storage of statistics. Currently, we are sometimes forced to disable
pg_stat_statementsor adjust the size of the statistics using pg_stat_statements.max, which is not always optimal. One
potentialsolution to this issue could be query sampling in pg_stat_statements. 
>
> A similar approach has been implemented in extensions like auto_explain and pg_store_plans, and it has proven very
usefulin high-load systems. However, this approach has its trade-offs, as it sacrifices statistical accuracy for
improvedperformance. This patch introduces a new configuration parameter, pg_stat_statements.sample_rate for the
pg_stat_statementsextension. The patch provides the ability to control the sampling of query statistics in
pg_stat_statements.
>
> This patch serves as a proof of concept (POC), and I would like to hear your thoughts on whether such an approach is
viableand applicable. 

+1 for the idea. I heard a lot of complaints about that pgss is costly. Most of them were using it wrong though. But at
leastit could give an easy way to rule out performance impact of pgss. 

> On 19 Nov 2024, at 15:09, Ilia Evdokimov <ilya.evdokimov@tantorlabs.com> wrote:
>
> I believe we should also include this check in the pgss_ExecutorEnd() function because sampling in pgss_ExecutorEnd()
ensuresthat a query not initially sampled in pgss_ExecutorStart() can still be logged if it meets the
pg_stat_statements.sample_ratecriteria. This approach adds flexibility by allowing critical queries to be captured
whilemaintaining efficient sampling. 

Is there a reason why pgss_ProcessUtility is excluded?


Best regards, Andrey Borodin.


Re: Sample rate added to pg_stat_statements

From
Greg Sabino Mullane
Date:
On Tue, Nov 19, 2024 at 7:12 AM Andrey M. Borodin <x4mmm@yandex-team.ru> wrote:
+1 for the idea. I heard a lot of complaints about that pgss is costly. Most of them were using it wrong though.

I'm curious what "using it wrong" means exactly?

Oh, and a +1 in general to the patch, OP, although it would also be nice to start finding the bottlenecks that cause such performance issues.

Cheers,
Greg

Re: Sample rate added to pg_stat_statements

From
Greg Sabino Mullane
Date:
On Tue, Nov 19, 2024 at 5:07 PM Michael Paquier <michael@paquier.xyz> wrote:
One piece of it would be to see how much of such "bottlenecks" we
would be able to get rid of by integrating pg_stat_statements into
the central pgstats with the custom APIs, without pushing the module
into core.

Any particular reason these days we cannot push this into core and allow disabling on startup? To say this extension is widely used would be an understatement.

Cheers,
Greg

Re: Sample rate added to pg_stat_statements

From
"Andrey M. Borodin"
Date:
> On 19 Nov 2024, at 17:39, Greg Sabino Mullane <htamfids@gmail.com> wrote:
>
> I'm curious what "using it wrong" means exactly?

Here's an example. pgSCV is querying pgss for every database separately every minute. It makes sense for the project.
Butwhen you have ~1000 databases, you have a lot of traffic to pgss alone. Even if you have only one active database,
becauseall pgss results must be put into tuplestore before filtering. See [0] for details. 


Best regards, Andrey Borodin.

[0] https://www.postgresql.org/message-id/flat/1AEEB240-9B68-44D5-8A29-8F9FDB22C801%40yandex-team.ru


Re: Sample rate added to pg_stat_statements

From
Alexander Korotkov
Date:
On Wed, Nov 20, 2024 at 12:07 AM Michael Paquier <michael@paquier.xyz> wrote:
> On Tue, Nov 19, 2024 at 09:39:21AM -0500, Greg Sabino Mullane wrote:
> > Oh, and a +1 in general to the patch, OP, although it would also be nice to
> > start finding the bottlenecks that cause such performance issues.
>
> FWIW, I'm not eager to integrate this proposal without looking at this
> exact argument in depth.
>
> One piece of it would be to see how much of such "bottlenecks" we
> would be able to get rid of by integrating pg_stat_statements into
> the central pgstats with the custom APIs, without pushing the module
> into core.  This means that we would combine the existing hash of pgss
> to shrink to 8 bytes for objid rather than 13 bytes now as the current
> code relies on (toplevel, userid, queryid) for the entry lookup (entry
> removal is sniped with these three values as well, or dshash seq
> scans).  The odds of conflicts still still play in our favor even if
> we have a few million entries, or even ten times that.

If you run "pgbench -S -M prepared" on a pretty large machine with
high concurrency, then spin lock in pgss_store() could become pretty
much of a bottleneck.  And I'm not sure switching all counters to
atomics could somehow improve the situation given we already have
pretty many counters.

I'm generally +1 for the approach taken in this thread.  But I would
suggest introducing a threshold value for a query execution time, and
sample just everything below that threshold.  The slower query
shouldn't be sampled, because it can't be too frequent, and also it
could be more valuable to be counter individually (while very fast
queries probably only matter "in average").

------
Regards,
Alexander Korotkov
Supabase



Re: Sample rate added to pg_stat_statements

From
Ilia Evdokimov
Date:


On 26.11.2024 01:15, Ilia Evdokimov wrote:


On 22.11.2024 09:08, Alexander Korotkov wrote:
On Wed, Nov 20, 2024 at 12:07 AM Michael Paquier <michael@paquier.xyz> wrote:
On Tue, Nov 19, 2024 at 09:39:21AM -0500, Greg Sabino Mullane wrote:
Oh, and a +1 in general to the patch, OP, although it would also be nice to
start finding the bottlenecks that cause such performance issues.
FWIW, I'm not eager to integrate this proposal without looking at this
exact argument in depth.

One piece of it would be to see how much of such "bottlenecks" we
would be able to get rid of by integrating pg_stat_statements into
the central pgstats with the custom APIs, without pushing the module
into core.  This means that we would combine the existing hash of pgss
to shrink to 8 bytes for objid rather than 13 bytes now as the current
code relies on (toplevel, userid, queryid) for the entry lookup (entry
removal is sniped with these three values as well, or dshash seq
scans).  The odds of conflicts still still play in our favor even if
we have a few million entries, or even ten times that.
If you run "pgbench -S -M prepared" on a pretty large machine with
high concurrency, then spin lock in pgss_store() could become pretty
much of a bottleneck.  And I'm not sure switching all counters to
atomics could somehow improve the situation given we already have
pretty many counters.

I'm generally +1 for the approach taken in this thread.  But I would
suggest introducing a threshold value for a query execution time, and
sample just everything below that threshold.  The slower query
shouldn't be sampled, because it can't be too frequent, and also it
could be more valuable to be counter individually (while very fast
queries probably only matter "in average").

------
Regards,
Alexander Korotkov
Supabase

I really liked your idea, and I’d like to propose an enhancement that I believe improves it further.

Yes, if a query’s execution time exceeds the threshold, it should always be tracked without sampling. However, for queries with execution times below the threshold, the sampling logic should prioritize shorter queries over those closer to the threshold. In my view, the ideal approach is for shorter queries to have the highest probability of being sampled, while queries closer to the threshold are less likely to be sampled.

This behavior can be achieved with the following logic:

pg_stat_statements.sample_exectime_threshold * random(0, 1) < total_time

Here’s how it works:

  • As a query’s execution time approaches zero, the probability of it being sampled approaches one.
  • Conversely, as a query’s execution time approaches the threshold, the probability of it being sampled approaches zero.

In other words, the sampling probability decreases linearly from 1 to 0 as the execution time gets closer to the threshold.

I believe this approach offers an ideal user experience. I have attached a new patch implementing this logic. Please let me know if you have any feedback regarding the comments in the code, the naming of variables or documentation. I’m always open to discussion.

--
Best regards,
Ilia Evdokimov,
Tantor Labs LLC.


I’ve been closely reviewing my last (v5-*.patch) patch on implementing time-based sampling, and I’ve realized that it might not be the best approach. Let me explain the reasons.

  • We can only perform sampling before the 'pgss_planner()' function. However, at that point, we don’t yet know the query's execution time since it only becomes available during 'pgss_ExecutorEnd()' or 'pgss_ProcessUtility()';
  • If we wait to sample until the execution completes and we have the actual execution time, this introduces a problem. By that point, we might have already recorded the query's statistics into shared memory from the 'pgss_planner()' making it too late to decide whether to sample the query;
  • Delaying sampling until the execution finishes would require waiting for the execution time, which could introduce performance overhead. This defeats the purpose of sampling, which aims to reduce the cost of tracking query.

I believe we should reconsider the approach and revert to sampling based on v4-*.patch. If I’ve missed anything or there’s an alternative way to implement time threshold-based sampling efficiently, I’d be grateful to hear your insights.

--
Best regards,
Ilia Evdokimov,
Tantor Labs LLC.

Re: Sample rate added to pg_stat_statements

From
"Andrey M. Borodin"
Date:
> On 6 Jan 2025, at 15:50, Ilia Evdokimov <ilya.evdokimov@tantorlabs.com> wrote:
>
> Any suggestions for improvements?

The patch looks good to me, just a few nits.

0. Perhaps, we could have a test for edge values of 0 and 1. I do not insist, just an idea to think about.

1. This code seems a little different from your patch. It is trying to avoid engaging PRNG. I'm not sure it's a good
idea,but still. Also, it uses "<=", not "<". 

    xact_is_sampled = log_xact_sample_rate != 0 &&
(log_xact_sample_rate == 1 ||
pg_prng_double(&pg_global_prng_state) <= log_xact_sample_rate);

Thanks!


Best regards, Andrey Borodin.


Re: Sample rate added to pg_stat_statements

From
Sami Imseih
Date:
Hi,

I was looking at this patch, and I was specifically curious about
how this works with prepared statements. The way the patch
works now is that it determines if the statement is to be sampled
at post_parse_analyze time which could lead to unexpected
behavior.

Let's take an example below in which the
pg_stat_statements.sample_rate is set to 0 ( to mimic
some sampling rate < 1 in which this query does not
get sampled ). At that point, all subsequent executions
of the statement will not get tracked at all. Is this
what is expected for prepared statements? My concern
is we will even lose more stats than what a user
may expect.

This of course will not be an issue for simple query.

postgres=# set pg_stat_statements.sample_rate = 0;
SET
postgres=# select pg_stat_statements_reset();
   pg_stat_statements_reset
-------------------------------
 2025-01-06 11:45:23.484793-06
(1 row)

postgres=# SELECT $1 \parse stmt

postgres=#
postgres=# \bind_named stmt 1 \g
 ?column?
----------
 1
(1 row)

postgres=# \bind_named stmt 1 \g
 ?column?
----------
 1
(1 row)

postgres=# set pg_stat_statements.sample_rate = 1;
SET
postgres=# \bind_named stmt 1 \g
 ?column?
----------
 1
(1 row)

postgres=# \bind_named stmt 1 \g
 ?column?
----------
 1
(1 row)

postgres=# SELECT query, calls FROM pg_stat_statements;
 query | calls
-------+-------
(0 rows)

Regards,

Sami Imseih
Amazon Web Services (AWS)



Re: Sample rate added to pg_stat_statements

From
"Andrey M. Borodin"
Date:
> On 7 Jan 2025, at 16:05, Ilia Evdokimov <ilya.evdokimov@tantorlabs.com> wrote:
>
>> 1. This code seems a little different from your patch. It is trying to avoid engaging PRNG. I'm not sure it's a good
idea,but still. Also, it uses "<=", not "<". 
>>
>>     xact_is_sampled = log_xact_sample_rate != 0 &&
>> (log_xact_sample_rate == 1 ||
>> pg_prng_double(&pg_global_prng_state) <= log_xact_sample_rate);
>>
> Are we sure we're discussing the same patch? Because these remarks refer to the 5 version of the patch, which I
abandoneddue to your remarks. 

Yes. v6 has this code
+    if (nesting_level == 0)
+        current_query_sampled = (pg_prng_double(&pg_global_prng_state) < pgss_sample_rate);

while upstream has code that I cited. And logic is slightly different.


Best regards, Andrey Borodin.


Re: Sample rate added to pg_stat_statements

From
Sami Imseih
Date:
>> You are right. This is absolutely unexpected for users. Thank you for
>> the review.
>>
>> To fix this, I suggest storing a random number in the [0, 1) range in a
>> separate variable, which will be used for comparisons in any place. We
>> will compare 'sample_rate' and random value not only in
>> pgss_post_parse_analyze(), but also in pgss_ProcessUtility() and
>> pgss_planner().
>>
>> I attached patch with test and fixes.

I still think this is not the correct approach. It seems that post_parse_analyze
should not have to deal with checking for a sample rate. This is because
post_parse_analyze, which is the only hook with access to jstate, is
only responsible
for storing a normalized query text on disk and creating a not-yet
user visible entry
in the hash. i.e. pgss_store will never increment counters when called from
pgss_post_parse_analyze.

/* Increment the counts, except when jstate is not NULL */
if (!jstate)
{

What I think may be a better idea is to add something similar
to auto_explain.c, but it should only be added to the top of
pgss_planner, pgss_ExecutorStart and pgss_ProcessUtility.

if (nesting_level == 0)
{
if (!IsParallelWorker())
current_query_sampled = pg_prng_double(&pg_global_prng_state) <
pgss_sample_rate;
else
current_query_sampled = false;
}

This is needed for ExecutorStart and not ExecutorEnd because
ExecutorStart is where the instrumentation is initialized with
queryDesc->totaltime. The above code block could be
turned into a macro and reused in the routines mentioned.

However, it seems with this change, we can see a much
higher likelihood of non-normalized query texts being stored.
This is because jstate is only available during post_parse_analyze.
Therefore, if the first time you are sampling the statement is in ExecutorEnd,
then you will end up storing a non-normalized version of the query text,
see below example with the attached v8.

postgres=# set pg_stat_statements.sample_rate = 0;
SET
postgres=# select pg_stat_statements_reset();
   pg_stat_statements_reset
-------------------------------
 2025-01-07 13:02:11.807952-06
(1 row)

postgres=# SELECT 1 \parse stmt

postgres=# \bind_named stmt \g
 ?column?
----------
        1
(1 row)

postgres=# \bind_named stmt \g
 ?column?
----------
        1
(1 row)

postgres=# SELECT query, calls FROM pg_stat_statements;
 query | calls
-------+-------
(0 rows)

postgres=# set pg_stat_statements.sample_rate = 1;
SET
postgres=# \bind_named stmt \g
 ?column?
----------
        1
(1 row)

postgres=# \bind_named stmt \g
 ?column?
----------
        1
(1 row)

postgres=# SELECT query, calls FROM pg_stat_statements;
 query | calls
-------+-------
(0 rows)

postgres=#  \bind_named stmt \g
 ?column?
----------
        1
(1 row)

postgres=# SELECT query, calls FROM pg_stat_statements;
                    query                    | calls
---------------------------------------------+-------
 SELECT 1                                    |     1
 SELECT query, calls FROM pg_stat_statements |     1
(2 rows)

postgres=#  \bind_named stmt \g
 ?column?
----------
        1
(1 row)

postgres=# SELECT query, calls FROM pg_stat_statements;
                    query                    | calls
---------------------------------------------+-------
 SELECT 1                                    |     2
 SELECT query, calls FROM pg_stat_statements |     2
(2 rows)

One idea is to make jstate available to all hooks, and completely
remove reliance on post_parse_analyze in pg_stat_statements.
I can't find the thread now, but I know this has come up in past discussions
when troubleshooting gaps in query normalization. My concern is this
feature will greatly increase the likelihood of non-normalized query texts.

Also, with regards to the benchmarks, it seems
sampling will be beneficial for workloads that are touching a small number
of entries with high concurrency. This is why we see benefit for
a standard pgbench workload.
Samping becomes less beneficial when there is a large set of queries
being updated.
I still think this is a good approach for workloads that touch a small set
of entries.

Regards,

Sami



Re: Sample rate added to pg_stat_statements

From
Ilia Evdokimov
Date:
On 07.01.2025 22:29, Sami Imseih wrote:
>>> You are right. This is absolutely unexpected for users. Thank you for
>>> the review.
>>>
>>> To fix this, I suggest storing a random number in the [0, 1) range in a
>>> separate variable, which will be used for comparisons in any place. We
>>> will compare 'sample_rate' and random value not only in
>>> pgss_post_parse_analyze(), but also in pgss_ProcessUtility() and
>>> pgss_planner().
>>>
>>> I attached patch with test and fixes.
> I still think this is not the correct approach. It seems that post_parse_analyze
> should not have to deal with checking for a sample rate. This is because
> post_parse_analyze, which is the only hook with access to jstate, is
> only responsible
> for storing a normalized query text on disk and creating a not-yet
> user visible entry
> in the hash. i.e. pgss_store will never increment counters when called from
> pgss_post_parse_analyze.
>
> /* Increment the counts, except when jstate is not NULL */
> if (!jstate)
> {
>
> What I think may be a better idea is to add something similar
> to auto_explain.c, but it should only be added to the top of
> pgss_planner, pgss_ExecutorStart and pgss_ProcessUtility.
>
> if (nesting_level == 0)
> {
> if (!IsParallelWorker())
> current_query_sampled = pg_prng_double(&pg_global_prng_state) <
> pgss_sample_rate;
> else
> current_query_sampled = false;
> }
>
> This is needed for ExecutorStart and not ExecutorEnd because
> ExecutorStart is where the instrumentation is initialized with
> queryDesc->totaltime. The above code block could be
> turned into a macro and reused in the routines mentioned.
>
> However, it seems with this change, we can see a much
> higher likelihood of non-normalized query texts being stored.
> This is because jstate is only available during post_parse_analyze.
> Therefore, if the first time you are sampling the statement is in ExecutorEnd,
> then you will end up storing a non-normalized version of the query text,
> see below example with the attached v8.
>
> postgres=# set pg_stat_statements.sample_rate = 0;
> SET
> postgres=# select pg_stat_statements_reset();
>     pg_stat_statements_reset
> -------------------------------
>   2025-01-07 13:02:11.807952-06
> (1 row)
>
> postgres=# SELECT 1 \parse stmt
>
> postgres=# \bind_named stmt \g
>   ?column?
> ----------
>          1
> (1 row)
>
> postgres=# \bind_named stmt \g
>   ?column?
> ----------
>          1
> (1 row)
>
> postgres=# SELECT query, calls FROM pg_stat_statements;
>   query | calls
> -------+-------
> (0 rows)
>
> postgres=# set pg_stat_statements.sample_rate = 1;
> SET
> postgres=# \bind_named stmt \g
>   ?column?
> ----------
>          1
> (1 row)
>
> postgres=# \bind_named stmt \g
>   ?column?
> ----------
>          1
> (1 row)
>
> postgres=# SELECT query, calls FROM pg_stat_statements;
>   query | calls
> -------+-------
> (0 rows)
>
> postgres=#  \bind_named stmt \g
>   ?column?
> ----------
>          1
> (1 row)
>
> postgres=# SELECT query, calls FROM pg_stat_statements;
>                      query                    | calls
> ---------------------------------------------+-------
>   SELECT 1                                    |     1
>   SELECT query, calls FROM pg_stat_statements |     1
> (2 rows)
>
> postgres=#  \bind_named stmt \g
>   ?column?
> ----------
>          1
> (1 row)
>
> postgres=# SELECT query, calls FROM pg_stat_statements;
>                      query                    | calls
> ---------------------------------------------+-------
>   SELECT 1                                    |     2
>   SELECT query, calls FROM pg_stat_statements |     2
> (2 rows)
>
> One idea is to make jstate available to all hooks, and completely
> remove reliance on post_parse_analyze in pg_stat_statements.
> I can't find the thread now, but I know this has come up in past discussions
> when troubleshooting gaps in query normalization. My concern is this
> feature will greatly increase the likelihood of non-normalized query texts.
>
> Also, with regards to the benchmarks, it seems
> sampling will be beneficial for workloads that are touching a small number
> of entries with high concurrency. This is why we see benefit for
> a standard pgbench workload.
> Samping becomes less beneficial when there is a large set of queries
> being updated.
> I still think this is a good approach for workloads that touch a small set
> of entries.
>
> Regards,
>
> Sami
>
>

Wow, thank you for pointing this out. Your solution looks interesting, 
but I'd like to explore other ways to solve the issue besides making it 
available to all hooks. If I don't find anything better, I'll go with yours.

--
Best regards,
Ilia Evdokimov,
Tantor Labs LLC.




Re: Sample rate added to pg_stat_statements

From
Ilia Evdokimov
Date:
On 08.01.2025 22:39, Ilia Evdokimov wrote:
>
> After the changes in v9-patch, won’t all the necessary queries now be 
> normalized? Since there are no longer any restrictions in the parser 
> hook, queries will be normalized as usual, and pgss_planner, 
> pgss_ExecutorStart, and ExecutorEnd will simply fetch them from 
> 'pgss_query_texts.stat' file.
>
> For now, this seems like the simplest approach instead of making 
> JumbleState available to other hooks. Moreover, if we do proceed with 
> that, we might be able to remove the 'pgss_query_texts.stat' file 
> altogether and more other improvements. In my view, if no other 
> options arise, we should address making JumbleState available to other 
> hooks in a separate thread. If you have any suggestions, I'm always 
> open to feedback.
>
> I attached v9 patch with changes and test above.


Unfortunately, these changes do not achieve the intended sampling goal. 
I looked into this more deeply: while the sampled-out queries do not 
appear in pg_stat_statements, an entry is still allocated in the hash 
table after normalization, which, in my view, should not happen when 
sampling is in effect. Therefore, patch v9 is unlikely to meet our needs.

--
Best regards,
Ilia Evdokimov,
Tantor Labs LLC.




Re: Sample rate added to pg_stat_statements

From
Sami Imseih
Date:
> Unfortunately, these changes do not achieve the intended sampling goal.
> I looked into this more deeply: while the sampled-out queries do not
> appear in pg_stat_statements, an entry is still allocated in the hash
> table after normalization, which, in my view, should not happen when
> sampling is in effect. Therefore, patch v9 is unlikely to meet our needs.

pg_stat_statements creates entries as "sticky" initially to give them
more time to stay in the hash before the first execution completes.
It's not perfect, but it works for the majority of cases. So, what you
are observing is how pg_stat_statements currently works.

If an entry is popular enough, we will need it anyways ( even
with the proposed sampling ). An entry that's not popular will
eventually be aged out.

From my understanding, what the proposed sampling will do is
to reduce the overhead  of incrementing counters of popular entries,
because of the spinlock to update the counters. This is particularly
the case with high concurrency on large machines ( high cpu count ),
and especially when there is a small set of popular entries.
IMO, This patch should also have a benchmark that proves
that a user can benefit with sampling in those types of
workloads.

Regards,

Sami

Sami

Regards,

Sami



Re: Sample rate added to pg_stat_statements

From
Ilia Evdokimov
Date:
On 09.01.2025 05:29, Sami Imseih wrote:
>> Unfortunately, these changes do not achieve the intended sampling goal.
>> I looked into this more deeply: while the sampled-out queries do not
>> appear in pg_stat_statements, an entry is still allocated in the hash
>> table after normalization, which, in my view, should not happen when
>> sampling is in effect. Therefore, patch v9 is unlikely to meet our needs.
> pg_stat_statements creates entries as "sticky" initially to give them
> more time to stay in the hash before the first execution completes.
> It's not perfect, but it works for the majority of cases. So, what you
> are observing is how pg_stat_statements currently works.
>
> If an entry is popular enough, we will need it anyways ( even
> with the proposed sampling ). An entry that's not popular will
> eventually be aged out.
>
>   From my understanding, what the proposed sampling will do is
> to reduce the overhead  of incrementing counters of popular entries,
> because of the spinlock to update the counters. This is particularly
> the case with high concurrency on large machines ( high cpu count ),
> and especially when there is a small set of popular entries.
> IMO, This patch should also have a benchmark that proves
> that a user can benefit with sampling in those types of
> workloads.


Ah, so patch version 9 might be the best fit to achieve this. I’ll need 
to benchmark it on a large, high-concurrency machine then.

--
Best regards,
Ilia Evdokimov,
Tantor Labs LLC.




Re: Sample rate added to pg_stat_statements

From
Sami Imseih
Date:
Hi,

I have a few comments about the patch.

1/
tests are missing for pg_stat_statements.track=all and with a sample
rate < 1 applied.
Can we add tests for this case?

2/

This declaration:

+static bool current_query_sampled = false;

should be moved near the declaration of nesting_level,
the other local variable.


3/
I do not really like the calls to update_current_query_sampled()
the way there are, and I think the pgss_enabled should encapsulate
the new sample rate check as well.

My thoughts are:

1/ change the name of the function from update_current_query_sampled
to current_query_sampled

Also, the static bool should be called is_query_sampled or something like that.

The function should also allow you to skip the sample rate check, such as if
called from pgss_post_parse_analyze.

We can also remove the IsParallelWorker() check in this case, since that
is done in pgss_enabled.

something like this is what I am thinking:

static bool
current_query_sampled(bool skip)
{
       if (skip)
               return true;

        if (nesting_level == 0)
        {
               is_query_sampled = pgss_sample_rate != 0.0 &&
                                               (pgss_sample_rate == 1.0 ||

pg_prng_double(&pg_global_prng_state) <= pgss_sample_rate);
        }

       return is_query_sampled;
}

#define pgss_enabled(level, skip_sampling_check) \
        (!IsParallelWorker() && \
        (pgss_track == PGSS_TRACK_ALL || \
       (pgss_track == PGSS_TRACK_TOP && (level) == 0)) && \
       current_query_sampled(skip_sampling_check))

What do you think?

4/ Now we have pg_stat_statements.track = "off" which is effectively
the same thing as pg_stat_statements.sample_rate = "0". Does this need
to be called out in docs?

Regards,

Sami



Re: Sample rate added to pg_stat_statements

From
Ilia Evdokimov
Date:
On 28.01.2025 23:50, Ilia Evdokimov wrote:
>
>>
>>> If anyone has the capability to run this benchmark on machines with 
>>> more
>>> CPUs or with different queries, it would be nice. I’d appreciate any
>>> suggestions or feedback.
>> I wanted to share some additional benchmarks I ran as well
>> on a r8g.48xlarge ( 192 vCPUs, 1,536 GiB of memory) configured
>> with 16GB of shared_buffers. I also attached the benchmark.sh
>> script used to generate the output.
>> The benchmark is running the select-only pgbench workload,
>> so we have a single heavily contentious entry, which is the
>> worst case.
>>
>> The test shows that the spinlock (SpinDelay waits)
>> becomes an issue at high connection counts and will
>> become worse on larger machines. A sample_rate going from
>> 1 to .75 shows a 60% improvement; but this is on a single
>> contentious entry. Most workloads will likely not see this type
>> of improvement. I also could not really observe
>> this type of difference on smaller machines ( i.e. 32 vCPUs),
>> as expected.
>>
>> ## init
>> pgbench -i -s500
>>
>> ### 192 connections
>> pgbench -c192 -j20 -S -Mprepared -T120 --progress 10
>>
>> sample_rate = 1
>> tps = 484338.769799 (without initial connection time)
>> waits
>> -----
>>    11107  SpinDelay
>>     9568  CPU
>>      929  ClientRead
>>       13  DataFileRead
>>        3  BufferMapping
>>
>> sample_rate = .75
>> tps = 909547.562124 (without initial connection time)
>> waits
>> -----
>>    12079  CPU
>>     4781  SpinDelay
>>     2100  ClientRead
>>
>> sample_rate = .5
>> tps = 1028594.555273 (without initial connection time)
>> waits
>> -----
>>    13253  CPU
>>     3378  ClientRead
>>      174  SpinDelay
>>
>> sample_rate = .25
>> tps = 1019507.126313 (without initial connection time)
>> waits
>> -----
>>    13397  CPU
>>     3423  ClientRead
>>
>> sample_rate = 0
>> tps = 1015425.288538 (without initial connection time)
>> waits
>> -----
>>    13106  CPU
>>     3502  ClientRead
>>
>> ### 32 connections
>> pgbench -c32 -j20 -S -Mprepared -T120 --progress 10
>>
>> sample_rate = 1
>> tps = 620667.049565 (without initial connection time)
>> waits
>> -----
>>     1782  CPU
>>      560  ClientRead
>>
>> sample_rate = .75
>> tps = 620663.131347 (without initial connection time)
>> waits
>> -----
>>     1736  CPU
>>      554  ClientRead
>>
>> sample_rate = .5
>> tps = 624094.688239 (without initial connection time)
>> waits
>> -----
>>     1741  CPU
>>      648  ClientRead
>>
>> sample_rate = .25
>> tps = 628638.538204 (without initial connection time)
>> waits
>> -----
>>     1702  CPU
>>      576  ClientRead
>>
>> sample_rate = 0
>> tps = 630483.464912 (without initial connection time)
>> waits
>> -----
>>     1638  CPU
>>      574  ClientRead
>>
>> Regards,
>>
>> Sami
>
>
> Thank you so much for benchmarking this on a pretty large machine with 
> a large number of CPUs. The results look fantastic, and I truly 
> appreciate your effort.
>
> BWT, I realized that the 'sampling' test needs to be added not only to 
> the Makefile but also to meson.build. I've included that in the v14 
> patch.
>
> -- 
> Best regards,
> Ilia Evdokimov,
> Tantor Labs LLC.


In my opinion, if we can't observe bottleneck of spinlock on 32 CPUs, we 
should determine the CPU count at which it becomes. This will help us 
understand the scale of the problem. Does this make sense, or are there 
really no real workloads where the same query runs on more than 32 CPUs, 
and we've been trying to solve a non-existent problem?

--
Best regards,
Ilia Evdokimov,
Tantor Labs LLC.




Re: Sample rate added to pg_stat_statements

From
Sami Imseih
Date:
> and we've been trying to solve a non-existent problem?

Anecdotally speaking, Most users will encounter bottlenecks
due to exclusive locks on pgss hash ( dealloc and garbage collection )
more often than they will on the spinlock. The spinlock will likely
become more of an issue when you are not seeing much dealloc/gc
and running on a very large machine. So, I don't think it's a non-existent
problem, but I don't think it's the most common problem.

Regards,

Sami



Re: Sample rate added to pg_stat_statements

From
Sami Imseih
Date:
> To summarize the results of all benchmarks, I compiled them into a table:
Thanks for compiling the benchmark data above.

The main benefit of this patch will be to give the user
a toggle if they are observing high spinlock contention
due to pg_stat_statements which will likely occur
on larger machines.

I also did not see this thread [1] mentioned in the thread above,
but one of the reasons pg_stat_statements.track_planning
was turned off by default was due to the additional spinlock
acquire to track the planning stats. Bringing this up as sample_rate
might also be beneficial as well if a user decides to track planning.

Regards,

Sami

[1] https://www.postgresql.org/message-id/flat/2895b53b033c47ccb22972b589050dd9@EX13D05UWC001.ant.amazon.com



Re: Sample rate added to pg_stat_statements

From
Ilia Evdokimov
Date:
On 04.02.2025 20:59, Sami Imseih wrote:
>> To summarize the results of all benchmarks, I compiled them into a table:
> Thanks for compiling the benchmark data above.
>
> The main benefit of this patch will be to give the user
> a toggle if they are observing high spinlock contention
> due to pg_stat_statements which will likely occur
> on larger machines.
>
> I also did not see this thread [1] mentioned in the thread above,
> but one of the reasons pg_stat_statements.track_planning
> was turned off by default was due to the additional spinlock
> acquire to track the planning stats. Bringing this up as sample_rate
> might also be beneficial as well if a user decides to track planning.
>
> Regards,
>
> Sami
>
> [1] https://www.postgresql.org/message-id/flat/2895b53b033c47ccb22972b589050dd9@EX13D05UWC001.ant.amazon.com


Thanks for the thread. As we can see, simply enabling or disabling not 
only track_planning but also other pg-stat_statements parameters is too 
radical a measure for managing performance. With the introduction of 
sample_rate,  users now have more flexibility in controlling spinlock 
contention. This allows them to balance overhead and statistics 
collection rather than completely disabling the feature.

--
Best regards,
Ilia Evdokimov,
Tantor Labs LLC.




Re: Sample rate added to pg_stat_statements

From
Ilia Evdokimov
Date:
On 14.02.2025 16:17, Ilia Evdokimov wrote:
> Hi hackers,
>
> I've decided to explore a slightly different approach to reducing 
> spinlock contention—by introducing a simple execution time threshold. 
> If a query’s execution time exceeds this threshold, it is recorded in 
> pg_stat_statements; otherwise, it is ignored. As Alexander [0] pointed 
> out, this helps retain valuable queries for further analysis. A 
> similar mechanism is already present in auto_explain and 
> pg_store_plans. When pg_stat_statements.track_min_duration = -1, 
> disable tracking. If pg_stat_statements.track_min_duration = -1, all 
> statements are tracked.
>
> I benchmarked this approach using -M prepared -S on my machine with 48 
> CPUs. However, I couldn’t reproduce spinlock contention because the 
> machine isn’t large enough to create sufficient concurrency. 
> Nevertheless, I’m sharing my results for reference and checking 
> correct results of threshold.
>
> Here’s the benchmarking procedure I followed:
>     createdb pgbench
>     pgbench -i -s 3000 pgbench
>     psql -c 'SELECT pg_stat_statements_reset()'
>     pgbench -c 46 -j 46 -T 120 -M prepared -S --progress=10 pgbench
>
> select query, calls, min_exec_time, max_exec_time, mean_exec_time, 
> stddev_exec_time from pg_stat_statements where query = 'SELECT 
> abalance FROM pgbench_accounts WHERE aid = $1';
>
> track_min_duration |     calls | min_exec_time | max_exec_time |     
> mean_exec_time | stddev_exec_time
>                  0 | 111282955 |       0.00365 |      15.56946 | 
> 0.015042374707317802 | 0.06067634978916631
>                  5 |       458 |       5.00627 |      15.699129 |    
> 5.962879746724887 | 1.1432124887616204
>                 10 |        14 |      10.538461 |      16.113204 |   
> 12.415218999999999 | 1.5598854455354354
>                 20 |         - |              - |              - |     
>                - | -
>                 -1 |         - |              - |              - |     
>                - | -
>
> I’d greatly appreciate any feedback on this alternative approach, as 
> well as benchmarking on a pretty large machine to see its impact at 
> scale.
>
> [0]: 
> https://www.postgresql.org/message-id/CAPpHfdsTKAQqC3A48-MGQhrhfEamXZPb64w%3Dutk7thQcOMNr7Q%40mail.gmail.com
>
> -- 
> Best regards,
> Ilia Evdokimov,
> Tantor Labs LLC.


Hi hackers,

I rebased this patch to v18 to fix an inaccurate additional description 
of the GUC parameter. The -1 value should be described first, followed by 0.

Does anyone have other suggestions on how we could sample queries in 
pg_stat_statements to reduce spin-lock contention on entries?

--
Best regards,
Ilia Evdokimov,
Tantor Labs LLC.





Re: Sample rate added to pg_stat_statements

From
Sami Imseih
Date:
I don't see v18 attached.

But, I think it's wrong that you changed the design of this patch
(sample rate to duration based ) after it was ready for committer.

This CF [0] should go back to "Needs Review" if this is the case.

--

Sami

[0] https://commitfest.postgresql.org/patch/5390/



Re: Sample rate added to pg_stat_statements

From
Ilia Evdokimov
Date:


On 19.02.2025 22:11, Sami Imseih wrote:
I don't see v18 attached.


Ah, sorry, I attached v18.


But, I think it's wrong that you changed the design of this patch
(sample rate to duration based ) after it was ready for committer.



I've decided to reconsider how to better reduce the load on the spin-lock in pg_stat_statements.

If we look at the benchmark with 192 CPUs [0], we can see that even a small reduction in the frequency of incrementing counter of entries gives a significant performance boost. For example, at sample_rate = 0.75, performance almost doubles, while at sample_rate < 0.5, the spin-lock bottleneck is already eliminated. This suggests that even small adjustments to the frequency can significantly improve the situation.

But instead of blindly reducing the frequency via PRNG, we can take a more thoughtful approach with threshold by execute time:

  • Find the most frequent query by column 'calls' in pg_stat_statements;
  • In this query look at info about execution time: min_exec_time, max_exec_time, etc;
  • Gradually increase the threshold from min_exec_time to max_exec_time, limiting the tracking of this query.
  • Monitor performance: once the bottleneck is resolved, stop at the current threshold value.

This approach allows us to:

  • Eliminate the spin-lock bottleneck;
  • Preserve data about slow queries, which may be critical for performance analysis;
  • Reduce the load on the most frequent queries causing contention, instead of uniformly reducing the frequency for all queries.

BTW, I track queries in pgss_ProcessUtility without checking exec time because these queries unlikely can lead to bottleneck.

Any other thoughts or objections?


This CF [0] should go back to "Needs Review" if this is the case.


Done!

P.S. During the recent review, I noticed an interesting behavior: when we introduced \parse, changing the 'track' from 'none' to 'all' might lead to an undefined behavior. The steps are:

postgres=# SELECT pg_stat_statements_reset() IS NOT NULL AS t; t --- t (1 row) postgres=# SET pg_stat_statements.track = "none"; SET postgres=# SELECT 1 \parse stmt ?column? ---------- 1 (1 row) postgres=# \bind_named stmt \g ?column? ---------- 1 (1 row) postgres=# SELECT query, calls FROM pg_stat_statements; query | calls -------+------- (0 rows) postgres=# SET pg_stat_statements.track = "all"; SET postgres=# \bind_named stmt \g ?column? ---------- 1 (1 row) postgres=# SELECT query, calls FROM pg_stat_statements; query | calls ----------+------- SELECT 1 | 1 (1 row)

As you can see, the query has not been normalized. Is it a bug, expected or undefined behavior?

[0]: https://www.postgresql.org/message-id/CAA5RZ0vxn_UiUkUK5SdngObLZzw40RhafB7SHydxzBw2_xjjiA%40mail.gmail.com

-- Best regards, Ilia Evdokimov, Tantor Labs LLC.

Attachment

Re: Sample rate added to pg_stat_statements

From
Sami Imseih
Date:
> As you can see, the query has not been normalized. Is it a bug, expected or undefined behavior?

No, this behavior is expected. The ability to normalize a statement is
only available
during post_parse_analyze (pgss_post_parse_analyze), as this is when
we have access to
JumbleState.

In your example, pgss_post_parse_analyze is skipped and during the
subsequent \bind_named,
ExecutorEnd (pgss_ExecutorEnd) does not have access to JumbleState,
leading to the query to be stored without normalization.

It will be nice to make JumbleState available to all hooks, IMO.

--
Sami



Re: Sample rate added to pg_stat_statements

From
Sami Imseih
Date:
> But instead of blindly reducing the frequency via PRNG, we can take a more thoughtful approach with threshold by
executetime:
 

> Find the most frequent query by column 'calls' in pg_stat_statements;
> In this query look at info about execution time: min_exec_time, max_exec_time, etc;
> Gradually increase the threshold from min_exec_time to max_exec_time, limiting the tracking of this query.
> Monitor performance: once the bottleneck is resolved, stop at the current threshold value.

This approach allows us to:

> Eliminate the spin-lock bottleneck;
> Preserve data about slow queries, which may be critical for performance analysis;
> Reduce the load on the most frequent queries causing contention, instead of uniformly reducing the frequency for all
queries.

In my opinion, sample rate is a better fit for pg_stat_statements,
since the queries that
you care about the most are usually the most frequently executed. Sampling them
will still provide enough good data without the risk of not capturing
statistics about
them at all.

Longer running queries will also likely be the least frequent, so they
are already not likely
contributing to the spinlock contention. Also, the least frequent
queries will likely be aged
out faster, so pg_stat_statements was never really a good candidate to
track those anyways;
slow query logging with log_min_duration_statement is a better way to
ensure you capture
the data.

Maybe others may have a different opinion?

--

Sami



Re: Sample rate added to pg_stat_statements

From
Ilia Evdokimov
Date:
On 20.02.2025 03:32, Sami Imseih wrote:
>> As you can see, the query has not been normalized. Is it a bug, expected or undefined behavior?
> No, this behavior is expected. The ability to normalize a statement is
> only available
> during post_parse_analyze (pgss_post_parse_analyze), as this is when
> we have access to
> JumbleState.
>
> In your example, pgss_post_parse_analyze is skipped and during the
> subsequent \bind_named,
> ExecutorEnd (pgss_ExecutorEnd) does not have access to JumbleState,
> leading to the query to be stored without normalization.
>
> It will be nice to make JumbleState available to all hooks, IMO.
>
> --
> Sami


Got it. I think it's worth considering, but not within the scope of this 
thread.

--
Best regards,
Ilia Evdokimov,
Tantor Labs LLC.