Thread: Doc update for pg_stat_statements normalization

Doc update for pg_stat_statements normalization

From
"Imseih (AWS), Sami"
Date:

Replacing constants in pg_stat_statements is on a best effort basis.

It is not unlikely that on a busy workload with heavy entry deallocation,

the user may observe the query with the constants in pg_stat_statements.

 

From what I can see, this is because the only time an entry is normalized is

during post_parse_analyze, and the entry may be deallocated by the time query

execution ends. At that point, the original form ( with constants ) of the query

is used.

 

It is not clear how prevalent this is in real-world workloads, but it's easily reproducible

on a workload with high entry deallocation. Attached are the repro steps on the latest

branch.

 

I think the only thing to do here is to call this out in docs with a suggestion to increase

pg_stat_statements.max to reduce the likelihood. I also attached the suggested

doc enhancement as well.

 

Any thoughts?

 

Regards,

 

--

Sami Imseih

Amazon Web Services

 

 

Attachment

Re: Doc update for pg_stat_statements normalization

From
Michael Paquier
Date:
On Fri, Feb 24, 2023 at 08:54:00PM +0000, Imseih (AWS), Sami wrote:
> I think the only thing to do here is to call this out in docs with a
> suggestion to increase pg_stat_statements.max to reduce the
> likelihood. I also attached the suggested  doc enhancement as well.

Improving the docs about that sounds like a good idea.  This would be
less surprising for users, if we had some details about that.

> Any thoughts?

The risk of deallocation of an entry between the post-analyze hook and
the planner/utility hook represented with two calls of pgss_store()
means that this will never be able to work correctly as long as we
don't know the shape of the normalized query in the second code path
(planner, utility execution) updating the entries with the call
information, etc.  And we only want to pay the cost of normalization
once, after the post-analyze where we do the query jumbling.

Could things be done in a more stable way?  For example, imagine that
we have an extra Query field called void *private_data that extensions
can use to store custom data associated to a query ID, then we could
do something like that:
- In the post-analyze hook, check if an entry with the query ID
calculated exists.
-- If the entry exists, grab a copy of the existing query string,
which may be normalized or not, and save it into Query->private_data.
-- If the entry does not exist, normalize the query, store it in
Query->private_data but do not yet create an entry in the hash table.
- In the planner/utility hook, fetch the normalized query from
private_data, then use it if an entry needs to be created in the hash
table.  The entry may have been deallocated since the post-analyze
hook, in which case it is re-created with the normalized copy saved in
the first phase.
--
Michael

Attachment

Re: Doc update for pg_stat_statements normalization

From
Julien Rouhaud
Date:
On Sat, Feb 25, 2023 at 02:58:36PM +0900, Michael Paquier wrote:
> On Fri, Feb 24, 2023 at 08:54:00PM +0000, Imseih (AWS), Sami wrote:
> > I think the only thing to do here is to call this out in docs with a
> > suggestion to increase pg_stat_statements.max to reduce the
> > likelihood. I also attached the suggested  doc enhancement as well.
> 
> Improving the docs about that sounds like a good idea.  This would be
> less surprising for users, if we had some details about that.
> 
> > Any thoughts?
> 
> The risk of deallocation of an entry between the post-analyze hook and
> the planner/utility hook represented with two calls of pgss_store()
> means that this will never be able to work correctly as long as we
> don't know the shape of the normalized query in the second code path
> (planner, utility execution) updating the entries with the call
> information, etc.  And we only want to pay the cost of normalization
> once, after the post-analyze where we do the query jumbling.

Note also that this is a somewhat wanted behavior (to evict queries that didn't
have any planning or execution stats record), per the
STICKY_DECREASE_FACTOR and related stuff.

> Could things be done in a more stable way?  For example, imagine that
> we have an extra Query field called void *private_data that extensions
> can use to store custom data associated to a query ID, then we could
> do something like that:
> - In the post-analyze hook, check if an entry with the query ID
> calculated exists.
> -- If the entry exists, grab a copy of the existing query string,
> which may be normalized or not, and save it into Query->private_data.
> -- If the entry does not exist, normalize the query, store it in
> Query->private_data but do not yet create an entry in the hash table.
> - In the planner/utility hook, fetch the normalized query from
> private_data, then use it if an entry needs to be created in the hash
> table.  The entry may have been deallocated since the post-analyze
> hook, in which case it is re-created with the normalized copy saved in
> the first phase.

I think the idea of a "private_data" like thing has been discussed before and
rejected IIRC, as it could be quite expensive and would also need to
accommodate for multiple extensions and so on.

Overall, I think that if the pgss eviction rate is high enough that it's
problematic for doing performance analysis, the performance overhead will be so
bad that simply removing pg_stat_statements will give you a ~ x2 performance
increase.  I don't see much point trying to make such a performance killer
scenario more usable.



Re: Doc update for pg_stat_statements normalization

From
"Imseih (AWS), Sami"
Date:
>    > Could things be done in a more stable way?  For example, imagine that
>    > we have an extra Query field called void *private_data that extensions
>    > can use to store custom data associated to a query ID, then we could
>    > do something like that:
>    > - In the post-analyze hook, check if an entry with the query ID
>    > calculated exists.
>    > -- If the entry exists, grab a copy of the existing query string,
>    > which may be normalized or not, and save it into Query->private_data.
>    > -- If the entry does not exist, normalize the query, store it in
>    > Query->private_data but do not yet create an entry in the hash table.
>    > - In the planner/utility hook, fetch the normalized query from
>    > private_data, then use it if an entry needs to be created in the hash
>    > table.  The entry may have been deallocated since the post-analyze
>    > hook, in which case it is re-created with the normalized copy saved in
>    > the first phase.

>    I think the idea of a "private_data" like thing has been discussed before and
>    rejected IIRC, as it could be quite expensive and would also need to
>    accommodate for multiple extensions and so on.

The overhead of storing this additional private data for the life of the query
execution may not be  desirable. I think we also will need to copy the
private data to QueryDesc as well to make it available to planner/utility/exec
hooks.

>    Overall, I think that if the pgss eviction rate is high enough that it's
>    problematic for doing performance analysis, the performance overhead will be so
>    bad that simply removing pg_stat_statements will give you a ~ x2 performance
>    increase.  I don't see much point trying to make such a performance killer
>    scenario more usable.

In v14, we added a dealloc metric to pg_stat_statements_info, which is helpful.
However, this only deals with the pgss_hash entry deallocation.
I think we should also add a metric for the text file garbage collection.

Regards

-- 
Sami Imseih
Amazon Web Services


Re: Doc update for pg_stat_statements normalization

From
Michael Paquier
Date:
On Sat, Feb 25, 2023 at 01:59:04PM +0000, Imseih (AWS), Sami wrote:
> The overhead of storing this additional private data for the life of the query
> execution may not be  desirable.

Okay, but why?

> I think we also will need to copy the
> private data to QueryDesc as well to make it available to planner/utility/exec
> hooks.

This seems like the key point to me here.  If we copy more information
into the Query structures, then we basically have no need for sticky
entries, which could be an advantage on its own as it simplifies the
deallocation and lookup logic.

For a DML or a SELECT, the manipulation of the hash table would still
be a three-step process (post-analyze, planner and execution end), but
the first step would have no need to use an exclusive lock on the hash
table because we could just read and copy over the Query the
normalized query if an entry exists, meaning that we could actually
relax things a bit?  This relaxation has as cost the extra memory used
to store more data to allow the insertion to use a proper state of the
Query[Desc] coming from the JumbleState (this extra data has no need
to be JumbleState, just the results we generate from it aka the
normalized query).

> In v14, we added a dealloc metric to pg_stat_statements_info, which is helpful.
> However, this only deals with the pgss_hash entry deallocation.
> I think we should also add a metric for the text file garbage collection.

This sounds like a good idea on its own.
--
Michael

Attachment

Re: Doc update for pg_stat_statements normalization

From
"Imseih (AWS), Sami"
Date:
>    On Sat, Feb 25, 2023 at 01:59:04PM +0000, Imseih (AWS), Sami wrote:>
>    > The overhead of storing this additional private data for the life of the query
>    > execution may not be  desirable.

>    Okay, but why?

Additional memory to maintain the JumbleState data in other structs, and
the additional copy operations.

>    This seems like the key point to me here.  If we copy more information
>    into the Query structures, then we basically have no need for sticky
>    entries, which could be an advantage on its own as it simplifies the
>    deallocation and lookup logic.

Removing the sticky entry logic will be a big plus, and of course we can
eliminate a query not showing up properly normalized.

>    For a DML or a SELECT, the manipulation of the hash table would still
>    be a three-step process (post-analyze, planner and execution end), but
>    the first step would have no need to use an exclusive lock on the hash
>    table because we could just read and copy over the Query the
>    normalized query if an entry exists, meaning that we could actually
>    relax things a bit?  

No lock is held while normalizing, and a shared lock is held while storing,
so there is no apparent benefit from that aspect.

>    This relaxation has as cost the extra memory used
>    to store more data to allow the insertion to use a proper state of the
>    Query[Desc] coming from the JumbleState (this extra data has no need
>    to be JumbleState, just the results we generate from it aka the
>    normalized query).

Wouldn't be less in terms of memory usage to just store the required
JumbleState fields in Query[Desc]?

clocations_count,
highest_extern_param_id,
clocations_locations,
clocations_length

>    > In v14, we added a dealloc metric to pg_stat_statements_info, which is helpful.
>    > However, this only deals with the pgss_hash entry deallocation.
>    > I think we should also add a metric for the text file garbage collection.

>    This sounds like a good idea on its own.

I can create a separate patch for this.

Regards,

--
Sami Imseih
Amazon Web services


Re: Doc update for pg_stat_statements normalization

From
Michael Paquier
Date:
On Mon, Feb 27, 2023 at 10:53:26PM +0000, Imseih (AWS), Sami wrote:
> Wouldn't be less in terms of memory usage to just store the required
> JumbleState fields in Query[Desc]?
>
> clocations_count,
> highest_extern_param_id,
> clocations_locations,
> clocations_length

Yes, these would be enough to ensure a proper rebuild of the
normalized query in either the 2nd or 3rd call of pgss_store().  With
a high deallocation rate, perhaps we just don't care about bearing
the extra computation to build a normalized qury more than once, still
it could be noticeable?

Anything that gets changed is going to need some serious benchmarking
(based on deallocation rate, string length, etc.) to check that the
cost of this extra memory is worth the correctness gained when storing
the normalization.  FWIW, I am all in if it means code simplifications
with better performance and better correctness of the results.
--
Michael

Attachment

Re: Doc update for pg_stat_statements normalization

From
Michael Paquier
Date:
On Fri, Feb 24, 2023 at 08:54:00PM +0000, Imseih (AWS), Sami wrote:
> I think the only thing to do here is to call this out in docs with a suggestion to increase
> pg_stat_statements.max to reduce the likelihood. I also attached the suggested
> doc enhancement as well.

+  <para>
+   A query text may be observed with constants in
+   <structname>pg_stat_statements</structname>, especially when there is a high
+   rate of entry deallocations. To reduce the likelihood of this occuring, consider
+   increasing <varname>pg_stat_statements.max</varname>.
+   The <structname>pg_stat_statements_info</structname> view provides entry
+   deallocation statistics.
+  </para>

I am OK with an addition to the documentation to warn that one may
have to increase the maximum number of entries that can be stored if
seeing a non-normalized entry that should have been normalized.

Shouldn't this text make it clear that this concerns only query
strings that can be normalized?  Utility queries can have constants,
for one (well, not for long for most of them with the work I have been
doing lately, but there will still be some exceptions like CREATE
VIEW or utilities with A_Const nodes).
--
Michael

Attachment

Re: Doc update for pg_stat_statements normalization

From
"Imseih (AWS), Sami"
Date:
> I am OK with an addition to the documentation to warn that one may
> have to increase the maximum number of entries that can be stored if
> seeing a non-normalized entry that should have been normalized.

I agree. We introduce the concept of a plannable statement in a 
previous section and we can then make this distinction in the new
paragraph. 

I also added a link to pg_stat_statements_info since that is introduced
later on int the doc.


Regards,

-- 
Sami Imseih
Amazon Web Services



Attachment

Re: Doc update for pg_stat_statements normalization

From
Michael Paquier
Date:
On Tue, Feb 28, 2023 at 11:11:30PM +0000, Imseih (AWS), Sami wrote:
> I agree. We introduce the concept of a plannable statement in a
> previous section and we can then make this distinction in the new
> paragraph.
>
> I also added a link to pg_stat_statements_info since that is introduced
> later on int the doc.

I have reworded the paragraph a bit to be more general so as it would
not need an update once more normalization is applied to utility
queries (I am going to fix the part where we mention that we use the
strings for utilities, which is not the case anymore now):
+  <para>
+   Queries on which normalization can be applied may be observed with constant
+   values in <structname>pg_stat_statements</structname>, especially when there
+   is a high rate of entry deallocations. To reduce the likelihood of this
+   happening, consider increasing <varname>pg_stat_statements.max</varname>.
+   The <structname>pg_stat_statements_info</structname> view, discussed below
+   in <xref linkend="pgstatstatements-pg-stat-statements-info"/>,
+   provides statistics about entry deallocations.
+  </para>

Are you OK with this text?
--
Michael

Attachment

Re: Doc update for pg_stat_statements normalization

From
"Imseih (AWS), Sami"
Date:
> + <para>
> + Queries on which normalization can be applied may be observed with constant
> + values in <structname>pg_stat_statements</structname>, especially when there
> + is a high rate of entry deallocations. To reduce the likelihood of this
> + happening, consider increasing <varname>pg_stat_statements.max</varname>.
> + The <structname>pg_stat_statements_info</structname> view, discussed below
> + in <xref linkend="pgstatstatements-pg-stat-statements-info"/>,
> + provides statistics about entry deallocations.
> + </para>

> Are you OK with this text?

Yes, that makes sense.


Regards,

--
Sami Imseih
Amazon Web Services




Re: Doc update for pg_stat_statements normalization

From
Michael Paquier
Date:
On Wed, Mar 01, 2023 at 12:43:40AM +0000, Imseih (AWS), Sami wrote:
> Yes, that makes sense.

Okay, thanks.  Done that now on HEAD.
--
Michael

Attachment