Thread: pg_stat_statements and "IN" conditions

pg_stat_statements and "IN" conditions

From

Dmitry Dolgov

Date:

12 August 2020, 16:19:02

I would like to start another thread to follow up on [1], mostly to bump up the
topic. Just to remind, it's about how pg_stat_statements jumbling ArrayExpr in
queries like:

SELECT something FROM table WHERE col IN (1, 2, 3, ...)

The current implementation produces different jumble hash for every different
number of arguments for essentially the same query. Unfortunately a lot of ORMs
like to generate these types of queries, which in turn leads to
pg_stat_statements pollution. Ideally we want to prevent this and have only one
record for such a query.

As the result of [1] I've identified two highlighted approaches to improve this
situation:

* Reduce the generated ArrayExpr to an array Const immediately, in cases where
all the inputs are Consts.

* Make repeating Const to contribute nothing to the resulting hash.

I've tried to prototype both approaches to find out pros/cons and be more
specific. Attached patches could not be considered a completed piece of work,
but they seem to work, mostly pass the tests and demonstrate the point. I would
like to get some high level input about them and ideally make it clear what is
the preferred solution to continue with.

# Reducing ArrayExpr to an array Const

IIUC this requires producing a Const with ArrayType constvalue in
transformAExprIn for ScalarArrayOpExpr. This could be a general improvement,
since apparently it's being done later anyway. But it deals only with Const,
which leaves more on the table, e.g. Params and other similar types of
duplication we observe when repeating constants are wrapped into VALUES.

Another point here is that it's quite possible this approach will still require
corresponding changes in pg_stat_statements, since just preventing duplicates
to show also loses the information. Ideally we also need to have some
understanding how many elements are actually there and display it, e.g. in
cases when there is just one outlier query that contains a huge IN list.

# Contribute nothing to the hash

I guess there could be multiple ways of doing this, but the first idea I had in
mind is to skip jumbling when necessary. At the same time it can be implemented
more centralized for different types of queries (although in the attached patch
there are only Const & Values). In the simplest case we just identify sequence
of constants of the same type, which just ignores any other cases when stuff is
mixed. But I believe it's something that could be considered a rare corner case
and it's better to start with the simplest solution.

Having said that I believe the second approach of contributing nothing to the
hash sounds more appealing, but would love to hear other opinions.

[1]: https://www.postgresql.org/message-id/flat/CAF42k%3DJCfHMJtkAVXCzBn2XBxDC83xb4VhV7jU7enPnZ0CfEQQ%40mail.gmail.com

Attachment

Re: pg_stat_statements and "IN" conditions

From

Dmitry Dolgov

Date:

18 November 2020, 16:04:32

> On Wed, Aug 12, 2020 at 06:19:02PM +0200, Dmitry Dolgov wrote:
>
> I would like to start another thread to follow up on [1], mostly to bump up the
> topic. Just to remind, it's about how pg_stat_statements jumbling ArrayExpr in
> queries like:
>
>     SELECT something FROM table WHERE col IN (1, 2, 3, ...)
>
> The current implementation produces different jumble hash for every different
> number of arguments for essentially the same query. Unfortunately a lot of ORMs
> like to generate these types of queries, which in turn leads to
> pg_stat_statements pollution. Ideally we want to prevent this and have only one
> record for such a query.
>
> As the result of [1] I've identified two highlighted approaches to improve this
> situation:
>
> * Reduce the generated ArrayExpr to an array Const immediately, in cases where
>   all the inputs are Consts.
>
> * Make repeating Const to contribute nothing to the resulting hash.
>
> I've tried to prototype both approaches to find out pros/cons and be more
> specific. Attached patches could not be considered a completed piece of work,
> but they seem to work, mostly pass the tests and demonstrate the point. I would
> like to get some high level input about them and ideally make it clear what is
> the preferred solution to continue with.

I've implemented the second approach mentioned above, this version was
tested on our test clusters for some time without visible issues. Will
create a CF item and would appreciate any feedback.

Attachment

v1-0001-Prevent-jumbling-of-every-element-in-ArrayExpr.patch

Re: pg_stat_statements and "IN" conditions

From

Chengxi Sun

Date:

09 December 2020, 03:37:40

The following review has been posted through the commitfest application:
make installcheck-world:  tested, passed
Implements feature:       tested, passed
Spec compliant:           not tested
Documentation:            not tested

Hi, I did some test and it works well

Re: pg_stat_statements and "IN" conditions

From

Dmitry Dolgov

Date:

09 December 2020, 18:49:44

> On Wed, Dec 09, 2020 at 03:37:40AM +0000, Chengxi Sun wrote:
>
> The following review has been posted through the commitfest application:
> make installcheck-world:  tested, passed
> Implements feature:       tested, passed
> Spec compliant:           not tested
> Documentation:            not tested
>
> Hi, I did some test and it works well

Thanks for testing!

Re: pg_stat_statements and "IN" conditions

From

Dmitry Dolgov

Date:

26 December 2020, 10:46:35

> On Wed, Nov 18, 2020 at 05:04:32PM +0100, Dmitry Dolgov wrote:
> > On Wed, Aug 12, 2020 at 06:19:02PM +0200, Dmitry Dolgov wrote:
> >
> > I would like to start another thread to follow up on [1], mostly to bump up the
> > topic. Just to remind, it's about how pg_stat_statements jumbling ArrayExpr in
> > queries like:
> >
> >     SELECT something FROM table WHERE col IN (1, 2, 3, ...)
> >
> > The current implementation produces different jumble hash for every different
> > number of arguments for essentially the same query. Unfortunately a lot of ORMs
> > like to generate these types of queries, which in turn leads to
> > pg_stat_statements pollution. Ideally we want to prevent this and have only one
> > record for such a query.
> >
> > As the result of [1] I've identified two highlighted approaches to improve this
> > situation:
> >
> > * Reduce the generated ArrayExpr to an array Const immediately, in cases where
> >   all the inputs are Consts.
> >
> > * Make repeating Const to contribute nothing to the resulting hash.
> >
> > I've tried to prototype both approaches to find out pros/cons and be more
> > specific. Attached patches could not be considered a completed piece of work,
> > but they seem to work, mostly pass the tests and demonstrate the point. I would
> > like to get some high level input about them and ideally make it clear what is
> > the preferred solution to continue with.
>
> I've implemented the second approach mentioned above, this version was
> tested on our test clusters for some time without visible issues. Will
> create a CF item and would appreciate any feedback.

After more testing I found couple of things that could be improved,
namely in the presence of non-reducible constants one part of the query
was not copied into the normalized version, and this approach also could
be extended for Params. These are incorporated in the attached patch.

Attachment

v2-0001-Prevent-jumbling-of-every-element-in-ArrayExpr.patch

Re: pg_stat_statements and "IN" conditions

From

Zhihong Yu

Date:

26 December 2020, 16:53:28

Hi,

A few comments.

+ "After this number of duplicating constants start to merge them.",

duplicating -> duplicate

+ foreach(lc, (List *) expr)
+ {
+ Node * subExpr = (Node *) lfirst(lc);
+
+ if (!IsA(subExpr, Const))
+ {
+ allConst = false;
+ break;
+ }
+ }

It seems the above foreach loop (within foreach(temp, (List *) node)) can be preceded with a check that allConst is true. Otherwise the loop can be skipped.

+ if (currentExprIdx == pgss_merge_threshold - 1)
+ {
+ JumbleExpr(jstate, expr);
+
+ /*
+ * A const expr is already found, so JumbleExpr must
+ * record it. Mark it as merged, it will be the first
+ * merged but still present in the statement query.
+ */
+ Assert(jstate->clocations_count > 0);
+ jstate->clocations[jstate->clocations_count - 1].merged = true;
+ currentExprIdx++;
+ }

The above snippet occurs a few times. Maybe extract into a helper method.

Cheers

On Sat, Dec 26, 2020 at 2:45 AM Dmitry Dolgov <9erthalion6@gmail.com> wrote:

> On Wed, Nov 18, 2020 at 05:04:32PM +0100, Dmitry Dolgov wrote:
> > On Wed, Aug 12, 2020 at 06:19:02PM +0200, Dmitry Dolgov wrote:
> >
> > I would like to start another thread to follow up on [1], mostly to bump up the
> > topic. Just to remind, it's about how pg_stat_statements jumbling ArrayExpr in
> > queries like:
> >
> > SELECT something FROM table WHERE col IN (1, 2, 3, ...)
> >
> > The current implementation produces different jumble hash for every different
> > number of arguments for essentially the same query. Unfortunately a lot of ORMs
> > like to generate these types of queries, which in turn leads to
> > pg_stat_statements pollution. Ideally we want to prevent this and have only one
> > record for such a query.
> >
> > As the result of [1] I've identified two highlighted approaches to improve this
> > situation:
> >
> > * Reduce the generated ArrayExpr to an array Const immediately, in cases where
> > all the inputs are Consts.
> >
> > * Make repeating Const to contribute nothing to the resulting hash.
> >
> > I've tried to prototype both approaches to find out pros/cons and be more
> > specific. Attached patches could not be considered a completed piece of work,
> > but they seem to work, mostly pass the tests and demonstrate the point. I would
> > like to get some high level input about them and ideally make it clear what is
> > the preferred solution to continue with.
>
> I've implemented the second approach mentioned above, this version was
> tested on our test clusters for some time without visible issues. Will
> create a CF item and would appreciate any feedback.

After more testing I found couple of things that could be improved,
namely in the presence of non-reducible constants one part of the query
was not copied into the normalized version, and this approach also could
be extended for Params. These are incorporated in the attached patch.

Re: pg_stat_statements and "IN" conditions

From

Dmitry Dolgov

Date:

05 January 2021, 12:52:30

> On Sat, Dec 26, 2020 at 08:53:28AM -0800, Zhihong Yu wrote:
> Hi,
> A few comments.
>
> +               foreach(lc, (List *) expr)
> +               {
> +                   Node * subExpr = (Node *) lfirst(lc);
> +
> +                   if (!IsA(subExpr, Const))
> +                   {
> +                       allConst = false;
> +                       break;
> +                   }
> +               }
>
> It seems the above foreach loop (within foreach(temp, (List *) node)) can
> be preceded with a check that allConst is true. Otherwise the loop can be
> skipped.

Thanks for noticing. Now that I look at it closer I think it's the other
way around, the loop above checking constants for the first expression
is not really necessary.

> +                   if (currentExprIdx == pgss_merge_threshold - 1)
> +                   {
> +                       JumbleExpr(jstate, expr);
> +
> +                       /*
> +                        * A const expr is already found, so JumbleExpr must
> +                        * record it. Mark it as merged, it will be the
> first
> +                        * merged but still present in the statement query.
> +                        */
> +                       Assert(jstate->clocations_count > 0);
> +                       jstate->clocations[jstate->clocations_count -
> 1].merged = true;
> +                       currentExprIdx++;
> +                   }
>
> The above snippet occurs a few times. Maybe extract into a helper method.

Originally I was hesitant to extract it was because it's quite small
part of the code. But now I've realized that the part relevant to lists
is not really correct, which makes those bits even more different, so I
think it makes sense to leave it like that. What do you think?

Attachment

v3-0001-Prevent-jumbling-of-every-element-in-ArrayExpr.patch

Re: pg_stat_statements and "IN" conditions

From

Zhihong Yu

Date:

05 January 2021, 15:51:42

Hi, Dmitry:

+ int lastExprLenght = 0;

Did you mean to name the variable lastExprLenghth ?

w.r.t. extracting to helper method, the second and third if (currentExprIdx == pgss_merge_threshold - 1) blocks are similar.

It is up to you whether to create the helper method.

I am fine with the current formation.

Cheers

On Tue, Jan 5, 2021 at 4:51 AM Dmitry Dolgov <9erthalion6@gmail.com> wrote:

> On Sat, Dec 26, 2020 at 08:53:28AM -0800, Zhihong Yu wrote:
> Hi,
> A few comments.
>
> + foreach(lc, (List *) expr)
> + {
> + Node * subExpr = (Node *) lfirst(lc);
> +
> + if (!IsA(subExpr, Const))
> + {
> + allConst = false;
> + break;
> + }
> + }
>
> It seems the above foreach loop (within foreach(temp, (List *) node)) can
> be preceded with a check that allConst is true. Otherwise the loop can be
> skipped.

Thanks for noticing. Now that I look at it closer I think it's the other
way around, the loop above checking constants for the first expression
is not really necessary.

> + if (currentExprIdx == pgss_merge_threshold - 1)
> + {
> + JumbleExpr(jstate, expr);
> +
> + /*
> + * A const expr is already found, so JumbleExpr must
> + * record it. Mark it as merged, it will be the
> first
> + * merged but still present in the statement query.
> + */
> + Assert(jstate->clocations_count > 0);
> + jstate->clocations[jstate->clocations_count -
> 1].merged = true;
> + currentExprIdx++;
> + }
>
> The above snippet occurs a few times. Maybe extract into a helper method.

Originally I was hesitant to extract it was because it's quite small
part of the code. But now I've realized that the part relevant to lists
is not really correct, which makes those bits even more different, so I
think it makes sense to leave it like that. What do you think?

Re: pg_stat_statements and "IN" conditions

From

David Steele

Date:

18 March 2021, 13:38:09

On 1/5/21 10:51 AM, Zhihong Yu wrote:
> 
> +   int         lastExprLenght = 0;
> 
> Did you mean to name the variable lastExprLenghth ?
> 
> w.r.t. extracting to helper method, the second and third 
> if (currentExprIdx == pgss_merge_threshold - 1) blocks are similar.
> It is up to you whether to create the helper method.
> I am fine with the current formation.

Dmitry, thoughts on this review?

Regards,
-- 
-David
david@pgmasters.net

Re: pg_stat_statements and "IN" conditions

From

Dmitry Dolgov

Date:

18 March 2021, 15:50:02

> On Thu, Mar 18, 2021 at 09:38:09AM -0400, David Steele wrote:
> On 1/5/21 10:51 AM, Zhihong Yu wrote:
> >
> > +   int         lastExprLenght = 0;
> >
> > Did you mean to name the variable lastExprLenghth ?
> >
> > w.r.t. extracting to helper method, the second and third
> > if (currentExprIdx == pgss_merge_threshold - 1) blocks are similar.
> > It is up to you whether to create the helper method.
> > I am fine with the current formation.
>
> Dmitry, thoughts on this review?

Oh, right. lastExprLenghth is obviously a typo, and as we agreed that
the helper is not strictly necessary I wanted to wait a bit hoping for
more feedback and eventually to post an accumulated patch. Doesn't make
sense to post another version only to fix one typo :)

Re: pg_stat_statements and "IN" conditions

From

Dmitry Dolgov

Date:

15 June 2021, 15:18:50

> On Thu, Mar 18, 2021 at 04:50:02PM +0100, Dmitry Dolgov wrote:
> > On Thu, Mar 18, 2021 at 09:38:09AM -0400, David Steele wrote:
> > On 1/5/21 10:51 AM, Zhihong Yu wrote:
> > >
> > > +   int         lastExprLenght = 0;
> > >
> > > Did you mean to name the variable lastExprLenghth ?
> > >
> > > w.r.t. extracting to helper method, the second and third
> > > if (currentExprIdx == pgss_merge_threshold - 1) blocks are similar.
> > > It is up to you whether to create the helper method.
> > > I am fine with the current formation.
> >
> > Dmitry, thoughts on this review?
>
> Oh, right. lastExprLenghth is obviously a typo, and as we agreed that
> the helper is not strictly necessary I wanted to wait a bit hoping for
> more feedback and eventually to post an accumulated patch. Doesn't make
> sense to post another version only to fix one typo :)

Hi,

I've prepared a new rebased version to deal with the new way of
computing query id, but as always there is one tricky part. From what I
understand, now an external module can provide custom implementation for
query id computation algorithm. It seems natural to think this machinery
could be used instead of patch in the thread, i.e. one could create a
custom logic that will enable constants collapsing as needed, so that
same queries with different number of constants in an array will be
hashed into the same record.

But there is a limitation in how such queries will be normalized
afterwards — to reduce level of surprise it's necessary to display the
fact that a certain query in fact had more constants that are showed in
pgss record. Ideally LocationLen needs to carry some bits of information
on what exactly could be skipped, and generate_normalized_query needs to
understand that, both are not reachable for an external module with
custom query id logic (without replicating significant part of the
existing code). Hence, a new version of the patch.

Attachment

v4-0001-Prevent-jumbling-of-every-element-in-ArrayExpr.patch

Re: pg_stat_statements and "IN" conditions

From

Dmitry Dolgov

Date:

16 June 2021, 14:02:12

> On Tue, Jun 15, 2021 at 05:18:50PM +0200, Dmitry Dolgov wrote:
> > On Thu, Mar 18, 2021 at 04:50:02PM +0100, Dmitry Dolgov wrote:
> > > On Thu, Mar 18, 2021 at 09:38:09AM -0400, David Steele wrote:
> > > On 1/5/21 10:51 AM, Zhihong Yu wrote:
> > > >
> > > > +   int         lastExprLenght = 0;
> > > >
> > > > Did you mean to name the variable lastExprLenghth ?
> > > >
> > > > w.r.t. extracting to helper method, the second and third
> > > > if (currentExprIdx == pgss_merge_threshold - 1) blocks are similar.
> > > > It is up to you whether to create the helper method.
> > > > I am fine with the current formation.
> > >
> > > Dmitry, thoughts on this review?
> >
> > Oh, right. lastExprLenghth is obviously a typo, and as we agreed that
> > the helper is not strictly necessary I wanted to wait a bit hoping for
> > more feedback and eventually to post an accumulated patch. Doesn't make
> > sense to post another version only to fix one typo :)
>
> Hi,
>
> I've prepared a new rebased version to deal with the new way of
> computing query id, but as always there is one tricky part. From what I
> understand, now an external module can provide custom implementation for
> query id computation algorithm. It seems natural to think this machinery
> could be used instead of patch in the thread, i.e. one could create a
> custom logic that will enable constants collapsing as needed, so that
> same queries with different number of constants in an array will be
> hashed into the same record.
>
> But there is a limitation in how such queries will be normalized
> afterwards — to reduce level of surprise it's necessary to display the
> fact that a certain query in fact had more constants that are showed in
> pgss record. Ideally LocationLen needs to carry some bits of information
> on what exactly could be skipped, and generate_normalized_query needs to
> understand that, both are not reachable for an external module with
> custom query id logic (without replicating significant part of the
> existing code). Hence, a new version of the patch.

Forgot to mention a couple of people who already reviewed the patch.

Attachment

v4-0001-Prevent-jumbling-of-every-element-in-ArrayExpr.patch

Re: pg_stat_statements and "IN" conditions

From

Dmitry Dolgov

Date:

30 September 2021, 13:49:30

>On Wed, Jun 16, 2021 at 04:02:12PM +0200, Dmitry Dolgov wrote:
>
> > I've prepared a new rebased version to deal with the new way of
> > computing query id, but as always there is one tricky part. From what I
> > understand, now an external module can provide custom implementation for
> > query id computation algorithm. It seems natural to think this machinery
> > could be used instead of patch in the thread, i.e. one could create a
> > custom logic that will enable constants collapsing as needed, so that
> > same queries with different number of constants in an array will be
> > hashed into the same record.
> >
> > But there is a limitation in how such queries will be normalized
> > afterwards — to reduce level of surprise it's necessary to display the
> > fact that a certain query in fact had more constants that are showed in
> > pgss record. Ideally LocationLen needs to carry some bits of information
> > on what exactly could be skipped, and generate_normalized_query needs to
> > understand that, both are not reachable for an external module with
> > custom query id logic (without replicating significant part of the
> > existing code). Hence, a new version of the patch.
>
> Forgot to mention a couple of people who already reviewed the patch.

And now for something completely different, here is a new patch version.
It contains a small fix for one problem we've found during testing (one
path code was incorrectly assuming find_const_walker results).

Attachment

v5-0001-Prevent-jumbling-of-every-element-in-ArrayExpr.patch

Re: pg_stat_statements and "IN" conditions

From

Zhihong Yu

Date:

30 September 2021, 15:03:16

On Thu, Sep 30, 2021 at 6:49 AM Dmitry Dolgov <9erthalion6@gmail.com> wrote:

>On Wed, Jun 16, 2021 at 04:02:12PM +0200, Dmitry Dolgov wrote:
>
> > I've prepared a new rebased version to deal with the new way of
> > computing query id, but as always there is one tricky part. From what I
> > understand, now an external module can provide custom implementation for
> > query id computation algorithm. It seems natural to think this machinery
> > could be used instead of patch in the thread, i.e. one could create a
> > custom logic that will enable constants collapsing as needed, so that
> > same queries with different number of constants in an array will be
> > hashed into the same record.
> >
> > But there is a limitation in how such queries will be normalized
> > afterwards — to reduce level of surprise it's necessary to display the
> > fact that a certain query in fact had more constants that are showed in
> > pgss record. Ideally LocationLen needs to carry some bits of information
> > on what exactly could be skipped, and generate_normalized_query needs to
> > understand that, both are not reachable for an external module with
> > custom query id logic (without replicating significant part of the
> > existing code). Hence, a new version of the patch.
>
> Forgot to mention a couple of people who already reviewed the patch.

And now for something completely different, here is a new patch version.
It contains a small fix for one problem we've found during testing (one
path code was incorrectly assuming find_const_walker results).

Hi,

bq. and at position further that specified threshold.

that specified threshold -> than specified threshold

Cheers

Re: pg_stat_statements and "IN" conditions

From

Dmitry Dolgov

Date:

30 September 2021, 15:09:57

> On Thu, Sep 30, 2021 at 08:03:16AM -0700, Zhihong Yu wrote:
> On Thu, Sep 30, 2021 at 6:49 AM Dmitry Dolgov <9erthalion6@gmail.com> wrote:
>
> > >On Wed, Jun 16, 2021 at 04:02:12PM +0200, Dmitry Dolgov wrote:
> > >
> > > > I've prepared a new rebased version to deal with the new way of
> > > > computing query id, but as always there is one tricky part. From what I
> > > > understand, now an external module can provide custom implementation
> > for
> > > > query id computation algorithm. It seems natural to think this
> > machinery
> > > > could be used instead of patch in the thread, i.e. one could create a
> > > > custom logic that will enable constants collapsing as needed, so that
> > > > same queries with different number of constants in an array will be
> > > > hashed into the same record.
> > > >
> > > > But there is a limitation in how such queries will be normalized
> > > > afterwards — to reduce level of surprise it's necessary to display the
> > > > fact that a certain query in fact had more constants that are showed in
> > > > pgss record. Ideally LocationLen needs to carry some bits of
> > information
> > > > on what exactly could be skipped, and generate_normalized_query needs
> > to
> > > > understand that, both are not reachable for an external module with
> > > > custom query id logic (without replicating significant part of the
> > > > existing code). Hence, a new version of the patch.
> > >
> > > Forgot to mention a couple of people who already reviewed the patch.
> >
> > And now for something completely different, here is a new patch version.
> > It contains a small fix for one problem we've found during testing (one
> > path code was incorrectly assuming find_const_walker results).
> >
> Hi,
>
> bq. and at position further that specified threshold.
>
>  that specified threshold -> than specified threshold

You mean in the patch commit message, nowhere else, right? Yep, my spell
checker didn't catch that, thanks for noticing!

Re: pg_stat_statements and "IN" conditions

From

Tom Lane

Date:

04 January 2022, 23:02:43

Dmitry Dolgov <9erthalion6@gmail.com> writes:
> And now for something completely different, here is a new patch version.
> It contains a small fix for one problem we've found during testing (one
> path code was incorrectly assuming find_const_walker results).

I've been saying from day one that pushing the query-hashing code into the
core was a bad idea, and I think this patch perfectly illustrates why.
We can debate whether the rules proposed here are good for
pg_stat_statements or not, but it seems inevitable that they will be a
disaster for some other consumers of the query hash.  In particular,
dropping external parameters from the hash seems certain to break
something for somebody --- do you really think that a query with two int
parameters is equivalent to one with five float parameters for all
query-identifying purposes?

I can see the merits of allowing different numbers of IN elements
to be considered equivalent for pg_stat_statements, but this patch
seems to go far beyond that basic idea, and I fear the side-effects
will be very bad.

Also, calling eval_const_expressions in the query jumbler is flat
out unacceptable.  There is way too much code that could be reached
that way (more or less the entire executor, to start with).  I
don't have a lot of faith that it'd never modify the input tree,
either.

            regards, tom lane

Re: pg_stat_statements and "IN" conditions

From

"Andrey V. Lepikhov"

Date:

05 January 2022, 04:37:33

On 1/5/22 4:02 AM, Tom Lane wrote:
> Dmitry Dolgov <9erthalion6@gmail.com> writes:
>> And now for something completely different, here is a new patch version.
>> It contains a small fix for one problem we've found during testing (one
>> path code was incorrectly assuming find_const_walker results).
> 
> I've been saying from day one that pushing the query-hashing code into the
> core was a bad idea, and I think this patch perfectly illustrates why.
> We can debate whether the rules proposed here are good for
> pg_stat_statements or not, but it seems inevitable that they will be a
> disaster for some other consumers of the query hash.  In particular,
> dropping external parameters from the hash seems certain to break
> something for somebody
+1.

In a couple of extensions I use different logic of query jumbling - hash 
value is more stable in some cases than in default implementation. For 
example, it should be stable to permutations in 'FROM' section of a query.
And If anyone subtly changes jumbling logic when the extension is 
active, the instance could get huge performance issues.

Let me suggest, that the core should allow an extension at least to 
detect such interference between extensions. Maybe hook could be 
replaced with callback to allow extension see an queryid with underlying 
generation logic what it expects.

-- 
regards,
Andrey Lepikhov
Postgres Professional

Re: pg_stat_statements and "IN" conditions

From

Tom Lane

Date:

05 January 2022, 05:13:30

"Andrey V. Lepikhov" <a.lepikhov@postgrespro.ru> writes:
> On 1/5/22 4:02 AM, Tom Lane wrote:
>> I've been saying from day one that pushing the query-hashing code into the
>> core was a bad idea, and I think this patch perfectly illustrates why.

> +1.

> Let me suggest, that the core should allow an extension at least to 
> detect such interference between extensions. Maybe hook could be 
> replaced with callback to allow extension see an queryid with underlying 
> generation logic what it expects.

I feel like we need to get away from the idea that there is just
one query hash, and somehow let different extensions attach
differently-calculated hashes to a query.  I don't have any immediate
ideas about how to do that in a reasonably inexpensive way.

            regards, tom lane

Re: pg_stat_statements and "IN" conditions

From

Dmitry Dolgov

Date:

05 January 2022, 21:11:11

> On Tue, Jan 04, 2022 at 06:02:43PM -0500, Tom Lane wrote:
> We can debate whether the rules proposed here are good for
> pg_stat_statements or not, but it seems inevitable that they will be a
> disaster for some other consumers of the query hash.

Hm, which consumers do you mean here, potential extension? Isn't the
ability to use an external module to compute queryid make this situation
possible anyway?

> do you really think that a query with two int
> parameters is equivalent to one with five float parameters for all
> query-identifying purposes?

Nope, and it will be hard to figure this out no matter which approach
we're talking about, because it mostly depends on the context and type
of queries I guess. Instead, such functionality should allow some
reasonable configuration. To be clear, the use case I have in mind here
is not four or five, but rather a couple of hundreds constants where
chances that the whole construction was generated automatically by ORM
is higher than normal.

> I can see the merits of allowing different numbers of IN elements
> to be considered equivalent for pg_stat_statements, but this patch
> seems to go far beyond that basic idea, and I fear the side-effects
> will be very bad.

Not sure why it goes far beyond, but then there were two approaches
under consideration, as I've stated in the first message. I already
don't remember all the details, but another one was evolving around
doing similar things in a more limited fashion in transformAExprIn. The
problem would be then to carry the information, necessary to represent
the act of "merging" some number of queryids together. Any thoughts
here?

The idea of keeping the original queryid untouched and add another type
of id instead sounds interesting, but it will add too much overhead for
a quite small use case I guess.

Re: pg_stat_statements and "IN" conditions

From

Dmitry Dolgov

Date:

10 March 2022, 16:38:37

> On Wed, Jan 05, 2022 at 10:11:11PM +0100, Dmitry Dolgov wrote:
> > On Tue, Jan 04, 2022 at 06:02:43PM -0500, Tom Lane wrote:
> > We can debate whether the rules proposed here are good for
> > pg_stat_statements or not, but it seems inevitable that they will be a
> > disaster for some other consumers of the query hash.
>
> Hm, which consumers do you mean here, potential extension? Isn't the
> ability to use an external module to compute queryid make this situation
> possible anyway?
>
> > do you really think that a query with two int
> > parameters is equivalent to one with five float parameters for all
> > query-identifying purposes?
>
> Nope, and it will be hard to figure this out no matter which approach
> we're talking about, because it mostly depends on the context and type
> of queries I guess. Instead, such functionality should allow some
> reasonable configuration. To be clear, the use case I have in mind here
> is not four or five, but rather a couple of hundreds constants where
> chances that the whole construction was generated automatically by ORM
> is higher than normal.
>
> > I can see the merits of allowing different numbers of IN elements
> > to be considered equivalent for pg_stat_statements, but this patch
> > seems to go far beyond that basic idea, and I fear the side-effects
> > will be very bad.
>
> Not sure why it goes far beyond, but then there were two approaches
> under consideration, as I've stated in the first message. I already
> don't remember all the details, but another one was evolving around
> doing similar things in a more limited fashion in transformAExprIn. The
> problem would be then to carry the information, necessary to represent
> the act of "merging" some number of queryids together. Any thoughts
> here?
>
> The idea of keeping the original queryid untouched and add another type
> of id instead sounds interesting, but it will add too much overhead for
> a quite small use case I guess.

```
Thu, 10 Mar 2022
New status: Waiting on Author
```

This seems incorrect, as the only feedback I've got was "this is a bad
idea", and no reaction on follow-up questions.

Re: pg_stat_statements and "IN" conditions

From

Tom Lane

Date:

10 March 2022, 17:11:59

Dmitry Dolgov <9erthalion6@gmail.com> writes:
> New status: Waiting on Author

> This seems incorrect, as the only feedback I've got was "this is a bad
> idea", and no reaction on follow-up questions.

I changed the status because it seems to me there is no chance of
this being committed as-is.

1. I think an absolute prerequisite before we could even consider
changing the query jumbler rules this much is to do the work that was
put off when the jumbler was moved into core: that is, provide some
honest support for multiple query-ID generation methods being used at
the same time.  Even if you successfully make a case for
pg_stat_statements to act this way, other consumers of query IDs
aren't going to be happy with it.

2. You haven't made a case for it.  The original complaint was
about different lengths of IN lists not being treated as equivalent,
but this patch has decided to do I'm-not-even-sure-quite-what
about treating different Params as equivalent.  Plus you're trying
to invoke eval_const_expressions in the jumbler; that is absolutely
Not OK, for both safety and semantic reasons.

If you backed off to just treating ArrayExprs containing different
numbers of Consts as equivalent, maybe that'd be something we could
adopt without fixing point 1.  I don't think anything that fuzzes the
treatment of Params can get away with that, though.

            regards, tom lane

Re: pg_stat_statements and "IN" conditions

From

Robert Haas

Date:

10 March 2022, 17:32:08

On Thu, Mar 10, 2022 at 12:12 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > This seems incorrect, as the only feedback I've got was "this is a bad
> > idea", and no reaction on follow-up questions.
>
> I changed the status because it seems to me there is no chance of
> this being committed as-is.
>
> 1. I think an absolute prerequisite before we could even consider
> changing the query jumbler rules this much is to do the work that was
> put off when the jumbler was moved into core: that is, provide some
> honest support for multiple query-ID generation methods being used at
> the same time.  Even if you successfully make a case for
> pg_stat_statements to act this way, other consumers of query IDs
> aren't going to be happy with it.

FWIW, I don't find this convincing at all. Query jumbling is already
somewhat expensive, and it seems unlikely that the same person is
going to want to jumble queries in one way for pg_stat_statements and
another way for pg_stat_broccoli or whatever their other extension is.
Putting a lot of engineering work into something with such a marginal
use case seems not worthwhile to me - and also likely futile, because
I don't see how it could realistically be made nearly as cheap as a
single jumble.

> 2. You haven't made a case for it.  The original complaint was
> about different lengths of IN lists not being treated as equivalent,
> but this patch has decided to do I'm-not-even-sure-quite-what
> about treating different Params as equivalent.  Plus you're trying
> to invoke eval_const_expressions in the jumbler; that is absolutely
> Not OK, for both safety and semantic reasons.

I think there are two separate points here, one about patch quality
and the other about whether the basic idea is good. I think the basic
idea is good. I do not contend that collapsing IN-lists of arbitrary
length is what everyone wants in all cases, but it seems entirely
reasonable to me to think that it is what some people want. So I would
say just make it a parameter and let people configure whichever
behavior they want. My bet is 95% of users would prefer to have it on,
but even if that's wildly wrong, having it as an optional behavior
hurts nobody. Let it be off by default and let those who want it flip
the toggle. On the code quality issue, I haven't read the patch but
your concerns sound well-founded to me from reading what you wrote.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: pg_stat_statements and "IN" conditions

From

Dmitry Dolgov

Date:

10 March 2022, 19:06:51

> On Thu, Mar 10, 2022 at 12:32:08PM -0500, Robert Haas wrote:
> On Thu, Mar 10, 2022 at 12:12 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> > 2. You haven't made a case for it.  The original complaint was
> > about different lengths of IN lists not being treated as equivalent,
> > but this patch has decided to do I'm-not-even-sure-quite-what
> > about treating different Params as equivalent.  Plus you're trying
> > to invoke eval_const_expressions in the jumbler; that is absolutely
> > Not OK, for both safety and semantic reasons.
>
> I think there are two separate points here, one about patch quality
> and the other about whether the basic idea is good. I think the basic
> idea is good. I do not contend that collapsing IN-lists of arbitrary
> length is what everyone wants in all cases, but it seems entirely
> reasonable to me to think that it is what some people want. So I would
> say just make it a parameter and let people configure whichever
> behavior they want. My bet is 95% of users would prefer to have it on,
> but even if that's wildly wrong, having it as an optional behavior
> hurts nobody. Let it be off by default and let those who want it flip
> the toggle. On the code quality issue, I haven't read the patch but
> your concerns sound well-founded to me from reading what you wrote.

I have the same understanding, there is a toggle in the patch exactly
for this purpose.

To give a bit more context, the whole development was ORM-driven rather
than pulled out of thin air -- people were complaining about huge
generated queries that could be barely displayed in monitoring, I was
trying to address it via collapsing the list where it was happening. In
other words "I'm-not-even-sure-quite-what" part may be indeed too
extensive, but was triggered by real world issues.

Of course, I could get the implementation not quite right, e.g. I wasn't
aware about dangers of using eval_const_expressions. But that's what the
CF item and the corresponding discussion is for, I guess. Let me see
what I could do to improve it.

Re: pg_stat_statements and "IN" conditions

From

Dmitry Dolgov

Date:

12 March 2022, 14:10:30

> On Thu, Mar 10, 2022 at 12:11:59PM -0500, Tom Lane wrote:
> Dmitry Dolgov <9erthalion6@gmail.com> writes:
> > New status: Waiting on Author
>
> > This seems incorrect, as the only feedback I've got was "this is a bad
> > idea", and no reaction on follow-up questions.
>
> I changed the status because it seems to me there is no chance of
> this being committed as-is.
>
> 1. I think an absolute prerequisite before we could even consider
> changing the query jumbler rules this much is to do the work that was
> put off when the jumbler was moved into core: that is, provide some
> honest support for multiple query-ID generation methods being used at
> the same time.  Even if you successfully make a case for
> pg_stat_statements to act this way, other consumers of query IDs
> aren't going to be happy with it.
>
> 2. You haven't made a case for it.  The original complaint was
> about different lengths of IN lists not being treated as equivalent,
> but this patch has decided to do I'm-not-even-sure-quite-what
> about treating different Params as equivalent.  Plus you're trying
> to invoke eval_const_expressions in the jumbler; that is absolutely
> Not OK, for both safety and semantic reasons.
>
> If you backed off to just treating ArrayExprs containing different
> numbers of Consts as equivalent, maybe that'd be something we could
> adopt without fixing point 1.  I don't think anything that fuzzes the
> treatment of Params can get away with that, though.

Here is the limited version of list collapsing functionality, which
doesn't utilize eval_const_expressions and ignores most of the stuff
except ArrayExprs. Any thoughts/more suggestions?

Attachment

v6-0001-Prevent-jumbling-of-every-element-in-ArrayExpr.patch

Re: pg_stat_statements and "IN" conditions

From

Robert Haas

Date:

14 March 2022, 14:17:57

On Sat, Mar 12, 2022 at 9:11 AM Dmitry Dolgov <9erthalion6@gmail.com> wrote:
> Here is the limited version of list collapsing functionality, which
> doesn't utilize eval_const_expressions and ignores most of the stuff
> except ArrayExprs. Any thoughts/more suggestions?

The proposed commit message says this commit intends to "Make Consts
contribute nothing to the jumble hash if they're part of a series and
at position further that specified threshold." I'm not sure whether
that's what the patch actually implements because I can't immediately
understand the new logic you've added, but I think if we did what that
sentence said then, supposing the threshold is set to 1, it would
result in producing the same hash for "x in (1,2)" that we do for "x
in (1,3)" but a different hash for "x in (2,3)" which does not sound
like what we want. What I would have thought we'd do is: if the list
is all constants and long enough to satisfy the threshold then nothing
in the list gets jumbled.

I'm a little surprised that there's not more context-awareness in this
code. It seems that it applies to every ArrayExpr found in the query,
which I think would extend to cases beyond something = IN(whatever).
In particular, any use of ARRAY[] in the query would be impacted. Now,
the comments seem to imply that's pretty intentional, but from the
user's point of view, WHERE x in (1,3) and x = any(array[1,3]) are two
different things. If anything like this is to be adopted, we certainly
need to be precise about exactly what it is doing and which cases are
covered. I thought of looking at the documentation to see whether
you'd tried to clarify this there, and found that you hadn't written
any.

In short, I think this patch is not really very close to being in
committable shape even if nobody were objecting to the concept.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: pg_stat_statements and "IN" conditions

From

Dmitry Dolgov

Date:

14 March 2022, 14:57:34

> On Mon, Mar 14, 2022 at 10:17:57AM -0400, Robert Haas wrote:
> On Sat, Mar 12, 2022 at 9:11 AM Dmitry Dolgov <9erthalion6@gmail.com> wrote:
> > Here is the limited version of list collapsing functionality, which
> > doesn't utilize eval_const_expressions and ignores most of the stuff
> > except ArrayExprs. Any thoughts/more suggestions?
>
> The proposed commit message says this commit intends to "Make Consts
> contribute nothing to the jumble hash if they're part of a series and
> at position further that specified threshold." I'm not sure whether
> that's what the patch actually implements because I can't immediately
> understand the new logic you've added, but I think if we did what that
> sentence said then, supposing the threshold is set to 1, it would
> result in producing the same hash for "x in (1,2)" that we do for "x
> in (1,3)" but a different hash for "x in (2,3)" which does not sound
> like what we want. What I would have thought we'd do is: if the list
> is all constants and long enough to satisfy the threshold then nothing
> in the list gets jumbled.

Well, yeah, the commit message is somewhat clumsy in this regard. It
works almost in the way you've described, except if the list is all
constants and long enough to satisfy the threshold then *first N
elements (where N == threshold) will be jumbled -- to leave at least
some traces of it in pgss.

> I'm a little surprised that there's not more context-awareness in this
> code. It seems that it applies to every ArrayExpr found in the query,
> which I think would extend to cases beyond something = IN(whatever).
> In particular, any use of ARRAY[] in the query would be impacted. Now,
> the comments seem to imply that's pretty intentional, but from the
> user's point of view, WHERE x in (1,3) and x = any(array[1,3]) are two
> different things. If anything like this is to be adopted, we certainly
> need to be precise about exactly what it is doing and which cases are
> covered.

I'm not sure if I follow the last point. WHERE x in (1,3) and x =
any(array[1,3]) are two different things for sure, but in which way are
they going to be mixed together because of this change? My goal was to
make only the following transformation, without leaving any uncertainty:

WHERE x in (1, 2, 3, 4, 5) -> WHERE x in (1, 2, ...)
WHERE x = any(array[1, 2, 3, 4, 5]) -> WHERE x = any(array[1, 2, ...])

> I thought of looking at the documentation to see whether you'd tried
> to clarify this there, and found that you hadn't written any.
>
> In short, I think this patch is not really very close to being in
> committable shape even if nobody were objecting to the concept.

Sure, I'll add documentation. To be honest I'm not targeting PG15 with
this, just want to make some progress. Thanks for the feedback, I'm glad
to see it coming!

Re: pg_stat_statements and "IN" conditions

From

Robert Haas

Date:

14 March 2022, 15:02:16

On Mon, Mar 14, 2022 at 10:57 AM Dmitry Dolgov <9erthalion6@gmail.com> wrote:
> Well, yeah, the commit message is somewhat clumsy in this regard. It
> works almost in the way you've described, except if the list is all
> constants and long enough to satisfy the threshold then *first N
> elements (where N == threshold) will be jumbled -- to leave at least
> some traces of it in pgss.

But that seems to me to be a thing we would not want. Why do you think
otherwise?

> I'm not sure if I follow the last point. WHERE x in (1,3) and x =
> any(array[1,3]) are two different things for sure, but in which way are
> they going to be mixed together because of this change? My goal was to
> make only the following transformation, without leaving any uncertainty:
>
> WHERE x in (1, 2, 3, 4, 5) -> WHERE x in (1, 2, ...)
> WHERE x = any(array[1, 2, 3, 4, 5]) -> WHERE x = any(array[1, 2, ...])

I understand. I think it might be OK to transform both of those
things, but I don't think it's very clear either from the comments or
the nonexistent documentation that both of those cases are affected --
and I think that needs to be clear. Not sure exactly how to do that,
just saying that we can't add behavior unless it will be clear to
users what the behavior is.

> Sure, I'll add documentation. To be honest I'm not targeting PG15 with
> this, just want to make some progress.

wfm!

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: pg_stat_statements and "IN" conditions

From

Dmitry Dolgov

Date:

14 March 2022, 15:10:28

> On Mon, Mar 14, 2022 at 11:02:16AM -0400, Robert Haas wrote:
> On Mon, Mar 14, 2022 at 10:57 AM Dmitry Dolgov <9erthalion6@gmail.com> wrote:
> > Well, yeah, the commit message is somewhat clumsy in this regard. It
> > works almost in the way you've described, except if the list is all
> > constants and long enough to satisfy the threshold then *first N
> > elements (where N == threshold) will be jumbled -- to leave at least
> > some traces of it in pgss.
>
> But that seems to me to be a thing we would not want. Why do you think
> otherwise?

Hm. Well, if the whole list would be not jumbled, the transformation
would look like this, right?

WHERE x in (1, 2, 3, 4, 5) -> WHERE x in (...)

Leaving some number of original elements in place gives some clue for
the reader about at least what type of data the array has contained.
Which hopefully makes it a bit easier to identify even in the collapsed
form:

WHERE x in (1, 2, 3, 4, 5) -> WHERE x in (1, 2, ...)

> > I'm not sure if I follow the last point. WHERE x in (1,3) and x =
> > any(array[1,3]) are two different things for sure, but in which way are
> > they going to be mixed together because of this change? My goal was to
> > make only the following transformation, without leaving any uncertainty:
> >
> > WHERE x in (1, 2, 3, 4, 5) -> WHERE x in (1, 2, ...)
> > WHERE x = any(array[1, 2, 3, 4, 5]) -> WHERE x = any(array[1, 2, ...])
>
> I understand. I think it might be OK to transform both of those
> things, but I don't think it's very clear either from the comments or
> the nonexistent documentation that both of those cases are affected --
> and I think that needs to be clear. Not sure exactly how to do that,
> just saying that we can't add behavior unless it will be clear to
> users what the behavior is.

Yep, got it.

Re: pg_stat_statements and "IN" conditions

From

Tom Lane

Date:

14 March 2022, 15:23:17

Robert Haas <robertmhaas@gmail.com> writes:
> On Mon, Mar 14, 2022 at 10:57 AM Dmitry Dolgov <9erthalion6@gmail.com> wrote:
>> I'm not sure if I follow the last point. WHERE x in (1,3) and x =
>> any(array[1,3]) are two different things for sure, but in which way are
>> they going to be mixed together because of this change? My goal was to
>> make only the following transformation, without leaving any uncertainty:
>>
>> WHERE x in (1, 2, 3, 4, 5) -> WHERE x in (1, 2, ...)
>> WHERE x = any(array[1, 2, 3, 4, 5]) -> WHERE x = any(array[1, 2, ...])

> I understand. I think it might be OK to transform both of those
> things, but I don't think it's very clear either from the comments or
> the nonexistent documentation that both of those cases are affected --
> and I think that needs to be clear.

We've transformed IN(...) to ANY(ARRAY[...]) at the parser stage for a
long time, and this has been visible to users of either EXPLAIN or
pg_stat_statements for the same length of time.  I doubt people are
going to find that surprising.  Even if they do, it's not the query
jumbler's fault.

I do find it odd that the proposed patch doesn't cause the *entire*
list to be skipped over.  That seems like extra complexity and confusion
to no benefit.

            regards, tom lane

Re: pg_stat_statements and "IN" conditions

From

Dmitry Dolgov

Date:

14 March 2022, 15:33:46

> On Mon, Mar 14, 2022 at 11:23:17AM -0400, Tom Lane wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>
> I do find it odd that the proposed patch doesn't cause the *entire*
> list to be skipped over.  That seems like extra complexity and confusion
> to no benefit.

That's a bit surprising for me, I haven't even thought that folks could
think this is an odd behaviour. As I've mentioned above, the original
idea was to give some clues about what was inside the collapsed array,
but if everyone finds it unnecessary I can of course change it.

Re: pg_stat_statements and "IN" conditions

From

Tom Lane

Date:

14 March 2022, 15:38:23

Dmitry Dolgov <9erthalion6@gmail.com> writes:
> On Mon, Mar 14, 2022 at 11:23:17AM -0400, Tom Lane wrote:
>> I do find it odd that the proposed patch doesn't cause the *entire*
>> list to be skipped over.  That seems like extra complexity and confusion
>> to no benefit.

> That's a bit surprising for me, I haven't even thought that folks could
> think this is an odd behaviour. As I've mentioned above, the original
> idea was to give some clues about what was inside the collapsed array,
> but if everyone finds it unnecessary I can of course change it.

But if what we're doing is skipping over an all-Consts list, then the
individual Consts would be elided from the pg_stat_statements entry
anyway, no?  All that would remain is information about how many such
Consts there were, which is exactly the information you want to drop.

            regards, tom lane

Re: pg_stat_statements and "IN" conditions

From

Dmitry Dolgov

Date:

14 March 2022, 15:51:50

> On Mon, Mar 14, 2022 at 11:38:23AM -0400, Tom Lane wrote:
> Dmitry Dolgov <9erthalion6@gmail.com> writes:
> > On Mon, Mar 14, 2022 at 11:23:17AM -0400, Tom Lane wrote:
> >> I do find it odd that the proposed patch doesn't cause the *entire*
> >> list to be skipped over.  That seems like extra complexity and confusion
> >> to no benefit.
>
> > That's a bit surprising for me, I haven't even thought that folks could
> > think this is an odd behaviour. As I've mentioned above, the original
> > idea was to give some clues about what was inside the collapsed array,
> > but if everyone finds it unnecessary I can of course change it.
>
> But if what we're doing is skipping over an all-Consts list, then the
> individual Consts would be elided from the pg_stat_statements entry
> anyway, no?  All that would remain is information about how many such
> Consts there were, which is exactly the information you want to drop.

Hm, yes, you're right. I guess I was thinking about this more like about
shortening some text with ellipsis, but indeed no actual Consts will end
up in the result anyway. Thanks for clarification, will modify the
patch!

Re: pg_stat_statements and "IN" conditions

From

Dmitry Dolgov

Date:

26 March 2022, 17:40:35

> On Mon, Mar 14, 2022 at 04:51:50PM +0100, Dmitry Dolgov wrote:
> > On Mon, Mar 14, 2022 at 11:38:23AM -0400, Tom Lane wrote:
> > Dmitry Dolgov <9erthalion6@gmail.com> writes:
> > > On Mon, Mar 14, 2022 at 11:23:17AM -0400, Tom Lane wrote:
> > >> I do find it odd that the proposed patch doesn't cause the *entire*
> > >> list to be skipped over.  That seems like extra complexity and confusion
> > >> to no benefit.
> >
> > > That's a bit surprising for me, I haven't even thought that folks could
> > > think this is an odd behaviour. As I've mentioned above, the original
> > > idea was to give some clues about what was inside the collapsed array,
> > > but if everyone finds it unnecessary I can of course change it.
> >
> > But if what we're doing is skipping over an all-Consts list, then the
> > individual Consts would be elided from the pg_stat_statements entry
> > anyway, no?  All that would remain is information about how many such
> > Consts there were, which is exactly the information you want to drop.
>
> Hm, yes, you're right. I guess I was thinking about this more like about
> shortening some text with ellipsis, but indeed no actual Consts will end
> up in the result anyway. Thanks for clarification, will modify the
> patch!

Here is another iteration. Now the patch doesn't leave any trailing
Consts in the normalized query, and contains more documentation. I hope
it's getting better.

Attachment

v7-0001-Prevent-jumbling-of-every-element-in-ArrayExpr.patch

Re: pg_stat_statements and "IN" conditions

From

Dmitry Dolgov

Date:

24 July 2022, 10:06:36

> On Sat, Mar 26, 2022 at 06:40:35PM +0100, Dmitry Dolgov wrote:
> > On Mon, Mar 14, 2022 at 04:51:50PM +0100, Dmitry Dolgov wrote:
> > > On Mon, Mar 14, 2022 at 11:38:23AM -0400, Tom Lane wrote:
> > > Dmitry Dolgov <9erthalion6@gmail.com> writes:
> > > > On Mon, Mar 14, 2022 at 11:23:17AM -0400, Tom Lane wrote:
> > > >> I do find it odd that the proposed patch doesn't cause the *entire*
> > > >> list to be skipped over.  That seems like extra complexity and confusion
> > > >> to no benefit.
> > >
> > > > That's a bit surprising for me, I haven't even thought that folks could
> > > > think this is an odd behaviour. As I've mentioned above, the original
> > > > idea was to give some clues about what was inside the collapsed array,
> > > > but if everyone finds it unnecessary I can of course change it.
> > >
> > > But if what we're doing is skipping over an all-Consts list, then the
> > > individual Consts would be elided from the pg_stat_statements entry
> > > anyway, no?  All that would remain is information about how many such
> > > Consts there were, which is exactly the information you want to drop.
> >
> > Hm, yes, you're right. I guess I was thinking about this more like about
> > shortening some text with ellipsis, but indeed no actual Consts will end
> > up in the result anyway. Thanks for clarification, will modify the
> > patch!
>
> Here is another iteration. Now the patch doesn't leave any trailing
> Consts in the normalized query, and contains more documentation. I hope
> it's getting better.

Hi,

Here is the rebased version, with no other changes.

Attachment

v8-0001-Prevent-jumbling-of-every-element-in-ArrayExpr.patch

Re:pg_stat_statements and "IN" conditions

From

Sergei Kornilov

Date:

16 September 2022, 18:25:13

Hello!

Unfortunately the patch needs another rebase due to the recent split of guc.c
(0a20ff54f5e66158930d5328f89f087d4e9ab400)

I'm reviewing a patch on top of a previous commit and noticed a failed test:

#   Failed test 'no parameters missing from postgresql.conf.sample'
#   at t/003_check_guc.pl line 82.
#          got: '1'
#     expected: '0'
# Looks like you failed 1 test of 3.
t/003_check_guc.pl .............. 

The new option has not been added to the postgresql.conf.sample

PS: I would also like to have such a feature. It's hard to increase pg_stat_statements.max or lose some entries just
becausesome ORM sends requests with a different number of parameters.
 

regards, Sergei

Re: pg_stat_statements and "IN" conditions

From

Dmitry Dolgov

Date:

24 September 2022, 14:07:14

> On Fri, Sep 16, 2022 at 09:25:13PM +0300, Sergei Kornilov wrote:
> Hello!
>
> Unfortunately the patch needs another rebase due to the recent split of guc.c
(0a20ff54f5e66158930d5328f89f087d4e9ab400)
>
> I'm reviewing a patch on top of a previous commit and noticed a failed test:
>
> #   Failed test 'no parameters missing from postgresql.conf.sample'
> #   at t/003_check_guc.pl line 82.
> #          got: '1'
> #     expected: '0'
> # Looks like you failed 1 test of 3.
> t/003_check_guc.pl ..............
>
> The new option has not been added to the postgresql.conf.sample
>
> PS: I would also like to have such a feature. It's hard to increase pg_stat_statements.max or lose some entries just
becausesome ORM sends requests with a different number of parameters.
 

Thanks! I'll post the rebased version soon.

Re: pg_stat_statements and "IN" conditions

From

Dmitry Dolgov

Date:

24 September 2022, 23:59:39

> On Sat, Sep 24, 2022 at 04:07:14PM +0200, Dmitry Dolgov wrote:
> > On Fri, Sep 16, 2022 at 09:25:13PM +0300, Sergei Kornilov wrote:
> > Hello!
> >
> > Unfortunately the patch needs another rebase due to the recent split of guc.c
(0a20ff54f5e66158930d5328f89f087d4e9ab400)
> >
> > I'm reviewing a patch on top of a previous commit and noticed a failed test:
> >
> > #   Failed test 'no parameters missing from postgresql.conf.sample'
> > #   at t/003_check_guc.pl line 82.
> > #          got: '1'
> > #     expected: '0'
> > # Looks like you failed 1 test of 3.
> > t/003_check_guc.pl ..............
> >
> > The new option has not been added to the postgresql.conf.sample
> >
> > PS: I would also like to have such a feature. It's hard to increase pg_stat_statements.max or lose some entries
justbecause some ORM sends requests with a different number of parameters.
 
>
> Thanks! I'll post the rebased version soon.

And here it is.

Attachment

v9-0001-Prevent-jumbling-of-every-element-in-ArrayExpr.patch

Re: pg_stat_statements and "IN" conditions

From

vignesh C

Date:

27 January 2023, 14:45:29

On Sun, 25 Sept 2022 at 05:29, Dmitry Dolgov <9erthalion6@gmail.com> wrote:
>
> > On Sat, Sep 24, 2022 at 04:07:14PM +0200, Dmitry Dolgov wrote:
> > > On Fri, Sep 16, 2022 at 09:25:13PM +0300, Sergei Kornilov wrote:
> > > Hello!
> > >
> > > Unfortunately the patch needs another rebase due to the recent split of guc.c
(0a20ff54f5e66158930d5328f89f087d4e9ab400)
> > >
> > > I'm reviewing a patch on top of a previous commit and noticed a failed test:
> > >
> > > #   Failed test 'no parameters missing from postgresql.conf.sample'
> > > #   at t/003_check_guc.pl line 82.
> > > #          got: '1'
> > > #     expected: '0'
> > > # Looks like you failed 1 test of 3.
> > > t/003_check_guc.pl ..............
> > >
> > > The new option has not been added to the postgresql.conf.sample
> > >
> > > PS: I would also like to have such a feature. It's hard to increase pg_stat_statements.max or lose some entries
justbecause some ORM sends requests with a different number of parameters.
 
> >
> > Thanks! I'll post the rebased version soon.

The patch does not apply on top of HEAD as in [1], please post a rebased patch:
=== Applying patches on top of PostgreSQL commit ID
456fa635a909ee36f73ca84d340521bd730f265f ===
=== applying patch
./v9-0001-Prevent-jumbling-of-every-element-in-ArrayExpr.patch
....
can't find file to patch at input line 746
Perhaps you used the wrong -p or --strip option?
The text leading up to this was:
--------------------------
|diff --git a/src/backend/utils/misc/queryjumble.c
b/src/backend/utils/misc/queryjumble.c
|index a67487e5fe..063b4be725 100644
|--- a/src/backend/utils/misc/queryjumble.c
|+++ b/src/backend/utils/misc/queryjumble.c
--------------------------
No file to patch.  Skipping patch.
8 out of 8 hunks ignored
can't find file to patch at input line 913
Perhaps you used the wrong -p or --strip option?
The text leading up to this was:
--------------------------
|diff --git a/src/include/utils/queryjumble.h b/src/include/utils/queryjumble.h
|index 3c2d9beab2..b50cc42d4e 100644
|--- a/src/include/utils/queryjumble.h
|+++ b/src/include/utils/queryjumble.h
--------------------------
No file to patch.  Skipping patch.

[1] - http://cfbot.cputube.org/patch_41_2837.log

Regards,
Vignesh

Re: pg_stat_statements and "IN" conditions

From

Dmitry Dolgov

Date:

29 January 2023, 12:22:42

> On Fri, Jan 27, 2023 at 08:15:29PM +0530, vignesh C wrote:
> The patch does not apply on top of HEAD as in [1], please post a rebased patch:

Thanks. I think this one should do the trick.

Attachment

v10-0001-Prevent-jumbling-of-every-element-in-ArrayExpr.patch

Re: pg_stat_statements and "IN" conditions

From

Marcos Pegoraro

Date:

29 January 2023, 12:56:02

Em dom., 29 de jan. de 2023 às 09:24, Dmitry Dolgov <9erthalion6@gmail.com> escreveu:

> On Fri, Jan 27, 2023 at 08:15:29PM +0530, vignesh C wrote:
> The patch does not apply on top of HEAD as in [1], please post a rebased patch:

Thanks. I think this one should do the trick.

There is a typo on DOC part

+ and it's length is larger than <varname> const_merge_threshold </varname>,
+ then array elements will contribure nothing to the query identifier.
+ Thus the query will get the same identifier no matter how many constants

That "contribure" should be "contribute"

regards

Marcos

Re: pg_stat_statements and "IN" conditions

From

Dmitry Dolgov

Date:

29 January 2023, 13:32:19

> On Sun, Jan 29, 2023 at 09:56:02AM -0300, Marcos Pegoraro wrote:
> Em dom., 29 de jan. de 2023 às 09:24, Dmitry Dolgov <9erthalion6@gmail.com>
> escreveu:
>
> > > On Fri, Jan 27, 2023 at 08:15:29PM +0530, vignesh C wrote:
> > > The patch does not apply on top of HEAD as in [1], please post a rebased
> > patch:
> >
> > Thanks. I think this one should do the trick.
> >
>
> There is a typo on DOC part
> +        and it's length is larger than <varname> const_merge_threshold
> </varname>,
> +        then array elements will contribure nothing to the query
> identifier.
> +        Thus the query will get the same identifier no matter how many
> constants
>
> That "contribure" should be "contribute"

Indeed, thanks for noticing.

Attachment

v11-0001-Prevent-jumbling-of-every-element-in-ArrayExpr.patch

Re: pg_stat_statements and "IN" conditions

From

Alvaro Herrera

Date:

02 February 2023, 14:07:27

This appears to have massive conflicts.  Would you please rebase?

-- 
Álvaro Herrera         PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"¿Cómo puedes confiar en algo que pagas y que no ves,
y no confiar en algo que te dan y te lo muestran?" (Germán Poo)

Re: pg_stat_statements and "IN" conditions

From

Dmitry Dolgov

Date:

02 February 2023, 15:05:54

> On Thu, Feb 02, 2023 at 03:07:27PM +0100, Alvaro Herrera wrote:
> This appears to have massive conflicts.  Would you please rebase?

Sure, I was already mentally preparing myself to do so in the view of
recent changes in query jumbling. Will post soon.

Re: pg_stat_statements and "IN" conditions

From

Dmitry Dolgov

Date:

04 February 2023, 17:08:41

> On Thu, Feb 02, 2023 at 04:05:54PM +0100, Dmitry Dolgov wrote:
> > On Thu, Feb 02, 2023 at 03:07:27PM +0100, Alvaro Herrera wrote:
> > This appears to have massive conflicts.  Would you please rebase?
>
> Sure, I was already mentally preparing myself to do so in the view of
> recent changes in query jumbling. Will post soon.

Here is the rebased version. To adapt to the latest changes, I've marked
ArrayExpr with custom_query_jumble to implement this functionality, but
tried to make the actual merge logic relatively independent. Otherwise,
everything is the same.

Attachment

v12-0001-Prevent-jumbling-of-every-element-in-ArrayExpr.patch

Re: pg_stat_statements and "IN" conditions

From

Michael Paquier

Date:

05 February 2023, 01:30:25

On Sat, Feb 04, 2023 at 06:08:41PM +0100, Dmitry Dolgov wrote:
> Here is the rebased version. To adapt to the latest changes, I've marked
> ArrayExpr with custom_query_jumble to implement this functionality, but
> tried to make the actual merge logic relatively independent. Otherwise,
> everything is the same.

I was quickly looking at this patch, so these are rough impressions.

+   bool        merged;         /* whether or not the location was marked as
+                                  not contributing to jumble */

This part of the patch is a bit disturbing..  We have node attributes
to track if portions of a node should be ignored or have a location
marked, still this "merged" flag is used as an extension to track if a
location should be done or not.  Is that a concept that had better be
controlled via a new node attribute?

+--
+-- Consts merging
+--
+CREATE TABLE test_merge (id int, data int);
+-- IN queries
+-- No merging

Would it be better to split this set of tests into a new file?  FWIW,
I have a patch in baking process that refactors a bit the whole,
before being able to extend it so as we have more coverage for
normalized utility queries, as of now the query strings stored by
pg_stat_statements don't reflect that even if the jumbling computation
marks the location of the Const nodes included in utility statements
(partition bounds, queries of COPY, etc.).  I should be able to send
that tomorrow, and my guess that you could take advantage of that
even for this thread.
--
Michael

Attachment

signature.asc

Re: pg_stat_statements and "IN" conditions

From

Dmitry Dolgov

Date:

05 February 2023, 11:33:46

> On Sun, Feb 05, 2023 at 10:30:25AM +0900, Michael Paquier wrote:
> On Sat, Feb 04, 2023 at 06:08:41PM +0100, Dmitry Dolgov wrote:
> > Here is the rebased version. To adapt to the latest changes, I've marked
> > ArrayExpr with custom_query_jumble to implement this functionality, but
> > tried to make the actual merge logic relatively independent. Otherwise,
> > everything is the same.
>
> I was quickly looking at this patch, so these are rough impressions.
>
> +   bool        merged;         /* whether or not the location was marked as
> +                                  not contributing to jumble */
>
> This part of the patch is a bit disturbing..  We have node attributes
> to track if portions of a node should be ignored or have a location
> marked, still this "merged" flag is used as an extension to track if a
> location should be done or not.  Is that a concept that had better be
> controlled via a new node attribute?

Good question. I need to think a bit more if it's possible to leverage
node attributes mechanism, but at the moment I'm still inclined to
extend LocationLen. The reason is that it doesn't exactly serve the
tracking purpose, i.e. whether to capture a location (I have to update
the code commentary), it helps differentiate cases when locations A and
D are obtained from merging A B C D instead of just being A and D.

I'm thinking about this in the following way: the core jumbling logic is
responsible for deriving locations based on the input expressions; in
the case of merging we produce less locations; pgss have to represent
the result only using locations and has to be able to differentiate
simple locations and locations after merging.

> +--
> +-- Consts merging
> +--
> +CREATE TABLE test_merge (id int, data int);
> +-- IN queries
> +-- No merging
>
> Would it be better to split this set of tests into a new file?  FWIW,
> I have a patch in baking process that refactors a bit the whole,
> before being able to extend it so as we have more coverage for
> normalized utility queries, as of now the query strings stored by
> pg_stat_statements don't reflect that even if the jumbling computation
> marks the location of the Const nodes included in utility statements
> (partition bounds, queries of COPY, etc.).  I should be able to send
> that tomorrow, and my guess that you could take advantage of that
> even for this thread.

Sure, I'll take a look how I can benefit from your work, thanks.

Re: pg_stat_statements and "IN" conditions

From

Tom Lane

Date:

05 February 2023, 16:02:32

Dmitry Dolgov <9erthalion6@gmail.com> writes:
> I'm thinking about this in the following way: the core jumbling logic is
> responsible for deriving locations based on the input expressions; in
> the case of merging we produce less locations; pgss have to represent
> the result only using locations and has to be able to differentiate
> simple locations and locations after merging.

Uh ... why?  ISTM you're just going to elide all inside the IN,
so why do you need more than a start and stop position?

            regards, tom lane

Re: pg_stat_statements and "IN" conditions

From

Dmitry Dolgov

Date:

05 February 2023, 19:56:00

> On Sun, Feb 05, 2023 at 11:02:32AM -0500, Tom Lane wrote:
> Dmitry Dolgov <9erthalion6@gmail.com> writes:
> > I'm thinking about this in the following way: the core jumbling logic is
> > responsible for deriving locations based on the input expressions; in
> > the case of merging we produce less locations; pgss have to represent
> > the result only using locations and has to be able to differentiate
> > simple locations and locations after merging.
>
> Uh ... why?  ISTM you're just going to elide all inside the IN,
> so why do you need more than a start and stop position?

Exactly, start and stop positions. But if there would be no information
that merging was applied, the following queries will look the same after
jumbling, right?

    -- input query
    SELECT * FROM test_merge WHERE id IN (1, 2);
    -- jumbling result, two LocationLen, for values 1 and 2
    SELECT * FROM test_merge WHERE id IN ($1, $2);

    -- input query
    SELECT * FROM test_merge WHERE id IN (1, 2, 3, 4, 5, 6, 7, 8, 9, 10);
    -- jumbling result, two LocationLen after merging, for values 1 and 10
    SELECT * FROM test_merge WHERE id IN (...);
    -- without remembering about merging the result would be
    SELECT * FROM test_merge WHERE id IN ($1, $2);

Re:pg_stat_statements and "IN" conditions

From

Sergei Kornilov

Date:

07 February 2023, 20:14:52

Hello!

Unfortunately, rebase is needed again due to recent changes in queryjumblefuncs (
9ba37b2cb6a174b37fc51d0649ef73e56eae27fc)
 

It seems a little strange to me that with const_merge_threshold = 1, such a test case gives the same result as with
const_merge_threshold= 2
 

select pg_stat_statements_reset();
set const_merge_threshold to 1;
select * from test where i in (1,2,3);
select * from test where i in (1,2);
select * from test where i in (1);
select query, calls from pg_stat_statements order by query;

                query                | calls 
-------------------------------------+-------
 select * from test where i in (...) |     2
 select * from test where i in ($1)  |     1

Probably const_merge_threshold = 1 should produce only "i in (...)"?

const_merge_threshold is "the minimal length of an array" (more or equal) or "array .. length is larger than" (not
equals)?I think the documentation is ambiguous in this regard.
 

I also noticed a typo in guc_tables.c: "Sets the minimal numer of constants in an array" -> number

regards, Sergei

Re: pg_stat_statements and "IN" conditions

From

Peter Eisentraut

Date:

09 February 2023, 13:30:34

On 07.02.23 21:14, Sergei Kornilov wrote:
> It seems a little strange to me that with const_merge_threshold = 1, such a test case gives the same result as with
const_merge_threshold= 2
 

What is the point of making this a numeric setting?  Either you want to 
merge all values or you don't want to merge any values.

Re: pg_stat_statements and "IN" conditions

From

Dmitry Dolgov

Date:

09 February 2023, 15:02:38

> On Tue, Feb 07, 2023 at 11:14:52PM +0300, Sergei Kornilov wrote:
> Hello!

Thanks for reviewing.

> Unfortunately, rebase is needed again due to recent changes in queryjumblefuncs (
9ba37b2cb6a174b37fc51d0649ef73e56eae27fc)
 

Yep, my favourite game, rebaseball. Will post a new version soon, after
figuring out all the recent questions.

> It seems a little strange to me that with const_merge_threshold = 1, such a test case gives the same result as with
const_merge_threshold= 2
 
>
> select pg_stat_statements_reset();
> set const_merge_threshold to 1;
> select * from test where i in (1,2,3);
> select * from test where i in (1,2);
> select * from test where i in (1);
> select query, calls from pg_stat_statements order by query;
>
>                 query                | calls
> -------------------------------------+-------
>  select * from test where i in (...) |     2
>  select * from test where i in ($1)  |     1
>
> Probably const_merge_threshold = 1 should produce only "i in (...)"?

Well, it's not intentional, probably I need to be more careful with
off-by-one. Although I agree to a certain extent with Peter questioning
the value of having numerical option here, let me think about this.

> const_merge_threshold is "the minimal length of an array" (more or equal) or "array .. length is larger than" (not
equals)?I think the documentation is ambiguous in this regard.
 
>
> I also noticed a typo in guc_tables.c: "Sets the minimal numer of constants in an array" -> number

Yep, I'll rephrase the documentation.

Re: pg_stat_statements and "IN" conditions

From

Dmitry Dolgov

Date:

09 February 2023, 15:12:26

> On Thu, Feb 09, 2023 at 02:30:34PM +0100, Peter Eisentraut wrote:
> On 07.02.23 21:14, Sergei Kornilov wrote:
> > It seems a little strange to me that with const_merge_threshold = 1, such a test case gives the same result as with
const_merge_threshold= 2

>
> What is the point of making this a numeric setting?  Either you want to
> merge all values or you don't want to merge any values.

At least in theory the definition of "too many constants" is different
for different use cases and I see allowing to configure it as a way of
reducing the level of surprise here. The main scenario for a numerical
setting would be to distinguish between normal usage with just a handful
of constants (and the user expecting to see them represented in pgss)
and some sort of outliers with thousands of constants in a query (e.g.
as a defence mechanism for the infrastructure working with those
metrics). But I agree that it's not clear how much value is in that.

Not having strong opinion about this I would be fine changing it to a
boolean option (with an actual limit hidden internally) if everyone
agrees it fits better.

Re: pg_stat_statements and "IN" conditions

From

Alvaro Herrera

Date:

09 February 2023, 17:26:51

On 2023-Feb-09, Dmitry Dolgov wrote:

> > On Thu, Feb 09, 2023 at 02:30:34PM +0100, Peter Eisentraut wrote:

> > What is the point of making this a numeric setting?  Either you want
> > to merge all values or you don't want to merge any values.
> 
> At least in theory the definition of "too many constants" is different
> for different use cases and I see allowing to configure it as a way of
> reducing the level of surprise here.

I was thinking about this a few days ago and I agree that we don't
necessarily want to make it just a boolean thing; we may want to make it
more complex.  One trivial idea is to make it group entries in powers of
10: for 0-9 elements, you get one entry, and 10-99 you get a different
one, and so on:

# group everything in a single bucket
const_merge_threshold = true / yes / on 

# group 0-9, 10-99, 100-999, 1000-9999
const_merge_treshold = powers

Ideally the value would be represented somehow in the query text. For
example

                         query                            | calls
----------------------------------------------------------+-------
 select * from test where i in ({... 0-9 entries ...})    |     2
 select * from test where i in ({... 10-99 entries ...})  |     1

What do you think?  The jumble would have to know how to reduce all
values within each power-of-ten group to one specific value, but I don't
think that should be particularly difficult.

-- 
Álvaro Herrera         PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"Find a bug in a program, and fix it, and the program will work today.
Show the program how to find and fix a bug, and the program
will work forever" (Oliver Silfridge)

Re: pg_stat_statements and "IN" conditions

From

Dmitry Dolgov

Date:

09 February 2023, 19:43:29

> On Thu, Feb 09, 2023 at 06:26:51PM +0100, Alvaro Herrera wrote:
> On 2023-Feb-09, Dmitry Dolgov wrote:
>
> > > On Thu, Feb 09, 2023 at 02:30:34PM +0100, Peter Eisentraut wrote:
>
> > > What is the point of making this a numeric setting?  Either you want
> > > to merge all values or you don't want to merge any values.
> >
> > At least in theory the definition of "too many constants" is different
> > for different use cases and I see allowing to configure it as a way of
> > reducing the level of surprise here.
>
> I was thinking about this a few days ago and I agree that we don't
> necessarily want to make it just a boolean thing; we may want to make it
> more complex.  One trivial idea is to make it group entries in powers of
> 10: for 0-9 elements, you get one entry, and 10-99 you get a different
> one, and so on:
>
> # group everything in a single bucket
> const_merge_threshold = true / yes / on
>
> # group 0-9, 10-99, 100-999, 1000-9999
> const_merge_treshold = powers
>
> Ideally the value would be represented somehow in the query text. For
> example
>
>                          query                            | calls
> ----------------------------------------------------------+-------
>  select * from test where i in ({... 0-9 entries ...})    |     2
>  select * from test where i in ({... 10-99 entries ...})  |     1
>
> What do you think?  The jumble would have to know how to reduce all
> values within each power-of-ten group to one specific value, but I don't
> think that should be particularly difficult.

Yeah, it sounds appealing and conveniently addresses the question of
losing the information about how many constants originally were there.
Not sure if the example above would be the most natural way to represent
it in the query text, but otherwise I'm going to try implementing this.
Stay tuned.

Re: pg_stat_statements and "IN" conditions

From

David Geier

Date:

11 February 2023, 10:03:36

Hi,

On 2/9/23 16:02, Dmitry Dolgov wrote:
>> Unfortunately, rebase is needed again due to recent changes in queryjumblefuncs (
9ba37b2cb6a174b37fc51d0649ef73e56eae27fc)
 
I reviewed the last patch applied to some commit from Feb. 4th.
>> It seems a little strange to me that with const_merge_threshold = 1, such a test case gives the same result as with
const_merge_threshold= 2
 
>>
>> select pg_stat_statements_reset();
>> set const_merge_threshold to 1;
>> select * from test where i in (1,2,3);
>> select * from test where i in (1,2);
>> select * from test where i in (1);
>> select query, calls from pg_stat_statements order by query;
>>
>>                  query                | calls
>> -------------------------------------+-------
>>   select * from test where i in (...) |     2
>>   select * from test where i in ($1)  |     1
>>
>> Probably const_merge_threshold = 1 should produce only "i in (...)"?
> Well, it's not intentional, probably I need to be more careful with
> off-by-one. Although I agree to a certain extent with Peter questioning

Please add tests for all the corner cases. At least for (1) IN only 
contains a single element and (2) const_merge_threshold = 1.

Beyond that:

- There's a comment about find_const_walker(). I cannot find that 
function anywhere. What am I missing?

- What about renaming IsConstList() to IsMergableConstList().

- Don't you intend to use the NUMERIC data column in SELECT * FROM 
test_merge_numeric WHERE id IN (1, 2, 3, 4, 5, 6, 7, 8, 9, 10)? 
Otherwise, the test is identical to previous test cases and you're not 
checking for what happens with NUMERICs which are wrapped in FuncExpr 
because of the implicit coercion.

- Don't we want to extend IsConstList() to allow a list of all 
implicitly coerced constants? It's inconsistent that otherwise e.g. 
NUMERICs don't work.

- Typo in /* The firsts merged constant */ (first not firsts)

- Prepared statements are not supported as they contain INs with Param 
instead of Const nodes. While less likely, I've seen applications that 
use prepared statements in conjunction with queries generated through a 
UI which ended up with tons of prepared queries with different number of 
elements in the IN clause. Not necessarily something that must go into 
this patch but maybe worth thinking about.

- The setting name const_merge_threshold is not very telling without 
knowing the context. While being a little longer, what about 
jumble_const_merge_threshold or queryid_const_merge_threshold or similar?

- Why do we actually only want to merge constants? Why don't we ignore 
the type of element in the IN and merge whatever is there? Is this 
because the original jumbling logic as of now only has support for 
constants?

- Ideally we would even remove duplicates. That would even improve 
cardinality estimation but I guess we don't want to spend the cycles on 
doing so in the planner?

- Why did you change palloc() to palloc0() for clocations array? The 
length is initialized to 0 and FWICS RecordConstLocation() initializes 
all members. Seems to me like we don't have to spend these cycles.

- Can't the logic at the end of IsConstList() not be simplified to:

         foreach(temp, elements)
             if (!IsA(lfirst(temp), Const))
                 return false;

         // All elements are of type Const
         *firstConst = linitial_node(Const, elements);
         *lastConst = llast_node(Const, elements);
         return true;

-- 
David Geier
(ServiceNow)

Re: pg_stat_statements and "IN" conditions

From

Dmitry Dolgov

Date:

11 February 2023, 10:47:07

> On Sat, Feb 11, 2023 at 11:03:36AM +0100, David Geier wrote:
> Hi,
>
> On 2/9/23 16:02, Dmitry Dolgov wrote:
> > > Unfortunately, rebase is needed again due to recent changes in queryjumblefuncs (
9ba37b2cb6a174b37fc51d0649ef73e56eae27fc)

> I reviewed the last patch applied to some commit from Feb. 4th.

Thanks for looking. Few quick answers about high-level questions below,
the rest I'll incorporate in the new version.

> - There's a comment about find_const_walker(). I cannot find that function
> anywhere. What am I missing?
>
> [...]
>
> - Don't you intend to use the NUMERIC data column in SELECT * FROM
> test_merge_numeric WHERE id IN (1, 2, 3, 4, 5, 6, 7, 8, 9, 10)? Otherwise,
> the test is identical to previous test cases and you're not checking for
> what happens with NUMERICs which are wrapped in FuncExpr because of the
> implicit coercion.
>
> - Don't we want to extend IsConstList() to allow a list of all implicitly
> coerced constants? It's inconsistent that otherwise e.g. NUMERICs don't
> work.
>
> [...]
>
> - Prepared statements are not supported as they contain INs with Param
> instead of Const nodes. While less likely, I've seen applications that use
> prepared statements in conjunction with queries generated through a UI which
> ended up with tons of prepared queries with different number of elements in
> the IN clause. Not necessarily something that must go into this patch but
> maybe worth thinking about.

The original version of the patch was doing all of this, i.e. handling
numerics, Param nodes, RTE_VALUES. The commentary about
find_const_walker in tests is referring to a part of that, that was
dealing with evaluation of expression to see if it could be reduced to a
constant.

Unfortunately there was a significant push back from reviewers because
of those features. That's why I've reduced the patch to it's minimally
useful version, having in mind re-implementing them as follow-up patches
in the future. This is the reason as well why I left tests covering all
this missing functionality -- as breadcrumbs to already discovered
cases, important for the future extensions.

> - Why do we actually only want to merge constants? Why don't we ignore the
> type of element in the IN and merge whatever is there? Is this because the
> original jumbling logic as of now only has support for constants?
>
> - Ideally we would even remove duplicates. That would even improve
> cardinality estimation but I guess we don't want to spend the cycles on
> doing so in the planner?

I believe these points are beyond the patch goals, as it's less clear
(at least to me) if it's safe or desirable to do so.

Re: pg_stat_statements and "IN" conditions

From

Dmitry Dolgov

Date:

11 February 2023, 12:08:20

> On Sat, Feb 11, 2023 at 11:47:07AM +0100, Dmitry Dolgov wrote:
>
> The original version of the patch was doing all of this, i.e. handling
> numerics, Param nodes, RTE_VALUES. The commentary about
> find_const_walker in tests is referring to a part of that, that was
> dealing with evaluation of expression to see if it could be reduced to a
> constant.
>
> Unfortunately there was a significant push back from reviewers because
> of those features. That's why I've reduced the patch to it's minimally
> useful version, having in mind re-implementing them as follow-up patches
> in the future. This is the reason as well why I left tests covering all
> this missing functionality -- as breadcrumbs to already discovered
> cases, important for the future extensions.

I'd like to elaborate on this a bit and remind about the origins of the
patch, as it's lost somewhere in the beginning of the thread. The idea
is not pulled out of thin air, everything is coming from our attempts to
improve one particular monitoring infrastructure in a real commercial
setting. Every covered use case and test in the original proposal was a
result of field trials, when some application-side library or ORM was
responsible for gigabytes of data in pgss, chocking the monitoring agent.

Re: pg_stat_statements and "IN" conditions

From

David Geier

Date:

15 February 2023, 07:51:56

Hi,

On 2/11/23 13:08, Dmitry Dolgov wrote:
>> On Sat, Feb 11, 2023 at 11:47:07AM +0100, Dmitry Dolgov wrote:
>>
>> The original version of the patch was doing all of this, i.e. handling
>> numerics, Param nodes, RTE_VALUES. The commentary about
>> find_const_walker in tests is referring to a part of that, that was
>> dealing with evaluation of expression to see if it could be reduced to a
>> constant.
>>
>> Unfortunately there was a significant push back from reviewers because
>> of those features. That's why I've reduced the patch to it's minimally
>> useful version, having in mind re-implementing them as follow-up patches
>> in the future. This is the reason as well why I left tests covering all
>> this missing functionality -- as breadcrumbs to already discovered
>> cases, important for the future extensions.
> I'd like to elaborate on this a bit and remind about the origins of the
> patch, as it's lost somewhere in the beginning of the thread. The idea
> is not pulled out of thin air, everything is coming from our attempts to
> improve one particular monitoring infrastructure in a real commercial
> setting. Every covered use case and test in the original proposal was a
> result of field trials, when some application-side library or ORM was
> responsible for gigabytes of data in pgss, chocking the monitoring agent.

Thanks for the clarifications. I didn't mean to contend the usefulness 
of the patch and I wasn't aware that you already jumped through the 
loops of handling Param, etc. Seems like supporting only constants is a 
good starting point. The only thing that is likely confusing for users 
is that NUMERICs (and potentially constants of other types) are 
unsupported. Wouldn't it be fairly simple to support them via something 
like the following?

     is_const(element) || (is_coercion(element) && is_const(element->child))

-- 
David Geier
(ServiceNow)

Re: pg_stat_statements and "IN" conditions

From

Dmitry Dolgov

Date:

17 February 2023, 15:46:02

> On Wed, Feb 15, 2023 at 08:51:56AM +0100, David Geier wrote:
> Hi,
>
> On 2/11/23 13:08, Dmitry Dolgov wrote:
> > > On Sat, Feb 11, 2023 at 11:47:07AM +0100, Dmitry Dolgov wrote:
> > >
> > > The original version of the patch was doing all of this, i.e. handling
> > > numerics, Param nodes, RTE_VALUES. The commentary about
> > > find_const_walker in tests is referring to a part of that, that was
> > > dealing with evaluation of expression to see if it could be reduced to a
> > > constant.
> > >
> > > Unfortunately there was a significant push back from reviewers because
> > > of those features. That's why I've reduced the patch to it's minimally
> > > useful version, having in mind re-implementing them as follow-up patches
> > > in the future. This is the reason as well why I left tests covering all
> > > this missing functionality -- as breadcrumbs to already discovered
> > > cases, important for the future extensions.
> > I'd like to elaborate on this a bit and remind about the origins of the
> > patch, as it's lost somewhere in the beginning of the thread. The idea
> > is not pulled out of thin air, everything is coming from our attempts to
> > improve one particular monitoring infrastructure in a real commercial
> > setting. Every covered use case and test in the original proposal was a
> > result of field trials, when some application-side library or ORM was
> > responsible for gigabytes of data in pgss, chocking the monitoring agent.
>
> Thanks for the clarifications. I didn't mean to contend the usefulness of
> the patch and I wasn't aware that you already jumped through the loops of
> handling Param, etc.

No worries, I just wanted to emphasize that we've already collected
quite some number of use cases.

> Seems like supporting only constants is a good starting
> point. The only thing that is likely confusing for users is that NUMERICs
> (and potentially constants of other types) are unsupported. Wouldn't it be
> fairly simple to support them via something like the following?
>
>     is_const(element) || (is_coercion(element) && is_const(element->child))

It definitely makes sense to implement that, although I don't think it's
going to be acceptable to do that via directly listing conditions an
element has to satisfy. It probably has to be more flexible, sice we
would like to extend it in the future. My plan is to address this in a
follow-up patch, when the main mechanism is approved. Would you agree
with this approach?

Re: pg_stat_statements and "IN" conditions

From

Dmitry Dolgov

Date:

17 February 2023, 15:46:43

> On Thu, Feb 09, 2023 at 08:43:29PM +0100, Dmitry Dolgov wrote:
> > On Thu, Feb 09, 2023 at 06:26:51PM +0100, Alvaro Herrera wrote:
> > On 2023-Feb-09, Dmitry Dolgov wrote:
> >
> > > > On Thu, Feb 09, 2023 at 02:30:34PM +0100, Peter Eisentraut wrote:
> >
> > > > What is the point of making this a numeric setting?  Either you want
> > > > to merge all values or you don't want to merge any values.
> > >
> > > At least in theory the definition of "too many constants" is different
> > > for different use cases and I see allowing to configure it as a way of
> > > reducing the level of surprise here.
> >
> > I was thinking about this a few days ago and I agree that we don't
> > necessarily want to make it just a boolean thing; we may want to make it
> > more complex.  One trivial idea is to make it group entries in powers of
> > 10: for 0-9 elements, you get one entry, and 10-99 you get a different
> > one, and so on:
> >
> > # group everything in a single bucket
> > const_merge_threshold = true / yes / on
> >
> > # group 0-9, 10-99, 100-999, 1000-9999
> > const_merge_treshold = powers
> >
> > Ideally the value would be represented somehow in the query text. For
> > example
> >
> >                          query                            | calls
> > ----------------------------------------------------------+-------
> >  select * from test where i in ({... 0-9 entries ...})    |     2
> >  select * from test where i in ({... 10-99 entries ...})  |     1
> >
> > What do you think?  The jumble would have to know how to reduce all
> > values within each power-of-ten group to one specific value, but I don't
> > think that should be particularly difficult.
>
> Yeah, it sounds appealing and conveniently addresses the question of
> losing the information about how many constants originally were there.
> Not sure if the example above would be the most natural way to represent
> it in the query text, but otherwise I'm going to try implementing this.
> Stay tuned.

It took me couple of evenings, here is what I've got:

* The representation is not that far away from your proposal, I've
  settled on:

    SELECT * FROM test_merge WHERE id IN (... [10-99 entries])

* To not reinvent the wheel, I've reused decimalLenght function from
  numutils, hence one more patch to make it available to reuse.

* This approach resolves my concerns about letting people tuning
  the behaviour of merging, as now it's possible to distinguish between
  calls with different number of constants up to the power of 10. So
  I've decided to simplify the configuration and make the guc boolean to
  turn it off or on.

* To separate queries with constants falling into different ranges
  (10-99, 100-999, etc), the order of magnitude is added into the jumble
  hash.

* I've incorporated feedback from Sergei and David, as well as tried to
  make comments and documentation more clear.

Any feedback is welcomed, thanks!

> On Tue, Mar 14, 2023 at 08:04:32PM +0100, Dmitry Dolgov wrote:
> > On Tue, Mar 14, 2023 at 02:14:17PM -0400, Gregory Stark (as CFM) wrote:
> > So I was seeing that this patch needs a rebase according to cfbot.
>
> Yeah, folks are getting up to speed in with pgss improvements recently.
> Thanks for letting me know.

Following recent refactoring of pg_stat_statements tests, I've created a
new one for merging functionality in the patch. This should solve any
conflicts.

On Tue, Jul 04, 2023 at 09:02:56PM +0200, Dmitry Dolgov wrote:
> On Mon, Jul 03, 2023 at 09:46:11PM -0700, Nathan Bossart wrote:
>> IMHO this adds way too much complexity to something that most users would
>> expect to be an on/off switch.
>
> This documentation is exclusively to be precise about how does it work.
> Users don't have to worry about all this, and pretty much turn it
> on/off, as you've described. I agree though, I could probably write this
> text a bit differently.

FWIW, I am going to side with Nathan on this one, but not completely
either.  I was looking at the patch and it brings too much complexity
for a monitoring feature in this code path.  In my experience, I've
seen people complain about IN/ANY never strimmed down to a single
parameter in pg_stat_statements but I still have to hear from somebody
outside this thread that they'd like to reduce an IN clause depending
on the number of items, or something else.

>> If I understand Álvaro's suggestion [0] correctly, he's saying that in
>> addition to allowing "on" and "off", it might be worth allowing
>> something like "powers" to yield roughly the behavior described above.
>> I don't think he's suggesting that this "powers" behavior should be
>> the only available option.
>
> Independently of what Álvaro was suggesting, I find the "powers"
> approach more suitable, because it answers my own concerns about the
> previous implementation. Having "on"/"off" values means we would have to
> scratch heads coming up with a one-size-fit-all default value, or to
> introduce another option for the actual cut-off threshold. I would like
> to avoid both of those options, that's why I went with "powers" only.

Now, it doesn't mean that this approach with the "powers" will never
happen, but based on the set of opinions I am gathering on this thread
I would suggest to rework the patch as follows:
- First implement an on/off switch that reduces the lists in IN and/or
ANY to one parameter.  Simply.
- Second, refactor the powers routine.
- Third, extend the on/off switch, or just implement a threshold with
a second switch.

When it comes to my opinion, I am not seeing any objections to the
feature as a whole, and I'm OK with the first step.  I'm also OK to
keep the door open for more improvements in controlling how these
IN/ANY lists show up, but there could be more than just the number of
items as parameter (say the query size, different behaviors depending
on the number of clauses in queries, subquery context or CTEs/WITH,
etc. just to name a few things coming in mind).
--
Michael

Attachment

signature.asc

Re: pg_stat_statements and "IN" conditions

From

Dmitry Dolgov

Date:

13 October 2023, 13:35:19

> On Fri, Oct 13, 2023 at 05:07:00PM +0900, Michael Paquier wrote:
> Now, it doesn't mean that this approach with the "powers" will never
> happen, but based on the set of opinions I am gathering on this thread
> I would suggest to rework the patch as follows:
> - First implement an on/off switch that reduces the lists in IN and/or
> ANY to one parameter.  Simply.
> - Second, refactor the powers routine.
> - Third, extend the on/off switch, or just implement a threshold with
> a second switch.

Well, if it will help move this patch forward, why not. To clarify, I'm
going to split the current implementation into three patches, one for
each point you've mentioned.

> When it comes to my opinion, I am not seeing any objections to the
> feature as a whole, and I'm OK with the first step.  I'm also OK to
> keep the door open for more improvements in controlling how these
> IN/ANY lists show up, but there could be more than just the number of
> items as parameter (say the query size, different behaviors depending
> on the number of clauses in queries, subquery context or CTEs/WITH,
> etc. just to name a few things coming in mind).

Interesting point, but now it's my turn to have troubles imagining the
case, where list representation could be controlled depending on
something else than the number of elements in it. Do you have any
specific example in mind?

Re: pg_stat_statements and "IN" conditions

From

Nathan Bossart

Date:

13 October 2023, 16:37:30

On Tue, Jul 04, 2023 at 09:02:56PM +0200, Dmitry Dolgov wrote:
>> On Mon, Jul 03, 2023 at 09:46:11PM -0700, Nathan Bossart wrote:
>> Also, it seems counterintuitive that queries with fewer than 10
>> constants are not merged.
> 
> Why? What would be your intuition using this feature?

For the "powers" setting, I would've expected queries with 0-9 constants to
be merged.  Then 10-99, 100-999, 1000-9999, etc.  I suppose there might be
an argument for separating 0 from 1-9, too.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: pg_stat_statements and "IN" conditions

From

Dmitry Dolgov

Date:

17 October 2023, 08:15:41

> On Fri, Oct 13, 2023 at 03:35:19PM +0200, Dmitry Dolgov wrote:
> > On Fri, Oct 13, 2023 at 05:07:00PM +0900, Michael Paquier wrote:
> > Now, it doesn't mean that this approach with the "powers" will never
> > happen, but based on the set of opinions I am gathering on this thread
> > I would suggest to rework the patch as follows:
> > - First implement an on/off switch that reduces the lists in IN and/or
> > ANY to one parameter.  Simply.
> > - Second, refactor the powers routine.
> > - Third, extend the on/off switch, or just implement a threshold with
> > a second switch.
>
> Well, if it will help move this patch forward, why not. To clarify, I'm
> going to split the current implementation into three patches, one for
> each point you've mentioned.

Here is what I had mind. The first patch implements the basic notion of
merging, and I guess everyone agrees on its usefulness. The second and
third implement merging into groups power of 10, which I still find
useful as well. The last one adds a lower threshold for merging on top
of that. My intentions are to get the first one in, ideally I would love
to see the second and third applied as well.

> > When it comes to my opinion, I am not seeing any objections to the
> > feature as a whole, and I'm OK with the first step.  I'm also OK to
> > keep the door open for more improvements in controlling how these
> > IN/ANY lists show up, but there could be more than just the number of
> > items as parameter (say the query size, different behaviors depending
> > on the number of clauses in queries, subquery context or CTEs/WITH,
> > etc. just to name a few things coming in mind).
>
> Interesting point, but now it's my turn to have troubles imagining the
> case, where list representation could be controlled depending on
> something else than the number of elements in it. Do you have any
> specific example in mind?

In the current patch version I didn't add anything yet to address the
question of having more parameters to tune constants merging. The main
obstacle as I see it is that the information for that has to be
collected when jumbling various query nodes. Anything except information
about the ArrayExpr itself would have to be acquired when jumbling some
other parts of the query, not directly related to the ArrayExpr. It
seems to me this interdependency between otherwise unrelated nodes
outweigh the value it brings, and would require some more elaborate (and
more invasive for the purpose of this patch) mechanism to implement.

Attachment

Re: pg_stat_statements and "IN" conditions

From

Michael Paquier

Date:

26 October 2023, 00:08:42

On Tue, Oct 17, 2023 at 10:15:41AM +0200, Dmitry Dolgov wrote:
> In the current patch version I didn't add anything yet to address the
> question of having more parameters to tune constants merging. The main
> obstacle as I see it is that the information for that has to be
> collected when jumbling various query nodes. Anything except information
> about the ArrayExpr itself would have to be acquired when jumbling some
> other parts of the query, not directly related to the ArrayExpr. It
> seems to me this interdependency between otherwise unrelated nodes
> outweigh the value it brings, and would require some more elaborate (and
> more invasive for the purpose of this patch) mechanism to implement.

 typedef struct ArrayExpr
 {
+    pg_node_attr(custom_query_jumble)
+

Hmm.  I am not sure that this is the best approach
implementation-wise.  Wouldn't it be better to invent a new
pg_node_attr (these can include parameters as well!), say
query_jumble_merge or query_jumble_agg_location that aggregates all
the parameters of a list to be considered as a single element.  To put
it short, we could also apply the same property to other parts of a
parsed tree, and not only an ArrayExpr's list.

 /* GUC parameters */
 extern PGDLLIMPORT int compute_query_id;
-
+extern PGDLLIMPORT bool query_id_const_merge;

Not much a fan of this addition as well for an in-core GUC.  I would
suggest pushing the GUC layer to pg_stat_statements, maintaining the
computation method to use as a field of JumbleState as I suspect that
this is something we should not enforce system-wide, but at
extension-level instead.

+    /*
+     * Indicates the constant represents the beginning or the end of a merged
+     * constants interval.
+     */
+    bool        merged;

Not sure that this is the best thing to do either.  Instead of this
extra boolean flag, could it be simpler if we switch LocationLen so as
we track the start position and the end position of a constant in a
query string, so as we'd use one LocationLen for a whole set of Const
nodes in an ArrayExpr?  Perhaps this could just be a refactoring piece
of its own?

+    /*
+     * If the first expression is a constant, verify if the following elements
+     * are constants as well. If yes, the list is eligible for merging, and the
+     * order of magnitude need to be calculated.
+     */
+    if (IsA(firstExpr, Const))
+    {
+        foreach(temp, elements)
+            if (!IsA(lfirst(temp), Const))
+                return false;

This path should be benchmarked, IMO.
--
Michael

Attachment

signature.asc

Re: pg_stat_statements and "IN" conditions

From

Dmitry Dolgov

Date:

27 October 2023, 15:02:44

> On Thu, Oct 26, 2023 at 09:08:42AM +0900, Michael Paquier wrote:
>  typedef struct ArrayExpr
>  {
> +    pg_node_attr(custom_query_jumble)
> +
>
> Hmm.  I am not sure that this is the best approach
> implementation-wise.  Wouldn't it be better to invent a new
> pg_node_attr (these can include parameters as well!), say
> query_jumble_merge or query_jumble_agg_location that aggregates all
> the parameters of a list to be considered as a single element.  To put
> it short, we could also apply the same property to other parts of a
> parsed tree, and not only an ArrayExpr's list.

Sounds like an interesting idea, something like:

    typedef struct ArrayExpr
    {
        ...
        List       *elements pg_node_attr(query_jumble_merge);

to replace simple JUMBLE_NODE(elements) with more elaborated logic.

>  /* GUC parameters */
>  extern PGDLLIMPORT int compute_query_id;
> -
> +extern PGDLLIMPORT bool query_id_const_merge;
>
> Not much a fan of this addition as well for an in-core GUC.  I would
> suggest pushing the GUC layer to pg_stat_statements, maintaining the
> computation method to use as a field of JumbleState as I suspect that
> this is something we should not enforce system-wide, but at
> extension-level instead.

I also do not particularly like an extra GUC here, but as far as I can
tell to make it pg_stat_statements GUC only it has to be something
similar to EnableQueryId (e.g. EnableQueryConstMerging), that will be
called from pgss. Does this sound better?

> +    /*
> +     * Indicates the constant represents the beginning or the end of a merged
> +     * constants interval.
> +     */
> +    bool        merged;
>
> Not sure that this is the best thing to do either.  Instead of this
> extra boolean flag, could it be simpler if we switch LocationLen so as
> we track the start position and the end position of a constant in a
> query string, so as we'd use one LocationLen for a whole set of Const
> nodes in an ArrayExpr?  Perhaps this could just be a refactoring piece
> of its own?

Sounds interesting as well, but it seems to me there is a catch. I'll
try to elaborate, bear with me:

* if the start and the end positions of a constant means the first and the
last character representing it, we need the textual length of the
constant in the query to be able to construct such a LocationLen.  The
lengths are calculated in pg_stat_statements later, not in JumbleQuery,
and it uses parser for that. Doing all of this in JumbleQuery doesn't
sound reasonable to me.

* if instead we talk about the start and the end positions in a
set of constants, that would mean locations of the first and the last
constants in the set, and everything seems fine. But for such
LocationLen to represent a single constant (not a set of constants), it
means that only the start position would be meaningful, the end position
will not be used.

The second approach is somewhat close to be simpler than the merge flag,
but assumes the ugliness for a single constant. What do you think about
this?

> +    /*
> +     * If the first expression is a constant, verify if the following elements
> +     * are constants as well. If yes, the list is eligible for merging, and the
> +     * order of magnitude need to be calculated.
> +     */
> +    if (IsA(firstExpr, Const))
> +    {
> +        foreach(temp, elements)
> +            if (!IsA(lfirst(temp), Const))
> +                return false;
>
> This path should be benchmarked, IMO.

I can do some benchmarking here, but of course it's going to be slower
than the baseline. The main idea behind the patch is to trade this
overhead for the benefits in the future while processing pgss records,
hoping that it's going to be worth it (and in those extreme cases I'm
trying to address it's definitely worth it).

Re: pg_stat_statements and "IN" conditions

From

Dmitry Dolgov

Date:

31 October 2023, 09:03:07

> On Fri, Oct 27, 2023 at 05:02:44PM +0200, Dmitry Dolgov wrote:
> > On Thu, Oct 26, 2023 at 09:08:42AM +0900, Michael Paquier wrote:
> >  typedef struct ArrayExpr
> >  {
> > +    pg_node_attr(custom_query_jumble)
> > +
> >
> > Hmm.  I am not sure that this is the best approach
> > implementation-wise.  Wouldn't it be better to invent a new
> > pg_node_attr (these can include parameters as well!), say
> > query_jumble_merge or query_jumble_agg_location that aggregates all
> > the parameters of a list to be considered as a single element.  To put
> > it short, we could also apply the same property to other parts of a
> > parsed tree, and not only an ArrayExpr's list.
>
> Sounds like an interesting idea, something like:
>
>     typedef struct ArrayExpr
>     {
>         ...
>         List       *elements pg_node_attr(query_jumble_merge);
>
> to replace simple JUMBLE_NODE(elements) with more elaborated logic.
>
> >  /* GUC parameters */
> >  extern PGDLLIMPORT int compute_query_id;
> > -
> > +extern PGDLLIMPORT bool query_id_const_merge;
> >
> > Not much a fan of this addition as well for an in-core GUC.  I would
> > suggest pushing the GUC layer to pg_stat_statements, maintaining the
> > computation method to use as a field of JumbleState as I suspect that
> > this is something we should not enforce system-wide, but at
> > extension-level instead.
>
> I also do not particularly like an extra GUC here, but as far as I can
> tell to make it pg_stat_statements GUC only it has to be something
> similar to EnableQueryId (e.g. EnableQueryConstMerging), that will be
> called from pgss. Does this sound better?

For clarity, here is what I had in mind for those two points.

> On Mon, Jan 08, 2024 at 05:10:20PM +0100, Dmitry Dolgov wrote:
> > On Sat, Jan 06, 2024 at 09:04:54PM +0530, vignesh C wrote:
> >
> > CFBot shows documentation build has failed at [1] with:
> > [07:44:55.531] time make -s -j${BUILD_JOBS} -C doc
> > [07:44:57.987] postgres.sgml:572: element xref: validity error : IDREF
> > attribute linkend references an unknown ID
> > "guc-query-id-const-merge-threshold"
> > [07:44:58.179] make[2]: *** [Makefile:70: postgres-full.xml] Error 4
> > [07:44:58.179] make[2]: *** Deleting file 'postgres-full.xml'
> > [07:44:58.181] make[1]: *** [Makefile:8: all] Error 2
> > [07:44:58.182] make: *** [Makefile:16: all] Error 2
> >
> > [1] - https://cirrus-ci.com/task/6688578378399744
>
> Indeed, after moving the configuration option to pgss I forgot to update
> its reference in the docs. Thanks for noticing, will update soon.

Here is the fixed version.

> On Mon, Jan 22, 2024 at 06:07:27PM +0100, Dmitry Dolgov wrote:
> > Please notice that at the moment, it's not being tested at all because
> > of a patch-apply failure -- that's what the little triangular symbol
> > means.  The rest of the display concerns the test results from the
> > last successfully-applied patch version.  (Perhaps that isn't a
> > particularly great UI design.)
> >
> > If you click on the triangle you find out
> >
> > == Applying patches on top of PostgreSQL commit ID b0f0a9432d0b6f53634a96715f2666f6d4ea25a1 ===
> > === applying patch ./v17-0001-Prevent-jumbling-of-every-element-in-ArrayExpr.patch
> > patching file contrib/pg_stat_statements/Makefile
> > Hunk #1 FAILED at 19.
> > 1 out of 1 hunk FAILED -- saving rejects to file contrib/pg_stat_statements/Makefile.rej
> > patching file contrib/pg_stat_statements/expected/merging.out
> > patching file contrib/pg_stat_statements/meson.build
>
> Oh, I see, thanks. Give me a moment, will fix this.

Here is it.

> On Wed, Mar 27, 2024 at 08:56:12AM +0900, Yasuo Honda wrote:
> Thanks for the useful info.
>
> Ruby on Rails uses bigint as a default data type for the primary key
> and prepared statements have been enabled by default for PostgreSQL.
> I'm looking forward to these current patches being merged as a first
> step and future versions of pg_stat_statements will support
> normalizing bigint and prepared statements.

Here is the rebased version. In the meantime I'm going to experiment
with how to support more use cases in a way that will be more acceptable
for the community.

> On Tue, Apr 23, 2024 at 10:18:15AM +0200, Dmitry Dolgov wrote:
> > On Mon, Apr 15, 2024 at 06:09:29PM +0900, Sutou Kouhei wrote:
> >
> > Thanks. I'm not familiar with this code base but I've
> > reviewed these patches because I'm interested in this
> > feature too.
>
> Thanks for the review! The commentaries for the first patch make sense
> to me, will apply.

Here is the new version. It turned out you were right about memory for
the normalized query, if the number of constants goes close to INT_MAX,
there were indeed not enough allocated. I've added a fix for this on top
of the applied changes, and also improved readability for
pg_stat_statements part.

> On Sun, Aug 11, 2024 at 09:34:55PM GMT, Dmitry Dolgov wrote:
> > On Sun, Aug 11, 2024 at 07:54:05PM +0300, Sergei Kornilov wrote:
> >
> > This feature will improve my monitoring. Even in patch 0001. I think there are many other people in the thread who
thinkthis is useful. So maybe we should move it forward? Any complaints about the overall design? I see in the
discussionit was mentioned that it would be good to measure performance difference.

> >
> > PS: patch cannot be applied at this time, it needs another rebase.
>
> Yeah, it seems like most people are fine with the first patch and the
> simplest approach. I'm going to post a rebased version and a short
> thread summary soon.

Ok, here is the rebased version. If anyone would like to review them, below is
the short summary of the thread. Currently the patch series contains 4 changes:

* 0001-Prevent-jumbling-of-every-element-in-ArrayExpr.patch

  Implements the simplest way to handle constant arrays, if the array contains
  only constants it will be reduced. This is the basis, if I read it correctly
  Nathan and Michael expressed that they're mostly fine with this one.

  Michael seems to be skeptical about the "merged" flag in the LocationLen
  struct, but from what I see the proposed alternative has problems as well.
  There was also a note that the loop over constants has to be benchmarked, but
  it's not entirely clear for me in which dimentions to benchmark (i.e. what
  are the expectations). For both I'm waiting on a reply to my questions.

* 0002-Reusable-decimalLength-functions.patch

  A small refactoring to make already existing "powers" functonality reusable
  for following patches.

* 0003-Merge-constants-in-ArrayExpr-into-groups.patch

  Makes handling of constant arrays smarter by taking into account number of
  elements in the array. This way records are merged into groups power of 10,
  i.e. arrays with length 55 will land in a group 10-99, with lenght 555 in a
  group 100-999 etc. This was proposed by Alvaro, and personally I like this
  approach, because it remediates the issue of one-size-fits-all for the static
  threshold. But there are opinions that this introduces too much complexity.

* 0004-Introduce-query_id_const_merge_threshold.patch

  Fine tuning for the previous patch, makes only arrays with the length over a
  certain threshold to be reduced.

On top of that Yasuo Honda and Jakub Wartak have provided a couple of practical
examples, where handling of constant arrays has to be improved. David Geier
pointed out some examples that might be confusing as well. All those are
definitely worth addressing, but out of scope of this patch for now.

On 2025-Feb-18, Sami Imseih wrote:

> > It's not a question about whether it's possible to implement this,
> > but about whether it makes sense. In case of plain constants it's
> > straightforward -- they will not change anything meaningfully and
> > hence could be squashed from the query. Now for a function, that
> > might return different values for the same set of constant
> > arguments, it's much less obvious and omitting such expressions
> > might have unexpected consequences.
> 
> query jumbling should not care about the behavior of the function. If
> we take a regular call to a volatile function, we will generate the
> same queryId for every call regardless of the input to the function.
> Why does the in-list case need to care about the volatility of the
> function?

I feel quite insecure about this idea TBH.  At least with immutable
functions I don't expect the system to behave wildly different than with
actual constants.  What non-immutable functions do you have in mind that
would be useful to fold as if they were constants in the IN list in such
a query?

In the meantime, here's v28 which is Dmitry's v27 plus pgindent.  No
other changes.  Dmitry, were you planning to submit a new version?

-- 
Álvaro Herrera        Breisgau, Deutschland  —  https://www.EnterpriseDB.com/
"The problem with the future is that it keeps turning into the present"
(Hobbes)

Attachment

v28-0001-Prevent-jumbling-of-every-element-in-ArrayExpr.patch

Re: pg_stat_statements and "IN" conditions

From

Dmitry Dolgov

Date:

03 March, 16:13:54

> On Mon, Mar 03, 2025 at 12:56:24PM GMT, Álvaro Herrera wrote:
> In the meantime, here's v28 which is Dmitry's v27 plus pgindent.  No
> other changes.  Dmitry, were you planning to submit a new version?

Nope, there was nothing I wanted to add so far.

Re: pg_stat_statements and "IN" conditions

From

Sami Imseih

Date:

03 March, 20:25:53

> > > It's not a question about whether it's possible to implement this,
> > > but about whether it makes sense. In case of plain constants it's
> > > straightforward -- they will not change anything meaningfully and
> > > hence could be squashed from the query. Now for a function, that
> > > might return different values for the same set of constant
> > > arguments, it's much less obvious and omitting such expressions
> > > might have unexpected consequences.
> >
> > query jumbling should not care about the behavior of the function. If
> > we take a regular call to a volatile function, we will generate the
> > same queryId for every call regardless of the input to the function.
> > Why does the in-list case need to care about the volatility of the
> > function?
>
> I feel quite insecure about this idea TBH.  At least with immutable
> functions I don't expect the system to behave wildly different than with
> actual constants.  What non-immutable functions do you have in mind that
> would be useful to fold as if they were constants in the IN list in such
> a query?

I don't have an example of non-immutable functions that should be folded,
but I also don't think query jumbling should care about the
function's volatility at all.

When dealing with FuncExpr, Why can't we simply skip merging when
func->funcformat != COERCE_IMPLICIT_CALL and the arguments
to the function are not constants? Meaning that anytime we have either
explicit casts or explicit function calls, this makes the list not eligible
for merging. See the attached .txt file to demonstrate what I am thinking.
This attached passes all the v28 test cases, but we will need to add tests
for implicit casts and external parameters.

-- 

Sami

Attachment

isMergeableConst.txt

Re: pg_stat_statements and "IN" conditions

From

vignesh C

Date:

17 March, 09:37:44

On Mon, 3 Mar 2025 at 17:26, Álvaro Herrera <alvherre@alvh.no-ip.org> wrote:
>
> On 2025-Feb-18, Sami Imseih wrote:
>
> > > It's not a question about whether it's possible to implement this,
> > > but about whether it makes sense. In case of plain constants it's
> > > straightforward -- they will not change anything meaningfully and
> > > hence could be squashed from the query. Now for a function, that
> > > might return different values for the same set of constant
> > > arguments, it's much less obvious and omitting such expressions
> > > might have unexpected consequences.
> >
> > query jumbling should not care about the behavior of the function. If
> > we take a regular call to a volatile function, we will generate the
> > same queryId for every call regardless of the input to the function.
> > Why does the in-list case need to care about the volatility of the
> > function?
>
> I feel quite insecure about this idea TBH.  At least with immutable
> functions I don't expect the system to behave wildly different than with
> actual constants.  What non-immutable functions do you have in mind that
> would be useful to fold as if they were constants in the IN list in such
> a query?
>
> In the meantime, here's v28 which is Dmitry's v27 plus pgindent.  No
> other changes.  Dmitry, were you planning to submit a new version?

I noticed that the feedback from Sami at [1] has not yet been
addressed, I have changed the status to Waiting on Author, kindly
address them and update the status to Needs review.
[1] - https://www.postgresql.org/message-id/CAA5RZ0vt29Om%2BtKFOcUNhXV%2BkKpNnj0yj6OFho3-wngcMHWnAQ%40mail.gmail.com

Regards,
Vignesh

Re: pg_stat_statements and "IN" conditions

From

Dmitry Dolgov

Date:

17 March, 11:11:58

> On Mon, Mar 17, 2025 at 12:07:44PM GMT, vignesh C wrote:
>
> I noticed that the feedback from Sami at [1] has not yet been
> addressed, I have changed the status to Waiting on Author, kindly
> address them and update the status to Needs review.
> [1] - https://www.postgresql.org/message-id/CAA5RZ0vt29Om%2BtKFOcUNhXV%2BkKpNnj0yj6OFho3-wngcMHWnAQ%40mail.gmail.com

I'm afraid there is a disagreement about this part of the feedback. I'm
not yet convinced about the idea suggested over there (treating mutable
functions in the same way as constants) and not planning to change
anything, at least not in the current version of the patch.

Re: pg_stat_statements and "IN" conditions

From

vignesh C

Date:

17 March, 11:28:33

On Mon, 17 Mar 2025 at 13:42, Dmitry Dolgov <9erthalion6@gmail.com> wrote:
>
> > On Mon, Mar 17, 2025 at 12:07:44PM GMT, vignesh C wrote:
> >
> > I noticed that the feedback from Sami at [1] has not yet been
> > addressed, I have changed the status to Waiting on Author, kindly
> > address them and update the status to Needs review.
> > [1] -
https://www.postgresql.org/message-id/CAA5RZ0vt29Om%2BtKFOcUNhXV%2BkKpNnj0yj6OFho3-wngcMHWnAQ%40mail.gmail.com
>
> I'm afraid there is a disagreement about this part of the feedback. I'm
> not yet convinced about the idea suggested over there (treating mutable
> functions in the same way as constants) and not planning to change
> anything, at least not in the current version of the patch.

@Sami Imseih Do you have any other suggestions to solve this?
Others: Any thoughts on which way is better in this case?

Regards,
Vignesh

Re: pg_stat_statements and "IN" conditions

From

Álvaro Herrera

Date:

17 March, 11:44:20

On 2025-Mar-17, Dmitry Dolgov wrote:

> I'm afraid there is a disagreement about this part of the feedback. I'm
> not yet convinced about the idea suggested over there (treating mutable
> functions in the same way as constants) and not planning to change
> anything, at least not in the current version of the patch.

I have to admit that I am leaning towards removing the immutability
constraint.  The reason is that we already require the function to be
boostrapped (due to the OID test) and to have implicit cast form, so
that limits which functions are recognized; the only ones there that are
not immutable are:

         castsource          │        casttarget        │                 castfunc                 
─────────────────────────────┼──────────────────────────┼──────────────────────────────────────────
 text                        │ regclass                 │ regclass(text)
 character varying           │ regclass                 │ regclass(text)
 date                        │ timestamp with time zone │ timestamptz(date)
 time without time zone      │ time with time zone      │ timetz(time without time zone)
 timestamp without time zone │ timestamp with time zone │ timestamptz(timestamp without time zone)

Looking at this list, it seems rather random to me to say that we should
not squash arrays with types using these casts.  Should we really
consider two queries to be different because they run with different
search_path or TimeZone settings?

But kindly do not submit a new version of the patch, as I already have
some changes of my own (mostly on removing the term "merge" from code
and comments to replace with "squash", as well as adding some more
comments).  I'll post it soon.

-- 
Álvaro Herrera        Breisgau, Deutschland  —  https://www.EnterpriseDB.com/

Re: pg_stat_statements and "IN" conditions

From

Álvaro Herrera

Date:

17 March, 22:05:42

On 2025-Feb-14, Dmitry Dolgov wrote:

> This should do it. The last patch for today, otherwise I'll probably add
> more bugs than features :)

Thank you.  I've spent some time with this patch in the last few days,
and I propose a few changes.  I renamed everything from "merge" to
"squash"; apart from that, it's mostly docs and code comments changes,
but I also removed the addition of a boolean argument to JUMBLE_LOCATION
which AFAICS is unnecessary, and did away with the business of checking
for function immutability.  I also changed the code layout of
generate_normalized_query(); no functional changes, I just reordered the
code blocks (which caused a couple of lines to appear repeated that
weren't before).

You can see my patch on top of yours here:
https://github.com/alvherre/postgres/commits/query_id_squash_values/
and the CI run here:
https://cirrus-ci.com/build/5660053472018432

In addition, here I attach the complete patch on top of current master.
Unless there's some opposition to this, I intend to push this tomorrow.

I have to admit that looking at this part of the test,

+SELECT * FROM test_squash_cast WHERE data IN
+       (1::int4::casttesttype, 2::int4::casttesttype, 3::int4::casttesttype,
+        4::int4::casttesttype, 5::int4::casttesttype, 6::int4::casttesttype,
+        7::int4::casttesttype, 8::int4::casttesttype, 9::int4::casttesttype,
+        10::int4::casttesttype, 11::int4::casttesttype);
+ id | data 
+----+------
+(0 rows)
+
+SELECT query, calls FROM pg_stat_statements ORDER BY query COLLATE "C";
+                       query                        | calls 
+----------------------------------------------------+-------
+ SELECT * FROM test_squash_cast WHERE data IN      +|     1
+         ($1 /*, ... */::int4::casttesttype)        | 
+ SELECT pg_stat_statements_reset() IS NOT NULL AS t |     1
+(2 rows)

and

+SELECT * FROM test_squash WHERE id IN (1::oid, 2::oid, 3::oid, 4::oid, 5::oid, 6::oid, 7::oid, 8::oid, 9::oid);
+ id | data 
+----+------
+(0 rows)
+
+SELECT query, calls FROM pg_stat_statements ORDER BY query COLLATE "C";
+                           query                            | calls 
+------------------------------------------------------------+-------
+ SELECT * FROM test_squash WHERE id IN ($1 /*, ... */::oid) |     1

I am tempted to say that explicit casts should also be considered
squashable (that is, in IsSquashableConst() also allow the case of
func->funcformat == COERCE_EXPLICIT_CAST).  That would also squash
queries such as this one:

+SELECT * FROM test_squash_bigint WHERE data IN
+   (1::bigint, 2::bigint, 3::bigint, 4::bigint, 5::bigint, 6::bigint,
+    7::bigint, 8::bigint, 9::bigint, 10::bigint, 11::bigint);
+ id | data 
+----+------
+(0 rows)
+
+SELECT query, calls FROM pg_stat_statements ORDER BY query COLLATE "C";
+                                      query                                       | calls 
+----------------------------------------------------------------------------------+-------
+ SELECT * FROM test_squash_bigint WHERE data IN                                  +|     1
+         ($1::bigint, $2::bigint, $3::bigint, $4::bigint, $5::bigint, $6::bigint,+| 
+          $7::bigint, $8::bigint, $9::bigint, $10::bigint, $11::bigint)           | 

I, frankly, see little argument for making a distinction here.  We can
still discuss whether we prefer it one way or the other; we don't need
that decision to prevent me from pushing the patch I here attach, I think.

-- 
Álvaro Herrera        Breisgau, Deutschland  —  https://www.EnterpriseDB.com/
"The saddest aspect of life right now is that science gathers knowledge faster
 than society gathers wisdom."  (Isaac Asimov)

Attachment

v29-0001-Prevent-jumbling-of-every-element-in-ArrayExpr.patch

Re: pg_stat_statements and "IN" conditions

From

Álvaro Herrera

Date:

17 March, 22:14:16

On 2025-Mar-17, Álvaro Herrera wrote:

> You can see my patch on top of yours here:
> https://github.com/alvherre/postgres/commits/query_id_squash_values/
> and the CI run here:
> https://cirrus-ci.com/build/5660053472018432

Heh, this blew up on bogus SGML markup :-(  Fixed and running again:
https://cirrus-ci.com/build/4822893680394240

-- 
Álvaro Herrera         PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"Once again, thank you and all of the developers for your hard work on
PostgreSQL.  This is by far the most pleasant management experience of
any database I've worked on."                             (Dan Harris)
http://archives.postgresql.org/pgsql-performance/2006-04/msg00247.php

Re: pg_stat_statements and "IN" conditions

From

Dmitry Dolgov

Date:

18 March, 12:45:14

> On Mon, Mar 17, 2025 at 08:14:16PM GMT, Álvaro Herrera wrote:
> On 2025-Mar-17, Álvaro Herrera wrote:
>
> > You can see my patch on top of yours here:
> > https://github.com/alvherre/postgres/commits/query_id_squash_values/
> > and the CI run here:
> > https://cirrus-ci.com/build/5660053472018432
>
> Heh, this blew up on bogus SGML markup :-(  Fixed and running again:
> https://cirrus-ci.com/build/4822893680394240

Thanks, much appreciated! I've inspected the diff between patches and run few
tests, at the first glance everything looks fine.

> I am tempted to say that explicit casts should also be considered
> squashable (that is, in IsSquashableConst() also allow the case of
> func->funcformat == COERCE_EXPLICIT_CAST).

Well, I admit I may have been burned too much by the initial reception of the
patch and handled it too conservatively in this regard. Originally I also had a
concern about normalized queries representation for explicit cast case, but it
was resolved by Julien's suggestion to switch to the /* ... */ format.

> I have to admit that I am leaning towards removing the immutability
> constraint.  The reason is that we already require the function to be
> boostrapped (due to the OID test) and to have implicit cast form, so
> that limits which functions are recognized; the only ones there that are
> not immutable are:
>
>          castsource          │        casttarget        │                 castfunc
> ─────────────────────────────┼──────────────────────────┼──────────────────────────────────────────
>  text                        │ regclass                 │ regclass(text)
>  character varying           │ regclass                 │ regclass(text)
>  date                        │ timestamp with time zone │ timestamptz(date)
>  time without time zone      │ time with time zone      │ timetz(time without time zone)
>  timestamp without time zone │ timestamp with time zone │ timestamptz(timestamp without time zone)

Agree, when put together with the OID limitation it doesn't look so bad.
Somehow I was thinking about the Sami's proposal and the discussion in more
abstract terms, as if we talk about any arbitrary mutable functions to squash
-- I still would be cautious about hiding non-bootstrapped mutable functions.

Re: pg_stat_statements and "IN" conditions

From

Álvaro Herrera

Date:

18 March, 21:33:25

On 2025-Mar-18, Dmitry Dolgov wrote:

> Thanks, much appreciated! I've inspected the diff between patches and
> run few tests, at the first glance everything looks fine.

Thanks for looking once more.

> > I am tempted to say that explicit casts should also be considered
> > squashable (that is, in IsSquashableConst() also allow the case of
> > func->funcformat == COERCE_EXPLICIT_CAST).
> 
> Well, I admit I may have been burned too much by the initial reception
> of the patch and handled it too conservatively in this regard.

I can totally understand that.  I have added this and pushed.  Hopefully
nobody will hate this too much, and some people might even like it.

By the way, I'm still open to adding the 'powers' mechanism that was
discussed earlier and other simple additions on top of what's there now,
if you have some spare energy to spend on this.  (For full disclosure:
the "powers" thing was discussed in a developer's meeting last year, and
several people said they'd prefer to stick with 0001 and forget about
powers, on the grounds that it produced query texts that were weird and
invalid SQL.  But now that we have the commented-out syntax, it's no
longer invalid SQL, and maybe it's not so weird either.)

But there are other proposals such as handling Params and whatnot.  At
this point, what we'd need for that is a more extensive test suite to
show that we're not breaking other things ...

> Agree, when put together with the OID limitation it doesn't look so bad.
> Somehow I was thinking about the Sami's proposal and the discussion in more
> abstract terms,

Yeah, that happened to me too, and then I checked again and realized
that my initial impression was wrong.

> as if we talk about any arbitrary mutable functions to squash -- I
> still would be cautious about hiding non-bootstrapped mutable
> functions.

Yeah, absolutely.

One thing I noticed while going over the code, is that we say in some
code comments that the list of elements only contributes the first and
last elements to the jumble.  But this is not true -- the list actually
contributes _nothing at all_ to the jumble.  I don't think this causes
any terrible problems, but maybe somebody will discover that I'm wrong
on that.  This isn't trivial to solve, because if you try to add
anything to the jumble from there, you'd break the first/last location
pair matching.  We could maybe fix this by returning the actual
bottommost Const node from IsSquashableConstList() instead of whatever
is wrapping it, and then arrange for _jumbleConst() to receive a boolean
that turns off jumbling of the location.

However, contributing nothing already makes such a query different from
another query that has exactly one element, because that one jumbles
that element.  It could only be confused (in the sense of identical
query_ids) with another list that has zero elements.

Anyway, something to play with.

BTW, it's fun to execute a query that's literally
  select col from tab where col in (1 /*, ... */);
and then
  select col from tab where col in (1, 2);

because now you have two entries in pg_stat_statements with visibly the
same query text, but two different query_ids.  I'm not terribly worried
about this, because who uses a literal "/*, ... */" in a query anyway?
And even if they do, it's easily explained.  But jesters could probably
get a good laugh messing about with these reports.

Thanks for keeping at this for so long!

-- 
Álvaro Herrera         PostgreSQL Developer  —  https://www.EnterpriseDB.com/

Re: pg_stat_statements and "IN" conditions

From

Sami Imseih

Date:

18 March, 21:54:18

I want to mention that the current patch does not deal

with external parameters ( prepared statements ) [0] [1]. This could be a

follow-up, as it may need some further discussion. It is important to

address this case, IMO.

[0]

https://www.postgresql.org/message-id/CAA5RZ0uGfxXyzhp9N5nfsS%2BZqF5ngEMC3YtBPtLoeK8EPsjHbw%40mail.gmail.com

[1]

https://www.postgresql.org/message-id/blauoky77sash2qzrbnz6ilfyi7odtvxtdr4ifg4hq4bpqp2uk%406z5yjfsxtcnl

Regards,

Sami

Re: pg_stat_statements and "IN" conditions

From

Álvaro Herrera

Date:

18 March, 23:17:18

On 2025-Mar-18, Sami Imseih wrote:

> I want to mention that the current patch does not deal
> with external parameters ( prepared statements ) [0] [1]. This could be a
> follow-up, as it may need some further discussion. It is important to
> address this case, IMO.

Yes, I realize that.  Feel free to send a patch.

-- 
Álvaro Herrera               48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
"E pur si muove" (Galileo Galilei)

Re: pg_stat_statements and "IN" conditions

From

Dmitry Dolgov

Date:

19 March, 10:31:15

> On Tue, Mar 18, 2025 at 07:33:25PM GMT, Álvaro Herrera wrote:
>
> By the way, I'm still open to adding the 'powers' mechanism that was
> discussed earlier and other simple additions on top of what's there now,
> if you have some spare energy to spend on this.  (For full disclosure:
> the "powers" thing was discussed in a developer's meeting last year, and
> several people said they'd prefer to stick with 0001 and forget about
> powers, on the grounds that it produced query texts that were weird and
> invalid SQL.  But now that we have the commented-out syntax, it's no
> longer invalid SQL, and maybe it's not so weird either.)
>
> But there are other proposals such as handling Params and whatnot.  At
> this point, what we'd need for that is a more extensive test suite to
> show that we're not breaking other things ...

Yes, I'm planning to continue working on this topic, there are still
plenty of things that could be improved.

> One thing I noticed while going over the code, is that we say in some
> code comments that the list of elements only contributes the first and
> last elements to the jumble.  But this is not true -- the list actually
> contributes _nothing at all_ to the jumble.  I don't think this causes
> any terrible problems, but maybe somebody will discover that I'm wrong
> on that.  This isn't trivial to solve, because if you try to add
> anything to the jumble from there, you'd break the first/last location
> pair matching.  We could maybe fix this by returning the actual
> bottommost Const node from IsSquashableConstList() instead of whatever
> is wrapping it, and then arrange for _jumbleConst() to receive a boolean
> that turns off jumbling of the location.
>
> However, contributing nothing already makes such a query different from
> another query that has exactly one element, because that one jumbles
> that element.  It could only be confused (in the sense of identical
> query_ids) with another list that has zero elements.
>
> Anyway, something to play with.

Yep, I don't see this as an immediate problem as well, but will do some
experiments with that.

Re: pg_stat_statements and "IN" conditions

From

Dmitry Dolgov

Date:

19 March, 10:32:55

> On Tue, Mar 18, 2025 at 02:54:18PM GMT, Sami Imseih wrote:
> I want to mention that the current patch does not deal
> with external parameters ( prepared statements ) [0] [1]. This could be a
> follow-up, as it may need some further discussion. It is important to
> address this case, IMO.

Sure, it's important and I'm planning to tackle this next. If you want,
we can collaborate on a patch for that.

Re: pg_stat_statements and "IN" conditions

From

Sami Imseih

Date:

22 March, 23:59:02

> Sure, it's important and I'm planning to tackle this next. If you want,
> we can collaborate on a patch for that.

I spent some time looking some more at this, and I believe all that needs
to be done is check for a PRAM node with a type of PARAM_EXTERN.

During planning the planner turns the Param into a Const during
eval_const_expressions_mutator.

If it's as simple as I think it is, I hope we can get this committed for 18.
If not, and a longer discussion is needed, a new thread can be started
for this.

--
Sami Imseih
Amazon Web Services (AWS)

Attachment

v1-0001-Allow-query-jumble-to-squash-a-list-external-paramet.patch