Thread: pg_stat_statements and "IN" conditions

pg_stat_statements and "IN" conditions

From
Dmitry Dolgov
Date:
Hi

I would like to start another thread to follow up on [1], mostly to bump up the
topic. Just to remind, it's about how pg_stat_statements jumbling ArrayExpr in
queries like:

    SELECT something FROM table WHERE col IN (1, 2, 3, ...)

The current implementation produces different jumble hash for every different
number of arguments for essentially the same query. Unfortunately a lot of ORMs
like to generate these types of queries, which in turn leads to
pg_stat_statements pollution. Ideally we want to prevent this and have only one
record for such a query.

As the result of [1] I've identified two highlighted approaches to improve this
situation:

* Reduce the generated ArrayExpr to an array Const immediately, in cases where
  all the inputs are Consts.

* Make repeating Const to contribute nothing to the resulting hash.

I've tried to prototype both approaches to find out pros/cons and be more
specific. Attached patches could not be considered a completed piece of work,
but they seem to work, mostly pass the tests and demonstrate the point. I would
like to get some high level input about them and ideally make it clear what is
the preferred solution to continue with.

# Reducing ArrayExpr to an array Const

IIUC this requires producing a Const with ArrayType constvalue in
transformAExprIn for ScalarArrayOpExpr. This could be a general improvement,
since apparently it's being done later anyway. But it deals only with Const,
which leaves more on the table, e.g. Params and other similar types of
duplication we observe when repeating constants are wrapped into VALUES.

Another point here is that it's quite possible this approach will still require
corresponding changes in pg_stat_statements, since just preventing duplicates
to show also loses the information. Ideally we also need to have some
understanding how many elements are actually there and display it, e.g. in
cases when there is just one outlier query that contains a huge IN list.

# Contribute nothing to the hash

I guess there could be multiple ways of doing this, but the first idea I had in
mind is to skip jumbling when necessary. At the same time it can be implemented
more centralized for different types of queries (although in the attached patch
there are only Const & Values). In the simplest case we just identify sequence
of constants of the same type, which just ignores any other cases when stuff is
mixed. But I believe it's something that could be considered a rare corner case
and it's better to start with the simplest solution.

Having said that I believe the second approach of contributing nothing to the
hash sounds more appealing, but would love to hear other opinions.

[1]: https://www.postgresql.org/message-id/flat/CAF42k%3DJCfHMJtkAVXCzBn2XBxDC83xb4VhV7jU7enPnZ0CfEQQ%40mail.gmail.com

Attachment

Re: pg_stat_statements and "IN" conditions

From
Dmitry Dolgov
Date:
> On Wed, Aug 12, 2020 at 06:19:02PM +0200, Dmitry Dolgov wrote:
>
> I would like to start another thread to follow up on [1], mostly to bump up the
> topic. Just to remind, it's about how pg_stat_statements jumbling ArrayExpr in
> queries like:
>
>     SELECT something FROM table WHERE col IN (1, 2, 3, ...)
>
> The current implementation produces different jumble hash for every different
> number of arguments for essentially the same query. Unfortunately a lot of ORMs
> like to generate these types of queries, which in turn leads to
> pg_stat_statements pollution. Ideally we want to prevent this and have only one
> record for such a query.
>
> As the result of [1] I've identified two highlighted approaches to improve this
> situation:
>
> * Reduce the generated ArrayExpr to an array Const immediately, in cases where
>   all the inputs are Consts.
>
> * Make repeating Const to contribute nothing to the resulting hash.
>
> I've tried to prototype both approaches to find out pros/cons and be more
> specific. Attached patches could not be considered a completed piece of work,
> but they seem to work, mostly pass the tests and demonstrate the point. I would
> like to get some high level input about them and ideally make it clear what is
> the preferred solution to continue with.

I've implemented the second approach mentioned above, this version was
tested on our test clusters for some time without visible issues. Will
create a CF item and would appreciate any feedback.

Attachment

Re: pg_stat_statements and "IN" conditions

From
Chengxi Sun
Date:
The following review has been posted through the commitfest application:
make installcheck-world:  tested, passed
Implements feature:       tested, passed
Spec compliant:           not tested
Documentation:            not tested

Hi, I did some test and it works well

Re: pg_stat_statements and "IN" conditions

From
Dmitry Dolgov
Date:
> On Wed, Dec 09, 2020 at 03:37:40AM +0000, Chengxi Sun wrote:
>
> The following review has been posted through the commitfest application:
> make installcheck-world:  tested, passed
> Implements feature:       tested, passed
> Spec compliant:           not tested
> Documentation:            not tested
>
> Hi, I did some test and it works well

Thanks for testing!



Re: pg_stat_statements and "IN" conditions

From
Dmitry Dolgov
Date:
> On Wed, Nov 18, 2020 at 05:04:32PM +0100, Dmitry Dolgov wrote:
> > On Wed, Aug 12, 2020 at 06:19:02PM +0200, Dmitry Dolgov wrote:
> >
> > I would like to start another thread to follow up on [1], mostly to bump up the
> > topic. Just to remind, it's about how pg_stat_statements jumbling ArrayExpr in
> > queries like:
> >
> >     SELECT something FROM table WHERE col IN (1, 2, 3, ...)
> >
> > The current implementation produces different jumble hash for every different
> > number of arguments for essentially the same query. Unfortunately a lot of ORMs
> > like to generate these types of queries, which in turn leads to
> > pg_stat_statements pollution. Ideally we want to prevent this and have only one
> > record for such a query.
> >
> > As the result of [1] I've identified two highlighted approaches to improve this
> > situation:
> >
> > * Reduce the generated ArrayExpr to an array Const immediately, in cases where
> >   all the inputs are Consts.
> >
> > * Make repeating Const to contribute nothing to the resulting hash.
> >
> > I've tried to prototype both approaches to find out pros/cons and be more
> > specific. Attached patches could not be considered a completed piece of work,
> > but they seem to work, mostly pass the tests and demonstrate the point. I would
> > like to get some high level input about them and ideally make it clear what is
> > the preferred solution to continue with.
>
> I've implemented the second approach mentioned above, this version was
> tested on our test clusters for some time without visible issues. Will
> create a CF item and would appreciate any feedback.

After more testing I found couple of things that could be improved,
namely in the presence of non-reducible constants one part of the query
was not copied into the normalized version, and this approach also could
be extended for Params. These are incorporated in the attached patch.

Attachment

Re: pg_stat_statements and "IN" conditions

From
Zhihong Yu
Date:
Hi,
A few comments.

+                           "After this number of duplicating constants start to merge them.",

duplicating -> duplicate

+               foreach(lc, (List *) expr)
+               {
+                   Node * subExpr = (Node *) lfirst(lc);
+
+                   if (!IsA(subExpr, Const))
+                   {
+                       allConst = false;
+                       break;
+                   }
+               }

It seems the above foreach loop (within foreach(temp, (List *) node)) can be preceded with a check that allConst is true. Otherwise the loop can be skipped.

+                   if (currentExprIdx == pgss_merge_threshold - 1)
+                   {
+                       JumbleExpr(jstate, expr);
+
+                       /*
+                        * A const expr is already found, so JumbleExpr must
+                        * record it. Mark it as merged, it will be the first
+                        * merged but still present in the statement query.
+                        */
+                       Assert(jstate->clocations_count > 0);
+                       jstate->clocations[jstate->clocations_count - 1].merged = true;
+                       currentExprIdx++;
+                   }

The above snippet occurs a few times. Maybe extract into a helper method.

Cheers

On Sat, Dec 26, 2020 at 2:45 AM Dmitry Dolgov <9erthalion6@gmail.com> wrote:
> On Wed, Nov 18, 2020 at 05:04:32PM +0100, Dmitry Dolgov wrote:
> > On Wed, Aug 12, 2020 at 06:19:02PM +0200, Dmitry Dolgov wrote:
> >
> > I would like to start another thread to follow up on [1], mostly to bump up the
> > topic. Just to remind, it's about how pg_stat_statements jumbling ArrayExpr in
> > queries like:
> >
> >     SELECT something FROM table WHERE col IN (1, 2, 3, ...)
> >
> > The current implementation produces different jumble hash for every different
> > number of arguments for essentially the same query. Unfortunately a lot of ORMs
> > like to generate these types of queries, which in turn leads to
> > pg_stat_statements pollution. Ideally we want to prevent this and have only one
> > record for such a query.
> >
> > As the result of [1] I've identified two highlighted approaches to improve this
> > situation:
> >
> > * Reduce the generated ArrayExpr to an array Const immediately, in cases where
> >   all the inputs are Consts.
> >
> > * Make repeating Const to contribute nothing to the resulting hash.
> >
> > I've tried to prototype both approaches to find out pros/cons and be more
> > specific. Attached patches could not be considered a completed piece of work,
> > but they seem to work, mostly pass the tests and demonstrate the point. I would
> > like to get some high level input about them and ideally make it clear what is
> > the preferred solution to continue with.
>
> I've implemented the second approach mentioned above, this version was
> tested on our test clusters for some time without visible issues. Will
> create a CF item and would appreciate any feedback.

After more testing I found couple of things that could be improved,
namely in the presence of non-reducible constants one part of the query
was not copied into the normalized version, and this approach also could
be extended for Params. These are incorporated in the attached patch.

Re: pg_stat_statements and "IN" conditions

From
Dmitry Dolgov
Date:
> On Sat, Dec 26, 2020 at 08:53:28AM -0800, Zhihong Yu wrote:
> Hi,
> A few comments.
>
> +               foreach(lc, (List *) expr)
> +               {
> +                   Node * subExpr = (Node *) lfirst(lc);
> +
> +                   if (!IsA(subExpr, Const))
> +                   {
> +                       allConst = false;
> +                       break;
> +                   }
> +               }
>
> It seems the above foreach loop (within foreach(temp, (List *) node)) can
> be preceded with a check that allConst is true. Otherwise the loop can be
> skipped.

Thanks for noticing. Now that I look at it closer I think it's the other
way around, the loop above checking constants for the first expression
is not really necessary.

> +                   if (currentExprIdx == pgss_merge_threshold - 1)
> +                   {
> +                       JumbleExpr(jstate, expr);
> +
> +                       /*
> +                        * A const expr is already found, so JumbleExpr must
> +                        * record it. Mark it as merged, it will be the
> first
> +                        * merged but still present in the statement query.
> +                        */
> +                       Assert(jstate->clocations_count > 0);
> +                       jstate->clocations[jstate->clocations_count -
> 1].merged = true;
> +                       currentExprIdx++;
> +                   }
>
> The above snippet occurs a few times. Maybe extract into a helper method.

Originally I was hesitant to extract it was because it's quite small
part of the code. But now I've realized that the part relevant to lists
is not really correct, which makes those bits even more different, so I
think it makes sense to leave it like that. What do you think?

Attachment

Re: pg_stat_statements and "IN" conditions

From
Zhihong Yu
Date:
Hi, Dmitry:

+   int         lastExprLenght = 0;

Did you mean to name the variable lastExprLenghth ?

w.r.t. extracting to helper method, the second and third if (currentExprIdx == pgss_merge_threshold - 1) blocks are similar.
It is up to you whether to create the helper method.
I am fine with the current formation.

Cheers

On Tue, Jan 5, 2021 at 4:51 AM Dmitry Dolgov <9erthalion6@gmail.com> wrote:
> On Sat, Dec 26, 2020 at 08:53:28AM -0800, Zhihong Yu wrote:
> Hi,
> A few comments.
>
> +               foreach(lc, (List *) expr)
> +               {
> +                   Node * subExpr = (Node *) lfirst(lc);
> +
> +                   if (!IsA(subExpr, Const))
> +                   {
> +                       allConst = false;
> +                       break;
> +                   }
> +               }
>
> It seems the above foreach loop (within foreach(temp, (List *) node)) can
> be preceded with a check that allConst is true. Otherwise the loop can be
> skipped.

Thanks for noticing. Now that I look at it closer I think it's the other
way around, the loop above checking constants for the first expression
is not really necessary.

> +                   if (currentExprIdx == pgss_merge_threshold - 1)
> +                   {
> +                       JumbleExpr(jstate, expr);
> +
> +                       /*
> +                        * A const expr is already found, so JumbleExpr must
> +                        * record it. Mark it as merged, it will be the
> first
> +                        * merged but still present in the statement query.
> +                        */
> +                       Assert(jstate->clocations_count > 0);
> +                       jstate->clocations[jstate->clocations_count -
> 1].merged = true;
> +                       currentExprIdx++;
> +                   }
>
> The above snippet occurs a few times. Maybe extract into a helper method.

Originally I was hesitant to extract it was because it's quite small
part of the code. But now I've realized that the part relevant to lists
is not really correct, which makes those bits even more different, so I
think it makes sense to leave it like that. What do you think?

Re: pg_stat_statements and "IN" conditions

From
David Steele
Date:
On 1/5/21 10:51 AM, Zhihong Yu wrote:
> 
> +   int         lastExprLenght = 0;
> 
> Did you mean to name the variable lastExprLenghth ?
> 
> w.r.t. extracting to helper method, the second and third 
> if (currentExprIdx == pgss_merge_threshold - 1) blocks are similar.
> It is up to you whether to create the helper method.
> I am fine with the current formation.

Dmitry, thoughts on this review?

Regards,
-- 
-David
david@pgmasters.net



Re: pg_stat_statements and "IN" conditions

From
Dmitry Dolgov
Date:
> On Thu, Mar 18, 2021 at 09:38:09AM -0400, David Steele wrote:
> On 1/5/21 10:51 AM, Zhihong Yu wrote:
> >
> > +   int         lastExprLenght = 0;
> >
> > Did you mean to name the variable lastExprLenghth ?
> >
> > w.r.t. extracting to helper method, the second and third
> > if (currentExprIdx == pgss_merge_threshold - 1) blocks are similar.
> > It is up to you whether to create the helper method.
> > I am fine with the current formation.
>
> Dmitry, thoughts on this review?

Oh, right. lastExprLenghth is obviously a typo, and as we agreed that
the helper is not strictly necessary I wanted to wait a bit hoping for
more feedback and eventually to post an accumulated patch. Doesn't make
sense to post another version only to fix one typo :)



Re: pg_stat_statements and "IN" conditions

From
Dmitry Dolgov
Date:
> On Thu, Mar 18, 2021 at 04:50:02PM +0100, Dmitry Dolgov wrote:
> > On Thu, Mar 18, 2021 at 09:38:09AM -0400, David Steele wrote:
> > On 1/5/21 10:51 AM, Zhihong Yu wrote:
> > >
> > > +   int         lastExprLenght = 0;
> > >
> > > Did you mean to name the variable lastExprLenghth ?
> > >
> > > w.r.t. extracting to helper method, the second and third
> > > if (currentExprIdx == pgss_merge_threshold - 1) blocks are similar.
> > > It is up to you whether to create the helper method.
> > > I am fine with the current formation.
> >
> > Dmitry, thoughts on this review?
>
> Oh, right. lastExprLenghth is obviously a typo, and as we agreed that
> the helper is not strictly necessary I wanted to wait a bit hoping for
> more feedback and eventually to post an accumulated patch. Doesn't make
> sense to post another version only to fix one typo :)

Hi,

I've prepared a new rebased version to deal with the new way of
computing query id, but as always there is one tricky part. From what I
understand, now an external module can provide custom implementation for
query id computation algorithm. It seems natural to think this machinery
could be used instead of patch in the thread, i.e. one could create a
custom logic that will enable constants collapsing as needed, so that
same queries with different number of constants in an array will be
hashed into the same record.

But there is a limitation in how such queries will be normalized
afterwards — to reduce level of surprise it's necessary to display the
fact that a certain query in fact had more constants that are showed in
pgss record. Ideally LocationLen needs to carry some bits of information
on what exactly could be skipped, and generate_normalized_query needs to
understand that, both are not reachable for an external module with
custom query id logic (without replicating significant part of the
existing code). Hence, a new version of the patch.

Attachment

Re: pg_stat_statements and "IN" conditions

From
Dmitry Dolgov
Date:
> On Tue, Jun 15, 2021 at 05:18:50PM +0200, Dmitry Dolgov wrote:
> > On Thu, Mar 18, 2021 at 04:50:02PM +0100, Dmitry Dolgov wrote:
> > > On Thu, Mar 18, 2021 at 09:38:09AM -0400, David Steele wrote:
> > > On 1/5/21 10:51 AM, Zhihong Yu wrote:
> > > >
> > > > +   int         lastExprLenght = 0;
> > > >
> > > > Did you mean to name the variable lastExprLenghth ?
> > > >
> > > > w.r.t. extracting to helper method, the second and third
> > > > if (currentExprIdx == pgss_merge_threshold - 1) blocks are similar.
> > > > It is up to you whether to create the helper method.
> > > > I am fine with the current formation.
> > >
> > > Dmitry, thoughts on this review?
> >
> > Oh, right. lastExprLenghth is obviously a typo, and as we agreed that
> > the helper is not strictly necessary I wanted to wait a bit hoping for
> > more feedback and eventually to post an accumulated patch. Doesn't make
> > sense to post another version only to fix one typo :)
>
> Hi,
>
> I've prepared a new rebased version to deal with the new way of
> computing query id, but as always there is one tricky part. From what I
> understand, now an external module can provide custom implementation for
> query id computation algorithm. It seems natural to think this machinery
> could be used instead of patch in the thread, i.e. one could create a
> custom logic that will enable constants collapsing as needed, so that
> same queries with different number of constants in an array will be
> hashed into the same record.
>
> But there is a limitation in how such queries will be normalized
> afterwards — to reduce level of surprise it's necessary to display the
> fact that a certain query in fact had more constants that are showed in
> pgss record. Ideally LocationLen needs to carry some bits of information
> on what exactly could be skipped, and generate_normalized_query needs to
> understand that, both are not reachable for an external module with
> custom query id logic (without replicating significant part of the
> existing code). Hence, a new version of the patch.

Forgot to mention a couple of people who already reviewed the patch.

Attachment

Re: pg_stat_statements and "IN" conditions

From
Dmitry Dolgov
Date:
>On Wed, Jun 16, 2021 at 04:02:12PM +0200, Dmitry Dolgov wrote:
>
> > I've prepared a new rebased version to deal with the new way of
> > computing query id, but as always there is one tricky part. From what I
> > understand, now an external module can provide custom implementation for
> > query id computation algorithm. It seems natural to think this machinery
> > could be used instead of patch in the thread, i.e. one could create a
> > custom logic that will enable constants collapsing as needed, so that
> > same queries with different number of constants in an array will be
> > hashed into the same record.
> >
> > But there is a limitation in how such queries will be normalized
> > afterwards — to reduce level of surprise it's necessary to display the
> > fact that a certain query in fact had more constants that are showed in
> > pgss record. Ideally LocationLen needs to carry some bits of information
> > on what exactly could be skipped, and generate_normalized_query needs to
> > understand that, both are not reachable for an external module with
> > custom query id logic (without replicating significant part of the
> > existing code). Hence, a new version of the patch.
>
> Forgot to mention a couple of people who already reviewed the patch.

And now for something completely different, here is a new patch version.
It contains a small fix for one problem we've found during testing (one
path code was incorrectly assuming find_const_walker results).

Attachment

Re: pg_stat_statements and "IN" conditions

From
Zhihong Yu
Date:


On Thu, Sep 30, 2021 at 6:49 AM Dmitry Dolgov <9erthalion6@gmail.com> wrote:
>On Wed, Jun 16, 2021 at 04:02:12PM +0200, Dmitry Dolgov wrote:
>
> > I've prepared a new rebased version to deal with the new way of
> > computing query id, but as always there is one tricky part. From what I
> > understand, now an external module can provide custom implementation for
> > query id computation algorithm. It seems natural to think this machinery
> > could be used instead of patch in the thread, i.e. one could create a
> > custom logic that will enable constants collapsing as needed, so that
> > same queries with different number of constants in an array will be
> > hashed into the same record.
> >
> > But there is a limitation in how such queries will be normalized
> > afterwards — to reduce level of surprise it's necessary to display the
> > fact that a certain query in fact had more constants that are showed in
> > pgss record. Ideally LocationLen needs to carry some bits of information
> > on what exactly could be skipped, and generate_normalized_query needs to
> > understand that, both are not reachable for an external module with
> > custom query id logic (without replicating significant part of the
> > existing code). Hence, a new version of the patch.
>
> Forgot to mention a couple of people who already reviewed the patch.

And now for something completely different, here is a new patch version.
It contains a small fix for one problem we've found during testing (one
path code was incorrectly assuming find_const_walker results).
Hi,

bq. and at position further that specified threshold.

 that specified threshold -> than specified threshold

Cheers

Re: pg_stat_statements and "IN" conditions

From
Dmitry Dolgov
Date:
> On Thu, Sep 30, 2021 at 08:03:16AM -0700, Zhihong Yu wrote:
> On Thu, Sep 30, 2021 at 6:49 AM Dmitry Dolgov <9erthalion6@gmail.com> wrote:
>
> > >On Wed, Jun 16, 2021 at 04:02:12PM +0200, Dmitry Dolgov wrote:
> > >
> > > > I've prepared a new rebased version to deal with the new way of
> > > > computing query id, but as always there is one tricky part. From what I
> > > > understand, now an external module can provide custom implementation
> > for
> > > > query id computation algorithm. It seems natural to think this
> > machinery
> > > > could be used instead of patch in the thread, i.e. one could create a
> > > > custom logic that will enable constants collapsing as needed, so that
> > > > same queries with different number of constants in an array will be
> > > > hashed into the same record.
> > > >
> > > > But there is a limitation in how such queries will be normalized
> > > > afterwards — to reduce level of surprise it's necessary to display the
> > > > fact that a certain query in fact had more constants that are showed in
> > > > pgss record. Ideally LocationLen needs to carry some bits of
> > information
> > > > on what exactly could be skipped, and generate_normalized_query needs
> > to
> > > > understand that, both are not reachable for an external module with
> > > > custom query id logic (without replicating significant part of the
> > > > existing code). Hence, a new version of the patch.
> > >
> > > Forgot to mention a couple of people who already reviewed the patch.
> >
> > And now for something completely different, here is a new patch version.
> > It contains a small fix for one problem we've found during testing (one
> > path code was incorrectly assuming find_const_walker results).
> >
> Hi,
>
> bq. and at position further that specified threshold.
>
>  that specified threshold -> than specified threshold

You mean in the patch commit message, nowhere else, right? Yep, my spell
checker didn't catch that, thanks for noticing!



Re: pg_stat_statements and "IN" conditions

From
Tom Lane
Date:
Dmitry Dolgov <9erthalion6@gmail.com> writes:
> And now for something completely different, here is a new patch version.
> It contains a small fix for one problem we've found during testing (one
> path code was incorrectly assuming find_const_walker results).

I've been saying from day one that pushing the query-hashing code into the
core was a bad idea, and I think this patch perfectly illustrates why.
We can debate whether the rules proposed here are good for
pg_stat_statements or not, but it seems inevitable that they will be a
disaster for some other consumers of the query hash.  In particular,
dropping external parameters from the hash seems certain to break
something for somebody --- do you really think that a query with two int
parameters is equivalent to one with five float parameters for all
query-identifying purposes?

I can see the merits of allowing different numbers of IN elements
to be considered equivalent for pg_stat_statements, but this patch
seems to go far beyond that basic idea, and I fear the side-effects
will be very bad.

Also, calling eval_const_expressions in the query jumbler is flat
out unacceptable.  There is way too much code that could be reached
that way (more or less the entire executor, to start with).  I
don't have a lot of faith that it'd never modify the input tree,
either.

            regards, tom lane



Re: pg_stat_statements and "IN" conditions

From
"Andrey V. Lepikhov"
Date:
On 1/5/22 4:02 AM, Tom Lane wrote:
> Dmitry Dolgov <9erthalion6@gmail.com> writes:
>> And now for something completely different, here is a new patch version.
>> It contains a small fix for one problem we've found during testing (one
>> path code was incorrectly assuming find_const_walker results).
> 
> I've been saying from day one that pushing the query-hashing code into the
> core was a bad idea, and I think this patch perfectly illustrates why.
> We can debate whether the rules proposed here are good for
> pg_stat_statements or not, but it seems inevitable that they will be a
> disaster for some other consumers of the query hash.  In particular,
> dropping external parameters from the hash seems certain to break
> something for somebody
+1.

In a couple of extensions I use different logic of query jumbling - hash 
value is more stable in some cases than in default implementation. For 
example, it should be stable to permutations in 'FROM' section of a query.
And If anyone subtly changes jumbling logic when the extension is 
active, the instance could get huge performance issues.

Let me suggest, that the core should allow an extension at least to 
detect such interference between extensions. Maybe hook could be 
replaced with callback to allow extension see an queryid with underlying 
generation logic what it expects.

-- 
regards,
Andrey Lepikhov
Postgres Professional



Re: pg_stat_statements and "IN" conditions

From
Tom Lane
Date:
"Andrey V. Lepikhov" <a.lepikhov@postgrespro.ru> writes:
> On 1/5/22 4:02 AM, Tom Lane wrote:
>> I've been saying from day one that pushing the query-hashing code into the
>> core was a bad idea, and I think this patch perfectly illustrates why.

> +1.

> Let me suggest, that the core should allow an extension at least to 
> detect such interference between extensions. Maybe hook could be 
> replaced with callback to allow extension see an queryid with underlying 
> generation logic what it expects.

I feel like we need to get away from the idea that there is just
one query hash, and somehow let different extensions attach
differently-calculated hashes to a query.  I don't have any immediate
ideas about how to do that in a reasonably inexpensive way.

            regards, tom lane



Re: pg_stat_statements and "IN" conditions

From
Dmitry Dolgov
Date:
> On Tue, Jan 04, 2022 at 06:02:43PM -0500, Tom Lane wrote:
> We can debate whether the rules proposed here are good for
> pg_stat_statements or not, but it seems inevitable that they will be a
> disaster for some other consumers of the query hash.

Hm, which consumers do you mean here, potential extension? Isn't the
ability to use an external module to compute queryid make this situation
possible anyway?

> do you really think that a query with two int
> parameters is equivalent to one with five float parameters for all
> query-identifying purposes?

Nope, and it will be hard to figure this out no matter which approach
we're talking about, because it mostly depends on the context and type
of queries I guess. Instead, such functionality should allow some
reasonable configuration. To be clear, the use case I have in mind here
is not four or five, but rather a couple of hundreds constants where
chances that the whole construction was generated automatically by ORM
is higher than normal.

> I can see the merits of allowing different numbers of IN elements
> to be considered equivalent for pg_stat_statements, but this patch
> seems to go far beyond that basic idea, and I fear the side-effects
> will be very bad.

Not sure why it goes far beyond, but then there were two approaches
under consideration, as I've stated in the first message. I already
don't remember all the details, but another one was evolving around
doing similar things in a more limited fashion in transformAExprIn. The
problem would be then to carry the information, necessary to represent
the act of "merging" some number of queryids together. Any thoughts
here?

The idea of keeping the original queryid untouched and add another type
of id instead sounds interesting, but it will add too much overhead for
a quite small use case I guess.



Re: pg_stat_statements and "IN" conditions

From
Dmitry Dolgov
Date:
> On Wed, Jan 05, 2022 at 10:11:11PM +0100, Dmitry Dolgov wrote:
> > On Tue, Jan 04, 2022 at 06:02:43PM -0500, Tom Lane wrote:
> > We can debate whether the rules proposed here are good for
> > pg_stat_statements or not, but it seems inevitable that they will be a
> > disaster for some other consumers of the query hash.
>
> Hm, which consumers do you mean here, potential extension? Isn't the
> ability to use an external module to compute queryid make this situation
> possible anyway?
>
> > do you really think that a query with two int
> > parameters is equivalent to one with five float parameters for all
> > query-identifying purposes?
>
> Nope, and it will be hard to figure this out no matter which approach
> we're talking about, because it mostly depends on the context and type
> of queries I guess. Instead, such functionality should allow some
> reasonable configuration. To be clear, the use case I have in mind here
> is not four or five, but rather a couple of hundreds constants where
> chances that the whole construction was generated automatically by ORM
> is higher than normal.
>
> > I can see the merits of allowing different numbers of IN elements
> > to be considered equivalent for pg_stat_statements, but this patch
> > seems to go far beyond that basic idea, and I fear the side-effects
> > will be very bad.
>
> Not sure why it goes far beyond, but then there were two approaches
> under consideration, as I've stated in the first message. I already
> don't remember all the details, but another one was evolving around
> doing similar things in a more limited fashion in transformAExprIn. The
> problem would be then to carry the information, necessary to represent
> the act of "merging" some number of queryids together. Any thoughts
> here?
>
> The idea of keeping the original queryid untouched and add another type
> of id instead sounds interesting, but it will add too much overhead for
> a quite small use case I guess.

```
Thu, 10 Mar 2022
New status: Waiting on Author
```

This seems incorrect, as the only feedback I've got was "this is a bad
idea", and no reaction on follow-up questions.



Re: pg_stat_statements and "IN" conditions

From
Tom Lane
Date:
Dmitry Dolgov <9erthalion6@gmail.com> writes:
> New status: Waiting on Author

> This seems incorrect, as the only feedback I've got was "this is a bad
> idea", and no reaction on follow-up questions.

I changed the status because it seems to me there is no chance of
this being committed as-is.

1. I think an absolute prerequisite before we could even consider
changing the query jumbler rules this much is to do the work that was
put off when the jumbler was moved into core: that is, provide some
honest support for multiple query-ID generation methods being used at
the same time.  Even if you successfully make a case for
pg_stat_statements to act this way, other consumers of query IDs
aren't going to be happy with it.

2. You haven't made a case for it.  The original complaint was
about different lengths of IN lists not being treated as equivalent,
but this patch has decided to do I'm-not-even-sure-quite-what
about treating different Params as equivalent.  Plus you're trying
to invoke eval_const_expressions in the jumbler; that is absolutely
Not OK, for both safety and semantic reasons.

If you backed off to just treating ArrayExprs containing different
numbers of Consts as equivalent, maybe that'd be something we could
adopt without fixing point 1.  I don't think anything that fuzzes the
treatment of Params can get away with that, though.

            regards, tom lane



Re: pg_stat_statements and "IN" conditions

From
Robert Haas
Date:
On Thu, Mar 10, 2022 at 12:12 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > This seems incorrect, as the only feedback I've got was "this is a bad
> > idea", and no reaction on follow-up questions.
>
> I changed the status because it seems to me there is no chance of
> this being committed as-is.
>
> 1. I think an absolute prerequisite before we could even consider
> changing the query jumbler rules this much is to do the work that was
> put off when the jumbler was moved into core: that is, provide some
> honest support for multiple query-ID generation methods being used at
> the same time.  Even if you successfully make a case for
> pg_stat_statements to act this way, other consumers of query IDs
> aren't going to be happy with it.

FWIW, I don't find this convincing at all. Query jumbling is already
somewhat expensive, and it seems unlikely that the same person is
going to want to jumble queries in one way for pg_stat_statements and
another way for pg_stat_broccoli or whatever their other extension is.
Putting a lot of engineering work into something with such a marginal
use case seems not worthwhile to me - and also likely futile, because
I don't see how it could realistically be made nearly as cheap as a
single jumble.

> 2. You haven't made a case for it.  The original complaint was
> about different lengths of IN lists not being treated as equivalent,
> but this patch has decided to do I'm-not-even-sure-quite-what
> about treating different Params as equivalent.  Plus you're trying
> to invoke eval_const_expressions in the jumbler; that is absolutely
> Not OK, for both safety and semantic reasons.

I think there are two separate points here, one about patch quality
and the other about whether the basic idea is good. I think the basic
idea is good. I do not contend that collapsing IN-lists of arbitrary
length is what everyone wants in all cases, but it seems entirely
reasonable to me to think that it is what some people want. So I would
say just make it a parameter and let people configure whichever
behavior they want. My bet is 95% of users would prefer to have it on,
but even if that's wildly wrong, having it as an optional behavior
hurts nobody. Let it be off by default and let those who want it flip
the toggle. On the code quality issue, I haven't read the patch but
your concerns sound well-founded to me from reading what you wrote.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: pg_stat_statements and "IN" conditions

From
Dmitry Dolgov
Date:
> On Thu, Mar 10, 2022 at 12:32:08PM -0500, Robert Haas wrote:
> On Thu, Mar 10, 2022 at 12:12 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> > 2. You haven't made a case for it.  The original complaint was
> > about different lengths of IN lists not being treated as equivalent,
> > but this patch has decided to do I'm-not-even-sure-quite-what
> > about treating different Params as equivalent.  Plus you're trying
> > to invoke eval_const_expressions in the jumbler; that is absolutely
> > Not OK, for both safety and semantic reasons.
>
> I think there are two separate points here, one about patch quality
> and the other about whether the basic idea is good. I think the basic
> idea is good. I do not contend that collapsing IN-lists of arbitrary
> length is what everyone wants in all cases, but it seems entirely
> reasonable to me to think that it is what some people want. So I would
> say just make it a parameter and let people configure whichever
> behavior they want. My bet is 95% of users would prefer to have it on,
> but even if that's wildly wrong, having it as an optional behavior
> hurts nobody. Let it be off by default and let those who want it flip
> the toggle. On the code quality issue, I haven't read the patch but
> your concerns sound well-founded to me from reading what you wrote.

I have the same understanding, there is a toggle in the patch exactly
for this purpose.

To give a bit more context, the whole development was ORM-driven rather
than pulled out of thin air -- people were complaining about huge
generated queries that could be barely displayed in monitoring, I was
trying to address it via collapsing the list where it was happening. In
other words "I'm-not-even-sure-quite-what" part may be indeed too
extensive, but was triggered by real world issues.

Of course, I could get the implementation not quite right, e.g. I wasn't
aware about dangers of using eval_const_expressions. But that's what the
CF item and the corresponding discussion is for, I guess. Let me see
what I could do to improve it.



Re: pg_stat_statements and "IN" conditions

From
Dmitry Dolgov
Date:
> On Thu, Mar 10, 2022 at 12:11:59PM -0500, Tom Lane wrote:
> Dmitry Dolgov <9erthalion6@gmail.com> writes:
> > New status: Waiting on Author
>
> > This seems incorrect, as the only feedback I've got was "this is a bad
> > idea", and no reaction on follow-up questions.
>
> I changed the status because it seems to me there is no chance of
> this being committed as-is.
>
> 1. I think an absolute prerequisite before we could even consider
> changing the query jumbler rules this much is to do the work that was
> put off when the jumbler was moved into core: that is, provide some
> honest support for multiple query-ID generation methods being used at
> the same time.  Even if you successfully make a case for
> pg_stat_statements to act this way, other consumers of query IDs
> aren't going to be happy with it.
>
> 2. You haven't made a case for it.  The original complaint was
> about different lengths of IN lists not being treated as equivalent,
> but this patch has decided to do I'm-not-even-sure-quite-what
> about treating different Params as equivalent.  Plus you're trying
> to invoke eval_const_expressions in the jumbler; that is absolutely
> Not OK, for both safety and semantic reasons.
>
> If you backed off to just treating ArrayExprs containing different
> numbers of Consts as equivalent, maybe that'd be something we could
> adopt without fixing point 1.  I don't think anything that fuzzes the
> treatment of Params can get away with that, though.

Here is the limited version of list collapsing functionality, which
doesn't utilize eval_const_expressions and ignores most of the stuff
except ArrayExprs. Any thoughts/more suggestions?

Attachment

Re: pg_stat_statements and "IN" conditions

From
Robert Haas
Date:
On Sat, Mar 12, 2022 at 9:11 AM Dmitry Dolgov <9erthalion6@gmail.com> wrote:
> Here is the limited version of list collapsing functionality, which
> doesn't utilize eval_const_expressions and ignores most of the stuff
> except ArrayExprs. Any thoughts/more suggestions?

The proposed commit message says this commit intends to "Make Consts
contribute nothing to the jumble hash if they're part of a series and
at position further that specified threshold." I'm not sure whether
that's what the patch actually implements because I can't immediately
understand the new logic you've added, but I think if we did what that
sentence said then, supposing the threshold is set to 1, it would
result in producing the same hash for "x in (1,2)" that we do for "x
in (1,3)" but a different hash for "x in (2,3)" which does not sound
like what we want. What I would have thought we'd do is: if the list
is all constants and long enough to satisfy the threshold then nothing
in the list gets jumbled.

I'm a little surprised that there's not more context-awareness in this
code. It seems that it applies to every ArrayExpr found in the query,
which I think would extend to cases beyond something = IN(whatever).
In particular, any use of ARRAY[] in the query would be impacted. Now,
the comments seem to imply that's pretty intentional, but from the
user's point of view, WHERE x in (1,3) and x = any(array[1,3]) are two
different things. If anything like this is to be adopted, we certainly
need to be precise about exactly what it is doing and which cases are
covered. I thought of looking at the documentation to see whether
you'd tried to clarify this there, and found that you hadn't written
any.

In short, I think this patch is not really very close to being in
committable shape even if nobody were objecting to the concept.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: pg_stat_statements and "IN" conditions

From
Dmitry Dolgov
Date:
> On Mon, Mar 14, 2022 at 10:17:57AM -0400, Robert Haas wrote:
> On Sat, Mar 12, 2022 at 9:11 AM Dmitry Dolgov <9erthalion6@gmail.com> wrote:
> > Here is the limited version of list collapsing functionality, which
> > doesn't utilize eval_const_expressions and ignores most of the stuff
> > except ArrayExprs. Any thoughts/more suggestions?
>
> The proposed commit message says this commit intends to "Make Consts
> contribute nothing to the jumble hash if they're part of a series and
> at position further that specified threshold." I'm not sure whether
> that's what the patch actually implements because I can't immediately
> understand the new logic you've added, but I think if we did what that
> sentence said then, supposing the threshold is set to 1, it would
> result in producing the same hash for "x in (1,2)" that we do for "x
> in (1,3)" but a different hash for "x in (2,3)" which does not sound
> like what we want. What I would have thought we'd do is: if the list
> is all constants and long enough to satisfy the threshold then nothing
> in the list gets jumbled.

Well, yeah, the commit message is somewhat clumsy in this regard. It
works almost in the way you've described, except if the list is all
constants and long enough to satisfy the threshold then *first N
elements (where N == threshold) will be jumbled -- to leave at least
some traces of it in pgss.

> I'm a little surprised that there's not more context-awareness in this
> code. It seems that it applies to every ArrayExpr found in the query,
> which I think would extend to cases beyond something = IN(whatever).
> In particular, any use of ARRAY[] in the query would be impacted. Now,
> the comments seem to imply that's pretty intentional, but from the
> user's point of view, WHERE x in (1,3) and x = any(array[1,3]) are two
> different things. If anything like this is to be adopted, we certainly
> need to be precise about exactly what it is doing and which cases are
> covered.

I'm not sure if I follow the last point. WHERE x in (1,3) and x =
any(array[1,3]) are two different things for sure, but in which way are
they going to be mixed together because of this change? My goal was to
make only the following transformation, without leaving any uncertainty:

WHERE x in (1, 2, 3, 4, 5) -> WHERE x in (1, 2, ...)
WHERE x = any(array[1, 2, 3, 4, 5]) -> WHERE x = any(array[1, 2, ...])

> I thought of looking at the documentation to see whether you'd tried
> to clarify this there, and found that you hadn't written any.
>
> In short, I think this patch is not really very close to being in
> committable shape even if nobody were objecting to the concept.

Sure, I'll add documentation. To be honest I'm not targeting PG15 with
this, just want to make some progress. Thanks for the feedback, I'm glad
to see it coming!



Re: pg_stat_statements and "IN" conditions

From
Robert Haas
Date:
On Mon, Mar 14, 2022 at 10:57 AM Dmitry Dolgov <9erthalion6@gmail.com> wrote:
> Well, yeah, the commit message is somewhat clumsy in this regard. It
> works almost in the way you've described, except if the list is all
> constants and long enough to satisfy the threshold then *first N
> elements (where N == threshold) will be jumbled -- to leave at least
> some traces of it in pgss.

But that seems to me to be a thing we would not want. Why do you think
otherwise?

> I'm not sure if I follow the last point. WHERE x in (1,3) and x =
> any(array[1,3]) are two different things for sure, but in which way are
> they going to be mixed together because of this change? My goal was to
> make only the following transformation, without leaving any uncertainty:
>
> WHERE x in (1, 2, 3, 4, 5) -> WHERE x in (1, 2, ...)
> WHERE x = any(array[1, 2, 3, 4, 5]) -> WHERE x = any(array[1, 2, ...])

I understand. I think it might be OK to transform both of those
things, but I don't think it's very clear either from the comments or
the nonexistent documentation that both of those cases are affected --
and I think that needs to be clear. Not sure exactly how to do that,
just saying that we can't add behavior unless it will be clear to
users what the behavior is.

> Sure, I'll add documentation. To be honest I'm not targeting PG15 with
> this, just want to make some progress.

wfm!

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: pg_stat_statements and "IN" conditions

From
Dmitry Dolgov
Date:
> On Mon, Mar 14, 2022 at 11:02:16AM -0400, Robert Haas wrote:
> On Mon, Mar 14, 2022 at 10:57 AM Dmitry Dolgov <9erthalion6@gmail.com> wrote:
> > Well, yeah, the commit message is somewhat clumsy in this regard. It
> > works almost in the way you've described, except if the list is all
> > constants and long enough to satisfy the threshold then *first N
> > elements (where N == threshold) will be jumbled -- to leave at least
> > some traces of it in pgss.
>
> But that seems to me to be a thing we would not want. Why do you think
> otherwise?

Hm. Well, if the whole list would be not jumbled, the transformation
would look like this, right?

WHERE x in (1, 2, 3, 4, 5) -> WHERE x in (...)

Leaving some number of original elements in place gives some clue for
the reader about at least what type of data the array has contained.
Which hopefully makes it a bit easier to identify even in the collapsed
form:

WHERE x in (1, 2, 3, 4, 5) -> WHERE x in (1, 2, ...)

> > I'm not sure if I follow the last point. WHERE x in (1,3) and x =
> > any(array[1,3]) are two different things for sure, but in which way are
> > they going to be mixed together because of this change? My goal was to
> > make only the following transformation, without leaving any uncertainty:
> >
> > WHERE x in (1, 2, 3, 4, 5) -> WHERE x in (1, 2, ...)
> > WHERE x = any(array[1, 2, 3, 4, 5]) -> WHERE x = any(array[1, 2, ...])
>
> I understand. I think it might be OK to transform both of those
> things, but I don't think it's very clear either from the comments or
> the nonexistent documentation that both of those cases are affected --
> and I think that needs to be clear. Not sure exactly how to do that,
> just saying that we can't add behavior unless it will be clear to
> users what the behavior is.

Yep, got it.



Re: pg_stat_statements and "IN" conditions

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> On Mon, Mar 14, 2022 at 10:57 AM Dmitry Dolgov <9erthalion6@gmail.com> wrote:
>> I'm not sure if I follow the last point. WHERE x in (1,3) and x =
>> any(array[1,3]) are two different things for sure, but in which way are
>> they going to be mixed together because of this change? My goal was to
>> make only the following transformation, without leaving any uncertainty:
>>
>> WHERE x in (1, 2, 3, 4, 5) -> WHERE x in (1, 2, ...)
>> WHERE x = any(array[1, 2, 3, 4, 5]) -> WHERE x = any(array[1, 2, ...])

> I understand. I think it might be OK to transform both of those
> things, but I don't think it's very clear either from the comments or
> the nonexistent documentation that both of those cases are affected --
> and I think that needs to be clear.

We've transformed IN(...) to ANY(ARRAY[...]) at the parser stage for a
long time, and this has been visible to users of either EXPLAIN or
pg_stat_statements for the same length of time.  I doubt people are
going to find that surprising.  Even if they do, it's not the query
jumbler's fault.

I do find it odd that the proposed patch doesn't cause the *entire*
list to be skipped over.  That seems like extra complexity and confusion
to no benefit.

            regards, tom lane



Re: pg_stat_statements and "IN" conditions

From
Dmitry Dolgov
Date:
> On Mon, Mar 14, 2022 at 11:23:17AM -0400, Tom Lane wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>
> I do find it odd that the proposed patch doesn't cause the *entire*
> list to be skipped over.  That seems like extra complexity and confusion
> to no benefit.

That's a bit surprising for me, I haven't even thought that folks could
think this is an odd behaviour. As I've mentioned above, the original
idea was to give some clues about what was inside the collapsed array,
but if everyone finds it unnecessary I can of course change it.



Re: pg_stat_statements and "IN" conditions

From
Tom Lane
Date:
Dmitry Dolgov <9erthalion6@gmail.com> writes:
> On Mon, Mar 14, 2022 at 11:23:17AM -0400, Tom Lane wrote:
>> I do find it odd that the proposed patch doesn't cause the *entire*
>> list to be skipped over.  That seems like extra complexity and confusion
>> to no benefit.

> That's a bit surprising for me, I haven't even thought that folks could
> think this is an odd behaviour. As I've mentioned above, the original
> idea was to give some clues about what was inside the collapsed array,
> but if everyone finds it unnecessary I can of course change it.

But if what we're doing is skipping over an all-Consts list, then the
individual Consts would be elided from the pg_stat_statements entry
anyway, no?  All that would remain is information about how many such
Consts there were, which is exactly the information you want to drop.

            regards, tom lane



Re: pg_stat_statements and "IN" conditions

From
Dmitry Dolgov
Date:
> On Mon, Mar 14, 2022 at 11:38:23AM -0400, Tom Lane wrote:
> Dmitry Dolgov <9erthalion6@gmail.com> writes:
> > On Mon, Mar 14, 2022 at 11:23:17AM -0400, Tom Lane wrote:
> >> I do find it odd that the proposed patch doesn't cause the *entire*
> >> list to be skipped over.  That seems like extra complexity and confusion
> >> to no benefit.
>
> > That's a bit surprising for me, I haven't even thought that folks could
> > think this is an odd behaviour. As I've mentioned above, the original
> > idea was to give some clues about what was inside the collapsed array,
> > but if everyone finds it unnecessary I can of course change it.
>
> But if what we're doing is skipping over an all-Consts list, then the
> individual Consts would be elided from the pg_stat_statements entry
> anyway, no?  All that would remain is information about how many such
> Consts there were, which is exactly the information you want to drop.

Hm, yes, you're right. I guess I was thinking about this more like about
shortening some text with ellipsis, but indeed no actual Consts will end
up in the result anyway. Thanks for clarification, will modify the
patch!



Re: pg_stat_statements and "IN" conditions

From
Dmitry Dolgov
Date:
> On Mon, Mar 14, 2022 at 04:51:50PM +0100, Dmitry Dolgov wrote:
> > On Mon, Mar 14, 2022 at 11:38:23AM -0400, Tom Lane wrote:
> > Dmitry Dolgov <9erthalion6@gmail.com> writes:
> > > On Mon, Mar 14, 2022 at 11:23:17AM -0400, Tom Lane wrote:
> > >> I do find it odd that the proposed patch doesn't cause the *entire*
> > >> list to be skipped over.  That seems like extra complexity and confusion
> > >> to no benefit.
> >
> > > That's a bit surprising for me, I haven't even thought that folks could
> > > think this is an odd behaviour. As I've mentioned above, the original
> > > idea was to give some clues about what was inside the collapsed array,
> > > but if everyone finds it unnecessary I can of course change it.
> >
> > But if what we're doing is skipping over an all-Consts list, then the
> > individual Consts would be elided from the pg_stat_statements entry
> > anyway, no?  All that would remain is information about how many such
> > Consts there were, which is exactly the information you want to drop.
>
> Hm, yes, you're right. I guess I was thinking about this more like about
> shortening some text with ellipsis, but indeed no actual Consts will end
> up in the result anyway. Thanks for clarification, will modify the
> patch!

Here is another iteration. Now the patch doesn't leave any trailing
Consts in the normalized query, and contains more documentation. I hope
it's getting better.

Attachment

Re: pg_stat_statements and "IN" conditions

From
Dmitry Dolgov
Date:
> On Sat, Mar 26, 2022 at 06:40:35PM +0100, Dmitry Dolgov wrote:
> > On Mon, Mar 14, 2022 at 04:51:50PM +0100, Dmitry Dolgov wrote:
> > > On Mon, Mar 14, 2022 at 11:38:23AM -0400, Tom Lane wrote:
> > > Dmitry Dolgov <9erthalion6@gmail.com> writes:
> > > > On Mon, Mar 14, 2022 at 11:23:17AM -0400, Tom Lane wrote:
> > > >> I do find it odd that the proposed patch doesn't cause the *entire*
> > > >> list to be skipped over.  That seems like extra complexity and confusion
> > > >> to no benefit.
> > >
> > > > That's a bit surprising for me, I haven't even thought that folks could
> > > > think this is an odd behaviour. As I've mentioned above, the original
> > > > idea was to give some clues about what was inside the collapsed array,
> > > > but if everyone finds it unnecessary I can of course change it.
> > >
> > > But if what we're doing is skipping over an all-Consts list, then the
> > > individual Consts would be elided from the pg_stat_statements entry
> > > anyway, no?  All that would remain is information about how many such
> > > Consts there were, which is exactly the information you want to drop.
> >
> > Hm, yes, you're right. I guess I was thinking about this more like about
> > shortening some text with ellipsis, but indeed no actual Consts will end
> > up in the result anyway. Thanks for clarification, will modify the
> > patch!
>
> Here is another iteration. Now the patch doesn't leave any trailing
> Consts in the normalized query, and contains more documentation. I hope
> it's getting better.

Hi,

Here is the rebased version, with no other changes.

Attachment

Re:pg_stat_statements and "IN" conditions

From
Sergei Kornilov
Date:
Hello!

Unfortunately the patch needs another rebase due to the recent split of guc.c
(0a20ff54f5e66158930d5328f89f087d4e9ab400)

I'm reviewing a patch on top of a previous commit and noticed a failed test:

#   Failed test 'no parameters missing from postgresql.conf.sample'
#   at t/003_check_guc.pl line 82.
#          got: '1'
#     expected: '0'
# Looks like you failed 1 test of 3.
t/003_check_guc.pl .............. 

The new option has not been added to the postgresql.conf.sample

PS: I would also like to have such a feature. It's hard to increase pg_stat_statements.max or lose some entries just
becausesome ORM sends requests with a different number of parameters.
 

regards, Sergei



Re: pg_stat_statements and "IN" conditions

From
Dmitry Dolgov
Date:
> On Fri, Sep 16, 2022 at 09:25:13PM +0300, Sergei Kornilov wrote:
> Hello!
>
> Unfortunately the patch needs another rebase due to the recent split of guc.c
(0a20ff54f5e66158930d5328f89f087d4e9ab400)
>
> I'm reviewing a patch on top of a previous commit and noticed a failed test:
>
> #   Failed test 'no parameters missing from postgresql.conf.sample'
> #   at t/003_check_guc.pl line 82.
> #          got: '1'
> #     expected: '0'
> # Looks like you failed 1 test of 3.
> t/003_check_guc.pl ..............
>
> The new option has not been added to the postgresql.conf.sample
>
> PS: I would also like to have such a feature. It's hard to increase pg_stat_statements.max or lose some entries just
becausesome ORM sends requests with a different number of parameters.
 

Thanks! I'll post the rebased version soon.



Re: pg_stat_statements and "IN" conditions

From
Dmitry Dolgov
Date:
> On Sat, Sep 24, 2022 at 04:07:14PM +0200, Dmitry Dolgov wrote:
> > On Fri, Sep 16, 2022 at 09:25:13PM +0300, Sergei Kornilov wrote:
> > Hello!
> >
> > Unfortunately the patch needs another rebase due to the recent split of guc.c
(0a20ff54f5e66158930d5328f89f087d4e9ab400)
> >
> > I'm reviewing a patch on top of a previous commit and noticed a failed test:
> >
> > #   Failed test 'no parameters missing from postgresql.conf.sample'
> > #   at t/003_check_guc.pl line 82.
> > #          got: '1'
> > #     expected: '0'
> > # Looks like you failed 1 test of 3.
> > t/003_check_guc.pl ..............
> >
> > The new option has not been added to the postgresql.conf.sample
> >
> > PS: I would also like to have such a feature. It's hard to increase pg_stat_statements.max or lose some entries
justbecause some ORM sends requests with a different number of parameters.
 
>
> Thanks! I'll post the rebased version soon.

And here it is.

Attachment

Re: pg_stat_statements and "IN" conditions

From
vignesh C
Date:
On Sun, 25 Sept 2022 at 05:29, Dmitry Dolgov <9erthalion6@gmail.com> wrote:
>
> > On Sat, Sep 24, 2022 at 04:07:14PM +0200, Dmitry Dolgov wrote:
> > > On Fri, Sep 16, 2022 at 09:25:13PM +0300, Sergei Kornilov wrote:
> > > Hello!
> > >
> > > Unfortunately the patch needs another rebase due to the recent split of guc.c
(0a20ff54f5e66158930d5328f89f087d4e9ab400)
> > >
> > > I'm reviewing a patch on top of a previous commit and noticed a failed test:
> > >
> > > #   Failed test 'no parameters missing from postgresql.conf.sample'
> > > #   at t/003_check_guc.pl line 82.
> > > #          got: '1'
> > > #     expected: '0'
> > > # Looks like you failed 1 test of 3.
> > > t/003_check_guc.pl ..............
> > >
> > > The new option has not been added to the postgresql.conf.sample
> > >
> > > PS: I would also like to have such a feature. It's hard to increase pg_stat_statements.max or lose some entries
justbecause some ORM sends requests with a different number of parameters.
 
> >
> > Thanks! I'll post the rebased version soon.

The patch does not apply on top of HEAD as in [1], please post a rebased patch:
=== Applying patches on top of PostgreSQL commit ID
456fa635a909ee36f73ca84d340521bd730f265f ===
=== applying patch
./v9-0001-Prevent-jumbling-of-every-element-in-ArrayExpr.patch
....
can't find file to patch at input line 746
Perhaps you used the wrong -p or --strip option?
The text leading up to this was:
--------------------------
|diff --git a/src/backend/utils/misc/queryjumble.c
b/src/backend/utils/misc/queryjumble.c
|index a67487e5fe..063b4be725 100644
|--- a/src/backend/utils/misc/queryjumble.c
|+++ b/src/backend/utils/misc/queryjumble.c
--------------------------
No file to patch.  Skipping patch.
8 out of 8 hunks ignored
can't find file to patch at input line 913
Perhaps you used the wrong -p or --strip option?
The text leading up to this was:
--------------------------
|diff --git a/src/include/utils/queryjumble.h b/src/include/utils/queryjumble.h
|index 3c2d9beab2..b50cc42d4e 100644
|--- a/src/include/utils/queryjumble.h
|+++ b/src/include/utils/queryjumble.h
--------------------------
No file to patch.  Skipping patch.

[1] - http://cfbot.cputube.org/patch_41_2837.log

Regards,
Vignesh



Re: pg_stat_statements and "IN" conditions

From
Dmitry Dolgov
Date:
> On Fri, Jan 27, 2023 at 08:15:29PM +0530, vignesh C wrote:
> The patch does not apply on top of HEAD as in [1], please post a rebased patch:

Thanks. I think this one should do the trick.

Attachment

Re: pg_stat_statements and "IN" conditions

From
Marcos Pegoraro
Date:
Em dom., 29 de jan. de 2023 às 09:24, Dmitry Dolgov <9erthalion6@gmail.com> escreveu:
> On Fri, Jan 27, 2023 at 08:15:29PM +0530, vignesh C wrote:
> The patch does not apply on top of HEAD as in [1], please post a rebased patch:

Thanks. I think this one should do the trick.

There is a typo on DOC part
+        and it's length is larger than <varname> const_merge_threshold </varname>,
+        then array elements will contribure nothing to the query identifier.
+        Thus the query will get the same identifier no matter how many constants

That "contribure" should be "contribute"

regards
Marcos 

Re: pg_stat_statements and "IN" conditions

From
Dmitry Dolgov
Date:
> On Sun, Jan 29, 2023 at 09:56:02AM -0300, Marcos Pegoraro wrote:
> Em dom., 29 de jan. de 2023 às 09:24, Dmitry Dolgov <9erthalion6@gmail.com>
> escreveu:
>
> > > On Fri, Jan 27, 2023 at 08:15:29PM +0530, vignesh C wrote:
> > > The patch does not apply on top of HEAD as in [1], please post a rebased
> > patch:
> >
> > Thanks. I think this one should do the trick.
> >
>
> There is a typo on DOC part
> +        and it's length is larger than <varname> const_merge_threshold
> </varname>,
> +        then array elements will contribure nothing to the query
> identifier.
> +        Thus the query will get the same identifier no matter how many
> constants
>
> That "contribure" should be "contribute"

Indeed, thanks for noticing.

Attachment

Re: pg_stat_statements and "IN" conditions

From
Alvaro Herrera
Date:
This appears to have massive conflicts.  Would you please rebase?

-- 
Álvaro Herrera         PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"¿Cómo puedes confiar en algo que pagas y que no ves,
y no confiar en algo que te dan y te lo muestran?" (Germán Poo)



Re: pg_stat_statements and "IN" conditions

From
Dmitry Dolgov
Date:
> On Thu, Feb 02, 2023 at 03:07:27PM +0100, Alvaro Herrera wrote:
> This appears to have massive conflicts.  Would you please rebase?

Sure, I was already mentally preparing myself to do so in the view of
recent changes in query jumbling. Will post soon.



Re: pg_stat_statements and "IN" conditions

From
Dmitry Dolgov
Date:
> On Thu, Feb 02, 2023 at 04:05:54PM +0100, Dmitry Dolgov wrote:
> > On Thu, Feb 02, 2023 at 03:07:27PM +0100, Alvaro Herrera wrote:
> > This appears to have massive conflicts.  Would you please rebase?
>
> Sure, I was already mentally preparing myself to do so in the view of
> recent changes in query jumbling. Will post soon.

Here is the rebased version. To adapt to the latest changes, I've marked
ArrayExpr with custom_query_jumble to implement this functionality, but
tried to make the actual merge logic relatively independent. Otherwise,
everything is the same.

Attachment

Re: pg_stat_statements and "IN" conditions

From
Michael Paquier
Date:
On Sat, Feb 04, 2023 at 06:08:41PM +0100, Dmitry Dolgov wrote:
> Here is the rebased version. To adapt to the latest changes, I've marked
> ArrayExpr with custom_query_jumble to implement this functionality, but
> tried to make the actual merge logic relatively independent. Otherwise,
> everything is the same.

I was quickly looking at this patch, so these are rough impressions.

+   bool        merged;         /* whether or not the location was marked as
+                                  not contributing to jumble */

This part of the patch is a bit disturbing..  We have node attributes
to track if portions of a node should be ignored or have a location
marked, still this "merged" flag is used as an extension to track if a
location should be done or not.  Is that a concept that had better be
controlled via a new node attribute?

+--
+-- Consts merging
+--
+CREATE TABLE test_merge (id int, data int);
+-- IN queries
+-- No merging

Would it be better to split this set of tests into a new file?  FWIW,
I have a patch in baking process that refactors a bit the whole,
before being able to extend it so as we have more coverage for
normalized utility queries, as of now the query strings stored by
pg_stat_statements don't reflect that even if the jumbling computation
marks the location of the Const nodes included in utility statements
(partition bounds, queries of COPY, etc.).  I should be able to send
that tomorrow, and my guess that you could take advantage of that
even for this thread.
--
Michael

Attachment

Re: pg_stat_statements and "IN" conditions

From
Dmitry Dolgov
Date:
> On Sun, Feb 05, 2023 at 10:30:25AM +0900, Michael Paquier wrote:
> On Sat, Feb 04, 2023 at 06:08:41PM +0100, Dmitry Dolgov wrote:
> > Here is the rebased version. To adapt to the latest changes, I've marked
> > ArrayExpr with custom_query_jumble to implement this functionality, but
> > tried to make the actual merge logic relatively independent. Otherwise,
> > everything is the same.
>
> I was quickly looking at this patch, so these are rough impressions.
>
> +   bool        merged;         /* whether or not the location was marked as
> +                                  not contributing to jumble */
>
> This part of the patch is a bit disturbing..  We have node attributes
> to track if portions of a node should be ignored or have a location
> marked, still this "merged" flag is used as an extension to track if a
> location should be done or not.  Is that a concept that had better be
> controlled via a new node attribute?

Good question. I need to think a bit more if it's possible to leverage
node attributes mechanism, but at the moment I'm still inclined to
extend LocationLen. The reason is that it doesn't exactly serve the
tracking purpose, i.e. whether to capture a location (I have to update
the code commentary), it helps differentiate cases when locations A and
D are obtained from merging A B C D instead of just being A and D.

I'm thinking about this in the following way: the core jumbling logic is
responsible for deriving locations based on the input expressions; in
the case of merging we produce less locations; pgss have to represent
the result only using locations and has to be able to differentiate
simple locations and locations after merging.

> +--
> +-- Consts merging
> +--
> +CREATE TABLE test_merge (id int, data int);
> +-- IN queries
> +-- No merging
>
> Would it be better to split this set of tests into a new file?  FWIW,
> I have a patch in baking process that refactors a bit the whole,
> before being able to extend it so as we have more coverage for
> normalized utility queries, as of now the query strings stored by
> pg_stat_statements don't reflect that even if the jumbling computation
> marks the location of the Const nodes included in utility statements
> (partition bounds, queries of COPY, etc.).  I should be able to send
> that tomorrow, and my guess that you could take advantage of that
> even for this thread.

Sure, I'll take a look how I can benefit from your work, thanks.



Re: pg_stat_statements and "IN" conditions

From
Tom Lane
Date:
Dmitry Dolgov <9erthalion6@gmail.com> writes:
> I'm thinking about this in the following way: the core jumbling logic is
> responsible for deriving locations based on the input expressions; in
> the case of merging we produce less locations; pgss have to represent
> the result only using locations and has to be able to differentiate
> simple locations and locations after merging.

Uh ... why?  ISTM you're just going to elide all inside the IN,
so why do you need more than a start and stop position?

            regards, tom lane



Re: pg_stat_statements and "IN" conditions

From
Dmitry Dolgov
Date:
> On Sun, Feb 05, 2023 at 11:02:32AM -0500, Tom Lane wrote:
> Dmitry Dolgov <9erthalion6@gmail.com> writes:
> > I'm thinking about this in the following way: the core jumbling logic is
> > responsible for deriving locations based on the input expressions; in
> > the case of merging we produce less locations; pgss have to represent
> > the result only using locations and has to be able to differentiate
> > simple locations and locations after merging.
>
> Uh ... why?  ISTM you're just going to elide all inside the IN,
> so why do you need more than a start and stop position?

Exactly, start and stop positions. But if there would be no information
that merging was applied, the following queries will look the same after
jumbling, right?

    -- input query
    SELECT * FROM test_merge WHERE id IN (1, 2);
    -- jumbling result, two LocationLen, for values 1 and 2
    SELECT * FROM test_merge WHERE id IN ($1, $2);

    -- input query
    SELECT * FROM test_merge WHERE id IN (1, 2, 3, 4, 5, 6, 7, 8, 9, 10);
    -- jumbling result, two LocationLen after merging, for values 1 and 10
    SELECT * FROM test_merge WHERE id IN (...);
    -- without remembering about merging the result would be
    SELECT * FROM test_merge WHERE id IN ($1, $2);



Re:pg_stat_statements and "IN" conditions

From
Sergei Kornilov
Date:
Hello!

Unfortunately, rebase is needed again due to recent changes in queryjumblefuncs (
9ba37b2cb6a174b37fc51d0649ef73e56eae27fc)
 

It seems a little strange to me that with const_merge_threshold = 1, such a test case gives the same result as with
const_merge_threshold= 2
 

select pg_stat_statements_reset();
set const_merge_threshold to 1;
select * from test where i in (1,2,3);
select * from test where i in (1,2);
select * from test where i in (1);
select query, calls from pg_stat_statements order by query;

                query                | calls 
-------------------------------------+-------
 select * from test where i in (...) |     2
 select * from test where i in ($1)  |     1

Probably const_merge_threshold = 1 should produce only "i in (...)"?

const_merge_threshold is "the minimal length of an array" (more or equal) or "array .. length is larger than" (not
equals)?I think the documentation is ambiguous in this regard.
 

I also noticed a typo in guc_tables.c: "Sets the minimal numer of constants in an array" -> number

regards, Sergei



Re: pg_stat_statements and "IN" conditions

From
Peter Eisentraut
Date:
On 07.02.23 21:14, Sergei Kornilov wrote:
> It seems a little strange to me that with const_merge_threshold = 1, such a test case gives the same result as with
const_merge_threshold= 2
 

What is the point of making this a numeric setting?  Either you want to 
merge all values or you don't want to merge any values.



Re: pg_stat_statements and "IN" conditions

From
Dmitry Dolgov
Date:
> On Tue, Feb 07, 2023 at 11:14:52PM +0300, Sergei Kornilov wrote:
> Hello!

Thanks for reviewing.

> Unfortunately, rebase is needed again due to recent changes in queryjumblefuncs (
9ba37b2cb6a174b37fc51d0649ef73e56eae27fc)
 

Yep, my favourite game, rebaseball. Will post a new version soon, after
figuring out all the recent questions.

> It seems a little strange to me that with const_merge_threshold = 1, such a test case gives the same result as with
const_merge_threshold= 2
 
>
> select pg_stat_statements_reset();
> set const_merge_threshold to 1;
> select * from test where i in (1,2,3);
> select * from test where i in (1,2);
> select * from test where i in (1);
> select query, calls from pg_stat_statements order by query;
>
>                 query                | calls
> -------------------------------------+-------
>  select * from test where i in (...) |     2
>  select * from test where i in ($1)  |     1
>
> Probably const_merge_threshold = 1 should produce only "i in (...)"?

Well, it's not intentional, probably I need to be more careful with
off-by-one. Although I agree to a certain extent with Peter questioning
the value of having numerical option here, let me think about this.

> const_merge_threshold is "the minimal length of an array" (more or equal) or "array .. length is larger than" (not
equals)?I think the documentation is ambiguous in this regard.
 
>
> I also noticed a typo in guc_tables.c: "Sets the minimal numer of constants in an array" -> number

Yep, I'll rephrase the documentation.



Re: pg_stat_statements and "IN" conditions

From
Dmitry Dolgov
Date:
> On Thu, Feb 09, 2023 at 02:30:34PM +0100, Peter Eisentraut wrote:
> On 07.02.23 21:14, Sergei Kornilov wrote:
> > It seems a little strange to me that with const_merge_threshold = 1, such a test case gives the same result as with
const_merge_threshold= 2
 
>
> What is the point of making this a numeric setting?  Either you want to
> merge all values or you don't want to merge any values.

At least in theory the definition of "too many constants" is different
for different use cases and I see allowing to configure it as a way of
reducing the level of surprise here. The main scenario for a numerical
setting would be to distinguish between normal usage with just a handful
of constants (and the user expecting to see them represented in pgss)
and some sort of outliers with thousands of constants in a query (e.g.
as a defence mechanism for the infrastructure working with those
metrics). But I agree that it's not clear how much value is in that.

Not having strong opinion about this I would be fine changing it to a
boolean option (with an actual limit hidden internally) if everyone
agrees it fits better.



Re: pg_stat_statements and "IN" conditions

From
Alvaro Herrera
Date:
On 2023-Feb-09, Dmitry Dolgov wrote:

> > On Thu, Feb 09, 2023 at 02:30:34PM +0100, Peter Eisentraut wrote:

> > What is the point of making this a numeric setting?  Either you want
> > to merge all values or you don't want to merge any values.
> 
> At least in theory the definition of "too many constants" is different
> for different use cases and I see allowing to configure it as a way of
> reducing the level of surprise here.

I was thinking about this a few days ago and I agree that we don't
necessarily want to make it just a boolean thing; we may want to make it
more complex.  One trivial idea is to make it group entries in powers of
10: for 0-9 elements, you get one entry, and 10-99 you get a different
one, and so on:

# group everything in a single bucket
const_merge_threshold = true / yes / on 

# group 0-9, 10-99, 100-999, 1000-9999
const_merge_treshold = powers

Ideally the value would be represented somehow in the query text. For
example

                         query                            | calls
----------------------------------------------------------+-------
 select * from test where i in ({... 0-9 entries ...})    |     2
 select * from test where i in ({... 10-99 entries ...})  |     1

What do you think?  The jumble would have to know how to reduce all
values within each power-of-ten group to one specific value, but I don't
think that should be particularly difficult.


-- 
Álvaro Herrera         PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"Find a bug in a program, and fix it, and the program will work today.
Show the program how to find and fix a bug, and the program
will work forever" (Oliver Silfridge)



Re: pg_stat_statements and "IN" conditions

From
Dmitry Dolgov
Date:
> On Thu, Feb 09, 2023 at 06:26:51PM +0100, Alvaro Herrera wrote:
> On 2023-Feb-09, Dmitry Dolgov wrote:
>
> > > On Thu, Feb 09, 2023 at 02:30:34PM +0100, Peter Eisentraut wrote:
>
> > > What is the point of making this a numeric setting?  Either you want
> > > to merge all values or you don't want to merge any values.
> >
> > At least in theory the definition of "too many constants" is different
> > for different use cases and I see allowing to configure it as a way of
> > reducing the level of surprise here.
>
> I was thinking about this a few days ago and I agree that we don't
> necessarily want to make it just a boolean thing; we may want to make it
> more complex.  One trivial idea is to make it group entries in powers of
> 10: for 0-9 elements, you get one entry, and 10-99 you get a different
> one, and so on:
>
> # group everything in a single bucket
> const_merge_threshold = true / yes / on
>
> # group 0-9, 10-99, 100-999, 1000-9999
> const_merge_treshold = powers
>
> Ideally the value would be represented somehow in the query text. For
> example
>
>                          query                            | calls
> ----------------------------------------------------------+-------
>  select * from test where i in ({... 0-9 entries ...})    |     2
>  select * from test where i in ({... 10-99 entries ...})  |     1
>
> What do you think?  The jumble would have to know how to reduce all
> values within each power-of-ten group to one specific value, but I don't
> think that should be particularly difficult.

Yeah, it sounds appealing and conveniently addresses the question of
losing the information about how many constants originally were there.
Not sure if the example above would be the most natural way to represent
it in the query text, but otherwise I'm going to try implementing this.
Stay tuned.



Re: pg_stat_statements and "IN" conditions

From
David Geier
Date:
Hi,

On 2/9/23 16:02, Dmitry Dolgov wrote:
>> Unfortunately, rebase is needed again due to recent changes in queryjumblefuncs (
9ba37b2cb6a174b37fc51d0649ef73e56eae27fc)
 
I reviewed the last patch applied to some commit from Feb. 4th.
>> It seems a little strange to me that with const_merge_threshold = 1, such a test case gives the same result as with
const_merge_threshold= 2
 
>>
>> select pg_stat_statements_reset();
>> set const_merge_threshold to 1;
>> select * from test where i in (1,2,3);
>> select * from test where i in (1,2);
>> select * from test where i in (1);
>> select query, calls from pg_stat_statements order by query;
>>
>>                  query                | calls
>> -------------------------------------+-------
>>   select * from test where i in (...) |     2
>>   select * from test where i in ($1)  |     1
>>
>> Probably const_merge_threshold = 1 should produce only "i in (...)"?
> Well, it's not intentional, probably I need to be more careful with
> off-by-one. Although I agree to a certain extent with Peter questioning

Please add tests for all the corner cases. At least for (1) IN only 
contains a single element and (2) const_merge_threshold = 1.

Beyond that:

- There's a comment about find_const_walker(). I cannot find that 
function anywhere. What am I missing?

- What about renaming IsConstList() to IsMergableConstList().

- Don't you intend to use the NUMERIC data column in SELECT * FROM 
test_merge_numeric WHERE id IN (1, 2, 3, 4, 5, 6, 7, 8, 9, 10)? 
Otherwise, the test is identical to previous test cases and you're not 
checking for what happens with NUMERICs which are wrapped in FuncExpr 
because of the implicit coercion.

- Don't we want to extend IsConstList() to allow a list of all 
implicitly coerced constants? It's inconsistent that otherwise e.g. 
NUMERICs don't work.

- Typo in /* The firsts merged constant */ (first not firsts)

- Prepared statements are not supported as they contain INs with Param 
instead of Const nodes. While less likely, I've seen applications that 
use prepared statements in conjunction with queries generated through a 
UI which ended up with tons of prepared queries with different number of 
elements in the IN clause. Not necessarily something that must go into 
this patch but maybe worth thinking about.

- The setting name const_merge_threshold is not very telling without 
knowing the context. While being a little longer, what about 
jumble_const_merge_threshold or queryid_const_merge_threshold or similar?

- Why do we actually only want to merge constants? Why don't we ignore 
the type of element in the IN and merge whatever is there? Is this 
because the original jumbling logic as of now only has support for 
constants?

- Ideally we would even remove duplicates. That would even improve 
cardinality estimation but I guess we don't want to spend the cycles on 
doing so in the planner?

- Why did you change palloc() to palloc0() for clocations array? The 
length is initialized to 0 and FWICS RecordConstLocation() initializes 
all members. Seems to me like we don't have to spend these cycles.

- Can't the logic at the end of IsConstList() not be simplified to:

         foreach(temp, elements)
             if (!IsA(lfirst(temp), Const))
                 return false;

         // All elements are of type Const
         *firstConst = linitial_node(Const, elements);
         *lastConst = llast_node(Const, elements);
         return true;

-- 
David Geier
(ServiceNow)




Re: pg_stat_statements and "IN" conditions

From
Dmitry Dolgov
Date:
> On Sat, Feb 11, 2023 at 11:03:36AM +0100, David Geier wrote:
> Hi,
>
> On 2/9/23 16:02, Dmitry Dolgov wrote:
> > > Unfortunately, rebase is needed again due to recent changes in queryjumblefuncs (
9ba37b2cb6a174b37fc51d0649ef73e56eae27fc)
 
> I reviewed the last patch applied to some commit from Feb. 4th.

Thanks for looking. Few quick answers about high-level questions below,
the rest I'll incorporate in the new version.

> - There's a comment about find_const_walker(). I cannot find that function
> anywhere. What am I missing?
>
> [...]
>
> - Don't you intend to use the NUMERIC data column in SELECT * FROM
> test_merge_numeric WHERE id IN (1, 2, 3, 4, 5, 6, 7, 8, 9, 10)? Otherwise,
> the test is identical to previous test cases and you're not checking for
> what happens with NUMERICs which are wrapped in FuncExpr because of the
> implicit coercion.
>
> - Don't we want to extend IsConstList() to allow a list of all implicitly
> coerced constants? It's inconsistent that otherwise e.g. NUMERICs don't
> work.
>
> [...]
>
> - Prepared statements are not supported as they contain INs with Param
> instead of Const nodes. While less likely, I've seen applications that use
> prepared statements in conjunction with queries generated through a UI which
> ended up with tons of prepared queries with different number of elements in
> the IN clause. Not necessarily something that must go into this patch but
> maybe worth thinking about.

The original version of the patch was doing all of this, i.e. handling
numerics, Param nodes, RTE_VALUES. The commentary about
find_const_walker in tests is referring to a part of that, that was
dealing with evaluation of expression to see if it could be reduced to a
constant.

Unfortunately there was a significant push back from reviewers because
of those features. That's why I've reduced the patch to it's minimally
useful version, having in mind re-implementing them as follow-up patches
in the future. This is the reason as well why I left tests covering all
this missing functionality -- as breadcrumbs to already discovered
cases, important for the future extensions.

> - Why do we actually only want to merge constants? Why don't we ignore the
> type of element in the IN and merge whatever is there? Is this because the
> original jumbling logic as of now only has support for constants?
>
> - Ideally we would even remove duplicates. That would even improve
> cardinality estimation but I guess we don't want to spend the cycles on
> doing so in the planner?

I believe these points are beyond the patch goals, as it's less clear
(at least to me) if it's safe or desirable to do so.



Re: pg_stat_statements and "IN" conditions

From
Dmitry Dolgov
Date:
> On Sat, Feb 11, 2023 at 11:47:07AM +0100, Dmitry Dolgov wrote:
>
> The original version of the patch was doing all of this, i.e. handling
> numerics, Param nodes, RTE_VALUES. The commentary about
> find_const_walker in tests is referring to a part of that, that was
> dealing with evaluation of expression to see if it could be reduced to a
> constant.
>
> Unfortunately there was a significant push back from reviewers because
> of those features. That's why I've reduced the patch to it's minimally
> useful version, having in mind re-implementing them as follow-up patches
> in the future. This is the reason as well why I left tests covering all
> this missing functionality -- as breadcrumbs to already discovered
> cases, important for the future extensions.

I'd like to elaborate on this a bit and remind about the origins of the
patch, as it's lost somewhere in the beginning of the thread. The idea
is not pulled out of thin air, everything is coming from our attempts to
improve one particular monitoring infrastructure in a real commercial
setting. Every covered use case and test in the original proposal was a
result of field trials, when some application-side library or ORM was
responsible for gigabytes of data in pgss, chocking the monitoring agent.



Re: pg_stat_statements and "IN" conditions

From
David Geier
Date:
Hi,

On 2/11/23 13:08, Dmitry Dolgov wrote:
>> On Sat, Feb 11, 2023 at 11:47:07AM +0100, Dmitry Dolgov wrote:
>>
>> The original version of the patch was doing all of this, i.e. handling
>> numerics, Param nodes, RTE_VALUES. The commentary about
>> find_const_walker in tests is referring to a part of that, that was
>> dealing with evaluation of expression to see if it could be reduced to a
>> constant.
>>
>> Unfortunately there was a significant push back from reviewers because
>> of those features. That's why I've reduced the patch to it's minimally
>> useful version, having in mind re-implementing them as follow-up patches
>> in the future. This is the reason as well why I left tests covering all
>> this missing functionality -- as breadcrumbs to already discovered
>> cases, important for the future extensions.
> I'd like to elaborate on this a bit and remind about the origins of the
> patch, as it's lost somewhere in the beginning of the thread. The idea
> is not pulled out of thin air, everything is coming from our attempts to
> improve one particular monitoring infrastructure in a real commercial
> setting. Every covered use case and test in the original proposal was a
> result of field trials, when some application-side library or ORM was
> responsible for gigabytes of data in pgss, chocking the monitoring agent.

Thanks for the clarifications. I didn't mean to contend the usefulness 
of the patch and I wasn't aware that you already jumped through the 
loops of handling Param, etc. Seems like supporting only constants is a 
good starting point. The only thing that is likely confusing for users 
is that NUMERICs (and potentially constants of other types) are 
unsupported. Wouldn't it be fairly simple to support them via something 
like the following?

     is_const(element) || (is_coercion(element) && is_const(element->child))

-- 
David Geier
(ServiceNow)




Re: pg_stat_statements and "IN" conditions

From
Dmitry Dolgov
Date:
> On Wed, Feb 15, 2023 at 08:51:56AM +0100, David Geier wrote:
> Hi,
>
> On 2/11/23 13:08, Dmitry Dolgov wrote:
> > > On Sat, Feb 11, 2023 at 11:47:07AM +0100, Dmitry Dolgov wrote:
> > >
> > > The original version of the patch was doing all of this, i.e. handling
> > > numerics, Param nodes, RTE_VALUES. The commentary about
> > > find_const_walker in tests is referring to a part of that, that was
> > > dealing with evaluation of expression to see if it could be reduced to a
> > > constant.
> > >
> > > Unfortunately there was a significant push back from reviewers because
> > > of those features. That's why I've reduced the patch to it's minimally
> > > useful version, having in mind re-implementing them as follow-up patches
> > > in the future. This is the reason as well why I left tests covering all
> > > this missing functionality -- as breadcrumbs to already discovered
> > > cases, important for the future extensions.
> > I'd like to elaborate on this a bit and remind about the origins of the
> > patch, as it's lost somewhere in the beginning of the thread. The idea
> > is not pulled out of thin air, everything is coming from our attempts to
> > improve one particular monitoring infrastructure in a real commercial
> > setting. Every covered use case and test in the original proposal was a
> > result of field trials, when some application-side library or ORM was
> > responsible for gigabytes of data in pgss, chocking the monitoring agent.
>
> Thanks for the clarifications. I didn't mean to contend the usefulness of
> the patch and I wasn't aware that you already jumped through the loops of
> handling Param, etc.

No worries, I just wanted to emphasize that we've already collected
quite some number of use cases.

> Seems like supporting only constants is a good starting
> point. The only thing that is likely confusing for users is that NUMERICs
> (and potentially constants of other types) are unsupported. Wouldn't it be
> fairly simple to support them via something like the following?
>
>     is_const(element) || (is_coercion(element) && is_const(element->child))

It definitely makes sense to implement that, although I don't think it's
going to be acceptable to do that via directly listing conditions an
element has to satisfy. It probably has to be more flexible, sice we
would like to extend it in the future. My plan is to address this in a
follow-up patch, when the main mechanism is approved. Would you agree
with this approach?



Re: pg_stat_statements and "IN" conditions

From
Dmitry Dolgov
Date:
> On Thu, Feb 09, 2023 at 08:43:29PM +0100, Dmitry Dolgov wrote:
> > On Thu, Feb 09, 2023 at 06:26:51PM +0100, Alvaro Herrera wrote:
> > On 2023-Feb-09, Dmitry Dolgov wrote:
> >
> > > > On Thu, Feb 09, 2023 at 02:30:34PM +0100, Peter Eisentraut wrote:
> >
> > > > What is the point of making this a numeric setting?  Either you want
> > > > to merge all values or you don't want to merge any values.
> > >
> > > At least in theory the definition of "too many constants" is different
> > > for different use cases and I see allowing to configure it as a way of
> > > reducing the level of surprise here.
> >
> > I was thinking about this a few days ago and I agree that we don't
> > necessarily want to make it just a boolean thing; we may want to make it
> > more complex.  One trivial idea is to make it group entries in powers of
> > 10: for 0-9 elements, you get one entry, and 10-99 you get a different
> > one, and so on:
> >
> > # group everything in a single bucket
> > const_merge_threshold = true / yes / on
> >
> > # group 0-9, 10-99, 100-999, 1000-9999
> > const_merge_treshold = powers
> >
> > Ideally the value would be represented somehow in the query text. For
> > example
> >
> >                          query                            | calls
> > ----------------------------------------------------------+-------
> >  select * from test where i in ({... 0-9 entries ...})    |     2
> >  select * from test where i in ({... 10-99 entries ...})  |     1
> >
> > What do you think?  The jumble would have to know how to reduce all
> > values within each power-of-ten group to one specific value, but I don't
> > think that should be particularly difficult.
>
> Yeah, it sounds appealing and conveniently addresses the question of
> losing the information about how many constants originally were there.
> Not sure if the example above would be the most natural way to represent
> it in the query text, but otherwise I'm going to try implementing this.
> Stay tuned.

It took me couple of evenings, here is what I've got:

* The representation is not that far away from your proposal, I've
  settled on:

    SELECT * FROM test_merge WHERE id IN (... [10-99 entries])

* To not reinvent the wheel, I've reused decimalLenght function from
  numutils, hence one more patch to make it available to reuse.

* This approach resolves my concerns about letting people tuning
  the behaviour of merging, as now it's possible to distinguish between
  calls with different number of constants up to the power of 10. So
  I've decided to simplify the configuration and make the guc boolean to
  turn it off or on.

* To separate queries with constants falling into different ranges
  (10-99, 100-999, etc), the order of magnitude is added into the jumble
  hash.

* I've incorporated feedback from Sergei and David, as well as tried to
  make comments and documentation more clear.

Any feedback is welcomed, thanks!

Attachment

Re: pg_stat_statements and "IN" conditions

From
David Geier
Date:
Hi,

>> Seems like supporting only constants is a good starting
>> point. The only thing that is likely confusing for users is that NUMERICs
>> (and potentially constants of other types) are unsupported. Wouldn't it be
>> fairly simple to support them via something like the following?
>>
>>      is_const(element) || (is_coercion(element) && is_const(element->child))
> It definitely makes sense to implement that, although I don't think it's
> going to be acceptable to do that via directly listing conditions an
> element has to satisfy. It probably has to be more flexible, sice we
> would like to extend it in the future. My plan is to address this in a
> follow-up patch, when the main mechanism is approved. Would you agree
> with this approach?

I still think it's counterintuitive and I'm pretty sure people would 
even report this as a bug because not knowing about the difference in 
internal representation you would expect NUMERICs to work the same way 
other constants work. If anything we would have to document it.

Can't we do something pragmatic and have something like 
IsMergableInElement() which for now only supports constants and later 
can be extended? Or what exactly do you mean by "more flexible"?

-- 
David Geier
(ServiceNow)




Re: pg_stat_statements and "IN" conditions

From
Dmitry Dolgov
Date:
> On Thu, Feb 23, 2023 at 09:48:35AM +0100, David Geier wrote:
> Hi,
>
> > > Seems like supporting only constants is a good starting
> > > point. The only thing that is likely confusing for users is that NUMERICs
> > > (and potentially constants of other types) are unsupported. Wouldn't it be
> > > fairly simple to support them via something like the following?
> > >
> > >      is_const(element) || (is_coercion(element) && is_const(element->child))
> > It definitely makes sense to implement that, although I don't think it's
> > going to be acceptable to do that via directly listing conditions an
> > element has to satisfy. It probably has to be more flexible, sice we
> > would like to extend it in the future. My plan is to address this in a
> > follow-up patch, when the main mechanism is approved. Would you agree
> > with this approach?
>
> I still think it's counterintuitive and I'm pretty sure people would even
> report this as a bug because not knowing about the difference in internal
> representation you would expect NUMERICs to work the same way other
> constants work. If anything we would have to document it.
>
> Can't we do something pragmatic and have something like
> IsMergableInElement() which for now only supports constants and later can be
> extended? Or what exactly do you mean by "more flexible"?

Here is how I see it (pls correct me if I'm wrong at any point). To
support numerics as presented in the tests from this patch, we have to
deal with FuncExpr (the function converting a value into a numeric).
Having in mind only numerics, we would need to filter out any other
FuncExpr (which already sounds dubious). Then we need to validate for
example that the function is immutable and have constant arguments,
which is already implemented in evaluate_function and is a part of
eval_const_expression. There is nothing special about numerics at this
point, and this approach leads us back to eval_const_expression to a
certain degree. Do you see any other way?

I'm thinking about Michael idea in this context, and want to see if it
would be possible to make the mechanism more flexible using some node
attributes. But I see it only as a follow-up step, not a prerequisite.



Re: pg_stat_statements and "IN" conditions

From
"Gregory Stark (as CFM)"
Date:
So I was seeing that this patch needs a rebase according to cfbot.

However it looks like the review feedback you're looking for is more
of design questions. What jumbling is best to include in the feature
set and which is best to add in later patches. It sounds like you've
gotten conflicting feedback from initial reviews.

It does sound like the patch is pretty mature and you're actively
responding to feedback so if you got more authoritative feedback it
might even be committable now. It's already been two years of being
rolled forward so it would be a shame to keep rolling it forward.

Or is there some fatal problem that you're trying to work around and
still haven't found the magic combination that convinces any
committers this is something we want? In which case perhaps we set
this patch returned? I don't get that impression myself though.



Re: pg_stat_statements and "IN" conditions

From
Dmitry Dolgov
Date:
> On Tue, Mar 14, 2023 at 02:14:17PM -0400, Gregory Stark (as CFM) wrote:
> So I was seeing that this patch needs a rebase according to cfbot.

Yeah, folks are getting up to speed in with pgss improvements recently.
Thanks for letting me know.

> However it looks like the review feedback you're looking for is more
> of design questions. What jumbling is best to include in the feature
> set and which is best to add in later patches. It sounds like you've
> gotten conflicting feedback from initial reviews.
>
> It does sound like the patch is pretty mature and you're actively
> responding to feedback so if you got more authoritative feedback it
> might even be committable now. It's already been two years of being
> rolled forward so it would be a shame to keep rolling it forward.

You got it about right. There is a balance to strike between
implementation, that would cover more useful cases, but has more
dependencies (something like possibility of having multiple query id),
and more minimalistic implementation that would actually be acceptable
in the way it is now. But I haven't heard back from David about it, so I
assume everybody is fine with the minimalistic approach.

> Or is there some fatal problem that you're trying to work around and
> still haven't found the magic combination that convinces any
> committers this is something we want? In which case perhaps we set
> this patch returned? I don't get that impression myself though.

Nothing like this on my side, although I'm not good at conjuring
committing powers of the nature.



Re: pg_stat_statements and "IN" conditions

From
Dmitry Dolgov
Date:
> On Tue, Mar 14, 2023 at 08:04:32PM +0100, Dmitry Dolgov wrote:
> > On Tue, Mar 14, 2023 at 02:14:17PM -0400, Gregory Stark (as CFM) wrote:
> > So I was seeing that this patch needs a rebase according to cfbot.
>
> Yeah, folks are getting up to speed in with pgss improvements recently.
> Thanks for letting me know.

Following recent refactoring of pg_stat_statements tests, I've created a
new one for merging functionality in the patch. This should solve any
conflicts.

Attachment

Re: pg_stat_statements and "IN" conditions

From
Nathan Bossart
Date:
On Sun, Mar 19, 2023 at 01:27:34PM +0100, Dmitry Dolgov wrote:
> +        If this parameter is on, two queries with an array will get the same
> +        query identifier if the only difference between them is the number of
> +        constants, both numbers is of the same order of magnitude and greater or
> +        equal 10 (so the order of magnitude is greather than 1, it is not worth
> +        the efforts otherwise).

IMHO this adds way too much complexity to something that most users would
expect to be an on/off switch.  If I understand Álvaro's suggestion [0]
correctly, he's saying that in addition to allowing "on" and "off", it
might be worth allowing something like "powers" to yield roughly the
behavior described above.  I don't think he's suggesting that this "powers"
behavior should be the only available option.  Also, it seems
counterintuitive that queries with fewer than 10 constants are not merged.

In the interest of moving this patch forward, I would suggest making it a
simple on/off switch in 0002 and moving the "powers" functionality to a new
0003 patch.  I think separating out the core part of this feature might
help reviewers.  As you can see, I got distracted by the complicated
threshold logic and ended up focusing my first round of review there.

[0] https://postgr.es/m/20230209172651.cfgrebpyyr72h7fv%40alvherre.pgsql

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com



Re: pg_stat_statements and "IN" conditions

From
Dmitry Dolgov
Date:
> On Mon, Jul 03, 2023 at 09:46:11PM -0700, Nathan Bossart wrote:

Thanks for reviewing.

> On Sun, Mar 19, 2023 at 01:27:34PM +0100, Dmitry Dolgov wrote:
> > +        If this parameter is on, two queries with an array will get the same
> > +        query identifier if the only difference between them is the number of
> > +        constants, both numbers is of the same order of magnitude and greater or
> > +        equal 10 (so the order of magnitude is greather than 1, it is not worth
> > +        the efforts otherwise).
>
> IMHO this adds way too much complexity to something that most users would
> expect to be an on/off switch.

This documentation is exclusively to be precise about how does it work.
Users don't have to worry about all this, and pretty much turn it
on/off, as you've described. I agree though, I could probably write this
text a bit differently.

> If I understand Álvaro's suggestion [0] correctly, he's saying that in
> addition to allowing "on" and "off", it might be worth allowing
> something like "powers" to yield roughly the behavior described above.
> I don't think he's suggesting that this "powers" behavior should be
> the only available option.

Independently of what Álvaro was suggesting, I find the "powers"
approach more suitable, because it answers my own concerns about the
previous implementation. Having "on"/"off" values means we would have to
scratch heads coming up with a one-size-fit-all default value, or to
introduce another option for the actual cut-off threshold. I would like
to avoid both of those options, that's why I went with "powers" only.

> Also, it seems counterintuitive that queries with fewer than 10
> constants are not merged.

Why? What would be your intuition using this feature?

> In the interest of moving this patch forward, I would suggest making it a
> simple on/off switch in 0002 and moving the "powers" functionality to a new
> 0003 patch.  I think separating out the core part of this feature might
> help reviewers.  As you can see, I got distracted by the complicated
> threshold logic and ended up focusing my first round of review there.

I would disagree. As I've described above, to me "powers" seems to be a
better fit, and the complicated logic is in fact reusing one already
existing function. Do those arguments sound convincing to you?



Re: pg_stat_statements and "IN" conditions

From
Jakub Wartak
Date:
The following review has been posted through the commitfest application:
make installcheck-world:  tested, passed
Implements feature:       tested, passed
Spec compliant:           not tested
Documentation:            tested, passed

I've tested the patched on 17devel/master and it is my feeling - especially given the proliferation of the ORMs - that
weneed such thing in pgss. Thread already took almost 3 years, so it would be pity to waste so much development time of
yours.Cfbot is green, and patch works very well for me. IMVHO commitfest status should be even set to
ready-for-comitter.

Given the:
    SET query_id_const_merge = on;
    SELECT pg_stat_statements_reset();
    SELECT * FROM test WHERE a IN (1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 11);
    SELECT * FROM test WHERE a IN (1, 2, 3);
    SELECT * FROM test WHERE a = ALL('{1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12}');
    SELECT * FROM test WHERE a = ANY (ARRAY[11,10,9,8,7,6,5,4,3,2,1]);

The patch results in:
                          q                          | calls
-----------------------------------------------------+-------
 SELECT * FROM test WHERE a = ALL($1)                |     1
 SELECT pg_stat_statements_reset()                   |     1
 SELECT * FROM test WHERE a IN ($1, $2, $3)          |     1
 SELECT * FROM test WHERE a IN (... [10-99 entries]) |     2

Of course it's pity it doesn't collapse the below ones:

SELECT * FROM (VALUES (1), (2), (3), (4), (5), (6), (7), (8), (9), (10), (11)) AS t (num);
INSERT INTO dummy VALUES(1, 'text 1'),(2, 'text 2'),(3, 'text 3'),(4, 'text 3'),(5, 'text 3'),(6, 'text 3'),(7, 'text
3'),(8,'text 3'),(9, 'text 3'),(10, 'text 3') ON CONFLICT (id) DO NOTHING;
 
PREPARE s3(int[], int[], int[], int[], int[], int[], int[], int[], int[], int[], int[]) AS SELECT * FROM test WHERE 
    a = ANY ($1::int[]) OR 
    a = ANY ($2::int[]) OR
[..]
    a = ANY ($11::int[]) ;

but given the convoluted thread history, it's understandable and as you stated - maybe in future.

There's one additional benefit to this patch: the pg_hint_plan extension seems to borrow pgss's
generate_normalized_query().So if that's changed in next major release, the pg_hint_plan hint table (transparent plan
rewriteusing table) will automatically benefit from generalization of the query string here (imagine fixing plans for
ORMthat generate N {1,1024} number of IN() array elements; today that would be N number of entries in the
"hint_plan.hints"table). 

The new status of this patch is: Needs review

Re: pg_stat_statements and "IN" conditions

From
Maciek Sakrejda
Date:
I've also tried the patch and I see the same results as Jakub, which
make sense to me. I did have issues getting it to apply, though: `git
am` complains about a conflict, though patch itself was able to apply
it.



Re: pg_stat_statements and "IN" conditions

From
Yasuo Honda
Date:
Hi, this is my first email to the pgsql hackers.

I came across this email thread while looking at
https://github.com/rails/rails/pull/49388 for Ruby on Rails one of the
popular web application framework by replacing every query `in` clause
with `any` to reduce similar entries in `pg_stat_statements`.

I want this to be solved on the PostgreSQL side, mainly because I want
to avoid replacing
every in clause with any to reduce similar entries in pg_stat_statements.

It would be nice to have this patch reviewed.

As I'm not familiar with C and PostgreSQL source code, I'm not
reviewing this patch myself,
I applied this patch to my local PostgreSQL and the Active Record unit
tests ran successfully.
--
Yasuo Honda



Re: pg_stat_statements and "IN" conditions

From
Michael Paquier
Date:
On Tue, Jul 04, 2023 at 09:02:56PM +0200, Dmitry Dolgov wrote:
> On Mon, Jul 03, 2023 at 09:46:11PM -0700, Nathan Bossart wrote:
>> IMHO this adds way too much complexity to something that most users would
>> expect to be an on/off switch.
>
> This documentation is exclusively to be precise about how does it work.
> Users don't have to worry about all this, and pretty much turn it
> on/off, as you've described. I agree though, I could probably write this
> text a bit differently.

FWIW, I am going to side with Nathan on this one, but not completely
either.  I was looking at the patch and it brings too much complexity
for a monitoring feature in this code path.  In my experience, I've
seen people complain about IN/ANY never strimmed down to a single
parameter in pg_stat_statements but I still have to hear from somebody
outside this thread that they'd like to reduce an IN clause depending
on the number of items, or something else.

>> If I understand Álvaro's suggestion [0] correctly, he's saying that in
>> addition to allowing "on" and "off", it might be worth allowing
>> something like "powers" to yield roughly the behavior described above.
>> I don't think he's suggesting that this "powers" behavior should be
>> the only available option.
>
> Independently of what Álvaro was suggesting, I find the "powers"
> approach more suitable, because it answers my own concerns about the
> previous implementation. Having "on"/"off" values means we would have to
> scratch heads coming up with a one-size-fit-all default value, or to
> introduce another option for the actual cut-off threshold. I would like
> to avoid both of those options, that's why I went with "powers" only.

Now, it doesn't mean that this approach with the "powers" will never
happen, but based on the set of opinions I am gathering on this thread
I would suggest to rework the patch as follows:
- First implement an on/off switch that reduces the lists in IN and/or
ANY to one parameter.  Simply.
- Second, refactor the powers routine.
- Third, extend the on/off switch, or just implement a threshold with
a second switch.

When it comes to my opinion, I am not seeing any objections to the
feature as a whole, and I'm OK with the first step.  I'm also OK to
keep the door open for more improvements in controlling how these
IN/ANY lists show up, but there could be more than just the number of
items as parameter (say the query size, different behaviors depending
on the number of clauses in queries, subquery context or CTEs/WITH,
etc. just to name a few things coming in mind).
--
Michael

Attachment

Re: pg_stat_statements and "IN" conditions

From
Dmitry Dolgov
Date:
> On Fri, Oct 13, 2023 at 05:07:00PM +0900, Michael Paquier wrote:
> Now, it doesn't mean that this approach with the "powers" will never
> happen, but based on the set of opinions I am gathering on this thread
> I would suggest to rework the patch as follows:
> - First implement an on/off switch that reduces the lists in IN and/or
> ANY to one parameter.  Simply.
> - Second, refactor the powers routine.
> - Third, extend the on/off switch, or just implement a threshold with
> a second switch.

Well, if it will help move this patch forward, why not. To clarify, I'm
going to split the current implementation into three patches, one for
each point you've mentioned.

> When it comes to my opinion, I am not seeing any objections to the
> feature as a whole, and I'm OK with the first step.  I'm also OK to
> keep the door open for more improvements in controlling how these
> IN/ANY lists show up, but there could be more than just the number of
> items as parameter (say the query size, different behaviors depending
> on the number of clauses in queries, subquery context or CTEs/WITH,
> etc. just to name a few things coming in mind).

Interesting point, but now it's my turn to have troubles imagining the
case, where list representation could be controlled depending on
something else than the number of elements in it. Do you have any
specific example in mind?



Re: pg_stat_statements and "IN" conditions

From
Nathan Bossart
Date:
On Tue, Jul 04, 2023 at 09:02:56PM +0200, Dmitry Dolgov wrote:
>> On Mon, Jul 03, 2023 at 09:46:11PM -0700, Nathan Bossart wrote:
>> Also, it seems counterintuitive that queries with fewer than 10
>> constants are not merged.
> 
> Why? What would be your intuition using this feature?

For the "powers" setting, I would've expected queries with 0-9 constants to
be merged.  Then 10-99, 100-999, 1000-9999, etc.  I suppose there might be
an argument for separating 0 from 1-9, too.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com



Re: pg_stat_statements and "IN" conditions

From
Dmitry Dolgov
Date:
> On Fri, Oct 13, 2023 at 03:35:19PM +0200, Dmitry Dolgov wrote:
> > On Fri, Oct 13, 2023 at 05:07:00PM +0900, Michael Paquier wrote:
> > Now, it doesn't mean that this approach with the "powers" will never
> > happen, but based on the set of opinions I am gathering on this thread
> > I would suggest to rework the patch as follows:
> > - First implement an on/off switch that reduces the lists in IN and/or
> > ANY to one parameter.  Simply.
> > - Second, refactor the powers routine.
> > - Third, extend the on/off switch, or just implement a threshold with
> > a second switch.
>
> Well, if it will help move this patch forward, why not. To clarify, I'm
> going to split the current implementation into three patches, one for
> each point you've mentioned.

Here is what I had mind. The first patch implements the basic notion of
merging, and I guess everyone agrees on its usefulness. The second and
third implement merging into groups power of 10, which I still find
useful as well. The last one adds a lower threshold for merging on top
of that. My intentions are to get the first one in, ideally I would love
to see the second and third applied as well.

> > When it comes to my opinion, I am not seeing any objections to the
> > feature as a whole, and I'm OK with the first step.  I'm also OK to
> > keep the door open for more improvements in controlling how these
> > IN/ANY lists show up, but there could be more than just the number of
> > items as parameter (say the query size, different behaviors depending
> > on the number of clauses in queries, subquery context or CTEs/WITH,
> > etc. just to name a few things coming in mind).
>
> Interesting point, but now it's my turn to have troubles imagining the
> case, where list representation could be controlled depending on
> something else than the number of elements in it. Do you have any
> specific example in mind?

In the current patch version I didn't add anything yet to address the
question of having more parameters to tune constants merging. The main
obstacle as I see it is that the information for that has to be
collected when jumbling various query nodes. Anything except information
about the ArrayExpr itself would have to be acquired when jumbling some
other parts of the query, not directly related to the ArrayExpr. It
seems to me this interdependency between otherwise unrelated nodes
outweigh the value it brings, and would require some more elaborate (and
more invasive for the purpose of this patch) mechanism to implement.

Attachment

Re: pg_stat_statements and "IN" conditions

From
Michael Paquier
Date:
On Tue, Oct 17, 2023 at 10:15:41AM +0200, Dmitry Dolgov wrote:
> In the current patch version I didn't add anything yet to address the
> question of having more parameters to tune constants merging. The main
> obstacle as I see it is that the information for that has to be
> collected when jumbling various query nodes. Anything except information
> about the ArrayExpr itself would have to be acquired when jumbling some
> other parts of the query, not directly related to the ArrayExpr. It
> seems to me this interdependency between otherwise unrelated nodes
> outweigh the value it brings, and would require some more elaborate (and
> more invasive for the purpose of this patch) mechanism to implement.

 typedef struct ArrayExpr
 {
+    pg_node_attr(custom_query_jumble)
+

Hmm.  I am not sure that this is the best approach
implementation-wise.  Wouldn't it be better to invent a new
pg_node_attr (these can include parameters as well!), say
query_jumble_merge or query_jumble_agg_location that aggregates all
the parameters of a list to be considered as a single element.  To put
it short, we could also apply the same property to other parts of a
parsed tree, and not only an ArrayExpr's list.

 /* GUC parameters */
 extern PGDLLIMPORT int compute_query_id;
-
+extern PGDLLIMPORT bool query_id_const_merge;

Not much a fan of this addition as well for an in-core GUC.  I would
suggest pushing the GUC layer to pg_stat_statements, maintaining the
computation method to use as a field of JumbleState as I suspect that
this is something we should not enforce system-wide, but at
extension-level instead.

+    /*
+     * Indicates the constant represents the beginning or the end of a merged
+     * constants interval.
+     */
+    bool        merged;

Not sure that this is the best thing to do either.  Instead of this
extra boolean flag, could it be simpler if we switch LocationLen so as
we track the start position and the end position of a constant in a
query string, so as we'd use one LocationLen for a whole set of Const
nodes in an ArrayExpr?  Perhaps this could just be a refactoring piece
of its own?

+    /*
+     * If the first expression is a constant, verify if the following elements
+     * are constants as well. If yes, the list is eligible for merging, and the
+     * order of magnitude need to be calculated.
+     */
+    if (IsA(firstExpr, Const))
+    {
+        foreach(temp, elements)
+            if (!IsA(lfirst(temp), Const))
+                return false;

This path should be benchmarked, IMO.
--
Michael

Attachment

Re: pg_stat_statements and "IN" conditions

From
Dmitry Dolgov
Date:
> On Thu, Oct 26, 2023 at 09:08:42AM +0900, Michael Paquier wrote:
>  typedef struct ArrayExpr
>  {
> +    pg_node_attr(custom_query_jumble)
> +
>
> Hmm.  I am not sure that this is the best approach
> implementation-wise.  Wouldn't it be better to invent a new
> pg_node_attr (these can include parameters as well!), say
> query_jumble_merge or query_jumble_agg_location that aggregates all
> the parameters of a list to be considered as a single element.  To put
> it short, we could also apply the same property to other parts of a
> parsed tree, and not only an ArrayExpr's list.

Sounds like an interesting idea, something like:

    typedef struct ArrayExpr
    {
        ...
        List       *elements pg_node_attr(query_jumble_merge);

to replace simple JUMBLE_NODE(elements) with more elaborated logic.

>  /* GUC parameters */
>  extern PGDLLIMPORT int compute_query_id;
> -
> +extern PGDLLIMPORT bool query_id_const_merge;
>
> Not much a fan of this addition as well for an in-core GUC.  I would
> suggest pushing the GUC layer to pg_stat_statements, maintaining the
> computation method to use as a field of JumbleState as I suspect that
> this is something we should not enforce system-wide, but at
> extension-level instead.

I also do not particularly like an extra GUC here, but as far as I can
tell to make it pg_stat_statements GUC only it has to be something
similar to EnableQueryId (e.g. EnableQueryConstMerging), that will be
called from pgss. Does this sound better?

> +    /*
> +     * Indicates the constant represents the beginning or the end of a merged
> +     * constants interval.
> +     */
> +    bool        merged;
>
> Not sure that this is the best thing to do either.  Instead of this
> extra boolean flag, could it be simpler if we switch LocationLen so as
> we track the start position and the end position of a constant in a
> query string, so as we'd use one LocationLen for a whole set of Const
> nodes in an ArrayExpr?  Perhaps this could just be a refactoring piece
> of its own?

Sounds interesting as well, but it seems to me there is a catch. I'll
try to elaborate, bear with me:

* if the start and the end positions of a constant means the first and the
last character representing it, we need the textual length of the
constant in the query to be able to construct such a LocationLen.  The
lengths are calculated in pg_stat_statements later, not in JumbleQuery,
and it uses parser for that. Doing all of this in JumbleQuery doesn't
sound reasonable to me.

* if instead we talk about the start and the end positions in a
set of constants, that would mean locations of the first and the last
constants in the set, and everything seems fine. But for such
LocationLen to represent a single constant (not a set of constants), it
means that only the start position would be meaningful, the end position
will not be used.

The second approach is somewhat close to be simpler than the merge flag,
but assumes the ugliness for a single constant. What do you think about
this?

> +    /*
> +     * If the first expression is a constant, verify if the following elements
> +     * are constants as well. If yes, the list is eligible for merging, and the
> +     * order of magnitude need to be calculated.
> +     */
> +    if (IsA(firstExpr, Const))
> +    {
> +        foreach(temp, elements)
> +            if (!IsA(lfirst(temp), Const))
> +                return false;
>
> This path should be benchmarked, IMO.

I can do some benchmarking here, but of course it's going to be slower
than the baseline. The main idea behind the patch is to trade this
overhead for the benefits in the future while processing pgss records,
hoping that it's going to be worth it (and in those extreme cases I'm
trying to address it's definitely worth it).



Re: pg_stat_statements and "IN" conditions

From
Dmitry Dolgov
Date:
> On Fri, Oct 27, 2023 at 05:02:44PM +0200, Dmitry Dolgov wrote:
> > On Thu, Oct 26, 2023 at 09:08:42AM +0900, Michael Paquier wrote:
> >  typedef struct ArrayExpr
> >  {
> > +    pg_node_attr(custom_query_jumble)
> > +
> >
> > Hmm.  I am not sure that this is the best approach
> > implementation-wise.  Wouldn't it be better to invent a new
> > pg_node_attr (these can include parameters as well!), say
> > query_jumble_merge or query_jumble_agg_location that aggregates all
> > the parameters of a list to be considered as a single element.  To put
> > it short, we could also apply the same property to other parts of a
> > parsed tree, and not only an ArrayExpr's list.
>
> Sounds like an interesting idea, something like:
>
>     typedef struct ArrayExpr
>     {
>         ...
>         List       *elements pg_node_attr(query_jumble_merge);
>
> to replace simple JUMBLE_NODE(elements) with more elaborated logic.
>
> >  /* GUC parameters */
> >  extern PGDLLIMPORT int compute_query_id;
> > -
> > +extern PGDLLIMPORT bool query_id_const_merge;
> >
> > Not much a fan of this addition as well for an in-core GUC.  I would
> > suggest pushing the GUC layer to pg_stat_statements, maintaining the
> > computation method to use as a field of JumbleState as I suspect that
> > this is something we should not enforce system-wide, but at
> > extension-level instead.
>
> I also do not particularly like an extra GUC here, but as far as I can
> tell to make it pg_stat_statements GUC only it has to be something
> similar to EnableQueryId (e.g. EnableQueryConstMerging), that will be
> called from pgss. Does this sound better?

For clarity, here is what I had in mind for those two points.

Attachment

Re: pg_stat_statements and "IN" conditions

From
vignesh C
Date:
On Tue, 31 Oct 2023 at 14:36, Dmitry Dolgov <9erthalion6@gmail.com> wrote:
>
> > On Fri, Oct 27, 2023 at 05:02:44PM +0200, Dmitry Dolgov wrote:
> > > On Thu, Oct 26, 2023 at 09:08:42AM +0900, Michael Paquier wrote:
> > >  typedef struct ArrayExpr
> > >  {
> > > +   pg_node_attr(custom_query_jumble)
> > > +
> > >
> > > Hmm.  I am not sure that this is the best approach
> > > implementation-wise.  Wouldn't it be better to invent a new
> > > pg_node_attr (these can include parameters as well!), say
> > > query_jumble_merge or query_jumble_agg_location that aggregates all
> > > the parameters of a list to be considered as a single element.  To put
> > > it short, we could also apply the same property to other parts of a
> > > parsed tree, and not only an ArrayExpr's list.
> >
> > Sounds like an interesting idea, something like:
> >
> >     typedef struct ArrayExpr
> >     {
> >         ...
> >         List     *elements pg_node_attr(query_jumble_merge);
> >
> > to replace simple JUMBLE_NODE(elements) with more elaborated logic.
> >
> > >  /* GUC parameters */
> > >  extern PGDLLIMPORT int compute_query_id;
> > > -
> > > +extern PGDLLIMPORT bool query_id_const_merge;
> > >
> > > Not much a fan of this addition as well for an in-core GUC.  I would
> > > suggest pushing the GUC layer to pg_stat_statements, maintaining the
> > > computation method to use as a field of JumbleState as I suspect that
> > > this is something we should not enforce system-wide, but at
> > > extension-level instead.
> >
> > I also do not particularly like an extra GUC here, but as far as I can
> > tell to make it pg_stat_statements GUC only it has to be something
> > similar to EnableQueryId (e.g. EnableQueryConstMerging), that will be
> > called from pgss. Does this sound better?
>
> For clarity, here is what I had in mind for those two points.

CFBot shows documentation build has failed at [1] with:
[07:44:55.531] time make -s -j${BUILD_JOBS} -C doc
[07:44:57.987] postgres.sgml:572: element xref: validity error : IDREF
attribute linkend references an unknown ID
"guc-query-id-const-merge-threshold"
[07:44:58.179] make[2]: *** [Makefile:70: postgres-full.xml] Error 4
[07:44:58.179] make[2]: *** Deleting file 'postgres-full.xml'
[07:44:58.181] make[1]: *** [Makefile:8: all] Error 2
[07:44:58.182] make: *** [Makefile:16: all] Error 2

[1] - https://cirrus-ci.com/task/6688578378399744

Regards,
Vignesh



Re: pg_stat_statements and "IN" conditions

From
Dmitry Dolgov
Date:
> On Sat, Jan 06, 2024 at 09:04:54PM +0530, vignesh C wrote:
>
> CFBot shows documentation build has failed at [1] with:
> [07:44:55.531] time make -s -j${BUILD_JOBS} -C doc
> [07:44:57.987] postgres.sgml:572: element xref: validity error : IDREF
> attribute linkend references an unknown ID
> "guc-query-id-const-merge-threshold"
> [07:44:58.179] make[2]: *** [Makefile:70: postgres-full.xml] Error 4
> [07:44:58.179] make[2]: *** Deleting file 'postgres-full.xml'
> [07:44:58.181] make[1]: *** [Makefile:8: all] Error 2
> [07:44:58.182] make: *** [Makefile:16: all] Error 2
>
> [1] - https://cirrus-ci.com/task/6688578378399744

Indeed, after moving the configuration option to pgss I forgot to update
its reference in the docs. Thanks for noticing, will update soon.



Re: pg_stat_statements and "IN" conditions

From
Dmitry Dolgov
Date:
> On Mon, Jan 08, 2024 at 05:10:20PM +0100, Dmitry Dolgov wrote:
> > On Sat, Jan 06, 2024 at 09:04:54PM +0530, vignesh C wrote:
> >
> > CFBot shows documentation build has failed at [1] with:
> > [07:44:55.531] time make -s -j${BUILD_JOBS} -C doc
> > [07:44:57.987] postgres.sgml:572: element xref: validity error : IDREF
> > attribute linkend references an unknown ID
> > "guc-query-id-const-merge-threshold"
> > [07:44:58.179] make[2]: *** [Makefile:70: postgres-full.xml] Error 4
> > [07:44:58.179] make[2]: *** Deleting file 'postgres-full.xml'
> > [07:44:58.181] make[1]: *** [Makefile:8: all] Error 2
> > [07:44:58.182] make: *** [Makefile:16: all] Error 2
> >
> > [1] - https://cirrus-ci.com/task/6688578378399744
>
> Indeed, after moving the configuration option to pgss I forgot to update
> its reference in the docs. Thanks for noticing, will update soon.

Here is the fixed version.

Attachment

Re: pg_stat_statements and "IN" conditions

From
Peter Smith
Date:
2024-01 Commitfest.

Hi, This patch has a CF status of "Needs Review" [1], but it seems
there was a CFbot test failure last time it was run [2]. Please have a
look and post an updated version if necessary.

======
[1] https://commitfest.postgresql.org/46/2837/
[2] https://cirrus-ci.com/task/6688578378399744

Kind Regards,
Peter Smith.



Re: pg_stat_statements and "IN" conditions

From
Dmitry Dolgov
Date:
> On Mon, Jan 22, 2024 at 05:33:26PM +1100, Peter Smith wrote:
> 2024-01 Commitfest.
>
> Hi, This patch has a CF status of "Needs Review" [1], but it seems
> there was a CFbot test failure last time it was run [2]. Please have a
> look and post an updated version if necessary.
>
> ======
> [1] https://commitfest.postgresql.org/46/2837/
> [2] https://cirrus-ci.com/task/6688578378399744

It's the same failing pipeline Vignesh C was talking above. I've fixed
the issue in the latest patch version, but looks like it wasn't picked
up yet (from what I understand, the latest build for this CF is 8 weeks
old).



Re: pg_stat_statements and "IN" conditions

From
Tom Lane
Date:
Dmitry Dolgov <9erthalion6@gmail.com> writes:
>> On Mon, Jan 22, 2024 at 05:33:26PM +1100, Peter Smith wrote:
>> Hi, This patch has a CF status of "Needs Review" [1], but it seems
>> there was a CFbot test failure last time it was run [2]. Please have a
>> look and post an updated version if necessary.
>>
>> ======
>> [1] https://commitfest.postgresql.org/46/2837/
>> [2] https://cirrus-ci.com/task/6688578378399744

> It's the same failing pipeline Vignesh C was talking above. I've fixed
> the issue in the latest patch version, but looks like it wasn't picked
> up yet (from what I understand, the latest build for this CF is 8 weeks
> old).

Please notice that at the moment, it's not being tested at all because
of a patch-apply failure -- that's what the little triangular symbol
means.  The rest of the display concerns the test results from the
last successfully-applied patch version.  (Perhaps that isn't a
particularly great UI design.)

If you click on the triangle you find out

== Applying patches on top of PostgreSQL commit ID b0f0a9432d0b6f53634a96715f2666f6d4ea25a1 ===
=== applying patch ./v17-0001-Prevent-jumbling-of-every-element-in-ArrayExpr.patch
patching file contrib/pg_stat_statements/Makefile
Hunk #1 FAILED at 19.
1 out of 1 hunk FAILED -- saving rejects to file contrib/pg_stat_statements/Makefile.rej
patching file contrib/pg_stat_statements/expected/merging.out
patching file contrib/pg_stat_statements/meson.build
...

            regards, tom lane



Re: pg_stat_statements and "IN" conditions

From
Dmitry Dolgov
Date:
> On Mon, Jan 22, 2024 at 11:35:22AM -0500, Tom Lane wrote:
> Dmitry Dolgov <9erthalion6@gmail.com> writes:
> >> On Mon, Jan 22, 2024 at 05:33:26PM +1100, Peter Smith wrote:
> >> Hi, This patch has a CF status of "Needs Review" [1], but it seems
> >> there was a CFbot test failure last time it was run [2]. Please have a
> >> look and post an updated version if necessary.
> >>
> >> ======
> >> [1] https://commitfest.postgresql.org/46/2837/
> >> [2] https://cirrus-ci.com/task/6688578378399744
>
> > It's the same failing pipeline Vignesh C was talking above. I've fixed
> > the issue in the latest patch version, but looks like it wasn't picked
> > up yet (from what I understand, the latest build for this CF is 8 weeks
> > old).
>
> Please notice that at the moment, it's not being tested at all because
> of a patch-apply failure -- that's what the little triangular symbol
> means.  The rest of the display concerns the test results from the
> last successfully-applied patch version.  (Perhaps that isn't a
> particularly great UI design.)
>
> If you click on the triangle you find out
>
> == Applying patches on top of PostgreSQL commit ID b0f0a9432d0b6f53634a96715f2666f6d4ea25a1 ===
> === applying patch ./v17-0001-Prevent-jumbling-of-every-element-in-ArrayExpr.patch
> patching file contrib/pg_stat_statements/Makefile
> Hunk #1 FAILED at 19.
> 1 out of 1 hunk FAILED -- saving rejects to file contrib/pg_stat_statements/Makefile.rej
> patching file contrib/pg_stat_statements/expected/merging.out
> patching file contrib/pg_stat_statements/meson.build

Oh, I see, thanks. Give me a moment, will fix this.



Re: pg_stat_statements and "IN" conditions

From
Dmitry Dolgov
Date:
> On Mon, Jan 22, 2024 at 06:07:27PM +0100, Dmitry Dolgov wrote:
> > Please notice that at the moment, it's not being tested at all because
> > of a patch-apply failure -- that's what the little triangular symbol
> > means.  The rest of the display concerns the test results from the
> > last successfully-applied patch version.  (Perhaps that isn't a
> > particularly great UI design.)
> >
> > If you click on the triangle you find out
> >
> > == Applying patches on top of PostgreSQL commit ID b0f0a9432d0b6f53634a96715f2666f6d4ea25a1 ===
> > === applying patch ./v17-0001-Prevent-jumbling-of-every-element-in-ArrayExpr.patch
> > patching file contrib/pg_stat_statements/Makefile
> > Hunk #1 FAILED at 19.
> > 1 out of 1 hunk FAILED -- saving rejects to file contrib/pg_stat_statements/Makefile.rej
> > patching file contrib/pg_stat_statements/expected/merging.out
> > patching file contrib/pg_stat_statements/meson.build
>
> Oh, I see, thanks. Give me a moment, will fix this.

Here is it.

Attachment

Re: pg_stat_statements and "IN" conditions

From
Yasuo Honda
Date:
Hi, I'm interested in this feature. It looks like these patches have
some conflicts.

http://cfbot.cputube.org/patch_47_2837.log

Would you rebase these patches?

Thanks,
--
Yasuo Honda

On Sat, Mar 23, 2024 at 4:11 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote:

> > Oh, I see, thanks. Give me a moment, will fix this.
>
> Here is it.



Re: pg_stat_statements and "IN" conditions

From
Dmitry Dolgov
Date:
> On Sat, Mar 23, 2024 at 04:13:44PM +0900, Yasuo Honda wrote:
> Hi, I'm interested in this feature. It looks like these patches have
> some conflicts.
>
> http://cfbot.cputube.org/patch_47_2837.log
>
> Would you rebase these patches?

Sure, I can rebase, give me a moment. If you don't want to wait, there
is a base commit in the patch, against which it should be applied
without issues, 0eb23285a2.



Re: pg_stat_statements and "IN" conditions

From
Yasuo Honda
Date:
Thanks for the information. I can apply these 4 patches from
0eb23285a2 . I tested this branch from Ruby on Rails and it gets some
unexpected behavior from my point of view.
Setting pg_stat_statements.query_id_const_merge_threshold = 5 does not
normalize sql queries whose number of in clauses exceeds 5.

Here are test steps.
https://gist.github.com/yahonda/825ffccc4dcb58aa60e12ce33d25cd45#expected-behavior

It would be appreciated if I can get my understanding correct.
--
Yasuo Honda

On Sun, Mar 24, 2024 at 3:20 AM Dmitry Dolgov <9erthalion6@gmail.com> wrote:

> Sure, I can rebase, give me a moment. If you don't want to wait, there
> is a base commit in the patch, against which it should be applied
> without issues, 0eb23285a2.



Re: pg_stat_statements and "IN" conditions

From
Dmitry Dolgov
Date:
> On Sun, Mar 24, 2024 at 11:36:38PM +0900, Yasuo Honda wrote:
> Thanks for the information. I can apply these 4 patches from
> 0eb23285a2 . I tested this branch from Ruby on Rails and it gets some
> unexpected behavior from my point of view.
> Setting pg_stat_statements.query_id_const_merge_threshold = 5 does not
> normalize sql queries whose number of in clauses exceeds 5.
>
> Here are test steps.
> https://gist.github.com/yahonda/825ffccc4dcb58aa60e12ce33d25cd45#expected-behavior
>
> It would be appreciated if I can get my understanding correct.

From what I understand out of the description this ruby script uses
prepared statements, passing values as parameters, right? Unfortunately
the current version of the patch doesn't handle that, it works with
constants only [1]. The original incarnation of this feature was able to
handle that, but the implementation was considered to be not suitable --
thus, to make some progress, it was left outside.

The plan is, if everything goes fine at some point, to do a follow-up
patch to handle Params and the rest.

[1]: https://www.postgresql.org/message-id/20230211104707.grsicemegr7d3mgh%40erthalion.local



Re: pg_stat_statements and "IN" conditions

From
Yasuo Honda
Date:
Yes. The script uses prepared statements because Ruby on Rails enables
prepared statements by default for PostgreSQL databases.

Then I tested this branch
https://github.com/yahonda/postgres/tree/pg_stat_statements without
using prepared statements as follows and all of them do not normalize
in clause values.

- Disabled prepared statements by setting `prepared_statements: false`
https://gist.github.com/yahonda/2c2d6ac7a955886a305750eecfd07c5e

- Use ruby-pg
https://gist.github.com/yahonda/2f0efb11ae888d8f6b27a07e0b833fdf

- Use psql
https://gist.github.com/yahonda/c830379b33d66a743aef159aa03d7e49

I do not know why even if I use psql, the query column at
pg_stat_sql_statement shows it is like a prepared statement "IN ($1,
$2)".

On Tue, Mar 26, 2024 at 1:35 AM Dmitry Dolgov <9erthalion6@gmail.com> wrote:

> From what I understand out of the description this ruby script uses
> prepared statements, passing values as parameters, right? Unfortunately
> the current version of the patch doesn't handle that, it works with
> constants only [1]. The original incarnation of this feature was able to
> handle that, but the implementation was considered to be not suitable --
> thus, to make some progress, it was left outside.



Re: pg_stat_statements and "IN" conditions

From
Dmitry Dolgov
Date:
> On Tue, Mar 26, 2024 at 04:21:46PM +0900, Yasuo Honda wrote:
> Yes. The script uses prepared statements because Ruby on Rails enables
> prepared statements by default for PostgreSQL databases.
>
> Then I tested this branch
> https://github.com/yahonda/postgres/tree/pg_stat_statements without
> using prepared statements as follows and all of them do not normalize
> in clause values.
>
> - Disabled prepared statements by setting `prepared_statements: false`
> https://gist.github.com/yahonda/2c2d6ac7a955886a305750eecfd07c5e
>
> - Use ruby-pg
> https://gist.github.com/yahonda/2f0efb11ae888d8f6b27a07e0b833fdf
>
> - Use psql
> https://gist.github.com/yahonda/c830379b33d66a743aef159aa03d7e49
>
> I do not know why even if I use psql, the query column at
> pg_stat_sql_statement shows it is like a prepared statement "IN ($1,
> $2)".

It's a similar case: the column is defined as bigint, thus PostgreSQL
has to wrap every constant expression in a function expression that
converts its type to bigint. The current patch version doesn't try to
reduce a FuncExpr into Const (event if the wrapped value is a Const),
thus this array is not getting merged. If you replace bigint with an
int, no type conversion would be required and merging logic will kick
in.

Again, the original version of the patch was able to handle this case,
but it was stripped away to make the patch smaller in hope of moving
forward. Anyway, thanks for reminding about how annoying the current
handling of constant arrays can look like in practice!



Re: pg_stat_statements and "IN" conditions

From
Yasuo Honda
Date:
Thanks for the useful info.

Ruby on Rails uses bigint as a default data type for the primary key
and prepared statements have been enabled by default for PostgreSQL.
I'm looking forward to these current patches being merged as a first
step and future versions of pg_stat_statements will support
normalizing bigint and prepared statements.

On Wed, Mar 27, 2024 at 6:00 AM Dmitry Dolgov <9erthalion6@gmail.com> wrote:

> It's a similar case: the column is defined as bigint, thus PostgreSQL
> has to wrap every constant expression in a function expression that
> converts its type to bigint. The current patch version doesn't try to
> reduce a FuncExpr into Const (event if the wrapped value is a Const),
> thus this array is not getting merged. If you replace bigint with an
> int, no type conversion would be required and merging logic will kick
> in.
>
> Again, the original version of the patch was able to handle this case,
> but it was stripped away to make the patch smaller in hope of moving
> forward. Anyway, thanks for reminding about how annoying the current
> handling of constant arrays can look like in practice!



Re: pg_stat_statements and "IN" conditions

From
Dmitry Dolgov
Date:
> On Wed, Mar 27, 2024 at 08:56:12AM +0900, Yasuo Honda wrote:
> Thanks for the useful info.
>
> Ruby on Rails uses bigint as a default data type for the primary key
> and prepared statements have been enabled by default for PostgreSQL.
> I'm looking forward to these current patches being merged as a first
> step and future versions of pg_stat_statements will support
> normalizing bigint and prepared statements.

Here is the rebased version. In the meantime I'm going to experiment
with how to support more use cases in a way that will be more acceptable
for the community.

Attachment

Re: pg_stat_statements and "IN" conditions

From
Sutou Kouhei
Date:
Hi,

In <20240404143514.a26f7ttxrbdfc73a@erthalion.local>
  "Re: pg_stat_statements and "IN" conditions" on Thu, 4 Apr 2024 16:35:14 +0200,
  Dmitry Dolgov <9erthalion6@gmail.com> wrote:

> Here is the rebased version.

Thanks. I'm not familiar with this code base but I've
reviewed these patches because I'm interested in this
feature too.

0001:

> diff --git a/src/backend/nodes/queryjumblefuncs.c b/src/backend/nodes/queryjumblefuncs.c
> index be823a7f8fa..e9473def361 100644
> --- a/src/backend/nodes/queryjumblefuncs.c
> +++ b/src/backend/nodes/queryjumblefuncs.c
> 
> @@ -212,15 +233,67 @@ RecordConstLocation(JumbleState *jstate, int location)
> ...
> +static bool
> +IsMergeableConstList(List *elements, Const **firstConst, Const **lastConst)
> +{
> +    ListCell   *temp;
> +    Node       *firstExpr = NULL;
> +
> +    if (elements == NULL)

"elements == NIL" will be better for List.

> +static void
> +_jumbleElements(JumbleState *jstate, List *elements)
> +{
> +    Const *first, *last;
> +    if(IsMergeableConstList(elements, &first, &last))

A space is missing between "if" and "(".

> diff --git a/src/include/nodes/primnodes.h b/src/include/nodes/primnodes.h
> index aa727e722cc..cf4f900d4ed 100644
> --- a/src/include/nodes/primnodes.h
> +++ b/src/include/nodes/primnodes.h
> @@ -1333,7 +1333,7 @@ typedef struct ArrayExpr
>      /* common type of array elements */
>      Oid            element_typeid pg_node_attr(query_jumble_ignore);
>      /* the array elements or sub-arrays */
> -    List       *elements;
> +    List       *elements pg_node_attr(query_jumble_merge);

Should we also update the pg_node_attr() comment for
query_jumble_merge in nodes.h?


0003:

> diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c
> index d7841b51cc9..00eec30feb1 100644
> --- a/contrib/pg_stat_statements/pg_stat_statements.c
> +++ b/contrib/pg_stat_statements/pg_stat_statements.c
> ...
> @@ -2883,12 +2886,22 @@ generate_normalized_query(JumbleState *jstate, const char *query,
>          /* The firsts merged constant */
>          else if (!skip)
>          {
> +            static const uint32 powers_of_ten[] = {
> +                1, 10, 100,
> +                1000, 10000, 100000,
> +                1000000, 10000000, 100000000,
> +                1000000000
> +            };
> +            int lower_merged = powers_of_ten[magnitude - 1];
> +            int upper_merged = powers_of_ten[magnitude];

How about adding a reverse function of decimalLength32() to
numutils.h and use it here?

> -            n_quer_loc += sprintf(norm_query + n_quer_loc, "...");
> +            n_quer_loc += sprintf(norm_query + n_quer_loc, "... [%d-%d entries]",
> +                                  lower_merged, upper_merged - 1);

Do we still have enough space in norm_query for this change?
It seems that norm_query expects up to 10 additional bytes
per jstate->clocations[i].


It seems that we can merge 0001, 0003 and 0004 to one patch.
(Sorry. I haven't read all discussions yet. If we already
discuss this, sorry for this noise.)


Thanks,
-- 
kou



Re: pg_stat_statements and "IN" conditions

From
Dmitry Dolgov
Date:
> On Mon, Apr 15, 2024 at 06:09:29PM +0900, Sutou Kouhei wrote:
>
> Thanks. I'm not familiar with this code base but I've
> reviewed these patches because I'm interested in this
> feature too.

Thanks for the review! The commentaries for the first patch make sense
to me, will apply.

> 0003:
>
> > diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c
> > index d7841b51cc9..00eec30feb1 100644
> > --- a/contrib/pg_stat_statements/pg_stat_statements.c
> > +++ b/contrib/pg_stat_statements/pg_stat_statements.c
> > ...
> > @@ -2883,12 +2886,22 @@ generate_normalized_query(JumbleState *jstate, const char *query,
> >          /* The firsts merged constant */
> >          else if (!skip)
> >          {
> > +            static const uint32 powers_of_ten[] = {
> > +                1, 10, 100,
> > +                1000, 10000, 100000,
> > +                1000000, 10000000, 100000000,
> > +                1000000000
> > +            };
> > +            int lower_merged = powers_of_ten[magnitude - 1];
> > +            int upper_merged = powers_of_ten[magnitude];
>
> How about adding a reverse function of decimalLength32() to
> numutils.h and use it here?

I was pondering that at some point, but eventually decided to keep it
this way, because:

* the use case is quite specific, I can't image it's being used anywhere
  else

* it would not be strictly reverse, as the transformation itself is not
  reversible in the strict sense

> > -            n_quer_loc += sprintf(norm_query + n_quer_loc, "...");
> > +            n_quer_loc += sprintf(norm_query + n_quer_loc, "... [%d-%d entries]",
> > +                                  lower_merged, upper_merged - 1);
>
> Do we still have enough space in norm_query for this change?
> It seems that norm_query expects up to 10 additional bytes
> per jstate->clocations[i].

As far as I understand there should be enough space, because we're going
to replace at least 10 constants with one merge record. But it's a good
point, this should be called out in the commentary explaining why 10
additional bytes are added.

> It seems that we can merge 0001, 0003 and 0004 to one patch.
> (Sorry. I haven't read all discussions yet. If we already
> discuss this, sorry for this noise.)

There is a certain disagreement about which portion of this feature
makes sense to go with first, thus I think keeping all options open is a
good idea. In the end a committer can squash the patches if needed.



Re: pg_stat_statements and "IN" conditions

From
Dmitry Dolgov
Date:
> On Tue, Apr 23, 2024 at 10:18:15AM +0200, Dmitry Dolgov wrote:
> > On Mon, Apr 15, 2024 at 06:09:29PM +0900, Sutou Kouhei wrote:
> >
> > Thanks. I'm not familiar with this code base but I've
> > reviewed these patches because I'm interested in this
> > feature too.
>
> Thanks for the review! The commentaries for the first patch make sense
> to me, will apply.

Here is the new version. It turned out you were right about memory for
the normalized query, if the number of constants goes close to INT_MAX,
there were indeed not enough allocated. I've added a fix for this on top
of the applied changes, and also improved readability for
pg_stat_statements part.

Attachment

Re: pg_stat_statements and "IN" conditions

From
Sergei Kornilov
Date:
Hello

This feature will improve my monitoring. Even in patch 0001. I think there are many other people in the thread who
thinkthis is useful. So maybe we should move it forward? Any complaints about the overall design? I see in the
discussionit was mentioned that it would be good to measure performance difference.
 

PS: patch cannot be applied at this time, it needs another rebase.

regards, Sergei



Re: pg_stat_statements and "IN" conditions

From
Dmitry Dolgov
Date:
> On Sun, Aug 11, 2024 at 07:54:05PM +0300, Sergei Kornilov wrote:
>
> This feature will improve my monitoring. Even in patch 0001. I think there are many other people in the thread who
thinkthis is useful. So maybe we should move it forward? Any complaints about the overall design? I see in the
discussionit was mentioned that it would be good to measure performance difference.
 
>
> PS: patch cannot be applied at this time, it needs another rebase.

Yeah, it seems like most people are fine with the first patch and the
simplest approach. I'm going to post a rebased version and a short
thread summary soon.



Re: pg_stat_statements and "IN" conditions

From
Dmitry Dolgov
Date:
> On Sun, Aug 11, 2024 at 09:34:55PM GMT, Dmitry Dolgov wrote:
> > On Sun, Aug 11, 2024 at 07:54:05PM +0300, Sergei Kornilov wrote:
> >
> > This feature will improve my monitoring. Even in patch 0001. I think there are many other people in the thread who
thinkthis is useful. So maybe we should move it forward? Any complaints about the overall design? I see in the
discussionit was mentioned that it would be good to measure performance difference.
 
> >
> > PS: patch cannot be applied at this time, it needs another rebase.
>
> Yeah, it seems like most people are fine with the first patch and the
> simplest approach. I'm going to post a rebased version and a short
> thread summary soon.

Ok, here is the rebased version. If anyone would like to review them, below is
the short summary of the thread. Currently the patch series contains 4 changes:

* 0001-Prevent-jumbling-of-every-element-in-ArrayExpr.patch

  Implements the simplest way to handle constant arrays, if the array contains
  only constants it will be reduced. This is the basis, if I read it correctly
  Nathan and Michael expressed that they're mostly fine with this one.

  Michael seems to be skeptical about the "merged" flag in the LocationLen
  struct, but from what I see the proposed alternative has problems as well.
  There was also a note that the loop over constants has to be benchmarked, but
  it's not entirely clear for me in which dimentions to benchmark (i.e. what
  are the expectations). For both I'm waiting on a reply to my questions.

* 0002-Reusable-decimalLength-functions.patch

  A small refactoring to make already existing "powers" functonality reusable
  for following patches.

* 0003-Merge-constants-in-ArrayExpr-into-groups.patch

  Makes handling of constant arrays smarter by taking into account number of
  elements in the array. This way records are merged into groups power of 10,
  i.e. arrays with length 55 will land in a group 10-99, with lenght 555 in a
  group 100-999 etc. This was proposed by Alvaro, and personally I like this
  approach, because it remediates the issue of one-size-fits-all for the static
  threshold. But there are opinions that this introduces too much complexity.

* 0004-Introduce-query_id_const_merge_threshold.patch

  Fine tuning for the previous patch, makes only arrays with the length over a
  certain threshold to be reduced.

On top of that Yasuo Honda and Jakub Wartak have provided a couple of practical
examples, where handling of constant arrays has to be improved. David Geier
pointed out some examples that might be confusing as well. All those are
definitely worth addressing, but out of scope of this patch for now.

Attachment

Re: pg_stat_statements and "IN" conditions

From
Kirill Reshke
Date:
On Wed, 14 Aug 2024 at 01:06, Dmitry Dolgov <9erthalion6@gmail.com> wrote:
>
> > On Sun, Aug 11, 2024 at 09:34:55PM GMT, Dmitry Dolgov wrote:
> > > On Sun, Aug 11, 2024 at 07:54:05PM +0300, Sergei Kornilov wrote:
> > >
> > > This feature will improve my monitoring. Even in patch 0001. I think there are many other people in the thread
whothink this is useful. So maybe we should move it forward? Any complaints about the overall design? I see in the
discussionit was mentioned that it would be good to measure performance difference. 
> > >
> > > PS: patch cannot be applied at this time, it needs another rebase.
> >
> > Yeah, it seems like most people are fine with the first patch and the
> > simplest approach. I'm going to post a rebased version and a short
> > thread summary soon.
>
> Ok, here is the rebased version. If anyone would like to review them, below is
> the short summary of the thread. Currently the patch series contains 4 changes:
>
> * 0001-Prevent-jumbling-of-every-element-in-ArrayExpr.patch
>
>   Implements the simplest way to handle constant arrays, if the array contains
>   only constants it will be reduced. This is the basis, if I read it correctly
>   Nathan and Michael expressed that they're mostly fine with this one.
>
>   Michael seems to be skeptical about the "merged" flag in the LocationLen
>   struct, but from what I see the proposed alternative has problems as well.
>   There was also a note that the loop over constants has to be benchmarked, but
>   it's not entirely clear for me in which dimentions to benchmark (i.e. what
>   are the expectations). For both I'm waiting on a reply to my questions.
>
> * 0002-Reusable-decimalLength-functions.patch
>
>   A small refactoring to make already existing "powers" functonality reusable
>   for following patches.
>
> * 0003-Merge-constants-in-ArrayExpr-into-groups.patch
>
>   Makes handling of constant arrays smarter by taking into account number of
>   elements in the array. This way records are merged into groups power of 10,
>   i.e. arrays with length 55 will land in a group 10-99, with lenght 555 in a
>   group 100-999 etc. This was proposed by Alvaro, and personally I like this
>   approach, because it remediates the issue of one-size-fits-all for the static
>   threshold. But there are opinions that this introduces too much complexity.
>
> * 0004-Introduce-query_id_const_merge_threshold.patch
>
>   Fine tuning for the previous patch, makes only arrays with the length over a
>   certain threshold to be reduced.
>
> On top of that Yasuo Honda and Jakub Wartak have provided a couple of practical
> examples, where handling of constant arrays has to be improved. David Geier
> pointed out some examples that might be confusing as well. All those are
> definitely worth addressing, but out of scope of this patch for now.

Hi! Can you please send a rebased version of this?

--
Best regards,
Kirill Reshke