Thread: pg_stat_statements and "IN" conditions
Hi I would like to start another thread to follow up on [1], mostly to bump up the topic. Just to remind, it's about how pg_stat_statements jumbling ArrayExpr in queries like: SELECT something FROM table WHERE col IN (1, 2, 3, ...) The current implementation produces different jumble hash for every different number of arguments for essentially the same query. Unfortunately a lot of ORMs like to generate these types of queries, which in turn leads to pg_stat_statements pollution. Ideally we want to prevent this and have only one record for such a query. As the result of [1] I've identified two highlighted approaches to improve this situation: * Reduce the generated ArrayExpr to an array Const immediately, in cases where all the inputs are Consts. * Make repeating Const to contribute nothing to the resulting hash. I've tried to prototype both approaches to find out pros/cons and be more specific. Attached patches could not be considered a completed piece of work, but they seem to work, mostly pass the tests and demonstrate the point. I would like to get some high level input about them and ideally make it clear what is the preferred solution to continue with. # Reducing ArrayExpr to an array Const IIUC this requires producing a Const with ArrayType constvalue in transformAExprIn for ScalarArrayOpExpr. This could be a general improvement, since apparently it's being done later anyway. But it deals only with Const, which leaves more on the table, e.g. Params and other similar types of duplication we observe when repeating constants are wrapped into VALUES. Another point here is that it's quite possible this approach will still require corresponding changes in pg_stat_statements, since just preventing duplicates to show also loses the information. Ideally we also need to have some understanding how many elements are actually there and display it, e.g. in cases when there is just one outlier query that contains a huge IN list. # Contribute nothing to the hash I guess there could be multiple ways of doing this, but the first idea I had in mind is to skip jumbling when necessary. At the same time it can be implemented more centralized for different types of queries (although in the attached patch there are only Const & Values). In the simplest case we just identify sequence of constants of the same type, which just ignores any other cases when stuff is mixed. But I believe it's something that could be considered a rare corner case and it's better to start with the simplest solution. Having said that I believe the second approach of contributing nothing to the hash sounds more appealing, but would love to hear other opinions. [1]: https://www.postgresql.org/message-id/flat/CAF42k%3DJCfHMJtkAVXCzBn2XBxDC83xb4VhV7jU7enPnZ0CfEQQ%40mail.gmail.com
Attachment
> On Wed, Aug 12, 2020 at 06:19:02PM +0200, Dmitry Dolgov wrote: > > I would like to start another thread to follow up on [1], mostly to bump up the > topic. Just to remind, it's about how pg_stat_statements jumbling ArrayExpr in > queries like: > > SELECT something FROM table WHERE col IN (1, 2, 3, ...) > > The current implementation produces different jumble hash for every different > number of arguments for essentially the same query. Unfortunately a lot of ORMs > like to generate these types of queries, which in turn leads to > pg_stat_statements pollution. Ideally we want to prevent this and have only one > record for such a query. > > As the result of [1] I've identified two highlighted approaches to improve this > situation: > > * Reduce the generated ArrayExpr to an array Const immediately, in cases where > all the inputs are Consts. > > * Make repeating Const to contribute nothing to the resulting hash. > > I've tried to prototype both approaches to find out pros/cons and be more > specific. Attached patches could not be considered a completed piece of work, > but they seem to work, mostly pass the tests and demonstrate the point. I would > like to get some high level input about them and ideally make it clear what is > the preferred solution to continue with. I've implemented the second approach mentioned above, this version was tested on our test clusters for some time without visible issues. Will create a CF item and would appreciate any feedback.
Attachment
The following review has been posted through the commitfest application: make installcheck-world: tested, passed Implements feature: tested, passed Spec compliant: not tested Documentation: not tested Hi, I did some test and it works well
> On Wed, Dec 09, 2020 at 03:37:40AM +0000, Chengxi Sun wrote: > > The following review has been posted through the commitfest application: > make installcheck-world: tested, passed > Implements feature: tested, passed > Spec compliant: not tested > Documentation: not tested > > Hi, I did some test and it works well Thanks for testing!
> On Wed, Nov 18, 2020 at 05:04:32PM +0100, Dmitry Dolgov wrote: > > On Wed, Aug 12, 2020 at 06:19:02PM +0200, Dmitry Dolgov wrote: > > > > I would like to start another thread to follow up on [1], mostly to bump up the > > topic. Just to remind, it's about how pg_stat_statements jumbling ArrayExpr in > > queries like: > > > > SELECT something FROM table WHERE col IN (1, 2, 3, ...) > > > > The current implementation produces different jumble hash for every different > > number of arguments for essentially the same query. Unfortunately a lot of ORMs > > like to generate these types of queries, which in turn leads to > > pg_stat_statements pollution. Ideally we want to prevent this and have only one > > record for such a query. > > > > As the result of [1] I've identified two highlighted approaches to improve this > > situation: > > > > * Reduce the generated ArrayExpr to an array Const immediately, in cases where > > all the inputs are Consts. > > > > * Make repeating Const to contribute nothing to the resulting hash. > > > > I've tried to prototype both approaches to find out pros/cons and be more > > specific. Attached patches could not be considered a completed piece of work, > > but they seem to work, mostly pass the tests and demonstrate the point. I would > > like to get some high level input about them and ideally make it clear what is > > the preferred solution to continue with. > > I've implemented the second approach mentioned above, this version was > tested on our test clusters for some time without visible issues. Will > create a CF item and would appreciate any feedback. After more testing I found couple of things that could be improved, namely in the presence of non-reducible constants one part of the query was not copied into the normalized version, and this approach also could be extended for Params. These are incorporated in the attached patch.
Attachment
Hi,
A few comments.
+ "After this number of duplicating constants start to merge them.",
duplicating -> duplicate
+ foreach(lc, (List *) expr)
+ {
+ Node * subExpr = (Node *) lfirst(lc);
+
+ if (!IsA(subExpr, Const))
+ {
+ allConst = false;
+ break;
+ }
+ }
+ {
+ Node * subExpr = (Node *) lfirst(lc);
+
+ if (!IsA(subExpr, Const))
+ {
+ allConst = false;
+ break;
+ }
+ }
It seems the above foreach loop (within foreach(temp, (List *) node)) can be preceded with a check that allConst is true. Otherwise the loop can be skipped.
+ if (currentExprIdx == pgss_merge_threshold - 1)
+ {
+ JumbleExpr(jstate, expr);
+
+ /*
+ * A const expr is already found, so JumbleExpr must
+ * record it. Mark it as merged, it will be the first
+ * merged but still present in the statement query.
+ */
+ Assert(jstate->clocations_count > 0);
+ jstate->clocations[jstate->clocations_count - 1].merged = true;
+ currentExprIdx++;
+ }
+ {
+ JumbleExpr(jstate, expr);
+
+ /*
+ * A const expr is already found, so JumbleExpr must
+ * record it. Mark it as merged, it will be the first
+ * merged but still present in the statement query.
+ */
+ Assert(jstate->clocations_count > 0);
+ jstate->clocations[jstate->clocations_count - 1].merged = true;
+ currentExprIdx++;
+ }
The above snippet occurs a few times. Maybe extract into a helper method.
Cheers
On Sat, Dec 26, 2020 at 2:45 AM Dmitry Dolgov <9erthalion6@gmail.com> wrote:
> On Wed, Nov 18, 2020 at 05:04:32PM +0100, Dmitry Dolgov wrote:
> > On Wed, Aug 12, 2020 at 06:19:02PM +0200, Dmitry Dolgov wrote:
> >
> > I would like to start another thread to follow up on [1], mostly to bump up the
> > topic. Just to remind, it's about how pg_stat_statements jumbling ArrayExpr in
> > queries like:
> >
> > SELECT something FROM table WHERE col IN (1, 2, 3, ...)
> >
> > The current implementation produces different jumble hash for every different
> > number of arguments for essentially the same query. Unfortunately a lot of ORMs
> > like to generate these types of queries, which in turn leads to
> > pg_stat_statements pollution. Ideally we want to prevent this and have only one
> > record for such a query.
> >
> > As the result of [1] I've identified two highlighted approaches to improve this
> > situation:
> >
> > * Reduce the generated ArrayExpr to an array Const immediately, in cases where
> > all the inputs are Consts.
> >
> > * Make repeating Const to contribute nothing to the resulting hash.
> >
> > I've tried to prototype both approaches to find out pros/cons and be more
> > specific. Attached patches could not be considered a completed piece of work,
> > but they seem to work, mostly pass the tests and demonstrate the point. I would
> > like to get some high level input about them and ideally make it clear what is
> > the preferred solution to continue with.
>
> I've implemented the second approach mentioned above, this version was
> tested on our test clusters for some time without visible issues. Will
> create a CF item and would appreciate any feedback.
After more testing I found couple of things that could be improved,
namely in the presence of non-reducible constants one part of the query
was not copied into the normalized version, and this approach also could
be extended for Params. These are incorporated in the attached patch.
> On Sat, Dec 26, 2020 at 08:53:28AM -0800, Zhihong Yu wrote: > Hi, > A few comments. > > + foreach(lc, (List *) expr) > + { > + Node * subExpr = (Node *) lfirst(lc); > + > + if (!IsA(subExpr, Const)) > + { > + allConst = false; > + break; > + } > + } > > It seems the above foreach loop (within foreach(temp, (List *) node)) can > be preceded with a check that allConst is true. Otherwise the loop can be > skipped. Thanks for noticing. Now that I look at it closer I think it's the other way around, the loop above checking constants for the first expression is not really necessary. > + if (currentExprIdx == pgss_merge_threshold - 1) > + { > + JumbleExpr(jstate, expr); > + > + /* > + * A const expr is already found, so JumbleExpr must > + * record it. Mark it as merged, it will be the > first > + * merged but still present in the statement query. > + */ > + Assert(jstate->clocations_count > 0); > + jstate->clocations[jstate->clocations_count - > 1].merged = true; > + currentExprIdx++; > + } > > The above snippet occurs a few times. Maybe extract into a helper method. Originally I was hesitant to extract it was because it's quite small part of the code. But now I've realized that the part relevant to lists is not really correct, which makes those bits even more different, so I think it makes sense to leave it like that. What do you think?
Attachment
Hi, Dmitry:
+ int lastExprLenght = 0;
Did you mean to name the variable lastExprLenghth ?
w.r.t. extracting to helper method, the second and third if (currentExprIdx == pgss_merge_threshold - 1) blocks are similar.
It is up to you whether to create the helper method.
I am fine with the current formation.
Cheers
On Tue, Jan 5, 2021 at 4:51 AM Dmitry Dolgov <9erthalion6@gmail.com> wrote:
> On Sat, Dec 26, 2020 at 08:53:28AM -0800, Zhihong Yu wrote:
> Hi,
> A few comments.
>
> + foreach(lc, (List *) expr)
> + {
> + Node * subExpr = (Node *) lfirst(lc);
> +
> + if (!IsA(subExpr, Const))
> + {
> + allConst = false;
> + break;
> + }
> + }
>
> It seems the above foreach loop (within foreach(temp, (List *) node)) can
> be preceded with a check that allConst is true. Otherwise the loop can be
> skipped.
Thanks for noticing. Now that I look at it closer I think it's the other
way around, the loop above checking constants for the first expression
is not really necessary.
> + if (currentExprIdx == pgss_merge_threshold - 1)
> + {
> + JumbleExpr(jstate, expr);
> +
> + /*
> + * A const expr is already found, so JumbleExpr must
> + * record it. Mark it as merged, it will be the
> first
> + * merged but still present in the statement query.
> + */
> + Assert(jstate->clocations_count > 0);
> + jstate->clocations[jstate->clocations_count -
> 1].merged = true;
> + currentExprIdx++;
> + }
>
> The above snippet occurs a few times. Maybe extract into a helper method.
Originally I was hesitant to extract it was because it's quite small
part of the code. But now I've realized that the part relevant to lists
is not really correct, which makes those bits even more different, so I
think it makes sense to leave it like that. What do you think?
On 1/5/21 10:51 AM, Zhihong Yu wrote: > > + int lastExprLenght = 0; > > Did you mean to name the variable lastExprLenghth ? > > w.r.t. extracting to helper method, the second and third > if (currentExprIdx == pgss_merge_threshold - 1) blocks are similar. > It is up to you whether to create the helper method. > I am fine with the current formation. Dmitry, thoughts on this review? Regards, -- -David david@pgmasters.net
> On Thu, Mar 18, 2021 at 09:38:09AM -0400, David Steele wrote: > On 1/5/21 10:51 AM, Zhihong Yu wrote: > > > > + int lastExprLenght = 0; > > > > Did you mean to name the variable lastExprLenghth ? > > > > w.r.t. extracting to helper method, the second and third > > if (currentExprIdx == pgss_merge_threshold - 1) blocks are similar. > > It is up to you whether to create the helper method. > > I am fine with the current formation. > > Dmitry, thoughts on this review? Oh, right. lastExprLenghth is obviously a typo, and as we agreed that the helper is not strictly necessary I wanted to wait a bit hoping for more feedback and eventually to post an accumulated patch. Doesn't make sense to post another version only to fix one typo :)
> On Thu, Mar 18, 2021 at 04:50:02PM +0100, Dmitry Dolgov wrote: > > On Thu, Mar 18, 2021 at 09:38:09AM -0400, David Steele wrote: > > On 1/5/21 10:51 AM, Zhihong Yu wrote: > > > > > > + int lastExprLenght = 0; > > > > > > Did you mean to name the variable lastExprLenghth ? > > > > > > w.r.t. extracting to helper method, the second and third > > > if (currentExprIdx == pgss_merge_threshold - 1) blocks are similar. > > > It is up to you whether to create the helper method. > > > I am fine with the current formation. > > > > Dmitry, thoughts on this review? > > Oh, right. lastExprLenghth is obviously a typo, and as we agreed that > the helper is not strictly necessary I wanted to wait a bit hoping for > more feedback and eventually to post an accumulated patch. Doesn't make > sense to post another version only to fix one typo :) Hi, I've prepared a new rebased version to deal with the new way of computing query id, but as always there is one tricky part. From what I understand, now an external module can provide custom implementation for query id computation algorithm. It seems natural to think this machinery could be used instead of patch in the thread, i.e. one could create a custom logic that will enable constants collapsing as needed, so that same queries with different number of constants in an array will be hashed into the same record. But there is a limitation in how such queries will be normalized afterwards — to reduce level of surprise it's necessary to display the fact that a certain query in fact had more constants that are showed in pgss record. Ideally LocationLen needs to carry some bits of information on what exactly could be skipped, and generate_normalized_query needs to understand that, both are not reachable for an external module with custom query id logic (without replicating significant part of the existing code). Hence, a new version of the patch.
Attachment
> On Tue, Jun 15, 2021 at 05:18:50PM +0200, Dmitry Dolgov wrote: > > On Thu, Mar 18, 2021 at 04:50:02PM +0100, Dmitry Dolgov wrote: > > > On Thu, Mar 18, 2021 at 09:38:09AM -0400, David Steele wrote: > > > On 1/5/21 10:51 AM, Zhihong Yu wrote: > > > > > > > > + int lastExprLenght = 0; > > > > > > > > Did you mean to name the variable lastExprLenghth ? > > > > > > > > w.r.t. extracting to helper method, the second and third > > > > if (currentExprIdx == pgss_merge_threshold - 1) blocks are similar. > > > > It is up to you whether to create the helper method. > > > > I am fine with the current formation. > > > > > > Dmitry, thoughts on this review? > > > > Oh, right. lastExprLenghth is obviously a typo, and as we agreed that > > the helper is not strictly necessary I wanted to wait a bit hoping for > > more feedback and eventually to post an accumulated patch. Doesn't make > > sense to post another version only to fix one typo :) > > Hi, > > I've prepared a new rebased version to deal with the new way of > computing query id, but as always there is one tricky part. From what I > understand, now an external module can provide custom implementation for > query id computation algorithm. It seems natural to think this machinery > could be used instead of patch in the thread, i.e. one could create a > custom logic that will enable constants collapsing as needed, so that > same queries with different number of constants in an array will be > hashed into the same record. > > But there is a limitation in how such queries will be normalized > afterwards — to reduce level of surprise it's necessary to display the > fact that a certain query in fact had more constants that are showed in > pgss record. Ideally LocationLen needs to carry some bits of information > on what exactly could be skipped, and generate_normalized_query needs to > understand that, both are not reachable for an external module with > custom query id logic (without replicating significant part of the > existing code). Hence, a new version of the patch. Forgot to mention a couple of people who already reviewed the patch.
Attachment
>On Wed, Jun 16, 2021 at 04:02:12PM +0200, Dmitry Dolgov wrote: > > > I've prepared a new rebased version to deal with the new way of > > computing query id, but as always there is one tricky part. From what I > > understand, now an external module can provide custom implementation for > > query id computation algorithm. It seems natural to think this machinery > > could be used instead of patch in the thread, i.e. one could create a > > custom logic that will enable constants collapsing as needed, so that > > same queries with different number of constants in an array will be > > hashed into the same record. > > > > But there is a limitation in how such queries will be normalized > > afterwards — to reduce level of surprise it's necessary to display the > > fact that a certain query in fact had more constants that are showed in > > pgss record. Ideally LocationLen needs to carry some bits of information > > on what exactly could be skipped, and generate_normalized_query needs to > > understand that, both are not reachable for an external module with > > custom query id logic (without replicating significant part of the > > existing code). Hence, a new version of the patch. > > Forgot to mention a couple of people who already reviewed the patch. And now for something completely different, here is a new patch version. It contains a small fix for one problem we've found during testing (one path code was incorrectly assuming find_const_walker results).
Attachment
On Thu, Sep 30, 2021 at 6:49 AM Dmitry Dolgov <9erthalion6@gmail.com> wrote:
>On Wed, Jun 16, 2021 at 04:02:12PM +0200, Dmitry Dolgov wrote:
>
> > I've prepared a new rebased version to deal with the new way of
> > computing query id, but as always there is one tricky part. From what I
> > understand, now an external module can provide custom implementation for
> > query id computation algorithm. It seems natural to think this machinery
> > could be used instead of patch in the thread, i.e. one could create a
> > custom logic that will enable constants collapsing as needed, so that
> > same queries with different number of constants in an array will be
> > hashed into the same record.
> >
> > But there is a limitation in how such queries will be normalized
> > afterwards — to reduce level of surprise it's necessary to display the
> > fact that a certain query in fact had more constants that are showed in
> > pgss record. Ideally LocationLen needs to carry some bits of information
> > on what exactly could be skipped, and generate_normalized_query needs to
> > understand that, both are not reachable for an external module with
> > custom query id logic (without replicating significant part of the
> > existing code). Hence, a new version of the patch.
>
> Forgot to mention a couple of people who already reviewed the patch.
And now for something completely different, here is a new patch version.
It contains a small fix for one problem we've found during testing (one
path code was incorrectly assuming find_const_walker results).
Hi,
bq. and at position further that specified threshold.
that specified threshold -> than specified threshold
Cheers
> On Thu, Sep 30, 2021 at 08:03:16AM -0700, Zhihong Yu wrote: > On Thu, Sep 30, 2021 at 6:49 AM Dmitry Dolgov <9erthalion6@gmail.com> wrote: > > > >On Wed, Jun 16, 2021 at 04:02:12PM +0200, Dmitry Dolgov wrote: > > > > > > > I've prepared a new rebased version to deal with the new way of > > > > computing query id, but as always there is one tricky part. From what I > > > > understand, now an external module can provide custom implementation > > for > > > > query id computation algorithm. It seems natural to think this > > machinery > > > > could be used instead of patch in the thread, i.e. one could create a > > > > custom logic that will enable constants collapsing as needed, so that > > > > same queries with different number of constants in an array will be > > > > hashed into the same record. > > > > > > > > But there is a limitation in how such queries will be normalized > > > > afterwards — to reduce level of surprise it's necessary to display the > > > > fact that a certain query in fact had more constants that are showed in > > > > pgss record. Ideally LocationLen needs to carry some bits of > > information > > > > on what exactly could be skipped, and generate_normalized_query needs > > to > > > > understand that, both are not reachable for an external module with > > > > custom query id logic (without replicating significant part of the > > > > existing code). Hence, a new version of the patch. > > > > > > Forgot to mention a couple of people who already reviewed the patch. > > > > And now for something completely different, here is a new patch version. > > It contains a small fix for one problem we've found during testing (one > > path code was incorrectly assuming find_const_walker results). > > > Hi, > > bq. and at position further that specified threshold. > > that specified threshold -> than specified threshold You mean in the patch commit message, nowhere else, right? Yep, my spell checker didn't catch that, thanks for noticing!
Dmitry Dolgov <9erthalion6@gmail.com> writes: > And now for something completely different, here is a new patch version. > It contains a small fix for one problem we've found during testing (one > path code was incorrectly assuming find_const_walker results). I've been saying from day one that pushing the query-hashing code into the core was a bad idea, and I think this patch perfectly illustrates why. We can debate whether the rules proposed here are good for pg_stat_statements or not, but it seems inevitable that they will be a disaster for some other consumers of the query hash. In particular, dropping external parameters from the hash seems certain to break something for somebody --- do you really think that a query with two int parameters is equivalent to one with five float parameters for all query-identifying purposes? I can see the merits of allowing different numbers of IN elements to be considered equivalent for pg_stat_statements, but this patch seems to go far beyond that basic idea, and I fear the side-effects will be very bad. Also, calling eval_const_expressions in the query jumbler is flat out unacceptable. There is way too much code that could be reached that way (more or less the entire executor, to start with). I don't have a lot of faith that it'd never modify the input tree, either. regards, tom lane
On 1/5/22 4:02 AM, Tom Lane wrote: > Dmitry Dolgov <9erthalion6@gmail.com> writes: >> And now for something completely different, here is a new patch version. >> It contains a small fix for one problem we've found during testing (one >> path code was incorrectly assuming find_const_walker results). > > I've been saying from day one that pushing the query-hashing code into the > core was a bad idea, and I think this patch perfectly illustrates why. > We can debate whether the rules proposed here are good for > pg_stat_statements or not, but it seems inevitable that they will be a > disaster for some other consumers of the query hash. In particular, > dropping external parameters from the hash seems certain to break > something for somebody +1. In a couple of extensions I use different logic of query jumbling - hash value is more stable in some cases than in default implementation. For example, it should be stable to permutations in 'FROM' section of a query. And If anyone subtly changes jumbling logic when the extension is active, the instance could get huge performance issues. Let me suggest, that the core should allow an extension at least to detect such interference between extensions. Maybe hook could be replaced with callback to allow extension see an queryid with underlying generation logic what it expects. -- regards, Andrey Lepikhov Postgres Professional
"Andrey V. Lepikhov" <a.lepikhov@postgrespro.ru> writes: > On 1/5/22 4:02 AM, Tom Lane wrote: >> I've been saying from day one that pushing the query-hashing code into the >> core was a bad idea, and I think this patch perfectly illustrates why. > +1. > Let me suggest, that the core should allow an extension at least to > detect such interference between extensions. Maybe hook could be > replaced with callback to allow extension see an queryid with underlying > generation logic what it expects. I feel like we need to get away from the idea that there is just one query hash, and somehow let different extensions attach differently-calculated hashes to a query. I don't have any immediate ideas about how to do that in a reasonably inexpensive way. regards, tom lane
> On Tue, Jan 04, 2022 at 06:02:43PM -0500, Tom Lane wrote: > We can debate whether the rules proposed here are good for > pg_stat_statements or not, but it seems inevitable that they will be a > disaster for some other consumers of the query hash. Hm, which consumers do you mean here, potential extension? Isn't the ability to use an external module to compute queryid make this situation possible anyway? > do you really think that a query with two int > parameters is equivalent to one with five float parameters for all > query-identifying purposes? Nope, and it will be hard to figure this out no matter which approach we're talking about, because it mostly depends on the context and type of queries I guess. Instead, such functionality should allow some reasonable configuration. To be clear, the use case I have in mind here is not four or five, but rather a couple of hundreds constants where chances that the whole construction was generated automatically by ORM is higher than normal. > I can see the merits of allowing different numbers of IN elements > to be considered equivalent for pg_stat_statements, but this patch > seems to go far beyond that basic idea, and I fear the side-effects > will be very bad. Not sure why it goes far beyond, but then there were two approaches under consideration, as I've stated in the first message. I already don't remember all the details, but another one was evolving around doing similar things in a more limited fashion in transformAExprIn. The problem would be then to carry the information, necessary to represent the act of "merging" some number of queryids together. Any thoughts here? The idea of keeping the original queryid untouched and add another type of id instead sounds interesting, but it will add too much overhead for a quite small use case I guess.
> On Wed, Jan 05, 2022 at 10:11:11PM +0100, Dmitry Dolgov wrote: > > On Tue, Jan 04, 2022 at 06:02:43PM -0500, Tom Lane wrote: > > We can debate whether the rules proposed here are good for > > pg_stat_statements or not, but it seems inevitable that they will be a > > disaster for some other consumers of the query hash. > > Hm, which consumers do you mean here, potential extension? Isn't the > ability to use an external module to compute queryid make this situation > possible anyway? > > > do you really think that a query with two int > > parameters is equivalent to one with five float parameters for all > > query-identifying purposes? > > Nope, and it will be hard to figure this out no matter which approach > we're talking about, because it mostly depends on the context and type > of queries I guess. Instead, such functionality should allow some > reasonable configuration. To be clear, the use case I have in mind here > is not four or five, but rather a couple of hundreds constants where > chances that the whole construction was generated automatically by ORM > is higher than normal. > > > I can see the merits of allowing different numbers of IN elements > > to be considered equivalent for pg_stat_statements, but this patch > > seems to go far beyond that basic idea, and I fear the side-effects > > will be very bad. > > Not sure why it goes far beyond, but then there were two approaches > under consideration, as I've stated in the first message. I already > don't remember all the details, but another one was evolving around > doing similar things in a more limited fashion in transformAExprIn. The > problem would be then to carry the information, necessary to represent > the act of "merging" some number of queryids together. Any thoughts > here? > > The idea of keeping the original queryid untouched and add another type > of id instead sounds interesting, but it will add too much overhead for > a quite small use case I guess. ``` Thu, 10 Mar 2022 New status: Waiting on Author ``` This seems incorrect, as the only feedback I've got was "this is a bad idea", and no reaction on follow-up questions.
Dmitry Dolgov <9erthalion6@gmail.com> writes: > New status: Waiting on Author > This seems incorrect, as the only feedback I've got was "this is a bad > idea", and no reaction on follow-up questions. I changed the status because it seems to me there is no chance of this being committed as-is. 1. I think an absolute prerequisite before we could even consider changing the query jumbler rules this much is to do the work that was put off when the jumbler was moved into core: that is, provide some honest support for multiple query-ID generation methods being used at the same time. Even if you successfully make a case for pg_stat_statements to act this way, other consumers of query IDs aren't going to be happy with it. 2. You haven't made a case for it. The original complaint was about different lengths of IN lists not being treated as equivalent, but this patch has decided to do I'm-not-even-sure-quite-what about treating different Params as equivalent. Plus you're trying to invoke eval_const_expressions in the jumbler; that is absolutely Not OK, for both safety and semantic reasons. If you backed off to just treating ArrayExprs containing different numbers of Consts as equivalent, maybe that'd be something we could adopt without fixing point 1. I don't think anything that fuzzes the treatment of Params can get away with that, though. regards, tom lane
On Thu, Mar 10, 2022 at 12:12 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > This seems incorrect, as the only feedback I've got was "this is a bad > > idea", and no reaction on follow-up questions. > > I changed the status because it seems to me there is no chance of > this being committed as-is. > > 1. I think an absolute prerequisite before we could even consider > changing the query jumbler rules this much is to do the work that was > put off when the jumbler was moved into core: that is, provide some > honest support for multiple query-ID generation methods being used at > the same time. Even if you successfully make a case for > pg_stat_statements to act this way, other consumers of query IDs > aren't going to be happy with it. FWIW, I don't find this convincing at all. Query jumbling is already somewhat expensive, and it seems unlikely that the same person is going to want to jumble queries in one way for pg_stat_statements and another way for pg_stat_broccoli or whatever their other extension is. Putting a lot of engineering work into something with such a marginal use case seems not worthwhile to me - and also likely futile, because I don't see how it could realistically be made nearly as cheap as a single jumble. > 2. You haven't made a case for it. The original complaint was > about different lengths of IN lists not being treated as equivalent, > but this patch has decided to do I'm-not-even-sure-quite-what > about treating different Params as equivalent. Plus you're trying > to invoke eval_const_expressions in the jumbler; that is absolutely > Not OK, for both safety and semantic reasons. I think there are two separate points here, one about patch quality and the other about whether the basic idea is good. I think the basic idea is good. I do not contend that collapsing IN-lists of arbitrary length is what everyone wants in all cases, but it seems entirely reasonable to me to think that it is what some people want. So I would say just make it a parameter and let people configure whichever behavior they want. My bet is 95% of users would prefer to have it on, but even if that's wildly wrong, having it as an optional behavior hurts nobody. Let it be off by default and let those who want it flip the toggle. On the code quality issue, I haven't read the patch but your concerns sound well-founded to me from reading what you wrote. -- Robert Haas EDB: http://www.enterprisedb.com
> On Thu, Mar 10, 2022 at 12:32:08PM -0500, Robert Haas wrote: > On Thu, Mar 10, 2022 at 12:12 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > > 2. You haven't made a case for it. The original complaint was > > about different lengths of IN lists not being treated as equivalent, > > but this patch has decided to do I'm-not-even-sure-quite-what > > about treating different Params as equivalent. Plus you're trying > > to invoke eval_const_expressions in the jumbler; that is absolutely > > Not OK, for both safety and semantic reasons. > > I think there are two separate points here, one about patch quality > and the other about whether the basic idea is good. I think the basic > idea is good. I do not contend that collapsing IN-lists of arbitrary > length is what everyone wants in all cases, but it seems entirely > reasonable to me to think that it is what some people want. So I would > say just make it a parameter and let people configure whichever > behavior they want. My bet is 95% of users would prefer to have it on, > but even if that's wildly wrong, having it as an optional behavior > hurts nobody. Let it be off by default and let those who want it flip > the toggle. On the code quality issue, I haven't read the patch but > your concerns sound well-founded to me from reading what you wrote. I have the same understanding, there is a toggle in the patch exactly for this purpose. To give a bit more context, the whole development was ORM-driven rather than pulled out of thin air -- people were complaining about huge generated queries that could be barely displayed in monitoring, I was trying to address it via collapsing the list where it was happening. In other words "I'm-not-even-sure-quite-what" part may be indeed too extensive, but was triggered by real world issues. Of course, I could get the implementation not quite right, e.g. I wasn't aware about dangers of using eval_const_expressions. But that's what the CF item and the corresponding discussion is for, I guess. Let me see what I could do to improve it.
> On Thu, Mar 10, 2022 at 12:11:59PM -0500, Tom Lane wrote: > Dmitry Dolgov <9erthalion6@gmail.com> writes: > > New status: Waiting on Author > > > This seems incorrect, as the only feedback I've got was "this is a bad > > idea", and no reaction on follow-up questions. > > I changed the status because it seems to me there is no chance of > this being committed as-is. > > 1. I think an absolute prerequisite before we could even consider > changing the query jumbler rules this much is to do the work that was > put off when the jumbler was moved into core: that is, provide some > honest support for multiple query-ID generation methods being used at > the same time. Even if you successfully make a case for > pg_stat_statements to act this way, other consumers of query IDs > aren't going to be happy with it. > > 2. You haven't made a case for it. The original complaint was > about different lengths of IN lists not being treated as equivalent, > but this patch has decided to do I'm-not-even-sure-quite-what > about treating different Params as equivalent. Plus you're trying > to invoke eval_const_expressions in the jumbler; that is absolutely > Not OK, for both safety and semantic reasons. > > If you backed off to just treating ArrayExprs containing different > numbers of Consts as equivalent, maybe that'd be something we could > adopt without fixing point 1. I don't think anything that fuzzes the > treatment of Params can get away with that, though. Here is the limited version of list collapsing functionality, which doesn't utilize eval_const_expressions and ignores most of the stuff except ArrayExprs. Any thoughts/more suggestions?
Attachment
On Sat, Mar 12, 2022 at 9:11 AM Dmitry Dolgov <9erthalion6@gmail.com> wrote: > Here is the limited version of list collapsing functionality, which > doesn't utilize eval_const_expressions and ignores most of the stuff > except ArrayExprs. Any thoughts/more suggestions? The proposed commit message says this commit intends to "Make Consts contribute nothing to the jumble hash if they're part of a series and at position further that specified threshold." I'm not sure whether that's what the patch actually implements because I can't immediately understand the new logic you've added, but I think if we did what that sentence said then, supposing the threshold is set to 1, it would result in producing the same hash for "x in (1,2)" that we do for "x in (1,3)" but a different hash for "x in (2,3)" which does not sound like what we want. What I would have thought we'd do is: if the list is all constants and long enough to satisfy the threshold then nothing in the list gets jumbled. I'm a little surprised that there's not more context-awareness in this code. It seems that it applies to every ArrayExpr found in the query, which I think would extend to cases beyond something = IN(whatever). In particular, any use of ARRAY[] in the query would be impacted. Now, the comments seem to imply that's pretty intentional, but from the user's point of view, WHERE x in (1,3) and x = any(array[1,3]) are two different things. If anything like this is to be adopted, we certainly need to be precise about exactly what it is doing and which cases are covered. I thought of looking at the documentation to see whether you'd tried to clarify this there, and found that you hadn't written any. In short, I think this patch is not really very close to being in committable shape even if nobody were objecting to the concept. -- Robert Haas EDB: http://www.enterprisedb.com
> On Mon, Mar 14, 2022 at 10:17:57AM -0400, Robert Haas wrote: > On Sat, Mar 12, 2022 at 9:11 AM Dmitry Dolgov <9erthalion6@gmail.com> wrote: > > Here is the limited version of list collapsing functionality, which > > doesn't utilize eval_const_expressions and ignores most of the stuff > > except ArrayExprs. Any thoughts/more suggestions? > > The proposed commit message says this commit intends to "Make Consts > contribute nothing to the jumble hash if they're part of a series and > at position further that specified threshold." I'm not sure whether > that's what the patch actually implements because I can't immediately > understand the new logic you've added, but I think if we did what that > sentence said then, supposing the threshold is set to 1, it would > result in producing the same hash for "x in (1,2)" that we do for "x > in (1,3)" but a different hash for "x in (2,3)" which does not sound > like what we want. What I would have thought we'd do is: if the list > is all constants and long enough to satisfy the threshold then nothing > in the list gets jumbled. Well, yeah, the commit message is somewhat clumsy in this regard. It works almost in the way you've described, except if the list is all constants and long enough to satisfy the threshold then *first N elements (where N == threshold) will be jumbled -- to leave at least some traces of it in pgss. > I'm a little surprised that there's not more context-awareness in this > code. It seems that it applies to every ArrayExpr found in the query, > which I think would extend to cases beyond something = IN(whatever). > In particular, any use of ARRAY[] in the query would be impacted. Now, > the comments seem to imply that's pretty intentional, but from the > user's point of view, WHERE x in (1,3) and x = any(array[1,3]) are two > different things. If anything like this is to be adopted, we certainly > need to be precise about exactly what it is doing and which cases are > covered. I'm not sure if I follow the last point. WHERE x in (1,3) and x = any(array[1,3]) are two different things for sure, but in which way are they going to be mixed together because of this change? My goal was to make only the following transformation, without leaving any uncertainty: WHERE x in (1, 2, 3, 4, 5) -> WHERE x in (1, 2, ...) WHERE x = any(array[1, 2, 3, 4, 5]) -> WHERE x = any(array[1, 2, ...]) > I thought of looking at the documentation to see whether you'd tried > to clarify this there, and found that you hadn't written any. > > In short, I think this patch is not really very close to being in > committable shape even if nobody were objecting to the concept. Sure, I'll add documentation. To be honest I'm not targeting PG15 with this, just want to make some progress. Thanks for the feedback, I'm glad to see it coming!
On Mon, Mar 14, 2022 at 10:57 AM Dmitry Dolgov <9erthalion6@gmail.com> wrote: > Well, yeah, the commit message is somewhat clumsy in this regard. It > works almost in the way you've described, except if the list is all > constants and long enough to satisfy the threshold then *first N > elements (where N == threshold) will be jumbled -- to leave at least > some traces of it in pgss. But that seems to me to be a thing we would not want. Why do you think otherwise? > I'm not sure if I follow the last point. WHERE x in (1,3) and x = > any(array[1,3]) are two different things for sure, but in which way are > they going to be mixed together because of this change? My goal was to > make only the following transformation, without leaving any uncertainty: > > WHERE x in (1, 2, 3, 4, 5) -> WHERE x in (1, 2, ...) > WHERE x = any(array[1, 2, 3, 4, 5]) -> WHERE x = any(array[1, 2, ...]) I understand. I think it might be OK to transform both of those things, but I don't think it's very clear either from the comments or the nonexistent documentation that both of those cases are affected -- and I think that needs to be clear. Not sure exactly how to do that, just saying that we can't add behavior unless it will be clear to users what the behavior is. > Sure, I'll add documentation. To be honest I'm not targeting PG15 with > this, just want to make some progress. wfm! -- Robert Haas EDB: http://www.enterprisedb.com
> On Mon, Mar 14, 2022 at 11:02:16AM -0400, Robert Haas wrote: > On Mon, Mar 14, 2022 at 10:57 AM Dmitry Dolgov <9erthalion6@gmail.com> wrote: > > Well, yeah, the commit message is somewhat clumsy in this regard. It > > works almost in the way you've described, except if the list is all > > constants and long enough to satisfy the threshold then *first N > > elements (where N == threshold) will be jumbled -- to leave at least > > some traces of it in pgss. > > But that seems to me to be a thing we would not want. Why do you think > otherwise? Hm. Well, if the whole list would be not jumbled, the transformation would look like this, right? WHERE x in (1, 2, 3, 4, 5) -> WHERE x in (...) Leaving some number of original elements in place gives some clue for the reader about at least what type of data the array has contained. Which hopefully makes it a bit easier to identify even in the collapsed form: WHERE x in (1, 2, 3, 4, 5) -> WHERE x in (1, 2, ...) > > I'm not sure if I follow the last point. WHERE x in (1,3) and x = > > any(array[1,3]) are two different things for sure, but in which way are > > they going to be mixed together because of this change? My goal was to > > make only the following transformation, without leaving any uncertainty: > > > > WHERE x in (1, 2, 3, 4, 5) -> WHERE x in (1, 2, ...) > > WHERE x = any(array[1, 2, 3, 4, 5]) -> WHERE x = any(array[1, 2, ...]) > > I understand. I think it might be OK to transform both of those > things, but I don't think it's very clear either from the comments or > the nonexistent documentation that both of those cases are affected -- > and I think that needs to be clear. Not sure exactly how to do that, > just saying that we can't add behavior unless it will be clear to > users what the behavior is. Yep, got it.
Robert Haas <robertmhaas@gmail.com> writes: > On Mon, Mar 14, 2022 at 10:57 AM Dmitry Dolgov <9erthalion6@gmail.com> wrote: >> I'm not sure if I follow the last point. WHERE x in (1,3) and x = >> any(array[1,3]) are two different things for sure, but in which way are >> they going to be mixed together because of this change? My goal was to >> make only the following transformation, without leaving any uncertainty: >> >> WHERE x in (1, 2, 3, 4, 5) -> WHERE x in (1, 2, ...) >> WHERE x = any(array[1, 2, 3, 4, 5]) -> WHERE x = any(array[1, 2, ...]) > I understand. I think it might be OK to transform both of those > things, but I don't think it's very clear either from the comments or > the nonexistent documentation that both of those cases are affected -- > and I think that needs to be clear. We've transformed IN(...) to ANY(ARRAY[...]) at the parser stage for a long time, and this has been visible to users of either EXPLAIN or pg_stat_statements for the same length of time. I doubt people are going to find that surprising. Even if they do, it's not the query jumbler's fault. I do find it odd that the proposed patch doesn't cause the *entire* list to be skipped over. That seems like extra complexity and confusion to no benefit. regards, tom lane
> On Mon, Mar 14, 2022 at 11:23:17AM -0400, Tom Lane wrote: > Robert Haas <robertmhaas@gmail.com> writes: > > I do find it odd that the proposed patch doesn't cause the *entire* > list to be skipped over. That seems like extra complexity and confusion > to no benefit. That's a bit surprising for me, I haven't even thought that folks could think this is an odd behaviour. As I've mentioned above, the original idea was to give some clues about what was inside the collapsed array, but if everyone finds it unnecessary I can of course change it.
Dmitry Dolgov <9erthalion6@gmail.com> writes: > On Mon, Mar 14, 2022 at 11:23:17AM -0400, Tom Lane wrote: >> I do find it odd that the proposed patch doesn't cause the *entire* >> list to be skipped over. That seems like extra complexity and confusion >> to no benefit. > That's a bit surprising for me, I haven't even thought that folks could > think this is an odd behaviour. As I've mentioned above, the original > idea was to give some clues about what was inside the collapsed array, > but if everyone finds it unnecessary I can of course change it. But if what we're doing is skipping over an all-Consts list, then the individual Consts would be elided from the pg_stat_statements entry anyway, no? All that would remain is information about how many such Consts there were, which is exactly the information you want to drop. regards, tom lane
> On Mon, Mar 14, 2022 at 11:38:23AM -0400, Tom Lane wrote: > Dmitry Dolgov <9erthalion6@gmail.com> writes: > > On Mon, Mar 14, 2022 at 11:23:17AM -0400, Tom Lane wrote: > >> I do find it odd that the proposed patch doesn't cause the *entire* > >> list to be skipped over. That seems like extra complexity and confusion > >> to no benefit. > > > That's a bit surprising for me, I haven't even thought that folks could > > think this is an odd behaviour. As I've mentioned above, the original > > idea was to give some clues about what was inside the collapsed array, > > but if everyone finds it unnecessary I can of course change it. > > But if what we're doing is skipping over an all-Consts list, then the > individual Consts would be elided from the pg_stat_statements entry > anyway, no? All that would remain is information about how many such > Consts there were, which is exactly the information you want to drop. Hm, yes, you're right. I guess I was thinking about this more like about shortening some text with ellipsis, but indeed no actual Consts will end up in the result anyway. Thanks for clarification, will modify the patch!
> On Mon, Mar 14, 2022 at 04:51:50PM +0100, Dmitry Dolgov wrote: > > On Mon, Mar 14, 2022 at 11:38:23AM -0400, Tom Lane wrote: > > Dmitry Dolgov <9erthalion6@gmail.com> writes: > > > On Mon, Mar 14, 2022 at 11:23:17AM -0400, Tom Lane wrote: > > >> I do find it odd that the proposed patch doesn't cause the *entire* > > >> list to be skipped over. That seems like extra complexity and confusion > > >> to no benefit. > > > > > That's a bit surprising for me, I haven't even thought that folks could > > > think this is an odd behaviour. As I've mentioned above, the original > > > idea was to give some clues about what was inside the collapsed array, > > > but if everyone finds it unnecessary I can of course change it. > > > > But if what we're doing is skipping over an all-Consts list, then the > > individual Consts would be elided from the pg_stat_statements entry > > anyway, no? All that would remain is information about how many such > > Consts there were, which is exactly the information you want to drop. > > Hm, yes, you're right. I guess I was thinking about this more like about > shortening some text with ellipsis, but indeed no actual Consts will end > up in the result anyway. Thanks for clarification, will modify the > patch! Here is another iteration. Now the patch doesn't leave any trailing Consts in the normalized query, and contains more documentation. I hope it's getting better.
Attachment
> On Sat, Mar 26, 2022 at 06:40:35PM +0100, Dmitry Dolgov wrote: > > On Mon, Mar 14, 2022 at 04:51:50PM +0100, Dmitry Dolgov wrote: > > > On Mon, Mar 14, 2022 at 11:38:23AM -0400, Tom Lane wrote: > > > Dmitry Dolgov <9erthalion6@gmail.com> writes: > > > > On Mon, Mar 14, 2022 at 11:23:17AM -0400, Tom Lane wrote: > > > >> I do find it odd that the proposed patch doesn't cause the *entire* > > > >> list to be skipped over. That seems like extra complexity and confusion > > > >> to no benefit. > > > > > > > That's a bit surprising for me, I haven't even thought that folks could > > > > think this is an odd behaviour. As I've mentioned above, the original > > > > idea was to give some clues about what was inside the collapsed array, > > > > but if everyone finds it unnecessary I can of course change it. > > > > > > But if what we're doing is skipping over an all-Consts list, then the > > > individual Consts would be elided from the pg_stat_statements entry > > > anyway, no? All that would remain is information about how many such > > > Consts there were, which is exactly the information you want to drop. > > > > Hm, yes, you're right. I guess I was thinking about this more like about > > shortening some text with ellipsis, but indeed no actual Consts will end > > up in the result anyway. Thanks for clarification, will modify the > > patch! > > Here is another iteration. Now the patch doesn't leave any trailing > Consts in the normalized query, and contains more documentation. I hope > it's getting better. Hi, Here is the rebased version, with no other changes.
Attachment
Hello! Unfortunately the patch needs another rebase due to the recent split of guc.c (0a20ff54f5e66158930d5328f89f087d4e9ab400) I'm reviewing a patch on top of a previous commit and noticed a failed test: # Failed test 'no parameters missing from postgresql.conf.sample' # at t/003_check_guc.pl line 82. # got: '1' # expected: '0' # Looks like you failed 1 test of 3. t/003_check_guc.pl .............. The new option has not been added to the postgresql.conf.sample PS: I would also like to have such a feature. It's hard to increase pg_stat_statements.max or lose some entries just becausesome ORM sends requests with a different number of parameters. regards, Sergei
> On Fri, Sep 16, 2022 at 09:25:13PM +0300, Sergei Kornilov wrote: > Hello! > > Unfortunately the patch needs another rebase due to the recent split of guc.c (0a20ff54f5e66158930d5328f89f087d4e9ab400) > > I'm reviewing a patch on top of a previous commit and noticed a failed test: > > # Failed test 'no parameters missing from postgresql.conf.sample' > # at t/003_check_guc.pl line 82. > # got: '1' > # expected: '0' > # Looks like you failed 1 test of 3. > t/003_check_guc.pl .............. > > The new option has not been added to the postgresql.conf.sample > > PS: I would also like to have such a feature. It's hard to increase pg_stat_statements.max or lose some entries just becausesome ORM sends requests with a different number of parameters. Thanks! I'll post the rebased version soon.
> On Sat, Sep 24, 2022 at 04:07:14PM +0200, Dmitry Dolgov wrote: > > On Fri, Sep 16, 2022 at 09:25:13PM +0300, Sergei Kornilov wrote: > > Hello! > > > > Unfortunately the patch needs another rebase due to the recent split of guc.c (0a20ff54f5e66158930d5328f89f087d4e9ab400) > > > > I'm reviewing a patch on top of a previous commit and noticed a failed test: > > > > # Failed test 'no parameters missing from postgresql.conf.sample' > > # at t/003_check_guc.pl line 82. > > # got: '1' > > # expected: '0' > > # Looks like you failed 1 test of 3. > > t/003_check_guc.pl .............. > > > > The new option has not been added to the postgresql.conf.sample > > > > PS: I would also like to have such a feature. It's hard to increase pg_stat_statements.max or lose some entries justbecause some ORM sends requests with a different number of parameters. > > Thanks! I'll post the rebased version soon. And here it is.
Attachment
On Sun, 25 Sept 2022 at 05:29, Dmitry Dolgov <9erthalion6@gmail.com> wrote: > > > On Sat, Sep 24, 2022 at 04:07:14PM +0200, Dmitry Dolgov wrote: > > > On Fri, Sep 16, 2022 at 09:25:13PM +0300, Sergei Kornilov wrote: > > > Hello! > > > > > > Unfortunately the patch needs another rebase due to the recent split of guc.c (0a20ff54f5e66158930d5328f89f087d4e9ab400) > > > > > > I'm reviewing a patch on top of a previous commit and noticed a failed test: > > > > > > # Failed test 'no parameters missing from postgresql.conf.sample' > > > # at t/003_check_guc.pl line 82. > > > # got: '1' > > > # expected: '0' > > > # Looks like you failed 1 test of 3. > > > t/003_check_guc.pl .............. > > > > > > The new option has not been added to the postgresql.conf.sample > > > > > > PS: I would also like to have such a feature. It's hard to increase pg_stat_statements.max or lose some entries justbecause some ORM sends requests with a different number of parameters. > > > > Thanks! I'll post the rebased version soon. The patch does not apply on top of HEAD as in [1], please post a rebased patch: === Applying patches on top of PostgreSQL commit ID 456fa635a909ee36f73ca84d340521bd730f265f === === applying patch ./v9-0001-Prevent-jumbling-of-every-element-in-ArrayExpr.patch .... can't find file to patch at input line 746 Perhaps you used the wrong -p or --strip option? The text leading up to this was: -------------------------- |diff --git a/src/backend/utils/misc/queryjumble.c b/src/backend/utils/misc/queryjumble.c |index a67487e5fe..063b4be725 100644 |--- a/src/backend/utils/misc/queryjumble.c |+++ b/src/backend/utils/misc/queryjumble.c -------------------------- No file to patch. Skipping patch. 8 out of 8 hunks ignored can't find file to patch at input line 913 Perhaps you used the wrong -p or --strip option? The text leading up to this was: -------------------------- |diff --git a/src/include/utils/queryjumble.h b/src/include/utils/queryjumble.h |index 3c2d9beab2..b50cc42d4e 100644 |--- a/src/include/utils/queryjumble.h |+++ b/src/include/utils/queryjumble.h -------------------------- No file to patch. Skipping patch. [1] - http://cfbot.cputube.org/patch_41_2837.log Regards, Vignesh
> On Fri, Jan 27, 2023 at 08:15:29PM +0530, vignesh C wrote: > The patch does not apply on top of HEAD as in [1], please post a rebased patch: Thanks. I think this one should do the trick.
Attachment
Em dom., 29 de jan. de 2023 às 09:24, Dmitry Dolgov <9erthalion6@gmail.com> escreveu:
> On Fri, Jan 27, 2023 at 08:15:29PM +0530, vignesh C wrote:
> The patch does not apply on top of HEAD as in [1], please post a rebased patch:
Thanks. I think this one should do the trick.
There is a typo on DOC part
+ and it's length is larger than <varname> const_merge_threshold </varname>,
+ then array elements will contribure nothing to the query identifier.
+ Thus the query will get the same identifier no matter how many constants
+ then array elements will contribure nothing to the query identifier.
+ Thus the query will get the same identifier no matter how many constants
That "contribure" should be "contribute"
regards
Marcos
> On Sun, Jan 29, 2023 at 09:56:02AM -0300, Marcos Pegoraro wrote: > Em dom., 29 de jan. de 2023 às 09:24, Dmitry Dolgov <9erthalion6@gmail.com> > escreveu: > > > > On Fri, Jan 27, 2023 at 08:15:29PM +0530, vignesh C wrote: > > > The patch does not apply on top of HEAD as in [1], please post a rebased > > patch: > > > > Thanks. I think this one should do the trick. > > > > There is a typo on DOC part > + and it's length is larger than <varname> const_merge_threshold > </varname>, > + then array elements will contribure nothing to the query > identifier. > + Thus the query will get the same identifier no matter how many > constants > > That "contribure" should be "contribute" Indeed, thanks for noticing.
Attachment
This appears to have massive conflicts. Would you please rebase? -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/ "¿Cómo puedes confiar en algo que pagas y que no ves, y no confiar en algo que te dan y te lo muestran?" (Germán Poo)
> On Thu, Feb 02, 2023 at 03:07:27PM +0100, Alvaro Herrera wrote: > This appears to have massive conflicts. Would you please rebase? Sure, I was already mentally preparing myself to do so in the view of recent changes in query jumbling. Will post soon.
> On Thu, Feb 02, 2023 at 04:05:54PM +0100, Dmitry Dolgov wrote: > > On Thu, Feb 02, 2023 at 03:07:27PM +0100, Alvaro Herrera wrote: > > This appears to have massive conflicts. Would you please rebase? > > Sure, I was already mentally preparing myself to do so in the view of > recent changes in query jumbling. Will post soon. Here is the rebased version. To adapt to the latest changes, I've marked ArrayExpr with custom_query_jumble to implement this functionality, but tried to make the actual merge logic relatively independent. Otherwise, everything is the same.
Attachment
On Sat, Feb 04, 2023 at 06:08:41PM +0100, Dmitry Dolgov wrote: > Here is the rebased version. To adapt to the latest changes, I've marked > ArrayExpr with custom_query_jumble to implement this functionality, but > tried to make the actual merge logic relatively independent. Otherwise, > everything is the same. I was quickly looking at this patch, so these are rough impressions. + bool merged; /* whether or not the location was marked as + not contributing to jumble */ This part of the patch is a bit disturbing.. We have node attributes to track if portions of a node should be ignored or have a location marked, still this "merged" flag is used as an extension to track if a location should be done or not. Is that a concept that had better be controlled via a new node attribute? +-- +-- Consts merging +-- +CREATE TABLE test_merge (id int, data int); +-- IN queries +-- No merging Would it be better to split this set of tests into a new file? FWIW, I have a patch in baking process that refactors a bit the whole, before being able to extend it so as we have more coverage for normalized utility queries, as of now the query strings stored by pg_stat_statements don't reflect that even if the jumbling computation marks the location of the Const nodes included in utility statements (partition bounds, queries of COPY, etc.). I should be able to send that tomorrow, and my guess that you could take advantage of that even for this thread. -- Michael
Attachment
> On Sun, Feb 05, 2023 at 10:30:25AM +0900, Michael Paquier wrote: > On Sat, Feb 04, 2023 at 06:08:41PM +0100, Dmitry Dolgov wrote: > > Here is the rebased version. To adapt to the latest changes, I've marked > > ArrayExpr with custom_query_jumble to implement this functionality, but > > tried to make the actual merge logic relatively independent. Otherwise, > > everything is the same. > > I was quickly looking at this patch, so these are rough impressions. > > + bool merged; /* whether or not the location was marked as > + not contributing to jumble */ > > This part of the patch is a bit disturbing.. We have node attributes > to track if portions of a node should be ignored or have a location > marked, still this "merged" flag is used as an extension to track if a > location should be done or not. Is that a concept that had better be > controlled via a new node attribute? Good question. I need to think a bit more if it's possible to leverage node attributes mechanism, but at the moment I'm still inclined to extend LocationLen. The reason is that it doesn't exactly serve the tracking purpose, i.e. whether to capture a location (I have to update the code commentary), it helps differentiate cases when locations A and D are obtained from merging A B C D instead of just being A and D. I'm thinking about this in the following way: the core jumbling logic is responsible for deriving locations based on the input expressions; in the case of merging we produce less locations; pgss have to represent the result only using locations and has to be able to differentiate simple locations and locations after merging. > +-- > +-- Consts merging > +-- > +CREATE TABLE test_merge (id int, data int); > +-- IN queries > +-- No merging > > Would it be better to split this set of tests into a new file? FWIW, > I have a patch in baking process that refactors a bit the whole, > before being able to extend it so as we have more coverage for > normalized utility queries, as of now the query strings stored by > pg_stat_statements don't reflect that even if the jumbling computation > marks the location of the Const nodes included in utility statements > (partition bounds, queries of COPY, etc.). I should be able to send > that tomorrow, and my guess that you could take advantage of that > even for this thread. Sure, I'll take a look how I can benefit from your work, thanks.
Dmitry Dolgov <9erthalion6@gmail.com> writes: > I'm thinking about this in the following way: the core jumbling logic is > responsible for deriving locations based on the input expressions; in > the case of merging we produce less locations; pgss have to represent > the result only using locations and has to be able to differentiate > simple locations and locations after merging. Uh ... why? ISTM you're just going to elide all inside the IN, so why do you need more than a start and stop position? regards, tom lane
> On Sun, Feb 05, 2023 at 11:02:32AM -0500, Tom Lane wrote: > Dmitry Dolgov <9erthalion6@gmail.com> writes: > > I'm thinking about this in the following way: the core jumbling logic is > > responsible for deriving locations based on the input expressions; in > > the case of merging we produce less locations; pgss have to represent > > the result only using locations and has to be able to differentiate > > simple locations and locations after merging. > > Uh ... why? ISTM you're just going to elide all inside the IN, > so why do you need more than a start and stop position? Exactly, start and stop positions. But if there would be no information that merging was applied, the following queries will look the same after jumbling, right? -- input query SELECT * FROM test_merge WHERE id IN (1, 2); -- jumbling result, two LocationLen, for values 1 and 2 SELECT * FROM test_merge WHERE id IN ($1, $2); -- input query SELECT * FROM test_merge WHERE id IN (1, 2, 3, 4, 5, 6, 7, 8, 9, 10); -- jumbling result, two LocationLen after merging, for values 1 and 10 SELECT * FROM test_merge WHERE id IN (...); -- without remembering about merging the result would be SELECT * FROM test_merge WHERE id IN ($1, $2);
Hello! Unfortunately, rebase is needed again due to recent changes in queryjumblefuncs ( 9ba37b2cb6a174b37fc51d0649ef73e56eae27fc) It seems a little strange to me that with const_merge_threshold = 1, such a test case gives the same result as with const_merge_threshold= 2 select pg_stat_statements_reset(); set const_merge_threshold to 1; select * from test where i in (1,2,3); select * from test where i in (1,2); select * from test where i in (1); select query, calls from pg_stat_statements order by query; query | calls -------------------------------------+------- select * from test where i in (...) | 2 select * from test where i in ($1) | 1 Probably const_merge_threshold = 1 should produce only "i in (...)"? const_merge_threshold is "the minimal length of an array" (more or equal) or "array .. length is larger than" (not equals)?I think the documentation is ambiguous in this regard. I also noticed a typo in guc_tables.c: "Sets the minimal numer of constants in an array" -> number regards, Sergei
On 07.02.23 21:14, Sergei Kornilov wrote: > It seems a little strange to me that with const_merge_threshold = 1, such a test case gives the same result as with const_merge_threshold= 2 What is the point of making this a numeric setting? Either you want to merge all values or you don't want to merge any values.
> On Tue, Feb 07, 2023 at 11:14:52PM +0300, Sergei Kornilov wrote: > Hello! Thanks for reviewing. > Unfortunately, rebase is needed again due to recent changes in queryjumblefuncs ( 9ba37b2cb6a174b37fc51d0649ef73e56eae27fc) Yep, my favourite game, rebaseball. Will post a new version soon, after figuring out all the recent questions. > It seems a little strange to me that with const_merge_threshold = 1, such a test case gives the same result as with const_merge_threshold= 2 > > select pg_stat_statements_reset(); > set const_merge_threshold to 1; > select * from test where i in (1,2,3); > select * from test where i in (1,2); > select * from test where i in (1); > select query, calls from pg_stat_statements order by query; > > query | calls > -------------------------------------+------- > select * from test where i in (...) | 2 > select * from test where i in ($1) | 1 > > Probably const_merge_threshold = 1 should produce only "i in (...)"? Well, it's not intentional, probably I need to be more careful with off-by-one. Although I agree to a certain extent with Peter questioning the value of having numerical option here, let me think about this. > const_merge_threshold is "the minimal length of an array" (more or equal) or "array .. length is larger than" (not equals)?I think the documentation is ambiguous in this regard. > > I also noticed a typo in guc_tables.c: "Sets the minimal numer of constants in an array" -> number Yep, I'll rephrase the documentation.
> On Thu, Feb 09, 2023 at 02:30:34PM +0100, Peter Eisentraut wrote: > On 07.02.23 21:14, Sergei Kornilov wrote: > > It seems a little strange to me that with const_merge_threshold = 1, such a test case gives the same result as with const_merge_threshold= 2 > > What is the point of making this a numeric setting? Either you want to > merge all values or you don't want to merge any values. At least in theory the definition of "too many constants" is different for different use cases and I see allowing to configure it as a way of reducing the level of surprise here. The main scenario for a numerical setting would be to distinguish between normal usage with just a handful of constants (and the user expecting to see them represented in pgss) and some sort of outliers with thousands of constants in a query (e.g. as a defence mechanism for the infrastructure working with those metrics). But I agree that it's not clear how much value is in that. Not having strong opinion about this I would be fine changing it to a boolean option (with an actual limit hidden internally) if everyone agrees it fits better.
On 2023-Feb-09, Dmitry Dolgov wrote: > > On Thu, Feb 09, 2023 at 02:30:34PM +0100, Peter Eisentraut wrote: > > What is the point of making this a numeric setting? Either you want > > to merge all values or you don't want to merge any values. > > At least in theory the definition of "too many constants" is different > for different use cases and I see allowing to configure it as a way of > reducing the level of surprise here. I was thinking about this a few days ago and I agree that we don't necessarily want to make it just a boolean thing; we may want to make it more complex. One trivial idea is to make it group entries in powers of 10: for 0-9 elements, you get one entry, and 10-99 you get a different one, and so on: # group everything in a single bucket const_merge_threshold = true / yes / on # group 0-9, 10-99, 100-999, 1000-9999 const_merge_treshold = powers Ideally the value would be represented somehow in the query text. For example query | calls ----------------------------------------------------------+------- select * from test where i in ({... 0-9 entries ...}) | 2 select * from test where i in ({... 10-99 entries ...}) | 1 What do you think? The jumble would have to know how to reduce all values within each power-of-ten group to one specific value, but I don't think that should be particularly difficult. -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/ "Find a bug in a program, and fix it, and the program will work today. Show the program how to find and fix a bug, and the program will work forever" (Oliver Silfridge)
> On Thu, Feb 09, 2023 at 06:26:51PM +0100, Alvaro Herrera wrote: > On 2023-Feb-09, Dmitry Dolgov wrote: > > > > On Thu, Feb 09, 2023 at 02:30:34PM +0100, Peter Eisentraut wrote: > > > > What is the point of making this a numeric setting? Either you want > > > to merge all values or you don't want to merge any values. > > > > At least in theory the definition of "too many constants" is different > > for different use cases and I see allowing to configure it as a way of > > reducing the level of surprise here. > > I was thinking about this a few days ago and I agree that we don't > necessarily want to make it just a boolean thing; we may want to make it > more complex. One trivial idea is to make it group entries in powers of > 10: for 0-9 elements, you get one entry, and 10-99 you get a different > one, and so on: > > # group everything in a single bucket > const_merge_threshold = true / yes / on > > # group 0-9, 10-99, 100-999, 1000-9999 > const_merge_treshold = powers > > Ideally the value would be represented somehow in the query text. For > example > > query | calls > ----------------------------------------------------------+------- > select * from test where i in ({... 0-9 entries ...}) | 2 > select * from test where i in ({... 10-99 entries ...}) | 1 > > What do you think? The jumble would have to know how to reduce all > values within each power-of-ten group to one specific value, but I don't > think that should be particularly difficult. Yeah, it sounds appealing and conveniently addresses the question of losing the information about how many constants originally were there. Not sure if the example above would be the most natural way to represent it in the query text, but otherwise I'm going to try implementing this. Stay tuned.
Hi, On 2/9/23 16:02, Dmitry Dolgov wrote: >> Unfortunately, rebase is needed again due to recent changes in queryjumblefuncs ( 9ba37b2cb6a174b37fc51d0649ef73e56eae27fc) I reviewed the last patch applied to some commit from Feb. 4th. >> It seems a little strange to me that with const_merge_threshold = 1, such a test case gives the same result as with const_merge_threshold= 2 >> >> select pg_stat_statements_reset(); >> set const_merge_threshold to 1; >> select * from test where i in (1,2,3); >> select * from test where i in (1,2); >> select * from test where i in (1); >> select query, calls from pg_stat_statements order by query; >> >> query | calls >> -------------------------------------+------- >> select * from test where i in (...) | 2 >> select * from test where i in ($1) | 1 >> >> Probably const_merge_threshold = 1 should produce only "i in (...)"? > Well, it's not intentional, probably I need to be more careful with > off-by-one. Although I agree to a certain extent with Peter questioning Please add tests for all the corner cases. At least for (1) IN only contains a single element and (2) const_merge_threshold = 1. Beyond that: - There's a comment about find_const_walker(). I cannot find that function anywhere. What am I missing? - What about renaming IsConstList() to IsMergableConstList(). - Don't you intend to use the NUMERIC data column in SELECT * FROM test_merge_numeric WHERE id IN (1, 2, 3, 4, 5, 6, 7, 8, 9, 10)? Otherwise, the test is identical to previous test cases and you're not checking for what happens with NUMERICs which are wrapped in FuncExpr because of the implicit coercion. - Don't we want to extend IsConstList() to allow a list of all implicitly coerced constants? It's inconsistent that otherwise e.g. NUMERICs don't work. - Typo in /* The firsts merged constant */ (first not firsts) - Prepared statements are not supported as they contain INs with Param instead of Const nodes. While less likely, I've seen applications that use prepared statements in conjunction with queries generated through a UI which ended up with tons of prepared queries with different number of elements in the IN clause. Not necessarily something that must go into this patch but maybe worth thinking about. - The setting name const_merge_threshold is not very telling without knowing the context. While being a little longer, what about jumble_const_merge_threshold or queryid_const_merge_threshold or similar? - Why do we actually only want to merge constants? Why don't we ignore the type of element in the IN and merge whatever is there? Is this because the original jumbling logic as of now only has support for constants? - Ideally we would even remove duplicates. That would even improve cardinality estimation but I guess we don't want to spend the cycles on doing so in the planner? - Why did you change palloc() to palloc0() for clocations array? The length is initialized to 0 and FWICS RecordConstLocation() initializes all members. Seems to me like we don't have to spend these cycles. - Can't the logic at the end of IsConstList() not be simplified to: foreach(temp, elements) if (!IsA(lfirst(temp), Const)) return false; // All elements are of type Const *firstConst = linitial_node(Const, elements); *lastConst = llast_node(Const, elements); return true; -- David Geier (ServiceNow)
> On Sat, Feb 11, 2023 at 11:03:36AM +0100, David Geier wrote: > Hi, > > On 2/9/23 16:02, Dmitry Dolgov wrote: > > > Unfortunately, rebase is needed again due to recent changes in queryjumblefuncs ( 9ba37b2cb6a174b37fc51d0649ef73e56eae27fc) > I reviewed the last patch applied to some commit from Feb. 4th. Thanks for looking. Few quick answers about high-level questions below, the rest I'll incorporate in the new version. > - There's a comment about find_const_walker(). I cannot find that function > anywhere. What am I missing? > > [...] > > - Don't you intend to use the NUMERIC data column in SELECT * FROM > test_merge_numeric WHERE id IN (1, 2, 3, 4, 5, 6, 7, 8, 9, 10)? Otherwise, > the test is identical to previous test cases and you're not checking for > what happens with NUMERICs which are wrapped in FuncExpr because of the > implicit coercion. > > - Don't we want to extend IsConstList() to allow a list of all implicitly > coerced constants? It's inconsistent that otherwise e.g. NUMERICs don't > work. > > [...] > > - Prepared statements are not supported as they contain INs with Param > instead of Const nodes. While less likely, I've seen applications that use > prepared statements in conjunction with queries generated through a UI which > ended up with tons of prepared queries with different number of elements in > the IN clause. Not necessarily something that must go into this patch but > maybe worth thinking about. The original version of the patch was doing all of this, i.e. handling numerics, Param nodes, RTE_VALUES. The commentary about find_const_walker in tests is referring to a part of that, that was dealing with evaluation of expression to see if it could be reduced to a constant. Unfortunately there was a significant push back from reviewers because of those features. That's why I've reduced the patch to it's minimally useful version, having in mind re-implementing them as follow-up patches in the future. This is the reason as well why I left tests covering all this missing functionality -- as breadcrumbs to already discovered cases, important for the future extensions. > - Why do we actually only want to merge constants? Why don't we ignore the > type of element in the IN and merge whatever is there? Is this because the > original jumbling logic as of now only has support for constants? > > - Ideally we would even remove duplicates. That would even improve > cardinality estimation but I guess we don't want to spend the cycles on > doing so in the planner? I believe these points are beyond the patch goals, as it's less clear (at least to me) if it's safe or desirable to do so.
> On Sat, Feb 11, 2023 at 11:47:07AM +0100, Dmitry Dolgov wrote: > > The original version of the patch was doing all of this, i.e. handling > numerics, Param nodes, RTE_VALUES. The commentary about > find_const_walker in tests is referring to a part of that, that was > dealing with evaluation of expression to see if it could be reduced to a > constant. > > Unfortunately there was a significant push back from reviewers because > of those features. That's why I've reduced the patch to it's minimally > useful version, having in mind re-implementing them as follow-up patches > in the future. This is the reason as well why I left tests covering all > this missing functionality -- as breadcrumbs to already discovered > cases, important for the future extensions. I'd like to elaborate on this a bit and remind about the origins of the patch, as it's lost somewhere in the beginning of the thread. The idea is not pulled out of thin air, everything is coming from our attempts to improve one particular monitoring infrastructure in a real commercial setting. Every covered use case and test in the original proposal was a result of field trials, when some application-side library or ORM was responsible for gigabytes of data in pgss, chocking the monitoring agent.
Hi, On 2/11/23 13:08, Dmitry Dolgov wrote: >> On Sat, Feb 11, 2023 at 11:47:07AM +0100, Dmitry Dolgov wrote: >> >> The original version of the patch was doing all of this, i.e. handling >> numerics, Param nodes, RTE_VALUES. The commentary about >> find_const_walker in tests is referring to a part of that, that was >> dealing with evaluation of expression to see if it could be reduced to a >> constant. >> >> Unfortunately there was a significant push back from reviewers because >> of those features. That's why I've reduced the patch to it's minimally >> useful version, having in mind re-implementing them as follow-up patches >> in the future. This is the reason as well why I left tests covering all >> this missing functionality -- as breadcrumbs to already discovered >> cases, important for the future extensions. > I'd like to elaborate on this a bit and remind about the origins of the > patch, as it's lost somewhere in the beginning of the thread. The idea > is not pulled out of thin air, everything is coming from our attempts to > improve one particular monitoring infrastructure in a real commercial > setting. Every covered use case and test in the original proposal was a > result of field trials, when some application-side library or ORM was > responsible for gigabytes of data in pgss, chocking the monitoring agent. Thanks for the clarifications. I didn't mean to contend the usefulness of the patch and I wasn't aware that you already jumped through the loops of handling Param, etc. Seems like supporting only constants is a good starting point. The only thing that is likely confusing for users is that NUMERICs (and potentially constants of other types) are unsupported. Wouldn't it be fairly simple to support them via something like the following? is_const(element) || (is_coercion(element) && is_const(element->child)) -- David Geier (ServiceNow)
> On Wed, Feb 15, 2023 at 08:51:56AM +0100, David Geier wrote: > Hi, > > On 2/11/23 13:08, Dmitry Dolgov wrote: > > > On Sat, Feb 11, 2023 at 11:47:07AM +0100, Dmitry Dolgov wrote: > > > > > > The original version of the patch was doing all of this, i.e. handling > > > numerics, Param nodes, RTE_VALUES. The commentary about > > > find_const_walker in tests is referring to a part of that, that was > > > dealing with evaluation of expression to see if it could be reduced to a > > > constant. > > > > > > Unfortunately there was a significant push back from reviewers because > > > of those features. That's why I've reduced the patch to it's minimally > > > useful version, having in mind re-implementing them as follow-up patches > > > in the future. This is the reason as well why I left tests covering all > > > this missing functionality -- as breadcrumbs to already discovered > > > cases, important for the future extensions. > > I'd like to elaborate on this a bit and remind about the origins of the > > patch, as it's lost somewhere in the beginning of the thread. The idea > > is not pulled out of thin air, everything is coming from our attempts to > > improve one particular monitoring infrastructure in a real commercial > > setting. Every covered use case and test in the original proposal was a > > result of field trials, when some application-side library or ORM was > > responsible for gigabytes of data in pgss, chocking the monitoring agent. > > Thanks for the clarifications. I didn't mean to contend the usefulness of > the patch and I wasn't aware that you already jumped through the loops of > handling Param, etc. No worries, I just wanted to emphasize that we've already collected quite some number of use cases. > Seems like supporting only constants is a good starting > point. The only thing that is likely confusing for users is that NUMERICs > (and potentially constants of other types) are unsupported. Wouldn't it be > fairly simple to support them via something like the following? > > is_const(element) || (is_coercion(element) && is_const(element->child)) It definitely makes sense to implement that, although I don't think it's going to be acceptable to do that via directly listing conditions an element has to satisfy. It probably has to be more flexible, sice we would like to extend it in the future. My plan is to address this in a follow-up patch, when the main mechanism is approved. Would you agree with this approach?
> On Thu, Feb 09, 2023 at 08:43:29PM +0100, Dmitry Dolgov wrote: > > On Thu, Feb 09, 2023 at 06:26:51PM +0100, Alvaro Herrera wrote: > > On 2023-Feb-09, Dmitry Dolgov wrote: > > > > > > On Thu, Feb 09, 2023 at 02:30:34PM +0100, Peter Eisentraut wrote: > > > > > > What is the point of making this a numeric setting? Either you want > > > > to merge all values or you don't want to merge any values. > > > > > > At least in theory the definition of "too many constants" is different > > > for different use cases and I see allowing to configure it as a way of > > > reducing the level of surprise here. > > > > I was thinking about this a few days ago and I agree that we don't > > necessarily want to make it just a boolean thing; we may want to make it > > more complex. One trivial idea is to make it group entries in powers of > > 10: for 0-9 elements, you get one entry, and 10-99 you get a different > > one, and so on: > > > > # group everything in a single bucket > > const_merge_threshold = true / yes / on > > > > # group 0-9, 10-99, 100-999, 1000-9999 > > const_merge_treshold = powers > > > > Ideally the value would be represented somehow in the query text. For > > example > > > > query | calls > > ----------------------------------------------------------+------- > > select * from test where i in ({... 0-9 entries ...}) | 2 > > select * from test where i in ({... 10-99 entries ...}) | 1 > > > > What do you think? The jumble would have to know how to reduce all > > values within each power-of-ten group to one specific value, but I don't > > think that should be particularly difficult. > > Yeah, it sounds appealing and conveniently addresses the question of > losing the information about how many constants originally were there. > Not sure if the example above would be the most natural way to represent > it in the query text, but otherwise I'm going to try implementing this. > Stay tuned. It took me couple of evenings, here is what I've got: * The representation is not that far away from your proposal, I've settled on: SELECT * FROM test_merge WHERE id IN (... [10-99 entries]) * To not reinvent the wheel, I've reused decimalLenght function from numutils, hence one more patch to make it available to reuse. * This approach resolves my concerns about letting people tuning the behaviour of merging, as now it's possible to distinguish between calls with different number of constants up to the power of 10. So I've decided to simplify the configuration and make the guc boolean to turn it off or on. * To separate queries with constants falling into different ranges (10-99, 100-999, etc), the order of magnitude is added into the jumble hash. * I've incorporated feedback from Sergei and David, as well as tried to make comments and documentation more clear. Any feedback is welcomed, thanks!
Attachment
Hi, >> Seems like supporting only constants is a good starting >> point. The only thing that is likely confusing for users is that NUMERICs >> (and potentially constants of other types) are unsupported. Wouldn't it be >> fairly simple to support them via something like the following? >> >> is_const(element) || (is_coercion(element) && is_const(element->child)) > It definitely makes sense to implement that, although I don't think it's > going to be acceptable to do that via directly listing conditions an > element has to satisfy. It probably has to be more flexible, sice we > would like to extend it in the future. My plan is to address this in a > follow-up patch, when the main mechanism is approved. Would you agree > with this approach? I still think it's counterintuitive and I'm pretty sure people would even report this as a bug because not knowing about the difference in internal representation you would expect NUMERICs to work the same way other constants work. If anything we would have to document it. Can't we do something pragmatic and have something like IsMergableInElement() which for now only supports constants and later can be extended? Or what exactly do you mean by "more flexible"? -- David Geier (ServiceNow)
> On Thu, Feb 23, 2023 at 09:48:35AM +0100, David Geier wrote: > Hi, > > > > Seems like supporting only constants is a good starting > > > point. The only thing that is likely confusing for users is that NUMERICs > > > (and potentially constants of other types) are unsupported. Wouldn't it be > > > fairly simple to support them via something like the following? > > > > > > is_const(element) || (is_coercion(element) && is_const(element->child)) > > It definitely makes sense to implement that, although I don't think it's > > going to be acceptable to do that via directly listing conditions an > > element has to satisfy. It probably has to be more flexible, sice we > > would like to extend it in the future. My plan is to address this in a > > follow-up patch, when the main mechanism is approved. Would you agree > > with this approach? > > I still think it's counterintuitive and I'm pretty sure people would even > report this as a bug because not knowing about the difference in internal > representation you would expect NUMERICs to work the same way other > constants work. If anything we would have to document it. > > Can't we do something pragmatic and have something like > IsMergableInElement() which for now only supports constants and later can be > extended? Or what exactly do you mean by "more flexible"? Here is how I see it (pls correct me if I'm wrong at any point). To support numerics as presented in the tests from this patch, we have to deal with FuncExpr (the function converting a value into a numeric). Having in mind only numerics, we would need to filter out any other FuncExpr (which already sounds dubious). Then we need to validate for example that the function is immutable and have constant arguments, which is already implemented in evaluate_function and is a part of eval_const_expression. There is nothing special about numerics at this point, and this approach leads us back to eval_const_expression to a certain degree. Do you see any other way? I'm thinking about Michael idea in this context, and want to see if it would be possible to make the mechanism more flexible using some node attributes. But I see it only as a follow-up step, not a prerequisite.
So I was seeing that this patch needs a rebase according to cfbot. However it looks like the review feedback you're looking for is more of design questions. What jumbling is best to include in the feature set and which is best to add in later patches. It sounds like you've gotten conflicting feedback from initial reviews. It does sound like the patch is pretty mature and you're actively responding to feedback so if you got more authoritative feedback it might even be committable now. It's already been two years of being rolled forward so it would be a shame to keep rolling it forward. Or is there some fatal problem that you're trying to work around and still haven't found the magic combination that convinces any committers this is something we want? In which case perhaps we set this patch returned? I don't get that impression myself though.
> On Tue, Mar 14, 2023 at 02:14:17PM -0400, Gregory Stark (as CFM) wrote: > So I was seeing that this patch needs a rebase according to cfbot. Yeah, folks are getting up to speed in with pgss improvements recently. Thanks for letting me know. > However it looks like the review feedback you're looking for is more > of design questions. What jumbling is best to include in the feature > set and which is best to add in later patches. It sounds like you've > gotten conflicting feedback from initial reviews. > > It does sound like the patch is pretty mature and you're actively > responding to feedback so if you got more authoritative feedback it > might even be committable now. It's already been two years of being > rolled forward so it would be a shame to keep rolling it forward. You got it about right. There is a balance to strike between implementation, that would cover more useful cases, but has more dependencies (something like possibility of having multiple query id), and more minimalistic implementation that would actually be acceptable in the way it is now. But I haven't heard back from David about it, so I assume everybody is fine with the minimalistic approach. > Or is there some fatal problem that you're trying to work around and > still haven't found the magic combination that convinces any > committers this is something we want? In which case perhaps we set > this patch returned? I don't get that impression myself though. Nothing like this on my side, although I'm not good at conjuring committing powers of the nature.
> On Tue, Mar 14, 2023 at 08:04:32PM +0100, Dmitry Dolgov wrote: > > On Tue, Mar 14, 2023 at 02:14:17PM -0400, Gregory Stark (as CFM) wrote: > > So I was seeing that this patch needs a rebase according to cfbot. > > Yeah, folks are getting up to speed in with pgss improvements recently. > Thanks for letting me know. Following recent refactoring of pg_stat_statements tests, I've created a new one for merging functionality in the patch. This should solve any conflicts.
Attachment
On Sun, Mar 19, 2023 at 01:27:34PM +0100, Dmitry Dolgov wrote: > + If this parameter is on, two queries with an array will get the same > + query identifier if the only difference between them is the number of > + constants, both numbers is of the same order of magnitude and greater or > + equal 10 (so the order of magnitude is greather than 1, it is not worth > + the efforts otherwise). IMHO this adds way too much complexity to something that most users would expect to be an on/off switch. If I understand Álvaro's suggestion [0] correctly, he's saying that in addition to allowing "on" and "off", it might be worth allowing something like "powers" to yield roughly the behavior described above. I don't think he's suggesting that this "powers" behavior should be the only available option. Also, it seems counterintuitive that queries with fewer than 10 constants are not merged. In the interest of moving this patch forward, I would suggest making it a simple on/off switch in 0002 and moving the "powers" functionality to a new 0003 patch. I think separating out the core part of this feature might help reviewers. As you can see, I got distracted by the complicated threshold logic and ended up focusing my first round of review there. [0] https://postgr.es/m/20230209172651.cfgrebpyyr72h7fv%40alvherre.pgsql -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
> On Mon, Jul 03, 2023 at 09:46:11PM -0700, Nathan Bossart wrote: Thanks for reviewing. > On Sun, Mar 19, 2023 at 01:27:34PM +0100, Dmitry Dolgov wrote: > > + If this parameter is on, two queries with an array will get the same > > + query identifier if the only difference between them is the number of > > + constants, both numbers is of the same order of magnitude and greater or > > + equal 10 (so the order of magnitude is greather than 1, it is not worth > > + the efforts otherwise). > > IMHO this adds way too much complexity to something that most users would > expect to be an on/off switch. This documentation is exclusively to be precise about how does it work. Users don't have to worry about all this, and pretty much turn it on/off, as you've described. I agree though, I could probably write this text a bit differently. > If I understand Álvaro's suggestion [0] correctly, he's saying that in > addition to allowing "on" and "off", it might be worth allowing > something like "powers" to yield roughly the behavior described above. > I don't think he's suggesting that this "powers" behavior should be > the only available option. Independently of what Álvaro was suggesting, I find the "powers" approach more suitable, because it answers my own concerns about the previous implementation. Having "on"/"off" values means we would have to scratch heads coming up with a one-size-fit-all default value, or to introduce another option for the actual cut-off threshold. I would like to avoid both of those options, that's why I went with "powers" only. > Also, it seems counterintuitive that queries with fewer than 10 > constants are not merged. Why? What would be your intuition using this feature? > In the interest of moving this patch forward, I would suggest making it a > simple on/off switch in 0002 and moving the "powers" functionality to a new > 0003 patch. I think separating out the core part of this feature might > help reviewers. As you can see, I got distracted by the complicated > threshold logic and ended up focusing my first round of review there. I would disagree. As I've described above, to me "powers" seems to be a better fit, and the complicated logic is in fact reusing one already existing function. Do those arguments sound convincing to you?
The following review has been posted through the commitfest application: make installcheck-world: tested, passed Implements feature: tested, passed Spec compliant: not tested Documentation: tested, passed I've tested the patched on 17devel/master and it is my feeling - especially given the proliferation of the ORMs - that weneed such thing in pgss. Thread already took almost 3 years, so it would be pity to waste so much development time of yours.Cfbot is green, and patch works very well for me. IMVHO commitfest status should be even set to ready-for-comitter. Given the: SET query_id_const_merge = on; SELECT pg_stat_statements_reset(); SELECT * FROM test WHERE a IN (1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 11); SELECT * FROM test WHERE a IN (1, 2, 3); SELECT * FROM test WHERE a = ALL('{1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12}'); SELECT * FROM test WHERE a = ANY (ARRAY[11,10,9,8,7,6,5,4,3,2,1]); The patch results in: q | calls -----------------------------------------------------+------- SELECT * FROM test WHERE a = ALL($1) | 1 SELECT pg_stat_statements_reset() | 1 SELECT * FROM test WHERE a IN ($1, $2, $3) | 1 SELECT * FROM test WHERE a IN (... [10-99 entries]) | 2 Of course it's pity it doesn't collapse the below ones: SELECT * FROM (VALUES (1), (2), (3), (4), (5), (6), (7), (8), (9), (10), (11)) AS t (num); INSERT INTO dummy VALUES(1, 'text 1'),(2, 'text 2'),(3, 'text 3'),(4, 'text 3'),(5, 'text 3'),(6, 'text 3'),(7, 'text 3'),(8,'text 3'),(9, 'text 3'),(10, 'text 3') ON CONFLICT (id) DO NOTHING; PREPARE s3(int[], int[], int[], int[], int[], int[], int[], int[], int[], int[], int[]) AS SELECT * FROM test WHERE a = ANY ($1::int[]) OR a = ANY ($2::int[]) OR [..] a = ANY ($11::int[]) ; but given the convoluted thread history, it's understandable and as you stated - maybe in future. There's one additional benefit to this patch: the pg_hint_plan extension seems to borrow pgss's generate_normalized_query().So if that's changed in next major release, the pg_hint_plan hint table (transparent plan rewriteusing table) will automatically benefit from generalization of the query string here (imagine fixing plans for ORMthat generate N {1,1024} number of IN() array elements; today that would be N number of entries in the "hint_plan.hints"table). The new status of this patch is: Needs review
I've also tried the patch and I see the same results as Jakub, which make sense to me. I did have issues getting it to apply, though: `git am` complains about a conflict, though patch itself was able to apply it.
Hi, this is my first email to the pgsql hackers. I came across this email thread while looking at https://github.com/rails/rails/pull/49388 for Ruby on Rails one of the popular web application framework by replacing every query `in` clause with `any` to reduce similar entries in `pg_stat_statements`. I want this to be solved on the PostgreSQL side, mainly because I want to avoid replacing every in clause with any to reduce similar entries in pg_stat_statements. It would be nice to have this patch reviewed. As I'm not familiar with C and PostgreSQL source code, I'm not reviewing this patch myself, I applied this patch to my local PostgreSQL and the Active Record unit tests ran successfully. -- Yasuo Honda
On Tue, Jul 04, 2023 at 09:02:56PM +0200, Dmitry Dolgov wrote: > On Mon, Jul 03, 2023 at 09:46:11PM -0700, Nathan Bossart wrote: >> IMHO this adds way too much complexity to something that most users would >> expect to be an on/off switch. > > This documentation is exclusively to be precise about how does it work. > Users don't have to worry about all this, and pretty much turn it > on/off, as you've described. I agree though, I could probably write this > text a bit differently. FWIW, I am going to side with Nathan on this one, but not completely either. I was looking at the patch and it brings too much complexity for a monitoring feature in this code path. In my experience, I've seen people complain about IN/ANY never strimmed down to a single parameter in pg_stat_statements but I still have to hear from somebody outside this thread that they'd like to reduce an IN clause depending on the number of items, or something else. >> If I understand Álvaro's suggestion [0] correctly, he's saying that in >> addition to allowing "on" and "off", it might be worth allowing >> something like "powers" to yield roughly the behavior described above. >> I don't think he's suggesting that this "powers" behavior should be >> the only available option. > > Independently of what Álvaro was suggesting, I find the "powers" > approach more suitable, because it answers my own concerns about the > previous implementation. Having "on"/"off" values means we would have to > scratch heads coming up with a one-size-fit-all default value, or to > introduce another option for the actual cut-off threshold. I would like > to avoid both of those options, that's why I went with "powers" only. Now, it doesn't mean that this approach with the "powers" will never happen, but based on the set of opinions I am gathering on this thread I would suggest to rework the patch as follows: - First implement an on/off switch that reduces the lists in IN and/or ANY to one parameter. Simply. - Second, refactor the powers routine. - Third, extend the on/off switch, or just implement a threshold with a second switch. When it comes to my opinion, I am not seeing any objections to the feature as a whole, and I'm OK with the first step. I'm also OK to keep the door open for more improvements in controlling how these IN/ANY lists show up, but there could be more than just the number of items as parameter (say the query size, different behaviors depending on the number of clauses in queries, subquery context or CTEs/WITH, etc. just to name a few things coming in mind). -- Michael
Attachment
> On Fri, Oct 13, 2023 at 05:07:00PM +0900, Michael Paquier wrote: > Now, it doesn't mean that this approach with the "powers" will never > happen, but based on the set of opinions I am gathering on this thread > I would suggest to rework the patch as follows: > - First implement an on/off switch that reduces the lists in IN and/or > ANY to one parameter. Simply. > - Second, refactor the powers routine. > - Third, extend the on/off switch, or just implement a threshold with > a second switch. Well, if it will help move this patch forward, why not. To clarify, I'm going to split the current implementation into three patches, one for each point you've mentioned. > When it comes to my opinion, I am not seeing any objections to the > feature as a whole, and I'm OK with the first step. I'm also OK to > keep the door open for more improvements in controlling how these > IN/ANY lists show up, but there could be more than just the number of > items as parameter (say the query size, different behaviors depending > on the number of clauses in queries, subquery context or CTEs/WITH, > etc. just to name a few things coming in mind). Interesting point, but now it's my turn to have troubles imagining the case, where list representation could be controlled depending on something else than the number of elements in it. Do you have any specific example in mind?
On Tue, Jul 04, 2023 at 09:02:56PM +0200, Dmitry Dolgov wrote: >> On Mon, Jul 03, 2023 at 09:46:11PM -0700, Nathan Bossart wrote: >> Also, it seems counterintuitive that queries with fewer than 10 >> constants are not merged. > > Why? What would be your intuition using this feature? For the "powers" setting, I would've expected queries with 0-9 constants to be merged. Then 10-99, 100-999, 1000-9999, etc. I suppose there might be an argument for separating 0 from 1-9, too. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
> On Fri, Oct 13, 2023 at 03:35:19PM +0200, Dmitry Dolgov wrote: > > On Fri, Oct 13, 2023 at 05:07:00PM +0900, Michael Paquier wrote: > > Now, it doesn't mean that this approach with the "powers" will never > > happen, but based on the set of opinions I am gathering on this thread > > I would suggest to rework the patch as follows: > > - First implement an on/off switch that reduces the lists in IN and/or > > ANY to one parameter. Simply. > > - Second, refactor the powers routine. > > - Third, extend the on/off switch, or just implement a threshold with > > a second switch. > > Well, if it will help move this patch forward, why not. To clarify, I'm > going to split the current implementation into three patches, one for > each point you've mentioned. Here is what I had mind. The first patch implements the basic notion of merging, and I guess everyone agrees on its usefulness. The second and third implement merging into groups power of 10, which I still find useful as well. The last one adds a lower threshold for merging on top of that. My intentions are to get the first one in, ideally I would love to see the second and third applied as well. > > When it comes to my opinion, I am not seeing any objections to the > > feature as a whole, and I'm OK with the first step. I'm also OK to > > keep the door open for more improvements in controlling how these > > IN/ANY lists show up, but there could be more than just the number of > > items as parameter (say the query size, different behaviors depending > > on the number of clauses in queries, subquery context or CTEs/WITH, > > etc. just to name a few things coming in mind). > > Interesting point, but now it's my turn to have troubles imagining the > case, where list representation could be controlled depending on > something else than the number of elements in it. Do you have any > specific example in mind? In the current patch version I didn't add anything yet to address the question of having more parameters to tune constants merging. The main obstacle as I see it is that the information for that has to be collected when jumbling various query nodes. Anything except information about the ArrayExpr itself would have to be acquired when jumbling some other parts of the query, not directly related to the ArrayExpr. It seems to me this interdependency between otherwise unrelated nodes outweigh the value it brings, and would require some more elaborate (and more invasive for the purpose of this patch) mechanism to implement.
Attachment
On Tue, Oct 17, 2023 at 10:15:41AM +0200, Dmitry Dolgov wrote: > In the current patch version I didn't add anything yet to address the > question of having more parameters to tune constants merging. The main > obstacle as I see it is that the information for that has to be > collected when jumbling various query nodes. Anything except information > about the ArrayExpr itself would have to be acquired when jumbling some > other parts of the query, not directly related to the ArrayExpr. It > seems to me this interdependency between otherwise unrelated nodes > outweigh the value it brings, and would require some more elaborate (and > more invasive for the purpose of this patch) mechanism to implement. typedef struct ArrayExpr { + pg_node_attr(custom_query_jumble) + Hmm. I am not sure that this is the best approach implementation-wise. Wouldn't it be better to invent a new pg_node_attr (these can include parameters as well!), say query_jumble_merge or query_jumble_agg_location that aggregates all the parameters of a list to be considered as a single element. To put it short, we could also apply the same property to other parts of a parsed tree, and not only an ArrayExpr's list. /* GUC parameters */ extern PGDLLIMPORT int compute_query_id; - +extern PGDLLIMPORT bool query_id_const_merge; Not much a fan of this addition as well for an in-core GUC. I would suggest pushing the GUC layer to pg_stat_statements, maintaining the computation method to use as a field of JumbleState as I suspect that this is something we should not enforce system-wide, but at extension-level instead. + /* + * Indicates the constant represents the beginning or the end of a merged + * constants interval. + */ + bool merged; Not sure that this is the best thing to do either. Instead of this extra boolean flag, could it be simpler if we switch LocationLen so as we track the start position and the end position of a constant in a query string, so as we'd use one LocationLen for a whole set of Const nodes in an ArrayExpr? Perhaps this could just be a refactoring piece of its own? + /* + * If the first expression is a constant, verify if the following elements + * are constants as well. If yes, the list is eligible for merging, and the + * order of magnitude need to be calculated. + */ + if (IsA(firstExpr, Const)) + { + foreach(temp, elements) + if (!IsA(lfirst(temp), Const)) + return false; This path should be benchmarked, IMO. -- Michael
Attachment
> On Thu, Oct 26, 2023 at 09:08:42AM +0900, Michael Paquier wrote: > typedef struct ArrayExpr > { > + pg_node_attr(custom_query_jumble) > + > > Hmm. I am not sure that this is the best approach > implementation-wise. Wouldn't it be better to invent a new > pg_node_attr (these can include parameters as well!), say > query_jumble_merge or query_jumble_agg_location that aggregates all > the parameters of a list to be considered as a single element. To put > it short, we could also apply the same property to other parts of a > parsed tree, and not only an ArrayExpr's list. Sounds like an interesting idea, something like: typedef struct ArrayExpr { ... List *elements pg_node_attr(query_jumble_merge); to replace simple JUMBLE_NODE(elements) with more elaborated logic. > /* GUC parameters */ > extern PGDLLIMPORT int compute_query_id; > - > +extern PGDLLIMPORT bool query_id_const_merge; > > Not much a fan of this addition as well for an in-core GUC. I would > suggest pushing the GUC layer to pg_stat_statements, maintaining the > computation method to use as a field of JumbleState as I suspect that > this is something we should not enforce system-wide, but at > extension-level instead. I also do not particularly like an extra GUC here, but as far as I can tell to make it pg_stat_statements GUC only it has to be something similar to EnableQueryId (e.g. EnableQueryConstMerging), that will be called from pgss. Does this sound better? > + /* > + * Indicates the constant represents the beginning or the end of a merged > + * constants interval. > + */ > + bool merged; > > Not sure that this is the best thing to do either. Instead of this > extra boolean flag, could it be simpler if we switch LocationLen so as > we track the start position and the end position of a constant in a > query string, so as we'd use one LocationLen for a whole set of Const > nodes in an ArrayExpr? Perhaps this could just be a refactoring piece > of its own? Sounds interesting as well, but it seems to me there is a catch. I'll try to elaborate, bear with me: * if the start and the end positions of a constant means the first and the last character representing it, we need the textual length of the constant in the query to be able to construct such a LocationLen. The lengths are calculated in pg_stat_statements later, not in JumbleQuery, and it uses parser for that. Doing all of this in JumbleQuery doesn't sound reasonable to me. * if instead we talk about the start and the end positions in a set of constants, that would mean locations of the first and the last constants in the set, and everything seems fine. But for such LocationLen to represent a single constant (not a set of constants), it means that only the start position would be meaningful, the end position will not be used. The second approach is somewhat close to be simpler than the merge flag, but assumes the ugliness for a single constant. What do you think about this? > + /* > + * If the first expression is a constant, verify if the following elements > + * are constants as well. If yes, the list is eligible for merging, and the > + * order of magnitude need to be calculated. > + */ > + if (IsA(firstExpr, Const)) > + { > + foreach(temp, elements) > + if (!IsA(lfirst(temp), Const)) > + return false; > > This path should be benchmarked, IMO. I can do some benchmarking here, but of course it's going to be slower than the baseline. The main idea behind the patch is to trade this overhead for the benefits in the future while processing pgss records, hoping that it's going to be worth it (and in those extreme cases I'm trying to address it's definitely worth it).
> On Fri, Oct 27, 2023 at 05:02:44PM +0200, Dmitry Dolgov wrote: > > On Thu, Oct 26, 2023 at 09:08:42AM +0900, Michael Paquier wrote: > > typedef struct ArrayExpr > > { > > + pg_node_attr(custom_query_jumble) > > + > > > > Hmm. I am not sure that this is the best approach > > implementation-wise. Wouldn't it be better to invent a new > > pg_node_attr (these can include parameters as well!), say > > query_jumble_merge or query_jumble_agg_location that aggregates all > > the parameters of a list to be considered as a single element. To put > > it short, we could also apply the same property to other parts of a > > parsed tree, and not only an ArrayExpr's list. > > Sounds like an interesting idea, something like: > > typedef struct ArrayExpr > { > ... > List *elements pg_node_attr(query_jumble_merge); > > to replace simple JUMBLE_NODE(elements) with more elaborated logic. > > > /* GUC parameters */ > > extern PGDLLIMPORT int compute_query_id; > > - > > +extern PGDLLIMPORT bool query_id_const_merge; > > > > Not much a fan of this addition as well for an in-core GUC. I would > > suggest pushing the GUC layer to pg_stat_statements, maintaining the > > computation method to use as a field of JumbleState as I suspect that > > this is something we should not enforce system-wide, but at > > extension-level instead. > > I also do not particularly like an extra GUC here, but as far as I can > tell to make it pg_stat_statements GUC only it has to be something > similar to EnableQueryId (e.g. EnableQueryConstMerging), that will be > called from pgss. Does this sound better? For clarity, here is what I had in mind for those two points.
Attachment
On Tue, 31 Oct 2023 at 14:36, Dmitry Dolgov <9erthalion6@gmail.com> wrote: > > > On Fri, Oct 27, 2023 at 05:02:44PM +0200, Dmitry Dolgov wrote: > > > On Thu, Oct 26, 2023 at 09:08:42AM +0900, Michael Paquier wrote: > > > typedef struct ArrayExpr > > > { > > > + pg_node_attr(custom_query_jumble) > > > + > > > > > > Hmm. I am not sure that this is the best approach > > > implementation-wise. Wouldn't it be better to invent a new > > > pg_node_attr (these can include parameters as well!), say > > > query_jumble_merge or query_jumble_agg_location that aggregates all > > > the parameters of a list to be considered as a single element. To put > > > it short, we could also apply the same property to other parts of a > > > parsed tree, and not only an ArrayExpr's list. > > > > Sounds like an interesting idea, something like: > > > > typedef struct ArrayExpr > > { > > ... > > List *elements pg_node_attr(query_jumble_merge); > > > > to replace simple JUMBLE_NODE(elements) with more elaborated logic. > > > > > /* GUC parameters */ > > > extern PGDLLIMPORT int compute_query_id; > > > - > > > +extern PGDLLIMPORT bool query_id_const_merge; > > > > > > Not much a fan of this addition as well for an in-core GUC. I would > > > suggest pushing the GUC layer to pg_stat_statements, maintaining the > > > computation method to use as a field of JumbleState as I suspect that > > > this is something we should not enforce system-wide, but at > > > extension-level instead. > > > > I also do not particularly like an extra GUC here, but as far as I can > > tell to make it pg_stat_statements GUC only it has to be something > > similar to EnableQueryId (e.g. EnableQueryConstMerging), that will be > > called from pgss. Does this sound better? > > For clarity, here is what I had in mind for those two points. CFBot shows documentation build has failed at [1] with: [07:44:55.531] time make -s -j${BUILD_JOBS} -C doc [07:44:57.987] postgres.sgml:572: element xref: validity error : IDREF attribute linkend references an unknown ID "guc-query-id-const-merge-threshold" [07:44:58.179] make[2]: *** [Makefile:70: postgres-full.xml] Error 4 [07:44:58.179] make[2]: *** Deleting file 'postgres-full.xml' [07:44:58.181] make[1]: *** [Makefile:8: all] Error 2 [07:44:58.182] make: *** [Makefile:16: all] Error 2 [1] - https://cirrus-ci.com/task/6688578378399744 Regards, Vignesh
> On Sat, Jan 06, 2024 at 09:04:54PM +0530, vignesh C wrote: > > CFBot shows documentation build has failed at [1] with: > [07:44:55.531] time make -s -j${BUILD_JOBS} -C doc > [07:44:57.987] postgres.sgml:572: element xref: validity error : IDREF > attribute linkend references an unknown ID > "guc-query-id-const-merge-threshold" > [07:44:58.179] make[2]: *** [Makefile:70: postgres-full.xml] Error 4 > [07:44:58.179] make[2]: *** Deleting file 'postgres-full.xml' > [07:44:58.181] make[1]: *** [Makefile:8: all] Error 2 > [07:44:58.182] make: *** [Makefile:16: all] Error 2 > > [1] - https://cirrus-ci.com/task/6688578378399744 Indeed, after moving the configuration option to pgss I forgot to update its reference in the docs. Thanks for noticing, will update soon.
> On Mon, Jan 08, 2024 at 05:10:20PM +0100, Dmitry Dolgov wrote: > > On Sat, Jan 06, 2024 at 09:04:54PM +0530, vignesh C wrote: > > > > CFBot shows documentation build has failed at [1] with: > > [07:44:55.531] time make -s -j${BUILD_JOBS} -C doc > > [07:44:57.987] postgres.sgml:572: element xref: validity error : IDREF > > attribute linkend references an unknown ID > > "guc-query-id-const-merge-threshold" > > [07:44:58.179] make[2]: *** [Makefile:70: postgres-full.xml] Error 4 > > [07:44:58.179] make[2]: *** Deleting file 'postgres-full.xml' > > [07:44:58.181] make[1]: *** [Makefile:8: all] Error 2 > > [07:44:58.182] make: *** [Makefile:16: all] Error 2 > > > > [1] - https://cirrus-ci.com/task/6688578378399744 > > Indeed, after moving the configuration option to pgss I forgot to update > its reference in the docs. Thanks for noticing, will update soon. Here is the fixed version.
Attachment
2024-01 Commitfest. Hi, This patch has a CF status of "Needs Review" [1], but it seems there was a CFbot test failure last time it was run [2]. Please have a look and post an updated version if necessary. ====== [1] https://commitfest.postgresql.org/46/2837/ [2] https://cirrus-ci.com/task/6688578378399744 Kind Regards, Peter Smith.
> On Mon, Jan 22, 2024 at 05:33:26PM +1100, Peter Smith wrote: > 2024-01 Commitfest. > > Hi, This patch has a CF status of "Needs Review" [1], but it seems > there was a CFbot test failure last time it was run [2]. Please have a > look and post an updated version if necessary. > > ====== > [1] https://commitfest.postgresql.org/46/2837/ > [2] https://cirrus-ci.com/task/6688578378399744 It's the same failing pipeline Vignesh C was talking above. I've fixed the issue in the latest patch version, but looks like it wasn't picked up yet (from what I understand, the latest build for this CF is 8 weeks old).
Dmitry Dolgov <9erthalion6@gmail.com> writes: >> On Mon, Jan 22, 2024 at 05:33:26PM +1100, Peter Smith wrote: >> Hi, This patch has a CF status of "Needs Review" [1], but it seems >> there was a CFbot test failure last time it was run [2]. Please have a >> look and post an updated version if necessary. >> >> ====== >> [1] https://commitfest.postgresql.org/46/2837/ >> [2] https://cirrus-ci.com/task/6688578378399744 > It's the same failing pipeline Vignesh C was talking above. I've fixed > the issue in the latest patch version, but looks like it wasn't picked > up yet (from what I understand, the latest build for this CF is 8 weeks > old). Please notice that at the moment, it's not being tested at all because of a patch-apply failure -- that's what the little triangular symbol means. The rest of the display concerns the test results from the last successfully-applied patch version. (Perhaps that isn't a particularly great UI design.) If you click on the triangle you find out == Applying patches on top of PostgreSQL commit ID b0f0a9432d0b6f53634a96715f2666f6d4ea25a1 === === applying patch ./v17-0001-Prevent-jumbling-of-every-element-in-ArrayExpr.patch patching file contrib/pg_stat_statements/Makefile Hunk #1 FAILED at 19. 1 out of 1 hunk FAILED -- saving rejects to file contrib/pg_stat_statements/Makefile.rej patching file contrib/pg_stat_statements/expected/merging.out patching file contrib/pg_stat_statements/meson.build ... regards, tom lane
> On Mon, Jan 22, 2024 at 11:35:22AM -0500, Tom Lane wrote: > Dmitry Dolgov <9erthalion6@gmail.com> writes: > >> On Mon, Jan 22, 2024 at 05:33:26PM +1100, Peter Smith wrote: > >> Hi, This patch has a CF status of "Needs Review" [1], but it seems > >> there was a CFbot test failure last time it was run [2]. Please have a > >> look and post an updated version if necessary. > >> > >> ====== > >> [1] https://commitfest.postgresql.org/46/2837/ > >> [2] https://cirrus-ci.com/task/6688578378399744 > > > It's the same failing pipeline Vignesh C was talking above. I've fixed > > the issue in the latest patch version, but looks like it wasn't picked > > up yet (from what I understand, the latest build for this CF is 8 weeks > > old). > > Please notice that at the moment, it's not being tested at all because > of a patch-apply failure -- that's what the little triangular symbol > means. The rest of the display concerns the test results from the > last successfully-applied patch version. (Perhaps that isn't a > particularly great UI design.) > > If you click on the triangle you find out > > == Applying patches on top of PostgreSQL commit ID b0f0a9432d0b6f53634a96715f2666f6d4ea25a1 === > === applying patch ./v17-0001-Prevent-jumbling-of-every-element-in-ArrayExpr.patch > patching file contrib/pg_stat_statements/Makefile > Hunk #1 FAILED at 19. > 1 out of 1 hunk FAILED -- saving rejects to file contrib/pg_stat_statements/Makefile.rej > patching file contrib/pg_stat_statements/expected/merging.out > patching file contrib/pg_stat_statements/meson.build Oh, I see, thanks. Give me a moment, will fix this.
> On Mon, Jan 22, 2024 at 06:07:27PM +0100, Dmitry Dolgov wrote: > > Please notice that at the moment, it's not being tested at all because > > of a patch-apply failure -- that's what the little triangular symbol > > means. The rest of the display concerns the test results from the > > last successfully-applied patch version. (Perhaps that isn't a > > particularly great UI design.) > > > > If you click on the triangle you find out > > > > == Applying patches on top of PostgreSQL commit ID b0f0a9432d0b6f53634a96715f2666f6d4ea25a1 === > > === applying patch ./v17-0001-Prevent-jumbling-of-every-element-in-ArrayExpr.patch > > patching file contrib/pg_stat_statements/Makefile > > Hunk #1 FAILED at 19. > > 1 out of 1 hunk FAILED -- saving rejects to file contrib/pg_stat_statements/Makefile.rej > > patching file contrib/pg_stat_statements/expected/merging.out > > patching file contrib/pg_stat_statements/meson.build > > Oh, I see, thanks. Give me a moment, will fix this. Here is it.
Attachment
Hi, I'm interested in this feature. It looks like these patches have some conflicts. http://cfbot.cputube.org/patch_47_2837.log Would you rebase these patches? Thanks, -- Yasuo Honda On Sat, Mar 23, 2024 at 4:11 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote: > > Oh, I see, thanks. Give me a moment, will fix this. > > Here is it.
> On Sat, Mar 23, 2024 at 04:13:44PM +0900, Yasuo Honda wrote: > Hi, I'm interested in this feature. It looks like these patches have > some conflicts. > > http://cfbot.cputube.org/patch_47_2837.log > > Would you rebase these patches? Sure, I can rebase, give me a moment. If you don't want to wait, there is a base commit in the patch, against which it should be applied without issues, 0eb23285a2.
Thanks for the information. I can apply these 4 patches from 0eb23285a2 . I tested this branch from Ruby on Rails and it gets some unexpected behavior from my point of view. Setting pg_stat_statements.query_id_const_merge_threshold = 5 does not normalize sql queries whose number of in clauses exceeds 5. Here are test steps. https://gist.github.com/yahonda/825ffccc4dcb58aa60e12ce33d25cd45#expected-behavior It would be appreciated if I can get my understanding correct. -- Yasuo Honda On Sun, Mar 24, 2024 at 3:20 AM Dmitry Dolgov <9erthalion6@gmail.com> wrote: > Sure, I can rebase, give me a moment. If you don't want to wait, there > is a base commit in the patch, against which it should be applied > without issues, 0eb23285a2.
> On Sun, Mar 24, 2024 at 11:36:38PM +0900, Yasuo Honda wrote: > Thanks for the information. I can apply these 4 patches from > 0eb23285a2 . I tested this branch from Ruby on Rails and it gets some > unexpected behavior from my point of view. > Setting pg_stat_statements.query_id_const_merge_threshold = 5 does not > normalize sql queries whose number of in clauses exceeds 5. > > Here are test steps. > https://gist.github.com/yahonda/825ffccc4dcb58aa60e12ce33d25cd45#expected-behavior > > It would be appreciated if I can get my understanding correct. From what I understand out of the description this ruby script uses prepared statements, passing values as parameters, right? Unfortunately the current version of the patch doesn't handle that, it works with constants only [1]. The original incarnation of this feature was able to handle that, but the implementation was considered to be not suitable -- thus, to make some progress, it was left outside. The plan is, if everything goes fine at some point, to do a follow-up patch to handle Params and the rest. [1]: https://www.postgresql.org/message-id/20230211104707.grsicemegr7d3mgh%40erthalion.local
Yes. The script uses prepared statements because Ruby on Rails enables prepared statements by default for PostgreSQL databases. Then I tested this branch https://github.com/yahonda/postgres/tree/pg_stat_statements without using prepared statements as follows and all of them do not normalize in clause values. - Disabled prepared statements by setting `prepared_statements: false` https://gist.github.com/yahonda/2c2d6ac7a955886a305750eecfd07c5e - Use ruby-pg https://gist.github.com/yahonda/2f0efb11ae888d8f6b27a07e0b833fdf - Use psql https://gist.github.com/yahonda/c830379b33d66a743aef159aa03d7e49 I do not know why even if I use psql, the query column at pg_stat_sql_statement shows it is like a prepared statement "IN ($1, $2)". On Tue, Mar 26, 2024 at 1:35 AM Dmitry Dolgov <9erthalion6@gmail.com> wrote: > From what I understand out of the description this ruby script uses > prepared statements, passing values as parameters, right? Unfortunately > the current version of the patch doesn't handle that, it works with > constants only [1]. The original incarnation of this feature was able to > handle that, but the implementation was considered to be not suitable -- > thus, to make some progress, it was left outside.
> On Tue, Mar 26, 2024 at 04:21:46PM +0900, Yasuo Honda wrote: > Yes. The script uses prepared statements because Ruby on Rails enables > prepared statements by default for PostgreSQL databases. > > Then I tested this branch > https://github.com/yahonda/postgres/tree/pg_stat_statements without > using prepared statements as follows and all of them do not normalize > in clause values. > > - Disabled prepared statements by setting `prepared_statements: false` > https://gist.github.com/yahonda/2c2d6ac7a955886a305750eecfd07c5e > > - Use ruby-pg > https://gist.github.com/yahonda/2f0efb11ae888d8f6b27a07e0b833fdf > > - Use psql > https://gist.github.com/yahonda/c830379b33d66a743aef159aa03d7e49 > > I do not know why even if I use psql, the query column at > pg_stat_sql_statement shows it is like a prepared statement "IN ($1, > $2)". It's a similar case: the column is defined as bigint, thus PostgreSQL has to wrap every constant expression in a function expression that converts its type to bigint. The current patch version doesn't try to reduce a FuncExpr into Const (event if the wrapped value is a Const), thus this array is not getting merged. If you replace bigint with an int, no type conversion would be required and merging logic will kick in. Again, the original version of the patch was able to handle this case, but it was stripped away to make the patch smaller in hope of moving forward. Anyway, thanks for reminding about how annoying the current handling of constant arrays can look like in practice!
Thanks for the useful info. Ruby on Rails uses bigint as a default data type for the primary key and prepared statements have been enabled by default for PostgreSQL. I'm looking forward to these current patches being merged as a first step and future versions of pg_stat_statements will support normalizing bigint and prepared statements. On Wed, Mar 27, 2024 at 6:00 AM Dmitry Dolgov <9erthalion6@gmail.com> wrote: > It's a similar case: the column is defined as bigint, thus PostgreSQL > has to wrap every constant expression in a function expression that > converts its type to bigint. The current patch version doesn't try to > reduce a FuncExpr into Const (event if the wrapped value is a Const), > thus this array is not getting merged. If you replace bigint with an > int, no type conversion would be required and merging logic will kick > in. > > Again, the original version of the patch was able to handle this case, > but it was stripped away to make the patch smaller in hope of moving > forward. Anyway, thanks for reminding about how annoying the current > handling of constant arrays can look like in practice!
> On Wed, Mar 27, 2024 at 08:56:12AM +0900, Yasuo Honda wrote: > Thanks for the useful info. > > Ruby on Rails uses bigint as a default data type for the primary key > and prepared statements have been enabled by default for PostgreSQL. > I'm looking forward to these current patches being merged as a first > step and future versions of pg_stat_statements will support > normalizing bigint and prepared statements. Here is the rebased version. In the meantime I'm going to experiment with how to support more use cases in a way that will be more acceptable for the community.
Attachment
Hi, In <20240404143514.a26f7ttxrbdfc73a@erthalion.local> "Re: pg_stat_statements and "IN" conditions" on Thu, 4 Apr 2024 16:35:14 +0200, Dmitry Dolgov <9erthalion6@gmail.com> wrote: > Here is the rebased version. Thanks. I'm not familiar with this code base but I've reviewed these patches because I'm interested in this feature too. 0001: > diff --git a/src/backend/nodes/queryjumblefuncs.c b/src/backend/nodes/queryjumblefuncs.c > index be823a7f8fa..e9473def361 100644 > --- a/src/backend/nodes/queryjumblefuncs.c > +++ b/src/backend/nodes/queryjumblefuncs.c > > @@ -212,15 +233,67 @@ RecordConstLocation(JumbleState *jstate, int location) > ... > +static bool > +IsMergeableConstList(List *elements, Const **firstConst, Const **lastConst) > +{ > + ListCell *temp; > + Node *firstExpr = NULL; > + > + if (elements == NULL) "elements == NIL" will be better for List. > +static void > +_jumbleElements(JumbleState *jstate, List *elements) > +{ > + Const *first, *last; > + if(IsMergeableConstList(elements, &first, &last)) A space is missing between "if" and "(". > diff --git a/src/include/nodes/primnodes.h b/src/include/nodes/primnodes.h > index aa727e722cc..cf4f900d4ed 100644 > --- a/src/include/nodes/primnodes.h > +++ b/src/include/nodes/primnodes.h > @@ -1333,7 +1333,7 @@ typedef struct ArrayExpr > /* common type of array elements */ > Oid element_typeid pg_node_attr(query_jumble_ignore); > /* the array elements or sub-arrays */ > - List *elements; > + List *elements pg_node_attr(query_jumble_merge); Should we also update the pg_node_attr() comment for query_jumble_merge in nodes.h? 0003: > diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c > index d7841b51cc9..00eec30feb1 100644 > --- a/contrib/pg_stat_statements/pg_stat_statements.c > +++ b/contrib/pg_stat_statements/pg_stat_statements.c > ... > @@ -2883,12 +2886,22 @@ generate_normalized_query(JumbleState *jstate, const char *query, > /* The firsts merged constant */ > else if (!skip) > { > + static const uint32 powers_of_ten[] = { > + 1, 10, 100, > + 1000, 10000, 100000, > + 1000000, 10000000, 100000000, > + 1000000000 > + }; > + int lower_merged = powers_of_ten[magnitude - 1]; > + int upper_merged = powers_of_ten[magnitude]; How about adding a reverse function of decimalLength32() to numutils.h and use it here? > - n_quer_loc += sprintf(norm_query + n_quer_loc, "..."); > + n_quer_loc += sprintf(norm_query + n_quer_loc, "... [%d-%d entries]", > + lower_merged, upper_merged - 1); Do we still have enough space in norm_query for this change? It seems that norm_query expects up to 10 additional bytes per jstate->clocations[i]. It seems that we can merge 0001, 0003 and 0004 to one patch. (Sorry. I haven't read all discussions yet. If we already discuss this, sorry for this noise.) Thanks, -- kou
> On Mon, Apr 15, 2024 at 06:09:29PM +0900, Sutou Kouhei wrote: > > Thanks. I'm not familiar with this code base but I've > reviewed these patches because I'm interested in this > feature too. Thanks for the review! The commentaries for the first patch make sense to me, will apply. > 0003: > > > diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c > > index d7841b51cc9..00eec30feb1 100644 > > --- a/contrib/pg_stat_statements/pg_stat_statements.c > > +++ b/contrib/pg_stat_statements/pg_stat_statements.c > > ... > > @@ -2883,12 +2886,22 @@ generate_normalized_query(JumbleState *jstate, const char *query, > > /* The firsts merged constant */ > > else if (!skip) > > { > > + static const uint32 powers_of_ten[] = { > > + 1, 10, 100, > > + 1000, 10000, 100000, > > + 1000000, 10000000, 100000000, > > + 1000000000 > > + }; > > + int lower_merged = powers_of_ten[magnitude - 1]; > > + int upper_merged = powers_of_ten[magnitude]; > > How about adding a reverse function of decimalLength32() to > numutils.h and use it here? I was pondering that at some point, but eventually decided to keep it this way, because: * the use case is quite specific, I can't image it's being used anywhere else * it would not be strictly reverse, as the transformation itself is not reversible in the strict sense > > - n_quer_loc += sprintf(norm_query + n_quer_loc, "..."); > > + n_quer_loc += sprintf(norm_query + n_quer_loc, "... [%d-%d entries]", > > + lower_merged, upper_merged - 1); > > Do we still have enough space in norm_query for this change? > It seems that norm_query expects up to 10 additional bytes > per jstate->clocations[i]. As far as I understand there should be enough space, because we're going to replace at least 10 constants with one merge record. But it's a good point, this should be called out in the commentary explaining why 10 additional bytes are added. > It seems that we can merge 0001, 0003 and 0004 to one patch. > (Sorry. I haven't read all discussions yet. If we already > discuss this, sorry for this noise.) There is a certain disagreement about which portion of this feature makes sense to go with first, thus I think keeping all options open is a good idea. In the end a committer can squash the patches if needed.
> On Tue, Apr 23, 2024 at 10:18:15AM +0200, Dmitry Dolgov wrote: > > On Mon, Apr 15, 2024 at 06:09:29PM +0900, Sutou Kouhei wrote: > > > > Thanks. I'm not familiar with this code base but I've > > reviewed these patches because I'm interested in this > > feature too. > > Thanks for the review! The commentaries for the first patch make sense > to me, will apply. Here is the new version. It turned out you were right about memory for the normalized query, if the number of constants goes close to INT_MAX, there were indeed not enough allocated. I've added a fix for this on top of the applied changes, and also improved readability for pg_stat_statements part.
Attachment
Hello This feature will improve my monitoring. Even in patch 0001. I think there are many other people in the thread who thinkthis is useful. So maybe we should move it forward? Any complaints about the overall design? I see in the discussionit was mentioned that it would be good to measure performance difference. PS: patch cannot be applied at this time, it needs another rebase. regards, Sergei
> On Sun, Aug 11, 2024 at 07:54:05PM +0300, Sergei Kornilov wrote: > > This feature will improve my monitoring. Even in patch 0001. I think there are many other people in the thread who thinkthis is useful. So maybe we should move it forward? Any complaints about the overall design? I see in the discussionit was mentioned that it would be good to measure performance difference. > > PS: patch cannot be applied at this time, it needs another rebase. Yeah, it seems like most people are fine with the first patch and the simplest approach. I'm going to post a rebased version and a short thread summary soon.
> On Sun, Aug 11, 2024 at 09:34:55PM GMT, Dmitry Dolgov wrote: > > On Sun, Aug 11, 2024 at 07:54:05PM +0300, Sergei Kornilov wrote: > > > > This feature will improve my monitoring. Even in patch 0001. I think there are many other people in the thread who thinkthis is useful. So maybe we should move it forward? Any complaints about the overall design? I see in the discussionit was mentioned that it would be good to measure performance difference. > > > > PS: patch cannot be applied at this time, it needs another rebase. > > Yeah, it seems like most people are fine with the first patch and the > simplest approach. I'm going to post a rebased version and a short > thread summary soon. Ok, here is the rebased version. If anyone would like to review them, below is the short summary of the thread. Currently the patch series contains 4 changes: * 0001-Prevent-jumbling-of-every-element-in-ArrayExpr.patch Implements the simplest way to handle constant arrays, if the array contains only constants it will be reduced. This is the basis, if I read it correctly Nathan and Michael expressed that they're mostly fine with this one. Michael seems to be skeptical about the "merged" flag in the LocationLen struct, but from what I see the proposed alternative has problems as well. There was also a note that the loop over constants has to be benchmarked, but it's not entirely clear for me in which dimentions to benchmark (i.e. what are the expectations). For both I'm waiting on a reply to my questions. * 0002-Reusable-decimalLength-functions.patch A small refactoring to make already existing "powers" functonality reusable for following patches. * 0003-Merge-constants-in-ArrayExpr-into-groups.patch Makes handling of constant arrays smarter by taking into account number of elements in the array. This way records are merged into groups power of 10, i.e. arrays with length 55 will land in a group 10-99, with lenght 555 in a group 100-999 etc. This was proposed by Alvaro, and personally I like this approach, because it remediates the issue of one-size-fits-all for the static threshold. But there are opinions that this introduces too much complexity. * 0004-Introduce-query_id_const_merge_threshold.patch Fine tuning for the previous patch, makes only arrays with the length over a certain threshold to be reduced. On top of that Yasuo Honda and Jakub Wartak have provided a couple of practical examples, where handling of constant arrays has to be improved. David Geier pointed out some examples that might be confusing as well. All those are definitely worth addressing, but out of scope of this patch for now.