Thread: Multiple Query IDs for a rewritten parse tree

Multiple Query IDs for a rewritten parse tree

From

Andrey Lepikhov

Date:

08 January 2022, 20:02:23

On 5/1/2022 10:13, Tom Lane wrote:
 > I feel like we need to get away from the idea that there is just
 > one query hash, and somehow let different extensions attach
 > differently-calculated hashes to a query.  I don't have any immediate
 > ideas about how to do that in a reasonably inexpensive way.

Now, queryId field represents an query class (depending on an jumbling 
implementation). It is used by extensions as the way for simple tracking 
a query from a parse tree creation point to the end of its life along 
all hook calls, which an extension uses (remember about possible plan 
caching).

I know at least two different cases of using queryId:
1) for monitoring purposes - pg_stat_statements is watching how often 
queries of a class emerge in the database and collects a stat on each class.
2) adaptive purposes - some extensions influence a planner decision 
during the optimization stage and want to learn on a performance shift 
at the end of execution stage.

Different purposes may require different jumbling implementations. But 
users can want to use such extensions at the same time. So we should 
allow to attach many different query IDs to a query (maybe better to 
call it as 'query label'?).

Thinking for a while I invented three different ways to implement it:
1. queryId will be a trivial 64-bit counter. So, each extension can 
differ each query from any other, track it along all hooks, use an 
jumbling code and store an queryId internally. Here only one big problem 
I see - increasing overhead in the case of many consumers of queryId 
feature.

2. Instead of simple queryId we can store a list of pairs (QueryId, 
funcOid). An extension can register a callback for queryId generation 
and the core will form a list of queryIds right after an query tree 
rewriting. funcOid is needed to differ jumbling implementations. Here we 
should invent an additional node type for an element of the list.

3. Instead of queryId we could add a multi-purpose 'private' list in the 
Query struct. Any extension can add to this list additional object(s) 
(with registered object type, of course). As an example, i can imagine a 
kind of convention for queryIds in such case - store a String node with 
value: '<extension name> - <Query ID>'.
This way we should implement a registered callback mechanism too.

I think, third way is the cheapest, flexible and simple for implementation.

Any thoughts, comments, criticism ?

-- 
regards,
Andrey Lepikhov
Postgres Professional

Re: Multiple Query IDs for a rewritten parse tree

From

Tom Lane

Date:

09 January 2022, 00:49:59

Andrey Lepikhov <a.lepikhov@postgrespro.ru> writes:
> On 5/1/2022 10:13, Tom Lane wrote:
>>> I feel like we need to get away from the idea that there is just
>>> one query hash, and somehow let different extensions attach
>>> differently-calculated hashes to a query.  I don't have any immediate
>>> ideas about how to do that in a reasonably inexpensive way.

> Thinking for a while I invented three different ways to implement it:
> 1. queryId will be a trivial 64-bit counter.

This seems pretty useless.  The whole point of the query hash, at
least for many use-cases, is to allow recognizing queries that are
the same or similar.

> 2. Instead of simple queryId we can store a list of pairs (QueryId, 
> funcOid). An extension can register a callback for queryId generation 
> and the core will form a list of queryIds right after an query tree 
> rewriting. funcOid is needed to differ jumbling implementations. Here we 
> should invent an additional node type for an element of the list.

I'm not sure that funcOid is a reasonable way to tag different hash
calculation methods, because it isn't necessarily stable across
installations.  For the same reason, it'd be hard for two extensions
to collaborate on a common query-hash definition.

> 3. Instead of queryId we could add a multi-purpose 'private' list in the 
> Query struct. Any extension can add to this list additional object(s) 
> (with registered object type, of course). As an example, i can imagine a 
> kind of convention for queryIds in such case - store a String node with 
> value: '<extension name> - <Query ID>'.

Again, this is presuming that every extension is totally independent
and has no interest in what any other code is doing.  But I don't
think we want to make every extension that wants a hash duplicate
the whole of queryjumble.c.

The idea I'd been vaguely thinking about is to allow attaching a list
of query-hash nodes to a Query, where each node would contain a "tag"
identifying the specific hash calculation method, and also the value
of the query's hash calculated according to that method.  We could
probably get away with saying that all such hash values must be uint64.
The main difference from your function-OID idea, I think, is that
I'm envisioning the tags as being small integers with well-known
values, similarly to the way we manage stakind values in pg_statistic.
In this way, an extension that wants a hash that the core knows how
to calculate doesn't need its own copy of the code, and similarly
one extension could publish a calculation method for use by other
extensions.

We'd also need some mechanism for registering a function to be
used to calculate the hash for any given tag value, of course.

            regards, tom lane

Re: Multiple Query IDs for a rewritten parse tree

From

Dmitry Dolgov

Date:

09 January 2022, 11:43:06

> On Sat, Jan 08, 2022 at 07:49:59PM -0500, Tom Lane wrote:
>
> The idea I'd been vaguely thinking about is to allow attaching a list
> of query-hash nodes to a Query, where each node would contain a "tag"
> identifying the specific hash calculation method, and also the value
> of the query's hash calculated according to that method.  We could
> probably get away with saying that all such hash values must be uint64.
> The main difference from your function-OID idea, I think, is that
> I'm envisioning the tags as being small integers with well-known
> values, similarly to the way we manage stakind values in pg_statistic.
> In this way, an extension that wants a hash that the core knows how
> to calculate doesn't need its own copy of the code, and similarly
> one extension could publish a calculation method for use by other
> extensions.

An extension that wants a slightly modified version of hash calculation
implementation from the core would still need to copy everything. The
core probably has to provide more than one (hash, method) pair to cover
some basic needs.

Re: Multiple Query IDs for a rewritten parse tree

From

Julien Rouhaud

Date:

09 January 2022, 12:04:44

On Sun, Jan 09, 2022 at 12:43:06PM +0100, Dmitry Dolgov wrote:
>
> An extension that wants a slightly modified version of hash calculation
> implementation from the core would still need to copy everything. The
> core probably has to provide more than one (hash, method) pair to cover
> some basic needs.

Or just GUC(s) to adapt the behavior.  But in any case there isn't much that
can be done that won't result in a huge performance drop (like e.g. the wanted
stability over logical replication or backup/restore).

Re: Multiple Query IDs for a rewritten parse tree

From

Julien Rouhaud

Date:

09 January 2022, 12:13:21

On Sat, Jan 08, 2022 at 07:49:59PM -0500, Tom Lane wrote:
>
> The idea I'd been vaguely thinking about is to allow attaching a list
> of query-hash nodes to a Query, where each node would contain a "tag"
> identifying the specific hash calculation method, and also the value
> of the query's hash calculated according to that method.

For now the queryid mixes two different things: fingerprinting and query text
normalization.  Should each calculation method be allowed to do a different
normalization too, and if yes where should be stored the state data needed for
that?  If not, we would need some kind of primary hash for that purpose.

Looking at Andrey's use case for wanting multiple hashes, I don't think that
adaptive optimization needs a normalized query string.  The only use would be
to output some statistics, but this could be achieved by storing a list of
"primary queryid" for each adaptive entry.  That's probably also true for
anything that's not monitoring intended.  Also, all monitoring consumers should
probably agree on the same queryid, both fingerprint and normalized string, as
otherwise it's impossible to cross-reference metric data.

Re: Multiple Query IDs for a rewritten parse tree

From

"Andrey V. Lepikhov"

Date:

10 January 2022, 04:10:59

On 1/9/22 5:13 PM, Julien Rouhaud wrote:
> For now the queryid mixes two different things: fingerprinting and query text
> normalization.  Should each calculation method be allowed to do a different
> normalization too, and if yes where should be stored the state data needed for
> that?  If not, we would need some kind of primary hash for that purpose.
> 
Do You mean JumbleState?
I think, registering queryId generator we should store also a pointer 
(void **args) to an additional data entry, as usual.

> Looking at Andrey's use case for wanting multiple hashes, I don't think that
> adaptive optimization needs a normalized query string.  The only use would be
> to output some statistics, but this could be achieved by storing a list of
> "primary queryid" for each adaptive entry.  That's probably also true for
> anything that's not monitoring intended.  Also, all monitoring consumers should
> probably agree on the same queryid, both fingerprint and normalized string, as
> otherwise it's impossible to cross-reference metric data.
> 
I can add one more use case.
Our extension for freezing query plan uses query tree comparison 
technique to prove, that the plan can be applied (and we don't need to 
execute planning procedure at all).
The procedure of a tree equality checking is expensive and we use 
cheaper queryId comparison to identify possible candidates. So here, for 
the better performance and queries coverage, we need to use query tree 
normalization - queryId should be stable to some modifications in a 
query text which do not change semantics.
As an example, query plan with external parameters can be used to 
execute constant query if these constants correspond by place and type 
to the parameters. So, queryId calculation technique returns also 
pointers to all constants and parameters found during the calculation.

-- 
regards,
Andrey Lepikhov
Postgres Professional

Re: Multiple Query IDs for a rewritten parse tree

From

Julien Rouhaud

Date:

10 January 2022, 04:51:55

On Mon, Jan 10, 2022 at 09:10:59AM +0500, Andrey V. Lepikhov wrote:
> On 1/9/22 5:13 PM, Julien Rouhaud wrote:
> > For now the queryid mixes two different things: fingerprinting and query text
> > normalization.  Should each calculation method be allowed to do a different
> > normalization too, and if yes where should be stored the state data needed for
> > that?  If not, we would need some kind of primary hash for that purpose.
> > 
> Do You mean JumbleState?

Yes, or some other abstraction.  For instance with the proposed patch to handle
differently the IN clauses, you needs a to change JumbleState.

> I think, registering queryId generator we should store also a pointer (void
> **args) to an additional data entry, as usual.

Well, yes but only if you need to produce different versions of the normalized
query text, and I'm not convinced that's really necessary.

> > Looking at Andrey's use case for wanting multiple hashes, I don't think that
> > adaptive optimization needs a normalized query string.  The only use would be
> > to output some statistics, but this could be achieved by storing a list of
> > "primary queryid" for each adaptive entry.  That's probably also true for
> > anything that's not monitoring intended.  Also, all monitoring consumers should
> > probably agree on the same queryid, both fingerprint and normalized string, as
> > otherwise it's impossible to cross-reference metric data.
> > 
> I can add one more use case.
> Our extension for freezing query plan uses query tree comparison technique
> to prove, that the plan can be applied (and we don't need to execute
> planning procedure at all).
> The procedure of a tree equality checking is expensive and we use cheaper
> queryId comparison to identify possible candidates. So here, for the better
> performance and queries coverage, we need to use query tree normalization -
> queryId should be stable to some modifications in a query text which do not
> change semantics.
> As an example, query plan with external parameters can be used to execute
> constant query if these constants correspond by place and type to the
> parameters. So, queryId calculation technique returns also pointers to all
> constants and parameters found during the calculation.

I'm also working on a similar extension, and yes you can't accept any
fingerprinting approach for that.  I don't know what are the exact heuristics
of your cheaper queryid calculation are, but is it reasonable to use it with
something like pg_stat_statements?  If yes, you don't really need two queryid
approach for the sake of this single extension and therefore don't need to
store multiple jumble state or similar per statement.  Especially since
requiring another one would mean a performance drop as soon as you want to use
something as common as pg_stat_statements.

Re: Multiple Query IDs for a rewritten parse tree

From

"Andrey V. Lepikhov"

Date:

10 January 2022, 07:37:34

On 1/10/22 9:51 AM, Julien Rouhaud wrote:
> On Mon, Jan 10, 2022 at 09:10:59AM +0500, Andrey V. Lepikhov wrote:
>> I can add one more use case.
>> Our extension for freezing query plan uses query tree comparison technique
>> to prove, that the plan can be applied (and we don't need to execute
>> planning procedure at all).
>> The procedure of a tree equality checking is expensive and we use cheaper
>> queryId comparison to identify possible candidates. So here, for the better
>> performance and queries coverage, we need to use query tree normalization -
>> queryId should be stable to some modifications in a query text which do not
>> change semantics.
>> As an example, query plan with external parameters can be used to execute
>> constant query if these constants correspond by place and type to the
>> parameters. So, queryId calculation technique returns also pointers to all
>> constants and parameters found during the calculation.
> 
> I'm also working on a similar extension, and yes you can't accept any
> fingerprinting approach for that.  I don't know what are the exact heuristics
> of your cheaper queryid calculation are, but is it reasonable to use it with
> something like pg_stat_statements?  If yes, you don't really need two queryid
> approach for the sake of this single extension and therefore don't need to
> store multiple jumble state or similar per statement.  Especially since
> requiring another one would mean a performance drop as soon as you want to use
> something as common as pg_stat_statements.
> 
I think, pg_stat_statements can live with an queryId generator of the 
sr_plan extension. But It replaces all constants with $XXX parameter at 
the query string. In our extension user defines which plan is optimal 
and which constants can be used as parameters in the plan.
One drawback I see here - creating or dropping of my extension changes 
behavior of pg_stat_statements that leads to distortion of the DB load 
profile. Also, we haven't guarantees, that another extension will work 
correctly (or in optimal way) with such queryId.

-- 
regards,
Andrey Lepikhov
Postgres Professional

Re: Multiple Query IDs for a rewritten parse tree

From

Julien Rouhaud

Date:

10 January 2022, 08:56:58

On Mon, Jan 10, 2022 at 12:37:34PM +0500, Andrey V. Lepikhov wrote:
> I think, pg_stat_statements can live with an queryId generator of the
> sr_plan extension. But It replaces all constants with $XXX parameter at the
> query string. In our extension user defines which plan is optimal and which
> constants can be used as parameters in the plan.

I don't know the details of that extension, but I think that it should work as
long as you have the constants information in the jumble state, whatever the
resulting normalized query string is right?

> One drawback I see here - creating or dropping of my extension changes
> behavior of pg_stat_statements that leads to distortion of the DB load
> profile. Also, we haven't guarantees, that another extension will work
> correctly (or in optimal way) with such queryId.

But then, if generating 2 queryid is a better for performance, does the
extension really need an additional jumble state and/or normalized query
string?

Re: Multiple Query IDs for a rewritten parse tree

From

Andrey Lepikhov

Date:

10 January 2022, 10:24:46

On 10/1/2022 13:56, Julien Rouhaud wrote:
> On Mon, Jan 10, 2022 at 12:37:34PM +0500, Andrey V. Lepikhov wrote:
>> I think, pg_stat_statements can live with an queryId generator of the
>> sr_plan extension. But It replaces all constants with $XXX parameter at the
>> query string. In our extension user defines which plan is optimal and which
>> constants can be used as parameters in the plan.
> 
> I don't know the details of that extension, but I think that it should work as
> long as you have the constants information in the jumble state, whatever the
> resulting normalized query string is right?
Yes. the same input query string doesn't prove that frozen query plan 
can be used, because rewrite rules could be changed. So we use only a 
query tree. Here we must have custom jumbling implementation.
queryId in this extension defines two aspects:
1. How many input queries will be compared with a query tree template of 
the frozen statement.
2. As a result, performance overheads on unsuccessful comparings.
> 
>> One drawback I see here - creating or dropping of my extension changes
>> behavior of pg_stat_statements that leads to distortion of the DB load
>> profile. Also, we haven't guarantees, that another extension will work
>> correctly (or in optimal way) with such queryId.
> 
> But then, if generating 2 queryid is a better for performance, does the
> extension really need an additional jumble state and/or normalized query
> string?
Additional Jumble state isn't necessary, but it would be better for 
performance to collect pointers to all constant nodes during a process 
of hash generation.

-- 
regards,
Andrey Lepikhov
Postgres Professional

Re: Multiple Query IDs for a rewritten parse tree

From

Julien Rouhaud

Date:

10 January 2022, 10:39:57

On Mon, Jan 10, 2022 at 03:24:46PM +0500, Andrey Lepikhov wrote:
> On 10/1/2022 13:56, Julien Rouhaud wrote:
> > 
> > I don't know the details of that extension, but I think that it should work as
> > long as you have the constants information in the jumble state, whatever the
> > resulting normalized query string is right?
> Yes. the same input query string doesn't prove that frozen query plan can be
> used, because rewrite rules could be changed. So we use only a query tree.

Yes, I'm fully aware of that.  I wasn't asking about using the input query
string but the need for generating a dedicated normalized output query string
that would be potentially different from the one generated by
pg_stat_statements (or similar).

> > But then, if generating 2 queryid is a better for performance, does the
> > extension really need an additional jumble state and/or normalized query
> > string?
> Additional Jumble state isn't necessary, but it would be better for
> performance to collect pointers to all constant nodes during a process of
> hash generation.

I agree, but it's even better for performance if this is collected only once.
I still don't know if this extension (or any extension) would require something
different from a common jumble state that would serve for all purpose or if we
can work with a single jumble state and only handle multiple queryid.

Re: Multiple Query IDs for a rewritten parse tree

From

Andrey Lepikhov

Date:

10 January 2022, 17:55:36

On 10/1/2022 15:39, Julien Rouhaud wrote:
> On Mon, Jan 10, 2022 at 03:24:46PM +0500, Andrey Lepikhov wrote:
>> On 10/1/2022 13:56, Julien Rouhaud wrote:
>> Yes. the same input query string doesn't prove that frozen query plan can be
>> used, because rewrite rules could be changed. So we use only a query tree.
> 
> Yes, I'm fully aware of that.  I wasn't asking about using the input query
> string but the need for generating a dedicated normalized output query string
> that would be potentially different from the one generated by
> pg_stat_statements (or similar).
> 
Thanks, now I got it. I don't remember a single situation where we would 
need to normalize a query string.
>>> But then, if generating 2 queryid is a better for performance, does the
>>> extension really need an additional jumble state and/or normalized query
>>> string?
>> Additional Jumble state isn't necessary, but it would be better for
>> performance to collect pointers to all constant nodes during a process of
>> hash generation.
> 
> I agree, but it's even better for performance if this is collected only once.
> I still don't know if this extension (or any extension) would require something
> different from a common jumble state that would serve for all purpose or if we
> can work with a single jumble state and only handle multiple queryid.
I think, JumbleState in highly monitoring-specific, maybe even 
pg_stat_statements specific (maybe you can designate another examples). 
I haven't ideas how it can be used in arbitrary extension.
But, it is not so difficult to imagine an implementation (as part of 
Tom's approach, described earlier) such kind of storage for each 
generation method.

-- 
regards,
Andrey Lepikhov
Postgres Professional

Re: Multiple Query IDs for a rewritten parse tree

From

"Andrey V. Lepikhov"

Date:

21 January 2022, 06:33:22

On 1/9/22 5:49 AM, Tom Lane wrote:
> The idea I'd been vaguely thinking about is to allow attaching a list
> of query-hash nodes to a Query, where each node would contain a "tag"
> identifying the specific hash calculation method, and also the value
> of the query's hash calculated according to that method.  We could
> probably get away with saying that all such hash values must be uint64.
> The main difference from your function-OID idea, I think, is that
> I'm envisioning the tags as being small integers with well-known
> values, similarly to the way we manage stakind values in pg_statistic.
> In this way, an extension that wants a hash that the core knows how
> to calculate doesn't need its own copy of the code, and similarly
> one extension could publish a calculation method for use by other
> extensions.

To move forward, I have made a patch that implements this idea (see 
attachment). It is a POC, but passes all regression tests.
Registration of an queryId generator implemented by analogy with 
extensible methods machinery. Also, I switched queryId to int64 type and 
renamed to 'label'.

Some lessons learned:
1. Single queryId implementation is deeply tangled with the core code 
(stat reporting machinery and parallel workers as an example).
2. We need a custom queryId, that is based on a generated queryId 
(according to the logic of pg_stat_statements).
3. We should think about safety of de-registering procedure.
4. We should reserve position of default in-core generator and think on 
logic of enabling/disabling it.
5. We should add an EXPLAIN hook, to allow an extension to print this 
custom queryId.

-- 
regards,
Andrey Lepikhov
Postgres Professional

Attachment

POC-multiple-query_id.patch

Re: Multiple Query IDs for a rewritten parse tree

From

Dmitry Dolgov

Date:

28 January 2022, 16:51:56

> On Fri, Jan 21, 2022 at 11:33:22AM +0500, Andrey V. Lepikhov wrote:
> On 1/9/22 5:49 AM, Tom Lane wrote:
> > The idea I'd been vaguely thinking about is to allow attaching a list
> > of query-hash nodes to a Query, where each node would contain a "tag"
> > identifying the specific hash calculation method, and also the value
> > of the query's hash calculated according to that method.  We could
> > probably get away with saying that all such hash values must be uint64.
> > The main difference from your function-OID idea, I think, is that
> > I'm envisioning the tags as being small integers with well-known
> > values, similarly to the way we manage stakind values in pg_statistic.
> > In this way, an extension that wants a hash that the core knows how
> > to calculate doesn't need its own copy of the code, and similarly
> > one extension could publish a calculation method for use by other
> > extensions.
>
> To move forward, I have made a patch that implements this idea (see
> attachment). It is a POC, but passes all regression tests.

Thanks. Couple of comments off the top of my head:

> Registration of an queryId generator implemented by analogy with extensible
> methods machinery.

Why not more like suggested with stakind and slots in some data
structure? All of those generators have to be iterated anyway, so not
sure if a hash table makes sense.

> Also, I switched queryId to int64 type and renamed to
> 'label'.

A name with "id" in it would be better I believe. Label could be think
of as "the query belongs to a certain category", while the purpose is
identification.

> 2. We need a custom queryId, that is based on a generated queryId (according
> to the logic of pg_stat_statements).

Could you clarify?

> 4. We should reserve position of default in-core generator

From the discussion above I was under the impression that the core
generator should be distinguished by a predefined kind.

> 5. We should add an EXPLAIN hook, to allow an extension to print this custom
> queryId.

Why? It would make sense if custom generation code will be generating
some complex structure, but the queryId itself is still a hash.

Re: Multiple Query IDs for a rewritten parse tree

From

Julien Rouhaud

Date:

29 January 2022, 07:51:33

Hi,

On Fri, Jan 28, 2022 at 05:51:56PM +0100, Dmitry Dolgov wrote:
> > On Fri, Jan 21, 2022 at 11:33:22AM +0500, Andrey V. Lepikhov wrote:
> > On 1/9/22 5:49 AM, Tom Lane wrote:
> > > The idea I'd been vaguely thinking about is to allow attaching a list
> > > of query-hash nodes to a Query, where each node would contain a "tag"
> > > identifying the specific hash calculation method, and also the value
> > > of the query's hash calculated according to that method.  We could
> > > probably get away with saying that all such hash values must be uint64.
> > > The main difference from your function-OID idea, I think, is that
> > > I'm envisioning the tags as being small integers with well-known
> > > values, similarly to the way we manage stakind values in pg_statistic.
> > > In this way, an extension that wants a hash that the core knows how
> > > to calculate doesn't need its own copy of the code, and similarly
> > > one extension could publish a calculation method for use by other
> > > extensions.
> >
> > To move forward, I have made a patch that implements this idea (see
> > attachment). It is a POC, but passes all regression tests.
> [...]
> > 4. We should reserve position of default in-core generator
> 
> From the discussion above I was under the impression that the core
> generator should be distinguished by a predefined kind.

I don't really like this approach.  IIUC, this patch as-is is meant to break
pg_stat_statements extensibility.  If kind == 0 is reserved for in-core queryid
then you can't use pg_stat_statement with a different queryid generator
anymore.  Unless you meant that kind == 0 is reserved for monitoring /
pg_stat_statement purpose and custom extension should register that specific
kind too if that's what they are supposed to implement?

The patch also reserves kind == -1 for pg_stat_statements internal purpose, and
I'm not really sure why that's needed.  Are additional extension that want to
agree with pg_stat_statements on what the monitoring queryid is
supposed to do that too?

I'm also unsure of how are extensions supposed to cooperate in general, as
I feel that queryids should be implemented for some "intent" (like monitoring,
planning optimization...).  That being said I'm not sure if e.g. AQO heuristics
are too specific for its need or if it could be shared with other extension
that might be dealing with similar concerns.

Re: Multiple Query IDs for a rewritten parse tree

From

Dmitry Dolgov

Date:

29 January 2022, 17:12:05

> On Sat, Jan 29, 2022 at 03:51:33PM +0800, Julien Rouhaud wrote:
> Hi,
>
> On Fri, Jan 28, 2022 at 05:51:56PM +0100, Dmitry Dolgov wrote:
> > > On Fri, Jan 21, 2022 at 11:33:22AM +0500, Andrey V. Lepikhov wrote:
> > > On 1/9/22 5:49 AM, Tom Lane wrote:
> > > > The idea I'd been vaguely thinking about is to allow attaching a list
> > > > of query-hash nodes to a Query, where each node would contain a "tag"
> > > > identifying the specific hash calculation method, and also the value
> > > > of the query's hash calculated according to that method.  We could
> > > > probably get away with saying that all such hash values must be uint64.
> > > > The main difference from your function-OID idea, I think, is that
> > > > I'm envisioning the tags as being small integers with well-known
> > > > values, similarly to the way we manage stakind values in pg_statistic.
> > > > In this way, an extension that wants a hash that the core knows how
> > > > to calculate doesn't need its own copy of the code, and similarly
> > > > one extension could publish a calculation method for use by other
> > > > extensions.
> > >
> > > To move forward, I have made a patch that implements this idea (see
> > > attachment). It is a POC, but passes all regression tests.
> > [...]
> > > 4. We should reserve position of default in-core generator
> >
> > From the discussion above I was under the impression that the core
> > generator should be distinguished by a predefined kind.
>
> I don't really like this approach.  IIUC, this patch as-is is meant to break
> pg_stat_statements extensibility.  If kind == 0 is reserved for in-core queryid
> then you can't use pg_stat_statement with a different queryid generator
> anymore.  Unless you meant that kind == 0 is reserved for monitoring /
> pg_stat_statement purpose and custom extension should register that specific
> kind too if that's what they are supposed to implement?
>
> [...]
>
> I'm also unsure of how are extensions supposed to cooperate in general, as
> I feel that queryids should be implemented for some "intent" (like monitoring,
> planning optimization...).  That being said I'm not sure if e.g. AQO heuristics
> are too specific for its need or if it could be shared with other extension
> that might be dealing with similar concerns.

Assuming there are multiple providers and consumers of queryIds, every
such consumer extension needs to know which type of queryId it wants to
use. E.g. in case of pg_stat_statements, it needs to be somehow
configured to know which of those kinds to take, to preserve
extensibility you're talking about. Does the answer make sense, or did
you mean something else?

Btw, the approach in this thread still doesn't give a clue what to do
when an extension needs to reuse some parts of core queryId generator,
as in case with pg_stat_statements and "IN" condition merging.

Re: Multiple Query IDs for a rewritten parse tree

From

Julien Rouhaud

Date:

29 January 2022, 17:48:20

Hi,

On Sat, Jan 29, 2022 at 06:12:05PM +0100, Dmitry Dolgov wrote:
> > On Sat, Jan 29, 2022 at 03:51:33PM +0800, Julien Rouhaud wrote:
> >
> > I'm also unsure of how are extensions supposed to cooperate in general, as
> > I feel that queryids should be implemented for some "intent" (like monitoring,
> > planning optimization...).  That being said I'm not sure if e.g. AQO heuristics
> > are too specific for its need or if it could be shared with other extension
> > that might be dealing with similar concerns.
> 
> Assuming there are multiple providers and consumers of queryIds, every
> such consumer extension needs to know which type of queryId it wants to
> use. E.g. in case of pg_stat_statements, it needs to be somehow
> configured to know which of those kinds to take, to preserve
> extensibility you're talking about. Does the answer make sense, or did
> you mean something else?

I guess, but I don't think that the proposed approach does that.

The DBA should be able to configure a monitoring queryid provider, a planning
queryid provider... and the extensions should have a way to know which is
which.  And also I don't think that the DBA should be allowed to setup multiple
monitoring queryid providers, nor change them dynamically.

> Btw, the approach in this thread still doesn't give a clue what to do
> when an extension needs to reuse some parts of core queryId generator,
> as in case with pg_stat_statements and "IN" condition merging.

Indeed, as the query text normalization is not extensible.

Re: Multiple Query IDs for a rewritten parse tree

From

Dmitry Dolgov

Date:

29 January 2022, 18:00:57

> On Sun, Jan 30, 2022 at 01:48:20AM +0800, Julien Rouhaud wrote:
> Hi,
>
> On Sat, Jan 29, 2022 at 06:12:05PM +0100, Dmitry Dolgov wrote:
> > > On Sat, Jan 29, 2022 at 03:51:33PM +0800, Julien Rouhaud wrote:
> > >
> > > I'm also unsure of how are extensions supposed to cooperate in general, as
> > > I feel that queryids should be implemented for some "intent" (like monitoring,
> > > planning optimization...).  That being said I'm not sure if e.g. AQO heuristics
> > > are too specific for its need or if it could be shared with other extension
> > > that might be dealing with similar concerns.
> >
> > Assuming there are multiple providers and consumers of queryIds, every
> > such consumer extension needs to know which type of queryId it wants to
> > use. E.g. in case of pg_stat_statements, it needs to be somehow
> > configured to know which of those kinds to take, to preserve
> > extensibility you're talking about. Does the answer make sense, or did
> > you mean something else?
>
> I guess, but I don't think that the proposed approach does that.

Yes, it doesn't, I'm just channeling my understanding of the problem.

Re: Multiple Query IDs for a rewritten parse tree

From

"Andrey V. Lepikhov"

Date:

31 January 2022, 09:59:17

On 1/28/22 9:51 PM, Dmitry Dolgov wrote:
>> On Fri, Jan 21, 2022 at 11:33:22AM +0500, Andrey V. Lepikhov wrote:
>> Registration of an queryId generator implemented by analogy with extensible
>> methods machinery.
> 
> Why not more like suggested with stakind and slots in some data
> structure? All of those generators have to be iterated anyway, so not
> sure if a hash table makes sense.
Maybe. But it is not obvious. We don't really know, how many extensions 
could set an queryId.
For example, adaptive planning extensions definitely wants to set an 
unique id (for example, simplistic counter) to trace specific 
{query,plan} across all executions (remember plancache too). And they 
would register a personal generator for such purpose.
> 
>> Also, I switched queryId to int64 type and renamed to
>> 'label'.
> 
> A name with "id" in it would be better I believe. Label could be think
> of as "the query belongs to a certain category", while the purpose is
> identification.
I think, it is not a full true. Current jumbling generates not unique 
queryId (i hope, intentionally) and pg_stat_statements uses queryId to 
group queries into classes.
For tracking specific query along execution path it performs additional 
efforts (to remember nesting query level, as an example).
BTW, before [1], I tried to improve queryId, that can be stable for 
permutations of tables in 'FROM' section and so on. It would allow to 
reduce a number of pg_stat_statements entries (critical factor when you 
use an ORM, like 1C for example).
So, i think queryId is an Id and a category too.
> 
>> 2. We need a custom queryId, that is based on a generated queryId (according
>> to the logic of pg_stat_statements).
> 
> Could you clarify?
pg_stat_statements uses origin queryId and changes it for a reason 
(sometimes zeroed it, sometimes not). So you can't use this value in 
another extension and be confident that you use original value, 
generated by JumbleQuery(). Custom queryId allows to solve this problem.
> 
>> 4. We should reserve position of default in-core generator
> 
>  From the discussion above I was under the impression that the core
> generator should be distinguished by a predefined kind.
Yes, but I think we should have a range of values, enough for use in 
third party extensions.
> 
>> 5. We should add an EXPLAIN hook, to allow an extension to print this custom
>> queryId.
> 
> Why? It would make sense if custom generation code will be generating
> some complex structure, but the queryId itself is still a hash.
> 
Extension can print not only queryId, but an explanation of a kind, 
maybe additional logic.
Moreover why an extension can't show some useful monitoring data, 
collected during an query execution, in verbose mode?

[1] 
https://www.postgresql.org/message-id/flat/e50c1e8f-e5d6-5988-48fa-63dd992e9565%40postgrespro.ru
-- 
regards,
Andrey Lepikhov
Postgres Professional

Re: Multiple Query IDs for a rewritten parse tree

From

Dmitry Dolgov

Date:

31 January 2022, 16:28:31

> On Mon, Jan 31, 2022 at 02:59:17PM +0500, Andrey V. Lepikhov wrote:
> On 1/28/22 9:51 PM, Dmitry Dolgov wrote:
> > > On Fri, Jan 21, 2022 at 11:33:22AM +0500, Andrey V. Lepikhov wrote:
> > > Registration of an queryId generator implemented by analogy with extensible
> > > methods machinery.
> >
> > Why not more like suggested with stakind and slots in some data
> > structure? All of those generators have to be iterated anyway, so not
> > sure if a hash table makes sense.
> Maybe. But it is not obvious. We don't really know, how many extensions
> could set an queryId.
> For example, adaptive planning extensions definitely wants to set an unique
> id (for example, simplistic counter) to trace specific {query,plan} across
> all executions (remember plancache too). And they would register a personal
> generator for such purpose.

I don't see how the number of extensions justify a hash table? I mean,
all of the generators will still most likely be executed at once (not on
demand) and store the result somewhere.

In any way I would like to remind, that this functionality wants to be
on the pretty much hot path, which means strong evidences of no
significant performance overhead are needed. And sure, there could be
multiple queryId consumers, but in the vast majority of cases only one
queryId will be generated.

> > > 2. We need a custom queryId, that is based on a generated queryId (according
> > > to the logic of pg_stat_statements).
> >
> > Could you clarify?
> pg_stat_statements uses origin queryId and changes it for a reason
> (sometimes zeroed it, sometimes not).

IIRC it does that only for utility statements. I still fail to see the
problem, why would a custom extension not register a new generator if it
needs something different?

> > > 5. We should add an EXPLAIN hook, to allow an extension to print this custom
> > > queryId.
> >
> > Why? It would make sense if custom generation code will be generating
> > some complex structure, but the queryId itself is still a hash.
> >
> Extension can print not only queryId, but an explanation of a kind, maybe
> additional logic.
> Moreover why an extension can't show some useful monitoring data, collected
> during an query execution, in verbose mode?

A good recommendation which pops up every now and then in hackers, is to
concentrace on what is immediately useful, leaving "maybe useful" things
for later. I have a strong feeling this is the case here.

Re: Multiple Query IDs for a rewritten parse tree

From

Andrey Lepikhov

Date:

31 January 2022, 19:03:51

On 29/1/2022 12:51, Julien Rouhaud wrote:
>>> 4. We should reserve position of default in-core generator
>>
>>  From the discussion above I was under the impression that the core
>> generator should be distinguished by a predefined kind.
> 
> I don't really like this approach.  IIUC, this patch as-is is meant to break
> pg_stat_statements extensibility.  If kind == 0 is reserved for in-core queryid
> then you can't use pg_stat_statement with a different queryid generator
> anymore.

Yes, it is one more problem. Maybe if we want to make it extensible, we 
could think about hooks in the pg_stat_statements too?

> The patch also reserves kind == -1 for pg_stat_statements internal purpose, and
> I'm not really sure why that's needed.
My idea - tags with positive numbers are reserved for generation 
results, that is performance-critical.
As I see during the implementation, pg_stat_statements makes additional 
changes on queryId (no matter which ones). Because our purpose is to 
avoid interference in this place, I invented negative values, where 
extensions can store their queryIds, based on any generator or not. 
Maybe it is redundant - main idea here was to highlight the issue.
> 
> I'm also unsure of how are extensions supposed to cooperate in general, as
> I feel that queryids should be implemented for some "intent" (like monitoring,
> planning optimization...).  That being said I'm not sure if e.g. AQO heuristics
> are too specific for its need or if it could be shared with other extension
> that might be dealing with similar concerns.
I think, it depends on a specific purpose of an extension.

-- 
regards,
Andrey Lepikhov
Postgres Professional