Thread: [POC] Allow an extension to add data into Query and PlannedStmt nodes

[POC] Allow an extension to add data into Query and PlannedStmt nodes

From
Andrey Lepikhov
Date:
Hi,

Previously, we read int this mailing list some controversial opinions on 
queryid generation and Jumbling technique. Here we don't intend to solve 
these problems but help an extension at least don't conflict with others 
on the queryId value.

Extensions could need to pass some query-related data through all stages 
of the query planning and execution. As a trivial example, 
pg_stat_statements uses queryid at the end of execution to save some 
statistics. One more reason - extensions now conflict on queryid value 
and the logic of its changing. With this patch, it can be managed.

This patch introduces the structure 'ExtensionData' which allows to 
manage of a list of entries with a couple of interface functions 
addExtensionDataToNode() and GetExtensionData(). Keep in mind the 
possible future hiding of this structure from the public interface.
An extension should invent a symbolic key to identify its data. It may 
invent as many additional keys as it wants but the best option here - is 
no more than one entry for each extension.
Usage of this machinery is demonstrated by the pg_stat_statements 
example - here we introduced Bigint node just for natively storing of 
queryId value.

Ruthless pgbench benchmark shows that we got some overhead:
1.6% - in default mode
4% - in prepared mode
~0.1% in extended mode.

An optimization that avoids copying of queryId by storing it into the 
node pointer field directly allows to keep this overhead in a range of 
%0.5 for all these modes but increases complexity. So here we 
demonstrate not optimized variant.

Some questions still cause doubts:
- QueryRewrite: should we copy extension fields from the parent 
parsetree to the rewritten ones?
- Are we need to invent a registration procedure to do away with the 
names of entries and use some compact integer IDs?
- Do we need to optimize this structure to avoid a copy for simple data 
types, for example, inventing something like A_Const?

All in all, in our opinion, this issue is tend to grow with an 
increasing number of extensions that utilize planner and executor hooks 
for some purposes. So, any thoughts will be useful.

-- 
Regards
Andrey Lepikhov
Postgres Professional
Attachment

Re: [POC] Allow an extension to add data into Query and PlannedStmt nodes

From
Julien Rouhaud
Date:
Hi,

On Wed, Mar 29, 2023 at 12:02:30PM +0500, Andrey Lepikhov wrote:
>
> Previously, we read int this mailing list some controversial opinions on
> queryid generation and Jumbling technique. Here we don't intend to solve
> these problems but help an extension at least don't conflict with others on
> the queryId value.
>
> Extensions could need to pass some query-related data through all stages of
> the query planning and execution. As a trivial example, pg_stat_statements
> uses queryid at the end of execution to save some statistics. One more
> reason - extensions now conflict on queryid value and the logic of its
> changing. With this patch, it can be managed.

I just had a quick lookc at the patch, and IIUC it doesn't really help on that
side, as there's still a single official "queryid" that's computed, stored
everywhere and later used by pg_stat_statements (which does then store in
additionally to that official queryid).

You can currently change the main jumbling algorithm with a custom extension,
and all extensions will then use it as the source of truth, but I guess that
what you want is to e.g. have an additional and semantically different queryid,
and create multiple ecosystem of extensions, each using one or the other source
of queryid without changing the other ecosystem behavior.
>
> This patch introduces the structure 'ExtensionData' which allows to manage
> of a list of entries with a couple of interface functions
> addExtensionDataToNode() and GetExtensionData(). Keep in mind the possible
> future hiding of this structure from the public interface.
> An extension should invent a symbolic key to identify its data. It may
> invent as many additional keys as it wants but the best option here - is no
> more than one entry for each extension.
> Usage of this machinery is demonstrated by the pg_stat_statements example -
> here we introduced Bigint node just for natively storing of queryId value.
>
> Ruthless pgbench benchmark shows that we got some overhead:
> 1.6% - in default mode
> 4% - in prepared mode
> ~0.1% in extended mode.

That's a quite significant overhead.  But the only reason to accept such a
change is to actually use it to store additional data, so it would be
interesting to see what the overhead is like once you store at least 2
different values there.

> - Are we need to invent a registration procedure to do away with the names
> of entries and use some compact integer IDs?

Note that the patch as proposed doesn't have any defense for two extensions
trying to register something with the same name, or update a stored value, as
AddExtensionDataToNode() simply prepend the new value to the list.  You can
actually update the value by just storing the new value, but it will add a
significant overhead to every other extension that want to read another value.

The API is also quite limited as each stored value has a single identifier.
What if your extension needs to store multiple things?  Since it's all based on
Node you can't really store some custom struct, so you probably have to end up
with things like "my_extension.my_val1", "my_extension.my_val2" which isn't
great.



Re: [POC] Allow an extension to add data into Query and PlannedStmt nodes

From
Andrey Lepikhov
Date:
On 30/3/2023 12:57, Julien Rouhaud wrote:
>> Extensions could need to pass some query-related data through all stages of
>> the query planning and execution. As a trivial example, pg_stat_statements
>> uses queryid at the end of execution to save some statistics. One more
>> reason - extensions now conflict on queryid value and the logic of its
>> changing. With this patch, it can be managed.
> 
> I just had a quick lookc at the patch, and IIUC it doesn't really help on that
> side, as there's still a single official "queryid" that's computed, stored
> everywhere and later used by pg_stat_statements (which does then store in
> additionally to that official queryid).
Thank you for the attention!
This patch doesn't try to solve the problem of oneness of queryId. In 
this patch we change pg_stat_statements and it doesn't set 0 into 
queryId field according to its internal logic. And another extension 
should do the same - use queryId on your own but not erase it - erase 
your private copy in the ext_field.

>> Ruthless pgbench benchmark shows that we got some overhead:
>> 1.6% - in default mode
>> 4% - in prepared mode
>> ~0.1% in extended mode.
> 
> That's a quite significant overhead.  But the only reason to accept such a
> change is to actually use it to store additional data, so it would be
> interesting to see what the overhead is like once you store at least 2
> different values there.
Yeah, but as I said earlier, it can be reduced to much smaller value 
just with simple optimization. Here I intentionally avoid it to discuss 
the core concept.
> 
>> - Are we need to invent a registration procedure to do away with the names
>> of entries and use some compact integer IDs?
> 
> Note that the patch as proposed doesn't have any defense for two extensions
> trying to register something with the same name, or update a stored value, as
> AddExtensionDataToNode() simply prepend the new value to the list.  You can
> actually update the value by just storing the new value, but it will add a
> significant overhead to every other extension that want to read another value.
Thanks a lot! Patch in attachment implements such an idea - extension 
can allocate some entries and use these private IDs to add entries. I 
hope, an extension would prefer to use only one entry for all the data 
to manage overhead, but still.
> 
> The API is also quite limited as each stored value has a single identifier.
> What if your extension needs to store multiple things?  Since it's all based on
> Node you can't really store some custom struct, so you probably have to end up
> with things like "my_extension.my_val1", "my_extension.my_val2" which isn't
> great.
Main idea here - if an extension holds custom struct and want to pass it 
along all planning and execution stages it should use extensible node 
with custom read/write/copy routines.

-- 
regards,
Andrey Lepikhov
Postgres Professional

Attachment