Thread: Query ID Calculation Fix for DISTINCT / ORDER BY and LIMIT / OFFSET
Hi, all!
It seems I have found a bug in the Query ID calculation.
Problem
=======
In some cases, we could have same IDs for not identical query trees.
For two structurally similar query subnodes, we may have query
trees that look like this:
----
QueryA->subNodeOne = Value X;
QueryA->subNodeTwo = NULL;
QueryB->subNodeOne = NULL;
QueryB->subNodeTwo = Value X;
----
When the query jumble walker calculates the query ID, it traverses the
Query members from top to bottom and generates the same IDs for these
two queries because it does not change the hash value when visiting
an empty node (= NULL).
Here is an example (the collection of jstate->jumble is omitted):
----
/* QueryA_ID = AAAA */
QueryA->subNodeOne = &(Value X);
/* QueryA_ID = hash_any_extended(&(Value X), sizeof(Value X), QueryA_ID) = BBBB */
QueryA->subNodeTwo = NULL;
/* QueryA_ID = BBBB; */
/* QueryB_ID = AAAA */
QueryB->subNodeOne = NULL;
/* QueryB_ID = AAAA; */
QueryB->subNodeTwo = &(Value X);
/* QueryB_ID = hash_any_extended(&(Value X), sizeof(Value X), QueryB_ID) = BBBB */
----
There are two pairs of subnodes that can trigger this error:
- distinctClause and sortClause (both contain a list of SortGroupClauses);
- limitOffset and limitCount (both contain an int8 expression).
Here is an example of such errors (all queries run on REL_17_0):
----
SET compute_query_id = on;
/* DISTINCT / ORDER BY *************************************/
EXPLAIN (VERBOSE) SELECT DISTINCT "oid" FROM pg_class;
/* Query Identifier: 751948508603549510 */
EXPLAIN (VERBOSE) SELECT "oid" FROM pg_class ORDER BY "oid";
/* Query Identifier: 751948508603549510 */
/* LIMIT / OFFSET ******************************************/
EXPLAIN (VERBOSE) SELECT "oid" FROM pg_class LIMIT 1;
/* Query Identifier: 5185884322440896420 */
EXPLAIN (VERBOSE) SELECT "oid" FROM pg_class OFFSET 1;
/* Query Identifier: 5185884322440896420 */
----
Solution One
============
The simplest way to fix the problem is to place the scalar field used in
the query ID calculation between similar subnodes. A patch for this
solution is attached below (0001-Query-ID-Calculation-Fix-Variant-A.patch).
Here is an example:
----
/* QueryA_ID = AAAA */
QueryA->subNodeOne = &(Value X);
/* QueryA_ID = hash_any_extended(&(Value X), sizeof(Value X), QueryA_ID) = BBBB */
QueryA->scalar = Value Y;
/* QueryA_ID = hash_any_extended(&(Value Y), sizeof(Value Y), QueryA_ID) = CCCC */
QueryA->subNodeTwo = NULL;
/* QueryA_ID = CCCC; */
/* QueryB_ID = AAAA */
QueryB->subNodeOne = NULL;
/* QueryB_ID = AAAA; */
QueryB->scalar = Value Y;
/* QueryB_ID = hash_any_extended(&(Value Y), sizeof(Value Y), QueryB_ID) = DDDD */
QueryB->subNodeOne = &(Value X);
/* QueryB_ID = hash_any_extended(&(Value X), sizeof(Value X), QueryB_ID) = EEEE */
----
Solution Two
============
Alternatively, we can change the hash sum when we encounter an empty node.
This approach may impact performance but will protect us from such errors
in the future. A patch for this solution is attached below
(0001-Query-ID-Calculation-Fix-Variant-B.patch).
Here is an example:
----
/* QueryA_ID = AAAA */
QueryA->subNodeOne = &(Value X);
/* QueryA_ID = hash_any_extended(&(Value Y), sizeof(Value Y), QueryA_ID) = BBBB */
QueryA->subNodeTwo = NULL;
/* QueryA_ID = hash_any_extended(&('\0'), sizeof(char), QueryA_ID) = CCCC */
/* QueryB_ID = AAAA */
QueryB->subNodeOne = NULL;
/* QueryB_ID = hash_any_extended(&('\0'), sizeof(char), QueryB_ID) = DDDD */
QueryB->subNodeOne = &(Value X);
/* QueryB_ID = hash_any_extended(&(Value X), sizeof(Value X), QueryB_ID) = EEEE */
----
Hello!
Last time, I forgot to attach the patches.
The problem still persists in the 17.3 release.
Solution One
============
The simplest way to fix the problem is to place the scalar field used in the query ID calculation
between similar subnodes.
A patch for this solution is attached below (0001-Query-ID-Calculation-Fix-Variant-A.patch).
Solution Two
============
Alternatively, we can change the hash sum when we encounter an empty node.
This approach may impact performance but will protect us from such errors in the future.
A patch for this solution is attached below (0001-Query-ID-Calculation-Fix-Variant-B.patch).
======
SELECT version();
version
-------------------------------------------------------------------------------------------------
PostgreSQL 17.3 on x86_64-pc-linux-gnu, compiled by gcc (Debian 12.2.0-14) 12.2.0, 64-bit
SET compute_query_id = on;
/* LIMIT / OFFSET */
EXPLAIN (VERBOSE) SELECT "oid" FROM pg_class LIMIT 1;
QUERY PLAN
----------------------------------------------------------------------------
Limit (cost=0.00..0.04 rows=1 width=4)
Output: oid
-> Seq Scan on pg_catalog.pg_class (cost=0.00..18.15 rows=415 width=4)
Output: oid
Query Identifier: 5185884322440896420
EXPLAIN (VERBOSE) SELECT "oid" FROM pg_class OFFSET 1;
QUERY PLAN
----------------------------------------------------------------------------
Limit (cost=0.04..18.15 rows=414 width=4)
Output: oid
-> Seq Scan on pg_catalog.pg_class (cost=0.00..18.15 rows=415 width=4)
Output: oid
Query Identifier: 5185884322440896420
/* DISTINCT / ORDER BY */
EXPLAIN (VERBOSE) SELECT DISTINCT "oid" FROM pg_class;
QUERY PLAN
------------------------------------------------------------------------------------------------------------
Unique (cost=0.27..23.54 rows=415 width=4)
Output: oid
-> Index Only Scan using pg_class_oid_index on pg_catalog.pg_class (cost=0.27..22.50 rows=415 width=4)
Output: oid
Query Identifier: 751948508603549510
EXPLAIN (VERBOSE) SELECT "oid" FROM pg_class ORDER BY "oid";
QUERY PLAN
------------------------------------------------------------------------------------------------------
Index Only Scan using pg_class_oid_index on pg_catalog.pg_class (cost=0.27..22.50 rows=415 width=4)
Output: oid
Query Identifier: 751948508603549510
Attachment
Here is bug description from
https://www.postgresql.org/message-id/flat/ca447b72d15745b9a877fad7e258407a%40localhost.localdomain
Problem
=======
In some cases, we could have same IDs for not identical query trees.
For two structurally similar query subnodes, we may have query
trees that look like this:
----
QueryA->subNodeOne = Value X;
QueryA->subNodeTwo = NULL;
QueryB->subNodeOne = NULL;
QueryB->subNodeTwo = Value X;
----
When the query jumble walker calculates the query ID, it traverses the
Query members from top to bottom and generates the same IDs for these
two queries because it does not change the hash value when visiting
an empty node (= NULL).
Here is an example (the collection of jstate->jumble is omitted):
----
/* QueryA_ID = AAAA */
QueryA->subNodeOne = &(Value X);
/* QueryA_ID = hash_any_extended(&(Value X), sizeof(Value X), QueryA_ID) = BBBB */
QueryA->subNodeTwo = NULL;
/* QueryA_ID = BBBB; */
/* QueryB_ID = AAAA */
QueryB->subNodeOne = NULL;
/* QueryB_ID = AAAA; */
QueryB->subNodeTwo = &(Value X);
/* QueryB_ID = hash_any_extended(&(Value X), sizeof(Value X), QueryB_ID) = BBBB */
----
There are two pairs of subnodes that can trigger this error:
- distinctClause and sortClause (both contain a list of SortGroupClauses);
- limitOffset and limitCount (both contain an int8 expression).
Here is an example of such errors (all queries run on REL_17_0):
----
SET compute_query_id = on;
/* DISTINCT / ORDER BY *************************************/
EXPLAIN (VERBOSE) SELECT DISTINCT "oid" FROM pg_class;
/* Query Identifier: 751948508603549510 */
EXPLAIN (VERBOSE) SELECT "oid" FROM pg_class ORDER BY "oid";
/* Query Identifier: 751948508603549510 */
/* LIMIT / OFFSET ******************************************/
EXPLAIN (VERBOSE) SELECT "oid" FROM pg_class LIMIT 1;
/* Query Identifier: 5185884322440896420 */
EXPLAIN (VERBOSE) SELECT "oid" FROM pg_class OFFSET 1;
/* Query Identifier: 5185884322440896420 */
----
Solution One
============
The simplest way to fix the problem is to place the scalar field used in
the query ID calculation between similar subnodes. A patch for this
solution is attached below (0001-Query-ID-Calculation-Fix-Variant-A.patch).
Here is an example:
----
/* QueryA_ID = AAAA */
QueryA->subNodeOne = &(Value X);
/* QueryA_ID = hash_any_extended(&(Value X), sizeof(Value X), QueryA_ID) = BBBB */
QueryA->scalar = Value Y;
/* QueryA_ID = hash_any_extended(&(Value Y), sizeof(Value Y), QueryA_ID) = CCCC */
QueryA->subNodeTwo = NULL;
/* QueryA_ID = CCCC; */
/* QueryB_ID = AAAA */
QueryB->subNodeOne = NULL;
/* QueryB_ID = AAAA; */
QueryB->scalar = Value Y;
/* QueryB_ID = hash_any_extended(&(Value Y), sizeof(Value Y), QueryB_ID) = DDDD */
QueryB->subNodeOne = &(Value X);
/* QueryB_ID = hash_any_extended(&(Value X), sizeof(Value X), QueryB_ID) = EEEE */
----
Solution Two
============
Alternatively, we can change the hash sum when we encounter an empty node.
This approach may impact performance but will protect us from such errors
in the future. A patch for this solution is attached below
(0001-Query-ID-Calculation-Fix-Variant-B.patch).
Here is an example:
----
/* QueryA_ID = AAAA */
QueryA->subNodeOne = &(Value X);
/* QueryA_ID = hash_any_extended(&(Value Y), sizeof(Value Y), QueryA_ID) = BBBB */
QueryA->subNodeTwo = NULL;
/* QueryA_ID = hash_any_extended(&('\0'), sizeof(char), QueryA_ID) = CCCC */
/* QueryB_ID = AAAA */
QueryB->subNodeOne = NULL;
/* QueryB_ID = hash_any_extended(&('\0'), sizeof(char), QueryB_ID) = DDDD */
QueryB->subNodeOne = &(Value X);
/* QueryB_ID = hash_any_extended(&(Value X), sizeof(Value X), QueryB_ID) = EEEE */
----
From: Быков Иван Александрович
Sent: Thursday, March 6, 2025 7:32 PM
To: 'pgsql-hackers@postgresql.org' <pgsql-hackers@postgresql.org>
Subject: RE: Query ID Calculation Fix for DISTINCT / ORDER BY and LIMIT / OFFSET
Hello!
Last time, I forgot to attach the patches.
The problem still persists in the 17.3 release.
Solution One
============
The simplest way to fix the problem is to place the scalar field used in the query ID calculation
between similar subnodes.
A patch for this solution is attached below (0001-Query-ID-Calculation-Fix-Variant-A.patch).
Solution Two
============
Alternatively, we can change the hash sum when we encounter an empty node.
This approach may impact performance but will protect us from such errors in the future.
A patch for this solution is attached below (0001-Query-ID-Calculation-Fix-Variant-B.patch).
======
SELECT version();
version
-------------------------------------------------------------------------------------------------
PostgreSQL 17.3 on x86_64-pc-linux-gnu, compiled by gcc (Debian 12.2.0-14) 12.2.0, 64-bit
SET compute_query_id = on;
/* LIMIT / OFFSET */
EXPLAIN (VERBOSE) SELECT "oid" FROM pg_class LIMIT 1;
QUERY PLAN
----------------------------------------------------------------------------
Limit (cost=0.00..0.04 rows=1 width=4)
Output: oid
-> Seq Scan on pg_catalog.pg_class (cost=0.00..18.15 rows=415 width=4)
Output: oid
Query Identifier: 5185884322440896420
EXPLAIN (VERBOSE) SELECT "oid" FROM pg_class OFFSET 1;
QUERY PLAN
----------------------------------------------------------------------------
Limit (cost=0.04..18.15 rows=414 width=4)
Output: oid
-> Seq Scan on pg_catalog.pg_class (cost=0.00..18.15 rows=415 width=4)
Output: oid
Query Identifier: 5185884322440896420
/* DISTINCT / ORDER BY */
EXPLAIN (VERBOSE) SELECT DISTINCT "oid" FROM pg_class;
QUERY PLAN
------------------------------------------------------------------------------------------------------------
Unique (cost=0.27..23.54 rows=415 width=4)
Output: oid
-> Index Only Scan using pg_class_oid_index on pg_catalog.pg_class (cost=0.27..22.50 rows=415 width=4)
Output: oid
Query Identifier: 751948508603549510
EXPLAIN (VERBOSE) SELECT "oid" FROM pg_class ORDER BY "oid";
QUERY PLAN
------------------------------------------------------------------------------------------------------
Index Only Scan using pg_class_oid_index on pg_catalog.pg_class (cost=0.27..22.50 rows=415 width=4)
Output: oid
Query Identifier: 751948508603549510
Here is
0001-Query-ID-Calculation-Fix-Variant-A.patch
and
0001-Query-ID-Calculation-Fix-Variant-B.patch
Attachment
Hi, It seems like there are multiple threads on this topic. This is the original [0], but I suggest continuing the discussion in this thread since it includes the examples and patches. Regarding the issue itself, query jumbling behavior is often subjective, making it difficult to classify as a bug. I'm not entirely sure this qualifies as a bug either, but I do believe it should be addressed. As I understand it, two nodes defined one after the other and in which both could end up with the same expressions when traversed cannot be differentiated by query jumbling when one of them is NULL. In this case, queryJumbling can't differentiate between when limitOffset of LimitOption and they both result with a similar FuncExpr. By rearranging them as done in variant A, the position where the expression is appended in the jumble changes, leading to a different hash. I just don't like this solution because it requires one to carefully construct a struct, but it maybe the best out of the other options. Variant B is not acceptable IMO as it adds a whole bunch of null-terminators unnecessarily. For example, in a simple "select 1", (expr == NULL) is true 19 times, so that is an extra 19 bytes. I think a third option is to add a new pg_node_attr called "query_jumble_null" and be very selective on which fields should append a null-terminator to the jumble when the expression is null The queryjumblefuncs.c could look like the below. JUMBLE_NULL will be responsible for appending the null terminator. """ static void _jumbleQuery(JumbleState *jstate, Node *node) { Query *expr = (Query *) node; ... ...... ....... if (!expr->limitOffset) { JUMBLE_NULL(); } else { JUMBLE_NODE(limitOffset); } if (!expr->limitCount) { JUMBLE_NULL(); } else { JUMBLE_NODE(limitCount); } """ What do you think? Maybe someone can suggest a better solution? Regards, Sami Imseih Amazon Web Services (AWS) [0] https://www.postgresql.org/message-id/ca447b72d15745b9a877fad7e258407a@localhost.localdomain
On Thu, Mar 06, 2025 at 06:44:27PM -0600, Sami Imseih wrote: > Regarding the issue itself, query jumbling behavior is often subjective, > making it difficult to classify as a bug. I'm not entirely sure this > qualifies as a bug either, but I do believe it should be addressed. I would call that a bug and something that we should fix, but not something that we can really backpatch as this has unfortunately an impact on monitoring tools. Stability takes priority in this area in already released branches. > By rearranging them as done in variant A, the position where the expression > is appended in the jumble changes, leading to a different hash. I just > don't like > this solution because it requires one to carefully construct a struct, > but it maybe the best out of the other options. I prefer something like variant A. It would not be the first time we are manipulating the structure of the parse nodes for the sake of making the query jumbling more transparent. An advantage of depending on the structure definition is that we can document the expectation in the header, and not hide it in the middle of the jumble code. > Variant B is not acceptable IMO as it adds a whole bunch of null-terminators > unnecessarily. For example, in a simple "select 1", (expr == NULL) is > true 19 times, > so that is an extra 19 bytes. Variant B is not acceptable here. > I think a third option is to add a new pg_node_attr called "query_jumble_null" > and be very selective on which fields should append a null-terminator to the > jumble when the expression is null > > The queryjumblefuncs.c could look like the below. JUMBLE_NULL will > be responsible for appending the null terminator. Not much a fan of that. For the sake of this thread, I'd still go with the simplicity of A. And please, let's add a couple of queries in pgss to track that these lead to two different entries. Another option that I was thinking about and could be slightly cleaner is the addition of an extra field in Query that is set when we go through a offset_clause in the parser. It could just be a boolean, false by default. We have been using this practice in in DeallocateStmt, for example. And there are only a few places where limitOffset is set. However, I'd rather discard this idea as transformSelectStmt() has also the idea to transform a LIMIT clause into an OFFSET clause, changing a Query representation. And note that we calculate the query jumbling after applying the query transformation. For these reasons, variant A where we put the LimitOption between the two int8 expression nodes feels like the "okay" approach here. But we must document this expectation in the structure, and check for more grammar variants of LIMIT and OFFSET clauses in pgss. Another concept would be to add into the jumble the offset of a Node in the upper structure we are jumbling, but this requires keeping a track of all the nested things, which would make the query jumbling code much more complicated and more expensive, for sure. I'd also recommend to be careful and double-check all these transformation assumptions for LIMIT and OFFSET. We should shape things so as an OFFSET never maps to a LIMIT variant when we differentiate all these flavors at query string level. -- Michael
Attachment
On Mon, 10 Mar 2025 at 12:39, Michael Paquier <michael@paquier.xyz> wrote: > > On Thu, Mar 06, 2025 at 06:44:27PM -0600, Sami Imseih wrote: > > Regarding the issue itself, query jumbling behavior is often subjective, > > making it difficult to classify as a bug. I'm not entirely sure this > > qualifies as a bug either, but I do believe it should be addressed. > > I would call that a bug and something that we should fix, but not > something that we can really backpatch as this has unfortunately an > impact on monitoring tools. Stability takes priority in this area in > already released branches. I know you reviewed this, Michael, so you're aware; 2d3389c28 did recently document that we'd only break this in minor versions as a last resort. So, I agree that it sounds like a master-only fix is in order. For the record, the docs [1] read: "Generally, it can be assumed that queryid values are stable between minor version releases of PostgreSQL, providing that instances are running on the same machine architecture and the catalog metadata details match. Compatibility will only be broken between minor versions as a last resort." David [1] https://www.postgresql.org/docs/current/pgstatstatements.html
On Mon, Mar 10, 2025 at 02:14:01PM +1300, David Rowley wrote: > I know you reviewed this, Michael, so you're aware; 2d3389c28 did > recently document that we'd only break this in minor versions as a > last resort. So, I agree that it sounds like a master-only fix is in > order. Yes, this thread's problem does not pass the compatibility bar. Thanks for the reminder about the docs. -- Michael
Attachment
Hello! >> Variant B is not acceptable IMO as it adds a whole bunch of >> null-terminators unnecessarily. For example, in a simple "select 1", >> (expr == NULL) is true 19 times, so that is an extra 19 bytes. > Variant B is not acceptable here. Could we improve Variant B? I was thinking about adding a count of NULL pointers encountered by the query jumble walker since the last scalar or non-null field visited. This way, we can traverse the query as follows: ==== QueryA->subNodeOne = &(Value X); /* JumbleState->Count = 0; QueryA_ID = hash_any_extended(&(Value X), sizeof(Value X), QueryA_ID) */ QueryA->subNodeTwo = NULL; /* JumbleState->Count = 1; */ QueryA->subNodeThee = NULL; /* JumbleState->Count = 2; */ QueryA->subNodeFour = NULL; /* JumbleState->Count = 3; */ QueryA->scalar = Value Z; /* QueryA_ID = hash_any_extended(&(JumbleState->Count), sizeof(JumbleState->Count), QueryA_ID) JumbleState->Count = 0; QueryA_ID = hash_any_extended(&(Value Y), sizeof(Value Y), QueryA_ID) */ ==== Variant A improvement --------------------- I have attached the updated Variant A patch with the following changes: - Implemented a pg_stat_statements regression test (see similar test results on REL_17_3 in Query-ID-collision-at_pg_stat_statements-1ea6e890b22.txt). - Added extra description about the Query ID collision in src/include/nodes/parsenodes.h.
Attachment
On Mon, 10 Mar 2025 at 23:17, Bykov Ivan <i.bykov@modernsys.ru> wrote: > I have attached the updated Variant A patch with the following changes: > - Implemented a pg_stat_statements regression test (see similar test results > on REL_17_3 in Query-ID-collision-at_pg_stat_statements-1ea6e890b22.txt). > - Added extra description about the Query ID collision > in src/include/nodes/parsenodes.h. I've not paid much attention to the jumble code before, but I did glance at this patch: + * + * The query jumbling process may trigger a Query ID collision when + * two Query fields located sequentially contain the same type of nodes + * (a list of nodes), and both of them may be NULL. + * List of such groups of fields: + * - limitOffset and limitCount (which may be an int8 expression or NULL); + * - distinctClause, and sortClause (which may be a list of + * SortGroupClause entries or NULL); + * To prevent this collision, we should insert at least one scalar field + * without the attribute query_jumble_ignore between such fields. */ typedef struct Query { @@ -208,11 +218,11 @@ typedef struct Query List *distinctClause; /* a list of SortGroupClause's */ - List *sortClause; /* a list of SortGroupClause's */ - Node *limitOffset; /* # of result tuples to skip (int8 expr) */ - Node *limitCount; /* # of result tuples to return (int8 expr) */ LimitOption limitOption; /* limit type */ + Node *limitCount; /* # of result tuples to return (int8 expr) */ + + List *sortClause; /* a list of SortGroupClause's */ List *rowMarks; /* a list of RowMarkClause's */ It seems to me that if this fixes the issue, that the next similar one is already lurking in the shadows waiting to jump out on us. Couldn't we adjust all this code so that we pass a seed to AppendJumble() that's the offsetof(Struct, field) that we're jumbling and incorporate that seed into the hash? I don't think we could possibly change the offset in a minor version, so I don't think there's a risk that could cause jumble value instability. Possibly different CPU architectures would come up with different offsets through having different struct alignment requirements, but we don't make any guarantees about the jumble values matching across different CPU architectures anyway. I admit to only having spent a few minutes looking at this, so there may well be a good reason for not doing this. Maybe Michael can tell me what that is...? A very quickly put together patch is attached. I intend this only to demonstrate what I mean, not because I want to work on it myself. David
Attachment
> It seems to me that if this fixes the issue, that the next similar one > is already lurking in the shadows waiting to jump out on us. Yes, this is true that there may be other cases, but I am not sure if it's worth carrying all the extra bytes in the jumble to deal with a few cases like this. This is why I don't think Variant B or tracking the offset is a thrilling idea. -1 for me. -- Sami
> transformation. For these reasons, variant A where we put the > LimitOption between the two int8 expression nodes feels like the > "okay" approach here. But we must document this expectation in the > structure, and check for more grammar variants of LIMIT and OFFSET > clauses in pgss. Please see the attached. Variant A with comments and some additional test cases. It should be noted that we currently have "WITH TIES/ROWS ONLY" tests in pg_s_s, so I added another case to show "FETCH FIRST 2 ROW ONLY" and "LIMIT 2" are the same queryId and also added a query that uses both a LIMIT and OFFSET. I could not think of other cases we need to cover. -- Sami
Attachment
On Tue, Mar 11, 2025 at 12:45:35AM +1300, David Rowley wrote: > It seems to me that if this fixes the issue, that the next similar one > is already lurking in the shadows waiting to jump out on us. For how many years will be have to wait for this threat hiddent in the shadows? :) This issue exists since the query jumbling exists in pgss, so it goes really down the road. I've spent a bit more than a few minutes on that. > Couldn't we adjust all this code so that we pass a seed to > AppendJumble() that's the offsetof(Struct, field) that we're jumbling > and incorporate that seed into the hash? I don't think we could > possibly change the offset in a minor version, so I don't think > there's a risk that could cause jumble value instability. Possibly > different CPU architectures would come up with different offsets > through having different struct alignment requirements, but we don't > make any guarantees about the jumble values matching across different > CPU architectures anyway. Yeah, something like that has potential to get rid of such problems forever, particularly thanks to the fact that we automate the generation of this code now so it is mostly cost-free. This has a cost for the custom jumbling functions where one needs some imagination, but with models being around while we keep the number of custom functions low, that should be neither complicated nor costly, at least in my experience. When I was thinking about the addition of the offset to the jumbling yesterday, I was wondering about two things: - if there are some node structures where it could not work. - if we'd need to pass down some information of the upper node when doing the jumbling of a node attached to it, which would make the code generation much more annoying. But after sleeping on it I think that these two points are kind of moot: having only the offset passed down to a single _jumbleNode() with the offset compiled at its beginning should be sufficient. Using "seed" as a term is perhaps a bit confusing, because this is an offset in the upper node, but perhaps that's OK as-is. It's just more entropy addition. If somebody has a better idea for this term, feel free. _jumbleList() is an interesting one. If we want an equivalent of the offset, this could use the item number in the list which would be a rather close equivalent to the offset used elsewhere. For the int, oid and xid lists, we should do an extra JUMBLE_FIELD_SINGLE() for each member, apply the offset at the beginning of _jumbleList(), once, I guess. _jumbleVariableSetStmt() should be OK with the offset of "arg" in VariableSetStmt. The addition of hash_combine64() to count for the offset should be OK. With that in mind, the cases with DISTINCT/ORDER BY and OFFSET/LIMIT seem to work fine, without causing noise for the other cases tracked in the regression tests of PGSS. What do you think? -- Michael
Attachment
On Tue, 11 Mar 2025 at 19:50, Michael Paquier <michael@paquier.xyz> wrote: > > On Tue, Mar 11, 2025 at 12:45:35AM +1300, David Rowley wrote: > > It seems to me that if this fixes the issue, that the next similar one > > is already lurking in the shadows waiting to jump out on us. > > For how many years will be have to wait for this threat hiddent in the > shadows? :) > > This issue exists since the query jumbling exists in pgss, so it goes > really down the road. I've spent a bit more than a few minutes on > that. I didn't mean to cause offence here. I just wanted to make it clear that I don't think fixing these types of issues by swapping the order of the fields is going to be a sustainable practice, and it would be good to come up with a solution that eliminates the chances of this class of bug from ever appearing again. Even if we were to trawl the entire Query struct and everything that could ever be attached to it and we deem that today it's fine and no more such bugs exist, the person adding some new SQL feature in the future that adds new data to store in the Query probably isn't going to give query jumbling much thought, especially now that the code generation is automatic. > > Couldn't we adjust all this code so that we pass a seed to > > AppendJumble() that's the offsetof(Struct, field) that we're jumbling > > and incorporate that seed into the hash? I don't think we could > > possibly change the offset in a minor version, so I don't think > > there's a risk that could cause jumble value instability. Possibly > > different CPU architectures would come up with different offsets > > through having different struct alignment requirements, but we don't > > make any guarantees about the jumble values matching across different > > CPU architectures anyway. > > Yeah, something like that has potential to get rid of such problems > forever, particularly thanks to the fact that we automate the > generation of this code now so it is mostly cost-free. This has a > cost for the custom jumbling functions where one needs some > imagination, but with models being around while we keep the number of > custom functions low, that should be neither complicated nor costly, > at least in my experience. I hadn't really processed this thread fully when I responded yesterday and my response was mostly aimed at the latest patch on the thread at the time, which I had looked at quickly and wanted to put a stop to changing field orders as a fix for this. Since then, I see that Ivan has already submitted a patch that accounts for NULL nodes and adds a byte to the jumble buffer to account for NULLs. This seems quite clean and simple. However, Sami seems to have concerns about the overhead of doing this. Is that warranted at all? Potentially, there could be a decent number of NULL fields. It'll probably be much cheaper than the offsetof idea I came up with. > When I was thinking about the addition of the offset to the jumbling > yesterday, I was wondering about two things: > - if there are some node structures where it could not work. > - if we'd need to pass down some information of the upper node when > doing the jumbling of a node attached to it, which would make the code > generation much more annoying. > > But after sleeping on it I think that these two points are kind of > moot: having only the offset passed down to a single _jumbleNode() > with the offset compiled at its beginning should be sufficient. Using > "seed" as a term is perhaps a bit confusing, because this is an offset > in the upper node, but perhaps that's OK as-is. It's just more > entropy addition. If somebody has a better idea for this term, feel > free. Can you share your thoughts on Ivan's NULL jumble idea? > _jumbleList() is an interesting one. If we want an equivalent of the > offset, this could use the item number in the list which would be a > rather close equivalent to the offset used elsewhere. For the int, > oid and xid lists, we should do an extra JUMBLE_FIELD_SINGLE() for > each member, apply the offset at the beginning of _jumbleList(), once, > I guess. Is this affected by the same class of bug that we're aiming to fix here? (This class being not always jumbling all fields because of NULLs or some other value, resulting in the next jumble field applying the same bytes to the buffer as the previous field would have, had it not been ignored) > _jumbleVariableSetStmt() should be OK with the offset of "arg" in > VariableSetStmt. The addition of hash_combine64() to count for the > offset should be OK. > > With that in mind, the cases with DISTINCT/ORDER BY and OFFSET/LIMIT > seem to work fine, without causing noise for the other cases tracked > in the regression tests of PGSS. > > What do you think? I mostly now think the class of bug can be fixed by ensuring we never ignore any jumble field for any reason and always put at least 1 byte onto the buffer with some sort of entropy. I'm keen to understand if I'm missing some case that you've thought about that I've not. David
Hello! > Since then, I see that Ivan > has already submitted a patch that accounts for NULL nodes and adds a > byte to the jumble buffer to account for NULLs. This seems quite clean > and simple. However, Sami seems to have concerns about the overhead of > doing this. Is that warranted at all? Potentially, there could be a > decent number of NULL fields. It'll probably be much cheaper than the > offsetof idea I came up with. To address the issue of it consuming a lot of bytes in the jumble buffer for NULL marks, I have added a new patch version for Variant B. (v2-0001-Query-ID-Calculation-Fix-Variant-B.patch). This patch adds to JumbleState the count of NULL nodes visited before a non-NULL one appears. The least significant byte of this count is appended to the jumble buffer every time the count is not null (indicating that we have visited non-NULL nodes previously), and a new chunk of data is also appended to the jumble buffer. I intentionally did not add a final append for the NULL count in the JumbleQuery processing, as NULL tail nodes of the query do not improve the reliability of query ID collision resolution. I believe that my first version, Variant B of the patch, is simple and easy to understand. I would prefer to use the first version of the Variant B patch, so I have attached an updated version along with tests (v3-0001-Query-ID-Calculation-Fix-Variant-B.patch). As you can see, v2-0001-Query-ID-Calculation-Fix-Variant-B.patch is more complex and invasive than v3-0001-Query-ID-Calculation-Fix-Variant-B.patch. I don't think that saving a few bytes in a 1024-byte sized buffer is worth implementing, even for this simple optimization (v2). > Can you share your thoughts on Ivan's NULL jumble idea? I would appreciate any feedback. Thank you. -----Original Message----- From: David Rowley <dgrowleyml@gmail.com> Sent: Tuesday, March 11, 2025 12:49 PM To: Michael Paquier <michael@paquier.xyz> Cc: Быков Иван Александрович <i.bykov@modernsys.ru>; Sami Imseih <samimseih@gmail.com>; pgsql-hackers@postgresql.org Subject: Re: Query ID Calculation Fix for DISTINCT / ORDER BY and LIMIT / OFFSET On Tue, 11 Mar 2025 at 19:50, Michael Paquier <michael@paquier.xyz> wrote: > > On Tue, Mar 11, 2025 at 12:45:35AM +1300, David Rowley wrote: > > It seems to me that if this fixes the issue, that the next similar > > one is already lurking in the shadows waiting to jump out on us. > > For how many years will be have to wait for this threat hiddent in the > shadows? :) > > This issue exists since the query jumbling exists in pgss, so it goes > really down the road. I've spent a bit more than a few minutes on > that. I didn't mean to cause offence here. I just wanted to make it clear that I don't think fixing these types of issues by swappingthe order of the fields is going to be a sustainable practice, and it would be good to come up with a solution thateliminates the chances of this class of bug from ever appearing again. Even if we were to trawl the entire Query structand everything that could ever be attached to it and we deem that today it's fine and no more such bugs exist, theperson adding some new SQL feature in the future that adds new data to store in the Query probably isn't going to givequery jumbling much thought, especially now that the code generation is automatic. > > Couldn't we adjust all this code so that we pass a seed to > > AppendJumble() that's the offsetof(Struct, field) that we're > > jumbling and incorporate that seed into the hash? I don't think we > > could possibly change the offset in a minor version, so I don't > > think there's a risk that could cause jumble value instability. > > Possibly different CPU architectures would come up with different > > offsets through having different struct alignment requirements, but > > we don't make any guarantees about the jumble values matching across > > different CPU architectures anyway. > > Yeah, something like that has potential to get rid of such problems > forever, particularly thanks to the fact that we automate the > generation of this code now so it is mostly cost-free. This has a > cost for the custom jumbling functions where one needs some > imagination, but with models being around while we keep the number of > custom functions low, that should be neither complicated nor costly, > at least in my experience. I hadn't really processed this thread fully when I responded yesterday and my response was mostly aimed at the latest patchon the thread at the time, which I had looked at quickly and wanted to put a stop to changing field orders as a fixfor this. Since then, I see that Ivan has already submitted a patch that accounts for NULL nodes and adds a byte to thejumble buffer to account for NULLs. This seems quite clean and simple. However, Sami seems to have concerns about theoverhead of doing this. Is that warranted at all? Potentially, there could be a decent number of NULL fields. It'll probablybe much cheaper than the offsetof idea I came up with. > When I was thinking about the addition of the offset to the jumbling > yesterday, I was wondering about two things: > - if there are some node structures where it could not work. > - if we'd need to pass down some information of the upper node when > doing the jumbling of a node attached to it, which would make the code > generation much more annoying. > > But after sleeping on it I think that these two points are kind of > moot: having only the offset passed down to a single _jumbleNode() > with the offset compiled at its beginning should be sufficient. Using > "seed" as a term is perhaps a bit confusing, because this is an offset > in the upper node, but perhaps that's OK as-is. It's just more > entropy addition. If somebody has a better idea for this term, feel > free. Can you share your thoughts on Ivan's NULL jumble idea? > _jumbleList() is an interesting one. If we want an equivalent of the > offset, this could use the item number in the list which would be a > rather close equivalent to the offset used elsewhere. For the int, > oid and xid lists, we should do an extra JUMBLE_FIELD_SINGLE() for > each member, apply the offset at the beginning of _jumbleList(), once, > I guess. Is this affected by the same class of bug that we're aiming to fix here? (This class being not always jumbling all fieldsbecause of NULLs or some other value, resulting in the next jumble field applying the same bytes to the buffer as theprevious field would have, had it not been ignored) > _jumbleVariableSetStmt() should be OK with the offset of "arg" in > VariableSetStmt. The addition of hash_combine64() to count for the > offset should be OK. > > With that in mind, the cases with DISTINCT/ORDER BY and OFFSET/LIMIT > seem to work fine, without causing noise for the other cases tracked > in the regression tests of PGSS. > > What do you think? I mostly now think the class of bug can be fixed by ensuring we never ignore any jumble field for any reason and always putat least 1 byte onto the buffer with some sort of entropy. I'm keen to understand if I'm missing some case that you'vethought about that I've not. David
Attachment
On Tue, Mar 11, 2025 at 08:48:33PM +1300, David Rowley wrote: > On Tue, 11 Mar 2025 at 19:50, Michael Paquier <michael@paquier.xyz> wrote: >> This issue exists since the query jumbling exists in pgss, so it goes >> really down the road. I've spent a bit more than a few minutes on >> that. > > I didn't mean to cause offence here. I just wanted to make it clear > that I don't think fixing these types of issues by swapping the order > of the fields is going to be a sustainable practice, and it would be > good to come up with a solution that eliminates the chances of this > class of bug from ever appearing again. Even if we were to trawl the > entire Query struct and everything that could ever be attached to it > and we deem that today it's fine and no more such bugs exist, the > person adding some new SQL feature in the future that adds new data to > store in the Query probably isn't going to give query jumbling much > thought, especially now that the code generation is automatic. FWIW, I've mentioned the use of the offset in a Node upthread, in the next to last last paragraph of this email: https://www.postgresql.org/message-id/Z84mhjm5RtWtLG4X@paquier.xyz One thing I did not notice yesterday was that it is possible to rely on _jumbleNode() to pass an offset from the parent. So it looks like we had roughly the same idea independently, but I was not able to look more closely at that yesterday. >> But after sleeping on it I think that these two points are kind of >> moot: having only the offset passed down to a single _jumbleNode() >> with the offset compiled at its beginning should be sufficient. Using >> "seed" as a term is perhaps a bit confusing, because this is an offset >> in the upper node, but perhaps that's OK as-is. It's just more >> entropy addition. If somebody has a better idea for this term, feel >> free. > > Can you share your thoughts on Ivan's NULL jumble idea? This is an incomplete solution, I am afraid. The origin of the problem comes from the fact that a Node (here, Query) stores two times successively equivalent Nodes that are used for separate data, with NULL being the difference between both. The problem is not limited to NULL, we could as well, for example, finish with a Node that has a custom jumbling function and the same issue depending on how it is used in a parent Node. Adding the offset from the parent in the jumbling is a much stronger insurance policy that the proposed solution specific to NULL, because each offset is unique in a single Node. >> _jumbleList() is an interesting one. If we want an equivalent of the >> offset, this could use the item number in the list which would be a >> rather close equivalent to the offset used elsewhere. For the int, >> oid and xid lists, we should do an extra JUMBLE_FIELD_SINGLE() for >> each member, apply the offset at the beginning of _jumbleList(), once, >> I guess. > > Is this affected by the same class of bug that we're aiming to fix > here? (This class being not always jumbling all fields because of > NULLs or some other value, resulting in the next jumble field applying > the same bytes to the buffer as the previous field would have, had it > not been ignored) Yep, that could be possible. I am not sure if it can be reached currently with the set of parse nodes, but it in theory possible could be possible with a list of Nodes, at least. -- Michael
Attachment
> and simple. However, Sami seems to have concerns about the overhead of > doing this. Is that warranted at all? Potentially, there could be a > decent number of NULL fields. It'll probably be much cheaper than the > offsetof idea I came up with. I have not benchmarked the overhead, so maybe there is not much to be concerned about. However, it just seems to me that tracking the extra data for all cases just to only deal with corner cases does not seem like the correct approach. This is what makes variant A the most attractive approach. -- Sami
On Tue, Mar 11, 2025 at 05:35:10PM -0500, Sami Imseih wrote: > I have not benchmarked the overhead, so maybe there is not much to > be concerned about. However, it just seems to me that tracking the extra > data for all cases just to only deal with corner cases does not seem like the > correct approach. This is what makes variant A the most attractive > approach. I suspect that the overhead will be minimal for all the approaches I'm seeing on this thread, but it would not hurt to double-check all that. As the overhead of a single query jumbling is weightless compared to the overall query processing, the fastest method I've used in this area is a micro-benchmark with a hardcoded loop in JumbleQuery() with some rusage to get a more few metrics. This exagerates the query jumbling computing, but it's good enough to see a difference once you take an average of the time taken for each loop. -- Michael
Attachment
On Tue, 11 Mar 2025 at 23:28, Michael Paquier <michael@paquier.xyz> wrote: > > Can you share your thoughts on Ivan's NULL jumble idea? > > This is an incomplete solution, I am afraid. The origin of the > problem comes from the fact that a Node (here, Query) stores two times > successively equivalent Nodes that are used for separate data, with > NULL being the difference between both. The problem is not limited to > NULL, we could as well, for example, finish with a Node that has a > custom jumbling function and the same issue depending on how it is > used in a parent Node. Adding the offset from the parent in the > jumbling is a much stronger insurance policy that the proposed > solution specific to NULL, because each offset is unique in a single > Node. One of us is not understanding the problem correctly. I'm very open for that to be me, but I'll need a bit more explanation for me to know that for sure. As far as I see it, providing we ensure we always add something with a bit of entropy to the jumble buffer for all possible values that a jumble field can be, this fixes the issue. The problem only occurs at the moment because we entirely skip over NULL node fields and we end up with the same jumble bytes if we effectively move the same List of nodes between the Query's distinctClause and sortClause fields. If we add something for NULLs, we'll always add distinctClause before the sortClause. If the distinctClause is NULL we won't end up with the same jumble bytes as if the sortClause were NULL, as the order the NULL byte(s) are added to the buffer changes. The fragment of the Query struct looks like: List *windowClause; /* a list of WindowClause's */ List *distinctClause; /* a list of SortGroupClause's */ List *sortClause; /* a list of SortGroupClause's */ Let's assume we have a non-NIL windowClause, a non-NIL distinctClause and a NIL sortClause. The jumble bytes currently look like: "<windowClause bytes><distinctClause bytes>", and if we have an ORDER BY instead of a DISTINCT, we get: "<windowClause bytes><sortClause bytes>", and this is problematic when the distinctClause bytes and sortClause bytes at the same as that results in the same hash. If we account for the NULLs, those two cases become: "<windowClause bytes><distinctClause bytes><byte for NULL sortClause>" and "<windowClause bytes><byte for NULL distinct clause><sortClause bytes>", which is going to be highly unlikely to hash to the same value due to the buffer not being the same. Can you explain where my understanding is wrong? David
> If we add something for NULLs, we'll always add distinctClause before > the sortClause. If the distinctClause is NULL we won't end up with the > same jumble bytes as if the sortClause were NULL, as the order the > NULL byte(s) are added to the buffer changes. That is how I understand it as well. By appending a single character null terminator to the jumble for every NULL expression, we change the composition of the final jumble. Passing offsets will accomplish the same thing, but I can't see how it buys us any additional safeguards. > I suspect that the overhead will be minimal for all the approaches I'm > seeing on this thread, but it would not hurt to double-check all that. Perhaps my concern is overblown. Also, it seems the consensus is for a more comprehensive solution than that of variant A. Here is an idea I was playing around with. Why don't we just append a null terminator for every expression (a variant of variant B)? I describe this as jumbling the expression position. And if we do that, it no longer seems necessary to jumble the expression type either. right? +++ b/src/backend/nodes/queryjumblefuncs.c @@ -236,6 +236,13 @@ do { \ if (expr->str) \ AppendJumble(jstate, (const unsigned char *) (expr->str), strlen(expr->str) + 1); \ } while(0) +#define JUMBLE_POSITION() \ +do { \ + char nullterm = '\0'; \ + AppendJumble(jstate, (const unsigned char *) &(nullterm), sizeof(nullterm)); \ + if (expr == NULL) \ + return; \ +} while(0) #include "queryjumblefuncs.funcs.c" @@ -244,17 +251,15 @@ _jumbleNode(JumbleState *jstate, Node *node) { Node *expr = node; - if (expr == NULL) - return; - /* Guard against stack overflow due to overly complex expressions */ check_stack_depth(); /* - * We always emit the node's NodeTag, then any additional fields that are - * considered significant, and then we recurse to any child nodes. + * We represent a position of a field in the jumble with a null-term. + * Doing so ensures that even NULL expressions are accounted for in + * the jumble. */ - JUMBLE_FIELD(type); + JUMBLE_POSITION(); switch (nodeTag(expr)) { > As the overhead of a single query jumbling is weightless compared to > the overall query processing yeah, that is a good point. At least benchmarking the above on a SELECT only pgbench did not reveal any regression. -- Sami Imseih Amazon Web Services (AWS)
On Tue, 11 Mar 2025 at 20:48, David Rowley <dgrowleyml@gmail.com> wrote: > Since then, I see that Ivan > has already submitted a patch that accounts for NULL nodes and adds a > byte to the jumble buffer to account for NULLs. This seems quite clean > and simple. However, Sami seems to have concerns about the overhead of > doing this. Is that warranted at all? Potentially, there could be a > decent number of NULL fields. It'll probably be much cheaper than the > offsetof idea I came up with. To try and get some timing information about the overhead added for jumbling the NULLs, I compared master and the patch at [1] using the attached jumbletime.patch.txt. The patch just adds an additional call to JumbleQuery and times how long it takes and outputs the result in nanoseconds. Running make check has the advantage of trying out many different queries. On an AMD 3990x machine, the results for jumbling all make check queries are: master: 73.66 milliseconds 0001-Query-ID-Calculation-Fix-Variant-B.patch: 80.36 milliseconds So that adds about 9.1% overhead to jumbling, on average. See the attached jumble_results.txt for full results and the code to extract the results. David [1] https://www.postgresql.org/message-id/attachment/173439/0001-Query-ID-Calculation-Fix-Variant-B.patch
Attachment
On Wed, Mar 12, 2025 at 03:28:27PM +1300, David Rowley wrote: > If we account for the NULLs, those two cases become: "<windowClause > bytes><distinctClause bytes><byte for NULL sortClause>" and > "<windowClause bytes><byte for NULL distinct clause><sortClause > bytes>", which is going to be highly unlikely to hash to the same > value due to the buffer not being the same. > > Can you explain where my understanding is wrong? (I am a bit busy this week, sorry for the delay) I have looked at the patches sent at https://www.postgresql.org/message-id/5ac172e0b77a4baba50671cd1a15285f@localhost.localdomain, and they don't really address what I'm trying to point at here. Custom jumbling functions could still be problematic, making uneffective a NULL check at the beginning of jumbleNode(). Here is an example, possible by design, that would still be a problem: static void _jumbleNodeFoo(JumbleState *jstate, Node *node) { Foo *expr = (Foo *) node; if (expr->field1 == 0) return; JUMBLE_FIELD(field2); [..] } typedef struct Foo { pg_node_attr(custom_query_jumble) NodeTag type; int field1; char *field2; [..] } Then we have the same problem with another Node using this Foo node two times in a row, depending on how it's used by the query parsing and transform: typedef struct FooBar { pg_node_attr(custom_query_jumble) NodeTag type; Foo *foo1; Foo *foo2; [..] } Adding a depth level incremented once we go through a Node would add more entropy, but it may not ve completely bullet-proof either, even if we add an offset. I am pretty sure I could find a Node structure that acts as counter example even if you add a depth level and an offset in the jumbling.. -- Michael
Attachment
> Then we have the same problem with another Node using this Foo node > two times in a row, depending on how it's used by the query parsing > and transform: hmm, it's hard to imagine such a case being real-world and useful for query jumbling purposes. Also, I think if we introduce something like JUMBLE_NULL() or JUMBLE_POSITION(), the author of a custom jumbling function should use it. This could be added in code comments somewhere, IMO. -- Sami
On Thu, Mar 13, 2025 at 07:17:20PM -0500, Sami Imseih wrote: >> Then we have the same problem with another Node using this Foo node >> two times in a row, depending on how it's used by the query parsing >> and transform: > > hmm, it's hard to imagine such a case being real-world and useful for > query jumbling purposes. Perhaps, but the code allows that be design, and people like forking Postgres. > Also, I think if we introduce something like > JUMBLE_NULL() or JUMBLE_POSITION(), the author of a custom jumbling > function should use it. This could be added in code comments somewhere, IMO. FWIW, another idea I have on top of my mind is the addition of a counter in JumbleState that we increment each time we enter _jumbleNode(), then simply call JUMBLE_FIELD_SINGLE() after the incrementation. And we can rely on this counter to be unique each time we jumble a node.. -- Michael
Attachment
> FWIW, another idea I have on top of my mind is the addition of a > counter in JumbleState that we increment each time we enter > _jumbleNode(), then simply call JUMBLE_FIELD_SINGLE() after the > incrementation. And we can rely on this counter to be unique each > time we jumble a node.. With this approach, the author of the custom function will need to remember to increment the counter, right? Also, I am wondering what is the difference between appending an incrementing counter in JState vs just appending a constant for every node ( NULL or NOT NULL ) ? appending some constant or null-term to the hash: <expression 1> <0><expression 2> <0> <0> vs appending an incremented value: expression_1 <1> expression_2 <2> <3> Both will do the job of adding the required entropy -- Sami
> On Mar 14, 2025, at 8:15, Sami Imseih <samimseih@gmail.com> wrote: >> FWIW, another idea I have on top of my mind is the addition of a >> counter in JumbleState that we increment each time we enter >> _jumbleNode(), then simply call JUMBLE_FIELD_SINGLE() after the >> incrementation. And we can rely on this counter to be unique each >> time we jumble a node.. > > With this approach, the author of the custom function will need > to remember to increment the counter, right? Actually, no. _jumbleNode() is the unique entry point we use when jumbling a sub-Node in a bigger Node structure, so customfunctions don’t need to touch it. > Also, I am wondering what is the difference between appending an incrementing > counter in JState vs just appending a constant for every node ( NULL > or NOT NULL ) ? Well, the problem is not NULL vs non-NULL, so.. — Michael
On Fri, 14 Mar 2025 at 14:51, Michael Paquier <michael@paquier.xyz> wrote: > > > > On Mar 14, 2025, at 8:15, Sami Imseih <samimseih@gmail.com> wrote: > >> FWIW, another idea I have on top of my mind is the addition of a > >> counter in JumbleState that we increment each time we enter > >> _jumbleNode(), then simply call JUMBLE_FIELD_SINGLE() after the > >> incrementation. And we can rely on this counter to be unique each > >> time we jumble a node.. > > > > With this approach, the author of the custom function will need > > to remember to increment the counter, right? > > Actually, no. _jumbleNode() is the unique entry point we use when jumbling a sub-Node in a bigger Node structure, so customfunctions don’t need to touch it. Err, what about non-node types? Those'll go to AppendJumble(). I think basically everyone here apart from you is having a hard time understanding what additional benefits your counter solution brings over just ensuring we always AppendJumble with something, regardless of the field's value. I do want to understand what you're concerned about but you've demonstrated nothing to us about the "always jumble something" idea that breaks. Your example custom function broke that rule as it skipped doing anything when "expr->field1 == 0". Anyway, let's see your patch so we can judge actual code rather than waste our time arguing over assumptions about what the hypothetical code is and does. David
> On Mar 14, 2025, at 9:27, David Rowley <dgrowleyml@gmail.com> wrote: > > Err, what about non-node types? Those'll go to AppendJumble(). > > I think basically everyone here apart from you is having a hard time > understanding what additional benefits your counter solution brings > over just ensuring we always AppendJumble with something, regardless > of the field's value. I do want to understand what you're concerned > about but you've demonstrated nothing to us about the "always jumble > something" idea that breaks. Your example custom function broke that > rule as it skipped doing anything when "expr->field1 == 0". Because what I’ve mentioned is the kind of case I’ve wanted as supported when designing this facility, keeping also in mindthat we may use this stuff for more than just Querys. If you’d rather discard what the argument I’ve done upthread, well,I guess that you’re free to do so and bypass any comments I have. Now I do think that what’s been sent does not checkall the boxes if we want a solution other than shifting the fields of a Node. > Anyway, let's see your patch so we can judge actual code rather than > waste our time arguing over assumptions about what the hypothetical > code is and does. Sure. I’m going to need a few days for that, though :) — Michael
On Fri, 14 Mar 2025 at 15:46, Michael Paquier <michael@paquier.xyz> wrote: > > > On Mar 14, 2025, at 9:27, David Rowley <dgrowleyml@gmail.com> wrote: > > I think basically everyone here apart from you is having a hard time > > understanding what additional benefits your counter solution brings > > over just ensuring we always AppendJumble with something, regardless > > of the field's value. I do want to understand what you're concerned > > about but you've demonstrated nothing to us about the "always jumble > > something" idea that breaks. Your example custom function broke that > > rule as it skipped doing anything when "expr->field1 == 0". > > Because what I’ve mentioned is the kind of case I’ve wanted as supported when designing this facility, keeping also inmind that we may use this stuff for more than just Querys. If you’d rather discard what the argument I’ve done upthread,well, I guess that you’re free to do so and bypass any comments I have. Now I do think that what’s been sent doesnot check all the boxes if we want a solution other than shifting the fields of a Node. I don't think I'm discarding them... As far as I'm aware, your remaining concern is with custom jumble functions and you showed an example in [1] of a hypothetical jumble function that could cause the same bug as this thread is discussing. My response to that was that the custom jumble function is broken and should be fixed, which seems to me to be analogous to Ivan's proposal to fix _jumbleNode() to do something with NULL pointers. I now can't tell which of the following is true: 1) I've missed one of your concerns, or; 2) you want to find a way to make badly coded custom jumble functions not suffer from this bug, or; 3) you think that adding jumble bytes unconditionally regardless of the field's value still does not fix the bug in question, or; 4) something else. It would be good to know which of these it is. David [1] https://postgr.es/m/Z9NrnVk7MtMe8uNF@paquier.xyz
On Fri, Mar 14, 2025 at 04:14:25PM +1300, David Rowley wrote: > I don't think I'm discarding them... As far as I'm aware, your > remaining concern is with custom jumble functions and you showed an > example in [1] of a hypothetical jumble function that could cause the > same bug as this thread is discussing. My response to that was that > the custom jumble function is broken and should be fixed, which seems > to me to be analogous to Ivan's proposal to fix _jumbleNode() to do > something with NULL pointers. Okay, that's where our opinions diverge. So, well, please let me reformulate that with some different words and terms. This is not a problem of having a NULL node vs a non-NULL node in the jumbling, because by design is it possible for one to decide if a Node can be included in the jumbling depending on its internal values, particularly if it gets used across one than one type of parent node. You are claiming that such functions are broken, but what I am saying is that such function can be OK depending on how one wants this code to behave depending on their node structure and what they expect from them. So I'm making a claim about keeping this stuff more flexible, while also addressing the problem of the same node defined successively more than once in a parent structure. > I now can't tell which of the following is true: 1) I've missed one of > your concerns, or; 2) you want to find a way to make badly coded > custom jumble functions not suffer from this bug, or; 3) you think > that adding jumble bytes unconditionally regardless of the field's > value still does not fix the bug in question, or; 4) something else. > It would be good to know which of these it is. I hope that my opinion counts, as I've worked on this whole design in 3db72ebcbe20, 8eba3e3f0208 and other related commits. FWIW, I don't really see any disadvantage in supporting what I'm claiming is OK, giving more flexibility to folks to do what they want with this facility, if it also tackles this thread's problem with the Node handling. So, here is attached a counter-proposal, where we can simply added a counter tracking a node count in _jumbleNode() to add more entropy to the mix, incrementing it as well for NULL nodes. By the way, I was looking at the set of tests proposed upthread at this message, which is as far as I know the latest version proposed: https://www.postgresql.org/message-id/5ac172e0b77a4baba50671cd1a15285f@localhost.localdomain The tests do require neither a relation nor a stats reset, so let's make it cheaper as I've proposed upthread and in the attached, including checks with multiple queries and different constants to make sure that these are correctly grouped in the pgss reports. The null_sequence_number reset each time we run through a non-NULL node from variant B reduces a bit the entropy, btw.. The argument about adding a counter in AppendJumble() is the wrong level to use for such a change. Anyway, perhaps we should have your extra byte '\0' or something equivalent added to JUMBLE_STRING() if dealing with a NULL string rather than ignoring it. There's an argument about simplicity, IMO, still it is true that ignoring a NULL value reduces the entropy of the query ID. So I'd be OK with that if you feel strongly about this point, at it sound to me that you are? This could be better than a hardcoded '\0' byte as we could use a counter and JUMBLE_FIELD_SINGLE() in the NULL case of JUMBLE_STRING(). It's not proved to be a problem yet, and there's value in having a simpler solution, as well. Anyway, one line I'd like to draw is that appendJumble() remains as it is written currently. One extra thing that We could do is to update it so as the size of an item given is always more than zero, with an Assert(), to enforce a stricter policy. -- Michael
Attachment
Hello, Michael! > So, here is attached a counter-proposal, where we can simply added a > counter tracking a node count in _jumbleNode() to add more entropy to > the mix, incrementing it as well for NULL nodes. It definitely looks like a more reliable solution than my variant, which only counts NULL nodes. However, we already knew about the overhead of adding `\0` bytes for every NULL field. > So that adds about 9.1% overhead to jumbling, on average. See: https://www.postgresql.org/message-id/flat/5ac172e0b77a4baba50671cd1a15285f%40localhost.localdomain#6c43f354f5f42d2a27e6824faa660a86 Is it really worth spending extra execution time to increase entropy when we have non-NULL nodes? Maybe we should choose to add node_count to the hash every time we visit non-NULL or NULL nodes. We could also add entropy if we see a change in the node->type value for non-NULL variants. Your Variant ------------ < node_count = 1 > < node 1 > < node_count = 2 > /* node 2 = NULL */ < node_count = 3 > < node 3 > Alternative 1 (mark only NULL Nodes) ------------------------------------ /* node_count = 1 */ < node 1 > < node_count = 2 > /* node 2 = NULL */ /* node_count = 3 */ < node 3 > Alternative 2 (mark only non-NULL Nodes) ---------------------------------------- This could address concerns about problems related to visiting nodes with the same content placed in different query tree branches. < node_count = 1 > < node 1 > /* node_count = 2 */ /* node 2 = NULL */ < node_count = 3 > < node 3 >
On Mon, Mar 17, 2025 at 07:33:42AM +0000, Bykov Ivan wrote: > See: > https://www.postgresql.org/message-id/flat/5ac172e0b77a4baba50671cd1a15285f%40localhost.localdomain#6c43f354f5f42d2a27e6824faa660a86 > > Is it really worth spending extra execution time to increase entropy > when we have non-NULL nodes? The computed time is already quite cheap, so I'm not much worried about that, FWIW. > We could also add entropy if we see a change in the node->type value for > non-NULL variants. I am not sure to get this one. The issue shows up if we have the same Node computed successively as reported on this thread. It would still be a problem for different node types, though less likely, when two nodes use with similar fields. -- Michael
Attachment
On Mon, 17 Mar 2025 at 13:53, Michael Paquier <michael@paquier.xyz> wrote: > So, here is attached a counter-proposal, where we can simply added a > counter tracking a node count in _jumbleNode() to add more entropy to > the mix, incrementing it as well for NULL nodes. I had a look at this and it seems the main difference will be that this patch will protect against the case that a given node is non-null but has a custom jumble function which selects to not jumble anything in some cases. Since Ivan's patch only adds jumbling for non-null, that could lead to the same bug if a custom jumble function decided to not jumble anything at all. FWIW, I did the same test to jumble all make check queries with your patch, same as [1]: master: 73.66 milliseconds 0001-Query-ID-Calculation-Fix-Variant-B.patch: 80.36 milliseconds v4-0001-Add-more-entropy-to-query-jumbling.patch: 88.84 milliseconds I know you've said in [2] that you're not worried about performance, but I wanted to see if that's true... I can measure the cost of compute_query_id=on with pgbench -S workload on my AMD 3990x machine. I tried with that on "auto", then with "on" with master and again with your v4 patch when set to "on": (the -c 156 -j 156 is just there to ensure this machine properly clocks up, which it still seems hesitant to do sometimes, even with the performance power settings) drowley@amd3990x:~$ echo master compute_query_id = auto && for i in {1..10}; do pg_ctl stop -D pgdata > /dev/null && pg_ctl start -D pgdata -l pg.log > /dev/null && pgbench -n -M simple -S -T 60 -c 156 -j 156 postgres | grep tps; done master compute_query_id = auto tps = 1202451.945998 tps = 1197645.856236 tps = 1182345.403291 tps = 1182440.562773 tps = 1180731.003390 tps = 1185277.478449 tps = 1174983.732094 tps = 1176310.828235 tps = 1179040.622103 tps = 1177068.520272 drowley@amd3990x:~$ echo master compute_query_id = on && for i in {1..10}; do pg_ctl stop -D pgdata > /dev/null && pg_ctl start -D pgdata -l pg.log > /dev/null && pgbench -n -M simple -S -T 60 -c 156 -j 156 postgres | grep tps; done master compute_query_id = on tps = 1158684.006821 tps = 1165808.752641 tps = 1158269.999683 tps = 1146730.269628 tps = 1154200.484194 tps = 1152750.152235 tps = 1150438.486321 tps = 1150649.669337 tps = 1144745.761593 tps = 1157743.530383 drowley@amd3990x:~$ echo v4-0001-Add-more-entropy-to-query-jumbling.patch compute_query_id = on && for i in {1..10}; do pg_ctl stop -D pgdata > /dev/null && pg_ctl start -D pgdata -l pg.log > /dev/null && pgbench -n -M simple -S -T 60 - c 156 -j 156 postgres | grep tps; done v4-0001-Add-more-entropy-to-query-jumbling.patch compute_query_id = on tps = 1156782.516710 tps = 1162780.019911 tps = 1142104.075069 tps = 1145651.299640 tps = 1143157.945891 tps = 1150275.033626 tps = 1145817.267485 tps = 1135915.694987 tps = 1138703.153873 tps = 1138436.802882 It's certainly hard to see much slowdown between master and v4, but it is visible if I sort the results and put them in a line graph and scale the vertical axis a bit. A 0.7% slowdown. See attached. I didn't run the same test on the 0001-Query-ID-Calculation-Fix-Variant-B.patch patch, but with the first set of results I posted above, you could assume it's less than 0.7% overhead. Locally, I did play around with internalising the AppendJumble function so my compiler would compile a dedicated JumbleNull function, which removes a bit of branching and code for the memcpy. I managed to get it going a bit faster that way. I do agree that your v4 fixes the issue at hand and I'm not going to stand in your way if you want to proceed with it. However, from my point of view, it seems that we could achieve the same goal with much less overhead by just insisting that custom jumble functions always jumble something. We could provide a jumble function to do that if the custom function conditionally opts to jumble nothing. That would save having to add this additional jumbling of the node count. Also, am I right in thinking that there's no way for an extension to add a custom jumble function? If so, then we have full control over any custom jumble function. That would make it easier to ensure these always jumble something. David [1] https://postgr.es/m/CAApHDvqaKySJdBf2v2_ybNuT=KObaUVBU4_5kgZfFTMSJr-Vmg@mail.gmail.com [2] https://postgr.es/m/Z9flafNlH4-1YSJW@paquier.xyz
Attachment
On Tue, Mar 18, 2025 at 05:24:42PM +1300, David Rowley wrote: > I had a look at this and it seems the main difference will be that > this patch will protect against the case that a given node is non-null > but has a custom jumble function which selects to not jumble anything > in some cases. Since Ivan's patch only adds jumbling for non-null, > that could lead to the same bug if a custom jumble function decided to > not jumble anything at all. Yeah. > It's certainly hard to see much slowdown between master and v4, but it > is visible if I sort the results and put them in a line graph and > scale the vertical axis a bit. A 0.7% slowdown. See attached. Thanks for the numbers. I'd like to say that this is within noise, but that's clearly not if repeatable, as you are pointing out. Ugh. > I do agree that your v4 fixes the issue at hand and I'm not going to > stand in your way if you want to proceed with it. However, from my > point of view, it seems that we could achieve the same goal with much > less overhead by just insisting that custom jumble functions always > jumble something. We could provide a jumble function to do that if the > custom function conditionally opts to jumble nothing. That would save > having to add this additional jumbling of the node count. If we make the whole cheaper with only extra entropy added for NULLs in nodes and strings, I'd rather have an insurance policy for the custom functions. Something like that: - Forbid a size of 0 in AppendJumble(). - When dealing with a non-NULL case in _jumbleNode(), save the initial jumble_len and the jumble contents when starting, then complain if the jumble_len matches with the initial length at the end and if the contents are the same in an assert. A check on the length may be enough, but we'd depend on JUMBLE_SIZE and nodes can get pretty big. What do you think? > Also, am I right in thinking that there's no way for an extension to > add a custom jumble function? If so, then we have full control over > any custom jumble function. That would make it easier to ensure these > always jumble something. Currently, no, that would not be possible through this code path. The only way would be to fork the code as we rely entirely on the code generated from the in-core headers. -- Michael
Attachment
On Tue, 18 Mar 2025 at 21:00, Michael Paquier <michael@paquier.xyz> wrote: > If we make the whole cheaper with only extra entropy added for NULLs > in nodes and strings, I'd rather have an insurance policy for the > custom functions. Something like that: > - Forbid a size of 0 in AppendJumble(). > - When dealing with a non-NULL case in _jumbleNode(), save the initial > jumble_len and the jumble contents when starting, then complain if the > jumble_len matches with the initial length at the end and if the > contents are the same in an assert. A check on the length may be > enough, but we'd depend on JUMBLE_SIZE and nodes can get pretty big. > > What do you think? If it's for Assert enabled builds only, then to save from having to look at the buffer, you could have an extra field similar to jumble_len, but does not get reset when the jumble buffer fills. Just assert that the total_jumbled_bytes has grown after jumbling a node, maybe? David
On Tue, Mar 18, 2025 at 09:24:06PM +1300, David Rowley wrote: > If it's for Assert enabled builds only, then to save from having to > look at the buffer, you could have an extra field similar to > jumble_len, but does not get reset when the jumble buffer fills. Just > assert that the total_jumbled_bytes has grown after jumbling a node, > maybe? Indeed, this would work as well, and there's little point in going full-memcmp() mode for the sake of such a check. I have been doing some measurements to see how the NULL addition affects the computation, and with the attached (based on what you have sent upthread), I can see that a "SELECT 1", which would be one of the worst cases possible based on its simple Query structure with 19 NULLs and not much else (?), jumps from 80ms to 107~109ms for 100k loops calling JumbleQuery(), meaning that a single computation moves from 800ns to 1070~1090ns on my local machine. So this stuff costs, which is not surprising, but it does not seem *that* bad compared to a full query run. Potential ideas about optimizing the branches seem worth investigating, I am not sure that I would be able to dig into that until the feature freeze gong, unfortunately, and it would be nice to fix this issue for this release. I'll see about doing some pgbench runs on my side, as well, with tpc-b Querys. Thoughts? -- Michael
Attachment
On Fri, Mar 21, 2025 at 02:45:09PM +0900, Michael Paquier wrote: > Potential ideas about optimizing the branches seem worth > investigating, I am not sure that I would be able to dig into that > until the feature freeze gong, unfortunately, and it would be nice to > fix this issue for this release. I'll see about doing some pgbench > runs on my side, as well, with tpc-b Querys. So, here are some numbers with a "SELECT 1;" using pgbench, which would be a worst-case. Then, at 128 clients on a machine I have around: - HEAD, compute_query_id = on: tps = 282081.839715 tps = 281379.854539 tps = 283708.205096 tps = 280197.225782 #worst tps = 291955.914664 #best tps = 284172.351609 tps = 287234.624720 tps = 287915.683291 #best tps = 277004.123742 #worst tps = 281326.071323 - Patch, compute_query_id = on: tps = 289746.641246 #best tps = 279689.654196 tps = 284874.774620 #best tps = 282312.246610 tps = 275807.057758 #worst tps = 276058.858384 #worst tps = 282705.638685 tps = 282651.367641 tps = 276777.418569 tps = 283137.265298 Removing the two best and two worst numbers, taking an average as in: (head_tps - patch_tps) * 2 / (head_tps + patch_tps) Then that's 283317.15 (HEAD) vs 281212.26 (patch) so a 0.7% decrease. Still that could be claimed as noise. Second set, at 1 client, and a better pattern becomes available: - HEAD, compute_query_id = on: tps = 19170.473320 #best tps = 18722.828797 tps = 19138.765391 tps = 18243.611593 tps = 17878.605004 #worst tps = 19557.753327 #best tps = 18395.734885 tps = 18433.188326 tps = 18227.914138 #worst tps = 18450.236841 - Patch, compute_query_id = on: tps = 16611.766411 tps = 15992.734615 #worst tps = 16465.077482 tps = 16726.097318 tps = 17284.874897 #best tps = 16076.378905 tps = 17971.534072 #best tps = 15733.718159 #worst tps = 16911.014660 tps = 17217.202679 And here we get 18564.06 (head) vs 16667.92 (patch) so a 10.7% difference in this run. Hence the automatic addition of NULL to the jumbling is disappointing here, even if this is what I'd see this as a worst case scenario, unlikely what one would see for real if they care about monitoring. So, on the other hand, we still have the trick to switch the fields, which means that we would not suffer any performance penalty, while fixing the original issue. And with the feature freeze in sight, time is running short if we want to do something here. NB: In case, a second run with 1 client showing the same trend: - HEAD: tps = 19697.406181 tps = 19474.495729 tps = 18614.615630 tps = 18868.245926 tps = 18949.775164 tps = 18943.311921 tps = 18824.192383 tps = 19104.542301 tps = 18950.870646 tps = 19331.829745 - Patch: tps = 17887.864764 tps = 17317.628065 tps = 17090.278323 tps = 15876.258607 tps = 16818.644938 tps = 16816.334313 tps = 18134.148809 tps = 16891.809962 tps = 18193.104125 tps = 16866.726990 -- Michael
Attachment
On Mon, 24 Mar 2025 at 15:23, Michael Paquier <michael@paquier.xyz> wrote: > And here we get 18564.06 (head) vs 16667.92 (patch) so a 10.7% > difference in this run. Hence the automatic addition of NULL to the > jumbling is disappointing here, even if this is what I'd see this as a > worst case scenario, unlikely what one would see for real if they care > about monitoring. Can you share which patch you tested here? There are many patches on this thread and I don't want to make assumptions about which one these are the results for. It sounds like it wasn't your latest patch as you mentioned "automatic addition of NULL to the jumbling". That doesn't sound like the correct description for your latest patch. David
On Mon, Mar 24, 2025 at 07:42:21PM +1300, David Rowley wrote: > Can you share which patch you tested here? There are many patches on > this thread and I don't want to make assumptions about which one these > are the results for. The patch tested is the latest one that has been posted on this thread, for a comparison of HEAD at 8a3e4011f02d and this 0001 (no 0002 which was forcing a loop to make the jumbling heavier): https://www.postgresql.org/message-id/Z9z85Ui5lPkPn2hq@paquier.xyz The message holding the patch tested is the one I've replied to when posting the results shared here: https://www.postgresql.org/message-id/Z-DB83o5AZmgnvnA@paquier.xyz -- Michael
Attachment
On Fri, 21 Mar 2025 at 18:45, Michael Paquier <michael@paquier.xyz> wrote: > Potential ideas about optimizing the branches seem worth > investigating, I am not sure that I would be able to dig into that > until the feature freeze gong, unfortunately, and it would be nice to > fix this issue for this release. I'll see about doing some pgbench > runs on my side, as well, with tpc-b Querys. I've come up with a version of this that fixes the issue and improves performance rather than regresses it. See attached. This patch uses the same "ensure we jumble at least something when we get a NULL" method, but works differently to anything previously proposed here. Instead of modifying the jumble buffer each time we see a NULL, the patch just increments a counter. It's only when something else actually needs to append bytes to the buffer that we must flush the pending NULLs to the buffer beforehand. I've opted to just write the consecutive NULL count to the buffer. This allows the NULLs to be appended more efficiently. This has the advantage of having to do very little when we see NULL values, and would probably do well on your "SELECT 1" test, which has lots of NULLs. I've also performed a round of optimisations to the code: 1) The jumble size in AppendJumble should never be zero, so the while (size > 0) loop can become a do while loop. This halves the size checks for the normal non-overflow case. 2) Added inlined functions to jumble common field sizes. This saves having to move the size into the appropriate register before calling the AppendJumble function and reduces the size of the compiled code as a result. It also allows the fast-path I added to AppendJumbleInternal for the non-overflow case to work more efficiently as the memcpy can transformed into fewer instructions due to the size being both known and small enough to perform the copy in a single instruction. This previously didn't work since part_size is a variable the inlined function. With the fast-path, "size" is used, which is a compile-time const to the inlined function. I tested the performance of this using the patch from [1] and running "make check" 50 times. This gave me the total time spent jumbling all of the make check queries. I sorted the results by time and graphed them in the attached graph. The patched version is on average 1.73% *faster* than master on my AMD 3990x machine: master: 76.564 milliseconds v7 patch 75.236 milliseconds It should be possible to still add the total_jumble_len stuff to this in assert builds with zero extra overhead in non-assert builds, which I think should address your concern with custom jumble functions not jumbling anything. I can adjust accordingly if you're happy to use the method I'm proposing. The tests done were in a version of master prior to 5ac462e2b. I've not yet studied if anything needs to be adjusted to account for that commit. David [1] https://www.postgresql.org/message-id/attachment/174042/jumbletime.patch.txt
Attachment
Hello, David! As I can see, your patch has the same idea as my v2-0001-Query-ID-Calculation-Fix-Variant-B.patch from [1]. I think it would be better to extract the jumble buffer update with hash calculation into a function (CompressJumble in my patch). This will result in fewer changes to the source code. We can also simply dump the null count to the buffer when we encounter a non-null field (which will be processed in AppendJumble, indeed). What do your thing about my patch (v2-0001-Query-ID-Calculation-Fix-Variant-B.patch)? [1] https://www.postgresql.org/message-id/5ac172e0b77a4baba50671cd1a15285f%40localhost.localdomain
On Tue, 25 Mar 2025 at 21:14, Bykov Ivan <i.bykov@modernsys.ru> wrote: > As I can see, your patch has the same idea as my v2-0001-Query-ID-Calculation-Fix-Variant-B.patch from [1]. > I think it would be better to extract the jumble buffer update with hash calculation into a function > (CompressJumble in my patch). This will result in fewer changes to the source code. > We can also simply dump the null count to the buffer when we encounter a non-null field > (which will be processed in AppendJumble, indeed). > > What do your thing about my patch (v2-0001-Query-ID-Calculation-Fix-Variant-B.patch)? Seemingly, I missed that one. I agree with the general method being suitable, which might not be surprising since I ended up at the same conclusion independently. I quite like your CompressJumble() idea, but I don't know that it amounts to any compiled code changes. I think it's a bit strange that you're flushing the pending nulls to the jumble buffer after jumbling the first non-null. I think it makes much more sense to do it before as I did, otherwise, you're effectively jumbling fields out of order. I see you opted to only jumble the low-order byte of the pending null counter. I used the full 4 bytes as I don't think the extra 3 bytes is a problem. Witrh v7 the memcpy to copy the pending_nulls into the buffer is inlined into a single instruction. It's semi-conceivable that you could have more than 255 NULLs given that a List can contain Nodes. I don't know if we ever have any NULL items in Lists at the moment, but I don't think that that's disallowed. I'd feel much safer taking the full 4 bytes of the counter to help prevent future surprises. Aside from that, v2b still has the performance regression issue that we're trying to solve. I did a fresh run of master as of 19c6eb06c and tested v7 and v2b. Here are the average times it took to jumble all the make check queries over 50 runs; master: 76.052 ms v7: 75.422 ms v2b: 81.391 ms I also attached a graph. I'm happy to proceed with the v7 version. I'm also happy to credit you as the primary author of it, given that you came up with the v2b patch. It might be best to extract the performance stuff that's in v7 and apply that separately from the bug fix. If you're still interested and have time in the next 7 hours to do it, would you be able to split the v7 as described, making v8-0001 as the bug fix and v8-0002 as the performance stuff? If so, could you also add the total_jumble_len to v8-0001 like Michael's last patch had, but adjust so it's only maintained for Assert builds. Michael is quite keen on having more assurance that custom jumble functions don't decide to forego any jumbling. If you don't have time, I'll do it in my morning (UTC+13). Thanks David > [1] https://www.postgresql.org/message-id/5ac172e0b77a4baba50671cd1a15285f%40localhost.localdomain
Attachment
On Wed, 26 Mar 2025 at 01:11, David Rowley <dgrowleyml@gmail.com> wrote: > I'm happy to proceed with the v7 version. I'm also happy to credit you > as the primary author of it, given that you came up with the v2b > patch. It might be best to extract the performance stuff that's in v7 > and apply that separately from the bug fix. If you're still interested > and have time in the next 7 hours to do it, would you be able to split > the v7 as described, making v8-0001 as the bug fix and v8-0002 as the > performance stuff? If so, could you also add the total_jumble_len to > v8-0001 like Michael's last patch had, but adjust so it's only > maintained for Assert builds. Michael is quite keen on having more > assurance that custom jumble functions don't decide to forego any > jumbling. > > If you don't have time, I'll do it in my morning (UTC+13). Here is the v8 version with the bug fix and performance stuff separated out. I also added the Assert code to ensure we always add something to the jumble buffer when jumbling a node. I didn't include the regression tests I saw in the v2b patch [1]. I felt these were just marking the fact that there used to be a bug here. David [1] https://postgr.es/m/5ac172e0b77a4baba50671cd1a15285f%40localhost.localdomain
Attachment
On Wed, Mar 26, 2025 at 11:56:50AM +1300, David Rowley wrote: > Here is the v8 version with the bug fix and performance stuff > separated out. Why not. I assume that you would merge these together? > I also added the Assert code to ensure we always add > something to the jumble buffer when jumbling a node. Thanks. This part looks good with its USE_ASSERT_CHECKING. > I didn't include the regression tests I saw in the v2b patch [1]. I > felt these were just marking the fact that there used to be a bug > here. I disagree about the lack of tests. These are cheap to add. I'd suggest to not use the ones posted as part of v2 variant B as these do not require the creation of a relation so they can be made cheaper, and instead use the set I have posted in patch 0001 here which covers most of the main scenarios from the parser with this nodes in the Query: https://www.postgresql.org/message-id/Z9z85Ui5lPkPn2hq@paquier.xyz It is a bit disappointing that we require JumbleQuery() to jumble the NULLs. There are discussions about making this code available not only for Querys, but for Nodes, and the NUL computations would be missed.. -- Michael
Attachment
Hello, David! > I see you opted to only jumble the low-order byte of the pending null > counter. I used the full 4 bytes as I don't think the extra 3 bytes is > a problem. Witrh v7 the memcpy to copy the pending_nulls into the > buffer is inlined into a single instruction. It's semi-conceivable > that you could have more than 255 NULLs given that a List can contain > Nodes. I don't know if we ever have any NULL items in Lists at the > moment, but I don't think that that's disallowed. I'd feel much safer > taking the full 4 bytes of the counter to help prevent future > surprises. Yes _jumbleList could theoretically handle cases with over 256 Node elements in same list, but most (or all) of them is non-NULL (as I know only PlannedStmt->subplans accepts NULL elements), Most of the foreach cycles over query lists don't have any NULL safety checks; we just assume that the lists contain no NULL pointers. I still believe storing just one byte is sufficient. > I'm happy to proceed with the v7 version. I'm also happy to credit you > As the primary author of it, given that you came up with the v2b > patch. It might be best to extract the performance stuff that's in v7 > and apply that separately from the bug fix. If you're still interested > and have time in the next 7 hours to do it, would you be able to split > the v7 as described, making v8-0001 as the bug fix and v8-0002 as the > performance stuff? If so, could you also add the total_jumble_len to > v8-0001 like Michael's last patch had, but adjust so it's only > maintained for Assert builds. Michael is quite keen on having more > assurance that custom jumble functions don't decide to forego any > jumbling. As I can see, you have already split the patches. I apologize for the delay; I don't have much time to work on this issue. Thank you for your proposal to credit me as the patch author.
On Wed, 26 Mar 2025 at 14:04, Michael Paquier <michael@paquier.xyz> wrote: > > On Wed, Mar 26, 2025 at 11:56:50AM +1300, David Rowley wrote: > > Here is the v8 version with the bug fix and performance stuff > > separated out. > > Why not. I assume that you would merge these together? I think it's better it's two commits. They're for different purposes. > I disagree about the lack of tests. These are cheap to add. I'd > suggest to not use the ones posted as part of v2 variant B as these do > not require the creation of a relation so they can be made cheaper, > and instead use the set I have posted in patch 0001 here which covers > most of the main scenarios from the parser with this nodes in the > Query: > https://www.postgresql.org/message-id/Z9z85Ui5lPkPn2hq@paquier.xyz Ok, I won't protest. I just feel the bug is a relic of the old design and I don't believe the tests are testing anything useful in the new design. It feels unlikely that someone accidentally adjusts the code and reverts it to the old design. Anyway, I've included the tests from your patch. > It is a bit disappointing that we require JumbleQuery() to jumble the > NULLs. There are discussions about making this code available not > only for Querys, but for Nodes, and the NUL computations would be > missed.. I've adjusted the patch to add two static functions. InitJumble() and DoJumble(). JumbleQuery() uses these. David
Attachment
On Thu, Mar 27, 2025 at 12:09:35PM +1300, David Rowley wrote: > I think it's better it's two commits. They're for different purposes. Fine for me. > Ok, I won't protest. I just feel the bug is a relic of the old design > and I don't believe the tests are testing anything useful in the new > design. It feels unlikely that someone accidentally adjusts the code > and reverts it to the old design. Anyway, I've included the tests from > your patch. One habit that I've found really useful to do when it comes to this area of the code is to apply the tests first to show what kind of behavior we had before changing the jumbling, then apply the update to the query jumbling. This has two benefits: - If the update of the second phase gets reverted, we still have the tests. - It is possible to show in the git history how the behavior changes, analyzing them in isolation. So I'd suggest an extra split, with the tests committed first. >> It is a bit disappointing that we require JumbleQuery() to jumble the >> NULLs. There are discussions about making this code available not >> only for Querys, but for Nodes, and the NUL computations would be >> missed.. > > I've adjusted the patch to add two static functions. InitJumble() and > DoJumble(). JumbleQuery() uses these. That's cleaner, thanks for the split. I don't think that we've reached a consensus on the other thread about what to provide as interface and that's likely too late for this release, what you are proposing here is a good step forward, and will help in not messing up again with the problem of this thread in the future. +#ifdef USE_ASSERT_CHECKING + Size prev_jumble_len = jstate->total_jumble_len; +#endif [...] + /* The total number of bytes added to the jumble buffer */ + Size total_jumble_len; } JumbleState; Maybe hide that in the header with one extra USE_ASSERT_CHECKING if we are only going to increment it when assertions are enabled? + * These are flushed out to the jumble buffer before subsequent appends + * and before performing the final jumble hash. This comment is slightly incorrect when 0001 is taken in isolation? The flush of the NULL counter is done in the patch via FlushPendingNulls() when jumbling the final value, not before more appends? It becomes true with 0002, where FlushPendingNulls() is called before appending any data. Perhaps removing JUMBLE_FIELD_SINGLE() is better now. -- Michael
Attachment
Maybe I am missing something, but when I applied the v9-0001 patch alone, it did not solve the problem it claims to and the pg_s_s regression test fails: +++ /local/home/simseih/pghead/install1/postgresql/contrib/pg_stat_statements/results/select.out 2025-03-27 04:56:35.143855392 +0000 @@ -215,14 +215,12 @@ 3 | 3 | SELECT $1 + $2 + $3 AS "add" 1 | 1 | SELECT $1 AS "float" 2 | 2 | SELECT $1 AS "int" - 2 | 2 | SELECT $1 AS "int" LIMIT $2 - 2 | 0 | SELECT $1 AS "int" OFFSET $2 + 4 | 2 | SELECT $1 AS "int" LIMIT $2 6 | 0 | SELECT $1 AS "int" OFFSET $2 LIMIT $3 - 2 | 2 | SELECT $1 AS "int" ORDER BY 1 1 | 2 | SELECT $1 AS i UNION SELECT $2 ORDER BY i 1 | 1 | SELECT $1 || $2 1 | 1 | SELECT $1, $2 LIMIT $3 - 2 | 2 | SELECT DISTINCT $1 AS "int" + 4 | 4 | SELECT DISTINCT $1 AS "int" 0 | 0 | SELECT calls, rows, query FROM pg_stat_statements ORDER BY query COLLATE "C" 1 | 1 | SELECT pg_stat_statements_reset() IS NOT NULL AS t 1 | 2 | WITH t(f) AS ( + @@ -230,7 +228,7 @@ | | ) + | | SELECT f FROM t ORDER BY f 1 | 1 | select $1::jsonb ? $2 -(17 rows) +(15 rows) but when v9-0002 is applied, the regress tests then succeed. -- Sami Imseih Amazon Web Services (AWS)
On Thu, 27 Mar 2025 at 18:03, Sami Imseih <samimseih@gmail.com> wrote: > Maybe I am missing something, but when I applied the > v9-0001 patch alone, it did not solve the problem it > claims to and the pg_s_s regression test fails: That was an artifact of me not having made the split quite in the right place when dividing the patch into two. What's just been committed fixed that. Thanks for looking. David
On Thu, 27 Mar 2025 at 14:51, Michael Paquier <michael@paquier.xyz> wrote: > One habit that I've found really useful to do when it comes to this > area of the code is to apply the tests first to show what kind of > behavior we had before changing the jumbling, then apply the update to > the query jumbling. This has two benefits: > - If the update of the second phase gets reverted, we still have the > tests. > - It is possible to show in the git history how the behavior changes, > analyzing them in isolation. > > So I'd suggest an extra split, with the tests committed first. I didn't do that. I can't quite process the logic of adding tests to check we get the wrong behaviour. > Maybe hide that in the header with one extra USE_ASSERT_CHECKING if we > are only going to increment it when assertions are enabled? > > + * These are flushed out to the jumble buffer before subsequent appends > + * and before performing the final jumble hash. I thought about that and had decided to leave it in. I had been concerned about the struct growing and shrinking between cassert/non-cassert builds. It's probably ok since it's expected to be an internal field. I guess if a cassert extension used the field and ran against a non-cassert PostgreSQL, then they'd have trouble. I did end up removing it in what I just pushed. I felt I might be being overly concerned since the field is at the end of the struct and is commented that it's for Asserts only. > This comment is slightly incorrect when 0001 is taken in isolation? > The flush of the NULL counter is done in the patch via > FlushPendingNulls() when jumbling the final value, not before more > appends? It becomes true with 0002, where FlushPendingNulls() is > called before appending any data. That was my mistake. I divided the patches incorrectly. There was meant to be a flush in there. > Perhaps removing JUMBLE_FIELD_SINGLE() is better now. Done and pushed. Thanks. David