Thread: Query ID Calculation Fix for DISTINCT / ORDER BY and LIMIT / OFFSET

Query ID Calculation Fix for DISTINCT / ORDER BY and LIMIT / OFFSET

From
Bykov Ivan
Date:

Hi, all!

 

It seems I have found a bug in the Query ID calculation.

 

Problem

=======

In some cases, we could have same IDs for not identical query trees.
For two structurally similar query subnodes, we may have query

trees that look like this:

 

----

QueryA->subNodeOne = Value X;

QueryA->subNodeTwo = NULL;

 

QueryB->subNodeOne = NULL;

QueryB->subNodeTwo = Value X;

----

 

When the query jumble walker calculates the query ID, it traverses the

Query members from top to bottom and generates the same IDs for these

two queries because it does not change the hash value when visiting

an empty node (= NULL).

 

Here is an example (the collection of jstate->jumble is omitted):

----

/* QueryA_ID = AAAA */

QueryA->subNodeOne = &(Value X);

/* QueryA_ID = hash_any_extended(&(Value X), sizeof(Value X), QueryA_ID) = BBBB */

QueryA->subNodeTwo = NULL;

/* QueryA_ID = BBBB; */

 

/* QueryB_ID  = AAAA */

QueryB->subNodeOne = NULL;

/* QueryB_ID = AAAA; */

QueryB->subNodeTwo = &(Value X);

/* QueryB_ID = hash_any_extended(&(Value X), sizeof(Value X), QueryB_ID) = BBBB */

----

 

There are two pairs of subnodes that can trigger this error:

- distinctClause and sortClause (both contain a list of SortGroupClauses);

- limitOffset and limitCount (both contain an int8 expression).

 

Here is an example of such errors (all queries run on REL_17_0):

----

SET compute_query_id = on;

 

/* DISTINCT / ORDER BY *************************************/

EXPLAIN (VERBOSE) SELECT DISTINCT "oid" FROM pg_class;

/* Query Identifier: 751948508603549510 */

EXPLAIN (VERBOSE) SELECT "oid" FROM pg_class ORDER BY "oid";

/* Query Identifier: 751948508603549510 */

 

/* LIMIT / OFFSET ******************************************/

EXPLAIN (VERBOSE) SELECT "oid" FROM pg_class LIMIT 1;

/* Query Identifier: 5185884322440896420 */

EXPLAIN (VERBOSE) SELECT "oid" FROM pg_class OFFSET 1;

/* Query Identifier: 5185884322440896420 */

----

 

Solution One

============

 

The simplest way to fix the problem is to place the scalar field used in

the query ID calculation between similar subnodes. A patch for this

solution is attached below (0001-Query-ID-Calculation-Fix-Variant-A.patch).

 

Here is an example:

----

/* QueryA_ID = AAAA */

QueryA->subNodeOne = &(Value X);

/* QueryA_ID = hash_any_extended(&(Value X), sizeof(Value X), QueryA_ID) = BBBB */

QueryA->scalar = Value Y;

/* QueryA_ID = hash_any_extended(&(Value Y), sizeof(Value Y), QueryA_ID) = CCCC */

QueryA->subNodeTwo = NULL;

/* QueryA_ID = CCCC; */

 

/* QueryB_ID = AAAA */

QueryB->subNodeOne = NULL;

/* QueryB_ID = AAAA; */

QueryB->scalar = Value Y;

/* QueryB_ID = hash_any_extended(&(Value Y), sizeof(Value Y), QueryB_ID) = DDDD */

QueryB->subNodeOne = &(Value X);

/* QueryB_ID = hash_any_extended(&(Value X), sizeof(Value X), QueryB_ID) = EEEE */

----

 

Solution Two

============

Alternatively, we can change the hash sum when we encounter an empty node.

This approach may impact performance but will protect us from such errors

in the future. A patch for this solution is attached below

(0001-Query-ID-Calculation-Fix-Variant-B.patch).

 

Here is an example:

----

/* QueryA_ID = AAAA */

QueryA->subNodeOne = &(Value X);

/* QueryA_ID = hash_any_extended(&(Value Y), sizeof(Value Y), QueryA_ID) = BBBB */

QueryA->subNodeTwo = NULL;

/* QueryA_ID = hash_any_extended(&('\0'), sizeof(char), QueryA_ID) = CCCC */

 

/* QueryB_ID  = AAAA */

QueryB->subNodeOne = NULL;

/* QueryB_ID = hash_any_extended(&('\0'), sizeof(char), QueryB_ID) = DDDD */

QueryB->subNodeOne = &(Value X);

/* QueryB_ID = hash_any_extended(&(Value X), sizeof(Value X), QueryB_ID) = EEEE */

----

Hello!

Last time, I forgot to attach the patches.

The problem still persists in the 17.3 release.

Solution One
============

The simplest way to fix the problem is to place the scalar field used in the query ID calculation

between similar subnodes.
A patch for this solution is attached below (0001-Query-ID-Calculation-Fix-Variant-A.patch).

Solution Two
============

Alternatively, we can change the hash sum when we encounter an empty node.
This approach may impact performance but will protect us from such errors in the future.
A patch for this solution is attached below (0001-Query-ID-Calculation-Fix-Variant-B.patch).



======

SELECT version();

                                             version                                             

-------------------------------------------------------------------------------------------------

PostgreSQL 17.3 on x86_64-pc-linux-gnu, compiled by gcc (Debian 12.2.0-14) 12.2.0, 64-bit

 

SET compute_query_id = on;

 

/* LIMIT / OFFSET */

EXPLAIN (VERBOSE) SELECT "oid" FROM pg_class LIMIT 1;

 

                                 QUERY PLAN                                

----------------------------------------------------------------------------

Limit  (cost=0.00..0.04 rows=1 width=4)

   Output: oid

   ->  Seq Scan on pg_catalog.pg_class  (cost=0.00..18.15 rows=415 width=4)

         Output: oid

Query Identifier: 5185884322440896420

 

EXPLAIN (VERBOSE) SELECT "oid" FROM pg_class OFFSET 1;

                                 QUERY PLAN                                

----------------------------------------------------------------------------

Limit  (cost=0.04..18.15 rows=414 width=4)

   Output: oid

   ->  Seq Scan on pg_catalog.pg_class  (cost=0.00..18.15 rows=415 width=4)

         Output: oid

Query Identifier: 5185884322440896420

 

/* DISTINCT / ORDER BY */

EXPLAIN (VERBOSE) SELECT DISTINCT "oid" FROM pg_class;

 

                                                 QUERY PLAN                                                

------------------------------------------------------------------------------------------------------------

Unique  (cost=0.27..23.54 rows=415 width=4)

   Output: oid

   ->  Index Only Scan using pg_class_oid_index on pg_catalog.pg_class  (cost=0.27..22.50 rows=415 width=4)

         Output: oid

Query Identifier: 751948508603549510

 

EXPLAIN (VERBOSE) SELECT "oid" FROM pg_class ORDER BY "oid";

 

                                              QUERY PLAN                                             

------------------------------------------------------------------------------------------------------

Index Only Scan using pg_class_oid_index on pg_catalog.pg_class  (cost=0.27..22.50 rows=415 width=4)

   Output: oid

Query Identifier: 751948508603549510

 

Attachment

Here is bug description from
https://www.postgresql.org/message-id/flat/ca447b72d15745b9a877fad7e258407a%40localhost.localdomain

Problem
=======
In some cases, we could have same IDs for not identical query trees.
For two structurally similar query subnodes, we may have query
trees that look like this:

----
QueryA->subNodeOne = Value X;
QueryA->subNodeTwo = NULL;

QueryB->subNodeOne = NULL;
QueryB->subNodeTwo = Value X;
----

When the query jumble walker calculates the query ID, it traverses the
Query members from top to bottom and generates the same IDs for these
two queries because it does not change the hash value when visiting
an empty node (= NULL).

Here is an example (the collection of jstate->jumble is omitted):
----
/* QueryA_ID = AAAA */
QueryA->subNodeOne = &(Value X);
/* QueryA_ID = hash_any_extended(&(Value X), sizeof(Value X), QueryA_ID) = BBBB */
QueryA->subNodeTwo = NULL;
/* QueryA_ID = BBBB; */

/* QueryB_ID = AAAA */
QueryB->subNodeOne = NULL;
/* QueryB_ID = AAAA; */
QueryB->subNodeTwo = &(Value X);
/* QueryB_ID = hash_any_extended(&(Value X), sizeof(Value X), QueryB_ID) = BBBB */
----

There are two pairs of subnodes that can trigger this error:
- distinctClause and sortClause (both contain a list of SortGroupClauses);
- limitOffset and limitCount (both contain an int8 expression).

Here is an example of such errors (all queries run on REL_17_0):
----
SET compute_query_id = on;

/* DISTINCT / ORDER BY *************************************/
EXPLAIN (VERBOSE) SELECT DISTINCT "oid" FROM pg_class;
/* Query Identifier: 751948508603549510 */
EXPLAIN (VERBOSE) SELECT "oid" FROM pg_class ORDER BY "oid";
/* Query Identifier: 751948508603549510 */

/* LIMIT / OFFSET ******************************************/
EXPLAIN (VERBOSE) SELECT "oid" FROM pg_class LIMIT 1;
/* Query Identifier: 5185884322440896420 */
EXPLAIN (VERBOSE) SELECT "oid" FROM pg_class OFFSET 1;
/* Query Identifier: 5185884322440896420 */
----

Solution One
============

The simplest way to fix the problem is to place the scalar field used in
the query ID calculation between similar subnodes. A patch for this
solution is attached below (0001-Query-ID-Calculation-Fix-Variant-A.patch).

Here is an example:
----
/* QueryA_ID = AAAA */
QueryA->subNodeOne = &(Value X);
/* QueryA_ID = hash_any_extended(&(Value X), sizeof(Value X), QueryA_ID) = BBBB */
QueryA->scalar = Value Y;
/* QueryA_ID = hash_any_extended(&(Value Y), sizeof(Value Y), QueryA_ID) = CCCC */
QueryA->subNodeTwo = NULL;
/* QueryA_ID = CCCC; */

/* QueryB_ID = AAAA */
QueryB->subNodeOne = NULL;
/* QueryB_ID = AAAA; */
QueryB->scalar = Value Y;
/* QueryB_ID = hash_any_extended(&(Value Y), sizeof(Value Y), QueryB_ID) = DDDD */
QueryB->subNodeOne = &(Value X);
/* QueryB_ID = hash_any_extended(&(Value X), sizeof(Value X), QueryB_ID) = EEEE */
----

Solution Two
============
Alternatively, we can change the hash sum when we encounter an empty node.
This approach may impact performance but will protect us from such errors
in the future. A patch for this solution is attached below
(0001-Query-ID-Calculation-Fix-Variant-B.patch).

Here is an example:
----
/* QueryA_ID = AAAA */
QueryA->subNodeOne = &(Value X);
/* QueryA_ID = hash_any_extended(&(Value Y), sizeof(Value Y), QueryA_ID) = BBBB */
QueryA->subNodeTwo = NULL;
/* QueryA_ID = hash_any_extended(&('\0'), sizeof(char), QueryA_ID) = CCCC */

/* QueryB_ID = AAAA */
QueryB->subNodeOne = NULL;
/* QueryB_ID = hash_any_extended(&('\0'), sizeof(char), QueryB_ID) = DDDD */
QueryB->subNodeOne = &(Value X);
/* QueryB_ID = hash_any_extended(&(Value X), sizeof(Value X), QueryB_ID) = EEEE */
----

 

 

From: Быков Иван Александрович
Sent: Thursday, March 6, 2025 7:32 PM
To: 'pgsql-hackers@postgresql.org' <pgsql-hackers@postgresql.org>
Subject: RE: Query ID Calculation Fix for DISTINCT / ORDER BY and LIMIT / OFFSET

 

Hello!

Last time, I forgot to attach the patches.

The problem still persists in the 17.3 release.

Solution One
============

The simplest way to fix the problem is to place the scalar field used in the query ID calculation

between similar subnodes.
A patch for this solution is attached below (0001-Query-ID-Calculation-Fix-Variant-A.patch).

Solution Two
============

Alternatively, we can change the hash sum when we encounter an empty node.
This approach may impact performance but will protect us from such errors in the future.
A patch for this solution is attached below (0001-Query-ID-Calculation-Fix-Variant-B.patch).

 

======

SELECT version();

                                             version                                             

-------------------------------------------------------------------------------------------------

PostgreSQL 17.3 on x86_64-pc-linux-gnu, compiled by gcc (Debian 12.2.0-14) 12.2.0, 64-bit

 

SET compute_query_id = on;

 

/* LIMIT / OFFSET */

EXPLAIN (VERBOSE) SELECT "oid" FROM pg_class LIMIT 1;

 

                                 QUERY PLAN                                

----------------------------------------------------------------------------

Limit  (cost=0.00..0.04 rows=1 width=4)

   Output: oid

   ->  Seq Scan on pg_catalog.pg_class  (cost=0.00..18.15 rows=415 width=4)

         Output: oid

Query Identifier: 5185884322440896420

 

EXPLAIN (VERBOSE) SELECT "oid" FROM pg_class OFFSET 1;

                                 QUERY PLAN                                

----------------------------------------------------------------------------

Limit  (cost=0.04..18.15 rows=414 width=4)

   Output: oid

   ->  Seq Scan on pg_catalog.pg_class  (cost=0.00..18.15 rows=415 width=4)

         Output: oid

Query Identifier: 5185884322440896420

 

/* DISTINCT / ORDER BY */

EXPLAIN (VERBOSE) SELECT DISTINCT "oid" FROM pg_class;

 

                                                 QUERY PLAN                                                

------------------------------------------------------------------------------------------------------------

Unique  (cost=0.27..23.54 rows=415 width=4)

   Output: oid

   ->  Index Only Scan using pg_class_oid_index on pg_catalog.pg_class  (cost=0.27..22.50 rows=415 width=4)

         Output: oid

Query Identifier: 751948508603549510

 

EXPLAIN (VERBOSE) SELECT "oid" FROM pg_class ORDER BY "oid";

 

                                              QUERY PLAN                                             

------------------------------------------------------------------------------------------------------

Index Only Scan using pg_class_oid_index on pg_catalog.pg_class  (cost=0.27..22.50 rows=415 width=4)

   Output: oid

Query Identifier: 751948508603549510

 

Here is
0001-Query-ID-Calculation-Fix-Variant-A.patch
and
0001-Query-ID-Calculation-Fix-Variant-B.patch


 

Attachment
Hi,

It seems like there are multiple threads on this topic. This is the
original [0], but I suggest continuing the discussion in this thread
since it includes the examples and patches.

Regarding the issue itself, query jumbling behavior is often subjective,
making it difficult to classify as a bug. I'm not entirely sure this
qualifies as a bug either, but I do believe it should be addressed.

As I understand it, two nodes defined one after the other and in which both
could end up with the same expressions when traversed cannot be differentiated
by query jumbling when one of them is NULL. In this case, queryJumbling can't
differentiate between when limitOffset of LimitOption and they both result with
a similar FuncExpr.

By rearranging them as done in variant A, the position where the expression
is appended in the jumble changes, leading to a different hash. I just
don't like
this solution because it requires one to carefully construct a struct,
but it maybe
the best out of the other options.

Variant B is not acceptable IMO as it adds a whole bunch of null-terminators
unnecessarily. For example, in a simple "select 1", (expr == NULL) is
true 19 times,
so that is an extra 19 bytes.

I think a third option is to add a new pg_node_attr called "query_jumble_null"
and be very selective on which fields should append a null-terminator to the
jumble when the expression is null

The queryjumblefuncs.c could look like the below. JUMBLE_NULL will
be responsible for appending the null terminator.

"""
static void
_jumbleQuery(JumbleState *jstate, Node *node)
{
        Query *expr = (Query *) node;
...
......
.......
        if (!expr->limitOffset)
        {
                JUMBLE_NULL();
        }
        else
        {
                JUMBLE_NODE(limitOffset);
        }
        if (!expr->limitCount)
        {
                JUMBLE_NULL();
        }
        else
        {
                JUMBLE_NODE(limitCount);
        }
"""

What do you think? Maybe someone can suggest a better solution?

Regards,

Sami Imseih
Amazon Web Services (AWS)


[0] https://www.postgresql.org/message-id/ca447b72d15745b9a877fad7e258407a@localhost.localdomain



Re: Query ID Calculation Fix for DISTINCT / ORDER BY and LIMIT / OFFSET

From
Michael Paquier
Date:
On Thu, Mar 06, 2025 at 06:44:27PM -0600, Sami Imseih wrote:
> Regarding the issue itself, query jumbling behavior is often subjective,
> making it difficult to classify as a bug. I'm not entirely sure this
> qualifies as a bug either, but I do believe it should be addressed.

I would call that a bug and something that we should fix, but not
something that we can really backpatch as this has unfortunately an
impact on monitoring tools.  Stability takes priority in this area in
already released branches.

> By rearranging them as done in variant A, the position where the expression
> is appended in the jumble changes, leading to a different hash. I just
> don't like
> this solution because it requires one to carefully construct a struct,
> but it maybe the best out of the other options.

I prefer something like variant A.  It would not be the first time we
are manipulating the structure of the parse nodes for the sake of
making the query jumbling more transparent.  An advantage of depending
on the structure definition is that we can document the expectation in
the header, and not hide it in the middle of the jumble code.

> Variant B is not acceptable IMO as it adds a whole bunch of null-terminators
> unnecessarily. For example, in a simple "select 1", (expr == NULL) is
> true 19 times,
> so that is an extra 19 bytes.

Variant B is not acceptable here.

> I think a third option is to add a new pg_node_attr called "query_jumble_null"
> and be very selective on which fields should append a null-terminator to the
> jumble when the expression is null
>
> The queryjumblefuncs.c could look like the below. JUMBLE_NULL will
> be responsible for appending the null terminator.

Not much a fan of that.  For the sake of this thread, I'd still go
with the simplicity of A.  And please, let's add a couple of queries
in pgss to track that these lead to two different entries.

Another option that I was thinking about and could be slightly cleaner
is the addition of an extra field in Query that is set when we go
through a offset_clause in the parser.  It could just be a boolean,
false by default.  We have been using this practice in in
DeallocateStmt, for example.  And there are only a few places where
limitOffset is set.  However, I'd rather discard this idea as
transformSelectStmt() has also the idea to transform a LIMIT clause
into an OFFSET clause, changing a Query representation.  And note that
we calculate the query jumbling after applying the query
transformation.  For these reasons, variant A where we put the
LimitOption between the two int8 expression nodes feels like the
"okay" approach here.  But we must document this expectation in the
structure, and check for more grammar variants of LIMIT and OFFSET
clauses in pgss.

Another concept would be to add into the jumble the offset of a Node
in the upper structure we are jumbling, but this requires keeping a
track of all the nested things, which would make the query jumbling
code much more complicated and more expensive, for sure.

I'd also recommend to be careful and double-check all these
transformation assumptions for LIMIT and OFFSET.  We should shape
things so as an OFFSET never maps to a LIMIT variant when we
differentiate all these flavors at query string level.
--
Michael

Attachment
On Mon, 10 Mar 2025 at 12:39, Michael Paquier <michael@paquier.xyz> wrote:
>
> On Thu, Mar 06, 2025 at 06:44:27PM -0600, Sami Imseih wrote:
> > Regarding the issue itself, query jumbling behavior is often subjective,
> > making it difficult to classify as a bug. I'm not entirely sure this
> > qualifies as a bug either, but I do believe it should be addressed.
>
> I would call that a bug and something that we should fix, but not
> something that we can really backpatch as this has unfortunately an
> impact on monitoring tools.  Stability takes priority in this area in
> already released branches.

I know you reviewed this, Michael, so you're aware; 2d3389c28 did
recently document that we'd only break this in minor versions as a
last resort. So, I agree that it sounds like a master-only fix is in
order.

For the record, the docs [1] read:

"Generally, it can be assumed that queryid values are stable between
minor version releases of PostgreSQL, providing that instances are
running on the same machine architecture and the catalog metadata
details match. Compatibility will only be broken between minor
versions as a last resort."

David

[1] https://www.postgresql.org/docs/current/pgstatstatements.html



Re: Query ID Calculation Fix for DISTINCT / ORDER BY and LIMIT / OFFSET

From
Michael Paquier
Date:
On Mon, Mar 10, 2025 at 02:14:01PM +1300, David Rowley wrote:
> I know you reviewed this, Michael, so you're aware; 2d3389c28 did
> recently document that we'd only break this in minor versions as a
> last resort. So, I agree that it sounds like a master-only fix is in
> order.

Yes, this thread's problem does not pass the compatibility bar.
Thanks for the reminder about the docs.
--
Michael

Attachment
Hello!

>> Variant B is not acceptable IMO as it adds a whole bunch of
>> null-terminators unnecessarily. For example, in a simple "select 1",
>> (expr == NULL) is true 19 times, so that is an extra 19 bytes.

> Variant B is not acceptable here.

Could we improve Variant B?

I was thinking about adding a count of NULL pointers encountered by the query
jumble walker since the last scalar or non-null field visited.
This way, we can traverse the query as follows:
====
QueryA->subNodeOne = &(Value X);
/* JumbleState->Count = 0;
   QueryA_ID = hash_any_extended(&(Value X), sizeof(Value X), QueryA_ID) */
QueryA->subNodeTwo = NULL;
/* JumbleState->Count = 1; */
QueryA->subNodeThee = NULL;
/* JumbleState->Count = 2; */
QueryA->subNodeFour = NULL;
/* JumbleState->Count = 3; */
QueryA->scalar = Value Z;
/* QueryA_ID = hash_any_extended(&(JumbleState->Count),
    sizeof(JumbleState->Count), QueryA_ID)
   JumbleState->Count = 0;
   QueryA_ID = hash_any_extended(&(Value Y), sizeof(Value Y), QueryA_ID) */
====

Variant A improvement
---------------------

I have attached the updated Variant A patch with the following changes:
- Implemented a pg_stat_statements regression test (see similar test results
  on REL_17_3 in Query-ID-collision-at_pg_stat_statements-1ea6e890b22.txt).
- Added extra description about the Query ID collision
  in src/include/nodes/parsenodes.h.




Attachment
On Mon, 10 Mar 2025 at 23:17, Bykov Ivan <i.bykov@modernsys.ru> wrote:
> I have attached the updated Variant A patch with the following changes:
> - Implemented a pg_stat_statements regression test (see similar test results
>   on REL_17_3 in Query-ID-collision-at_pg_stat_statements-1ea6e890b22.txt).
> - Added extra description about the Query ID collision
>   in src/include/nodes/parsenodes.h.

I've not paid much attention to the jumble code before, but I did
glance at this patch:

+ *
+ *   The query jumbling process may trigger a Query ID collision when
+ *   two Query fields located sequentially contain the same type of nodes
+ *   (a list of nodes), and both of them may be NULL.
+ *   List of such groups of fields:
+ *     - limitOffset and limitCount (which may be an int8 expression or NULL);
+ *     - distinctClause, and sortClause (which may be a list of
+ *       SortGroupClause entries or NULL);
+ *   To prevent this collision, we should insert at least one scalar field
+ *   without the attribute query_jumble_ignore between such fields.
  */
 typedef struct Query
 {
@@ -208,11 +218,11 @@ typedef struct Query

  List    *distinctClause; /* a list of SortGroupClause's */

- List    *sortClause; /* a list of SortGroupClause's */
-
  Node    *limitOffset; /* # of result tuples to skip (int8 expr) */
- Node    *limitCount; /* # of result tuples to return (int8 expr) */
  LimitOption limitOption; /* limit type */
+ Node    *limitCount; /* # of result tuples to return (int8 expr) */
+
+ List    *sortClause; /* a list of SortGroupClause's */

  List    *rowMarks; /* a list of RowMarkClause's */

It seems to me that if this fixes the issue, that the next similar one
is already lurking in the shadows waiting to jump out on us.

Couldn't we adjust all this code so that we pass a seed to
AppendJumble() that's the offsetof(Struct, field) that we're jumbling
and incorporate that seed into the hash?  I don't think we could
possibly change the offset in a minor version, so I don't think
there's a risk that could cause jumble value instability. Possibly
different CPU architectures would come up with different offsets
through having different struct alignment requirements, but we don't
make any guarantees about the jumble values matching across different
CPU architectures anyway.

I admit to only having spent a few minutes looking at this, so there
may well be a good reason for not doing this. Maybe Michael can tell
me what that is...?

A very quickly put together patch is attached. I intend this only to
demonstrate what I mean, not because I want to work on it myself.

David

Attachment
> It seems to me that if this fixes the issue, that the next similar one
> is already lurking in the shadows waiting to jump out on us.

Yes, this is true that there may be other cases, but I am not sure if
it's worth carrying all the
extra bytes in the jumble to deal with a few cases like this. This is
why I don't think Variant B
or tracking the offset is a thrilling idea. -1 for me.

--
Sami



> transformation.  For these reasons, variant A where we put the
> LimitOption between the two int8 expression nodes feels like the
> "okay" approach here.  But we must document this expectation in the
> structure, and check for more grammar variants of LIMIT and OFFSET
> clauses in pgss.

Please see the attached. Variant A with comments and some additional
test cases.

It should be noted that we currently have "WITH TIES/ROWS ONLY" tests in pg_s_s,
so I added another case to show "FETCH FIRST 2 ROW ONLY" and "LIMIT 2" are
the same queryId and also added a query that uses both a LIMIT and OFFSET.

I could not think of other cases we need to cover.

--

Sami

Attachment

Re: Query ID Calculation Fix for DISTINCT / ORDER BY and LIMIT / OFFSET

From
Michael Paquier
Date:
On Tue, Mar 11, 2025 at 12:45:35AM +1300, David Rowley wrote:
> It seems to me that if this fixes the issue, that the next similar one
> is already lurking in the shadows waiting to jump out on us.

For how many years will be have to wait for this threat hiddent in the
shadows?  :)

This issue exists since the query jumbling exists in pgss, so it goes
really down the road.  I've spent a bit more than a few minutes on
that.

> Couldn't we adjust all this code so that we pass a seed to
> AppendJumble() that's the offsetof(Struct, field) that we're jumbling
> and incorporate that seed into the hash?  I don't think we could
> possibly change the offset in a minor version, so I don't think
> there's a risk that could cause jumble value instability. Possibly
> different CPU architectures would come up with different offsets
> through having different struct alignment requirements, but we don't
> make any guarantees about the jumble values matching across different
> CPU architectures anyway.

Yeah, something like that has potential to get rid of such problems
forever, particularly thanks to the fact that we automate the
generation of this code now so it is mostly cost-free.  This has a
cost for the custom jumbling functions where one needs some
imagination, but with models being around while we keep the number of
custom functions low, that should be neither complicated nor costly,
at least in my experience.

When I was thinking about the addition of the offset to the jumbling
yesterday, I was wondering about two things:
- if there are some node structures where it could not work.
- if we'd need to pass down some information of the upper node when
doing the jumbling of a node attached to it, which would make the code
generation much more annoying.

But after sleeping on it I think that these two points are kind of
moot: having only the offset passed down to a single _jumbleNode()
with the offset compiled at its beginning should be sufficient.  Using
"seed" as a term is perhaps a bit confusing, because this is an offset
in the upper node, but perhaps that's OK as-is.  It's just more
entropy addition.  If somebody has a better idea for this term, feel
free.

_jumbleList() is an interesting one.  If we want an equivalent of the
offset, this could use the item number in the list which would be a
rather close equivalent to the offset used elsewhere.  For the int,
oid and xid lists, we should do an extra JUMBLE_FIELD_SINGLE() for
each member, apply the offset at the beginning of _jumbleList(), once,
I guess.

_jumbleVariableSetStmt() should be OK with the offset of "arg" in
VariableSetStmt.  The addition of hash_combine64() to count for the
offset should be OK.

With that in mind, the cases with DISTINCT/ORDER BY and OFFSET/LIMIT
seem to work fine, without causing noise for the other cases tracked
in the regression tests of PGSS.

What do you think?
--
Michael

Attachment
On Tue, 11 Mar 2025 at 19:50, Michael Paquier <michael@paquier.xyz> wrote:
>
> On Tue, Mar 11, 2025 at 12:45:35AM +1300, David Rowley wrote:
> > It seems to me that if this fixes the issue, that the next similar one
> > is already lurking in the shadows waiting to jump out on us.
>
> For how many years will be have to wait for this threat hiddent in the
> shadows?  :)
>
> This issue exists since the query jumbling exists in pgss, so it goes
> really down the road.  I've spent a bit more than a few minutes on
> that.

I didn't mean to cause offence here. I just wanted to make it clear
that I don't think fixing these types of issues by swapping the order
of the fields is going to be a sustainable practice, and it would be
good to come up with a solution that eliminates the chances of this
class of bug from ever appearing again.  Even if we were to trawl the
entire Query struct and everything that could ever be attached to it
and we deem that today it's fine and no more such bugs exist, the
person adding some new SQL feature in the future that adds new data to
store in the Query probably isn't going to give query jumbling much
thought, especially now that the code generation is automatic.

> > Couldn't we adjust all this code so that we pass a seed to
> > AppendJumble() that's the offsetof(Struct, field) that we're jumbling
> > and incorporate that seed into the hash?  I don't think we could
> > possibly change the offset in a minor version, so I don't think
> > there's a risk that could cause jumble value instability. Possibly
> > different CPU architectures would come up with different offsets
> > through having different struct alignment requirements, but we don't
> > make any guarantees about the jumble values matching across different
> > CPU architectures anyway.
>
> Yeah, something like that has potential to get rid of such problems
> forever, particularly thanks to the fact that we automate the
> generation of this code now so it is mostly cost-free.  This has a
> cost for the custom jumbling functions where one needs some
> imagination, but with models being around while we keep the number of
> custom functions low, that should be neither complicated nor costly,
> at least in my experience.

I hadn't really processed this thread fully when I responded yesterday
and my response was mostly aimed at the latest patch on the thread at
the time, which I had looked at quickly and wanted to put a stop to
changing field orders as a fix for this.  Since then, I see that Ivan
has already submitted a patch that accounts for NULL nodes and adds a
byte to the jumble buffer to account for NULLs. This seems quite clean
and simple. However, Sami seems to have concerns about the overhead of
doing this. Is that warranted at all? Potentially, there could be a
decent number of NULL fields. It'll probably be much cheaper than the
offsetof idea I came up with.

> When I was thinking about the addition of the offset to the jumbling
> yesterday, I was wondering about two things:
> - if there are some node structures where it could not work.
> - if we'd need to pass down some information of the upper node when
> doing the jumbling of a node attached to it, which would make the code
> generation much more annoying.
>
> But after sleeping on it I think that these two points are kind of
> moot: having only the offset passed down to a single _jumbleNode()
> with the offset compiled at its beginning should be sufficient.  Using
> "seed" as a term is perhaps a bit confusing, because this is an offset
> in the upper node, but perhaps that's OK as-is.  It's just more
> entropy addition.  If somebody has a better idea for this term, feel
> free.

Can you share your thoughts on Ivan's NULL jumble idea?

> _jumbleList() is an interesting one.  If we want an equivalent of the
> offset, this could use the item number in the list which would be a
> rather close equivalent to the offset used elsewhere.  For the int,
> oid and xid lists, we should do an extra JUMBLE_FIELD_SINGLE() for
> each member, apply the offset at the beginning of _jumbleList(), once,
> I guess.

Is this affected by the same class of bug that we're aiming to fix
here?  (This class being not always jumbling all fields because of
NULLs or some other value, resulting in the next jumble field applying
the same bytes to the buffer as the previous field would have, had it
not been ignored)

> _jumbleVariableSetStmt() should be OK with the offset of "arg" in
> VariableSetStmt.  The addition of hash_combine64() to count for the
> offset should be OK.
>
> With that in mind, the cases with DISTINCT/ORDER BY and OFFSET/LIMIT
> seem to work fine, without causing noise for the other cases tracked
> in the regression tests of PGSS.
>
> What do you think?

I mostly now think the class of bug can be fixed by ensuring we never
ignore any jumble field for any reason and always put at least 1 byte
onto the buffer with some sort of entropy. I'm keen to understand if
I'm missing some case that you've thought about that I've not.

David



Hello!

> Since then, I see that Ivan
> has already submitted a patch that accounts for NULL nodes and adds a
> byte to the jumble buffer to account for NULLs. This seems quite clean
> and simple. However, Sami seems to have concerns about the overhead of
> doing this. Is that warranted at all? Potentially, there could be a
> decent number of NULL fields. It'll probably be much cheaper than the
> offsetof idea I came up with.

To address the issue of it consuming a lot of bytes in the jumble buffer
for NULL marks, I have added a new patch version for Variant B.
(v2-0001-Query-ID-Calculation-Fix-Variant-B.patch).

This patch adds to JumbleState the count of NULL nodes visited before a
non-NULL one appears. The least significant byte of this count is appended
to the jumble buffer every time the count is not null (indicating that we have
visited non-NULL nodes previously), and a new chunk of data is also appended
to the jumble buffer. I intentionally did not add a final append for the NULL
count in the JumbleQuery processing, as NULL tail nodes of the query do not
improve the reliability of query ID collision resolution.

I believe that my first version, Variant B of the patch, is simple and easy to
understand. I would prefer to use the first version of the Variant B patch,
so I have attached an updated version along with tests
(v3-0001-Query-ID-Calculation-Fix-Variant-B.patch).

As you can see, v2-0001-Query-ID-Calculation-Fix-Variant-B.patch is more
complex and invasive than v3-0001-Query-ID-Calculation-Fix-Variant-B.patch.
I don't think that saving a few bytes in a 1024-byte sized buffer is worth
implementing, even for this simple optimization (v2).

> Can you share your thoughts on Ivan's NULL jumble idea?

I would appreciate any feedback. Thank you.


-----Original Message-----
From: David Rowley <dgrowleyml@gmail.com> 
Sent: Tuesday, March 11, 2025 12:49 PM
To: Michael Paquier <michael@paquier.xyz>
Cc: Быков Иван Александрович <i.bykov@modernsys.ru>; Sami Imseih <samimseih@gmail.com>; pgsql-hackers@postgresql.org
Subject: Re: Query ID Calculation Fix for DISTINCT / ORDER BY and LIMIT / OFFSET

On Tue, 11 Mar 2025 at 19:50, Michael Paquier <michael@paquier.xyz> wrote:
>
> On Tue, Mar 11, 2025 at 12:45:35AM +1300, David Rowley wrote:
> > It seems to me that if this fixes the issue, that the next similar 
> > one is already lurking in the shadows waiting to jump out on us.
>
> For how many years will be have to wait for this threat hiddent in the 
> shadows?  :)
>
> This issue exists since the query jumbling exists in pgss, so it goes 
> really down the road.  I've spent a bit more than a few minutes on 
> that.

I didn't mean to cause offence here. I just wanted to make it clear that I don't think fixing these types of issues by
swappingthe order of the fields is going to be a sustainable practice, and it would be good to come up with a solution
thateliminates the chances of this class of bug from ever appearing again.  Even if we were to trawl the entire Query
structand everything that could ever be attached to it and we deem that today it's fine and no more such bugs exist,
theperson adding some new SQL feature in the future that adds new data to store in the Query probably isn't going to
givequery jumbling much thought, especially now that the code generation is automatic.
 

> > Couldn't we adjust all this code so that we pass a seed to
> > AppendJumble() that's the offsetof(Struct, field) that we're 
> > jumbling and incorporate that seed into the hash?  I don't think we 
> > could possibly change the offset in a minor version, so I don't 
> > think there's a risk that could cause jumble value instability. 
> > Possibly different CPU architectures would come up with different 
> > offsets through having different struct alignment requirements, but 
> > we don't make any guarantees about the jumble values matching across 
> > different CPU architectures anyway.
>
> Yeah, something like that has potential to get rid of such problems 
> forever, particularly thanks to the fact that we automate the 
> generation of this code now so it is mostly cost-free.  This has a 
> cost for the custom jumbling functions where one needs some 
> imagination, but with models being around while we keep the number of 
> custom functions low, that should be neither complicated nor costly, 
> at least in my experience.

I hadn't really processed this thread fully when I responded yesterday and my response was mostly aimed at the latest
patchon the thread at the time, which I had looked at quickly and wanted to put a stop to changing field orders as a
fixfor this.  Since then, I see that Ivan has already submitted a patch that accounts for NULL nodes and adds a byte to
thejumble buffer to account for NULLs. This seems quite clean and simple. However, Sami seems to have concerns about
theoverhead of doing this. Is that warranted at all? Potentially, there could be a decent number of NULL fields. It'll
probablybe much cheaper than the offsetof idea I came up with.
 

> When I was thinking about the addition of the offset to the jumbling 
> yesterday, I was wondering about two things:
> - if there are some node structures where it could not work.
> - if we'd need to pass down some information of the upper node when 
> doing the jumbling of a node attached to it, which would make the code 
> generation much more annoying.
>
> But after sleeping on it I think that these two points are kind of
> moot: having only the offset passed down to a single _jumbleNode() 
> with the offset compiled at its beginning should be sufficient.  Using 
> "seed" as a term is perhaps a bit confusing, because this is an offset 
> in the upper node, but perhaps that's OK as-is.  It's just more 
> entropy addition.  If somebody has a better idea for this term, feel 
> free.

Can you share your thoughts on Ivan's NULL jumble idea?

> _jumbleList() is an interesting one.  If we want an equivalent of the 
> offset, this could use the item number in the list which would be a 
> rather close equivalent to the offset used elsewhere.  For the int, 
> oid and xid lists, we should do an extra JUMBLE_FIELD_SINGLE() for 
> each member, apply the offset at the beginning of _jumbleList(), once, 
> I guess.

Is this affected by the same class of bug that we're aiming to fix here?  (This class being not always jumbling all
fieldsbecause of NULLs or some other value, resulting in the next jumble field applying the same bytes to the buffer as
theprevious field would have, had it not been ignored)
 

> _jumbleVariableSetStmt() should be OK with the offset of "arg" in 
> VariableSetStmt.  The addition of hash_combine64() to count for the 
> offset should be OK.
>
> With that in mind, the cases with DISTINCT/ORDER BY and OFFSET/LIMIT 
> seem to work fine, without causing noise for the other cases tracked 
> in the regression tests of PGSS.
>
> What do you think?

I mostly now think the class of bug can be fixed by ensuring we never ignore any jumble field for any reason and always
putat least 1 byte onto the buffer with some sort of entropy. I'm keen to understand if I'm missing some case that
you'vethought about that I've not.
 

David

Attachment

Re: Query ID Calculation Fix for DISTINCT / ORDER BY and LIMIT / OFFSET

From
Michael Paquier
Date:
On Tue, Mar 11, 2025 at 08:48:33PM +1300, David Rowley wrote:
> On Tue, 11 Mar 2025 at 19:50, Michael Paquier <michael@paquier.xyz> wrote:
>> This issue exists since the query jumbling exists in pgss, so it goes
>> really down the road.  I've spent a bit more than a few minutes on
>> that.
>
> I didn't mean to cause offence here. I just wanted to make it clear
> that I don't think fixing these types of issues by swapping the order
> of the fields is going to be a sustainable practice, and it would be
> good to come up with a solution that eliminates the chances of this
> class of bug from ever appearing again.  Even if we were to trawl the
> entire Query struct and everything that could ever be attached to it
> and we deem that today it's fine and no more such bugs exist, the
> person adding some new SQL feature in the future that adds new data to
> store in the Query probably isn't going to give query jumbling much
> thought, especially now that the code generation is automatic.

FWIW, I've mentioned the use of the offset in a Node upthread, in the
next to last last paragraph of this email:
https://www.postgresql.org/message-id/Z84mhjm5RtWtLG4X@paquier.xyz

One thing I did not notice yesterday was that it is possible to rely
on _jumbleNode() to pass an offset from the parent.  So it looks like
we had roughly the same idea independently, but I was not able to look
more closely at that yesterday.

>> But after sleeping on it I think that these two points are kind of
>> moot: having only the offset passed down to a single _jumbleNode()
>> with the offset compiled at its beginning should be sufficient.  Using
>> "seed" as a term is perhaps a bit confusing, because this is an offset
>> in the upper node, but perhaps that's OK as-is.  It's just more
>> entropy addition.  If somebody has a better idea for this term, feel
>> free.
>
> Can you share your thoughts on Ivan's NULL jumble idea?

This is an incomplete solution, I am afraid.  The origin of the
problem comes from the fact that a Node (here, Query) stores two times
successively equivalent Nodes that are used for separate data, with
NULL being the difference between both.  The problem is not limited to
NULL, we could as well, for example, finish with a Node that has a
custom jumbling function and the same issue depending on how it is
used in a parent Node.  Adding the offset from the parent in the
jumbling is a much stronger insurance policy that the proposed
solution specific to NULL, because each offset is unique in a single
Node.

>> _jumbleList() is an interesting one.  If we want an equivalent of the
>> offset, this could use the item number in the list which would be a
>> rather close equivalent to the offset used elsewhere.  For the int,
>> oid and xid lists, we should do an extra JUMBLE_FIELD_SINGLE() for
>> each member, apply the offset at the beginning of _jumbleList(), once,
>> I guess.
>
> Is this affected by the same class of bug that we're aiming to fix
> here?  (This class being not always jumbling all fields because of
> NULLs or some other value, resulting in the next jumble field applying
> the same bytes to the buffer as the previous field would have, had it
> not been ignored)

Yep, that could be possible.  I am not sure if it can be reached
currently with the set of parse nodes, but it in theory possible could
be possible with a list of Nodes, at least.
--
Michael

Attachment
> and simple. However, Sami seems to have concerns about the overhead of
> doing this. Is that warranted at all? Potentially, there could be a
> decent number of NULL fields. It'll probably be much cheaper than the
> offsetof idea I came up with.

I have not benchmarked the overhead, so maybe there is not much to
be concerned about. However, it just seems to me that tracking the extra
data for all cases just to only deal with corner cases does not seem like the
correct approach. This is what makes variant A the most attractive
approach.

--

Sami



Re: Query ID Calculation Fix for DISTINCT / ORDER BY and LIMIT / OFFSET

From
Michael Paquier
Date:
On Tue, Mar 11, 2025 at 05:35:10PM -0500, Sami Imseih wrote:
> I have not benchmarked the overhead, so maybe there is not much to
> be concerned about. However, it just seems to me that tracking the extra
> data for all cases just to only deal with corner cases does not seem like the
> correct approach. This is what makes variant A the most attractive
> approach.

I suspect that the overhead will be minimal for all the approaches I'm
seeing on this thread, but it would not hurt to double-check all that.
As the overhead of a single query jumbling is weightless compared to
the overall query processing, the fastest method I've used in this
area is a micro-benchmark with a hardcoded loop in JumbleQuery() with
some rusage to get a more few metrics.  This exagerates the query
jumbling computing, but it's good enough to see a difference once you
take an average of the time taken for each loop.
--
Michael

Attachment
On Tue, 11 Mar 2025 at 23:28, Michael Paquier <michael@paquier.xyz> wrote:
> > Can you share your thoughts on Ivan's NULL jumble idea?
>
> This is an incomplete solution, I am afraid.  The origin of the
> problem comes from the fact that a Node (here, Query) stores two times
> successively equivalent Nodes that are used for separate data, with
> NULL being the difference between both.  The problem is not limited to
> NULL, we could as well, for example, finish with a Node that has a
> custom jumbling function and the same issue depending on how it is
> used in a parent Node.  Adding the offset from the parent in the
> jumbling is a much stronger insurance policy that the proposed
> solution specific to NULL, because each offset is unique in a single
> Node.

One of us is not understanding the problem correctly. I'm very open
for that to be me, but I'll need a bit more explanation for me to know
that for sure.

As far as I see it, providing we ensure we always add something with a
bit of entropy to the jumble buffer for all possible values that a
jumble field can be, this fixes the issue.  The problem only occurs at
the moment because we entirely skip over NULL node fields and we end
up with the same jumble bytes if we effectively move the same List of
nodes between the Query's distinctClause and sortClause fields.

If we add something for NULLs, we'll always add distinctClause before
the sortClause. If the distinctClause is NULL we won't end up with the
same jumble bytes as if the sortClause were NULL, as the order the
NULL byte(s) are added to the buffer changes.

The fragment of the Query struct looks like:

List    *windowClause; /* a list of WindowClause's */

List    *distinctClause; /* a list of SortGroupClause's */

List    *sortClause; /* a list of SortGroupClause's */

Let's assume we have a non-NIL windowClause, a non-NIL distinctClause
and a NIL sortClause. The jumble bytes currently look like:
"<windowClause bytes><distinctClause bytes>", and if we have an ORDER
BY instead of a DISTINCT, we get: "<windowClause bytes><sortClause
bytes>", and this is problematic when the distinctClause bytes and
sortClause bytes at the same as that results in the same hash.

If we account for the NULLs, those two cases become: "<windowClause
bytes><distinctClause bytes><byte for NULL sortClause>" and
"<windowClause bytes><byte for NULL distinct clause><sortClause
bytes>", which is going to be highly unlikely to hash to the same
value due to the buffer not being the same.

Can you explain where my understanding is wrong?

David



> If we add something for NULLs, we'll always add distinctClause before
> the sortClause. If the distinctClause is NULL we won't end up with the
> same jumble bytes as if the sortClause were NULL, as the order the
> NULL byte(s) are added to the buffer changes.

That is how I understand it as well. By appending a single character null
terminator to the jumble for every NULL expression, we change the composition
of the final jumble. Passing offsets will accomplish the same thing, but I can't
see how it buys us any additional safeguards.

> I suspect that the overhead will be minimal for all the approaches I'm
> seeing on this thread, but it would not hurt to double-check all that.

Perhaps my concern is overblown. Also, it seems the consensus is for a more
comprehensive solution than that of variant A.

Here is an idea I was playing around with. Why don't we just append a null
terminator for every expression (a variant of variant B)?
I describe this as jumbling the expression position. And if we do
that, it no longer
seems necessary to jumble the expression type either. right?

+++ b/src/backend/nodes/queryjumblefuncs.c
@@ -236,6 +236,13 @@ do { \
        if (expr->str) \
                AppendJumble(jstate, (const unsigned char *)
(expr->str), strlen(expr->str) + 1); \
 } while(0)
+#define JUMBLE_POSITION() \
+do { \
+       char nullterm = '\0'; \
+       AppendJumble(jstate, (const unsigned char *) &(nullterm),
sizeof(nullterm)); \
+       if (expr == NULL) \
+               return; \
+} while(0)

 #include "queryjumblefuncs.funcs.c"

@@ -244,17 +251,15 @@ _jumbleNode(JumbleState *jstate, Node *node)
 {
        Node       *expr = node;

-       if (expr == NULL)
-               return;
-
        /* Guard against stack overflow due to overly complex expressions */
        check_stack_depth();

        /*
-        * We always emit the node's NodeTag, then any additional
fields that are
-        * considered significant, and then we recurse to any child nodes.
+        * We represent a position of a field in the jumble with a null-term.
+        * Doing so ensures that even NULL expressions are accounted for in
+        * the jumble.
         */
-       JUMBLE_FIELD(type);
+       JUMBLE_POSITION();

        switch (nodeTag(expr))
        {

> As the overhead of a single query jumbling is weightless compared to
> the overall query processing

yeah, that is a good point. At least benchmarking the above on a
SELECT only pgbench did not reveal any regression.


--
Sami Imseih
Amazon Web Services (AWS)



On Tue, 11 Mar 2025 at 20:48, David Rowley <dgrowleyml@gmail.com> wrote:
> Since then, I see that Ivan
> has already submitted a patch that accounts for NULL nodes and adds a
> byte to the jumble buffer to account for NULLs. This seems quite clean
> and simple. However, Sami seems to have concerns about the overhead of
> doing this. Is that warranted at all? Potentially, there could be a
> decent number of NULL fields. It'll probably be much cheaper than the
> offsetof idea I came up with.

To try and get some timing information about the overhead added for
jumbling the NULLs, I compared master and the patch at [1] using the
attached jumbletime.patch.txt.  The patch just adds an additional call
to JumbleQuery and times how long it takes and outputs the result in
nanoseconds. Running make check has the advantage of trying out many
different queries.

On an AMD 3990x machine, the results for jumbling all make check queries are:

master: 73.66 milliseconds
0001-Query-ID-Calculation-Fix-Variant-B.patch: 80.36 milliseconds

So that adds about 9.1% overhead to jumbling, on average.

See the attached jumble_results.txt for full results and the code to
extract the results.

David

[1] https://www.postgresql.org/message-id/attachment/173439/0001-Query-ID-Calculation-Fix-Variant-B.patch

Attachment

Re: Query ID Calculation Fix for DISTINCT / ORDER BY and LIMIT / OFFSET

From
Michael Paquier
Date:
On Wed, Mar 12, 2025 at 03:28:27PM +1300, David Rowley wrote:
> If we account for the NULLs, those two cases become: "<windowClause
> bytes><distinctClause bytes><byte for NULL sortClause>" and
> "<windowClause bytes><byte for NULL distinct clause><sortClause
> bytes>", which is going to be highly unlikely to hash to the same
> value due to the buffer not being the same.
>
> Can you explain where my understanding is wrong?

(I am a bit busy this week, sorry for the delay)

I have looked at the patches sent at
https://www.postgresql.org/message-id/5ac172e0b77a4baba50671cd1a15285f@localhost.localdomain,
and they don't really address what I'm trying to point at here.

Custom jumbling functions could still be problematic, making
uneffective a NULL check at the beginning of jumbleNode().

Here is an example, possible by design, that would still be a problem:
static void
_jumbleNodeFoo(JumbleState *jstate, Node *node)
{
   Foo *expr = (Foo *) node;

   if (expr->field1 == 0)
       return;

    JUMBLE_FIELD(field2);
   [..]
}

typedef struct Foo
{
   pg_node_attr(custom_query_jumble)

   NodeTag     type;
   int field1;
   char *field2;
   [..]
}

Then we have the same problem with another Node using this Foo node
two times in a row, depending on how it's used by the query parsing
and transform:
typedef struct FooBar
{
   pg_node_attr(custom_query_jumble)

   NodeTag     type;
   Foo *foo1;
   Foo *foo2;
   [..]
}

Adding a depth level incremented once we go through a Node would add
more entropy, but it may not ve completely bullet-proof either, even
if we add an offset.  I am pretty sure I could find a Node structure
that acts as counter example even if you add a depth level and an
offset in the jumbling..
--
Michael

Attachment
> Then we have the same problem with another Node using this Foo node
> two times in a row, depending on how it's used by the query parsing
> and transform:

hmm, it's hard to imagine such a case being real-world and useful for
query jumbling purposes. Also, I think if we introduce something like
JUMBLE_NULL() or JUMBLE_POSITION(), the author of a custom jumbling
function should use it. This could be added in code comments somewhere, IMO.


--
Sami



Re: Query ID Calculation Fix for DISTINCT / ORDER BY and LIMIT / OFFSET

From
Michael Paquier
Date:
On Thu, Mar 13, 2025 at 07:17:20PM -0500, Sami Imseih wrote:
>> Then we have the same problem with another Node using this Foo node
>> two times in a row, depending on how it's used by the query parsing
>> and transform:
>
> hmm, it's hard to imagine such a case being real-world and useful for
> query jumbling purposes.

Perhaps, but the code allows that be design, and people like forking
Postgres.

> Also, I think if we introduce something like
> JUMBLE_NULL() or JUMBLE_POSITION(), the author of a custom jumbling
> function should use it. This could be added in code comments somewhere, IMO.

FWIW, another idea I have on top of my mind is the addition of a
counter in JumbleState that we increment each time we enter
_jumbleNode(), then simply call JUMBLE_FIELD_SINGLE() after the
incrementation.  And we can rely on this counter to be unique each
time we jumble a node..
--
Michael

Attachment
> FWIW, another idea I have on top of my mind is the addition of a
> counter in JumbleState that we increment each time we enter
> _jumbleNode(), then simply call JUMBLE_FIELD_SINGLE() after the
> incrementation.  And we can rely on this counter to be unique each
> time we jumble a node..

With this approach, the author of the custom function will need
to remember to increment the counter, right?

Also, I am wondering what is the difference between appending an incrementing
counter in JState vs just appending a constant for every node ( NULL
or NOT NULL ) ?

appending some constant or null-term to the hash:
<expression 1> <0><expression 2> <0> <0>

vs

appending an incremented value:
expression_1 <1> expression_2 <2> <3>

Both will do the job of adding the required entropy

--
Sami



Re: Query ID Calculation Fix for DISTINCT / ORDER BY and LIMIT / OFFSET

From
Michael Paquier
Date:
> On Mar 14, 2025, at 8:15, Sami Imseih <samimseih@gmail.com> wrote:
>> FWIW, another idea I have on top of my mind is the addition of a
>> counter in JumbleState that we increment each time we enter
>> _jumbleNode(), then simply call JUMBLE_FIELD_SINGLE() after the
>> incrementation.  And we can rely on this counter to be unique each
>> time we jumble a node..
>
> With this approach, the author of the custom function will need
> to remember to increment the counter, right?

Actually, no. _jumbleNode() is the unique entry point we use when jumbling a sub-Node in a bigger Node structure, so
customfunctions don’t need to touch it. 

> Also, I am wondering what is the difference between appending an incrementing
> counter in JState vs just appending a constant for every node ( NULL
> or NOT NULL ) ?

Well, the problem is not NULL vs non-NULL, so..
—
Michael



On Fri, 14 Mar 2025 at 14:51, Michael Paquier <michael@paquier.xyz> wrote:
>
>
> > On Mar 14, 2025, at 8:15, Sami Imseih <samimseih@gmail.com> wrote:
> >> FWIW, another idea I have on top of my mind is the addition of a
> >> counter in JumbleState that we increment each time we enter
> >> _jumbleNode(), then simply call JUMBLE_FIELD_SINGLE() after the
> >> incrementation.  And we can rely on this counter to be unique each
> >> time we jumble a node..
> >
> > With this approach, the author of the custom function will need
> > to remember to increment the counter, right?
>
> Actually, no. _jumbleNode() is the unique entry point we use when jumbling a sub-Node in a bigger Node structure, so
customfunctions don’t need to touch it. 

Err, what about non-node types? Those'll go to AppendJumble().

I think basically everyone here apart from you is having a hard time
understanding what additional benefits your counter solution brings
over just ensuring we always AppendJumble with something, regardless
of the field's value. I do want to understand what you're concerned
about but you've demonstrated nothing to us about the "always jumble
something" idea that breaks. Your example custom function broke that
rule as it skipped doing anything when "expr->field1 == 0".

Anyway, let's see your patch so we can judge actual code rather than
waste our time arguing over assumptions about what the hypothetical
code is and does.

David



Re: Query ID Calculation Fix for DISTINCT / ORDER BY and LIMIT / OFFSET

From
Michael Paquier
Date:
> On Mar 14, 2025, at 9:27, David Rowley <dgrowleyml@gmail.com> wrote:
>
> Err, what about non-node types? Those'll go to AppendJumble().
>
> I think basically everyone here apart from you is having a hard time
> understanding what additional benefits your counter solution brings
> over just ensuring we always AppendJumble with something, regardless
> of the field's value. I do want to understand what you're concerned
> about but you've demonstrated nothing to us about the "always jumble
> something" idea that breaks. Your example custom function broke that
> rule as it skipped doing anything when "expr->field1 == 0".

Because what I’ve mentioned is the kind of case I’ve wanted as supported when designing this facility, keeping also in
mindthat we may use this stuff for more than just Querys. If you’d rather discard what the argument I’ve done upthread,
well,I guess that you’re free to do so and bypass any comments I have. Now I do think that what’s been sent does not
checkall the boxes if we want a solution other than shifting the fields of a Node. 

> Anyway, let's see your patch so we can judge actual code rather than
> waste our time arguing over assumptions about what the hypothetical
> code is and does.

Sure. I’m going to need a few days for that, though :)
—
Michael



On Fri, 14 Mar 2025 at 15:46, Michael Paquier <michael@paquier.xyz> wrote:
>
> > On Mar 14, 2025, at 9:27, David Rowley <dgrowleyml@gmail.com> wrote:
> > I think basically everyone here apart from you is having a hard time
> > understanding what additional benefits your counter solution brings
> > over just ensuring we always AppendJumble with something, regardless
> > of the field's value. I do want to understand what you're concerned
> > about but you've demonstrated nothing to us about the "always jumble
> > something" idea that breaks. Your example custom function broke that
> > rule as it skipped doing anything when "expr->field1 == 0".
>
> Because what I’ve mentioned is the kind of case I’ve wanted as supported when designing this facility, keeping also
inmind that we may use this stuff for more than just Querys. If you’d rather discard what the argument I’ve done
upthread,well, I guess that you’re free to do so and bypass any comments I have. Now I do think that what’s been sent
doesnot check all the boxes if we want a solution other than shifting the fields of a Node. 

I don't think I'm discarding them... As far as I'm aware, your
remaining concern is with custom jumble functions and you showed an
example in [1] of a hypothetical jumble function that could cause the
same bug as this thread is discussing. My response to that was that
the custom jumble function is broken and should be fixed, which seems
to me to be analogous to Ivan's proposal to fix _jumbleNode() to do
something with NULL pointers.

I now can't tell which of the following is true: 1) I've missed one of
your concerns, or; 2) you want to find a way to make badly coded
custom jumble functions not suffer from this bug, or; 3) you think
that adding jumble bytes unconditionally regardless of the field's
value still does not fix the bug in question, or; 4) something else.
It would be good to know which of these it is.

David

[1] https://postgr.es/m/Z9NrnVk7MtMe8uNF@paquier.xyz



Re: Query ID Calculation Fix for DISTINCT / ORDER BY and LIMIT / OFFSET

From
Michael Paquier
Date:
On Fri, Mar 14, 2025 at 04:14:25PM +1300, David Rowley wrote:
> I don't think I'm discarding them... As far as I'm aware, your
> remaining concern is with custom jumble functions and you showed an
> example in [1] of a hypothetical jumble function that could cause the
> same bug as this thread is discussing. My response to that was that
> the custom jumble function is broken and should be fixed, which seems
> to me to be analogous to Ivan's proposal to fix _jumbleNode() to do
> something with NULL pointers.

Okay, that's where our opinions diverge.  So, well, please let me
reformulate that with some different words and terms.  This is not a
problem of having a NULL node vs a non-NULL node in the jumbling,
because by design is it possible for one to decide if a Node can be
included in the jumbling depending on its internal values,
particularly if it gets used across one than one type of parent node.
You are claiming that such functions are broken, but what I am saying
is that such function can be OK depending on how one wants this code
to behave depending on their node structure and what they expect from
them.  So I'm making a claim about keeping this stuff more
flexible, while also addressing the problem of the same node defined
successively more than once in a parent structure.

> I now can't tell which of the following is true: 1) I've missed one of
> your concerns, or; 2) you want to find a way to make badly coded
> custom jumble functions not suffer from this bug, or; 3) you think
> that adding jumble bytes unconditionally regardless of the field's
> value still does not fix the bug in question, or; 4) something else.
> It would be good to know which of these it is.

I hope that my opinion counts, as I've worked on this whole design
in 3db72ebcbe20, 8eba3e3f0208 and other related commits.  FWIW, I
don't really see any disadvantage in supporting what I'm claiming is
OK, giving more flexibility to folks to do what they want with this
facility, if it also tackles this thread's problem with the Node
handling.

So, here is attached a counter-proposal, where we can simply added a
counter tracking a node count in _jumbleNode() to add more entropy to
the mix, incrementing it as well for NULL nodes.

By the way, I was looking at the set of tests proposed upthread at
this message, which is as far as I know the latest version proposed:
https://www.postgresql.org/message-id/5ac172e0b77a4baba50671cd1a15285f@localhost.localdomain

The tests do require neither a relation nor a stats reset, so let's
make it cheaper as I've proposed upthread and in the attached,
including checks with multiple queries and different constants to make
sure that these are correctly grouped in the pgss reports.  The
null_sequence_number reset each time we run through a non-NULL node
from variant B reduces a bit the entropy, btw..  The argument about
adding a counter in AppendJumble() is the wrong level to use for such
a change.

Anyway, perhaps we should have your extra byte '\0' or something
equivalent added to JUMBLE_STRING() if dealing with a NULL string
rather than ignoring it.  There's an argument about simplicity, IMO,
still it is true that ignoring a NULL value reduces the entropy of the
query ID.  So I'd be OK with that if you feel strongly about this
point, at it sound to me that you are?  This could be better than a
hardcoded '\0' byte as we could use a counter and
JUMBLE_FIELD_SINGLE() in the NULL case of JUMBLE_STRING().  It's not
proved to be a problem yet, and there's value in having a simpler
solution, as well.  Anyway, one line I'd like to draw is that
appendJumble() remains as it is written currently.  One extra thing
that We could do is to update it so as the size of an item given is
always more than zero, with an Assert(), to enforce a stricter policy.
--
Michael

Attachment
Hello, Michael!

> So, here is attached a counter-proposal, where we can simply added a
> counter tracking a node count in _jumbleNode() to add more entropy to
> the mix, incrementing it as well for NULL nodes.

It definitely looks like a more reliable solution than my variant, which only
counts NULL nodes.

However, we already knew about the overhead of adding `\0` bytes for
every NULL field.

> So that adds about 9.1% overhead to jumbling, on average.

See:

https://www.postgresql.org/message-id/flat/5ac172e0b77a4baba50671cd1a15285f%40localhost.localdomain#6c43f354f5f42d2a27e6824faa660a86

Is it really worth spending extra execution time to increase entropy
when we have non-NULL nodes?

Maybe we should choose to add node_count to the hash every time we visit
non-NULL or NULL nodes.
We could also add entropy if we see a change in the node->type value for
non-NULL variants.

Your Variant
------------

< node_count = 1 > < node 1 >
< node_count = 2 > /* node 2 = NULL */
< node_count = 3 > < node 3 >

Alternative 1 (mark only NULL Nodes)
------------------------------------

/* node_count = 1 */ < node 1 >
< node_count = 2 > /* node 2 = NULL */
/* node_count = 3 */ < node 3 >

Alternative 2 (mark only non-NULL Nodes)
----------------------------------------
This could address concerns about problems related to visiting nodes with the
same content placed in different query tree branches.

< node_count = 1 > < node 1 >
/* node_count = 2 */ /* node 2 = NULL */
< node_count = 3 > < node 3 >



Re: Query ID Calculation Fix for DISTINCT / ORDER BY and LIMIT / OFFSET

From
Michael Paquier
Date:
On Mon, Mar 17, 2025 at 07:33:42AM +0000, Bykov Ivan wrote:
> See:
>
https://www.postgresql.org/message-id/flat/5ac172e0b77a4baba50671cd1a15285f%40localhost.localdomain#6c43f354f5f42d2a27e6824faa660a86
>
> Is it really worth spending extra execution time to increase entropy
> when we have non-NULL nodes?

The computed time is already quite cheap, so I'm not much worried
about that, FWIW.

> We could also add entropy if we see a change in the node->type value for
> non-NULL variants.

I am not sure to get this one.  The issue shows up if we have the same
Node computed successively as reported on this thread.  It would still
be a problem for different node types, though less likely, when two
nodes use with similar fields.
--
Michael

Attachment
On Mon, 17 Mar 2025 at 13:53, Michael Paquier <michael@paquier.xyz> wrote:
> So, here is attached a counter-proposal, where we can simply added a
> counter tracking a node count in _jumbleNode() to add more entropy to
> the mix, incrementing it as well for NULL nodes.

I had a look at this and it seems the main difference will be that
this patch will protect against the case that a given node is non-null
but has a custom jumble function which selects to not jumble anything
in some cases. Since Ivan's patch only adds jumbling for non-null,
that could lead to the same bug if a custom jumble function decided to
not jumble anything at all.

FWIW, I did the same test to jumble all make check queries with your
patch, same as [1]:

master: 73.66 milliseconds
0001-Query-ID-Calculation-Fix-Variant-B.patch: 80.36 milliseconds
v4-0001-Add-more-entropy-to-query-jumbling.patch: 88.84 milliseconds

I know you've said in [2] that you're not worried about performance,
but I wanted to see if that's true... I can measure the cost of
compute_query_id=on with pgbench -S workload on my AMD 3990x machine.
I tried with that on "auto", then with "on" with master and again with
your v4 patch when set to "on":

(the -c 156 -j 156 is just there to ensure this machine properly
clocks up, which it still seems hesitant to do sometimes, even with
the performance power settings)

drowley@amd3990x:~$ echo master compute_query_id = auto && for i in
{1..10}; do pg_ctl stop -D pgdata > /dev/null && pg_ctl start -D
pgdata -l pg.log > /dev/null && pgbench -n -M simple -S -T 60 -c 156
-j 156 postgres | grep tps; done
master compute_query_id = auto
tps = 1202451.945998
tps = 1197645.856236
tps = 1182345.403291
tps = 1182440.562773
tps = 1180731.003390
tps = 1185277.478449
tps = 1174983.732094
tps = 1176310.828235
tps = 1179040.622103
tps = 1177068.520272
drowley@amd3990x:~$ echo master compute_query_id = on && for i in
{1..10}; do pg_ctl stop -D pgdata > /dev/null && pg_ctl start -D
pgdata -l pg.log > /dev/null && pgbench -n -M simple -S -T 60 -c 156
-j 156 postgres | grep tps; done
master compute_query_id = on
tps = 1158684.006821
tps = 1165808.752641
tps = 1158269.999683
tps = 1146730.269628
tps = 1154200.484194
tps = 1152750.152235
tps = 1150438.486321
tps = 1150649.669337
tps = 1144745.761593
tps = 1157743.530383
drowley@amd3990x:~$ echo
v4-0001-Add-more-entropy-to-query-jumbling.patch compute_query_id = on
&& for i in {1..10}; do pg_ctl stop -D pgdata > /dev/null && pg_ctl
start -D pgdata -l pg.log > /dev/null && pgbench -n -M simple -S -T 60
-
c 156 -j 156 postgres | grep tps; done
v4-0001-Add-more-entropy-to-query-jumbling.patch compute_query_id = on
tps = 1156782.516710
tps = 1162780.019911
tps = 1142104.075069
tps = 1145651.299640
tps = 1143157.945891
tps = 1150275.033626
tps = 1145817.267485
tps = 1135915.694987
tps = 1138703.153873
tps = 1138436.802882

It's certainly hard to see much slowdown between master and v4, but it
is visible if I sort the results and put them in a line graph and
scale the vertical axis a bit. A 0.7% slowdown. See attached.

I didn't run the same test on the
0001-Query-ID-Calculation-Fix-Variant-B.patch patch, but with the
first set of results I posted above, you could assume it's less than
0.7% overhead. Locally, I did play around with internalising the
AppendJumble function so my compiler would compile a dedicated
JumbleNull function, which removes a bit of branching and code for the
memcpy. I managed to get it going a bit faster that way.

I do agree that your v4 fixes the issue at hand and I'm not going to
stand in your way if you want to proceed with it. However, from my
point of view, it seems that we could achieve the same goal with much
less overhead by just insisting that custom jumble functions always
jumble something. We could provide a jumble function to do that if the
custom function conditionally opts to jumble nothing. That would save
having to add this additional jumbling of the node count.

Also, am I right in thinking that there's no way for an extension to
add a custom jumble function? If so, then we have full control over
any custom jumble function. That would make it easier to ensure these
always jumble something.

David

[1] https://postgr.es/m/CAApHDvqaKySJdBf2v2_ybNuT=KObaUVBU4_5kgZfFTMSJr-Vmg@mail.gmail.com
[2] https://postgr.es/m/Z9flafNlH4-1YSJW@paquier.xyz

Attachment

Re: Query ID Calculation Fix for DISTINCT / ORDER BY and LIMIT / OFFSET

From
Michael Paquier
Date:
On Tue, Mar 18, 2025 at 05:24:42PM +1300, David Rowley wrote:
> I had a look at this and it seems the main difference will be that
> this patch will protect against the case that a given node is non-null
> but has a custom jumble function which selects to not jumble anything
> in some cases. Since Ivan's patch only adds jumbling for non-null,
> that could lead to the same bug if a custom jumble function decided to
> not jumble anything at all.

Yeah.

> It's certainly hard to see much slowdown between master and v4, but it
> is visible if I sort the results and put them in a line graph and
> scale the vertical axis a bit. A 0.7% slowdown. See attached.

Thanks for the numbers.  I'd like to say that this is within noise,
but that's clearly not if repeatable, as you are pointing out.  Ugh.

> I do agree that your v4 fixes the issue at hand and I'm not going to
> stand in your way if you want to proceed with it. However, from my
> point of view, it seems that we could achieve the same goal with much
> less overhead by just insisting that custom jumble functions always
> jumble something. We could provide a jumble function to do that if the
> custom function conditionally opts to jumble nothing. That would save
> having to add this additional jumbling of the node count.

If we make the whole cheaper with only extra entropy added for NULLs
in nodes and strings, I'd rather have an insurance policy for the
custom functions.  Something like that:
- Forbid a size of 0 in AppendJumble().
- When dealing with a non-NULL case in _jumbleNode(), save the initial
jumble_len and the jumble contents when starting, then complain if the
jumble_len matches with the initial length at the end and if the
contents are the same in an assert.  A check on the length may be
enough, but we'd depend on JUMBLE_SIZE and nodes can get pretty big.

What do you think?

> Also, am I right in thinking that there's no way for an extension to
> add a custom jumble function? If so, then we have full control over
> any custom jumble function. That would make it easier to ensure these
> always jumble something.

Currently, no, that would not be possible through this code path.  The
only way would be to fork the code as we rely entirely on the code
generated from the in-core headers.
--
Michael

Attachment
On Tue, 18 Mar 2025 at 21:00, Michael Paquier <michael@paquier.xyz> wrote:
> If we make the whole cheaper with only extra entropy added for NULLs
> in nodes and strings, I'd rather have an insurance policy for the
> custom functions.  Something like that:
> - Forbid a size of 0 in AppendJumble().
> - When dealing with a non-NULL case in _jumbleNode(), save the initial
> jumble_len and the jumble contents when starting, then complain if the
> jumble_len matches with the initial length at the end and if the
> contents are the same in an assert.  A check on the length may be
> enough, but we'd depend on JUMBLE_SIZE and nodes can get pretty big.
>
> What do you think?

If it's for Assert enabled builds only, then to save from having to
look at the buffer, you could have an extra field similar to
jumble_len, but does not get reset when the jumble buffer fills.  Just
assert that the total_jumbled_bytes has grown after jumbling a node,
maybe?

David



Re: Query ID Calculation Fix for DISTINCT / ORDER BY and LIMIT / OFFSET

From
Michael Paquier
Date:
On Tue, Mar 18, 2025 at 09:24:06PM +1300, David Rowley wrote:
> If it's for Assert enabled builds only, then to save from having to
> look at the buffer, you could have an extra field similar to
> jumble_len, but does not get reset when the jumble buffer fills.  Just
> assert that the total_jumbled_bytes has grown after jumbling a node,
> maybe?

Indeed, this would work as well, and there's little point in going
full-memcmp() mode for the sake of such a check.

I have been doing some measurements to see how the NULL addition
affects the computation, and with the attached (based on what you have
sent upthread), I can see that a "SELECT 1", which would be one of the
worst cases possible based on its simple Query structure with 19 NULLs
and not much else (?), jumps from 80ms to 107~109ms for 100k loops
calling JumbleQuery(), meaning that a single computation moves from
800ns to 1070~1090ns on my local machine.  So this stuff costs, which
is not surprising, but it does not seem *that* bad compared to a full
query run.

Potential ideas about optimizing the branches seem worth
investigating, I am not sure that I would be able to dig into that
until the feature freeze gong, unfortunately, and it would be nice to
fix this issue for this release.  I'll see about doing some pgbench
runs on my side, as well, with tpc-b Querys.

Thoughts?
--
Michael

Attachment

Re: Query ID Calculation Fix for DISTINCT / ORDER BY and LIMIT / OFFSET

From
Michael Paquier
Date:
On Fri, Mar 21, 2025 at 02:45:09PM +0900, Michael Paquier wrote:
> Potential ideas about optimizing the branches seem worth
> investigating, I am not sure that I would be able to dig into that
> until the feature freeze gong, unfortunately, and it would be nice to
> fix this issue for this release.  I'll see about doing some pgbench
> runs on my side, as well, with tpc-b Querys.

So, here are some numbers with a "SELECT 1;" using pgbench, which
would be a worst-case.  Then, at 128 clients on a machine I have
around:
- HEAD, compute_query_id = on:
tps = 282081.839715
tps = 281379.854539
tps = 283708.205096
tps = 280197.225782 #worst
tps = 291955.914664 #best
tps = 284172.351609
tps = 287234.624720
tps = 287915.683291 #best
tps = 277004.123742 #worst
tps = 281326.071323
- Patch, compute_query_id = on:
tps = 289746.641246 #best
tps = 279689.654196
tps = 284874.774620 #best
tps = 282312.246610
tps = 275807.057758 #worst
tps = 276058.858384 #worst
tps = 282705.638685
tps = 282651.367641
tps = 276777.418569
tps = 283137.265298

Removing the two best and two worst numbers, taking an average as in:
(head_tps - patch_tps) * 2 / (head_tps + patch_tps)
Then that's 283317.15 (HEAD) vs 281212.26 (patch) so a 0.7% decrease.
Still that could be claimed as noise.

Second set, at 1 client, and a better pattern becomes available:
- HEAD, compute_query_id = on:
tps = 19170.473320 #best
tps = 18722.828797
tps = 19138.765391
tps = 18243.611593
tps = 17878.605004 #worst
tps = 19557.753327 #best
tps = 18395.734885
tps = 18433.188326
tps = 18227.914138 #worst
tps = 18450.236841
- Patch, compute_query_id = on:
tps = 16611.766411
tps = 15992.734615 #worst
tps = 16465.077482
tps = 16726.097318
tps = 17284.874897 #best
tps = 16076.378905
tps = 17971.534072 #best
tps = 15733.718159 #worst
tps = 16911.014660
tps = 17217.202679

And here we get 18564.06 (head) vs 16667.92 (patch) so a 10.7%
difference in this run.  Hence the automatic addition of NULL to the
jumbling is disappointing here, even if this is what I'd see this as a
worst case scenario, unlikely what one would see for real if they care
about monitoring.

So, on the other hand, we still have the trick to switch the fields,
which means that we would not suffer any performance penalty, while
fixing the original issue.  And with the feature freeze in sight, time
is running short if we want to do something here.

NB: In case, a second run with 1 client showing the same trend:
- HEAD:
tps = 19697.406181
tps = 19474.495729
tps = 18614.615630
tps = 18868.245926
tps = 18949.775164
tps = 18943.311921
tps = 18824.192383
tps = 19104.542301
tps = 18950.870646
tps = 19331.829745
- Patch:
tps = 17887.864764
tps = 17317.628065
tps = 17090.278323
tps = 15876.258607
tps = 16818.644938
tps = 16816.334313
tps = 18134.148809
tps = 16891.809962
tps = 18193.104125
tps = 16866.726990
--
Michael

Attachment
On Mon, 24 Mar 2025 at 15:23, Michael Paquier <michael@paquier.xyz> wrote:
> And here we get 18564.06 (head) vs 16667.92 (patch) so a 10.7%
> difference in this run.  Hence the automatic addition of NULL to the
> jumbling is disappointing here, even if this is what I'd see this as a
> worst case scenario, unlikely what one would see for real if they care
> about monitoring.

Can you share which patch you tested here?  There are many patches on
this thread and I don't want to make assumptions about which one these
are the results for.

It sounds like it wasn't your latest patch as you mentioned "automatic
addition of NULL to the jumbling".  That doesn't sound like the
correct description for your latest patch.

David



Re: Query ID Calculation Fix for DISTINCT / ORDER BY and LIMIT / OFFSET

From
Michael Paquier
Date:
On Mon, Mar 24, 2025 at 07:42:21PM +1300, David Rowley wrote:
> Can you share which patch you tested here?  There are many patches on
> this thread and I don't want to make assumptions about which one these
> are the results for.

The patch tested is the latest one that has been posted on this
thread, for a comparison of HEAD at 8a3e4011f02d and this 0001 (no
0002 which was forcing a loop to make the jumbling heavier):
https://www.postgresql.org/message-id/Z9z85Ui5lPkPn2hq@paquier.xyz

The message holding the patch tested is the one I've replied to when
posting the results shared here:
https://www.postgresql.org/message-id/Z-DB83o5AZmgnvnA@paquier.xyz
--
Michael

Attachment
On Fri, 21 Mar 2025 at 18:45, Michael Paquier <michael@paquier.xyz> wrote:
> Potential ideas about optimizing the branches seem worth
> investigating, I am not sure that I would be able to dig into that
> until the feature freeze gong, unfortunately, and it would be nice to
> fix this issue for this release.  I'll see about doing some pgbench
> runs on my side, as well, with tpc-b Querys.

I've come up with a version of this that fixes the issue and improves
performance rather than regresses it. See attached.

This patch uses the same "ensure we jumble at least something when we
get a NULL" method, but works differently to anything previously
proposed here. Instead of modifying the jumble buffer each time we see
a NULL, the patch just increments a counter.  It's only when something
else actually needs to append bytes to the buffer that we must flush
the pending NULLs to the buffer beforehand. I've opted to just write
the consecutive NULL count to the buffer. This allows the NULLs to be
appended more efficiently. This has the advantage of having to do very
little when we see NULL values, and would probably do well on your
"SELECT 1" test, which has lots of NULLs.

I've also performed a round of optimisations to the code:

1) The jumble size in AppendJumble should never be zero, so the while
(size > 0) loop can become a do while loop.  This halves the size
checks for the normal non-overflow case.
2) Added inlined functions to jumble common field sizes. This saves
having to move the size into the appropriate register before calling
the AppendJumble function and reduces the size of the compiled code as
a result. It also allows the fast-path I added to AppendJumbleInternal
for the non-overflow case to work more efficiently as the memcpy can
transformed into fewer instructions due to the size being both known
and small enough to perform the copy in a single instruction. This
previously didn't work since part_size is a variable the inlined
function. With the fast-path, "size" is used, which is a compile-time
const to the inlined function.

I tested the performance of this using the patch from [1] and running
"make check" 50 times. This gave me the total time spent jumbling all
of the make check queries. I sorted the results by time and graphed
them in the attached graph. The patched version is on average 1.73%
*faster* than master on my AMD 3990x machine:

master: 76.564 milliseconds
v7 patch 75.236 milliseconds

It should be possible to still add the total_jumble_len stuff to this
in assert builds with zero extra overhead in non-assert builds, which
I think should address your concern with custom jumble functions not
jumbling anything. I can adjust accordingly if you're happy to use the
method I'm proposing.

The tests done were in a version of master prior to 5ac462e2b. I've
not yet studied if anything needs to be adjusted to account for that
commit.

David

[1] https://www.postgresql.org/message-id/attachment/174042/jumbletime.patch.txt

Attachment
Hello, David!

As I can see, your patch has the same idea as my v2-0001-Query-ID-Calculation-Fix-Variant-B.patch from [1]. 
I think it would be better to extract the jumble buffer update with hash calculation into a function
(CompressJumble in my patch). This will result in fewer changes to the source code. 
We can also simply dump the null count to the buffer when we encounter a non-null field
(which will be processed in AppendJumble, indeed).

What do your thing about my patch (v2-0001-Query-ID-Calculation-Fix-Variant-B.patch)?

[1] https://www.postgresql.org/message-id/5ac172e0b77a4baba50671cd1a15285f%40localhost.localdomain


On Tue, 25 Mar 2025 at 21:14, Bykov Ivan <i.bykov@modernsys.ru> wrote:
> As I can see, your patch has the same idea as my v2-0001-Query-ID-Calculation-Fix-Variant-B.patch from [1].
> I think it would be better to extract the jumble buffer update with hash calculation into a function
> (CompressJumble in my patch). This will result in fewer changes to the source code.
> We can also simply dump the null count to the buffer when we encounter a non-null field
> (which will be processed in AppendJumble, indeed).
>
> What do your thing about my patch (v2-0001-Query-ID-Calculation-Fix-Variant-B.patch)?

Seemingly, I missed that one. I agree with the general method being
suitable, which might not be surprising since I ended up at the same
conclusion independently.

I quite like your CompressJumble() idea, but I don't know that it
amounts to any compiled code changes.

I think it's a bit strange that you're flushing the pending nulls to
the jumble buffer after jumbling the first non-null. I think it makes
much more sense to do it before as I did, otherwise, you're
effectively jumbling fields out of order.

I see you opted to only jumble the low-order byte of the pending null
counter. I used the full 4 bytes as I don't think the extra 3 bytes is
a problem. Witrh v7 the memcpy to copy the pending_nulls into the
buffer is inlined into a single instruction. It's semi-conceivable
that you could have more than 255 NULLs given that a List can contain
Nodes. I don't know if we ever have any NULL items in Lists at the
moment, but I don't think that that's disallowed. I'd feel much safer
taking the full 4 bytes of the counter to help prevent future
surprises.

Aside from that, v2b still has the performance regression issue that
we're trying to solve. I did a fresh run of master as of 19c6eb06c and
tested v7 and v2b. Here are the average times it took to jumble all
the make check queries over 50 runs;

master: 76.052 ms
v7: 75.422 ms
v2b: 81.391 ms

I also attached a graph.

I'm happy to proceed with the v7 version. I'm also happy to credit you
as the primary author of it, given that you came up with the v2b
patch.  It might be best to extract the performance stuff that's in v7
and apply that separately from the bug fix. If you're still interested
and have time in the next 7 hours to do it, would you be able to split
the v7 as described, making v8-0001 as the bug fix and v8-0002 as the
performance stuff? If so, could you also add the total_jumble_len to
v8-0001 like Michael's last patch had, but adjust so it's only
maintained for Assert builds. Michael is quite keen on having more
assurance that custom jumble functions don't decide to forego any
jumbling.

If you don't have time, I'll do it in my morning (UTC+13).

Thanks

David

> [1] https://www.postgresql.org/message-id/5ac172e0b77a4baba50671cd1a15285f%40localhost.localdomain

Attachment
On Wed, 26 Mar 2025 at 01:11, David Rowley <dgrowleyml@gmail.com> wrote:
> I'm happy to proceed with the v7 version. I'm also happy to credit you
> as the primary author of it, given that you came up with the v2b
> patch.  It might be best to extract the performance stuff that's in v7
> and apply that separately from the bug fix. If you're still interested
> and have time in the next 7 hours to do it, would you be able to split
> the v7 as described, making v8-0001 as the bug fix and v8-0002 as the
> performance stuff? If so, could you also add the total_jumble_len to
> v8-0001 like Michael's last patch had, but adjust so it's only
> maintained for Assert builds. Michael is quite keen on having more
> assurance that custom jumble functions don't decide to forego any
> jumbling.
>
> If you don't have time, I'll do it in my morning (UTC+13).

Here is the v8 version with the bug fix and performance stuff
separated out. I also added the Assert code to ensure we always add
something to the jumble buffer when jumbling a node.

I didn't include the regression tests I saw in the v2b patch [1]. I
felt these were just marking the fact that there used to be a bug
here.

David

[1] https://postgr.es/m/5ac172e0b77a4baba50671cd1a15285f%40localhost.localdomain

Attachment

Re: Query ID Calculation Fix for DISTINCT / ORDER BY and LIMIT / OFFSET

From
Michael Paquier
Date:
On Wed, Mar 26, 2025 at 11:56:50AM +1300, David Rowley wrote:
> Here is the v8 version with the bug fix and performance stuff
> separated out.

Why not.  I assume that you would merge these together?

> I also added the Assert code to ensure we always add
> something to the jumble buffer when jumbling a node.

Thanks.  This part looks good with its USE_ASSERT_CHECKING.

> I didn't include the regression tests I saw in the v2b patch [1]. I
> felt these were just marking the fact that there used to be a bug
> here.

I disagree about the lack of tests.  These are cheap to add.  I'd
suggest to not use the ones posted as part of v2 variant B as these do
not require the creation of a relation so they can be made cheaper,
and instead use the set I have posted in patch 0001 here which covers
most of the main scenarios from the parser with this nodes in the
Query:
https://www.postgresql.org/message-id/Z9z85Ui5lPkPn2hq@paquier.xyz

It is a bit disappointing that we require JumbleQuery() to jumble the
NULLs.  There are discussions about making this code available not
only for Querys, but for Nodes, and the NUL computations would be
missed..
--
Michael

Attachment
Hello, David!

> I see you opted to only jumble the low-order byte of the pending null
> counter. I used the full 4 bytes as I don't think the extra 3 bytes is
> a problem. Witrh v7 the memcpy to copy the pending_nulls into the
> buffer is inlined into a single instruction. It's semi-conceivable
> that you could have more than 255 NULLs given that a List can contain
> Nodes. I don't know if we ever have any NULL items in Lists at the
> moment, but I don't think that that's disallowed. I'd feel much safer
> taking the full 4 bytes of the counter to help prevent future
> surprises.

Yes _jumbleList could theoretically handle cases with over 256 Node elements in same list,
but most (or all) of them is non-NULL (as I know only PlannedStmt->subplans accepts NULL elements),
Most of the foreach cycles over query lists don't have any NULL safety checks;
we just assume that the lists contain no NULL pointers.
I still believe storing just one byte is sufficient.

> I'm happy to proceed with the v7 version. I'm also happy to credit you
> As the primary author of it, given that you came up with the v2b
> patch.  It might be best to extract the performance stuff that's in v7
> and apply that separately from the bug fix. If you're still interested
> and have time in the next 7 hours to do it, would you be able to split
> the v7 as described, making v8-0001 as the bug fix and v8-0002 as the
> performance stuff? If so, could you also add the total_jumble_len to
> v8-0001 like Michael's last patch had, but adjust so it's only
> maintained for Assert builds. Michael is quite keen on having more
> assurance that custom jumble functions don't decide to forego any
> jumbling.

As I can see, you have already split the patches. I apologize for the delay; I don't have
much time to work on this issue. Thank you for your proposal to credit me as the patch author.






On Wed, 26 Mar 2025 at 14:04, Michael Paquier <michael@paquier.xyz> wrote:
>
> On Wed, Mar 26, 2025 at 11:56:50AM +1300, David Rowley wrote:
> > Here is the v8 version with the bug fix and performance stuff
> > separated out.
>
> Why not.  I assume that you would merge these together?

I think it's better it's two commits. They're for different purposes.

> I disagree about the lack of tests.  These are cheap to add.  I'd
> suggest to not use the ones posted as part of v2 variant B as these do
> not require the creation of a relation so they can be made cheaper,
> and instead use the set I have posted in patch 0001 here which covers
> most of the main scenarios from the parser with this nodes in the
> Query:
> https://www.postgresql.org/message-id/Z9z85Ui5lPkPn2hq@paquier.xyz

Ok, I won't protest. I just feel the bug is a relic of the old design
and I don't believe the tests are testing anything useful in the new
design. It feels unlikely that someone accidentally adjusts the code
and reverts it to the old design. Anyway, I've included the tests from
your patch.

> It is a bit disappointing that we require JumbleQuery() to jumble the
> NULLs.  There are discussions about making this code available not
> only for Querys, but for Nodes, and the NUL computations would be
> missed..

I've adjusted the patch to add two static functions. InitJumble() and
DoJumble().  JumbleQuery() uses these.

David

Attachment

Re: Query ID Calculation Fix for DISTINCT / ORDER BY and LIMIT / OFFSET

From
Michael Paquier
Date:
On Thu, Mar 27, 2025 at 12:09:35PM +1300, David Rowley wrote:
> I think it's better it's two commits. They're for different purposes.

Fine for me.

> Ok, I won't protest. I just feel the bug is a relic of the old design
> and I don't believe the tests are testing anything useful in the new
> design. It feels unlikely that someone accidentally adjusts the code
> and reverts it to the old design. Anyway, I've included the tests from
> your patch.

One habit that I've found really useful to do when it comes to this
area of the code is to apply the tests first to show what kind of
behavior we had before changing the jumbling, then apply the update to
the query jumbling.  This has two benefits:
- If the update of the second phase gets reverted, we still have the
tests.
- It is possible to show in the git history how the behavior changes,
analyzing them in isolation.

So I'd suggest an extra split, with the tests committed first.

>> It is a bit disappointing that we require JumbleQuery() to jumble the
>> NULLs.  There are discussions about making this code available not
>> only for Querys, but for Nodes, and the NUL computations would be
>> missed..
>
> I've adjusted the patch to add two static functions. InitJumble() and
> DoJumble().  JumbleQuery() uses these.

That's cleaner, thanks for the split.  I don't think that we've
reached a consensus on the other thread about what to provide as
interface and that's likely too late for this release, what you are
proposing here is a good step forward, and will help in not messing up
again with the problem of this thread in the future.

+#ifdef USE_ASSERT_CHECKING
+    Size        prev_jumble_len = jstate->total_jumble_len;
+#endif
[...]
+    /* The total number of bytes added to the jumble buffer */
+    Size        total_jumble_len;
 } JumbleState;

Maybe hide that in the header with one extra USE_ASSERT_CHECKING if we
are only going to increment it when assertions are enabled?

+     * These are flushed out to the jumble buffer before subsequent appends
+     * and before performing the final jumble hash.

This comment is slightly incorrect when 0001 is taken in isolation?
The flush of the NULL counter is done in the patch via
FlushPendingNulls() when jumbling the final value, not before more
appends?  It becomes true with 0002, where FlushPendingNulls() is
called before appending any data.

Perhaps removing JUMBLE_FIELD_SINGLE() is better now.
--
Michael

Attachment
Maybe I am missing something, but when I applied the
v9-0001 patch alone, it did not solve the problem it
claims to and the pg_s_s regression test fails:

+++ /local/home/simseih/pghead/install1/postgresql/contrib/pg_stat_statements/results/select.out
   2025-03-27 04:56:35.143855392 +0000
@@ -215,14 +215,12 @@
      3 |    3 | SELECT $1 + $2 + $3 AS "add"
      1 |    1 | SELECT $1 AS "float"
      2 |    2 | SELECT $1 AS "int"
-     2 |    2 | SELECT $1 AS "int" LIMIT $2
-     2 |    0 | SELECT $1 AS "int" OFFSET $2
+     4 |    2 | SELECT $1 AS "int" LIMIT $2
      6 |    0 | SELECT $1 AS "int" OFFSET $2 LIMIT $3
-     2 |    2 | SELECT $1 AS "int" ORDER BY 1
      1 |    2 | SELECT $1 AS i UNION SELECT $2 ORDER BY i
      1 |    1 | SELECT $1 || $2
      1 |    1 | SELECT $1, $2 LIMIT $3
-     2 |    2 | SELECT DISTINCT $1 AS "int"
+     4 |    4 | SELECT DISTINCT $1 AS "int"
      0 |    0 | SELECT calls, rows, query FROM pg_stat_statements
ORDER BY query COLLATE "C"
      1 |    1 | SELECT pg_stat_statements_reset() IS NOT NULL AS t
      1 |    2 | WITH t(f) AS (
                      +
@@ -230,7 +228,7 @@
        |      | )
                      +
        |      |   SELECT f FROM t ORDER BY f
      1 |    1 | select $1::jsonb ? $2
-(17 rows)
+(15 rows)

but when v9-0002 is applied, the regress tests then
succeed.

--
Sami Imseih
Amazon Web Services (AWS)



On Thu, 27 Mar 2025 at 18:03, Sami Imseih <samimseih@gmail.com> wrote:
> Maybe I am missing something, but when I applied the
> v9-0001 patch alone, it did not solve the problem it
> claims to and the pg_s_s regression test fails:

That was an artifact of me not having made the split quite in the
right place when dividing the patch into two.  What's just been
committed fixed that. Thanks for looking.

David



On Thu, 27 Mar 2025 at 14:51, Michael Paquier <michael@paquier.xyz> wrote:
> One habit that I've found really useful to do when it comes to this
> area of the code is to apply the tests first to show what kind of
> behavior we had before changing the jumbling, then apply the update to
> the query jumbling.  This has two benefits:
> - If the update of the second phase gets reverted, we still have the
> tests.
> - It is possible to show in the git history how the behavior changes,
> analyzing them in isolation.
>
> So I'd suggest an extra split, with the tests committed first.

I didn't do that. I can't quite process the logic of adding tests to
check we get the wrong behaviour.

> Maybe hide that in the header with one extra USE_ASSERT_CHECKING if we
> are only going to increment it when assertions are enabled?
>
> +        * These are flushed out to the jumble buffer before subsequent appends
> +        * and before performing the final jumble hash.

I thought about that and had decided to leave it in.  I had been
concerned about the struct growing and shrinking between
cassert/non-cassert builds. It's probably ok since it's expected to be
an internal field. I guess if a cassert extension used the field and
ran against a non-cassert PostgreSQL, then they'd have trouble. I did
end up removing it in what I just pushed. I felt I might be being
overly concerned since the field is at the end of the struct and is
commented that it's for Asserts only.

> This comment is slightly incorrect when 0001 is taken in isolation?
> The flush of the NULL counter is done in the patch via
> FlushPendingNulls() when jumbling the final value, not before more
> appends?  It becomes true with 0002, where FlushPendingNulls() is
> called before appending any data.

That was my mistake. I divided the patches incorrectly. There was
meant to be a flush in there.

> Perhaps removing JUMBLE_FIELD_SINGLE() is better now.

Done and pushed.

Thanks.

David