Thread: BUG #17668: Query normalization generates multiple queryId:s for calls to the same procedure

BUG #17668: Query normalization generates multiple queryId:s for calls to the same procedure

From
PG Bug reporting form
Date:
The following bug has been logged on the website:

Bug reference:      17668
Logged by:          Marcus Kempe
Email address:      marcus.kempe@gmail.com
PostgreSQL version: 14.5
Operating system:   Linux
Description:

Consider the following case: 

postgres=# create function test(x bigint) returns void language sql as $$
select x ;$$;
CREATE FUNCTION
postgres=# select test(1);
 test 
------
 
(1 row)

postgres=# select calls,userid,dbid,queryid,query from
public.pg_stat_statements where query like 'select test%';
 calls | userid | dbid  |       queryid        |      query      
-------+--------+-------+----------------------+-----------------
     1 |     10 | 13757 | -8729651821477628722 | select test($1)
(1 row)

postgres=# select test(12345678910); -- larger that INTMAX.
 test 
------
 
(1 row)

postgres=# select calls,userid,dbid,queryid,query from
public.pg_stat_statements where query like 'select test%';
 calls | userid | dbid  |       queryid        |      query      
-------+--------+-------+----------------------+-----------------
     1 |     10 | 13757 | -8729651821477628722 | select test($1)
     1 |     10 | 13757 |  1452143938866253601 | select test($1)
(2 rows)

So given the above 2 calls for the same function, there are 2 different
queryIds generated, seemingly based on the fact that in the first case the
parameter can be cast to integer, and in the second case it can not.

I would have expected both calls to be normalized and fingerprinted to one
and the same queryId.

This behavior was similar even before queryjumble.c was introduced in PG14.


Hi,

On Thu, Oct 27, 2022 at 01:49:18PM +0000, PG Bug reporting form wrote:
>
> postgres=# select calls,userid,dbid,queryid,query from
> public.pg_stat_statements where query like 'select test%';
>  calls | userid | dbid  |       queryid        |      query
> -------+--------+-------+----------------------+-----------------
>      1 |     10 | 13757 | -8729651821477628722 | select test($1)
>      1 |     10 | 13757 |  1452143938866253601 | select test($1)
> (2 rows)
>
> So given the above 2 calls for the same function, there are 2 different
> queryIds generated, seemingly based on the fact that in the first case the
> parameter can be cast to integer, and in the second case it can not.
>
> I would have expected both calls to be normalized and fingerprinted to one
> and the same queryId.

Agreed, and that's actually a known problem that is currently being worked on.
You can look at
https://www.postgresql.org/message-id/flat/36e5bffe-e989-194f-85c8-06e7bc88e6f7%40amazon.com
for details about the discussion and in-progress fix.



Julien Rouhaud <rjuju123@gmail.com> writes:
> On Thu, Oct 27, 2022 at 01:49:18PM +0000, PG Bug reporting form wrote:
>> I would have expected both calls to be normalized and fingerprinted to one
>> and the same queryId.

> Agreed, and that's actually a known problem that is currently being worked on.
> You can look at
> https://www.postgresql.org/message-id/flat/36e5bffe-e989-194f-85c8-06e7bc88e6f7%40amazon.com
> for details about the discussion and in-progress fix.

Um ... that seems unrelated.  AFAICS the OP is complaining about
the fact that '42'::int4::int8 is not identical to '42'::int8.
Well, they're not.  I seriously doubt that we would or should
consider trying to get queryjumble to mask that.

            regards, tom lane



On Thu, Oct 27, 2022 at 02:28:17PM -0400, Tom Lane wrote:
> Julien Rouhaud <rjuju123@gmail.com> writes:
> > On Thu, Oct 27, 2022 at 01:49:18PM +0000, PG Bug reporting form wrote:
> >> I would have expected both calls to be normalized and fingerprinted to one
> >> and the same queryId.
> 
> > Agreed, and that's actually a known problem that is currently being worked on.
> > You can look at
> > https://www.postgresql.org/message-id/flat/36e5bffe-e989-194f-85c8-06e7bc88e6f7%40amazon.com
> > for details about the discussion and in-progress fix.
> 
> Um ... that seems unrelated.  AFAICS the OP is complaining about
> the fact that '42'::int4::int8 is not identical to '42'::int8.
> Well, they're not.  I seriously doubt that we would or should
> consider trying to get queryjumble to mask that.

Oh wow I totally misread the bug report, sorry about that.  A good night sleep
and a coffee later I see what it's about, and I agree that we shouldn't try to
mask that even if it might seem surprising.



Appreciate the clarifications. The only thing I would then add is that if we agree that it's unintuitive, this could be better reflected in the documentation. 
Currently the only mention is the following (pgstatstatements.html):
"queries with identical texts might appear as separate entries, if they have different meanings as a result of factors such as different search_path settings."

Thanks and best regards, 
Marcus Kempe


On Fri, Oct 28, 2022 at 3:28 AM Julien Rouhaud <rjuju123@gmail.com> wrote:
On Thu, Oct 27, 2022 at 02:28:17PM -0400, Tom Lane wrote:
> Julien Rouhaud <rjuju123@gmail.com> writes:
> > On Thu, Oct 27, 2022 at 01:49:18PM +0000, PG Bug reporting form wrote:
> >> I would have expected both calls to be normalized and fingerprinted to one
> >> and the same queryId.
>
> > Agreed, and that's actually a known problem that is currently being worked on.
> > You can look at
> > https://www.postgresql.org/message-id/flat/36e5bffe-e989-194f-85c8-06e7bc88e6f7%40amazon.com
> > for details about the discussion and in-progress fix.
>
> Um ... that seems unrelated.  AFAICS the OP is complaining about
> the fact that '42'::int4::int8 is not identical to '42'::int8.
> Well, they're not.  I seriously doubt that we would or should
> consider trying to get queryjumble to mask that.

Oh wow I totally misread the bug report, sorry about that.  A good night sleep
and a coffee later I see what it's about, and I agree that we shouldn't try to
mask that even if it might seem surprising.
Hello, I've seen possibly related case where that might be a problem.

PostgreSQL 12.8, pg_stat_statements.max=10000. There are queries like these:
insert into tablename (col1, ..., col28) values ($1, ..., $28), ... 
($2605, ... , $2660) on conflict do nothing.

Problem is that not only we might insert different number of rows but 
there might be many duplicates of the same exact query in 
pg_stat_statements (userid and dbid are the same in all rows):

select md5(query), count(*), pg_size_pretty(sum(length(query))) from 
pg_stat_statements
where query like 'insert into tablename%' group by 1 order by 2 desc 
limit 10;

    md5    | count | pg_size_pretty
----------+-------+----------------
  170...84 |  1283 | 2679 kB
  906...53 |   897 | 1724 kB
  55e...78 |   676 | 12 MB
  ea4...ef |   642 | 11 MB
  944...0f |   629 | 11 MB
  04e...ec |   530 | 9138 kB
  572...27 |   476 | 8575 kB
  5ea...3c |   430 | 7331 kB
  be1...6c |   324 | 5704 kB
  f69...47 |   313 | 550 kB

There are 8 bigint columns in the table, I suspect that we might insert 
into some of them values both greater and less than INT_MAX.
In result pg_stat_statements buffer is full of such queries (they take 
9k rows out of 10k right now, 10h since pg_stat_statements_reset call), 
currently total sum of query lengths is 117MB (97% of total sum).

 From monitoring perspective that might be a problem since calculating 
metrics from pg_stat_statements became more expensive and we might lose 
information about some queries when there is not enough space for them.

But I suppose such cases are pretty rare and some of them could be fixed 
with COPY queries instead of inserts.

--
Thanks,
Alexey Ermakov


On 2022-10-28 00:28, Tom Lane wrote:
> Julien Rouhaud <rjuju123@gmail.com> writes:
>> On Thu, Oct 27, 2022 at 01:49:18PM +0000, PG Bug reporting form wrote:
>>> I would have expected both calls to be normalized and fingerprinted to one
>>> and the same queryId.
>> Agreed, and that's actually a known problem that is currently being worked on.
>> You can look at
>> https://www.postgresql.org/message-id/flat/36e5bffe-e989-194f-85c8-06e7bc88e6f7%40amazon.com
>> for details about the discussion and in-progress fix.
> Um ... that seems unrelated.  AFAICS the OP is complaining about
> the fact that '42'::int4::int8 is not identical to '42'::int8.
> Well, they're not.  I seriously doubt that we would or should
> consider trying to get queryjumble to mask that.
>
>             regards, tom lane
>
>