Thread: pg_stat_statments queryid

pg_stat_statments queryid

From
Magnus Hagander
Date:
Was there any actual reason why we didn't end up exposing queryid in
the pg_stat_statements view?

It would be highly useful when tracking changes over time. Right now I
see people doing md5(query) to do that, which is a lot more ugly (and
obviously uses more space and is slow, too).

--
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/


Re: pg_stat_statments queryid

From
Peter Geoghegan
Date:
On 24 May 2012 11:50, Magnus Hagander <magnus@hagander.net> wrote:
> Was there any actual reason why we didn't end up exposing queryid in
> the pg_stat_statements view?
>
> It would be highly useful when tracking changes over time. Right now I
> see people doing md5(query) to do that, which is a lot more ugly (and
> obviously uses more space and is slow, too).

Right. I continue to maintain that this is a good idea. I raised the
issue more than once. However, my proposal was not accepted by Tom and
Robert, apparently on the basis that queryId's actual value was
partially dictated by things like the endianness of the architecture
used, and the value of OIDs when serialising and subsequently hashing
the post-analysis tree.

What I'd like to be able to do is aggregate this information over time
and/or across standbys in a cluster, as queries are evicted and
subsequently re-entered into pg_stat_statement's shared hash table.
Now, there are situations were this isn't going to work, like when a
third-party logical replication system is used. That's unfortunate,
but I wouldn't expect it makes the information any less useful to the
large majority of people. I'd also credit our users with being
discerning enough to realise that they should not jump to the
conclusion that the value will be stable according to any particular
standard.

Arguments against including an internal value in the view could
equally well be applied to any of the internal statistic collector
views, many of which have oid columns, despite the fact that various
"name" columns already unambiguously identify tuples in most cases. I
see no reason for the inconsistency, particularly given that the
pg_stat_statements.query column *is* still somewhat ambiguous, as
described in the docs, and given that the query hash value referred to
in the docs anyway.

--
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services


Re: pg_stat_statments queryid

From
Magnus Hagander
Date:
On Thu, May 24, 2012 at 4:26 PM, Peter Geoghegan <peter@2ndquadrant.com> wrote:
> On 24 May 2012 11:50, Magnus Hagander <magnus@hagander.net> wrote:
>> Was there any actual reason why we didn't end up exposing queryid in
>> the pg_stat_statements view?
>>
>> It would be highly useful when tracking changes over time. Right now I
>> see people doing md5(query) to do that, which is a lot more ugly (and
>> obviously uses more space and is slow, too).
>
> Right. I continue to maintain that this is a good idea. I raised the
> issue more than once. However, my proposal was not accepted by Tom and
> Robert, apparently on the basis that queryId's actual value was
> partially dictated by things like the endianness of the architecture
> used, and the value of OIDs when serialising and subsequently hashing
> the post-analysis tree.
>
> What I'd like to be able to do is aggregate this information over time
> and/or across standbys in a cluster, as queries are evicted and
> subsequently re-entered into pg_stat_statement's shared hash table.

That's exactly the usecase I'm looking at here, except it's not
actually across standbys in this case.


> Now, there are situations were this isn't going to work, like when a
> third-party logical replication system is used. That's unfortunate,
> but I wouldn't expect it makes the information any less useful to the
> large majority of people. I'd also credit our users with being
> discerning enough to realise that they should not jump to the
> conclusion that the value will be stable according to any particular
> standard.

As long as it's documented as such, I don't see a problem with that at all.

--
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/


Re: pg_stat_statments queryid

From
Robert Haas
Date:
On Thu, May 24, 2012 at 10:26 AM, Peter Geoghegan <peter@2ndquadrant.com> wrote:
> On 24 May 2012 11:50, Magnus Hagander <magnus@hagander.net> wrote:
>> Was there any actual reason why we didn't end up exposing queryid in
>> the pg_stat_statements view?
>>
>> It would be highly useful when tracking changes over time. Right now I
>> see people doing md5(query) to do that, which is a lot more ugly (and
>> obviously uses more space and is slow, too).
>
> Right. I continue to maintain that this is a good idea. I raised the
> issue more than once. However, my proposal was not accepted by Tom and
> Robert, apparently on the basis that queryId's actual value was
> partially dictated by things like the endianness of the architecture
> used, and the value of OIDs when serialising and subsequently hashing
> the post-analysis tree.

No, my concern was more that I wasn't clear on what you hoped to do with it.

But I think this explanation is enough to convince me that it might be
worthwhile:

> What I'd like to be able to do is aggregate this information over time
> and/or across standbys in a cluster, as queries are evicted and
> subsequently re-entered into pg_stat_statement's shared hash table.
> Now, there are situations were this isn't going to work, like when a
> third-party logical replication system is used. That's unfortunate,
> but I wouldn't expect it makes the information any less useful to the
> large majority of people. I'd also credit our users with being
> discerning enough to realise that they should not jump to the
> conclusion that the value will be stable according to any particular
> standard.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: pg_stat_statments queryid

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> On Thu, May 24, 2012 at 10:26 AM, Peter Geoghegan <peter@2ndquadrant.com> wrote:
> But I think this explanation is enough to convince me that it might be
> worthwhile:

>> What I'd like to be able to do is aggregate this information over time
>> and/or across standbys in a cluster, as queries are evicted and
>> subsequently re-entered into pg_stat_statement's shared hash table.

It appears to me that the above ...

>> ... I'd also credit our users with being
>> discerning enough to realise that they should not jump to the
>> conclusion that the value will be stable according to any particular
>> standard.

... is in direct contradiction to this.  The proposed usage absolutely
requires that the hash be stable "over time and/or across standbys".
I do not want to promise that it's stable over any timeframe longer than
a server reboot.  Aside from the OID dependence problem, we might well
change the way the hash is calculated in minor releases, for example by
adding or removing struct fields.
        regards, tom lane


Re: pg_stat_statments queryid

From
Peter Geoghegan
Date:
On 24 May 2012 16:06, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I do not want to promise that it's stable over any timeframe longer than
> a server reboot.

You already have though, since pg_stat_statements persistently stores
statistics to disk by default, and can only ever recognise statement
equivalence based on the (dbid, userid, queryid) hash key.

> Aside from the OID dependence problem, we might well
> change the way the hash is calculated in minor releases, for example by
> adding or removing struct fields.

You've already invalidated the saved statistics if you do that, so all
bets are off anyway. If you have to do it, it'll be necessary to bump
PGSS_FILE_HEADER, so that pg_stat_statements will be cleared upon
restart. That will in turn necessitate documenting the issue in the
minor version release notes. I'd hope to avoid that, but it doesn't
seem to me that the situation is made any worse than before by
exposing the value. On the contrary, it could help users to understand
where the problem may have affected them.

If you don't expose the value, users are going to do this sort of
thing anyway, but will be far worse off due to using the query text or
a hash thereof instead of the internal value.

--
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services


Re: pg_stat_statments queryid

From
Peter Geoghegan
Date:
On 24 May 2012 16:06, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> What I'd like to be able to do is aggregate this information over time
>>> and/or across standbys in a cluster, as queries are evicted and
>>> subsequently re-entered into pg_stat_statement's shared hash table.
>
> It appears to me that the above ...
>
>>> ... I'd also credit our users with being
>>> discerning enough to realise that they should not jump to the
>>> conclusion that the value will be stable according to any particular
>>> standard.
>
> ... is in direct contradiction to this.

I simply meant that we ought to be able to trust that people will
actually investigate the stability guarantees of a newly exposed
query_id before going and writing a tool that does some sort of
aggregation - they should and will make informed decisions. If that
guarantee is limited to "we might have to change the walker logic
during a minor release, so the value might change for some queries, so
we don't promise that it will be a stable identifier but will
naturally strive to do our best to avoid invalidating existing
statistics (within postgres and aggregated by external tools)", that
wouldn't put many people off. Still, I don't think it's all that
likely that we'll ever have to adjust the walker logic, since
pg_stat_statements already doesn't strictly promise that collisions
cannot occur, while making them very improbable.

I'm not even asking that possible uses for queryId be documented as
being useful for this sort of thing. I only ask that we expose it and
document it.

--
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services


Re: pg_stat_statments queryid

From
Magnus Hagander
Date:
On Thu, May 24, 2012 at 5:20 PM, Peter Geoghegan <peter@2ndquadrant.com> wrote:
> On 24 May 2012 16:06, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> I do not want to promise that it's stable over any timeframe longer than
>> a server reboot.
>
> You already have though, since pg_stat_statements persistently stores
> statistics to disk by default, and can only ever recognise statement
> equivalence based on the (dbid, userid, queryid) hash key.

Yes, if that's actually a problem, the whole way how
pg_stat_statements stores it's data persistently across restarts needs
to be rewritten. In a way that introduces an identifier that *is*
stable across restarts. In which case we could just expose that
identifier instead, and we're done.

What exactly is it that could/would be unstable across a reboot?

Not being stable across an initdb is of course a whole different story
- I think it's perfectly reasonable not to be that.



>> Aside from the OID dependence problem, we might well
>> change the way the hash is calculated in minor releases, for example by
>> adding or removing struct fields.
>
> You've already invalidated the saved statistics if you do that, so all
> bets are off anyway. If you have to do it, it'll be necessary to bump
> PGSS_FILE_HEADER, so that pg_stat_statements will be cleared upon
> restart. That will in turn necessitate documenting the issue in the
> minor version release notes. I'd hope to avoid that, but it doesn't
> seem to me that the situation is made any worse than before by
> exposing the value. On the contrary, it could help users to understand
> where the problem may have affected them.

Agreed. We already break something very user-visible in this case. Two
symptoms of the same breakage is really not that big an issue, IMO,
compared to the big gains to be had.


> If you don't expose the value, users are going to do this sort of
> thing anyway, but will be far worse off due to using the query text or
> a hash thereof instead of the internal value.

Exactly.


--
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/