Thread: pg_stat_statments queryid
Was there any actual reason why we didn't end up exposing queryid in the pg_stat_statements view? It would be highly useful when tracking changes over time. Right now I see people doing md5(query) to do that, which is a lot more ugly (and obviously uses more space and is slow, too). -- Magnus Hagander Me: http://www.hagander.net/ Work: http://www.redpill-linpro.com/
On 24 May 2012 11:50, Magnus Hagander <magnus@hagander.net> wrote: > Was there any actual reason why we didn't end up exposing queryid in > the pg_stat_statements view? > > It would be highly useful when tracking changes over time. Right now I > see people doing md5(query) to do that, which is a lot more ugly (and > obviously uses more space and is slow, too). Right. I continue to maintain that this is a good idea. I raised the issue more than once. However, my proposal was not accepted by Tom and Robert, apparently on the basis that queryId's actual value was partially dictated by things like the endianness of the architecture used, and the value of OIDs when serialising and subsequently hashing the post-analysis tree. What I'd like to be able to do is aggregate this information over time and/or across standbys in a cluster, as queries are evicted and subsequently re-entered into pg_stat_statement's shared hash table. Now, there are situations were this isn't going to work, like when a third-party logical replication system is used. That's unfortunate, but I wouldn't expect it makes the information any less useful to the large majority of people. I'd also credit our users with being discerning enough to realise that they should not jump to the conclusion that the value will be stable according to any particular standard. Arguments against including an internal value in the view could equally well be applied to any of the internal statistic collector views, many of which have oid columns, despite the fact that various "name" columns already unambiguously identify tuples in most cases. I see no reason for the inconsistency, particularly given that the pg_stat_statements.query column *is* still somewhat ambiguous, as described in the docs, and given that the query hash value referred to in the docs anyway. -- Peter Geoghegan http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training and Services
On Thu, May 24, 2012 at 4:26 PM, Peter Geoghegan <peter@2ndquadrant.com> wrote: > On 24 May 2012 11:50, Magnus Hagander <magnus@hagander.net> wrote: >> Was there any actual reason why we didn't end up exposing queryid in >> the pg_stat_statements view? >> >> It would be highly useful when tracking changes over time. Right now I >> see people doing md5(query) to do that, which is a lot more ugly (and >> obviously uses more space and is slow, too). > > Right. I continue to maintain that this is a good idea. I raised the > issue more than once. However, my proposal was not accepted by Tom and > Robert, apparently on the basis that queryId's actual value was > partially dictated by things like the endianness of the architecture > used, and the value of OIDs when serialising and subsequently hashing > the post-analysis tree. > > What I'd like to be able to do is aggregate this information over time > and/or across standbys in a cluster, as queries are evicted and > subsequently re-entered into pg_stat_statement's shared hash table. That's exactly the usecase I'm looking at here, except it's not actually across standbys in this case. > Now, there are situations were this isn't going to work, like when a > third-party logical replication system is used. That's unfortunate, > but I wouldn't expect it makes the information any less useful to the > large majority of people. I'd also credit our users with being > discerning enough to realise that they should not jump to the > conclusion that the value will be stable according to any particular > standard. As long as it's documented as such, I don't see a problem with that at all. -- Magnus Hagander Me: http://www.hagander.net/ Work: http://www.redpill-linpro.com/
On Thu, May 24, 2012 at 10:26 AM, Peter Geoghegan <peter@2ndquadrant.com> wrote: > On 24 May 2012 11:50, Magnus Hagander <magnus@hagander.net> wrote: >> Was there any actual reason why we didn't end up exposing queryid in >> the pg_stat_statements view? >> >> It would be highly useful when tracking changes over time. Right now I >> see people doing md5(query) to do that, which is a lot more ugly (and >> obviously uses more space and is slow, too). > > Right. I continue to maintain that this is a good idea. I raised the > issue more than once. However, my proposal was not accepted by Tom and > Robert, apparently on the basis that queryId's actual value was > partially dictated by things like the endianness of the architecture > used, and the value of OIDs when serialising and subsequently hashing > the post-analysis tree. No, my concern was more that I wasn't clear on what you hoped to do with it. But I think this explanation is enough to convince me that it might be worthwhile: > What I'd like to be able to do is aggregate this information over time > and/or across standbys in a cluster, as queries are evicted and > subsequently re-entered into pg_stat_statement's shared hash table. > Now, there are situations were this isn't going to work, like when a > third-party logical replication system is used. That's unfortunate, > but I wouldn't expect it makes the information any less useful to the > large majority of people. I'd also credit our users with being > discerning enough to realise that they should not jump to the > conclusion that the value will be stable according to any particular > standard. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > On Thu, May 24, 2012 at 10:26 AM, Peter Geoghegan <peter@2ndquadrant.com> wrote: > But I think this explanation is enough to convince me that it might be > worthwhile: >> What I'd like to be able to do is aggregate this information over time >> and/or across standbys in a cluster, as queries are evicted and >> subsequently re-entered into pg_stat_statement's shared hash table. It appears to me that the above ... >> ... I'd also credit our users with being >> discerning enough to realise that they should not jump to the >> conclusion that the value will be stable according to any particular >> standard. ... is in direct contradiction to this. The proposed usage absolutely requires that the hash be stable "over time and/or across standbys". I do not want to promise that it's stable over any timeframe longer than a server reboot. Aside from the OID dependence problem, we might well change the way the hash is calculated in minor releases, for example by adding or removing struct fields. regards, tom lane
On 24 May 2012 16:06, Tom Lane <tgl@sss.pgh.pa.us> wrote: > I do not want to promise that it's stable over any timeframe longer than > a server reboot. You already have though, since pg_stat_statements persistently stores statistics to disk by default, and can only ever recognise statement equivalence based on the (dbid, userid, queryid) hash key. > Aside from the OID dependence problem, we might well > change the way the hash is calculated in minor releases, for example by > adding or removing struct fields. You've already invalidated the saved statistics if you do that, so all bets are off anyway. If you have to do it, it'll be necessary to bump PGSS_FILE_HEADER, so that pg_stat_statements will be cleared upon restart. That will in turn necessitate documenting the issue in the minor version release notes. I'd hope to avoid that, but it doesn't seem to me that the situation is made any worse than before by exposing the value. On the contrary, it could help users to understand where the problem may have affected them. If you don't expose the value, users are going to do this sort of thing anyway, but will be far worse off due to using the query text or a hash thereof instead of the internal value. -- Peter Geoghegan http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training and Services
On 24 May 2012 16:06, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> What I'd like to be able to do is aggregate this information over time >>> and/or across standbys in a cluster, as queries are evicted and >>> subsequently re-entered into pg_stat_statement's shared hash table. > > It appears to me that the above ... > >>> ... I'd also credit our users with being >>> discerning enough to realise that they should not jump to the >>> conclusion that the value will be stable according to any particular >>> standard. > > ... is in direct contradiction to this. I simply meant that we ought to be able to trust that people will actually investigate the stability guarantees of a newly exposed query_id before going and writing a tool that does some sort of aggregation - they should and will make informed decisions. If that guarantee is limited to "we might have to change the walker logic during a minor release, so the value might change for some queries, so we don't promise that it will be a stable identifier but will naturally strive to do our best to avoid invalidating existing statistics (within postgres and aggregated by external tools)", that wouldn't put many people off. Still, I don't think it's all that likely that we'll ever have to adjust the walker logic, since pg_stat_statements already doesn't strictly promise that collisions cannot occur, while making them very improbable. I'm not even asking that possible uses for queryId be documented as being useful for this sort of thing. I only ask that we expose it and document it. -- Peter Geoghegan http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training and Services
On Thu, May 24, 2012 at 5:20 PM, Peter Geoghegan <peter@2ndquadrant.com> wrote: > On 24 May 2012 16:06, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> I do not want to promise that it's stable over any timeframe longer than >> a server reboot. > > You already have though, since pg_stat_statements persistently stores > statistics to disk by default, and can only ever recognise statement > equivalence based on the (dbid, userid, queryid) hash key. Yes, if that's actually a problem, the whole way how pg_stat_statements stores it's data persistently across restarts needs to be rewritten. In a way that introduces an identifier that *is* stable across restarts. In which case we could just expose that identifier instead, and we're done. What exactly is it that could/would be unstable across a reboot? Not being stable across an initdb is of course a whole different story - I think it's perfectly reasonable not to be that. >> Aside from the OID dependence problem, we might well >> change the way the hash is calculated in minor releases, for example by >> adding or removing struct fields. > > You've already invalidated the saved statistics if you do that, so all > bets are off anyway. If you have to do it, it'll be necessary to bump > PGSS_FILE_HEADER, so that pg_stat_statements will be cleared upon > restart. That will in turn necessitate documenting the issue in the > minor version release notes. I'd hope to avoid that, but it doesn't > seem to me that the situation is made any worse than before by > exposing the value. On the contrary, it could help users to understand > where the problem may have affected them. Agreed. We already break something very user-visible in this case. Two symptoms of the same breakage is really not that big an issue, IMO, compared to the big gains to be had. > If you don't expose the value, users are going to do this sort of > thing anyway, but will be far worse off due to using the query text or > a hash thereof instead of the internal value. Exactly. -- Magnus Hagander Me: http://www.hagander.net/ Work: http://www.redpill-linpro.com/