Thread: Hash id in pg_stat_statements
Can we please expose the internal hash id of the statements in pg_stat_statements? I know there was discussions about it earlier, and it wasn't done with an argument of it not being stable between releases (IIRC). I think we can live with that drawback, assuming of course that we document this properly. I've now run into multiple customer installations where it would be very useful to have. The usecase is mainly storing snapshots of the pg_stat_statements output over time and analyzing those. Weird things happen for example when the query text is the same, but the hash is different (which can happen for example when a table is dropped and recreated). And even without that, in order to do anything useful with it, you end up hashing the query text anyway - so using the already existing hash would be easier and more useful. -- Magnus HaganderMe: http://www.hagander.net/Work: http://www.redpill-linpro.com/
On 1 October 2012 08:57, Magnus Hagander <magnus@hagander.net> wrote: > I know there was discussions about it earlier, and it wasn't done with > an argument of it not being stable between releases (IIRC). I think we > can live with that drawback, assuming of course that we document this > properly. Well, I'll point out once again that the argument about its stability is invalid, because we serialise the entries to disk. If a point release changes the representation of the query tree such that the hash values won't match, then we have no recourse but to bump pg_stat_statements version number, and invalidate all existing entries. I credit our users with the intelligence to not jump to any rash conclusions about the hash if directly exposed, such as assuming that it has any particular degree of stability with respect to the queries that are fingerprinted, or any degree greater than the self-evident bare minimum. I'm pretty sure that the "stability among point releases in the face of potential minor changes to query tree representation" thing was something that I imagined as a reason for the proposal being rejected, when I tried to read between the lines of a flat rejection. Perhaps I should have asked for clarification on that point. Now that I think about it, I'm pretty sure that the need to bump catversion, as a result of any change in the way dumping the query tree struct into stored rules needs to happen, will preclude that problem in practice. > I've now run into multiple customer installations where it would be > very useful to have. The usecase is mainly storing snapshots of the > pg_stat_statements output over time and analyzing those. Weird things > happen for example when the query text is the same, but the hash is > different (which can happen for example when a table is dropped and > recreated). And even without that, in order to do anything useful with > it, you end up hashing the query text anyway - so using the already > existing hash would be easier and more useful. Yes, these are all arguments that I'm familiar with :-) . I've always thought of pg_stat_statements as a low-level statistical view that people would naturally want to store snapshots of for analysis, in much the same way as many do now with things like pg_stat_bgwriter using tools like Munin. Who wouldn't want to know what queries were running a half an hour ago when the database server seemed slower than usual, for example? Such tools should naturally have access to the same "candidate key" for entries, rather than adding a subtle impedance mismatch by using the string. That reminds me - when are you writing the pg_stat_statements Postgres plugin for Munin? I was disappointed that my proposal was shot down, despite the fact that I independently raised it on list at least twice, and pushed as hard as I felt that I could at the time. -- Peter Geoghegan http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training and Services
Magnus Hagander <magnus@hagander.net> writes: > Can we please expose the internal hash id of the statements in > pg_stat_statements? > I know there was discussions about it earlier, and it wasn't done with > an argument of it not being stable between releases (IIRC). Worse than that: it could change across a minor version update. These are internal data structures we're hashing, and we've been known to have to change them for bug-fix purposes. > I've now run into multiple customer installations where it would be > very useful to have. The usecase is mainly storing snapshots of the > pg_stat_statements output over time and analyzing those. Doesn't that immediately refute your claim that unstable hash values would be okay? regards, tom lane
On Mon, Oct 1, 2012 at 4:10 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Magnus Hagander <magnus@hagander.net> writes: >> Can we please expose the internal hash id of the statements in >> pg_stat_statements? > >> I know there was discussions about it earlier, and it wasn't done with >> an argument of it not being stable between releases (IIRC). > > Worse than that: it could change across a minor version update. These > are internal data structures we're hashing, and we've been known to > have to change them for bug-fix purposes. As Peter pointed out, when we do that we have to change the file format anyway, and wipe the statistics. So chaning that is something we'd have to *note*. >> I've now run into multiple customer installations where it would be >> very useful to have. The usecase is mainly storing snapshots of the >> pg_stat_statements output over time and analyzing those. > > Doesn't that immediately refute your claim that unstable hash values > would be okay? No. It means they need to know when it changes, if it changes in a minor release. Which would obviously have to go in the release notes, since it would also invalidate any stored statistics in the *general* view. As long as we *tell* them under what conditions it might change, I think it's perfectly fine. Particularly those who are likely to use this functionality should certainly be capable of understanding that. -- Magnus HaganderMe: http://www.hagander.net/Work: http://www.redpill-linpro.com/
On 01-10-2012 11:22, Magnus Hagander wrote: > As long as we *tell* them under what conditions it might change, I > think it's perfectly fine. Particularly those who are likely to use > this functionality should certainly be capable of understanding that. > Even if we do that it is too much work for those statistics tools, isn't it? Instead of relying on internal structures hash, why don't you expose the query text hash to such tools? If you solve the space normalizations, it is an almost perfect solution for your use case. -- Euler Taveira de Oliveira - Timbira http://www.timbira.com.br/ PostgreSQL: Consultoria, Desenvolvimento, Suporte24x7 e Treinamento
On 1 October 2012 15:22, Magnus Hagander <magnus@hagander.net> wrote: > On Mon, Oct 1, 2012 at 4:10 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Worse than that: it could change across a minor version update. These >> are internal data structures we're hashing, and we've been known to >> have to change them for bug-fix purposes. > > As Peter pointed out, when we do that we have to change the file > format anyway, and wipe the statistics. So chaning that is something > we'd have to *note*. I think the need to not break stored rules pins down the actual representation of the Query struct in release branches. I'm not prepared to say that it does so absolutely, since of course certain logic could be altered in a way that resulted in different values within the struct or its children, but I'm curious as to when this actually occurred in the past. Discussion about exposing this value aside, it'd obviously be a bit disappointing if we had to invalidate existing statistics in a point release. Still, that situation isn't made any worse by exposing the value, and in fact doing so could aid in helping users to understand the issues involved. -- Peter Geoghegan http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training and Services
Peter, all, * Peter Geoghegan (peter@2ndquadrant.com) wrote: > Well, I'll point out once again that the argument about its stability > is invalid, because we serialise the entries to disk. If a point > release changes the representation of the query tree such that the > hash values won't match, then we have no recourse but to bump > pg_stat_statements version number, and invalidate all existing > entries. What if we simply included the pg_stat_statements version number in what's shown to the user as the 'hash'? ver#.hash ? Thanks, Stephen
On 1 October 2012 17:12, Stephen Frost <sfrost@snowman.net> wrote: > Peter, all, > > * Peter Geoghegan (peter@2ndquadrant.com) wrote: >> Well, I'll point out once again that the argument about its stability >> is invalid, because we serialise the entries to disk. If a point >> release changes the representation of the query tree such that the >> hash values won't match, then we have no recourse but to bump >> pg_stat_statements version number, and invalidate all existing >> entries. > > What if we simply included the pg_stat_statements version number in > what's shown to the user as the 'hash'? ver#.hash ? That won't really help matters. There'd still be duplicate entries, from before and after the change, even if we make it immediately obvious which is which. The only reasonable solution in that scenario is to bump PGSS_FILE_HEADER, which will cause all existing entries to be invalidated. This is a hypothetical scenario, and concerns about it are totally orthogonal to the actual question of whether or not the queryid should be exposed... -- Peter Geoghegan http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training and Services
Peter, * Peter Geoghegan (peter@2ndquadrant.com) wrote: > That won't really help matters. There'd still be duplicate entries, > from before and after the change, even if we make it immediately > obvious which is which. The only reasonable solution in that scenario > is to bump PGSS_FILE_HEADER, which will cause all existing entries to > be invalidated. You're going to have to help me here, 'cause I don't see how there can be duplicates if we include the PGSS_FILE_HEADER as part of the hash, unless we're planning to keep PGSS_FILE_HEADER constant while we change what the hash value is for a given query, yet that goes against the assumptions that were laid out, aiui. If there's a change that results in a given query X no longer hashing to a value A, we need to change PGSS_FILE_HEADER to invalidate statistics which were collected for value A (or else we risk an independent query Y hashing to value A and ending up with completely invalid stats..). Provided we apply that consistently and don't reuse PGSS_FILE_HEADER values along the way, a combination of PGSS_FILE_HEADER and the hash value for a given query should be unique over time. We do need to document that the hash value for a given query may change.. Thanks, Stephen
On 1 October 2012 18:05, Stephen Frost <sfrost@snowman.net> wrote: > * Peter Geoghegan (peter@2ndquadrant.com) wrote: >> That won't really help matters. There'd still be duplicate entries, >> from before and after the change, even if we make it immediately >> obvious which is which. The only reasonable solution in that scenario >> is to bump PGSS_FILE_HEADER, which will cause all existing entries to >> be invalidated. > > You're going to have to help me here, 'cause I don't see how there can > be duplicates if we include the PGSS_FILE_HEADER as part of the hash, > unless we're planning to keep PGSS_FILE_HEADER constant while we change > what the hash value is for a given query, yet that goes against the > assumptions that were laid out, aiui. Well, they wouldn't be duplicates if you think that the fact that one query was executed before some point release and another after ought to differentiate queries. I do not. > If there's a change that results in a given query X no longer hashing to > a value A, we need to change PGSS_FILE_HEADER to invalidate statistics > which were collected for value A (or else we risk an independent query Y > hashing to value A and ending up with completely invalid stats..). > Provided we apply that consistently and don't reuse PGSS_FILE_HEADER > values along the way, a combination of PGSS_FILE_HEADER and the hash > value for a given query should be unique over time. > > We do need to document that the hash value for a given query may > change.. By invalidate, I mean that when we go to open the saved file, if the header doesn't match, the file is considered corrupt, and we simply log that the file could not be read, before unlinking it. This would be necessary in the unlikely event of there being some substantive change in the representation of query trees in a point release. I am not aware of any precedent for this, though Tom said that there was one. Tom: Could you please describe the precedent you had in mind? I would like to better understand this risk. I don't want to get too hung up on what we'd do if this problem actually occurred, because that isn't what this thread is about. Exposing the hash is a completely unrelated question, except that to do so might make pg_stat_statements (including its limitations) better understood. In my opinion, the various log parsing tools have taught people to think about this in the wrong way - the query string is just a representation of a query (and an imperfect one at that). There are other, similar tools that exist in proprietary databases. They expose a hash value, which is subject to exactly the same caveats as our own. They explicitly encourage the type of aggregation by third-party tools that I anticipate will happen with pg_stat_statements. I want to facilitate this, but I'm confident that the use of (dbid, userid, querytext) as a "candidate key" by these tools is going to cause confusion, in subtle ways. By using the hash value, those tools are exactly as well off as pg_stat_statements is, which is to say, as well off as possible. I simply do not understand objections to the proposal. Have I missed something? -- Peter Geoghegan http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training and Services
On 02-10-2012 10:15, Peter Geoghegan wrote: > There are other, similar tools that exist in proprietary databases. > They expose a hash value, which is subject to exactly the same caveats > as our own. They explicitly encourage the type of aggregation by > third-party tools that I anticipate will happen with > pg_stat_statements. I want to facilitate this, but I'm confident that > the use of (dbid, userid, querytext) as a "candidate key" by these > tools is going to cause confusion, in subtle ways. By using the hash > value, those tools are exactly as well off as pg_stat_statements is, > which is to say, as well off as possible. > It depends on how you will use this information. If it is for a rapid analysis, it is fine to use a hash because the object internals won't change (I hope not). However, if you want to analyze query statistics for a long period of time (say a few months) or your environment is distributed, you can't use the hash. There isn't a perfect solution but I see both cases being useful for analysis tools. -- Euler Taveira de Oliveira - Timbira http://www.timbira.com.br/ PostgreSQL: Consultoria, Desenvolvimento, Suporte24x7 e Treinamento
<p dir="ltr">P<br /> On Oct 2, 2012 5:04 PM, "Euler Taveira" <<a href="mailto:euler@timbira.com">euler@timbira.com</a>>wrote:<br /> ><br /> > On 02-10-2012 10:15, Peter Geogheganwrote:<br /> > > There are other, similar tools that exist in proprietary databases.<br /> > > Theyexpose a hash value, which is subject to exactly the same caveats<br /> > > as our own. They explicitly encouragethe type of aggregation by<br /> > > third-party tools that I anticipate will happen with<br /> > >pg_stat_statements. I want to facilitate this, but I'm confident that<br /> > > the use of (dbid, userid, querytext)as a "candidate key" by these<br /> > > tools is going to cause confusion, in subtle ways. By using the hash<br/> > > value, those tools are exactly as well off as pg_stat_statements is,<br /> > > which is to say,as well off as possible.<br /> > ><br /> > It depends on how you will use this information. If it is for a rapid<br/> > analysis, it is fine to use a hash because the object internals won't change<br /> > (I hope not). However,if you want to analyze query statistics for a long<br /> > period of time (say a few months) or your environmentis distributed, you<br /> > can't use the hash. There isn't a perfect solution but I see both cases being<br/> > useful for analysis tools.<br /><p dir="ltr">Having the hash available in no way prevents you from usingthe query string ad your key. Not having the hash certainly prevents you from using it. <p dir="ltr">/Magnus
* Peter Geoghegan (peter@2ndquadrant.com) wrote: > On 1 October 2012 18:05, Stephen Frost <sfrost@snowman.net> wrote: > > You're going to have to help me here, 'cause I don't see how there can > > be duplicates if we include the PGSS_FILE_HEADER as part of the hash, > > unless we're planning to keep PGSS_FILE_HEADER constant while we change > > what the hash value is for a given query, yet that goes against the > > assumptions that were laid out, aiui. > > Well, they wouldn't be duplicates if you think that the fact that one > query was executed before some point release and another after ought > to differentiate queries. I do not. This would only be if we happened to change what hash was generated for a given query during such a point release, where I share your feeling that it aught to be quite rare. I'm not suggestion we do this for every point release... > By invalidate, I mean that when we go to open the saved file, if the > header doesn't match, the file is considered corrupt, and we simply > log that the file could not be read, before unlinking it. This would > be necessary in the unlikely event of there being some substantive > change in the representation of query trees in a point release. I am > not aware of any precedent for this, though Tom said that there was > one. Right, and that's all I'm trying to address here- how do we provide a value for a given query which can be relied upon by outside sources, even in the face of a point release which changes what our internal hash value for a given query is. > I don't want to get too hung up on what we'd do if this problem > actually occurred, because that isn't what this thread is about. [...] > I simply do not understand objections to the proposal. Have I missed something? It was my impression that the concern is the stability of the hash value and ensuring that tools which operate on it don't mistakenly lump two different queries into one because they had the same hash value (caused by a change in our hashing algorithm or input into it over time, eg a point release). I was hoping to address that to allow this proposal to move forward.. Thanks, Stephen
Stephen Frost <sfrost@snowman.net> writes: > * Peter Geoghegan (peter@2ndquadrant.com) wrote: >> I simply do not understand objections to the proposal. Have I missed something? > It was my impression that the concern is the stability of the hash value > and ensuring that tools which operate on it don't mistakenly lump two > different queries into one because they had the same hash value (caused > by a change in our hashing algorithm or input into it over time, eg a > point release). I was hoping to address that to allow this proposal to > move forward.. I think there are at least two questions that ought to be answered: 1. Why isn't something like md5() on the reported query text an equally good solution for users who want a query hash? 2. If people are going to accumulate stats on queries over a long period of time, is a 32-bit hash really good enough for the purpose? If I'm doing the math right, the chance of collision is already greater than 1% at 10000 queries, and rises to about 70% for 100000 queries; see http://en.wikipedia.org/wiki/Birthday_paradox We discussed this issue and decided it was okay for pg_stat_statements's internal hash table, but it's not at all clear to me that it's sensible to use 32-bit hashes for external accumulation of query stats. regards, tom lane
On 2 October 2012 18:16, Tom Lane <tgl@sss.pgh.pa.us> wrote > 1. Why isn't something like md5() on the reported query text an equally > good solution for users who want a query hash? Because that does not uniquely identify the entry. The very first thing that the docs say on search_path is "Qualified names are tedious to write, and it's often best not to wire a particular schema name into applications anyway". Presumably, the reason it's best not to wire schema names into apps is because it might be useful to modify search_path in a way that dynamically made the same queries in some application reference what are technically distinct relations. If anyone does this, and it seems likely that many do for various reasons, they will be out of luck when using some kind of pg_stat_statements aggregation. This was the behaviour that I intended for pg_stat_statements all along, and I think it's better than a solution that matches query strings. > 2. If people are going to accumulate stats on queries over a long period > of time, is a 32-bit hash really good enough for the purpose? If I'm > doing the math right, the chance of collision is already greater than 1% > at 10000 queries, and rises to about 70% for 100000 queries; see > http://en.wikipedia.org/wiki/Birthday_paradox > We discussed this issue and decided it was okay for pg_stat_statements's > internal hash table, but it's not at all clear to me that it's sensible > to use 32-bit hashes for external accumulation of query stats. Well, forgive me for pointing this out, but I did propose that the hash be a 64-bit value (which would have necessitated adopting hash_any() to produce 64-bit values), but you rejected the proposal. I arrived at the same probability for a collision as you did and posted in to the list, in discussion shortly after the normalisation stuff was committed. A more sensible way of assessing the risk of a collision would be to try and come up with the probability of a collision that someone actually ends up caring about, which is considerably less than the 1% for 10,000 entries. I'm not being glib - people are very used to the idea that aggregating information on query costs is a lossy process. Prior to 9.2, the only way execution costs could reasonably be measured on at the query granularity on a busy system was to set log_min_duration_statement to something like 1 second. I am also unconvinced by the idea that aggregating historical data (with the same hash value) in a separate application is likely to make the collision situation appreciably worse. People are going to be using something like an RRD circular buffer to aggregate the information, and I can't see anyone caring about detailed information that is more than a couple of weeks in the past. The point of aggregation isn't to store more queries, it's to construct time-series data from snapshots. Besides, do most applications really even have more than 10,000 distinct queries? -- Peter Geoghegan http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training and Services
On 2 October 2012 17:58, Stephen Frost <sfrost@snowman.net> wrote: > Right, and that's all I'm trying to address here- how do we provide a > value for a given query which can be relied upon by outside sources, > even in the face of a point release which changes what our internal hash > value for a given query is. I don't know of a way. Presumably, we'd hope to avoid this, and would look for alternatives to anything that would necessitate bumping PGSS_FILE_HEADER, while not going so far as to let pg_stat_statements contort things in the core system. If I was aware of a case where this would have come up had pg_stat_statements fingerprinting been around at the time, perhaps I could give a better answer than that. -- Peter Geoghegan http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training and Services
On Tue, Oct 02, 2012 at 12:58:15PM -0400, Stephen Frost wrote: > > I simply do not understand objections to the proposal. Have I missed something? > > It was my impression that the concern is the stability of the hash value > and ensuring that tools which operate on it don't mistakenly lump two > different queries into one because they had the same hash value (caused > by a change in our hashing algorithm or input into it over time, eg a > point release). I was hoping to address that to allow this proposal to > move forward.. That makes no sense though. The moment you talk about "hash" you consider the possibility of lumping together things that aren't the same. Any tools using these hashes must have realised this. Fortunatly, the statistics are better than the birthday paradox. The chances that the two most important queries in your system end up having the same hash is miniscule. Like mentioned elsewhere, a system with more than 10,000 different queries sounds rare to me (once you exclude query parameters ofcourse). Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > He who writes carelessly confesses thereby at the very outset that he does > not attach much importance to his own thoughts. -- Arthur Schopenhauer
On Mon, Oct 1, 2012 at 12:57 AM, Magnus Hagander <magnus@hagander.net> wrote: > Can we please expose the internal hash id of the statements in > pg_stat_statements? > > I know there was discussions about it earlier, and it wasn't done with > an argument of it not being stable between releases (IIRC). I think we > can live with that drawback, assuming of course that we document this > properly. > > I've now run into multiple customer installations where it would be > very useful to have. The usecase is mainly storing snapshots of the > pg_stat_statements output over time and analyzing those. Weird things > happen for example when the query text is the same, but the hash is > different (which can happen for example when a table is dropped and > recreated). And even without that, in order to do anything useful with > it, you end up hashing the query text anyway - so using the already > existing hash would be easier and more useful. I have a similar problem, however, I am not sure if the hash generated is ideal. Putting aside the number of mechanical, versioning, shut-down/stats files issues, etc reasons given in the main branch of the thread, I also have this feeling that it is not what I want. Consider the following case: SELECT * FROM users WHERE id = ? <this query isn't seen for a while> SELECT * FROM users WHERE id = ? In the intervening time, an equivalent hash could still be evicted and reintroduced and the statistics silently reset, and that'll befuddle principled tools. This is worse than merely less-useful, because it can lead to drastic underestimations that otherwise pass inspection. Instead, I think it makes sense to assign a number -- arbitrarily, but uniquely -- to the generation of a new row in pg_stat_statements, and, on the flip side, whenever a row is retired its number should be eliminated, practically, for-ever. This way re-introductions between two samplings of pg_stat_statements cannot be confused for a contiguously maintained statistic on a query. -- fdr
Daniel Farina <daniel@heroku.com> writes: > Instead, I think it makes sense to assign a number -- arbitrarily, but > uniquely -- to the generation of a new row in pg_stat_statements, and, > on the flip side, whenever a row is retired its number should be > eliminated, practically, for-ever. This way re-introductions between > two samplings of pg_stat_statements cannot be confused for a > contiguously maintained statistic on a query. This argument seems sensible to me. Is there any use-case where the proposed counter wouldn't do what people wished to do with an exposed hash value? regards, tom lane
On 3 October 2012 19:04, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Daniel Farina <daniel@heroku.com> writes: >> Instead, I think it makes sense to assign a number -- arbitrarily, but >> uniquely -- to the generation of a new row in pg_stat_statements, and, >> on the flip side, whenever a row is retired its number should be >> eliminated, practically, for-ever. This way re-introductions between >> two samplings of pg_stat_statements cannot be confused for a >> contiguously maintained statistic on a query. > > This argument seems sensible to me. Is there any use-case where the > proposed counter wouldn't do what people wished to do with an exposed > hash value? Yes. The hash could be used to aggregate query execution costs across entire WAL-based replication clusters. I'm not opposed to Daniel's suggestion, though. -- Peter Geoghegan http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training and Services
On Wed, Oct 3, 2012 at 11:04 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Daniel Farina <daniel@heroku.com> writes: >> Instead, I think it makes sense to assign a number -- arbitrarily, but >> uniquely -- to the generation of a new row in pg_stat_statements, and, >> on the flip side, whenever a row is retired its number should be >> eliminated, practically, for-ever. This way re-introductions between >> two samplings of pg_stat_statements cannot be confused for a >> contiguously maintained statistic on a query. > > This argument seems sensible to me. Is there any use-case where the > proposed counter wouldn't do what people wished to do with an exposed > hash value? Yes: unless schemas are included into the canonicalized query text, two identical texts can actually mean different things. This is the only corner case (besides collision) I can think of. I also referenced an idea in a similar vein to the counter/arbitrary number: instead, one can perform a kind of error propagation from evicted entries in the hash table. This might avoid having to even bother saving a counter, which feels rather nice to me. I have a sketch (I now know of two stupid bugs in this) in implementation and it comes out very neatly so far. I got this idea from a paper that I saw implemented once, with strikingly good effect: http://www.cs.ucsb.edu/research/tech_reports/reports/2005-23.pdf Here they have a very specific rendition that is, for a couple of reasons, not quite what we want, I think. But the part I found very interesting was the simple propagation of error that made the technique possible. Inspired by this, I gave every hash entry a maximum-under-estimation error. When members of the hash table are evicted, the minimum of the metric gathered plus its already accumulated under-estimation error are propagated to the new entry. The general property is that hash table members who are frequently evicted will accumulate huge amounts of error. Those that are never evicted (thanks to constant use) never accumulate any error. A tool reading the table can take into account the error accumulation to determine if there was an eviction between two samplings, as well as bounding the error accrued between two snapshots. I think there is a tiny bit of room for some sort of ambiguity in a corner case, but I'd have to think more about it. -- fdr
On 3 October 2012 19:54, Peter Geoghegan <peter@2ndquadrant.com> wrote: > On 3 October 2012 19:04, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> This argument seems sensible to me. Is there any use-case where the >> proposed counter wouldn't do what people wished to do with an exposed >> hash value? > > Yes. The hash could be used to aggregate query execution costs across > entire WAL-based replication clusters. I'm not opposed to Daniel's > suggestion, though. Could we please try and reach a consensus here? If you're still dead set against exposing the hash value, I think that just following what Daniel has suggested is a fair compromise. -- Peter Geoghegan http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training and Services
On 3 October 2012 19:54, Peter Geoghegan <peter@2ndquadrant.com> wrote: > On 3 October 2012 19:04, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> This argument seems sensible to me. Is there any use-case where the >> proposed counter wouldn't do what people wished to do with an exposed >> hash value? > > Yes. The hash could be used to aggregate query execution costs across > entire WAL-based replication clusters. I'm not opposed to Daniel's > suggestion, though. I failed to mention a more compelling use-case. On systems that use roles extensively, there will naturally from time to time be a desire to see things at the query granularity, rather than at the (userid, queryid) granularity. Exposing the hash value allows the user to construct an aggregate query that groups by the hash value, rather than the query string. For all the reasons already mentioned, this is a better principled approach than using the query string. -- Peter Geoghegan http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training and Services
On 3 October 2012 19:04, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Daniel Farina <daniel@heroku.com> writes: >> Instead, I think it makes sense to assign a number -- arbitrarily, but >> uniquely -- to the generation of a new row in pg_stat_statements, and, >> on the flip side, whenever a row is retired its number should be >> eliminated, practically, for-ever. This way re-introductions between >> two samplings of pg_stat_statements cannot be confused for a >> contiguously maintained statistic on a query. > > This argument seems sensible to me. Daniel: Could you please submit the patch that you were working on that does this to the next commitfest? -- Peter Geoghegan http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training and Services
On Mon, Oct 15, 2012 at 7:35 AM, Peter Geoghegan <peter@2ndquadrant.com> wrote: > On 3 October 2012 19:04, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Daniel Farina <daniel@heroku.com> writes: >>> Instead, I think it makes sense to assign a number -- arbitrarily, but >>> uniquely -- to the generation of a new row in pg_stat_statements, and, >>> on the flip side, whenever a row is retired its number should be >>> eliminated, practically, for-ever. This way re-introductions between >>> two samplings of pg_stat_statements cannot be confused for a >>> contiguously maintained statistic on a query. >> >> This argument seems sensible to me. > > Daniel: Could you please submit the patch that you were working on > that does this to the next commitfest? Yes. Sorry I haven't done that already. I'll clean it up and send it out Real Soon Now, thanks for the expression of interest. -- fdr
On Tue, Oct 2, 2012 at 8:22 PM, Peter Geoghegan <peter@2ndquadrant.com> wrote: > On 2 October 2012 18:16, Tom Lane <tgl@sss.pgh.pa.us> wrote >> 1. Why isn't something like md5() on the reported query text an equally >> good solution for users who want a query hash? > > Because that does not uniquely identify the entry. The very first > thing that the docs say on search_path is "Qualified names are tedious > to write, and it's often best not to wire a particular schema name > into applications anyway". Presumably, the reason it's best not to > wire schema names into apps is because it might be useful to modify > search_path in a way that dynamically made the same queries in some > application reference what are technically distinct relations. If > anyone does this, and it seems likely that many do for various > reasons, they will be out of luck when using some kind of > pg_stat_statements aggregation. > > This was the behaviour that I intended for pg_stat_statements all > along, and I think it's better than a solution that matches query > strings. > >> 2. If people are going to accumulate stats on queries over a long period >> of time, is a 32-bit hash really good enough for the purpose? If I'm >> doing the math right, the chance of collision is already greater than 1% >> at 10000 queries, and rises to about 70% for 100000 queries; see >> http://en.wikipedia.org/wiki/Birthday_paradox >> We discussed this issue and decided it was okay for pg_stat_statements's >> internal hash table, but it's not at all clear to me that it's sensible >> to use 32-bit hashes for external accumulation of query stats. > > Well, forgive me for pointing this out, but I did propose that the > hash be a 64-bit value (which would have necessitated adopting > hash_any() to produce 64-bit values), but you rejected the proposal. I > arrived at the same probability for a collision as you did and posted > in to the list, in discussion shortly after the normalisation stuff > was committed. What was the argument for rejecting it? Just that it required hash_any() to be adapted? Now that we have a very clear use case where this would help, perhaps it's time to re-visit this proposal? --Magnus HaganderMe: http://www.hagander.net/Work: http://www.redpill-linpro.com/
On 15 November 2012 13:10, Magnus Hagander <magnus@hagander.net> wrote: >> Well, forgive me for pointing this out, but I did propose that the >> hash be a 64-bit value (which would have necessitated adopting >> hash_any() to produce 64-bit values), but you rejected the proposal. I >> arrived at the same probability for a collision as you did and posted >> in to the list, in discussion shortly after the normalisation stuff >> was committed. > > What was the argument for rejecting it? Just that it required > hash_any() to be adapted? I think so, yes. It just wasn't deemed necessary. Note that hash_any() has a comment that says something like "if you want to adapt this so that the Datum returned is a 64-bit integer, the way to do that is...". I followed this advice in a revision of the pg_stat_statements normalisation patch, and preserved compatibility by using a macro, which Tom objected to. > Now that we have a very clear use case where this would help, perhaps > it's time to re-visit this proposal? Perhaps. I think that the issue of whether or not a 64-bit integer is necessary is totally orthogonal to the question of whether or not we should expose the hash. The need to aggregate historical statistics just doesn't appreciably alter things here, I feel. The number of discrete queries that an application will execute in a week just isn't that different from the number that it will ever execute, I suspect. -- Peter Geoghegan http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training and Services
Peter Geoghegan <peter@2ndquadrant.com> writes: > should expose the hash. The need to aggregate historical statistics > just doesn't appreciably alter things here, I feel. The number of > discrete queries that an application will execute in a week just isn't > that different from the number that it will ever execute, I suspect. Please don't forget that some people won't write the most effective SQL from the get go and will rewrite problematic queries. Some people will even rollout new features in their applications, with new queries to implement them. Or new applications on top of the existing database. So I don't think that the query list is that static. That said, I don't have any idea at all about the impact of what I'm saying to your analysis… Regards, -- Dimitri Fontaine http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support