Re: Patch: add timing of buffer I/O requests - Mailing list pgsql-hackers
From | Peter Geoghegan |
---|---|
Subject | Re: Patch: add timing of buffer I/O requests |
Date | |
Msg-id | CAEYLb_VtJtPa7SpxEaoqejGb0YWikHyAo68adQfByQFbFN6Gzw@mail.gmail.com Whole thread Raw |
In response to | Re: Patch: add timing of buffer I/O requests (Robert Haas <robertmhaas@gmail.com>) |
List | pgsql-hackers |
On 13 April 2012 20:15, Robert Haas <robertmhaas@gmail.com> wrote: > On Wed, Apr 11, 2012 at 9:02 AM, ktm@rice.edu <ktm@rice.edu> wrote: >> By using all 64-bits of the hash that we currently calculate, instead >> of the current use of 32-bits only, the collision probabilities are >> very low. > > That's probably true, but I'm not sure it's worth worrying about - > one-in-four-billion is a pretty small probability. That assumes that our hashing of query trees will exhibit uniform distribution. While it's easy enough to prove that hash_any does that, it would seem to me that that does not imply that we will exhibit perfectly uniform distribution within pg_stat_statements. The reason that I invented the jumble nomenclature to distinguish it from a straight serialisation is that, apart from the fact that many fields are simply ignored, we still couldn't deserialize what we do store from the jumble, because the correct way to serialise a tree entails serialising NULL pointers too - two non-equivalent jumbles could actually be bitwise identical. In practice, I think that adding NodeTags ought to be sufficient here, but I wouldn't like to bet my life on it. Actually, that is a point that perhaps should have been touched on in the comments at the top of the file. You may wish to take a look at my original analysis in the pg_stat_statements normalisation thread, which attempts to quantify the odds by drawing upon the mathematics of the birthday problem. > On the broader issue, Peter's argument here seems to boil down to > "there is probably a reason why this is useful" and Tom's argument > seems to boil down to "i don't want to expose it without knowing what > that reason is". Peter may well be right, but that doesn't make Tom > wrong. If we are going to expose this, we ought to be able to > document why we have it, and right now we can't, because we don't > *really* know what it's good for, other than raising awareness of the > theoretical possibility of collisions, which doesn't seem like quite > enough. Well, it's important to realise that the query string isn't really stable, to the extent that it could differ as the query is evicted and re-enters pg_stat_statements over time. The hash value is necessarily a stable identifier, as it is the very identifier used by pg_stat_statements. People are going to want to aggregate this information over long periods and across database clusters, and I would really like to facilitate that. To be honest, with the plans that we have for replication, with the addition of things like cascading replication, I think it will increasingly be the case that people prefer built-in replication. The fact that the actual hash value is affected by factors like the endianness of the architecture in question seems mostly irrelevant. If we were to expose this, I'd suggest that the documentation in the table describing each column say of the value: query_hash | stable identifier used by pg_stat_statements for the query There'd then be additional clarification after the existing reference to the hash value, which gave the required caveats about the hash value being subject to various implementation artefacts that effectively only make the values persistently indentify queries referencing particular objects in the same database (that is, the object OID values are used), or across databases that are binary clones, such as between a streaming replica master and its standby. The OID restriction alone effectively shadows all others, so there's no need to mention endianness or anything like that. Anyone who jumped to the conclusion that their aggregation tool would work fine with Slony or something based on the query_hash description alone would probably have bigger problems. > On another note, I think this is a sufficiently major change that we > ought to add Peter's name to the "Author" section of the > pg_stat_statements documentation. Thanks. I actually thought this myself, but didn't want to mention it because I didn't think that it was up to me to decide, or to attempt to influence that decision. -- Peter Geoghegan http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training and Services
pgsql-hackers by date: