Re: Patch: add timing of buffer I/O requests - Mailing list pgsql-hackers

From ktm@rice.edu
Subject Re: Patch: add timing of buffer I/O requests
Date
Msg-id 20120411130244.GT24468@aart.rice.edu
Whole thread Raw
In response to Re: Patch: add timing of buffer I/O requests  (Peter Geoghegan <peter@2ndquadrant.com>)
Responses Re: Patch: add timing of buffer I/O requests  (Robert Haas <robertmhaas@gmail.com>)
List pgsql-hackers
On Wed, Apr 11, 2012 at 01:53:06AM +0100, Peter Geoghegan wrote:
> On 11 April 2012 01:16, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > Peter Geoghegan <peter@2ndquadrant.com> writes:
> >> On 11 April 2012 00:35, Robert Haas <robertmhaas@gmail.com> wrote:
> >>> If people need something like that, couldn't they create it by hashing
> >>> the normalized query text with an arbitrary algorithm?
> >
> >> That supposes that the normalised query text is perfectly stable. It
> >> may well not be, particularly for things like ad-hoc queries or
> >> queries generated by ORMs, across database clusters and over long
> >> periods of time -
> >
> > Indeed, but the hash value isn't stable either given those sorts of
> > assumptions, so I'm not convinced that there's any advantage there.
>
> Isn't it? The hash captures the true meaning of the query, while
> having the database server's platform as a usually irrelevant
> artefact. Another thing that I forgot to mention is client encoding -
> it may well be fairly inconvenient to have to use the same algorithm
> to hash the query string across applications. You also have to hash
> the query string yourself again and again, which is expensive to do
> from Python or something, and is often inconvenient - differences
> beyond track_activity_query_size bytes (default:1024) are not
> recognised. Using an SQL code beautifier where a single byte varies
> now breaks everything, which developers don't expect at all (we've
> trained them not to), so in many ways you're back to the same
> limitations as classic pg_stat_statements if you attempt to aggregate
> queries over time and across machines, which is a very real use case.
>
> It's probably pretty annoying to have to get your Python app to use
> the same hash function as your Java app or whatever I, unless you want
> to use something heavyweight like a cryptographic hash function. By
> doing it within Postgres, you avoid those headaches.
>
> I'm not asking you to very loudly proclaim that it should be used like
> this - just expose it, accurately document it, and I'm quite confident
> that it will be widely used and relied upon by those that are
> reasonably well informed, and understand its limitations, which are
> really quite straightforward.
>
> > What I think people would actually like to know, if they're in a
> > situation where distinct query texts are getting hashed to the same
> > thing, is *which* different texts got hashed to the same thing.
> > But there's no good way to expose that given the pg_stat_statements
> > infrastructure, and exposing the hash value doesn't help.
>
> Apart from detecting the case where we get a straightforward
> collision, I don't expect that that would be useful. The whole point
> is that the user doesn't care about the difference, and I think we've
> specified a practical, widely useful standard for when queries should
> be considered equivalent.
> --
> Peter Geoghegan       http://www.2ndQuadrant.com/
> PostgreSQL Development, 24x7 Support, Training and Services
>

By using all 64-bits of the hash that we currently calculate, instead
of the current use of 32-bits only, the collision probabilities are
very low.

Regards,
Ken


pgsql-hackers by date:

Previous
From: Greg Smith
Date:
Subject: Re: Last gasp
Next
From: Tom Lane
Date:
Subject: Re: Last gasp