Thread: Hash id in pg_stat_statements

Hash id in pg_stat_statements

From
Magnus Hagander
Date:
Can we please expose the internal hash id of the statements in
pg_stat_statements?

I know there was discussions about it earlier, and it wasn't done with
an argument of it not being stable between releases (IIRC). I think we
can live with that drawback, assuming of course that we document this
properly.

I've now run into multiple customer installations where it would be
very useful to have. The usecase is mainly storing snapshots of the
pg_stat_statements output over time and analyzing those. Weird things
happen for example when the query text is the same, but the hash is
different (which can happen for example when a table is dropped and
recreated). And even without that, in order to do anything useful with
it, you end up hashing the query text anyway - so using the already
existing hash would be easier and more useful.

-- Magnus HaganderMe: http://www.hagander.net/Work: http://www.redpill-linpro.com/



Re: Hash id in pg_stat_statements

From
Peter Geoghegan
Date:
On 1 October 2012 08:57, Magnus Hagander <magnus@hagander.net> wrote:
> I know there was discussions about it earlier, and it wasn't done with
> an argument of it not being stable between releases (IIRC). I think we
> can live with that drawback, assuming of course that we document this
> properly.

Well, I'll point out once again that the argument about its stability
is invalid, because we serialise the entries to disk. If a point
release changes the representation of the query tree such that the
hash values won't match, then we have no recourse but to bump
pg_stat_statements version number, and invalidate all existing
entries. I credit our users with the intelligence to not jump to any
rash conclusions about the hash if directly exposed, such as assuming
that it has any particular degree of stability with respect to the
queries that are fingerprinted, or any degree greater than the
self-evident bare minimum.

I'm pretty sure that the "stability among point releases in the face
of potential minor changes to query tree representation" thing was
something that I imagined as a reason for the proposal being rejected,
when I tried to read between the lines of a flat rejection. Perhaps I
should have asked for clarification on that point. Now that I think
about it, I'm pretty sure that the need to bump catversion, as a
result of any change in the way dumping the query tree struct into
stored rules needs to happen, will preclude that problem in practice.

> I've now run into multiple customer installations where it would be
> very useful to have. The usecase is mainly storing snapshots of the
> pg_stat_statements output over time and analyzing those. Weird things
> happen for example when the query text is the same, but the hash is
> different (which can happen for example when a table is dropped and
> recreated). And even without that, in order to do anything useful with
> it, you end up hashing the query text anyway - so using the already
> existing hash would be easier and more useful.

Yes, these are all arguments that I'm familiar with :-)  . I've always
thought of pg_stat_statements as a low-level statistical view that
people would naturally want to store snapshots of for analysis, in
much the same way as many do now with things like pg_stat_bgwriter
using tools like Munin. Who wouldn't want to know what queries were
running a half an hour ago when the database server seemed slower than
usual, for example? Such tools should naturally have access to the
same "candidate key" for entries, rather than adding a subtle
impedance mismatch by using the string. That reminds me - when are you
writing the pg_stat_statements Postgres plugin for Munin?

I was disappointed that my proposal was shot down, despite the fact
that I independently raised it on list at least twice, and pushed as
hard as I felt that I could at the time.

--
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services



Re: Hash id in pg_stat_statements

From
Tom Lane
Date:
Magnus Hagander <magnus@hagander.net> writes:
> Can we please expose the internal hash id of the statements in
> pg_stat_statements?

> I know there was discussions about it earlier, and it wasn't done with
> an argument of it not being stable between releases (IIRC).

Worse than that: it could change across a minor version update.  These
are internal data structures we're hashing, and we've been known to
have to change them for bug-fix purposes.

> I've now run into multiple customer installations where it would be
> very useful to have. The usecase is mainly storing snapshots of the
> pg_stat_statements output over time and analyzing those.

Doesn't that immediately refute your claim that unstable hash values
would be okay?
        regards, tom lane



Re: Hash id in pg_stat_statements

From
Magnus Hagander
Date:
On Mon, Oct 1, 2012 at 4:10 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Magnus Hagander <magnus@hagander.net> writes:
>> Can we please expose the internal hash id of the statements in
>> pg_stat_statements?
>
>> I know there was discussions about it earlier, and it wasn't done with
>> an argument of it not being stable between releases (IIRC).
>
> Worse than that: it could change across a minor version update.  These
> are internal data structures we're hashing, and we've been known to
> have to change them for bug-fix purposes.

As Peter pointed out, when we do that we have to change the file
format anyway, and wipe the statistics. So chaning that is something
we'd have to *note*.


>> I've now run into multiple customer installations where it would be
>> very useful to have. The usecase is mainly storing snapshots of the
>> pg_stat_statements output over time and analyzing those.
>
> Doesn't that immediately refute your claim that unstable hash values
> would be okay?

No. It means they need to know when it changes, if it changes in a
minor release. Which would obviously have to go in the release notes,
since it would also invalidate any stored statistics in the *general*
view.

As long as we *tell* them under what conditions it might change, I
think it's perfectly fine. Particularly those who are likely to use
this functionality should certainly be capable of understanding that.

-- Magnus HaganderMe: http://www.hagander.net/Work: http://www.redpill-linpro.com/



Re: Hash id in pg_stat_statements

From
Euler Taveira
Date:
On 01-10-2012 11:22, Magnus Hagander wrote:
> As long as we *tell* them under what conditions it might change, I
> think it's perfectly fine. Particularly those who are likely to use
> this functionality should certainly be capable of understanding that.
> 
Even if we do that it is too much work for those statistics tools, isn't it?
Instead of relying on internal structures hash, why don't you expose the query
text hash to such tools? If you solve the space normalizations, it is an
almost perfect solution for your use case.


--   Euler Taveira de Oliveira - Timbira       http://www.timbira.com.br/  PostgreSQL: Consultoria, Desenvolvimento,
Suporte24x7 e Treinamento
 



Re: Hash id in pg_stat_statements

From
Peter Geoghegan
Date:
On 1 October 2012 15:22, Magnus Hagander <magnus@hagander.net> wrote:
> On Mon, Oct 1, 2012 at 4:10 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Worse than that: it could change across a minor version update.  These
>> are internal data structures we're hashing, and we've been known to
>> have to change them for bug-fix purposes.
>
> As Peter pointed out, when we do that we have to change the file
> format anyway, and wipe the statistics. So chaning that is something
> we'd have to *note*.

I think the need to not break stored rules pins down the actual
representation of the Query struct in release branches. I'm not
prepared to say that it does so absolutely, since of course certain
logic could be altered in a way that resulted in different values
within the struct or its children, but I'm curious as to when this
actually occurred in the past. Discussion about exposing this value
aside, it'd obviously be a bit disappointing if we had to invalidate
existing statistics in a point release. Still, that situation isn't
made any worse by exposing the value, and in fact doing so could aid
in helping users to understand the issues involved.

-- 
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services



Re: Hash id in pg_stat_statements

From
Stephen Frost
Date:
Peter, all,

* Peter Geoghegan (peter@2ndquadrant.com) wrote:
> Well, I'll point out once again that the argument about its stability
> is invalid, because we serialise the entries to disk. If a point
> release changes the representation of the query tree such that the
> hash values won't match, then we have no recourse but to bump
> pg_stat_statements version number, and invalidate all existing
> entries.

What if we simply included the pg_stat_statements version number in
what's shown to the user as the 'hash'?  ver#.hash ?
Thanks,
    Stephen

Re: Hash id in pg_stat_statements

From
Peter Geoghegan
Date:
On 1 October 2012 17:12, Stephen Frost <sfrost@snowman.net> wrote:
> Peter, all,
>
> * Peter Geoghegan (peter@2ndquadrant.com) wrote:
>> Well, I'll point out once again that the argument about its stability
>> is invalid, because we serialise the entries to disk. If a point
>> release changes the representation of the query tree such that the
>> hash values won't match, then we have no recourse but to bump
>> pg_stat_statements version number, and invalidate all existing
>> entries.
>
> What if we simply included the pg_stat_statements version number in
> what's shown to the user as the 'hash'?  ver#.hash ?

That won't really help matters. There'd still be duplicate entries,
from before and after the change, even if we make it immediately
obvious which is which. The only reasonable solution in that scenario
is to bump PGSS_FILE_HEADER, which will cause all existing entries to
be invalidated.

This is a hypothetical scenario, and concerns about it are totally
orthogonal to the actual question of whether or not the queryid should
be exposed...

-- 
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services



Re: Hash id in pg_stat_statements

From
Stephen Frost
Date:
Peter,

* Peter Geoghegan (peter@2ndquadrant.com) wrote:
> That won't really help matters. There'd still be duplicate entries,
> from before and after the change, even if we make it immediately
> obvious which is which. The only reasonable solution in that scenario
> is to bump PGSS_FILE_HEADER, which will cause all existing entries to
> be invalidated.

You're going to have to help me here, 'cause I don't see how there can
be duplicates if we include the PGSS_FILE_HEADER as part of the hash,
unless we're planning to keep PGSS_FILE_HEADER constant while we change
what the hash value is for a given query, yet that goes against the
assumptions that were laid out, aiui.

If there's a change that results in a given query X no longer hashing to
a value A, we need to change PGSS_FILE_HEADER to invalidate statistics
which were collected for value A (or else we risk an independent query Y
hashing to value A and ending up with completely invalid stats..).
Provided we apply that consistently and don't reuse PGSS_FILE_HEADER
values along the way, a combination of PGSS_FILE_HEADER and the hash
value for a given query should be unique over time.

We do need to document that the hash value for a given query may
change..
Thanks,
    Stephen

Re: Hash id in pg_stat_statements

From
Peter Geoghegan
Date:
On 1 October 2012 18:05, Stephen Frost <sfrost@snowman.net> wrote:
> * Peter Geoghegan (peter@2ndquadrant.com) wrote:
>> That won't really help matters. There'd still be duplicate entries,
>> from before and after the change, even if we make it immediately
>> obvious which is which. The only reasonable solution in that scenario
>> is to bump PGSS_FILE_HEADER, which will cause all existing entries to
>> be invalidated.
>
> You're going to have to help me here, 'cause I don't see how there can
> be duplicates if we include the PGSS_FILE_HEADER as part of the hash,
> unless we're planning to keep PGSS_FILE_HEADER constant while we change
> what the hash value is for a given query, yet that goes against the
> assumptions that were laid out, aiui.

Well, they wouldn't be duplicates if you think that the fact that one
query was executed before some point release and another after ought
to differentiate queries. I do not.

> If there's a change that results in a given query X no longer hashing to
> a value A, we need to change PGSS_FILE_HEADER to invalidate statistics
> which were collected for value A (or else we risk an independent query Y
> hashing to value A and ending up with completely invalid stats..).
> Provided we apply that consistently and don't reuse PGSS_FILE_HEADER
> values along the way, a combination of PGSS_FILE_HEADER and the hash
> value for a given query should be unique over time.
>
> We do need to document that the hash value for a given query may
> change..

By invalidate, I mean that when we go to open the saved file, if the
header doesn't match, the file is considered corrupt, and we simply
log that the file could not be read, before unlinking it. This would
be necessary in the unlikely event of there being some substantive
change in the representation of query trees in a point release. I am
not aware of any precedent for this, though Tom said that there was
one.

Tom: Could you please describe the precedent you had in mind? I would
like to better understand this risk.

I don't want to get too hung up on what we'd do if this problem
actually occurred, because that isn't what this thread is about.
Exposing the hash is a completely unrelated question, except that to
do so might make pg_stat_statements (including its limitations) better
understood. In my opinion, the various log parsing tools have taught
people to think about this in the wrong way - the query string is just
a representation of a query (and an imperfect one at that).

There are other, similar tools that exist in proprietary databases.
They expose a hash value, which is subject to exactly the same caveats
as our own. They explicitly encourage the type of aggregation by
third-party tools that I anticipate will happen with
pg_stat_statements. I want to facilitate this, but I'm confident that
the use of (dbid, userid, querytext) as a "candidate key" by these
tools is going to cause confusion, in subtle ways. By using the hash
value, those tools are exactly as well off as pg_stat_statements is,
which is to say, as well off as possible.

I simply do not understand objections to the proposal. Have I missed something?

-- 
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services



Re: Hash id in pg_stat_statements

From
Euler Taveira
Date:
On 02-10-2012 10:15, Peter Geoghegan wrote:
> There are other, similar tools that exist in proprietary databases.
> They expose a hash value, which is subject to exactly the same caveats
> as our own. They explicitly encourage the type of aggregation by
> third-party tools that I anticipate will happen with
> pg_stat_statements. I want to facilitate this, but I'm confident that
> the use of (dbid, userid, querytext) as a "candidate key" by these
> tools is going to cause confusion, in subtle ways. By using the hash
> value, those tools are exactly as well off as pg_stat_statements is,
> which is to say, as well off as possible.
> 
It depends on how you will use this information. If it is for a rapid
analysis, it is fine to use a hash because the object internals won't change
(I hope not). However, if you want to analyze query statistics for a long
period of time (say a few months) or your environment is distributed, you
can't use the hash. There isn't a perfect solution but I see both cases being
useful for analysis tools.


--   Euler Taveira de Oliveira - Timbira       http://www.timbira.com.br/  PostgreSQL: Consultoria, Desenvolvimento,
Suporte24x7 e Treinamento
 



Re: Hash id in pg_stat_statements

From
Magnus Hagander
Date:
<p dir="ltr">P<br /> On Oct 2, 2012 5:04 PM, "Euler Taveira" <<a
href="mailto:euler@timbira.com">euler@timbira.com</a>>wrote:<br /> ><br /> > On 02-10-2012 10:15, Peter
Geogheganwrote:<br /> > > There are other, similar tools that exist in proprietary databases.<br /> > >
Theyexpose a hash value, which is subject to exactly the same caveats<br /> > > as our own. They explicitly
encouragethe type of aggregation by<br /> > > third-party tools that I anticipate will happen with<br /> >
>pg_stat_statements. I want to facilitate this, but I'm confident that<br /> > > the use of (dbid, userid,
querytext)as a "candidate key" by these<br /> > > tools is going to cause confusion, in subtle ways. By using the
hash<br/> > > value, those tools are exactly as well off as pg_stat_statements is,<br /> > > which is to
say,as well off as possible.<br /> > ><br /> > It depends on how you will use this information. If it is for a
rapid<br/> > analysis, it is fine to use a hash because the object internals won't change<br /> > (I hope not).
However,if you want to analyze query statistics for a long<br /> > period of time (say a few months) or your
environmentis distributed, you<br /> > can't use the hash. There isn't a perfect solution but I see both cases
being<br/> > useful for analysis tools.<br /><p dir="ltr">Having the hash available in no way prevents you from
usingthe query string ad your key. Not having the hash certainly prevents you from using it. <p dir="ltr">/Magnus  

Re: Hash id in pg_stat_statements

From
Stephen Frost
Date:
* Peter Geoghegan (peter@2ndquadrant.com) wrote:
> On 1 October 2012 18:05, Stephen Frost <sfrost@snowman.net> wrote:
> > You're going to have to help me here, 'cause I don't see how there can
> > be duplicates if we include the PGSS_FILE_HEADER as part of the hash,
> > unless we're planning to keep PGSS_FILE_HEADER constant while we change
> > what the hash value is for a given query, yet that goes against the
> > assumptions that were laid out, aiui.
>
> Well, they wouldn't be duplicates if you think that the fact that one
> query was executed before some point release and another after ought
> to differentiate queries. I do not.

This would only be if we happened to change what hash was generated for
a given query during such a point release, where I share your feeling
that it aught to be quite rare.  I'm not suggestion we do this for every
point release...

> By invalidate, I mean that when we go to open the saved file, if the
> header doesn't match, the file is considered corrupt, and we simply
> log that the file could not be read, before unlinking it. This would
> be necessary in the unlikely event of there being some substantive
> change in the representation of query trees in a point release. I am
> not aware of any precedent for this, though Tom said that there was
> one.

Right, and that's all I'm trying to address here- how do we provide a
value for a given query which can be relied upon by outside sources,
even in the face of a point release which changes what our internal hash
value for a given query is.

> I don't want to get too hung up on what we'd do if this problem
> actually occurred, because that isn't what this thread is about.

[...]

> I simply do not understand objections to the proposal. Have I missed something?

It was my impression that the concern is the stability of the hash value
and ensuring that tools which operate on it don't mistakenly lump two
different queries into one because they had the same hash value (caused
by a change in our hashing algorithm or input into it over time, eg a
point release).  I was hoping to address that to allow this proposal to
move forward..
Thanks,
    Stephen

Re: Hash id in pg_stat_statements

From
Tom Lane
Date:
Stephen Frost <sfrost@snowman.net> writes:
> * Peter Geoghegan (peter@2ndquadrant.com) wrote:
>> I simply do not understand objections to the proposal. Have I missed something?

> It was my impression that the concern is the stability of the hash value
> and ensuring that tools which operate on it don't mistakenly lump two
> different queries into one because they had the same hash value (caused
> by a change in our hashing algorithm or input into it over time, eg a
> point release).  I was hoping to address that to allow this proposal to
> move forward..

I think there are at least two questions that ought to be answered:

1. Why isn't something like md5() on the reported query text an equally
good solution for users who want a query hash?

2. If people are going to accumulate stats on queries over a long period
of time, is a 32-bit hash really good enough for the purpose?  If I'm
doing the math right, the chance of collision is already greater than 1%
at 10000 queries, and rises to about 70% for 100000 queries; see
http://en.wikipedia.org/wiki/Birthday_paradox
We discussed this issue and decided it was okay for pg_stat_statements's
internal hash table, but it's not at all clear to me that it's sensible
to use 32-bit hashes for external accumulation of query stats.
        regards, tom lane



Re: Hash id in pg_stat_statements

From
Peter Geoghegan
Date:
On 2 October 2012 18:16, Tom Lane <tgl@sss.pgh.pa.us> wrote
> 1. Why isn't something like md5() on the reported query text an equally
> good solution for users who want a query hash?

Because that does not uniquely identify the entry. The very first
thing that the docs say on search_path is "Qualified names are tedious
to write, and it's often best not to wire a particular schema name
into applications anyway". Presumably, the reason it's best not to
wire schema names into apps is because it might be useful to modify
search_path in a way that dynamically made the same queries in some
application reference what are technically distinct relations. If
anyone does this, and it seems likely that many do for various
reasons, they will be out of luck when using some kind of
pg_stat_statements aggregation.

This was the behaviour that I intended for pg_stat_statements all
along, and I think it's better than a solution that matches query
strings.

> 2. If people are going to accumulate stats on queries over a long period
> of time, is a 32-bit hash really good enough for the purpose?  If I'm
> doing the math right, the chance of collision is already greater than 1%
> at 10000 queries, and rises to about 70% for 100000 queries; see
> http://en.wikipedia.org/wiki/Birthday_paradox
> We discussed this issue and decided it was okay for pg_stat_statements's
> internal hash table, but it's not at all clear to me that it's sensible
> to use 32-bit hashes for external accumulation of query stats.

Well, forgive me for pointing this out, but I did propose that the
hash be a 64-bit value (which would have necessitated adopting
hash_any() to produce 64-bit values), but you rejected the proposal. I
arrived at the same probability for a collision as you did and posted
in to the list, in discussion shortly after the normalisation stuff
was committed.

A more sensible way of assessing the risk of a collision would be to
try and come up with the probability of a collision that someone
actually ends up caring about, which is considerably less than the 1%
for 10,000 entries. I'm not being glib - people are very used to the
idea that aggregating information on query costs is a lossy process.
Prior to 9.2, the only way execution costs could reasonably be
measured on at the query granularity on a busy system was to set
log_min_duration_statement to something like 1 second.

I am also unconvinced by the idea that aggregating historical data
(with the same hash value) in a separate application is likely to make
the collision situation appreciably worse. People are going to be
using something like an RRD circular buffer to aggregate the
information, and I can't see anyone caring about detailed information
that is more than a couple of weeks in the past. The point of
aggregation isn't to store more queries, it's to construct time-series
data from snapshots. Besides, do most applications really even have
more than 10,000 distinct queries?

-- 
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services



Re: Hash id in pg_stat_statements

From
Peter Geoghegan
Date:
On 2 October 2012 17:58, Stephen Frost <sfrost@snowman.net> wrote:
> Right, and that's all I'm trying to address here- how do we provide a
> value for a given query which can be relied upon by outside sources,
> even in the face of a point release which changes what our internal hash
> value for a given query is.

I don't know of a way. Presumably, we'd hope to avoid this, and would
look for alternatives to anything that would necessitate bumping
PGSS_FILE_HEADER, while not going so far as to let pg_stat_statements
contort things in the core system. If I was aware of a case where this
would have come up had pg_stat_statements fingerprinting been around
at the time, perhaps I could give a better answer than that.

-- 
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services



Re: Hash id in pg_stat_statements

From
Martijn van Oosterhout
Date:
On Tue, Oct 02, 2012 at 12:58:15PM -0400, Stephen Frost wrote:
> > I simply do not understand objections to the proposal. Have I missed something?
>
> It was my impression that the concern is the stability of the hash value
> and ensuring that tools which operate on it don't mistakenly lump two
> different queries into one because they had the same hash value (caused
> by a change in our hashing algorithm or input into it over time, eg a
> point release).  I was hoping to address that to allow this proposal to
> move forward..

That makes no sense though. The moment you talk about "hash" you
consider the possibility of lumping together things that aren't the
same.  Any tools using these hashes must have realised this.

Fortunatly, the statistics are better than the birthday paradox. The
chances that the two most important queries in your system end up
having the same hash is miniscule.

Like mentioned elsewhere, a system with more than 10,000 different
queries sounds rare to me (once you exclude query parameters ofcourse).

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> He who writes carelessly confesses thereby at the very outset that he does
> not attach much importance to his own thoughts.  -- Arthur Schopenhauer

Re: Hash id in pg_stat_statements

From
Daniel Farina
Date:
On Mon, Oct 1, 2012 at 12:57 AM, Magnus Hagander <magnus@hagander.net> wrote:
> Can we please expose the internal hash id of the statements in
> pg_stat_statements?
>
> I know there was discussions about it earlier, and it wasn't done with
> an argument of it not being stable between releases (IIRC). I think we
> can live with that drawback, assuming of course that we document this
> properly.
>
> I've now run into multiple customer installations where it would be
> very useful to have. The usecase is mainly storing snapshots of the
> pg_stat_statements output over time and analyzing those. Weird things
> happen for example when the query text is the same, but the hash is
> different (which can happen for example when a table is dropped and
> recreated). And even without that, in order to do anything useful with
> it, you end up hashing the query text anyway - so using the already
> existing hash would be easier and more useful.

I have a similar problem, however, I am not sure if the hash generated
is ideal.  Putting aside the number of mechanical, versioning,
shut-down/stats files issues, etc reasons given in the main branch of
the thread, I also have this feeling that it is not what I want.
Consider the following case:

SELECT * FROM users WHERE id = ?

<this query isn't seen for a while>

SELECT * FROM users WHERE id = ?

In the intervening time, an equivalent hash could still be evicted and
reintroduced and the statistics silently reset, and that'll befuddle
principled tools.  This is worse than merely less-useful, because it
can lead to drastic underestimations that otherwise pass inspection.

Instead, I think it makes sense to assign a number -- arbitrarily, but
uniquely -- to the generation of a new row in pg_stat_statements, and,
on the flip side, whenever a row is retired its number should be
eliminated, practically, for-ever.  This way re-introductions between
two samplings of pg_stat_statements cannot be confused for a
contiguously maintained statistic on a query.

-- 
fdr



Re: Hash id in pg_stat_statements

From
Tom Lane
Date:
Daniel Farina <daniel@heroku.com> writes:
> Instead, I think it makes sense to assign a number -- arbitrarily, but
> uniquely -- to the generation of a new row in pg_stat_statements, and,
> on the flip side, whenever a row is retired its number should be
> eliminated, practically, for-ever.  This way re-introductions between
> two samplings of pg_stat_statements cannot be confused for a
> contiguously maintained statistic on a query.

This argument seems sensible to me.  Is there any use-case where the
proposed counter wouldn't do what people wished to do with an exposed
hash value?
        regards, tom lane



Re: Hash id in pg_stat_statements

From
Peter Geoghegan
Date:
On 3 October 2012 19:04, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Daniel Farina <daniel@heroku.com> writes:
>> Instead, I think it makes sense to assign a number -- arbitrarily, but
>> uniquely -- to the generation of a new row in pg_stat_statements, and,
>> on the flip side, whenever a row is retired its number should be
>> eliminated, practically, for-ever.  This way re-introductions between
>> two samplings of pg_stat_statements cannot be confused for a
>> contiguously maintained statistic on a query.
>
> This argument seems sensible to me.  Is there any use-case where the
> proposed counter wouldn't do what people wished to do with an exposed
> hash value?

Yes. The hash could be used to aggregate query execution costs across
entire WAL-based replication clusters. I'm not opposed to Daniel's
suggestion, though.

-- 
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services



Re: Hash id in pg_stat_statements

From
Daniel Farina
Date:
On Wed, Oct 3, 2012 at 11:04 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Daniel Farina <daniel@heroku.com> writes:
>> Instead, I think it makes sense to assign a number -- arbitrarily, but
>> uniquely -- to the generation of a new row in pg_stat_statements, and,
>> on the flip side, whenever a row is retired its number should be
>> eliminated, practically, for-ever.  This way re-introductions between
>> two samplings of pg_stat_statements cannot be confused for a
>> contiguously maintained statistic on a query.
>
> This argument seems sensible to me.  Is there any use-case where the
> proposed counter wouldn't do what people wished to do with an exposed
> hash value?

Yes: unless schemas are included into the canonicalized query text,
two identical texts can actually mean different things.  This is the
only corner case (besides collision) I can think of.

I also referenced an idea in a similar vein to the counter/arbitrary
number: instead, one can perform a kind of error propagation from
evicted entries in the hash table.  This might avoid having to even
bother saving a counter, which feels rather nice to me.  I have a
sketch (I now know of two stupid bugs in this) in implementation and
it comes out very neatly so far.

I got this idea from a paper that I saw implemented once, with
strikingly good effect:
http://www.cs.ucsb.edu/research/tech_reports/reports/2005-23.pdf

Here they have a very specific rendition that is, for a couple of
reasons, not quite what we want, I think.  But the part I found very
interesting was the simple propagation of error that made the
technique possible.  Inspired by this, I gave every hash entry a
maximum-under-estimation error.  When members of the hash table are
evicted, the minimum of the metric gathered plus its already
accumulated under-estimation error are propagated to the new entry.

The general property is that hash table members who are frequently
evicted will accumulate huge amounts of error.  Those that are never
evicted (thanks to constant use) never accumulate any error.

A tool reading the table can take into account the error accumulation
to determine if there was an eviction between two samplings, as well
as bounding the error accrued between two snapshots.  I think there is
a tiny bit of room for some sort of ambiguity in a corner case, but
I'd have to think more about it.

-- 
fdr



Re: Hash id in pg_stat_statements

From
Peter Geoghegan
Date:
On 3 October 2012 19:54, Peter Geoghegan <peter@2ndquadrant.com> wrote:
> On 3 October 2012 19:04, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> This argument seems sensible to me.  Is there any use-case where the
>> proposed counter wouldn't do what people wished to do with an exposed
>> hash value?
>
> Yes. The hash could be used to aggregate query execution costs across
> entire WAL-based replication clusters. I'm not opposed to Daniel's
> suggestion, though.

Could we please try and reach a consensus here? If you're still dead
set against exposing the hash value, I think that just following what
Daniel has suggested is a fair compromise.

-- 
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services



Re: Hash id in pg_stat_statements

From
Peter Geoghegan
Date:
On 3 October 2012 19:54, Peter Geoghegan <peter@2ndquadrant.com> wrote:
> On 3 October 2012 19:04, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> This argument seems sensible to me.  Is there any use-case where the
>> proposed counter wouldn't do what people wished to do with an exposed
>> hash value?
>
> Yes. The hash could be used to aggregate query execution costs across
> entire WAL-based replication clusters. I'm not opposed to Daniel's
> suggestion, though.

I failed to mention a more compelling use-case. On systems that use
roles extensively, there will naturally from time to time be a desire
to see things at the query granularity, rather than at the (userid,
queryid) granularity. Exposing the hash value allows the user to
construct an aggregate query that groups by the hash value, rather
than the query string. For all the reasons already mentioned, this is
a better principled approach than using the query string.

-- 
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services



Re: Hash id in pg_stat_statements

From
Peter Geoghegan
Date:
On 3 October 2012 19:04, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Daniel Farina <daniel@heroku.com> writes:
>> Instead, I think it makes sense to assign a number -- arbitrarily, but
>> uniquely -- to the generation of a new row in pg_stat_statements, and,
>> on the flip side, whenever a row is retired its number should be
>> eliminated, practically, for-ever.  This way re-introductions between
>> two samplings of pg_stat_statements cannot be confused for a
>> contiguously maintained statistic on a query.
>
> This argument seems sensible to me.

Daniel: Could you please submit the patch that you were working on
that does this to the next commitfest?

-- 
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services



Re: Hash id in pg_stat_statements

From
Daniel Farina
Date:
On Mon, Oct 15, 2012 at 7:35 AM, Peter Geoghegan <peter@2ndquadrant.com> wrote:
> On 3 October 2012 19:04, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Daniel Farina <daniel@heroku.com> writes:
>>> Instead, I think it makes sense to assign a number -- arbitrarily, but
>>> uniquely -- to the generation of a new row in pg_stat_statements, and,
>>> on the flip side, whenever a row is retired its number should be
>>> eliminated, practically, for-ever.  This way re-introductions between
>>> two samplings of pg_stat_statements cannot be confused for a
>>> contiguously maintained statistic on a query.
>>
>> This argument seems sensible to me.
>
> Daniel: Could you please submit the patch that you were working on
> that does this to the next commitfest?

Yes. Sorry I haven't done that already.  I'll clean it up and send it
out Real Soon Now, thanks for the expression of interest.

-- 
fdr



Re: Hash id in pg_stat_statements

From
Magnus Hagander
Date:
On Tue, Oct 2, 2012 at 8:22 PM, Peter Geoghegan <peter@2ndquadrant.com> wrote:
> On 2 October 2012 18:16, Tom Lane <tgl@sss.pgh.pa.us> wrote
>> 1. Why isn't something like md5() on the reported query text an equally
>> good solution for users who want a query hash?
>
> Because that does not uniquely identify the entry. The very first
> thing that the docs say on search_path is "Qualified names are tedious
> to write, and it's often best not to wire a particular schema name
> into applications anyway". Presumably, the reason it's best not to
> wire schema names into apps is because it might be useful to modify
> search_path in a way that dynamically made the same queries in some
> application reference what are technically distinct relations. If
> anyone does this, and it seems likely that many do for various
> reasons, they will be out of luck when using some kind of
> pg_stat_statements aggregation.
>
> This was the behaviour that I intended for pg_stat_statements all
> along, and I think it's better than a solution that matches query
> strings.
>
>> 2. If people are going to accumulate stats on queries over a long period
>> of time, is a 32-bit hash really good enough for the purpose?  If I'm
>> doing the math right, the chance of collision is already greater than 1%
>> at 10000 queries, and rises to about 70% for 100000 queries; see
>> http://en.wikipedia.org/wiki/Birthday_paradox
>> We discussed this issue and decided it was okay for pg_stat_statements's
>> internal hash table, but it's not at all clear to me that it's sensible
>> to use 32-bit hashes for external accumulation of query stats.
>
> Well, forgive me for pointing this out, but I did propose that the
> hash be a 64-bit value (which would have necessitated adopting
> hash_any() to produce 64-bit values), but you rejected the proposal. I
> arrived at the same probability for a collision as you did and posted
> in to the list, in discussion shortly after the normalisation stuff
> was committed.

What was the argument for rejecting it? Just that it required
hash_any() to be adapted?

Now that we have a very clear use case where this would help, perhaps
it's time to re-visit this proposal?

--Magnus HaganderMe: http://www.hagander.net/Work: http://www.redpill-linpro.com/



Re: Hash id in pg_stat_statements

From
Peter Geoghegan
Date:
On 15 November 2012 13:10, Magnus Hagander <magnus@hagander.net> wrote:
>> Well, forgive me for pointing this out, but I did propose that the
>> hash be a 64-bit value (which would have necessitated adopting
>> hash_any() to produce 64-bit values), but you rejected the proposal. I
>> arrived at the same probability for a collision as you did and posted
>> in to the list, in discussion shortly after the normalisation stuff
>> was committed.
>
> What was the argument for rejecting it? Just that it required
> hash_any() to be adapted?

I think so, yes. It just wasn't deemed necessary. Note that hash_any()
has a comment that says something like "if you want to adapt this so
that the Datum returned is a 64-bit integer, the way to do that
is...". I followed this advice in a revision of the pg_stat_statements
normalisation patch, and preserved compatibility by using a macro,
which Tom objected to.

> Now that we have a very clear use case where this would help, perhaps
> it's time to re-visit this proposal?

Perhaps. I think that the issue of whether or not a 64-bit integer is
necessary is totally orthogonal to the question of whether or not we
should expose the hash. The need to aggregate historical statistics
just doesn't appreciably alter things here, I feel. The number of
discrete queries that an application will execute in a week just isn't
that different from the number that it will ever execute, I suspect.

-- 
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services



Re: Hash id in pg_stat_statements

From
Dimitri Fontaine
Date:
Peter Geoghegan <peter@2ndquadrant.com> writes:
> should expose the hash. The need to aggregate historical statistics
> just doesn't appreciably alter things here, I feel. The number of
> discrete queries that an application will execute in a week just isn't
> that different from the number that it will ever execute, I suspect.

Please don't forget that some people won't write the most effective SQL
from the get go and will rewrite problematic queries. Some people will
even rollout new features in their applications, with new queries to
implement them. Or new applications on top of the existing database.

So I don't think that the query list is that static. That said, I don't
have any idea at all about the impact of what I'm saying to your
analysis…

Regards,
--
Dimitri Fontaine
http://2ndQuadrant.fr     PostgreSQL : Expertise, Formation et Support