Home > mailing lists

Re: Hash id in pg_stat_statements - Mailing list pgsql-hackers

From	Tom Lane
Subject	Re: Hash id in pg_stat_statements
Date	October 2, 2012 17:16:27
Msg-id	9844.1349198176@sss.pgh.pa.us Whole thread Raw
In response to	Re: Hash id in pg_stat_statements (Stephen Frost <sfrost@snowman.net>)
Responses	Re: Hash id in pg_stat_statements
List	pgsql-hackers

Tree view

Stephen Frost <sfrost@snowman.net> writes:
> * Peter Geoghegan (peter@2ndquadrant.com) wrote:
>> I simply do not understand objections to the proposal. Have I missed something?

> It was my impression that the concern is the stability of the hash value
> and ensuring that tools which operate on it don't mistakenly lump two
> different queries into one because they had the same hash value (caused
> by a change in our hashing algorithm or input into it over time, eg a
> point release).  I was hoping to address that to allow this proposal to
> move forward..

I think there are at least two questions that ought to be answered:

1. Why isn't something like md5() on the reported query text an equally
good solution for users who want a query hash?

2. If people are going to accumulate stats on queries over a long period
of time, is a 32-bit hash really good enough for the purpose?  If I'm
doing the math right, the chance of collision is already greater than 1%
at 10000 queries, and rises to about 70% for 100000 queries; see
http://en.wikipedia.org/wiki/Birthday_paradox
We discussed this issue and decided it was okay for pg_stat_statements's
internal hash table, but it's not at all clear to me that it's sensible
to use 32-bit hashes for external accumulation of query stats.
        regards, tom lane

pgsql-hackers by date:

From: Stephen Frost
Date: 02 October 2012, 16:58:26
Subject: Re: Hash id in pg_stat_statements

From: Noah Misch
Date: 02 October 2012, 17:58:54
Subject: Re: Incorrect behaviour when using a GiST index on points

Re: Hash id in pg_stat_statements - Mailing list pgsql-hackers

Previous

Next