Re: Hash id in pg_stat_statements - Mailing list pgsql-hackers

From Magnus Hagander
Subject Re: Hash id in pg_stat_statements
Date
Msg-id CABUevEwW0iQs1kY8qfh6h=01AHWGeemEn=Cr_Wc6+PEKf6bS5Q@mail.gmail.com
Whole thread Raw
In response to Re: Hash id in pg_stat_statements  (Peter Geoghegan <peter@2ndquadrant.com>)
Responses Re: Hash id in pg_stat_statements  (Peter Geoghegan <peter@2ndquadrant.com>)
List pgsql-hackers
On Tue, Oct 2, 2012 at 8:22 PM, Peter Geoghegan <peter@2ndquadrant.com> wrote:
> On 2 October 2012 18:16, Tom Lane <tgl@sss.pgh.pa.us> wrote
>> 1. Why isn't something like md5() on the reported query text an equally
>> good solution for users who want a query hash?
>
> Because that does not uniquely identify the entry. The very first
> thing that the docs say on search_path is "Qualified names are tedious
> to write, and it's often best not to wire a particular schema name
> into applications anyway". Presumably, the reason it's best not to
> wire schema names into apps is because it might be useful to modify
> search_path in a way that dynamically made the same queries in some
> application reference what are technically distinct relations. If
> anyone does this, and it seems likely that many do for various
> reasons, they will be out of luck when using some kind of
> pg_stat_statements aggregation.
>
> This was the behaviour that I intended for pg_stat_statements all
> along, and I think it's better than a solution that matches query
> strings.
>
>> 2. If people are going to accumulate stats on queries over a long period
>> of time, is a 32-bit hash really good enough for the purpose?  If I'm
>> doing the math right, the chance of collision is already greater than 1%
>> at 10000 queries, and rises to about 70% for 100000 queries; see
>> http://en.wikipedia.org/wiki/Birthday_paradox
>> We discussed this issue and decided it was okay for pg_stat_statements's
>> internal hash table, but it's not at all clear to me that it's sensible
>> to use 32-bit hashes for external accumulation of query stats.
>
> Well, forgive me for pointing this out, but I did propose that the
> hash be a 64-bit value (which would have necessitated adopting
> hash_any() to produce 64-bit values), but you rejected the proposal. I
> arrived at the same probability for a collision as you did and posted
> in to the list, in discussion shortly after the normalisation stuff
> was committed.

What was the argument for rejecting it? Just that it required
hash_any() to be adapted?

Now that we have a very clear use case where this would help, perhaps
it's time to re-visit this proposal?

--Magnus HaganderMe: http://www.hagander.net/Work: http://www.redpill-linpro.com/



pgsql-hackers by date:

Previous
From: Amit Kapila
Date:
Subject: Re: Switching timeline over streaming replication
Next
From: Peter Geoghegan
Date:
Subject: Re: Hash id in pg_stat_statements