Re: Hash id in pg_stat_statements - Mailing list pgsql-hackers

From Daniel Farina
Subject Re: Hash id in pg_stat_statements
Date
Msg-id CAAZKuFYmWEpGZjMMEcSjyB3_y--eCWZFJLo=sABz6cZ3gBtVmQ@mail.gmail.com
Whole thread Raw
In response to Re: Hash id in pg_stat_statements  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-hackers
On Wed, Oct 3, 2012 at 11:04 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Daniel Farina <daniel@heroku.com> writes:
>> Instead, I think it makes sense to assign a number -- arbitrarily, but
>> uniquely -- to the generation of a new row in pg_stat_statements, and,
>> on the flip side, whenever a row is retired its number should be
>> eliminated, practically, for-ever.  This way re-introductions between
>> two samplings of pg_stat_statements cannot be confused for a
>> contiguously maintained statistic on a query.
>
> This argument seems sensible to me.  Is there any use-case where the
> proposed counter wouldn't do what people wished to do with an exposed
> hash value?

Yes: unless schemas are included into the canonicalized query text,
two identical texts can actually mean different things.  This is the
only corner case (besides collision) I can think of.

I also referenced an idea in a similar vein to the counter/arbitrary
number: instead, one can perform a kind of error propagation from
evicted entries in the hash table.  This might avoid having to even
bother saving a counter, which feels rather nice to me.  I have a
sketch (I now know of two stupid bugs in this) in implementation and
it comes out very neatly so far.

I got this idea from a paper that I saw implemented once, with
strikingly good effect:
http://www.cs.ucsb.edu/research/tech_reports/reports/2005-23.pdf

Here they have a very specific rendition that is, for a couple of
reasons, not quite what we want, I think.  But the part I found very
interesting was the simple propagation of error that made the
technique possible.  Inspired by this, I gave every hash entry a
maximum-under-estimation error.  When members of the hash table are
evicted, the minimum of the metric gathered plus its already
accumulated under-estimation error are propagated to the new entry.

The general property is that hash table members who are frequently
evicted will accumulate huge amounts of error.  Those that are never
evicted (thanks to constant use) never accumulate any error.

A tool reading the table can take into account the error accumulation
to determine if there was an eviction between two samplings, as well
as bounding the error accrued between two snapshots.  I think there is
a tiny bit of room for some sort of ambiguity in a corner case, but
I'd have to think more about it.

-- 
fdr



pgsql-hackers by date:

Previous
From: Alexander Korotkov
Date:
Subject: Re: gistchoose vs. bloat
Next
From: Devrim GÜNDÜZ
Date:
Subject: Re: pg_upgrade does not completely honor --new-port