Re: Simple postgresql.conf wizard -- Statistics idea... - Mailing list pgsql-hackers

From Dann Corbit
Subject Re: Simple postgresql.conf wizard -- Statistics idea...
Date
Msg-id D425483C2C5C9F49B5B7A41F89441547010012B5@postal.corporate.connx.com
Whole thread Raw
In response to Re: Simple postgresql.conf wizard  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: Simple postgresql.conf wizard -- Statistics idea...
Re: Simple postgresql.conf wizard -- Statistics idea...
List pgsql-hackers
> -----Original Message-----
> From: Tom Lane [mailto:tgl@sss.pgh.pa.us]
> Sent: Tuesday, November 25, 2008 5:33 PM
> To: Dann Corbit
> Cc: Gregory Stark; Decibel!; Robert Haas; Bruce Momjian; Mark Wong;
> Heikki Linnakangas; Josh Berkus; Greg Smith; pgsql-
> hackers@postgresql.org
> Subject: Re: [HACKERS] Simple postgresql.conf wizard
>
> "Dann Corbit" <DCorbit@connx.com> writes:
> > I also do not believe that there is any value that will be the right
> > answer.  But a table of data might be useful both for people who
want
> to
> > toy with altering the values and also for those who want to set the
> > defaults.  I guess that at one time such a table was generated to
> > produce the initial estimates for default values.
>
> Sir, you credit us too much :-(.  The actual story is that the current
> default of 10 was put in when we first implemented stats histograms,
> replacing code that kept track of only a *single* most common value
> (and not very well, at that).  So it was already a factor of 10 more
> stats than we had experience with keeping, and accordingly
conservatism
> suggested not boosting the default much past that.
>
> So we really don't have any methodically-gathered evidence about the
> effects of different stats settings.  It wouldn't take a lot to
> convince
> us to switch to a different default, I think, but it would be nice to
> have more than none.

I do have a statistics idea/suggestion (possibly useful with some future
PostgreSQL 9.x or something):
It is a simple matter to calculate lots of interesting univarate summary
statistics with a single pass over the data (perhaps during a vacuum
full).
For instance with numerical columns, you can calculate mean, min, max,
standard deviation, skew, kurtosis and things like that with a single
pass over the data.
Here is a C++ template I wrote to do that:
http://cap.connx.com/tournament_software/STATS.HPP
It also uses this template:
http://cap.connx.com/tournament_software/Kahan.Hpp
which is a high-accuracy adder.  These things could easily be rewritten
in C instead of C++.

Now, if you store a few numbers calculated in this way, it can be used
to augment your histogram data when you want to estimate the volume of a
request. So (for instance) if someone asks for a scalar that is ">
value" you can look to see what percentage of the tail will hang out in
that neck of the woods using standard deviation and the mean.

I have another, similar idea (possibly useful someday far off in the
future) that I think may have some merit.  The idea is to create a
"statistical index".  This index is updated whenever data values are
modified in any way.  For scalar/ordinal values such as float, integer,
numeric it would simply store and update a statistics accumulator (a
small vector of a few items holding statistical moments, counts and
sums) for the column of interest.   These indexes would be very small
and inexpensive to {for instance} memory map.

For categorical values (things like 'color' or 'department') we might
store the count for the number of items that correspond to a hash in our
statistical index.  It would give you a crude "count distinct" for any
item -- the only caveat being that more than one item could possibly
have the same hash code (we would also keep a count of null items).  If
the count needed to be exact, we could generate a perfect hash for the
data or store each distinct column value in the categorical index with
its hash.  The size of such an index would depend on the data so that
bit or char='m'/'f' for male/female or 'y'/'n' for yes/no indexes would
contain just two counts and a column that is unique would have one hash
paired with the number one for each row of the table (a regular unique
index would clearly be better in such a case, but such distributions
could easily arise).  The value of an index like this is that it is good
where the normal index types are bad (e.g. it is useless to create a
btree index on a bit column or male/female, yes/no -- things of that
nature but a statistics counter would work nicely -- to give you whole
table measures only of course).  The thing that is odd about these
statistics indexes is that they do not even bother to point back to the
data that they represent -- they are just abstract measures of the
general contents.

It seems to me that this kind of information could be used to improve
the query plans for both categorical values and scalars and also be used
to generate instant answers for some kinds of statistical queries.  If
used only for query planning, the values would not need to be exact, and
could be updated only at vacuum full time or some other convenient time.
The notion behind creation of a stats index would be that we may not
need to maintain this detailed information for every column, but we can
maintain such data for columns that we often filter with in our queries
to get an idea of cardinality for a subset.


pgsql-hackers by date:

Previous
From: Alvaro Herrera
Date:
Subject: Re: Column reordering in pg_dump
Next
From: ITAGAKI Takahiro
Date:
Subject: Re: [PATCHES] Solve a problem of LC_TIME of windows.