Re: Simple postgresql.conf wizard -- Statistics idea... - Mailing list pgsql-hackers
From | Dann Corbit |
---|---|
Subject | Re: Simple postgresql.conf wizard -- Statistics idea... |
Date | |
Msg-id | D425483C2C5C9F49B5B7A41F89441547010012B5@postal.corporate.connx.com Whole thread Raw |
In response to | Re: Simple postgresql.conf wizard (Tom Lane <tgl@sss.pgh.pa.us>) |
Responses |
Re: Simple postgresql.conf wizard -- Statistics idea...
Re: Simple postgresql.conf wizard -- Statistics idea... |
List | pgsql-hackers |
> -----Original Message----- > From: Tom Lane [mailto:tgl@sss.pgh.pa.us] > Sent: Tuesday, November 25, 2008 5:33 PM > To: Dann Corbit > Cc: Gregory Stark; Decibel!; Robert Haas; Bruce Momjian; Mark Wong; > Heikki Linnakangas; Josh Berkus; Greg Smith; pgsql- > hackers@postgresql.org > Subject: Re: [HACKERS] Simple postgresql.conf wizard > > "Dann Corbit" <DCorbit@connx.com> writes: > > I also do not believe that there is any value that will be the right > > answer. But a table of data might be useful both for people who want > to > > toy with altering the values and also for those who want to set the > > defaults. I guess that at one time such a table was generated to > > produce the initial estimates for default values. > > Sir, you credit us too much :-(. The actual story is that the current > default of 10 was put in when we first implemented stats histograms, > replacing code that kept track of only a *single* most common value > (and not very well, at that). So it was already a factor of 10 more > stats than we had experience with keeping, and accordingly conservatism > suggested not boosting the default much past that. > > So we really don't have any methodically-gathered evidence about the > effects of different stats settings. It wouldn't take a lot to > convince > us to switch to a different default, I think, but it would be nice to > have more than none. I do have a statistics idea/suggestion (possibly useful with some future PostgreSQL 9.x or something): It is a simple matter to calculate lots of interesting univarate summary statistics with a single pass over the data (perhaps during a vacuum full). For instance with numerical columns, you can calculate mean, min, max, standard deviation, skew, kurtosis and things like that with a single pass over the data. Here is a C++ template I wrote to do that: http://cap.connx.com/tournament_software/STATS.HPP It also uses this template: http://cap.connx.com/tournament_software/Kahan.Hpp which is a high-accuracy adder. These things could easily be rewritten in C instead of C++. Now, if you store a few numbers calculated in this way, it can be used to augment your histogram data when you want to estimate the volume of a request. So (for instance) if someone asks for a scalar that is "> value" you can look to see what percentage of the tail will hang out in that neck of the woods using standard deviation and the mean. I have another, similar idea (possibly useful someday far off in the future) that I think may have some merit. The idea is to create a "statistical index". This index is updated whenever data values are modified in any way. For scalar/ordinal values such as float, integer, numeric it would simply store and update a statistics accumulator (a small vector of a few items holding statistical moments, counts and sums) for the column of interest. These indexes would be very small and inexpensive to {for instance} memory map. For categorical values (things like 'color' or 'department') we might store the count for the number of items that correspond to a hash in our statistical index. It would give you a crude "count distinct" for any item -- the only caveat being that more than one item could possibly have the same hash code (we would also keep a count of null items). If the count needed to be exact, we could generate a perfect hash for the data or store each distinct column value in the categorical index with its hash. The size of such an index would depend on the data so that bit or char='m'/'f' for male/female or 'y'/'n' for yes/no indexes would contain just two counts and a column that is unique would have one hash paired with the number one for each row of the table (a regular unique index would clearly be better in such a case, but such distributions could easily arise). The value of an index like this is that it is good where the normal index types are bad (e.g. it is useless to create a btree index on a bit column or male/female, yes/no -- things of that nature but a statistics counter would work nicely -- to give you whole table measures only of course). The thing that is odd about these statistics indexes is that they do not even bother to point back to the data that they represent -- they are just abstract measures of the general contents. It seems to me that this kind of information could be used to improve the query plans for both categorical values and scalars and also be used to generate instant answers for some kinds of statistical queries. If used only for query planning, the values would not need to be exact, and could be updated only at vacuum full time or some other convenient time. The notion behind creation of a stats index would be that we may not need to maintain this detailed information for every column, but we can maintain such data for columns that we often filter with in our queries to get an idea of cardinality for a subset.
pgsql-hackers by date: