Hi!
I've made a small (2 hours work) program to make "histograms" on data. Giving a table and column he tries to put in
bucketsspecial values
(special means the most/least used values). It tries to be a little smart,
ie, if there are lots of guys with most/least values it will not put them
in buckets.
Example: bucket size=10
$ ./a.out "dbname=teste" d_pags uid
Distinct values: 1028
Number of tuples: 6880
On a uniform distribution: 6.692607 tuples/value
# of values with more references
1 - 1
2 - 1
3 - 1
[...]
This means that there is only one value as the most referenced (110 times)
# of values with less references
1 - 253
2 - 153
3 - 109
[..]
This means that there ara 253 values that have the least references (once)
Best case buckets
1 - u805156 (110)
2 - u1503927 (103)
3 - u110525 (82)
4 - u91106009 (78)
5 - u1106837 (60)
6 - u1714112 (55)
7 - u1414335 (53)
8 - u1105732 (50)
9 - u302719 (49)
Worst case buckets
[there are so many guys with only one ref... nothing can be put in
buckets]
Removed values: 9
Removed tuples: 640
Expected tuples for each of unclassified values: 6.123651 tuples/value
If all values have equal prob of being chosen:
Error in normal case: 5.464981
Error in hist case: 4.905326
If my calculations are right:
For 9 of 1028 values (9.5% of relation) there is a very precise idea,
but in general (for a random value being selected) that's not a great
advance. At least for this case ...
Tiago