statistics - Mailing list pgsql-hackers

From Tiago Antão
Subject statistics
Date
Msg-id Pine.LNX.4.21.0008232034030.6540-100000@eros.si.fct.unl.pt
Whole thread Raw
List pgsql-hackers
Hi!
 I've made a small (2 hours work) program to make "histograms" on data. Giving a table and column he tries to put in
bucketsspecial values
 
(special means the most/least used values). It tries to be a little smart,
ie, if there are lots of guys with most/least values it will not put them
in buckets.
 Example: bucket size=10
$ ./a.out "dbname=teste"  d_pags uid
Distinct values: 1028
Number of tuples: 6880
On a uniform distribution: 6.692607 tuples/value

# of values with more references
1 - 1
2 - 1
3 - 1
[...]
This means that there is only one value as the most referenced (110 times)
# of values with less references
1 - 253
2 - 153
3 - 109
[..]
This means that there ara 253 values that have the least references (once)
Best case buckets
1 - u805156 (110) 
2 - u1503927 (103) 
3 - u110525 (82) 
4 - u91106009 (78) 
5 - u1106837 (60) 
6 - u1714112 (55) 
7 - u1414335 (53) 
8 - u1105732 (50) 
9 - u302719 (49) 
Worst case buckets
[there are so many guys with only one ref... nothing can be put in
buckets]
Removed values: 9
Removed tuples: 640
Expected tuples for each of unclassified values: 6.123651 tuples/value

If all values have equal prob of being chosen:
Error in normal case: 5.464981
Error in hist case: 4.905326

If my calculations are right:
For 9 of 1028 values (9.5% of relation) there is a very precise idea,
but in general (for a random value being selected) that's not a great
advance. At least for this case ...


Tiago



pgsql-hackers by date:

Previous
From: "Ross J. Reedstrom"
Date:
Subject: Re: when does CREATE VIEW not create a view?
Next
From: Vince Vielhaber
Date:
Subject: How do pronounce PostgreSQL - the final word.