Thread: best statistic target for boolean columns
Hi all, don't you think the best statistic target for a boolean column is something like 2? Or in general the is useless have a statistics target > data type cardinality ? Regards Gaetano Mendola
Gaetano, > don't you think the best statistic target for a boolean > column is something like 2? Or in general the is useless > have a statistics target > data type cardinality ? It depends, really, on the proportionality of the boolean values; if they're about equal, I certainly wouldn't raise Stats from the default of 10. If, however, it's very dispraportionate -- like 2% true and 98% false -- then it may pay to have better statistics so that the planner doesn't assume 50% hits, which it otherwise might. -- --Josh Josh Berkus Aglio Database Solutions San Francisco
> Gaetano, > > > don't you think the best statistic target for a boolean > > column is something like 2? Or in general the is useless > > have a statistics target > data type cardinality ? > > It depends, really, on the proportionality of the boolean values; if they're > about equal, I certainly wouldn't raise Stats from the default of 10. If, > however, it's very dispraportionate -- like 2% true and 98% false -- then it > may pay to have better statistics so that the planner doesn't assume 50% > hits, which it otherwise might. No, actually the stats table keeps the n most common values and their frequency (usually in percentage). So really a target of 2 ought to be enough for boolean values. In fact that's all I see in pg_statistic; I'm assuming there's a full histogram somewhere but I don't see it. Where would it be? However the target also dictates how large a sample of the table to take. A target of two represents a very small sample. So the estimations could be quite far off. I ran the experiment and for a table with 2036 false rows out of 204,624 the estimate was 1720. Not bad. But then I did vacuum full analyze and got an estimate of 688. Which isn't so good. -- greg
Gregory Stark <gsstark@mit.edu> writes: > No, actually the stats table keeps the n most common values and their > frequency (usually in percentage). So really a target of 2 ought to be enough > for boolean values. In fact that's all I see in pg_statistic; I'm assuming > there's a full histogram somewhere but I don't see it. Where would it be? It's not going to be there. The histogram only covers values that are not in the most-frequent-values list, and therefore it won't exist for a column that is completely describable by most-frequent-values. > However the target also dictates how large a sample of the table to take. A > target of two represents a very small sample. So the estimations could be > quite far off. Right. The real point of stats target for such columns is that it determines how many rows to sample, and thereby indirectly implies the accuracy of the statistics. For a heavily skewed boolean column you'd want a high target so that the number of occurrences of the infrequent value would be estimated accurately. It's also worth noting that the number of rows sampled is driven by the largest per-column stats target in the table, and so reducing stats target to 2 for a boolean column will save *zero* effort unless all the columns in the table are booleans. regards, tom lane
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Josh Berkus wrote: | Gaetano, | | |>don't you think the best statistic target for a boolean |>column is something like 2? Or in general the is useless |>have a statistics target > data type cardinality ? | | | It depends, really, on the proportionality of the boolean values; if they're | about equal, I certainly wouldn't raise Stats from the default of 10. If, | however, it's very dispraportionate -- like 2% true and 98% false -- then it | may pay to have better statistics so that the planner doesn't assume 50% | hits, which it otherwise might. So, I didn't understand how the statistics hystogram works. I'm going to take a look at analyze.c Regards Gaetano Mendola -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.2.5 (MingW32) Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org iD8DBQFBWHr07UpzwH2SGd4RAi8nAJoDOa7j+5IjDEcqBvB4ATXRzRPB+wCfWZ0p OCmUew9zlyqVkxB9iWKoGAw= =7lkZ -----END PGP SIGNATURE-----
Tom Lane wrote: > Gregory Stark <gsstark@mit.edu> writes: > >>No, actually the stats table keeps the n most common values and their >>frequency (usually in percentage). So really a target of 2 ought to be enough >>for boolean values. In fact that's all I see in pg_statistic; I'm assuming >>there's a full histogram somewhere but I don't see it. Where would it be? > > > It's not going to be there. The histogram only covers values that are > not in the most-frequent-values list, and therefore it won't exist for a > column that is completely describable by most-frequent-values. > > >>However the target also dictates how large a sample of the table to take. A >>target of two represents a very small sample. So the estimations could be >>quite far off. > > > Right. The real point of stats target for such columns is that it > determines how many rows to sample, and thereby indirectly implies > the accuracy of the statistics. For a heavily skewed boolean column > you'd want a high target so that the number of occurrences of the > infrequent value would be estimated accurately. > > It's also worth noting that the number of rows sampled is driven by the > largest per-column stats target in the table, and so reducing stats > target to 2 for a boolean column will save *zero* effort unless all the > columns in the table are booleans. Thank you all, now I have more clear how it works. Btw last time I was thinking: why during an explain analyze we can not use the information on about the really extracted rows vs the extimated rows ? Now I'm reading an article, written by the same author that ispired the magic "300" on analyze.c, about "Self-tuning Histograms". If this is implemented, I understood we can take rid of "vacuum analyze" for mantain up to date the statistics. Have someone in his plans to implement it ? After all the idea is simple: compare during normal selects the extimated rows and the actual extracted rows then use this "free" information to refine the histograms. Regards Gaetano Mendola
On Tue, 2004-09-28 at 08:42, Gaetano Mendola wrote: > Now I'm reading an article, written by the same author that ispired the magic "300" > on analyze.c, about "Self-tuning Histograms". If this is implemented, I understood > we can take rid of "vacuum analyze" for mantain up to date the statistics. > Have someone in his plans to implement it ? http://www.mail-archive.com/pgsql-hackers@postgresql.org/msg17477.html Tom's reply is salient. I still think self-tuning histograms would be worth looking at for the multi-dimensional case. -Neil
Neil Conway wrote: > On Tue, 2004-09-28 at 08:42, Gaetano Mendola wrote: > >>Now I'm reading an article, written by the same author that ispired the magic "300" >>on analyze.c, about "Self-tuning Histograms". If this is implemented, I understood >>we can take rid of "vacuum analyze" for mantain up to date the statistics. >>Have someone in his plans to implement it ? > > > http://www.mail-archive.com/pgsql-hackers@postgresql.org/msg17477.html > > Tom's reply is salient. I still think self-tuning histograms would be > worth looking at for the multi-dimensional case. I see. Regards Gaetano Mendola