Kuntal Ghosh <kuntalghosh.2007@gmail.com> writes:
> On Tue, Jul 4, 2017 at 9:23 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> ... I have to admit that I've failed to wrap my brain around exactly
>> why it's correct. The arguments that I've constructed so far seem to
>> point in the direction of applying the opposite correction, which is
>> demonstrably wrong. Perhaps someone whose college statistics class
>> wasn't quite so long ago can explain this satisfactorily?
> I guess that you're referring the last case, i.e.
> explain analyze select * from tenk1 where thousand between 10 and 10;
No, the thing that is bothering me is why it seems to be correct to
apply a positive correction for ">=", a negative correction for "<",
and no correction for "<=" or ">". That seems weird and I can't
construct a plausible explanation for it. I think it might be a
result of the fact that, given a discrete distribution rather than
a continuous one, the histogram boundary values should be understood
as having some "width" rather than being zero-width points on the
distribution axis. But the arguments I tried to fashion on that
basis led to other rules that didn't actually work.
It's also possible that this logic is in fact wrong and it just happens
to give the right answer anyway for uniformly-distributed cases.
regards, tom lane