"Joe Conway" <joseph.conway@home.com> writes:
> But in any case, for this type of data assuming a 0..255 range for any
> particular byte is completely appropriate. And in testing, I found that the
> current calculation is reasonably accurate.
Well, it *is* accurate, if and only if that assumption is correct.
Obviously it's correct for random-byte data.
> I think there are other scenarios in which you might want to use bytea where
> the distribution is less random (I'm thinking in terms of any compressible
> binary data like executables or some image types), but I can't think of any
> offhand where ordering of the data is all that meaningful.
Yeah. Compressed data would show a pretty even byte-value distribution
as well, but the real question is what sort of data might be found in a
bytea column for which "x < y" is an interesting comparison. I'm not
sure either.
> On the other hand, I suppose if you wanted to use bytea to store some sort
> of bitmapped data it might be highly skewed, and interesting to select
> distinct ranges from. Given that, it might make sense to leave the
> range estimate as-is.
I don't like the estimate as-is. For textual data it makes some sense
to classify characters into a small number of categories (letters,
digits, other), and with so few categories it's not completely
ridiculous to suppose that the three available strings might tell you
which categories are present in a column. For bytea data, there are
no natural categories and thus no justification for extrapolating
byte-value distribution from the info available to scalarltsel. So
I think there's no defensible argument for using anything but 0..255.
(I suppose we could consider adding more info to pg_statistic for these
types of columns, but I'm not eager to do that right at the moment.)
regards, tom lane