Re: bytea_ops - Mailing list pgsql-patches

From Tom Lane
Subject Re: bytea_ops
Date
Msg-id 21190.997637483@sss.pgh.pa.us
Whole thread Raw
In response to Re: bytea_ops  ("Joe Conway" <joseph.conway@home.com>)
List pgsql-patches
"Joe Conway" <joseph.conway@home.com> writes:
>> The biggie is that you missed adding support for bytea to scalarltsel,
>> which puts a severe crimp on the optimizer's ability to make any
>> intelligent decisions about using your index.

> Hopefully done correctly ;-)

I'm inclined to think that convert_bytea_to_scalar should just always
assume that the appropriate base is 256, ie, all possible byte values
can appear in the data.  The logic that convert_string_to_scalar uses
to guess at a suitable base depends strongly on the assumption that
ASCII text is much more likely than any other kind of data.  That
doesn't seem like the right assumption to make for bytea, I'd think.
You've removed the more blatant ASCII dependencies from the code,
but I still wonder whether it makes any sense to assume that the byte
values seen in the given strings should be used as predictors of the
overall distribution of byte values in the column.

On the other hand, the reason that convert_string_to_scalar makes all
those difficult-to-justify assumptions is that using a large base leads
to overly optimistic selectivity estimates.  (Example: suppose the
histogram bounds are 'aa' and 'zz', and we are trying to estimate the
selectivity of the range 'bb' to 'cc'.  If we assume the data range is
'a'..'z' then we get scalar equivalents of aa = 0, zz = 0.9985, bb =
0.03994, cc = 0.079881 leading to a selectivity estimate of 0.03994.
If we use a data range of 0..255 then we get aa = 0.380386, bb =
0.384307, cc = 0.388229, zz = 0.478424 leading to selectivity = 0.00392,
more than a factor of 10 smaller.)  Depending on how you are using
bytea, 0..255 might be too large for its data range too.  Thoughts?

BTW, I think that convert_bytea_datum is probably unnecessary, and
definitely it's a waste of cycles to palloc a copy of the input values.
convert_string_datum exists to (a) unify the representations of the
different datatypes that we consider strings, and (b) apply strxfrm
if necessary.  Neither of those motivations will ever apply to bytea
AFAICS.  So you could just as easily pass the given Datums directly to
convert_bytea_to_scalar and let it work directly on them.

            regards, tom lane

pgsql-patches by date:

Previous
From: "Joe Conway"
Date:
Subject: Re: bytea_ops
Next
From: "Joe Conway"
Date:
Subject: Re: bytea_ops