Re: bytea_ops - Mailing list pgsql-patches

From Joe Conway
Subject Re: bytea_ops
Date
Msg-id 008001c12367$38d8f1f0$0705a8c0@jecw2k1
Whole thread Raw
In response to bytea_ops  ("Joe Conway" <joseph.conway@home.com>)
Responses Re: bytea_ops
List pgsql-patches
> selectivity of the range 'bb' to 'cc'.  If we assume the data range is
> 'a'..'z' then we get scalar equivalents of aa = 0, zz = 0.9985, bb =
> 0.03994, cc = 0.079881 leading to a selectivity estimate of 0.03994.
> If we use a data range of 0..255 then we get aa = 0.380386, bb =
> 0.384307, cc = 0.388229, zz = 0.478424 leading to selectivity = 0.00392,
> more than a factor of 10 smaller.)  Depending on how you are using
> bytea, 0..255 might be too large for its data range too.  Thoughts?

The specific case that I am using bytea for is storage of completely random
binary data generated for use as cryptographic keys and account identifiers.
For this application, you generally are only interested in the selectivity
of an exact
match, i.e. "where vl_guid = 'bytea-data'".

But in any case, for this type of data assuming a 0..255 range for any
particular byte is completely appropriate. And in testing, I found that the
current calculation is reasonably accurate. For example, with some randomly
generated data:

vsreg_192=# select count(*) from vs_lkup;
 count
-------
 24950
(1 row)

vsreg_192=# explain select vl_guid from vs_lkup where vl_guid <
'\\300\\300';
NOTICE:  QUERY PLAN:

Seq Scan on vs_lkup  (cost=0.00..727.88 rows=319 width=24)

EXPLAIN
vsreg_192=# select count(*) from vs_lkup where vl_guid < '\\300\\300';
 count
-------
   287
(1 row)

vsreg_192=# explain select vl_guid from vs_lkup where vl_guid <
'\\300\\377';
NOTICE:  QUERY PLAN:

Seq Scan on vs_lkup  (cost=0.00..727.88 rows=425 width=24)

EXPLAIN
vsreg_192=# select count(*) from vs_lkup where vl_guid < '\\300\\377';
 count
-------
   390
(1 row)

vsreg_192=# explain select vl_guid from vs_lkup where vl_guid <
'\\300\\010';
NOTICE:  QUERY PLAN:

Index Scan using vs_lkup_vl_guid_key on vs_lkup  (cost=0.00..43.80 rows=11
width=24)

EXPLAIN
vsreg_192=# select count(*) from vs_lkup where vl_guid < '\\300\\010';
 count
-------
    14
(1 row)

(**Note that this database *only* has data >= \300 in vl_guid**)

I think there are other scenarios in which you might want to use bytea where
the distribution is less random (I'm thinking in terms of any compressible
binary data like executables or some image types), but I can't think of any
offhand where ordering of the data is all that meaningful.

On the other hand, I suppose if you wanted to use bytea to store some sort
of bitmapped data it might be highly skewed, and interesting to select
distinct ranges from. Given that, it might make sense to leave the
range estimate as-is.


> BTW, I think that convert_bytea_datum is probably unnecessary, and
> definitely it's a waste of cycles to palloc a copy of the input values.
> convert_string_datum exists to (a) unify the representations of the
> different datatypes that we consider strings, and (b) apply strxfrm
> if necessary.  Neither of those motivations will ever apply to bytea
> AFAICS.  So you could just as easily pass the given Datums directly to
> convert_bytea_to_scalar and let it work directly on them.

OK -- I almost did just that, but I was unsure of the reason for working
with copies. I'll fix it on the next round.

-- Joe






pgsql-patches by date:

Previous
From: Tom Lane
Date:
Subject: Re: bytea_ops
Next
From: Tom Lane
Date:
Subject: Re: bytea_ops