Re: bytea_ops - Mailing list pgsql-patches
From | Joe Conway |
---|---|
Subject | Re: bytea_ops |
Date | |
Msg-id | 008001c12367$38d8f1f0$0705a8c0@jecw2k1 Whole thread Raw |
In response to | bytea_ops ("Joe Conway" <joseph.conway@home.com>) |
Responses |
Re: bytea_ops
|
List | pgsql-patches |
> selectivity of the range 'bb' to 'cc'. If we assume the data range is > 'a'..'z' then we get scalar equivalents of aa = 0, zz = 0.9985, bb = > 0.03994, cc = 0.079881 leading to a selectivity estimate of 0.03994. > If we use a data range of 0..255 then we get aa = 0.380386, bb = > 0.384307, cc = 0.388229, zz = 0.478424 leading to selectivity = 0.00392, > more than a factor of 10 smaller.) Depending on how you are using > bytea, 0..255 might be too large for its data range too. Thoughts? The specific case that I am using bytea for is storage of completely random binary data generated for use as cryptographic keys and account identifiers. For this application, you generally are only interested in the selectivity of an exact match, i.e. "where vl_guid = 'bytea-data'". But in any case, for this type of data assuming a 0..255 range for any particular byte is completely appropriate. And in testing, I found that the current calculation is reasonably accurate. For example, with some randomly generated data: vsreg_192=# select count(*) from vs_lkup; count ------- 24950 (1 row) vsreg_192=# explain select vl_guid from vs_lkup where vl_guid < '\\300\\300'; NOTICE: QUERY PLAN: Seq Scan on vs_lkup (cost=0.00..727.88 rows=319 width=24) EXPLAIN vsreg_192=# select count(*) from vs_lkup where vl_guid < '\\300\\300'; count ------- 287 (1 row) vsreg_192=# explain select vl_guid from vs_lkup where vl_guid < '\\300\\377'; NOTICE: QUERY PLAN: Seq Scan on vs_lkup (cost=0.00..727.88 rows=425 width=24) EXPLAIN vsreg_192=# select count(*) from vs_lkup where vl_guid < '\\300\\377'; count ------- 390 (1 row) vsreg_192=# explain select vl_guid from vs_lkup where vl_guid < '\\300\\010'; NOTICE: QUERY PLAN: Index Scan using vs_lkup_vl_guid_key on vs_lkup (cost=0.00..43.80 rows=11 width=24) EXPLAIN vsreg_192=# select count(*) from vs_lkup where vl_guid < '\\300\\010'; count ------- 14 (1 row) (**Note that this database *only* has data >= \300 in vl_guid**) I think there are other scenarios in which you might want to use bytea where the distribution is less random (I'm thinking in terms of any compressible binary data like executables or some image types), but I can't think of any offhand where ordering of the data is all that meaningful. On the other hand, I suppose if you wanted to use bytea to store some sort of bitmapped data it might be highly skewed, and interesting to select distinct ranges from. Given that, it might make sense to leave the range estimate as-is. > BTW, I think that convert_bytea_datum is probably unnecessary, and > definitely it's a waste of cycles to palloc a copy of the input values. > convert_string_datum exists to (a) unify the representations of the > different datatypes that we consider strings, and (b) apply strxfrm > if necessary. Neither of those motivations will ever apply to bytea > AFAICS. So you could just as easily pass the given Datums directly to > convert_bytea_to_scalar and let it work directly on them. OK -- I almost did just that, but I was unsure of the reason for working with copies. I'll fix it on the next round. -- Joe
pgsql-patches by date: