On 8/4/21 7:21 PM, Tomas Vondra wrote:
> On 8/5/21 1:05 AM, Andres Freund wrote:
>
>>
>>> The first one seems quite efficient in how it encodes the length
>>> into very
>>> few bits (which matters especially for small values). It's designed for
>>> integers with 1B, 2B, 4B or 8B, but it can be extended to arbitrary
>>> lengths
>>> fairly easily, I think:
>>
>>> Look at the first byte, and
>>>
>>> 0 - 243 - encoded as is
>>> 244 - 1 byte
>>> 245 - 2 bytes
>>> 246 - 3 bytes
>>> 247 - 4 bytes
>>> 248 - 5 bytes
>>> 249 - 6 bytes
>>> 250 - 7 bytes
>>> 251 - 8 bytes
>>> 252 - next 1 byte is length
>>> 253 - next 2 bytes are length
>>> 254 - next 3 bytes are length
>>> 255 - next 4 bytes are length
>>
>>> If we want to support longer lengths, we'd have to reserve an extra
>>> value
>>> (which reduces the number of values that require a single byte).
>>
>> I think that's not a bad scheme. I think it may end up being a bit more
>> expensive to decode because you need more branches instead of using
>> find-first-set type instructions.
>>
>
> I don't think it requires many branches, because you can essentially do
>
> if (byte[0] <= 243)
> length = 0
> else if (byte[0] <= 251)
> length = byte[0] - 243
> else
> {
> length_bytes = byte[0] - 251;
> ... read length_bytes, decode length
> }
>
> but I haven't tried implementing it and maybe my intuition is wrong.
>
> Or maybe it'd be a good scheme for on-disk format, but poor for memory.
>
>
>
This seems like quite an elegant scheme. Certainly worth trying out. I
find it hard to believe that more than 4 length bytes would be needed
(although that reminds me of the famous and possibly apocryphal quote
"640K ought to be enough for anybody.")
cheers
andrew
--
Andrew Dunstan
EDB: https://www.enterprisedb.com