Re: A varint implementation for PG? - Mailing list pgsql-hackers
From | Andres Freund |
---|---|
Subject | Re: A varint implementation for PG? |
Date | |
Msg-id | 20210804190110.mus75wpe4nsopobs@alap3.anarazel.de Whole thread Raw |
In response to | Re: A varint implementation for PG? (Robert Haas <robertmhaas@gmail.com>) |
Responses |
Re: A varint implementation for PG?
Re: A varint implementation for PG? |
List | pgsql-hackers |
Hi, On 2021-08-03 14:26:16 -0400, Robert Haas wrote: > [ resurrecting this 2-year-old thread ] > > On Fri, Dec 13, 2019 at 12:45 AM Andres Freund <andres@anarazel.de> wrote: > > > If baking a new variant integer format now, I think limiting it to 64 bits > > > is probably a mistake given how long-lived PostgreSQL is, and how hard it > > > can be to change things in the protocol, on disk, etc. > > > > I don't think it's ever going to be sensible to transport 64bit quanta > > of data. Also, uh, it'd be larger than the data a postgres instance > > could really contain, given LSNs are 64 bit. > > We already use 128-bit integers in some places in the source code, and > it seems more likely than not that use of 128-bit integer data types > will continue to expand over time. If we want to represent 64 bit > integers in a variable number of bits today, it does seem pretty > likely that some day we will want to do the same thing with 128 bit > integers, and maybe eventually larger. And I think that all of that is > true even though exhausting a 64-bit LSN space looks difficult today > and will likely remain difficult for the foreseeable future. I was thinking a bit about how to encode arbitrary length values for the cases where that's interesting. Currently what I proposed for 8 byte unsigned integers is that we encode the length in unary unset bits, followed by a set bit. As a special case, a prefix of 8 bits indicates a length of 9, without needing the separator bit - we don't need a separator bit at that point. Extending that to arbitrary lengths obviously at some point makes the encoding in unary wasteful, and the benefit of few branches vanishes. So what I was thinking is that for variable length pieces of data that are not limited to 8 bytes, we could replace the '8 0 bits' special case with a new special case: The length in bytes follows as a max-8-byte varint. That'd leave us with the following overheads: - 0 - 127: 0 bytes (i.e. 0 to 7 bits) - 128 - 2^56 - 1: 1 byte (i.e. 7 bits - 7 bytes) - 7 bytes - 127 bytes: 2 bytes - 128 bytes - 16383 bytes: 3 bytes - 16384 bytes - 2097151 bytes: 4 bytes - 2097152 bytes - 268435455 bytes: 5 bytes - 268435456 bytes - 34359738367 bytes: 6 bytes - ... The obvious alternative would of course be to just always store the length prefix separately: - 0 - 127 bytes: 1 byte - 128 - 16383 bytes: 2 bytes - 16384 - 2097151 bytes: 3 bytes - ... I do suspect that for a fair number of cases the "0 byte overhead" for very small values would be worth the increase in overhead later. Particularly because the decoding for values up to 7 bytes would be cheaper cpu-wise as well. Greetings, Andres Freund
pgsql-hackers by date: