Thread: A varint implementation for PG?
Hi, I several times, most recently for the record format in the undo patchset, wished for a fast variable width integer implementation for postgres. Using very narrow integers, for space efficiency, solves the space usage problem, but leads to extensibility / generality problems. Other cases where I wanted a variable width integer encodings are, non exhaustively: variable-width TIDs, on-disk representation for a smaller SQL level integer, WAL format, wire protocol, ... I spent a few days experimenting with different variable length encodings for unsigned and signed integers. My goal is/was to find an encoding and a prototype that has reasonably low space overhead and that is very quick to encode/decode. The probably most common encoding is often called LEB128. It basically uses 7 bits in each byte for data, and one bit as a "continuation" bit. While not bad space-wise, it's not great for decoding (but also encoding) speed. The problem is a) that there's a branch for each byte, which will result in poorly predictable code, b) the 7 byte base needs to be changed to an 8 byte base (not as expensive as the branching issue, but quite noticably). I think a better encoding is something similar to UTF-8. That is, encoding the byte length in unary in the leading byte, with the data following that prefix (plus the separator ending the prefix). That allows to avoid per-byte branching during decoding. In contrast to UTF-8, in my prototype I decided to encode the prefix in unary as the number of leading 0 bits - the main advantage is that that makes it possible to use instructions like "count leading zeroes" to determine the length (otherwise one needs to invert the data first). To encode negative numbers, I chose to use what's often called "zigzag" encoding. One cannot just naively use the above encoding for signed integers, as otherwise two's complement negative numbers would always be of the maximum length (due to the leading 1 bit(s)). Instead the to-be-encoded number is multiplied by two (i.e. <<1), and the sign bit is stored at the lowest bit; negative numbers are additionally stored with bit bits inverted. The only disadvantage of that encoding that I am aware of is that encoded signed varints cannot just be memcmp'd (which they could with e.g. a single sign bit). Alternatively one could store the sign directly after the "separator bit" (the one ending the unary length indicator). But then there's the problem that the number has leading data after the indicator. It's also a bit harder to write the code for that encoding. I wrote the code for that first, and I think it's very likely worth going for the simpler approach. In my benchmarks encoding + decoding a varint in this way costs approximately 23 cycles with my intel skylake cpu. I think pipelining is hiding a bit of the latency, but not too much. As the benchmark currently is stupid and just encodes all 32bit integers, it's overly friendly to the branch predictor however. Even with those caveats, I think that's a pretty good result. Other encodings were more expensive. And I think there's definitely some room for optimization left. I've pushed a repo containing my initial implementation of the above to https://github.com/anarazel/varint_experiment and attached a tarball containing it. The header file contains a more detailed description of the format. So far I've only implemented uint64_t and int64_t encoding/decoding functions. It'd definitely make sense to implement a 4 byte variant, and potentially also one supporting larger lengths. I think it'd make sense to have a variant support length up to 64bit long unary length indicator (i.e. 64bytes of data (potentially minus one bit)). If data lengths longer than that are required for a use case, it probably is better to either a) use the max-representable 8 byte integer as an indicator that the length is stored or b) sacrifice another bit to represent whether the integer is the data itself or the length. Not sure if this is worthwhile. Do others see use in this? If so, does the encoding outlined above sound reasonable? I'm using a variant of this for a proposal for a different undo record format, but I really hope this would be interesting for others too. Greetings, Andres Freund
Attachment
On Tue, 10 Dec 2019 at 09:51, Andres Freund <andres@anarazel.de> wrote:
Hi,
I several times, most recently for the record format in the undo
patchset, wished for a fast variable width integer implementation for
postgres. Using very narrow integers, for space efficiency, solves the
space usage problem, but leads to extensibility / generality problems.
Yes. I've wanted flexible but efficiently packed integers quite a bit too, especially when working with wire protocols.
Am I stabbing completely in the dark when wondering if this might be a step towards a way to lift the size limit on VARLENA Datums like bytea ?
There are obvious practical concerns with doing so, given that our protocol offers no handle based lazy fetching for big VARLENA values, but that too needs a way to represent sizes sensibly and flexibly.
Even with those caveats, I think that's a pretty good result. Other
encodings were more expensive. And I think there's definitely some room
for optimization left.
I don't feel at all qualified to question your analysis of the appropriate representation. But your explanation certainly makes a lot of sense as someone approaching the topic mostly fresh - I've done a bit with BCD but not much else.
I assume we'd be paying a price in padding and alignment in most cases, and probably more memory copying, but these representations would likely be appearing mostly in places where other costs are overwhelmingly greater like network or disk I/O.
If data lengths longer than that are required for a use case
If baking a new variant integer format now, I think limiting it to 64 bits is probably a mistake given how long-lived PostgreSQL is, and how hard it can be to change things in the protocol, on disk, etc.
it
probably is better to either a) use the max-representable 8 byte integer
as an indicator that the length is stored or b) sacrifice another bit to
represent whether the integer is the data itself or the length.
I'd be inclined to suspect that (b) is likely worth doing. If nothing else because not being able to represent the full range of a 64-bit integer in the variant type is potentially going to be a seriously annoying hassle at points where we're interacting with places that could use the full width. We'd then have the potential for variant integers of > 2^64 but at least that's wholly under our control.
I also routinely underestimate how truly huge a 64-bit integer really is. But even now 8 petabytes isn't as inconceivable as it used to be....
It mostly depends on how often you expect you'd be coming up on the boundaries where the extra bit would push you up a variant size.
Do others see use in this?
Yes. Very, very much yes.
I'd be quick to want to expose it to SQL too.
Hi, On 2019-12-13 13:31:55 +0800, Craig Ringer wrote: > Am I stabbing completely in the dark when wondering if this might be a step > towards a way to lift the size limit on VARLENA Datums like bytea ? It could be - but I think it'd be a pretty small piece of it. But yes, I have mused about that. > > Even with those caveats, I think that's a pretty good result. Other > > encodings were more expensive. And I think there's definitely some room > > for optimization left. > > > I don't feel at all qualified to question your analysis of the appropriate > representation. But your explanation certainly makes a lot of sense as > someone approaching the topic mostly fresh - I've done a bit with BCD but > not much else. > > I assume we'd be paying a price in padding and alignment in most cases, and > probably more memory copying, but these representations would likely be > appearing mostly in places where other costs are overwhelmingly greater > like network or disk I/O. I don't really see where padding/alignment costs come into play here? > If data lengths longer than that are required for a use case > > > If baking a new variant integer format now, I think limiting it to 64 bits > is probably a mistake given how long-lived PostgreSQL is, and how hard it > can be to change things in the protocol, on disk, etc. I don't think it's ever going to be sensible to transport 64bit quanta of data. Also, uh, it'd be larger than the data a postgres instance could really contain, given LSNs are 64 bit. > > it > > probably is better to either a) use the max-representable 8 byte integer > > as an indicator that the length is stored or b) sacrifice another bit to > > represent whether the integer is the data itself or the length. > I'd be inclined to suspect that (b) is likely worth doing. If nothing else > because not being able to represent the full range of a 64-bit integer in > the variant type is potentially going to be a seriously annoying hassle at > points where we're interacting with places that could use the full width. > We'd then have the potential for variant integers of > 2^64 but at least > that's wholly under our control. I'm very very staunchly against doing either of these for the varints used widely. Throwing away even a bit is quite painful, as it e.g. reduces the range representable in a single byte from 0 - 127/-64 - 63 to 0 - 63/-32 - 31. Without ever being useful, given what kind of things varints are commonly going to describe. There's e.g. simply no practical use of describing a single WAL record length that's bigger than 63 bit can represent. I *can* see a separate varint type, probably sharing some code, that supports storing arbitrarily large numbers. But using that everywhere would be pointless. > I'd be quick to want to expose it to SQL too. It'll be a bit problmeatic to deal with all the casting necessary, and with the likely resulting overload resolution issues. I'm wondering whether it'd be worthwhile to have a ALTER TABLE ... STORAGE ... option that encodes int2/4/8 as varints when inside a tuple, but otherwise just let it be a normal integer. Greetings, Andres Freund
[ resurrecting this 2-year-old thread ] On Fri, Dec 13, 2019 at 12:45 AM Andres Freund <andres@anarazel.de> wrote: > > If baking a new variant integer format now, I think limiting it to 64 bits > > is probably a mistake given how long-lived PostgreSQL is, and how hard it > > can be to change things in the protocol, on disk, etc. > > I don't think it's ever going to be sensible to transport 64bit quanta > of data. Also, uh, it'd be larger than the data a postgres instance > could really contain, given LSNs are 64 bit. We already use 128-bit integers in some places in the source code, and it seems more likely than not that use of 128-bit integer data types will continue to expand over time. If we want to represent 64 bit integers in a variable number of bits today, it does seem pretty likely that some day we will want to do the same thing with 128 bit integers, and maybe eventually larger. And I think that all of that is true even though exhausting a 64-bit LSN space looks difficult today and will likely remain difficult for the foreseeable future. So to me, the right thing to do here probably depends a bit on the use case. If we are talking about using this for strictly internal purposes - e.g. cramming stuff into an internal data structure into which users have no direct visibility - then I think it's probably best to pick a representation that corresponds exactly to the C data type that we're trying to store. If we're starting out with a int64 and we want to byte-squeeze it, the representation you've chosen here seems like just the thing. And it's easy to imagine why we might also want to have similar transformations for uint64, int32, and uint32, and maybe eventually int128 and uint128. And if the details of the format are a little different in each case that's fine; for these sorts of use cases we're always going to unpack into the same data type that we packed originally, so it's not important to have exact compatibility across differences in signedness or bit count. However, I suspect that the whole approach should be completely revised for a user-visible data type. On the one hand, there's no telling how large a value some user will want to represent, so limiting ourselves to 64 bits does seem shortsighted. And on the othe hand, if we've got a varlena, we already know the length, so it seems like we shouldn't also encode the length in the value. Maybe there's a more efficient way, but the first thing that occurs to me is to just discard high order bytes that are all zeroes or all ones until the high order bit of the next byte doesn't match and plonk the remaining bytes into the varlena. To decompress, just sign-extend out to the target length. Really, this kind of representation can be extended to represent arbitrarily large integers, even bigger than what we can currently do with numeric, which is already crazy huge, and it seems to have some advantage in that every payload byte contains exactly 8 data bits, so we don't need to shift or mask while encoding and decoding. Now, we could think of introducing a new format for variable-length datums, essentially making this a new typlen rather than a new kind of varlena. That might be worth it, because if you are storing a lot of values that are small enough that this format could represent them in 3 bytes or less, which I think would be everything up to +/- 2^20, you could save a pretty significant amount of space even if your table was laid out to avoid alignment padding. However, there would be some distributed overhead to this, because all the places that have special handling for typlen = -1 and typlen = -2 would need to grow new cases for this. I'm not sure how much of a problem that is, really, but I don't think we can go nuts with adding new typlen values. > It'll be a bit problmeatic to deal with all the casting necessary, and > with the likely resulting overload resolution issues. I'm wondering > whether it'd be worthwhile to have a ALTER TABLE ... STORAGE ... option > that encodes int2/4/8 as varints when inside a tuple, but otherwise just > let it be a normal integer. I don't see how a STORAGE option could work. We tend to treat those as hints, rather than critical data, which wouldn't work here. I think there are a number of problems that crop up, but one of them is the same thing we hit with the custom TOAST compression stuff. If you need to make a value of some row type out of a HeapTuple, you're not going to know which settings were used to create that heap tuple, and you're certainly not going to know that when you go to deform that tuple. The only thing you're going to have is the tuple descriptor, which AFAICS means that the representation needs to be a property of the type, not where the value is stored. Maybe you have a clever idea I'm not seeing. As for the casting and overloading issues, that's tripped up quite a few people now when trying to add new numeric data types - unsigned integers being a case that has come up a few times now, I think. I don't think it has to be a blocker. I think the solution is probably to accept that using unsigned or variable-width data types will require inserting more casts than would be required for integer datatypes and numeric. That may not thrill everybody, but it may still be better than deciding we're never ever ever adding any more data types for storing numbers than we have today. -- Robert Haas EDB: http://www.enterprisedb.com
Hi, On 2021-08-03 14:26:16 -0400, Robert Haas wrote: > [ resurrecting this 2-year-old thread ] > On Fri, Dec 13, 2019 at 12:45 AM Andres Freund <andres@anarazel.de> wrote: > > I don't think it's ever going to be sensible to transport 64bit quanta > > of data. Also, uh, it'd be larger than the data a postgres instance > > could really contain, given LSNs are 64 bit. > > We already use 128-bit integers in some places in the source code, and > it seems more likely than not that use of 128-bit integer data types > will continue to expand over time. If we want to represent 64 bit > integers in a variable number of bits today, it does seem pretty > likely that some day we will want to do the same thing with 128 bit > integers, and maybe eventually larger. And I think that all of that is > true even though exhausting a 64-bit LSN space looks difficult today > and will likely remain difficult for the foreseeable future. Yea, my answer here was intended to be about areas where we "bake in", to quote Craig, the varint format, like varlena tags or protocol messages. For those a 64bit length tag seems entirely sufficient. I think e.g. adding an SQL level variable width integer type would potentially come to a different outcome. But even for things like that I suspect that more often than not you'd be better off with a variable length length value, rather than encoding the whole "huge" value as a single variable width integer. > So to me, the right thing to do here probably depends a bit on the use > case. If we are talking about using this for strictly internal > purposes - e.g. cramming stuff into an internal data structure into > which users have no direct visibility - then I think it's probably > best to pick a representation that corresponds exactly to the C data > type that we're trying to store. If we're starting out with a int64 > and we want to byte-squeeze it, the representation you've chosen here > seems like just the thing. And it's easy to imagine why we might also > want to have similar transformations for uint64, int32, and uint32, > and maybe eventually int128 and uint128. And if the details of the > format are a little different in each case that's fine; for these > sorts of use cases we're always going to unpack into the same data > type that we packed originally, so it's not important to have exact > compatibility across differences in signedness or bit count. Agreed. > However, I suspect that the whole approach should be completely > revised for a user-visible data type. On the one hand, there's no > telling how large a value some user will want to represent, so > limiting ourselves to 64 bits does seem shortsighted. And on the othe > hand, if we've got a varlena, we already know the length, so it seems > like we shouldn't also encode the length in the value. If we're talking varlenas, then I think using embedded varints only really makes sense if they're "sub-components" of that varlena, not the entire value... > Now, we could think of introducing a new format for variable-length > datums, essentially making this a new typlen rather than a new kind of > varlena. That might be worth it, because if you are storing a lot of > values that are small enough that this format could represent them in > 3 bytes or less, which I think would be everything up to +/- 2^20, you > could save a pretty significant amount of space even if your table was > laid out to avoid alignment padding. However, there would be some > distributed overhead to this, because all the places that have special > handling for typlen = -1 and typlen = -2 would need to grow new cases > for this. I'm not sure how much of a problem that is, really, but I > don't think we can go nuts with adding new typlen values. Yes - the branches for the different typelens already are quite visible in profiles... It might be that we could compensate for that without too much difficulty (e.g. by having a field in a tupdesc indicating what kind of types are in use in a tuple to be formed/deformed, and dispatching to different form/deform routines based on that), but it's not obviously a win. > > It'll be a bit problmeatic to deal with all the casting necessary, and > > with the likely resulting overload resolution issues. I'm wondering > > whether it'd be worthwhile to have a ALTER TABLE ... STORAGE ... option > > that encodes int2/4/8 as varints when inside a tuple, but otherwise just > > let it be a normal integer. > > I don't see how a STORAGE option could work. We tend to treat those as > hints, rather than critical data, which wouldn't work here. I wasn't thinking that this would be something that could be changed without a table rewrite - so maybe STORAGE would be the wrong place to put it, given how it's currently used. OTOH, I don't think it'd be that big a problem to have a rewrite for some, but not all STORAGE options... > I think there are a number of problems that crop up, but one of them is the > same thing we hit with the custom TOAST compression stuff. If you need to > make a value of some row type out of a HeapTuple, you're not going to know > which settings were used to create that heap tuple, and you're certainly not > going to know that when you go to deform that tuple. The only thing you're > going to have is the tuple descriptor, which AFAICS means that the > representation needs to be a property of the type, not where the value is > stored. Maybe you have a clever idea I'm not seeing. All the information needed for deforming fields is a property of pg_attribute, not pg_type (because the type could have been dropped after the column was dropped, but we still need to skip over the column). So if one accepts needing rewrites for changing the data encoding, I don't think there'd be a huge issue here. I think the compression stuff is a bit different because you want to prevent compressed values of one type wandering into another table with a different storage type. That problem wouldn't really exist for an SQL level varint as I was imagining them - they'd always be deformed into the "normal" int2/4/8 Datum representation outside of a HeapTuple. We obviously don't want to do that for varlenas, because they need to be fetched from out-of-line, which is several orders of magnitude more expensive than decoding a varint. I am now wondering if what we're talking about here would best be thought of not as a variable width integer type, but a variable-width encoding for all pass-by-value types. Leaving on-disk compatibility aside (:)), ISTM that we by default could use the following heuristic to decide how to encode pass-by-value types: If it's a leading fixed-width NOT NULL type, store it in fixed-length encoding. Otherwise use a variable-length encoding. Greetings, Andres Freund
Robert Haas <robertmhaas@gmail.com> writes: > However, I suspect that the whole approach should be completely > revised for a user-visible data type. On the one hand, there's no > telling how large a value some user will want to represent, so > limiting ourselves to 64 bits does seem shortsighted. And on the othe > hand, if we've got a varlena, we already know the length, so it seems > like we shouldn't also encode the length in the value. Maybe there's a > more efficient way, but the first thing that occurs to me is to just > discard high order bytes that are all zeroes or all ones until the > high order bit of the next byte doesn't match and plonk the remaining > bytes into the varlena. To decompress, just sign-extend out to the > target length. Really, this kind of representation can be extended to > represent arbitrarily large integers, even bigger than what we can > currently do with numeric, which is already crazy huge, and it seems > to have some advantage in that every payload byte contains exactly 8 > data bits, so we don't need to shift or mask while encoding and > decoding. +1. I think this, together with our existing rules for varlena headers, would address the issue quite nicely. Any sanely-sized integer would require only a one-byte header, so the minimum on-disk size is 2 bytes (with no alignment padding required). I don't buy that there's enough need to justify inventing a new typlen code, since even if you did it wouldn't improve things all that much compared to this design. (Oh ... actually the minimum on-disk size is one byte, since value zero would require no payload bytes.) regards, tom lane
On Tue, Aug 3, 2021 at 3:32 PM Andres Freund <andres@anarazel.de> wrote: > I am now wondering if what we're talking about here would best be thought of > not as a variable width integer type, but a variable-width encoding for all > pass-by-value types. > > Leaving on-disk compatibility aside (:)), ISTM that we by default could use > the following heuristic to decide how to encode pass-by-value types: If it's a > leading fixed-width NOT NULL type, store it in fixed-length > encoding. Otherwise use a variable-length encoding. This is pretty integer-centric, though. If your pass-by-value type is storing timestamps, for example, they're not likely to be especially close to zero. Since a 64-bit address is pretty big, perhaps they're still close enough to zero that this will work out to a win, but I don't know, that seems a bit cheesy. I grant that it could work out to a win -- pass-by-value data types whose distribution is very different from what's typical for integers, or for that matter columns full of integers that all happen to be toward the extreme values the data type can store, are probably not that common. I just don't really like making such assumptions on a system-wide basis (as opposed to a per-datatype basis where it's easier to reason about the consequences). -- Robert Haas EDB: http://www.enterprisedb.com
Hi, On 2021-08-04 09:31:25 -0400, Robert Haas wrote: > This is pretty integer-centric, though. If your pass-by-value type is > storing timestamps, for example, they're not likely to be especially > close to zero. Since a 64-bit address is pretty big, perhaps they're > still close enough to zero that this will work out to a win, but I > don't know, that seems a bit cheesy. Yea, that's fair. The really bad™ example probably is negative numbers - which wouldn't be easy to do something about in a datatype agnostic way. > I grant that it could work out to a win -- pass-by-value data types whose > distribution is very different from what's typical for integers, or for that > matter columns full of integers that all happen to be toward the extreme > values the data type can store, are probably not that common. It'd work out as a wash for common timestamps: ./varint_test -u 681413261095983 processing unsigned unsigned: 681413261095983 input bytes: 00 02 6b bd e3 5f 74 2f 8 output bytes: 01 02 6b bd e3 5f 74 2f decoded: 681413261095983 I don't think there's many workloads where plain integers would skew extreme enough for it to work out to a loss often enough to matter. But: > I just don't really like making such assumptions on a system-wide basis (as > opposed to a per-datatype basis where it's easier to reason about the > consequences). I'd not at all be opposed to datatypes having influence over the on-disk encoding. I was just musing about a default heuristic that could make sense. I do think you'd want something that chooses the encoding for one pg_attribute values based on preceding columns. Greetings, Andres Freund
Hi, On 2021-08-03 14:26:16 -0400, Robert Haas wrote: > [ resurrecting this 2-year-old thread ] > > On Fri, Dec 13, 2019 at 12:45 AM Andres Freund <andres@anarazel.de> wrote: > > > If baking a new variant integer format now, I think limiting it to 64 bits > > > is probably a mistake given how long-lived PostgreSQL is, and how hard it > > > can be to change things in the protocol, on disk, etc. > > > > I don't think it's ever going to be sensible to transport 64bit quanta > > of data. Also, uh, it'd be larger than the data a postgres instance > > could really contain, given LSNs are 64 bit. > > We already use 128-bit integers in some places in the source code, and > it seems more likely than not that use of 128-bit integer data types > will continue to expand over time. If we want to represent 64 bit > integers in a variable number of bits today, it does seem pretty > likely that some day we will want to do the same thing with 128 bit > integers, and maybe eventually larger. And I think that all of that is > true even though exhausting a 64-bit LSN space looks difficult today > and will likely remain difficult for the foreseeable future. I was thinking a bit about how to encode arbitrary length values for the cases where that's interesting. Currently what I proposed for 8 byte unsigned integers is that we encode the length in unary unset bits, followed by a set bit. As a special case, a prefix of 8 bits indicates a length of 9, without needing the separator bit - we don't need a separator bit at that point. Extending that to arbitrary lengths obviously at some point makes the encoding in unary wasteful, and the benefit of few branches vanishes. So what I was thinking is that for variable length pieces of data that are not limited to 8 bytes, we could replace the '8 0 bits' special case with a new special case: The length in bytes follows as a max-8-byte varint. That'd leave us with the following overheads: - 0 - 127: 0 bytes (i.e. 0 to 7 bits) - 128 - 2^56 - 1: 1 byte (i.e. 7 bits - 7 bytes) - 7 bytes - 127 bytes: 2 bytes - 128 bytes - 16383 bytes: 3 bytes - 16384 bytes - 2097151 bytes: 4 bytes - 2097152 bytes - 268435455 bytes: 5 bytes - 268435456 bytes - 34359738367 bytes: 6 bytes - ... The obvious alternative would of course be to just always store the length prefix separately: - 0 - 127 bytes: 1 byte - 128 - 16383 bytes: 2 bytes - 16384 - 2097151 bytes: 3 bytes - ... I do suspect that for a fair number of cases the "0 byte overhead" for very small values would be worth the increase in overhead later. Particularly because the decoding for values up to 7 bytes would be cheaper cpu-wise as well. Greetings, Andres Freund
On Wed, Aug 4, 2021 at 3:01 PM Andres Freund <andres@anarazel.de> wrote: > Extending that to arbitrary lengths obviously at some point makes the encoding > in unary wasteful, and the benefit of few branches vanishes. So what I was > thinking is that for variable length pieces of data that are not limited to 8 > bytes, we could replace the '8 0 bits' special case with a new special case: > The length in bytes follows as a max-8-byte varint. But what if I have a machine with more than 16 exabytes of RAM and I want to use all of its memory to store one really big integer? -- Robert Haas EDB: http://www.enterprisedb.com
On 2021-08-04 15:37:36 -0400, Robert Haas wrote: > On Wed, Aug 4, 2021 at 3:01 PM Andres Freund <andres@anarazel.de> wrote: > > Extending that to arbitrary lengths obviously at some point makes the encoding > > in unary wasteful, and the benefit of few branches vanishes. So what I was > > thinking is that for variable length pieces of data that are not limited to 8 > > bytes, we could replace the '8 0 bits' special case with a new special case: > > The length in bytes follows as a max-8-byte varint. > > But what if I have a machine with more than 16 exabytes of RAM and I > want to use all of its memory to store one really big integer? Then the embedded 8 byte length value would just have to do the same thing recursively to store that huge length header :)
On Wed, Aug 4, 2021 at 3:46 PM Andres Freund <andres@anarazel.de> wrote: > > But what if I have a machine with more than 16 exabytes of RAM and I > > want to use all of its memory to store one really big integer? > > Then the embedded 8 byte length value would just have to do the same thing > recursively to store that huge length header :) Well, yes. But more seriously, my point is that I can't imagine why we would need an object with a length bounded by 2^64. I mean I suppose there's no harm in looking to the future, but that's *really big*. -- Robert Haas EDB: http://www.enterprisedb.com
On 8/4/21 9:01 PM, Andres Freund wrote: > Hi, > > On 2021-08-03 14:26:16 -0400, Robert Haas wrote: >> [ resurrecting this 2-year-old thread ] >> >> On Fri, Dec 13, 2019 at 12:45 AM Andres Freund <andres@anarazel.de> wrote: >>>> If baking a new variant integer format now, I think limiting it to 64 bits >>>> is probably a mistake given how long-lived PostgreSQL is, and how hard it >>>> can be to change things in the protocol, on disk, etc. >>> >>> I don't think it's ever going to be sensible to transport 64bit quanta >>> of data. Also, uh, it'd be larger than the data a postgres instance >>> could really contain, given LSNs are 64 bit. >> >> We already use 128-bit integers in some places in the source code, and >> it seems more likely than not that use of 128-bit integer data types >> will continue to expand over time. If we want to represent 64 bit >> integers in a variable number of bits today, it does seem pretty >> likely that some day we will want to do the same thing with 128 bit >> integers, and maybe eventually larger. And I think that all of that is >> true even though exhausting a 64-bit LSN space looks difficult today >> and will likely remain difficult for the foreseeable future. > > I was thinking a bit about how to encode arbitrary length values for the cases > where that's interesting. > > Currently what I proposed for 8 byte unsigned integers is that we encode the > length in unary unset bits, followed by a set bit. As a special case, a prefix > of 8 bits indicates a length of 9, without needing the separator bit - we > don't need a separator bit at that point. > How is that better than the two varint flavors that are already out there, i.e. the bitcoin [1] and protocol buffers [2]? The first one seems quite efficient in how it encodes the length into very few bits (which matters especially for small values). It's designed for integers with 1B, 2B, 4B or 8B, but it can be extended to arbitrary lengths fairly easily, I think: Look at the first byte, and 0 - 243 - encoded as is 244 - 1 byte 245 - 2 bytes 246 - 3 bytes 247 - 4 bytes 248 - 5 bytes 249 - 6 bytes 250 - 7 bytes 251 - 8 bytes 252 - next 1 byte is length 253 - next 2 bytes are length 254 - next 3 bytes are length 255 - next 4 bytes are length If we want to support longer lengths, we'd have to reserve an extra value (which reduces the number of values that require a single byte). The [2] is a bit more problematic, as it's tailored for very short values (essentially wasting 1/8 of bits) and you have to parse the whole value to determine the length. [1] https://wiki.bitcoinsv.io/index.php/VarInt [2] https://developers.google.com/protocol-buffers/docs/encoding > Extending that to arbitrary lengths obviously at some point makes the encoding > in unary wasteful, and the benefit of few branches vanishes. So what I was > thinking is that for variable length pieces of data that are not limited to 8 > bytes, we could replace the '8 0 bits' special case with a new special case: > The length in bytes follows as a max-8-byte varint. > > That'd leave us with the following overheads: > - 0 - 127: 0 bytes (i.e. 0 to 7 bits) > - 128 - 2^56 - 1: 1 byte (i.e. 7 bits - 7 bytes) > - 7 bytes - 127 bytes: 2 bytes > - 128 bytes - 16383 bytes: 3 bytes > - 16384 bytes - 2097151 bytes: 4 bytes > - 2097152 bytes - 268435455 bytes: 5 bytes > - 268435456 bytes - 34359738367 bytes: 6 bytes > - ... > > The obvious alternative would of course be to just always store the length > prefix separately: > - 0 - 127 bytes: 1 byte > - 128 - 16383 bytes: 2 bytes > - 16384 - 2097151 bytes: 3 bytes > - ... > > I do suspect that for a fair number of cases the "0 byte overhead" for very > small values would be worth the increase in overhead later. Particularly > because the decoding for values up to 7 bytes would be cheaper cpu-wise as > well. > Yeah. Especially if the long values can be compressed, which probably applies to most real-world data sets. IMHO the efficiency for short values is the more important thing. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hi, On 2021-08-04 23:44:10 +0200, Tomas Vondra wrote: > How is that better than the two varint flavors that are already out there, > i.e. the bitcoin [1] and protocol buffers [2]? The protobuf one is *terrible* for CPU efficiency. You need to go through each byte, do masking and shifting for each byte and then have a conditional branch. That's bad from the the amount of instructions you need to execute, and *really* bad for the branch predictor. > The first one seems quite efficient in how it encodes the length into very > few bits (which matters especially for small values). It's designed for > integers with 1B, 2B, 4B or 8B, but it can be extended to arbitrary lengths > fairly easily, I think: > Look at the first byte, and > > 0 - 243 - encoded as is > 244 - 1 byte > 245 - 2 bytes > 246 - 3 bytes > 247 - 4 bytes > 248 - 5 bytes > 249 - 6 bytes > 250 - 7 bytes > 251 - 8 bytes > 252 - next 1 byte is length > 253 - next 2 bytes are length > 254 - next 3 bytes are length > 255 - next 4 bytes are length > If we want to support longer lengths, we'd have to reserve an extra value > (which reduces the number of values that require a single byte). I think that's not a bad scheme. I think it may end up being a bit more expensive to decode because you need more branches instead of using find-first-set type instructions. Greetings, Andres Freund
On 8/5/21 1:05 AM, Andres Freund wrote: > Hi, > > On 2021-08-04 23:44:10 +0200, Tomas Vondra wrote: >> How is that better than the two varint flavors that are already out there, >> i.e. the bitcoin [1] and protocol buffers [2]? > > The protobuf one is *terrible* for CPU efficiency. You need to go through each > byte, do masking and shifting for each byte and then have a conditional > branch. That's bad from the the amount of instructions you need to execute, > and *really* bad for the branch predictor. > Yeah, probably true - particularly for longer values. No argument here. > >> The first one seems quite efficient in how it encodes the length into very >> few bits (which matters especially for small values). It's designed for >> integers with 1B, 2B, 4B or 8B, but it can be extended to arbitrary lengths >> fairly easily, I think: > >> Look at the first byte, and >> >> 0 - 243 - encoded as is >> 244 - 1 byte >> 245 - 2 bytes >> 246 - 3 bytes >> 247 - 4 bytes >> 248 - 5 bytes >> 249 - 6 bytes >> 250 - 7 bytes >> 251 - 8 bytes >> 252 - next 1 byte is length >> 253 - next 2 bytes are length >> 254 - next 3 bytes are length >> 255 - next 4 bytes are length > >> If we want to support longer lengths, we'd have to reserve an extra value >> (which reduces the number of values that require a single byte). > > I think that's not a bad scheme. I think it may end up being a bit more > expensive to decode because you need more branches instead of using > find-first-set type instructions. > I don't think it requires many branches, because you can essentially do if (byte[0] <= 243) length = 0 else if (byte[0] <= 251) length = byte[0] - 243 else { length_bytes = byte[0] - 251; ... read length_bytes, decode length } but I haven't tried implementing it and maybe my intuition is wrong. Or maybe it'd be a good scheme for on-disk format, but poor for memory. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 8/4/21 7:21 PM, Tomas Vondra wrote: > On 8/5/21 1:05 AM, Andres Freund wrote: > >> >>> The first one seems quite efficient in how it encodes the length >>> into very >>> few bits (which matters especially for small values). It's designed for >>> integers with 1B, 2B, 4B or 8B, but it can be extended to arbitrary >>> lengths >>> fairly easily, I think: >> >>> Look at the first byte, and >>> >>> 0 - 243 - encoded as is >>> 244 - 1 byte >>> 245 - 2 bytes >>> 246 - 3 bytes >>> 247 - 4 bytes >>> 248 - 5 bytes >>> 249 - 6 bytes >>> 250 - 7 bytes >>> 251 - 8 bytes >>> 252 - next 1 byte is length >>> 253 - next 2 bytes are length >>> 254 - next 3 bytes are length >>> 255 - next 4 bytes are length >> >>> If we want to support longer lengths, we'd have to reserve an extra >>> value >>> (which reduces the number of values that require a single byte). >> >> I think that's not a bad scheme. I think it may end up being a bit more >> expensive to decode because you need more branches instead of using >> find-first-set type instructions. >> > > I don't think it requires many branches, because you can essentially do > > if (byte[0] <= 243) > length = 0 > else if (byte[0] <= 251) > length = byte[0] - 243 > else > { > length_bytes = byte[0] - 251; > ... read length_bytes, decode length > } > > but I haven't tried implementing it and maybe my intuition is wrong. > > Or maybe it'd be a good scheme for on-disk format, but poor for memory. > > > This seems like quite an elegant scheme. Certainly worth trying out. I find it hard to believe that more than 4 length bytes would be needed (although that reminds me of the famous and possibly apocryphal quote "640K ought to be enough for anybody.") cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
Andres asked me off-list if I could take another look at this. So here's a bit of review: - The header comment at the top of the file gives some examples of how the encoding works, and then basically says, oh wait, there's also a sign bit at the end, so all those examples are actually wrong. It's really this other thing. Perhaps it's possible to reword things to avoid that. - The XXX comment in the file header says "variants" where it probably means "varints". A later comment mentions "lenght" instead of "length". - In pg_varint_encode_uint64, where it says "XXX: I'm sure there's a neater way to do this," I'm not sure exactly what your parameters for neater are, but you could avoid the conditional if you just did: bits_of_output_data = bits_of_input_data + bytes_of_input_data; bytes_of_output_data = (bits_of_output_data + BITS_PER_BYTE - 1) / BITS_PER_BYTE; I think the comment implies that you thought of that and discarded it on performance grounds, but I'm not quite sure that I'm right about that, so I mention it here. I also thought of another approach that doesn't require computing bytes_of_input_data: bytes_of_output_data = bits_of_input_data / 7 + 1; The intuition here is that every byte you add to the output gives you room for 7 additional data bits, because you gain 8 for the new byte but also have to consume 1 bit from the first byte, for a net gain of 7. This formula gives the wrong answer when bits_of_input_data is 63 or 64, but you don't need to use it for values greater than PG_VARINT_UINT64_MAX_8BYTE_VAL, so that doesn't matter. - It's a bit surprising that the length argument to memcpy() is a constant in the 8-bytes-or-less case. It should be fine, because the output buffer must be big enough to hold at least 9 more bytes, so all that can happen is we write unnecessary bytes that the caller can later overwrite, or not. But it would be worth a comment, I think. In particular, we should note that you should always have enough allocated space in the output buffer for the maximum width of 9 bytes even if you know the number you're actually encoding is small. You do mention this in the decoding function, but not on the encoding side. - The FIXME comment for pg_varint_decode_uint64 no verb. - ret <<= BITS_PER_BYTE * (BITS_PER_BYTE - bytes) is awfully hard to reason about. My first thought was that it had to be wrong: if we are shifting left during encoding, how can we also be shifting left during decoding? Actually I think I see, now, how it works on a little-Endian machine. Logically, the encoded bytes are the least-significant bytes. We copy them to the least-significant bytes of "ret," but they're in the wrong order, with the most significant byte of the encoded representation in the last significant byte of "ret". By shifting left, we move all the bytes to the other "end" of "ret", and then the byte swap puts them back where they were, but now in the right order. Along the way, the garbage bits we're supposed to be ignoring get thrown away and replaced with zeroes. But I still don't see how it works on a big-Endian machine. On such a machine, we copy the encoded bytes, which are still the least significant bytes of the original value, to the most significant bytes of "ret". The byte swap isn't going to do anything here, so in this case it feels like the shift needs to be in the other direction -- a shift right. But maybe not, because originally I thought it should be a shift right in both cases, and now I don't think that's right. - Re "mask out length indicator bits & separator bit", only the separator bit actually is being or needs to be masked out. - I think the overall interface is fine. I think it might be useful to add a function that returns the length to which something would encode without actually encoding it, for cases where you want to estimate how much space you're going to need for something so that you can allocate space for it, and then only afterwards do the encoding for real. That's all I've got. -- Robert Haas EDB: http://www.enterprisedb.com
On Thu, Jan 05, 2023 at 06:36:15PM -0500, Robert Haas wrote: > Andres asked me off-list if I could take another look at this. I'm curious whether there are plans to pick this up again. IMHO it seems like a generally good idea. AFAICT the newest version of the patch is in a separate thread [0], which I just wanted to link here for posterity. If there are no plans, I might give it a try, but otherwise I'm happy to help review. [0] https://postgr.es/m/20221004234952.anrguppx5owewb6n%40awork3.anarazel.de -- Nathan Bossart Amazon Web Services: https://aws.amazon.com