Re: A varint implementation for PG? - Mailing list pgsql-hackers
From | Robert Haas |
---|---|
Subject | Re: A varint implementation for PG? |
Date | |
Msg-id | CA+TgmoZ8NXJ_phQc=AYOS-WBuEVcgKXQZY7MT5jBdFf8=YiTCQ@mail.gmail.com Whole thread Raw |
In response to | A varint implementation for PG? (Andres Freund <andres@anarazel.de>) |
Responses |
Re: A varint implementation for PG?
|
List | pgsql-hackers |
Andres asked me off-list if I could take another look at this. So here's a bit of review: - The header comment at the top of the file gives some examples of how the encoding works, and then basically says, oh wait, there's also a sign bit at the end, so all those examples are actually wrong. It's really this other thing. Perhaps it's possible to reword things to avoid that. - The XXX comment in the file header says "variants" where it probably means "varints". A later comment mentions "lenght" instead of "length". - In pg_varint_encode_uint64, where it says "XXX: I'm sure there's a neater way to do this," I'm not sure exactly what your parameters for neater are, but you could avoid the conditional if you just did: bits_of_output_data = bits_of_input_data + bytes_of_input_data; bytes_of_output_data = (bits_of_output_data + BITS_PER_BYTE - 1) / BITS_PER_BYTE; I think the comment implies that you thought of that and discarded it on performance grounds, but I'm not quite sure that I'm right about that, so I mention it here. I also thought of another approach that doesn't require computing bytes_of_input_data: bytes_of_output_data = bits_of_input_data / 7 + 1; The intuition here is that every byte you add to the output gives you room for 7 additional data bits, because you gain 8 for the new byte but also have to consume 1 bit from the first byte, for a net gain of 7. This formula gives the wrong answer when bits_of_input_data is 63 or 64, but you don't need to use it for values greater than PG_VARINT_UINT64_MAX_8BYTE_VAL, so that doesn't matter. - It's a bit surprising that the length argument to memcpy() is a constant in the 8-bytes-or-less case. It should be fine, because the output buffer must be big enough to hold at least 9 more bytes, so all that can happen is we write unnecessary bytes that the caller can later overwrite, or not. But it would be worth a comment, I think. In particular, we should note that you should always have enough allocated space in the output buffer for the maximum width of 9 bytes even if you know the number you're actually encoding is small. You do mention this in the decoding function, but not on the encoding side. - The FIXME comment for pg_varint_decode_uint64 no verb. - ret <<= BITS_PER_BYTE * (BITS_PER_BYTE - bytes) is awfully hard to reason about. My first thought was that it had to be wrong: if we are shifting left during encoding, how can we also be shifting left during decoding? Actually I think I see, now, how it works on a little-Endian machine. Logically, the encoded bytes are the least-significant bytes. We copy them to the least-significant bytes of "ret," but they're in the wrong order, with the most significant byte of the encoded representation in the last significant byte of "ret". By shifting left, we move all the bytes to the other "end" of "ret", and then the byte swap puts them back where they were, but now in the right order. Along the way, the garbage bits we're supposed to be ignoring get thrown away and replaced with zeroes. But I still don't see how it works on a big-Endian machine. On such a machine, we copy the encoded bytes, which are still the least significant bytes of the original value, to the most significant bytes of "ret". The byte swap isn't going to do anything here, so in this case it feels like the shift needs to be in the other direction -- a shift right. But maybe not, because originally I thought it should be a shift right in both cases, and now I don't think that's right. - Re "mask out length indicator bits & separator bit", only the separator bit actually is being or needs to be masked out. - I think the overall interface is fine. I think it might be useful to add a function that returns the length to which something would encode without actually encoding it, for cases where you want to estimate how much space you're going to need for something so that you can allocate space for it, and then only afterwards do the encoding for real. That's all I've got. -- Robert Haas EDB: http://www.enterprisedb.com
pgsql-hackers by date: