Thread: [PATCH] Add get_bytes() and set_bytes() functions
Hi, While discussing another patch [1] it was discovered that we don't have a convenient way of casting a bytea to an integer / bigint and vice versa, extracting integers larger than one byte from byteas, etc. For instance, casting '\x11223344' to 0x11223344 may look like this: ``` WITH vals AS ( SELECT '\x11223344'::bytea AS x ) SELECT (get_byte(x, 0) :: bigint << 24) | (get_byte(x, 1) << 16) | (get_byte(x, 2) << 8) | get_byte(x, 3) FROM vals; ``` There seems to be a demand for this functionality [2][3] and it costs us nothing to maintain it, so I propose adding it. The proposed patch adds get_bytes() and set_bytes() functions. The semantics is similar to get_byte() and set_byte() we already have but the functions operate with bigints rather than bytes and the user can specify the size of the integer. This allows working with int2s, int4s, int8s or even int5s if needed. Examples: ``` SELECT get_bytes('\x1122334455667788'::bytea, 1, 2) = 0x2233; ?column? ---------- t SELECT set_bytes('\x1122334455667788'::bytea, 1, 2, 0xAABB); set_bytes -------------------- \x11aabb4455667788 ``` Thoughts? [1]: https://postgr.es/m/CAJ7c6TNMTGnqnG%3DyXXUQh9E88JDckmR45H2Q%2B%3DucaCLMOW1QQw%40mail.gmail.com [2]: https://stackoverflow.com/questions/32944267/postgresql-converting-bytea-to-bigint [3]: https://postgr.es/m/AANLkTikip9xs8iXc8e%2BMgz1T1701i8Xk6QtbVB3KJQzX%40mail.gmail.com -- Best regards, Aleksander Alekseev
Attachment
On Wed, Aug 14, 2024, at 13:01, Aleksander Alekseev wrote: > The proposed patch adds get_bytes() and set_bytes() functions. The > semantics is similar to get_byte() and set_byte() we already have but > the functions operate with bigints rather than bytes and the user can > specify the size of the integer. This allows working with int2s, > int4s, int8s or even int5s if needed. +1 I wanted this myself many times. I wonder if get_bytes() and set_bytes() will behave differently on little-endian vs big-endian systems? If so, then I think it would be nice to enforce a consistent byte order (e.g., big-endian), to ensure consistent behavior across platforms. Regards, Joel
Hi, > +1 > > I wanted this myself many times. > > I wonder if get_bytes() and set_bytes() will behave differently > on little-endian vs big-endian systems? > > If so, then I think it would be nice to enforce a consistent byte order > (e.g., big-endian), to ensure consistent behavior across platforms. No, the returned value will not depend on the CPU endiness. Current implementation uses big-endian / network order which in my humble opinion is what most users would expect. I believe we also need reverse(bytea) and repeat(bytea, integer) functions e.g. for those who want little-endian. However I want to propose them separately when we are done with this patch. -- Best regards, Aleksander Alekseev
On Wed, Aug 14, 2024, at 13:31, Aleksander Alekseev wrote: >> I wonder if get_bytes() and set_bytes() will behave differently >> on little-endian vs big-endian systems? > No, the returned value will not depend on the CPU endiness. Current > implementation uses big-endian / network order which in my humble > opinion is what most users would expect. Nice. I've reviewed and tested the patch. It looks straight-forward to me. I don't see any potential problems. I've marked it Ready for Committer. > I believe we also need reverse(bytea) and repeat(bytea, integer) > functions e.g. for those who want little-endian. However I want to > propose them separately when we are done with this patch. I agree those functions would be nice too. I also think it would be nice to provide these convenience functions: to_bytes(bigint) -> bytea from_bytes(bytea) -> bigint Since if not having a current bytea value, and just wanting to convert a bigint to bytea, then one would need to construct an zeroed bytea of the proper size first, to then use set_bytes(). And if just wanting to convert the entire bytea to a bigint, then one would need to pass 0 as offset and the length of the bytea as size. Regards, Joel
On Wed, Aug 14, 2024 at 02:34:06PM +0200, Joel Jacobson wrote: > On Wed, Aug 14, 2024, at 13:31, Aleksander Alekseev wrote: > >> I wonder if get_bytes() and set_bytes() will behave differently > >> on little-endian vs big-endian systems? > > No, the returned value will not depend on the CPU endiness. Current > > implementation uses big-endian / network order which in my humble > > opinion is what most users would expect. > > Nice. Indeed! > I've reviewed and tested the patch. > It looks straight-forward to me. > I don't see any potential problems. > I've marked it Ready for Committer. > > > I believe we also need reverse(bytea) and repeat(bytea, integer) > > functions e.g. for those who want little-endian. However I want to > > propose them separately when we are done with this patch. > > I agree those functions would be nice too. > > I also think it would be nice to provide these convenience functions: > to_bytes(bigint) -> bytea > from_bytes(bytea) -> bigint Along with these, would it make sense to have other forms of these that won't choke at 63 bits, e.g. NUMERIC or TEXT? Best, David. -- David Fetter <david(at)fetter(dot)org> http://fetter.org/ Phone: +1 415 235 3778
On Wed, Aug 14, 2024, at 16:43, David Fetter wrote: >> I also think it would be nice to provide these convenience functions: >> to_bytes(bigint) -> bytea >> from_bytes(bytea) -> bigint > > Along with these, would it make sense to have other forms of these > that won't choke at 63 bits, e.g. NUMERIC or TEXT? I wonder what would be good names for such functions though? Since NUMERIC can have decimal digits, and since BYTEA can already be casted to TEXT, which will just be \x followed by the hex digits, maybe such names should include "integer" or some other word, to indicate what is being returned? It's already quite easy to convert to NUMERIC though, for users who are aware of tricks like this: SELECT ('0x'||encode('\xCAFEBABEDEADBEEF'::bytea,'hex'))::numeric; numeric ---------------------- 14627333968688430831 (1 row) But, I think it would be better to provide functions, since many users probably have to google+stackoverflow or gpt to learn such tricks, which are not in the official documentation. Regards, Joel
On Wed, Aug 14, 2024 at 05:39:32PM +0200, Joel Jacobson wrote: > On Wed, Aug 14, 2024, at 16:43, David Fetter wrote: > >> I also think it would be nice to provide these convenience functions: > >> to_bytes(bigint) -> bytea > >> from_bytes(bytea) -> bigint > > > > Along with these, would it make sense to have other forms of these > > that won't choke at 63 bits, e.g. NUMERIC or TEXT? > > I wonder what would be good names for such functions though? decimal_to_bytes(numeric), decimal_to_bytes(text) on one side decimal_from_bytes(bytea, typeoid) Naming Things™ is one of the hard problems in computer science. Bad joke that includes cache coherency and off-by-one included by reference. > Since NUMERIC can have decimal digits, and since BYTEA can already be casted to > TEXT, which will just be \x followed by the hex digits, maybe such names should > include "integer" or some other word, to indicate what is being returned? > > It's already quite easy to convert to NUMERIC though, > for users who are aware of tricks like this: > SELECT ('0x'||encode('\xCAFEBABEDEADBEEF'::bytea,'hex'))::numeric; > numeric > ---------------------- > 14627333968688430831 > (1 row) > > But, I think it would be better to provide functions, > since many users probably have to google+stackoverflow or gpt > to learn such tricks, which are not in the official documentation. As usual, I see "official documentation lacks helpful and/or non-obvious examples" as a problem best approached by making good the lack. I am aware that my ideas about pedagogy, documentation, etc. are not shared universally, but they're widely shared by people whose main interaction with documents is trying to get help from them. Best, David. -- David Fetter <david(at)fetter(dot)org> http://fetter.org/ Phone: +1 415 235 3778
On Wed, Aug 14, 2024, at 18:31, David Fetter wrote: > On Wed, Aug 14, 2024 at 05:39:32PM +0200, Joel Jacobson wrote: >> On Wed, Aug 14, 2024, at 16:43, David Fetter wrote: >> >> I also think it would be nice to provide these convenience functions: >> >> to_bytes(bigint) -> bytea >> >> from_bytes(bytea) -> bigint >> > >> > Along with these, would it make sense to have other forms of these >> > that won't choke at 63 bits, e.g. NUMERIC or TEXT? >> >> I wonder what would be good names for such functions though? > > decimal_to_bytes(numeric), decimal_to_bytes(text) on one side > decimal_from_bytes(bytea, typeoid) I assume decimal_to_bytes() will only accept integer numerics, that is, that don't have a decimal digits part? Hmm, it's perhaps then a bit counter intuitive that the name contains "decimal", since some people might associate the word "decimal" stronger with "decimal digits" rather than the radix/base 10. What do we want to happen if passing a numeric with decimal digits, to decimal_to_bytes()? It must be an error, right? Example: SELECT decimal_to_bytes(1.23); > Naming Things™ is one of the hard problems in computer science. Bad > joke that includes cache coherency and off-by-one included by > reference. So true :) > As usual, I see "official documentation lacks helpful and/or > non-obvious examples" as a problem best approached by making good the > lack. I am aware that my ideas about pedagogy, documentation, etc. are > not shared universally, but they're widely shared by people whose main > interaction with documents is trying to get help from them. Well spoken. Regards, Joel
On Wed, Aug 14, 2024, at 19:25, Joel Jacobson wrote: > What do we want to happen if passing a numeric with decimal digits, > to decimal_to_bytes()? It must be an error, right? > > Example: SELECT decimal_to_bytes(1.23); Hmm, an error feels quite ugly on second thought. Would be nicer if all numerics could be represented, in a bytea representation that is meaningful to other systems. I think we need a tuple somehow. Example: SELECT numeric_to_bytes(223195403574957); (\xcafebabedeadbeef,0,false) SELECT numeric_to_bytes(-223195403574957); (\xcafebabedeadbeef,0,true) SELECT numeric_to_bytes(2231954035749.57); (\xcafebabedeadbeef,2,false) SELECT numeric_from_bytes('\xcafebabedeadbeef'::bytea,0,false); 223195403574957 SELECT numeric_from_bytes('\xcafebabedeadbeef'::bytea,0,true); -223195403574957 SELECT numeric_from_bytes('\xcafebabedeadbeef'::bytea,2,false); 2231954035749.57 But then what about Inf,-Inf,NaN? Regards, Joel
On Thu, 15 Aug 2024 at 05:20, Joel Jacobson <joel@compiler.org> wrote: > > On Wed, Aug 14, 2024, at 19:25, Joel Jacobson wrote: > > What do we want to happen if passing a numeric with decimal digits, > > to decimal_to_bytes()? It must be an error, right? > > > > Example: SELECT decimal_to_bytes(1.23); > > Hmm, an error feels quite ugly on second thought. > Would be nicer if all numerics could be represented, > > But then what about Inf,-Inf,NaN? > Perhaps we should also add casts between bytea and the integer/numeric types. That might be easier to use than functions in some circumstances. When casting a numeric to an integer, the result is rounded to the nearest integer, and NaN/Inf generate errors, so we should probably do the same here. Regards, Dean
Hi, > Perhaps we should also add casts between bytea and the integer/numeric > types. That might be easier to use than functions in some > circumstances. > > When casting a numeric to an integer, the result is rounded to the > nearest integer, and NaN/Inf generate errors, so we should probably do > the same here. Yes, I was also thinking about adding NUMERIC versions of get_bytes() / set_bytes(). This would allow converting more than 8 bytes to/from an integer. I dropped this idea because I thought there would be not much practical use for it. On the flip side you never know who uses Postgres and for what purpose. I will add corresponding casts unless the idea will get a push-back from the community. IMO the existence of these casts will at least not make things worse. -- Best regards, Aleksander Alekseev
On Thu, 15 Aug 2024 13:58:03 +0300 Aleksander Alekseev <aleksander@timescale.com> wrote: > Hi, > > > Perhaps we should also add casts between bytea and the integer/numeric > > types. That might be easier to use than functions in some > > circumstances. > > > > When casting a numeric to an integer, the result is rounded to the > > nearest integer, and NaN/Inf generate errors, so we should probably do > > the same here. > > Yes, I was also thinking about adding NUMERIC versions of get_bytes() > / set_bytes(). This would allow converting more than 8 bytes to/from > an integer. I dropped this idea because I thought there would be not > much practical use for it. On the flip side you never know who uses > Postgres and for what purpose. > > I will add corresponding casts unless the idea will get a push-back > from the community. IMO the existence of these casts will at least not > make things worse. When we add such casts between bytea and the integer/numeric types, one of the problems mentioned the first of the thread, that is, "we don't have a convenient way of casting a bytea to an integer / bigint and vice versa", would seem be resolved. On the other hand, I suppose get_bytes() and set_bytes() are still useful for extracting bytes from byteas, etc. If casting is no longer the main purpose of these functions, are variations that get_bytes returns bytea instead of bigint, and set_bytes receives bytea as the newvalue argument useful? I wonder it would eliminate the restrict that size cannot be larger than 8. Here are my very trivial comments on the patch. + * this routine treats "bytea" as an array of bytes. Maybe, the sentence should start with "This ... ". + while(size) + { I wonder inserting a space after "while" is the standard style. Regards, Yugo Nagata -- Yugo Nagata <nagata@sraoss.co.jp>
On 14.08.24 13:01, Aleksander Alekseev wrote: > The proposed patch adds get_bytes() and set_bytes() functions. The > semantics is similar to get_byte() and set_byte() we already have but > the functions operate with bigints rather than bytes and the user can > specify the size of the integer. This allows working with int2s, > int4s, int8s or even int5s if needed. > > Examples: > > ``` > SELECT get_bytes('\x1122334455667788'::bytea, 1, 2) = 0x2233; > ?column? > ---------- > t > > SELECT set_bytes('\x1122334455667788'::bytea, 1, 2, 0xAABB); > set_bytes > -------------------- > \x11aabb4455667788 > ``` I think these functions do about three things at once, and I don't think they address the originally requested purpose very well. Converting between integers and byte arrays of matching size seems like reasonable functionality. (You can already do one half of that by calling int2send(), int4send(), and int8send(), but the other direction (intXrecv()) is not user-callable). The other things are extracting that byte array from a larger byte array and sticking it back into a larger byte array; those seem like separate operations. There is already substr() for bytea for the first part, and there might be another string-like operationg for the second part, or maybe we could add one.
On Fri, 16 Aug 2024 11:41:37 +0300 Aleksander Alekseev <aleksander@timescale.com> wrote: > Hi, > > > When we add such casts between bytea and the integer/numeric types, > > one of the problems mentioned the first of the thread, that is, > > "we don't have a convenient way of casting a bytea to an integer / bigint > > and vice versa", would seem be resolved. > > > > On the other hand, I suppose get_bytes() and set_bytes() are still useful > > for extracting bytes from byteas, etc. If casting is no longer the main > > purpose of these functions, are variations that get_bytes returns bytea > > instead of bigint, and set_bytes receives bytea as the newvalue argument > > useful? I wonder it would eliminate the restrict that size cannot be larger > > than 8. > > No, casting between bytea and numeric will not replace get_bytes() / > set_bytes() for performance reasons. > > Consider the case when you want to extract an int4 from a bytea. > get_bytes() is going to be very fast while substr() -> ::numeric -> > ::integer chain will require unnecessary copying and conversions. > Casting between bytea and numeric is only useful when one has to deal > with integers larger than 8 bytes. Whether this happens often is a > debatable question. Thank you for explanation. I understood the performance drawback. I supposed interfaces similar to lo_get, lo_put, loread, lowrite of large objects since they might be useful to access or modify a part of bytea like a binary file read by pg_read_binary_file. > > > Here are my very trivial comments on the patch. > > > > + * this routine treats "bytea" as an array of bytes. > > > > Maybe, the sentence should start with "This ... ". > > > > + while(size) > > + { > > > > I wonder inserting a space after "while" is the standard style. > > Thanks, fixed. Should we fix the comment on byteaGetByte in passing, too? Regards, Yugo Nagata -- Yugo Nagata <nagata@sraoss.co.jp>
Hi Peter, Thanks for the feedback. > I think these functions do about three things at once, and I don't think > they address the originally requested purpose very well. The amount of things that the function does is a matter of interpretation. I can claim that it does one thing ("extracts an integer from a bytea"), or as many things as there are lines of code. IMO the actual question is whether this is a good user interface or not. Since we already have get_byte() / set_byte() the interface is arguably OK. > Converting between integers and byte arrays of matching size seems like > reasonable functionality. (You can already do one half of that by > calling int2send(), int4send(), and int8send(), but the other direction > (intXrecv()) is not user-callable). > > The other things are extracting that byte array from a larger byte array > and sticking it back into a larger byte array; those seem like separate > operations. There is already substr() for bytea for the first part, and > there might be another string-like operationg for the second part, or > maybe we could add one. If I understand correctly, you propose doing (1): ``` SELECT substr('\x1122334455667788'::bytea, 2, 2) :: int2; ``` ... instead of: ``` SELECT get_bytes('\x1122334455667788'::bytea, 1, 2) ``` ... and (2): ``` WITH vals AS ( SELECT '\x1122334455667788'::bytea AS x ) SELECT substr(x, 1, 1) || int2send(1234::int2) || substr(x, 4, 5) FROM vals; ``` ... instead of: ``` SELECT set_bytes('\x1122334455667788'::bytea, 1, 2, 0xAABB); ``` There is nothing to do for query (2), it already works. It's not much better than the query from my first email though. To clarify, supporting bytea<->integer (and/or bytea<->numeric) casts doesn't strike me as a terrible idea but it doesn't address the issue I'm proposing to solve. -- Best regards, Aleksander Alekseev
On 19.08.24 16:10, Aleksander Alekseev wrote: > To clarify, supporting bytea<->integer (and/or bytea<->numeric) casts > doesn't strike me as a terrible idea but it doesn't address the issue > I'm proposing to solve. What is the issue you are proposing to solve? You linked to a couple of threads and stackoverflow pages, and you concluded from that that we should add get_bytes() and set_bytes() functions. It's not obvious how you get from the former to the latter, and I don't think the functions you propose are well-designed in isolation.
On Fri, 24 Jan 2025 at 13:00, Aleksander Alekseev <aleksander@timescale.com> wrote: > > Thank you. Here is the corrected patch. > This looks pretty good to me. I have a just a couple of minor comments: * The loop in bytea_integer() can be written more simply as a "for" loop. Given that it's only a few lines of code, it might as well just be coded directly in each cast function, which avoids the need to go via a 64-bit integer for every case. In addition, it should use the BITS_PER_BYTE macro rather than "8". Doing that leads to code that's consistent with bittoint4() and bittoint8(). * In pg_proc.dat, it's not necessary to write "proleakproof => 'f'", because that's the default, and no other function does that. * I think it's worth using slightly more non-trivial doc examples, including positive and negative cases, to make the behaviour more obvious. * I also tweaked the regression tests a bit, and copied the existing test style which displays both the expected and actual results from each test. With those updates, I think this is ready for commit, which I'll do in a day or two, if there are no further comments. Regards, Dean