Thread: [HACKERS] bytea_output output of base64
Currently bytea_output supports values of 'hex' (the default) and 'escape' (octal). 'hex' uses two characters per byte, while escape uses three (ignoring the prefix overhead of \x or \[0-9].) It is my understanding that base64 uses 1.37 characters per byte: https://en.wikipedia.org/wiki/Base64 Is there a reason we don't support base64 as a bytea_output output option, except that no one has implemented it? -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + As you are, so once was I. As I am, so you will be. + + Ancient Roman grave inscription +
Bruce Momjian <bruce@momjian.us> writes: > Is there a reason we don't support base64 as a bytea_output output > option, except that no one has implemented it? How about "we already have one too many bytea output formats"? I don't think forcing code to try to support still another one is a great thing ... especially not if it couldn't be reliably distinguished from the hex format. regards, tom lane
On Thu, Feb 23, 2017 at 04:08:58PM -0500, Tom Lane wrote: > Bruce Momjian <bruce@momjian.us> writes: > > Is there a reason we don't support base64 as a bytea_output output > > option, except that no one has implemented it? > > How about "we already have one too many bytea output formats"? > I don't think forcing code to try to support still another one > is a great thing ... especially not if it couldn't be reliably > distinguished from the hex format. Is there a reason we chose hex over base64? -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + As you are, so once was I. As I am, so you will be. + + Ancient Roman grave inscription +
On Thu, Feb 23, 2017 at 05:55:37PM -0500, Bruce Momjian wrote: > On Thu, Feb 23, 2017 at 04:08:58PM -0500, Tom Lane wrote: > > Bruce Momjian <bruce@momjian.us> writes: > > > Is there a reason we don't support base64 as a bytea_output output > > > option, except that no one has implemented it? > > > > How about "we already have one too many bytea output formats"? > > I don't think forcing code to try to support still another one > > is a great thing ... especially not if it couldn't be reliably > > distinguished from the hex format. > > Is there a reason we chose hex over base64? Whether there was or not, there's not a compelling reason now to break people's software. When people want compression, methods a LOT more effective than base64 are common. Gzip, for example. Best, David. -- David Fetter <david(at)fetter(dot)org> http://fetter.org/ Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter Skype: davidfetter XMPP: david(dot)fetter(at)gmail(dot)com Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate
David Fetter <david@fetter.org> writes: > On Thu, Feb 23, 2017 at 05:55:37PM -0500, Bruce Momjian wrote: >> Is there a reason we chose hex over base64? > Whether there was or not, there's not a compelling reason now to break > people's software. When people want compression, methods a LOT more > effective than base64 are common. Gzip, for example. Yeah. I think the argument for hex was about readability. It's certainly not more compact than the traditional "escape" format --- and depending on what your data is like, base64 probably wouldn't be either. If you want small, you'd go for binary transmission first, and then maybe gzip it. I really don't see any compelling argument for base64 except possibly exchangability with other systems. But if we stuck a header on it, which I think we'd have to do to make it reliably distinguishable from the other bytea formats, that argument goes down the drain. regards, tom lane
On 02/23/2017 06:52 PM, David Fetter wrote: > On Thu, Feb 23, 2017 at 05:55:37PM -0500, Bruce Momjian wrote: >> On Thu, Feb 23, 2017 at 04:08:58PM -0500, Tom Lane wrote: >>> Bruce Momjian <bruce@momjian.us> writes: >>>> Is there a reason we don't support base64 as a bytea_output output >>>> option, except that no one has implemented it? >>> How about "we already have one too many bytea output formats"? >>> I don't think forcing code to try to support still another one >>> is a great thing ... especially not if it couldn't be reliably >>> distinguished from the hex format. >> Is there a reason we chose hex over base64? > Whether there was or not, there's not a compelling reason now to break > people's software. When people want compression, methods a LOT more > effective than base64 are common. Gzip, for example. > What's the use case anyway? It's already supported by the encode() and decode() functions if you need that format. cheers andrew -- Andrew Dunstan https://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Thu, Feb 23, 2017 at 07:09:57PM -0500, Andrew Dunstan wrote: > > > On 02/23/2017 06:52 PM, David Fetter wrote: > > On Thu, Feb 23, 2017 at 05:55:37PM -0500, Bruce Momjian wrote: > >> On Thu, Feb 23, 2017 at 04:08:58PM -0500, Tom Lane wrote: > >>> Bruce Momjian <bruce@momjian.us> writes: > >>>> Is there a reason we don't support base64 as a bytea_output output > >>>> option, except that no one has implemented it? > >>> How about "we already have one too many bytea output formats"? > >>> I don't think forcing code to try to support still another one > >>> is a great thing ... especially not if it couldn't be reliably > >>> distinguished from the hex format. > >> Is there a reason we chose hex over base64? > > Whether there was or not, there's not a compelling reason now to break > > people's software. When people want compression, methods a LOT more > > effective than base64 are common. Gzip, for example. > > > > > What's the use case anyway? It's already supported by the encode() and > decode() functions if you need that format. I was just curious because it seems more compact than hex and many exchange formats use it, like SSL certificates and keys. I know you can encode() but I thought it might help make pg_dump output smaller. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + As you are, so once was I. As I am, so you will be. + + Ancient Roman grave inscription +
On 2/23/17 8:22 PM, Bruce Momjian wrote: > I was just curious because it seems more compact than hex and many > exchange formats use it, like SSL certificates and keys. I know you can > encode() but I thought it might help make pg_dump output smaller. It undoubtedly would make pg_dump smaller, though I'm not sure how much that's worth since if you care at all about that you'll gzip it. But, the other thing it might do is speed up COPY, especially on input. Some performance tests of that might be interesting. -- Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX Experts in Analytics, Data Architecture and PostgreSQL Data in Trouble? Get it in Treble! http://BlueTreble.com 855-TREBLE2 (855-873-2532)
> It undoubtedly would make pg_dump smaller, though I'm not sure how much > that's worth since if you care at all about that you'll gzip it. > > But, the other thing it might do is speed up COPY, especially on input. Some > performance tests of that might be interesting. For what it is worth: Ascii85 (aka Base85) is used in PDF by Adobe, with a prefix "<~" and a suffix "~>". It codes 4 bytes as 5 ascii characters, i.e. a 25% loss. There is also Z85 which avoids some special characters: backslash, single quote, double quote. -- Fabien.
On Thu, Feb 23, 2017 at 03:52:46PM -0800, David Fetter wrote: > On Thu, Feb 23, 2017 at 05:55:37PM -0500, Bruce Momjian wrote: > > On Thu, Feb 23, 2017 at 04:08:58PM -0500, Tom Lane wrote: > > > Bruce Momjian <bruce@momjian.us> writes: > > > > Is there a reason we don't support base64 as a bytea_output output > > > > option, except that no one has implemented it? > > > > > > How about "we already have one too many bytea output formats"? > > > I don't think forcing code to try to support still another one > > > is a great thing ... especially not if it couldn't be reliably > > > distinguished from the hex format. > > > > Is there a reason we chose hex over base64? > > Whether there was or not, there's not a compelling reason now to break > people's software. When people want compression, methods a LOT more > effective than base64 are common. Gzip, for example. > > Best, > David. First, hex encoding is very simple to perform. Second, most applications have routines to handle it trivially. And third, base64 encoding has some padding constraints that can complicate is processing. Like David suggests, if you want compact, run it through lz4/gzip/lzop...for a much better size return. Regards, Ken
On 2/24/17 7:44 AM, Kenneth Marshall wrote: > Like David suggests, > if you want compact, run it through lz4/gzip/lzop...for a much better size > return. Speaking of which; any bytea where you care about this is likely to live in an already compressed state in toast. ISTM it would be valuable if we had a way to just spit out the raw compressed data (or a text-safe version of that), at least for COPY's purposes... -- Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX Experts in Analytics, Data Architecture and PostgreSQL Data in Trouble? Get it in Treble! http://BlueTreble.com 855-TREBLE2 (855-873-2532)
On 2/23/17 17:55, Bruce Momjian wrote: > On Thu, Feb 23, 2017 at 04:08:58PM -0500, Tom Lane wrote: >> Bruce Momjian <bruce@momjian.us> writes: >>> Is there a reason we don't support base64 as a bytea_output output >>> option, except that no one has implemented it? >> >> How about "we already have one too many bytea output formats"? >> I don't think forcing code to try to support still another one >> is a great thing ... especially not if it couldn't be reliably >> distinguished from the hex format. > > Is there a reason we chose hex over base64? The reason we changed from the old format to hex was for performance. We didn't consider base64 at the time, but hex would probably still have been faster. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Sat, Feb 25, 2017 at 06:49:01PM -0500, Peter Eisentraut wrote: > On 2/23/17 17:55, Bruce Momjian wrote: > > On Thu, Feb 23, 2017 at 04:08:58PM -0500, Tom Lane wrote: > >> Bruce Momjian <bruce@momjian.us> writes: > >>> Is there a reason we don't support base64 as a bytea_output output > >>> option, except that no one has implemented it? > >> > >> How about "we already have one too many bytea output formats"? > >> I don't think forcing code to try to support still another one > >> is a great thing ... especially not if it couldn't be reliably > >> distinguished from the hex format. > > > > Is there a reason we chose hex over base64? > > The reason we changed from the old format to hex was for performance. > We didn't consider base64 at the time, but hex would probably still have > been faster. OK, good to know. I was just asking. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + As you are, so once was I. As I am, so you will be. + + Ancient Roman grave inscription +
On Sun, Feb 26, 2017 at 5:19 AM, Peter Eisentraut <peter.eisentraut@2ndquadrant.com> wrote: >> Is there a reason we chose hex over base64? > > The reason we changed from the old format to hex was for performance. > We didn't consider base64 at the time, but hex would probably still have > been faster. I might be remembering the discussion incorrectly, but I thought the issue was that it was really hard to predict how large a buffer we needed for the "escape" format, because the answer depends on the number of backslashes in the string. With the "hex" format, we could predict the size of the output just based on the size of the input (which we know) rather than having it depend on the contents of the input (which we don't). So I would think that any format that has the property that we can know the exact output size based on the input size would be equally good. And base64 would seem to have some advantage because the output would be more substantially more compact. That having been said, I do agree with Tom's point that we already have one more bytea_output format than would be ideal. To justify implementing base64 as a third choice, it would have to not only be better than hex, but enough better to justify the migration pain. I'm not sure whether it could clear that bar. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2/26/17 05:05, Robert Haas wrote: > That having been said, I do agree with Tom's point that we already > have one more bytea_output format than would be ideal. To justify > implementing base64 as a third choice, it would have to not only be > better than hex, but enough better to justify the migration pain. I'm > not sure whether it could clear that bar. Another point is that an "output" format is not the same as an "encoding" format. An output format should be somewhat human-readable. Otherwise we could find all kinds of more compact output formats for different data types. The hex format satisfies all of performance, ease of use, and readability pretty well, I think. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Mon, Feb 27, 2017 at 7:08 PM, Peter Eisentraut <peter.eisentraut@2ndquadrant.com> wrote: > On 2/26/17 05:05, Robert Haas wrote: >> That having been said, I do agree with Tom's point that we already >> have one more bytea_output format than would be ideal. To justify >> implementing base64 as a third choice, it would have to not only be >> better than hex, but enough better to justify the migration pain. I'm >> not sure whether it could clear that bar. > > Another point is that an "output" format is not the same as an > "encoding" format. An output format should be somewhat human-readable. > Otherwise we could find all kinds of more compact output formats for > different data types. The hex format satisfies all of performance, ease > of use, and readability pretty well, I think. Depends. I'd argue that 'dog' is more readable than '\x646f67', although I concede that the latter is easier for a human to decode than 'ZG9n'. The Matrix nonwithstanding, if your binary blobs contain things like JPG images, you probably want to feed the blobs to something that can render them as a picture, rather than trying to guess from either hex or escape output whether you've got a blond, brunette, or redhead. In the case where you do happen to want your binary blob rendered as ASCII, the optimal format probably depends on some mix of what the underlying data is, what you plan to do with the result, and personal preference. I think we might as well argue about whether green is better than yellow. I don't think Bruce was seriously proposing a change in this area anyway. I think he was just asking a question. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Mon, Feb 27, 2017 at 09:28:10PM +0530, Robert Haas wrote: > I don't think Bruce was seriously proposing a change in this area > anyway. I think he was just asking a question. That is correct. I was asking if we made an obvious mistake, and most people are saying no. Also, base64 is less easy to compress because input bytes span base64-bytes, so "dog" might encode differently depending on where the two high bits are stored, while hex alway encodes "dog" the same way. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + As you are, so once was I. As I am, so you will be. + + Ancient Roman grave inscription +