Thread: [HACKERS] bytea_output output of base64

[HACKERS] bytea_output output of base64

From
Bruce Momjian
Date:
Currently bytea_output supports values of 'hex' (the default) and
'escape' (octal).  'hex' uses two characters per byte, while escape uses
three (ignoring the prefix overhead of \x or \[0-9].)

It is my understanding that base64 uses 1.37 characters per byte:
https://en.wikipedia.org/wiki/Base64

Is there a reason we don't support base64 as a bytea_output output
option, except that no one has implemented it?

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+                      Ancient Roman grave inscription +



Re: [HACKERS] bytea_output output of base64

From
Tom Lane
Date:
Bruce Momjian <bruce@momjian.us> writes:
> Is there a reason we don't support base64 as a bytea_output output
> option, except that no one has implemented it?

How about "we already have one too many bytea output formats"?
I don't think forcing code to try to support still another one
is a great thing ... especially not if it couldn't be reliably
distinguished from the hex format.
        regards, tom lane



Re: [HACKERS] bytea_output output of base64

From
Bruce Momjian
Date:
On Thu, Feb 23, 2017 at 04:08:58PM -0500, Tom Lane wrote:
> Bruce Momjian <bruce@momjian.us> writes:
> > Is there a reason we don't support base64 as a bytea_output output
> > option, except that no one has implemented it?
> 
> How about "we already have one too many bytea output formats"?
> I don't think forcing code to try to support still another one
> is a great thing ... especially not if it couldn't be reliably
> distinguished from the hex format.

Is there a reason we chose hex over base64?

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+                      Ancient Roman grave inscription +



Re: [HACKERS] bytea_output output of base64

From
David Fetter
Date:
On Thu, Feb 23, 2017 at 05:55:37PM -0500, Bruce Momjian wrote:
> On Thu, Feb 23, 2017 at 04:08:58PM -0500, Tom Lane wrote:
> > Bruce Momjian <bruce@momjian.us> writes:
> > > Is there a reason we don't support base64 as a bytea_output output
> > > option, except that no one has implemented it?
> > 
> > How about "we already have one too many bytea output formats"?
> > I don't think forcing code to try to support still another one
> > is a great thing ... especially not if it couldn't be reliably
> > distinguished from the hex format.
> 
> Is there a reason we chose hex over base64?

Whether there was or not, there's not a compelling reason now to break
people's software.  When people want compression, methods a LOT more
effective than base64 are common.  Gzip, for example.

Best,
David.
-- 
David Fetter <david(at)fetter(dot)org> http://fetter.org/
Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
Skype: davidfetter      XMPP: david(dot)fetter(at)gmail(dot)com

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate



Re: [HACKERS] bytea_output output of base64

From
Tom Lane
Date:
David Fetter <david@fetter.org> writes:
> On Thu, Feb 23, 2017 at 05:55:37PM -0500, Bruce Momjian wrote:
>> Is there a reason we chose hex over base64?

> Whether there was or not, there's not a compelling reason now to break
> people's software.  When people want compression, methods a LOT more
> effective than base64 are common.  Gzip, for example.

Yeah.  I think the argument for hex was about readability.  It's certainly
not more compact than the traditional "escape" format --- and depending on
what your data is like, base64 probably wouldn't be either.

If you want small, you'd go for binary transmission first, and then maybe
gzip it.

I really don't see any compelling argument for base64 except possibly
exchangability with other systems.  But if we stuck a header on it, which
I think we'd have to do to make it reliably distinguishable from the other
bytea formats, that argument goes down the drain.
        regards, tom lane



Re: [HACKERS] bytea_output output of base64

From
Andrew Dunstan
Date:

On 02/23/2017 06:52 PM, David Fetter wrote:
> On Thu, Feb 23, 2017 at 05:55:37PM -0500, Bruce Momjian wrote:
>> On Thu, Feb 23, 2017 at 04:08:58PM -0500, Tom Lane wrote:
>>> Bruce Momjian <bruce@momjian.us> writes:
>>>> Is there a reason we don't support base64 as a bytea_output output
>>>> option, except that no one has implemented it?
>>> How about "we already have one too many bytea output formats"?
>>> I don't think forcing code to try to support still another one
>>> is a great thing ... especially not if it couldn't be reliably
>>> distinguished from the hex format.
>> Is there a reason we chose hex over base64?
> Whether there was or not, there's not a compelling reason now to break
> people's software.  When people want compression, methods a LOT more
> effective than base64 are common.  Gzip, for example.
>


What's the use case anyway? It's already supported by the encode() and
decode() functions if you need that format.

cheers

andrew

-- 
Andrew Dunstan                https://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: [HACKERS] bytea_output output of base64

From
Bruce Momjian
Date:
On Thu, Feb 23, 2017 at 07:09:57PM -0500, Andrew Dunstan wrote:
> 
> 
> On 02/23/2017 06:52 PM, David Fetter wrote:
> > On Thu, Feb 23, 2017 at 05:55:37PM -0500, Bruce Momjian wrote:
> >> On Thu, Feb 23, 2017 at 04:08:58PM -0500, Tom Lane wrote:
> >>> Bruce Momjian <bruce@momjian.us> writes:
> >>>> Is there a reason we don't support base64 as a bytea_output output
> >>>> option, except that no one has implemented it?
> >>> How about "we already have one too many bytea output formats"?
> >>> I don't think forcing code to try to support still another one
> >>> is a great thing ... especially not if it couldn't be reliably
> >>> distinguished from the hex format.
> >> Is there a reason we chose hex over base64?
> > Whether there was or not, there's not a compelling reason now to break
> > people's software.  When people want compression, methods a LOT more
> > effective than base64 are common.  Gzip, for example.
> >
> 
> 
> What's the use case anyway? It's already supported by the encode() and
> decode() functions if you need that format.

I was just curious because it seems more compact than hex and many
exchange formats use it, like SSL certificates and keys.  I know you can
encode() but I thought it might help make pg_dump output smaller. 

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+                      Ancient Roman grave inscription +



Re: [HACKERS] bytea_output output of base64

From
Jim Nasby
Date:
On 2/23/17 8:22 PM, Bruce Momjian wrote:
> I was just curious because it seems more compact than hex and many
> exchange formats use it, like SSL certificates and keys.  I know you can
> encode() but I thought it might help make pg_dump output smaller.

It undoubtedly would make pg_dump smaller, though I'm not sure how much 
that's worth since if you care at all about that you'll gzip it.

But, the other thing it might do is speed up COPY, especially on input. 
Some performance tests of that might be interesting.
-- 
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com
855-TREBLE2 (855-873-2532)



Re: [HACKERS] bytea_output output of base64

From
Fabien COELHO
Date:
> It undoubtedly would make pg_dump smaller, though I'm not sure how much 
> that's worth since if you care at all about that you'll gzip it.
>
> But, the other thing it might do is speed up COPY, especially on input. Some 
> performance tests of that might be interesting.

For what it is worth:

Ascii85 (aka Base85) is used in PDF by Adobe, with a prefix "<~" and a 
suffix "~>". It codes 4 bytes as 5 ascii characters, i.e. a 25% loss. 
There is also Z85 which avoids some special characters: backslash, single 
quote, double quote.

-- 
Fabien.



Re: [HACKERS] bytea_output output of base64

From
Kenneth Marshall
Date:
On Thu, Feb 23, 2017 at 03:52:46PM -0800, David Fetter wrote:
> On Thu, Feb 23, 2017 at 05:55:37PM -0500, Bruce Momjian wrote:
> > On Thu, Feb 23, 2017 at 04:08:58PM -0500, Tom Lane wrote:
> > > Bruce Momjian <bruce@momjian.us> writes:
> > > > Is there a reason we don't support base64 as a bytea_output output
> > > > option, except that no one has implemented it?
> > > 
> > > How about "we already have one too many bytea output formats"?
> > > I don't think forcing code to try to support still another one
> > > is a great thing ... especially not if it couldn't be reliably
> > > distinguished from the hex format.
> > 
> > Is there a reason we chose hex over base64?
> 
> Whether there was or not, there's not a compelling reason now to break
> people's software.  When people want compression, methods a LOT more
> effective than base64 are common.  Gzip, for example.
> 
> Best,
> David.

First, hex encoding is very simple to perform. Second, most applications
have routines to handle it trivially. And third, base64 encoding has some
padding constraints that can complicate is processing. Like David suggests,
if you want compact, run it through lz4/gzip/lzop...for a much better size
return.

Regards,
Ken



Re: [HACKERS] bytea_output output of base64

From
Jim Nasby
Date:
On 2/24/17 7:44 AM, Kenneth Marshall wrote:
> Like David suggests,
> if you want compact, run it through lz4/gzip/lzop...for a much better size
> return.

Speaking of which; any bytea where you care about this is likely to live 
in an already compressed state in toast. ISTM it would be valuable if we 
had a way to just spit out the raw compressed data (or a text-safe 
version of that), at least for COPY's purposes...
-- 
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com
855-TREBLE2 (855-873-2532)



Re: [HACKERS] bytea_output output of base64

From
Peter Eisentraut
Date:
On 2/23/17 17:55, Bruce Momjian wrote:
> On Thu, Feb 23, 2017 at 04:08:58PM -0500, Tom Lane wrote:
>> Bruce Momjian <bruce@momjian.us> writes:
>>> Is there a reason we don't support base64 as a bytea_output output
>>> option, except that no one has implemented it?
>>
>> How about "we already have one too many bytea output formats"?
>> I don't think forcing code to try to support still another one
>> is a great thing ... especially not if it couldn't be reliably
>> distinguished from the hex format.
> 
> Is there a reason we chose hex over base64?

The reason we changed from the old format to hex was for performance.
We didn't consider base64 at the time, but hex would probably still have
been faster.

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [HACKERS] bytea_output output of base64

From
Bruce Momjian
Date:
On Sat, Feb 25, 2017 at 06:49:01PM -0500, Peter Eisentraut wrote:
> On 2/23/17 17:55, Bruce Momjian wrote:
> > On Thu, Feb 23, 2017 at 04:08:58PM -0500, Tom Lane wrote:
> >> Bruce Momjian <bruce@momjian.us> writes:
> >>> Is there a reason we don't support base64 as a bytea_output output
> >>> option, except that no one has implemented it?
> >>
> >> How about "we already have one too many bytea output formats"?
> >> I don't think forcing code to try to support still another one
> >> is a great thing ... especially not if it couldn't be reliably
> >> distinguished from the hex format.
> > 
> > Is there a reason we chose hex over base64?
> 
> The reason we changed from the old format to hex was for performance.
> We didn't consider base64 at the time, but hex would probably still have
> been faster.

OK, good to know. I was just asking.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+                      Ancient Roman grave inscription +



Re: [HACKERS] bytea_output output of base64

From
Robert Haas
Date:
On Sun, Feb 26, 2017 at 5:19 AM, Peter Eisentraut
<peter.eisentraut@2ndquadrant.com> wrote:
>> Is there a reason we chose hex over base64?
>
> The reason we changed from the old format to hex was for performance.
> We didn't consider base64 at the time, but hex would probably still have
> been faster.

I might be remembering the discussion incorrectly, but I thought the
issue was that it was really hard to predict how large a buffer we
needed for the "escape" format, because the answer depends on the
number of backslashes in the string.  With the "hex" format, we could
predict the size of the output just based on the size of the input
(which we know) rather than having it depend on the contents of the
input (which we don't).  So I would think that any format that has the
property that we can know the exact output size based on the input
size would be equally good.  And base64 would seem to have some
advantage because the output would be more substantially more compact.

That having been said, I do agree with Tom's point that we already
have one more bytea_output format than would be ideal.  To justify
implementing base64 as a third choice, it would have to not only be
better than hex, but enough better to justify the migration pain.  I'm
not sure whether it could clear that bar.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] bytea_output output of base64

From
Peter Eisentraut
Date:
On 2/26/17 05:05, Robert Haas wrote:
> That having been said, I do agree with Tom's point that we already
> have one more bytea_output format than would be ideal.  To justify
> implementing base64 as a third choice, it would have to not only be
> better than hex, but enough better to justify the migration pain.  I'm
> not sure whether it could clear that bar.

Another point is that an "output" format is not the same as an
"encoding" format.  An output format should be somewhat human-readable.
Otherwise we could find all kinds of more compact output formats for
different data types.  The hex format satisfies all of performance, ease
of use, and readability pretty well, I think.

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [HACKERS] bytea_output output of base64

From
Robert Haas
Date:
On Mon, Feb 27, 2017 at 7:08 PM, Peter Eisentraut
<peter.eisentraut@2ndquadrant.com> wrote:
> On 2/26/17 05:05, Robert Haas wrote:
>> That having been said, I do agree with Tom's point that we already
>> have one more bytea_output format than would be ideal.  To justify
>> implementing base64 as a third choice, it would have to not only be
>> better than hex, but enough better to justify the migration pain.  I'm
>> not sure whether it could clear that bar.
>
> Another point is that an "output" format is not the same as an
> "encoding" format.  An output format should be somewhat human-readable.
> Otherwise we could find all kinds of more compact output formats for
> different data types.  The hex format satisfies all of performance, ease
> of use, and readability pretty well, I think.

Depends. I'd argue that 'dog' is more readable than '\x646f67',
although I concede that the latter is easier for a human to decode
than 'ZG9n'.

The Matrix nonwithstanding, if your binary blobs contain things like
JPG images, you probably want to feed the blobs to something that can
render them as a picture, rather than trying to guess from either hex
or escape output whether you've got a blond, brunette, or redhead.  In
the case where you do happen to want your binary blob rendered as
ASCII, the optimal format probably depends on some mix of what the
underlying data is, what you plan to do with the result, and personal
preference.  I think we might as well argue about whether green is
better than yellow.

I don't think Bruce was seriously proposing a change in this area
anyway.  I think he was just asking a question.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] bytea_output output of base64

From
Bruce Momjian
Date:
On Mon, Feb 27, 2017 at 09:28:10PM +0530, Robert Haas wrote:
> I don't think Bruce was seriously proposing a change in this area
> anyway.  I think he was just asking a question.

That is correct.  I was asking if we made an obvious mistake, and most
people are saying no.  Also, base64 is less easy to compress because
input bytes span base64-bytes, so "dog" might encode differently
depending on where the two high bits are stored, while hex alway encodes
"dog" the same way.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+                      Ancient Roman grave inscription +