Re: Re: Toast,bytea, Text -blob all confusing - Mailing list pgsql-hackers

From Bruce Momjian
Subject Re: Re: Toast,bytea, Text -blob all confusing
Date
Msg-id 200109040319.f843JJ016358@candle.pha.pa.us
Whole thread Raw
In response to Re: Re: Toast,bytea, Text -blob all confusing  ("Joe Conway" <joseph.conway@home.com>)
Responses Re: Re: Toast,bytea, Text -blob all confusing
List pgsql-hackers
Thanks you your description, I have added a bytea section to the docs.

Patch attached.


> > > >For bytea, follow this rule: to escape a null character, use this:
> > > >'\\0'. To escape a backslash, use this: '\\\\'.
> > > >
> > > >Same idea to unescape data.
> > >
> > > Are there other characters that need to be escaped? I suspect there are
> > > more characters that need to be escaped - ctrl chars? single quotes?.
> Why
> > > four backslashes for one? Is there a definitive documentation anywhere
> for
> > > what bytea is _supposed_ (not what it might actually be) to be and how
> it
> > > is to be handled?
> >
> > Yes, sorry for being unclear on this one. Here's a more detailed
> > explanation: Bytea is just a stream of data. On input, it must follow C
> > escaping conventions, on output, it will be escaped using C escaping
> > conventions.
> >
> > However, there's a trap: before things get to bytea input handler, they
> > are first processed by postgresql general parser. Hence, the string \\0
> > given from psql will become \0 when it gets to bytea input handler. String
> > \\\\ will become \\. All non-printable characters must be escaped like
> > this: \\(octal of character), for ex, chr(255) must be presented as \\377.
> > (If you want, you can also use this as an alternative and more generic way
> > to escape a backslash, \\134). Single quote must be escaped either as \\47
> > or as \'. Note the single backslash. Why only one? Because bytea parser
> > doesn't care about single quotes and you only need to escape it for the
> > postgresql parser...
> >
> > So, just keep in mind the double-parsing of input and you'll be safe.
> >
> > > Also why wouldn't escaping stuff like this work with TEXT then? If a
> null
> > > is going to be backslash backslash zero, and come out the same way, it
> sure
> > > looks like TEXT to me :). OK so there's this thing about storage. So
> maybe
> > Because text is null-terminated, can't have a null inside.
> >
> > > I could save a byte by just converting nulls to backslash zero and real
> > > backslashes to backslash backslash. Tada.
> > If you do that, you'll break ordering/comparison. Bytea in memory is
> > stored EXACTLY the way input string was, without any escaping, hence, all
> > comparisons will be correct ( '\\0'::bytea is less than '\\1'::bytea).
> >
> > With your representation, comparisons will fail, because in memory, data
> > is escaped using some escaping convention that you made up.
> >
> > > OK it's probably not the same, but having to put four backslashes when
> two
> > > should be enough to quote one makes me rather puzzled and uneasy.
> > Double parsing, hence double escaping.
>
> Great explanation Alex --thanks! I'll add a bit:
>
> I've done about 400,000 inserts and subsequent queries to verify that, from
> PHP at least, only four charaters need to be escaped. The inserts were 20
> byte strings gotten by concatenating some known text with a counter in a
> loop, and then producing  a SHA-1 hash. This produces very uniformly
> distributed binary data. Then I ran the same loop, except I queried for the
> inserted strings. I'm quite confident from this experiment that binary can
> reliably be inserted via standard SQL when these four characters are
> escaped. Here's the run down:
>
> \\000    First slash is consumed by the general parser, leaving \000 for the
> byteain function. If you only use one '\', the general parser converts the
> character into a true '\0' byte, and the byteain function sees this byte as
> the string terminator and stops. This causes the input string to be
> truncated (which seems to confuse many people).
>
> \\012    In my early tests 0x0a (LF) was getting converted to 0x20 (space).
> I think this was happening during PHP's parsing, but I'm still not sure.
> I'll dig into this some more later.
>
> \\047    As Alex mentioned, the byteain function doesn't treat this as a
> special character, but of course the general parser does as this is a single
> quote. It also works fine to escape it as \', I just prefer to use all
> octals.
>
> \\134    Both the general parser and the byteain function see this (a single
> \) as the special escape character. Therefore the general parser turns \\\\
> into \\, and the byteain function turns \\ into \ for actual storage. Again,
> I prefer to use the octal representation instead.
>
> I hope this helps reduce the concerns and confusion over bytea. If anyone
> can help explain why my linefeeds were getting converted to spaces, all the
> mysteries would be explained ;-)
>
> -- Joe
>
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 5: Have you checked our extensive FAQ?
>
> http://www.postgresql.org/users-lounge/docs/faq.html
>

--
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
Index: doc/src/sgml/datatype.sgml
===================================================================
RCS file: /home/projects/pgsql/cvsroot/pgsql/doc/src/sgml/datatype.sgml,v
retrieving revision 1.60
diff -c -r1.60 datatype.sgml
*** doc/src/sgml/datatype.sgml    2001/08/31 01:55:25    1.60
--- doc/src/sgml/datatype.sgml    2001/09/04 03:15:49
***************
*** 84,89 ****
--- 84,95 ----
        </row>

        <row>
+        <entry><type>bytea</type></entry>
+        <entry></entry>
+        <entry>binary data</entry>
+       </row>
+
+       <row>
         <entry><type>character(<replaceable>n</replaceable>)</type></entry>
         <entry><type>char(<replaceable>n</replaceable>)</type></entry>
         <entry>fixed-length character string</entry>
***************
*** 782,788 ****
      <entry>text</entry>
      <entry>Variable unlimited length</entry>
         </row>
!       </tbody>
       </tgroup>
      </table>

--- 788,798 ----
      <entry>text</entry>
      <entry>Variable unlimited length</entry>
         </row>
!         <row>
!     <entry>bytea</entry>
!     <entry>binary data</entry>
!        </row>
!      </tbody>
       </tgroup>
      </table>

***************
*** 827,832 ****
--- 837,855 ----
      does not require an explicit declared upper limit on the size of
      the string.  Although the type <type>text</type> is not in the SQL
      standard, many other RDBMS packages have it as well.
+    </para>
+
+    <para>
+     The <type>bytea</type> data type allows storage of binary data,
+     specifically allowing storage of NULLs which are entered as
+     <literal>'\\000'</>. The first backslash is interpreted by the
+     single quotes, and the second is recognized by <type>bytea</> and
+     preceeds a three digit octal value. For a similar reason, a
+     backslash must be entered into a field as <literal>'\\\\'</> or
+     <literal>'\\134'</>. You may also have to escape line feeds and
+     carriage return if your interface automatically translates these. It
+     can store values of any length. <type>Bytea</> is a non-standard
+     data type.
     </para>

     <para>

pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: Another pgindent request
Next
From: "Joe Conway"
Date:
Subject: Re: Bytea/Base64 encoders for libpq - interested?