Thread: Toast,bytea, Text -blob all confusing

Toast,bytea, Text -blob all confusing

From
jason.ory@ndchealth.com
Date:
I'm trying my best to convert from MySQL to PgSQL but I cant get a good
answer about 
certian questions. It was an easy task in mysql but all this talk about
, text , toast and bytea is just confusing me.
I cant get a clear picture of any of this,from the book from Bruce, the
e-mail archives. Ive looked all i can,
with every keyword i can think of from years past. Here is my situation.


WHAT I WAS DOING IN MYSQL
Via the web my clients are uploading basic  text/data files, sometimes >
than 30MB. In the past ,via CGI I have been parsing the file
into one STL string, using mysql_escape_string to escape it and then using
an INSERT  to place the   ,\'+"stlstring+"\' ,    into a BLOB column. 
I dont want to use a temp. file anywhere. The data will always be passed via
the database and buffers for certian reasons.


THIS IS WHAT I CANT SEEM TO FIGURE OUT IN POSTGRESQL
1. I cant get a clear answer on what kind of data type to use for my large
text string?  TEXT, ???, ??? or something about TOAST
I have seen in the e-mail archive but cant find any documentaion?

2. I've written my own escape method ,("cant find one for Pgsql") , BUT i
don't know what 
to escape and not to escape. So it keeps failing. I cand find any docs. on
what to escape either?


SUMMARY
What datatype do I use? How Im a supposed to escape it and get it though the
parser correctly so i can
retrieve it correctly?


Thnks for your time.

PS: Using RedHat.



Jason H. Ory
Medprint+
Software Developer
(205) 989-4617
jason.ory@ndchealth.com



Re: Toast,bytea, Text -blob all confusing

From
Frank Joerdens
Date:
On Thu, Aug 23, 2001 at 01:09:14PM -0400, jason.ory@ndchealth.com wrote:
> I'm trying my best to convert from MySQL to PgSQL but I cant get a good
> answer about 
> certian questions. It was an easy task in mysql but all this talk about
> , text , toast and bytea is just confusing me.
> I cant get a clear picture of any of this,from the book from Bruce, the
> e-mail archives. Ive looked all i can,
> with every keyword i can think of from years past. Here is my situation.
> 
> 
> WHAT I WAS DOING IN MYSQL
> Via the web my clients are uploading basic  text/data files, sometimes >
> than 30MB. In the past ,via CGI I have been parsing the file
> into one STL string, using mysql_escape_string to escape it and then using
> an INSERT  to place the 
>    ,\'+"stlstring+"\' ,    into a BLOB column. 
> I dont want to use a temp. file anywhere. The data will always be passed via
> the database and buffers for certian reasons.
> 
> 
> THIS IS WHAT I CANT SEEM TO FIGURE OUT IN POSTGRESQL
> 1. I cant get a clear answer on what kind of data type to use for my large
> text string?  TEXT, ???, ??? or something about TOAST
> I have seen in the e-mail archive but cant find any documentaion?

TOAST is just a name for the mechanism/feature that is used in Postgres
>= 7.1 to overcome the (32 KB if I recall correctly) limit on the row
size in previous versions. It's completely transparent to the
programmer, i.e. if you use TEXT, for instance, you can have a row up to
1 GB in size (which is probably not practical) theoretically. The
advantage over using BLOBs is that you can search this field (with 30 MB
fieldis, if you have a few of them, this is probably not practical
either so you'd probably want to consider some full text indexing
mechanism). I'd use TEXT for this reason.

> 
> 2. I've written my own escape method ,("cant find one for Pgsql") , BUT i
> don't know what 
> to escape and not to escape. So it keeps failing. I cand find any docs. on
> what to escape either?

Hm. I don't understand why the database (using MySQL or Postgres) would
make any difference there.

Hope it helps,

Frank


Re: Toast,bytea, Text -blob all confusing

From
Alex Pilosov
Date:
On Thu, 23 Aug 2001 jason.ory@ndchealth.com wrote:

> THIS IS WHAT I CANT SEEM TO FIGURE OUT IN POSTGRESQL
> 1. I cant get a clear answer on what kind of data type to use for my large
> text string?  TEXT, ???, ??? or something about TOAST
> I have seen in the e-mail archive but cant find any documentaion?
I would suggest bytea or blob. Blobs are well-documented in normal
documentation and in documentation of your favorite interface, so I'll
just talk about bytea.

> 2. I've written my own escape method ,("cant find one for Pgsql") , BUT i
> don't know what 
> to escape and not to escape. So it keeps failing. I cand find any docs. on
> what to escape either?
For bytea, follow this rule: to escape a null character, use this:
'\\0'. To escape a backslash, use this: '\\\\'.

Same idea to unescape data.




Re: Toast,bytea, Text -blob all confusing

From
Lincoln Yeoh
Date:
At 03:05 PM 27-08-2001 -0400, Alex Pilosov wrote:
>On Thu, 23 Aug 2001 jason.ory@ndchealth.com wrote:
>
>> THIS IS WHAT I CANT SEEM TO FIGURE OUT IN POSTGRESQL
>> 1. I cant get a clear answer on what kind of data type to use for my large
>> text string?  TEXT, ???, ??? or something about TOAST
>> I have seen in the e-mail archive but cant find any documentaion?
>I would suggest bytea or blob. Blobs are well-documented in normal
>documentation and in documentation of your favorite interface, so I'll
>just talk about bytea.
>
>> 2. I've written my own escape method ,("cant find one for Pgsql") , BUT i
>> don't know what 
>> to escape and not to escape. So it keeps failing. I cand find any docs. on
>> what to escape either?
>For bytea, follow this rule: to escape a null character, use this:
>'\\0'. To escape a backslash, use this: '\\\\'.
>
>Same idea to unescape data.

Are there other characters that need to be escaped? I suspect there are
more characters that need to be escaped - ctrl chars? single quotes?. Why
four backslashes for one? Is there a definitive documentation anywhere for
what bytea is _supposed_ (not what it might actually be) to be and how it
is to be handled?

Also why wouldn't escaping stuff like this work with TEXT then? If a null
is going to be backslash backslash zero, and come out the same way, it sure
looks like TEXT to me :). OK so there's this thing about storage. So maybe
I could save a byte by just converting nulls to backslash zero and real
backslashes to backslash backslash. Tada.  

OK it's probably not the same, but having to put four backslashes when two
should be enough to quote one makes me rather puzzled and uneasy. 

Cheerio,
Link.



Re: Toast,bytea, Text -blob all confusing

From
Alex Pilosov
Date:
On Tue, 28 Aug 2001, Lincoln Yeoh wrote:

> >For bytea, follow this rule: to escape a null character, use this:
> >'\\0'. To escape a backslash, use this: '\\\\'.
> >
> >Same idea to unescape data.
> 
> Are there other characters that need to be escaped? I suspect there are
> more characters that need to be escaped - ctrl chars? single quotes?. Why
> four backslashes for one? Is there a definitive documentation anywhere for
> what bytea is _supposed_ (not what it might actually be) to be and how it
> is to be handled?

Yes, sorry for being unclear on this one. Here's a more detailed
explanation: Bytea is just a stream of data. On input, it must follow C
escaping conventions, on output, it will be escaped using C escaping
conventions. 

However, there's a trap: before things get to bytea input handler, they
are first processed by postgresql general parser. Hence, the string \\0
given from psql will become \0 when it gets to bytea input handler. String
\\\\ will become \\. All non-printable characters must be escaped like
this: \\(octal of character), for ex, chr(255) must be presented as \\377.
(If you want, you can also use this as an alternative and more generic way
to escape a backslash, \\134). Single quote must be escaped either as \\47 
or as \'. Note the single backslash. Why only one? Because bytea parser
doesn't care about single quotes and you only need to escape it for the
postgresql parser...

So, just keep in mind the double-parsing of input and you'll be safe.

> Also why wouldn't escaping stuff like this work with TEXT then? If a null
> is going to be backslash backslash zero, and come out the same way, it sure
> looks like TEXT to me :). OK so there's this thing about storage. So maybe
Because text is null-terminated, can't have a null inside.

> I could save a byte by just converting nulls to backslash zero and real
> backslashes to backslash backslash. Tada.  
If you do that, you'll break ordering/comparison. Bytea in memory is
stored EXACTLY the way input string was, without any escaping, hence, all
comparisons will be correct ( '\\0'::bytea is less than '\\1'::bytea). 

With your representation, comparisons will fail, because in memory, data
is escaped using some escaping convention that you made up.

> OK it's probably not the same, but having to put four backslashes when two
> should be enough to quote one makes me rather puzzled and uneasy. 
Double parsing, hence double escaping.

--
Alex Pilosov            | http://www.acedsl.com/home.html
CTO - Acecape, Inc.     | AceDSL:The best ADSL in the world
325 W 38 St. Suite 1005 | (Stealth Marketing Works! :)
New York, NY 10018      |



Re: Re: Toast,bytea, Text -blob all confusing

From
"Joe Conway"
Date:
> > >For bytea, follow this rule: to escape a null character, use this:
> > >'\\0'. To escape a backslash, use this: '\\\\'.
> > >
> > >Same idea to unescape data.
> >
> > Are there other characters that need to be escaped? I suspect there are
> > more characters that need to be escaped - ctrl chars? single quotes?.
Why
> > four backslashes for one? Is there a definitive documentation anywhere
for
> > what bytea is _supposed_ (not what it might actually be) to be and how
it
> > is to be handled?
>
> Yes, sorry for being unclear on this one. Here's a more detailed
> explanation: Bytea is just a stream of data. On input, it must follow C
> escaping conventions, on output, it will be escaped using C escaping
> conventions.
>
> However, there's a trap: before things get to bytea input handler, they
> are first processed by postgresql general parser. Hence, the string \\0
> given from psql will become \0 when it gets to bytea input handler. String
> \\\\ will become \\. All non-printable characters must be escaped like
> this: \\(octal of character), for ex, chr(255) must be presented as \\377.
> (If you want, you can also use this as an alternative and more generic way
> to escape a backslash, \\134). Single quote must be escaped either as \\47
> or as \'. Note the single backslash. Why only one? Because bytea parser
> doesn't care about single quotes and you only need to escape it for the
> postgresql parser...
>
> So, just keep in mind the double-parsing of input and you'll be safe.
>
> > Also why wouldn't escaping stuff like this work with TEXT then? If a
null
> > is going to be backslash backslash zero, and come out the same way, it
sure
> > looks like TEXT to me :). OK so there's this thing about storage. So
maybe
> Because text is null-terminated, can't have a null inside.
>
> > I could save a byte by just converting nulls to backslash zero and real
> > backslashes to backslash backslash. Tada.
> If you do that, you'll break ordering/comparison. Bytea in memory is
> stored EXACTLY the way input string was, without any escaping, hence, all
> comparisons will be correct ( '\\0'::bytea is less than '\\1'::bytea).
>
> With your representation, comparisons will fail, because in memory, data
> is escaped using some escaping convention that you made up.
>
> > OK it's probably not the same, but having to put four backslashes when
two
> > should be enough to quote one makes me rather puzzled and uneasy.
> Double parsing, hence double escaping.

Great explanation Alex --thanks! I'll add a bit:

I've done about 400,000 inserts and subsequent queries to verify that, from
PHP at least, only four charaters need to be escaped. The inserts were 20
byte strings gotten by concatenating some known text with a counter in a
loop, and then producing  a SHA-1 hash. This produces very uniformly
distributed binary data. Then I ran the same loop, except I queried for the
inserted strings. I'm quite confident from this experiment that binary can
reliably be inserted via standard SQL when these four characters are
escaped. Here's the run down:

\\000    First slash is consumed by the general parser, leaving \000 for the
byteain function. If you only use one '\', the general parser converts the
character into a true '\0' byte, and the byteain function sees this byte as
the string terminator and stops. This causes the input string to be
truncated (which seems to confuse many people).

\\012    In my early tests 0x0a (LF) was getting converted to 0x20 (space).
I think this was happening during PHP's parsing, but I'm still not sure.
I'll dig into this some more later.

\\047    As Alex mentioned, the byteain function doesn't treat this as a
special character, but of course the general parser does as this is a single
quote. It also works fine to escape it as \', I just prefer to use all
octals.

\\134    Both the general parser and the byteain function see this (a single
\) as the special escape character. Therefore the general parser turns \\\\
into \\, and the byteain function turns \\ into \ for actual storage. Again,
I prefer to use the octal representation instead.

I hope this helps reduce the concerns and confusion over bytea. If anyone
can help explain why my linefeeds were getting converted to spaces, all the
mysteries would be explained ;-)

-- Joe




RE: Re: Toast,bytea, Text -blob all confusing

From
"Zeugswetter Andreas SB SD"
Date:
> > > >For bytea, follow this rule: to escape a null character, use
this:
> > > >'\\0'. To escape a backslash, use this: '\\\\'.

Can anybody explain in technical terms why this is implemented 
so inconveniently ?

Since bytea is probably not very common among users yet
we could imho still change it to not do double escapes.

Imho we need to decide where to do the escaping,
eighter in the parser or in the input functions.

I think actually the backend parser has no business changing
constants, he is imho only allowed to parse it, so he knows 
where a constant begins, and where it ends.    

Andreas


Re: Re: Toast,bytea, Text -blob all confusing

From
Hannu Krosing
Date:
Zeugswetter Andreas SB SD wrote:
> 
> > > > >For bytea, follow this rule: to escape a null character, use
> this:
> > > > >'\\0'. To escape a backslash, use this: '\\\\'.
> 
> Can anybody explain in technical terms why this is implemented
> so inconveniently ?

I think that this has to to with making textin and textout behave 
symmetrically, and the requirement that textout must produce a 
valid C-string for ASCII transfer format.

> Since bytea is probably not very common among users yet
> we could imho still change it to not do double escapes.

But how ?

> Imho we need to decide where to do the escaping,
> eighter in the parser or in the input functions.

It would be probably hard to make the parser to _not_ unescape some 
types, as it does not yet know it

> I think actually the backend parser has no business changing
> constants, he is imho only allowed to parse it, so he knows
> where a constant begins, and where it ends.

If it is any consolation then you have to write the inset of 
a single \ from shell command so:

> psql -c "insert into t values('\\\\\\\\')"

;)

------------------
Hannu


Re: Re: Toast,bytea, Text -blob all confusing

From
Tom Lane
Date:
"Zeugswetter Andreas SB SD" <ZeugswetterA@spardat.at> writes:
> I think actually the backend parser has no business changing
> constants, he is imho only allowed to parse it, so he knows 
> where a constant begins, and where it ends.    

How do you propose to handle embedded quote marks in literals,
if there is no parser-level escape convention?

Don't suggest a type-specific escape convention; at the time the
parser runs, it's impossible to know what type the literal will
turn out to be.
        regards, tom lane


Re: Re: Toast,bytea, Text -blob all confusing

From
Bruce Momjian
Date:
Thanks you your description, I have added a bytea section to the docs.

Patch attached.


> > > >For bytea, follow this rule: to escape a null character, use this:
> > > >'\\0'. To escape a backslash, use this: '\\\\'.
> > > >
> > > >Same idea to unescape data.
> > >
> > > Are there other characters that need to be escaped? I suspect there are
> > > more characters that need to be escaped - ctrl chars? single quotes?.
> Why
> > > four backslashes for one? Is there a definitive documentation anywhere
> for
> > > what bytea is _supposed_ (not what it might actually be) to be and how
> it
> > > is to be handled?
> >
> > Yes, sorry for being unclear on this one. Here's a more detailed
> > explanation: Bytea is just a stream of data. On input, it must follow C
> > escaping conventions, on output, it will be escaped using C escaping
> > conventions.
> >
> > However, there's a trap: before things get to bytea input handler, they
> > are first processed by postgresql general parser. Hence, the string \\0
> > given from psql will become \0 when it gets to bytea input handler. String
> > \\\\ will become \\. All non-printable characters must be escaped like
> > this: \\(octal of character), for ex, chr(255) must be presented as \\377.
> > (If you want, you can also use this as an alternative and more generic way
> > to escape a backslash, \\134). Single quote must be escaped either as \\47
> > or as \'. Note the single backslash. Why only one? Because bytea parser
> > doesn't care about single quotes and you only need to escape it for the
> > postgresql parser...
> >
> > So, just keep in mind the double-parsing of input and you'll be safe.
> >
> > > Also why wouldn't escaping stuff like this work with TEXT then? If a
> null
> > > is going to be backslash backslash zero, and come out the same way, it
> sure
> > > looks like TEXT to me :). OK so there's this thing about storage. So
> maybe
> > Because text is null-terminated, can't have a null inside.
> >
> > > I could save a byte by just converting nulls to backslash zero and real
> > > backslashes to backslash backslash. Tada.
> > If you do that, you'll break ordering/comparison. Bytea in memory is
> > stored EXACTLY the way input string was, without any escaping, hence, all
> > comparisons will be correct ( '\\0'::bytea is less than '\\1'::bytea).
> >
> > With your representation, comparisons will fail, because in memory, data
> > is escaped using some escaping convention that you made up.
> >
> > > OK it's probably not the same, but having to put four backslashes when
> two
> > > should be enough to quote one makes me rather puzzled and uneasy.
> > Double parsing, hence double escaping.
>
> Great explanation Alex --thanks! I'll add a bit:
>
> I've done about 400,000 inserts and subsequent queries to verify that, from
> PHP at least, only four charaters need to be escaped. The inserts were 20
> byte strings gotten by concatenating some known text with a counter in a
> loop, and then producing  a SHA-1 hash. This produces very uniformly
> distributed binary data. Then I ran the same loop, except I queried for the
> inserted strings. I'm quite confident from this experiment that binary can
> reliably be inserted via standard SQL when these four characters are
> escaped. Here's the run down:
>
> \\000    First slash is consumed by the general parser, leaving \000 for the
> byteain function. If you only use one '\', the general parser converts the
> character into a true '\0' byte, and the byteain function sees this byte as
> the string terminator and stops. This causes the input string to be
> truncated (which seems to confuse many people).
>
> \\012    In my early tests 0x0a (LF) was getting converted to 0x20 (space).
> I think this was happening during PHP's parsing, but I'm still not sure.
> I'll dig into this some more later.
>
> \\047    As Alex mentioned, the byteain function doesn't treat this as a
> special character, but of course the general parser does as this is a single
> quote. It also works fine to escape it as \', I just prefer to use all
> octals.
>
> \\134    Both the general parser and the byteain function see this (a single
> \) as the special escape character. Therefore the general parser turns \\\\
> into \\, and the byteain function turns \\ into \ for actual storage. Again,
> I prefer to use the octal representation instead.
>
> I hope this helps reduce the concerns and confusion over bytea. If anyone
> can help explain why my linefeeds were getting converted to spaces, all the
> mysteries would be explained ;-)
>
> -- Joe
>
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 5: Have you checked our extensive FAQ?
>
> http://www.postgresql.org/users-lounge/docs/faq.html
>

--
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
Index: doc/src/sgml/datatype.sgml
===================================================================
RCS file: /home/projects/pgsql/cvsroot/pgsql/doc/src/sgml/datatype.sgml,v
retrieving revision 1.60
diff -c -r1.60 datatype.sgml
*** doc/src/sgml/datatype.sgml    2001/08/31 01:55:25    1.60
--- doc/src/sgml/datatype.sgml    2001/09/04 03:15:49
***************
*** 84,89 ****
--- 84,95 ----
        </row>

        <row>
+        <entry><type>bytea</type></entry>
+        <entry></entry>
+        <entry>binary data</entry>
+       </row>
+
+       <row>
         <entry><type>character(<replaceable>n</replaceable>)</type></entry>
         <entry><type>char(<replaceable>n</replaceable>)</type></entry>
         <entry>fixed-length character string</entry>
***************
*** 782,788 ****
      <entry>text</entry>
      <entry>Variable unlimited length</entry>
         </row>
!       </tbody>
       </tgroup>
      </table>

--- 788,798 ----
      <entry>text</entry>
      <entry>Variable unlimited length</entry>
         </row>
!         <row>
!     <entry>bytea</entry>
!     <entry>binary data</entry>
!        </row>
!      </tbody>
       </tgroup>
      </table>

***************
*** 827,832 ****
--- 837,855 ----
      does not require an explicit declared upper limit on the size of
      the string.  Although the type <type>text</type> is not in the SQL
      standard, many other RDBMS packages have it as well.
+    </para>
+
+    <para>
+     The <type>bytea</type> data type allows storage of binary data,
+     specifically allowing storage of NULLs which are entered as
+     <literal>'\\000'</>. The first backslash is interpreted by the
+     single quotes, and the second is recognized by <type>bytea</> and
+     preceeds a three digit octal value. For a similar reason, a
+     backslash must be entered into a field as <literal>'\\\\'</> or
+     <literal>'\\134'</>. You may also have to escape line feeds and
+     carriage return if your interface automatically translates these. It
+     can store values of any length. <type>Bytea</> is a non-standard
+     data type.
     </para>

     <para>

Re: Re: Toast,bytea, Text -blob all confusing

From
"Joe Conway"
Date:
> >
> > \\012    In my early tests 0x0a (LF) was getting converted to 0x20
(space).
> > I think this was happening during PHP's parsing, but I'm still not sure.
> > I'll dig into this some more later.
> >

<redfaced>   The script I was using in PHP *explicitly* converted all linefeeds to
spaces. Mystery solved.
</redfaced>

I think Bruce's text still works though.

-- Joe




Re: Re: Toast,bytea, Text -blob all confusing

From
Tom Lane
Date:
Bruce Momjian <pgman@candle.pha.pa.us> writes:
> However, there's a trap: before things get to bytea input handler, they
> are first processed by postgresql general parser.

This description fails to make clear that the two levels of parsing only
apply for datums that are written as string literals in SQL commands.
An example where this doesn't apply is COPY input data.
        regards, tom lane


Re: Re: Toast,bytea, Text -blob all confusing

From
Bruce Momjian
Date:
> Bruce Momjian <pgman@candle.pha.pa.us> writes:
> > However, there's a trap: before things get to bytea input handler, they
> > are first processed by postgresql general parser.
> 
> This description fails to make clear that the two levels of parsing only
> apply for datums that are written as string literals in SQL commands.
> An example where this doesn't apply is COPY input data.

Are you talking about my SGML changes? I clearly mention quote-handling
and bytea handling, which pretty clearly not apply in COPY.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: Re: Toast,bytea, Text -blob all confusing

From
Bruce Momjian
Date:
> > >
> > > \\012    In my early tests 0x0a (LF) was getting converted to 0x20
> (space).
> > > I think this was happening during PHP's parsing, but I'm still not sure.
> > > I'll dig into this some more later.
> > >
> 
> <redfaced>
>     The script I was using in PHP *explicitly* converted all linefeeds to
> spaces. Mystery solved.
> </redfaced>
> 
> I think Bruce's text still works though.

I can see other interfaces doing fancy things with newlines and carriage
returns so I added it to the docs.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026