Thread: Toast,bytea, Text -blob all confusing
I'm trying my best to convert from MySQL to PgSQL but I cant get a good answer about certian questions. It was an easy task in mysql but all this talk about , text , toast and bytea is just confusing me. I cant get a clear picture of any of this,from the book from Bruce, the e-mail archives. Ive looked all i can, with every keyword i can think of from years past. Here is my situation. WHAT I WAS DOING IN MYSQL Via the web my clients are uploading basic text/data files, sometimes > than 30MB. In the past ,via CGI I have been parsing the file into one STL string, using mysql_escape_string to escape it and then using an INSERT to place the ,\'+"stlstring+"\' , into a BLOB column. I dont want to use a temp. file anywhere. The data will always be passed via the database and buffers for certian reasons. THIS IS WHAT I CANT SEEM TO FIGURE OUT IN POSTGRESQL 1. I cant get a clear answer on what kind of data type to use for my large text string? TEXT, ???, ??? or something about TOAST I have seen in the e-mail archive but cant find any documentaion? 2. I've written my own escape method ,("cant find one for Pgsql") , BUT i don't know what to escape and not to escape. So it keeps failing. I cand find any docs. on what to escape either? SUMMARY What datatype do I use? How Im a supposed to escape it and get it though the parser correctly so i can retrieve it correctly? Thnks for your time. PS: Using RedHat. Jason H. Ory Medprint+ Software Developer (205) 989-4617 jason.ory@ndchealth.com
On Thu, Aug 23, 2001 at 01:09:14PM -0400, jason.ory@ndchealth.com wrote: > I'm trying my best to convert from MySQL to PgSQL but I cant get a good > answer about > certian questions. It was an easy task in mysql but all this talk about > , text , toast and bytea is just confusing me. > I cant get a clear picture of any of this,from the book from Bruce, the > e-mail archives. Ive looked all i can, > with every keyword i can think of from years past. Here is my situation. > > > WHAT I WAS DOING IN MYSQL > Via the web my clients are uploading basic text/data files, sometimes > > than 30MB. In the past ,via CGI I have been parsing the file > into one STL string, using mysql_escape_string to escape it and then using > an INSERT to place the > ,\'+"stlstring+"\' , into a BLOB column. > I dont want to use a temp. file anywhere. The data will always be passed via > the database and buffers for certian reasons. > > > THIS IS WHAT I CANT SEEM TO FIGURE OUT IN POSTGRESQL > 1. I cant get a clear answer on what kind of data type to use for my large > text string? TEXT, ???, ??? or something about TOAST > I have seen in the e-mail archive but cant find any documentaion? TOAST is just a name for the mechanism/feature that is used in Postgres >= 7.1 to overcome the (32 KB if I recall correctly) limit on the row size in previous versions. It's completely transparent to the programmer, i.e. if you use TEXT, for instance, you can have a row up to 1 GB in size (which is probably not practical) theoretically. The advantage over using BLOBs is that you can search this field (with 30 MB fieldis, if you have a few of them, this is probably not practical either so you'd probably want to consider some full text indexing mechanism). I'd use TEXT for this reason. > > 2. I've written my own escape method ,("cant find one for Pgsql") , BUT i > don't know what > to escape and not to escape. So it keeps failing. I cand find any docs. on > what to escape either? Hm. I don't understand why the database (using MySQL or Postgres) would make any difference there. Hope it helps, Frank
On Thu, 23 Aug 2001 jason.ory@ndchealth.com wrote: > THIS IS WHAT I CANT SEEM TO FIGURE OUT IN POSTGRESQL > 1. I cant get a clear answer on what kind of data type to use for my large > text string? TEXT, ???, ??? or something about TOAST > I have seen in the e-mail archive but cant find any documentaion? I would suggest bytea or blob. Blobs are well-documented in normal documentation and in documentation of your favorite interface, so I'll just talk about bytea. > 2. I've written my own escape method ,("cant find one for Pgsql") , BUT i > don't know what > to escape and not to escape. So it keeps failing. I cand find any docs. on > what to escape either? For bytea, follow this rule: to escape a null character, use this: '\\0'. To escape a backslash, use this: '\\\\'. Same idea to unescape data.
At 03:05 PM 27-08-2001 -0400, Alex Pilosov wrote: >On Thu, 23 Aug 2001 jason.ory@ndchealth.com wrote: > >> THIS IS WHAT I CANT SEEM TO FIGURE OUT IN POSTGRESQL >> 1. I cant get a clear answer on what kind of data type to use for my large >> text string? TEXT, ???, ??? or something about TOAST >> I have seen in the e-mail archive but cant find any documentaion? >I would suggest bytea or blob. Blobs are well-documented in normal >documentation and in documentation of your favorite interface, so I'll >just talk about bytea. > >> 2. I've written my own escape method ,("cant find one for Pgsql") , BUT i >> don't know what >> to escape and not to escape. So it keeps failing. I cand find any docs. on >> what to escape either? >For bytea, follow this rule: to escape a null character, use this: >'\\0'. To escape a backslash, use this: '\\\\'. > >Same idea to unescape data. Are there other characters that need to be escaped? I suspect there are more characters that need to be escaped - ctrl chars? single quotes?. Why four backslashes for one? Is there a definitive documentation anywhere for what bytea is _supposed_ (not what it might actually be) to be and how it is to be handled? Also why wouldn't escaping stuff like this work with TEXT then? If a null is going to be backslash backslash zero, and come out the same way, it sure looks like TEXT to me :). OK so there's this thing about storage. So maybe I could save a byte by just converting nulls to backslash zero and real backslashes to backslash backslash. Tada. OK it's probably not the same, but having to put four backslashes when two should be enough to quote one makes me rather puzzled and uneasy. Cheerio, Link.
On Tue, 28 Aug 2001, Lincoln Yeoh wrote: > >For bytea, follow this rule: to escape a null character, use this: > >'\\0'. To escape a backslash, use this: '\\\\'. > > > >Same idea to unescape data. > > Are there other characters that need to be escaped? I suspect there are > more characters that need to be escaped - ctrl chars? single quotes?. Why > four backslashes for one? Is there a definitive documentation anywhere for > what bytea is _supposed_ (not what it might actually be) to be and how it > is to be handled? Yes, sorry for being unclear on this one. Here's a more detailed explanation: Bytea is just a stream of data. On input, it must follow C escaping conventions, on output, it will be escaped using C escaping conventions. However, there's a trap: before things get to bytea input handler, they are first processed by postgresql general parser. Hence, the string \\0 given from psql will become \0 when it gets to bytea input handler. String \\\\ will become \\. All non-printable characters must be escaped like this: \\(octal of character), for ex, chr(255) must be presented as \\377. (If you want, you can also use this as an alternative and more generic way to escape a backslash, \\134). Single quote must be escaped either as \\47 or as \'. Note the single backslash. Why only one? Because bytea parser doesn't care about single quotes and you only need to escape it for the postgresql parser... So, just keep in mind the double-parsing of input and you'll be safe. > Also why wouldn't escaping stuff like this work with TEXT then? If a null > is going to be backslash backslash zero, and come out the same way, it sure > looks like TEXT to me :). OK so there's this thing about storage. So maybe Because text is null-terminated, can't have a null inside. > I could save a byte by just converting nulls to backslash zero and real > backslashes to backslash backslash. Tada. If you do that, you'll break ordering/comparison. Bytea in memory is stored EXACTLY the way input string was, without any escaping, hence, all comparisons will be correct ( '\\0'::bytea is less than '\\1'::bytea). With your representation, comparisons will fail, because in memory, data is escaped using some escaping convention that you made up. > OK it's probably not the same, but having to put four backslashes when two > should be enough to quote one makes me rather puzzled and uneasy. Double parsing, hence double escaping. -- Alex Pilosov | http://www.acedsl.com/home.html CTO - Acecape, Inc. | AceDSL:The best ADSL in the world 325 W 38 St. Suite 1005 | (Stealth Marketing Works! :) New York, NY 10018 |
> > >For bytea, follow this rule: to escape a null character, use this: > > >'\\0'. To escape a backslash, use this: '\\\\'. > > > > > >Same idea to unescape data. > > > > Are there other characters that need to be escaped? I suspect there are > > more characters that need to be escaped - ctrl chars? single quotes?. Why > > four backslashes for one? Is there a definitive documentation anywhere for > > what bytea is _supposed_ (not what it might actually be) to be and how it > > is to be handled? > > Yes, sorry for being unclear on this one. Here's a more detailed > explanation: Bytea is just a stream of data. On input, it must follow C > escaping conventions, on output, it will be escaped using C escaping > conventions. > > However, there's a trap: before things get to bytea input handler, they > are first processed by postgresql general parser. Hence, the string \\0 > given from psql will become \0 when it gets to bytea input handler. String > \\\\ will become \\. All non-printable characters must be escaped like > this: \\(octal of character), for ex, chr(255) must be presented as \\377. > (If you want, you can also use this as an alternative and more generic way > to escape a backslash, \\134). Single quote must be escaped either as \\47 > or as \'. Note the single backslash. Why only one? Because bytea parser > doesn't care about single quotes and you only need to escape it for the > postgresql parser... > > So, just keep in mind the double-parsing of input and you'll be safe. > > > Also why wouldn't escaping stuff like this work with TEXT then? If a null > > is going to be backslash backslash zero, and come out the same way, it sure > > looks like TEXT to me :). OK so there's this thing about storage. So maybe > Because text is null-terminated, can't have a null inside. > > > I could save a byte by just converting nulls to backslash zero and real > > backslashes to backslash backslash. Tada. > If you do that, you'll break ordering/comparison. Bytea in memory is > stored EXACTLY the way input string was, without any escaping, hence, all > comparisons will be correct ( '\\0'::bytea is less than '\\1'::bytea). > > With your representation, comparisons will fail, because in memory, data > is escaped using some escaping convention that you made up. > > > OK it's probably not the same, but having to put four backslashes when two > > should be enough to quote one makes me rather puzzled and uneasy. > Double parsing, hence double escaping. Great explanation Alex --thanks! I'll add a bit: I've done about 400,000 inserts and subsequent queries to verify that, from PHP at least, only four charaters need to be escaped. The inserts were 20 byte strings gotten by concatenating some known text with a counter in a loop, and then producing a SHA-1 hash. This produces very uniformly distributed binary data. Then I ran the same loop, except I queried for the inserted strings. I'm quite confident from this experiment that binary can reliably be inserted via standard SQL when these four characters are escaped. Here's the run down: \\000 First slash is consumed by the general parser, leaving \000 for the byteain function. If you only use one '\', the general parser converts the character into a true '\0' byte, and the byteain function sees this byte as the string terminator and stops. This causes the input string to be truncated (which seems to confuse many people). \\012 In my early tests 0x0a (LF) was getting converted to 0x20 (space). I think this was happening during PHP's parsing, but I'm still not sure. I'll dig into this some more later. \\047 As Alex mentioned, the byteain function doesn't treat this as a special character, but of course the general parser does as this is a single quote. It also works fine to escape it as \', I just prefer to use all octals. \\134 Both the general parser and the byteain function see this (a single \) as the special escape character. Therefore the general parser turns \\\\ into \\, and the byteain function turns \\ into \ for actual storage. Again, I prefer to use the octal representation instead. I hope this helps reduce the concerns and confusion over bytea. If anyone can help explain why my linefeeds were getting converted to spaces, all the mysteries would be explained ;-) -- Joe
> > > >For bytea, follow this rule: to escape a null character, use this: > > > >'\\0'. To escape a backslash, use this: '\\\\'. Can anybody explain in technical terms why this is implemented so inconveniently ? Since bytea is probably not very common among users yet we could imho still change it to not do double escapes. Imho we need to decide where to do the escaping, eighter in the parser or in the input functions. I think actually the backend parser has no business changing constants, he is imho only allowed to parse it, so he knows where a constant begins, and where it ends. Andreas
Zeugswetter Andreas SB SD wrote: > > > > > >For bytea, follow this rule: to escape a null character, use > this: > > > > >'\\0'. To escape a backslash, use this: '\\\\'. > > Can anybody explain in technical terms why this is implemented > so inconveniently ? I think that this has to to with making textin and textout behave symmetrically, and the requirement that textout must produce a valid C-string for ASCII transfer format. > Since bytea is probably not very common among users yet > we could imho still change it to not do double escapes. But how ? > Imho we need to decide where to do the escaping, > eighter in the parser or in the input functions. It would be probably hard to make the parser to _not_ unescape some types, as it does not yet know it > I think actually the backend parser has no business changing > constants, he is imho only allowed to parse it, so he knows > where a constant begins, and where it ends. If it is any consolation then you have to write the inset of a single \ from shell command so: > psql -c "insert into t values('\\\\\\\\')" ;) ------------------ Hannu
"Zeugswetter Andreas SB SD" <ZeugswetterA@spardat.at> writes: > I think actually the backend parser has no business changing > constants, he is imho only allowed to parse it, so he knows > where a constant begins, and where it ends. How do you propose to handle embedded quote marks in literals, if there is no parser-level escape convention? Don't suggest a type-specific escape convention; at the time the parser runs, it's impossible to know what type the literal will turn out to be. regards, tom lane
Thanks you your description, I have added a bytea section to the docs. Patch attached. > > > >For bytea, follow this rule: to escape a null character, use this: > > > >'\\0'. To escape a backslash, use this: '\\\\'. > > > > > > > >Same idea to unescape data. > > > > > > Are there other characters that need to be escaped? I suspect there are > > > more characters that need to be escaped - ctrl chars? single quotes?. > Why > > > four backslashes for one? Is there a definitive documentation anywhere > for > > > what bytea is _supposed_ (not what it might actually be) to be and how > it > > > is to be handled? > > > > Yes, sorry for being unclear on this one. Here's a more detailed > > explanation: Bytea is just a stream of data. On input, it must follow C > > escaping conventions, on output, it will be escaped using C escaping > > conventions. > > > > However, there's a trap: before things get to bytea input handler, they > > are first processed by postgresql general parser. Hence, the string \\0 > > given from psql will become \0 when it gets to bytea input handler. String > > \\\\ will become \\. All non-printable characters must be escaped like > > this: \\(octal of character), for ex, chr(255) must be presented as \\377. > > (If you want, you can also use this as an alternative and more generic way > > to escape a backslash, \\134). Single quote must be escaped either as \\47 > > or as \'. Note the single backslash. Why only one? Because bytea parser > > doesn't care about single quotes and you only need to escape it for the > > postgresql parser... > > > > So, just keep in mind the double-parsing of input and you'll be safe. > > > > > Also why wouldn't escaping stuff like this work with TEXT then? If a > null > > > is going to be backslash backslash zero, and come out the same way, it > sure > > > looks like TEXT to me :). OK so there's this thing about storage. So > maybe > > Because text is null-terminated, can't have a null inside. > > > > > I could save a byte by just converting nulls to backslash zero and real > > > backslashes to backslash backslash. Tada. > > If you do that, you'll break ordering/comparison. Bytea in memory is > > stored EXACTLY the way input string was, without any escaping, hence, all > > comparisons will be correct ( '\\0'::bytea is less than '\\1'::bytea). > > > > With your representation, comparisons will fail, because in memory, data > > is escaped using some escaping convention that you made up. > > > > > OK it's probably not the same, but having to put four backslashes when > two > > > should be enough to quote one makes me rather puzzled and uneasy. > > Double parsing, hence double escaping. > > Great explanation Alex --thanks! I'll add a bit: > > I've done about 400,000 inserts and subsequent queries to verify that, from > PHP at least, only four charaters need to be escaped. The inserts were 20 > byte strings gotten by concatenating some known text with a counter in a > loop, and then producing a SHA-1 hash. This produces very uniformly > distributed binary data. Then I ran the same loop, except I queried for the > inserted strings. I'm quite confident from this experiment that binary can > reliably be inserted via standard SQL when these four characters are > escaped. Here's the run down: > > \\000 First slash is consumed by the general parser, leaving \000 for the > byteain function. If you only use one '\', the general parser converts the > character into a true '\0' byte, and the byteain function sees this byte as > the string terminator and stops. This causes the input string to be > truncated (which seems to confuse many people). > > \\012 In my early tests 0x0a (LF) was getting converted to 0x20 (space). > I think this was happening during PHP's parsing, but I'm still not sure. > I'll dig into this some more later. > > \\047 As Alex mentioned, the byteain function doesn't treat this as a > special character, but of course the general parser does as this is a single > quote. It also works fine to escape it as \', I just prefer to use all > octals. > > \\134 Both the general parser and the byteain function see this (a single > \) as the special escape character. Therefore the general parser turns \\\\ > into \\, and the byteain function turns \\ into \ for actual storage. Again, > I prefer to use the octal representation instead. > > I hope this helps reduce the concerns and confusion over bytea. If anyone > can help explain why my linefeeds were getting converted to spaces, all the > mysteries would be explained ;-) > > -- Joe > > > > ---------------------------(end of broadcast)--------------------------- > TIP 5: Have you checked our extensive FAQ? > > http://www.postgresql.org/users-lounge/docs/faq.html > -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000 + If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania 19026 Index: doc/src/sgml/datatype.sgml =================================================================== RCS file: /home/projects/pgsql/cvsroot/pgsql/doc/src/sgml/datatype.sgml,v retrieving revision 1.60 diff -c -r1.60 datatype.sgml *** doc/src/sgml/datatype.sgml 2001/08/31 01:55:25 1.60 --- doc/src/sgml/datatype.sgml 2001/09/04 03:15:49 *************** *** 84,89 **** --- 84,95 ---- </row> <row> + <entry><type>bytea</type></entry> + <entry></entry> + <entry>binary data</entry> + </row> + + <row> <entry><type>character(<replaceable>n</replaceable>)</type></entry> <entry><type>char(<replaceable>n</replaceable>)</type></entry> <entry>fixed-length character string</entry> *************** *** 782,788 **** <entry>text</entry> <entry>Variable unlimited length</entry> </row> ! </tbody> </tgroup> </table> --- 788,798 ---- <entry>text</entry> <entry>Variable unlimited length</entry> </row> ! <row> ! <entry>bytea</entry> ! <entry>binary data</entry> ! </row> ! </tbody> </tgroup> </table> *************** *** 827,832 **** --- 837,855 ---- does not require an explicit declared upper limit on the size of the string. Although the type <type>text</type> is not in the SQL standard, many other RDBMS packages have it as well. + </para> + + <para> + The <type>bytea</type> data type allows storage of binary data, + specifically allowing storage of NULLs which are entered as + <literal>'\\000'</>. The first backslash is interpreted by the + single quotes, and the second is recognized by <type>bytea</> and + preceeds a three digit octal value. For a similar reason, a + backslash must be entered into a field as <literal>'\\\\'</> or + <literal>'\\134'</>. You may also have to escape line feeds and + carriage return if your interface automatically translates these. It + can store values of any length. <type>Bytea</> is a non-standard + data type. </para> <para>
> > > > \\012 In my early tests 0x0a (LF) was getting converted to 0x20 (space). > > I think this was happening during PHP's parsing, but I'm still not sure. > > I'll dig into this some more later. > > <redfaced> The script I was using in PHP *explicitly* converted all linefeeds to spaces. Mystery solved. </redfaced> I think Bruce's text still works though. -- Joe
Bruce Momjian <pgman@candle.pha.pa.us> writes: > However, there's a trap: before things get to bytea input handler, they > are first processed by postgresql general parser. This description fails to make clear that the two levels of parsing only apply for datums that are written as string literals in SQL commands. An example where this doesn't apply is COPY input data. regards, tom lane
> Bruce Momjian <pgman@candle.pha.pa.us> writes: > > However, there's a trap: before things get to bytea input handler, they > > are first processed by postgresql general parser. > > This description fails to make clear that the two levels of parsing only > apply for datums that are written as string literals in SQL commands. > An example where this doesn't apply is COPY input data. Are you talking about my SGML changes? I clearly mention quote-handling and bytea handling, which pretty clearly not apply in COPY. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
> > > > > > \\012 In my early tests 0x0a (LF) was getting converted to 0x20 > (space). > > > I think this was happening during PHP's parsing, but I'm still not sure. > > > I'll dig into this some more later. > > > > > <redfaced> > The script I was using in PHP *explicitly* converted all linefeeds to > spaces. Mystery solved. > </redfaced> > > I think Bruce's text still works though. I can see other interfaces doing fancy things with newlines and carriage returns so I added it to the docs. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026