Re: Re: Toast,bytea, Text -blob all confusing - Mailing list pgsql-hackers
From | Bruce Momjian |
---|---|
Subject | Re: Re: Toast,bytea, Text -blob all confusing |
Date | |
Msg-id | 200109040319.f843JJ016358@candle.pha.pa.us Whole thread Raw |
In response to | Re: Re: Toast,bytea, Text -blob all confusing ("Joe Conway" <joseph.conway@home.com>) |
Responses |
Re: Re: Toast,bytea, Text -blob all confusing
|
List | pgsql-hackers |
Thanks you your description, I have added a bytea section to the docs. Patch attached. > > > >For bytea, follow this rule: to escape a null character, use this: > > > >'\\0'. To escape a backslash, use this: '\\\\'. > > > > > > > >Same idea to unescape data. > > > > > > Are there other characters that need to be escaped? I suspect there are > > > more characters that need to be escaped - ctrl chars? single quotes?. > Why > > > four backslashes for one? Is there a definitive documentation anywhere > for > > > what bytea is _supposed_ (not what it might actually be) to be and how > it > > > is to be handled? > > > > Yes, sorry for being unclear on this one. Here's a more detailed > > explanation: Bytea is just a stream of data. On input, it must follow C > > escaping conventions, on output, it will be escaped using C escaping > > conventions. > > > > However, there's a trap: before things get to bytea input handler, they > > are first processed by postgresql general parser. Hence, the string \\0 > > given from psql will become \0 when it gets to bytea input handler. String > > \\\\ will become \\. All non-printable characters must be escaped like > > this: \\(octal of character), for ex, chr(255) must be presented as \\377. > > (If you want, you can also use this as an alternative and more generic way > > to escape a backslash, \\134). Single quote must be escaped either as \\47 > > or as \'. Note the single backslash. Why only one? Because bytea parser > > doesn't care about single quotes and you only need to escape it for the > > postgresql parser... > > > > So, just keep in mind the double-parsing of input and you'll be safe. > > > > > Also why wouldn't escaping stuff like this work with TEXT then? If a > null > > > is going to be backslash backslash zero, and come out the same way, it > sure > > > looks like TEXT to me :). OK so there's this thing about storage. So > maybe > > Because text is null-terminated, can't have a null inside. > > > > > I could save a byte by just converting nulls to backslash zero and real > > > backslashes to backslash backslash. Tada. > > If you do that, you'll break ordering/comparison. Bytea in memory is > > stored EXACTLY the way input string was, without any escaping, hence, all > > comparisons will be correct ( '\\0'::bytea is less than '\\1'::bytea). > > > > With your representation, comparisons will fail, because in memory, data > > is escaped using some escaping convention that you made up. > > > > > OK it's probably not the same, but having to put four backslashes when > two > > > should be enough to quote one makes me rather puzzled and uneasy. > > Double parsing, hence double escaping. > > Great explanation Alex --thanks! I'll add a bit: > > I've done about 400,000 inserts and subsequent queries to verify that, from > PHP at least, only four charaters need to be escaped. The inserts were 20 > byte strings gotten by concatenating some known text with a counter in a > loop, and then producing a SHA-1 hash. This produces very uniformly > distributed binary data. Then I ran the same loop, except I queried for the > inserted strings. I'm quite confident from this experiment that binary can > reliably be inserted via standard SQL when these four characters are > escaped. Here's the run down: > > \\000 First slash is consumed by the general parser, leaving \000 for the > byteain function. If you only use one '\', the general parser converts the > character into a true '\0' byte, and the byteain function sees this byte as > the string terminator and stops. This causes the input string to be > truncated (which seems to confuse many people). > > \\012 In my early tests 0x0a (LF) was getting converted to 0x20 (space). > I think this was happening during PHP's parsing, but I'm still not sure. > I'll dig into this some more later. > > \\047 As Alex mentioned, the byteain function doesn't treat this as a > special character, but of course the general parser does as this is a single > quote. It also works fine to escape it as \', I just prefer to use all > octals. > > \\134 Both the general parser and the byteain function see this (a single > \) as the special escape character. Therefore the general parser turns \\\\ > into \\, and the byteain function turns \\ into \ for actual storage. Again, > I prefer to use the octal representation instead. > > I hope this helps reduce the concerns and confusion over bytea. If anyone > can help explain why my linefeeds were getting converted to spaces, all the > mysteries would be explained ;-) > > -- Joe > > > > ---------------------------(end of broadcast)--------------------------- > TIP 5: Have you checked our extensive FAQ? > > http://www.postgresql.org/users-lounge/docs/faq.html > -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000 + If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania 19026 Index: doc/src/sgml/datatype.sgml =================================================================== RCS file: /home/projects/pgsql/cvsroot/pgsql/doc/src/sgml/datatype.sgml,v retrieving revision 1.60 diff -c -r1.60 datatype.sgml *** doc/src/sgml/datatype.sgml 2001/08/31 01:55:25 1.60 --- doc/src/sgml/datatype.sgml 2001/09/04 03:15:49 *************** *** 84,89 **** --- 84,95 ---- </row> <row> + <entry><type>bytea</type></entry> + <entry></entry> + <entry>binary data</entry> + </row> + + <row> <entry><type>character(<replaceable>n</replaceable>)</type></entry> <entry><type>char(<replaceable>n</replaceable>)</type></entry> <entry>fixed-length character string</entry> *************** *** 782,788 **** <entry>text</entry> <entry>Variable unlimited length</entry> </row> ! </tbody> </tgroup> </table> --- 788,798 ---- <entry>text</entry> <entry>Variable unlimited length</entry> </row> ! <row> ! <entry>bytea</entry> ! <entry>binary data</entry> ! </row> ! </tbody> </tgroup> </table> *************** *** 827,832 **** --- 837,855 ---- does not require an explicit declared upper limit on the size of the string. Although the type <type>text</type> is not in the SQL standard, many other RDBMS packages have it as well. + </para> + + <para> + The <type>bytea</type> data type allows storage of binary data, + specifically allowing storage of NULLs which are entered as + <literal>'\\000'</>. The first backslash is interpreted by the + single quotes, and the second is recognized by <type>bytea</> and + preceeds a three digit octal value. For a similar reason, a + backslash must be entered into a field as <literal>'\\\\'</> or + <literal>'\\134'</>. You may also have to escape line feeds and + carriage return if your interface automatically translates these. It + can store values of any length. <type>Bytea</> is a non-standard + data type. </para> <para>
pgsql-hackers by date: