Re: Re: Toast,bytea, Text -blob all confusing - Mailing list pgsql-hackers
From | Joe Conway |
---|---|
Subject | Re: Re: Toast,bytea, Text -blob all confusing |
Date | |
Msg-id | 013201c130b1$20d7e910$0705a8c0@jecw2k1 Whole thread Raw |
In response to | Re: Toast,bytea, Text -blob all confusing (Alex Pilosov <alex@acecape.com>) |
Responses |
Re: Re: Toast,bytea, Text -blob all confusing
|
List | pgsql-hackers |
> > >For bytea, follow this rule: to escape a null character, use this: > > >'\\0'. To escape a backslash, use this: '\\\\'. > > > > > >Same idea to unescape data. > > > > Are there other characters that need to be escaped? I suspect there are > > more characters that need to be escaped - ctrl chars? single quotes?. Why > > four backslashes for one? Is there a definitive documentation anywhere for > > what bytea is _supposed_ (not what it might actually be) to be and how it > > is to be handled? > > Yes, sorry for being unclear on this one. Here's a more detailed > explanation: Bytea is just a stream of data. On input, it must follow C > escaping conventions, on output, it will be escaped using C escaping > conventions. > > However, there's a trap: before things get to bytea input handler, they > are first processed by postgresql general parser. Hence, the string \\0 > given from psql will become \0 when it gets to bytea input handler. String > \\\\ will become \\. All non-printable characters must be escaped like > this: \\(octal of character), for ex, chr(255) must be presented as \\377. > (If you want, you can also use this as an alternative and more generic way > to escape a backslash, \\134). Single quote must be escaped either as \\47 > or as \'. Note the single backslash. Why only one? Because bytea parser > doesn't care about single quotes and you only need to escape it for the > postgresql parser... > > So, just keep in mind the double-parsing of input and you'll be safe. > > > Also why wouldn't escaping stuff like this work with TEXT then? If a null > > is going to be backslash backslash zero, and come out the same way, it sure > > looks like TEXT to me :). OK so there's this thing about storage. So maybe > Because text is null-terminated, can't have a null inside. > > > I could save a byte by just converting nulls to backslash zero and real > > backslashes to backslash backslash. Tada. > If you do that, you'll break ordering/comparison. Bytea in memory is > stored EXACTLY the way input string was, without any escaping, hence, all > comparisons will be correct ( '\\0'::bytea is less than '\\1'::bytea). > > With your representation, comparisons will fail, because in memory, data > is escaped using some escaping convention that you made up. > > > OK it's probably not the same, but having to put four backslashes when two > > should be enough to quote one makes me rather puzzled and uneasy. > Double parsing, hence double escaping. Great explanation Alex --thanks! I'll add a bit: I've done about 400,000 inserts and subsequent queries to verify that, from PHP at least, only four charaters need to be escaped. The inserts were 20 byte strings gotten by concatenating some known text with a counter in a loop, and then producing a SHA-1 hash. This produces very uniformly distributed binary data. Then I ran the same loop, except I queried for the inserted strings. I'm quite confident from this experiment that binary can reliably be inserted via standard SQL when these four characters are escaped. Here's the run down: \\000 First slash is consumed by the general parser, leaving \000 for the byteain function. If you only use one '\', the general parser converts the character into a true '\0' byte, and the byteain function sees this byte as the string terminator and stops. This causes the input string to be truncated (which seems to confuse many people). \\012 In my early tests 0x0a (LF) was getting converted to 0x20 (space). I think this was happening during PHP's parsing, but I'm still not sure. I'll dig into this some more later. \\047 As Alex mentioned, the byteain function doesn't treat this as a special character, but of course the general parser does as this is a single quote. It also works fine to escape it as \', I just prefer to use all octals. \\134 Both the general parser and the byteain function see this (a single \) as the special escape character. Therefore the general parser turns \\\\ into \\, and the byteain function turns \\ into \ for actual storage. Again, I prefer to use the octal representation instead. I hope this helps reduce the concerns and confusion over bytea. If anyone can help explain why my linefeeds were getting converted to spaces, all the mysteries would be explained ;-) -- Joe
pgsql-hackers by date: