Thread: PQescapeBytea is not multibyte aware
PQescapebytea() is not multibyte aware and will produce bad multibyte character sequences. Example: INSERT INTO t1(bytea_col) VALUES('characters produced by PQescapebytea'); ERROR: Invalid EUC_JP character sequence found (0x8950) I think 0x89 should be converted to '\\211' since 0x89 of 0x8950 is considered as "non printable characters". Any objection? -- Tatsuo Ishii
Tatsuo Ishii <t-ishii@sra.co.jp> writes: > PQescapebytea() is not multibyte aware and will produce bad multibyte > character sequences. Example: > I think 0x89 should be converted to '\\211' since 0x89 of 0x8950 is > considered as "non printable characters". Hmm, so essentially we'd have to convert all codes >= 0x80 to prevent them from being mistaken for parts of multibyte sequences? Ugh, but you're probably right. It looks to me like byteaout does the reverse already. regards, tom lane
Tom Lane wrote: > Tatsuo Ishii <t-ishii@sra.co.jp> writes: > >>PQescapebytea() is not multibyte aware and will produce bad multibyte >>character sequences. Example: >>I think 0x89 should be converted to '\\211' since 0x89 of 0x8950 is >>considered as "non printable characters". > > > Hmm, so essentially we'd have to convert all codes >= 0x80 to prevent > them from being mistaken for parts of multibyte sequences? Ugh, but > you're probably right. It looks to me like byteaout does the reverse > already. > But the error comes from pg_verifymbstr. Since bytea has no encoding (it's just an array of bytes afterall), why does pg_verifymbstr get applied at all to bytea data? pg_verifymbstr is called by textin, bpcharin, and varcharin. Would it help to rewrite this as: INSERT INTO t1(bytea_col) VALUES('characters produced by PQescapebytea'::bytea); ? Joe
Joe Conway <mail@joeconway.com> writes: > But the error comes from pg_verifymbstr. Since bytea has no encoding > (it's just an array of bytes afterall), why does pg_verifymbstr get > applied at all to bytea data? Because textin() is used for the initial conversion to an "unknown" constant --- see make_const() in parse_node.c. > pg_verifymbstr is called by textin, bpcharin, and varcharin. Would it > help to rewrite this as: > INSERT INTO t1(bytea_col) VALUES('characters produced by > PQescapebytea'::bytea); Probably that would cause the error to disappear, but it's hardly a desirable answer. I wonder whether this says that TEXT is not a good implementation of type UNKNOWN. That choice was made on the assumption that TEXT would faithfully preserve the contents of a C string ... but it seems that in the multibyte world it ain't so. It would not be a huge amount of work to write a couple more I/O routines and give UNKNOWN its own I/O behavior. OTOH, I was surprised to read your message because I had assumed the damage was being done much further upstream, viz during collection of the query string by pq_getstr(). Do we need to think twice about that processing, as well? regards, tom lane
Tom Lane wrote: >>INSERT INTO t1(bytea_col) VALUES('characters produced by >>PQescapebytea'::bytea); > > > Probably that would cause the error to disappear, but it's hardly a > desirable answer. > > I wonder whether this says that TEXT is not a good implementation of > type UNKNOWN. That choice was made on the assumption that TEXT would > faithfully preserve the contents of a C string ... but it seems that in > the multibyte world it ain't so. It would not be a huge amount of work > to write a couple more I/O routines and give UNKNOWN its own I/O > behavior. I could take a look at this. Any guidance other than "faithfully preserving the contents of a C string"? > > OTOH, I was surprised to read your message because I had assumed the > damage was being done much further upstream, viz during collection of > the query string by pq_getstr(). Do we need to think twice about that > processing, as well? I'll take a look at this as well. Joe
Joe Conway <mail@joeconway.com> writes: > I could take a look at this. Any guidance other than "faithfully > preserving the contents of a C string"? Take textin/textout, remove multibyte awareness? Actually the hard part is to figure out which of the existing hardwired calls of textin and textout would need to be replaced by calls to unknownin/unknownout. I think the assumption UNKNOWN == TEXT has crept into a fair number of places by now. regards, tom lane
Tom Lane wrote: > > OTOH, I was surprised to read your message because I had assumed the > damage was being done much further upstream, viz during collection of > the query string by pq_getstr(). Do we need to think twice about that > processing, as well? > I just looked in pq_getstr() I see: #ifdef MULTIBYTE p = (char *) pg_client_to_server((unsigned char *) s->data, s->len); if (p != s->data) /* actual conversion has been done? */ and in pg_client_to_server I see: if (ClientEncoding->encoding == DatabaseEncoding->encoding)return s; So I'm guessing that in Tatsuo's case, both client and database encoding are the same, and therefore the string was passed as-is downstream. I think you're correct that in a client/database encoding mismatch scenario, there would be bigger problems. Thoughts on this? Joe
Joe Conway <mail@joeconway.com> writes: > I think you're correct that in a client/database encoding mismatch > scenario, there would be bigger problems. Thoughts on this? This scenario is probably why Tatsuo wants PQescapeBytea to octalize everything with the high bit set; I'm not sure there's any lesser way out. Nonetheless, if UNKNOWN conversion introduces additional failures then it makes sense to fix that. regards, tom lane
Tom Lane wrote: > Joe Conway <mail@joeconway.com> writes: > >>I think you're correct that in a client/database encoding mismatch >>scenario, there would be bigger problems. Thoughts on this? > > > This scenario is probably why Tatsuo wants PQescapeBytea to octalize > everything with the high bit set; I'm not sure there's any lesser way Yuck! At that point you're no better off than converting to hex (and worse off than converting to base64) for storage. SQL99 actually defines BLOB as a binary string literal comprised of an even number of hexadecimal digits, in single quotes, preceded by "X", e.g. X'1a43fe'. Should we be looking at implementing the standard instead of, or in addition to, octalizing? Maybe it is possible to do this by creating a new datatype, BLOB, which uses new IN/OUT functions, but otherwise uses the various bytea functions? > out. Nonetheless, if UNKNOWN conversion introduces additional failures > then it makes sense to fix that. I'll follow up on this then. Joe
Joe Conway <mail@joeconway.com> writes: >> This scenario is probably why Tatsuo wants PQescapeBytea to octalize >> everything with the high bit set; I'm not sure there's any lesser way > Yuck! At that point you're no better off than converting to hex (and > worse off than converting to base64) for storage. No; the *storage* is still compact, it's just the I/O representation that's not. > SQL99 actually defines BLOB as a binary string literal comprised of an > even number of hexadecimal digits, in single quotes, preceded by "X", > e.g. X'1a43fe'. Should we be looking at implementing the standard > instead of, or in addition to, octalizing? Perhaps we should cause the system to regard hex-strings as literals of type bytea. Right now I think they're taken to be integer constants, which is clearly not per spec. regards, tom lane
Tom Lane wrote:>> Yuck! At that point you're no better off than converting to hex>> (and worse off than converting to base64)for storage.>>> No; the *storage* is still compact, it's just the I/O representation> that's not. Yeah, I realized that after I pushed send ;) But still, doesn't that mean roughly twice as much memory usage for each copy of the string? And I seem to remember Jan saying that each datum winds up having 4 copies in memory. It ends up impacting the practical length limit for a bytea value. >>>> SQL99 actually defines BLOB as a binary string literal comprised>> of an even number of hexadecimal digits, in singlequotes,>> preceded by "X", e.g. X'1a43fe'. Should we be looking at>> implementing the standard instead of, or in additionto,>> octalizing?>>> Perhaps we should cause the system to regard hex-strings as literals> of type bytea. Rightnow I think they're taken to be integer> constants, which is clearly not per spec. Wow. I didn't realize this was possible: test=# select X'ffff'; ?column? ---------- 65535 (1 row) This does clearly conflict with the spec, but what about backward compatibility? Do you think many people use this capability? Joe
Joe Conway <mail@joeconway.com> writes: > But still, doesn't that mean roughly twice as much memory usage for each > copy of the string? And I seem to remember Jan saying that each datum > winds up having 4 copies in memory. It ends up impacting the practical > length limit for a bytea value. Well, once the data actually reaches Datum form it'll be in internal representation, hence compact. I'm not sure how many copies the parser will make in the process of casting to UNKNOWN and then to bytea, but I'm not terribly concerned by the above argument. > Wow. I didn't realize this was possible: > test=# select X'ffff'; > ?column? > ---------- > 65535 > (1 row) > This does clearly conflict with the spec, but what about backward > compatibility? Do you think many people use this capability? No idea. I don't think it's documented anywhere, though... regards, tom lane
> Hmm, so essentially we'd have to convert all codes >= 0x80 to prevent > them from being mistaken for parts of multibyte sequences? Yes. > Ugh, but > you're probably right. It looks to me like byteaout does the reverse > already. As for the new UNKNOWN data type, that seems a good thing for me. However, I think more aggressive soultion would be having an encoding info in the text data type itself. This would also opens the way to implement SQL99's CREATE CHARACTER SET stuffs. I have been thinking about this for a while and want to make a RFC in the future(I need to rethink my idea to adopt the SCHEMA you introduced). BTW, for the 7.2.x tree we need a solution with lesser impact. For this purpose, I would like to change PQescapeBytea as I stated in the previous mail. Objection? -- Tatsuo Ishii
Tatsuo Ishii wrote: > BTW, for the 7.2.x tree we need a solution with lesser impact. > For this purpose, I would like to change PQescapeBytea as I stated in > the previous mail. Objection? > -- > Tatsuo Ishii No objection here, but can we wrap the change in #ifdef MULTIBYTE so there's no effect for people who don't use MULTIBYTE? Joe
Joe Conway <mail@joeconway.com> writes: > No objection here, but can we wrap the change in #ifdef MULTIBYTE so > there's no effect for people who don't use MULTIBYTE? That opens up the standard set of issues about "what if your server is MULTIBYTE but your libpq is not?" It seems risky to me. regards, tom lane
> Joe Conway <mail@joeconway.com> writes: > > No objection here, but can we wrap the change in #ifdef MULTIBYTE so > > there's no effect for people who don't use MULTIBYTE? > > That opens up the standard set of issues about "what if your server is > MULTIBYTE but your libpq is not?" It seems risky to me. I have committed changes to the current source (without MULTIBYTE ifdes). Will change t.2-stable tree soon. I also added some careful handlings for memory allocation errors and changed some questionable codes useing direct ASCII values 92 instead of '\\' for example. -- Tatsuo Ishii