Thread: PQescapeBytea is not multibyte aware

PQescapeBytea is not multibyte aware

From
Tatsuo Ishii
Date:
PQescapebytea() is not multibyte aware and will produce bad multibyte
character sequences. Example:

INSERT INTO t1(bytea_col) VALUES('characters produced by
PQescapebytea');
ERROR:  Invalid EUC_JP character sequence found (0x8950)

I think 0x89 should be converted to '\\211' since 0x89 of 0x8950 is
considered as "non printable characters".

Any objection?
--
Tatsuo Ishii


Re: PQescapeBytea is not multibyte aware

From
Tom Lane
Date:
Tatsuo Ishii <t-ishii@sra.co.jp> writes:
> PQescapebytea() is not multibyte aware and will produce bad multibyte
> character sequences. Example:
> I think 0x89 should be converted to '\\211' since 0x89 of 0x8950 is
> considered as "non printable characters".

Hmm, so essentially we'd have to convert all codes >= 0x80 to prevent
them from being mistaken for parts of multibyte sequences?  Ugh, but
you're probably right.  It looks to me like byteaout does the reverse
already.
        regards, tom lane


Re: PQescapeBytea is not multibyte aware

From
Joe Conway
Date:
Tom Lane wrote:
> Tatsuo Ishii <t-ishii@sra.co.jp> writes:
> 
>>PQescapebytea() is not multibyte aware and will produce bad multibyte
>>character sequences. Example:
>>I think 0x89 should be converted to '\\211' since 0x89 of 0x8950 is
>>considered as "non printable characters".
> 
> 
> Hmm, so essentially we'd have to convert all codes >= 0x80 to prevent
> them from being mistaken for parts of multibyte sequences?  Ugh, but
> you're probably right.  It looks to me like byteaout does the reverse
> already.
> 

But the error comes from pg_verifymbstr. Since bytea has no encoding 
(it's just an array of bytes afterall), why does pg_verifymbstr get 
applied at all to bytea data?

pg_verifymbstr is called by textin, bpcharin, and varcharin. Would it 
help to rewrite this as:

INSERT INTO t1(bytea_col) VALUES('characters produced by
PQescapebytea'::bytea);
?

Joe





Re: PQescapeBytea is not multibyte aware

From
Tom Lane
Date:
Joe Conway <mail@joeconway.com> writes:
> But the error comes from pg_verifymbstr. Since bytea has no encoding 
> (it's just an array of bytes afterall), why does pg_verifymbstr get 
> applied at all to bytea data?

Because textin() is used for the initial conversion to an "unknown"
constant --- see make_const() in parse_node.c.

> pg_verifymbstr is called by textin, bpcharin, and varcharin. Would it 
> help to rewrite this as:

> INSERT INTO t1(bytea_col) VALUES('characters produced by
> PQescapebytea'::bytea);

Probably that would cause the error to disappear, but it's hardly a
desirable answer.

I wonder whether this says that TEXT is not a good implementation of
type UNKNOWN.  That choice was made on the assumption that TEXT would
faithfully preserve the contents of a C string ... but it seems that in
the multibyte world it ain't so.  It would not be a huge amount of work
to write a couple more I/O routines and give UNKNOWN its own I/O
behavior.

OTOH, I was surprised to read your message because I had assumed the
damage was being done much further upstream, viz during collection of
the query string by pq_getstr().  Do we need to think twice about that
processing, as well?
        regards, tom lane


Re: PQescapeBytea is not multibyte aware

From
Joe Conway
Date:
Tom Lane wrote:
>>INSERT INTO t1(bytea_col) VALUES('characters produced by
>>PQescapebytea'::bytea);
> 
> 
> Probably that would cause the error to disappear, but it's hardly a
> desirable answer.
> 
> I wonder whether this says that TEXT is not a good implementation of
> type UNKNOWN.  That choice was made on the assumption that TEXT would
> faithfully preserve the contents of a C string ... but it seems that in
> the multibyte world it ain't so.  It would not be a huge amount of work
> to write a couple more I/O routines and give UNKNOWN its own I/O
> behavior.


I could take a look at this. Any guidance other than "faithfully 
preserving the contents of a C string"?

> 
> OTOH, I was surprised to read your message because I had assumed the
> damage was being done much further upstream, viz during collection of
> the query string by pq_getstr().  Do we need to think twice about that
> processing, as well?

I'll take a look at this as well.

Joe





Re: PQescapeBytea is not multibyte aware

From
Tom Lane
Date:
Joe Conway <mail@joeconway.com> writes:
> I could take a look at this. Any guidance other than "faithfully 
> preserving the contents of a C string"?

Take textin/textout, remove multibyte awareness?  Actually the hard
part is to figure out which of the existing hardwired calls of textin
and textout would need to be replaced by calls to unknownin/unknownout.
I think the assumption UNKNOWN == TEXT has crept into a fair number of
places by now.
        regards, tom lane


Re: PQescapeBytea is not multibyte aware

From
Joe Conway
Date:
Tom Lane wrote:
> 
> OTOH, I was surprised to read your message because I had assumed the
> damage was being done much further upstream, viz during collection of
> the query string by pq_getstr().  Do we need to think twice about that
> processing, as well?
> 

I just looked in pq_getstr() I see:

#ifdef MULTIBYTE
p = (char *) pg_client_to_server((unsigned char *) s->data, s->len);
if (p != s->data)          /* actual conversion has been done? */


and in pg_client_to_server I see:

if (ClientEncoding->encoding == DatabaseEncoding->encoding)return s;


So I'm guessing that in Tatsuo's case, both client and database encoding 
are the same, and therefore the string was passed as-is downstream. I 
think you're correct that in a client/database encoding mismatch 
scenario, there would be bigger problems. Thoughts on this?

Joe





Re: PQescapeBytea is not multibyte aware

From
Tom Lane
Date:
Joe Conway <mail@joeconway.com> writes:
> I think you're correct that in a client/database encoding mismatch 
> scenario, there would be bigger problems. Thoughts on this?

This scenario is probably why Tatsuo wants PQescapeBytea to octalize
everything with the high bit set; I'm not sure there's any lesser way
out.  Nonetheless, if UNKNOWN conversion introduces additional failures
then it makes sense to fix that.
        regards, tom lane


Re: PQescapeBytea is not multibyte aware

From
Joe Conway
Date:
Tom Lane wrote:
> Joe Conway <mail@joeconway.com> writes:
> 
>>I think you're correct that in a client/database encoding mismatch 
>>scenario, there would be bigger problems. Thoughts on this?
> 
> 
> This scenario is probably why Tatsuo wants PQescapeBytea to octalize
> everything with the high bit set; I'm not sure there's any lesser way

Yuck! At that point you're no better off than converting to hex (and 
worse off than converting to base64) for storage.

SQL99 actually defines BLOB as a binary string literal comprised of an 
even number of hexadecimal digits, in single quotes, preceded by "X", 
e.g. X'1a43fe'. Should we be looking at implementing the standard 
instead of, or in addition to, octalizing? Maybe it is possible to do 
this by creating a new datatype, BLOB, which uses new IN/OUT functions, 
but otherwise uses the various bytea functions?


> out.  Nonetheless, if UNKNOWN conversion introduces additional failures
> then it makes sense to fix that.

I'll follow up on this then.

Joe




Re: PQescapeBytea is not multibyte aware

From
Tom Lane
Date:
Joe Conway <mail@joeconway.com> writes:
>> This scenario is probably why Tatsuo wants PQescapeBytea to octalize
>> everything with the high bit set; I'm not sure there's any lesser way

> Yuck! At that point you're no better off than converting to hex (and 
> worse off than converting to base64) for storage.

No; the *storage* is still compact, it's just the I/O representation
that's not.

> SQL99 actually defines BLOB as a binary string literal comprised of an 
> even number of hexadecimal digits, in single quotes, preceded by "X", 
> e.g. X'1a43fe'. Should we be looking at implementing the standard 
> instead of, or in addition to, octalizing?

Perhaps we should cause the system to regard hex-strings as literals of
type bytea.  Right now I think they're taken to be integer constants,
which is clearly not per spec.
        regards, tom lane


Re: PQescapeBytea is not multibyte aware

From
Joe Conway
Date:
Tom Lane wrote:>> Yuck! At that point you're no better off than converting to hex>> (and worse off than converting to
base64)for storage.>>> No; the *storage* is still compact, it's just the I/O representation> that's not.
 

Yeah, I realized that after I pushed send ;)

But still, doesn't that mean roughly twice as much memory usage for each
copy of the string? And I seem to remember Jan saying that each datum
winds up having 4 copies in memory. It ends up impacting the practical
length limit for a bytea value.
>>>> SQL99 actually defines BLOB as a binary string literal comprised>> of an even number of hexadecimal digits, in
singlequotes,>> preceded by "X", e.g. X'1a43fe'. Should we be looking at>> implementing the standard instead of, or in
additionto,>> octalizing?>>> Perhaps we should cause the system to regard hex-strings as literals> of type bytea.
Rightnow I think they're taken to be integer> constants, which is clearly not per spec.
 

Wow. I didn't realize this was possible:

test=# select X'ffff'; ?column?
----------    65535
(1 row)

This does clearly conflict with the spec, but what about backward
compatibility? Do you think many people use this capability?


Joe



Re: PQescapeBytea is not multibyte aware

From
Tom Lane
Date:
Joe Conway <mail@joeconway.com> writes:
> But still, doesn't that mean roughly twice as much memory usage for each
> copy of the string? And I seem to remember Jan saying that each datum
> winds up having 4 copies in memory. It ends up impacting the practical
> length limit for a bytea value.

Well, once the data actually reaches Datum form it'll be in internal
representation, hence compact.  I'm not sure how many copies the parser
will make in the process of casting to UNKNOWN and then to bytea, but
I'm not terribly concerned by the above argument.

> Wow. I didn't realize this was possible:

> test=# select X'ffff';
>   ?column?
> ----------
>      65535
> (1 row)

> This does clearly conflict with the spec, but what about backward
> compatibility? Do you think many people use this capability?

No idea.  I don't think it's documented anywhere, though...
        regards, tom lane


Re: PQescapeBytea is not multibyte aware

From
Tatsuo Ishii
Date:
> Hmm, so essentially we'd have to convert all codes >= 0x80 to prevent
> them from being mistaken for parts of multibyte sequences?

Yes.

> Ugh, but
> you're probably right.  It looks to me like byteaout does the reverse
> already.

As for the new UNKNOWN data type, that seems a good thing for
me. However, I think more aggressive soultion would be having an
encoding info in the text data type itself. This would also opens the
way to implement SQL99's CREATE CHARACTER SET stuffs. I have been
thinking about this for a while and want to make a RFC in the future(I
need to rethink my idea to adopt the SCHEMA you introduced).

BTW, for the 7.2.x tree we need a solution with lesser impact. 
For this purpose, I would like to change PQescapeBytea as I stated in
the previous mail. Objection?
--
Tatsuo Ishii


Re: PQescapeBytea is not multibyte aware

From
Joe Conway
Date:
Tatsuo Ishii wrote:
> BTW, for the 7.2.x tree we need a solution with lesser impact. 
> For this purpose, I would like to change PQescapeBytea as I stated in
> the previous mail. Objection?
> --
> Tatsuo Ishii

No objection here, but can we wrap the change in #ifdef MULTIBYTE so 
there's no effect for people who don't use MULTIBYTE?

Joe




Re: PQescapeBytea is not multibyte aware

From
Tom Lane
Date:
Joe Conway <mail@joeconway.com> writes:
> No objection here, but can we wrap the change in #ifdef MULTIBYTE so 
> there's no effect for people who don't use MULTIBYTE?

That opens up the standard set of issues about "what if your server is
MULTIBYTE but your libpq is not?"  It seems risky to me.
        regards, tom lane


Re: PQescapeBytea is not multibyte aware

From
Tatsuo Ishii
Date:
> Joe Conway <mail@joeconway.com> writes:
> > No objection here, but can we wrap the change in #ifdef MULTIBYTE so 
> > there's no effect for people who don't use MULTIBYTE?
> 
> That opens up the standard set of issues about "what if your server is
> MULTIBYTE but your libpq is not?"  It seems risky to me.

I have committed changes to the current source (without MULTIBYTE
ifdes). Will change t.2-stable tree soon.

I also added some careful handlings for memory allocation errors and
changed some questionable codes useing direct ASCII values 92 instead
of '\\' for example.
--
Tatsuo Ishii