Thread: invalidly encoded strings

invalidly encoded strings

From

Andrew Dunstan

Date:

09 September 2007, 01:02:34

I have been looking at fixing the issue of accepting strings that are 
not valid in the database encoding. It appears from previous discussion 
that we need to add a call to pg_verifymbstr() to the relevant input 
routines and ensure that the chr() function returns a valid string. That 
leaves several issues:

. which are the relevant input routines? I have identified the following 
as needing remediation: textin(), bpcharin(), varcharin(), anyenum_in(), 
namein().  Do we also need one for cstring_in()? Does the xml code 
handle this as part of xml validation?

. what do we need to do to make the verification code more efficient? I 
think we need to address the correctness issue first, but doing so 
should certainly make us want to improve the verification code. For 
example, I'm wondering if it might benefit from having a tiny cache.

. for chr() under UTF8, it seems to be generally agreed that the 
argument should represent the codepoint and the function should return 
the correspondingly encoded character. If so, possible the argument 
should be a bigint to accommodate the full range of possible code 
points. It is not clear what the argument should represent for other 
multi-byte encodings for any argument higher than 127. Similarly, it is 
not clear what ascii() should return in such cases. I would be inclined 
just to error out.

cheers

andrew

Re: invalidly encoded strings

From

Martijn van Oosterhout

Date:

09 September 2007, 08:00:04

On Sun, Sep 09, 2007 at 12:02:28AM -0400, Andrew Dunstan wrote:
> . what do we need to do to make the verification code more efficient? I
> think we need to address the correctness issue first, but doing so
> should certainly make us want to improve the verification code. For
> example, I'm wondering if it might benefit from having a tiny cache.

It has been pointed out the the verification for UTF-8 is very
inefficient, involving several function calls to first get the length,
then check characters, etc. It could be significantly improved. I don't
know whether a cache would make any useful difference.

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> From each according to his ability. To each according to his ability to litigate.

Re: invalidly encoded strings

From

Andrew Dunstan

Date:

09 September 2007, 08:46:26

Martijn van Oosterhout wrote:
> On Sun, Sep 09, 2007 at 12:02:28AM -0400, Andrew Dunstan wrote:
>   
>> . what do we need to do to make the verification code more efficient? I 
>> think we need to address the correctness issue first, but doing so 
>> should certainly make us want to improve the verification code. For 
>> example, I'm wondering if it might benefit from having a tiny cache.
>>     
>
> It has been pointed out the the verification for UTF-8 is very
> inefficient, involving several function calls to first get the length,
> then check characters, etc. It could be significantly improved. I don't
> know whether a cache would make any useful difference.
>
>
>   

Well, it looks to me like "several" = 2. If function call overhead is 
the worry, we could either create inline versions of pg_utf_mblen and 
pg_utf8_islegal and rely on the compiler to make things good for us, or 
inline the calls directly ourselves. Either way there's some 
duplication, since the whole "extern inline" thing is still a mess. I'd 
be inclined to do the inlining by hand since it's not that much code and 
it would be more portable. Unless I'm missing something it should be a 
10 minute c&p job.

Is that all we're worried about?

cheers

andrew

Re: invalidly encoded strings

From

Tom Lane

Date:

09 September 2007, 11:52:21

Andrew Dunstan <andrew@dunslane.net> writes:
> I have been looking at fixing the issue of accepting strings that are 
> not valid in the database encoding. It appears from previous discussion 
> that we need to add a call to pg_verifymbstr() to the relevant input 
> routines and ensure that the chr() function returns a valid string. That 
> leaves several issues:

> . which are the relevant input routines? I have identified the following 
> as needing remediation: textin(), bpcharin(), varcharin(), anyenum_in(), 
> namein().  Do we also need one for cstring_in()? Does the xml code 
> handle this as part of xml validation?

This seems entirely the wrong approach, because 99% of the time a
check placed here will be redundant with the one in the main
client-input logic.  (That was, indeed, the reason I took such checks
out of these places in the first place.)  Moreover, as you've already
found out there are N places that would have to be fixed and we'd face
a constant hazard of new datatypes introducing new holes.

AFAICS the risk comes only from chr() and the possibility of numeric
backslash-escapes in SQL string literals, and we ought to think about
fixing it in those places.

A possible answer is to add a verifymbstr to the string literal
converter anytime it has processed a numeric backslash-escape in the
string.  Open questions for that are (1) does it have negative effects
for bytea, and if so is there any hope of working around it?  (2) how
can we postpone the conversion/test to the parse analysis phase?
(To the extent that database encoding is frozen it'd probably be OK
to do it in the scanner, but such a choice will come back to bite
us eventually.)

> . for chr() under UTF8, it seems to be generally agreed that the 
> argument should represent the codepoint and the function should return 
> the correspondingly encoded character. If so, possible the argument 
> should be a bigint to accommodate the full range of possible code 
> points. It is not clear what the argument should represent for other 
> multi-byte encodings for any argument higher than 127. Similarly, it is 
> not clear what ascii() should return in such cases. I would be inclined 
> just to error out.

In SQL_ASCII I'd argue for allowing 0..255.  In actual MB encodings,
OK with throwing error.
        regards, tom lane

Re: invalidly encoded strings

From

Andrew Dunstan

Date:

09 September 2007, 12:56:41


Tom Lane wrote:
>
> A possible answer is to add a verifymbstr to the string literal
> converter anytime it has processed a numeric backslash-escape in the
> string.  Open questions for that are (1) does it have negative effects
> for bytea, and if so is there any hope of working around it?  (2) how
> can we postpone the conversion/test to the parse analysis phase?
>   

Finding out how to do (2) seems to me the only possible answer to (1). 
I'll have a look.

Is that going to cover data coming in via COPY? and parameters for 
prepared statements?


>> . for chr() under UTF8, it seems to be generally agreed that the 
>> argument should represent the codepoint and the function should return 
>> the correspondingly encoded character. If so, possible the argument 
>> should be a bigint to accommodate the full range of possible code 
>> points. It is not clear what the argument should represent for other 
>> multi-byte encodings for any argument higher than 127. Similarly, it is 
>> not clear what ascii() should return in such cases. I would be inclined 
>> just to error out.
>>     
>
> In SQL_ASCII I'd argue for allowing 0..255.  In actual MB encodings,
> OK with throwing error.
>
>     
>   

I was planning on allowing up to 255 for all single byte encodings too.

cheers

andrew

Re: invalidly encoded strings

From

Tom Lane

Date:

09 September 2007, 13:14:48

Andrew Dunstan <andrew@dunslane.net> writes:
> Is that going to cover data coming in via COPY? and parameters for 
> prepared statements?

Those should be checked already --- if not, the right fix is still to
fix it there, not in per-datatype code.  I think we are OK though,
eg see "need_transcoding" logic in copy.c.

>> In SQL_ASCII I'd argue for allowing 0..255.  In actual MB encodings,
>> OK with throwing error.

> I was planning on allowing up to 255 for all single byte encodings too.

OK, that sounds fine.
        regards, tom lane

Re: invalidly encoded strings

From

Andrew Dunstan

Date:

09 September 2007, 14:18:29

Tom Lane wrote:
> Andrew Dunstan <andrew@dunslane.net> writes:
>   
>> Is that going to cover data coming in via COPY? and parameters for 
>> prepared statements?
>>     
>
> Those should be checked already --- if not, the right fix is still to
> fix it there, not in per-datatype code.  I think we are OK though,
> eg see "need_transcoding" logic in copy.c.
>   

Well, a little experimentation shows that we currently are not OK:

in foo.data:
\366\66

utf8test=# \copy xx from foo.data
utf8test=# select encode(t::bytea,'hex') from xx;encode
--------f636
(1 row)

utf8test=# \copy xx to bb.data
utf8test=# \copy xx from bb.data
ERROR:  invalid byte sequence for encoding "UTF8": 0xf636
HINT:  This error can also happen if the byte sequence does not match 
the encoding expected by the server, which is controlled by 
"client_encoding".
CONTEXT:  COPY xx, line 1
utf8test=#

BTW, all the foo_recv functions that call pq_getmsgtext or 
pq_getmsgstring are thereby calling pg_verify_mbstr already (via 
pg_client_to_server). So I am still not 100% convinced that doing the 
same directly in the corresponding foo_in functions is such a bad idea.

cheers

andrew

Re: invalidly encoded strings

From

Jeff Davis

Date:

09 September 2007, 15:00:34

On Sun, 2007-09-09 at 10:51 -0400, Tom Lane wrote:
> A possible answer is to add a verifymbstr to the string literal
> converter anytime it has processed a numeric backslash-escape in the
> string.  Open questions for that are (1) does it have negative effects
> for bytea, and if so is there any hope of working around it?  (2) how
> can we postpone the conversion/test to the parse analysis phase?
> (To the extent that database encoding is frozen it'd probably be OK
> to do it in the scanner, but such a choice will come back to bite
> us eventually.)

Regarding #1: 

Currently, you can pass a bytea literal as either: E'\377\377\377' or
E'\\377\\377\\377'.

The first strategy (single backslash) is not correct, because if you do
E'\377\000\377', the embedded null character counts as the end of the
cstring, even though there are bytes after it. Similar strange things
happen if you have a E'\134' (backslash) somewhere in the string.
However, I have no doubt that there are people who use the first
strategy anyway, and the proposed change would break that for them.

There may be more issues, too.

Regards,Jeff Davis

Re: invalidly encoded strings

From

Tom Lane

Date:

09 September 2007, 18:06:41

Andrew Dunstan <andrew@dunslane.net> writes:
> Well, a little experimentation shows that we currently are not OK:

This experiment is inadequately described.
What is the type of the column involved?
        regards, tom lane

Re: invalidly encoded strings

From

Tom Lane

Date:

09 September 2007, 18:10:29

Jeff Davis <pgsql@j-davis.com> writes:
> Currently, you can pass a bytea literal as either: E'\377\377\377' or
> E'\\377\\377\\377'.

> The first strategy (single backslash) is not correct, because if you do
> E'\377\000\377', the embedded null character counts as the end of the
> cstring, even though there are bytes after it. Similar strange things
> happen if you have a E'\134' (backslash) somewhere in the string.
> However, I have no doubt that there are people who use the first
> strategy anyway, and the proposed change would break that for them.

If their code is broken anyway, calling their attention to it would be a
good idea, hm?

If we are not going to reject the embedded-null case then there is
hardly any point in considering any behavioral change at all here.
I want to point out in particular that Andrew's proposal of making
datatype input functions responsible for encoding verification cannot
possibly handle this, since they have to take the "terminating" null
at face value.
        regards, tom lane

Re: invalidly encoded strings

From

Jeff Davis

Date:

09 September 2007, 19:03:58

On Sun, 2007-09-09 at 17:09 -0400, Tom Lane wrote:
> Jeff Davis <pgsql@j-davis.com> writes:
> > Currently, you can pass a bytea literal as either: E'\377\377\377' or
> > E'\\377\\377\\377'.
> 
> > The first strategy (single backslash) is not correct, because if you do
> > E'\377\000\377', the embedded null character counts as the end of the
> > cstring, even though there are bytes after it. Similar strange things
> > happen if you have a E'\134' (backslash) somewhere in the string.
> > However, I have no doubt that there are people who use the first
> > strategy anyway, and the proposed change would break that for them.
> 
> If their code is broken anyway, calling their attention to it would be a
> good idea, hm?
> 

Agreed.

> If we are not going to reject the embedded-null case then there is
> hardly any point in considering any behavioral change at all here.
> I want to point out in particular that Andrew's proposal of making
> datatype input functions responsible for encoding verification cannot
> possibly handle this, since they have to take the "terminating" null
> at face value.

Would stringTypeDatum() in parse_type.c be a good place to put the
pg_verifymbstr()? 

Regards,Jeff Davis

Re: invalidly encoded strings

From

Andrew Dunstan

Date:

09 September 2007, 19:07:43


Tom Lane wrote:
> Andrew Dunstan <andrew@dunslane.net> writes:
>   
>> Well, a little experimentation shows that we currently are not OK:
>>     
>
> This experiment is inadequately described.
> What is the type of the column involved?
>
>             
>   

Sorry. It's text.

cheers

andrew

Re: invalidly encoded strings

From

Tom Lane

Date:

10 September 2007, 00:23:09

Jeff Davis <pgsql@j-davis.com> writes:
> Would stringTypeDatum() in parse_type.c be a good place to put the
> pg_verifymbstr()? 

Probably not, in its current form, since it hasn't got any idea where
the "char *string" came from; moreover it is not in any better position
than the typinput function to determine whether there was a bogus
embedded null.

OTOH, there may be no decent way to fix the embedded-null problem
other than by hacking the scanner to reject \0 immediately.  If we
did that it would give us more flexibility about where to put the
encoding validity checks.

In any case, I feel dubious that checking in stringTypeDatum will cover
every code path.  Somewhere around where A_Const gets transformed to
Const seems like it'd be a better plan.  (But I think that in most
utility statement parsetrees, A_Const never does get transformed to
Const; and there seem to be a few places in gram.y where an SCONST
gives rise to something other than A_Const; so this is still not a
bulletproof choice, at least not without additional changes.)

In the short run it might be best to do it in scan.l after all.  A few
minutes' thought about what it'd take to delay the decisions till later
yields a depressingly large number of changes; and we do not have time
to be developing mostly-cosmetic patches for 8.3.  Given that
database_encoding is frozen for any one DB at the moment, and that that
is unlikely to change in the near future, insisting on a solution that
allows it to vary is probably unreasonable at this stage of the game.
        regards, tom lane

Re: invalidly encoded strings

From

Andrew Dunstan

Date:

10 September 2007, 00:33:26


Tom Lane wrote:
> In the short run it might be best to do it in scan.l after all.  
>   

I have not come up with a way of doing that and handling the bytea case. 
If you have I'm all ears. And then I am still worried about COPY.

cheers

andrew

Re: invalidly encoded strings

From

Jeff Davis

Date:

10 September 2007, 00:38:05

On Sun, 2007-09-09 at 23:22 -0400, Tom Lane wrote:
> In the short run it might be best to do it in scan.l after all.  A few
> minutes' thought about what it'd take to delay the decisions till later
> yields a depressingly large number of changes; and we do not have time
> to be developing mostly-cosmetic patches for 8.3.  Given that
> database_encoding is frozen for any one DB at the moment, and that that
> is unlikely to change in the near future, insisting on a solution that
> allows it to vary is probably unreasonable at this stage of the game.
> 

Sounds reasonable to me.

Regards,Jeff Davis

Re: invalidly encoded strings

From

Tom Lane

Date:

10 September 2007, 00:44:29

Andrew Dunstan <andrew@dunslane.net> writes:
> Tom Lane wrote:
>> In the short run it might be best to do it in scan.l after all.  

> I have not come up with a way of doing that and handling the bytea case. 

AFAICS we have no realistic choice other than to reject \0 in SQL
literals; to do otherwise requires API changes throughout that stack of
modules.  And once you admit \0 is no good, it's not clear that
\somethingelse is any better for bytea-using clients.  Moreover, given
that we are moving away from backslash escapes as fast as we can sanely
travel, expending large amounts of energy to make them work better
doesn't seem like a good use of development manpower.

> If you have I'm all ears. And then I am still worried about COPY.

Haven't looked at your test case yet... but it sure looks to me like the
COPY code *ought* to cover this.
        regards, tom lane

Re: invalidly encoded strings

From

Jeff Davis

Date:

10 September 2007, 00:52:30

On Sun, 2007-09-09 at 23:33 -0400, Andrew Dunstan wrote:
> Tom Lane wrote:
> > In the short run it might be best to do it in scan.l after all.  
> >   
> 
> I have not come up with a way of doing that and handling the bytea case. 
> If you have I'm all ears. And then I am still worried about COPY.

If it's done in the scanner it should still accept things like: E'\\377\\000\\377'::bytea
right? 

I think the concern is when they use only one slash, like: E'\377\000\377'::bytea
which, as I mentioned before, is not correct anyway.

Regards,Jeff Davis

Re: invalidly encoded strings

From

Tom Lane

Date:

10 September 2007, 01:03:47

Jeff Davis <pgsql@j-davis.com> writes:
> If it's done in the scanner it should still accept things like:
>   E'\\377\\000\\377'::bytea
> right? 

Right, that will work, because the transformed literal is '\377\000\377'
(no strange characters there, just what it says) and that has not got
any encoding violations.

> I think the concern is when they use only one slash, like:
>   E'\377\000\377'::bytea
> which, as I mentioned before, is not correct anyway.

Note also that if you have standard_conforming_strings set to ON,
this gives the very same result:
  '\377\000\377'::bytea

I'm not convinced that we need to move heaven and earth to make this
work the same with or without E''.  Basically what I'm thinking is we
should just decree that the de-escaped string literal still has to be
valid per the database encoding.  If you don't want to be bound by that,
use the right number of backslashes to get the thing passed through to
the bytea input routine.
        regards, tom lane

Re: invalidly encoded strings

From

"Albe Laurenz"

Date:

10 September 2007, 04:30:01

Tom Lane wrote:
>> . for chr() under UTF8, it seems to be generally agreed
>> that the argument should represent the codepoint and the
>> function should return the correspondingly encoded character.
>> If so, possible the argument should be a bigint to
>> accommodate the full range of possible code points.
>> It is not clear what the argument should represent for other
>> multi-byte encodings for any argument higher than 127.
>> Similarly, it is not clear what ascii() should return in
>> such cases. I would be inclined just to error out.
>
> In SQL_ASCII I'd argue for allowing 0..255.  In actual MB
> encodings, OK with throwing error.

I'd like to repeat my suggestion for chr() and ascii().

Instead of the code point, I'd prefer the actual encoding of
the character as argument to chr() and return value of ascii().

The advantage I see is the following:

- It would make these functions from oracle_compat.c compatible with Oracle (Oracle's chr() and ascii() work the way I
suggest).

I agree with Tom's earlier suggestion to throw an error for
chr(0), although this is not what Oracle does.

Of course, if it is generally perceived that the code point
is more useful than the encoding, then Oracle compliance
is probably secondary.

Yours,
Laurenz Albe

Re: invalidly encoded strings

From

Gregory Stark

Date:

10 September 2007, 05:05:09

"Tom Lane" <tgl@sss.pgh.pa.us> writes:

> Jeff Davis <pgsql@j-davis.com> writes:
>
>> I think the concern is when they use only one slash, like:
>>   E'\377\000\377'::bytea
>> which, as I mentioned before, is not correct anyway.

Wait, why would this be wrong? How would you enter the three byte bytea of
consisting of those three bytes described above?

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com

Re: invalidly encoded strings

From

db@zigo.dhs.org

Date:

10 September 2007, 05:46:48

>>> I think the concern is when they use only one slash, like:
>>>   E'\377\000\377'::bytea
>>> which, as I mentioned before, is not correct anyway.
>
> Wait, why would this be wrong? How would you enter the three byte bytea of
> consisting of those three bytes described above?

Either as

E'\\377\\000\\377'

or

'\377\000\377'

/Dennis

Re: invalidly encoded strings

From

Andrew Dunstan

Date:

10 September 2007, 09:16:42

Tom Lane wrote:
> Andrew Dunstan <andrew@dunslane.net> writes:
>   
>> Tom Lane wrote:
>>     
>>> In the short run it might be best to do it in scan.l after all.  
>>>       
>
>   
>> I have not come up with a way of doing that and handling the bytea case. 
>>     
>
> AFAICS we have no realistic choice other than to reject \0 in SQL
> literals; to do otherwise requires API changes throughout that stack of
> modules.  And once you admit \0 is no good, it's not clear that
> \somethingelse is any better for bytea-using clients.  Moreover, given
> that we are moving away from backslash escapes as fast as we can sanely
> travel, expending large amounts of energy to make them work better
> doesn't seem like a good use of development manpower.
>
>   
>   

Perhaps we're talking at cross purposes.

The problem with doing encoding validation in scan.l is that it lacks 
context. Null bytes are only the tip of the bytea iceberg, since any 
arbitrary sequence of bytes can be valid for a bytea. So we can only do 
validation of encoding on a literal when we know it isn't destined for a 
bytea. That's what I haven't come up with a way of doing in the scanner 
(and as you noted upthread it's getting pretty darn late in the cycle 
for us to be looking for ways to do things).

I still don't see why it's OK for us to do validation from the foo_recv 
functions but not the corresponding foo_in functions. At least in the 
short term that would provide us with fairly complete protection against 
accepting invalidly encoded data into the database, once we fix up 
chr(), without having to mess with the scanner, parser, COPY code etc. 
We could still get corruption from UDFs and UDTs - it's hard to see how 
we can avoid that danger. Yes, we would need to make sure that any 
additions to the type system acted properly, and yes we should fix up 
any validation inefficiencies (like possibly inlining calls in the UTF8 
validation code). Personally, I don't see those as killer objections.

cheers

andrew

Re: invalidly encoded strings

From

Andrew Dunstan

Date:

10 September 2007, 10:51:15

Albe Laurenz wrote:
> I'd like to repeat my suggestion for chr() and ascii().
>
> Instead of the code point, I'd prefer the actual encoding of
> the character as argument to chr() and return value of ascii().
>
>
>   
[snip]
> Of course, if it is generally perceived that the code point
> is more useful than the encoding, then Oracle compliance
> is probably secondary.
>
>
>   

Last time this was discussed, you were the only person arguing for that 
behaviour, IIRC.

And frankly, I don't know how to do it sanely anyway. A character 
encoding has a fixed byte pattern, but a given byte pattern doesn't have 
a single universal number value. I really don't think we want to have 
the value of chr(n) depend on the endianness of the machine, do we?

The reason we are prepared to make an exception for Unicode is precisely 
because the code point maps to an encoding pattern independently of 
architecture, ISTM.

cheers

andrew

Re: invalidly encoded strings

From

Tom Lane

Date:

10 September 2007, 10:56:53

Andrew Dunstan <andrew@dunslane.net> writes:
> Perhaps we're talking at cross purposes.

> The problem with doing encoding validation in scan.l is that it lacks 
> context. Null bytes are only the tip of the bytea iceberg, since any 
> arbitrary sequence of bytes can be valid for a bytea.

If you think that, then we're definitely talking at cross purposes.
I assert we should require the post-scanning value of a string literal
to be valid in the database encoding.  If you want to produce an
arbitrary byte sequence within a bytea value, the way to get there is
for the bytea input function to do the de-escaping, not for the string
literal parser to do it.

The current situation where there is overlapping functionality is a bit
unfortunate, but once standard_conforming_strings is on by default,
it'll get a lot easier to work with.  I'm not eager to contort APIs
throughout the backend in order to produce a more usable solution for
the standard_conforming_strings = off case, given the expected limited
lifespan of that usage.

The only reason I was considering not doing it in scan.l is that
scan.l's behavior ideally shouldn't depend on any changeable variables.
But until there's some prospect of database_encoding actually being
mutable at run time, there's not much point in investing a lot of sweat
on that either.

> I still don't see why it's OK for us to do validation from the foo_recv 
> functions but not the corresponding foo_in functions.

Basically, a CSTRING handed to an input function should already have
been encoding-validated by somebody.  The most obvious reason why this
must be so is the embedded-null problem, but in general it will already
have been validated (typically as part of a larger string such as the
whole SQL query or whole COPY data line), and doing that over is
pointless and expensive.  On the other hand, the entire point of a recv
function is that it gets raw data that no one else in the backend knows
the format of; so if the data is to be considered textual, the recv
function has to be the one that considers it so and invokes appropriate
conversion or validation.

The reason backslash escapes in string literals are a problem is that
they can produce incorrect-encoding results from what had been a
validated string.

>  At least in the 
> short term that would provide us with fairly complete protection against 
> accepting invalidly encoded data into the database, once we fix up 
> chr(), without having to mess with the scanner, parser, COPY code etc. 

Instead, we have to mess with an unknown number of UDTs ...
        regards, tom lane

Re: invalidly encoded strings

From

Tom Lane

Date:

10 September 2007, 11:05:18

Andrew Dunstan <andrew@dunslane.net> writes:
> The reason we are prepared to make an exception for Unicode is precisely 
> because the code point maps to an encoding pattern independently of 
> architecture, ISTM.

Right --- there is a well-defined standard for the numerical value of
each character in Unicode.  And it's also clear what to do in
single-byte encodings.  It's not at all clear what the representation
ought to be for other multibyte encodings.  A direct transliteration
of the byte sequence not only has endianness issues, but will have
a weird non-dense set of valid values because of the restrictions on
valid multibyte characters.

Given that chr() has never before behaved sanely for multibyte values at
all, extending it to Unicode code points is a reasonable extension,
and throwing error for other encodings is reasonable too.  If we ever do
come across code-point standards for other encodings we can adopt 'em at
that time.
        regards, tom lane

Re: invalidly encoded strings

From

Tom Lane

Date:

10 September 2007, 11:15:13

Andrew Dunstan <andrew@dunslane.net> writes:
> Tom Lane wrote:
>> Those should be checked already --- if not, the right fix is still to
>> fix it there, not in per-datatype code.  I think we are OK though,
>> eg see "need_transcoding" logic in copy.c.

> Well, a little experimentation shows that we currently are not OK:
> in foo.data:
> \366\66

I looked at this and found that the problem is that
CopyReadAttributesText() does backslash conversions and doesn't validate
the result.

My proposal at this point is that both scan.l and CopyReadAttributesText
need to guarantee that the result of de-escaping is valid in the
database encoding: they should reject \0 and check validity of multibyte
characters.  You could optimize this fairly well (important in the COPY
case) by not redoing verifymbstr unless you'd seen at least one
numerical backslash-escape that set the high bit.
        regards, tom lane

Re: invalidly encoded strings

From

Andrew Dunstan

Date:

10 September 2007, 11:24:25


Tom Lane wrote:
> Andrew Dunstan <andrew@dunslane.net> writes:
>   
>> Perhaps we're talking at cross purposes.
>>     
>
>   
>> The problem with doing encoding validation in scan.l is that it lacks 
>> context. Null bytes are only the tip of the bytea iceberg, since any 
>> arbitrary sequence of bytes can be valid for a bytea.
>>     
>
> If you think that, then we're definitely talking at cross purposes.
> I assert we should require the post-scanning value of a string literal
> to be valid in the database encoding.  If you want to produce an
> arbitrary byte sequence within a bytea value, the way to get there is
> for the bytea input function to do the de-escaping, not for the string
> literal parser to do it.
>   


[looks again ... thinks ...]

Ok, I'm sold.
>
> The only reason I was considering not doing it in scan.l is that
> scan.l's behavior ideally shouldn't depend on any changeable variables.
> But until there's some prospect of database_encoding actually being
> mutable at run time, there's not much point in investing a lot of sweat
> on that either.
>   

Agreed. This would just be one item of many to change if/when we ever 
come to that.
> Instead, we have to mess with an unknown number of UDTs ...
>
>             
>   

We're going to have that danger anyway, aren't we, unless we check the 
encoding validity of the result of every UDF that returns some text 
type? I'm not going to lose sleep over something I can't cure but the 
user can - what concerns me is that our own code ensures data 
intregrity, including for encoding.


Anyway, it looks to me like we have the following items to do:

. add validity checking to the scanner
. fix COPY code
. fix chr()
. improve efficiency of validity checks, at least for UTF8


cheers

andrew

cheers

andrew

Re: invalidly encoded strings

From

Tatsuo Ishii

Date:

10 September 2007, 12:31:44

> Andrew Dunstan <andrew@dunslane.net> writes:
> > The reason we are prepared to make an exception for Unicode is precisely 
> > because the code point maps to an encoding pattern independently of 
> > architecture, ISTM.
> 
> Right --- there is a well-defined standard for the numerical value of
> each character in Unicode.  And it's also clear what to do in
> single-byte encodings.  It's not at all clear what the representation
> ought to be for other multibyte encodings.  A direct transliteration
> of the byte sequence not only has endianness issues, but will have
> a weird non-dense set of valid values because of the restrictions on
> valid multibyte characters.
> 
> Given that chr() has never before behaved sanely for multibyte values at
> all, extending it to Unicode code points is a reasonable extension,
> and throwing error for other encodings is reasonable too.  If we ever do
> come across code-point standards for other encodings we can adopt 'em at
> that time.

I don't understand whole discussion.

Why do you think that employing the Unicode code point as the chr()
argument could avoid endianness issues? Are you going to represent
Unicode code point as UCS-4? Then you have to specify the endianness
anyway.  (see the UCS-4 standard for more details)

Or are you going to represent Unicode point as a character string such
as 'U+0259'? Then representing any encoding as a string could avoid
endianness issues anyway, and I don't see Unicode code point is any
better than others.

Also I'd like to point out all encodings has its own code point
systems as far as I know. For example, EUC-JP has its corresponding
code point systems, ASCII, JIS X 0208 and JIS X 0212. So I don't see
we can't use "code point" as chr()'s argument for othe encodings(of
course we need optional parameter specifying which character set is
supposed).
--
Tatsuo Ishii
SRA OSS, Inc. Japan

Re: invalidly encoded strings

From

Tom Lane

Date:

10 September 2007, 12:49:13

BTW, I'm sure this was discussed but I forgot the conclusion: should
chr(0) throw an error?  If we're trying to get rid of embedded-null
problems, seems it must.
        regards, tom lane

Re: invalidly encoded strings

From

Martijn van Oosterhout

Date:

10 September 2007, 13:08:51

On Tue, Sep 11, 2007 at 12:30:51AM +0900, Tatsuo Ishii wrote:
> Why do you think that employing the Unicode code point as the chr()
> argument could avoid endianness issues? Are you going to represent
> Unicode code point as UCS-4? Then you have to specify the endianness
> anyway.  (see the UCS-4 standard for more details)

Because the argument to chr() is an integer, which has no endian-ness.
You only get into endian-ness if you look at how you store the
resulting string.

> Also I'd like to point out all encodings has its own code point
> systems as far as I know. For example, EUC-JP has its corresponding
> code point systems, ASCII, JIS X 0208 and JIS X 0212. So I don't see
> we can't use "code point" as chr()'s argument for othe encodings(of
> course we need optional parameter specifying which character set is
> supposed).

Oh, the last discussion on this didn't answer this question. Is there a
standard somewhere that maps integers to characters in EUC-JP. If so,
how can I find out what character 512 is?

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> From each according to his ability. To each according to his ability to litigate.

Re: invalidly encoded strings

From

Andrew Dunstan

Date:

10 September 2007, 13:10:25


Tom Lane wrote:
> BTW, I'm sure this was discussed but I forgot the conclusion: should
> chr(0) throw an error?  If we're trying to get rid of embedded-null
> problems, seems it must.
>
>             
>   

I think it should, yes.

cheers

andrew

Re: invalidly encoded strings

From

Andrew Dunstan

Date:

10 September 2007, 13:10:26

Tatsuo Ishii wrote:
>
> I don't understand whole discussion.
>
> Why do you think that employing the Unicode code point as the chr()
> argument could avoid endianness issues? Are you going to represent
> Unicode code point as UCS-4? Then you have to specify the endianness
> anyway.  (see the UCS-4 standard for more details)
>   

The code point is simply a number. The result of chr() will be a text 
value one char (not one byte) wide, in the relevant database encoding.

U+nnnn maps to the same Unicode char and hence the same UTF8 encoding 
pattern regardless of endianness. e.g. U+00a9 is the copyright symbol on 
all machines. So to get this char in a UTF8 database you could call 
"select chr(169)" and get back the byte pattern \xC2A9.

> Or are you going to represent Unicode point as a character string such
> as 'U+0259'? Then representing any encoding as a string could avoid
> endianness issues anyway, and I don't see Unicode code point is any
> better than others.
>   

The argument will be a number, as now.

> Also I'd like to point out all encodings has its own code point
> systems as far as I know. For example, EUC-JP has its corresponding
> code point systems, ASCII, JIS X 0208 and JIS X 0212. So I don't see
> we can't use "code point" as chr()'s argument for othe encodings(of
> course we need optional parameter specifying which character set is
> supposed).
>   

Where can I find the tables that map code points (as opposed to 
encodings) to characters for these others?

cheers

andrew

Re: invalidly encoded strings

From

Tom Lane

Date:

10 September 2007, 13:21:54

Andrew Dunstan <andrew@dunslane.net> writes:
> Tom Lane wrote:
>> BTW, I'm sure this was discussed but I forgot the conclusion: should
>> chr(0) throw an error?

> I think it should, yes.

OK.  Looking back, there was also some mention of changing chr's
argument to bigint, but I'd counsel against doing that.  We should not
need it since we only support 4-byte UTF8, hence code points only up to
21 bits (and indeed even 6-byte UTF8 can only have 31-bit code points,
no?).

If Tatsuo can find official code-point mappings for any other MB
encodings, +1 on supporting those too.
        regards, tom lane

Re: invalidly encoded strings

From

Martijn van Oosterhout

Date:

10 September 2007, 13:25:32

On Mon, Sep 10, 2007 at 11:48:29AM -0400, Tom Lane wrote:
> BTW, I'm sure this was discussed but I forgot the conclusion: should
> chr(0) throw an error?  If we're trying to get rid of embedded-null
> problems, seems it must.

It is pointed out on wikipedia that Java sometimes uses to byte pair C0
80 to represent the NUL character to allow it to be embedded in C
strings without tripping anything up. It is however technically an
illegal representation (violates the minimal representation rule) and
thus rejected by postgres. I'm not suggesting we copy this, but it does
show there are other ways to deal with this.

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> From each according to his ability. To each according to his ability to litigate.

Re: invalidly encoded strings

From

Andrew Dunstan

Date:

10 September 2007, 13:27:39


Tom Lane wrote:
> OK.  Looking back, there was also some mention of changing chr's
> argument to bigint, but I'd counsel against doing that.  We should not
> need it since we only support 4-byte UTF8, hence code points only up to
> 21 bits (and indeed even 6-byte UTF8 can only have 31-bit code points,
> no?).
>
>
>   

Yes, that was a thinko on my part. I realised that yesterday.

cheers

andrew

Re: invalidly encoded strings

From

Tatsuo Ishii

Date:

10 September 2007, 22:15:30

> Tatsuo Ishii wrote:
> >
> > I don't understand whole discussion.
> >
> > Why do you think that employing the Unicode code point as the chr()
> > argument could avoid endianness issues? Are you going to represent
> > Unicode code point as UCS-4? Then you have to specify the endianness
> > anyway.  (see the UCS-4 standard for more details)
> >   
> 
> The code point is simply a number. The result of chr() will be a text 
> value one char (not one byte) wide, in the relevant database encoding.
>
> U+nnnn maps to the same Unicode char and hence the same UTF8 encoding 
> pattern regardless of endianness. e.g. U+00a9 is the copyright symbol on 
> all machines. So to get this char in a UTF8 database you could call 
> "select chr(169)" and get back the byte pattern \xC2A9.

If you regard the unicode code point as simply a number, why not
regard the multibyte characters as a number too? I mean, since 0xC2A9
= 49833, "select chr(49833)" should work fine no?

Also I'm wondering you what we should do with different
backend/frontend encoding combo. For example, if your database is in
UTF-8, and your client encoding is LATIN2, what integer value should
be passed to chr()? LATIN2 or Unicode code point?

> > Or are you going to represent Unicode point as a character string such
> > as 'U+0259'? Then representing any encoding as a string could avoid
> > endianness issues anyway, and I don't see Unicode code point is any
> > better than others.
> >   
> 
> The argument will be a number, as now.
> 
> > Also I'd like to point out all encodings has its own code point
> > systems as far as I know. For example, EUC-JP has its corresponding
> > code point systems, ASCII, JIS X 0208 and JIS X 0212. So I don't see
> > we can't use "code point" as chr()'s argument for othe encodings(of
> > course we need optional parameter specifying which character set is
> > supposed).
> >   
> 
> Where can I find the tables that map code points (as opposed to 
> encodings) to characters for these others?

You mean code point table of character set? The actual standard is not
on the web since it is copyrighted by the Japanese goverment (you need
to buy as a book or a pdf file). However you could find many code
point tables on the web. For example, JIS X 0208 code points can be
found on:

http://www.infonet.co.jp/ueyama/ip/binary/x0208txt.html

(you need to have a Japanese font and set page encoding to Shift JIS)

BTW, if you want to pass "code point of character set" rather than
encoding value, you need to give chr() what character set you are
reffering to. So we need to have two arguments, one is for code point,
the other is for character set specification. What do you think?
--
Tatsuo Ishii
SRA OSS, Inc. Japan

Re: invalidly encoded strings

From

Tatsuo Ishii

Date:

10 September 2007, 22:16:28

> Tatsuo Ishii wrote:
> >
> > I don't understand whole discussion.
> >
> > Why do you think that employing the Unicode code point as the chr()
> > argument could avoid endianness issues? Are you going to represent
> > Unicode code point as UCS-4? Then you have to specify the endianness
> > anyway.  (see the UCS-4 standard for more details)
> >   
> 
> The code point is simply a number. The result of chr() will be a text 
> value one char (not one byte) wide, in the relevant database encoding.
>
> U+nnnn maps to the same Unicode char and hence the same UTF8 encoding 
> pattern regardless of endianness. e.g. U+00a9 is the copyright symbol on 
> all machines. So to get this char in a UTF8 database you could call 
> "select chr(169)" and get back the byte pattern \xC2A9.

If you regard the unicode code point as simply a number, why not
regard the multibyte characters as a number too? I mean, since 0xC2A9
= 49833, "select chr(49833)" should work fine no?

Also I'm wondering you what we should do with different
backend/frontend encoding combo. For example, if your database is in
UTF-8, and your client encoding is LATIN2, what integer value should
be passed to chr()? LATIN2 or Unicode code point?

> > Or are you going to represent Unicode point as a character string such
> > as 'U+0259'? Then representing any encoding as a string could avoid
> > endianness issues anyway, and I don't see Unicode code point is any
> > better than others.
> >   
> 
> The argument will be a number, as now.
> 
> > Also I'd like to point out all encodings has its own code point
> > systems as far as I know. For example, EUC-JP has its corresponding
> > code point systems, ASCII, JIS X 0208 and JIS X 0212. So I don't see
> > we can't use "code point" as chr()'s argument for othe encodings(of
> > course we need optional parameter specifying which character set is
> > supposed).
> >   
> 
> Where can I find the tables that map code points (as opposed to 
> encodings) to characters for these others?

You mean code point table of character set? The actual standard is not
on the web since it is copyrighted by the Japanese goverment (you need
to buy as a book or a pdf file). However you could find many code
point tables on the web. For example, JIS X 0208 code points can be
found on:

http://www.infonet.co.jp/ueyama/ip/binary/x0208txt.html

(you need to have a Japanese font and set page encoding to Shift JIS)

BTW, if you want to pass "code point of character set" rather than
encoding value, you need to give chr() what character set you are
reffering to. So we need to have two arguments, one is for code point,
the other is for character set specification. What do you think?
--
Tatsuo Ishii
SRA OSS, Inc. Japan

Re: invalidly encoded strings

From

Tom Lane

Date:

10 September 2007, 22:42:50

Tatsuo Ishii <ishii@postgresql.org> writes:
> If you regard the unicode code point as simply a number, why not
> regard the multibyte characters as a number too?

Because there's a standard specifying the Unicode code points *as
numbers*.  The mapping from those numbers to UTF8 strings (and other
representations) is well-defined by the standard.

> Also I'm wondering you what we should do with different
> backend/frontend encoding combo.

Nothing.  chr() has always worked with reference to the database
encoding, and we should keep it that way.

BTW, it strikes me that there is another hole that we need to plug in
this area, and that's the convert() function.  Being able to create
a value of type text that is not in the database encoding is simply
broken.  Perhaps we could make it work on bytea instead (providing
a cast from text to bytea but not vice versa), or maybe we should just
forbid the whole thing if the database encoding isn't SQL_ASCII.
        regards, tom lane

Re: invalidly encoded strings

From

Andrew Dunstan

Date:

10 September 2007, 22:57:53

Tatsuo Ishii wrote:
>
> If you regard the unicode code point as simply a number, why not
> regard the multibyte characters as a number too? I mean, since 0xC2A9
> = 49833, "select chr(49833)" should work fine no?
>
>   

No. The number corresponding to a given byte pattern depends on the 
endianness of the architecture. That's exactly why we can't sanely use 
the byte pattern of the encoded characters as numbers.

cheers

andrew

Re: invalidly encoded strings

From

Tatsuo Ishii

Date:

10 September 2007, 23:28:26

> Tatsuo Ishii <ishii@postgresql.org> writes:
> > If you regard the unicode code point as simply a number, why not
> > regard the multibyte characters as a number too?
> 
> Because there's a standard specifying the Unicode code points *as
> numbers*.  The mapping from those numbers to UTF8 strings (and other
> representations) is well-defined by the standard.
> 
> > Also I'm wondering you what we should do with different
> > backend/frontend encoding combo.
> 
> Nothing.  chr() has always worked with reference to the database
> encoding, and we should keep it that way.

Where is it documented?

> BTW, it strikes me that there is another hole that we need to plug in
> this area, and that's the convert() function.  Being able to create
> a value of type text that is not in the database encoding is simply
> broken.  Perhaps we could make it work on bytea instead (providing
> a cast from text to bytea but not vice versa), or maybe we should just
> forbid the whole thing if the database encoding isn't SQL_ASCII.

Please don't do that. It will break an usefull use case of convert().

A user has a database encoded in UTF-8. He has English, French,
Chinese  and Japanese data in tables. To sort the tables in the
language order, he will do like this:

SELECT * FROM japanese_table ORDER BY convert(japanese_text using utf8_to_euc_jp);

Without using convert(), he will get random order of data. This is
because Kanji characters are in random order in UTF-8, while Kanji
characters are reasonably ordered in EUC_JP.
--
Tatsuo Ishii
SRA OSS, Inc. Japan

Re: invalidly encoded strings

From

Jeff Davis

Date:

10 September 2007, 23:42:50

On Tue, 2007-09-11 at 11:27 +0900, Tatsuo Ishii wrote:
> > BTW, it strikes me that there is another hole that we need to plug in
> > this area, and that's the convert() function.  Being able to create
> > a value of type text that is not in the database encoding is simply
> > broken.  Perhaps we could make it work on bytea instead (providing
> > a cast from text to bytea but not vice versa), or maybe we should just
> > forbid the whole thing if the database encoding isn't SQL_ASCII.
> 
> Please don't do that. It will break an usefull use case of convert().
> 
> A user has a database encoded in UTF-8. He has English, French,
> Chinese  and Japanese data in tables. To sort the tables in the
> language order, he will do like this:
> 
> SELECT * FROM japanese_table ORDER BY convert(japanese_text using utf8_to_euc_jp);
> 
> Without using convert(), he will get random order of data. This is
> because Kanji characters are in random order in UTF-8, while Kanji
> characters are reasonably ordered in EUC_JP.

Isn't the collation a locale issue, not an encoding issue? Is there a
ja_JP.UTF-8 that defines the proper order?

Regards,Jeff Davis

Re: invalidly encoded strings

From

Tatsuo Ishii

Date:

10 September 2007, 23:53:42

> On Tue, 2007-09-11 at 11:27 +0900, Tatsuo Ishii wrote:
> > > BTW, it strikes me that there is another hole that we need to plug in
> > > this area, and that's the convert() function.  Being able to create
> > > a value of type text that is not in the database encoding is simply
> > > broken.  Perhaps we could make it work on bytea instead (providing
> > > a cast from text to bytea but not vice versa), or maybe we should just
> > > forbid the whole thing if the database encoding isn't SQL_ASCII.
> > 
> > Please don't do that. It will break an usefull use case of convert().
> > 
> > A user has a database encoded in UTF-8. He has English, French,
> > Chinese  and Japanese data in tables. To sort the tables in the
> > language order, he will do like this:
> > 
> > SELECT * FROM japanese_table ORDER BY convert(japanese_text using utf8_to_euc_jp);
> > 
> > Without using convert(), he will get random order of data. This is
> > because Kanji characters are in random order in UTF-8, while Kanji
> > characters are reasonably ordered in EUC_JP.
> 
> Isn't the collation a locale issue, not an encoding issue? Is there a
> ja_JP.UTF-8 that defines the proper order?

I don't think it helps. The point is, he needs different language's
collation, while PostgreSQL allows only one collation(locale) per
database cluster.
--
Tatsuo Ishii
SRA OSS, Inc. Japan

Re: invalidly encoded strings

From

Andrew Dunstan

Date:

10 September 2007, 23:54:26

Tatsuo Ishii wrote:
>> BTW, it strikes me that there is another hole that we need to plug in
>> this area, and that's the convert() function.  Being able to create
>> a value of type text that is not in the database encoding is simply
>> broken.  Perhaps we could make it work on bytea instead (providing
>> a cast from text to bytea but not vice versa), or maybe we should just
>> forbid the whole thing if the database encoding isn't SQL_ASCII.
>>     
>
> Please don't do that. It will break an usefull use case of convert().
>
> A user has a database encoded in UTF-8. He has English, French,
> Chinese  and Japanese data in tables. To sort the tables in the
> language order, he will do like this:
>
> SELECT * FROM japanese_table ORDER BY convert(japanese_text using utf8_to_euc_jp);
>
> Without using convert(), he will get random order of data. This is
> because Kanji characters are in random order in UTF-8, while Kanji
> characters are reasonably ordered in EUC_JP.
>
>   

Tatsuo-san,

would not this case be at least as well met by an operator supplying the 
required ordering? The operator of course would not have the danger of 
supplying values that are invalid in the database encoding. Admittedly, 
the user might need several operators for the case you describe.

I'm not sure we are going to be able to catch every path by which 
invalid data can get into the database in one release. I suspect we 
might need two or three goes at this. (I'm just wondering if the 
routines that return cstrings are a possible vector).

cheers

andrew

Re: invalidly encoded strings

From

Andrew Dunstan

Date:

10 September 2007, 23:56:05


Jeff Davis wrote:
> On Tue, 2007-09-11 at 11:27 +0900, Tatsuo Ishii wrote:
>   
>>> BTW, it strikes me that there is another hole that we need to plug in
>>> this area, and that's the convert() function.  Being able to create
>>> a value of type text that is not in the database encoding is simply
>>> broken.  Perhaps we could make it work on bytea instead (providing
>>> a cast from text to bytea but not vice versa), or maybe we should just
>>> forbid the whole thing if the database encoding isn't SQL_ASCII.
>>>       
>> Please don't do that. It will break an usefull use case of convert().
>>
>> A user has a database encoded in UTF-8. He has English, French,
>> Chinese  and Japanese data in tables. To sort the tables in the
>> language order, he will do like this:
>>
>> SELECT * FROM japanese_table ORDER BY convert(japanese_text using utf8_to_euc_jp);
>>
>> Without using convert(), he will get random order of data. This is
>> because Kanji characters are in random order in UTF-8, while Kanji
>> characters are reasonably ordered in EUC_JP.
>>     
>
> Isn't the collation a locale issue, not an encoding issue? Is there a
> ja_JP.UTF-8 that defines the proper order?
>
>
>   

That won't help you much if you have all the collection mentioned above.

cheers

andrew

Re: invalidly encoded strings

From

Tom Lane

Date:

11 September 2007, 00:01:14

Andrew Dunstan <andrew@dunslane.net> writes:
> I'm not sure we are going to be able to catch every path by which 
> invalid data can get into the database in one release. I suspect we 
> might need two or three goes at this. (I'm just wondering if the 
> routines that return cstrings are a possible vector).

We need to have a policy that cstrings are in the database encoding.
        regards, tom lane

Re: invalidly encoded strings

From

Tom Lane

Date:

11 September 2007, 00:20:44

Tatsuo Ishii <ishii@postgresql.org> writes:
>> BTW, it strikes me that there is another hole that we need to plug in
>> this area, and that's the convert() function.  Being able to create
>> a value of type text that is not in the database encoding is simply
>> broken.  Perhaps we could make it work on bytea instead (providing
>> a cast from text to bytea but not vice versa), or maybe we should just
>> forbid the whole thing if the database encoding isn't SQL_ASCII.

> Please don't do that. It will break an usefull use case of convert().

The reason we have a problem here is that we've been choosing
convenience over safety in encoding-related issues.  I wonder if we must
stoop to having a "strict_encoding_checks" GUC variable to satisfy
everyone.

> A user has a database encoded in UTF-8. He has English, French,
> Chinese  and Japanese data in tables. To sort the tables in the
> language order, he will do like this:

> SELECT * FROM japanese_table ORDER BY convert(japanese_text using utf8_to_euc_jp);

> Without using convert(), he will get random order of data.

I'd say that *with* convert() he will get a random order of data.  This
is making a boatload of unsupportable assumptions about the locale and
encoding of the surrounding database.  There are a lot of bad-encoding
situations for which strcoll() simply breaks down completely and can't
even deliver self-consistent answers.

It might work the way you are expecting if the database uses SQL_ASCII
encoding and C locale --- and I'd be fine with allowing convert() only
when the database encoding is SQL_ASCII.
        regards, tom lane

Re: invalidly encoded strings

From

Tatsuo Ishii

Date:

11 September 2007, 00:30:08

> Tatsuo Ishii <ishii@postgresql.org> writes:
> >> BTW, it strikes me that there is another hole that we need to plug in
> >> this area, and that's the convert() function.  Being able to create
> >> a value of type text that is not in the database encoding is simply
> >> broken.  Perhaps we could make it work on bytea instead (providing
> >> a cast from text to bytea but not vice versa), or maybe we should just
> >> forbid the whole thing if the database encoding isn't SQL_ASCII.
> 
> > Please don't do that. It will break an usefull use case of convert().
> 
> The reason we have a problem here is that we've been choosing
> convenience over safety in encoding-related issues.  I wonder if we must
> stoop to having a "strict_encoding_checks" GUC variable to satisfy
> everyone.

Please show me concrete examples how I could introduce a vulnerability
using this kind of convert() usage.

> > A user has a database encoded in UTF-8. He has English, French,
> > Chinese  and Japanese data in tables. To sort the tables in the
> > language order, he will do like this:
> 
> > SELECT * FROM japanese_table ORDER BY convert(japanese_text using utf8_to_euc_jp);
> 
> > Without using convert(), he will get random order of data.
> 
> I'd say that *with* convert() he will get a random order of data.  This
> is making a boatload of unsupportable assumptions about the locale and
> encoding of the surrounding database.  There are a lot of bad-encoding
> situations for which strcoll() simply breaks down completely and can't
> even deliver self-consistent answers.
> 
> It might work the way you are expecting if the database uses SQL_ASCII
> encoding and C locale --- and I'd be fine with allowing convert() only
> when the database encoding is SQL_ASCII.

I don't believe that. With C locale, the convert() works fine as I
described.
--
Tatsuo Ishii
SRA OSS, Inc. Japan

Re: invalidly encoded strings

From

Jeff Davis

Date:

11 September 2007, 01:04:16

On Tue, 2007-09-11 at 11:53 +0900, Tatsuo Ishii wrote:
> > Isn't the collation a locale issue, not an encoding issue? Is there a
> > ja_JP.UTF-8 that defines the proper order?
> 
> I don't think it helps. The point is, he needs different language's
> collation, while PostgreSQL allows only one collation(locale) per
> database cluster.

My thought was: if we had some function that treated a string as a
different locale, it might solve the problem without violating the
database encoding.

Converting to a different encoding, and producing text result, is the
source of the problem.

Regards,Jeff Davis

Re: invalidly encoded strings

From

Jeff Davis

Date:

11 September 2007, 02:32:25

On Tue, 2007-09-11 at 12:29 +0900, Tatsuo Ishii wrote:
> Please show me concrete examples how I could introduce a vulnerability
> using this kind of convert() usage.

Try the sequence below. Then, try to dump and then reload the database.
When you try to reload it, you will get an error:

ERROR:  invalid byte sequence for encoding "UTF8": 0xbd

Regards,Jeff Davis

test=> select version();

version

--------------------------------------------------------------------------------------------------------------------------PostgreSQL
8.3develon x86_64-unknown-linux-gnu, compiled by GCC gcc 
(GCC) 4.1.3 20070601 (prerelease) (Debian 4.1.2-12)
(1 row)

test=> show lc_collate;lc_collate
-------------en_US.UTF-8
(1 row)

test=> create table encoding_test(t text);
CREATE TABLE
test=> insert into encoding_test values('初');
INSERT 0 1
test=> insert into encoding_test values(convert('初' using
utf8_to_euc_jp));
INSERT 0 1

Re: invalidly encoded strings

From

Tatsuo Ishii

Date:

11 September 2007, 02:51:37


> On Tue, 2007-09-11 at 12:29 +0900, Tatsuo Ishii wrote:
> > Please show me concrete examples how I could introduce a vulnerability
> > using this kind of convert() usage.
> 
> Try the sequence below. Then, try to dump and then reload the database.
> When you try to reload it, you will get an error:
> 
> ERROR:  invalid byte sequence for encoding "UTF8": 0xbd

I know this could be a problem (like chr() with invalid byte pattern).
What I really want to know is, read query something like this:

SELECT * FROM japanese_table ORDER BY convert(japanese_text using utf8_to_euc_jp);

could be a problem (I assume we use C locale).
--
Tatsuo Ishii
SRA OSS, Inc. Japan

> Regards,
>     Jeff Davis
> 
> test=> select version();
> 
> version                                                          
>
--------------------------------------------------------------------------------------------------------------------------
>  PostgreSQL 8.3devel on x86_64-unknown-linux-gnu, compiled by GCC gcc
> (GCC) 4.1.3 20070601 (prerelease) (Debian 4.1.2-12)
> (1 row)
> 
> test=> show lc_collate;
>  lc_collate  
> -------------
>  en_US.UTF-8
> (1 row)
> 
> test=> create table encoding_test(t text);
> CREATE TABLE
> test=> insert into encoding_test values('初');
> INSERT 0 1
> test=> insert into encoding_test values(convert('初' using
> utf8_to_euc_jp));
> INSERT 0 1
> 
>

Re: invalidly encoded strings

From

Martijn van Oosterhout

Date:

11 September 2007, 03:32:18

On Tue, Sep 11, 2007 at 11:27:50AM +0900, Tatsuo Ishii wrote:
> SELECT * FROM japanese_table ORDER BY convert(japanese_text using utf8_to_euc_jp);
>
> Without using convert(), he will get random order of data. This is
> because Kanji characters are in random order in UTF-8, while Kanji
> characters are reasonably ordered in EUC_JP.

The usual way to approach this is to make convert return bytea instead
of text. Then your problems vanish. Bytea can still be sorted, but it
won't be treated as a text string and thus does not need to conform to
the requirements of a text string.

Languages like perl distinguish between "encode" which is text->bytea
and "decode" which is bytea->text. We've got "convert" for oth and that
causes problems.

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> From each according to his ability. To each according to his ability to litigate.

Re: invalidly encoded strings

From

Jeff Davis

Date:

11 September 2007, 03:35:28

On Tue, 2007-09-11 at 14:50 +0900, Tatsuo Ishii wrote:
> 
> > On Tue, 2007-09-11 at 12:29 +0900, Tatsuo Ishii wrote:
> > > Please show me concrete examples how I could introduce a
> vulnerability
> > > using this kind of convert() usage.
> > 
> > Try the sequence below. Then, try to dump and then reload the
> database.
> > When you try to reload it, you will get an error:
> > 
> > ERROR:  invalid byte sequence for encoding "UTF8": 0xbd
> 
> I know this could be a problem (like chr() with invalid byte pattern).
> What I really want to know is, read query something like this:
> 
> SELECT * FROM japanese_table ORDER BY convert(japanese_text using
> utf8_to_euc_jp);

I guess I don't quite understand the question.

I agree that ORDER BY convert() must be safe in the C locale, because it
just passes the strings to strcmp(). 

Are you saying that we should not remove convert() until we can support
multiple locales in one database?

If we make convert() operate on bytea and return bytea, as Tom
suggested, would that solve your use case?

Regards,Jeff Davis

Re: invalidly encoded strings

From

db@zigo.dhs.org

Date:

11 September 2007, 03:36:39

>> Try the sequence below. Then, try to dump and then reload the database.
>> When you try to reload it, you will get an error:
>>
>> ERROR:  invalid byte sequence for encoding "UTF8": 0xbd
>
> I know this could be a problem (like chr() with invalid byte pattern).

And that's enough of a problem already. We don't need more problems.

> What I really want to know is, read query something like this:
>
> SELECT * FROM japanese_table ORDER BY convert(japanese_text using
> utf8_to_euc_jp);
>
> could be a problem (I assume we use C locale).

If convert() produce a sequence of bytes that can't be interpreted as a
string in the server encoding then it's broken. Imho convert() should
return a bytea value. If we hade good encoding/charset support we could do
better, but we can't today.

The above example would work fine if convert() returned a bytea. In the C
locale the string would be compared byte for byte and that's what you get
with bytea values as well.

Strings are not sequences of bytes that can be interpreted in different
ways. That's what bytea values are. Strings are in a specific encoding
always, and in pg that encoding is fixed to a single one for a whole
cluster at initdb time. We should not confuse text with bytea.

/Dennis

Re: invalidly encoded strings

From

Tatsuo Ishii

Date:

11 September 2007, 04:17:38

> On Tue, 2007-09-11 at 14:50 +0900, Tatsuo Ishii wrote:
> > 
> > > On Tue, 2007-09-11 at 12:29 +0900, Tatsuo Ishii wrote:
> > > > Please show me concrete examples how I could introduce a
> > vulnerability
> > > > using this kind of convert() usage.
> > > 
> > > Try the sequence below. Then, try to dump and then reload the
> > database.
> > > When you try to reload it, you will get an error:
> > > 
> > > ERROR:  invalid byte sequence for encoding "UTF8": 0xbd
> > 
> > I know this could be a problem (like chr() with invalid byte pattern).
> > What I really want to know is, read query something like this:
> > 
> > SELECT * FROM japanese_table ORDER BY convert(japanese_text using
> > utf8_to_euc_jp);
> 
> I guess I don't quite understand the question.
> 
> I agree that ORDER BY convert() must be safe in the C locale, because it
> just passes the strings to strcmp(). 
> 
> Are you saying that we should not remove convert() until we can support
> multiple locales in one database?
>
> If we make convert() operate on bytea and return bytea, as Tom
> suggested, would that solve your use case?

The problem is, the above use case is just one of what I can think of.

Another use case is, something like this:

SELECT sum(octet_length(convert(text_column using utf8_to_euc_jp))) FROM mytable;

to know the total byte length of text column if it's encoded in
EUC_JP. 

So I'm not sure we could change convert() returning bytea without
complaing from users...
--
Tatsuo Ishii
SRA OSS, Inc. Japan

Re: invalidly encoded strings

From

"Albe Laurenz"

Date:

11 September 2007, 04:43:32

Andrew Dunstan wrote:
>> Instead of the code point, I'd prefer the actual encoding of
>> the character as argument to chr() and return value of ascii().
>
> And frankly, I don't know how to do it sanely anyway. A character
> encoding has a fixed byte pattern, but a given byte pattern
> doesn't have
> a single universal number value. I really don't think we want to have
> the value of chr(n) depend on the endianness of the machine, do we?
>
> The reason we are prepared to make an exception for Unicode
> is precisely because the code point maps to an encoding
> pattern independently of architecture, ISTM.

Point taken.

I only wanted to make sure that there are good reasons to
differ from Oracle.

Oracle's chr() is big-endian on all platforms, BTW.

Yours,
Laurenz Albe

Re: invalidly encoded strings

From

Jeff Davis

Date:

11 September 2007, 13:15:13

On Mon, 2007-09-10 at 23:20 -0400, Tom Lane wrote:
> The reason we have a problem here is that we've been choosing
> convenience over safety in encoding-related issues.  I wonder if we must
> stoop to having a "strict_encoding_checks" GUC variable to satisfy
> everyone.

That would be satisfactory to me. However, I'm sure some will cringe at
a MySQL-like configurable integrity option. Might it be better as an
initdb-time option? I can't think why anyone would want to change it
later.

> It might work the way you are expecting if the database uses SQL_ASCII
> encoding and C locale --- and I'd be fine with allowing convert() only
> when the database encoding is SQL_ASCII.

I prefer this option. It's consistent with the docs, which say:

"The SQL_ASCII setting behaves considerably differently from the other
settings,"
and also
"One way to use multiple encodings safely is to set the locale to C or
POSIX during initdb, thus disabling any real locale awareness." 

Regards,Jeff Davis

Re: invalidly encoded strings

From

Tom Lane

Date:

11 September 2007, 15:49:46

Jeff Davis <pgsql@j-davis.com> writes:
> On Mon, 2007-09-10 at 23:20 -0400, Tom Lane wrote:
>> It might work the way you are expecting if the database uses SQL_ASCII
>> encoding and C locale --- and I'd be fine with allowing convert() only
>> when the database encoding is SQL_ASCII.

> I prefer this option.

I think really the technically cleanest solution would be to make
convert() return bytea instead of text; then we'd not have to put
restrictions on what encoding or locale it's working inside of.
However, it's not clear to me whether there are valid usages that
that would foreclose.  Tatsuo mentioned length() but bytea has that.

What I think we'd need to have a complete solution is

convert(text, name) returns bytea-- convert from DB encoding to arbitrary encoding

convert(bytea, name, name) returns bytea-- convert between any two encodings

convert(bytea, name) returns text-- convert from arbitrary encoding to DB encoding

The second and third would need to do a verify step before
converting, of course.

Comments?
        regards, tom lane

Re: invalidly encoded strings

From

Jeff Davis

Date:

11 September 2007, 16:07:54

On Tue, 2007-09-11 at 14:48 -0400, Tom Lane wrote:
> Jeff Davis <pgsql@j-davis.com> writes:
> > On Mon, 2007-09-10 at 23:20 -0400, Tom Lane wrote:
> >> It might work the way you are expecting if the database uses SQL_ASCII
> >> encoding and C locale --- and I'd be fine with allowing convert() only
> >> when the database encoding is SQL_ASCII.
> 
> > I prefer this option.
> 
> I think really the technically cleanest solution would be to make
> convert() return bytea instead of text; then we'd not have to put
> restrictions on what encoding or locale it's working inside of.
> However, it's not clear to me whether there are valid usages that
> that would foreclose.  Tatsuo mentioned length() but bytea has that.

Once it's in bytea, you can make operators to achieve the old
functionality. If I understood correctly, he was making a backwards
compatibility argument, not a functionality argument. I can't think of a
problem without a workaround, but maybe there are some.

> What I think we'd need to have a complete solution is
> 
> convert(text, name) returns bytea
>     -- convert from DB encoding to arbitrary encoding
> 
> convert(bytea, name, name) returns bytea
>     -- convert between any two encodings
> 
> convert(bytea, name) returns text
>     -- convert from arbitrary encoding to DB encoding
> 
> The second and third would need to do a verify step before
> converting, of course.
> 

I like it. 

Regards,Jeff Davis

Re: invalidly encoded strings

From

Alvaro Herrera

Date:

11 September 2007, 16:27:30

Tom Lane wrote:
> Jeff Davis <pgsql@j-davis.com> writes:
> > On Mon, 2007-09-10 at 23:20 -0400, Tom Lane wrote:
> >> It might work the way you are expecting if the database uses SQL_ASCII
> >> encoding and C locale --- and I'd be fine with allowing convert() only
> >> when the database encoding is SQL_ASCII.
> 
> > I prefer this option.
> 
> I think really the technically cleanest solution would be to make
> convert() return bytea instead of text; then we'd not have to put
> restrictions on what encoding or locale it's working inside of.
> However, it's not clear to me whether there are valid usages that
> that would foreclose.  Tatsuo mentioned length() but bytea has that.

But length(bytea) cannot count characters, only bytes.

> What I think we'd need to have a complete solution is
> 
> convert(text, name) returns bytea
>     -- convert from DB encoding to arbitrary encoding
> 
> convert(bytea, name, name) returns bytea
>     -- convert between any two encodings
> 
> convert(bytea, name) returns text
>     -- convert from arbitrary encoding to DB encoding

That seems good.  This is the encode/decode that other systems have.

However ISTM we would also need something like

length(bytea, name) returns int-- counts the number of characters assuming that the bytea is in-- the given encoding

Hmm, I wonder if counting chars is consistent regardless of the
encoding the string is in.  To me it sounds like it should, in which
case it works to convert to the DB encoding and count chars there.

-- 
Alvaro Herrera       Valdivia, Chile   ICBM: S 39º 49' 18.1", W 73º 13' 56.4"
<inflex> really, I see PHP as like a strange amalgamation of C, Perl, Shell
<crab> inflex: you know that "amalgam" means "mixture with mercury",      more or less, right?
<crab> i.e., "deadly poison"

Re: invalidly encoded strings

From

Tom Lane

Date:

11 September 2007, 16:32:02

Alvaro Herrera <alvherre@commandprompt.com> writes:
> Tom Lane wrote:
>> I think really the technically cleanest solution would be to make
>> convert() return bytea instead of text; then we'd not have to put
>> restrictions on what encoding or locale it's working inside of.
>> However, it's not clear to me whether there are valid usages that
>> that would foreclose.  Tatsuo mentioned length() but bytea has that.

> But length(bytea) cannot count characters, only bytes.

So what?  If you want characters, just count the original text string.
Encoding conversion won't change that.

> Hmm, I wonder if counting chars is consistent regardless of the
> encoding the string is in.  To me it sounds like it should, in which
> case it works to convert to the DB encoding and count chars there.

A conversion that isn't one-for-one is not merely an encoding conversion
IMHO.
        regards, tom lane

Re: invalidly encoded strings

From

Tom Lane

Date:

11 September 2007, 17:13:45

Tatsuo Ishii <ishii@postgresql.org> writes:
>> If we make convert() operate on bytea and return bytea, as Tom
>> suggested, would that solve your use case?

> The problem is, the above use case is just one of what I can think of.

> Another use case is, something like this:

> SELECT sum(octet_length(convert(text_column using utf8_to_euc_jp))) FROM mytable;

> to know the total byte length of text column if it's encoded in
> EUC_JP. 

Since octet_length exists and produces identical results for both text
and bytea, this example is hardly very convincing...
        regards, tom lane

Re: invalidly encoded strings

From

Tatsuo Ishii

Date:

12 September 2007, 00:44:35

> However ISTM we would also need something like
> 
> length(bytea, name) returns int
>     -- counts the number of characters assuming that the bytea is in
>     -- the given encoding
> 
> Hmm, I wonder if counting chars is consistent regardless of the
> encoding the string is in.  To me it sounds like it should, in which
> case it works to convert to the DB encoding and count chars there.

Not necessarily.

It's possible that after encoding conversion, number of chars are
different before and after. An example is, UTF-8 and EUC_JIS_2004.

0xa4f7(EUC_JIS_2004) <--> U+304B *and* U+309A (Unicode)

This is defined in the Japanese goverment's standard.
--
Tatsuo Ishii
SRA OSS, Inc. Japan

Re: invalidly encoded strings

From

Andrew Dunstan

Date:

14 September 2007, 13:26:19


Tom Lane wrote:
> I think really the technically cleanest solution would be to make
> convert() return bytea instead of text; then we'd not have to put
> restrictions on what encoding or locale it's working inside of.
> However, it's not clear to me whether there are valid usages that
> that would foreclose.  Tatsuo mentioned length() but bytea has that.
>
> What I think we'd need to have a complete solution is
>
> convert(text, name) returns bytea
>     -- convert from DB encoding to arbitrary encoding
>
> convert(bytea, name, name) returns bytea
>     -- convert between any two encodings
>
> convert(bytea, name) returns text
>     -- convert from arbitrary encoding to DB encoding
>
> The second and third would need to do a verify step before
> converting, of course.
>
>
>   

Are you wanting this done for 8.3? If so, by whom? :-)

cheers

andrew

Re: invalidly encoded strings

From

Tom Lane

Date:

14 September 2007, 14:23:30

Andrew Dunstan <andrew@dunslane.net> writes:
> Are you wanting this done for 8.3? If so, by whom? :-)

[ shrug... ]  I'm not the one who's worried about closing all the holes
leading to encoding problems.
        regards, tom lane

Re: invalidly encoded strings

From

Andrew Dunstan

Date:

14 September 2007, 14:32:45


Tom Lane wrote:
> Andrew Dunstan <andrew@dunslane.net> writes:
>   
>> Are you wanting this done for 8.3? If so, by whom? :-)
>>     
>
> [ shrug... ]  I'm not the one who's worried about closing all the holes
> leading to encoding problems.
>
>             
>   

I can certainly have a go at it. Are we still talking about Oct 1 for a 
possible beta?

cheers

andrew

Re: invalidly encoded strings

From

Tom Lane

Date:

14 September 2007, 14:39:07

Andrew Dunstan <andrew@dunslane.net> writes:
> I can certainly have a go at it. Are we still talking about Oct 1 for a 
> possible beta?

Yeah, there's still a little time left --- HOT will take at least a few
more days.
        regards, tom lane

Re: invalidly encoded strings

From

Andrew Dunstan

Date:

16 September 2007, 21:40:54


Tom Lane wrote:
> What I think we'd need to have a complete solution is
>
> convert(text, name) returns bytea
>     -- convert from DB encoding to arbitrary encoding
>
> convert(bytea, name, name) returns bytea
>     -- convert between any two encodings
>
> convert(bytea, name) returns text
>     -- convert from arbitrary encoding to DB encoding
>
> The second and third would need to do a verify step before
> converting, of course.
>
>
>   

Is there any reason these functions shouldn't be marked immutable? Also, 
I'm wondering if we should give them disambiguating names, rather than 
call them all convert.

cheers

andrew

Re: invalidly encoded strings

From

Tom Lane

Date:

17 September 2007, 01:03:38

Andrew Dunstan <andrew@dunslane.net> writes:
> Tom Lane wrote:
>> What I think we'd need to have a complete solution is
>> 
>> convert(text, name) returns bytea
>> -- convert from DB encoding to arbitrary encoding
>> 
>> convert(bytea, name, name) returns bytea
>> -- convert between any two encodings
>> 
>> convert(bytea, name) returns text
>> -- convert from arbitrary encoding to DB encoding
>> 
>> The second and third would need to do a verify step before
>> converting, of course.

> Is there any reason these functions shouldn't be marked immutable?

No, not as long as DB encoding is frozen...

> I'm wondering if we should give them disambiguating names, rather than 
> call them all convert.

No.  We have a function overloading system, we should use it.
        regards, tom lane

Re: invalidly encoded strings

From

Andrew Dunstan

Date:

18 September 2007, 09:10:11

Tom Lane wrote:
> Andrew Dunstan <andrew@dunslane.net> writes:
>   
>> Tom Lane wrote:
>>     
>>> What I think we'd need to have a complete solution is
>>>
>>> convert(text, name) returns bytea
>>> -- convert from DB encoding to arbitrary encoding
>>>
>>> convert(bytea, name, name) returns bytea
>>> -- convert between any two encodings
>>>
>>> convert(bytea, name) returns text
>>> -- convert from arbitrary encoding to DB encoding
>>>
>>> The second and third would need to do a verify step before
>>> converting, of course.
>>>       

>> I'm wondering if we should give them disambiguating names, rather than 
>> call them all convert.
>>     
>
> No.  We have a function overloading system, we should use it.
>
>             
>   
In general I agree with you.

What's bothering me here though is that in the two argument forms, if 
the first argument is text the second argument is the destination 
encoding, but if the first argument is a bytea the second argument is 
the source encoding. That strikes me as likely to be quite confusing, 
and we might alleviate that with something like:
 text convert_from(bytea, name) bytea convert_to(text, name)

But if I'm the only one bothered by it I won't worry.

cheers

andrew

Re: invalidly encoded strings

From

Hannu Krosing

Date:

18 September 2007, 09:25:12

Ühel kenal päeval, T, 2007-09-18 kell 08:08, kirjutas Andrew Dunstan:
> 
> Tom Lane wrote:
> > Andrew Dunstan <andrew@dunslane.net> writes:
> >   
> >> Tom Lane wrote:
> >>     
> >>> What I think we'd need to have a complete solution is
> >>>
> >>> convert(text, name) returns bytea
> >>> -- convert from DB encoding to arbitrary encoding
> >>>
> >>> convert(bytea, name, name) returns bytea
> >>> -- convert between any two encodings
> >>>
> >>> convert(bytea, name) returns text
> >>> -- convert from arbitrary encoding to DB encoding
> >>>
> >>> The second and third would need to do a verify step before
> >>> converting, of course.
> >>>       
> 
> >> I'm wondering if we should give them disambiguating names, rather than 
> >> call them all convert.
> >>     
> >
> > No.  We have a function overloading system, we should use it.
> >
> >             
> >   
> In general I agree with you.
> 
> What's bothering me here though is that in the two argument forms, if 
> the first argument is text the second argument is the destination 
> encoding, but if the first argument is a bytea the second argument is 
> the source encoding. That strikes me as likely to be quite confusing, 
> and we might alleviate that with something like:
> 
>   text convert_from(bytea, name)
>   bytea convert_to(text, name)

how is this fundamentally different from encode/decode functions we have
now ?


> But if I'm the only one bothered by it I won't worry.
> 
> cheers
> 
> andrew
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 9: In versions below 8.0, the planner will ignore your desire to
>        choose an index scan if your joining column's datatypes do not
>        match

Re: invalidly encoded strings

From

Andrew Dunstan

Date:

18 September 2007, 10:24:11


Hannu Krosing wrote:
> Ühel kenal päeval, T, 2007-09-18 kell 08:08, kirjutas Andrew Dunstan:
>   
>> Tom Lane wrote:
>>     
>>> Andrew Dunstan <andrew@dunslane.net> writes:
>>>   
>>>       
>>>> Tom Lane wrote:
>>>>     
>>>>         
>>>>> What I think we'd need to have a complete solution is
>>>>>
>>>>> convert(text, name) returns bytea
>>>>> -- convert from DB encoding to arbitrary encoding
>>>>>
>>>>> convert(bytea, name, name) returns bytea
>>>>> -- convert between any two encodings
>>>>>
>>>>> convert(bytea, name) returns text
>>>>> -- convert from arbitrary encoding to DB encoding
>>>>>
>>>>> The second and third would need to do a verify step before
>>>>> converting, of course.
>>>>>       
>>>>>           
>>>> I'm wondering if we should give them disambiguating names, rather than 
>>>> call them all convert.
>>>>     
>>>>         
>>> No.  We have a function overloading system, we should use it.
>>>
>>>             
>>>   
>>>       
>> In general I agree with you.
>>
>> What's bothering me here though is that in the two argument forms, if 
>> the first argument is text the second argument is the destination 
>> encoding, but if the first argument is a bytea the second argument is 
>> the source encoding. That strikes me as likely to be quite confusing, 
>> and we might alleviate that with something like:
>>
>>   text convert_from(bytea, name)
>>   bytea convert_to(text, name)
>>     
>
> how is this fundamentally different from encode/decode functions we have
> now ?
>
>
>   

They are in effect reversed. encode() applies the named encoding to a 
bytea. convert_from() above unapplies the named encoding (i.e. converts 
the bytea to text in the database encoding).


cheers

andrew

Re: invalidly encoded strings

From

Tom Lane

Date:

18 September 2007, 12:24:33

Andrew Dunstan <andrew@dunslane.net> writes:
> What's bothering me here though is that in the two argument forms, if 
> the first argument is text the second argument is the destination 
> encoding, but if the first argument is a bytea the second argument is 
> the source encoding. That strikes me as likely to be quite confusing, 
> and we might alleviate that with something like:

>   text convert_from(bytea, name)
>   bytea convert_to(text, name)

In a green field this would be fine, but what will you do with the
existing 2-argument form of convert()?

[ pokes around... ]  Actually, on looking into it, it seems the
2-argument form of convert() is there because somebody thought it
would be a suitable approximation to SQL99's <form-of-use conversion>,
which in reality I fear means something different entirely.  I might
not be grasping what the spec is trying to say, but I *think* that
what they intend is that the argument of CONVERT(x USING encoding)
has to be a client-side variable in embedded SQL, and that the intent
is that x is in the specified encoding and it's brought into the
DB encoding by the function.  Or maybe I'm all wrong.

Anyway, on the strength of that, these functions are definitely
best named to stay away from the spec syntax, so +1 for your
proposal above.
        regards, tom lane

Re: invalidly encoded strings

From

Andrew Dunstan

Date:

18 September 2007, 15:04:42

Tom Lane wrote:
> Anyway, on the strength of that, these functions are definitely
> best named to stay away from the spec syntax, so +1 for your
> proposal above.
>
>     
>   

OK, I have committed this and the other the functional changes that 
should change the encoding holes.

Catalog version bumped.

cheers

andrew

Re: invalidly encoded strings

From

Gregory Stark

Date:

18 September 2007, 21:00:16

"Andrew Dunstan" <andrew@dunslane.net> writes:

> Tom Lane wrote:
>
>> No.  We have a function overloading system, we should use it.
>>   
> In general I agree with you.
>
> What's bothering me here though is that in the two argument forms, if the first
> argument is text the second argument is the destination encoding, but if the
> first argument is a bytea the second argument is the source encoding. That
> strikes me as likely to be quite confusing, and we might alleviate that with
> something like:
>
>  text convert_from(bytea, name)
>  bytea convert_to(text, name)
>
> But if I'm the only one bothered by it I won't worry.

I tend to agree with you. We should only use overloading when the function is
essentially the same just tweaked as appropriate for the datatype, not when
the meaning is different.

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com