Thread: Bug in UTF8-Validation Code?

Bug in UTF8-Validation Code?

From
Mario Weilguni
Date:
Hi,

I've a problem with a database, I can dump the database to a file, but 
restoration fails, happens with 8.1.4.

Steps to reproduce:
create database testdb with encoding='UTF8';
\c testdb
create table test(x text);
insert into test values ('\244'); ==> Is akzepted, even if not UTF8.

pg_dump testdb -f testdb.dump -Fc
pg_restore -f testdb.dump -d testdb => fails with an error: 
ERROR:  invalid byte sequence for encoding "UTF8": 0xa4

The problem itself comes from a CSV file, which is imported with \copy without 
proper quoting (so I have to fix this anyway), but I still think this is an 
error, making restoration very complicated in such cases...

Or am I doing something completly wrong here?

Best regards,
Mario Weilguni



Re: Bug in UTF8-Validation Code?

From
"Albe Laurenz"
Date:
Mario Weilguni wrote:

> Steps to reproduce:
> create database testdb with encoding='UTF8';
> \c testdb
> create table test(x text);
> insert into test values ('\244'); ==> Is akzepted, even if not UTF8.

This is working as expected, see the remark in
http://www.postgresql.org/docs/current/static/sql-syntax-lexical.html#SQ
L-SYNTAX-STRINGS

"It is your responsibility that the byte sequences you createare valid characters in the server character set
encoding."

Yours,
Laurenz Albe


Re: Bug in UTF8-Validation Code?

From
Mario Weilguni
Date:
Am Dienstag, 13. März 2007 14:46 schrieb Albe Laurenz:
> Mario Weilguni wrote:
> > Steps to reproduce:
> > create database testdb with encoding='UTF8';
> > \c testdb
> > create table test(x text);
> > insert into test values ('\244'); ==> Is akzepted, even if not UTF8.
>
> This is working as expected, see the remark in
> http://www.postgresql.org/docs/current/static/sql-syntax-lexical.html#SQ
> L-SYNTAX-STRINGS
>
> "It is your responsibility that the byte sequences you create
>  are valid characters in the server character set encoding."

In that case, pg_dump is doing wrong here and should quote the output. IMO it
cannot be defined as working as expected, when this makes any database dumps
worthless, without any warnings at dump-time.

pg_dump should output \244 itself in that case.

Best regards,Mario Weilguni


Re: Bug in UTF8-Validation Code?

From
Andrew Dunstan
Date:
Mario Weilguni wrote:
> Am Dienstag, 13. März 2007 14:46 schrieb Albe Laurenz:
>   
>> Mario Weilguni wrote:
>>     
>>> Steps to reproduce:
>>> create database testdb with encoding='UTF8';
>>> \c testdb
>>> create table test(x text);
>>> insert into test values ('\244'); ==> Is akzepted, even if not UTF8.
>>>       
>> This is working as expected, see the remark in
>> http://www.postgresql.org/docs/current/static/sql-syntax-lexical.html#SQ
>> L-SYNTAX-STRINGS
>>
>> "It is your responsibility that the byte sequences you create
>>  are valid characters in the server character set encoding."
>>     
>
> In that case, pg_dump is doing wrong here and should quote the output. IMO it 
> cannot be defined as working as expected, when this makes any database dumps 
> worthless, without any warnings at dump-time.
>
> pg_dump should output \244 itself in that case.
>
>   

The sentence quoted from the docs is perhaps less than a model of 
clarity. I would take it to mean that no client-encoding -> 
server-encoding translation will take place. Does it really mean that 
the server will happily accept any escaped byte sequence, whether or not 
it is valid for the server encoding? If so that seems ... odd.

cheers

andrew


Re: Bug in UTF8-Validation Code?

From
Mario Weilguni
Date:
Am Dienstag, 13. März 2007 15:12 schrieb Andrew Dunstan:
> The sentence quoted from the docs is perhaps less than a model of
> clarity. I would take it to mean that no client-encoding ->
> server-encoding translation will take place. Does it really mean that
> the server will happily accept any escaped byte sequence, whether or not
> it is valid for the server encoding? If so that seems ... odd.

Yes, \octal sequences are accepted even if invalid. The problem is, pgdump
will happily dump those sequences as is, so in that case a char ascii 0xa4 is
emitted, and so the dump cannot be restored with pg_restore.

A dangerous feature IMO, and will make a lot of DB admins very unhappy if they
have to validate every day if the precious database dumps can be restored in
case of an error.

Best regards,
Mario Weilguni




Re: Bug in UTF8-Validation Code?

From
"Albe Laurenz"
Date:
Mario Weilguni wrote:
>>> Steps to reproduce:
>>> create database testdb with encoding='UTF8';
>>> \c testdb
>>> create table test(x text);
>>> insert into test values ('\244'); ==> Is akzepted, even if not UTF8.
>>
>> This is working as expected, see the remark in
>>
>> http://www.postgresql.org/docs/current/static/sql-syntax-lexical.html#SQL-SYNTAX-STRINGS
>>
>> "It is your responsibility that the byte sequences you create
>>  are valid characters in the server character set encoding."
>
> In that case, pg_dump is doing wrong here and should quote the output. IMO it
> cannot be defined as working as expected, when this makes any database dumps
> worthless, without any warnings at dump-time.
>
> pg_dump should output \244 itself in that case.

True. Here is a test case on 8.2.3
(OS, database and client all use UTF8):

test=> CREATE TABLE test(x text);
CREATE TABLE
test=> INSERT INTO test VALUES ('correct: ä');
INSERT 0 1
test=> INSERT INTO test VALUES (E'incorrect: \244');
INSERT 0 1
test=> \q
laurenz:~> pg_dump -d -t test -f test.sql

Here is an excerpt from 'od -c test.sql':

0001040   e   n   z  \n   -   -  \n  \n   I   N   S   E   R   T       I
0001060   N   T   O       t   e   s   t       V   A   L   U   E   S
0001100   (   '   c   o   r   r   e   c   t   :     303 244   '   )   ;
0001120  \n   I   N   S   E   R   T       I   N   T   O       t   e   s
0001140   t       V   A   L   U   E   S       (   '   i   n   c   o   r
0001160   r   e   c   t   :     244   '   )   ;  \n  \n  \n   -   -  \n

The invalid character (octal 244) is in the INSERT statement!

This makes psql gag:

test=> DROP TABLE test;
DROP TABLE
test=> \i test.sql
SET
SET
SET
SET
SET
SET
SET
SET
CREATE TABLE
ALTER TABLE
INSERT 0 1
psql:test.sql:33: ERROR:  invalid byte sequence for encoding "UTF8": 0xa4
HINT:  This error can also happen if the byte sequence does not match the encoding expected by the server, which is
controlledby "client_encoding". 

A fix could be either that the server checks escape sequences for validity
or that pg_dump outputs invalid bytes as escape sequences.
Or pg_dump could stop with an error.
I think that the cleanest way would be the first.

Yours,
Laurenz Albe


Re: Bug in UTF8-Validation Code?

From
Andrew Dunstan
Date:
Albe Laurenz wrote:
> A fix could be either that the server checks escape sequences for validity
>   

This strikes me as essential. If the db has a certain encoding ISTM we 
are promising that all the text data is valid for that encoding.

The question in my mind is how we help people to recover from the fact 
that we haven't done that.

cheers

andrew





Re: Bug in UTF8-Validation Code?

From
"Joshua D. Drake"
Date:
Andrew Dunstan wrote:
> Albe Laurenz wrote:
>> A fix could be either that the server checks escape sequences for
>> validity
>>   
> 
> This strikes me as essential. If the db has a certain encoding ISTM we
> are promising that all the text data is valid for that encoding.
> 
> The question in my mind is how we help people to recover from the fact
> that we haven't done that.

Is this any different than the issues of moving 8.0.x to 8.1 UTF8? Where
we had to use iconv?

Joshua D. Drake

> 
> cheers
> 
> andrew
> 
> 
> 
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 5: don't forget to increase your free space map settings
> 


-- 
     === The PostgreSQL Company: Command Prompt, Inc. ===
Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240
Providing the most comprehensive  PostgreSQL solutions since 1997            http://www.commandprompt.com/

Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate
PostgreSQL Replication: http://www.commandprompt.com/products/



Re: Bug in UTF8-Validation Code?

From
Mario Weilguni
Date:
Am Dienstag, 13. März 2007 16:38 schrieb Joshua D. Drake:
> Andrew Dunstan wrote:
> > Albe Laurenz wrote:
> >> A fix could be either that the server checks escape sequences for
> >> validity
> >
> > This strikes me as essential. If the db has a certain encoding ISTM we
> > are promising that all the text data is valid for that encoding.
> >
> > The question in my mind is how we help people to recover from the fact
> > that we haven't done that.
>
> Is this any different than the issues of moving 8.0.x to 8.1 UTF8? Where
> we had to use iconv?
>

What issues? I've upgraded several 8.0 database to 8.1. without having to use
iconv. Did I miss something?

Regards,
Mario Weilguni


Re: Bug in UTF8-Validation Code?

From
Michael Paesold
Date:
Andrew Dunstan wrote:
> Albe Laurenz wrote:
>> A fix could be either that the server checks escape sequences for 
>> validity
>>   
> 
> This strikes me as essential. If the db has a certain encoding ISTM we 
> are promising that all the text data is valid for that encoding.
> 
> The question in my mind is how we help people to recover from the fact 
> that we haven't done that.

I would also say that it's a bug that escape sequences can get characters 
into the database that are not valid in the specified encoding. If you 
compare the encoding to table constraints, there is no way to simply 
"escape" a constraint check.

This seems to violate the principle of consistency in ACID. Additionally, 
if you include pg_dump into ACID, it also violates durability, since it 
cannot restore what it wrote itself.
Is there anything in the SQL spec that asks for such a behaviour? I guess not.

A DBA will usually not even learn about this issue until they are presented 
with a failing restore.

Best Regards,
Michael Paesold


Re: Bug in UTF8-Validation Code?

From
Michael Fuhr
Date:
On Tue, Mar 13, 2007 at 04:42:35PM +0100, Mario Weilguni wrote:
> Am Dienstag, 13. März 2007 16:38 schrieb Joshua D. Drake:
> > Is this any different than the issues of moving 8.0.x to 8.1 UTF8? Where
> > we had to use iconv?
> 
> What issues? I've upgraded several 8.0 database to 8.1. without having to use 
> iconv. Did I miss something?

http://www.postgresql.org/docs/8.1/interactive/release-8-1.html

"Some users are having problems loading UTF-8 data into 8.1.X.  This
is because previous versions allowed invalid UTF-8 byte sequences
to be entered into the database, and this release properly accepts
only valid UTF-8 sequences. One way to correct a dumpfile is to run
the command iconv -c -f UTF-8 -t UTF-8 -o cleanfile.sql dumpfile.sql."

-- 
Michael Fuhr


Re: Bug in UTF8-Validation Code?

From
Peter Eisentraut
Date:
Am Mittwoch, 14. März 2007 08:01 schrieb Michael Paesold:
> Is there anything in the SQL spec that asks for such a behaviour? I guess
> not.

I think that the octal escapes are a holdover from the single-byte days where 
they were simply a way to enter characters that are difficult to find on a 
keyboard.  In today's multi-encoding world, it would make more sense if there 
were an escape sequence for a *codepoint* which is then converted to the 
actual encoding (if possible and valid) in the server.  The meaning of 
codepoint is, however, character set dependent as well.

The SQL standard supports escape sequences for Unicode codepoints, which I 
think would be a very useful feature (try entering a UTF-8 character 
bytewise ...), but it's a bit weird to implement and it's not clear how to 
handle character sets other than Unicode.

-- 
Peter Eisentraut
http://developer.postgresql.org/~petere/


Re: Bug in UTF8-Validation Code?

From
Mario Weilguni
Date:
Am Mittwoch, 14. März 2007 08:01 schrieb Michael Paesold:
> Andrew Dunstan wrote:
> >
> > This strikes me as essential. If the db has a certain encoding ISTM we
> > are promising that all the text data is valid for that encoding.
> >
> > The question in my mind is how we help people to recover from the fact
> > that we haven't done that.
>
> I would also say that it's a bug that escape sequences can get characters
> into the database that are not valid in the specified encoding. If you
> compare the encoding to table constraints, there is no way to simply
> "escape" a constraint check.
>
> This seems to violate the principle of consistency in ACID. Additionally,
> if you include pg_dump into ACID, it also violates durability, since it
> cannot restore what it wrote itself.
> Is there anything in the SQL spec that asks for such a behaviour? I guess
> not.
>
> A DBA will usually not even learn about this issue until they are presented
> with a failing restore.

Is there anything I can do to help with this problem? Maybe implementing a new
GUC variable that turns off accepting wrong encoded sequences (so DBAs still
can turn it on if they really depend on it)?

For me,

Best regards,
Mario Weilguni


Re: Bug in UTF8-Validation Code?

From
"Albe Laurenz"
Date:
Mario Weilguni wrote:
> Is there anything I can do to help with this problem? Maybe
implementing a new
> GUC variable that turns off accepting wrong encoded sequences (so DBAs
still
> can turn it on if they really depend on it)?

I think that this should be done away with unconditionally.
Or does anybody have a good point for allowing corrupt data
in text columns?
Maybe it is the way it is now because nobody could be bothered
to add the appropriate checks...

Yours,
Laurenz Albe


Re: Bug in UTF8-Validation Code?

From
Andrew Dunstan
Date:
Albe Laurenz wrote:
> Mario Weilguni wrote:
>   
>> Is there anything I can do to help with this problem? Maybe
>>     
> implementing a new 
>   
>> GUC variable that turns off accepting wrong encoded sequences (so DBAs
>>     
> still 
>   
>> can turn it on if they really depend on it)?
>>     
>
> I think that this should be done away with unconditionally.
> Or does anybody have a good point for allowing corrupt data
> in text columns?
> Maybe it is the way it is now because nobody could be bothered
> to add the appropriate checks...
>
>   

I agree. It's more or less an integrity violation, IMNSHO.

cheers

andrew


Re: Bug in UTF8-Validation Code?

From
Jeff Davis
Date:
On Tue, 2007-03-13 at 12:00 +0100, Mario Weilguni wrote:
> Hi,
> 
> I've a problem with a database, I can dump the database to a file, but 
> restoration fails, happens with 8.1.4.

I reported the same problem a while back:

http://archives.postgresql.org/pgsql-bugs/2006-10/msg00246.php

Some people think it's a bug, some people don't. It is technically
documented behavior, but I don't think the documentation is clear
enough. I think it is a bug that should be fixed, and here's another
message in the thread that expresses my opinion:

http://archives.postgresql.org/pgsql-bugs/2006-11/msg00033.php

If you look at that email, it includes some examples of surprising
behaviors caused by that bug, particularly with bytea.

In some applications (for which it's impractical to change the source
code), I actually use a CHECK constraint (which raises an exception on
invalid utf8 data) on every text column so that some binary data doesn't
break my slony replication. 

I'd like to see this fixed.

Regards,Jeff Davis



Re: Bug in UTF8-Validation Code?

From
Jeff Davis
Date:
On Wed, 2007-03-14 at 01:29 -0600, Michael Fuhr wrote:
> On Tue, Mar 13, 2007 at 04:42:35PM +0100, Mario Weilguni wrote:
> > Am Dienstag, 13. März 2007 16:38 schrieb Joshua D. Drake:
> > > Is this any different than the issues of moving 8.0.x to 8.1 UTF8? Where
> > > we had to use iconv?
> >
> > What issues? I've upgraded several 8.0 database to 8.1. without having to use
> > iconv. Did I miss something?
>
> http://www.postgresql.org/docs/8.1/interactive/release-8-1.html
>
> "Some users are having problems loading UTF-8 data into 8.1.X.  This
> is because previous versions allowed invalid UTF-8 byte sequences
> to be entered into the database, and this release properly accepts
> only valid UTF-8 sequences. One way to correct a dumpfile is to run
> the command iconv -c -f UTF-8 -t UTF-8 -o cleanfile.sql dumpfile.sql."
>

If the above quote were actually true, then Mario wouldn't be having a
problem. Instead, it's half-true: Invalid byte sequences are rejected in
some situations and accepted in others. If postgresql consistently
rejected or consistently accepted invalid byte sequences, that would not
cause problems with COPY (meaning problems with pg_dump, slony, etc.).

Regards,Jeff Davis



Re: Bug in UTF8-Validation Code?

From
Andrew Dunstan
Date:

Jeff Davis wrote:
> On Wed, 2007-03-14 at 01:29 -0600, Michael Fuhr wrote:
>   
>> On Tue, Mar 13, 2007 at 04:42:35PM +0100, Mario Weilguni wrote:
>>     
>>> Am Dienstag, 13. März 2007 16:38 schrieb Joshua D. Drake:
>>>       
>>>> Is this any different than the issues of moving 8.0.x to 8.1 UTF8? Where
>>>> we had to use iconv?
>>>>         
>>> What issues? I've upgraded several 8.0 database to 8.1. without having to use 
>>> iconv. Did I miss something?
>>>       
>> http://www.postgresql.org/docs/8.1/interactive/release-8-1.html
>>
>> "Some users are having problems loading UTF-8 data into 8.1.X.  This
>> is because previous versions allowed invalid UTF-8 byte sequences
>> to be entered into the database, and this release properly accepts
>> only valid UTF-8 sequences. One way to correct a dumpfile is to run
>> the command iconv -c -f UTF-8 -t UTF-8 -o cleanfile.sql dumpfile.sql."
>>
>>     
>
> If the above quote were actually true, then Mario wouldn't be having a
> problem. Instead, it's half-true: Invalid byte sequences are rejected in
> some situations and accepted in others. If postgresql consistently
> rejected or consistently accepted invalid byte sequences, that would not
> cause problems with COPY (meaning problems with pg_dump, slony, etc.).
>
>
>   

How can we fix this? Frankly, the statement in the docs warning about 
making sure that escaped sequences are valid in the server encoding is a 
cop-out. We don't accept invalid data elsewhere, and this should be no 
different IMNSHO. I don't see why this should be any different from, 
say, date or numeric data. For years people have sneered at MySQL 
because it accepted dates like Feb 31st, and rightly so. But this seems 
to me to be like our own version of the same problem.

Last year Jeff suggested adding something like:
   pg_verifymbstr(string,strlen(string),0);

to each relevant input routine. Would that be an acceptable solution? If 
not, what would be?

cheers

andrew


Re: Bug in UTF8-Validation Code?

From
Tom Lane
Date:
Andrew Dunstan <andrew@dunslane.net> writes:
> Last year Jeff suggested adding something like:
>     pg_verifymbstr(string,strlen(string),0);
> to each relevant input routine. Would that be an acceptable solution?

The problem with that is that it duplicates effort: in many cases
(especially COPY IN) the data's already been validated.  I'm not sure
how to fix that, but I think you'll get some push-back if you double
the encoding verification work in COPY for nothing.

Given that we are moving away from backslash-enabled literals, I'm
not as convinced as some that this must be fixed...
        regards, tom lane


Re: Bug in UTF8-Validation Code?

From
Andrew Dunstan
Date:

Tom Lane wrote:
> Andrew Dunstan <andrew@dunslane.net> writes:
>   
>> Last year Jeff suggested adding something like:
>>     pg_verifymbstr(string,strlen(string),0);
>> to each relevant input routine. Would that be an acceptable solution?
>>     
>
> The problem with that is that it duplicates effort: in many cases
> (especially COPY IN) the data's already been validated.  I'm not sure
> how to fix that, but I think you'll get some push-back if you double
> the encoding verification work in COPY for nothing.
>
> Given that we are moving away from backslash-enabled literals, I'm
> not as convinced as some that this must be fixed...
>
>
>
>   

They will still be available in E'\nn' form, won't they?

One thought I had was that it might make sense to have a flag that would 
inhibit the check, that could be set (and reset) by routines that check 
for themselves, such as COPY IN. Then bulk load performance should not 
be hit much.

cheers

andrew


Re: Bug in UTF8-Validation Code?

From
Tom Lane
Date:
Andrew Dunstan <andrew@dunslane.net> writes:
> Tom Lane wrote:
>> The problem with that is that it duplicates effort: in many cases
>> (especially COPY IN) the data's already been validated.

> One thought I had was that it might make sense to have a flag that would 
> inhibit the check, that could be set (and reset) by routines that check 
> for themselves, such as COPY IN. Then bulk load performance should not 
> be hit much.

Actually, I have to take back that objection: on closer look, COPY
validates the data only once and does so before applying its own
backslash-escaping rules.  So there is a risk in that path too.

It's still pretty annoying to be validating the data twice in the
common case where no backslash reduction occurred, but I'm not sure
I see any good way to avoid it.  I don't much want to add another
argument to input functions, and the global flag that you suggest
above seems too ugly/risky.

Would someone do some performance checking on the cost of adding
mbverify to textin()?  If it could be shown that it adds only
negligible overhead to COPY, on say hundred-byte-wide text fields,
then we could decide that this isn't worth worrying about.
        regards, tom lane


Re: Bug in UTF8-Validation Code?

From
Tom Lane
Date:
I wrote:
> Actually, I have to take back that objection: on closer look, COPY
> validates the data only once and does so before applying its own
> backslash-escaping rules.  So there is a risk in that path too.

> It's still pretty annoying to be validating the data twice in the
> common case where no backslash reduction occurred, but I'm not sure
> I see any good way to avoid it.

Further thought here: if we put encoding verification into textin()
and related functions, could we *remove* it from COPY IN, in the common
case where client and server encodings are the same?  Currently, copy.c
forces a trip through pg_client_to_server for multibyte encodings
even when the encodings are the same, so as to perform validation.
But I'm wondering whether we'd still need that.  There's no risk of
SQL injection in COPY data.  Bogus input encoding could possibly
make for confusion about where the field boundaries are, but bad
data is bad data in any case.
        regards, tom lane


Re: Bug in UTF8-Validation Code?

From
Andrew Dunstan
Date:

Tom Lane wrote:
> I wrote:
>   
>> Actually, I have to take back that objection: on closer look, COPY
>> validates the data only once and does so before applying its own
>> backslash-escaping rules.  So there is a risk in that path too.
>>     
>
>   
>> It's still pretty annoying to be validating the data twice in the
>> common case where no backslash reduction occurred, but I'm not sure
>> I see any good way to avoid it.
>>     
>
> Further thought here: if we put encoding verification into textin()
> and related functions, could we *remove* it from COPY IN, in the common
> case where client and server encodings are the same?  Currently, copy.c
> forces a trip through pg_client_to_server for multibyte encodings
> even when the encodings are the same, so as to perform validation.
> But I'm wondering whether we'd still need that.  There's no risk of
> SQL injection in COPY data.  Bogus input encoding could possibly
> make for confusion about where the field boundaries are, but bad
> data is bad data in any case.
>
>             regards, tom lane
>
>   


Here are some timing tests in 1m rows of random utf8 encoded 100 char 
data. It doesn't look to me like the saving you're suggesting is worth 
the trouble.

baseline:

Time: 28228.325 ms
Time: 25987.740 ms
Time: 25950.707 ms
Time: 25756.371 ms
Time: 27589.719 ms
Time: 25774.417 ms


after adding suggested extra test to textin():


Time: 26722.376 ms
Time: 28343.226 ms
Time: 26529.364 ms
Time: 28020.140 ms
Time: 24836.853 ms
Time: 24860.530 ms


Script is:

\timing
create table xyz (x text);
copy xyz from '/tmp/utf8.data';
truncate xyz;
copy xyz from '/tmp/utf8.data';
truncate xyz;
copy xyz from '/tmp/utf8.data';
truncate xyz;
copy xyz from '/tmp/utf8.data';
truncate xyz;
copy xyz from '/tmp/utf8.data';
truncate xyz;
copy xyz from '/tmp/utf8.data';
drop table xyz;


Test platform: FC6, Athlon64.


cheers

andrew



Re: Bug in UTF8-Validation Code?

From
Tom Lane
Date:
Andrew Dunstan <andrew@dunslane.net> writes:
> Here are some timing tests in 1m rows of random utf8 encoded 100 char 
> data. It doesn't look to me like the saving you're suggesting is worth 
> the trouble.

Hmm ... not sure I believe your numbers.  Using a test file of 1m lines
of 100 random latin1 characters converted to utf8 (thus, about half and
half 7-bit ASCII and 2-byte utf8 characters), I get this in SQL_ASCII
encoding:

regression=# \timing
Timing is on.
regression=# create temp table test(f1 text);
CREATE TABLE
Time: 5.047 ms
regression=# copy test from '/home/tgl/zzz1m';
COPY 1000000
Time: 4337.089 ms

and this in UTF8 encoding:

utf8=# \timing
Timing is on.
utf8=# create temp table test(f1 text);
CREATE TABLE
Time: 5.108 ms
utf8=# copy test from '/home/tgl/zzz1m';
COPY 1000000
Time: 7776.583 ms

The numbers aren't super repeatable, but it sure looks to me like the
encoding check adds at least 50% to the runtime in this example; so
doing it twice seems unpleasant.

(This is CVS HEAD, compiled without assert checking, on an x86_64
Fedora Core 6 box.)
        regards, tom lane


Re: Bug in UTF8-Validation Code?

From
Andrew Dunstan
Date:

Tom Lane wrote:
> Andrew Dunstan <andrew@dunslane.net> writes:
>   
>> Here are some timing tests in 1m rows of random utf8 encoded 100 char 
>> data. It doesn't look to me like the saving you're suggesting is worth 
>> the trouble.
>>     
>
> Hmm ... not sure I believe your numbers.  Using a test file of 1m lines
> of 100 random latin1 characters converted to utf8 (thus, about half and
> half 7-bit ASCII and 2-byte utf8 characters), I get this in SQL_ASCII
> encoding:
>
> regression=# \timing
> Timing is on.
> regression=# create temp table test(f1 text);
> CREATE TABLE
> Time: 5.047 ms
> regression=# copy test from '/home/tgl/zzz1m';
> COPY 1000000
> Time: 4337.089 ms
>
> and this in UTF8 encoding:
>
> utf8=# \timing
> Timing is on.
> utf8=# create temp table test(f1 text);
> CREATE TABLE
> Time: 5.108 ms
> utf8=# copy test from '/home/tgl/zzz1m';
> COPY 1000000
> Time: 7776.583 ms
>
> The numbers aren't super repeatable, but it sure looks to me like the
> encoding check adds at least 50% to the runtime in this example; so
> doing it twice seems unpleasant.
>  
> (This is CVS HEAD, compiled without assert checking, on an x86_64
> Fedora Core 6 box.)
>
>   

Are you comparing apples with apples? The db is utf8 in both of my cases.

cheers

andrew


Re: Bug in UTF8-Validation Code?

From
Andrew Dunstan
Date:

Tom Lane wrote:
> Andrew Dunstan <andrew@dunslane.net> writes:
>   
>> Here are some timing tests in 1m rows of random utf8 encoded 100 char 
>> data. It doesn't look to me like the saving you're suggesting is worth 
>> the trouble.
>>     
>
> Hmm ... not sure I believe your numbers.  Using a test file of 1m lines
> of 100 random latin1 characters converted to utf8 (thus, about half and
> half 7-bit ASCII and 2-byte utf8 characters), I get this in SQL_ASCII
> encoding:
>
> regression=# \timing
> Timing is on.
> regression=# create temp table test(f1 text);
> CREATE TABLE
> Time: 5.047 ms
> regression=# copy test from '/home/tgl/zzz1m';
> COPY 1000000
> Time: 4337.089 ms
>
> and this in UTF8 encoding:
>
> utf8=# \timing
> Timing is on.
> utf8=# create temp table test(f1 text);
> CREATE TABLE
> Time: 5.108 ms
> utf8=# copy test from '/home/tgl/zzz1m';
> COPY 1000000
> Time: 7776.583 ms
>
> The numbers aren't super repeatable, but it sure looks to me like the
> encoding check adds at least 50% to the runtime in this example; so
> doing it twice seems unpleasant.
>
> (This is CVS HEAD, compiled without assert checking, on an x86_64
> Fedora Core 6 box.)
>
>     
>   

Here are some test results that are closer to yours. I used a temp table 
and had cassert off and fsync off, and tried with several encodings.

The additional load from the test isn't 50%, (I think you have added the 
cost of going from ascii to utf8 to the cost of the test to get that 
50%) but it is  nevertheless appreciable.

I agree that we should look at not testing if the client and server 
encodings are the same, so we can reduce the difference.

cheers

andrew
                   Run SQL_ASCII LATIN1  UTF8

                     1   4659.38 4766.07  9134.53
                     2   7999.64 4003.13  6231.41
                     3   4178.46 6178.89  7266.39
  Without test       4    4201.7 3930.84 10154.38
                     5   4092.44 4444.52  9438.24
                     6   3977.34 4197.09  8866.56
               Average   4851.49 4586.76  8515.25


                     1  11993.86 12625.8 10109.89
                     2   4647.16 9192.53 11251.27
  With test          3   4211.02 9903.77 10097.37
                     4   9203.62 7045.06 10372.25
                     5   4121.39 4138.78 10386.92
                     6   3722.73 4552.09  7432.56
               Average   6316.63 7909.67  9941.71






Re: Bug in UTF8-Validation Code?

From
Grzegorz Jaskiewicz
Date:
<evil mode1>
Maybe we should add as resurce intensive check to ascii encoding(s),  
that would even the score ;p
</evil mode1>

<evil mode 2>
let's test mysql on this, and see how worse does it perform.
</evil mode 2>

-- 
Grzegorz 'the evil' Jaskiewicz

evil C/evil C++ developer for hire



Re: Bug in UTF8-Validation Code?

From
Martijn van Oosterhout
Date:
On Sat, Mar 17, 2007 at 11:46:01AM -0400, Andrew Dunstan wrote:
> How can we fix this? Frankly, the statement in the docs warning about
> making sure that escaped sequences are valid in the server encoding is a
> cop-out. We don't accept invalid data elsewhere, and this should be no
> different IMNSHO. I don't see why this should be any different from,
> say, date or numeric data. For years people have sneered at MySQL
> because it accepted dates like Feb 31st, and rightly so. But this seems
> to me to be like our own version of the same problem.

It seems to me that the easiest solution would be to forbid \x?? escape
sequences where it's greater than \x7F for UTF-8 server encodings.
Instead introduce a \u escape for specifying the unicode character
directly. Under the basic principle that any escape sequence still has
to represent a single character. The result can be multiple bytes, but
you don't have to check for consistancy anymore.

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> From each according to his ability. To each according to his ability to litigate.

Re: Bug in UTF8-Validation Code?

From
Andrew Dunstan
Date:
Martijn van Oosterhout wrote:
> On Sat, Mar 17, 2007 at 11:46:01AM -0400, Andrew Dunstan wrote:
>   
>> How can we fix this? Frankly, the statement in the docs warning about 
>> making sure that escaped sequences are valid in the server encoding is a 
>> cop-out. We don't accept invalid data elsewhere, and this should be no 
>> different IMNSHO. I don't see why this should be any different from, 
>> say, date or numeric data. For years people have sneered at MySQL 
>> because it accepted dates like Feb 31st, and rightly so. But this seems 
>> to me to be like our own version of the same problem.
>>     
>
> It seems to me that the easiest solution would be to forbid \x?? escape
> sequences where it's greater than \x7F for UTF-8 server encodings.
> Instead introduce a \u escape for specifying the unicode character
> directly. Under the basic principle that any escape sequence still has
> to represent a single character. The result can be multiple bytes, but
> you don't have to check for consistancy anymore. 
>
> Have a nice day,
>   

The escape processing is actually done in the lexer in the case of 
literals. We have to allow for bytea literals there too, regardless of 
encoding. The lexer naturally has no notion of the intended destination 
of the literal, So we need to defer the validity check to the *in 
functions for encoding-aware types. And it as Tom has noted, COPY does 
its own escape processing but does it before the transcoding.

So ISTM that any solution other than something like I have proposed will 
probably involve substantial surgery.

It does also seem from my test results that transcoding to MB charsets 
(or at least to utf-8) is surprisingly expensive, and that this would be 
a good place to look at optimisation possibilities. The validity tests 
can also be somewhat expensive.

But correctness matters most, IMNSHO.

cheers

andrew





Re: Bug in UTF8-Validation Code?

From
Martijn van Oosterhout
Date:
On Sun, Mar 18, 2007 at 08:25:56AM -0400, Andrew Dunstan wrote:
> It does also seem from my test results that transcoding to MB charsets
> (or at least to utf-8) is surprisingly expensive, and that this would be
> a good place to look at optimisation possibilities. The validity tests
> can also be somewhat expensive.

Hmm, I just noticed that the verify string works one character at a
time, at least that part could be dramatically optimised.

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> From each according to his ability. To each according to his ability to litigate.

Re: Bug in UTF8-Validation Code?

From
Andrew Dunstan
Date:

I wrote: 
>
> The escape processing is actually done in the lexer in the case of 
> literals. We have to allow for bytea literals there too, regardless of 
> encoding. The lexer naturally has no notion of the intended 
> destination of the literal, So we need to defer the validity check to 
> the *in functions for encoding-aware types. And it as Tom has noted, 
> COPY does its own escape processing but does it before the transcoding.
>
> So ISTM that any solution other than something like I have proposed 
> will probably involve substantial surgery.

Below is a list of the input routines in the adt directory, courtesy of 
grep.

I'm thinking we will need to put checks in:
   varcharin   bpcharin   textin   unknownin (?)   namein (?)

Any others?

cheers

andrew


acl.c:aclitemin
bool.c:boolin
char.c:charin
date.c:timetypmodin
date.c:timetztypmodin
float.c:dasin
float.c:dsin
nabstime.c:abstimein
nabstime.c:reltimein
nabstime.c:tintervalin
name.c:namein
not_in.c:oidnotin
numeric.c:numerictypmodin
oid.c:oidin
oid.c:oidvectorin
regproc.c:regprocin
regproc.c:regprocedurein
regproc.c:regoperin
regproc.c:regoperatorin
regproc.c:regclassin
regproc.c:regtypein
tid.c:tidin
timestamp.c:timestamptypmodin
timestamp.c:timestamptztypmodin
timestamp.c:intervaltypmodin
varbit.c:bittypmodin
varbit.c:varbittypmodin
varchar.c:bpcharin
varchar.c:bpchartypmodin
varchar.c:varcharin
varchar.c:varchartypmodin
varlena.c:byteain
varlena.c:textin
varlena.c:unknownin
xid.c:xidin
xid.c:cidin



Re: Bug in UTF8-Validation Code?

From
Gregory Stark
Date:
"Andrew Dunstan" <andrew@dunslane.net> writes:

> Below is a list of the input routines in the adt directory, courtesy of grep.

Grep isn't a good way to get these, your list missed a bunch.

postgres=# select distinct prosrc from pg_proc where oid in (select typinput from pg_type);      prosrc        

---------------------abstimeinacliteminany_inanyarray_inanyelement_inarray_inbit_inboolinbox_inbpcharinbyteaincash_incharincidincidr_incircle_incstring_indate_indomain_infloat4infloat8ininet_inint2inint2vectorinint4inint8ininternal_ininterval_inlanguage_handler_inline_inlseg_inmacaddr_innameinnumeric_inoidinoidvectorinopaque_inpath_inpoint_inpoly_inrecord_inregclassinregoperatorinregoperinregprocedureinregprocinregtypeinreltimeinsmgrintextintidintime_intimestamp_intimestamptz_intimetz_intintervalintrigger_inunknowninuuid_invarbit_invarcharinvoid_inxidinxml_in
(64 rows)

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com


Re: Bug in UTF8-Validation Code?

From
Andrew Dunstan
Date:

Gregory Stark wrote:
> "Andrew Dunstan" <andrew@dunslane.net> writes:
>
>   
>> Below is a list of the input routines in the adt directory, courtesy of grep.
>>     
>
> Grep isn't a good way to get these, your list missed a bunch.
>
> postgres=# select distinct prosrc from pg_proc where oid in (select typinput from pg_type);
>
>   
[snip]
> (64 rows)
>
>   

Ok, good point. Now, which of those need to have a check for valid encoding?

cheers

andrew


Re: Bug in UTF8-Validation Code?

From
Tom Lane
Date:
Andrew Dunstan <andrew@dunslane.net> writes:
> Ok, good point. Now, which of those need to have a check for valid encoding?

The vast majority will barf on any non-ASCII character anyway ... only
the ones that don't will need a check.
        regards, tom lane


Re: Bug in UTF8-Validation Code?

From
ITAGAKI Takahiro
Date:
Jeff Davis <pgsql@j-davis.com> wrote:

> Some people think it's a bug, some people don't. It is technically
> documented behavior, but I don't think the documentation is clear
> enough. I think it is a bug that should be fixed, and here's another
> message in the thread that expresses my opinion:

Agreed. I think it is a bug, too.
Insertion of invalid characters makes read queries busted.

$ initdb --encoding=utf8
# CREATE TABLE test (t text);
# INSERT INTO test VALUES('A');
# SELECT * FROM test;t
---A
(1 row)

# INSERT INTO test VALUES(E'\200');
# SELECT * FROM test;
ERROR:  invalid byte sequence for encoding "UTF8": 0x80
HINT:  This error can also happen if the byte sequence does not match the
encoding expected by the server, which is cont rolled by "client_encoding".


Could it lead to DoS?
http://www.postgresql.org/support/security
| [D] A vulnerability that is exploitable for denial-of-service,
| but requiring a valid prior login. 

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center




Re: Bug in UTF8-Validation Code?

From
Mario Weilguni
Date:
Am Sonntag, 18. März 2007 12:36 schrieb Martijn van Oosterhout:
>
> It seems to me that the easiest solution would be to forbid \x?? escape
> sequences where it's greater than \x7F for UTF-8 server encodings.
> Instead introduce a \u escape for specifying the unicode character
> directly. Under the basic principle that any escape sequence still has
> to represent a single character. The result can be multiple bytes, but
> you don't have to check for consistancy anymore.

Would'nt the best solution to change the order of
validation/convert_backslash_sequences better? First convert the sequences,
and reject them in the validation stage?

Regards
Mario Weilguni


Re: Bug in UTF8-Validation Code?

From
Bruce Momjian
Date:
Added to TODO:
* Fix cases where invalid byte encodings are accepted by the database,  but throw an error on SELECT
http://archives.postgresql.org/pgsql-hackers/2007-03/msg00767.php

Is anyone working on fixing this bug?

---------------------------------------------------------------------------

Mario Weilguni wrote:
> Hi,
> 
> I've a problem with a database, I can dump the database to a file, but 
> restoration fails, happens with 8.1.4.
> 
> Steps to reproduce:
> create database testdb with encoding='UTF8';
> \c testdb
> create table test(x text);
> insert into test values ('\244'); ==> Is akzepted, even if not UTF8.
> 
> pg_dump testdb -f testdb.dump -Fc
> pg_restore -f testdb.dump -d testdb => fails with an error: 
> ERROR:  invalid byte sequence for encoding "UTF8": 0xa4
> 
> The problem itself comes from a CSV file, which is imported with \copy without 
> proper quoting (so I have to fix this anyway), but I still think this is an 
> error, making restoration very complicated in such cases...
> 
> Or am I doing something completly wrong here?
> 
> Best regards,
> Mario Weilguni
> 
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 9: In versions below 8.0, the planner will ignore your desire to
>        choose an index scan if your joining column's datatypes do not
>        match

--  Bruce Momjian  <bruce@momjian.us>          http://momjian.us EnterpriseDB
http://www.enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +


Re: Bug in UTF8-Validation Code?

From
Mark Dilger
Date:
Bruce Momjian wrote:
> Added to TODO:
> 
>     * Fix cases where invalid byte encodings are accepted by the database,
>       but throw an error on SELECT
>     
>       http://archives.postgresql.org/pgsql-hackers/2007-03/msg00767.php
> 
> Is anyone working on fixing this bug?

Hi, has anyone volunteered to fix this bug?  I did not see any reply on the 
mailing list to your question above.

mark


Re: Bug in UTF8-Validation Code?

From
Mark Dilger
Date:
Mark Dilger wrote:
> Bruce Momjian wrote:
>> Added to TODO:
>>
>>     * Fix cases where invalid byte encodings are accepted by the 
>> database,
>>       but throw an error on SELECT
>>     
>>       http://archives.postgresql.org/pgsql-hackers/2007-03/msg00767.php
>>
>> Is anyone working on fixing this bug?
> 
> Hi, has anyone volunteered to fix this bug?  I did not see any reply on 
> the mailing list to your question above.
> 
> mark

OK, I can take a stab at fixing this.  I'd like to state some assumptions so 
people can comment and reply:

I assume that I need to fix *all* cases where invalid byte encodings get into 
the database through functions shipped in the core distribution.

I assume I do not need to worry about people getting bad data into the system 
through their own database extensions.

I assume that the COPY problem discussed up-thread goes away once you eliminate 
all the paths by which bad data can get into the system.  However, existing 
database installations with bad data already loaded will not be magically fixed 
with these code patches.

Do any of the string functions (see 
http://www.postgresql.org/docs/8.2/interactive/functions-string.html) run the 
risk of generating invalid utf8 encoded strings?  Do I need to add checks?  Are 
there known bugs with these functions in this regard?

If not, I assume I can add mbverify calls to the various input routines (textin, 
varcharin, etc) where invalid utf8 could otherwise enter the system.

I assume that this work can be limited to HEAD and that I don't need to 
back-patch it.  (I suspect this assumption is a contentious one.)

Advice and comments are welcome,

mark


Re: Bug in UTF8-Validation Code?

From
Martijn van Oosterhout
Date:
On Sat, Mar 31, 2007 at 07:47:21PM -0700, Mark Dilger wrote:
> OK, I can take a stab at fixing this.  I'd like to state some assumptions
> so people can comment and reply:
>
> I assume that I need to fix *all* cases where invalid byte encodings get
> into the database through functions shipped in the core distribution.

Yes.

> I assume I do not need to worry about people getting bad data into the
> system through their own database extensions.

That'd be rather difficult :)

> I assume that the COPY problem discussed up-thread goes away once you
> eliminate all the paths by which bad data can get into the system.
> However, existing database installations with bad data already loaded will
> not be magically fixed with these code patches.

Correct.

> Do any of the string functions (see
> http://www.postgresql.org/docs/8.2/interactive/functions-string.html) run
> the risk of generating invalid utf8 encoded strings?  Do I need to add
> checks?  Are there known bugs with these functions in this regard?

I don't think so. They'd be bugs if they were...

> If not, I assume I can add mbverify calls to the various input routines
> (textin, varcharin, etc) where invalid utf8 could otherwise enter the
> system.

The only hard part is handling where the escaping and unescaping is
happening...

> I assume that this work can be limited to HEAD and that I don't need to
> back-patch it.  (I suspect this assumption is a contentious one.)

At the very least I'd start with HEAD. Whether it gets backpatched
probably depends on how invasive it ends up being...

There's also the performance angle. The current mbverify is very
inefficient for encodings like UTF-8. You might need to refactor a bit
there...

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> From each according to his ability. To each according to his ability to litigate.

Re: Bug in UTF8-Validation Code?

From
Andrew - Supernews
Date:
On 2007-04-01, Mark Dilger <pgsql@markdilger.com> wrote:
> Do any of the string functions (see 
> http://www.postgresql.org/docs/8.2/interactive/functions-string.html) run the 
> risk of generating invalid utf8 encoded strings?  Do I need to add checks?
> Are there known bugs with these functions in this regard?

The chr() function returns an octet, rather than a character; this is clearly
wrong and needs fixing.

-- 
Andrew, Supernews
http://www.supernews.com - individual and corporate NNTP services


Re: Bug in UTF8-Validation Code?

From
Mark Dilger
Date:
Martijn van Oosterhout wrote:
> There's also the performance angle. The current mbverify is very
> inefficient for encodings like UTF-8. You might need to refactor a bit
> there...

There appears to be a lot of function call overhead in the current 
implementation.  In pg_verify_mbstr, the function pointer 
pg_wchar_table.mbverify is called for each multibyte character in a multibyte 
string.

Refactoring the way these table driven functions work would impact lots of other 
code.  Just grep for all files #including mb/pg_wchar.h for the list of them. 
The list includes interfaces/libpq, and I'm wondering if software that links 
against postgres might rely on these function prototypes?

mark


Re: Bug in UTF8-Validation Code?

From
Tom Lane
Date:
Mark Dilger <pgsql@markdilger.com> writes:
> Refactoring the way these table driven functions work would impact
> lots of other code.  Just grep for all files #including mb/pg_wchar.h
> for the list of them.  The list includes interfaces/libpq, and I'm
> wondering if software that links against postgres might rely on these
> function prototypes?

No, we've never exported those with the intent that client code should
use 'em.  Doing so would require importing non-public headers, and
anyone who does that can have no grounds for complaining if the headers
change incompatibly.
        regards, tom lane


Re: Bug in UTF8-Validation Code?

From
Tatsuo Ishii
Date:
> Mark Dilger <pgsql@markdilger.com> writes:
> > Refactoring the way these table driven functions work would impact
> > lots of other code.  Just grep for all files #including mb/pg_wchar.h
> > for the list of them.  The list includes interfaces/libpq, and I'm
> > wondering if software that links against postgres might rely on these
> > function prototypes?
> 
> No, we've never exported those with the intent that client code should
> use 'em.  Doing so would require importing non-public headers, and
> anyone who does that can have no grounds for complaining if the headers
> change incompatibly.

I thought PQescapeString() of 8.3 uses mbverify functions to make sure
that user supplied multibyte string is valid.
--
Tatsuo Ishii
SRA OSS, Inc. Japan


Re: Bug in UTF8-Validation Code?

From
Tom Lane
Date:
Tatsuo Ishii <ishii@postgresql.org> writes:
>> No, we've never exported those with the intent that client code should
>> use 'em.

> I thought PQescapeString() of 8.3 uses mbverify functions to make sure
> that user supplied multibyte string is valid.

Certainly --- but we can change PQescapeString to match whatever we do
with the pg_wchar functions.  The question was whether we intend to
support client application code (outside libpq) using those functions.
That's definitely not the intent.  exports.txt lists only PQmblen and
pg_utf_mblen as exported (and I have to wonder why the latter is
separately exported...), which means that client code on modern
platforms isn't even capable of getting at the others.
        regards, tom lane


Re: Bug in UTF8-Validation Code?

From
Mark Dilger
Date:
Andrew - Supernews wrote:
> On 2007-04-01, Mark Dilger <pgsql@markdilger.com> wrote:
>> Do any of the string functions (see 
>> http://www.postgresql.org/docs/8.2/interactive/functions-string.html) run the 
>> risk of generating invalid utf8 encoded strings?  Do I need to add checks?
>> Are there known bugs with these functions in this regard?
> 
> The chr() function returns an octet, rather than a character; this is clearly
> wrong and needs fixing.
> 

Ok, I've altered the chr() function.  I am including a transcript from psql 
below.  There are several design concerns:

1) In the current implementation, chr(0) returns a 5-byte text object (4-bytes 
of overhead plus one byte of data) containing a null.  In the new 
implementation, this returns an error.  I don't know, but it is possible that 
people currently use things like "SELECT chr(0) || chr(0) || ..." to build up 
strings of nulls.

2) Under utf8, chr(X) fails for X = 128..255.  This may also break current users 
expectations.

3) The implicit modulus operation that was being performed by chr() is now gone, 
which might break some users.

4) You can't represent the high end of the astral plain with type INTEGER, 
unless you pass in a negative value, which is somewhat unintuitive.  Since chr() 
expects an integer (and not a bigint) the user needs handle the sign bit correctly.

mark

---------------------




Welcome to psql 8.3devel, the PostgreSQL interactive terminal.

Type:  \copyright for distribution terms       \h for help with SQL commands       \? for help with psql commands
\gor terminate with semicolon to execute query       \q to quit
 

pgsql=# select chr(0);
ERROR:  character 0x00 of encoding "SQL_ASCII" has no equivalent in "UTF8"
pgsql=# select chr(65); chr
----- A
(1 row)

pgsql=# select chr(128);
ERROR:  character 0x80 of encoding "SQL_ASCII" has no equivalent in "UTF8"
pgsql=# select chr(53398); chr
----- Ж
(1 row)

pgsql=# select chr(14989485); chr
----- 中
(1 row)

pgsql=# select chr(4036005254);
ERROR:  function chr(bigint) does not exist
LINE 1: select chr(4036005254);               ^
HINT:  No function matches the given name and argument types. You might need to 
add explicit type casts.


Re: Bug in UTF8-Validation Code?

From
Mark Dilger
Date:
Mark Dilger wrote:
> Andrew - Supernews wrote:
>> On 2007-04-01, Mark Dilger <pgsql@markdilger.com> wrote:
>>> Do any of the string functions (see 
>>> http://www.postgresql.org/docs/8.2/interactive/functions-string.html) 
>>> run the risk of generating invalid utf8 encoded strings?  Do I need 
>>> to add checks?
>>> Are there known bugs with these functions in this regard?
>>
>> The chr() function returns an octet, rather than a character; this is 
>> clearly
>> wrong and needs fixing.
>>
> 
> Ok, I've altered the chr() function.  I am including a transcript from 
> psql below.  There are several design concerns:
> 
> 1) In the current implementation, chr(0) returns a 5-byte text object 
> (4-bytes of overhead plus one byte of data) containing a null.  In the 
> new implementation, this returns an error.  I don't know, but it is 
> possible that people currently use things like "SELECT chr(0) || chr(0) 
> || ..." to build up strings of nulls.
> 
> 2) Under utf8, chr(X) fails for X = 128..255.  This may also break 
> current users expectations.
> 
> 3) The implicit modulus operation that was being performed by chr() is 
> now gone, which might break some users.
> 
> 4) You can't represent the high end of the astral plain with type 
> INTEGER, unless you pass in a negative value, which is somewhat 
> unintuitive.  Since chr() expects an integer (and not a bigint) the user 
> needs handle the sign bit correctly.
> 
> mark
> 
> ---------------------
> 
> 
> 
> 
> Welcome to psql 8.3devel, the PostgreSQL interactive terminal.
> 
> Type:  \copyright for distribution terms
>        \h for help with SQL commands
>        \? for help with psql commands
>        \g or terminate with semicolon to execute query
>        \q to quit
> 
> pgsql=# select chr(0);
> ERROR:  character 0x00 of encoding "SQL_ASCII" has no equivalent in "UTF8"
> pgsql=# select chr(65);
>  chr
> -----
>  A
> (1 row)
> 
> pgsql=# select chr(128);
> ERROR:  character 0x80 of encoding "SQL_ASCII" has no equivalent in "UTF8"
> pgsql=# select chr(53398);
>  chr
> -----
>  Ж
> (1 row)
> 
> pgsql=# select chr(14989485);
>  chr
> -----
>  中
> (1 row)
> 
> pgsql=# select chr(4036005254);
> ERROR:  function chr(bigint) does not exist
> LINE 1: select chr(4036005254);
>                ^
> HINT:  No function matches the given name and argument types. You might 
> need to add explicit type casts.


Here's the code for the new chr() function:

Datum
chr(PG_FUNCTION_ARGS)
{    int32       cvalue = PG_GETARG_INT32(0);    text       *result;
    if (pg_database_encoding_max_length() > 1 && !lc_ctype_is_c())    {        int encoding,            len,
byteoff;       uint32 buf[2];        const char *bufptr;
 
        encoding = GetDatabaseEncoding();        buf[0] = htonl(cvalue);        buf[1] = 0;        bufptr = (const char
*)&buf;       for (byteoff = 0; byteoff < sizeof(uint32) && 0 == *bufptr; ++byteoff, 
 
++bufptr);        len = pg_encoding_mblen(encoding,bufptr);        if (byteoff + len != sizeof(uint32) ||
!pg_verify_mbstr(encoding,
 
bufptr, len, true /* noError */))            report_untranslatable_char(PG_SQL_ASCII, encoding, bufptr, 
sizeof(int32));        result = (text *) palloc(VARHDRSZ + len);        SET_VARSIZE(result, VARHDRSZ + len);
memcpy(VARDATA(result),bufptr,len);   }    else    {        result = (text *) palloc(VARHDRSZ + 1);
SET_VARSIZE(result,VARHDRSZ + 1);        *VARDATA(result) = (char) cvalue;    }
 
    PG_RETURN_TEXT_P(result);
}


Re: Bug in UTF8-Validation Code?

From
Tom Lane
Date:
Mark Dilger <pgsql@markdilger.com> writes:
>> pgsql=# select chr(14989485);
>> chr
>> -----
>> 中
>> (1 row)

Is there a principled rationale for this particular behavior as
opposed to any other?

In particular, in UTF8 land I'd have expected the argument of chr()
to be interpreted as a Unicode code point, not as actual UTF8 bytes
with a randomly-chosen endianness.

Not sure what to do in other multibyte encodings.
        regards, tom lane


Re: Bug in UTF8-Validation Code?

From
Mark Dilger
Date:
Tom Lane wrote:
> Mark Dilger <pgsql@markdilger.com> writes:
>>> pgsql=# select chr(14989485);
>>> chr
>>> -----
>>> 中
>>> (1 row)
> 
> Is there a principled rationale for this particular behavior as
> opposed to any other?
> 
> In particular, in UTF8 land I'd have expected the argument of chr()
> to be interpreted as a Unicode code point, not as actual UTF8 bytes
> with a randomly-chosen endianness.
> 
> Not sure what to do in other multibyte encodings.

"Not sure what to do in other multibyte encodings" was pretty much my rationale 
for this particular behavior.  I standardized on network byte order because 
there are only two endianesses to choose from, and the other seems to be a more 
surprising choice.

I looked around on the web for a standard for how to convert an integer into a 
valid multibyte character and didn't find anything.  Andrew, Supernews has said 
upthread that chr() is clearly wrong and needs to be fixed. If so, we need some 
clear definition what "fixed" means.

Any suggestions?

mark


Re: Bug in UTF8-Validation Code?

From
Mark Dilger
Date:
Mark Dilger wrote:
> Tom Lane wrote:
>> Mark Dilger <pgsql@markdilger.com> writes:
>>>> pgsql=# select chr(14989485);
>>>> chr
>>>> -----
>>>> 中
>>>> (1 row)
>>
>> Is there a principled rationale for this particular behavior as
>> opposed to any other?
>>
>> In particular, in UTF8 land I'd have expected the argument of chr()
>> to be interpreted as a Unicode code point, not as actual UTF8 bytes
>> with a randomly-chosen endianness.
>>
>> Not sure what to do in other multibyte encodings.
> 
> "Not sure what to do in other multibyte encodings" was pretty much my 
> rationale for this particular behavior.  I standardized on network byte 
> order because there are only two endianesses to choose from, and the 
> other seems to be a more surprising choice.
> 
> I looked around on the web for a standard for how to convert an integer 
> into a valid multibyte character and didn't find anything.  Andrew, 
> Supernews has said upthread that chr() is clearly wrong and needs to be 
> fixed. If so, we need some clear definition what "fixed" means.
> 
> Any suggestions?
> 
> mark

Another issue to consider when thinking about the corect definition of chr() is 
that ascii(chr(X)) = X.  This gets weird if X is greater than 255.  If nothing 
else, the name "ascii" is no longer appropriate.

mark


Re: Bug in UTF8-Validation Code?

From
Mark Dilger
Date:
Mark Dilger wrote:
> Tom Lane wrote:
>> Mark Dilger <pgsql@markdilger.com> writes:
>>>> pgsql=# select chr(14989485);
>>>> chr
>>>> -----
>>>> 中
>>>> (1 row)
>>
>> Is there a principled rationale for this particular behavior as
>> opposed to any other?
>>
>> In particular, in UTF8 land I'd have expected the argument of chr()
>> to be interpreted as a Unicode code point, not as actual UTF8 bytes
>> with a randomly-chosen endianness.
>>
>> Not sure what to do in other multibyte encodings.
> 
> "Not sure what to do in other multibyte encodings" was pretty much my 
> rationale for this particular behavior.  I standardized on network byte 
> order because there are only two endianesses to choose from, and the 
> other seems to be a more surprising choice.
> 
> I looked around on the web for a standard for how to convert an integer 
> into a valid multibyte character and didn't find anything.  Andrew, 
> Supernews has said upthread that chr() is clearly wrong and needs to be 
> fixed. If so, we need some clear definition what "fixed" means.
> 
> Any suggestions?
> 
> mark

Since chr() is defined in oracle_compat.c, I decided to look at what Oracle 
might do.  See 
http://download-west.oracle.com/docs/cd/B10501_01/server.920/a96540/functions18a.htm

It looks to me like they are doing the same thing that I did, though I don't 
have Oracle installed anywhere to verify that.  Is there a difference?

mark


Re: Bug in UTF8-Validation Code?

From
Mark Dilger
Date:
Mark Dilger wrote:

> Since chr() is defined in oracle_compat.c, I decided to look at what 
> Oracle might do.  See 
> http://download-west.oracle.com/docs/cd/B10501_01/server.920/a96540/functions18a.htm 
> 
> 
> It looks to me like they are doing the same thing that I did, though I 
> don't have Oracle installed anywhere to verify that.  Is there a 
> difference?

Reading that page again, I think I'd have to use mbrtowc() or similar in the 
spot where I'm currently just using the literal utf8 string.


Re: Bug in UTF8-Validation Code?

From
Andrew - Supernews
Date:
On 2007-04-02, Mark Dilger <pgsql@markdilger.com> wrote:
> Here's the code for the new chr() function:
>
>      if (pg_database_encoding_max_length() > 1 && !lc_ctype_is_c())

Clearly wrong - this allows returning invalid UTF8 data in locale C, which
is not an uncommon setting to use.

Treating the parameter as bytes is wrong too - it should correspond to
whatever the natural character numbering for the encoding is; for utf8
that is the Unicode code point.

-- 
Andrew, Supernews
http://www.supernews.com - individual and corporate NNTP services


Re: Bug in UTF8-Validation Code?

From
"Albe Laurenz"
Date:
Mark Dilger wrote:
>>> In particular, in UTF8 land I'd have expected the argument of chr()
>>> to be interpreted as a Unicode code point, not as actual UTF8 bytes
>>> with a randomly-chosen endianness.
>>>
>>> Not sure what to do in other multibyte encodings.
>>
>> "Not sure what to do in other multibyte encodings" was pretty much my

>> rationale for this particular behavior.  I standardized on network
byte
>> order because there are only two endianesses to choose from, and the
>> other seems to be a more surprising choice.
>
> Since chr() is defined in oracle_compat.c, I decided to look
> at what Oracle might do.  See
>
http://download-west.oracle.com/docs/cd/B10501_01/server.920/a96540/func
tions18a.htm
>
> It looks to me like they are doing the same thing that I did,
> though I don't have Oracle installed anywhere to verify that.
> Is there a difference?

This is Oracle 10.2.0.3.0 ("latest and greatest") with UTF-8 encoding
(actually, Oracle chooses to call this encoding AL32UTF8):

SQL> SELECT ASCII('EUR') AS DEC, 2         TO_CHAR(ASCII('EUR'), 'XXXXXX') AS HEX 3  FROM DUAL;
      DEC HEX
---------- ---------------------------- 14844588  E282AC

SQL> SELECT CHR(14844588) AS EURO FROM DUAL;

EURO
----
EUR

I don't see how endianness enters into this at all - isn't that just
the question of how a byte is stored physically?

According to RFC 2279, the Euro,
Unicode code point 0x20AC = 0010 0000 1010 1100,
will be encoded to 1110 0010 1000 0010 1010 1100 = 0xE282AC.

IMHO this is the only good and intuitive way for CHR() and ASCII().

Yours,
Laurenz Albe


Re: Bug in UTF8-Validation Code?

From
Andrew - Supernews
Date:
On 2007-04-03, "Albe Laurenz" <all@adv.magwien.gv.at> wrote:
> According to RFC 2279, the Euro,
> Unicode code point 0x20AC = 0010 0000 1010 1100,
> will be encoded to 1110 0010 1000 0010 1010 1100 = 0xE282AC.
>
> IMHO this is the only good and intuitive way for CHR() and ASCII().

It is beyond ludicrous for functions like chr() or ascii() to convert a
Euro sign to 0xE282AC rather than 0x20AC. "Intuitive"? There is _NO SUCH
THING_ as 0xE282AC as a representation of a Unicode character - there is
either the code point, 0x20AC (which is a _number_), or the sequences of
_bytes_ that represent that code point in various encodings, of which the
three-byte sequence 0xE2 0x82 0xAC is the one used in UTF-8.

Functions like chr() and ascii() should be dealing with the _number_ of the
code point, not with its representation in transfer encodings.

-- 
Andrew, Supernews
http://www.supernews.com - individual and corporate NNTP services


Re: Bug in UTF8-Validation Code?

From
Martijn van Oosterhout
Date:
On Tue, Apr 03, 2007 at 11:43:21AM +0200, Albe Laurenz wrote:
> IMHO this is the only good and intuitive way for CHR() and ASCII().

Hardly. The comment earlier about mbtowc was much closer to the mark.
And wide characters are defined as Unicode points.

Basically, CHR() takes a unicode point and returns that character
in a string appropriately encoded. ASCII() does the reverse.

Just about every multibyte encoding other than Unicode has the problem
of not distinguishing between the code point and the encoding of it.
Unicode is a collection of encodings based on the same set.

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> From each according to his ability. To each according to his ability to litigate.

Re: Bug in UTF8-Validation Code?

From
"Albe Laurenz"
Date:
Andrew wrote:
>> According to RFC 2279, the Euro,
>> Unicode code point 0x20AC = 0010 0000 1010 1100,
>> will be encoded to 1110 0010 1000 0010 1010 1100 = 0xE282AC.
>>
>> IMHO this is the only good and intuitive way for CHR() and ASCII().
>
> It is beyond ludicrous for functions like chr() or ascii() to
> convert a Euro sign to 0xE282AC rather than 0x20AC. "Intuitive"? There
> is _NO SUCH THING_ as 0xE282AC as a representation of a Unicode
character
> - there is either the code point, 0x20AC (which is a _number_), or the
> sequences of _bytes_ that represent that code point in various
encodings,
> of which the three-byte sequence 0xE2 0x82 0xAC is the one used in
UTF-8.

Yes, 0xE2 0x82 0xAC is the representation in UTF-8, and UTF-8 is the
database encoding in use.

> Functions like chr() and ascii() should be dealing with the _number_
of the
> code point, not with its representation in transfer encodings.

I think that we have a fundamental difference.

As far as I know, the word "code point" is only used in UNICODE and
is the first column in the list
http://www.unicode.org/Public/UNIDATA/UnicodeData.txt

So, if I understand you correctly, you want CHR() and ASCII()
to convert between characters (in the current database encoding)
and UNICODE code points (independent of database encoding).

What I suggest (and what Oracle implements, and isn't CHR() and ASCII()
partly for Oracle compatibility?) is that CHR() and ASCII()
convert between a character (in database encoding) and
that database encoding in numeric form.

I think that what you suggest would be a useful function too,
but I certainly wouldn't call such a function ASCII() :^)

The current implementation seems closer to my idea of ASCII(),
only incomplete:

test=> select to_hex(ascii('EUR'));to_hex
--------e2
(1 row)

What do others think? Should the argument to CHR() be a Unicode
code point or the numeric representation of the database encoding?

Yours,
Laurenz Albe


Re: Bug in UTF8-Validation Code?

From
Mark Dilger
Date:
Martijn van Oosterhout wrote:
> On Tue, Apr 03, 2007 at 11:43:21AM +0200, Albe Laurenz wrote:
>> IMHO this is the only good and intuitive way for CHR() and ASCII().
> 
> Hardly. The comment earlier about mbtowc was much closer to the mark.
> And wide characters are defined as Unicode points.
> 
> Basically, CHR() takes a unicode point and returns that character
> in a string appropriately encoded. ASCII() does the reverse.
> 
> Just about every multibyte encoding other than Unicode has the problem
> of not distinguishing between the code point and the encoding of it.
> Unicode is a collection of encodings based on the same set.
> 
> Have a nice day,

Thanks for the feedback.  Would you say that the way I implemented things in the 
example code would be correct for multibyte non Unicode encodings?  I don't see 
how to avoid the endianness issue for those encodings.

mark


Re: Bug in UTF8-Validation Code?

From
Tom Lane
Date:
Mark Dilger <pgsql@markdilger.com> writes:
> Martijn van Oosterhout wrote:
>> Just about every multibyte encoding other than Unicode has the problem
>> of not distinguishing between the code point and the encoding of it.

> Thanks for the feedback.  Would you say that the way I implemented things in the 
> example code would be correct for multibyte non Unicode encodings?

I think it's probably defensible for non-Unicode encodings.  To do
otherwise would require (a) figuring out what the equivalent concept to
"code point" is for each encoding, and (b) having a separate code path
for each encoding to perform the mapping.  It's not clear that there
even is an answer to (a), and (b) seems like more work than chr() is
worth.  But we know what the right way is for Unicode, so we should
special case that one.

Note the points made that in all cases ascii() and chr() should be
inverses, and that you shouldn't just fall back to the old behavior
in SQL_ASCII encoding.  (My vote for SQL_ASCII would be to reject
values > 255.)
        regards, tom lane


Re: Bug in UTF8-Validation Code?

From
Mark Dilger
Date:
Albe Laurenz wrote:
> What I suggest (and what Oracle implements, and isn't CHR() and ASCII()
> partly for Oracle compatibility?) is that CHR() and ASCII()
> convert between a character (in database encoding) and
> that database encoding in numeric form.

Looking at Oracle documentation, it appears that you get different 
behavior from CHR(X [USING NCHAR_CS]) depending on whether you call it 
with the argument USING NCHAR_CS.  Oracle 9i and higher have an 
additional function called NCHR(X) which is supposed to be the same as 
CHR(X USING NCHAR_CS).

On http://www.oraclehome.co.uk/chr-function.htm it says that "To use 
UTF8, you specify using nchar_cs in the argument list".  Does this mean 
that CHR(X) behaves as Tom Lane wants, and NCHR(X) behaves as Albe 
Laurenz wants?  Vice versa?

I'm not saying that Oracle compatibility is paramount.  But if we can 
get compatibility and a reasonable implementation at the same time, that 
seems like a bonus.

Once again, I don't have Oracle installed and cannot test this :(

mark


Re: Bug in UTF8-Validation Code?

From
Martijn van Oosterhout
Date:
On Tue, Apr 03, 2007 at 01:06:38PM -0400, Tom Lane wrote:
> I think it's probably defensible for non-Unicode encodings.  To do
> otherwise would require (a) figuring out what the equivalent concept to
> "code point" is for each encoding, and (b) having a separate code path
> for each encoding to perform the mapping.  It's not clear that there
> even is an answer to (a), and (b) seems like more work than chr() is
> worth.  But we know what the right way is for Unicode, so we should
> special case that one.

I dunno. I find it odd that if I want a pl/pgsql function to return a
Euro symbol, it has to know what encoding the DB is in. Though I
suppose that would call for a unicode_chr() function.

Is there any multibyte mapping other than unicode that distinguishes
between the character set and the encoding thereof?

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> From each according to his ability. To each according to his ability to litigate.

Re: Bug in UTF8-Validation Code?

From
"Albe Laurenz"
Date:
Mark Dilger wrote:
>> What I suggest (and what Oracle implements, and isn't CHR() and
ASCII()
>> partly for Oracle compatibility?) is that CHR() and ASCII()
>> convert between a character (in database encoding) and
>> that database encoding in numeric form.
>
> Looking at Oracle documentation, it appears that you get different
> behavior from CHR(X [USING NCHAR_CS]) depending on whether you call it

> with the argument USING NCHAR_CS.  Oracle 9i and higher have an
> additional function called NCHR(X) which is supposed to be the same as

> CHR(X USING NCHAR_CS).
>
> On http://www.oraclehome.co.uk/chr-function.htm it says that "To use
> UTF8, you specify using nchar_cs in the argument list".  Does this
mean
> that CHR(X) behaves as Tom Lane wants, and NCHR(X) behaves as Albe
> Laurenz wants?  Vice versa?

That web page is misleading at least, if not downright wrong.

It's just that an Oracle database has 2 character sets, a "database
character set" and a "national character set", the latter always being a
UNICODE encoding (the name "national character set" is somewhat
misleading).

This baroque concept is from those days when nobody had a UNICODE
database, but people still wanted to store characters not supported
by the "database character set" - in that case you could define a column
to be in the "national character set".

CHR(n) and CHR(n USING NCHAR_CS) = NCHR(n) are the same function, only
that the first one uses the "database character set" and the latter ones
the "national character set".

Nowadays this Oracle concept of "national character set" is nearly
obsolete, one normally uses a UNICODE "database character set".

Oracle has two things to say about CHR():
 "For single-byte character sets, if n > 256, then Oracle Database  returns the binary equivalent of n mod 256. For
multibytecharacter  sets, n must resolve to one entire code point. Invalid code points  are not validated, and the
resultof specifying invalid code points  is indeterminate." 

It seems that Oracle means "encoding" when it says "code point" :^)
We should of course reject invalid arguments!
I don't know if I like this modulus thing for single byte encodings
or not...
 "Use of the CHR function (either with or without the optional USING  NCHAR_CS clause) results in code that is not
portablebetween ASCII-  and EBCDIC-based machine architectures." 

There's one thing that strikes me as weird in your implementation:

> pgsql=# select chr(0);
> ERROR:  character 0x00 of encoding "SQL_ASCII" has no equivalent in
"UTF8"

0x00 is a valid UNICODE code point and also a valid UTF-8 character!

To me (maybe only to me) CHR() and ASCII() have always had the look
and feel of "type casts" between "char" and integer, with all the lack
of portability this might imply.

Yours,
Laurenz Albe


Re: Bug in UTF8-Validation Code?

From
"Zeugswetter Andreas ADI SD"
Date:
> What do others think? Should the argument to CHR() be a
> Unicode code point or the numeric representation of the
> database encoding?

When the database uses a single byte encoding, the chr function takes
the binary byte representation as an integer number between 0 and 255
(e.g. ascii code).
When the database encoding is one of the unicode encodings it takes a
unicode code point.
This is also what Oracle does.

Not sure what to do with other multibyte encodings.
Oracle only states that the numeric argument must resolve to one entire
code point,
whatever that is.

Andreas


Re: Bug in UTF8-Validation Code?

From
"Albe Laurenz"
Date:
> When the database uses a single byte encoding, the chr function takes
> the binary byte representation as an integer number between 0 and 255
> (e.g. ascii code).
> When the database encoding is one of the unicode encodings it takes a
> unicode code point.
> This is also what Oracle does.

Sorry, but this is *NOT* what Oracle does.
At least if we can agree that the code point for the Euro sign
is 0x20AC.

SQL> SELECT ASCII('EUR') AS DEC, 2         TO_CHAR(ASCII('EUR'), 'XXXXXX') AS HEX 3  FROM DUAL;
      DEC HEX
---------- ---------------------------- 14844588  E282AC

The encoding in this example is AL32UTF8, which corresponds to
our UTF8.

Yours,
Laurenz Albe


Re: Bug in UTF8-Validation Code?

From
"Zeugswetter Andreas ADI SD"
Date:
> > When the database uses a single byte encoding, the chr function
takes
> > the binary byte representation as an integer number between 0 and
255
> > (e.g. ascii code).
> > When the database encoding is one of the unicode encodings it takes
a
> > unicode code point.
> > This is also what Oracle does.
>
> Sorry, but this is *NOT* what Oracle does.
> At least if we can agree that the code point for the Euro
> sign is 0x20AC.

yes

>
> SQL> SELECT ASCII('EUR') AS DEC,
>   2         TO_CHAR(ASCII('EUR'), 'XXXXXX') AS HEX
>   3  FROM DUAL;
>
>        DEC HEX
> ---------- ----------------------------
>   14844588  E282AC
>
> The encoding in this example is AL32UTF8, which corresponds
> to our UTF8.

You are right, I am sorry. My test was broken.

To get the euro symbol in Oracle with a AL32UTF8 encoding you use
chr(14844588)

Andreas


Re: Bug in UTF8-Validation Code?

From
Alvaro Herrera
Date:
Martijn van Oosterhout wrote:
> On Tue, Apr 03, 2007 at 01:06:38PM -0400, Tom Lane wrote:
> > I think it's probably defensible for non-Unicode encodings.  To do
> > otherwise would require (a) figuring out what the equivalent concept to
> > "code point" is for each encoding, and (b) having a separate code path
> > for each encoding to perform the mapping.  It's not clear that there
> > even is an answer to (a), and (b) seems like more work than chr() is
> > worth.  But we know what the right way is for Unicode, so we should
> > special case that one.
> 
> I dunno. I find it odd that if I want a pl/pgsql function to return a
> Euro symbol, it has to know what encoding the DB is in. Though I
> suppose that would call for a unicode_chr() function.

Right -- IMHO what we should be doing is reject any input to chr() which
is beyond plain ASCII (or maybe > 255), and create a separate function
(unicode_char() sounds good) to get an Unicode character from a code
point, converted to the local client_encoding per conversion_procs.

So if I'm in Latin-1 and ask for the Euro sign, this should fail because
Latin-1 does not have the euro sign.  If I'm in Latin-9 I should get the
Euro.

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


Re: Bug in UTF8-Validation Code?

From
Andrew - Supernews
Date:
On 2007-04-04, Alvaro Herrera <alvherre@commandprompt.com> wrote:
> Right -- IMHO what we should be doing is reject any input to chr() which
> is beyond plain ASCII (or maybe > 255), and create a separate function
> (unicode_char() sounds good) to get an Unicode character from a code
> point, converted to the local client_encoding per conversion_procs.

Thinking about this made me realize that there's another, ahem, elephant
in the room here: convert().

By definition convert() returns text strings which are not valid in the
server encoding. How can this be addressed?

-- 
Andrew, Supernews
http://www.supernews.com - individual and corporate NNTP services


Re: Bug in UTF8-Validation Code?

From
Tom Lane
Date:
Alvaro Herrera <alvherre@commandprompt.com> writes:
> Right -- IMHO what we should be doing is reject any input to chr() which
> is beyond plain ASCII (or maybe > 255), and create a separate function
> (unicode_char() sounds good) to get an Unicode character from a code
> point, converted to the local client_encoding per conversion_procs.

Hm, I hadn't thought of that approach, but another idea is that the
argument of chr() is *always* a unicode code point, and it converts
to the current encoding.  Do we really need a separate function?
        regards, tom lane


Re: Bug in UTF8-Validation Code?

From
Tom Lane
Date:
Andrew - Supernews <andrew+nonews@supernews.com> writes:
> Thinking about this made me realize that there's another, ahem, elephant
> in the room here: convert().
> By definition convert() returns text strings which are not valid in the
> server encoding. How can this be addressed?

Remove convert().  Or at least redefine it as dealing in bytea not text.
        regards, tom lane


Re: Bug in UTF8-Validation Code?

From
Martijn van Oosterhout
Date:
On Wed, Apr 04, 2007 at 10:22:28AM -0400, Tom Lane wrote:
> Alvaro Herrera <alvherre@commandprompt.com> writes:
> > Right -- IMHO what we should be doing is reject any input to chr() which
> > is beyond plain ASCII (or maybe > 255), and create a separate function
> > (unicode_char() sounds good) to get an Unicode character from a code
> > point, converted to the local client_encoding per conversion_procs.
>
> Hm, I hadn't thought of that approach, but another idea is that the
> argument of chr() is *always* a unicode code point, and it converts
> to the current encoding.  Do we really need a separate function?

That's what I'd advocate, but then we're not Oracle compatable...

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> From each according to his ability. To each according to his ability to litigate.

Re: Bug in UTF8-Validation Code?

From
Tatsuo Ishii
Date:
> Alvaro Herrera <alvherre@commandprompt.com> writes:
> > Right -- IMHO what we should be doing is reject any input to chr() which
> > is beyond plain ASCII (or maybe > 255), and create a separate function
> > (unicode_char() sounds good) to get an Unicode character from a code
> > point, converted to the local client_encoding per conversion_procs.
> 
> Hm, I hadn't thought of that approach, but another idea is that the
> argument of chr() is *always* a unicode code point, and it converts
> to the current encoding.  Do we really need a separate function?

To be honest, I don't really see why we need to rush to add such
Unicode(I assume we are reffering to "Unicode" as ISO 10646)
specialized functions at this point. Limiting chr() to ASCII range is
enough, I think.

BTW, every encoding has its own charset. However the relationship
between encoding and charset are not so simple as Unicode. For
example, encoding EUC_JP correponds to multiple charsets, namely
ASCII, JIS X 0201, JIS X 0208 and JIS X 0212. So a function which
returns a "code point" is not quite usefull since it lacks the charset
info. I think we need to continute design discussion, probably
targetting for 8.4, not 8.3.
--
Tatsuo Ishii
SRA OSS, Inc. Japan


Re: Bug in UTF8-Validation Code?

From
Alvaro Herrera
Date:
Tatsuo Ishii wrote:

> BTW, every encoding has its own charset. However the relationship
> between encoding and charset are not so simple as Unicode. For
> example, encoding EUC_JP correponds to multiple charsets, namely
> ASCII, JIS X 0201, JIS X 0208 and JIS X 0212. So a function which
> returns a "code point" is not quite usefull since it lacks the charset
> info. I think we need to continute design discussion, probably
> targetting for 8.4, not 8.3.

Is Unicode complete as far as Japanese chars go?  I mean, is there a
character in EUC_JP that is not representable in Unicode?

Because if Unicode is complete, ISTM it makes perfect sense to have a
unicode_char() (or whatever we end up calling it) that takes an Unicode
code point and returns a character in whatever JIS set you want
(specified by setting client_encoding to that).  Because then you solved
the problem nicely.


One thing that I find confusing in your text above is whether EUC_JP is
an encoding or a charset?  I would think that the various JIS X are
encodings, and EUC_JP is the charset; or is it the other way around?

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


Re: Bug in UTF8-Validation Code?

From
Peter Eisentraut
Date:
Am Mittwoch, 4. April 2007 16:22 schrieb Tom Lane:
> Alvaro Herrera <alvherre@commandprompt.com> writes:
> > Right -- IMHO what we should be doing is reject any input to chr() which
> > is beyond plain ASCII (or maybe > 255), and create a separate function
> > (unicode_char() sounds good) to get an Unicode character from a code
> > point, converted to the local client_encoding per conversion_procs.
>
> Hm, I hadn't thought of that approach, but another idea is that the
> argument of chr() is *always* a unicode code point, and it converts
> to the current encoding.  Do we really need a separate function?

The SQL standard has a "Unicode character string literal", which looks like 
this:

U&'The price is 100 \20AC.'

This is similar in spirit to our current escape mechanism available via E'...' 
which, however, produces bytes.  It has the advantage over a chr()-based 
mechanism that the composition of strings doesn't require an ugly chain of 
literals, functions, and concatenations.

Implementing this would, however, be a bit tricky because you don't have 
access to the encoding conversion functions in the lexer.  You would probably 
have to map that to a function call an evaluate it later.

-- 
Peter Eisentraut
http://developer.postgresql.org/~petere/


Re: Bug in UTF8-Validation Code?

From
Mark Dilger
Date:
Albe Laurenz wrote:
> There's one thing that strikes me as weird in your implementation:
> 
>> pgsql=# select chr(0);
>> ERROR:  character 0x00 of encoding "SQL_ASCII" has no equivalent in
> "UTF8"
> 
> 0x00 is a valid UNICODE code point and also a valid UTF-8 character!

It's not my code that rejects this.  I'm passing the resultant string to 
the pg_verify_mbstr(...) function and it is rejecting a null.  I could 
change that, of course, but if other parts of the system found it 
reasonable to reject null, why should chr() be different?

This was also one of the questions I asked upthread.  Does changing the 
behavior of chr() break people's reasonable expectations?

mark



Re: Bug in UTF8-Validation Code?

From
Mark Dilger
Date:
Tatsuo Ishii wrote:

> <SNIP>. I think we need to continute design discussion, probably
> targetting for 8.4, not 8.3.

The discussion came about because Andrew - Supernews noticed that chr() 
returns invalid utf8, and we're trying to fix all the bugs with invalid 
utf8 in the system.  Something needs to be done, even if we just check 
the result of the current chr() implementation and throw an error on 
invalid results.  But do we want to make this minor change for 8.3 and 
then change it again for 8.4?

Here's an example of the current problem.  It's an 8.2.3 database with 
utf8.en_US encoding


mark=# create table testutf8 (t text);
CREATE TABLE
mark=# insert into testutf8 (t) (select chr(gs) from 
generate_series(0,255) as gs);
INSERT 0 256
mark=# \copy testutf8 to testutf8.data
mark=# truncate testutf8;
TRUNCATE TABLE
mark=# \copy testutf8 from testutf8.data
ERROR:  invalid byte sequence for encoding "UTF8": 0x80
HINT:  This error can also happen if the byte sequence does not match 
the encoding expected by the server, which is controlled by 
"client_encoding".
CONTEXT:  COPY testutf8, line 129




Re: Bug in UTF8-Validation Code?

From
Tom Lane
Date:
Mark Dilger <pgsql@markdilger.com> writes:
> Albe Laurenz wrote:
>> 0x00 is a valid UNICODE code point and also a valid UTF-8 character!

> It's not my code that rejects this.  I'm passing the resultant string to 
> the pg_verify_mbstr(...) function and it is rejecting a null.  I could 
> change that, of course, but if other parts of the system found it 
> reasonable to reject null, why should chr() be different?

chr() really should reject that if it's going to do any error checking
at all, because text operations tend to misbehave on embedded nulls.
        regards, tom lane


Re: Bug in UTF8-Validation Code?

From
Tatsuo Ishii
Date:
> Tatsuo Ishii wrote:
> 
> > <SNIP>. I think we need to continute design discussion, probably
> > targetting for 8.4, not 8.3.
> 
> The discussion came about because Andrew - Supernews noticed that chr() 
> returns invalid utf8, and we're trying to fix all the bugs with invalid 
> utf8 in the system.  Something needs to be done, even if we just check 
> the result of the current chr() implementation and throw an error on 
> invalid results.  But do we want to make this minor change for 8.3 and 
> then change it again for 8.4?

My opinion was in the snipped part by you in the previous mail -- 
Limiting chr() to ASCII range
--
Tatsuo Ishii
SRA OSS, Inc. Japan

> Here's an example of the current problem.  It's an 8.2.3 database with 
> utf8.en_US encoding
> 
> 
> mark=# create table testutf8 (t text);
> CREATE TABLE
> mark=# insert into testutf8 (t) (select chr(gs) from 
> generate_series(0,255) as gs);
> INSERT 0 256
> mark=# \copy testutf8 to testutf8.data
> mark=# truncate testutf8;
> TRUNCATE TABLE
> mark=# \copy testutf8 from testutf8.data
> ERROR:  invalid byte sequence for encoding "UTF8": 0x80
> HINT:  This error can also happen if the byte sequence does not match 
> the encoding expected by the server, which is controlled by 
> "client_encoding".
> CONTEXT:  COPY testutf8, line 129
> 
> 
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 9: In versions below 8.0, the planner will ignore your desire to
>        choose an index scan if your joining column's datatypes do not
>        match


Re: Bug in UTF8-Validation Code?

From
Tatsuo Ishii
Date:
> Tatsuo Ishii wrote:
> 
> > BTW, every encoding has its own charset. However the relationship
> > between encoding and charset are not so simple as Unicode. For
> > example, encoding EUC_JP correponds to multiple charsets, namely
> > ASCII, JIS X 0201, JIS X 0208 and JIS X 0212. So a function which
> > returns a "code point" is not quite usefull since it lacks the charset
> > info. I think we need to continute design discussion, probably
> > targetting for 8.4, not 8.3.
> 
> Is Unicode complete as far as Japanese chars go?  I mean, is there a
> character in EUC_JP that is not representable in Unicode?

I don't think Unicode is "complete" in this case. Problems are: EUC_JP
allows user defined characters which are not mapped to Unicode. Also
some characters in EUC_JP corresponds to multiple Unicode points.

> Because if Unicode is complete, ISTM it makes perfect sense to have a
> unicode_char() (or whatever we end up calling it) that takes an Unicode
> code point and returns a character in whatever JIS set you want
> (specified by setting client_encoding to that).  Because then you solved
> the problem nicely.

I'm not sure what kind of use case for unicode_char() you are thinking
about. Anyway if you want a "code point" from a character, we could
easily add such functions to all backend encodings currently we
support. Probably it would look like:

to_code_point(str TEXT) returns TEXT

An example outputs are:

ASCII - 41
ISO 10646 - U+0041
ISO 10646 - U+29E3D
ISO 8859-1 - a5
JIS X 0208 - 4141

It's a little bit too late for 8.2 though.

> One thing that I find confusing in your text above is whether EUC_JP is
> an encoding or a charset?  I would think that the various JIS X are
> encodings, and EUC_JP is the charset; or is it the other way around?

No, EUC_JP is an encoding. JIS X are the charsets.
--
Tatsuo Ishii
SRA OSS, Inc. Japan


Re: Bug in UTF8-Validation Code?

From
Tatsuo Ishii
Date:
> Andrew - Supernews <andrew+nonews@supernews.com> writes:
> > Thinking about this made me realize that there's another, ahem, elephant
> > in the room here: convert().
> > By definition convert() returns text strings which are not valid in the
> > server encoding. How can this be addressed?
> 
> Remove convert().  Or at least redefine it as dealing in bytea not text.

That would break some important use cases. 

1) A user have UTF-8 database which contains various language  data. Each language has its own table. He wants to sort
aSELECT  result by using ORDER BY. Since locale cannot handle multiple  languages, he uses C locale and do the SELECT
somethinglike this:
 
  SELECT * FROM french_table ORDER BY convert(t, 'LATIN1');  SELECT * FROM japanese_table ORDER BY convert(t,
'EUC_JP');

2) A user has a UTF-8 database but unfortunately his OS's UTF-8 locale  is broken. He decided to use C locale and want
tosort the result  from SELECT like this.
 
  SELECT * FROM japanese_table ORDER BY convert(t, 'EUC_JP');
  Note that sorting by UTF-8 physical order would produce random  results. So following would not help him in this
case:
  SELECT * FROM japanese_table ORDER BY t;

Also I don't understand what this is different to the problem when we
have a message catalogue which does not match the encoding.
--
Tatsuo Ishii
SRA OSS, Inc. Japan


Re: Bug in UTF8-Validation Code?

From
Andrew - Supernews
Date:
On 2007-04-05, Tatsuo Ishii <ishii@postgresql.org> wrote:
>> Andrew - Supernews <andrew+nonews@supernews.com> writes:
>> > Thinking about this made me realize that there's another, ahem, elephant
>> > in the room here: convert().
>> > By definition convert() returns text strings which are not valid in the
>> > server encoding. How can this be addressed?
>> 
>> Remove convert().  Or at least redefine it as dealing in bytea not text.
>
> That would break some important use cases. 
>
> 1) A user have UTF-8 database which contains various language
>    data. Each language has its own table. He wants to sort a SELECT
>    result by using ORDER BY. Since locale cannot handle multiple
>    languages, he uses C locale and do the SELECT something like this:
>
>    SELECT * FROM french_table ORDER BY convert(t, 'LATIN1');
>    SELECT * FROM japanese_table ORDER BY convert(t, 'EUC_JP');

That works without change if convert(text,text) returns bytea.
>
> 2) A user has a UTF-8 database but unfortunately his OS's UTF-8 locale
>    is broken. He decided to use C locale and want to sort the result
>    from SELECT like this.
>
>    SELECT * FROM japanese_table ORDER BY convert(t, 'EUC_JP');

That also works without change if convert(text,text) returns bytea.

-- 
Andrew, Supernews
http://www.supernews.com - individual and corporate NNTP services


Re: Bug in UTF8-Validation Code?

From
"Albe Laurenz"
Date:
Tatsuo Ishii wrote:
> I think we need to continute design discussion, probably
> targetting for 8.4, not 8.3.

But isn't a simple fix for chr() and ascii(), which does not
require a redesign, a Good Thing for 8.3 if possible? Something
that maintains as much upward and/or Oracle compatibility as
possible while doing away with ascii('EUR') returning 226 in UTF-8?

And I think - correct me if I am wrong - that conversion between
character and integer representation of the character in the current
database encoding is exactly that.

I see Tom Lane's point in rejecting chr(0), though.

Yours,
Laurenz Albe


Re: Bug in UTF8-Validation Code?

From
Martijn van Oosterhout
Date:
On Thu, Apr 05, 2007 at 09:34:25AM +0900, Tatsuo Ishii wrote:
> I'm not sure what kind of use case for unicode_char() you are thinking
> about. Anyway if you want a "code point" from a character, we could
> easily add such functions to all backend encodings currently we
> support. Probably it would look like:

I think the problem is that most encodings do not have the concept of a
code point anyway, so implementing it for them is fairly useless.

> An example outputs are:
>
> ASCII - 41
> ISO 10646 - U+0041
> ISO 10646 - U+29E3D
> ISO 8859-1 - a5
> JIS X 0208 - 4141

In every case other than Unicode you're doing the same thing as
encode/decode. Since we already have those functions, there's no need
to get chr/ascii to duplicate it. In the case of UTF-8 however, it does
something that is not done by encode/decode, hence the proposal to
simply extend chr/ascii to do that.

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> From each according to his ability. To each according to his ability to litigate.

Re: Bug in UTF8-Validation Code?

From
Martijn van Oosterhout
Date:
On Thu, Apr 05, 2007 at 11:52:14AM +0200, Albe Laurenz wrote:
> But isn't a simple fix for chr() and ascii(), which does not
> require a redesign, a Good Thing for 8.3 if possible? Something
> that maintains as much upward and/or Oracle compatibility as
> possible while doing away with ascii('EUR') returning 226 in UTF-8?

I think the earlier expressed idea of getting chr/ascii to bail on
non-ascii character isn't a bad one.

I think your idea is bad in the sense that I actually need to know the
encoding of the character I want to be able to use it. If I knew the
encoding already, I could just use encode().

What I was thinking of was something that, irrespective of the
encoding, gave me a string properly encoded with the character I want.
Since AFAIK Unicode is the only character set that actually numbers the
characters in a way not related to the encoding, so it would seem
useful to be able to give a unicode character number and get a string
with that character...

So your implemntation is simply:
1. Take number and make UTF-8 string
2. Convert it to database encoding.

> And I think - correct me if I am wrong - that conversion between
> character and integer representation of the character in the current
> database encoding is exactly that.

AFAIK there is no "integer representation" of a character in anything
other than Unicode. Unicode is the only case that cannot be handled by
a simple encode/decode.

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> From each according to his ability. To each according to his ability to litigate.

Re: Bug in UTF8-Validation Code?

From
Tom Lane
Date:
Martijn van Oosterhout <kleptog@svana.org> writes:
> I think the problem is that most encodings do not have the concept of a
> code point anyway, so implementing it for them is fairly useless.

Yeah.  I'm beginning to think that the right thing to do is

(a) make chr/ascii do the same thing as Oracle (ie, as in the discussed
patch)

(b) make a different pair of functions that translate Unicode code
points to/from the current database encoding.
        regards, tom lane


Re: Bug in UTF8-Validation Code?

From
"Albe Laurenz"
Date:
Martijn van Oosterhout wrote:
> So your implemntation is simply:
> 1. Take number and make UTF-8 string
> 2. Convert it to database encoding.

Aah, now I can spot where the misunderstanding is.
That's not what I mean.

I mean that chr() should simply 'typecast' to "char".

So when the database encoding is UTF8, I want
chr(14844588) to return a Euro sign, and when the encoding
is LATIN9, then chr(14844588) should either yield the 'not'
sign (UNICODE 0xAC) or an error message, depending on whether
we want chr() to operate mod 256 like Oracle has it for
single byte character sets or not, while chr(164) should
return the Euro sign for LATIN9 database encoding.

Yours,
Laurenz Albe


Re: Bug in UTF8-Validation Code?

From
Andrew Dunstan
Date:
What is the state of play with this item? I think this is a must-fix bug 
for 8.3. There was a flurry of messages back in April but since then I 
don't recall seeing anything.

cheers

andrew



Mark Dilger wrote:
> Mark Dilger wrote:
>> Bruce Momjian wrote:
>>> Added to TODO:
>>>
>>>     * Fix cases where invalid byte encodings are accepted by the 
>>> database,
>>>       but throw an error on SELECT
>>>           
>>> http://archives.postgresql.org/pgsql-hackers/2007-03/msg00767.php
>>>
>>> Is anyone working on fixing this bug?
>>
>> Hi, has anyone volunteered to fix this bug?  I did not see any reply 
>> on the mailing list to your question above.
>>
>> mark
>
> OK, I can take a stab at fixing this.  I'd like to state some 
> assumptions so people can comment and reply:
>
> I assume that I need to fix *all* cases where invalid byte encodings 
> get into the database through functions shipped in the core distribution.
>
> I assume I do not need to worry about people getting bad data into the 
> system through their own database extensions.
>
> I assume that the COPY problem discussed up-thread goes away once you 
> eliminate all the paths by which bad data can get into the system.  
> However, existing database installations with bad data already loaded 
> will not be magically fixed with these code patches.
>
> Do any of the string functions (see 
> http://www.postgresql.org/docs/8.2/interactive/functions-string.html) 
> run the risk of generating invalid utf8 encoded strings?  Do I need to 
> add checks?  Are there known bugs with these functions in this regard?
>
> If not, I assume I can add mbverify calls to the various input 
> routines (textin, varcharin, etc) where invalid utf8 could otherwise 
> enter the system.
>
> I assume that this work can be limited to HEAD and that I don't need 
> to back-patch it.  (I suspect this assumption is a contentious one.)
>
> Advice and comments are welcome,
>
>