Re: Bug in UTF8-Validation Code? - Mailing list pgsql-hackers

From Andrew Dunstan
Subject Re: Bug in UTF8-Validation Code?
Date
Msg-id 45FC0D39.8090802@dunslane.net
Whole thread Raw
In response to Re: Bug in UTF8-Validation Code?  (Jeff Davis <pgsql@j-davis.com>)
Responses Re: Bug in UTF8-Validation Code?  (Tom Lane <tgl@sss.pgh.pa.us>)
Re: Bug in UTF8-Validation Code?  (Martijn van Oosterhout <kleptog@svana.org>)
List pgsql-hackers

Jeff Davis wrote:
> On Wed, 2007-03-14 at 01:29 -0600, Michael Fuhr wrote:
>   
>> On Tue, Mar 13, 2007 at 04:42:35PM +0100, Mario Weilguni wrote:
>>     
>>> Am Dienstag, 13. März 2007 16:38 schrieb Joshua D. Drake:
>>>       
>>>> Is this any different than the issues of moving 8.0.x to 8.1 UTF8? Where
>>>> we had to use iconv?
>>>>         
>>> What issues? I've upgraded several 8.0 database to 8.1. without having to use 
>>> iconv. Did I miss something?
>>>       
>> http://www.postgresql.org/docs/8.1/interactive/release-8-1.html
>>
>> "Some users are having problems loading UTF-8 data into 8.1.X.  This
>> is because previous versions allowed invalid UTF-8 byte sequences
>> to be entered into the database, and this release properly accepts
>> only valid UTF-8 sequences. One way to correct a dumpfile is to run
>> the command iconv -c -f UTF-8 -t UTF-8 -o cleanfile.sql dumpfile.sql."
>>
>>     
>
> If the above quote were actually true, then Mario wouldn't be having a
> problem. Instead, it's half-true: Invalid byte sequences are rejected in
> some situations and accepted in others. If postgresql consistently
> rejected or consistently accepted invalid byte sequences, that would not
> cause problems with COPY (meaning problems with pg_dump, slony, etc.).
>
>
>   

How can we fix this? Frankly, the statement in the docs warning about 
making sure that escaped sequences are valid in the server encoding is a 
cop-out. We don't accept invalid data elsewhere, and this should be no 
different IMNSHO. I don't see why this should be any different from, 
say, date or numeric data. For years people have sneered at MySQL 
because it accepted dates like Feb 31st, and rightly so. But this seems 
to me to be like our own version of the same problem.

Last year Jeff suggested adding something like:
   pg_verifymbstr(string,strlen(string),0);

to each relevant input routine. Would that be an acceptable solution? If 
not, what would be?

cheers

andrew


pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: CREATE INDEX and HOT (was Question: pg_class attributes and race conditions ?)
Next
From: Tom Lane
Date:
Subject: Re: Bison 2.1 on win32