Re: Bug in UTF8-Validation Code? - Mailing list pgsql-hackers

From Mark Dilger
Subject Re: Bug in UTF8-Validation Code?
Date
Msg-id 46116C9C.80300@markdilger.com
Whole thread Raw
In response to Re: Bug in UTF8-Validation Code?  (Mark Dilger <pgsql@markdilger.com>)
Responses Re: Bug in UTF8-Validation Code?
List pgsql-hackers
Mark Dilger wrote:
> Andrew - Supernews wrote:
>> On 2007-04-01, Mark Dilger <pgsql@markdilger.com> wrote:
>>> Do any of the string functions (see 
>>> http://www.postgresql.org/docs/8.2/interactive/functions-string.html) 
>>> run the risk of generating invalid utf8 encoded strings?  Do I need 
>>> to add checks?
>>> Are there known bugs with these functions in this regard?
>>
>> The chr() function returns an octet, rather than a character; this is 
>> clearly
>> wrong and needs fixing.
>>
> 
> Ok, I've altered the chr() function.  I am including a transcript from 
> psql below.  There are several design concerns:
> 
> 1) In the current implementation, chr(0) returns a 5-byte text object 
> (4-bytes of overhead plus one byte of data) containing a null.  In the 
> new implementation, this returns an error.  I don't know, but it is 
> possible that people currently use things like "SELECT chr(0) || chr(0) 
> || ..." to build up strings of nulls.
> 
> 2) Under utf8, chr(X) fails for X = 128..255.  This may also break 
> current users expectations.
> 
> 3) The implicit modulus operation that was being performed by chr() is 
> now gone, which might break some users.
> 
> 4) You can't represent the high end of the astral plain with type 
> INTEGER, unless you pass in a negative value, which is somewhat 
> unintuitive.  Since chr() expects an integer (and not a bigint) the user 
> needs handle the sign bit correctly.
> 
> mark
> 
> ---------------------
> 
> 
> 
> 
> Welcome to psql 8.3devel, the PostgreSQL interactive terminal.
> 
> Type:  \copyright for distribution terms
>        \h for help with SQL commands
>        \? for help with psql commands
>        \g or terminate with semicolon to execute query
>        \q to quit
> 
> pgsql=# select chr(0);
> ERROR:  character 0x00 of encoding "SQL_ASCII" has no equivalent in "UTF8"
> pgsql=# select chr(65);
>  chr
> -----
>  A
> (1 row)
> 
> pgsql=# select chr(128);
> ERROR:  character 0x80 of encoding "SQL_ASCII" has no equivalent in "UTF8"
> pgsql=# select chr(53398);
>  chr
> -----
>  Ж
> (1 row)
> 
> pgsql=# select chr(14989485);
>  chr
> -----
>  中
> (1 row)
> 
> pgsql=# select chr(4036005254);
> ERROR:  function chr(bigint) does not exist
> LINE 1: select chr(4036005254);
>                ^
> HINT:  No function matches the given name and argument types. You might 
> need to add explicit type casts.


Here's the code for the new chr() function:

Datum
chr(PG_FUNCTION_ARGS)
{    int32       cvalue = PG_GETARG_INT32(0);    text       *result;
    if (pg_database_encoding_max_length() > 1 && !lc_ctype_is_c())    {        int encoding,            len,
byteoff;       uint32 buf[2];        const char *bufptr;
 
        encoding = GetDatabaseEncoding();        buf[0] = htonl(cvalue);        buf[1] = 0;        bufptr = (const char
*)&buf;       for (byteoff = 0; byteoff < sizeof(uint32) && 0 == *bufptr; ++byteoff, 
 
++bufptr);        len = pg_encoding_mblen(encoding,bufptr);        if (byteoff + len != sizeof(uint32) ||
!pg_verify_mbstr(encoding,
 
bufptr, len, true /* noError */))            report_untranslatable_char(PG_SQL_ASCII, encoding, bufptr, 
sizeof(int32));        result = (text *) palloc(VARHDRSZ + len);        SET_VARSIZE(result, VARHDRSZ + len);
memcpy(VARDATA(result),bufptr,len);   }    else    {        result = (text *) palloc(VARHDRSZ + 1);
SET_VARSIZE(result,VARHDRSZ + 1);        *VARDATA(result) = (char) cvalue;    }
 
    PG_RETURN_TEXT_P(result);
}


pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: Questions about pid file creation code
Next
From: Josh Berkus
Date:
Subject: Mentor for ASync I/O for SoC