Re: Bug in UTF8-Validation Code? - Mailing list pgsql-hackers
From | Mark Dilger |
---|---|
Subject | Re: Bug in UTF8-Validation Code? |
Date | |
Msg-id | 46116C9C.80300@markdilger.com Whole thread Raw |
In response to | Re: Bug in UTF8-Validation Code? (Mark Dilger <pgsql@markdilger.com>) |
Responses |
Re: Bug in UTF8-Validation Code?
|
List | pgsql-hackers |
Mark Dilger wrote: > Andrew - Supernews wrote: >> On 2007-04-01, Mark Dilger <pgsql@markdilger.com> wrote: >>> Do any of the string functions (see >>> http://www.postgresql.org/docs/8.2/interactive/functions-string.html) >>> run the risk of generating invalid utf8 encoded strings? Do I need >>> to add checks? >>> Are there known bugs with these functions in this regard? >> >> The chr() function returns an octet, rather than a character; this is >> clearly >> wrong and needs fixing. >> > > Ok, I've altered the chr() function. I am including a transcript from > psql below. There are several design concerns: > > 1) In the current implementation, chr(0) returns a 5-byte text object > (4-bytes of overhead plus one byte of data) containing a null. In the > new implementation, this returns an error. I don't know, but it is > possible that people currently use things like "SELECT chr(0) || chr(0) > || ..." to build up strings of nulls. > > 2) Under utf8, chr(X) fails for X = 128..255. This may also break > current users expectations. > > 3) The implicit modulus operation that was being performed by chr() is > now gone, which might break some users. > > 4) You can't represent the high end of the astral plain with type > INTEGER, unless you pass in a negative value, which is somewhat > unintuitive. Since chr() expects an integer (and not a bigint) the user > needs handle the sign bit correctly. > > mark > > --------------------- > > > > > Welcome to psql 8.3devel, the PostgreSQL interactive terminal. > > Type: \copyright for distribution terms > \h for help with SQL commands > \? for help with psql commands > \g or terminate with semicolon to execute query > \q to quit > > pgsql=# select chr(0); > ERROR: character 0x00 of encoding "SQL_ASCII" has no equivalent in "UTF8" > pgsql=# select chr(65); > chr > ----- > A > (1 row) > > pgsql=# select chr(128); > ERROR: character 0x80 of encoding "SQL_ASCII" has no equivalent in "UTF8" > pgsql=# select chr(53398); > chr > ----- > Ж > (1 row) > > pgsql=# select chr(14989485); > chr > ----- > 中 > (1 row) > > pgsql=# select chr(4036005254); > ERROR: function chr(bigint) does not exist > LINE 1: select chr(4036005254); > ^ > HINT: No function matches the given name and argument types. You might > need to add explicit type casts. Here's the code for the new chr() function: Datum chr(PG_FUNCTION_ARGS) { int32 cvalue = PG_GETARG_INT32(0); text *result; if (pg_database_encoding_max_length() > 1 && !lc_ctype_is_c()) { int encoding, len, byteoff; uint32 buf[2]; const char *bufptr; encoding = GetDatabaseEncoding(); buf[0] = htonl(cvalue); buf[1] = 0; bufptr = (const char *)&buf; for (byteoff = 0; byteoff < sizeof(uint32) && 0 == *bufptr; ++byteoff, ++bufptr); len = pg_encoding_mblen(encoding,bufptr); if (byteoff + len != sizeof(uint32) || !pg_verify_mbstr(encoding, bufptr, len, true /* noError */)) report_untranslatable_char(PG_SQL_ASCII, encoding, bufptr, sizeof(int32)); result = (text *) palloc(VARHDRSZ + len); SET_VARSIZE(result, VARHDRSZ + len); memcpy(VARDATA(result),bufptr,len); } else { result = (text *) palloc(VARHDRSZ + 1); SET_VARSIZE(result,VARHDRSZ + 1); *VARDATA(result) = (char) cvalue; } PG_RETURN_TEXT_P(result); }
pgsql-hackers by date: