Thread: Concerning about Unicode-aware string handling

Concerning about Unicode-aware string handling

From
Vincas Dargis
Date:
Hello,

We have problems (currently using 8.4, but also in latest 9.1.3) in
our application with Unicode word symbols in Lithuanian ('ąčęėįšųūž'),
Russian and of course potentially other languages.

For example, regex_replace('acząčž', E'\\W', '', 'g') removes ąčž.

lower() and ~* comparison works only with locale that is set (no
internationalization).

Could we expect Unciode support in near future? Or should we do quick
hacks by reimplementing regexp_replace(), lower(), upper() and other
string SQL functions using, for example, Qt libraries..? Or maybe are
there some kind simpler workarounds?

Thank you,
Vincas.

Re: Concerning about Unicode-aware string handling

From
John R Pierce
Date:
On 05/21/12 2:09 AM, Vincas Dargis wrote:
> We have problems (currently using 8.4, but also in latest 9.1.3) in
> our application with Unicode word symbols in Lithuanian ('ąčęėįšųūž'),
> Russian and of course potentially other languages.
>
> For example, regex_replace('acząčž', E'\\W', '', 'g') removes ąčž.
>
> lower() and ~* comparison works only with locale that is set (no
> internationalization).
>
> Could we expect Unciode support in near future? Or should we do quick
> hacks by reimplementing regexp_replace(), lower(), upper() and other
> string SQL functions using, for example, Qt libraries..? Or maybe are
> there some kind simpler workarounds?

your database encoding is UTF8 ?  the language or environment you're
using to generate those strings such as 'acząčž'  is also UTF8 ?

postgresql supports UTF-8 unicode just fine.   It does not directly
support the bastardized UTF-16 'unicode' implemented by Windows NT and
derivatives (2000, XP, 2003, Vista, 2008, 7), but on those platforms it
generally behaves fairly sanely as long as you realize UTF8 is its
native tongue.

of course, the database has to be created as a  UTF8 database, its
possible to initialize the server cluster in "C"/"POSIX"/"SQLASCII"
which says bytes-are-bytes and encodings are unknown, or in various 8
bit encodings like LATIN-1.




--
john r pierce                            N 37, W 122
santa cruz ca                         mid-left coast



Re: Concerning about Unicode-aware string handling

From
Andrew Sullivan
Date:
On Mon, May 21, 2012 at 02:44:45AM -0700, John R Pierce wrote:
> support the bastardized UTF-16 'unicode' implemented by Windows NT

To be fair to Microsoft, while the BOM might be an irritant, they do
use a perfectly legitimate encoding of Unicode.  There is no Unicode
requirement that code points be stored as UTF-8, and there is a strong
argument to be made that, for some languages, UTF-8 is extremely
inefficient and therefore the least preferred encoding.  (Microsoft's
dependence on the BOM with UTF-16 -- really UCS2 -- is problematic, of
course, and appears to be adjusted in funny ways in Win 7.)

Because all wire protocols from the IETF use UTF-8 for Unicode
encoding, your best bet is still UTF-8 for maximal portability, so
your point about needing to make the database encoding and client
locale UTF-8 is correct.

Best,

A


--
Andrew Sullivan
ajs@crankycanuck.ca

Re: Concerning about Unicode-aware string handling

From
Vincas Dargis
Date:
Sorry I have to write "manual" replay since I've messed up mailing
list settings (got "Partial Digest"...).

John R Pierce wrote:
> your database encoding is UTF8 ? the language or environment you're using to generate those strings such as 'acząčž'
isalso UTF8 ? 


Database created using:
initdb -D ../data -E utf-8 -U postgres

I have to take back statements about "lower()" and "~*" - they do not
work only for (strange) clusters created with "--no-locale".

But regexp_replace issue is still there. Regexp "\w" understands only
as "ascii word character" ?

Re: Concerning about Unicode-aware string handling

From
"Albe Laurenz"
Date:
Vincas Dargis wrote:
> We have problems (currently using 8.4, but also in latest 9.1.3) in
> our application with Unicode word symbols in Lithuanian ('ąčęėįšųūž'),
> Russian and of course potentially other languages.
> 
> For example, regex_replace('acząčž', E'\\W', '', 'g') removes ąčž.
> 
> lower() and ~* comparison works only with locale that is set (no
> internationalization).
> 
> Could we expect Unciode support in near future? Or should we do quick
> hacks by reimplementing regexp_replace(), lower(), upper() and other
> string SQL functions using, for example, Qt libraries..? Or maybe are
> there some kind simpler workarounds?

I tried it with 9.1.3 on Linux:

upper() and lower() works fine, no matter what the
database encoding is:

test=> SELECT upper('acząčž');
 upper
--------
 ACZĄČŽ
(1 row)

And this seems OK with LATIN7:

lt2=> SHOW server_encoding;
 server_encoding
-----------------
 LATIN7
(1 row)

lt2=> SHOW lc_ctype;
 lc_ctype
----------
 lt_LT
(1 row)

lt2=> SHOW lc_collate;
 lc_collate
------------
 lt_LT
(1 row)

lt2=> SELECT 'ą' ~* '\w';
 ?column?
----------
 t
(1 row)

But it looks wrong with UTF8:

lt=> SHOW server_encoding;
 server_encoding
-----------------
 UTF8
(1 row)

lt=> SHOW lc_ctype;
  lc_ctype
------------
 lt_LT.utf8
(1 row)

lt=> SHOW lc_collate;
 lc_collate
------------
 lt_LT.utf8
(1 row)

lt=> SELECT 'ą' ~* '\w';
 ?column?
----------
 f
(1 row)


Is that what you are complaining about?

Yours,
Laurenz Albe

Re: Concerning about Unicode-aware string handling

From
Vincas Dargis
Date:
I've forgot to mention I'm working on Windows XP SP3

Yes, we are using UTF8 encoding and regexp works wrong. It looks like
you replicated that.

2012/5/21 Albe Laurenz <laurenz.albe@wien.gv.at>:
>
> I tried it with 9.1.3 on Linux:
>
> upper() and lower() works fine, no matter what the
> database encoding is:
>
> test=> SELECT upper('acząčž');
>  upper
> --------
>  ACZĄČŽ
> (1 row)
>
> And this seems OK with LATIN7:
>
> lt2=> SHOW server_encoding;
>  server_encoding
> -----------------
>  LATIN7
> (1 row)
>
> lt2=> SHOW lc_ctype;
>  lc_ctype
> ----------
>  lt_LT
> (1 row)
>
> lt2=> SHOW lc_collate;
>  lc_collate
> ------------
>  lt_LT
> (1 row)
>
> lt2=> SELECT 'ą' ~* '\w';
>  ?column?
> ----------
>  t
> (1 row)
>
> But it looks wrong with UTF8:
>
> lt=> SHOW server_encoding;
>  server_encoding
> -----------------
>  UTF8
> (1 row)
>
> lt=> SHOW lc_ctype;
>  lc_ctype
> ------------
>  lt_LT.utf8
> (1 row)
>
> lt=> SHOW lc_collate;
>  lc_collate
> ------------
>  lt_LT.utf8
> (1 row)
>
> lt=> SELECT 'ą' ~* '\w';
>  ?column?
> ----------
>  f
> (1 row)
>
>
> Is that what you are complaining about?
>
> Yours,
> Laurenz Albe

Re: Concerning about Unicode-aware string handling

From
Tom Lane
Date:
Vincas Dargis <vindrg@gmail.com> writes:
> Database created using:
> initdb -D ../data -E utf-8 -U postgres

That looks fairly dangerous, as it will absorb the database's locale
settings (particularly LC_CTYPE, which is what you care about for these
operations) from your shell environment.  If the environment locale is
not for UTF8 encoding then it won't work at all.  Best to specify a
--locale switch as well.  See
http://www.postgresql.org/docs/9.1/static/charset.html

> But regexp_replace issue is still there. Regexp "\w" understands only
> as "ascii word character" ?

Locale-specific character classes in regexps are not terribly bright
about UTF8, because historically that code has not considered any
character codes above 255 :-(.  So in UTF8 you only got correct behavior
for ASCII and LATIN1 characters.  9.2 will be better though not perfect:
http://git.postgresql.org/gitweb/?p=postgresql.git&a=commitdiff&h=e00f68e49

            regards, tom lane

Re: Concerning about Unicode-aware string handling

From
Craig Ringer
Date:
On 05/21/2012 06:59 PM, Andrew Sullivan wrote:
> On Mon, May 21, 2012 at 02:44:45AM -0700, John R Pierce wrote:
>> support the bastardized UTF-16 'unicode' implemented by Windows NT
> To be fair to Microsoft, while the BOM might be an irritant, they do
> use a perfectly legitimate encoding of Unicode.  There is no Unicode
> requirement that code points be stored as UTF-8, and there is a strong
> argument to be made that, for some languages, UTF-8 is extremely
> inefficient and therefore the least preferred encoding.  (Microsoft's
> dependence on the BOM with UTF-16 -- really UCS2 -- is problematic, of
> course, and appears to be adjusted in funny ways in Win 7.)

In fact, until it became clear that UCS-2 (now UTF-16) wasn't enough and
we'd need 4 bytes to represent characters, Microsoft's choice of UCS-2
with BOM looked really good. They just didn't realise that UCS-2 would
turn into UTF-16 when UCS-4 came on the scene, so they'd be left holding
a bastardised half-way mess that's usually-but-not-always 2 bytes per
character.

MS's choice allowed programs to work with the safe (at the time)
assumption that each char was 2 bytes, which made a lot of things way
simpler than they are in UTF-8 and was well and truly worth the storage
bloat IMO. Pity Unicode had to grow again and break the assumption.

--
Craig Ringer