Regexp match with accented character problem - Mailing list pgsql-novice

From Laslo Forro
Subject Regexp match with accented character problem
Date
Msg-id AANLkTinhs32woCPg8neaTb3jEqde7BRHx0P0_rxgo_0_@mail.gmail.com
Whole thread Raw
Responses Re: Regexp match with accented character problem  (Thom Brown <thombrown@gmail.com>)
Re: Regexp match with accented character problem  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-novice
Hi there, could someone drop me a hint on the whys at below?

The table: 

test=# select * from texts;
    title     |         a_text          
--------------+-------------------------
 A macskacicó | A blah blah macskacicónak.
The dark tower | Blah blah
(2 rows)

Now, I want to match 'macskacicó' WORD.

It works: 
test=# select * from texts where title ~* E'macskacicó';
    title     |         a_text          
--------------+-------------------------
 A macskacicó | A blah blah macskacicó.
(1 row)

But it would also macth 'macskacicónak' string:

test=# select * from texts where a_text ~* E'macskacicó';
    title     |           a_text           
--------------+----------------------------
 A macskacicó | A blah blah macskacicónak.
(1 row)

Now, these do not work:

test=# select * from texts where title ~* E'\\mmacskacicó\\M';
test=# select * from texts where title ~* E'\\<macskacicó\\>';
test=# select * from texts where title ~* E'\\Wmacskacicó\\W';

(neither with one \ , nor with double.)

Now, it seems that all is ok if the string does not end with an accented character:
test=# select * from texts where title ~* E'\\mtower\\M';
     title      |  a_text   
----------------+-----------
 The dark tower | Blah blah
(1 row)

It seems that accented characters are not recognized as \w. (It matches:  select * from texts where title ~* E'\\Wmacskacic\\W'; )
Does it mean that I have to convert each accented character to a hex form and feed it that way? Or is there a more elegant way to redefine the \w class?

Thanks a lot!

I use :
Postgresql 8.4.1 on Gentoo.
Postgresql.conf:
max_connections = 100
shared_buffers = 1000 # min 16, at least max_connections*2, 8KB each
lc_messages = 'en_US.UTF-8' # locale for system error message strings
lc_monetary = 'en_US.UTF-8' # locale for monetary formatting
lc_numeric = 'en_US.UTF-8' # locale for number formatting
lc_time = 'en_US.UTF-8' # locale for time formatting

'locale' gives: 
LANG=hu_HU.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=en_US.UTF-8

pgsql-novice by date:

Previous
From: Andrej
Date:
Subject: Re: The Two Towers
Next
From: Thom Brown
Date:
Subject: Re: Regexp match with accented character problem