Home > mailing lists

Regexp match with accented character problem - Mailing list pgsql-novice

From	Laslo Forro
Subject	Regexp match with accented character problem
Date	June 8, 2010 08:49:03
Msg-id	AANLkTinhs32woCPg8neaTb3jEqde7BRHx0P0_rxgo_0_@mail.gmail.com Whole thread Raw
Responses	Re: Regexp match with accented character problem (Thom Brown <thombrown@gmail.com>) Re: Regexp match with accented character problem (Tom Lane <tgl@sss.pgh.pa.us>)
List	pgsql-novice

Tree view

Hi there, could someone drop me a hint on the whys at below?

The table:

test=# select * from texts;

title | a_text

--------------+-------------------------

A macskacicó | A blah blah macskacicónak.

The dark tower | Blah blah

(2 rows)

Now, I want to match 'macskacicó' WORD.

It works:

test=# select * from texts where title ~* E'macskacicó';

title | a_text

--------------+-------------------------

A macskacicó | A blah blah macskacicó.

(1 row)

But it would also macth 'macskacicónak' string:

test=# select * from texts where a_text ~* E'macskacicó';

title | a_text

--------------+----------------------------

A macskacicó | A blah blah macskacicónak.

(1 row)

Now, these do not work:

test=# select * from texts where title ~* E'\\mmacskacicó\\M';

test=# select * from texts where title ~* E'\\<macskacicó\\>';

test=# select * from texts where title ~* E'\\Wmacskacicó\\W';

(neither with one \ , nor with double.)

Now, it seems that all is ok if the string does not end with an accented character:

test=# select * from texts where title ~* E'\\mtower\\M';

title | a_text

----------------+-----------

The dark tower | Blah blah

(1 row)

It seems that accented characters are not recognized as \w. (It matches: select * from texts where title ~* E'\\Wmacskacic\\W'; )

Does it mean that I have to convert each accented character to a hex form and feed it that way? Or is there a more elegant way to redefine the \w class?

Thanks a lot!

I use :

Postgresql 8.4.1 on Gentoo.

Postgresql.conf:

max_connections = 100

shared_buffers = 1000 # min 16, at least max_connections*2, 8KB each

lc_messages = 'en_US.UTF-8' # locale for system error message strings

lc_monetary = 'en_US.UTF-8' # locale for monetary formatting

lc_numeric = 'en_US.UTF-8' # locale for number formatting

lc_time = 'en_US.UTF-8' # locale for time formatting

'locale' gives:

LANG=hu_HU.UTF-8

LC_CTYPE="en_US.UTF-8"

LC_NUMERIC="en_US.UTF-8"

LC_TIME="en_US.UTF-8"

LC_COLLATE="en_US.UTF-8"

LC_MONETARY="en_US.UTF-8"

LC_MESSAGES="en_US.UTF-8"

LC_PAPER="en_US.UTF-8"

LC_NAME="en_US.UTF-8"

LC_ADDRESS="en_US.UTF-8"

LC_TELEPHONE="en_US.UTF-8"

LC_MEASUREMENT="en_US.UTF-8"

LC_IDENTIFICATION="en_US.UTF-8"

LC_ALL=en_US.UTF-8

pgsql-novice by date:

From: Andrej
Date: 08 June 2010, 05:48:35
Subject: Re: The Two Towers

From: Thom Brown
Date: 08 June 2010, 09:51:21
Subject: Re: Regexp match with accented character problem

Regexp match with accented character problem - Mailing list pgsql-novice

Previous

Next