Thread: Regexp match with accented character problem
Hi there, could someone drop me a hint on the whys at below?
The table:
test=# select * from texts;
title | a_text
--------------+-------------------------
A macskacicó | A blah blah macskacicónak.
The dark tower | Blah blah
(2 rows)
Now, I want to match 'macskacicó' WORD.
It works:
test=# select * from texts where title ~* E'macskacicó';
title | a_text
--------------+-------------------------
A macskacicó | A blah blah macskacicó.
(1 row)
But it would also macth 'macskacicónak' string:
test=# select * from texts where a_text ~* E'macskacicó';
title | a_text
--------------+----------------------------
A macskacicó | A blah blah macskacicónak.
(1 row)
Now, these do not work:
test=# select * from texts where title ~* E'\\mmacskacicó\\M';
test=# select * from texts where title ~* E'\\<macskacicó\\>';
test=# select * from texts where title ~* E'\\Wmacskacicó\\W';
(neither with one \ , nor with double.)
Now, it seems that all is ok if the string does not end with an accented character:
test=# select * from texts where title ~* E'\\mtower\\M';
title | a_text
----------------+-----------
The dark tower | Blah blah
(1 row)
It seems that accented characters are not recognized as \w. (It matches: select * from texts where title ~* E'\\Wmacskacic\\W'; )
Does it mean that I have to convert each accented character to a hex form and feed it that way? Or is there a more elegant way to redefine the \w class?
Thanks a lot!
I use :
Postgresql 8.4.1 on Gentoo.
Postgresql.conf:
max_connections = 100
shared_buffers = 1000 # min 16, at least max_connections*2, 8KB each
lc_messages = 'en_US.UTF-8' # locale for system error message strings
lc_monetary = 'en_US.UTF-8' # locale for monetary formatting
lc_numeric = 'en_US.UTF-8' # locale for number formatting
lc_time = 'en_US.UTF-8' # locale for time formatting
'locale' gives:
LANG=hu_HU.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=en_US.UTF-8
On 8 June 2010 09:48, Laslo Forro <getforum@gmail.com> wrote: > Hi there, could someone drop me a hint on the whys at below? > The table: > test=# select * from texts; > title | a_text > --------------+------------------------- > A macskacicó | A blah blah macskacicónak. > The dark tower | Blah blah > (2 rows) > Now, I want to match 'macskacicó' WORD. > It works: > test=# select * from texts where title ~* E'macskacicó'; > title | a_text > --------------+------------------------- > A macskacicó | A blah blah macskacicó. > (1 row) > But it would also macth 'macskacicónak' string: > test=# select * from texts where a_text ~* E'macskacicó'; > title | a_text > --------------+---------------------------- > A macskacicó | A blah blah macskacicónak. > (1 row) > Now, these do not work: > test=# select * from texts where title ~* E'\\mmacskacicó\\M'; That works for me. > test=# select * from texts where title ~* E'\\<macskacicó\\>'; What's that supposed to be doing? > test=# select * from texts where title ~* E'\\Wmacskacicó\\W'; That shouldn't work because nothing follows that word. You'd need to match like this in that case: select * from texts where title ~* E'\\Wmacskacicó$'; If you add something like a space or full-stop (period) after that word, it will match. Accented characters should match against \\w To prove it, try: select * from texts where title ~* E'\\Wmacskacic\\w'; Regards Thom
Thanx for your response!
You write:
>test=# select * from texts where title ~* E'\\mmacskacicó\\M';
>
>That works for me.
However, it doesn't work for me (if I understand you right):
test=# select * from texts where title ~* E'\\mmacskacicó\\M';
title | a_text
-------+--------
(0 rows)
It does:
test=# select * from texts where title ~* E'\\mmacskacicó$';
title | a_text
--------------+----------------------------
A macskacicó | A blah blah macskacicónak.
(1 row)
On Tue, Jun 8, 2010 at 11:45 AM, Thom Brown <thombrown@gmail.com> wrote:
On 8 June 2010 09:48, Laslo Forro <getforum@gmail.com> wrote:That works for me.
> Hi there, could someone drop me a hint on the whys at below?
> The table:
> test=# select * from texts;
> title | a_text
> --------------+-------------------------
> A macskacicó | A blah blah macskacicónak.
> The dark tower | Blah blah
> (2 rows)
> Now, I want to match 'macskacicó' WORD.
> It works:
> test=# select * from texts where title ~* E'macskacicó';
> title | a_text
> --------------+-------------------------
> A macskacicó | A blah blah macskacicó.
> (1 row)
> But it would also macth 'macskacicónak' string:
> test=# select * from texts where a_text ~* E'macskacicó';
> title | a_text
> --------------+----------------------------
> A macskacicó | A blah blah macskacicónak.
> (1 row)
> Now, these do not work:
> test=# select * from texts where title ~* E'\\mmacskacicó\\M';
> test=# select * from texts where title ~* E'\\<macskacicó\\>';
What's that supposed to be doing?That shouldn't work because nothing follows that word. You'd need to
> test=# select * from texts where title ~* E'\\Wmacskacicó\\W';
match like this in that case:
select * from texts where title ~* E'\\Wmacskacicó$';
If you add something like a space or full-stop (period) after that
word, it will match.
Accented characters should match against \\w
To prove it, try:
select * from texts where title ~* E'\\Wmacskacic\\w';
Regards
Thom
On 8 June 2010 10:57, Laslo Forro <getforum@gmail.com> wrote: > Thanx for your response! > You write: >>test=# select * from texts where title ~* E'\\mmacskacicó\\M'; >> >>That works for me. > However, it doesn't work for me (if I understand you right): I suspect your database encoding is SQL_ASCII instead of UTF8. Can you tell us your database encoding? Thanks Thom
test=# \l
List of databases
Name | Owner | Encoding | Collation | Ctype | Access privileges
-----------+----------+----------+-------------+-------------+-----------------------
postgres | postgres | UTF8 | en_US.UTF-8 | en_US.UTF-8 |
template0 | postgres | UTF8 | en_US.UTF-8 | en_US.UTF-8 | =c/postgres
: postgres=CTc/postgres
template1 | postgres | UTF8 | en_US.UTF-8 | en_US.UTF-8 | =c/postgres
: postgres=CTc/postgres
test | salmonix | UTF8 | en_US.UTF-8 | en_US.UTF-8 |
(5 rows)
On Tue, Jun 8, 2010 at 12:24 PM, Thom Brown <thombrown@gmail.com> wrote:
On 8 June 2010 10:57, Laslo Forro <getforum@gmail.com> wrote:I suspect your database encoding is SQL_ASCII instead of UTF8. Can
> Thanx for your response!
> You write:
>>test=# select * from texts where title ~* E'\\mmacskacicó\\M';
>>
>>That works for me.
> However, it doesn't work for me (if I understand you right):
you tell us your database encoding?
Thanks
Thom
On 8 June 2010 11:54, Laslo Forro <getforum@gmail.com> wrote: > test=# \l > List of databases > Name | Owner | Encoding | Collation | Ctype | Access > privileges > -----------+----------+----------+-------------+-------------+----------------------- > postgres | postgres | UTF8 | en_US.UTF-8 | en_US.UTF-8 | > template0 | postgres | UTF8 | en_US.UTF-8 | en_US.UTF-8 | =c/postgres > : > postgres=CTc/postgres > template1 | postgres | UTF8 | en_US.UTF-8 | en_US.UTF-8 | =c/postgres > : > postgres=CTc/postgres > test | salmonix | UTF8 | en_US.UTF-8 | en_US.UTF-8 | > (5 rows) > Okay, I'm not sure what the problem is there then. :S Hopefully someone else can shed some light on it for you. Thom
Thanks a lot, anyway!
On Tue, Jun 8, 2010 at 12:56 PM, Thom Brown <thombrown@gmail.com> wrote:
On 8 June 2010 11:54, Laslo Forro <getforum@gmail.com> wrote:Okay, I'm not sure what the problem is there then. :S Hopefully
> test=# \l
> List of databases
> Name | Owner | Encoding | Collation | Ctype | Access
> privileges
> -----------+----------+----------+-------------+-------------+-----------------------
> postgres | postgres | UTF8 | en_US.UTF-8 | en_US.UTF-8 |
> template0 | postgres | UTF8 | en_US.UTF-8 | en_US.UTF-8 | =c/postgres
> :
> postgres=CTc/postgres
> template1 | postgres | UTF8 | en_US.UTF-8 | en_US.UTF-8 | =c/postgres
> :
> postgres=CTc/postgres
> test | salmonix | UTF8 | en_US.UTF-8 | en_US.UTF-8 |
> (5 rows)
>
someone else can shed some light on it for you.
Thom
That might be a problem that 'ó' is not recognized as \w
Actually I do not know which class 'ó' is in. It matches:
test=# select * from texts where title ~* E'\\mmacskacic\\M';
title | a_text
--------------+----------------------------
A macskacicó | A blah blah macskacicónak.
(1 row)
As if the end-of-word is at the last 'c' . ???
If the hex. code of 'ó' is 97 (dec.151) could someone hint me how to insert it into the expression?
On Tue, Jun 8, 2010 at 1:17 PM, Laslo Forro <getforum@gmail.com> wrote:
Thanks a lot, anyway!On Tue, Jun 8, 2010 at 12:56 PM, Thom Brown <thombrown@gmail.com> wrote:On 8 June 2010 11:54, Laslo Forro <getforum@gmail.com> wrote:Okay, I'm not sure what the problem is there then. :S Hopefully
> test=# \l
> List of databases
> Name | Owner | Encoding | Collation | Ctype | Access
> privileges
> -----------+----------+----------+-------------+-------------+-----------------------
> postgres | postgres | UTF8 | en_US.UTF-8 | en_US.UTF-8 |
> template0 | postgres | UTF8 | en_US.UTF-8 | en_US.UTF-8 | =c/postgres
> :
> postgres=CTc/postgres
> template1 | postgres | UTF8 | en_US.UTF-8 | en_US.UTF-8 | =c/postgres
> :
> postgres=CTc/postgres
> test | salmonix | UTF8 | en_US.UTF-8 | en_US.UTF-8 |
> (5 rows)
>
someone else can shed some light on it for you.
Thom
Perhaps helps:
\M
\M\M\M
\.*
but not \M\M\M\M or \M\M\M\W
These match:
E'\\mmacskacicó\M*'
E'\\mmacskacicó\s*'
E'\\mmacskacicó\W*'
with * quantifier. But not with + quantifier, or w/o any quantifier.
Also matches:
E'\\mmacskacicó\\Y' (!!!)
E'\\mmacskacicó$'
The text is typed via psql using urxvt terminal.
Perhaps some unicode - wide charater kind of mess?
On Tue, Jun 8, 2010 at 1:26 PM, Laslo Forro <getforum@gmail.com> wrote:
That might be a problem that 'ó' is not recognized as \wActually I do not know which class 'ó' is in. It matches:test=# select * from texts where title ~* E'\\mmacskacic\\M';title | a_text--------------+----------------------------A macskacicó | A blah blah macskacicónak.(1 row)As if the end-of-word is at the last 'c' . ???If the hex. code of 'ó' is 97 (dec.151) could someone hint me how to insert it into the expression?On Tue, Jun 8, 2010 at 1:17 PM, Laslo Forro <getforum@gmail.com> wrote:Thanks a lot, anyway!On Tue, Jun 8, 2010 at 12:56 PM, Thom Brown <thombrown@gmail.com> wrote:On 8 June 2010 11:54, Laslo Forro <getforum@gmail.com> wrote:Okay, I'm not sure what the problem is there then. :S Hopefully
> test=# \l
> List of databases
> Name | Owner | Encoding | Collation | Ctype | Access
> privileges
> -----------+----------+----------+-------------+-------------+-----------------------
> postgres | postgres | UTF8 | en_US.UTF-8 | en_US.UTF-8 |
> template0 | postgres | UTF8 | en_US.UTF-8 | en_US.UTF-8 | =c/postgres
> :
> postgres=CTc/postgres
> template1 | postgres | UTF8 | en_US.UTF-8 | en_US.UTF-8 | =c/postgres
> :
> postgres=CTc/postgres
> test | salmonix | UTF8 | en_US.UTF-8 | en_US.UTF-8 |
> (5 rows)
>
someone else can shed some light on it for you.
Thom
more:
having the string 'macskacicóca' it matches:
\\mmacskacic\\Wca'
so it matches:
\\mmacskacic\\W'
indicating that 'ó' is a non alphanumeric character, but strange enough
but it doesn't:
\\mmacskacic\\W\\M
unless \M is with * quantifier.
Any idea or hint is highly appreciated.
Thanx in advance,
Laslo
On Tue, Jun 8, 2010 at 1:59 PM, Laslo Forro <getforum@gmail.com> wrote:
Perhaps helps:'ó' matches\M\M\M\M\.*but not \M\M\M\M or \M\M\M\WThese match:E'\\mmacskacicó\M*'E'\\mmacskacicó\s*'E'\\mmacskacicó\W*'with * quantifier. But not with + quantifier, or w/o any quantifier.Also matches:E'\\mmacskacicó\\Y' (!!!)E'\\mmacskacicó$'The text is typed via psql using urxvt terminal.Perhaps some unicode - wide charater kind of mess?On Tue, Jun 8, 2010 at 1:26 PM, Laslo Forro <getforum@gmail.com> wrote:That might be a problem that 'ó' is not recognized as \wActually I do not know which class 'ó' is in. It matches:test=# select * from texts where title ~* E'\\mmacskacic\\M';title | a_text--------------+----------------------------A macskacicó | A blah blah macskacicónak.(1 row)As if the end-of-word is at the last 'c' . ???If the hex. code of 'ó' is 97 (dec.151) could someone hint me how to insert it into the expression?On Tue, Jun 8, 2010 at 1:17 PM, Laslo Forro <getforum@gmail.com> wrote:Thanks a lot, anyway!On Tue, Jun 8, 2010 at 12:56 PM, Thom Brown <thombrown@gmail.com> wrote:On 8 June 2010 11:54, Laslo Forro <getforum@gmail.com> wrote:Okay, I'm not sure what the problem is there then. :S Hopefully
> test=# \l
> List of databases
> Name | Owner | Encoding | Collation | Ctype | Access
> privileges
> -----------+----------+----------+-------------+-------------+-----------------------
> postgres | postgres | UTF8 | en_US.UTF-8 | en_US.UTF-8 |
> template0 | postgres | UTF8 | en_US.UTF-8 | en_US.UTF-8 | =c/postgres
> :
> postgres=CTc/postgres
> template1 | postgres | UTF8 | en_US.UTF-8 | en_US.UTF-8 | =c/postgres
> :
> postgres=CTc/postgres
> test | salmonix | UTF8 | en_US.UTF-8 | en_US.UTF-8 |
> (5 rows)
>
someone else can shed some light on it for you.
Thom
And one more thing:
this all is strange.
test=# select * from text where a_text ~* E'\\mmacskacic\\W\\s';
title | a_text
--------------+--------------------------
A macskacicó | A bah macskacicóca
A macskacicó | A bah macskacicó és a ló
(2 rows)
Strange, because I expect the 'macskacic+NON_WORD+WSPACE' pattern.
The corresponding perl regexp does not match:
macskacic\W\s
I am really lost.
And stop spamming.
On Tue, Jun 8, 2010 at 2:28 PM, Laslo Forro <getforum@gmail.com> wrote:
more:having the string 'macskacicóca' it matches:\\mmacskacic\\Wca'so it matches:\\mmacskacic\\W'indicating that 'ó' is a non alphanumeric character, but strange enoughbut it doesn't:\\mmacskacic\\W\\Munless \M is with * quantifier.Any idea or hint is highly appreciated.Thanx in advance,LasloOn Tue, Jun 8, 2010 at 1:59 PM, Laslo Forro <getforum@gmail.com> wrote:Perhaps helps:'ó' matches\M\M\M\M\.*but not \M\M\M\M or \M\M\M\WThese match:E'\\mmacskacicó\M*'E'\\mmacskacicó\s*'E'\\mmacskacicó\W*'with * quantifier. But not with + quantifier, or w/o any quantifier.Also matches:E'\\mmacskacicó\\Y' (!!!)E'\\mmacskacicó$'The text is typed via psql using urxvt terminal.Perhaps some unicode - wide charater kind of mess?On Tue, Jun 8, 2010 at 1:26 PM, Laslo Forro <getforum@gmail.com> wrote:That might be a problem that 'ó' is not recognized as \wActually I do not know which class 'ó' is in. It matches:test=# select * from texts where title ~* E'\\mmacskacic\\M';title | a_text--------------+----------------------------A macskacicó | A blah blah macskacicónak.(1 row)As if the end-of-word is at the last 'c' . ???If the hex. code of 'ó' is 97 (dec.151) could someone hint me how to insert it into the expression?On Tue, Jun 8, 2010 at 1:17 PM, Laslo Forro <getforum@gmail.com> wrote:Thanks a lot, anyway!On Tue, Jun 8, 2010 at 12:56 PM, Thom Brown <thombrown@gmail.com> wrote:On 8 June 2010 11:54, Laslo Forro <getforum@gmail.com> wrote:Okay, I'm not sure what the problem is there then. :S Hopefully
> test=# \l
> List of databases
> Name | Owner | Encoding | Collation | Ctype | Access
> privileges
> -----------+----------+----------+-------------+-------------+-----------------------
> postgres | postgres | UTF8 | en_US.UTF-8 | en_US.UTF-8 |
> template0 | postgres | UTF8 | en_US.UTF-8 | en_US.UTF-8 | =c/postgres
> :
> postgres=CTc/postgres
> template1 | postgres | UTF8 | en_US.UTF-8 | en_US.UTF-8 | =c/postgres
> :
> postgres=CTc/postgres
> test | salmonix | UTF8 | en_US.UTF-8 | en_US.UTF-8 |
> (5 rows)
>
someone else can shed some light on it for you.
Thom
On 8 June 2010 14:07, Laslo Forro <getforum@gmail.com> wrote: > OK. I have upgraded to 8.4.4.-r1 (Gentoo) and everything is fine. > Thanx! > Ah, must have been a bug then. Glad to hear you solved it. :) Thom
Laslo Forro <getforum@gmail.com> writes: > It seems that accented characters are not recognized as \w. Just FYI, that's a known problem with the regex operators if you're using UTF8 database encoding (or more generally, any multibyte encoding, but UTF8 is usually the one people complain about). I don't believe updating to 8.4 would have fixed it for you --- maybe the reason the problem went away is you switched to a different encoding, such as one of the LATINn family? There is a tentative fix in 9.0, FWIW. regards, tom lane
That is the case.
I have to move to use ISO encoding because I can not use 9.0 .
Thank you!
On Tue, Jun 8, 2010 at 3:53 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Laslo Forro <getforum@gmail.com> writes:Just FYI, that's a known problem with the regex operators if you're
> It seems that accented characters are not recognized as \w.
using UTF8 database encoding (or more generally, any multibyte encoding,
but UTF8 is usually the one people complain about). I don't believe
updating to 8.4 would have fixed it for you --- maybe the reason the
problem went away is you switched to a different encoding, such as one
of the LATINn family?
There is a tentative fix in 9.0, FWIW.
regards, tom lane