Thread: Regexp match with accented character problem

Regexp match with accented character problem

From
Laslo Forro
Date:
Hi there, could someone drop me a hint on the whys at below?

The table: 

test=# select * from texts;
    title     |         a_text          
--------------+-------------------------
 A macskacicó | A blah blah macskacicónak.
The dark tower | Blah blah
(2 rows)

Now, I want to match 'macskacicó' WORD.

It works: 
test=# select * from texts where title ~* E'macskacicó';
    title     |         a_text          
--------------+-------------------------
 A macskacicó | A blah blah macskacicó.
(1 row)

But it would also macth 'macskacicónak' string:

test=# select * from texts where a_text ~* E'macskacicó';
    title     |           a_text           
--------------+----------------------------
 A macskacicó | A blah blah macskacicónak.
(1 row)

Now, these do not work:

test=# select * from texts where title ~* E'\\mmacskacicó\\M';
test=# select * from texts where title ~* E'\\<macskacicó\\>';
test=# select * from texts where title ~* E'\\Wmacskacicó\\W';

(neither with one \ , nor with double.)

Now, it seems that all is ok if the string does not end with an accented character:
test=# select * from texts where title ~* E'\\mtower\\M';
     title      |  a_text   
----------------+-----------
 The dark tower | Blah blah
(1 row)

It seems that accented characters are not recognized as \w. (It matches:  select * from texts where title ~* E'\\Wmacskacic\\W'; )
Does it mean that I have to convert each accented character to a hex form and feed it that way? Or is there a more elegant way to redefine the \w class?

Thanks a lot!

I use :
Postgresql 8.4.1 on Gentoo.
Postgresql.conf:
max_connections = 100
shared_buffers = 1000 # min 16, at least max_connections*2, 8KB each
lc_messages = 'en_US.UTF-8' # locale for system error message strings
lc_monetary = 'en_US.UTF-8' # locale for monetary formatting
lc_numeric = 'en_US.UTF-8' # locale for number formatting
lc_time = 'en_US.UTF-8' # locale for time formatting

'locale' gives: 
LANG=hu_HU.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=en_US.UTF-8

Re: Regexp match with accented character problem

From
Thom Brown
Date:
On 8 June 2010 09:48, Laslo Forro <getforum@gmail.com> wrote:
> Hi there, could someone drop me a hint on the whys at below?
> The table:
> test=# select * from texts;
>     title     |         a_text
> --------------+-------------------------
>  A macskacicó | A blah blah macskacicónak.
> The dark tower | Blah blah
> (2 rows)
> Now, I want to match 'macskacicó' WORD.
> It works:
> test=# select * from texts where title ~* E'macskacicó';
>     title     |         a_text
> --------------+-------------------------
>  A macskacicó | A blah blah macskacicó.
> (1 row)
> But it would also macth 'macskacicónak' string:
> test=# select * from texts where a_text ~* E'macskacicó';
>     title     |           a_text
> --------------+----------------------------
>  A macskacicó | A blah blah macskacicónak.
> (1 row)
> Now, these do not work:
> test=# select * from texts where title ~* E'\\mmacskacicó\\M';

That works for me.

> test=# select * from texts where title ~* E'\\<macskacicó\\>';

What's that supposed to be doing?

> test=# select * from texts where title ~* E'\\Wmacskacicó\\W';

That shouldn't work because nothing follows that word.  You'd need to
match like this in that case:

select * from texts where title ~* E'\\Wmacskacicó$';

If you add something like a space or full-stop (period) after that
word, it will match.

Accented characters should match against \\w

To prove it, try:

select * from texts where title ~* E'\\Wmacskacic\\w';

Regards

Thom

Re: Regexp match with accented character problem

From
Laslo Forro
Date:
Thanx for your response!

You write: 

>test=# select * from texts where title ~* E'\\mmacskacicó\\M';
>
>That works for me.

However, it doesn't work for me (if I understand you right):

test=# select * from texts where title ~* E'\\mmacskacicó\\M';
 title | a_text 
-------+--------
(0 rows)

It does:

test=# select * from texts where title ~* E'\\mmacskacicó$';
    title     |           a_text           
--------------+----------------------------
 A macskacicó | A blah blah macskacicónak.
(1 row)



On Tue, Jun 8, 2010 at 11:45 AM, Thom Brown <thombrown@gmail.com> wrote:
On 8 June 2010 09:48, Laslo Forro <getforum@gmail.com> wrote:
> Hi there, could someone drop me a hint on the whys at below?
> The table:
> test=# select * from texts;
>     title     |         a_text
> --------------+-------------------------
>  A macskacicó | A blah blah macskacicónak.
> The dark tower | Blah blah
> (2 rows)
> Now, I want to match 'macskacicó' WORD.
> It works:
> test=# select * from texts where title ~* E'macskacicó';
>     title     |         a_text
> --------------+-------------------------
>  A macskacicó | A blah blah macskacicó.
> (1 row)
> But it would also macth 'macskacicónak' string:
> test=# select * from texts where a_text ~* E'macskacicó';
>     title     |           a_text
> --------------+----------------------------
>  A macskacicó | A blah blah macskacicónak.
> (1 row)
> Now, these do not work:
> test=# select * from texts where title ~* E'\\mmacskacicó\\M';

That works for me.

> test=# select * from texts where title ~* E'\\<macskacicó\\>';

What's that supposed to be doing?

> test=# select * from texts where title ~* E'\\Wmacskacicó\\W';

That shouldn't work because nothing follows that word.  You'd need to
match like this in that case:

select * from texts where title ~* E'\\Wmacskacicó$';

If you add something like a space or full-stop (period) after that
word, it will match.

Accented characters should match against \\w

To prove it, try:

select * from texts where title ~* E'\\Wmacskacic\\w';

Regards

Thom

Re: Regexp match with accented character problem

From
Thom Brown
Date:
On 8 June 2010 10:57, Laslo Forro <getforum@gmail.com> wrote:
> Thanx for your response!
> You write:
>>test=# select * from texts where title ~* E'\\mmacskacicó\\M';
>>
>>That works for me.
> However, it doesn't work for me (if I understand you right):

I suspect your database encoding is SQL_ASCII instead of UTF8.  Can
you tell us your database encoding?

Thanks

Thom

Re: Regexp match with accented character problem

From
Laslo Forro
Date:
test=# \l
                                  List of databases
   Name    |  Owner   | Encoding |  Collation  |    Ctype    |   Access privileges   
-----------+----------+----------+-------------+-------------+-----------------------
 postgres  | postgres | UTF8     | en_US.UTF-8 | en_US.UTF-8 | 
 template0 | postgres | UTF8     | en_US.UTF-8 | en_US.UTF-8 | =c/postgres
                                                             : postgres=CTc/postgres
 template1 | postgres | UTF8     | en_US.UTF-8 | en_US.UTF-8 | =c/postgres
                                                             : postgres=CTc/postgres
 test      | salmonix | UTF8     | en_US.UTF-8 | en_US.UTF-8 | 
(5 rows)


On Tue, Jun 8, 2010 at 12:24 PM, Thom Brown <thombrown@gmail.com> wrote:
On 8 June 2010 10:57, Laslo Forro <getforum@gmail.com> wrote:
> Thanx for your response!
> You write:
>>test=# select * from texts where title ~* E'\\mmacskacicó\\M';
>>
>>That works for me.
> However, it doesn't work for me (if I understand you right):

I suspect your database encoding is SQL_ASCII instead of UTF8.  Can
you tell us your database encoding?

Thanks

Thom

Re: Regexp match with accented character problem

From
Thom Brown
Date:
On 8 June 2010 11:54, Laslo Forro <getforum@gmail.com> wrote:
> test=# \l
>                                   List of databases
>    Name    |  Owner   | Encoding |  Collation  |    Ctype    |   Access
> privileges
> -----------+----------+----------+-------------+-------------+-----------------------
>  postgres  | postgres | UTF8     | en_US.UTF-8 | en_US.UTF-8 |
>  template0 | postgres | UTF8     | en_US.UTF-8 | en_US.UTF-8 | =c/postgres
>                                                              :
> postgres=CTc/postgres
>  template1 | postgres | UTF8     | en_US.UTF-8 | en_US.UTF-8 | =c/postgres
>                                                              :
> postgres=CTc/postgres
>  test      | salmonix | UTF8     | en_US.UTF-8 | en_US.UTF-8 |
> (5 rows)
>

Okay, I'm not sure what the problem is there then. :S  Hopefully
someone else can shed some light on it for you.

Thom

Re: Regexp match with accented character problem

From
Laslo Forro
Date:
Thanks a lot, anyway!

On Tue, Jun 8, 2010 at 12:56 PM, Thom Brown <thombrown@gmail.com> wrote:
On 8 June 2010 11:54, Laslo Forro <getforum@gmail.com> wrote:
> test=# \l
>                                   List of databases
>    Name    |  Owner   | Encoding |  Collation  |    Ctype    |   Access
> privileges
> -----------+----------+----------+-------------+-------------+-----------------------
>  postgres  | postgres | UTF8     | en_US.UTF-8 | en_US.UTF-8 |
>  template0 | postgres | UTF8     | en_US.UTF-8 | en_US.UTF-8 | =c/postgres
>                                                              :
> postgres=CTc/postgres
>  template1 | postgres | UTF8     | en_US.UTF-8 | en_US.UTF-8 | =c/postgres
>                                                              :
> postgres=CTc/postgres
>  test      | salmonix | UTF8     | en_US.UTF-8 | en_US.UTF-8 |
> (5 rows)
>

Okay, I'm not sure what the problem is there then. :S  Hopefully
someone else can shed some light on it for you.

Thom

Re: Regexp match with accented character problem

From
Laslo Forro
Date:
That might be a problem that 'ó' is not recognized as \w
Actually I do not know which class 'ó' is in. It matches:

test=# select * from texts where title ~* E'\\mmacskacic\\M';
    title     |           a_text           
--------------+----------------------------
 A macskacicó | A blah blah macskacicónak.
(1 row)

As if the end-of-word is at the last 'c' . ???

If the hex. code of 'ó' is 97 (dec.151) could someone hint me how to insert it into the expression?

On Tue, Jun 8, 2010 at 1:17 PM, Laslo Forro <getforum@gmail.com> wrote:
Thanks a lot, anyway!


On Tue, Jun 8, 2010 at 12:56 PM, Thom Brown <thombrown@gmail.com> wrote:
On 8 June 2010 11:54, Laslo Forro <getforum@gmail.com> wrote:
> test=# \l
>                                   List of databases
>    Name    |  Owner   | Encoding |  Collation  |    Ctype    |   Access
> privileges
> -----------+----------+----------+-------------+-------------+-----------------------
>  postgres  | postgres | UTF8     | en_US.UTF-8 | en_US.UTF-8 |
>  template0 | postgres | UTF8     | en_US.UTF-8 | en_US.UTF-8 | =c/postgres
>                                                              :
> postgres=CTc/postgres
>  template1 | postgres | UTF8     | en_US.UTF-8 | en_US.UTF-8 | =c/postgres
>                                                              :
> postgres=CTc/postgres
>  test      | salmonix | UTF8     | en_US.UTF-8 | en_US.UTF-8 |
> (5 rows)
>

Okay, I'm not sure what the problem is there then. :S  Hopefully
someone else can shed some light on it for you.

Thom


Re: Regexp match with accented character problem

From
Laslo Forro
Date:
Perhaps helps:

'ó' matches 
\M
\M\M\M
\.*

but not \M\M\M\M or \M\M\M\W

These match:
E'\\mmacskacicó\M*'
E'\\mmacskacicó\s*'
E'\\mmacskacicó\W*'

with * quantifier. But not with + quantifier, or w/o any quantifier.
Also matches:

E'\\mmacskacicó\\Y'     (!!!)
E'\\mmacskacicó$'

The text is typed via psql using urxvt terminal.
Perhaps some unicode - wide charater kind of mess?


On Tue, Jun 8, 2010 at 1:26 PM, Laslo Forro <getforum@gmail.com> wrote:
That might be a problem that 'ó' is not recognized as \w
Actually I do not know which class 'ó' is in. It matches:

test=# select * from texts where title ~* E'\\mmacskacic\\M';
    title     |           a_text           
--------------+----------------------------
 A macskacicó | A blah blah macskacicónak.
(1 row)

As if the end-of-word is at the last 'c' . ???

If the hex. code of 'ó' is 97 (dec.151) could someone hint me how to insert it into the expression?

On Tue, Jun 8, 2010 at 1:17 PM, Laslo Forro <getforum@gmail.com> wrote:
Thanks a lot, anyway!


On Tue, Jun 8, 2010 at 12:56 PM, Thom Brown <thombrown@gmail.com> wrote:
On 8 June 2010 11:54, Laslo Forro <getforum@gmail.com> wrote:
> test=# \l
>                                   List of databases
>    Name    |  Owner   | Encoding |  Collation  |    Ctype    |   Access
> privileges
> -----------+----------+----------+-------------+-------------+-----------------------
>  postgres  | postgres | UTF8     | en_US.UTF-8 | en_US.UTF-8 |
>  template0 | postgres | UTF8     | en_US.UTF-8 | en_US.UTF-8 | =c/postgres
>                                                              :
> postgres=CTc/postgres
>  template1 | postgres | UTF8     | en_US.UTF-8 | en_US.UTF-8 | =c/postgres
>                                                              :
> postgres=CTc/postgres
>  test      | salmonix | UTF8     | en_US.UTF-8 | en_US.UTF-8 |
> (5 rows)
>

Okay, I'm not sure what the problem is there then. :S  Hopefully
someone else can shed some light on it for you.

Thom



Re: Regexp match with accented character problem

From
Laslo Forro
Date:
more:
having the string 'macskacicóca' it matches:
\\mmacskacic\\Wca'
so it matches:
\\mmacskacic\\W'
indicating that 'ó' is a non alphanumeric character, but strange enough 

but it doesn't:
\\mmacskacic\\W\\M
unless \M is with * quantifier.

Any idea or hint is highly appreciated.

Thanx in advance, 
Laslo

On Tue, Jun 8, 2010 at 1:59 PM, Laslo Forro <getforum@gmail.com> wrote:
Perhaps helps:

'ó' matches 
\M
\M\M\M
\.*

but not \M\M\M\M or \M\M\M\W

These match:
E'\\mmacskacicó\M*'
E'\\mmacskacicó\s*'
E'\\mmacskacicó\W*'

with * quantifier. But not with + quantifier, or w/o any quantifier.
Also matches:

E'\\mmacskacicó\\Y'     (!!!)
E'\\mmacskacicó$'

The text is typed via psql using urxvt terminal.
Perhaps some unicode - wide charater kind of mess?


On Tue, Jun 8, 2010 at 1:26 PM, Laslo Forro <getforum@gmail.com> wrote:
That might be a problem that 'ó' is not recognized as \w
Actually I do not know which class 'ó' is in. It matches:

test=# select * from texts where title ~* E'\\mmacskacic\\M';
    title     |           a_text           
--------------+----------------------------
 A macskacicó | A blah blah macskacicónak.
(1 row)

As if the end-of-word is at the last 'c' . ???

If the hex. code of 'ó' is 97 (dec.151) could someone hint me how to insert it into the expression?

On Tue, Jun 8, 2010 at 1:17 PM, Laslo Forro <getforum@gmail.com> wrote:
Thanks a lot, anyway!


On Tue, Jun 8, 2010 at 12:56 PM, Thom Brown <thombrown@gmail.com> wrote:
On 8 June 2010 11:54, Laslo Forro <getforum@gmail.com> wrote:
> test=# \l
>                                   List of databases
>    Name    |  Owner   | Encoding |  Collation  |    Ctype    |   Access
> privileges
> -----------+----------+----------+-------------+-------------+-----------------------
>  postgres  | postgres | UTF8     | en_US.UTF-8 | en_US.UTF-8 |
>  template0 | postgres | UTF8     | en_US.UTF-8 | en_US.UTF-8 | =c/postgres
>                                                              :
> postgres=CTc/postgres
>  template1 | postgres | UTF8     | en_US.UTF-8 | en_US.UTF-8 | =c/postgres
>                                                              :
> postgres=CTc/postgres
>  test      | salmonix | UTF8     | en_US.UTF-8 | en_US.UTF-8 |
> (5 rows)
>

Okay, I'm not sure what the problem is there then. :S  Hopefully
someone else can shed some light on it for you.

Thom




Re: Regexp match with accented character problem

From
Laslo Forro
Date:
And one more thing:

this all is strange.

test=# select * from text where a_text ~* E'\\mmacskacic\\W\\s';
    title     |          a_text          
--------------+--------------------------
 A macskacicó | A bah macskacicóca
 A macskacicó | A bah macskacicó és a ló
(2 rows)

Strange, because I expect the 'macskacic+NON_WORD+WSPACE' pattern.
The corresponding perl regexp does not match:
macskacic\W\s

I am really lost.

And stop spamming.

On Tue, Jun 8, 2010 at 2:28 PM, Laslo Forro <getforum@gmail.com> wrote:
more:
having the string 'macskacicóca' it matches:
\\mmacskacic\\Wca'
so it matches:
\\mmacskacic\\W'
indicating that 'ó' is a non alphanumeric character, but strange enough 

but it doesn't:
\\mmacskacic\\W\\M
unless \M is with * quantifier.

Any idea or hint is highly appreciated.

Thanx in advance, 
Laslo

On Tue, Jun 8, 2010 at 1:59 PM, Laslo Forro <getforum@gmail.com> wrote:
Perhaps helps:

'ó' matches 
\M
\M\M\M
\.*

but not \M\M\M\M or \M\M\M\W

These match:
E'\\mmacskacicó\M*'
E'\\mmacskacicó\s*'
E'\\mmacskacicó\W*'

with * quantifier. But not with + quantifier, or w/o any quantifier.
Also matches:

E'\\mmacskacicó\\Y'     (!!!)
E'\\mmacskacicó$'

The text is typed via psql using urxvt terminal.
Perhaps some unicode - wide charater kind of mess?


On Tue, Jun 8, 2010 at 1:26 PM, Laslo Forro <getforum@gmail.com> wrote:
That might be a problem that 'ó' is not recognized as \w
Actually I do not know which class 'ó' is in. It matches:

test=# select * from texts where title ~* E'\\mmacskacic\\M';
    title     |           a_text           
--------------+----------------------------
 A macskacicó | A blah blah macskacicónak.
(1 row)

As if the end-of-word is at the last 'c' . ???

If the hex. code of 'ó' is 97 (dec.151) could someone hint me how to insert it into the expression?

On Tue, Jun 8, 2010 at 1:17 PM, Laslo Forro <getforum@gmail.com> wrote:
Thanks a lot, anyway!


On Tue, Jun 8, 2010 at 12:56 PM, Thom Brown <thombrown@gmail.com> wrote:
On 8 June 2010 11:54, Laslo Forro <getforum@gmail.com> wrote:
> test=# \l
>                                   List of databases
>    Name    |  Owner   | Encoding |  Collation  |    Ctype    |   Access
> privileges
> -----------+----------+----------+-------------+-------------+-----------------------
>  postgres  | postgres | UTF8     | en_US.UTF-8 | en_US.UTF-8 |
>  template0 | postgres | UTF8     | en_US.UTF-8 | en_US.UTF-8 | =c/postgres
>                                                              :
> postgres=CTc/postgres
>  template1 | postgres | UTF8     | en_US.UTF-8 | en_US.UTF-8 | =c/postgres
>                                                              :
> postgres=CTc/postgres
>  test      | salmonix | UTF8     | en_US.UTF-8 | en_US.UTF-8 |
> (5 rows)
>

Okay, I'm not sure what the problem is there then. :S  Hopefully
someone else can shed some light on it for you.

Thom





Re: Regexp match with accented character problem

From
Thom Brown
Date:
On 8 June 2010 14:07, Laslo Forro <getforum@gmail.com> wrote:
> OK. I have upgraded to 8.4.4.-r1 (Gentoo) and everything is fine.
> Thanx!
>

Ah, must have been a bug then.  Glad to hear you solved it. :)

Thom

Re: Regexp match with accented character problem

From
Tom Lane
Date:
Laslo Forro <getforum@gmail.com> writes:
> It seems that accented characters are not recognized as \w.

Just FYI, that's a known problem with the regex operators if you're
using UTF8 database encoding (or more generally, any multibyte encoding,
but UTF8 is usually the one people complain about).  I don't believe
updating to 8.4 would have fixed it for you --- maybe the reason the
problem went away is you switched to a different encoding, such as one
of the LATINn family?

There is a tentative fix in 9.0, FWIW.

            regards, tom lane

Re: Regexp match with accented character problem

From
Laslo Forro
Date:
That is the case.
I have to move to use ISO encoding because I can not use 9.0 .
Thank you!


On Tue, Jun 8, 2010 at 3:53 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Laslo Forro <getforum@gmail.com> writes:
> It seems that accented characters are not recognized as \w.

Just FYI, that's a known problem with the regex operators if you're
using UTF8 database encoding (or more generally, any multibyte encoding,
but UTF8 is usually the one people complain about).  I don't believe
updating to 8.4 would have fixed it for you --- maybe the reason the
problem went away is you switched to a different encoding, such as one
of the LATINn family?

There is a tentative fix in 9.0, FWIW.

                       regards, tom lane