Home > mailing lists

Thread: Regexp match with accented character problem

Regexp match with accented character problem

From

Laslo Forro

Date:

08 June 2010, 05:49:03

Hi there, could someone drop me a hint on the whys at below?

The table:

test=# select * from texts;

title | a_text

--------------+-------------------------

A macskacicó | A blah blah macskacicónak.

The dark tower | Blah blah

(2 rows)

Now, I want to match 'macskacicó' WORD.

It works:

test=# select * from texts where title ~* E'macskacicó';

title | a_text

--------------+-------------------------

A macskacicó | A blah blah macskacicó.

(1 row)

But it would also macth 'macskacicónak' string:

test=# select * from texts where a_text ~* E'macskacicó';

title | a_text

--------------+----------------------------

A macskacicó | A blah blah macskacicónak.

(1 row)

Now, these do not work:

test=# select * from texts where title ~* E'\\mmacskacicó\\M';

test=# select * from texts where title ~* E'\\<macskacicó\\>';

test=# select * from texts where title ~* E'\\Wmacskacicó\\W';

(neither with one \ , nor with double.)

Now, it seems that all is ok if the string does not end with an accented character:

test=# select * from texts where title ~* E'\\mtower\\M';

title | a_text

----------------+-----------

The dark tower | Blah blah

(1 row)

It seems that accented characters are not recognized as \w. (It matches: select * from texts where title ~* E'\\Wmacskacic\\W'; )

Does it mean that I have to convert each accented character to a hex form and feed it that way? Or is there a more elegant way to redefine the \w class?

Thanks a lot!

I use :

Postgresql 8.4.1 on Gentoo.

Postgresql.conf:

max_connections = 100

shared_buffers = 1000 # min 16, at least max_connections*2, 8KB each

lc_messages = 'en_US.UTF-8' # locale for system error message strings

lc_monetary = 'en_US.UTF-8' # locale for monetary formatting

lc_numeric = 'en_US.UTF-8' # locale for number formatting

lc_time = 'en_US.UTF-8' # locale for time formatting

'locale' gives:

LANG=hu_HU.UTF-8

LC_CTYPE="en_US.UTF-8"

LC_NUMERIC="en_US.UTF-8"

LC_TIME="en_US.UTF-8"

LC_COLLATE="en_US.UTF-8"

LC_MONETARY="en_US.UTF-8"

LC_MESSAGES="en_US.UTF-8"

LC_PAPER="en_US.UTF-8"

LC_NAME="en_US.UTF-8"

LC_ADDRESS="en_US.UTF-8"

LC_TELEPHONE="en_US.UTF-8"

LC_MEASUREMENT="en_US.UTF-8"

LC_IDENTIFICATION="en_US.UTF-8"

LC_ALL=en_US.UTF-8

Re: Regexp match with accented character problem

From

Thom Brown

Date:

08 June 2010, 06:51:21

On 8 June 2010 09:48, Laslo Forro <getforum@gmail.com> wrote:
> Hi there, could someone drop me a hint on the whys at below?
> The table:
> test=# select * from texts;
>     title     |         a_text
> --------------+-------------------------
>  A macskacicó | A blah blah macskacicónak.
> The dark tower | Blah blah
> (2 rows)
> Now, I want to match 'macskacicó' WORD.
> It works:
> test=# select * from texts where title ~* E'macskacicó';
>     title     |         a_text
> --------------+-------------------------
>  A macskacicó | A blah blah macskacicó.
> (1 row)
> But it would also macth 'macskacicónak' string:
> test=# select * from texts where a_text ~* E'macskacicó';
>     title     |           a_text
> --------------+----------------------------
>  A macskacicó | A blah blah macskacicónak.
> (1 row)
> Now, these do not work:
> test=# select * from texts where title ~* E'\\mmacskacicó\\M';

That works for me.

> test=# select * from texts where title ~* E'\\<macskacicó\\>';

What's that supposed to be doing?

> test=# select * from texts where title ~* E'\\Wmacskacicó\\W';

That shouldn't work because nothing follows that word.  You'd need to
match like this in that case:

select * from texts where title ~* E'\\Wmacskacicó$';

If you add something like a space or full-stop (period) after that
word, it will match.

Accented characters should match against \\w

To prove it, try:

select * from texts where title ~* E'\\Wmacskacic\\w';

Regards

Thom

Re: Regexp match with accented character problem

From

Laslo Forro

Date:

08 June 2010, 06:57:46

Thanx for your response!

You write:

>test=# select * from texts where title ~* E'\\mmacskacicó\\M';

>That works for me.

However, it doesn't work for me (if I understand you right):

test=# select * from texts where title ~* E'\\mmacskacicó\\M';

title | a_text

-------+--------

(0 rows)

It does:

test=# select * from texts where title ~* E'\\mmacskacicó$';

title | a_text

--------------+----------------------------

A macskacicó | A blah blah macskacicónak.

(1 row)

On Tue, Jun 8, 2010 at 11:45 AM, Thom Brown <thombrown@gmail.com> wrote:

On 8 June 2010 09:48, Laslo Forro <getforum@gmail.com> wrote:
> Hi there, could someone drop me a hint on the whys at below?
> The table:
> test=# select * from texts;
>    title | a_text
> --------------+-------------------------
> A macskacicó | A blah blah macskacicónak.
> The dark tower | Blah blah
> (2 rows)
> Now, I want to match 'macskacicó' WORD.
> It works:
> test=# select * from texts where title ~* E'macskacicó';
>    title | a_text
> --------------+-------------------------
> A macskacicó | A blah blah macskacicó.
> (1 row)
> But it would also macth 'macskacicónak' string:
> test=# select * from texts where a_text ~* E'macskacicó';
>    title | a_text
> --------------+----------------------------
> A macskacicó | A blah blah macskacicónak.
> (1 row)
> Now, these do not work:
> test=# select * from texts where title ~* E'\\mmacskacicó\\M';

That works for me.

> test=# select * from texts where title ~* E'\\<macskacicó\\>';

What's that supposed to be doing?

> test=# select * from texts where title ~* E'\\Wmacskacicó\\W';

That shouldn't work because nothing follows that word. You'd need to
match like this in that case:

select * from texts where title ~* E'\\Wmacskacicó$';

If you add something like a space or full-stop (period) after that
word, it will match.

Accented characters should match against \\w

To prove it, try:

select * from texts where title ~* E'\\Wmacskacic\\w';

Regards

Thom

Re: Regexp match with accented character problem

From

Thom Brown

Date:

08 June 2010, 07:25:27

On 8 June 2010 10:57, Laslo Forro <getforum@gmail.com> wrote:
> Thanx for your response!
> You write:
>>test=# select * from texts where title ~* E'\\mmacskacicó\\M';
>>
>>That works for me.
> However, it doesn't work for me (if I understand you right):

I suspect your database encoding is SQL_ASCII instead of UTF8.  Can
you tell us your database encoding?

Thanks

Thom

Re: Regexp match with accented character problem

From

Laslo Forro

Date:

08 June 2010, 07:54:51

test=# \l

List of databases

-----------+----------+----------+-------------+-------------+-----------------------

: postgres=CTc/postgres

: postgres=CTc/postgres

(5 rows)

On Tue, Jun 8, 2010 at 12:24 PM, Thom Brown <thombrown@gmail.com> wrote:

On 8 June 2010 10:57, Laslo Forro <getforum@gmail.com> wrote:
> Thanx for your response!
> You write:
>>test=# select * from texts where title ~* E'\\mmacskacicó\\M';
>>
>>That works for me.
> However, it doesn't work for me (if I understand you right):

I suspect your database encoding is SQL_ASCII instead of UTF8. Can
you tell us your database encoding?

Thanks

Thom

Re: Regexp match with accented character problem

From

Thom Brown

Date:

08 June 2010, 07:56:41

On 8 June 2010 11:54, Laslo Forro <getforum@gmail.com> wrote:
> test=# \l
>                                   List of databases
>    Name    |  Owner   | Encoding |  Collation  |    Ctype    |   Access
> privileges
> -----------+----------+----------+-------------+-------------+-----------------------
>  postgres  | postgres | UTF8     | en_US.UTF-8 | en_US.UTF-8 |
>  template0 | postgres | UTF8     | en_US.UTF-8 | en_US.UTF-8 | =c/postgres
>                                                              :
> postgres=CTc/postgres
>  template1 | postgres | UTF8     | en_US.UTF-8 | en_US.UTF-8 | =c/postgres
>                                                              :
> postgres=CTc/postgres
>  test      | salmonix | UTF8     | en_US.UTF-8 | en_US.UTF-8 |
> (5 rows)
>

Okay, I'm not sure what the problem is there then. :S  Hopefully
someone else can shed some light on it for you.

Thom

Re: Regexp match with accented character problem

From

Laslo Forro

Date:

08 June 2010, 08:17:42

Thanks a lot, anyway!

On Tue, Jun 8, 2010 at 12:56 PM, Thom Brown <thombrown@gmail.com> wrote:

On 8 June 2010 11:54, Laslo Forro <getforum@gmail.com> wrote:
> test=# \l
>    List of databases
>    Name | Owner | Encoding | Collation | Ctype | Access
> privileges
> -----------+----------+----------+-------------+-------------+-----------------------
> postgres | postgres | UTF8 | en_US.UTF-8 | en_US.UTF-8 |
> template0 | postgres | UTF8 | en_US.UTF-8 | en_US.UTF-8 | =c/postgres
>    :
> postgres=CTc/postgres
> template1 | postgres | UTF8 | en_US.UTF-8 | en_US.UTF-8 | =c/postgres
>    :
> postgres=CTc/postgres
> test | salmonix | UTF8 | en_US.UTF-8 | en_US.UTF-8 |
> (5 rows)
>

Okay, I'm not sure what the problem is there then. :S Hopefully
someone else can shed some light on it for you.

Thom

Re: Regexp match with accented character problem

From

Laslo Forro

Date:

08 June 2010, 08:26:34

That might be a problem that 'ó' is not recognized as \w

Actually I do not know which class 'ó' is in. It matches:

test=# select * from texts where title ~* E'\\mmacskacic\\M';

title | a_text

--------------+----------------------------

A macskacicó | A blah blah macskacicónak.

(1 row)

As if the end-of-word is at the last 'c' . ???

If the hex. code of 'ó' is 97 (dec.151) could someone hint me how to insert it into the expression?

On Tue, Jun 8, 2010 at 1:17 PM, Laslo Forro <getforum@gmail.com> wrote:

Thanks a lot, anyway!

On Tue, Jun 8, 2010 at 12:56 PM, Thom Brown <thombrown@gmail.com> wrote:
On 8 June 2010 11:54, Laslo Forro <getforum@gmail.com> wrote:
> test=# \l
>    List of databases
>    Name | Owner | Encoding | Collation | Ctype | Access
> privileges
> -----------+----------+----------+-------------+-------------+-----------------------
> postgres | postgres | UTF8 | en_US.UTF-8 | en_US.UTF-8 |
> template0 | postgres | UTF8 | en_US.UTF-8 | en_US.UTF-8 | =c/postgres
>    :
> postgres=CTc/postgres
> template1 | postgres | UTF8 | en_US.UTF-8 | en_US.UTF-8 | =c/postgres
>    :
> postgres=CTc/postgres
> test | salmonix | UTF8 | en_US.UTF-8 | en_US.UTF-8 |
> (5 rows)
>

Okay, I'm not sure what the problem is there then. :S Hopefully
someone else can shed some light on it for you.

Thom

Re: Regexp match with accented character problem

From

Laslo Forro

Date:

08 June 2010, 08:59:37

Perhaps helps:

'ó' matches

\M\M\M

\.*

but not \M\M\M\M or \M\M\M\W

These match:

E'\\mmacskacicó\M*'

E'\\mmacskacicó\s*'

E'\\mmacskacicó\W*'

with * quantifier. But not with + quantifier, or w/o any quantifier.

Also matches:

E'\\mmacskacicó\\Y' (!!!)

E'\\mmacskacicó$'

The text is typed via psql using urxvt terminal.

Perhaps some unicode - wide charater kind of mess?

On Tue, Jun 8, 2010 at 1:26 PM, Laslo Forro <getforum@gmail.com> wrote:

That might be a problem that 'ó' is not recognized as \w
Actually I do not know which class 'ó' is in. It matches:

test=# select * from texts where title ~* E'\\mmacskacic\\M';
   title | a_text
--------------+----------------------------
A macskacicó | A blah blah macskacicónak.
(1 row)

As if the end-of-word is at the last 'c' . ???

If the hex. code of 'ó' is 97 (dec.151) could someone hint me how to insert it into the expression?

On Tue, Jun 8, 2010 at 1:17 PM, Laslo Forro <getforum@gmail.com> wrote:
Thanks a lot, anyway!

On Tue, Jun 8, 2010 at 12:56 PM, Thom Brown <thombrown@gmail.com> wrote:
On 8 June 2010 11:54, Laslo Forro <getforum@gmail.com> wrote:
> test=# \l
>    List of databases
>    Name | Owner | Encoding | Collation | Ctype | Access
> privileges
> -----------+----------+----------+-------------+-------------+-----------------------
> postgres | postgres | UTF8 | en_US.UTF-8 | en_US.UTF-8 |
> template0 | postgres | UTF8 | en_US.UTF-8 | en_US.UTF-8 | =c/postgres
>    :
> postgres=CTc/postgres
> template1 | postgres | UTF8 | en_US.UTF-8 | en_US.UTF-8 | =c/postgres
>    :
> postgres=CTc/postgres
> test | salmonix | UTF8 | en_US.UTF-8 | en_US.UTF-8 |
> (5 rows)
>

Okay, I'm not sure what the problem is there then. :S Hopefully
someone else can shed some light on it for you.

Thom

Re: Regexp match with accented character problem

From

Laslo Forro

Date:

08 June 2010, 09:29:10

having the string 'macskacicóca' it matches:

\\mmacskacic\\Wca'

so it matches:

\\mmacskacic\\W'

indicating that 'ó' is a non alphanumeric character, but strange enough

but it doesn't:

\\mmacskacic\\W\\M

unless \M is with * quantifier.

Any idea or hint is highly appreciated.

Thanx in advance,

Laslo

On Tue, Jun 8, 2010 at 1:59 PM, Laslo Forro <getforum@gmail.com> wrote:

Perhaps helps:

'ó' matches
\M
\M\M\M
\.*

but not \M\M\M\M or \M\M\M\W

These match:
E'\\mmacskacicó\M*'
E'\\mmacskacicó\s*'
E'\\mmacskacicó\W*'

with * quantifier. But not with + quantifier, or w/o any quantifier.
Also matches:

E'\\mmacskacicó\\Y' (!!!)
E'\\mmacskacicó$'

The text is typed via psql using urxvt terminal.
Perhaps some unicode - wide charater kind of mess?

On Tue, Jun 8, 2010 at 1:26 PM, Laslo Forro <getforum@gmail.com> wrote:
That might be a problem that 'ó' is not recognized as \w
Actually I do not know which class 'ó' is in. It matches:

test=# select * from texts where title ~* E'\\mmacskacic\\M';
   title | a_text
--------------+----------------------------
A macskacicó | A blah blah macskacicónak.
(1 row)

As if the end-of-word is at the last 'c' . ???

If the hex. code of 'ó' is 97 (dec.151) could someone hint me how to insert it into the expression?

On Tue, Jun 8, 2010 at 1:17 PM, Laslo Forro <getforum@gmail.com> wrote:
Thanks a lot, anyway!

On Tue, Jun 8, 2010 at 12:56 PM, Thom Brown <thombrown@gmail.com> wrote:
On 8 June 2010 11:54, Laslo Forro <getforum@gmail.com> wrote:
> test=# \l
>    List of databases
>    Name | Owner | Encoding | Collation | Ctype | Access
> privileges
> -----------+----------+----------+-------------+-------------+-----------------------
> postgres | postgres | UTF8 | en_US.UTF-8 | en_US.UTF-8 |
> template0 | postgres | UTF8 | en_US.UTF-8 | en_US.UTF-8 | =c/postgres
>    :
> postgres=CTc/postgres
> template1 | postgres | UTF8 | en_US.UTF-8 | en_US.UTF-8 | =c/postgres
>    :
> postgres=CTc/postgres
> test | salmonix | UTF8 | en_US.UTF-8 | en_US.UTF-8 |
> (5 rows)
>

Okay, I'm not sure what the problem is there then. :S Hopefully
someone else can shed some light on it for you.

Thom

Re: Regexp match with accented character problem

From

Laslo Forro

Date:

08 June 2010, 09:48:41

And one more thing:

this all is strange.

test=# select * from text where a_text ~* E'\\mmacskacic\\W\\s';

title | a_text

--------------+--------------------------

A macskacicó | A bah macskacicóca

A macskacicó | A bah macskacicó és a ló

(2 rows)

Strange, because I expect the 'macskacic+NON_WORD+WSPACE' pattern.

The corresponding perl regexp does not match:

macskacic\W\s

I am really lost.

And stop spamming.

On Tue, Jun 8, 2010 at 2:28 PM, Laslo Forro <getforum@gmail.com> wrote:

more:
having the string 'macskacicóca' it matches:
\\mmacskacic\\Wca'
so it matches:
\\mmacskacic\\W'
indicating that 'ó' is a non alphanumeric character, but strange enough

but it doesn't:
\\mmacskacic\\W\\M
unless \M is with * quantifier.

Any idea or hint is highly appreciated.

Thanx in advance,
Laslo

On Tue, Jun 8, 2010 at 1:59 PM, Laslo Forro <getforum@gmail.com> wrote:
Perhaps helps:

'ó' matches
\M
\M\M\M
\.*

but not \M\M\M\M or \M\M\M\W

These match:
E'\\mmacskacicó\M*'
E'\\mmacskacicó\s*'
E'\\mmacskacicó\W*'

with * quantifier. But not with + quantifier, or w/o any quantifier.
Also matches:

E'\\mmacskacicó\\Y' (!!!)
E'\\mmacskacicó$'

The text is typed via psql using urxvt terminal.
Perhaps some unicode - wide charater kind of mess?

On Tue, Jun 8, 2010 at 1:26 PM, Laslo Forro <getforum@gmail.com> wrote:
That might be a problem that 'ó' is not recognized as \w
Actually I do not know which class 'ó' is in. It matches:

test=# select * from texts where title ~* E'\\mmacskacic\\M';
   title | a_text
--------------+----------------------------
A macskacicó | A blah blah macskacicónak.
(1 row)

As if the end-of-word is at the last 'c' . ???

If the hex. code of 'ó' is 97 (dec.151) could someone hint me how to insert it into the expression?

On Tue, Jun 8, 2010 at 1:17 PM, Laslo Forro <getforum@gmail.com> wrote:
Thanks a lot, anyway!

On Tue, Jun 8, 2010 at 12:56 PM, Thom Brown <thombrown@gmail.com> wrote:
On 8 June 2010 11:54, Laslo Forro <getforum@gmail.com> wrote:
> test=# \l
>    List of databases
>    Name | Owner | Encoding | Collation | Ctype | Access
> privileges
> -----------+----------+----------+-------------+-------------+-----------------------
> postgres | postgres | UTF8 | en_US.UTF-8 | en_US.UTF-8 |
> template0 | postgres | UTF8 | en_US.UTF-8 | en_US.UTF-8 | =c/postgres
>    :
> postgres=CTc/postgres
> template1 | postgres | UTF8 | en_US.UTF-8 | en_US.UTF-8 | =c/postgres
>    :
> postgres=CTc/postgres
> test | salmonix | UTF8 | en_US.UTF-8 | en_US.UTF-8 |
> (5 rows)
>

Okay, I'm not sure what the problem is there then. :S Hopefully
someone else can shed some light on it for you.

Thom

Re: Regexp match with accented character problem

From

Thom Brown

Date:

08 June 2010, 10:13:29

On 8 June 2010 14:07, Laslo Forro <getforum@gmail.com> wrote:
> OK. I have upgraded to 8.4.4.-r1 (Gentoo) and everything is fine.
> Thanx!
>

Ah, must have been a bug then.  Glad to hear you solved it. :)

Thom

Re: Regexp match with accented character problem

From

Tom Lane

Date:

08 June 2010, 10:53:16

Laslo Forro <getforum@gmail.com> writes:
> It seems that accented characters are not recognized as \w.

Just FYI, that's a known problem with the regex operators if you're
using UTF8 database encoding (or more generally, any multibyte encoding,
but UTF8 is usually the one people complain about).  I don't believe
updating to 8.4 would have fixed it for you --- maybe the reason the
problem went away is you switched to a different encoding, such as one
of the LATINn family?

There is a tentative fix in 9.0, FWIW.

            regards, tom lane

Re: Regexp match with accented character problem

From

Laslo Forro

Date:

22 June 2010, 17:41:43

That is the case.

I have to move to use ISO encoding because I can not use 9.0 .

Thank you!

On Tue, Jun 8, 2010 at 3:53 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Laslo Forro <getforum@gmail.com> writes:
> It seems that accented characters are not recognized as \w.

Just FYI, that's a known problem with the regex operators if you're
using UTF8 database encoding (or more generally, any multibyte encoding,
but UTF8 is usually the one people complain about). I don't believe
updating to 8.4 would have fixed it for you --- maybe the reason the
problem went away is you switched to a different encoding, such as one
of the LATINn family?

There is a tentative fix in 9.0, FWIW.

regards, tom lane