Thread: Scadinavian characters in regular expressions

Scadinavian characters in regular expressions

From

Søren Vainio

Date:

09 April 2002, 04:52:33

Can someone please explain the following?

I am using a regular expression to find strings containing two words (begin with one or more characters not being spaces followed by a space followed by one or more characters not being spaces).

But when scandinavian characters are included it returns different results depending on where the character is positioned.

The first two-word example returns TRUE as expected.

The second three-word example returns FALSE as expected.

But when I let an å (&#229 &aring a-ring) traverse through the string it unexpectedly returns TRUE when the character is positioned as the second-last or last character in the two first words.

SELECT 'one two' ~ '^[^ ]+[ ][^ ]+$'; returns TRUE
SELECT 'one two three' ~ '^[^ ]+[ ][^ ]+$'; returns FALSE
SELECT 'åone two three' ~ '^[^ ]+[ ][^ ]+$'; returns FALSE
SELECT 'oåne two three' ~ '^[^ ]+[ ][^ ]+$'; returns FALSE
SELECT 'onåe two three' ~ '^[^ ]+[ ][^ ]+$'; returns TRUE
SELECT 'oneå two three' ~ '^[^ ]+[ ][^ ]+$'; returns TRUE
SELECT 'one åtwo three' ~ '^[^ ]+[ ][^ ]+$'; returns FALSE
SELECT 'one tåwo three' ~ '^[^ ]+[ ][^ ]+$'; returns FALSE
SELECT 'one twåo three' ~ '^[^ ]+[ ][^ ]+$'; returns TRUE
SELECT 'one twoå three' ~ '^[^ ]+[ ][^ ]+$'; returns TRUE
SELECT 'one two åthree' ~ '^[^ ]+[ ][^ ]+$'; returns FALSE
SELECT 'one two tåhree' ~ '^[^ ]+[ ][^ ]+$'; returns FALSE
SELECT 'one two thåree' ~ '^[^ ]+[ ][^ ]+$'; returns FALSE
SELECT 'one two thråee' ~ '^[^ ]+[ ][^ ]+$'; returns FALSE
SELECT 'one two threåe' ~ '^[^ ]+[ ][^ ]+$'; returns FALSE
SELECT 'one two threeå' ~ '^[^ ]+[ ][^ ]+$'; returns FALSE

Thank you for any response.

Søren Vainio, Denmark

Re: Scadinavian characters in regular expressions

From

Andreas Joseph Krogh

Date:

09 April 2002, 05:59:28

On Tuesday 09 April 2002 10:51, Søren Vainio wrote:
> Can someone please explain the following?
> I am using a regular expression to find strings containing two words (begin
> with one or more characters not being spaces followed by a space followed
> by one or more characters not being spaces).
> But when scandinavian characters are included it returns different results
> depending on where the character is positioned.
> The first two-word example returns TRUE as expected.
> The second three-word example returns FALSE as expected.
> But when I let an å (å å a-ring) traverse through the string it
> unexpectedly returns TRUE when the character is positioned as the
> second-last or last character in the two first words.
>
> SELECT 'one two' ~ '^[^ ]+[ ][^ ]+$'; returns TRUE
> SELECT 'one two three' ~ '^[^ ]+[ ][^ ]+$'; returns FALSE
> SELECT 'åone two three' ~ '^[^ ]+[ ][^ ]+$'; returns FALSE
> SELECT 'oåne two three' ~ '^[^ ]+[ ][^ ]+$'; returns FALSE
> SELECT 'onåe two three' ~ '^[^ ]+[ ][^ ]+$'; returns TRUE
> SELECT 'oneå two three' ~ '^[^ ]+[ ][^ ]+$'; returns TRUE
> SELECT 'one åtwo three' ~ '^[^ ]+[ ][^ ]+$'; returns FALSE
> SELECT 'one tåwo three' ~ '^[^ ]+[ ][^ ]+$'; returns FALSE
> SELECT 'one twåo three' ~ '^[^ ]+[ ][^ ]+$'; returns TRUE
> SELECT 'one twoå three' ~ '^[^ ]+[ ][^ ]+$'; returns TRUE
> SELECT 'one two åthree' ~ '^[^ ]+[ ][^ ]+$'; returns FALSE
> SELECT 'one two tåhree' ~ '^[^ ]+[ ][^ ]+$'; returns FALSE
> SELECT 'one two thåree' ~ '^[^ ]+[ ][^ ]+$'; returns FALSE
> SELECT 'one two thråee' ~ '^[^ ]+[ ][^ ]+$'; returns FALSE
> SELECT 'one two threåe' ~ '^[^ ]+[ ][^ ]+$'; returns FALSE
> SELECT 'one two threeå' ~ '^[^ ]+[ ][^ ]+$'; returns FALSE
>
> Thank you for any response.
>
> Søren Vainio, Denmark

I just tried the following which returned false as expected:
andreak=# SELECT 'oneå two three' ~ '^[^\s]+[\s][^\s]+$';?column?
----------f
(1 row)

andreak=# select version();                         version
-----------------------------------------------------------PostgreSQL 7.2 on i686-pc-linux-gnu, compiled by GCC 2.96
(1 row)

NOTE: I replaced your [^ ] with the properly formated pattarn for whitespace:
[^\s]

--
Andreas Joseph Krogh (Senior Software Developer) <andreak@officenet.no>
A hen is an egg's way of making another egg.

Re: Scadinavian characters in regular expressions

From

Søren Vainio

Date:

09 April 2002, 07:28:13

Using \s does produce FALSE for SELECT 'oneå two three' ~
'^[^\s]+[\s][^\s]+$';
But it also produces FALSE for any two-word string ex:
SELECT 'one two' ~ '^[^\s]+[\s][^\s]+$'; where I would expect TRUE???
(I am using PostgreSQL 7.1.3)

> -----Oprindelig meddelelse-----
> Fra: pgsql-sql-owner@postgresql.org
> [mailto:pgsql-sql-owner@postgresql.org]På vegne af Andreas
> Joseph Krogh
> Sendt: 9. april 2002 11:53
> Til: 'pgsql-sql@postgresql.org'
> Emne: Re: [SQL] Scadinavian characters in regular expressions
>
>
> On Tuesday 09 April 2002 10:51, Søren Vainio wrote:
> > Can someone please explain the following?
> > I am using a regular expression to find strings containing
> two words (begin
> > with one or more characters not being spaces followed by a
> space followed
> > by one or more characters not being spaces).
> > But when scandinavian characters are included it returns
> different results
> > depending on where the character is positioned.
> > The first two-word example returns TRUE as expected.
> > The second three-word example returns FALSE as expected.
> > But when I let an å (å å a-ring) traverse through
> the string it
> > unexpectedly returns TRUE when the character is positioned as the
> > second-last or last character in the two first words.
> >
> > SELECT 'one two' ~ '^[^ ]+[ ][^ ]+$'; returns TRUE
> > SELECT 'one two three' ~ '^[^ ]+[ ][^ ]+$'; returns FALSE
> > SELECT 'åone two three' ~ '^[^ ]+[ ][^ ]+$'; returns FALSE
> > SELECT 'oåne two three' ~ '^[^ ]+[ ][^ ]+$'; returns FALSE
> > SELECT 'onåe two three' ~ '^[^ ]+[ ][^ ]+$'; returns TRUE
> > SELECT 'oneå two three' ~ '^[^ ]+[ ][^ ]+$'; returns TRUE
> > SELECT 'one åtwo three' ~ '^[^ ]+[ ][^ ]+$'; returns FALSE
> > SELECT 'one tåwo three' ~ '^[^ ]+[ ][^ ]+$'; returns FALSE
> > SELECT 'one twåo three' ~ '^[^ ]+[ ][^ ]+$'; returns TRUE
> > SELECT 'one twoå three' ~ '^[^ ]+[ ][^ ]+$'; returns TRUE
> > SELECT 'one two åthree' ~ '^[^ ]+[ ][^ ]+$'; returns FALSE
> > SELECT 'one two tåhree' ~ '^[^ ]+[ ][^ ]+$'; returns FALSE
> > SELECT 'one two thåree' ~ '^[^ ]+[ ][^ ]+$'; returns FALSE
> > SELECT 'one two thråee' ~ '^[^ ]+[ ][^ ]+$'; returns FALSE
> > SELECT 'one two threåe' ~ '^[^ ]+[ ][^ ]+$'; returns FALSE
> > SELECT 'one two threeå' ~ '^[^ ]+[ ][^ ]+$'; returns FALSE
> >
> > Thank you for any response.
> >
> > Søren Vainio, Denmark
>
> I just tried the following which returned false as expected:
> andreak=# SELECT 'oneå two three' ~ '^[^\s]+[\s][^\s]+$';
>  ?column?
> ----------
>  f
> (1 row)
>
> andreak=# select version();
>                           version
> -----------------------------------------------------------
>  PostgreSQL 7.2 on i686-pc-linux-gnu, compiled by GCC 2.96
> (1 row)
>
> NOTE: I replaced your [^ ] with the properly formated pattarn
> for whitespace:
> [^\s]
>
> --
> Andreas Joseph Krogh (Senior Software Developer)
> <andreak@officenet.no>
> A hen is an egg's way of making another egg.
>
> ---------------------------(end of
> broadcast)---------------------------
> TIP 3: if posting/reading through Usenet, please send an appropriate
> subscribe-nomail command to majordomo@postgresql.org so that your
> message can get through to the mailing list cleanly
>

Re: Scadinavian characters in regular expressions

From

Andreas Joseph Krogh

Date:

09 April 2002, 07:51:11

On Tuesday 09 April 2002 13:28, Søren Vainio wrote:
> Using \s does produce FALSE for SELECT 'oneå two three' ~
> '^[^\s]+[\s][^\s]+$';
> But it also produces FALSE for any two-word string ex:
> SELECT 'one two' ~ '^[^\s]+[\s][^\s]+$'; where I would expect TRUE???
> (I am using PostgreSQL 7.1.3)

Funny, in Perl it works:

perl -e '$p = "hei hopp"; print eval ($p =~ /^[^\s]+\s[^\s]+$/) ? "true\n" :
"false\n";'

prints:
true


--
Andreas Joseph Krogh (Senior Software Developer) <andreak@officenet.no>
A hen is an egg's way of making another egg.

Re: Scadinavian characters in regular expressions

From

Tom Lane

Date:

09 April 2002, 09:36:23

Søren Vainio <sva@Netpointers.com> writes:
> Using \s does produce FALSE for SELECT 'one� two three' ~
> '^[^\s]+[\s][^\s]+$';
> But it also produces FALSE for any two-word string ex:
> SELECT 'one two' ~ '^[^\s]+[\s][^\s]+$'; where I would expect TRUE???
> (I am using PostgreSQL 7.1.3)

I do not believe that Postgres' regular expression engine recognizes \s
as meaning anything except "s".  See
http://www.ca.postgresql.org/users-lounge/docs/7.2/postgres/functions-matching.html

In the above, it's even worse: the backslashes were eaten by the
string-literal parser, so what arrived at the RE engine was just
^[^s]+[s][^s]+$ ... not likely to produce what you wanted.

As for the original issue, I wonder whether you are storing the string
as UTF-8 or Latin1 encoding.  I have a suspicion that the � (å
å a-ring) is actually a multibyte sequence inside the database
and for some reason Postgres isn't configured to recognize it as a
single logical character.
        regards, tom lane

Re: Scadinavian characters in regular expressions

From

Søren Vainio

Date:

09 April 2002, 10:25:05

There is obviously a problem with the scecial characters.
The query SELECT 'oneå two three' ~ '^[^ ]+[ ][^ ]+$';
produced FALSE on a database with ENCODING = 'LATIN1' and TRUE on a database
with ENCODING = 'UNICODE'.

Do you have a suggestion to how I can find the count of two-word strings
with ENCODING = 'UNICODE'?

Thank you
Søren Vainio

> -----Oprindelig meddelelse-----
> Fra: Tom Lane [mailto:tgl@sss.pgh.pa.us]
> Sendt: 9. april 2002 15:34
> Til: Søren Vainio
> Cc: 'Andreas Joseph Krogh'; 'pgsql-sql@postgresql.org'
> Emne: Re: [SQL] Scadinavian characters in regular expressions
>
>
> Søren Vainio <sva@Netpointers.com> writes:
> > Using \s does produce FALSE for SELECT 'oneå two three' ~
> > '^[^\s]+[\s][^\s]+$';
> > But it also produces FALSE for any two-word string ex:
> > SELECT 'one two' ~ '^[^\s]+[\s][^\s]+$'; where I would
> expect TRUE???
> > (I am using PostgreSQL 7.1.3)
>
> I do not believe that Postgres' regular expression engine
> recognizes \s
> as meaning anything except "s".  See
> http://www.ca.postgresql.org/users-lounge/docs/7.2/postgres/fu
nctions-matching.html

In the above, it's even worse: the backslashes were eaten by the
string-literal parser, so what arrived at the RE engine was just
^[^s]+[s][^s]+$ ... not likely to produce what you wanted.

As for the original issue, I wonder whether you are storing the string
as UTF-8 or Latin1 encoding.  I have a suspicion that the å (å
å a-ring) is actually a multibyte sequence inside the database
and for some reason Postgres isn't configured to recognize it as a
single logical character.
        regards, tom lane

Re: Scadinavian characters in regular expressions

From

Tatsuo Ishii

Date:

09 April 2002, 21:09:46

> There is obviously a problem with the scecial characters.
> The query SELECT 'oneå two three' ~ '^[^ ]+[ ][^ ]+$';
> produced FALSE on a database with ENCODING = 'LATIN1' and TRUE on a database
> with ENCODING = 'UNICODE'.
>
> Do you have a suggestion to how I can find the count of two-word strings
> with ENCODING = 'UNICODE'?

I see no problem here.

SELECT 'oneå two three' ~ '^[^ ]+[ ][^ ]+$';?column?
----------f
(1 row)

SELECT 'oneå three' ~ '^[^ ]+[ ][^ ]+$';?column?
----------t
(1 row)

(The database encoding is UNICODE)

This is PostgreSQL 7.1.3/7.2.1 running on a Linux box. I guess you
have some problems with your installation or platform.
--
Tatsuo Ishii