Thread: Scadinavian characters in regular expressions
Can someone please explain the following?
I am using a regular expression to find strings containing two words (begin with one or more characters not being spaces followed by a space followed by one or more characters not being spaces).
But when scandinavian characters are included it returns different results depending on where the character is positioned.
The first two-word example returns TRUE as expected.
The second three-word example returns FALSE as expected.
But when I let an å (å å a-ring) traverse through the string it unexpectedly returns TRUE when the character is positioned as the second-last or last character in the two first words.
SELECT 'one two' ~ '^[^ ]+[ ][^ ]+$'; returns TRUE
SELECT 'one two three' ~ '^[^ ]+[ ][^ ]+$'; returns FALSE
SELECT 'åone two three' ~ '^[^ ]+[ ][^ ]+$'; returns FALSE
SELECT 'oåne two three' ~ '^[^ ]+[ ][^ ]+$'; returns FALSE
SELECT 'onåe two three' ~ '^[^ ]+[ ][^ ]+$'; returns TRUE
SELECT 'oneå two three' ~ '^[^ ]+[ ][^ ]+$'; returns TRUE
SELECT 'one åtwo three' ~ '^[^ ]+[ ][^ ]+$'; returns FALSE
SELECT 'one tåwo three' ~ '^[^ ]+[ ][^ ]+$'; returns FALSE
SELECT 'one twåo three' ~ '^[^ ]+[ ][^ ]+$'; returns TRUE
SELECT 'one twoå three' ~ '^[^ ]+[ ][^ ]+$'; returns TRUE
SELECT 'one two åthree' ~ '^[^ ]+[ ][^ ]+$'; returns FALSE
SELECT 'one two tåhree' ~ '^[^ ]+[ ][^ ]+$'; returns FALSE
SELECT 'one two thåree' ~ '^[^ ]+[ ][^ ]+$'; returns FALSE
SELECT 'one two thråee' ~ '^[^ ]+[ ][^ ]+$'; returns FALSE
SELECT 'one two threåe' ~ '^[^ ]+[ ][^ ]+$'; returns FALSE
SELECT 'one two threeå' ~ '^[^ ]+[ ][^ ]+$'; returns FALSE
SELECT 'one two three' ~ '^[^ ]+[ ][^ ]+$'; returns FALSE
SELECT 'åone two three' ~ '^[^ ]+[ ][^ ]+$'; returns FALSE
SELECT 'oåne two three' ~ '^[^ ]+[ ][^ ]+$'; returns FALSE
SELECT 'onåe two three' ~ '^[^ ]+[ ][^ ]+$'; returns TRUE
SELECT 'oneå two three' ~ '^[^ ]+[ ][^ ]+$'; returns TRUE
SELECT 'one åtwo three' ~ '^[^ ]+[ ][^ ]+$'; returns FALSE
SELECT 'one tåwo three' ~ '^[^ ]+[ ][^ ]+$'; returns FALSE
SELECT 'one twåo three' ~ '^[^ ]+[ ][^ ]+$'; returns TRUE
SELECT 'one twoå three' ~ '^[^ ]+[ ][^ ]+$'; returns TRUE
SELECT 'one two åthree' ~ '^[^ ]+[ ][^ ]+$'; returns FALSE
SELECT 'one two tåhree' ~ '^[^ ]+[ ][^ ]+$'; returns FALSE
SELECT 'one two thåree' ~ '^[^ ]+[ ][^ ]+$'; returns FALSE
SELECT 'one two thråee' ~ '^[^ ]+[ ][^ ]+$'; returns FALSE
SELECT 'one two threåe' ~ '^[^ ]+[ ][^ ]+$'; returns FALSE
SELECT 'one two threeå' ~ '^[^ ]+[ ][^ ]+$'; returns FALSE
Thank you for any response.
Søren Vainio, Denmark
On Tuesday 09 April 2002 10:51, Søren Vainio wrote: > Can someone please explain the following? > I am using a regular expression to find strings containing two words (begin > with one or more characters not being spaces followed by a space followed > by one or more characters not being spaces). > But when scandinavian characters are included it returns different results > depending on where the character is positioned. > The first two-word example returns TRUE as expected. > The second three-word example returns FALSE as expected. > But when I let an å (å å a-ring) traverse through the string it > unexpectedly returns TRUE when the character is positioned as the > second-last or last character in the two first words. > > SELECT 'one two' ~ '^[^ ]+[ ][^ ]+$'; returns TRUE > SELECT 'one two three' ~ '^[^ ]+[ ][^ ]+$'; returns FALSE > SELECT 'åone two three' ~ '^[^ ]+[ ][^ ]+$'; returns FALSE > SELECT 'oåne two three' ~ '^[^ ]+[ ][^ ]+$'; returns FALSE > SELECT 'onåe two three' ~ '^[^ ]+[ ][^ ]+$'; returns TRUE > SELECT 'oneå two three' ~ '^[^ ]+[ ][^ ]+$'; returns TRUE > SELECT 'one åtwo three' ~ '^[^ ]+[ ][^ ]+$'; returns FALSE > SELECT 'one tåwo three' ~ '^[^ ]+[ ][^ ]+$'; returns FALSE > SELECT 'one twåo three' ~ '^[^ ]+[ ][^ ]+$'; returns TRUE > SELECT 'one twoå three' ~ '^[^ ]+[ ][^ ]+$'; returns TRUE > SELECT 'one two åthree' ~ '^[^ ]+[ ][^ ]+$'; returns FALSE > SELECT 'one two tåhree' ~ '^[^ ]+[ ][^ ]+$'; returns FALSE > SELECT 'one two thåree' ~ '^[^ ]+[ ][^ ]+$'; returns FALSE > SELECT 'one two thråee' ~ '^[^ ]+[ ][^ ]+$'; returns FALSE > SELECT 'one two threåe' ~ '^[^ ]+[ ][^ ]+$'; returns FALSE > SELECT 'one two threeå' ~ '^[^ ]+[ ][^ ]+$'; returns FALSE > > Thank you for any response. > > Søren Vainio, Denmark I just tried the following which returned false as expected: andreak=# SELECT 'oneå two three' ~ '^[^\s]+[\s][^\s]+$';?column? ----------f (1 row) andreak=# select version(); version -----------------------------------------------------------PostgreSQL 7.2 on i686-pc-linux-gnu, compiled by GCC 2.96 (1 row) NOTE: I replaced your [^ ] with the properly formated pattarn for whitespace: [^\s] -- Andreas Joseph Krogh (Senior Software Developer) <andreak@officenet.no> A hen is an egg's way of making another egg.
Using \s does produce FALSE for SELECT 'oneå two three' ~ '^[^\s]+[\s][^\s]+$'; But it also produces FALSE for any two-word string ex: SELECT 'one two' ~ '^[^\s]+[\s][^\s]+$'; where I would expect TRUE??? (I am using PostgreSQL 7.1.3) > -----Oprindelig meddelelse----- > Fra: pgsql-sql-owner@postgresql.org > [mailto:pgsql-sql-owner@postgresql.org]På vegne af Andreas > Joseph Krogh > Sendt: 9. april 2002 11:53 > Til: 'pgsql-sql@postgresql.org' > Emne: Re: [SQL] Scadinavian characters in regular expressions > > > On Tuesday 09 April 2002 10:51, Søren Vainio wrote: > > Can someone please explain the following? > > I am using a regular expression to find strings containing > two words (begin > > with one or more characters not being spaces followed by a > space followed > > by one or more characters not being spaces). > > But when scandinavian characters are included it returns > different results > > depending on where the character is positioned. > > The first two-word example returns TRUE as expected. > > The second three-word example returns FALSE as expected. > > But when I let an å (å å a-ring) traverse through > the string it > > unexpectedly returns TRUE when the character is positioned as the > > second-last or last character in the two first words. > > > > SELECT 'one two' ~ '^[^ ]+[ ][^ ]+$'; returns TRUE > > SELECT 'one two three' ~ '^[^ ]+[ ][^ ]+$'; returns FALSE > > SELECT 'åone two three' ~ '^[^ ]+[ ][^ ]+$'; returns FALSE > > SELECT 'oåne two three' ~ '^[^ ]+[ ][^ ]+$'; returns FALSE > > SELECT 'onåe two three' ~ '^[^ ]+[ ][^ ]+$'; returns TRUE > > SELECT 'oneå two three' ~ '^[^ ]+[ ][^ ]+$'; returns TRUE > > SELECT 'one åtwo three' ~ '^[^ ]+[ ][^ ]+$'; returns FALSE > > SELECT 'one tåwo three' ~ '^[^ ]+[ ][^ ]+$'; returns FALSE > > SELECT 'one twåo three' ~ '^[^ ]+[ ][^ ]+$'; returns TRUE > > SELECT 'one twoå three' ~ '^[^ ]+[ ][^ ]+$'; returns TRUE > > SELECT 'one two åthree' ~ '^[^ ]+[ ][^ ]+$'; returns FALSE > > SELECT 'one two tåhree' ~ '^[^ ]+[ ][^ ]+$'; returns FALSE > > SELECT 'one two thåree' ~ '^[^ ]+[ ][^ ]+$'; returns FALSE > > SELECT 'one two thråee' ~ '^[^ ]+[ ][^ ]+$'; returns FALSE > > SELECT 'one two threåe' ~ '^[^ ]+[ ][^ ]+$'; returns FALSE > > SELECT 'one two threeå' ~ '^[^ ]+[ ][^ ]+$'; returns FALSE > > > > Thank you for any response. > > > > Søren Vainio, Denmark > > I just tried the following which returned false as expected: > andreak=# SELECT 'oneå two three' ~ '^[^\s]+[\s][^\s]+$'; > ?column? > ---------- > f > (1 row) > > andreak=# select version(); > version > ----------------------------------------------------------- > PostgreSQL 7.2 on i686-pc-linux-gnu, compiled by GCC 2.96 > (1 row) > > NOTE: I replaced your [^ ] with the properly formated pattarn > for whitespace: > [^\s] > > -- > Andreas Joseph Krogh (Senior Software Developer) > <andreak@officenet.no> > A hen is an egg's way of making another egg. > > ---------------------------(end of > broadcast)--------------------------- > TIP 3: if posting/reading through Usenet, please send an appropriate > subscribe-nomail command to majordomo@postgresql.org so that your > message can get through to the mailing list cleanly >
On Tuesday 09 April 2002 13:28, Søren Vainio wrote: > Using \s does produce FALSE for SELECT 'oneå two three' ~ > '^[^\s]+[\s][^\s]+$'; > But it also produces FALSE for any two-word string ex: > SELECT 'one two' ~ '^[^\s]+[\s][^\s]+$'; where I would expect TRUE??? > (I am using PostgreSQL 7.1.3) Funny, in Perl it works: perl -e '$p = "hei hopp"; print eval ($p =~ /^[^\s]+\s[^\s]+$/) ? "true\n" : "false\n";' prints: true -- Andreas Joseph Krogh (Senior Software Developer) <andreak@officenet.no> A hen is an egg's way of making another egg.
Søren Vainio <sva@Netpointers.com> writes: > Using \s does produce FALSE for SELECT 'one� two three' ~ > '^[^\s]+[\s][^\s]+$'; > But it also produces FALSE for any two-word string ex: > SELECT 'one two' ~ '^[^\s]+[\s][^\s]+$'; where I would expect TRUE??? > (I am using PostgreSQL 7.1.3) I do not believe that Postgres' regular expression engine recognizes \s as meaning anything except "s". See http://www.ca.postgresql.org/users-lounge/docs/7.2/postgres/functions-matching.html In the above, it's even worse: the backslashes were eaten by the string-literal parser, so what arrived at the RE engine was just ^[^s]+[s][^s]+$ ... not likely to produce what you wanted. As for the original issue, I wonder whether you are storing the string as UTF-8 or Latin1 encoding. I have a suspicion that the � (å å a-ring) is actually a multibyte sequence inside the database and for some reason Postgres isn't configured to recognize it as a single logical character. regards, tom lane
There is obviously a problem with the scecial characters. The query SELECT 'oneå two three' ~ '^[^ ]+[ ][^ ]+$'; produced FALSE on a database with ENCODING = 'LATIN1' and TRUE on a database with ENCODING = 'UNICODE'. Do you have a suggestion to how I can find the count of two-word strings with ENCODING = 'UNICODE'? Thank you Søren Vainio > -----Oprindelig meddelelse----- > Fra: Tom Lane [mailto:tgl@sss.pgh.pa.us] > Sendt: 9. april 2002 15:34 > Til: Søren Vainio > Cc: 'Andreas Joseph Krogh'; 'pgsql-sql@postgresql.org' > Emne: Re: [SQL] Scadinavian characters in regular expressions > > > Søren Vainio <sva@Netpointers.com> writes: > > Using \s does produce FALSE for SELECT 'oneå two three' ~ > > '^[^\s]+[\s][^\s]+$'; > > But it also produces FALSE for any two-word string ex: > > SELECT 'one two' ~ '^[^\s]+[\s][^\s]+$'; where I would > expect TRUE??? > > (I am using PostgreSQL 7.1.3) > > I do not believe that Postgres' regular expression engine > recognizes \s > as meaning anything except "s". See > http://www.ca.postgresql.org/users-lounge/docs/7.2/postgres/fu nctions-matching.html In the above, it's even worse: the backslashes were eaten by the string-literal parser, so what arrived at the RE engine was just ^[^s]+[s][^s]+$ ... not likely to produce what you wanted. As for the original issue, I wonder whether you are storing the string as UTF-8 or Latin1 encoding. I have a suspicion that the å (å å a-ring) is actually a multibyte sequence inside the database and for some reason Postgres isn't configured to recognize it as a single logical character. regards, tom lane
> There is obviously a problem with the scecial characters. > The query SELECT 'oneå two three' ~ '^[^ ]+[ ][^ ]+$'; > produced FALSE on a database with ENCODING = 'LATIN1' and TRUE on a database > with ENCODING = 'UNICODE'. > > Do you have a suggestion to how I can find the count of two-word strings > with ENCODING = 'UNICODE'? I see no problem here. SELECT 'oneå two three' ~ '^[^ ]+[ ][^ ]+$';?column? ----------f (1 row) SELECT 'oneå three' ~ '^[^ ]+[ ][^ ]+$';?column? ----------t (1 row) (The database encoding is UNICODE) This is PostgreSQL 7.1.3/7.2.1 running on a Linux box. I guess you have some problems with your installation or platform. -- Tatsuo Ishii