Re: Scadinavian characters in regular expressions - Mailing list pgsql-sql

From Søren Vainio
Subject Re: Scadinavian characters in regular expressions
Date
Msg-id 910513A5A944D5118BE900C04F67CB5A1F82C7@MAIL
Whole thread Raw
In response to Scadinavian characters in regular expressions  (Søren Vainio <sva@Netpointers.com>)
Responses Re: Scadinavian characters in regular expressions
List pgsql-sql
There is obviously a problem with the scecial characters.
The query SELECT 'oneå two three' ~ '^[^ ]+[ ][^ ]+$';
produced FALSE on a database with ENCODING = 'LATIN1' and TRUE on a database
with ENCODING = 'UNICODE'.

Do you have a suggestion to how I can find the count of two-word strings
with ENCODING = 'UNICODE'?

Thank you
Søren Vainio

> -----Oprindelig meddelelse-----
> Fra: Tom Lane [mailto:tgl@sss.pgh.pa.us]
> Sendt: 9. april 2002 15:34
> Til: Søren Vainio
> Cc: 'Andreas Joseph Krogh'; 'pgsql-sql@postgresql.org'
> Emne: Re: [SQL] Scadinavian characters in regular expressions
>
>
> Søren Vainio <sva@Netpointers.com> writes:
> > Using \s does produce FALSE for SELECT 'oneå two three' ~
> > '^[^\s]+[\s][^\s]+$';
> > But it also produces FALSE for any two-word string ex:
> > SELECT 'one two' ~ '^[^\s]+[\s][^\s]+$'; where I would
> expect TRUE???
> > (I am using PostgreSQL 7.1.3)
>
> I do not believe that Postgres' regular expression engine
> recognizes \s
> as meaning anything except "s".  See
> http://www.ca.postgresql.org/users-lounge/docs/7.2/postgres/fu
nctions-matching.html

In the above, it's even worse: the backslashes were eaten by the
string-literal parser, so what arrived at the RE engine was just
^[^s]+[s][^s]+$ ... not likely to produce what you wanted.

As for the original issue, I wonder whether you are storing the string
as UTF-8 or Latin1 encoding.  I have a suspicion that the å (å
å a-ring) is actually a multibyte sequence inside the database
and for some reason Postgres isn't configured to recognize it as a
single logical character.
        regards, tom lane


pgsql-sql by date:

Previous
From: Tom Lane
Date:
Subject: Re: Scadinavian characters in regular expressions
Next
From: Roberto Mello
Date:
Subject: Re: Hierarchical Queries