Thread: Matching uppercased russian words (\x0410-\x042F) in UTF8 database 8.4.13
Matching uppercased russian words (\x0410-\x042F) in UTF8 database 8.4.13
From
Alexander Farber
Date:
Hello, I have prepared an SQL fiddle for my question: http://sqlfiddle.com/#!11/8a494/4 And also described it in more detail at http://stackoverflow.com/questions/15500270/string-matching-in-insert-trigger-how-to-use-in-conditionals-to-return-null Does anybody please know how to check for UTF8 range \x0410-\x042F in my code below? I've tried both new.word !~ '^[\x0410-\x042F]{2,}$' (fails with syntax error) and new.word !~ '^[\u0410-\u042F]{2,}$' (triggers even for correct words): create table good_words ( word varchar(64) primary key ); create or replace function keep_clean() returns trigger as $body$ begin new.word := upper(new.word); /* next line does not compile? */ IF new.word !~ '^[\x0410-\x042F]{2,}$' THEN RAISE EXCEPTION 'Not an uppercased Russian word in UTF8'; END IF; IF new.word ~ '^[ЪЫЬ]' OR new.word ~ 'Ъ$' THEN return NULL; END IF; /* does not return NULL for 'ошибббка'? */ IF new.word ~ '(.)\1\1' AND new.word NOT LIKE '%ШЕЕЕ%' AND new.word NOT LIKE '%ЗМЕЕЕ%' THEN return NULL; END IF; return new; end; $body$ language plpgsql; Thank you Alex
Re: Matching uppercased russian words (\x0410-\x042F) in UTF8 database 8.4.13
From
Alexander Farber
Date:
I'm trying at the psql prompt of an UTF8 database in 8.4.13: # select 'АБВГД' ~ '^[\u0410-\u042F]{2,}$'; WARNING: nonstandard use of escape in a string literal LINE 1: select 'АБВГД' ~ '^[\u0410-\u042F]{2,}$'; ^ HINT: Use the escape string syntax for escapes, e.g., E'\r\n'. ?column? ---------- f (1 row) On Tue, Mar 19, 2013 at 4:10 PM, Alexander Farber <alexander.farber@gmail.com> wrote: > http://sqlfiddle.com/#!11/8a494/4
Re: Re: Matching uppercased russian words (\x0410-\x042F) in UTF8 database 8.4.13
From
Tom Lane
Date:
Alexander Farber <alexander.farber@gmail.com> writes: > I'm trying at the psql prompt of an UTF8 database in 8.4.13: > # select 'АБВГД' ~ '^[\u0410-\u042F]{2,}$'; > WARNING: nonstandard use of escape in a string literal I think Unicode escapes were introduced in 9.0. In 8.4 you'd probably have to write out the UTF8 equivalent as octal escapes :-( regards, tom lane
Re: Matching uppercased russian words (\x0410-\x042F) in UTF8 database 8.4.13
From
Albe Laurenz
Date:
Alexander Farber wrote: > I have prepared an SQL fiddle for my question: > http://sqlfiddle.com/#!11/8a494/4 > > And also described it in more detail at > http://stackoverflow.com/questions/15500270/string-matching-in-insert-trigger-how-to-use-in- > conditionals-to-return-null > > Does anybody please know how to check for > UTF8 range \x0410-\x042F in my code below? > > I've tried both > new.word !~ '^[\x0410-\x042F]{2,}$' > (fails with syntax error) and > new.word !~ '^[\u0410-\u042F]{2,}$' > (triggers even for correct words): Strange, it works here (RHEL 6, x86_64, PostgreSQL 9.2.2, encoding "UTF8", collation and ctype "de_DE.UTF8"): test=> SELECT 'ПРОВЕРКА' ~ '^[\u0410-\u042F]{2,}$'; ?column? ---------- t (1 row) test=> SELECT 'ABCDE' ~ '^[\u0410-\u042F]{2,}$'; ?column? ---------- f (1 row) > create table good_words ( > word varchar(64) primary key > ); > > create or replace function keep_clean() returns trigger as $body$ > begin > new.word := upper(new.word); > > /* next line does not compile? */ > IF new.word !~ '^[\x0410-\x042F]{2,}$' THEN > RAISE EXCEPTION 'Not an uppercased Russian word in UTF8'; > END IF; > > IF new.word ~ '^[ЪЫЬ]' OR new.word ~ 'Ъ$' THEN > return NULL; > END IF; > > /* does not return NULL for 'ошибббка'? */ > IF new.word ~ '(.)\1\1' AND new.word NOT LIKE '%ШЕЕЕ%' > AND new.word NOT LIKE '%ЗМЕЕЕ%' THEN > return NULL; This works for me as well: test=> SELECT 'ошибббка' ~ '(.)\1\1' AND 'ошибббка' NOT LIKE '%ШЕЕЕ%' AND 'ошибббка' NOT LIKE '%ЗМЕЕЕ%'; ?column? ---------- t (1 row) test=> SELECT 'ошиббка' ~ '(.)\1\1' AND 'ошиббка' NOT LIKE '%ШЕЕЕ%' AND 'ошиббка' NOT LIKE '%ЗМЕЕЕ%'; ?column? ---------- f (1 row) > END IF; > > return new; > end; > $body$ language plpgsql; What do you get for SELECT pg_encoding_to_char(encoding), datcollate, datctype FROM pg_database WHERE datname = current_database(); and for SHOW client_encoding; Yours, Laurenz Albe
Re: Matching uppercased russian words (\x0410-\x042F) in UTF8 database 8.4.13
From
Alexander Farber
Date:
Thanks for trying! I am using CentOS 6.3 It seems to be better in 9.2.x? Unfortunately I'd like to stay with 8.4.x for now (because I use the PostgreSQL instance with other projects at the same host).... Regards Alex On Wed, Mar 20, 2013 at 10:35 AM, Albe Laurenz <laurenz.albe@wien.gv.at> wrote: > Alexander Farber wrote: >> I have prepared an SQL fiddle for my question: >> http://sqlfiddle.com/#!11/8a494/4 >> > Strange, it works here (RHEL 6, x86_64, PostgreSQL 9.2.2, > encoding "UTF8", collation and ctype "de_DE.UTF8"): > > test=> SELECT 'ПРОВЕРКА' ~ '^[\u0410-\u042F]{2,}$'; > ?column? > ---------- > t > (1 row) > > test=> SELECT 'ABCDE' ~ '^[\u0410-\u042F]{2,}$'; > ?column? > ---------- > f > (1 row) > >> create table good_words ( >> word varchar(64) primary key >> ); >> >> create or replace function keep_clean() returns trigger as $body$ >> begin >> new.word := upper(new.word); >> >> /* next line does not compile? */ >> IF new.word !~ '^[\x0410-\x042F]{2,}$' THEN >> RAISE EXCEPTION 'Not an uppercased Russian word in UTF8'; >> END IF; >> >> IF new.word ~ '^[ЪЫЬ]' OR new.word ~ 'Ъ$' THEN >> return NULL; >> END IF; >> >> /* does not return NULL for 'ошибббка'? */ >> IF new.word ~ '(.)\1\1' AND new.word NOT LIKE '%ШЕЕЕ%' >> AND new.word NOT LIKE '%ЗМЕЕЕ%' THEN >> return NULL; > > This works for me as well: > > test=> SELECT 'ошибббка' ~ '(.)\1\1' > AND 'ошибббка' NOT LIKE '%ШЕЕЕ%' > AND 'ошибббка' NOT LIKE '%ЗМЕЕЕ%'; > ?column? > ---------- > t > (1 row) > > test=> SELECT 'ошиббка' ~ '(.)\1\1' > AND 'ошиббка' NOT LIKE '%ШЕЕЕ%' > AND 'ошиббка' NOT LIKE '%ЗМЕЕЕ%'; > ?column? > ---------- > f > (1 row) > >> END IF; >> >> return new; >> end; >> $body$ language plpgsql; > > What do you get for > > SELECT pg_encoding_to_char(encoding), > datcollate, > datctype > FROM pg_database WHERE datname = current_database(); > > and for > > SHOW client_encoding; > > Yours, > Laurenz Albe > > -- > Sent via pgsql-general mailing list (pgsql-general@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-general
Re: Matching uppercased russian words (\x0410-\x042F) in UTF8 database 8.4.13
From
Albe Laurenz
Date:
Alexander Farber wrote: > It seems to be better in 9.2.x? Yes, as Tom has pointed out. I didn't see that you were on 8.4 when I wrote my answer. Yours, Laurenz Albe
Alexander Farber <alexander.farber@gmail.com> writes: > Thanks for trying! I am using CentOS 6.3 > It seems to be better in 9.2.x? As stated upthread, 8.4 doesn't understand \u escapes. You'd need to put in the characters another way --- either literally, or using octal escapes to spell out the UTF8 encoding. I think it will work in 8.4 if you do, but not 100% sure. regards, tom lane
Re: Re: Matching uppercased russian words (\x0410-\x042F) in UTF8 database 8.4.13
From
Alexander Farber
Date:
Hello, unfortunately octal doesn't seem to work either - On Tue, Mar 19, 2013 at 7:03 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Alexander Farber <alexander.farber@gmail.com> writes: >> # select 'АБВГД' ~ '^[\u0410-\u042F]{2,}$'; >> WARNING: nonstandard use of escape in a string literal > > I think Unicode escapes were introduced in 9.0. In 8.4 you'd probably > have to write out the UTF8 equivalent as octal escapes :-( # select 'АБВГД' ~ '^[\2020-\2057]{2,}$'; WARNING: nonstandard use of escape in a string literal LINE 1: select 'АБВГД' ~ '^[\2020-\2057]{2,}$'; ^ HINT: Use the escape string syntax for escapes, e.g., E'\r\n'. ERROR: invalid byte sequence for encoding "UTF8": 0x82 HINT: This error can also happen if the byte sequence does not match the encoding expected by the server, which is controlled by "client_encoding". But writing out UTF8 equivalents seems to work (trying to detect capitalized Russian letters as per http://www.unicode.org/charts/PDF/U0400.pdf ): # select 'АБВГД' ~ '^[А-Я]{2,}$'; ?column? ---------- t (1 row) And then I try to solve my 2nd problem (detecting 3 letters in a row, a rare case in Russian language): # select 'ОШИБББКА' ~ '(.)\1\1'; WARNING: nonstandard use of escape in a string literal LINE 1: select 'ОШИБББКА' ~ '(.)\1\1'; ^ HINT: Use the escape string syntax for escapes, e.g., E'\r\n'. ?column? ---------- f (1 row) Does anybody please know why this fails in 8.4.13? According to the table 9-18 in http://www.postgresql.org/docs/8.4/static/functions-matching.html it should be ok to use \1 for referencing parts captured by round brackets? Regards Alex
Re: Re: Matching uppercased russian words (\x0410-\x042F) in UTF8 database 8.4.13
From
Alexander Farber
Date:
My insert trigger for 8.4.13 works now: http://sqlfiddle.com/#!11/c74a1/3 Thank you for you help