Thread: Matching uppercased russian words (\x0410-\x042F) in UTF8 database 8.4.13

Matching uppercased russian words (\x0410-\x042F) in UTF8 database 8.4.13

From
Alexander Farber
Date:
Hello,

I have prepared an SQL fiddle for my question:
http://sqlfiddle.com/#!11/8a494/4

And also described it in more detail at
http://stackoverflow.com/questions/15500270/string-matching-in-insert-trigger-how-to-use-in-conditionals-to-return-null

Does anybody please know how to check for
UTF8 range \x0410-\x042F in my code below?

I've tried both
new.word !~ '^[\x0410-\x042F]{2,}$'
(fails with syntax error) and
new.word !~ '^[\u0410-\u042F]{2,}$'
(triggers even for correct words):


create table good_words (
        word varchar(64) primary key
);

create or replace function keep_clean() returns trigger as $body$
        begin
                new.word := upper(new.word);

                /* next line does not compile? */
                IF new.word !~ '^[\x0410-\x042F]{2,}$' THEN
                    RAISE EXCEPTION 'Not an uppercased Russian word in UTF8';
                END IF;

                IF new.word ~ '^[ЪЫЬ]' OR new.word ~ 'Ъ$' THEN
                    return NULL;
                END IF;

                /* does not return NULL for 'ошибббка'? */
                IF new.word ~ '(.)\1\1' AND new.word NOT LIKE '%ШЕЕЕ%'
AND new.word NOT LIKE '%ЗМЕЕЕ%' THEN
                    return NULL;
                END IF;

                return new;
        end;
$body$ language plpgsql;


Thank you
Alex


Re: Matching uppercased russian words (\x0410-\x042F) in UTF8 database 8.4.13

From
Alexander Farber
Date:
I'm trying at the psql prompt of an UTF8 database in 8.4.13:

# select 'АБВГД' ~ '^[\u0410-\u042F]{2,}$';
WARNING:  nonstandard use of escape in a string literal
LINE 1: select 'АБВГД' ~ '^[\u0410-\u042F]{2,}$';
                         ^
HINT:  Use the escape string syntax for escapes, e.g., E'\r\n'.
 ?column?
----------
 f
(1 row)


On Tue, Mar 19, 2013 at 4:10 PM, Alexander Farber
<alexander.farber@gmail.com> wrote:
> http://sqlfiddle.com/#!11/8a494/4


Alexander Farber <alexander.farber@gmail.com> writes:
> I'm trying at the psql prompt of an UTF8 database in 8.4.13:
> # select 'АБВГД' ~ '^[\u0410-\u042F]{2,}$';
> WARNING:  nonstandard use of escape in a string literal

I think Unicode escapes were introduced in 9.0.  In 8.4 you'd probably
have to write out the UTF8 equivalent as octal escapes :-(

            regards, tom lane


Alexander Farber wrote:
> I have prepared an SQL fiddle for my question:
> http://sqlfiddle.com/#!11/8a494/4
> 
> And also described it in more detail at
> http://stackoverflow.com/questions/15500270/string-matching-in-insert-trigger-how-to-use-in-
> conditionals-to-return-null
> 
> Does anybody please know how to check for
> UTF8 range \x0410-\x042F in my code below?
> 
> I've tried both
> new.word !~ '^[\x0410-\x042F]{2,}$'
> (fails with syntax error) and
> new.word !~ '^[\u0410-\u042F]{2,}$'
> (triggers even for correct words):

Strange, it works here (RHEL 6, x86_64, PostgreSQL 9.2.2,
encoding "UTF8", collation and ctype "de_DE.UTF8"):

test=> SELECT 'ПРОВЕРКА' ~ '^[\u0410-\u042F]{2,}$';
 ?column?
----------
 t
(1 row)

test=> SELECT 'ABCDE' ~ '^[\u0410-\u042F]{2,}$';
 ?column?
----------
 f
(1 row)

> create table good_words (
>         word varchar(64) primary key
> );
> 
> create or replace function keep_clean() returns trigger as $body$
>         begin
>                 new.word := upper(new.word);
> 
>                 /* next line does not compile? */
>                 IF new.word !~ '^[\x0410-\x042F]{2,}$' THEN
>                     RAISE EXCEPTION 'Not an uppercased Russian word in UTF8';
>                 END IF;
> 
>                 IF new.word ~ '^[ЪЫЬ]' OR new.word ~ 'Ъ$' THEN
>                     return NULL;
>                 END IF;
> 
>                 /* does not return NULL for 'ошибббка'? */
>                 IF new.word ~ '(.)\1\1' AND new.word NOT LIKE '%ШЕЕЕ%'
> AND new.word NOT LIKE '%ЗМЕЕЕ%' THEN
>                     return NULL;

This works for me as well:

test=> SELECT 'ошибббка' ~ '(.)\1\1'
          AND 'ошибббка' NOT LIKE '%ШЕЕЕ%'
          AND 'ошибббка' NOT LIKE '%ЗМЕЕЕ%';
 ?column?
----------
 t
(1 row)

test=> SELECT 'ошиббка' ~ '(.)\1\1'
          AND 'ошиббка' NOT LIKE '%ШЕЕЕ%'
          AND 'ошиббка' NOT LIKE '%ЗМЕЕЕ%';
 ?column?
----------
 f
(1 row)

>                 END IF;
> 
>                 return new;
>         end;
> $body$ language plpgsql;

What do you get for

SELECT pg_encoding_to_char(encoding),
       datcollate,
       datctype
FROM pg_database WHERE datname = current_database();

and for 

SHOW client_encoding;

Yours,
Laurenz Albe

Re: Matching uppercased russian words (\x0410-\x042F) in UTF8 database 8.4.13

From
Alexander Farber
Date:
Thanks for trying! I am using CentOS 6.3

It seems to be better in 9.2.x?

Unfortunately I'd like to stay with 8.4.x for now
(because I use the PostgreSQL instance
with other projects at the same host)....

Regards
Alex


On Wed, Mar 20, 2013 at 10:35 AM, Albe Laurenz <laurenz.albe@wien.gv.at> wrote:
> Alexander Farber wrote:
>> I have prepared an SQL fiddle for my question:
>> http://sqlfiddle.com/#!11/8a494/4
>>
> Strange, it works here (RHEL 6, x86_64, PostgreSQL 9.2.2,
> encoding "UTF8", collation and ctype "de_DE.UTF8"):
>
> test=> SELECT 'ПРОВЕРКА' ~ '^[\u0410-\u042F]{2,}$';
>  ?column?
> ----------
>  t
> (1 row)
>
> test=> SELECT 'ABCDE' ~ '^[\u0410-\u042F]{2,}$';
>  ?column?
> ----------
>  f
> (1 row)
>
>> create table good_words (
>>         word varchar(64) primary key
>> );
>>
>> create or replace function keep_clean() returns trigger as $body$
>>         begin
>>                 new.word := upper(new.word);
>>
>>                 /* next line does not compile? */
>>                 IF new.word !~ '^[\x0410-\x042F]{2,}$' THEN
>>                     RAISE EXCEPTION 'Not an uppercased Russian word in UTF8';
>>                 END IF;
>>
>>                 IF new.word ~ '^[ЪЫЬ]' OR new.word ~ 'Ъ$' THEN
>>                     return NULL;
>>                 END IF;
>>
>>                 /* does not return NULL for 'ошибббка'? */
>>                 IF new.word ~ '(.)\1\1' AND new.word NOT LIKE '%ШЕЕЕ%'
>> AND new.word NOT LIKE '%ЗМЕЕЕ%' THEN
>>                     return NULL;
>
> This works for me as well:
>
> test=> SELECT 'ошибббка' ~ '(.)\1\1'
>           AND 'ошибббка' NOT LIKE '%ШЕЕЕ%'
>           AND 'ошибббка' NOT LIKE '%ЗМЕЕЕ%';
>  ?column?
> ----------
>  t
> (1 row)
>
> test=> SELECT 'ошиббка' ~ '(.)\1\1'
>           AND 'ошиббка' NOT LIKE '%ШЕЕЕ%'
>           AND 'ошиббка' NOT LIKE '%ЗМЕЕЕ%';
>  ?column?
> ----------
>  f
> (1 row)
>
>>                 END IF;
>>
>>                 return new;
>>         end;
>> $body$ language plpgsql;
>
> What do you get for
>
> SELECT pg_encoding_to_char(encoding),
>        datcollate,
>        datctype
> FROM pg_database WHERE datname = current_database();
>
> and for
>
> SHOW client_encoding;
>
> Yours,
> Laurenz Albe
>
> --
> Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general


Alexander Farber wrote:
> It seems to be better in 9.2.x?

Yes, as Tom has pointed out.
I didn't see that you were on 8.4 when I wrote my answer.

Yours,
Laurenz Albe

Alexander Farber <alexander.farber@gmail.com> writes:
> Thanks for trying! I am using CentOS 6.3
> It seems to be better in 9.2.x?

As stated upthread, 8.4 doesn't understand \u escapes.  You'd need to
put in the characters another way --- either literally, or using octal
escapes to spell out the UTF8 encoding.  I think it will work in 8.4
if you do, but not 100% sure.

            regards, tom lane


Re: Re: Matching uppercased russian words (\x0410-\x042F) in UTF8 database 8.4.13

From
Alexander Farber
Date:
Hello,

unfortunately octal doesn't seem to work either -

On Tue, Mar 19, 2013 at 7:03 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Alexander Farber <alexander.farber@gmail.com> writes:
>> # select 'АБВГД' ~ '^[\u0410-\u042F]{2,}$';
>> WARNING:  nonstandard use of escape in a string literal
>
> I think Unicode escapes were introduced in 9.0.  In 8.4 you'd probably
> have to write out the UTF8 equivalent as octal escapes :-(

    # select 'АБВГД' ~ '^[\2020-\2057]{2,}$';
    WARNING:  nonstandard use of escape in a string literal
    LINE 1: select 'АБВГД' ~ '^[\2020-\2057]{2,}$';
                             ^
    HINT:  Use the escape string syntax for escapes, e.g., E'\r\n'.
    ERROR:  invalid byte sequence for encoding "UTF8": 0x82
    HINT:  This error can also happen if the byte sequence does not
match the encoding expected by the server, which is controlled by
"client_encoding".

But writing out UTF8 equivalents seems to work
(trying to detect capitalized Russian letters as per
http://www.unicode.org/charts/PDF/U0400.pdf ):

    # select 'АБВГД' ~ '^[А-Я]{2,}$';
     ?column?
    ----------
       t
    (1 row)

And then I try to solve my 2nd problem (detecting 3
letters in a row, a rare case in Russian language):

# select 'ОШИБББКА' ~ '(.)\1\1';
WARNING:  nonstandard use of escape in a string literal
LINE 1: select 'ОШИБББКА' ~ '(.)\1\1';
                            ^
HINT:  Use the escape string syntax for escapes, e.g., E'\r\n'.
 ?column?
----------
 f
(1 row)


Does anybody please know why this fails in 8.4.13?

According to the table 9-18 in
http://www.postgresql.org/docs/8.4/static/functions-matching.html
it should be ok to use \1 for referencing
parts captured by round brackets?

Regards
Alex


Re: Re: Matching uppercased russian words (\x0410-\x042F) in UTF8 database 8.4.13

From
Alexander Farber
Date:
My insert trigger for 8.4.13 works now:
http://sqlfiddle.com/#!11/c74a1/3

Thank you for you help