Thread: full text search, utf8

full text search, utf8

From

alexander lunyov

Date:

03 June 2009, 13:29:27

Здравствуйте.

Имеется freebsd 6.2, postgresql-8.3.1

В env:

% env | grep UTF
LANG=ru_RU.UTF-8
MM_CHARSET=UTF-8

% psql ports -U pgsql
Welcome to psql 8.3.1, the PostgreSQL interactive terminal.

Type:  \copyright for distribution terms
        \h for help with SQL commands
        \? for help with psql commands
        \g or terminate with semicolon to execute query
        \q to quit

ports=# \encoding
UTF8
ports=# \l
         Список баз данных
     Имя    | Владелец | Кодировка
-----------+----------+-----------
  ports     | pgsql    | UTF8
  postgres  | pgsql    | UTF8
  template0 | pgsql    | UTF8
  template1 | pgsql    | UTF8
(4 rows)

Пробую поискать в таблице, и вот результат:

ports=# select name from abonents where to_tsvector(name) @@
to_tsquery('s');
ERROR:  неверная последовательность байт имя кодировки "UTF8": 0xd1
ПОДСКАЗКА:  This error can also happen if the byte sequence does not
match the encoding expected by the server, which is controlled by
"client_encoding".

при этом в конфигурации english работает нормально.

# select count(name) from abonents where to_tsvector('english',name) @@
to_tsquery('some');
  count
-------
      6
(1 запись)

Почему?

--
alexander lunyov

Re: full text search, utf8

From

alexander lunyov

Date:

03 June 2009, 15:56:59

I can answer in english if you like.

This error happening also when i'm trying to CREATE TEXT SEARCH DICTIONARY:

ports=# CREATE TEXT SEARCH DICTIONARY ruispell (
ports(#     TEMPLATE = ispell,
ports(#     DictFile = russian,
ports(#     AffFile = russian,
ports(#     StopWords = russian
ports(# );
ERROR:  неверная последовательность байт имя кодировки "UTF8": 0xd1
ПОДСКАЗКА:  This error can also happen if the byte sequence does not
match the encoding expected by the server, which is controlled by
"client_encoding".
ports=#

All data in table populated with perl script that read text file in UTF8
and make INSERTs, and i think if there was illegal character, error
would appear after INSERT.


Andrew Boag wrote:
> sorry for English response (I don't have Russian keyboard here)
>
> 0xd1 may be an illegal UTF8 chaacter that was mistakenly allowed into
> the table. Not all libraries (or all versions of postgres) prevent
> illegal UTF8 characters from getting into DB.
>
> We saw similar issues with a 7.4 -> 8.1 postgres data migration.
>
> However, I don't fully understand your select query so there may be
> another cause.
>
> alexander lunyov wrote:
>> Здравствуйте.
>>
>> Имеется freebsd 6.2, postgresql-8.3.1
>>
>> В env:
>>
>> % env | grep UTF
>> LANG=ru_RU.UTF-8
>> MM_CHARSET=UTF-8
>>
>> % psql ports -U pgsql
>> Welcome to psql 8.3.1, the PostgreSQL interactive terminal.
>>
>> Type:  \copyright for distribution terms
>>        \h for help with SQL commands
>>        \? for help with psql commands
>>        \g or terminate with semicolon to execute query
>>        \q to quit
>>
>> ports=# \encoding
>> UTF8
>> ports=# \l
>>         Список баз данных
>>     Имя    | Владелец | Кодировка
>> -----------+----------+-----------
>>  ports     | pgsql    | UTF8
>>  postgres  | pgsql    | UTF8
>>  template0 | pgsql    | UTF8
>>  template1 | pgsql    | UTF8
>> (4 rows)
>>
>> Пробую поискать в таблице, и вот результат:
>>
>> ports=# select name from abonents where to_tsvector(name) @@
>> to_tsquery('s');
>> ERROR:  неверная последовательность байт имя кодировки "UTF8": 0xd1
>> ПОДСКАЗКА:  This error can also happen if the byte sequence does not
>> match the encoding expected by the server, which is controlled by
>> "client_encoding".
>>
>> при этом в конфигурации english работает нормально.
>>
>> # select count(name) from abonents where to_tsvector('english',name)
>> @@ to_tsquery('some');
>>  count
>> -------
>>      6
>> (1 запись)
>>
>> Почему?
>>
>
>


--
С уважением
Александр Лунев
ОАО РТК

Re: full text search, utf8

From

eshkinkot@gmail.com (Сергей Бурладян)

Date:

04 June 2009, 03:55:14

alexander lunyov <lan@zato.ru> writes:

> This error happening also when i'm trying to CREATE TEXT SEARCH DICTIONARY:
> 
> ports=# CREATE TEXT SEARCH DICTIONARY ruispell (
> ports(#     TEMPLATE = ispell,
> ports(#     DictFile = russian,
> ports(#     AffFile = russian,
> ports(#     StopWords = russian
> ports(# );
> ERROR:  неверная последовательность байт имя кодировки "UTF8": 0xd1
> ПОДСКАЗКА:  This error can also happen if the byte sequence does not match the
> encoding expected by the server, which is controlled by "client_encoding".
> ports=#

А файлы словарей в какой кодировке ? Наверное в них дело.

-- 
С уважением, Сергей Бурладян

Re: full text search, utf8

From

alexander lunyov

Date:

04 June 2009, 09:19:34

Сергей Бурладян wrote:
> alexander lunyov <lan@zato.ru> writes:
>
>> This error happening also when i'm trying to CREATE TEXT SEARCH DICTIONARY:
>>
>> ports=# CREATE TEXT SEARCH DICTIONARY ruispell (
>> ports(#     TEMPLATE = ispell,
>> ports(#     DictFile = russian,
>> ports(#     AffFile = russian,
>> ports(#     StopWords = russian
>> ports(# );
>> ERROR:  неверная последовательность байт имя кодировки "UTF8": 0xd1
>> ПОДСКАЗКА:  This error can also happen if the byte sequence does not match the
>> encoding expected by the server, which is controlled by "client_encoding".
>> ports=#
>
> А файлы словарей в какой кодировке ? Наверное в них дело.
>

# file russian.affix russian.dict russian.stop
russian.affix: UTF-8 Unicode text
russian.dict:  UTF-8 Unicode text
russian.stop:  UTF-8 Unicode text



--
alexander lunyov

Re: full text search, utf8

From

alexander lunyov

Date:

04 June 2009, 09:30:36

Andrew Boag wrote:
>> This error happening also when i'm trying to CREATE TEXT SEARCH
>> DICTIONARY:
>>
>> ports=# CREATE TEXT SEARCH DICTIONARY ruispell (
>> ports(#     TEMPLATE = ispell,
>> ports(#     DictFile = russian,
>> ports(#     AffFile = russian,
>> ports(#     StopWords = russian
>> ports(# );
>> ERROR:  неверная последовательность байт имя кодировки "UTF8": 0xd1
>> ПОДСКАЗКА:  This error can also happen if the byte sequence does not
>> match the encoding expected by the server, which is controlled by
>> "client_encoding".
>> ports=#
>>
>> All data in table populated with perl script that read text file in
>> UTF8 and make INSERTs, and i think if there was illegal character,
>> error would appear after INSERT.
> я с вами согласен, а все равно попробовал бы конверт исходной текст до
> INSERT
>
> *iconv* -c -t utf8 blah.sql > blah.sql.recode
>
>  (-c значит - omit illegal UTF8 char)

сконвертировал и словари (взял от openoffice), и файл с данными (csv) с
ключом -c, та же история.

--
alexander lunyov

Re: full text search, utf8

From

Nikolay Samokhvalov

Date:

04 June 2009, 09:36:06

c http://blog.lexa.ru/2008/03/03/freebsd_utf8_russian_collate_vtoraja_popitka.html ознакомлены?

Sincerely yours,
Nikolay

2009/6/3 alexander lunyov <lan@zato.ru>

Здравствуйте.

Имеется freebsd 6.2, postgresql-8.3.1

В env:

% env | grep UTF
LANG=ru_RU.UTF-8
MM_CHARSET=UTF-8

Re: full text search, utf8

From

alexander lunyov

Date:

04 June 2009, 11:37:34

Nikolay Samokhvalov wrote:
> c http://blog.lexa.ru/2008/03/03/freebsd_utf8_russian_collate_vtoraja_popitka.html ознакомлены?

мне кажется, это никак не связано, но тем не менее попробовал, результат
тот же. кстати, приложенный там test.pl не выдал различий между
оригинальным и измененным LC_COLLATE.

> Sincerely yours,
> Nikolay
>
>
> 2009/6/3 alexander lunyov <lan@zato.ru <mailto:lan@zato.ru>>
>
>     Здравствуйте.
>
>     Имеется freebsd 6.2, postgresql-8.3.1
>
>     В env:
>
>     % env | grep UTF
>     LANG=ru_RU.UTF-8
>     MM_CHARSET=UTF-8
>
>

--
С уважением
Александр Лунев
ОАО РТК