Thread: Extending range of to_tsvector et al

Extending range of to_tsvector et al

From

johnkn63

Date:

30 September 2012, 17:56:13

When using to_tsvector  a number of newer unicode characters and pua
characters are not included. How do I add the characters which I desire to
be found?

Regards
John



--
View this message in context:
http://postgresql.1045698.n5.nabble.com/Extending-range-of-to-tsvector-et-al-tp5726041.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.

Re: Extending range of to_tsvector et al

From

Dan Scott

Date:

01 October 2012, 03:04:30

On Sun, Sep 30, 2012 at 1:56 PM, johnkn63 <john.knightley@gmail.com> wrote:
> When using to_tsvector  a number of newer unicode characters and pua
> characters are not included. How do I add the characters which I desire to
> be found?

I've just started digging into this code a bit, but from what I've
found src/backend/tsearch/wparser_def.c defines much of the parser
functionality, and in the area of Unicode includes a number of
comments like:

* with multibyte encoding and C-locale isw* function may fail or give
wrong result.
* multibyte encoding and C-locale often are used for Asian languages.
* any non-ascii symbol with multibyte encoding with C-locale is an
alpha character

... in concert with ifdefs around WIDE_UPPER_LOWER (in effect if
WCSTOMBS and TOWLOWER are available) to complicate testing scenarios
:)

Also note that src/test/regress/sql/tsearch.sql and
regress/sql/tsdicts.sql currently focus on English, ASCII-only data.

Perhaps this is a good opportunity for you to describe what your
environment looks like (OS, PostgreSQL version, encoding and locale
settings for the database) and show some sample to_tsquery() @@
to_tsvector() queries that don't behave the way you think they should
behave - and we could start building some test cases as a first step?

-- 
Dan Scott
Laurentian University

Re: Extending range of to_tsvector et al

From

john knightley

Date:

01 October 2012, 03:45:08

Dear Dan,

thank you for your reply.

The OS I am using is Ubuntu 12.04, with PostgreSQL 9.1.5 installed on
a utf8 local

A short 5 line dictionary file  is sufficient to test:-

raeuz
我们
𦘭𥎵
𪽖𫖂
󶒘󴮬

line 1 "raeuz" Zhuang word written using English letters and show up
under ts_vector ok
line 2 "我们" uses everyday Chinese word and show up under ts_vector ok
line 3 "𦘭𥎵" Zhuang word written using rather old Chinese charcters
found in Unicode 3.1 which came in about the year 2000  and show up
under ts_vector ok
line 4 "𪽖𫖂" Zhuang word written using rather old Chinese charcters
found in Unicode 5.2 which came in about the year 2009 but do not show
up under ts_vector ok
line 5 "󶒘󴮬" Zhuang word written using rather old Chinese charcters
found in PUA area of the font Sawndip.ttf but do not show up under
ts_vector ok (Font can be downloaded from
http://gdzhdb.l10n-support.com/sawndip-fonts/Sawndip.ttf)

The last two words even though included in a dictionary do not get
accepted by ts_vector.

Regards
John

On Mon, Oct 1, 2012 at 11:04 AM, Dan Scott <denials@gmail.com> wrote:
> On Sun, Sep 30, 2012 at 1:56 PM, johnkn63 <john.knightley@gmail.com> wrote:
>> When using to_tsvector  a number of newer unicode characters and pua
>> characters are not included. How do I add the characters which I desire to
>> be found?
>
> I've just started digging into this code a bit, but from what I've
> found src/backend/tsearch/wparser_def.c defines much of the parser
> functionality, and in the area of Unicode includes a number of
> comments like:
>
> * with multibyte encoding and C-locale isw* function may fail or give
> wrong result.
> * multibyte encoding and C-locale often are used for Asian languages.
> * any non-ascii symbol with multibyte encoding with C-locale is an
> alpha character
>
> ... in concert with ifdefs around WIDE_UPPER_LOWER (in effect if
> WCSTOMBS and TOWLOWER are available) to complicate testing scenarios
> :)
>
> Also note that src/test/regress/sql/tsearch.sql and
> regress/sql/tsdicts.sql currently focus on English, ASCII-only data.
>
> Perhaps this is a good opportunity for you to describe what your
> environment looks like (OS, PostgreSQL version, encoding and locale
> settings for the database) and show some sample to_tsquery() @@
> to_tsvector() queries that don't behave the way you think they should
> behave - and we could start building some test cases as a first step?
>
> --
> Dan Scott
> Laurentian University

Re: Extending range of to_tsvector et al

From

Dan Scott

Date:

01 October 2012, 03:58:15

Hi John:

On Sun, Sep 30, 2012 at 11:45 PM, john knightley
<john.knightley@gmail.com> wrote:
> Dear Dan,
>
> thank you for your reply.
>
> The OS I am using is Ubuntu 12.04, with PostgreSQL 9.1.5 installed on
> a utf8 local
>
> A short 5 line dictionary file  is sufficient to test:-
>
> raeuz
> 我们
> 𦘭𥎵
> 𪽖𫖂
> 󶒘󴮬
>
> line 1 "raeuz" Zhuang word written using English letters and show up
> under ts_vector ok
> line 2 "我们" uses everyday Chinese word and show up under ts_vector ok
> line 3 "𦘭𥎵" Zhuang word written using rather old Chinese charcters
> found in Unicode 3.1 which came in about the year 2000  and show up
> under ts_vector ok
> line 4 "𪽖𫖂" Zhuang word written using rather old Chinese charcters
> found in Unicode 5.2 which came in about the year 2009 but do not show
> up under ts_vector ok
> line 5 "󶒘󴮬" Zhuang word written using rather old Chinese charcters
> found in PUA area of the font Sawndip.ttf but do not show up under
> ts_vector ok (Font can be downloaded from
> http://gdzhdb.l10n-support.com/sawndip-fonts/Sawndip.ttf)
>
> The last two words even though included in a dictionary do not get
> accepted by ts_vector.

Hmm. Fedora 17 x86-64 w/ PostgreSQL 9.1.5 here, the latter seems to
work using the default text search configuration (albeit with one
crucial note: I created the database with the "lc_ctype=C
lc_collate=C" options):

WORKING:

createdb --template=template0 --lc-ctype=C --lc-collate=C foobar
foobar=# select ts_debug('󶒘󴮬');                           ts_debug
----------------------------------------------------------------(word,"Word, all
letters",󶒘󴮬,{english_stem},english_stem,{󶒘󴮬})
(1 row)

NOT WORKING AS EXPECTED:

foobaz=# SHOW LC_CTYPE; lc_ctype
-------------en_US.UTF-8
(1 row)

foobaz=# select ts_debug('󶒘󴮬');           ts_debug
---------------------------------(blank,"Space symbols",󶒘󴮬,{},,)
(1 row)

So... perhaps LC_CTYPE=C is a possible workaround for you?

Re: Extending range of to_tsvector et al

From

Tom Lane

Date:

01 October 2012, 04:11:23

john knightley <john.knightley@gmail.com> writes:
> The OS I am using is Ubuntu 12.04, with PostgreSQL 9.1.5 installed on
> a utf8 local

> A short 5 line dictionary file  is sufficient to test:-

> raeuz
> 我们
> 𦘭𥎵
> 𪽖𫖂
> 󶒘󴮬

> line 1 "raeuz" Zhuang word written using English letters and show up
> under ts_vector ok
> line 2 "我们" uses everyday Chinese word and show up under ts_vector ok
> line 3 "𦘭𥎵" Zhuang word written using rather old Chinese charcters
> found in Unicode 3.1 which came in about the year 2000  and show up
> under ts_vector ok
> line 4 "𪽖𫖂" Zhuang word written using rather old Chinese charcters
> found in Unicode 5.2 which came in about the year 2009 but do not show
> up under ts_vector ok
> line 5 "󶒘󴮬" Zhuang word written using rather old Chinese charcters
> found in PUA area of the font Sawndip.ttf but do not show up under
> ts_vector ok (Font can be downloaded from
> http://gdzhdb.l10n-support.com/sawndip-fonts/Sawndip.ttf)

AFAIK there is nothing in Postgres itself that would distinguish, say,
𦘭 from 𪽖.  I think this must be down to
your platform's locale definition: it probably thinks that the former is
a letter and the latter is not.  You'd have to gripe to the locale
maintainers to get that fixed.
        regards, tom lane

Re: Extending range of to_tsvector et al

From

john knightley

Date:

01 October 2012, 04:35:09

On Mon, Oct 1, 2012 at 12:11 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> john knightley <john.knightley@gmail.com> writes:
>> The OS I am using is Ubuntu 12.04, with PostgreSQL 9.1.5 installed on
>> a utf8 local
>
>> A short 5 line dictionary file  is sufficient to test:-
>
>> raeuz
>> 我们
>> 𦘭𥎵
>> 𪽖𫖂
>> 󶒘󴮬
>
>> line 1 "raeuz" Zhuang word written using English letters and show up
>> under ts_vector ok
>> line 2 "我们" uses everyday Chinese word and show up under ts_vector ok
>> line 3 "𦘭𥎵" Zhuang word written using rather old Chinese charcters
>> found in Unicode 3.1 which came in about the year 2000  and show up
>> under ts_vector ok
>> line 4 "𪽖𫖂" Zhuang word written using rather old Chinese charcters
>> found in Unicode 5.2 which came in about the year 2009 but do not show
>> up under ts_vector ok
>> line 5 "󶒘󴮬" Zhuang word written using rather old Chinese charcters
>> found in PUA area of the font Sawndip.ttf but do not show up under
>> ts_vector ok (Font can be downloaded from
>> http://gdzhdb.l10n-support.com/sawndip-fonts/Sawndip.ttf)
>
> AFAIK there is nothing in Postgres itself that would distinguish, say,
> 𦘭 from 𪽖.  I think this must be down to
> your platform's locale definition: it probably thinks that the former is
> a letter and the latter is not.  You'd have to gripe to the locale
> maintainers to get that fixed.
>
>                         regards, tom lane

PostgreSQL in general does not usually distinguish but full text search does:-
select ts_debug('𦘭 from 𪽖');

gives the result:-
                            ts_debug
-------------------------------------------------------------------(word,"Word, all
letters",𦘭,{english_stem},english_stem,{𦘭})(blank,"Spacesymbols"," ",{},,)(asciiword,"Word, all
ASCII",from,{english_stem},english_stem,{})(blank,"Spacesymbols"," 𪽖",{},,) 
(4 rows)

Somewhere there is dictionary, or library that is based on @ Unicode
4.0 which includes "𦘭","U+2662d" but not  "𫖂","U+2b582" which is
Unicode 5.1.

Also PUA characters are dropped in the same way by the full text
search, which is what google does but which I do not wish to do.

Regards
John

Re: Extending range of to_tsvector et al

From

john knightley

Date:

01 October 2012, 04:52:38

On Mon, Oct 1, 2012 at 11:58 AM, Dan Scott <denials@gmail.com> wrote:
> Hi John:
>
> On Sun, Sep 30, 2012 at 11:45 PM, john knightley
> <john.knightley@gmail.com> wrote:
>> Dear Dan,
>>
>> thank you for your reply.
>>
>> The OS I am using is Ubuntu 12.04, with PostgreSQL 9.1.5 installed on
>> a utf8 local
>>
>> A short 5 line dictionary file  is sufficient to test:-
>>
>> raeuz
>> 我们
>> 𦘭𥎵
>> 𪽖𫖂
>> 󶒘󴮬
>>
>> line 1 "raeuz" Zhuang word written using English letters and show up
>> under ts_vector ok
>> line 2 "我们" uses everyday Chinese word and show up under ts_vector ok
>> line 3 "𦘭𥎵" Zhuang word written using rather old Chinese charcters
>> found in Unicode 3.1 which came in about the year 2000  and show up
>> under ts_vector ok
>> line 4 "𪽖𫖂" Zhuang word written using rather old Chinese charcters
>> found in Unicode 5.2 which came in about the year 2009 but do not show
>> up under ts_vector ok
>> line 5 "󶒘󴮬" Zhuang word written using rather old Chinese charcters
>> found in PUA area of the font Sawndip.ttf but do not show up under
>> ts_vector ok (Font can be downloaded from
>> http://gdzhdb.l10n-support.com/sawndip-fonts/Sawndip.ttf)
>>
>> The last two words even though included in a dictionary do not get
>> accepted by ts_vector.
>
> Hmm. Fedora 17 x86-64 w/ PostgreSQL 9.1.5 here, the latter seems to
> work using the default text search configuration (albeit with one
> crucial note: I created the database with the "lc_ctype=C
> lc_collate=C" options):
>
> WORKING:
>
> createdb --template=template0 --lc-ctype=C --lc-collate=C foobar
> foobar=# select ts_debug('󶒘󴮬');
>                             ts_debug
> ----------------------------------------------------------------
>  (word,"Word, all letters",󶒘󴮬,{english_stem},english_stem,{󶒘󴮬})
> (1 row)
>
> NOT WORKING AS EXPECTED:
>


>
> foobaz=# SHOW LC_CTYPE;
>   lc_ctype
> -------------
>  en_US.UTF-8
> (1 row)
>
> foobaz=# select ts_debug('󶒘󴮬');
>             ts_debug
> ---------------------------------
>  (blank,"Space symbols",󶒘󴮬,{},,)
> (1 row)
>
> So... perhaps LC_CTYPE=C is a possible workaround for you?

LC_CTYPE would not be a work around - this database needs to be in
utf8 , the full text search is to be used for a mediawiki. Is this a
bug that is being worked on?

Regards
John

Re: Extending range of to_tsvector et al

From

Tom Lane

Date:

01 October 2012, 14:36:56

john knightley <john.knightley@gmail.com> writes:
> On Mon, Oct 1, 2012 at 11:58 AM, Dan Scott <denials@gmail.com> wrote:
>> So... perhaps LC_CTYPE=C is a possible workaround for you?

> LC_CTYPE would not be a work around - this database needs to be in
> utf8 , the full text search is to be used for a mediawiki.

You're confusing locale and encoding.  They are different things.

> Is this a bug that is being worked on?

No.  As I already tried to explain to you, this behavior is not
determined by Postgres, it's determined by the platform's locale
support.  You need to complain to your OS vendor.
        regards, tom lane