Thread: pg_trgm: unicode string not working

pg_trgm: unicode string not working

From
Sushant Sinha
Date:
I am using pg_trgm for spelling correction as prescribed in the
documentation. But I see that it does not work for unicode sring. The
database was initialized with utf8 encoding and the C locale.

Here is the table:\d words    Table "public.words"Column |  Type   | Modifiers
--------+---------+-----------word   | text    | ndoc   | integer | nentry | integer |
Indexes:   "words_idx" gin (word gin_trgm_ops)

Query: select word from words where word % 'कतद';

I get an error:

ERROR:  GIN indexes do not support whole-index scans


Any idea what is wrong?

-Sushant.



Re: pg_trgm: unicode string not working

From
Florian Pflug
Date:
Hi

Next time, please post questions regarding the usage of postgres
to the -general list, not to -hackers. The purpose of -hackers is
to discuss the development of postgres proper, not the development
of applications using postgres.

On Jun12, 2011, at 13:33 , Sushant Sinha wrote:
> I am using pg_trgm for spelling correction as prescribed in the
> documentation. But I see that it does not work for unicode sring. The
> database was initialized with utf8 encoding and the C locale.

I think you need to use a locale (more precisely, a CTYPE) in which
'क', 'त', 'द' are considered to be alphanumeric.

You can specify the CTYPE when creating the database with CREATE DATABASE ... LC_CTYPE = ...

> Here is the table:
> \d words
>     Table "public.words"
> Column |  Type   | Modifiers
> --------+---------+-----------
> word   | text    |
> ndoc   | integer |
> nentry | integer |
> Indexes:
>    "words_idx" gin (word gin_trgm_ops)
>
> Query: select word from words where word % 'कतद';
>
> I get an error:
>
> ERROR:  GIN indexes do not support whole-index scans


pg_trgm probably ignores non-alphanumeric characters during
comparison, so you end up with an empty search string, which
translates to a whole-index scan. Postgres up to 9.0 does
not support such scans for GIN indices.

Note that this restriction was removed in postgres 9.1 which
is currently in beta. However, GIT indices must be re-created
with REINDEX after upgrading from 9.0 to leverage that
improvement.

best regards.
Florian Pflug



Re: pg_trgm: unicode string not working

From
Robert Haas
Date:
On Sun, Jun 12, 2011 at 8:40 AM, Florian Pflug <fgp@phlo.org> wrote:
> Note that this restriction was removed in postgres 9.1 which
> is currently in beta. However, GIT indices must be re-created
> with REINDEX after upgrading from 9.0 to leverage that
> improvement.

Does pg_upgrade know about this?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: pg_trgm: unicode string not working

From
Bruce Momjian
Date:
Robert Haas wrote:
> On Sun, Jun 12, 2011 at 8:40 AM, Florian Pflug <fgp@phlo.org> wrote:
> > Note that this restriction was removed in postgres 9.1 which
> > is currently in beta. However, GIT indices must be re-created
> > with REINDEX after upgrading from 9.0 to leverage that
> > improvement.
> 
> Does pg_upgrade know about this?

No, it does not.  Under what circumstances should I issue a suggestion
to reindex, and what should the text be?

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + It's impossible for everything to be true. +


Re: pg_trgm: unicode string not working

From
Robert Haas
Date:
On Mon, Jun 13, 2011 at 7:47 PM, Bruce Momjian <bruce@momjian.us> wrote:
> Robert Haas wrote:
>> On Sun, Jun 12, 2011 at 8:40 AM, Florian Pflug <fgp@phlo.org> wrote:
>> > Note that this restriction was removed in postgres 9.1 which
>> > is currently in beta. However, GIT indices must be re-created
>> > with REINDEX after upgrading from 9.0 to leverage that
>> > improvement.
>>
>> Does pg_upgrade know about this?
>
> No, it does not.  Under what circumstances should I issue a suggestion
> to reindex, and what should the text be?

It sounds like GIN indexes need to be reindexed after upgrading from <
9.1 to >= 9.1.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: pg_trgm: unicode string not working

From
Bruce Momjian
Date:
Robert Haas wrote:
> On Mon, Jun 13, 2011 at 7:47 PM, Bruce Momjian <bruce@momjian.us> wrote:
> > Robert Haas wrote:
> >> On Sun, Jun 12, 2011 at 8:40 AM, Florian Pflug <fgp@phlo.org> wrote:
> >> > Note that this restriction was removed in postgres 9.1 which
> >> > is currently in beta. However, GIT indices must be re-created
> >> > with REINDEX after upgrading from 9.0 to leverage that
> >> > improvement.
> >>
> >> Does pg_upgrade know about this?
> >
> > No, it does not. ?Under what circumstances should I issue a suggestion
> > to reindex, and what should the text be?
> 
> It sounds like GIN indexes need to be reindexed after upgrading from <
> 9.1 to >= 9.1.

I already have some GIN tests I used for 8.3 to 8.4 so that is easy, but
is the reindex required or just suggested for features?

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + It's impossible for everything to be true. +


Re: pg_trgm: unicode string not working

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> On Mon, Jun 13, 2011 at 7:47 PM, Bruce Momjian <bruce@momjian.us> wrote:
>> No, it does not. �Under what circumstances should I issue a suggestion
>> to reindex, and what should the text be?

> It sounds like GIN indexes need to be reindexed after upgrading from <
> 9.1 to >= 9.1.

Only if you care whether they work for corner cases such as empty
arrays ... corner cases which didn't work before 9.1, so very likely
you don't care.

I'm not sure that pg_upgrade is a good vehicle for dispensing such
advice, anyway.  At least in the Red Hat packaging, end users will never
read what it prints, unless maybe it fails outright and they're trying
to debug why.
        regards, tom lane


Re: pg_trgm: unicode string not working

From
Florian Pflug
Date:
On Jun14, 2011, at 07:15 , Tom Lane wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> On Mon, Jun 13, 2011 at 7:47 PM, Bruce Momjian <bruce@momjian.us> wrote:
>>> No, it does not.  Under what circumstances should I issue a suggestion
>>> to reindex, and what should the text be?
> 
>> It sounds like GIN indexes need to be reindexed after upgrading from <
>> 9.1 to >= 9.1.
> 
> Only if you care whether they work for corner cases such as empty
> arrays ... corner cases which didn't work before 9.1, so very likely
> you don't care.

We also already say "To fix this, do REINDEX INDEX ... " in the errhint
of "old GIN indexes do not support whole-index scans nor searches for nulls".

best regards,
Florian Pflug



Re: pg_trgm: unicode string not working

From
Robert Haas
Date:
On Tue, Jun 14, 2011 at 1:15 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I'm not sure that pg_upgrade is a good vehicle for dispensing such
> advice, anyway.  At least in the Red Hat packaging, end users will never
> read what it prints, unless maybe it fails outright and they're trying
> to debug why.

In my experience to date, that happens 100% of the time.  :-(

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company