Thread: full-text indexing, locales, triggers, SPI & more fun

full-text indexing, locales, triggers, SPI & more fun

From
Charlie Hornberger
Date:
I've been doing some poking at the full-text indexing code in
/contrib/fulltextindex to try to get it to work with non-ASCII locales 
(among other things), but I'm having a bit of trouble trying to figure 
out how to properly parse non-ASCII strings from inside the fti() 
trigger function (which is written in C).

My problem is this:

I want to aggregate text in multiple languages in a single full-text index
much like the current structure used by the current fti() function. In order
to correctly parse the strings, however, I've got to know what locale
they're written in/for (otherwise, isalpha() thinks that characters such as
the Hungarian letter u" -- that's a 'u' with a double acute accent -- aren't
very alphabetic.)

My initial thinking (which could certainly be very wrong) is that the
easiest way to get around this would be to allow client apps to set their
LC_ALL environment variables, and then to have the new fti() function use
that locale while doing string manipulation.

But the way I'm doing things, it doesn't appear that the LC_ALL environment
variable is available.  (Maybe it was never meant to be ... but I'm not a 
very skilled C programmer, and I don't know the first thing about the SPI 
interface, so please forgive me if I'm asking why the sun doesn't rise in 
the west more often ;-)).

Here's what's happening:
bash# LC_ALL=hu_HUbash# export LC_ALLbash# psql testWelcome to psql, the PostgreSQL interactive terminal.
Type:  \copyright for distribution terms       \h for help with SQL commands       \? for help on internal slash
commands      \g or terminate with semicolon to execute query       \q to quit
 
test=# INSERT INTO ttxt (t1) values ('FELELÕSSÉGÛ');INSERT 513377 1test=#select * from ttxt_fti; string |   id
--------+--------felel  | 513377 ss     | 513377(2 rows)
 

Which isn't quite what I'm looking for ;-).

Inside the C source of fti(), I added a call to getenv("LC_ALL") to make
sure that LC_ALL really isn't set:
       locale = getenv("LC_ALL");       elog(NOTICE,"Locale is '%s'\n",locale);

And sure enough, it outputs:
NOTICE:  Locale is '(null)'

If, on the other hand, I do:
setlocale("LC_ALL","hu_HU")

inside fti(), everything works out perfectly:
test=# INSERT INTO ttxt (t1) values ('FELELÕSSÉGÛ');INSERT 513410 1test=# select * from ttxt_fti;   string    |   id
-------------+--------felelõsségû | 513410(1 row)
 


Any ideas?

Cheers,
Charlie

P.S. I only subscribe to the hackers digest, so please CC me with your 
replies... Thanks!



Re: full-text indexing, locales, triggers, SPI & more fun

From
Karel Zak
Date:
On Wed, 31 May 2000, Charlie Hornberger wrote:

> I want to aggregate text in multiple languages in a single full-text index
> much like the current structure used by the current fti() function. In order
> to correctly parse the strings, however, I've got to know what locale
> they're written in/for (otherwise, isalpha() thinks that characters such as
> the Hungarian letter u" -- that's a 'u' with a double acute accent -- aren't
> very alphabetic.)
> 
> My initial thinking (which could certainly be very wrong) is that the
> easiest way to get around this would be to allow client apps to set their
> LC_ALL environment variables, and then to have the new fti() function use
> that locale while doing string manipulation.
> 
> But the way I'm doing things, it doesn't appear that the LC_ALL environment
> variable is available.  (Maybe it was never meant to be ... but I'm not a 
> very skilled C programmer, and I don't know the first thing about the SPI 
> interface, so please forgive me if I'm asking why the sun doesn't rise in 
> the west more often ;-)).

The PostgreSQL set in main() next locale catg. (if you compile it with locale
support)

#ifdef USE_LOCALE       setlocale(LC_CTYPE, "");               setlocale(LC_COLLATE, "");       setlocale(LC_MONETARY,
"");
#endif
If you need in your routines ctype.h's functions a solution is a set LANG
env. :# LANG=Czech        (for me)# start_postmaster
It works very well, and you not need any other setting. 
IMHO use setlocale(LC_ALL, ..) is a very strong hard to backend, because
a example all float data will crashed.
If you (still:-) need all locales see pg_locale.c in pg's sources in
utils/atd and usage of these routines in formatting.c (to_char()) which use 
full locale for numbers. 

But don't remember - you must always return all to state before LC_ALL.                  Karel