Thread: TSearch2: Auto identify document language?

TSearch2: Auto identify document language?

From

Hannes Dorbath

Date:

11 December 2005, 11:17:46

Is there a practical way to make a guess what language a document is
written in and auto magically use the adequate TSearch config? I thought
of looking up the document's words in various dicts and use the one with
the most matches.. doesn't matter if performance will be bad.

Any ideas? :)

Thanks

--
Regards,
Hannes Dorbath

Re: TSearch2: Auto identify document language?

From

Michael Fuhr

Date:

11 December 2005, 16:04:50

On Sun, Dec 11, 2005 at 01:17:42PM +0100, Hannes Dorbath wrote:
> Is there a practical way to make a guess what language a document is
> written in and auto magically use the adequate TSearch config? I thought
> of looking up the document's words in various dicts and use the one with
> the most matches.. doesn't matter if performance will be bad.

I don't know how easily you could incorporate this into tsearch2,
but for the general problem of language identification you could
try something like Perl's Lingua::Identify module.

http://search.cpan.org/dist/Lingua-Identify/lib/Lingua/Identify.pm

CREATE FUNCTION langof(text) RETURNS text AS $$
use Lingua::Identify qw(:language_identification);
return langof($_[0]);
$$ LANGUAGE plperlu IMMUTABLE STRICT;

SELECT langof('The quick brown fox jumped over the lazy dog.');
 langof
--------
 en
(1 row)

SELECT langof('Der schnelle braune Fuchs sprang über den faulen Hund.');
 langof
--------
 de
(1 row)

SELECT langof('El zorro marrón rápido saltó sobre el perro perezoso.');
 langof
--------
 es
(1 row)

SELECT langof('La volpe marrone rapida ha saltato sopra il cane pigro.');
 langof
--------
 it
(1 row)

SELECT langof('Le renard brun rapide a sauté par-dessus le chien paresseux.');
 langof
--------
 fi
(1 row)

Language identification isn't always accurate -- in this example
the function thinks the last text is Finnish instead of French --
but it might get better with more text to examine, and you can tell
Lingua::Identify which languages to consider or ignore.

--
Michael Fuhr

Re: TSearch2: Auto identify document language?

From

Mark Mitchenall

Date:

12 December 2005, 14:04:04

On 11/12/05, Hannes Dorbath <light@theendofthetunnel.de> wrote:
  > Is there a practical way to make a guess what language a document is
  > written in and auto magically use the adequate TSearch config? I thought
  > of looking up the document's words in various dicts and use the one with
  > the most matches.. doesn't matter if performance will be bad.

Is it possible to use something like....

http://odur.let.rug.nl/~vannoord/TextCat/

... from a plPerl script?

Best,
Mark
--
Mark Mitchenall, Standingwave Ltd
(Complete Hosting and Development Services)

Tel/Fax :=  +44 (0)845 612 0699
Email   :=  mark@standingwave.co.uk                mark@mitchenall.com
Home    :=  http://www.standingwave.co.uk    http://www.mitchenall.com

Attachment

mark.vcf