Thread: search on accents over all possible matches

search on accents over all possible matches

From
Jaume Teixi
Date:
Hello,

I'm developing a search tool with php against a posgresql database.
As the database is in catalan an in spanish is obvious that a simple
search like:
(SELECT * FROM painters WHERE artist_name ~* 'Dali');

should perform over Dd Aa Ll Ii (and will not found Dalí).
but on an accent based language also should perform over ÍíÌìÏï

question is:

this c function from Patrice Hédé is the most appropiate tool for
searching on an accent based language ?
http://www.postgresql.org/mhonarc/pgsql-sql/1998-06/msg00119.html

or should I use an implemented function inside postgres right now ?

bests from barcelona,
jaume teixi.

Re: search on accents over all possible matches

From
David Lizano
Date:
At 18.24 27/3/01 +0200, you wrote:
>Hello,
>
>I'm developing a search tool with php against a posgresql database.
>As the database is in catalan an in spanish is obvious that a simple
>search like:
>(SELECT * FROM painters WHERE artist_name ~* 'Dali');
>
>should perform over Dd Aa Ll Ii (and will not found Dalí).
>but on an accent based language also should perform over ÍíÌìÏï
>
>question is:
>
>this c function from Patrice Hédé is the most appropiate tool for
>searching on an accent based language ?
>http://www.postgresql.org/mhonarc/pgsql-sql/1998-06/msg00119.html
>
>or should I use an implemented function inside postgres right now ?
>
>bests from barcelona,
>jaume teixi.

Using regular expressions from PHP you can convert "a" into "[Aaáä]" and
from the original SQL query:
         (SELECT * FROM painters WHERE artist_name ~* 'Dali');

You obtain
         (SELECT * FROM painters WHERE artist_name ~* 'D[Aaáä]l[Iiíï]');

generating a new complete regular expression for the SQL language.

It should be valid for Dali, Dáli, Dalí, Dálí, and others.




~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
David Lizano - Director área técnica
correo-e: david.lizano@izanet.com

I Z A N E T - Servicios integrales de internet.
web: http://www.izanet.com/
Dirección: C/ Checa, 57-59, 3º D - 50.007 Zaragoza (España)
Teléfono: +34 976 25 80 23    Fax: +34 976 25 80 24
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~


Re: search on accents over all possible matches

From
Peter Eisentraut
Date:
Jaume Teixi writes:

> this c function from Patrice Hédé is the most appropiate tool for
> searching on an accent based language ?
> http://www.postgresql.org/mhonarc/pgsql-sql/1998-06/msg00119.html

Looks good to me.

> or should I use an implemented function inside postgres right now ?

The reason there is no such implementation, and probably won't be any time
soon, is that this tool would either have to hard-code or ignore natural
language semantics, neither of which would make it practical.  Not all
languages have the same accent ignoring or accent folding rules or
conventions.

--
Peter Eisentraut      peter_e@gmx.net       http://yi.org/peter-e/


Re: search on accents -> Why not include this function

From
Jaume Teixi
Date:
on day of Tue, 27 Mar 2001 19:24:34 +0200 (CEST), the message from Peter
Eisentraut <peter_e@gmx.net> says:

> Jaume Teixi writes:
>
> > this c function from Patrice Hédé is the most appropiate tool for
> > searching on an accent based language ?
> > http://www.postgresql.org/mhonarc/pgsql-sql/1998-06/msg00119.html
>
> Looks good to me.
>
> > or should I use an implemented function inside postgres right now ?
>
> The reason there is no such implementation, and probably won't be any
time
> soon, is that this tool would either have to hard-code or ignore natural
> language semantics, neither of which would make it practical.  Not all
> languages have the same accent ignoring or accent folding rules or
> conventions.

This function is really fast.
The accent method is a REAL need for almost all non-english languages.
You should to explicity call this funciton like:
select accents ('dali');
             accents
----------------------------------
 [dðÐ][aáÁàÀâÂäÄåÅãÃ]l[iíÍìÌîÎïÏ]

so why to not to include on the next release ?

best from barcelona,

jaume teixi.

This fortune intentionally not included.

Re: search on accents -> Why not include this function

From
Peter Eisentraut
Date:
Jaume Teixi writes:

> > The reason there is no such implementation, and probably won't be any time
> > soon, is that this tool would either have to hard-code or ignore natural
> > language semantics, neither of which would make it practical.  Not all
> > languages have the same accent ignoring or accent folding rules or
> > conventions.
>
> This function is really fast.
> The accent method is a REAL need for almost all non-english languages.
> You should to explicity call this funciton like:
> select accents ('dali');
>              accents
> ----------------------------------
>  [dðÐ][aáÁàÀâÂäÄåÅãÃ]l[iíÍìÌîÎïÏ]
>
> so why to not to include on the next release ?

For the reason I cited above:  it is a too abstract approach for many
languages and/or applications.  For example in Swedish, a search for 'e'
should probably include 'é', since most users will not type that in
explicitly (it's not on the keyboard), but a search for 'a' should
normally not include 'å', since that it a completely separate letter (and
it is on the keyboard).  Additionally, this particular implementation
seems to be ISO-8859-1 charset specific.  I know a number of accented
letters that are a lot closer "siblings" to 'd' than 'ð' is.

--
Peter Eisentraut      peter_e@gmx.net       http://yi.org/peter-e/


Re: search on accents -> Why not include this function

From
Jaume Teixi
Date:
But the thing is that you must explicity call this function in order to
use it.
Also in order to some stetics maybe you should call it accents_iso-8859-1
The thing is that this should be consider a big need for non-english
languages.

On a major approx also could be possible to modify it in order to accept
parameters to include ('å','à') or ('ca_ES','fr_FR')....

bests,
jaume.


> For the reason I cited above:  it is a too abstract approach for many
> languages and/or applications.  For example in Swedish, a search for 'e'
> should probably include 'é', since most users will not type that in
> explicitly (it's not on the keyboard), but a search for 'a' should
> normally not include 'å', since that it a completely separate letter
(and
> it is on the keyboard).  Additionally, this particular implementation
> seems to be ISO-8859-1 charset specific.  I know a number of accented
> letters that are a lot closer "siblings" to 'd' than 'ð' is.
>
> --
> Peter Eisentraut      peter_e@gmx.net       http://yi.org/peter-e/

How to place a table on a separate partition?

From
"Rodin A. Porrata"
Date:

Hi,

I am putting together a large database and want the
table to reside on a partition separate from the
default under 'base'. How can I do this?

I tried the following: First I created the place I
wanted to put the table:

mkdir /taos/01/postgres
chown taos /taos/01/postgres

Then I modified the .profile of postgres
so that:

PGDATA2 = /taos/01/postgres
export PGDATA2

then on the command line of the postgres
account I typed

initlocation $PGDATA2

which was successful.

Then I typed

initdb -D $PGDATA2 -i 1095

Now I had to kill the old postmaster and restart it. However,
I could only give it one location to utilize the data base, i.e.,

postmaster -D /taos/01/postgres
createdb -U taos -D $PGDATA2 large_table

So does this mean I have to start a separate postmaster for
the new location? How would I do that?

What we would like is a single postmaster to handle
all database queries, etc. However, for the especially
large table have a special user and table only.

Rodin


Re: How to place a table on a separate partition?

From
Stefan Huber
Date:
>I am putting together a large database and want the
>table to reside on a partition separate from the
>default under 'base'. How can I do this?

You could create symlinks for the larger tables pointing to another location:

ln -s /path/to/table/bigtable /usr/local/pgsql/data/base/whatever/bigtable

(supposed /usr/local/pgsql is your Postgres directory)

If there are serious troubles to be expected, I'd like to know that, because we have used this method once (not so
importantDB, without any probs till now) 

Stefan
--
Atheism is a non-prophet organization.


Re: search on accents -> Why not include this function

From
Patrice Hédé
Date:
Hi,

First, thank you for having including me in this thread : I haven't
been involved with PostgreSQL for 3 years now, and it's nice to see
that this hack is still useful to some persons ! (I should however
soon get involved again with databases :) ).

About this programme, I agree with Peter that it is too biased to be
included as a standard function. It is biased towards ISO-8859-1, and
towards some european languages I know ("d" or "dh" => "ð" is for
Icelandic, for example)... although "a" => "å" makes sense : not all
people involved with swedish/norwegian/danish have a scandinavic
keyboard, and they may not be sure whether the programme will do the
"aa" => "å" translation correctly (which this function does ;) ).

Back to the subject, though. This function also has another
limitation, namely, it has a fixed length buffer of 4096 bytes, and
that's not so nice (but it takes care of buffer overflows...).

Maybe, if it's not already the case, the source code could be put in a
contribution directory, available for anyone to adapt to his/her
needs without having to go through 3 years of archives, since it seems
to be a fairly common problem. The code should be simple enough for
anyone with a basic knowledge of C to customise :)

I know that localisation, and collation, and "acceptable alternatives"
are following quite different rules from country to country, making it
difficult to come with a general solution. This is why I didn't even
try to make one ;)

Patrice

* Jaume Teixi <teixi@6tems.com> [010329 22:04]:
> But the thing is that you must explicity call this function in order
> to use it.
> Also in order to some stetics maybe you should call it
> accents_iso-8859-1 The thing is that this should be consider a big
> need for non-english languages.
>
> On a major approx also could be possible to modify it in order to
> accept parameters to include ('å','à') or ('ca_ES','fr_FR')....
>
> bests,
> jaume.
>
>
> > For the reason I cited above:  it is a too abstract approach for
> > many languages and/or applications.  For example in Swedish, a
> > search for 'e' should probably include 'é', since most users will
> > not type that in explicitly (it's not on the keyboard), but a
> > search for 'a' should normally not include 'å', since that it a
> > completely separate letter (and it is on the keyboard).
> > Additionally, this particular implementation seems to be
> > ISO-8859-1 charset specific.  I know a number of accented
> > letters that are a lot closer "siblings" to 'd' than 'ð' is.
> >
> > --
> > Peter Eisentraut      peter_e@gmx.net       http://yi.org/peter-e/
>

--
Patrice HÉDÉ --------------------------------- patrice@islande.org -----
  --  Isn't it weird  how scientists  can imagine  all the matter of the
universe exploding out of a dot smaller than the head of a pin, but they
can't come up with a more evocative name for it than "The Big Bang" ?
  -- What would _you_ call the creation of the universe ?
  -- "The HORRENDOUS SPACE KABLOOIE !"               - Calvin and Hobbes
------------------------------------------ http://www.islande.org/ -----