Re: BUG #13440: unaccent does not remove all diacritics - Mailing list pgsql-bugs

From Alvaro Herrera
Subject Re: BUG #13440: unaccent does not remove all diacritics
Date
Msg-id 20150618211722.GJ133018@postgresql.org
Whole thread Raw
In response to Re: BUG #13440: unaccent does not remove all diacritics  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: BUG #13440: unaccent does not remove all diacritics  (Emre Hasegeli <emre@hasegeli.com>)
Re: BUG #13440: unaccent does not remove all diacritics  (Peter Eisentraut <peter_e@gmx.net>)
List pgsql-bugs
Tom Lane wrote:
> Thomas Munro <thomas.munro@enterprisedb.com> writes:
> > On Wed, Jun 17, 2015 at 10:01 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> >> I'm really dubious that we should be translating those ligatures at
> >> all (since the standard file is only advertised to do "unaccenting"),
> >> and if we do translate them, shouldn't they convert to AE, ae, etc?
>
> > Perhaps these conversions are intended only for comparisons, full text
> > indexing etc but not showing the converted text to a user, in which
> > case it doesn't matter too much if the conversions are a bit weird
> > (œuf and oeuf are interchangeable in French, but euf is nonsense).
> > But can we actually change them?  That could cause difficulty for
> > users with existing unaccented data stored/indexed...  But I suppose
> > even adding new mappings could cause problems.
>
> Yeah, if we do anything other than adding new mappings, I suspect that
> part could not be back-patched.  Maybe adding new mappings shouldn't
> be back-patched either, though it seems relatively safe to me.

To me, conceptually what unaccent does is turn whatever junk you have
into a very basic common alphabet (ascii); then it's very easy to do
full text searches without having to worry about what accents the people
did or did not use in their searches.  If we say "okay, but that funny
char is not an accent so let's leave it alone" then the charter doesn't
sound so useful to me.

The cases I care about are okay anyway, because all the funny chars in
spanish are already covered; and maybe German people always enter their
queries using the funny ss thing I can't even write, and then this is
not a problem for them.


Regarding back-patching unaccent.rules changes as discussed downthread,
I think it's okay to simply document that any indexes using the module
should be reindexed immediately after upgrading to that minor version.
The consequence of not doing so is not *that* serious anyway.  But then,
since I'm not actually affected in any way, I'm not strongly holding
this position either.

--
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

pgsql-bugs by date:

Previous
From: Tom Lane
Date:
Subject: Re: BUG #13440: unaccent does not remove all diacritics
Next
From: Thomas Munro
Date:
Subject: Re: BUG #13440: unaccent does not remove all diacritics