Re: Remaining dependency on setlocale() - Mailing list pgsql-hackers

From Peter Eisentraut
Subject Re: Remaining dependency on setlocale()
Date
Msg-id dd0cdd1f-e786-426e-b336-1ffa9b2f1fc6@eisentraut.org
Whole thread Raw
In response to Re: Remaining dependency on setlocale()  (Jeff Davis <pgsql@j-davis.com>)
List pgsql-hackers
On 12.12.25 21:11, Jeff Davis wrote:
>> case '\xc7':        /* C with cedilla */
>>
>> so the premise that "fuzzystrmatch is designed for ASCII" does not
>> appear to be correct.  Needs more analysis.
>>
>> (But apparently it's not multibyte aware at all, so I don't know what
>> to
>> do about that.)
> I didn't notice that, thank you. Agreed, we need a bit more discussion
> around this case as well as soundex().

Soundex is an ASCII-only algorithm, there is no expectation that the 
algorithm does anything useful with non-ASCII characters, and it doesn't 
do so now.  So I think using pg_ascii_toupper() is ok.  (Users could for 
example use unaccent to preprocess text.)

One might wonder if the presence of non-ASCII characters should be an 
error, but that doesn't have to be the subject of this thread.  I 
noticed that the Wikipedia page for Soundex even calls out PostgreSQL 
for doing things slightly different than everyone else, but I haven't 
studied the details.

For Metaphone, I found the reference implementation linked from its 
Wikipedia page, and it looks like our implementation is pretty closely 
aligned to that.  That reference implementation also contains the 
C-with-cedilla case explicitly.  The correct fix here would probably be 
to change the implementation to work on wide characters.  But I think 
for the moment you could try a shortcut like, use pg_ascii_toupper(), 
but if the encoding is LATIN1 (or LATIN9 or whichever other encodings 
also contain C-with-cedilla at that code point), then explicitly 
uppercase that one as well.  This would preserve the existing behavior.

Note that the documentation calls out: "At present, the soundex, 
metaphone, dmetaphone, and dmetaphone_alt functions do not work well 
with multibyte encodings (such as UTF-8)."




pgsql-hackers by date:

Previous
From: Rahila Syed
Date:
Subject: Re: Segmentation fault on proc exit after dshash_find_or_insert
Next
From: Zsolt Parragi
Date:
Subject: Re: Periodic authorization expiration checks using GoAway message