Re: Character Conversions Handling - Mailing list pgsql-hackers

From Martijn van Oosterhout
Subject Re: Character Conversions Handling
Date
Msg-id 20051018213924.GC13902@svana.org
Whole thread Raw
In response to Character Conversions Handling  (Volkan YAZICI <volkan.yazici@gmail.com>)
List pgsql-hackers
On Tue, Oct 18, 2005 at 10:29:30PM +0300, Volkan YAZICI wrote:
> Hi,
>
> I'm trying to understand the schema laying behind
> backend/utils/adt/like.c to downcase letters [1]. When I look at the
> other tolower() implementations, there're lots of them spread around.
> (In interfaces/libpq, backend/regex, backend/utils/adt/like and etc.)
> For example, despite having pg_wc_tolower() function in regc_locale.c,
> achieving same with manually in iwchareq() of like.c.
>
> I'd so appreciated if somebody can point me the places where I should
> start to look at to understand the character handling with different
> encodings. Also, I wonder why didn't we use any btow/mbsrtowc/wctomb
> like functions. Is this for portability with other compilers?

PostgreSQL has to be compatable across many platforms, including those
that don't have any multibyte support, and there are a few of those.
Just like PostgreSQL includes a complete copy of the timezone library,
so various bits usually handled by system libraries have been
incorporated into the backend. This include encoding support.

> [1] iwchareq() is using pg_mb2wchar_with_len() which decides the right
> mb2wchar function from pg_wchar_table. When I look at
> backend/mb/wchar.c there're some other specific to locale mblen and
> mb2wchar routines. For example, EUC_KR is handled with
> pg_euc2wchar_with_len() function, but LATIN5 is handled with
> pg_latin12wchar_with_len() function. Will we write a new function for
> latin5 like pg_latin52wchar_with_len() if we'd encounter with a new
> problem with latin5?

In this particular case it's not an issue since all the Latin-N
encodings are all single byte encodings, they don't have to be handled
seperately. But yes, this means that PostgreSQL's behaviour may vary
from that of the surrounding system.

The current planning is to use a cross-platform library (ICU) to handle
all the locale and encoding related issues. This is a large task and I
wouldn't be surprised if it takes a release or two. Hopefully it will
clean all these issues up...

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
> tool for doing 5% of the work and then sitting around waiting for someone
> else to do the other 95% so you can sue them.

pgsql-hackers by date:

Previous
From: Martijn van Oosterhout
Date:
Subject: Re: 2nd try @NetBSD/2.0 Alpha
Next
From: Josh Berkus
Date:
Subject: Re: Seeing context switch storm with 10/13 snapshot of