Thread: Character Conversions Handling

Character Conversions Handling

From
Volkan YAZICI
Date:
Hi,

I'm trying to understand the schema laying behind
backend/utils/adt/like.c to downcase letters [1]. When I look at the
other tolower() implementations, there're lots of them spread around.
(In interfaces/libpq, backend/regex, backend/utils/adt/like and etc.)
For example, despite having pg_wc_tolower() function in regc_locale.c,
achieving same with manually in iwchareq() of like.c.

I'd so appreciated if somebody can point me the places where I should
start to look at to understand the character handling with different
encodings. Also, I wonder why didn't we use any btow/mbsrtowc/wctomb
like functions. Is this for portability with other compilers?

[1] iwchareq() is using pg_mb2wchar_with_len() which decides the right
mb2wchar function from pg_wchar_table. When I look at
backend/mb/wchar.c there're some other specific to locale mblen and
mb2wchar routines. For example, EUC_KR is handled with
pg_euc2wchar_with_len() function, but LATIN5 is handled with
pg_latin12wchar_with_len() function. Will we write a new function for
latin5 like pg_latin52wchar_with_len() if we'd encounter with a new
problem with latin5?

Regards.


Re: Character Conversions Handling

From
Martijn van Oosterhout
Date:
On Tue, Oct 18, 2005 at 10:29:30PM +0300, Volkan YAZICI wrote:
> Hi,
>
> I'm trying to understand the schema laying behind
> backend/utils/adt/like.c to downcase letters [1]. When I look at the
> other tolower() implementations, there're lots of them spread around.
> (In interfaces/libpq, backend/regex, backend/utils/adt/like and etc.)
> For example, despite having pg_wc_tolower() function in regc_locale.c,
> achieving same with manually in iwchareq() of like.c.
>
> I'd so appreciated if somebody can point me the places where I should
> start to look at to understand the character handling with different
> encodings. Also, I wonder why didn't we use any btow/mbsrtowc/wctomb
> like functions. Is this for portability with other compilers?

PostgreSQL has to be compatable across many platforms, including those
that don't have any multibyte support, and there are a few of those.
Just like PostgreSQL includes a complete copy of the timezone library,
so various bits usually handled by system libraries have been
incorporated into the backend. This include encoding support.

> [1] iwchareq() is using pg_mb2wchar_with_len() which decides the right
> mb2wchar function from pg_wchar_table. When I look at
> backend/mb/wchar.c there're some other specific to locale mblen and
> mb2wchar routines. For example, EUC_KR is handled with
> pg_euc2wchar_with_len() function, but LATIN5 is handled with
> pg_latin12wchar_with_len() function. Will we write a new function for
> latin5 like pg_latin52wchar_with_len() if we'd encounter with a new
> problem with latin5?

In this particular case it's not an issue since all the Latin-N
encodings are all single byte encodings, they don't have to be handled
seperately. But yes, this means that PostgreSQL's behaviour may vary
from that of the surrounding system.

The current planning is to use a cross-platform library (ICU) to handle
all the locale and encoding related issues. This is a large task and I
wouldn't be surprised if it takes a release or two. Hopefully it will
clean all these issues up...

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
> tool for doing 5% of the work and then sitting around waiting for someone
> else to do the other 95% so you can sue them.