Thread: UPPER()/LOWER() and UTF-8

UPPER()/LOWER() and UTF-8

From
Alexey Mahotkin
Date:
Hello,

I'm running Postgresql 7.3.4 with ru_RU.UTF-8 locale (with UNICODE
database encoding), and all is almost well, except that UPPER() and
LOWER() seem to ignore locale.

I searched the sources couple of times, but do not understand where is
the implementation of UPPER()/LOWER().  Could you please point me to
the right direction?

I'll try to understand and fix that.  (But maybe patches for that
exist?  Or maybe FreeBSD 4.8-RELEASE utf-8 locales are broken in that
respect?)


Thanks a lot,

--alexm


Re: UPPER()/LOWER() and UTF-8

From
Tom Lane
Date:
Alexey Mahotkin <alexm@w-m.ru> writes:
> I'm running Postgresql 7.3.4 with ru_RU.UTF-8 locale (with UNICODE
> database encoding), and all is almost well, except that UPPER() and
> LOWER() seem to ignore locale.

upper/lower aren't going to work desirably in any multi-byte character
set encoding.  I think Peter E. is looking into what it would take to
fix this for 7.5, but at present you are going to need to use a
single-byte encoding within the server.  (Nothing to stop you from using
UTF-8 on the client side though.)
        regards, tom lane


Re: UPPER()/LOWER() and UTF-8

From
Karel Zak
Date:
On Tue, Nov 04, 2003 at 04:52:33PM -0500, Tom Lane wrote:
> Alexey Mahotkin <alexm@w-m.ru> writes:
> > I'm running Postgresql 7.3.4 with ru_RU.UTF-8 locale (with UNICODE
> > database encoding), and all is almost well, except that UPPER() and
> > LOWER() seem to ignore locale.
> 
> upper/lower aren't going to work desirably in any multi-byte character
> set encoding.  I think Peter E. is looking into what it would take to
It's a PostgreSQL and no  UTF problem, because standard PostgreSQL textfunctions doesn't know something about
argumentsencoding and for thisfunctions cannot use another (an  example UTF's lower/upper) method fora work with
strings.
Maybe a little  extend internal "text" datatype and  like VARSIZE() useVARENCODING(). Maybe Peter already has some
betteridea.
 

> fix this for 7.5, but at present you are going to need to use a
> single-byte encoding within the server.  (Nothing to stop you from using
> UTF-8 on the client side though.)
You  can use  mutibyte on  server side  too, but  you must  to use  forexample convert() function for upper/lower
arguments.
   Karel

-- Karel Zak  <zakkr@zf.jcu.cz>http://home.zf.jcu.cz/~zakkr/


Re: UPPER()/LOWER() and UTF-8

From
Tom Lane
Date:
Alexey Mahotkin <alexm@w-m.ru> writes:
>     TL> upper/lower aren't going to work desirably in any multi-byte
>     TL> character set encoding.  

> Can you please point me at their implementation?  I do not understand
> why that's impossible.

Because they use <ctype.h>'s toupper() and tolower() functions, which
only work on single-byte characters.

There has been some discussion of using <wctype.h> where available, but
this has a number of issues, notably figuring out the correct mapping
from the server string encoding (eg UTF-8) to unpacked wide characters.
At minimum we'd need to know which charset the locale setting is
expecting, and there doesn't seem to be a portable way to find that out.

IIRC, Peter thinks we must abandon use of libc's locale functionality
altogether and write our own locale layer before we can really have all
the locale-specific functionality we want.
        regards, tom lane


Re: UPPER()/LOWER() and UTF-8

From
Alexey Mahotkin
Date:
>>>>> "TL" == Tom Lane <tgl@sss.pgh.pa.us> writes:
   TL> writes: upper/lower aren't   TL> going to work desirably in any multi-byte character set   TL> encoding.
   >> Can you please point me at their implementation?  I do not   >> understand why that's impossible.
   TL> Because they use <ctype.h>'s toupper() and tolower()   TL> functions, which only work on single-byte
characters.

Aha, that's in src/backend/utils/adt/formatting.c, right?

Yes, I see, it goes byte by byte and uses toupper().  I believe we
could look at the locale, and if it is UTF-8, then use (or copy)
e.g. g_utf8_strup/strdown, right?
    http://developer.gnome.org/doc/API/2.0/glib/glib-Unicode-Manipulation.html#g-utf8-strup

I belive that patch could be written in a matter of hours.

   TL> There has been some discussion of using <wctype.h> where   TL> available, but this has a number of issues,
notablyfiguring   TL> out the correct mapping from the server string encoding (eg   TL> UTF-8) to unpacked wide
characters. At minimum we'd need to   TL> know which charset the locale setting is expecting, and there   TL> doesn't
seemto be a portable way to find that out.
 
   TL> IIRC, Peter thinks we must abandon use of libc's locale   TL> functionality altogether and write our own locale
layerbefore   TL> we can really have all the locale-specific functionality we   TL> want.
 

I believe that native Unicode strings (together with human language
handling) should be introduced as (almost) separate data type (which
have nothing to do with locale), but that's bluesky maybe.

--alexm


Re: UPPER()/LOWER() and UTF-8

From
Alexey Mahotkin
Date:
>>>>> "TL" == Tom Lane <tgl@sss.pgh.pa.us> writes:
   TL> Alexey Mahotkin <alexm@w-m.ru> writes:   >> I'm running Postgresql 7.3.4 with ru_RU.UTF-8 locale (with   >>
UNICODEdatabase encoding), and all is almost well, except that   >> UPPER() and LOWER() seem to ignore locale.
 
   TL> upper/lower aren't going to work desirably in any multi-byte   TL> character set encoding.  

Can you please point me at their implementation?  I do not understand
why that's impossible.
   TL> I think Peter E. is looking into what   TL> it would take to fix this for 7.5, but at present you are   TL>
goingto need to use a single-byte encoding within the server.   TL> (Nothing to stop you from using UTF-8 on the client
side  TL> though.)
 


Thanks,

--alexm


Re: UPPER()/LOWER() and UTF-8

From
Hannu Krosing
Date:
Alexey Mahotkin kirjutas K, 05.11.2003 kell 17:11:
> Aha, that's in src/backend/utils/adt/formatting.c, right?
> 
> Yes, I see, it goes byte by byte and uses toupper().  I believe we
> could look at the locale, and if it is UTF-8, then use (or copy)
> e.g. g_utf8_strup/strdown, right?
> 
>      http://developer.gnome.org/doc/API/2.0/glib/glib-Unicode-Manipulation.html#g-utf8-strup
> 
> I belive that patch could be written in a matter of hours.
> 
> 
>     TL> There has been some discussion of using <wctype.h> where
>     TL> available, but this has a number of issues, notably figuring
>     TL> out the correct mapping from the server string encoding (eg
>     TL> UTF-8) to unpacked wide characters.  At minimum we'd need to
>     TL> know which charset the locale setting is expecting, and there
>     TL> doesn't seem to be a portable way to find that out.
> 
>     TL> IIRC, Peter thinks we must abandon use of libc's locale
>     TL> functionality altogether and write our own locale layer before
>     TL> we can really have all the locale-specific functionality we
>     TL> want.
> 
> I believe that native Unicode strings (together with human language
> handling) should be introduced as (almost) separate data type (which
> have nothing to do with locale), but that's bluesky maybe.

They should have nothing to do with _system_ locale, but you can
neither  UPPER()/LOWER() nor ORDER BY unless you know the locale. It is
just that the locale should either be property of column or given in the
SQL statement.

I guess one could write UCHAR, UVARCHAR, UTEXT types based on ICU.

-------------
Hannu



database is shutting down

From
Michael Brusser
Date:
We have a customer who reports a weird problem.
Too often the App. Server fails to connect to the database.
Sometimes the scheduled vacuum fails as well.
The error message is always the same: FATAL:  The database system is shutting down
But from what I see no one is trying to shut down the database at this time.

I am still waiting for the database-log to see if I can find a clue there,
but I wonder if someone knows what can make the database respond this way.

This is Pg 7.3.2, on HP 11.0, using the Unix Domain Socket.
Thank you,
Mike






Re: database is shutting down

From
Tom Lane
Date:
Michael Brusser <michael@synchronicity.com> writes:
> The error message is always the same:
>   FATAL:  The database system is shutting down
> But from what I see no one is trying to shut down the database at this time.

*Something* has sent the postmaster a shutdown signal --- either
SIGINT or SIGTERM.  Look around and find out what.
        regards, tom lane