Thread: UPPER()/LOWER() and UTF-8
Hello, I'm running Postgresql 7.3.4 with ru_RU.UTF-8 locale (with UNICODE database encoding), and all is almost well, except that UPPER() and LOWER() seem to ignore locale. I searched the sources couple of times, but do not understand where is the implementation of UPPER()/LOWER(). Could you please point me to the right direction? I'll try to understand and fix that. (But maybe patches for that exist? Or maybe FreeBSD 4.8-RELEASE utf-8 locales are broken in that respect?) Thanks a lot, --alexm
Alexey Mahotkin <alexm@w-m.ru> writes: > I'm running Postgresql 7.3.4 with ru_RU.UTF-8 locale (with UNICODE > database encoding), and all is almost well, except that UPPER() and > LOWER() seem to ignore locale. upper/lower aren't going to work desirably in any multi-byte character set encoding. I think Peter E. is looking into what it would take to fix this for 7.5, but at present you are going to need to use a single-byte encoding within the server. (Nothing to stop you from using UTF-8 on the client side though.) regards, tom lane
On Tue, Nov 04, 2003 at 04:52:33PM -0500, Tom Lane wrote: > Alexey Mahotkin <alexm@w-m.ru> writes: > > I'm running Postgresql 7.3.4 with ru_RU.UTF-8 locale (with UNICODE > > database encoding), and all is almost well, except that UPPER() and > > LOWER() seem to ignore locale. > > upper/lower aren't going to work desirably in any multi-byte character > set encoding. I think Peter E. is looking into what it would take to It's a PostgreSQL and no UTF problem, because standard PostgreSQL textfunctions doesn't know something about argumentsencoding and for thisfunctions cannot use another (an example UTF's lower/upper) method fora work with strings. Maybe a little extend internal "text" datatype and like VARSIZE() useVARENCODING(). Maybe Peter already has some betteridea. > fix this for 7.5, but at present you are going to need to use a > single-byte encoding within the server. (Nothing to stop you from using > UTF-8 on the client side though.) You can use mutibyte on server side too, but you must to use forexample convert() function for upper/lower arguments. Karel -- Karel Zak <zakkr@zf.jcu.cz>http://home.zf.jcu.cz/~zakkr/
Alexey Mahotkin <alexm@w-m.ru> writes: > TL> upper/lower aren't going to work desirably in any multi-byte > TL> character set encoding. > Can you please point me at their implementation? I do not understand > why that's impossible. Because they use <ctype.h>'s toupper() and tolower() functions, which only work on single-byte characters. There has been some discussion of using <wctype.h> where available, but this has a number of issues, notably figuring out the correct mapping from the server string encoding (eg UTF-8) to unpacked wide characters. At minimum we'd need to know which charset the locale setting is expecting, and there doesn't seem to be a portable way to find that out. IIRC, Peter thinks we must abandon use of libc's locale functionality altogether and write our own locale layer before we can really have all the locale-specific functionality we want. regards, tom lane
>>>>> "TL" == Tom Lane <tgl@sss.pgh.pa.us> writes: TL> writes: upper/lower aren't TL> going to work desirably in any multi-byte character set TL> encoding. >> Can you please point me at their implementation? I do not >> understand why that's impossible. TL> Because they use <ctype.h>'s toupper() and tolower() TL> functions, which only work on single-byte characters. Aha, that's in src/backend/utils/adt/formatting.c, right? Yes, I see, it goes byte by byte and uses toupper(). I believe we could look at the locale, and if it is UTF-8, then use (or copy) e.g. g_utf8_strup/strdown, right? http://developer.gnome.org/doc/API/2.0/glib/glib-Unicode-Manipulation.html#g-utf8-strup I belive that patch could be written in a matter of hours. TL> There has been some discussion of using <wctype.h> where TL> available, but this has a number of issues, notablyfiguring TL> out the correct mapping from the server string encoding (eg TL> UTF-8) to unpacked wide characters. At minimum we'd need to TL> know which charset the locale setting is expecting, and there TL> doesn't seemto be a portable way to find that out. TL> IIRC, Peter thinks we must abandon use of libc's locale TL> functionality altogether and write our own locale layerbefore TL> we can really have all the locale-specific functionality we TL> want. I believe that native Unicode strings (together with human language handling) should be introduced as (almost) separate data type (which have nothing to do with locale), but that's bluesky maybe. --alexm
>>>>> "TL" == Tom Lane <tgl@sss.pgh.pa.us> writes: TL> Alexey Mahotkin <alexm@w-m.ru> writes: >> I'm running Postgresql 7.3.4 with ru_RU.UTF-8 locale (with >> UNICODEdatabase encoding), and all is almost well, except that >> UPPER() and LOWER() seem to ignore locale. TL> upper/lower aren't going to work desirably in any multi-byte TL> character set encoding. Can you please point me at their implementation? I do not understand why that's impossible. TL> I think Peter E. is looking into what TL> it would take to fix this for 7.5, but at present you are TL> goingto need to use a single-byte encoding within the server. TL> (Nothing to stop you from using UTF-8 on the client side TL> though.) Thanks, --alexm
Alexey Mahotkin kirjutas K, 05.11.2003 kell 17:11: > Aha, that's in src/backend/utils/adt/formatting.c, right? > > Yes, I see, it goes byte by byte and uses toupper(). I believe we > could look at the locale, and if it is UTF-8, then use (or copy) > e.g. g_utf8_strup/strdown, right? > > http://developer.gnome.org/doc/API/2.0/glib/glib-Unicode-Manipulation.html#g-utf8-strup > > I belive that patch could be written in a matter of hours. > > > TL> There has been some discussion of using <wctype.h> where > TL> available, but this has a number of issues, notably figuring > TL> out the correct mapping from the server string encoding (eg > TL> UTF-8) to unpacked wide characters. At minimum we'd need to > TL> know which charset the locale setting is expecting, and there > TL> doesn't seem to be a portable way to find that out. > > TL> IIRC, Peter thinks we must abandon use of libc's locale > TL> functionality altogether and write our own locale layer before > TL> we can really have all the locale-specific functionality we > TL> want. > > I believe that native Unicode strings (together with human language > handling) should be introduced as (almost) separate data type (which > have nothing to do with locale), but that's bluesky maybe. They should have nothing to do with _system_ locale, but you can neither UPPER()/LOWER() nor ORDER BY unless you know the locale. It is just that the locale should either be property of column or given in the SQL statement. I guess one could write UCHAR, UVARCHAR, UTEXT types based on ICU. ------------- Hannu
We have a customer who reports a weird problem. Too often the App. Server fails to connect to the database. Sometimes the scheduled vacuum fails as well. The error message is always the same: FATAL: The database system is shutting down But from what I see no one is trying to shut down the database at this time. I am still waiting for the database-log to see if I can find a clue there, but I wonder if someone knows what can make the database respond this way. This is Pg 7.3.2, on HP 11.0, using the Unix Domain Socket. Thank you, Mike
Michael Brusser <michael@synchronicity.com> writes: > The error message is always the same: > FATAL: The database system is shutting down > But from what I see no one is trying to shut down the database at this time. *Something* has sent the postmaster a shutdown signal --- either SIGINT or SIGTERM. Look around and find out what. regards, tom lane