Thread: PostgreSQL, UTF-8 and Mac OS X
Hi. I have a problem with PostgreSQL and UTF-8 on my Mac OS X Powerbook. - System is Mac OS X Client 10.4.3, PostgreSQL 8.1beta3 - initdb was called with -E UTF-8 --locale=de_DE.UTF-8 I have successfully build a LC_COLLATE file for ISO8859-15, and ordering works there if I do the initdb with ISO8859-1 but I want to use UTF-8 for some reasons. I have linked the LC_COLLATE for de_DE.UTF-8 to the same LC_COLLATE file that works fine with ISO8859-1. "show all;" shows that the encoding ist UTF-8 now, the LC_... are "de_DE.UTF-8". Okay, this is fine. But it doesn't work. The LC_COLLATE file works, if I set encoding and locale to ISO... but not, if I set the values to be UTF-8 (don't know how often I have called initdb in the last days ...). It seems to me, that the locale "de_DE.UTF-8" just isn't working at all (at least for ordering results) in the combination PG -- Mac OS X. Some hints what I can try to find out more? cug
Attachment
On Mon, Nov 07, 2005 at 12:50:18PM +0100, Guido Neitzer wrote: > Hi. > > I have a problem with PostgreSQL and UTF-8 on my Mac OS X Powerbook. > > > - System is Mac OS X Client 10.4.3, PostgreSQL 8.1beta3 > > - initdb was called with -E UTF-8 --locale=de_DE.UTF-8 We had this question earlier this week. Mac OS X uses the locales from FreeBSD, and neither support UTF-8 collation at all. You'll see exactly the same results from other UNIX utilities. Sometime in the near future (hopefully) PostgreSQL will provide locale support independant of the underlying operating system, but for now you're stuck. Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a > tool for doing 5% of the work and then sitting around waiting for someone > else to do the other 95% so you can sue them.
Attachment
On 07.11.2005, at 14:07 Uhr, Martijn van Oosterhout wrote: > We had this question earlier this week. Mac OS X uses the locales from > FreeBSD, and neither support UTF-8 collation at all. You'll see > exactly > the same results from other UNIX utilities. I think I was the one who asked. I worked on my locale problem on the weekend and was able to build a LC_COLLATE file, that actually works with ISO locales, but not with UTF-8 (50% progress ... ;-)). When you test the UNIX utility "sort" on Mac OS X, you should be aware, that the pre-installed version on Mac OS X ignores locales at all ... :-( I had to install the gnu coreutils to get a sort that works with locales, and this also fails on UTF-8 but works with ISO encoding/collate - same as PG does. Now I'm not sure, whether my own LC_COLLATE file is not appropriate for UTF-8 (why not?) or whether Mac OS X locale does not support UTF-8 at all as you state. > Sometime in the near future (hopefully) PostgreSQL will provide locale > support independant of the underlying operating system, but for now > you're stuck. Will be cool to have locale support directly in PostgreSQL. So, just a quick question regarding a switch: is there a problem with using ISO8859-15 for now, and do a switch later with dumping the data and import it to a newer version which should then use UTF-8? Do I need to do some conversion or how does this work? Thanks for your help! cug
Attachment
On Mon, Nov 07, 2005 at 02:28:05PM +0100, Guido Neitzer wrote: > I think I was the one who asked. > > I worked on my locale problem on the weekend and was able to build a > LC_COLLATE file, that actually works with ISO locales, but not with > UTF-8 (50% progress ... ;-)). Guess the problem is that you have to import the entire Unicode database to make it work. I think the code is multibyte aware though, it's just that no-one has done the work. Disclaimer: I'm working with Linux/Glibc which has had proper collation for quite a while now so I have no real understanding of systems that don't have it. > When you test the UNIX utility "sort" on Mac OS X, you should be > aware, that the pre-installed version on Mac OS X ignores locales at > all ... :-( I had to install the gnu coreutils to get a sort that > works with locales, and this also fails on UTF-8 but works with ISO > encoding/collate - same as PG does. Nasty. > Now I'm not sure, whether my own LC_COLLATE file is not appropriate > for UTF-8 (why not?) or whether Mac OS X locale does not support > UTF-8 at all as you state. Hmm, I just went back to the source code (adv_cmds-79.1) and it looks like collations don't support UTF-8 at all. Or any multibyte encoding. > Will be cool to have locale support directly in PostgreSQL. Yeah, but seems a bit lame for an operating system to claim to support multibyte locales if it can't do collation on them. :( It supports everything but collation, so it's obviously not a priority. > So, just a quick question regarding a switch: is there a problem with > using ISO8859-15 for now, and do a switch later with dumping the data > and import it to a newer version which should then use UTF-8? Do I > need to do some conversion or how does this work? If you import as ISO8859-15 now, when you do the upgrade, simply set the client encoding to that and PostgreSQL will convert it all to UTF-8 during the load. Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a > tool for doing 5% of the work and then sitting around waiting for someone > else to do the other 95% so you can sue them.
Attachment
Guido Neitzer <guido.neitzer@pharmaline.de> writes: > I have linked the LC_COLLATE for de_DE.UTF-8 to the same LC_COLLATE > file that works fine with ISO8859-1. Um ... why would you expect that to work at all? Aren't the collation files very dependent on the encoding? regards, tom lane
On Mon, Nov 07, 2005 at 09:47:21AM -0500, Tom Lane wrote: > Guido Neitzer <guido.neitzer@pharmaline.de> writes: > > I have linked the LC_COLLATE for de_DE.UTF-8 to the same LC_COLLATE > > file that works fine with ISO8859-1. > > Um ... why would you expect that to work at all? Aren't the collation > files very dependent on the encoding? You'd think so, but standard Mac OS X/FreeBSD just link the UTF-8 locales to the US-ASCII locales. So by default: de_DE.UTF-8 links to ln_LN.US_ASCII All he's done is change it so the UTF-8 locale uses latin9 rather than ascii ordering. It obviously breaks for actual UTF-8 strings, but the C library doesn't support that anyway... Multibyte collation simply isn't supported so linking files at random won't crash anything. All the more reason to go for something like ICU... -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a > tool for doing 5% of the work and then sitting around waiting for someone > else to do the other 95% so you can sue them.