Thread: PostgreSQL, UTF-8 and Mac OS X

PostgreSQL, UTF-8 and Mac OS X

From
Guido Neitzer
Date:
Hi.

I have a problem with PostgreSQL and UTF-8 on my Mac OS X Powerbook.


- System is Mac OS X Client 10.4.3, PostgreSQL 8.1beta3

- initdb was called with -E UTF-8 --locale=de_DE.UTF-8

I have successfully build a LC_COLLATE file for ISO8859-15, and
ordering works there if I do the initdb with ISO8859-1 but I want to
use UTF-8 for some reasons.

I have linked the LC_COLLATE for de_DE.UTF-8 to the same LC_COLLATE
file that works fine with ISO8859-1.

"show all;" shows that the encoding ist UTF-8 now, the LC_... are
"de_DE.UTF-8". Okay, this is fine.

But it doesn't work. The LC_COLLATE file works, if I set encoding and
locale to ISO... but not, if I set the values to be UTF-8 (don't know
how often I have called initdb in the last days ...).


It seems to me, that the locale "de_DE.UTF-8" just isn't working at
all (at least for ordering results) in the combination PG -- Mac OS X.

Some hints what I can try to find out more?

cug

Attachment

Re: PostgreSQL, UTF-8 and Mac OS X

From
Martijn van Oosterhout
Date:
On Mon, Nov 07, 2005 at 12:50:18PM +0100, Guido Neitzer wrote:
> Hi.
>
> I have a problem with PostgreSQL and UTF-8 on my Mac OS X Powerbook.
>
>
> - System is Mac OS X Client 10.4.3, PostgreSQL 8.1beta3
>
> - initdb was called with -E UTF-8 --locale=de_DE.UTF-8

We had this question earlier this week. Mac OS X uses the locales from
FreeBSD, and neither support UTF-8 collation at all. You'll see exactly
the same results from other UNIX utilities.

Sometime in the near future (hopefully) PostgreSQL will provide locale
support independant of the underlying operating system, but for now
you're stuck.

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
> tool for doing 5% of the work and then sitting around waiting for someone
> else to do the other 95% so you can sue them.

Attachment

Re: PostgreSQL, UTF-8 and Mac OS X

From
Guido Neitzer
Date:
On 07.11.2005, at 14:07 Uhr, Martijn van Oosterhout wrote:

> We had this question earlier this week. Mac OS X uses the locales from
> FreeBSD, and neither support UTF-8 collation at all. You'll see
> exactly
> the same results from other UNIX utilities.

I think I was the one who asked.

I worked on my locale problem on the weekend and was able to build a
LC_COLLATE file, that actually works with ISO locales, but not with
UTF-8 (50% progress ... ;-)).

When you test the UNIX utility "sort" on Mac OS X, you should be
aware, that the pre-installed version on Mac OS X ignores locales at
all ... :-( I had to install the gnu coreutils to get a sort that
works with locales, and this also fails on UTF-8 but works with ISO
encoding/collate - same as PG does.

Now I'm not sure, whether my own LC_COLLATE file is not appropriate
for UTF-8 (why not?) or whether Mac OS X locale does not support
UTF-8 at all as you state.

> Sometime in the near future (hopefully) PostgreSQL will provide locale
> support independant of the underlying operating system, but for now
> you're stuck.

Will be cool to have locale support directly in PostgreSQL.

So, just a quick question regarding a switch: is there a problem with
using ISO8859-15 for now, and do a switch later with dumping the data
and import it to a newer version which should then use UTF-8? Do I
need to do some conversion or how does this work?

Thanks for your help!
cug

Attachment

Re: PostgreSQL, UTF-8 and Mac OS X

From
Martijn van Oosterhout
Date:
On Mon, Nov 07, 2005 at 02:28:05PM +0100, Guido Neitzer wrote:
> I think I was the one who asked.
>
> I worked on my locale problem on the weekend and was able to build a
> LC_COLLATE file, that actually works with ISO locales, but not with
> UTF-8 (50% progress ... ;-)).

Guess the problem is that you have to import the entire Unicode
database to make it work. I think the code is multibyte aware though,
it's just that no-one has done the work.

Disclaimer: I'm working with Linux/Glibc which has had proper collation
for quite a while now so I have no real understanding of systems that
don't have it.

> When you test the UNIX utility "sort" on Mac OS X, you should be
> aware, that the pre-installed version on Mac OS X ignores locales at
> all ... :-( I had to install the gnu coreutils to get a sort that
> works with locales, and this also fails on UTF-8 but works with ISO
> encoding/collate - same as PG does.

Nasty.

> Now I'm not sure, whether my own LC_COLLATE file is not appropriate
> for UTF-8 (why not?) or whether Mac OS X locale does not support
> UTF-8 at all as you state.

Hmm, I just went back to the source code (adv_cmds-79.1) and it looks
like collations don't support UTF-8 at all. Or any multibyte encoding.

> Will be cool to have locale support directly in PostgreSQL.

Yeah, but seems a bit lame for an operating system to claim to support
multibyte locales if it can't do collation on them. :( It supports
everything but collation, so it's obviously not a priority.

> So, just a quick question regarding a switch: is there a problem with
> using ISO8859-15 for now, and do a switch later with dumping the data
> and import it to a newer version which should then use UTF-8? Do I
> need to do some conversion or how does this work?

If you import as ISO8859-15 now, when you do the upgrade, simply set
the client encoding to that and PostgreSQL will convert it all to UTF-8
during the load.

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
> tool for doing 5% of the work and then sitting around waiting for someone
> else to do the other 95% so you can sue them.

Attachment

Re: PostgreSQL, UTF-8 and Mac OS X

From
Tom Lane
Date:
Guido Neitzer <guido.neitzer@pharmaline.de> writes:
> I have linked the LC_COLLATE for de_DE.UTF-8 to the same LC_COLLATE
> file that works fine with ISO8859-1.

Um ... why would you expect that to work at all?  Aren't the collation
files very dependent on the encoding?

            regards, tom lane

Re: PostgreSQL, UTF-8 and Mac OS X

From
Martijn van Oosterhout
Date:
On Mon, Nov 07, 2005 at 09:47:21AM -0500, Tom Lane wrote:
> Guido Neitzer <guido.neitzer@pharmaline.de> writes:
> > I have linked the LC_COLLATE for de_DE.UTF-8 to the same LC_COLLATE
> > file that works fine with ISO8859-1.
>
> Um ... why would you expect that to work at all?  Aren't the collation
> files very dependent on the encoding?

You'd think so, but standard Mac OS X/FreeBSD just link the UTF-8
locales to the US-ASCII locales. So by default:

de_DE.UTF-8  links to  ln_LN.US_ASCII

All he's done is change it so the UTF-8 locale uses latin9 rather than
ascii ordering. It obviously breaks for actual UTF-8 strings, but the C
library doesn't support that anyway... Multibyte collation simply
isn't supported so linking files at random won't crash anything.

All the more reason to go for something like ICU...
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
> tool for doing 5% of the work and then sitting around waiting for someone
> else to do the other 95% so you can sue them.

Attachment