Thread: UTF-8 collation on Windows?

UTF-8 collation on Windows?

From
Dev Kumkar
Date:
Am really going no where with this after so many searching over net or am missing some basic things, not sure!

What is the equivalent for "en_US.UTF-8" collation in case of windows?

In Linux am creating database with following options, as follows:
-E utf8 -l en_US.UTF-8 -T template0

This creates utf8 encoding and also the collation is set as en_US.UTF-8.

in case of windows utf8 encoding parameter works but not sure abou the equivalent for en_US.UTF-8 collation.
Default database gets created with 'English_United States.1252' collation.

As a result the sorting is not same in case of windows and linux database.

Any suggestions/ pointers here?

Regards...



Re: UTF-8 collation on Windows?

From
Adrian Klaver
Date:
On 02/19/2014 06:41 AM, Dev Kumkar wrote:
> Am really going no where with this after so many searching over net or
> am missing some basic things, not sure!
>
> What is the equivalent for "en_US.UTF-8" collation in case of windows?
>
> In Linux am creating database with following options, as follows:
> -E utf8 -l en_US.UTF-8 -T template0
>
> This creates utf8 encoding and also the collation is set as en_US.UTF-8.
>
> in case of windows utf8 encoding parameter works but not sure abou the
> equivalent for en_US.UTF-8 collation.
> Default database gets created with 'English_United States.1252' collation.
>
> As a result the sorting is not same in case of windows and linux database.
>
> Any suggestions/ pointers here?

I found the below that might help. I do not use Windows much any more so
I do not have a machine handy to confirm.

http://www.g-loaded.eu/2011/02/27/locale-windows/

>
> Regards...
>
>
>


--
Adrian Klaver
adrian.klaver@aklaver.com


Re: UTF-8 collation on Windows?

From
Dev Kumkar
Date:
On Wed, Feb 19, 2014 at 10:16 PM, Adrian Klaver <adrian.klaver@aklaver.com> wrote:
I found the below that might help. I do not use Windows much any more so I do not have a machine handy to confirm.

http://www.g-loaded.eu/2011/02/27/locale-windows/

Thanks for the pointer. "american_usa" works however it sets the LC_COLLATE to 'English_United States.1252' which is basically "ANSI Latin 1" and is not utf8. The third parameter as the link says is codeset - "language_territory.codeset".

Here is the list of all codesets http://msdn.microsoft.com/en-us/library/dd317756%28VS.85%29.aspx but still no success.

One of the link says codepage 65001 and utf-8 is same - http://stackoverflow.com/questions/1629437/is-codepage-65001-and-utf-8-the-same-thing.

Am not able to find windows codeset equivalent of utf-8 and set it as LC_COLLATE while creating database.

Has anyone set the LC_COLLATE as utf-8 on windows?

Regards...

Re: UTF-8 collation on Windows?

From
Adrian Klaver
Date:
On 02/19/2014 11:42 AM, Dev Kumkar wrote:
> On Wed, Feb 19, 2014 at 10:16 PM, Adrian Klaver
> <adrian.klaver@aklaver.com <mailto:adrian.klaver@aklaver.com>> wrote:
>
>     I found the below that might help. I do not use Windows much any
>     more so I do not have a machine handy to confirm.
>
>     http://www.g-loaded.eu/2011/02/27/locale-windows/
>
>
> Thanks for the pointer. "*american_usa*" works however it sets the
> LC_COLLATE to 'English_United States.1252' which is basically "ANSI
> Latin 1" and is not utf8. The third parameter as the link says is
> codeset - "language_territory.codeset".

So what is the exact command you are using?

>
> Here is the list of all codesets
> http://msdn.microsoft.com/en-us/library/dd317756%28VS.85%29.aspx but
> still no success.
>
> One of the link says codepage 65001 and utf-8 is same -
> http://stackoverflow.com/questions/1629437/is-codepage-65001-and-utf-8-the-same-thing.
>
> Am not able to find windows codeset equivalent of utf-8 and set it as
> LC_COLLATE while creating database.
>
> Has anyone set the LC_COLLATE as utf-8 on windows?
>
> Regards...



Re: UTF-8 collation on Windows?

From
Dev Kumkar
Date:
On Thu, Feb 20, 2014 at 1:19 AM, Adrian Klaver <adrian.klaver@aklaver.com> wrote:
So what is the exact command you are using?

createdb -U postgres -E utf8 -l american_usa <DBNAME>
Above command fails to create utf-8 LC_COLLATE.

Regards...

Re: UTF-8 collation on Windows?

From
Adrian Klaver
Date:
On 02/19/2014 12:03 PM, Dev Kumkar wrote:
> On Thu, Feb 20, 2014 at 1:19 AM, Adrian Klaver
> <adrian.klaver@aklaver.com <mailto:adrian.klaver@aklaver.com>> wrote:
>
>     So what is the exact command you are using?
>
>
> createdb -U postgres -E utf8 -l american_usa <DBNAME>
> Above command fails to create utf-8 LC_COLLATE.

What does it set LC_CTYPE to?

So what happens if you do?:

createdb -U postgres -E utf8 -l american_usa.65001 <DBNAME>

or

createdb -U postgres -E utf8 --lc-ctype=american_usa
--lc-collate=american_usa <DBNAME>

>
> Regards...



Re: UTF-8 collation on Windows?

From
Dev Kumkar
Date:
On Thu, Feb 20, 2014 at 1:41 AM, Adrian Klaver <adrian.klaver@aklaver.com> wrote:
What does it set LC_CTYPE to?

So what happens if you do?:

createdb -U postgres -E utf8 -l american_usa.65001 <DBNAME>

createdb: database creation failed: ERROR:  invalid locale name: "american_usa.65001"


or

createdb -U postgres -E utf8 --lc-ctype=american_usa --lc-collate=american_usa <DBNAME>
Succeeds but as replied earlier it creates database with LC_COLLATE = 'English_United States.1252' which corresponds to Latin1.

Regards...

Re: UTF-8 collation on Windows?

From
Adrian Klaver
Date:
On 02/19/2014 12:16 PM, Dev Kumkar wrote:
> On Thu, Feb 20, 2014 at 1:41 AM, Adrian Klaver
> <adrian.klaver@aklaver.com <mailto:adrian.klaver@aklaver.com>> wrote:
>
>     What does it set LC_CTYPE to?
>
>     So what happens if you do?:
>
>     createdb -U postgres -E utf8 -l american_usa.65001 <DBNAME>
>
>
> *createdb: database creation failed: ERROR:  invalid locale name:
> "american_usa.65001" *
>
>
>     or
>
>     createdb -U postgres -E utf8 --lc-ctype=american_usa
>     --lc-collate=american_usa <DBNAME>
>
> Succeeds but as replied earlier it creates database with LC_COLLATE =
> 'English_United States.1252' which corresponds to Latin1.

Just noticed you are not specifying the template database. Try using
template0:

createdb -U postgres -E utf8 --lc-ctype=american_usa
--lc-collate=american_usa  -T template0 <DBNAME>

>
> Regards...



Re: UTF-8 collation on Windows?

From
Dev Kumkar
Date:
On Thu, Feb 20, 2014 at 2:01 AM, Adrian Klaver <adrian.klaver@aklaver.com> wrote:
Just noticed you are not specifying the template database. Try using template0:

createdb -U postgres -E utf8 --lc-ctype=american_usa --lc-collate=american_usa  -T template0 <DBNAME>

Same result i.e. LC_COLLATE and LC_CTYPE gets set as 'English_United States.1252'
Had specified template option this earlier too.

Btw in case the specified collation is incompatible then "createdb" binary complains and instructs to use template0 as template, which isn't the case here.

The codeset parameter is missed here and hence it takes the collation as the default one which is 'English_United States.1252'.

Regards...

Re: UTF-8 collation on Windows?

From
Adrian Klaver
Date:
On 02/19/2014 12:43 PM, Dev Kumkar wrote:
> On Thu, Feb 20, 2014 at 2:01 AM, Adrian Klaver
> <adrian.klaver@aklaver.com <mailto:adrian.klaver@aklaver.com>> wrote:
>
>     Just noticed you are not specifying the template database. Try using
>     template0:
>
>     createdb -U postgres -E utf8 --lc-ctype=american_usa
>     --lc-collate=american_usa  -T template0 <DBNAME>
>
>
> Same result i.e. LC_COLLATE and LC_CTYPE gets set as 'English_United
> States.1252'
> Had specified template option this earlier too.
>
> Btw in case the specified collation is incompatible then "createdb"
> binary complains and instructs to use template0 as template, which isn't
> the case here.
>
> The codeset parameter is missed here and hence it takes the collation as
> the default one which is 'English_United States.1252'.

Alright last shot:)

Taking hint from here:

http://msdn.microsoft.com/en-us/library/x99tb11d.aspx

try:

createdb -U postgres -E utf8 -l en-US <DBNAME>

If that does not work, not sure where to go.

>
> Regards...



Re: UTF-8 collation on Windows?

From
Dev Kumkar
Date:
On Thu, Feb 20, 2014 at 2:24 AM, Adrian Klaver <adrian.klaver@aklaver.com> wrote:
Alright last shot:)

Taking hint from here:

http://msdn.microsoft.com/en-us/library/x99tb11d.aspx

try:

createdb -U postgres -E utf8 -l en-US <DBNAME>

If that does not work, not sure where to go.

This won't work on Windows. Note that en-US collation name is specific to linux and in case of Windows these names are different which is where am facing issues to find the exact code page which corresponds to utf8.

Your msdn link has this mentioned which states that for code pages that require more than two bytes per character which is basically UTF8 doesn't work with setlocale command. But again its specific to the setlocale API.
               "The locale argument can take a locale name, a language string, a language string and country/region code, a code page, or a    
                language string, country/region code, and code page. The set of available locale names, languages, country/region codes, and
               code pages includes all those supported by the Windows NLS API except code pages that require more than two bytes per
               character, such as UTF-7 and UTF-8. If you provide a code page value of UTF-7 or UTF-8, setlocale will fail, returning NULL.
"

However am sure there would be some codepage which can be used in postgreSQL to set the collation to UTF8 equivalent of linux.

Please suggest? Am sure this not something new which am looking for.

Regards...

Re: UTF-8 collation on Windows?

From
Adrian Klaver
Date:
On 02/19/2014 01:09 PM, Dev Kumkar wrote:
> On Thu, Feb 20, 2014 at 2:24 AM, Adrian Klaver
> <adrian.klaver@aklaver.com <mailto:adrian.klaver@aklaver.com>> wrote:
>
>     Alright last shot:)
>
>     Taking hint from here:
>
>     http://msdn.microsoft.com/en-__us/library/x99tb11d.aspx
>     <http://msdn.microsoft.com/en-us/library/x99tb11d.aspx>
>
>     try:
>
>     createdb -U postgres -E utf8 -l en-US <DBNAME>
>
>     If that does not work, not sure where to go.
>
>
> This won't work on Windows. Note that en-US collation name is specific
> to linux and in case of Windows these names are different which is where
> am facing issues to find the exact code page which corresponds to utf8.

Have you tried it?

Note that the locale name is different then the one Linux.

On Linux it is en_US.

What I suggested is en-US.




Re: UTF-8 collation on Windows?

From
Dev Kumkar
Date:
On Thu, Feb 20, 2014 at 2:45 AM, Adrian Klaver <adrian.klaver@aklaver.com> wrote:

Have you tried it?

Note that the locale name is different then the one Linux.

On Linux it is en_US.

What I suggested is en-US.

Yes. Here is the output:
createdb -U postgres -E utf8 -l en-US -T template0 mynewdb
Password:
createdb: database creation failed: ERROR:  invalid locale name: "en-US"

Regards...

Re: UTF-8 collation on Windows?

From
Adrian Klaver
Date:
On 02/19/2014 01:21 PM, Dev Kumkar wrote:
> On Thu, Feb 20, 2014 at 2:45 AM, Adrian Klaver
> <adrian.klaver@aklaver.com <mailto:adrian.klaver@aklaver.com>> wrote:
>
>
>     Have you tried it?
>
>     Note that the locale name is different then the one Linux.
>
>     On Linux it is en_US.
>
>     What I suggested is en-US.
>
>
> Yes. Here is the output:
> createdb -U postgres -E utf8 -l en-US -T template0 mynewdb
> Password:
> *createdb: database creation failed: ERROR:  invalid locale name: "en-US"*

Hmmm, well I am out of ideas:(

>
> Regards...


--
Adrian Klaver
adrian.klaver@aklaver.com


Re: UTF-8 collation on Windows?

From
John R Pierce
Date:
On 2/19/2014 1:21 PM, Dev Kumkar wrote:
createdb -U postgres -E utf8 -l en-US -T template0 mynewdb
Password:
createdb: database creation failed: ERROR:  invalid locale name: "en-US"

I believe its en_US ... _ not -






-- 
john r pierce                                      37N 122W
somewhere on the middle of the left coast

Re: UTF-8 collation on Windows?

From
Gavin Flower
Date:
On 20/02/14 10:28, Adrian Klaver wrote:
> On 02/19/2014 01:21 PM, Dev Kumkar wrote:
>> On Thu, Feb 20, 2014 at 2:45 AM, Adrian Klaver
>> <adrian.klaver@aklaver.com <mailto:adrian.klaver@aklaver.com>> wrote:
>>
>>
>>     Have you tried it?
>>
>>     Note that the locale name is different then the one Linux.
>>
>>     On Linux it is en_US.
>>
>>     What I suggested is en-US.
>>
>>
>> Yes. Here is the output:
>> createdb -U postgres -E utf8 -l en-US -T template0 mynewdb
>> Password:
>> *createdb: database creation failed: ERROR:  invalid locale name:
>> "en-US"*
>
> Hmmm, well I am out of ideas:(
>
>>
>> Regards...
>
>
Upgrade servers to Linux?  :-P


Cheers,
Gavin


Re: UTF-8 collation on Windows?

From
Adrian Klaver
Date:
On 02/19/2014 01:30 PM, John R Pierce wrote:
> On 2/19/2014 1:21 PM, Dev Kumkar wrote:
>> createdb -U postgres -E utf8 -l en-US -T template0 mynewdb
>> Password:
>> *createdb: database creation failed: ERROR:  invalid locale name: "en-US"*
>
> I believe its en_US ... _ not -

Unfortunately this is a Windows install and that does not work either.

>
>
>
>
>
>
> --
> john r pierce                                      37N 122W
> somewhere on the middle of the left coast
>


--
Adrian Klaver
adrian.klaver@aklaver.com


Re: UTF-8 collation on Windows?

From
John R Pierce
Date:
On 2/19/2014 1:35 PM, Adrian Klaver wrote:
>
> Unfortunately this is a Windows install and that does not work either.

windows encodings are a pain.   their Unicode is NOT utf8, its ucs2 aka
utf16.   I just checked my default install of potsgres 9.2, it appears
its using WIN1252 encoding, another bastard, this is a modified ISO-8859-1


--
john r pierce                                      37N 122W
somewhere on the middle of the left coast



Re: UTF-8 collation on Windows?

From
Dev Kumkar
Date:
On Thu, Feb 20, 2014 at 3:17 AM, John R Pierce <pierce@hogranch.com> wrote:
On 2/19/2014 1:35 PM, Adrian Klaver wrote:

Unfortunately this is a Windows install and that does not work either.

windows encodings are a pain.   their Unicode is NOT utf8, its ucs2 aka utf16.   I just checked my default install of potsgres 9.2, it appears its using WIN1252 encoding, another bastard, this is a modified ISO-8859-1

Yes I guess that will be a different topic altogether to discuss. As windows internally uses UTF-16 encoding. I have one use case wherein will have to change the client_encoding in ODBC so that Unicode characters on windows are stored correctly. Currently on linux machine same unicode data is stored correctly but in case of windows it is not by default, so there are tweaks there required as well. But separate topic later.

Coming back here, yes by default it is WIN1252 on windows. So is there no way I could achieve linux alike utf-8 collation?
Linux and windows sorting order behaves differently in this case.

Regards...

Re: UTF-8 collation on Windows?

From
"Daniel Verite"
Date:
    Dev Kumkar wrote:

> Succeeds but as replied earlier it creates database with LC_COLLATE =
> 'English_United States.1252' which corresponds to Latin1.

Despite windows-1252 being a monobyte encoding sharing most
of LATIN1 codes and character set, it does not mean that
English_United States.1252 is limited to this character set.
You may use UTF-8 databases with that locale.

Consider the 2nd paragraph of  "Character Set Support"
in the doc:
http://www.postgresql.org/docs/current/static/multibyte.html

    "For C or POSIX locale, any character set is allowed, but for other
     locales there is only one character set that will work
     correctly. (On Windows, however, UTF-8 encoding can be used with
     any locale.)"

This is a key difference with Unix when choosing a locale.

As for getting the exact same sort order than Linux, it's not possible but
that's not a Windows-vs-Unix issue. If you used FreeBSD or MacOS X, some
en_US.UTF-8 collation rules  would differ from Linux's libc too, resulting in
a different sort order for certain strings.

Best regards,
--
Daniel
PostgreSQL-powered mail user agent and storage: http://www.manitou-mail.org


Re: UTF-8 collation on Windows?

From
Dev Kumkar
Date:
On Thu, Feb 20, 2014 at 4:34 PM, Daniel Verite <daniel@manitou-mail.org> wrote:
Despite windows-1252 being a monobyte encoding sharing most
of LATIN1 codes and character set, it does not mean that
English_United States.1252 is limited to this character set.
You may use UTF-8 databases with that locale.

Consider the 2nd paragraph of  "Character Set Support"
in the doc:
http://www.postgresql.org/docs/current/static/multibyte.html

    "For C or POSIX locale, any character set is allowed, but for other
     locales there is only one character set that will work
     correctly. (On Windows, however, UTF-8 encoding can be used with
     any locale.)"

This is a key difference with Unix when choosing a locale.

As for getting the exact same sort order than Linux, it's not possible but
that's not a Windows-vs-Unix issue. If you used FreeBSD or MacOS X, some
en_US.UTF-8 collation rules  would differ from Linux's libc too, resulting in
a different sort order for certain strings.

There is no issue of using windows-1252 with utf8 database. The point of discussion here is sorting order and windows code page for utf8?
The links http://msdn.microsoft.com/en-us/library/dd317756%28VS.85%29.aspx which I provided earlier has those code pages but creating database with these code pages fail.

Well overall with the discussion so far and whatever search I could over net/community it looks like there is no code page on windows corresponding to what is utf8 of linux. If there is then please let me know?

Conclusion: I have basically decided to have the database encoding UTF8 on both windows and linux. And then set the collation to 'C'.
At least my customers on linux and windows sees the same behavior when sorting. Any gotchas here?

Regards...

Re: UTF-8 collation on Windows?

From
Dev Kumkar
Date:
On Thu, Feb 20, 2014 at 3:04 AM, Gavin Flower <GavinFlower@archidevsys.co.nz> wrote:
Upgrade servers to Linux?  :-P

Actually that's not the solution but running away from it.
There is a heavy footprint of customers and huge market on windows too and so not that easy to migrate and convince in market. 

Regards...

Re: UTF-8 collation on Windows?

From
Gavin Flower
Date:
On 21/02/14 02:04, Dev Kumkar wrote:
On Thu, Feb 20, 2014 at 3:04 AM, Gavin Flower <GavinFlower@archidevsys.co.nz> wrote:
Upgrade servers to Linux?  :-P

Actually that's not the solution but running away from it.
There is a heavy footprint of customers and huge market on windows too and so not that easy to migrate and convince in market. 

Regards...

I am aware of the heavy presence of Microsoft in the market place and the huge inertia of Microsoft dominated companies (even where management would like to change), hence I was not trying to push upgrading to Linux too strongly for this particular situation - more light hearted exasperation!

None-the-less there are more and more companies making that move - as there are a whole raft of very good reasons to do so.

If the sole reason for going to Linux was the collation problem, then it would probably be considered by most people to be a silly reason.


Cheers,
Gavin


P.S.  Once a Senior Systems Analyst left the company I was working for to become the DP manager of an IT department.  I spoke to him shortly after and he said it was an 'IBM shop' - about ten years later he was at the same place, but he now said it was a 'Microsoft shop'.  The dominant technology that appears set in stone, does eventually change despite the market inertia – my first commercial languages were FORTRAN & COBOL on minicomputers & IBM style mainframes, now I use Java on Linux.

Re: UTF-8 collation on Windows?

From
Dev Kumkar
Date:
On Fri, Feb 21, 2014 at 12:14 AM, Gavin Flower <GavinFlower@archidevsys.co.nz> wrote:
On 21/02/14 02:04, Dev Kumkar wrote:
On Thu, Feb 20, 2014 at 3:04 AM, Gavin Flower <GavinFlower@archidevsys.co.nz> wrote:
Upgrade servers to Linux?  :-P

Actually that's not the solution but running away from it.
There is a heavy footprint of customers and huge market on windows too and so not that easy to migrate and convince in market. 

Regards...

I am aware of the heavy presence of Microsoft in the market place and the huge inertia of Microsoft dominated companies (even where management would like to change), hence I was not trying to push upgrading to Linux too strongly for this particular situation - more light hearted exasperation!

None-the-less there are more and more companies making that move - as there are a whole raft of very good reasons to do so.

If the sole reason for going to Linux was the collation problem, then it would probably be considered by most people to be a silly reason.


Cheers,
Gavin


P.S.  Once a Senior Systems Analyst left the company I was working for to become the DP manager of an IT department.  I spoke to him shortly after and he said it was an 'IBM shop' - about ten years later he was at the same place, but he now said it was a 'Microsoft shop'.  The dominant technology that appears set in stone, does eventually change despite the market inertia – my first commercial languages were FORTRAN & COBOL on minicomputers & IBM style mainframes, now I use Java on Linux.


Hmm. Don't want to digress here and loose the topic context.
Here would really appreciate if there are any suggestions for UTF-8 collation on Windows?

Regards...

Re: UTF-8 collation on Windows?

From
Adrian Klaver
Date:
On 02/20/2014 11:40 AM, Dev Kumkar wrote:

>
>
> Hmm. Don't want to digress here and loose the topic context.
> Here would really appreciate if there are any suggestions for UTF-8
> collation on Windows?

Well I dug out a Windows machine and tried to get what you wanted, to no
avail. As far as I know there is no UTF8 collation, it is an encoding.
What you want if I am following, is the en_US locale (or equivalent for
another language) on Windows. Anything I tried resolved back to a
Windows code page. So the answer from my tests, is no you cannot match
en_US on Windows.

>
> Regards...



Re: UTF-8 collation on Windows?

From
Dev Kumkar
Date:
On Fri, Feb 21, 2014 at 1:26 AM, Adrian Klaver <adrian.klaver@aklaver.com> wrote:
Well I dug out a Windows machine and tried to get what you wanted, to no avail. As far as I know there is no UTF8 collation, it is an encoding. What you want if I am following, is the en_US locale (or equivalent for another language) on Windows. Anything I tried resolved back to a Windows code page. So the answer from my tests, is no you cannot match en_US on Windows.

Thanks for taking time out and looking into it !
Yes all the scenarios we tested didn't work for any of the utf8 code pages specified on MSDN or may be I don't know the correct representation of "language_territory.code".

Regards...

Re: UTF-8 collation on Windows?

From
Adrian Klaver
Date:
On 02/20/2014 11:40 AM, Dev Kumkar wrote:

>
>
> Hmm. Don't want to digress here and loose the topic context.
> Here would really appreciate if there are any suggestions for UTF-8
> collation on Windows?

Just had idea, not sure how feasible it is in your situation though. Run
Postgres in a Linux VM on Windows machines.

>
> Regards...



Re: UTF-8 collation on Windows?

From
Adrian Klaver
Date:
On 02/20/2014 12:27 PM, Dev Kumkar wrote:
> On Fri, Feb 21, 2014 at 1:26 AM, Adrian Klaver
> <adrian.klaver@aklaver.com <mailto:adrian.klaver@aklaver.com>> wrote:
>
>     Well I dug out a Windows machine and tried to get what you wanted,
>     to no avail. As far as I know there is no UTF8 collation, it is an
>     encoding. What you want if I am following, is the en_US locale (or
>     equivalent for another language) on Windows. Anything I tried
>     resolved back to a Windows code page. So the answer from my tests,
>     is no you cannot match en_US on Windows.
>
>
> Thanks for taking time out and looking into it !
> Yes all the scenarios we tested didn't work for any of the utf8 code
> pages specified on MSDN or may be I don't know the correct
> representation of "language_territory.code".

It seems to be more basic then that. Microsoft has its own locale
mechanism and you will always be redirected back to it.

>
> Regards...