Thread: FW: Win32 unicode vs ICU

FW: Win32 unicode vs ICU

From
"Magnus Hagander"
Date:
I just realised this mail didn't go through. Probably because it was too
large for -hackers. So: repost to -patches. Sorry about that. If it's a
duplicate, even more sorry, but I couldn't find it in the archives.

(This may explain that nobody answered me :P)

//Magnus


> -----Original Message-----
> From: Magnus Hagander
> Sent: Sunday, July 31, 2005 2:09 PM
> To: PostgreSQL-development
> Cc: pgsql-hackers-win32@postgresql.org
> Subject: Win32 unicode vs ICU
>
> Hi!
>
> I've been working with Palles ICU patch to make it work on
> win32, and I believe I have it done. While doing it I noticed
> that ICU basically converts to UTF16 and back - I previously
> thought it worked on UTF8 strings. Based on this I also tried
> out an implementation for the win32-unicode problem that does
> *not* require ICU. It uses the win32 native functions to map
> to utf16 and back, and then to process the text there. And I
> got through with much less code than the ICU version, while
> doing the same thing.
>
> I am unsure of how to proceed. As I see it there are three paths:
> 1) Use native win32 functionality only on win32
> 2) Use ICU functionality only on win32
> 3) Allow both ICU and native functionality, compile time
> switch --with-icu (same as unix with the ICU patch)
>
>
> The main downsides of ICU vs the native ones are:
> * ICU does not accept win32 locale names. When doing
> setlocale("sv_se"), for example, win32 will return this in
> later calls as "Swedish_Sweden.1252". To get around this in
> the ICU patch, I had to implement a lookup map that converts
> it back to sv_se for ICU.
>
> * ICU is yet another build and runtime dependency, and a
> large one (comes in at 11Mb for the DLL files alone in the
> win32 download)
>
>
> I guess that the main upside of it is that we'd get
> constistent behaviour - in case there are issues with either
> ICU or win32 native they'd otherwise differ. And only one new
> codepath. But we already live with the platform-inconsistency today...
>
> Another upside is that it handles more encodings in ICU - my
> native implementation does *only* UTF8 and relies on existing
> functionality to deal with other encodings. It could of
> course be extended if necessary, but from what I can tell
> UTF8 is the big one.
>
>
>
> I have attached both patches. For the native version, only
> win32_utf8.patch is required.  For the ICU version,
> icu_win32.patch is needed and also the files
> localemap.c,localemap.pl, iso639 and iso3166 needs to go in
> src/backend/port/win32. (the localemap needs to be updated to
> do a better-than-linear search, but I wanted to include an example)
>
>
> Thoughts on the options?
>
>
> And anohter question - my native patch touches the same
> functions as the ICU patch. Can somebody who knows the
> internals confirm or deny that these are all the required
> locations, or do we need to modify more?
>
> (I have run simple tests in swedish locale and both behave
> the same and correct, but I'm unsure of exactly how much
> would be affected)
>
> Finally, the win32 patch also changes the normal path to use
> strncoll(). The comment above the function states that we'd
> like to use strncoll but it's not available. Well, on win32
> it is, so it should provide a speedup on win32. It is
> currently not included in the ICU patch, but should probably
> be included whichever path we'd chose.
>
>
> //Magnus
>

Attachment

Re: FW: Win32 unicode vs ICU

From
Bruce Momjian
Date:
This has been saved for the 8.2 release:

    http://momjian.postgresql.org/cgi-bin/pgpatches_hold

---------------------------------------------------------------------------

Magnus Hagander wrote:
> I just realised this mail didn't go through. Probably because it was too
> large for -hackers. So: repost to -patches. Sorry about that. If it's a
> duplicate, even more sorry, but I couldn't find it in the archives.
>
> (This may explain that nobody answered me :P)
>
> //Magnus
>
>
> > -----Original Message-----
> > From: Magnus Hagander
> > Sent: Sunday, July 31, 2005 2:09 PM
> > To: PostgreSQL-development
> > Cc: pgsql-hackers-win32@postgresql.org
> > Subject: Win32 unicode vs ICU
> >
> > Hi!
> >
> > I've been working with Palles ICU patch to make it work on
> > win32, and I believe I have it done. While doing it I noticed
> > that ICU basically converts to UTF16 and back - I previously
> > thought it worked on UTF8 strings. Based on this I also tried
> > out an implementation for the win32-unicode problem that does
> > *not* require ICU. It uses the win32 native functions to map
> > to utf16 and back, and then to process the text there. And I
> > got through with much less code than the ICU version, while
> > doing the same thing.
> >
> > I am unsure of how to proceed. As I see it there are three paths:
> > 1) Use native win32 functionality only on win32
> > 2) Use ICU functionality only on win32
> > 3) Allow both ICU and native functionality, compile time
> > switch --with-icu (same as unix with the ICU patch)
> >
> >
> > The main downsides of ICU vs the native ones are:
> > * ICU does not accept win32 locale names. When doing
> > setlocale("sv_se"), for example, win32 will return this in
> > later calls as "Swedish_Sweden.1252". To get around this in
> > the ICU patch, I had to implement a lookup map that converts
> > it back to sv_se for ICU.
> >
> > * ICU is yet another build and runtime dependency, and a
> > large one (comes in at 11Mb for the DLL files alone in the
> > win32 download)
> >
> >
> > I guess that the main upside of it is that we'd get
> > constistent behaviour - in case there are issues with either
> > ICU or win32 native they'd otherwise differ. And only one new
> > codepath. But we already live with the platform-inconsistency today...
> >
> > Another upside is that it handles more encodings in ICU - my
> > native implementation does *only* UTF8 and relies on existing
> > functionality to deal with other encodings. It could of
> > course be extended if necessary, but from what I can tell
> > UTF8 is the big one.
> >
> >
> >
> > I have attached both patches. For the native version, only
> > win32_utf8.patch is required.  For the ICU version,
> > icu_win32.patch is needed and also the files
> > localemap.c,localemap.pl, iso639 and iso3166 needs to go in
> > src/backend/port/win32. (the localemap needs to be updated to
> > do a better-than-linear search, but I wanted to include an example)
> >
> >
> > Thoughts on the options?
> >
> >
> > And anohter question - my native patch touches the same
> > functions as the ICU patch. Can somebody who knows the
> > internals confirm or deny that these are all the required
> > locations, or do we need to modify more?
> >
> > (I have run simple tests in swedish locale and both behave
> > the same and correct, but I'm unsure of exactly how much
> > would be affected)
> >
> > Finally, the win32 patch also changes the normal path to use
> > strncoll(). The comment above the function states that we'd
> > like to use strncoll but it's not available. Well, on win32
> > it is, so it should provide a speedup on win32. It is
> > currently not included in the ICU patch, but should probably
> > be included whichever path we'd chose.
> >
> >
> > //Magnus
> >

Content-Description: win32_utf8.patch

[ Attachment, skipping... ]

Content-Description: icu_win32.patch

[ Attachment, skipping... ]

Content-Description: localemap.pl

[ Attachment, skipping... ]

Content-Description: localemap.c

[ Attachment, skipping... ]

>
> ---------------------------(end of broadcast)---------------------------
> TIP 4: Have you searched our list archives?
>
>                http://archives.postgresql.org

--
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073

Re: FW: Win32 unicode vs ICU

From
Bruce Momjian
Date:
Is this patch moving toward competion?

---------------------------------------------------------------------------

Magnus Hagander wrote:
> I just realised this mail didn't go through. Probably because it was too
> large for -hackers. So: repost to -patches. Sorry about that. If it's a
> duplicate, even more sorry, but I couldn't find it in the archives.
>
> (This may explain that nobody answered me :P)
>
> //Magnus
>
>
> > -----Original Message-----
> > From: Magnus Hagander
> > Sent: Sunday, July 31, 2005 2:09 PM
> > To: PostgreSQL-development
> > Cc: pgsql-hackers-win32@postgresql.org
> > Subject: Win32 unicode vs ICU
> >
> > Hi!
> >
> > I've been working with Palles ICU patch to make it work on
> > win32, and I believe I have it done. While doing it I noticed
> > that ICU basically converts to UTF16 and back - I previously
> > thought it worked on UTF8 strings. Based on this I also tried
> > out an implementation for the win32-unicode problem that does
> > *not* require ICU. It uses the win32 native functions to map
> > to utf16 and back, and then to process the text there. And I
> > got through with much less code than the ICU version, while
> > doing the same thing.
> >
> > I am unsure of how to proceed. As I see it there are three paths:
> > 1) Use native win32 functionality only on win32
> > 2) Use ICU functionality only on win32
> > 3) Allow both ICU and native functionality, compile time
> > switch --with-icu (same as unix with the ICU patch)
> >
> >
> > The main downsides of ICU vs the native ones are:
> > * ICU does not accept win32 locale names. When doing
> > setlocale("sv_se"), for example, win32 will return this in
> > later calls as "Swedish_Sweden.1252". To get around this in
> > the ICU patch, I had to implement a lookup map that converts
> > it back to sv_se for ICU.
> >
> > * ICU is yet another build and runtime dependency, and a
> > large one (comes in at 11Mb for the DLL files alone in the
> > win32 download)
> >
> >
> > I guess that the main upside of it is that we'd get
> > constistent behaviour - in case there are issues with either
> > ICU or win32 native they'd otherwise differ. And only one new
> > codepath. But we already live with the platform-inconsistency today...
> >
> > Another upside is that it handles more encodings in ICU - my
> > native implementation does *only* UTF8 and relies on existing
> > functionality to deal with other encodings. It could of
> > course be extended if necessary, but from what I can tell
> > UTF8 is the big one.
> >
> >
> >
> > I have attached both patches. For the native version, only
> > win32_utf8.patch is required.  For the ICU version,
> > icu_win32.patch is needed and also the files
> > localemap.c,localemap.pl, iso639 and iso3166 needs to go in
> > src/backend/port/win32. (the localemap needs to be updated to
> > do a better-than-linear search, but I wanted to include an example)
> >
> >
> > Thoughts on the options?
> >
> >
> > And anohter question - my native patch touches the same
> > functions as the ICU patch. Can somebody who knows the
> > internals confirm or deny that these are all the required
> > locations, or do we need to modify more?
> >
> > (I have run simple tests in swedish locale and both behave
> > the same and correct, but I'm unsure of exactly how much
> > would be affected)
> >
> > Finally, the win32 patch also changes the normal path to use
> > strncoll(). The comment above the function states that we'd
> > like to use strncoll but it's not available. Well, on win32
> > it is, so it should provide a speedup on win32. It is
> > currently not included in the ICU patch, but should probably
> > be included whichever path we'd chose.
> >
> >
> > //Magnus
> >

Content-Description: win32_utf8.patch

[ Attachment, skipping... ]

Content-Description: icu_win32.patch

[ Attachment, skipping... ]

Content-Description: localemap.pl

[ Attachment, skipping... ]

Content-Description: localemap.c

[ Attachment, skipping... ]

>
> ---------------------------(end of broadcast)---------------------------
> TIP 4: Have you searched our list archives?
>
>                http://archives.postgresql.org

--
  Bruce Momjian   http://candle.pha.pa.us
  SRA OSS, Inc.   http://www.sraoss.com

  + If your life is a hard drive, Christ can be your backup. +

Re: FW: Win32 unicode vs ICU

From
"Magnus Hagander"
Date:
Not from me. I'm not working on it anymore. My work on it was directed
to making unicode work on win32, which was accomplished without ICU.

The original patch was to fix it for FreeBSD. But I think it was put
back pending an actual decision on how to deal with multi-locale
situations, since we'd want to pick a solution that will work for that.


//Magnus


>
> Is this patch moving toward competion?
>
> --------------------------------------------------------------
> -------------
>
> Magnus Hagander wrote:
> > I just realised this mail didn't go through. Probably
> because it was
> > too large for -hackers. So: repost to -patches. Sorry about
> that. If
> > it's a duplicate, even more sorry, but I couldn't find it
> in the archives.
> >
> > (This may explain that nobody answered me :P)
> >
> > //Magnus
> >
> >
> > > -----Original Message-----
> > > From: Magnus Hagander
> > > Sent: Sunday, July 31, 2005 2:09 PM
> > > To: PostgreSQL-development
> > > Cc: pgsql-hackers-win32@postgresql.org
> > > Subject: Win32 unicode vs ICU
> > >
> > > Hi!
> > >
> > > I've been working with Palles ICU patch to make it work on win32,
> > > and I believe I have it done. While doing it I noticed that ICU
> > > basically converts to UTF16 and back - I previously thought it
> > > worked on UTF8 strings. Based on this I also tried out an
> > > implementation for the win32-unicode problem that does
> > > *not* require ICU. It uses the win32 native functions to map to
> > > utf16 and back, and then to process the text there. And I got
> > > through with much less code than the ICU version, while doing the
> > > same thing.
> > >
> > > I am unsure of how to proceed. As I see it there are three paths:
> > > 1) Use native win32 functionality only on win32
> > > 2) Use ICU functionality only on win32
> > > 3) Allow both ICU and native functionality, compile time switch
> > > --with-icu (same as unix with the ICU patch)
> > >
> > >
> > > The main downsides of ICU vs the native ones are:
> > > * ICU does not accept win32 locale names. When doing
> > > setlocale("sv_se"), for example, win32 will return this in later
> > > calls as "Swedish_Sweden.1252". To get around this in the
> ICU patch,
> > > I had to implement a lookup map that converts it back to
> sv_se for
> > > ICU.
> > >
> > > * ICU is yet another build and runtime dependency, and a
> large one
> > > (comes in at 11Mb for the DLL files alone in the
> > > win32 download)
> > >
> > >
> > > I guess that the main upside of it is that we'd get constistent
> > > behaviour - in case there are issues with either ICU or
> win32 native
> > > they'd otherwise differ. And only one new codepath. But
> we already
> > > live with the platform-inconsistency today...
> > >
> > > Another upside is that it handles more encodings in ICU -
> my native
> > > implementation does *only* UTF8 and relies on existing
> functionality
> > > to deal with other encodings. It could of course be extended if
> > > necessary, but from what I can tell
> > > UTF8 is the big one.
> > >
> > >
> > >
> > > I have attached both patches. For the native version, only
> > > win32_utf8.patch is required.  For the ICU version,
> icu_win32.patch
> > > is needed and also the files localemap.c,localemap.pl, iso639 and
> > > iso3166 needs to go in src/backend/port/win32. (the
> localemap needs
> > > to be updated to do a better-than-linear search, but I wanted to
> > > include an example)
> > >
> > >
> > > Thoughts on the options?
> > >
> > >
> > > And anohter question - my native patch touches the same
> functions as
> > > the ICU patch. Can somebody who knows the internals
> confirm or deny
> > > that these are all the required locations, or do we need
> to modify
> > > more?
> > >
> > > (I have run simple tests in swedish locale and both
> behave the same
> > > and correct, but I'm unsure of exactly how much would be affected)
> > >
> > > Finally, the win32 patch also changes the normal path to use
> > > strncoll(). The comment above the function states that
> we'd like to
> > > use strncoll but it's not available. Well, on win32 it is, so it
> > > should provide a speedup on win32. It is currently not
> included in
> > > the ICU patch, but should probably be included whichever
> path we'd
> > > chose.
> > >
> > >
> > > //Magnus
> > >
>
> Content-Description: win32_utf8.patch
>
> [ Attachment, skipping... ]
>
> Content-Description: icu_win32.patch
>
> [ Attachment, skipping... ]
>
> Content-Description: localemap.pl
>
> [ Attachment, skipping... ]
>
> Content-Description: localemap.c
>
> [ Attachment, skipping... ]
>
> >
> > ---------------------------(end of
> > broadcast)---------------------------
> > TIP 4: Have you searched our list archives?
> >
> >                http://archives.postgresql.org
>
> --
>   Bruce Momjian   http://candle.pha.pa.us
>   SRA OSS, Inc.   http://www.sraoss.com
>
>   + If your life is a hard drive, Christ can be your backup. +
>

Re: FW: Win32 unicode vs ICU

From
Bruce Momjian
Date:
Added to TODO.detail.

---------------------------------------------------------------------------

Magnus Hagander wrote:
> I just realised this mail didn't go through. Probably because it was too
> large for -hackers. So: repost to -patches. Sorry about that. If it's a
> duplicate, even more sorry, but I couldn't find it in the archives.
>
> (This may explain that nobody answered me :P)
>
> //Magnus
>
>
> > -----Original Message-----
> > From: Magnus Hagander
> > Sent: Sunday, July 31, 2005 2:09 PM
> > To: PostgreSQL-development
> > Cc: pgsql-hackers-win32@postgresql.org
> > Subject: Win32 unicode vs ICU
> >
> > Hi!
> >
> > I've been working with Palles ICU patch to make it work on
> > win32, and I believe I have it done. While doing it I noticed
> > that ICU basically converts to UTF16 and back - I previously
> > thought it worked on UTF8 strings. Based on this I also tried
> > out an implementation for the win32-unicode problem that does
> > *not* require ICU. It uses the win32 native functions to map
> > to utf16 and back, and then to process the text there. And I
> > got through with much less code than the ICU version, while
> > doing the same thing.
> >
> > I am unsure of how to proceed. As I see it there are three paths:
> > 1) Use native win32 functionality only on win32
> > 2) Use ICU functionality only on win32
> > 3) Allow both ICU and native functionality, compile time
> > switch --with-icu (same as unix with the ICU patch)
> >
> >
> > The main downsides of ICU vs the native ones are:
> > * ICU does not accept win32 locale names. When doing
> > setlocale("sv_se"), for example, win32 will return this in
> > later calls as "Swedish_Sweden.1252". To get around this in
> > the ICU patch, I had to implement a lookup map that converts
> > it back to sv_se for ICU.
> >
> > * ICU is yet another build and runtime dependency, and a
> > large one (comes in at 11Mb for the DLL files alone in the
> > win32 download)
> >
> >
> > I guess that the main upside of it is that we'd get
> > constistent behaviour - in case there are issues with either
> > ICU or win32 native they'd otherwise differ. And only one new
> > codepath. But we already live with the platform-inconsistency today...
> >
> > Another upside is that it handles more encodings in ICU - my
> > native implementation does *only* UTF8 and relies on existing
> > functionality to deal with other encodings. It could of
> > course be extended if necessary, but from what I can tell
> > UTF8 is the big one.
> >
> >
> >
> > I have attached both patches. For the native version, only
> > win32_utf8.patch is required.  For the ICU version,
> > icu_win32.patch is needed and also the files
> > localemap.c,localemap.pl, iso639 and iso3166 needs to go in
> > src/backend/port/win32. (the localemap needs to be updated to
> > do a better-than-linear search, but I wanted to include an example)
> >
> >
> > Thoughts on the options?
> >
> >
> > And anohter question - my native patch touches the same
> > functions as the ICU patch. Can somebody who knows the
> > internals confirm or deny that these are all the required
> > locations, or do we need to modify more?
> >
> > (I have run simple tests in swedish locale and both behave
> > the same and correct, but I'm unsure of exactly how much
> > would be affected)
> >
> > Finally, the win32 patch also changes the normal path to use
> > strncoll(). The comment above the function states that we'd
> > like to use strncoll but it's not available. Well, on win32
> > it is, so it should provide a speedup on win32. It is
> > currently not included in the ICU patch, but should probably
> > be included whichever path we'd chose.
> >
> >
> > //Magnus
> >

Content-Description: win32_utf8.patch

[ Attachment, skipping... ]

Content-Description: icu_win32.patch

[ Attachment, skipping... ]

Content-Description: localemap.pl

[ Attachment, skipping... ]

Content-Description: localemap.c

[ Attachment, skipping... ]

>
> ---------------------------(end of broadcast)---------------------------
> TIP 4: Have you searched our list archives?
>
>                http://archives.postgresql.org

--
  Bruce Momjian   http://candle.pha.pa.us
  EnterpriseDB    http://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +