Thread: Re: Win32 unicode vs ICU
.. back home again after a couple of days .. <snip> > >> I am unsure of how to proceed. As I see it there are three paths: > >> 1) Use native win32 functionality only on win32 > >> 2) Use ICU functionality only on win32 > >> 3) Allow both ICU and native functionality, compile time > >> switch --with-icu (same as unix with the ICU patch) <snip> > I feel it makes sense to apply the smaller patch in any case, > so that there's a Win32 solution not requiring ICU (ie, I > can't see an argument for doing (2) rather than (3)). > > Comments? Sounds reasonable to me - couldn't really find a reasonable argument for (2), but it was an option :-) Though as it seems to be needed on FreeBSD as well, we should definitly look at (3) as a long-term option. Considering Palle has been running it in production for quite a while now IIRC, the ICU part should be fairly stable. but see below... > > And anohter question - my native patch touches the same > functions as > > the ICU patch. Can somebody who knows the internals confirm or deny > > that these are all the required locations, or do we need to modify > > more? > > There is a strxfrm() call in > src/backend/utils/adt/selfuncs.c, which probably needs to be > looked at too. Ok. Will look into that. Do you have a hint as to how to test that? Considering I've been unable to show incorrect function without donig it, and apparantly so has Palle (with the ICU patch) running it in production for a long time, I clearly don't know how to provoke a failure with the current code. Which brings up another point - there are clearly no regression tests for this (considering we missed the unicode stuff early in the 8.0 cycle). Is there a reasonable way to add something along this line that could be used, or will that be too complex? (I guess we can never test completely as it depends on the OS locale handling, but some?) I'm thinking it might be too complex as we'd need a new initdb with a different encoding, but perhaps it's worth it anyway? //Magnus
"Magnus Hagander" <mha@sollentuna.net> writes: >> There is a strxfrm() call in >> src/backend/utils/adt/selfuncs.c, which probably needs to be >> looked at too. > Ok. Will look into that. Do you have a hint as to how to test that? Any problems would manifest as a bogus interpolation between histogram elements for a scalar-inequality selectivity estimate in a text column. For instance, if you insert all 676 2-letter combinations AA, AB, AC, ..., ZY, ZZ into a text column, ANALYZE, and then try cases like "EXPLAIN SELECT * FROM tab WHERE col < 'QW'", ideally the row estimate should be pretty nearly dead on. Being pure-ASCII this test would probably still work in a broken Unicode context, but if you did a similar experiment with 26 non-ASCII characters it would be likely to come out with silly results. You could increase the obviousness of the bad result by reducing the statistics target, since the silliness will be bounded by the histogram bin size. (Just looking at it again, the code in convert_string_to_scalar is pretty bogus for multibyte encodings in any case. Possibly we need to rethink the whole approach.) > Which brings up another point - there are clearly no regression tests > for this (considering we missed the unicode stuff early in the 8.0 > cycle). src/test/locale? src/test/mb? I've never used either, but they're there ... regards, tom lane
I wrote: > (Just looking at it again, the code in convert_string_to_scalar is > pretty bogus for multibyte encodings in any case. Possibly we need to > rethink the whole approach.) After studying this some more, I think the code is really so bogus for any non-ASCII situation that it's probably not worth worrying about too much. It's effectively assuming that the output of strxfrm() is still in an ASCII-superset encoding ... but I don't see anything in strxfrm's API that guarantees any such thing. As long as strxfrm() doesn't fail completely for Windows Unicode, I'd recommend just leaving this alone. As previously noted, the worst that can happen is an estimation error that's bounded by the histogram bin size anyhow. regards, tom lane
I was reviewing this thread about its lack of collation support in freebsd. As some of you may or may not know the PHP project is also currently working heavily on unicode support. (For PHP6) I had the chance to ask Andrei Zmievski of the php project about their support for unicode. The key items are as follows. <StormTide> with the new unicode support, is there any support for unicode collation <andrei> StormTide, there will be <StormTide> is it imported by the platform or custom done for php <StormTide> (cuz freebsd seeems to have issues with its collation support) <andrei> StormTide, not OS-dependent <andrei> StormTide, uses CLDR Should the postgresql project also be looking at CLDR for cross-platform unicode support? http://www.unicode.org/cldr/ Kevin McArthur Digifonica Canada ----- Original Message ----- From: "Tom Lane" <tgl@sss.pgh.pa.us> To: "Magnus Hagander" <mha@sollentuna.net> Cc: <pgsql-hackers@postgresql.org>; "Palle Girgensohn" <girgen@pingpong.net> Sent: Tuesday, August 23, 2005 9:03 AM Subject: Re: [HACKERS] Win32 unicode vs ICU >I wrote: >> (Just looking at it again, the code in convert_string_to_scalar is >> pretty bogus for multibyte encodings in any case. Possibly we need to >> rethink the whole approach.) > > After studying this some more, I think the code is really so bogus for > any non-ASCII situation that it's probably not worth worrying about > too much. It's effectively assuming that the output of strxfrm() is > still in an ASCII-superset encoding ... but I don't see anything in > strxfrm's API that guarantees any such thing. > > As long as strxfrm() doesn't fail completely for Windows Unicode, > I'd recommend just leaving this alone. As previously noted, the > worst that can happen is an estimation error that's bounded by the > histogram bin size anyhow. > > regards, tom lane > > ---------------------------(end of broadcast)--------------------------- > TIP 5: don't forget to increase your free space map settings >