Re: PATCH: CITEXT 2.0 v3 - Mailing list pgsql-hackers
From | David E. Wheeler |
---|---|
Subject | Re: PATCH: CITEXT 2.0 v3 |
Date | |
Msg-id | EC8BD896-825A-4098-9A6E-6024DBF28078@kineticode.com Whole thread Raw |
In response to | Re: PATCH: CITEXT 2.0 v3 (Tom Lane <tgl@sss.pgh.pa.us>) |
Responses |
Re: PATCH: CITEXT 2.0 v3
(Tom Lane <tgl@sss.pgh.pa.us>)
|
List | pgsql-hackers |
On Jul 14, 2008, at 07:24, Tom Lane wrote: > "David E. Wheeler" <david@kineticode.com> writes: >> Could I supply two comparison files, one for Mac OS X with >> en_US.UTF-8 >> and one for everything else, as described in the last three >> paragraphs >> here? > > The fallacy in that proposal is the assumption that there are only two > behaviors out there. Well, no, that's not the assumption at all. The assumption is that the type works properly with multibyte characters under multibyte-aware locales. So I want to have tests to ensure that such is true by having multibyte characters run under a very specific locale and platform. I don't really care what platform or locale; the point is to make sure that the type is actually multibyte-aware. > Let me recalibrate your thoughts a bit: so far > I have tried citext on three different machines (Mac, Fedora 8, HPUX), > and I got three different answers from those tests. That's despite > endeavoring to make the database locales match ... which is less than > trivial in itself because they use three slightly different > spellings of > "en_US.UTF8". <rant> This is a truly pitiful state of affairs. Rhetorical question: Why is there no standardization of locales? I'm sure there are a lot of opinions out there (should all uppercase chars should precede all lowercase chars or be mixed in with lowercase chars), but I should think that, in this day and age, there would be some sort of standard defining locales and how they work -- and to allow such opinions to be expressed by different locales, not in the same locale names on different platforms. </rant> > Given that you were more or less deliberately testing corner cases, > I think it's quite likely that the number of observable reactions from > N platforms would be more nearly O(N) than O(1). To me they're not corner cases. To me it is just, "given a specific platform/locale, does CITEXT respect the locale's rules?" I don't care to test all platforms and locales (I'm not *that* stupid :-)). > In the real world, to the extent that we are able to control the > locale > of the regression tests, we make it "C" --- and to a large extent we > can't control it at all, which means you have another uncontrolled > variable besides platform. So in the current universe there is > absolutely no value in submitting locale-specific tests for a contrib > module. Then how do we know that it will continue to be locale-aware over time? Someone could replace the comparison function with one that just lowercases ASCII characters, like CITEXT 1 does, and no tests would fail. How do you prevent that from happening without being hyper- vigilant (and never leaving the project, I might add)? > I see some discussion in the thread about improving the situation, but > until we are able to decouple database locale from environment locale, > I doubt we'll be able to do a whole lot about automating this kind > of test. There are too many variables at the moment. Is the decoupling of database locale from environment locale likely to happen anytime soon? Now that I've written CITEXT, I dare say that such might become my top-desired feature (aside from replication). Thanks for the discussion, much appreciated, and I'm learning a ton. I retain the right to be opinionated, however. ;-) Best, David
pgsql-hackers by date: