Re: PATCH: CITEXT 2.0 v3 - Mailing list pgsql-hackers

From David E. Wheeler
Subject Re: PATCH: CITEXT 2.0 v3
Date
Msg-id EC8BD896-825A-4098-9A6E-6024DBF28078@kineticode.com
Whole thread Raw
In response to Re: PATCH: CITEXT 2.0 v3  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: PATCH: CITEXT 2.0 v3  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-hackers
On Jul 14, 2008, at 07:24, Tom Lane wrote:

> "David E. Wheeler" <david@kineticode.com> writes:
>> Could I supply two comparison files, one for Mac OS X with  
>> en_US.UTF-8
>> and one for everything else, as described in the last three  
>> paragraphs
>> here?
>
> The fallacy in that proposal is the assumption that there are only two
> behaviors out there.

Well, no, that's not the assumption at all. The assumption is that the  
type works properly with multibyte characters under multibyte-aware  
locales. So I want to have tests to ensure that such is true by having  
multibyte characters run under a very specific locale and platform. I  
don't really care what platform or locale; the point is to make sure  
that the type is actually multibyte-aware.

> Let me recalibrate your thoughts a bit: so far
> I have tried citext on three different machines (Mac, Fedora 8, HPUX),
> and I got three different answers from those tests.  That's despite
> endeavoring to make the database locales match ... which is less than
> trivial in itself because they use three slightly different  
> spellings of
> "en_US.UTF8".

<rant>
This is a truly pitiful state of affairs. Rhetorical question: Why is  
there no standardization of locales? I'm sure there are a lot of  
opinions out there (should all uppercase chars should precede all  
lowercase chars or be mixed in with lowercase chars), but I should  
think that, in this day and age, there would be some sort of standard  
defining locales and how they work -- and to allow such opinions to be  
expressed by different locales, not in the same locale names on  
different platforms.
</rant>

> Given that you were more or less deliberately testing corner cases,
> I think it's quite likely that the number of observable reactions from
> N platforms would be more nearly O(N) than O(1).

To me they're not corner cases. To me it is just, "given a specific  
platform/locale, does CITEXT respect the locale's rules?" I don't care  
to test all platforms and locales (I'm not *that* stupid :-)).

> In the real world, to the extent that we are able to control the  
> locale
> of the regression tests, we make it "C" --- and to a large extent we
> can't control it at all, which means you have another uncontrolled
> variable besides platform.  So in the current universe there is
> absolutely no value in submitting locale-specific tests for a contrib
> module.

Then how do we know that it will continue to be locale-aware over  
time? Someone could replace the comparison function with one that just  
lowercases ASCII characters, like CITEXT 1 does, and no tests would  
fail. How do you prevent that from happening without being hyper- 
vigilant (and never leaving the project, I might add)?

> I see some discussion in the thread about improving the situation, but
> until we are able to decouple database locale from environment locale,
> I doubt we'll be able to do a whole lot about automating this kind
> of test.  There are too many variables at the moment.

Is the decoupling of database locale from environment locale likely to  
happen anytime soon? Now that I've written CITEXT, I dare say that  
such might become my top-desired feature (aside from replication).

Thanks for the discussion, much appreciated, and I'm learning a ton. I  
retain the right to be opinionated, however. ;-)

Best,

David



pgsql-hackers by date:

Previous
From: "David E. Wheeler"
Date:
Subject: Re: PATCH: CITEXT 2.0 v3
Next
From: "David E. Wheeler"
Date:
Subject: Re: PATCH: CITEXT 2.0 v3