Thread: Making the regression tests locale-proof

Making the regression tests locale-proof

From
Peter Eisentraut
Date:
Since locale support is now enabled by default, it is desirable that the
regression tests can pass if the clusters locale is not C.

As a first step I have included the following statements in pg_regress
right after the database is created:

alter database "$dbname" set lc_messages to 'C';
alter database "$dbname" set lc_monetary to 'C';
alter database "$dbname" set lc_numeric to 'C';
alter database "$dbname" set lc_time to 'C';

This gets rid of a boatload of failures related to number formatting.
For that purpose I have changed the permissions on these options to
USERSET.  (I'm still debating making lc_messages SUSET, because otherwise
users can screw with admins by changing the language of the log output all
the time.  Comments?)

The remaining issue is the sort order.  I think this can be solved for
practical purposes by creating two expected files for each affected test,
say char.out and char-locale.out.  The regression test driver would try
the first one, if that fails try the second one.

The assumption here is that all locales will choose the same sort order as
long as they're dealing only with the core 26 letters.  This does not have
to be true in theory, but I think it works for the vast majority of
practical cases.

We could also cut down the number of affected tests by making the
select_implicit and select_having not use mixed-case strings in the test
tables.  Then we have only char, varchar, and select_views left.

Comments?

-- 
Peter Eisentraut   peter_e@gmx.net



Re: Making the regression tests locale-proof

From
teg@redhat.com (Trond Eivind Glomsrød)
Date:
Peter Eisentraut <peter_e@gmx.net> writes:

> The assumption here is that all locales will choose the same sort order as
> long as they're dealing only with the core 26 letters.  This does not have
> to be true in theory, but I think it works for the vast majority of
> practical cases.


Not for uppercase vs. lowercase versions of them.

With no locale used (straight ASCII), you get A C b, with a locale
you'll get A b C.

-- 
Trond Eivind Glomsrød
Red Hat, Inc.


Re: Making the regression tests locale-proof

From
Hannu Krosing
Date:
On Sat, 2002-05-11 at 02:25, Peter Eisentraut wrote:
> The remaining issue is the sort order.  I think this can be solved for
> practical purposes by creating two expected files for each affected test,
> say char.out and char-locale.out.  The regression test driver would try
> the first one, if that fails try the second one.
> 
> The assumption here is that all locales will choose the same sort order as
> long as they're dealing only with the core 26 letters.  This does not have
> to be true in theory, but I think it works for the vast majority of
> practical cases.

et_EE locale has the following order for "core 26 letters" _ are other
letters

ABCDEFGHIJKLMNOPQRS_Z_TUVW____XY  (notice position of Z)

and I'm not sure if V and W are distinguished when sorting words that
have anything after them.

I've heard that in some other locales there are other veir behaviours
(like sorting on or two of the same letters as equivalent)

------------
Hannu




Re: Making the regression tests locale-proof

From
Tom Lane
Date:
Peter Eisentraut <peter_e@gmx.net> writes:
> For that purpose I have changed the permissions on these options to
> USERSET.  (I'm still debating making lc_messages SUSET, because otherwise
> users can screw with admins by changing the language of the log output all
> the time.  Comments?)

Hm.  Don't the regression tests already assume they are run by the
superuser?  They've got create/drop user commands in them.  So I'd
say SUSET is fine from the point of view of the tests, and I agree
with your concern about making the logs unreadable.

> The assumption here is that all locales will choose the same sort order as
> long as they're dealing only with the core 26 letters.

Nope.  For instance, on HPUX I get this sort order in English:

$ LANG=en_US.iso88591 sort testll
eix
ela
ella
ellm
elm
eln
enx

and this in Spanish:

$ LANG=es_ES.iso88591 sort testll
eix
ela
elm
eln
ella
ellm
enx

because the Spanish treat LL as a single collating element.  (Actually,
my very-rusty recollection is that they sort LL the same as one L, which
would mean that HPUX's behavior is not quite right here: it's treating
LL as one symbol that sorts after L.  Linux seems to have no clue that
LL is special at all though...)

> We could also cut down the number of affected tests by making the
> select_implicit and select_having not use mixed-case strings in the test
> tables.  Then we have only char, varchar, and select_views left.

In practice we could perhaps use test data that doesn't hit any of the
special cases in the popular languages.  But I wonder whether this would
not be shirking our responsibility as testers.  Seems like if you avoid
exercising these kinds of cases, you avoid finding corner-case bugs.
        regards, tom lane


Re: Making the regression tests locale-proof

From
Alvaro Herrera
Date:
Tom Lane escribió: 

> Peter Eisentraut <peter_e@gmx.net> writes:

> > The assumption here is that all locales will choose the same sort order as
> > long as they're dealing only with the core 26 letters.
> 
> Nope.  For instance, on HPUX I get this sort order in English:
[...]

> because the Spanish treat LL as a single collating element.  (Actually,
> my very-rusty recollection is that they sort LL the same as one L, which
> would mean that HPUX's behavior is not quite right here: it's treating
> LL as one symbol that sorts after L.  Linux seems to have no clue that
> LL is special at all though...)

HPUX's behaviour is broken, because in spanish LL (as well as CH)
stopped being a special symbol some five years ago (it used to be
treated as one collating element sorted after "L", so HPUX behaviour was
right then).


> > We could also cut down the number of affected tests by making the
> > select_implicit and select_having not use mixed-case strings in the test
> > tables.  Then we have only char, varchar, and select_views left.

Maybe it would be better to prepare various results, one for each of a
subset of the locales supported (C, en_EN, some other "western" and
maybe a couple multibyte?). That way at least you make sure the C
library is working as expected.

-- 
Alvaro Herrera (<alvherre[a]atentus.com>)
"No deja de ser humillante para una persona de ingenio saber
que no hay tonto que no le pueda enseñar algo." (Jean B. Say)



Re: Making the regression tests locale-proof

From
Tom Lane
Date:
Alvaro Herrera <alvherre@atentus.com> writes:
> HPUX's behaviour is broken, because in spanish LL (as well as CH)
> stopped being a special symbol some five years ago (it used to be
> treated as one collating element sorted after "L", so HPUX behaviour was
> right then).

Well, this is an old release ;-) ... the localedef files are dated
around 1996.  (And you don't want to know how long it's been since
I could speak passable Spanish.)

In any case, the fact that the official rules have changed does not
invalidate my point: there are systems on which the assumption Peter
wants to make will fail.
        regards, tom lane


Re: Making the regression tests locale-proof

From
Peter Eisentraut
Date:
Tom Lane writes:

> In practice we could perhaps use test data that doesn't hit any of the
> special cases in the popular languages.  But I wonder whether this would
> not be shirking our responsibility as testers.  Seems like if you avoid
> exercising these kinds of cases, you avoid finding corner-case bugs.

There is a locale test suite under src/test/locale, which isn't very well
known currently.  There we can test the collation order in the wildest
extremes for any particular locale.  For the main test suite, I think we
can boldly assume that if sorting works at all then it would also work
equally well if more complicated strings were substituted, since the
actual collating isn't done by us anyway.

What I'm thinking now is to simply collect a number of possible results
and store expected files char_0.out, char_1.out, etc. and have the driver
try all of these, basically meaning "any of these may be right".

The alternative I had in the back of my head was to query the locale and
prepare files char_en.out, char_de.out, etc. but as you showed, we can't
rely on these locales working in a particular way.

-- 
Peter Eisentraut   peter_e@gmx.net