Thread: Making the regression tests locale-proof
Since locale support is now enabled by default, it is desirable that the regression tests can pass if the clusters locale is not C. As a first step I have included the following statements in pg_regress right after the database is created: alter database "$dbname" set lc_messages to 'C'; alter database "$dbname" set lc_monetary to 'C'; alter database "$dbname" set lc_numeric to 'C'; alter database "$dbname" set lc_time to 'C'; This gets rid of a boatload of failures related to number formatting. For that purpose I have changed the permissions on these options to USERSET. (I'm still debating making lc_messages SUSET, because otherwise users can screw with admins by changing the language of the log output all the time. Comments?) The remaining issue is the sort order. I think this can be solved for practical purposes by creating two expected files for each affected test, say char.out and char-locale.out. The regression test driver would try the first one, if that fails try the second one. The assumption here is that all locales will choose the same sort order as long as they're dealing only with the core 26 letters. This does not have to be true in theory, but I think it works for the vast majority of practical cases. We could also cut down the number of affected tests by making the select_implicit and select_having not use mixed-case strings in the test tables. Then we have only char, varchar, and select_views left. Comments? -- Peter Eisentraut peter_e@gmx.net
Peter Eisentraut <peter_e@gmx.net> writes: > The assumption here is that all locales will choose the same sort order as > long as they're dealing only with the core 26 letters. This does not have > to be true in theory, but I think it works for the vast majority of > practical cases. Not for uppercase vs. lowercase versions of them. With no locale used (straight ASCII), you get A C b, with a locale you'll get A b C. -- Trond Eivind Glomsrød Red Hat, Inc.
On Sat, 2002-05-11 at 02:25, Peter Eisentraut wrote: > The remaining issue is the sort order. I think this can be solved for > practical purposes by creating two expected files for each affected test, > say char.out and char-locale.out. The regression test driver would try > the first one, if that fails try the second one. > > The assumption here is that all locales will choose the same sort order as > long as they're dealing only with the core 26 letters. This does not have > to be true in theory, but I think it works for the vast majority of > practical cases. et_EE locale has the following order for "core 26 letters" _ are other letters ABCDEFGHIJKLMNOPQRS_Z_TUVW____XY (notice position of Z) and I'm not sure if V and W are distinguished when sorting words that have anything after them. I've heard that in some other locales there are other veir behaviours (like sorting on or two of the same letters as equivalent) ------------ Hannu
Peter Eisentraut <peter_e@gmx.net> writes: > For that purpose I have changed the permissions on these options to > USERSET. (I'm still debating making lc_messages SUSET, because otherwise > users can screw with admins by changing the language of the log output all > the time. Comments?) Hm. Don't the regression tests already assume they are run by the superuser? They've got create/drop user commands in them. So I'd say SUSET is fine from the point of view of the tests, and I agree with your concern about making the logs unreadable. > The assumption here is that all locales will choose the same sort order as > long as they're dealing only with the core 26 letters. Nope. For instance, on HPUX I get this sort order in English: $ LANG=en_US.iso88591 sort testll eix ela ella ellm elm eln enx and this in Spanish: $ LANG=es_ES.iso88591 sort testll eix ela elm eln ella ellm enx because the Spanish treat LL as a single collating element. (Actually, my very-rusty recollection is that they sort LL the same as one L, which would mean that HPUX's behavior is not quite right here: it's treating LL as one symbol that sorts after L. Linux seems to have no clue that LL is special at all though...) > We could also cut down the number of affected tests by making the > select_implicit and select_having not use mixed-case strings in the test > tables. Then we have only char, varchar, and select_views left. In practice we could perhaps use test data that doesn't hit any of the special cases in the popular languages. But I wonder whether this would not be shirking our responsibility as testers. Seems like if you avoid exercising these kinds of cases, you avoid finding corner-case bugs. regards, tom lane
Tom Lane escribió: > Peter Eisentraut <peter_e@gmx.net> writes: > > The assumption here is that all locales will choose the same sort order as > > long as they're dealing only with the core 26 letters. > > Nope. For instance, on HPUX I get this sort order in English: [...] > because the Spanish treat LL as a single collating element. (Actually, > my very-rusty recollection is that they sort LL the same as one L, which > would mean that HPUX's behavior is not quite right here: it's treating > LL as one symbol that sorts after L. Linux seems to have no clue that > LL is special at all though...) HPUX's behaviour is broken, because in spanish LL (as well as CH) stopped being a special symbol some five years ago (it used to be treated as one collating element sorted after "L", so HPUX behaviour was right then). > > We could also cut down the number of affected tests by making the > > select_implicit and select_having not use mixed-case strings in the test > > tables. Then we have only char, varchar, and select_views left. Maybe it would be better to prepare various results, one for each of a subset of the locales supported (C, en_EN, some other "western" and maybe a couple multibyte?). That way at least you make sure the C library is working as expected. -- Alvaro Herrera (<alvherre[a]atentus.com>) "No deja de ser humillante para una persona de ingenio saber que no hay tonto que no le pueda enseñar algo." (Jean B. Say)
Alvaro Herrera <alvherre@atentus.com> writes: > HPUX's behaviour is broken, because in spanish LL (as well as CH) > stopped being a special symbol some five years ago (it used to be > treated as one collating element sorted after "L", so HPUX behaviour was > right then). Well, this is an old release ;-) ... the localedef files are dated around 1996. (And you don't want to know how long it's been since I could speak passable Spanish.) In any case, the fact that the official rules have changed does not invalidate my point: there are systems on which the assumption Peter wants to make will fail. regards, tom lane
Tom Lane writes: > In practice we could perhaps use test data that doesn't hit any of the > special cases in the popular languages. But I wonder whether this would > not be shirking our responsibility as testers. Seems like if you avoid > exercising these kinds of cases, you avoid finding corner-case bugs. There is a locale test suite under src/test/locale, which isn't very well known currently. There we can test the collation order in the wildest extremes for any particular locale. For the main test suite, I think we can boldly assume that if sorting works at all then it would also work equally well if more complicated strings were substituted, since the actual collating isn't done by us anyway. What I'm thinking now is to simply collect a number of possible results and store expected files char_0.out, char_1.out, etc. and have the driver try all of these, basically meaning "any of these may be right". The alternative I had in the back of my head was to query the locale and prepare files char_en.out, char_de.out, etc. but as you showed, we can't rely on these locales working in a particular way. -- Peter Eisentraut peter_e@gmx.net