Thread: fixes for the Danish locale

fixes for the Danish locale

From
Jeff Janes
Date:
In Danish, the sequence 'aa' is sometimes treated as a single letter
which collates after 'z'.

Some regression tests got into 9.5, and are still in 9.6beta3, which
fail due to assuming they know how things will sort or compare.

I thought the easiest way to deal with it was just to change the test
data to use 'ab...' rather than 'aa...' to represent an
early-collating string.

With these applied, this now passes:

LANG=danish make check

Cheers,

Jeff

Attachment

Re: fixes for the Danish locale

From
Tom Lane
Date:
Jeff Janes <jeff.janes@gmail.com> writes:
> In Danish, the sequence 'aa' is sometimes treated as a single letter
> which collates after 'z'.
> Some regression tests got into 9.5, and are still in 9.6beta3, which
> fail due to assuming they know how things will sort or compare.

Confirmed here.  Will deal with it, but I wonder why we have no buildfarm
members covering this ...
        regards, tom lane



Re: fixes for the Danish locale

From
Greg Stark
Date:
On Thu, Jul 21, 2016 at 5:44 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> Confirmed here.  Will deal with it, but I wonder why we have no buildfarm
> members covering this ...

We're not going to have a build farm member for every locale the local
systems support.

Perhaps the build farm script should pick a random locale for each
run. Either a random locale from the set on the OS or a random
language from a list of locale that the regression tests are intended
to be safe for.

-- 
greg



Re: fixes for the Danish locale

From
Peter Geoghegan
Date:
On Thu, Jul 21, 2016 at 11:26 AM, Greg Stark <stark@mit.edu> wrote:
> Perhaps the build farm script should pick a random locale for each
> run. Either a random locale from the set on the OS or a random
> language from a list of locale that the regression tests are intended
> to be safe for.

That's more or less what I did with the amcheck regression tests.

-- 
Peter Geoghegan



Re: fixes for the Danish locale

From
Tom Lane
Date:
Greg Stark <stark@mit.edu> writes:
> On Thu, Jul 21, 2016 at 5:44 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Confirmed here.  Will deal with it, but I wonder why we have no buildfarm
>> members covering this ...

> We're not going to have a build farm member for every locale the local
> systems support.

Probably not, but Danish seems odd enough to be worth testing.  Aside
from this issue, I found one in the pltcl tests.

> Perhaps the build farm script should pick a random locale for each
> run. Either a random locale from the set on the OS or a random
> language from a list of locale that the regression tests are intended
> to be safe for.

Nah, we have a hard enough time with reproducibility of buildfarm results
without deliberately injecting transient failures.
        regards, tom lane



Re: fixes for the Danish locale

From
Peter Geoghegan
Date:
On Thu, Jul 21, 2016 at 11:29 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Perhaps the build farm script should pick a random locale for each
>> run. Either a random locale from the set on the OS or a random
>> language from a list of locale that the regression tests are intended
>> to be safe for.
>
> Nah, we have a hard enough time with reproducibility of buildfarm results
> without deliberately injecting transient failures.

It could be pseudo-random, and so deterministic per buildfarm animal.
That's what I did.

-- 
Peter Geoghegan



Re: fixes for the Danish locale

From
Tom Lane
Date:
Peter Geoghegan <pg@heroku.com> writes:
> On Thu, Jul 21, 2016 at 11:29 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Nah, we have a hard enough time with reproducibility of buildfarm results
>> without deliberately injecting transient failures.

> It could be pseudo-random, and so deterministic per buildfarm animal.
> That's what I did.

I'm not impressed with that proposal either --- then we don't even have
any control over what set of locales are getting tested.

Note that there are certain locales we've deliberately chosen not to
support in some regression tests (see e.g. plpython_unicode.sql), so
I'm not really willing to buy into the idea that "any random locale found
on a buildfarm animal should work" anyway.  I'm much more interested in
supporting locales that someone cares enough about to configure a
buildfarm animal for.
        regards, tom lane



Re: fixes for the Danish locale

From
Jeff Janes
Date:
On Thu, Jul 21, 2016 at 9:44 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Jeff Janes <jeff.janes@gmail.com> writes:
>> In Danish, the sequence 'aa' is sometimes treated as a single letter
>> which collates after 'z'.
>> Some regression tests got into 9.5, and are still in 9.6beta3, which
>> fail due to assuming they know how things will sort or compare.
>
> Confirmed here.  Will deal with it, but I wonder why we have no buildfarm
> members covering this ...
>

My CentOS box came with 735 locales installed, so testing all of them
on a regular basis would be quite a task.  And it doesn't help that
many of them seem to be very slow compared to C locale.

I guess the good news is that nothing I tested which was working in
9.5 is broken in 9.6, but several things which were working in 9.4 did
get broken in 9.5 and still are in 9.6.

The Danish fix will probably also fix the (very large) Norwegian family.

The Welsh (cy_GB) apparently put 'dd' after 'f', which breaks row
level security in much the same way as 'aa' does.

I think that that will cover all of the ones that were working in 9.4.

Does testing in other locales ever uncover bugs other than those in
the tests themselves?  Is it worth trying to maintain broad coverage?

Cheers,

Jeff



Re: fixes for the Danish locale

From
Peter Geoghegan
Date:
On Thu, Jul 21, 2016 at 11:44 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Note that there are certain locales we've deliberately chosen not to
> support in some regression tests (see e.g. plpython_unicode.sql), so
> I'm not really willing to buy into the idea that "any random locale found
> on a buildfarm animal should work" anyway.  I'm much more interested in
> supporting locales that someone cares enough about to configure a
> buildfarm animal for.

That seems like a high standard to me. Locale rules are known to
change, and are explicitly versioned by glibc, for example.

-- 
Peter Geoghegan



Re: fixes for the Danish locale

From
Peter Geoghegan
Date:
On Thu, Jul 21, 2016 at 11:49 AM, Jeff Janes <jeff.janes@gmail.com> wrote:
> Does testing in other locales ever uncover bugs other than those in
> the tests themselves?  Is it worth trying to maintain broad coverage?

Potentially, yes. The strxfrm() inconsistency issue disproportionately
affected de_DE.utf8, for example. There were other locales that were
affected less severely, and I think the majority were not shown to be
affected at all.

That being said, it probably wouldn't have caught that particular
issue if we had broad coverage. It probably would catch a broken test,
though.

-- 
Peter Geoghegan



Re: fixes for the Danish locale

From
Andrew Dunstan
Date:

On 07/21/2016 02:26 PM, Greg Stark wrote:
> On Thu, Jul 21, 2016 at 5:44 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Confirmed here.  Will deal with it, but I wonder why we have no buildfarm
>> members covering this ...
> We're not going to have a build farm member for every locale the local
> systems support.
>
> Perhaps the build farm script should pick a random locale for each
> run. Either a random locale from the set on the OS or a random
> language from a list of locale that the regression tests are intended
> to be safe for.
>


I don't see why we shouldn't have a buildfarm machine that tests a very 
large number of locales. It takes a very lightly resourced machine like 
nightjar just over two minutes per locale. The list of locales to test 
is a setting in the config file.

cheers

andrew




Re: fixes for the Danish locale

From
Tom Lane
Date:
Jeff Janes <jeff.janes@gmail.com> writes:
> In Danish, the sequence 'aa' is sometimes treated as a single letter
> which collates after 'z'.
> Some regression tests got into 9.5, and are still in 9.6beta3, which
> fail due to assuming they know how things will sort or compare.

As of HEAD, "LANG=danish make check-world" passes for me, which it
did not before the round of fixes I just pushed.

I see that the core tests fall over in Turkish still :-(
        regards, tom lane



Re: fixes for the Danish locale

From
Jeff Janes
Date:
On Thu, Jul 21, 2016 at 2:11 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Jeff Janes <jeff.janes@gmail.com> writes:
>> In Danish, the sequence 'aa' is sometimes treated as a single letter
>> which collates after 'z'.
>> Some regression tests got into 9.5, and are still in 9.6beta3, which
>> fail due to assuming they know how things will sort or compare.
>
> As of HEAD, "LANG=danish make check-world" passes for me, which it
> did not before the round of fixes I just pushed.
>
> I see that the core tests fall over in Turkish still :-(

Turkish has never passed (at least back to 9.0).  It looks like it is
in the stemming functions.  I don't understand why, I would think
everything other than English would be failing those if the regression
tests hard-code English stemming expectations but fail to arrange for
English stemming rules.

Cheers,

Jeff



Re: fixes for the Danish locale

From
Tom Lane
Date:
Jeff Janes <jeff.janes@gmail.com> writes:
> On Thu, Jul 21, 2016 at 2:11 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> I see that the core tests fall over in Turkish still :-(

> Turkish has never passed (at least back to 9.0).  It looks like it is
> in the stemming functions.  I don't understand why, I would think
> everything other than English would be failing those if the regression
> tests hard-code English stemming expectations but fail to arrange for
> English stemming rules.

It looks to me like the 'simple' dictionary assumes it can apply the
lowercasing rules implied by LC_CTYPE regardless of which language
it's supposedly working on.  This is probably something we should
improve sometime, but I doubt it's an easy change.
        regards, tom lane



Re: fixes for the Danish locale

From
Jeff Janes
Date:
On Thu, Jul 21, 2016 at 11:49 AM, Jeff Janes <jeff.janes@gmail.com> wrote:
> On Thu, Jul 21, 2016 at 9:44 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Jeff Janes <jeff.janes@gmail.com> writes:
>>> In Danish, the sequence 'aa' is sometimes treated as a single letter
>>> which collates after 'z'.
>>> Some regression tests got into 9.5, and are still in 9.6beta3, which
>>> fail due to assuming they know how things will sort or compare.
>>
>> Confirmed here.  Will deal with it, but I wonder why we have no buildfarm
>> members covering this ...
>>
>
> My CentOS box came with 735 locales installed, so testing all of them
> on a regular basis would be quite a task.  And it doesn't help that
> many of them seem to be very slow compared to C locale.
>
> I guess the good news is that nothing I tested which was working in
> 9.5 is broken in 9.6, but several things which were working in 9.4 did
> get broken in 9.5 and still are in 9.6.
>
> The Danish fix will probably also fix the (very large) Norwegian family.
>
> The Welsh (cy_GB) apparently put 'dd' after 'f', which breaks row
> level security in much the same way as 'aa' does.
>
> I think that that will cover all of the ones that were working in 9.4.

The attached patch fixes regression tests for Welsh (cy_GB), needed in
9.5 and 9.6.

Cheers,

Jeff

Attachment

Re: fixes for the Danish locale

From
Tom Lane
Date:
Jeff Janes <jeff.janes@gmail.com> writes:
> The attached patch fixes regression tests for Welsh (cy_GB), needed in
> 9.5 and 9.6.

Pushed, thanks.
        regards, tom lane



Re: fixes for the Danish locale

From
Andreas Karlsson
Date:
On 07/22/2016 03:59 AM, Jeff Janes wrote:
> On Thu, Jul 21, 2016 at 2:11 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> I see that the core tests fall over in Turkish still :-(
>
> Turkish has never passed (at least back to 9.0).  It looks like it is
> in the stemming functions.  I don't understand why, I would think
> everything other than English would be failing those if the regression
> tests hard-code English stemming expectations but fail to arrange for
> English stemming rules.

If something fails for Turkish but not other languages it is usually due 
to the upper/lower casing rules of the dotted and the dotless I (I -> ı 
and İ -> i rather than most languages which have I -> i).

Andreas



Re: fixes for the Danish locale

From
Noah Misch
Date:
On Thu, Jul 21, 2016 at 03:53:45PM -0400, Andrew Dunstan wrote:
> 
> 
> On 07/21/2016 02:26 PM, Greg Stark wrote:
> >On Thu, Jul 21, 2016 at 5:44 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> >>Confirmed here.  Will deal with it, but I wonder why we have no buildfarm
> >>members covering this ...
> >We're not going to have a build farm member for every locale the local
> >systems support.
> >
> >Perhaps the build farm script should pick a random locale for each
> >run. Either a random locale from the set on the OS or a random
> >language from a list of locale that the regression tests are intended
> >to be safe for.
> >
> 
> 
> I don't see why we shouldn't have a buildfarm machine that tests a very
> large number of locales. It takes a very lightly resourced machine like
> nightjar just over two minutes per locale. The list of locales to test is a
> setting in the config file.

+1.  Ten animals of ~75 locales apiece would give fair per-animal runtime.



Re: fixes for the Danish locale

From
Bjorn Munch
Date:
On 21/07 08.42, Jeff Janes wrote:
> In Danish, the sequence 'aa' is sometimes treated as a single letter
> which collates after 'z'.

For the record: this is also true for Norwegian, in both locales it
collates equal to 'Ã¥' which is the 29th letter of the alphabet. But
'aa' is no longer used in ordinary words, only names (in Norwegian
only personal names, in Danish also place names).

- Bjorn Munch