Thread: Re: Missing rows with index scan when collation is not "C" (PostgreSQL 9.5);

Re: Missing rows with index scan when collation is not "C" (PostgreSQL 9.5);

From
"Reko Turja"
Date:
Tom Lane wrote:

> Indeed.  To try to put some scope on the problem, I made an idiot
> little
> program that just generates some random UTF8 strings and sees
> whether
> strcoll and strxfrm sort them alike.  Attached are that program, a
> even
> more idiot little shell script that runs it over all available UTF8
> locales, and the results on my RHEL6 box.  While de_DE seems to be
> the
> worst-broken locale, it's far from the only one.
>
> Please try this on as many platforms as you can get hold of ...

Platform - FreeBSD 10.2, everything built from source using clang:

./tryalllocales.sh
Using LC_COLLATE = "af_ZA.UTF-8"
Using LC_CTYPE = "af_ZA.UTF-8"
af_ZA.UTF-8 good
Using LC_COLLATE = "am_ET.UTF-8"
Using LC_CTYPE = "am_ET.UTF-8"
am_ET.UTF-8 good
Using LC_COLLATE = "be_BY.UTF-8"
Using LC_CTYPE = "be_BY.UTF-8"
be_BY.UTF-8 good
Using LC_COLLATE = "bg_BG.UTF-8"
Using LC_CTYPE = "bg_BG.UTF-8"
bg_BG.UTF-8 good
Using LC_COLLATE = "ca_AD.UTF-8"
Using LC_CTYPE = "ca_AD.UTF-8"
ca_AD.UTF-8 good
Using LC_COLLATE = "ca_ES.UTF-8"
Using LC_CTYPE = "ca_ES.UTF-8"
ca_ES.UTF-8 good
Using LC_COLLATE = "ca_FR.UTF-8"
Using LC_CTYPE = "ca_FR.UTF-8"
ca_FR.UTF-8 good
Using LC_COLLATE = "ca_IT.UTF-8"
Using LC_CTYPE = "ca_IT.UTF-8"
ca_IT.UTF-8 good
Using LC_COLLATE = "cs_CZ.UTF-8"
Using LC_CTYPE = "cs_CZ.UTF-8"
cs_CZ.UTF-8 good
Using LC_COLLATE = "da_DK.UTF-8"
Using LC_CTYPE = "da_DK.UTF-8"
da_DK.UTF-8 good
Using LC_COLLATE = "de_AT.UTF-8"
Using LC_CTYPE = "de_AT.UTF-8"
de_AT.UTF-8 good
Using LC_COLLATE = "de_CH.UTF-8"
Using LC_CTYPE = "de_CH.UTF-8"
de_CH.UTF-8 good
Using LC_COLLATE = "de_DE.UTF-8"
Using LC_CTYPE = "de_DE.UTF-8"
de_DE.UTF-8 good
Using LC_COLLATE = "el_GR.UTF-8"
Using LC_CTYPE = "el_GR.UTF-8"
el_GR.UTF-8 good
Using LC_COLLATE = "en_AU.UTF-8"
Using LC_CTYPE = "en_AU.UTF-8"
en_AU.UTF-8 good
Using LC_COLLATE = "en_CA.UTF-8"
Using LC_CTYPE = "en_CA.UTF-8"
en_CA.UTF-8 good
Using LC_COLLATE = "en_GB.UTF-8"
Using LC_CTYPE = "en_GB.UTF-8"
en_GB.UTF-8 good
Using LC_COLLATE = "en_IE.UTF-8"
Using LC_CTYPE = "en_IE.UTF-8"
en_IE.UTF-8 good
Using LC_COLLATE = "en_NZ.UTF-8"
Using LC_CTYPE = "en_NZ.UTF-8"
en_NZ.UTF-8 good
Using LC_COLLATE = "en_US.UTF-8"
Using LC_CTYPE = "en_US.UTF-8"
en_US.UTF-8 good
Using LC_COLLATE = "es_ES.UTF-8"
Using LC_CTYPE = "es_ES.UTF-8"
es_ES.UTF-8 good
Using LC_COLLATE = "et_EE.UTF-8"
Using LC_CTYPE = "et_EE.UTF-8"
et_EE.UTF-8 good
Using LC_COLLATE = "eu_ES.UTF-8"
Using LC_CTYPE = "eu_ES.UTF-8"
eu_ES.UTF-8 good
Using LC_COLLATE = "fi_FI.UTF-8"
Using LC_CTYPE = "fi_FI.UTF-8"
fi_FI.UTF-8 good
Using LC_COLLATE = "fr_BE.UTF-8"
Using LC_CTYPE = "fr_BE.UTF-8"
fr_BE.UTF-8 good
Using LC_COLLATE = "fr_CA.UTF-8"
Using LC_CTYPE = "fr_CA.UTF-8"
fr_CA.UTF-8 good
Using LC_COLLATE = "fr_CH.UTF-8"
Using LC_CTYPE = "fr_CH.UTF-8"
fr_CH.UTF-8 good
Using LC_COLLATE = "fr_FR.UTF-8"
Using LC_CTYPE = "fr_FR.UTF-8"
fr_FR.UTF-8 good
Using LC_COLLATE = "he_IL.UTF-8"
Using LC_CTYPE = "he_IL.UTF-8"
he_IL.UTF-8 good
Using LC_COLLATE = "hr_HR.UTF-8"
Using LC_CTYPE = "hr_HR.UTF-8"
hr_HR.UTF-8 good
Using LC_COLLATE = "hu_HU.UTF-8"
Using LC_CTYPE = "hu_HU.UTF-8"
hu_HU.UTF-8 good
Using LC_COLLATE = "hy_AM.UTF-8"
Using LC_CTYPE = "hy_AM.UTF-8"
hy_AM.UTF-8 good
Using LC_COLLATE = "is_IS.UTF-8"
Using LC_CTYPE = "is_IS.UTF-8"
is_IS.UTF-8 good
Using LC_COLLATE = "it_CH.UTF-8"
Using LC_CTYPE = "it_CH.UTF-8"
it_CH.UTF-8 good
Using LC_COLLATE = "it_IT.UTF-8"
Using LC_CTYPE = "it_IT.UTF-8"
it_IT.UTF-8 good
Using LC_COLLATE = "ja_JP.UTF-8"
Using LC_CTYPE = "ja_JP.UTF-8"
ja_JP.UTF-8 good
Using LC_COLLATE = "kk_KZ.UTF-8"
Using LC_CTYPE = "kk_KZ.UTF-8"
kk_KZ.UTF-8 good
Using LC_COLLATE = "ko_KR.UTF-8"
Using LC_CTYPE = "ko_KR.UTF-8"
ko_KR.UTF-8 good
Using LC_COLLATE = "lt_LT.UTF-8"
Using LC_CTYPE = "lt_LT.UTF-8"
lt_LT.UTF-8 good
Using LC_COLLATE = "lv_LV.UTF-8"
Using LC_CTYPE = "lv_LV.UTF-8"
lv_LV.UTF-8 good
Using LC_COLLATE = "mn_MN.UTF-8"
Using LC_CTYPE = "mn_MN.UTF-8"
mn_MN.UTF-8 good
Using LC_COLLATE = "nb_NO.UTF-8"
Using LC_CTYPE = "nb_NO.UTF-8"
nb_NO.UTF-8 good
Using LC_COLLATE = "nl_BE.UTF-8"
Using LC_CTYPE = "nl_BE.UTF-8"
nl_BE.UTF-8 good
Using LC_COLLATE = "nl_NL.UTF-8"
Using LC_CTYPE = "nl_NL.UTF-8"
nl_NL.UTF-8 good
Using LC_COLLATE = "nn_NO.UTF-8"
Using LC_CTYPE = "nn_NO.UTF-8"
nn_NO.UTF-8 good
Using LC_COLLATE = "no_NO.UTF-8"
Using LC_CTYPE = "no_NO.UTF-8"
no_NO.UTF-8 good
Using LC_COLLATE = "pl_PL.UTF-8"
Using LC_CTYPE = "pl_PL.UTF-8"
pl_PL.UTF-8 good
Using LC_COLLATE = "pt_BR.UTF-8"
Using LC_CTYPE = "pt_BR.UTF-8"
pt_BR.UTF-8 good
Using LC_COLLATE = "pt_PT.UTF-8"
Using LC_CTYPE = "pt_PT.UTF-8"
pt_PT.UTF-8 good
Using LC_COLLATE = "ro_RO.UTF-8"
Using LC_CTYPE = "ro_RO.UTF-8"
ro_RO.UTF-8 good
Using LC_COLLATE = "ru_RU.UTF-8"
Using LC_CTYPE = "ru_RU.UTF-8"
ru_RU.UTF-8 good
Using LC_COLLATE = "sk_SK.UTF-8"
Using LC_CTYPE = "sk_SK.UTF-8"
sk_SK.UTF-8 good
Using LC_COLLATE = "sl_SI.UTF-8"
Using LC_CTYPE = "sl_SI.UTF-8"
sl_SI.UTF-8 good
Using LC_COLLATE = "sr_YU.UTF-8"
Using LC_CTYPE = "sr_YU.UTF-8"
sr_YU.UTF-8 good
Using LC_COLLATE = "sv_SE.UTF-8"
Using LC_CTYPE = "sv_SE.UTF-8"
sv_SE.UTF-8 good
Using LC_COLLATE = "tr_TR.UTF-8"
Using LC_CTYPE = "tr_TR.UTF-8"
tr_TR.UTF-8 good
Using LC_COLLATE = "uk_UA.UTF-8"
Using LC_CTYPE = "uk_UA.UTF-8"
uk_UA.UTF-8 good
Using LC_COLLATE = "zh_CN.UTF-8"
Using LC_CTYPE = "zh_CN.UTF-8"
zh_CN.UTF-8 good
Using LC_COLLATE = "zh_HK.UTF-8"
Using LC_CTYPE = "zh_HK.UTF-8"
zh_HK.UTF-8 good
Using LC_COLLATE = "zh_TW.UTF-8"
Using LC_CTYPE = "zh_TW.UTF-8"
zh_TW.UTF-8 good

-Reko

Re: Missing rows with index scan when collation is not "C" (PostgreSQL 9.5);

From
Thomas Munro
Date:
On Fri, Mar 25, 2016 at 3:02 AM, Reko Turja <reko.turja@liukuma.net> wrote:
> Tom Lane wrote:
>
>> Indeed.  To try to put some scope on the problem, I made an idiot little
>> program that just generates some random UTF8 strings and sees whether
>> strcoll and strxfrm sort them alike.  Attached are that program, a even
>> more idiot little shell script that runs it over all available UTF8
>> locales, and the results on my RHEL6 box.  While de_DE seems to be the
>> worst-broken locale, it's far from the only one.
>>
>> Please try this on as many platforms as you can get hold of ...
>
>
> Platform - FreeBSD 10.2, everything built from source using clang:
>
> [all good]

FWIW I tried this on FreeBSD 11.0-CURRENT (the version currently in
development which contains a new collation implementation that deals
with Unicode) and it didn't look good so I reported that over here:

https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=208266

--
Thomas Munro
http://www.enterprisedb.com