Thread: Re: Missing rows with index scan when collation is not "C" (PostgreSQL 9.5);
Re: Missing rows with index scan when collation is not "C" (PostgreSQL 9.5);
From
"Reko Turja"
Date:
Tom Lane wrote: > Indeed. To try to put some scope on the problem, I made an idiot > little > program that just generates some random UTF8 strings and sees > whether > strcoll and strxfrm sort them alike. Attached are that program, a > even > more idiot little shell script that runs it over all available UTF8 > locales, and the results on my RHEL6 box. While de_DE seems to be > the > worst-broken locale, it's far from the only one. > > Please try this on as many platforms as you can get hold of ... Platform - FreeBSD 10.2, everything built from source using clang: ./tryalllocales.sh Using LC_COLLATE = "af_ZA.UTF-8" Using LC_CTYPE = "af_ZA.UTF-8" af_ZA.UTF-8 good Using LC_COLLATE = "am_ET.UTF-8" Using LC_CTYPE = "am_ET.UTF-8" am_ET.UTF-8 good Using LC_COLLATE = "be_BY.UTF-8" Using LC_CTYPE = "be_BY.UTF-8" be_BY.UTF-8 good Using LC_COLLATE = "bg_BG.UTF-8" Using LC_CTYPE = "bg_BG.UTF-8" bg_BG.UTF-8 good Using LC_COLLATE = "ca_AD.UTF-8" Using LC_CTYPE = "ca_AD.UTF-8" ca_AD.UTF-8 good Using LC_COLLATE = "ca_ES.UTF-8" Using LC_CTYPE = "ca_ES.UTF-8" ca_ES.UTF-8 good Using LC_COLLATE = "ca_FR.UTF-8" Using LC_CTYPE = "ca_FR.UTF-8" ca_FR.UTF-8 good Using LC_COLLATE = "ca_IT.UTF-8" Using LC_CTYPE = "ca_IT.UTF-8" ca_IT.UTF-8 good Using LC_COLLATE = "cs_CZ.UTF-8" Using LC_CTYPE = "cs_CZ.UTF-8" cs_CZ.UTF-8 good Using LC_COLLATE = "da_DK.UTF-8" Using LC_CTYPE = "da_DK.UTF-8" da_DK.UTF-8 good Using LC_COLLATE = "de_AT.UTF-8" Using LC_CTYPE = "de_AT.UTF-8" de_AT.UTF-8 good Using LC_COLLATE = "de_CH.UTF-8" Using LC_CTYPE = "de_CH.UTF-8" de_CH.UTF-8 good Using LC_COLLATE = "de_DE.UTF-8" Using LC_CTYPE = "de_DE.UTF-8" de_DE.UTF-8 good Using LC_COLLATE = "el_GR.UTF-8" Using LC_CTYPE = "el_GR.UTF-8" el_GR.UTF-8 good Using LC_COLLATE = "en_AU.UTF-8" Using LC_CTYPE = "en_AU.UTF-8" en_AU.UTF-8 good Using LC_COLLATE = "en_CA.UTF-8" Using LC_CTYPE = "en_CA.UTF-8" en_CA.UTF-8 good Using LC_COLLATE = "en_GB.UTF-8" Using LC_CTYPE = "en_GB.UTF-8" en_GB.UTF-8 good Using LC_COLLATE = "en_IE.UTF-8" Using LC_CTYPE = "en_IE.UTF-8" en_IE.UTF-8 good Using LC_COLLATE = "en_NZ.UTF-8" Using LC_CTYPE = "en_NZ.UTF-8" en_NZ.UTF-8 good Using LC_COLLATE = "en_US.UTF-8" Using LC_CTYPE = "en_US.UTF-8" en_US.UTF-8 good Using LC_COLLATE = "es_ES.UTF-8" Using LC_CTYPE = "es_ES.UTF-8" es_ES.UTF-8 good Using LC_COLLATE = "et_EE.UTF-8" Using LC_CTYPE = "et_EE.UTF-8" et_EE.UTF-8 good Using LC_COLLATE = "eu_ES.UTF-8" Using LC_CTYPE = "eu_ES.UTF-8" eu_ES.UTF-8 good Using LC_COLLATE = "fi_FI.UTF-8" Using LC_CTYPE = "fi_FI.UTF-8" fi_FI.UTF-8 good Using LC_COLLATE = "fr_BE.UTF-8" Using LC_CTYPE = "fr_BE.UTF-8" fr_BE.UTF-8 good Using LC_COLLATE = "fr_CA.UTF-8" Using LC_CTYPE = "fr_CA.UTF-8" fr_CA.UTF-8 good Using LC_COLLATE = "fr_CH.UTF-8" Using LC_CTYPE = "fr_CH.UTF-8" fr_CH.UTF-8 good Using LC_COLLATE = "fr_FR.UTF-8" Using LC_CTYPE = "fr_FR.UTF-8" fr_FR.UTF-8 good Using LC_COLLATE = "he_IL.UTF-8" Using LC_CTYPE = "he_IL.UTF-8" he_IL.UTF-8 good Using LC_COLLATE = "hr_HR.UTF-8" Using LC_CTYPE = "hr_HR.UTF-8" hr_HR.UTF-8 good Using LC_COLLATE = "hu_HU.UTF-8" Using LC_CTYPE = "hu_HU.UTF-8" hu_HU.UTF-8 good Using LC_COLLATE = "hy_AM.UTF-8" Using LC_CTYPE = "hy_AM.UTF-8" hy_AM.UTF-8 good Using LC_COLLATE = "is_IS.UTF-8" Using LC_CTYPE = "is_IS.UTF-8" is_IS.UTF-8 good Using LC_COLLATE = "it_CH.UTF-8" Using LC_CTYPE = "it_CH.UTF-8" it_CH.UTF-8 good Using LC_COLLATE = "it_IT.UTF-8" Using LC_CTYPE = "it_IT.UTF-8" it_IT.UTF-8 good Using LC_COLLATE = "ja_JP.UTF-8" Using LC_CTYPE = "ja_JP.UTF-8" ja_JP.UTF-8 good Using LC_COLLATE = "kk_KZ.UTF-8" Using LC_CTYPE = "kk_KZ.UTF-8" kk_KZ.UTF-8 good Using LC_COLLATE = "ko_KR.UTF-8" Using LC_CTYPE = "ko_KR.UTF-8" ko_KR.UTF-8 good Using LC_COLLATE = "lt_LT.UTF-8" Using LC_CTYPE = "lt_LT.UTF-8" lt_LT.UTF-8 good Using LC_COLLATE = "lv_LV.UTF-8" Using LC_CTYPE = "lv_LV.UTF-8" lv_LV.UTF-8 good Using LC_COLLATE = "mn_MN.UTF-8" Using LC_CTYPE = "mn_MN.UTF-8" mn_MN.UTF-8 good Using LC_COLLATE = "nb_NO.UTF-8" Using LC_CTYPE = "nb_NO.UTF-8" nb_NO.UTF-8 good Using LC_COLLATE = "nl_BE.UTF-8" Using LC_CTYPE = "nl_BE.UTF-8" nl_BE.UTF-8 good Using LC_COLLATE = "nl_NL.UTF-8" Using LC_CTYPE = "nl_NL.UTF-8" nl_NL.UTF-8 good Using LC_COLLATE = "nn_NO.UTF-8" Using LC_CTYPE = "nn_NO.UTF-8" nn_NO.UTF-8 good Using LC_COLLATE = "no_NO.UTF-8" Using LC_CTYPE = "no_NO.UTF-8" no_NO.UTF-8 good Using LC_COLLATE = "pl_PL.UTF-8" Using LC_CTYPE = "pl_PL.UTF-8" pl_PL.UTF-8 good Using LC_COLLATE = "pt_BR.UTF-8" Using LC_CTYPE = "pt_BR.UTF-8" pt_BR.UTF-8 good Using LC_COLLATE = "pt_PT.UTF-8" Using LC_CTYPE = "pt_PT.UTF-8" pt_PT.UTF-8 good Using LC_COLLATE = "ro_RO.UTF-8" Using LC_CTYPE = "ro_RO.UTF-8" ro_RO.UTF-8 good Using LC_COLLATE = "ru_RU.UTF-8" Using LC_CTYPE = "ru_RU.UTF-8" ru_RU.UTF-8 good Using LC_COLLATE = "sk_SK.UTF-8" Using LC_CTYPE = "sk_SK.UTF-8" sk_SK.UTF-8 good Using LC_COLLATE = "sl_SI.UTF-8" Using LC_CTYPE = "sl_SI.UTF-8" sl_SI.UTF-8 good Using LC_COLLATE = "sr_YU.UTF-8" Using LC_CTYPE = "sr_YU.UTF-8" sr_YU.UTF-8 good Using LC_COLLATE = "sv_SE.UTF-8" Using LC_CTYPE = "sv_SE.UTF-8" sv_SE.UTF-8 good Using LC_COLLATE = "tr_TR.UTF-8" Using LC_CTYPE = "tr_TR.UTF-8" tr_TR.UTF-8 good Using LC_COLLATE = "uk_UA.UTF-8" Using LC_CTYPE = "uk_UA.UTF-8" uk_UA.UTF-8 good Using LC_COLLATE = "zh_CN.UTF-8" Using LC_CTYPE = "zh_CN.UTF-8" zh_CN.UTF-8 good Using LC_COLLATE = "zh_HK.UTF-8" Using LC_CTYPE = "zh_HK.UTF-8" zh_HK.UTF-8 good Using LC_COLLATE = "zh_TW.UTF-8" Using LC_CTYPE = "zh_TW.UTF-8" zh_TW.UTF-8 good -Reko
Re: Missing rows with index scan when collation is not "C" (PostgreSQL 9.5);
From
Thomas Munro
Date:
On Fri, Mar 25, 2016 at 3:02 AM, Reko Turja <reko.turja@liukuma.net> wrote: > Tom Lane wrote: > >> Indeed. To try to put some scope on the problem, I made an idiot little >> program that just generates some random UTF8 strings and sees whether >> strcoll and strxfrm sort them alike. Attached are that program, a even >> more idiot little shell script that runs it over all available UTF8 >> locales, and the results on my RHEL6 box. While de_DE seems to be the >> worst-broken locale, it's far from the only one. >> >> Please try this on as many platforms as you can get hold of ... > > > Platform - FreeBSD 10.2, everything built from source using clang: > > [all good] FWIW I tried this on FreeBSD 11.0-CURRENT (the version currently in development which contains a new collation implementation that deals with Unicode) and it didn't look good so I reported that over here: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=208266 -- Thomas Munro http://www.enterprisedb.com