Thread: [BUGS] Crash report for some ICU-52 (debian8) COLLATE and work_mem values

[BUGS] Crash report for some ICU-52 (debian8) COLLATE and work_mem values

From
"Daniel Verite"
Date:
Hi,

With 10beta2 built on Debian 8 with ./configure  --enable-debug --with-icu
and the ICU package currently in the "jessie" Debian repo:

$ dpkg -l 'libicu*'
...
ii  libicu-dev:amd 52.1-8+deb8u amd64         Development files for
Internation
ii  libicu52:amd64 52.1-8+deb8u amd64         International Components for
Unic
ii  libicu52-dbg   52.1-8+deb8u amd64         International Components for
Unic

I've got a table with 6,6 million unique small bits of text from different
Unicode alphabets:
    Table "public.words_test" Column  | Type | Collation | Nullable | Default
----------+------+-----------+----------+---------wordtext | text |         |        |

and found that running the following query on it consistently
provokes a SIGSEGV with certain collations:
SELECT count(distinct wordtext COLLATE :"collname") FROM words_test;

Some of the collations that crash: az-Latn-AZ-u-co-search-x-icu bs-Latn-BA-u-co-search-x-icu bs-x-icu
cs-CZ-u-co-search-x-icude-BE-u-co-phonebk-x-icu sr-Latn-XK-x-icu zh-Hans-CN-u-co-big5han-x-icu 

Trying all of them I had 146 crashes out of the 1741 ICU
entries in pg_collation created by initdb.

The size of the table is 291MB, and work_mem to 128MB.

Reducing the dataset tends to make the problem disappear: if I split
the table in halves based on row_number() to bisect on the data,
the queries on both parts pass without crashing.

Below is a backtrace got with collate "az-Latn-AZ-u-co-search-x-icu",
and work_mem to 128MB. Cranking up work_mem to 512MB makes
the crash not happen, but 300MB is not enough.
(by comparison, the same query with collate "en_US.utf8" or "fr-x-icu"
runs fine with work_mem to 4MB)

#0  0x00007fc3c5017030 in ucol_getLatinOneContractionUTF8 (   coll=coll@entry=0x2824ef0, strength=strength@entry=0,
CE=<optimizedout>,    s=s@entry=0x383a3ec 

"u\364\217\273\252tectuablesow-what-it-is-about_tlsstnamek\224\257\346\214\201gb2312\357\274\214\202\232\204\344\271\237\351\201\207\345\210\260\350\277\207\344\270\200\344\272\233\345\205\266\345\256\203\351\227\256\351\242\230\357\274\214\345\233\240\344\270\272\346\262\241\346\234\211\350\277\231\344\270\252\351\227\256\351\242\230\344\270\245\351\207\215\357\274\214\350\277\230\347\256\227\344\270\215\351\224\231\343\200\202\200\342\224\200",
index=index@entry=0x7fff46afe290, len=len@entry=7) at ucol.cpp:8044
#1  0x00007fc3c502917c in ucol_strcollUseLatin1UTF8 (status=0x7fff46afe338,    tLen=7,    target=0x383a3ec

"u\364\217\273\252tectuablesow-what-it-is-about_tlsstnamek\224\257\346\214\201gb2312\357\274\214\202\232\204\344\271\237\351\201\207\345\210\260\350\277\207\344\270\200\344\272\233\345\205\266\345\256\203\351\227\256\351\242\230\357\274\214\345\233\240\344\270\272\346\262\241\346\234\211\350\277\231\344\270\252\351\227\256\351\242\230\344\270\245\351\207\215\357\274\214\350\277\230\347\256\227\344\270\215\351\224\231\343\200\202\200\342\224\200",
  sLen=6,    source=0x3839bec 

"wuiredntnsookiemmand-tcl-testsuit-tp65374p65375\204\346\226\231\344\271\237\345\276\210\345\244\232\343\200\202\257debian\345\222\214redhat\344\272\206\357\274\214\344\270\244\344\270\252\215\263\345\264\251\346\272\203\343\200\202\226\350\257\221\343\200\202\204u\347\233\230\357\274\214\272\346\211\213\344\272\206\344\270\200\344\272\233\343\200\202ct\342\224\200\342\224\230",
coll=0x2824ef0) at ucol.cpp:8153
#2  ucol_strcollUTF8_52 (coll=<optimized out>,    source=0x3839bec

"wuiredntnsookiemmand-tcl-testsuit-tp65374p65375\204\346\226\231\344\271\237\345\276\210\345\244\232\343\200\202\257debian\345\222\214redhat\344\272\206\357\274\214\344\270\244\344\270\252\215\263\345\264\251\346\272\203\343\200\202\226\350\257\221\343\200\202\204u\347\233\230\357\274\214\272\346\211\213\344\272\206\344\270\200\344\272\233\343\200\202ct\342\224\200\342\224\230",
  source@entry=0x3839be9 

"reqwuiredntnsookiemmand-tcl-testsuit-tp65374p65375\204\346\226\231\344\271\237\345\276\210\345\244\232\343\200\202\257debian\345\222\214redhat\344\272\206\357\274\214\344\270\244\344\270\252\215\263\345\264\251\346\272\203\343\200\202\226\350\257\221\343\200\202\204u\347\233\230\357\274\214\272\346\211\213\344\272\206\344\270\200\344\272\233\343\200\202ct\342\224\200\342\224\230",
sourceLength=<optimized out>, sourceLength@entry=9,    target=0x383a3ec

"u\364\217\273\252tectuablesow-what-it-is-about_tlsstnamek\224\257\346\214\201gb2312\357\274\214\202\232\204\344\271\237\351\201\207\345\210\260\350\277\207\344\270\200\344\272\233\345\205\266\345\256\203\351\227\256\351\242\230\357\274\214\345\233\240\344\270\272\346\262\241\346\234\211\350\277\231\344\270\252\351\227\256\351\242\230\344\270\245\351\207\215\357\274\214\350\277\230\347\256\227\344\270\215\351\224\231\343\200\202\200\342\224\200",
  target@entry=0x383a3e9 

"requ\364\217\273\252tectuablesow-what-it-is-about_tlsstnamek\224\257\346\214\201gb2312\357\274\214\202\232\204\344\271\237\351\201\207\345\210\260\350\277\207\344\270\200\344\272\233\345\205\266\345\256\203\351\227\256\351\242\230\357\274\214\345\233\240\344\270\272\346\262\241\346\234\211\350\277\231\344\270\252\351\227\256\351\242\230\344\270\245\351\207\215\357\274\214\350\277\230\347\256\227\344\270\215\351\224\231\343\200\202\200\342\224\200",
targetLength=7, targetLength@entry=10,    status=status@entry=0x7fff46afe338) at ucol.cpp:8770
#3  0x00000000007cc7b4 in varstrfastcmp_locale (x=58956776, y=58958824,    ssup=<optimized out>) at varlena.c:2139
#4  0x00000000008170b6 in ApplySortComparator (ssup=0x28171b8,    isNull2=<optimized out>, datum2=<optimized out>,
isNull1=<optimizedout>,    datum1=<optimized out>) at 
../../../../src/include/utils/sortsupport.h:225
#5  comparetup_datum (a=0x2818ef8, b=0x2818f10, state=0x2816fa8)   at tuplesort.c:4341
#6  0x0000000000815623 in tuplesort_heap_replace_top (   state=state@entry=0x2816fa8, tuple=tuple@entry=0x7fff46afe410,
  checkIndex=checkIndex@entry=0 '\000') at tuplesort.c:3510 
#7  0x0000000000816d8c in tuplesort_gettuple_common (   state=state@entry=0x2816fa8, forward=forward@entry=1 '\001',
stup=stup@entry=0x7fff46afe460)at tuplesort.c:2082 
#8  0x000000000081b176 in tuplesort_getdatum (state=0x2816fa8,    forward=forward@entry=1 '\001',
val=val@entry=0x28130f0,   isNull=isNull@entry=0x2813409 "", abbrev=abbrev@entry=0x7fff46afe518)   at tuplesort.c:2205 
#9  0x00000000005ea107 in process_ordered_aggregate_single (   pergroupstate=0x2812f38, pertrans=0x2812f98,
aggstate=0x2811198)  at nodeAgg.c:1330 
#10 finalize_aggregates (aggstate=aggstate@entry=0x2811198,    peraggs=peraggs@entry=0x2812588, pergroup=<optimized
out>)  at nodeAgg.c:1736 
#11 0x00000000005eaabd in agg_retrieve_direct (aggstate=0x2811198)   at nodeAgg.c:2464
#12 ExecAgg (node=node@entry=0x2811198) at nodeAgg.c:2117
#13 0x00000000005e2378 in ExecProcNode (node=node@entry=0x2811198)   at execProcnode.c:539
#14 0x00000000005ddf1e in ExecutePlan (execute_once=<optimized out>,    dest=0x27f7eb0, direction=<optimized out>,
numberTuples=0,   sendTuples=<optimized out>, operation=CMD_SELECT,    use_parallel_mode=<optimized out>,
planstate=0x2811198,estate=0x2810f88)   at execMain.c:1693 
#15 standard_ExecutorRun (queryDesc=0x280d6d8, direction=<optimized out>,    count=0, execute_once=<optimized out>) at
execMain.c:362
#16 0x00000000006f924c in PortalRunSelect (portal=portal@entry=0x280ef78,    forward=forward@entry=1 '\001', count=0,
count@entry=9223372036854775807,   dest=dest@entry=0x27f7eb0) at pquery.c:932 
#17 0x00000000006fa5f0 in PortalRun (portal=0x280ef78,    count=9223372036854775807, isTopLevel=<optimized out>,
run_once=<optimizedout>, dest=0x27f7eb0, altdest=0x27f7eb0,    completionTag=0x7fff46afe830 "") at pquery.c:773 
#18 0x00000000006f67c3 in exec_simple_query (   query_string=0x383a3ec

"u\364\217\273\252tectuablesow-what-it-is-about_tlsstnamek\224\257\346\214\201gb2312\357\274\214\202\232\204\344\271\237\351\201\207\345\210\260\350\277\207\344\270\200\344\272\233\345\205\266\345\256\203\351\227\256\351\242\230\357\274\214\345\233\240\344\270\272\346\262\241\346\234\211\350\277\231\344\270\252\351\227\256\351\242\230\344\270\245\351\207\215\357\274\214\350\277\230\347\256\227\344\270\215\351\224\231\343\200\202\200\342\224\200")
at postgres.c:1099
#19 0x00000000006f842a in PostgresMain (argc=1, argv=0x27d5f68,    dbname=0x2752288 "mlists", username=0x276a298
"postgres")  at postgres.c:4090 
#20 0x000000000047803f in BackendRun (port=0x274bf30) at postmaster.c:4357
#21 BackendStartup (port=0x274bf30) at postmaster.c:4029
#22 ServerLoop () at postmaster.c:1753
#23 0x0000000000692a82 in PostmasterMain (argc=argc@entry=3,    argv=argv@entry=0x2724330) at postmaster.c:1361
#24 0x0000000000478f8e in main (argc=3, argv=0x2724330) at main.c:228


Best regards,
--
Daniel Vérité
PostgreSQL-powered mailer: http://www.manitou-mail.org
Twitter: @DanielVerite


--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Re: [BUGS] Crash report for some ICU-52 (debian8) COLLATE andwork_mem values

From
Peter Geoghegan
Date:
Daniel Verite <daniel@manitou-mail.org> wrote:
> SELECT count(distinct wordtext COLLATE :"collname") FROM words_test;
>
>Some of the collations that crash:
>  az-Latn-AZ-u-co-search-x-icu
>  bs-Latn-BA-u-co-search-x-icu
>  bs-x-icu
>  cs-CZ-u-co-search-x-icu
>  de-BE-u-co-phonebk-x-icu
>  sr-Latn-XK-x-icu
>  zh-Hans-CN-u-co-big5han-x-icu
>
>Trying all of them I had 146 crashes out of the 1741 ICU
>entries in pg_collation created by initdb.
>
>The size of the table is 291MB, and work_mem to 128MB.
>
>Reducing the dataset tends to make the problem disappear: if I split
>the table in halves based on row_number() to bisect on the data,
>the queries on both parts pass without crashing.

I think that this sensitivity to work_mem exists because abbreviated
keys are used for quicksort operations that sort individual runs.
As work_mem is increased, and less merging is required, affected
codepaths are reached less frequently. You would probably find that the
problem appears more consistently if varstr_sortsupport() is modified so
that even ICU collations never use abbreviated keys; that would be a
matter of "abbreviate" always being set to false within that function.

I suggest using the new amcheck contrib module as part of this testing
(you'll need to use CREATE INDEX to have an index to perform
verification against). This will zero in on inconsistencies that may be
far more subtle than a hard crash. I wouldn't assume that abbreviated
key comparisons are correct here just because there is no hard crash.

Does the crash always have ucol_strcollUseLatin1UTF8() in its backtrace?

-- 
Peter Geoghegan


-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Re: [BUGS] Crash report for some ICU-52 (debian8) COLLATE and work_mem values

From
"Daniel Verite"
Date:
Peter Geoghegan wrote:

> Does the crash always have ucol_strcollUseLatin1UTF8() in its backtrace?

AFAICS, yes.

The test that iterates over collations produces two kinds of core files,
some of them are 289MB large, some others are 17GB large.
shared_buffers is only 128MB and work_mem 128MB,
so 289MB is not surprising but 17GB seems excessive.
The box has 16GB of physical mem and 8GB of swap.

I haven't checked all core files because they exhaust the disk
space before completion of the test, but a typical backtrace for
the biggest ones looks like the following, with the segfaults
happening in memcpy:

Program terminated with signal SIGSEGV, Segmentation fault.
#0  __memcpy_sse2_unaligned ()   at ../sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S:35
(gdb) #0  __memcpy_sse2_unaligned ()   at ../sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S:35
#1  0x00007fc1db02be6b in memcpy (__len=8589934592, __src=0x7fbc6f4a2010,   __dest=<optimized out>) at
/usr/include/x86_64-linux-gnu/bits/string3.h:51
#2  ucol_CEBuf_Expand (ci=<optimized out>, status=0x7ffd90777128,   b=0x7ffd907751e0) at ucol.cpp:7009
#3  UCOL_CEBUF_PUT (status=0x7ffd90777128, ci=0x7ffd90776460, ce=1493173509,   b=0x7ffd907751e0) at ucol.cpp:7022
#4  ucol_strcollRegular (sColl=sColl@entry=0x7ffd90776460,   tColl=tColl@entry=0x7ffd90776610,
status=status@entry=0x7ffd90777128)  at ucol.cpp:7163 
#5  0x00007fc1db031177 in ucol_strcollRegularUTF8 (coll=0x1371af0,   source=source@entry=0x273d379 "콗喩zx㎍",
sourceLength=sourceLength@entry=11,target=<optimized out>,   targetLength=targetLength@entry=8,
status=status@entry=0x7ffd90777128)  at ucol.cpp:8023 
#6  0x00007fc1db032d36 in ucol_strcollUseLatin1UTF8 (status=<optimized out>,   tLen=<optimized out>, target=<optimized
out>,sLen=<optimized out>,   source=<optimized out>, coll=<optimized out>) at ucol.cpp:8108 
#7  ucol_strcollUTF8_52 (coll=<optimized out>,   source=source@entry=0x273d379 "콗喩zx㎍", sourceLength=<optimized out>,
sourceLength@entry=11,target=<optimized out>,   target@entry=0x273d409 "쳭喩zz", targetLength=targetLength@entry=8,
status=status@entry=0x7ffd90777128)at ucol.cpp:8770 
#8  0x00000000007cc7b4 in varstrfastcmp_locale (x=41145208, y=41145352,   ssup=<optimized out>) at varlena.c:2139
#9  0x0000000000817138 in ApplySortAbbrevFullComparator (ssup=0x136bd98,   isNull2=<optimized out>, datum2=<optimized
out>,isNull1=<optimized out>,    datum1=<optimized out>) at 
../../../../src/include/utils/sortsupport.h:263
#10 comparetup_datum (a=0x7fc1c642d210, b=0x7fc1c642d228,    state=<optimized out>) at tuplesort.c:4350
#11 0x0000000000815aef in qsort_tuple (a=0x7fc1c642d210, n=<optimized out>,    n@entry=2505,
cmp_tuple=cmp_tuple@entry=0x817030<comparetup_datum>,    state=state@entry=0x136bb88) at qsort_tuple.c:104 
#12 0x0000000000815c1f in qsort_tuple (a=0x7fc1c6413020, n=<optimized out>,    n@entry=5279,
cmp_tuple=cmp_tuple@entry=0x817030<comparetup_datum>,    state=state@entry=0x136bb88) at qsort_tuple.c:191 
#13 0x0000000000815c1f in qsort_tuple (a=0x7fc1c63f27e8, n=<optimized out>,    n@entry=103353,
cmp_tuple=cmp_tuple@entry=0x817030<comparetup_datum>,    state=state@entry=0x136bb88) at qsort_tuple.c:191 
#14 0x0000000000815c1f in qsort_tuple (a=0x7fc1c5f423c8, n=<optimized out>,    n@entry=503456,
cmp_tuple=cmp_tuple@entry=0x817030<comparetup_datum>,    state=state@entry=0x136bb88) at qsort_tuple.c:191 
#15 0x0000000000815c1f in qsort_tuple (a=0x7fc1c34ae048, n=<optimized out>,    cmp_tuple=0x817030 <comparetup_datum>,
state=state@entry=0x136bb88)  at qsort_tuple.c:191 
#16 0x000000000081873b in tuplesort_sort_memtuples (   state=state@entry=0x136bb88) at tuplesort.c:3410
#17 0x0000000000818907 in dumpbatch (alltuples=0 '\000', state=0x136bb88)   at tuplesort.c:3087
#18 dumptuples (state=state@entry=0x136bb88,    alltuples=alltuples@entry=0 '\000') at tuplesort.c:2970
#19 0x0000000000818b81 in puttuple_common (state=state@entry=0x136bb88,    tuple=tuple@entry=0x7ffd90777470) at
tuplesort.c:1719
#20 0x000000000081a6f1 in tuplesort_putdatum (state=0x136bb88,    val=<optimized out>, isNull=<optimized out>) at
tuplesort.c:1558
#21 0x00000000005e91ec in advance_aggregates (   aggstate=aggstate@entry=0x13731f8, pergroup=pergroup@entry=0x1376008,
 pergroups=0x0) at nodeAgg.c:1023 
#22 0x00000000005eac03 in agg_retrieve_direct (aggstate=0x13731f8)   at nodeAgg.c:2402
#23 ExecAgg (node=node@entry=0x13731f8) at nodeAgg.c:2117
#24 0x00000000005e2378 in ExecProcNode (node=node@entry=0x13731f8)   at execProcnode.c:539
#25 0x00000000005ddf1e in ExecutePlan (execute_once=<optimized out>,    dest=0x132f368, direction=<optimized out>,
numberTuples=0,   sendTuples=<optimized out>, operation=CMD_SELECT,    use_parallel_mode=<optimized out>,
planstate=0x13731f8,estate=0x1372fe8)   at execMain.c:1693 
#26 standard_ExecutorRun (queryDesc=0x1350938, direction=<optimized out>,    count=0, execute_once=<optimized out>) at
execMain.c:362
#27 0x00000000006f924c in PortalRunSelect (portal=portal@entry=0x1362d98,    forward=forward@entry=1 '\001', count=0,
count@entry=9223372036854775807,   dest=dest@entry=0x132f368) at pquery.c:932 
#28 0x00000000006fa5f0 in PortalRun (portal=0x1362d98,    count=9223372036854775807, isTopLevel=<optimized out>,
run_once=<optimizedout>, dest=0x132f368, altdest=0x132f368,    completionTag=0x7ffd90777990 "") at pquery.c:773 
#29 0x00000000006f7da1 in exec_execute_message (max_rows=9223372036854775807,    portal_name=<optimized out>) at
postgres.c:1984
#30 PostgresMain (argc=3, argv=0x7fffffffffffffff, dbname=0x12cb288 "mlists",    username=0x132f368 "") at
postgres.c:4153
#31 0x000000000047803f in BackendRun (port=0x12c4f30) at postmaster.c:4357
#32 BackendStartup (port=0x12c4f30) at postmaster.c:4029
#33 ServerLoop () at postmaster.c:1753
#34 0x0000000000692a82 in PostmasterMain (argc=argc@entry=3,    argv=argv@entry=0x129d330) at postmaster.c:1361
#35 0x0000000000478f8e in main (argc=3, argv=0x129d330) at main.c:228


Best regards,
--
Daniel Vérité
PostgreSQL-powered mailer: http://www.manitou-mail.org
Twitter: @DanielVerite


--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Re: [BUGS] Crash report for some ICU-52 (debian8) COLLATE andwork_mem values

From
Peter Geoghegan
Date:
On Tue, Aug 1, 2017 at 12:45 AM, Daniel Verite <daniel@manitou-mail.org> wrote:
> The test that iterates over collations produces two kinds of core files,
> some of them are 289MB large, some others are 17GB large.
> shared_buffers is only 128MB and work_mem 128MB,
> so 289MB is not surprising but 17GB seems excessive.
> The box has 16GB of physical mem and 8GB of swap.
>
> I haven't checked all core files because they exhaust the disk
> space before completion of the test, but a typical backtrace for
> the biggest ones looks like the following, with the segfaults
> happening in memcpy:
>
> Program terminated with signal SIGSEGV, Segmentation fault.
> #0  __memcpy_sse2_unaligned ()
>     at ../sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S:35
> (gdb) #0  __memcpy_sse2_unaligned ()
>     at ../sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S:35
> #1  0x00007fc1db02be6b in memcpy (__len=8589934592, __src=0x7fbc6f4a2010,
>     __dest=<optimized out>) at
> /usr/include/x86_64-linux-gnu/bits/string3.h:51
> #2  ucol_CEBuf_Expand (ci=<optimized out>, status=0x7ffd90777128,
>     b=0x7ffd907751e0) at ucol.cpp:7009
> #3  UCOL_CEBUF_PUT (status=0x7ffd90777128, ci=0x7ffd90776460, ce=1493173509,
>     b=0x7ffd907751e0) at ucol.cpp:7022
> #4  ucol_strcollRegular (sColl=sColl@entry=0x7ffd90776460,
>     tColl=tColl@entry=0x7ffd90776610, status=status@entry=0x7ffd90777128)
>     at ucol.cpp:7163
> #5  0x00007fc1db031177 in ucol_strcollRegularUTF8 (coll=0x1371af0,
>     source=source@entry=0x273d379 "콗喩zx㎍",
>     sourceLength=sourceLength@entry=11, target=<optimized out>,
>     targetLength=targetLength@entry=8, status=status@entry=0x7ffd90777128)
>     at ucol.cpp:8023
> #6  0x00007fc1db032d36 in ucol_strcollUseLatin1UTF8 (status=<optimized out>,
>     tLen=<optimized out>, target=<optimized out>, sLen=<optimized out>,
>     source=<optimized out>, coll=<optimized out>) at ucol.cpp:8108
> #7  ucol_strcollUTF8_52 (coll=<optimized out>,
>     source=source@entry=0x273d379 "콗喩zx㎍", sourceLength=<optimized out>,
>     sourceLength@entry=11, target=<optimized out>,
>     target@entry=0x273d409 "쳭喩zz", targetLength=targetLength@entry=8,
>     status=status@entry=0x7ffd90777128) at ucol.cpp:8770

Interesting. The "__len" argument to memcpy() is 8589934592 -- that's
2 ^ 33. (I'm not sure why it's the first memcpy() argument in the
stack trace, since it's supposed to be the last -- anyone seen that
before?)

Can you figure out what the optimized-out lengths are, by either
looking at registers within GDB, or building at a lower optimization
level?

Maybe this is a bug in ICU-52. For reference, here is ICU-52's
ucol_CEBuf_Expand() function:

static
void ucol_CEBuf_Expand(ucol_CEBuf *b, collIterate *ci, UErrorCode *status) {   uint32_t  oldSize;   uint32_t  newSize;
uint32_t  *newBuf; 
   ci->flags |= UCOL_ITER_ALLOCATED;   oldSize = (uint32_t)(b->pos - b->buf);   newSize = oldSize * 2;   newBuf =
(uint32_t*)uprv_malloc(newSize * sizeof(uint32_t));   if(newBuf == NULL) {       *status = U_MEMORY_ALLOCATION_ERROR;
}  else {       uprv_memcpy(newBuf, b->buf, oldSize * sizeof(uint32_t));       if (b->buf != b->localArray) {
uprv_free(b->buf);      }       b->buf = newBuf;       b->endp = b->buf + newSize;       b->pos  = b->buf + oldSize;
}
}

If "oldSize * sizeof(uint32_t)" becomes what we see as "__len", as I
believe it does, then that must mean that oldSize is 2 ^ 31. *Not* 2 ^
31 - 1 (INT_MAX). I think that this could be an off-by-one bug, since
ucol_strcollUTF8()/ucol_strcollUTF8_52() accepts an int32 argument for
sourceLength and targetLength. I'm not very confident of this, but it
does make a certain amount of sense. It could be that everyone else is
passing -1 as sourceLength and targetLength arguments, anyway, to
indicate that the buffer is NUL-terminated, as required by regular
strcoll().

Note also that the docs say this of ucol_strcollUTF8(): "When input
string contains malformed a UTF-8 byte sequence, this function treats
these bytes as REPLACEMENT CHARACTER (U+FFFD)". I'm not sure that
that's a very sensible way for it to fail.

I'd be interested to see if anything changed when -1 was passed as
both sourceLength and targetLength to ucol_strcollUTF8(). You'd have
to build Postgres yourself to test this, but it would just work, since
we don't actually avoid NUL termination, even though in principled we
could with ICU.

--
Peter Geoghegan


--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Re: [BUGS] Crash report for some ICU-52 (debian8) COLLATE and work_mem values

From
"Daniel Verite"
Date:
Peter Geoghegan wrote:

> I'd be interested to see if anything changed when -1 was passed as
> both sourceLength and targetLength to ucol_strcollUTF8(). You'd have
> to build Postgres yourself to test this, but it would just work, since
> we don't actually avoid NUL termination, even though in principled we
> could with ICU.

It doesn't seem to change the outcome:

*** varlena.c~    2017-07-10 22:26:20.000000000 +0200
--- varlena.c    2017-08-02 12:57:25.997265001 +0200
***************
*** 2137,2144 ****
            status = U_ZERO_ERROR;            result =
ucol_strcollUTF8(sss->locale->info.icu.ucol,
!                                            a1p, len1,
!                                            a2p, len2,                                        &status);            if
(U_FAILURE(status))               ereport(ERROR, 
--- 2137,2144 ----
            status = U_ZERO_ERROR;            result =
ucol_strcollUTF8(sss->locale->info.icu.ucol,
!                                            a1p, -1, /* len1 */
!                                            a2p, -1, /*len2,*/                                        &status);
   if (U_FAILURE(status))                ereport(ERROR, 


For the case when its ends up crashing in memcpy, the backtrace now looks
like this:

#0  __memcpy_sse2_unaligned ()   at ../sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S:35
#1  0x00007f0156812e6b in memcpy (__len=8589934592, __src=0x7efbeac89010,   __dest=<optimized out>) at
/usr/include/x86_64-linux-gnu/bits/string3.h:51
#2  ucol_CEBuf_Expand (ci=<optimized out>, status=0x7ffd49c4ca98,   b=0x7ffd49c4ab60) at ucol.cpp:7009
#3  UCOL_CEBUF_PUT (status=0x7ffd49c4ca98, ci=0x7ffd49c4bde0, ce=1493173509,   b=0x7ffd49c4ab60) at ucol.cpp:7022
#4  ucol_strcollRegular (sColl=sColl@entry=0x7ffd49c4bde0,   tColl=tColl@entry=0x7ffd49c4bf90,
status=status@entry=0x7ffd49c4ca98)  at ucol.cpp:7163 
#5  0x00007f0156818177 in ucol_strcollRegularUTF8 (coll=0x2286100,   source=source@entry=0x3651829 "쳭喩zz",
sourceLength=sourceLength@entry=-1,target=<optimized out>,   targetLength=targetLength@entry=-1,
status=status@entry=0x7ffd49c4ca98)  at ucol.cpp:8023 
#6  0x00007f0156819d36 in ucol_strcollUseLatin1UTF8 (status=<optimized out>,   tLen=<optimized out>, target=<optimized
out>,sLen=<optimized out>,   source=<optimized out>, coll=<optimized out>) at ucol.cpp:8108 
#7  ucol_strcollUTF8_52 (coll=<optimized out>,   source=source@entry=0x3651829 "쳭喩zz", sourceLength=<optimized out>,
sourceLength@entry=-1,target=<optimized out>,   target@entry=0x3651799 "콗喩zx㎍", targetLength=targetLength@entry=-1,
status=status@entry=0x7ffd49c4ca98)at ucol.cpp:8770 
#8  0x00000000007cc7df in varstrfastcmp_locale (x=56956968, y=56956824,    ssup=<optimized out>) at varlena.c:2139
#9  0x0000000000817168 in ApplySortAbbrevFullComparator (ssup=0x2280338,    isNull2=<optimized out>, datum2=<optimized
out>,isNull1=<optimized out>,    datum1=<optimized out>) at 
../../../../src/include/utils/sortsupport.h:263
#10 comparetup_datum (a=0x7f0141c14228, b=0x7f0141c14240,    state=<optimized out>) at tuplesort.c:4350
#11 0x0000000000815b1f in qsort_tuple (a=0x7f0141c14210, n=<optimized out>,    n@entry=106,
cmp_tuple=cmp_tuple@entry=0x817060<comparetup_datum>,    state=state@entry=0x2280128) at qsort_tuple.c:104 
.....

Anyway I'll try to build ICU-52 without optimization to avoid these
<optimized out> parameters.

Also running the same tests with a self-compiled ICU-59 versus debian's
packaged
ICU-52 results in *no crash at all*. So maybe the root cause of these crashes
is
that a critical fix in ICU has not been backported?

I note that the list of ICU locales available in pg_collation is different
though.
When compiled with ICU-59, initdb produces "only" 581 collname
matching '%icu%', versus 1741 with ICU-52. The ICU version
is the only thing that differs between both cases.
But anyway in the list of collations that crash for me with ICU-52, a fair
number
(~50) are still there and work in ICU-59, so I think the comparison is still
meaningful.


Best regards,
--
Daniel Vérité
PostgreSQL-powered mailer: http://www.manitou-mail.org
Twitter: @DanielVerite


--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Re: [BUGS] Crash report for some ICU-52 (debian8) COLLATE andwork_mem values

From
Peter Geoghegan
Date:
On Wed, Aug 2, 2017 at 5:00 AM, Daniel Verite <daniel@manitou-mail.org> wrote:
> Anyway I'll try to build ICU-52 without optimization to avoid these
> <optimized out> parameters.
>
> Also running the same tests with a self-compiled ICU-59 versus debian's
> packaged
> ICU-52 results in *no crash at all*. So maybe the root cause of these crashes
> is
> that a critical fix in ICU has not been backported?

That seems possible. Certainly, there have been a fair amount of
bugfixes against this package, some as recently as April of this year:

http://metadata.ftp-master.debian.org/changelogs/main/i/icu/icu_52.1-8+deb8u5_changelog

I'm sure that you used a version that has the relevant fixes, but it's
not that hard to imagine that new bugs were introduced in the process
of backpatching upstream fixes from later major ICU releases. There
has only been one official ICU point release for 52, which was way
back in 2013. Unofficial backpatches are hazardous in their own way.

> I note that the list of ICU locales available in pg_collation is different
> though.
> When compiled with ICU-59, initdb produces "only" 581 collname
> matching '%icu%', versus 1741 with ICU-52. The ICU version
> is the only thing that differs between both cases.
> But anyway in the list of collations that crash for me with ICU-52, a fair
> number
> (~50) are still there and work in ICU-59, so I think the comparison is still
> meaningful.

I agree that it's a meaningful comparison. It now seems very likely
that this is a bug in ICU, and possibly this particular package. That
seemed likely before now anyway; there is not that much that could go
wrong with PostgreSQL's use of ICU that is peculiar to only a small
minority of available collations. Perhaps you can whittle down your
testcase into something minimal, for the Debian package maintainer.

-- 
Peter Geoghegan


-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Re: [BUGS] Crash report for some ICU-52 (debian8) COLLATE andwork_mem values

From
Peter Eisentraut
Date:
On 8/2/17 08:00, Daniel Verite wrote:
> I note that the list of ICU locales available in pg_collation is different
> though.
> When compiled with ICU-59, initdb produces "only" 581 collname
> matching '%icu%', versus 1741 with ICU-52.

This is because the older version produces redundant locale names such
as de-BE-*, de-CH-*, de-DE-*, whereas the newer version omits those
because they are (presumably?) all equivalent to de-*.

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Re: [BUGS] Crash report for some ICU-52 (debian8) COLLATE and work_mem values

From
"Daniel Verite"
Date:
Peter Geoghegan wrote:

> it's not that hard to imagine that new bugs were introduced in the process
> of backpatching upstream fixes from later major ICU releases

I've now tried with ICU-52.1 from upstream, built with
gcc -O0 -g to get cleaner stack traces.
The behavior is similar to debian's precompiled lib in how and where
it crashes.

work_mem is at 128MB.

query #1:
  select count(distinct wordtext collate "az-x-icu") from words_test;

The backend terminates with SIGSEGV and a 289MB core file :
at ucol.cpp:8033
8033        while(schar > (tchar = *(UCharOffset+offset))) { /* since the
contraction codepoints should be ordered, we skip all that are smaller */

#0  0x00007f7926e12341 in ucol_getLatinOneContractionUTF8 (coll=0x1b422c0,
    strength=0, CE=4061217513,
    s=0x2b56b5c

"u\364\217\273\252tectuablesow-what-it-is-about_tlsstnamek\224\257\346\214\201gb2312\357\274\214\202\232\204\344\271\237\351\201\207\345\210\260\350\277\207\344\270\200\344\272\233\345\205\266\345\256\203\351\227\256\351\242\230\357\274\214\345\233\240\344\270\272\346\262\241\346\234\211\350\277\231\344\270\252\351\227\256\351\242\230\344\270\245\351\207\215\357\274\214\350\277\230\347\256\227\344\270\215\351\224\231\343\200\202\200\342\224\220",

    index=0x7ffde80dc978, len=7) at ucol.cpp:8033
#1  0x00007f7926e12845 in ucol_strcollUseLatin1UTF8 (coll=0x1b422c0,
    source=0x2b56b5c

"u\364\217\273\252tectuablesow-what-it-is-about_tlsstnamek\224\257\346\214\201gb2312\357\274\214\202\232\204\344\271\237\351\201\207\345\210\260\350\277\207\344\270\200\344\272\233\345\205\266\345\256\203\351\227\256\351\242\230\357\274\214\345\233\240\344\270\272\346\262\241\346\234\211\350\277\231\344\270\252\351\227\256\351\242\230\344\270\245\351\207\215\357\274\214\350\277\230\347\256\227\344\270\215\351\224\231\343\200\202\200\342\224\220",

    sLen=7,
    target=0x2b5735c

"wuiredntnsookiemmand-tcl-testsuit-tp65374p65375\204\346\226\231\344\271\237\345\276\210\345\244\232\343\200\202\257debian\345\222\214redhat\344\272\206\357\274\214\344\270\244\344\270\252\215\263\345\264\251\346\272\203\343\200\202\273\345\205\263\351\227\255\347\232\204\351\202\243\344\270\252\346\217\220\347\244\272\347\252\227\345\234\250\346\234\200\345\272\225\344\270\213\222\230",
tLen=6, status=0x7ffde80dcab8) at ucol.cpp:8104
#2  0x00007f7926e1487a in ucol_strcollUTF8_52 (coll=0x1b422c0,
    source=0x2b56b5c

"u\364\217\273\252tectuablesow-what-it-is-about_tlsstnamek\224\257\346\214\201gb2312\357\274\214\202\232\204\344\271\237\351\201\207\345\210\260\350\277\207\344\270\200\344\272\233\345\205\266\345\256\203\351\227\256\351\242\230\357\274\214\345\233\240\344\270\272\346\262\241\346\234\211\350\277\231\344\270\252\351\227\256\351\242\230\344\270\245\351\207\215\357\274\214\350\277\230\347\256\227\344\270\215\351\224\231\343\200\202\200\342\224\220",

    sourceLength=7,
    target=0x2b5735c

"wuiredntnsookiemmand-tcl-testsuit-tp65374p65375\204\346\226\231\344\271\237\345\276\210\345\244\232\343\200\202\257debian\345\222\214redhat\344\272\206\357\274\214\344\270\244\344\270\252\215\263\345\264\251\346\272\203\343\200\202\273\345\205\263\351\227\255\347\232\204\351\202\243\344\270\252\346\217\220\347\244\272\347\252\227\345\234\250\346\234\200\345\272\225\344\270\213\222\230",
targetLength=6, status=0x7ffde80dcab8) at ucol.cpp:8759
#3  0x00000000007cc7b4 in varstrfastcmp_locale (x=45443928, y=45445976,
    ssup=<optimized out>) at varlena.c:2139
#4  0x00000000008170b6 in ApplySortComparator (ssup=0x1b344a8,
    isNull2=<optimized out>, datum2=<optimized out>, isNull1=<optimized out>,
    datum1=<optimized out>) at
../../../../src/include/utils/sortsupport.h:225
#5  comparetup_datum (a=0x7ffde80dcb90, b=0x1b36218, state=0x1b34298)
    at tuplesort.c:4341
#6  0x00000000008154d9 in tuplesort_heap_replace_top (
    state=state@entry=0x1b34298, tuple=tuple@entry=0x7ffde80dcb90,
    checkIndex=checkIndex@entry=0 '\000') at tuplesort.c:3512
...cut...


query #2: select count(distinct wordtext collate
"bs-Cyrl-BA-u-co-search-x-icu")
  from words_test;

The backend terminates with SIGSEGV leaving a 17GB core file
(there's definitely a uint32 offset overflow from 2^31->0
in ucol_CEBuf_Expand just before the crash in memcpy)

#0  __memcpy_sse2_unaligned ()
    at ../sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S:35
#1  0x00007fbad6951ec3 in ucol_CEBuf_Expand (b=0x7ffdef2d2390,
    ci=0x7ffdef2d3560, status=0x7ffdef2d41e8) at ucol.cpp:6998
#2  0x00007fbad6951f68 in UCOL_CEBUF_PUT (b=0x7ffdef2d2390, ce=1493173509,
    ci=0x7ffdef2d3560, status=0x7ffdef2d41e8) at ucol.cpp:7011
#3  0x00007fbad69525c7 in ucol_strcollRegular (sColl=0x7ffdef2d3560,
    tColl=0x7ffdef2d3710, status=0x7ffdef2d41e8) at ucol.cpp:7152
#4  0x00007fbad6954080 in ucol_strcollRegularUTF8 (coll=0xf75610,
    source=0x23318a9 "\354\275\227\345\226\251zx\343\216\215",
    sourceLength=11, target=0x2331939 "\354\263\255\345\226\251zz",
    targetLength=8, status=0x7ffdef2d41e8) at ucol.cpp:8012
#5  0x00007fbad6956845 in ucol_strcollUTF8_52 (coll=0xf75610,
    source=0x23318a9 "\354\275\227\345\226\251zx\343\216\215",
    sourceLength=11, target=0x2331939 "\354\263\255\345\226\251zz",
    targetLength=8, status=0x7ffdef2d41e8) at ucol.cpp:8757
#6  0x00000000007cc7b4 in varstrfastcmp_locale (x=36903080, y=36903224,
    ssup=<optimized out>) at varlena.c:2139
#7  0x0000000000817138 in ApplySortAbbrevFullComparator (ssup=0xf67888,
    isNull2=<optimized out>, datum2=<optimized out>, isNull1=<optimized out>,
    datum1=<optimized out>) at
../../../../src/include/utils/sortsupport.h:263
#8  comparetup_datum (a=0x7fbac1cd6210, b=0x7fbac1cd6228,
    state=<optimized out>) at tuplesort.c:4350
#9  0x0000000000815aef in qsort_tuple (a=0x7fbac1cd6210, n=<optimized out>,
    n@entry=2505, cmp_tuple=cmp_tuple@entry=0x817030 <comparetup_datum>,
    state=state@entry=0xf67678) at qsort_tuple.c:104
#10 0x0000000000815c1f in qsort_tuple (a=0x7fbac1cbc020, n=<optimized out>,
    n@entry=5279, cmp_tuple=cmp_tuple@entry=0x817030 <comparetup_datum>,
    state=state@entry=0xf67678) at qsort_tuple.c:191
...cut...


Next I've tried with valgrind, invoked with:

valgrind \
   --suppressions=$HOME/src/postgresql-10beta2/src/tools/valgrind.supp \
   --trace-children=yes --track-origins=yes --error-limit=no \
   --read-var-info=yes    --log-file=/tmp/log-valgrind \
  bin/postgres -D $PWD/data

(and turning off autovacuum because it causes some unrelated reports).

With query #1 the run goes to completion without crashing (hi heisenbug!),
but it reports many uses of uninitialised values under ucol_strcollUTF8().
The full log is attached in log-valgrind-1.txt.gz


With query #2 it ends up crashing after ~5hours  and produces
the log in log-valgrind-2.txt.gz with some other entries than
case #1, but AFAICS still all about reading  uninitialised values
in space allocated by datumCopy().


Best regards,
--
Daniel Vérité
PostgreSQL-powered mailer: http://www.manitou-mail.org
Twitter: @DanielVerite

-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Attachment

Re: [BUGS] Crash report for some ICU-52 (debian8) COLLATE andwork_mem values

From
Peter Geoghegan
Date:
On Thu, Aug 3, 2017 at 8:49 AM, Daniel Verite <daniel@manitou-mail.org> wrote:
> With query #2 it ends up crashing after ~5hours  and produces
> the log in log-valgrind-2.txt.gz with some other entries than
> case #1, but AFAICS still all about reading  uninitialised values
> in space allocated by datumCopy().

Right. This part is really interesting to me:

==48827==  Uninitialised value was created by a heap allocation
==48827==    at 0x4C28C20: malloc (vg_replace_malloc.c:296)
==48827==    by 0x80B597: AllocSetAlloc (aset.c:771)
==48827==    by 0x810ADC: palloc (mcxt.c:862)
==48827==    by 0x72BFEF: datumCopy (datum.c:171)
==48827==    by 0x81A74C: tuplesort_putdatum (tuplesort.c:1515)
==48827==    by 0x5E91EB: advance_aggregates (nodeAgg.c:1023)

If you actually go to datum.c:171, you see that that's a codepath for
pass-by-reference datatypes that lack a varlena header. Text is a
datatype that has a varlena header, though, so that's clearly wrong. I
don't know how this actually happened, but working back through the
relevant tuplesort_begin_datum() caller, initialize_aggregate(), does
suggest some things. (tuplesort_begin_datum() is where datumTypeLen is
determined for the entire datum tuplesort.)

I am once again only guessing, but I have to wonder if this is a
problem in commit b8d7f053. It seems likely that the problem begins
before tuplesort_begin_datum() is even called, which is the basis of
this suspicion. If the problem is within tuplesort, then that could
only be because get_typlenbyval() gives wrong answers, which seems
very unlikely.

-- 
Peter Geoghegan


-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Re: [BUGS] Crash report for some ICU-52 (debian8) COLLATE andwork_mem values

From
Peter Geoghegan
Date:
On Thu, Aug 3, 2017 at 8:49 AM, Daniel Verite <daniel@manitou-mail.org> wrote:
> With query #1 the run goes to completion without crashing (hi heisenbug!),
> but it reports many uses of uninitialised values under ucol_strcollUTF8().
> The full log is attached in log-valgrind-1.txt.gz
>
>
> With query #2 it ends up crashing after ~5hours  and produces
> the log in log-valgrind-2.txt.gz with some other entries than
> case #1, but AFAICS still all about reading  uninitialised values
> in space allocated by datumCopy().

It would be nice if you could confirm whether or not Valgrind
complains when non-ICU collations are in use. It may just have been
that we get (un)lucky with ICU, where the undefined behavior happens
to result in a hard crash, more or less by accident.

-- 
Peter Geoghegan


-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Re: [BUGS] Crash report for some ICU-52 (debian8) COLLATE and work_mem values

From
"Daniel Verite"
Date:
    Peter Geoghegan wrote:

> It would be nice if you could confirm whether or not Valgrind
> complains when non-ICU collations are in use.

It doesn't (with en_US.utf8), and also doesn't complain
for certain ICU collations such as "fr-x-icu", one of
the 1595 that don't segfault with this data and work_mem
at 128MB (vs 146 that happen to segfault)

If someone wants to reproduce the problem, I've made
a custom dump (40MB) available at
http://www.manitou-mail.org/vrac/words_test.dump

query: select count(distinct wordtext collate "bs-x-icu") from words_test;

work_mem: 128MB or less, shared_buffers: 128MB

OS: debian 8 x86_64

ICU: 52.1, upstream or deb8

locale environment from which initdb was called:
$ locale -a
C
C.UTF-8
de_DE.utf8
en_US.utf8
français
french
fr_FR
fr_FR.iso88591
fr_FR.utf8
pl_PL
pl_PL.iso88592
polish
POSIX

Best regards,
--
Daniel Vérité
PostgreSQL-powered mailer: http://www.manitou-mail.org
Twitter: @DanielVerite


--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

On Thu, Aug 03, 2017 at 11:42:25AM -0700, Peter Geoghegan wrote:
> On Thu, Aug 3, 2017 at 8:49 AM, Daniel Verite <daniel@manitou-mail.org> wrote:
> > With query #2 it ends up crashing after ~5hours  and produces
> > the log in log-valgrind-2.txt.gz with some other entries than
> > case #1, but AFAICS still all about reading  uninitialised values
> > in space allocated by datumCopy().
> 
> Right. This part is really interesting to me:
> 
> ==48827==  Uninitialised value was created by a heap allocation
> ==48827==    at 0x4C28C20: malloc (vg_replace_malloc.c:296)
> ==48827==    by 0x80B597: AllocSetAlloc (aset.c:771)
> ==48827==    by 0x810ADC: palloc (mcxt.c:862)
> ==48827==    by 0x72BFEF: datumCopy (datum.c:171)
> ==48827==    by 0x81A74C: tuplesort_putdatum (tuplesort.c:1515)
> ==48827==    by 0x5E91EB: advance_aggregates (nodeAgg.c:1023)
> 
> If you actually go to datum.c:171, you see that that's a codepath for
> pass-by-reference datatypes that lack a varlena header. Text is a
> datatype that has a varlena header, though, so that's clearly wrong. I
> don't know how this actually happened, but working back through the
> relevant tuplesort_begin_datum() caller, initialize_aggregate(), does
> suggest some things. (tuplesort_begin_datum() is where datumTypeLen is
> determined for the entire datum tuplesort.)
> 
> I am once again only guessing, but I have to wonder if this is a
> problem in commit b8d7f053. It seems likely that the problem begins
> before tuplesort_begin_datum() is even called, which is the basis of
> this suspicion. If the problem is within tuplesort, then that could
> only be because get_typlenbyval() gives wrong answers, which seems
> very unlikely.

[Action required within three days.  This is a generic notification.]

The above-described topic is currently a PostgreSQL 10 open item.  Peter
(Eisentraut), since you committed the patch believed to have created it, you
own this open item.  If some other commit is more relevant or if this does not
belong as a v10 open item, please let us know.  Otherwise, please observe the
policy on open item ownership[1] and send a status update within three
calendar days of this message.  Include a date for your subsequent status
update.  Testers may discover new open items at any time, and I want to plan
to get them all fixed well in advance of shipping v10.  Consequently, I will
appreciate your efforts toward speedy resolution.  Thanks.

[1] https://www.postgresql.org/message-id/20170404140717.GA2675809%40tornado.leadboat.com


-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Adding -hackers.

On Sat, Aug 05, 2017 at 03:55:13PM -0700, Noah Misch wrote:
> On Thu, Aug 03, 2017 at 11:42:25AM -0700, Peter Geoghegan wrote:
> > On Thu, Aug 3, 2017 at 8:49 AM, Daniel Verite <daniel@manitou-mail.org> wrote:
> > > With query #2 it ends up crashing after ~5hours  and produces
> > > the log in log-valgrind-2.txt.gz with some other entries than
> > > case #1, but AFAICS still all about reading  uninitialised values
> > > in space allocated by datumCopy().
> > 
> > Right. This part is really interesting to me:
> > 
> > ==48827==  Uninitialised value was created by a heap allocation
> > ==48827==    at 0x4C28C20: malloc (vg_replace_malloc.c:296)
> > ==48827==    by 0x80B597: AllocSetAlloc (aset.c:771)
> > ==48827==    by 0x810ADC: palloc (mcxt.c:862)
> > ==48827==    by 0x72BFEF: datumCopy (datum.c:171)
> > ==48827==    by 0x81A74C: tuplesort_putdatum (tuplesort.c:1515)
> > ==48827==    by 0x5E91EB: advance_aggregates (nodeAgg.c:1023)
> > 
> > If you actually go to datum.c:171, you see that that's a codepath for
> > pass-by-reference datatypes that lack a varlena header. Text is a
> > datatype that has a varlena header, though, so that's clearly wrong. I
> > don't know how this actually happened, but working back through the
> > relevant tuplesort_begin_datum() caller, initialize_aggregate(), does
> > suggest some things. (tuplesort_begin_datum() is where datumTypeLen is
> > determined for the entire datum tuplesort.)
> > 
> > I am once again only guessing, but I have to wonder if this is a
> > problem in commit b8d7f053. It seems likely that the problem begins
> > before tuplesort_begin_datum() is even called, which is the basis of
> > this suspicion. If the problem is within tuplesort, then that could
> > only be because get_typlenbyval() gives wrong answers, which seems
> > very unlikely.
> 
> [Action required within three days.  This is a generic notification.]
> 
> The above-described topic is currently a PostgreSQL 10 open item.  Peter
> (Eisentraut), since you committed the patch believed to have created it, you
> own this open item.  If some other commit is more relevant or if this does not
> belong as a v10 open item, please let us know.  Otherwise, please observe the
> policy on open item ownership[1] and send a status update within three
> calendar days of this message.  Include a date for your subsequent status
> update.  Testers may discover new open items at any time, and I want to plan
> to get them all fixed well in advance of shipping v10.  Consequently, I will
> appreciate your efforts toward speedy resolution.  Thanks.
> 
> [1] https://www.postgresql.org/message-id/20170404140717.GA2675809%40tornado.leadboat.com



Re: [BUGS] Crash report for some ICU-52 (debian8) COLLATE andwork_mem values

From
Andres Freund
Date:
On 2017-08-03 11:42:25 -0700, Peter Geoghegan wrote:
> On Thu, Aug 3, 2017 at 8:49 AM, Daniel Verite <daniel@manitou-mail.org> wrote:
> > With query #2 it ends up crashing after ~5hours  and produces
> > the log in log-valgrind-2.txt.gz with some other entries than
> > case #1, but AFAICS still all about reading  uninitialised values
> > in space allocated by datumCopy().
> 
> Right. This part is really interesting to me:
> 
> ==48827==  Uninitialised value was created by a heap allocation
> ==48827==    at 0x4C28C20: malloc (vg_replace_malloc.c:296)
> ==48827==    by 0x80B597: AllocSetAlloc (aset.c:771)
> ==48827==    by 0x810ADC: palloc (mcxt.c:862)
> ==48827==    by 0x72BFEF: datumCopy (datum.c:171)
> ==48827==    by 0x81A74C: tuplesort_putdatum (tuplesort.c:1515)
> ==48827==    by 0x5E91EB: advance_aggregates (nodeAgg.c:1023)
> 
> If you actually go to datum.c:171, you see that that's a codepath for
> pass-by-reference datatypes that lack a varlena header. Text is a
> datatype that has a varlena header, though, so that's clearly wrong. I
> don't know how this actually happened, but working back through the
> relevant tuplesort_begin_datum() caller, initialize_aggregate(), does
> suggest some things. (tuplesort_begin_datum() is where datumTypeLen is
> determined for the entire datum tuplesort.)
> 
> I am once again only guessing, but I have to wonder if this is a
> problem in commit b8d7f053. It seems likely that the problem begins
> before tuplesort_begin_datum() is even called, which is the basis of
> this suspicion. If the problem is within tuplesort, then that could
> only be because get_typlenbyval() gives wrong answers, which seems
> very unlikely.

Not saying it's not the fault of b8d7f053 et al, but I don't quite see
how - whether something is a varlena tuple or not isn't really something
expression evaluation has an influence over if it doesn't happen from
within its code. That's the responsibility of the calling code, not from
within the datum. So I don't quite understand how you got to b8d7f053?

- Andres


-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Re: [BUGS] Crash report for some ICU-52 (debian8) COLLATE andwork_mem values

From
Peter Geoghegan
Date:
On Sat, Aug 5, 2017 at 4:03 PM, Andres Freund <andres@anarazel.de> wrote:
> Not saying it's not the fault of b8d7f053 et al, but I don't quite see
> how - whether something is a varlena tuple or not isn't really something
> expression evaluation has an influence over if it doesn't happen from
> within its code. That's the responsibility of the calling code, not from
> within the datum. So I don't quite understand how you got to b8d7f053?

It was a guess based on nothing more than what codepaths changed since
the last release.

While I think it's fair to assume for the time being that this is an
ICU bug, based on the fact that Valgrind won't complain when other
collations are used with the same data, I have a really hard time
figuring out why the datum tuplesort thinks that text is a
pass-by-reference type without a varlena header. That doesn't smell
like a bug in the ICU feature to me. Could an ICU bug really have
somehow made unrelated syscache lookups give wrong answers about
built-in types?

I'll try to find time to reproduce the problem next week. It should be
possible to work backwards from the weird issue within datumCopy(),
and figure out what the problem is.

-- 
Peter Geoghegan


-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

"Daniel Verite" <daniel@manitou-mail.org> writes:
> If someone wants to reproduce the problem, I've made
> a custom dump (40MB) available at 
> http://www.manitou-mail.org/vrac/words_test.dump

Thanks for providing this test data.  I've attempted, so far
unsuccessfully, to reproduce the crash.  I'm using today's HEAD PG code,
in --enable-debug --enable-cassert configuration, all parameters default
except work_mem = 128MB.  I see no crash with any ICU collation provided
by these configurations:

* RHEL 6's stock version of ICU 4.2, on x86_64

* Direct-from-upstream-source build of icu4c-57_1-src.tgz on
current macOS, also x86_64.

So while that isn't helping all that much, it seems to constrain the range
of affected ICU versions further than we knew before.

I'm quite disturbed though that the set of installed collations on these
two test cases seem to be entirely different both from each other and from
what you reported.  The base collations look generally similar, but the
"keyword variant" versions are not comparable at all.  Considering that
the entire reason we are interested in ICU in the first place is its
alleged cross-version collation behavior stability, this gives me the
exact opposite of a warm fuzzy feeling.  We need to understand why it's
like that and what we can do to reduce the variation, or else we're just
buying our users enormous future pain.  At least with the libc collations,
you can expect that if you have en_US.utf8 available today you will
probably still have en_US.utf8 available tomorrow.  I am not seeing any
reason to believe that the same holds for ICU collations.
        regards, tom lane


-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

I wrote:
> "Daniel Verite" <daniel@manitou-mail.org> writes:
>> If someone wants to reproduce the problem, I've made
>> a custom dump (40MB) available at
>> http://www.manitou-mail.org/vrac/words_test.dump

> Thanks for providing this test data.  I've attempted, so far
> unsuccessfully, to reproduce the crash.

OK, so after installing ICU 52.1 from upstream sources, I can reproduce
these problems, and I think there's nothing to see here except ICU bugs.

In the case where Daniel reports huge core dumps, the problem seems to be
that ucol_strcollRegular() gets into an infinite loop here:
           // We fetch CEs until we hit a non ignorable primary or end.           uint32_t sPrimary;           do {
         // We get the next CE               sOrder = ucol_IGetNextCE(coll, sColl, status);               // Stuff it
inthe buffer               UCOL_CEBUF_PUT(&sCEs, sOrder, sColl, status);               // And keep just the primary
part.              sPrimary = sOrder & UCOL_PRIMARYMASK;           } while(sPrimary == 0); 

It keeps on stuffing tokens into "sCEs", which is an expansible buffer in
the same spirit as our StringInfos, and once that gets past 2^31 tokens
all hell breaks loose, because the expansion code is entirely unconcerned
about the possibility of integer overflow:

void ucol_CEBuf_Expand(ucol_CEBuf *b, collIterate *ci, UErrorCode *status) {   uint32_t  oldSize;   uint32_t  newSize;
uint32_t  *newBuf; 
   ci->flags |= UCOL_ITER_ALLOCATED;   oldSize = (uint32_t)(b->pos - b->buf);   newSize = oldSize * 2;   newBuf =
(uint32_t*)uprv_malloc(newSize * sizeof(uint32_t));   if(newBuf == NULL) {       *status = U_MEMORY_ALLOCATION_ERROR;
}  else {       uprv_memcpy(newBuf, b->buf, oldSize * sizeof(uint32_t));   ... 

The buffer size seems to always be a power of 2.  After enough iterations
oldSize is 2^31, so newSize overflows to zero, and it allocates a
zero-length newBuf and proceeds to copy ~8GB of data there.  No surprise
that we get a SIGSEGV, but before that happens it'll have trashed a ton of
memory.  I suspect whatever problems valgrind is reporting are just false
positives induced by that massive memory stomp.

So there are two separate bugs there --- one is that that loop fails to
terminate, for reasons that are doubtless specific to particular collation
data, and the other is that the expansible-buffer code is not robust.

The other class of failures amounts to this loop iterating till it falls
off the end of memory:
       while(schar > (tchar = *(UCharOffset+offset))) { /* since the contraction codepoints should be ordered, we skip
allthat are smaller */           offset++;       } 

which is unsurprising, because (in my core dump) schar is 1113834 which is
larger than any possible UChar value, so the loop cannot terminate except
by crashing.  The crash occurred while trying to process this string:
 buf2 = 0x1614d20 "requ\364\217\273\252te",

and I do not think it's coincidence that that multibyte character
there corresponds to U+10FEEA or decimal 1113834.  Apparently
they've got some bugs with dealing with characters beyond U+FFFF,
at least in certain locales.

In short then, ICU 52.1 is just too buggy to contemplate using.
I haven't compared versions to see when these issues were introduced
or fixed.  But I'm now thinking that trying to support old ICU
versions is a mistake, and we'd better serve our users by telling
them not to use ICU before some-version-after-52.
        regards, tom lane


--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Re: [BUGS] Crash report for some ICU-52 (debian8) COLLATE andwork_mem values

From
Peter Geoghegan
Date:
On Sat, Aug 5, 2017 at 8:26 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I'm quite disturbed though that the set of installed collations on these
> two test cases seem to be entirely different both from each other and from
> what you reported.  The base collations look generally similar, but the
> "keyword variant" versions are not comparable at all.  Considering that
> the entire reason we are interested in ICU in the first place is its
> alleged cross-version collation behavior stability, this gives me the
> exact opposite of a warm fuzzy feeling.  We need to understand why it's
> like that and what we can do to reduce the variation, or else we're just
> buying our users enormous future pain.  At least with the libc collations,
> you can expect that if you have en_US.utf8 available today you will
> probably still have en_US.utf8 available tomorrow.  I am not seeing any
> reason to believe that the same holds for ICU collations.

+1. That seems like something that is important to get right up-front.

-- 
Peter Geoghegan


-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Re: [BUGS] Crash report for some ICU-52 (debian8) COLLATE andwork_mem values

From
Peter Geoghegan
Date:
On Sun, Aug 6, 2017 at 1:00 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> So there are two separate bugs there --- one is that that loop fails to
> terminate, for reasons that are doubtless specific to particular collation
> data, and the other is that the expansible-buffer code is not robust.

Interesting.

> In short then, ICU 52.1 is just too buggy to contemplate using.
> I haven't compared versions to see when these issues were introduced
> or fixed.  But I'm now thinking that trying to support old ICU
> versions is a mistake, and we'd better serve our users by telling
> them not to use ICU before some-version-after-52.

Distributions generally back-patch fixes to older ICU versions. What
upstream source did you use, exactly?

It's possible that this bug could be fixed by the Debian maintainer.
This is supposed to be a very stable version of the library. It has
received security updates fairly recently [1]. Maybe this is actually
a regression caused by a badly handled backpatch, like the infamous
random number bug that only appeared in Debian's OpenSSL. Many
important software packages will have a dependency on the Debian 52.1
ICU package, and it really shouldn't be a liability to support it.
(Granted, it does appear to be one right now.)

[1] http://metadata.ftp-master.debian.org/changelogs/main/i/icu/icu_52.1-8+deb8u5_changelog
-- 
Peter Geoghegan


-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Peter Geoghegan <pg@bowt.ie> writes:
> On Sun, Aug 6, 2017 at 1:00 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> In short then, ICU 52.1 is just too buggy to contemplate using.

> Distributions generally back-patch fixes to older ICU versions. What
> upstream source did you use, exactly?

I went to http://www.icu-project.org/ and downloaded icu4c-52_1-src.tgz.
All the file dates therein seem to be 2013-10-04.

Debian, for one, is evidently not trying very hard in that direction,
since not only are the bugs still there but the line numbers I saw in
my backtraces agreed with Daniel's, indicating they've not changed
much of anything at all in ucol.cpp.
        regards, tom lane


-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Re: [BUGS] Crash report for some ICU-52 (debian8) COLLATE andwork_mem values

From
Peter Geoghegan
Date:
On Sun, Aug 6, 2017 at 1:47 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Debian, for one, is evidently not trying very hard in that direction,
> since not only are the bugs still there but the line numbers I saw in
> my backtraces agreed with Daniel's, indicating they've not changed
> much of anything at all in ucol.cpp.

I  wonder if it's worth considering distrusting ucol_strcollUTF8() on
some ICU versions that prove to not be up to snuff. The function has
only been around since ICU-50, released in late 2012. We could have
something like a TRUST_UCOL_STRCOLLUTF8 override for
HAVE_UCOL_STRCOLLUTF8. This might make sense as a compromise between
not supporting somewhat older ICU versions and doing nothing at all.

I'm pretty sure that ucol_strcollUTF8() is unused by all the major
projects that use ICU, such as LibreOffice, Chromium, and Qt. Their
requirements play to the strengths of UTF-16, so they naturally always
used UTF-16.

-- 
Peter Geoghegan


-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Re: [BUGS] Crash report for some ICU-52 (debian8) COLLATE andwork_mem values

From
"Daniel Verite"
Date:
    Tom Lane wrote:

> I went to http://www.icu-project.org/ and downloaded icu4c-52_1-src.tgz.
> All the file dates therein seem to be 2013-10-04.
>
> Debian, for one, is evidently not trying very hard in that direction,
> since not only are the bugs still there but the line numbers I saw in
> my backtraces agreed with Daniel's, indicating they've not changed
> much of anything at all in ucol.cpp.

They have 2 small patches in ucol.cpp (diff attached),
but the last backtraces I've sent were against upstream, not
Debian, got from the same source as you, so they wouldn't differ
in the line numbers.
Anyway the behavior with segfaulting was identical to Debian's.

Speaking of upstream vs Debian, for the library as a whole there are
quite a few security patches that are not in upstream:

  $ apt-get source libicu-dev
  [...]
  dpkg-source: info: extracting icu in icu-52.1
  dpkg-source: info: unpacking icu_52.1.orig.tar.gz
  dpkg-source: info: unpacking icu_52.1-8+deb8u5.debian.tar.xz
  dpkg-source: info: applying icudata-stdlibs.patch
  dpkg-source: info: applying gennorm2-man.patch
  dpkg-source: info: applying icuinfo-man.patch
  dpkg-source: info: applying malayalam-rendering.patch
  dpkg-source: info: applying indic-ccmp.patch
  dpkg-source: info: applying mlym-crash.patch
  dpkg-source: info: applying two-digit-year-test.patch
  dpkg-source: info: applying icu-config.patch
  dpkg-source: info: applying CVE-2014-6585.patch
  dpkg-source: info: applying CVE-2014-6591.patch
  dpkg-source: info: applying CVE-2014-7923+7926.patch
  dpkg-source: info: applying CVE-2014-7940.patch
  dpkg-source: info: applying CVE-2014-9654.patch
  dpkg-source: info: applying CVE-2014-8146.patch
  dpkg-source: info: applying CVE-2014-8147.patch
  dpkg-source: info: applying CVE-2015-4760.patch
  dpkg-source: info: applying CVE-2014-6585+.patch
  dpkg-source: info: applying CVE-2015-1270.patch
  dpkg-source: info: applying CVE-2014-9911.patch
  dpkg-source: info: applying CVE-2015-2632.patch
  dpkg-source: info: applying CVE-2015-4844.patch
  dpkg-source: info: applying CVE-2016-0494.patch
  dpkg-source: info: applying CVE-2016-6293.patch
  dpkg-source: info: applying CVE-2016-7415.patch
  dpkg-source: info: applying CVE-2017-7867_CVE-2017-7868.patch

Independantly of the bug discussed in this thread, what is puzzling
to me is why upstream does not integrate any of these fixes.
Here's their policy about maintenance releases:

http://site.icu-project.org/processes/maintenance-releases

   "When a critical problem is found in ICU libraries, we try to fix the
   problem in the latest development stream first. If there is a demand
   for the fix in a past release, an ICU project developer may escalate
   the fix to be integrated in the release to the ICU project management
   committee. Once the committee approved to merge the fix into back
   level stream, the developer can merge the bug fix back to the past
   release suggested by the committee. This merge activity must be
   tracked by maintenance release place holder tickets and the developer
   should provide original ticket number and description as the response
   in each maintenance ticket. These fixes are automatically included in
   a future ICU maintenance release."


Best regards,
--
Daniel Vérité
PostgreSQL-powered mailer: http://www.manitou-mail.org
Twitter: @DanielVerite

-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Attachment

Re: [BUGS] Crash report for some ICU-52 (debian8) COLLATE andwork_mem values

From
Peter Eisentraut
Date:
On 8/6/17 17:17, Peter Geoghegan wrote:
> I  wonder if it's worth considering distrusting ucol_strcollUTF8() on
> some ICU versions that prove to not be up to snuff. The function has
> only been around since ICU-50, released in late 2012. We could have
> something like a TRUST_UCOL_STRCOLLUTF8 override for
> HAVE_UCOL_STRCOLLUTF8. This might make sense as a compromise between
> not supporting somewhat older ICU versions and doing nothing at all.

I can confirm that that avoids the crash in ICU 52, so it's something to
consider.  We would need more testing about which other ICU versions are
affected, however.

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Re: [BUGS] Crash report for some ICU-52 (debian8) COLLATE andwork_mem values

From
Peter Eisentraut
Date:
On 8/7/17 11:17, Peter Eisentraut wrote:
> On 8/6/17 17:17, Peter Geoghegan wrote:
>> I  wonder if it's worth considering distrusting ucol_strcollUTF8() on
>> some ICU versions that prove to not be up to snuff. The function has
>> only been around since ICU-50, released in late 2012. We could have
>> something like a TRUST_UCOL_STRCOLLUTF8 override for
>> HAVE_UCOL_STRCOLLUTF8. This might make sense as a compromise between
>> not supporting somewhat older ICU versions and doing nothing at all.
> 
> I can confirm that that avoids the crash in ICU 52, so it's something to
> consider.  We would need more testing about which other ICU versions are
> affected, however.

In my testing, there are no crashes in ICU 51 and ICU 53, so we could
restrict this workaround to version 52 specifically.  Perhaps someone
else can confirm my test results.

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

On 8/5/17 18:56, Noah Misch wrote:
>> [Action required within three days.  This is a generic notification.]

I'm awaiting further testing and discussion.  Probably nothing happening
for beta3.  Will report on Thursday.

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [BUGS] Crash report for some ICU-52 (debian8) COLLATE andwork_mem values

From
Peter Eisentraut
Date:
On 8/5/17 23:26, Tom Lane wrote:
> I'm quite disturbed though that the set of installed collations on these
> two test cases seem to be entirely different both from each other and from
> what you reported.  The base collations look generally similar, but the
> "keyword variant" versions are not comparable at all.

The differences you are seeing are probably because we intentionally
removed the keyword variants for ICU 4.2:
<https://www.postgresql.org/message-id/E1ddzFn-0005x5-N0%40gemulon.postgresql.org>.

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Re: [BUGS] Crash report for some ICU-52 (debian8) COLLATE andwork_mem values

From
Peter Geoghegan
Date:
On Mon, Aug 7, 2017 at 8:55 AM, Peter Eisentraut
<peter.eisentraut@2ndquadrant.com> wrote:
> In my testing, there are no crashes in ICU 51 and ICU 53, so we could
> restrict this workaround to version 52 specifically.  Perhaps someone
> else can confirm my test results.

Probably worth running tests with Valgrind on those other versions, if
you haven't already.

We can worry about which versions it makes sense for, and we should,
but it's more important to make sure that the TRUST_UCOL_STRCOLLUTF8
idea is useful as a way of defending against a class of possible ICU
bugs. If every package maintainer applies their own patches for ICU,
which appears to be the case, then we shouldn't be too surprised if a
certain maintainer needs to go their own way on
TRUST_UCOL_STRCOLLUTF8.

Has anyone reported this to the Debian maintainer yet?

-- 
Peter Geoghegan


-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Peter Eisentraut <peter.eisentraut@2ndquadrant.com> writes:
> On 8/7/17 11:17, Peter Eisentraut wrote:
>> I can confirm that that avoids the crash in ICU 52, so it's something to
>> consider.  We would need more testing about which other ICU versions are
>> affected, however.

> In my testing, there are no crashes in ICU 51 and ICU 53, so we could
> restrict this workaround to version 52 specifically.  Perhaps someone
> else can confirm my test results.

I did some desultory comparison of this source code in ICU 51.2, 52.1,
and 53.1.  The crash-related code fragments are not significantly
different in 51.2 and 52.1.  It may be that 52's problem is triggered by
broken collation data that wasn't broken in previous versions ... but
I have to say that I don't find 51 worthy of trust either.  It seems
quite likely that pre-52 can still be made to crash, we just don't yet
have a test case that triggers it.

In 53, they have thrown away ucol.cpp lock stock and barrel, and rewritten
the whole thing from scratch in C++.  So it's not surprising that these
bugs are gone.  (Whether it's got other bugs is hard to tell.  I do not
get the impression, just from eyeballing it, that the code quality went
up significantly :-(.)  So that explains why upstream is not back-patching
fixes for these problems from newer versions: there are no newer versions
that they could possibly back-patch from.

I'm on board with Peter G's proposal: let's distrust ucol_strcollUTF8
in ICU < 53.  If those folk felt a need to rewrite it completely after
only three release cycles, that should tell us something.
        regards, tom lane


-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Peter Eisentraut <peter.eisentraut@2ndquadrant.com> writes:
> On 8/5/17 23:26, Tom Lane wrote:
>> I'm quite disturbed though that the set of installed collations on these
>> two test cases seem to be entirely different both from each other and from
>> what you reported.  The base collations look generally similar, but the
>> "keyword variant" versions are not comparable at all.

> The differences you are seeing are probably because we intentionally
> removed the keyword variants for ICU 4.2:
> <https://www.postgresql.org/message-id/E1ddzFn-0005x5-N0%40gemulon.postgresql.org>.

After looking more closely, I think I'd mistaken keyword variants for
something else.  As an example, I get these "af-*" collations installed
by initdb when using ICU 52.1:

af-x-icuaf-u-co-standard-x-icuaf-u-co-eor-x-icuaf-u-co-search-x-icuaf-NA-x-icuaf-NA-u-co-standard-x-icuaf-NA-u-co-eor-x-icuaf-NA-u-co-search-x-icuaf-ZA-x-icuaf-ZA-u-co-standard-x-icuaf-ZA-u-co-eor-x-icuaf-ZA-u-co-search-x-icu

while with ICU 57.1, the set is
af-x-icuaf-u-co-standard-x-icuaf-u-co-emoji-x-icuaf-u-co-eor-x-icuaf-u-co-search-x-icu

So they added emojis (I'm with Peter G that we could do without installing
that by default) ... but what became of the af-NA and af-ZA collations?
If I were a user who'd adopted one of those as a database collation,
I'd be seriously unhappy to have them go away in a later PG release.

I think we'd be well advised to filter the set of installed-by-default
collations rather strongly, in hope of avoiding such problems.  For
starters, do we really need the keyword variants at all?  People who
know what those are for can create their own collations, and take
their own risks of the feature disappearing in later ICU releases.
        regards, tom lane


--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Re: [BUGS] Crash report for some ICU-52 (debian8) COLLATE andwork_mem values

From
Peter Eisentraut
Date:
On 8/7/17 12:25, Peter Geoghegan wrote:
> Has anyone reported this to the Debian maintainer yet?

This is an upstream bug, so there is nothing to report.

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Re: [BUGS] Crash report for some ICU-52 (debian8) COLLATE andwork_mem values

From
Peter Geoghegan
Date:
On Mon, Aug 7, 2017 at 10:30 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> So they added emojis (I'm with Peter G that we could do without installing
> that by default) ... but what became of the af-NA and af-ZA collations?
> If I were a user who'd adopted one of those as a database collation,
> I'd be seriously unhappy to have them go away in a later PG release.
>
> I think we'd be well advised to filter the set of installed-by-default
> collations rather strongly, in hope of avoiding such problems.  For
> starters, do we really need the keyword variants at all?  People who
> know what those are for can create their own collations, and take
> their own risks of the feature disappearing in later ICU releases.

Actually, I think I was wrong about it being possible to create the
collations after the fact. CREATE COLLATION simply doesn't support
that. This surprised me.

-- 
Peter Geoghegan


-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Peter Geoghegan <pg@bowt.ie> writes:
> On Mon, Aug 7, 2017 at 10:30 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> I think we'd be well advised to filter the set of installed-by-default
>> collations rather strongly, in hope of avoiding such problems.  For
>> starters, do we really need the keyword variants at all?  People who
>> know what those are for can create their own collations, and take
>> their own risks of the feature disappearing in later ICU releases.

> Actually, I think I was wrong about it being possible to create the
> collations after the fact. CREATE COLLATION simply doesn't support
> that. This surprised me.

Surely that's a bug?
        regards, tom lane


-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Re: [BUGS] Crash report for some ICU-52 (debian8) COLLATE and work_mem values

From
"Daniel Verite"
Date:
    Peter Eisentraut wrote:

> In my testing, there are no crashes in ICU 51 and ICU 53, so we could
> restrict this workaround to version 52 specifically.  Perhaps someone
> else can confirm my test results.

I'm testing all collations with ICU-51.2 with the same data as previously,
and I just see it crashing at the 103th and a few other az-* after that:

2017-08-07 19:09:26.538 CEST [59353] LOG:  server process (PID 60458) was
terminated by signal 11: Segmentation fault
2017-08-07 19:09:26.538 CEST [59353] DETAIL:  Failed process was running:
select count(distinct wordtext collate "az-Latn-AZ-u-co-search-x-icu") from
words_test
2017-08-07 19:09:26.538 CEST [59353] LOG:  terminating any other active
server processes
2017-08-07 19:09:26.538 CEST [60316] WARNING:  terminating connection because
of crash of another server process

I don't have core files for this run but that's probably enough to say
that ICU-51 is no good for us.

Best regards,
--
Daniel Vérité
PostgreSQL-powered mailer: http://www.manitou-mail.org
Twitter: @DanielVerite


--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

I wrote:
> I'm on board with Peter G's proposal: let's distrust ucol_strcollUTF8
> in ICU < 53.  If those folk felt a need to rewrite it completely after
> only three release cycles, that should tell us something.

A bit of googling turned up this document:
http://site.icu-project.org/design/collation/v2
which indicates that the ucol_strcollUTF8 rewrite was really just one
facet of a rather major reimplementation.  But still, that document says
specifically that they knew of multiple bugs and suspected many more.

At this point I'm thinking that really what we ought to do is deprecate
using any pre-53 ICU release for Postgres.  It is very clear that those
versions are an entirely different beast from 53-and-up, and that they
are now abandonware so far as ICU upstream is concerned.

I have not checked, but I wonder whether 53 is also when the large
change in the set of available collations happened.  Maybe rejecting
pre-53 would also be enough to assuage my concerns about disappearing
collations.
        regards, tom lane


-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Re: [BUGS] Crash report for some ICU-52 (debian8) COLLATE andwork_mem values

From
Peter Geoghegan
Date:
On Mon, Aug 7, 2017 at 10:39 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Actually, I think I was wrong about it being possible to create the
>> collations after the fact. CREATE COLLATION simply doesn't support
>> that. This surprised me.
>
> Surely that's a bug?

I had it right the first time, actually -- it is possible.

There is a lack of any example of it in the docs. And, references to
CREATE COLLATION right under "23.2.2.2.2. ICU collations" in the docs
are under the title "Copying Collations". That threw me off.

I think that we are entitled to assume plenty about what collations
ICU makes available, at least per version.

-- 
Peter Geoghegan


-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Re: [BUGS] Crash report for some ICU-52 (debian8) COLLATE andwork_mem values

From
Peter Geoghegan
Date:
On Mon, Aug 7, 2017 at 10:51 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> At this point I'm thinking that really what we ought to do is deprecate
> using any pre-53 ICU release for Postgres.  It is very clear that those
> versions are an entirely different beast from 53-and-up, and that they
> are now abandonware so far as ICU upstream is concerned.

+1. I think that ICU is very important, and I would like to see it
become the defacto standard collation provider, but I don't think it's
something that needs to happen on an aggressive schedule.

> I have not checked, but I wonder whether 53 is also when the large
> change in the set of available collations happened.  Maybe rejecting
> pre-53 would also be enough to assuage my concerns about disappearing
> collations.

I bet it would. Although, I should point out that ICU has an annoying
habit of being very tolerant of misspellings, or alternative
spellings, so it might not be as bad as it appeared. We could perhaps
fix this by taking a greater interest in the collations that are
initially available, documenting useful variations, and so on. We
should do that anyway.

-- 
Peter Geoghegan


-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Re: [BUGS] Crash report for some ICU-52 (debian8) COLLATE andwork_mem values

From
Peter Eisentraut
Date:
On 8/7/17 13:51, Tom Lane wrote:
> At this point I'm thinking that really what we ought to do is deprecate
> using any pre-53 ICU release for Postgres.  It is very clear that those
> versions are an entirely different beast from 53-and-up, and that they
> are now abandonware so far as ICU upstream is concerned.

Maybe a note in the documentation to that effect would be useful.

I don't think we should prevent the use of older ICU versions in the
code.  That would probably annoy people who are forced to develop or
test their applications on older platforms.

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Re: [BUGS] Crash report for some ICU-52 (debian8) COLLATE andwork_mem values

From
Peter Eisentraut
Date:
On 8/7/17 13:30, Tom Lane wrote:
> ... but what became of the af-NA and af-ZA collations?

This was already mentioned here:
https://www.postgresql.org/message-id/93ef952f-40d4-24ce-bc62-97cc49a565ab@2ndquadrant.com

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Peter Eisentraut <peter.eisentraut@2ndquadrant.com> writes:
> On 8/7/17 13:30, Tom Lane wrote:
>> ... but what became of the af-NA and af-ZA collations?

> This was already mentioned here:
> https://www.postgresql.org/message-id/93ef952f-40d4-24ce-bc62-97cc49a565ab@2ndquadrant.com

Well, the fact that they're "redundant" doesn't really help you if
you can't pg_upgrade because the collation name you chose in v10 is
not present in initdb's results in v11.  So this is still a serious
issue to my mind.
        regards, tom lane


--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Re: [BUGS] Crash report for some ICU-52 (debian8) COLLATE andwork_mem values

From
Peter Geoghegan
Date:
On Mon, Aug 7, 2017 at 12:29 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Well, the fact that they're "redundant" doesn't really help you if
> you can't pg_upgrade because the collation name you chose in v10 is
> not present in initdb's results in v11.  So this is still a serious
> issue to my mind.

I agree.

Even MongoDB has ICU support these days. They specifically document
which collations are supported. It's just the same for DB2, and other
systems that build their collations on ICU. Users do not "use the ICU
collations" on these other systems. They simply use the collations
that are available, choosing from a list in the documentation, or
possibly create their own collations with their own customization.

The ICU collations are based on the CLDR data and an IETF standard's
idea of a locale identifier [1], so in an important sense they're
supposed to be universal; they're not tied to ICU in particular. This
is probably why ICU is ridiculously forgiving of alternate collation
names, and will not throw an error if you specify an ICU collation
name that is total garbage within CREATE COLLATION (there is a
Postgres regression test that proves this for ICU, actually): As far
as ICU is concerned, this may be coming from input from an end user
over the web, where it makes sense to be so forgiving.

Even stuff like the names for emoji collations, or phonebook
collations, are covered by a standard, though it's not quite an IETF
standard. RFC 6067 says that the CLDR data is the authoritative source
of which variant subtags are allowed, and ICU uses CLDR, from the
Unicode consortium.

We need to move further away from the idea that there are ICU
collations just like there are libc collations.

[1] https://www.rfc-editor.org/rfc/bcp/bcp47.txt
-- 
Peter Geoghegan


-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Re: [BUGS] Crash report for some ICU-52 (debian8) COLLATE andwork_mem values

From
Peter Eisentraut
Date:
On 8/7/17 15:29, Tom Lane wrote:
> Peter Eisentraut <peter.eisentraut@2ndquadrant.com> writes:
>> On 8/7/17 13:30, Tom Lane wrote:
>>> ... but what became of the af-NA and af-ZA collations?
> 
>> This was already mentioned here:
>> https://www.postgresql.org/message-id/93ef952f-40d4-24ce-bc62-97cc49a565ab@2ndquadrant.com
> 
> Well, the fact that they're "redundant" doesn't really help you if
> you can't pg_upgrade because the collation name you chose in v10 is
> not present in initdb's results in v11.  So this is still a serious
> issue to my mind.

It has never been the case that there is a guarantee that a new
operating system environment will have the same or more collations
available as an earlier version.  Even glibc removes or renames locales.

The way pg_upgrade works, it will presumably detect that during the
schema dump/restore, so you should get a proper error message.

In this particular case, you can create user-defined collations to fill
in the missing names if you want.

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Re: [BUGS] Crash report for some ICU-52 (debian8) COLLATE andwork_mem values

From
Peter Eisentraut
Date:
On 8/7/17 16:21, Peter Geoghegan wrote:
> Even MongoDB has ICU support these days. They specifically document
> which collations are supported. It's just the same for DB2, and other
> systems that build their collations on ICU. Users do not "use the ICU
> collations" on these other systems. They simply use the collations
> that are available, choosing from a list in the documentation, or
> possibly create their own collations with their own customization.

We cannot know what versions of ICU or a C library users would build
PostgreSQL with, so I don't see how we can provide a definite list of
available locales, other than at run time like we do now.

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Peter Eisentraut <peter.eisentraut@2ndquadrant.com> writes:
> It has never been the case that there is a guarantee that a new
> operating system environment will have the same or more collations
> available as an earlier version.  Even glibc removes or renames locales.

Indeed, and one of the alleged selling points of ICU was greater stability
of collation behaviors (including naming).  I'd like to try to actually
achieve that.

> In this particular case, you can create user-defined collations to fill
> in the missing names if you want.

NO. YOU. CAN'T.  At least not with convenient upgrade methods like
pg_upgrade.  If the collation name isn't produced by the newer version's
initdb, pg_upgrade will fail.
        regards, tom lane


-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Re: [BUGS] Crash report for some ICU-52 (debian8) COLLATE andwork_mem values

From
Peter Geoghegan
Date:
On Mon, Aug 7, 2017 at 3:27 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Peter Eisentraut <peter.eisentraut@2ndquadrant.com> writes:
>> It has never been the case that there is a guarantee that a new
>> operating system environment will have the same or more collations
>> available as an earlier version.  Even glibc removes or renames locales.

A major goal of the BCP 47 language tag format (and related standards)
seems to be to decouple technical implementation details (e.g.
encoding) from natural language/cultural concerns. That's the only
approach that really scales for applications in the modern
internet-centric world, I imagine. The glibc collations do very badly
there.

> Indeed, and one of the alleged selling points of ICU was greater stability
> of collation behaviors (including naming).  I'd like to try to actually
> achieve that.

IETF/CLDR describe the naming conventions for locales/collations in
excruciating detail, across multiple RFCs/specs. There is no excuse
for not making use of that work, IMV.

-- 
Peter Geoghegan


-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Re: [BUGS] Crash report for some ICU-52 (debian8) COLLATE andwork_mem values

From
"Daniel Verite"
Date:
Tom Lane wrote:

> The other class of failures amounts to this loop iterating till it falls
> off the end of memory:
>
>        while(schar > (tchar = *(UCharOffset+offset))) { /* since the
> contraction codepoints should be ordered, we skip all that are smaller */
>            offset++;
>        }
>
> which is unsurprising, because (in my core dump) schar is 1113834 which is
> larger than any possible UChar value, so the loop cannot terminate except
> by crashing.  The crash occurred while trying to process this string:
>
>  buf2 = 0x1614d20 "requ\364\217\273\252te",
>
> and I do not think it's coincidence that that multibyte character
> there corresponds to U+10FEEA or decimal 1113834.  Apparently
> they've got some bugs with dealing with characters beyond U+FFFF,
> at least in certain locales.

For that failure, this single comparison seems to hit it consistently
with ICU-52:

# SELECT 'ab' < 'abc'||chr(128519) COLLATE "bs-x-icu";

server closed the connection unexpectedlyThis probably means the server terminated abnormallybefore or while processing
therequest. 

128519 is 'SMILING FACE WITH HALO' (U+1F607), and nearby codepoints
cause the same bug.

At least from this simple test we can produce a bug report
to the maintainers who are still interested in pre-53 versions.


Best regards,
--
Daniel Vérité
PostgreSQL-powered mailer: http://www.manitou-mail.org
Twitter: @DanielVerite


--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Re: [BUGS] Crash report for some ICU-52 (debian8) COLLATE andwork_mem values

From
Robert Haas
Date:
On Mon, Aug 7, 2017 at 1:30 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> So they added emojis (I'm with Peter G that we could do without installing
> that by default) ... but what became of the af-NA and af-ZA collations?
> If I were a user who'd adopted one of those as a database collation,
> I'd be seriously unhappy to have them go away in a later PG release.

I disagree with this argument.  I think you're looking at this problem
from the wrong angle.  If the customer doesn't really care what
collation they end up with, then filtering down the set of collations
is exactly the right fix -- they'll pick something that is included
rather than something that is excluded.  However, I think we should
assume that people choose a collation because they want and need the
sorting behavior it gives.  In that case, if we remove the collation
from the default catalog contents, then people who need it will (1)
see that it's not there, be unhappy, and give up or (2) figure out
that they can create it manually, do so, and then have the same
upgrade problem.

In other words, excluding, say, emoji collations from what gets
imported is just making a value judgement that those collations aren't
important and people shouldn't want to use them.  It's saying that we
know better than the ICU maintainers which collations ought to exist.
To use one of your phrases, color me skeptical.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Re: [BUGS] Crash report for some ICU-52 (debian8) COLLATE andwork_mem values

From
Peter Geoghegan
Date:
On Wed, Aug 9, 2017 at 10:35 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> In other words, excluding, say, emoji collations from what gets
> imported is just making a value judgement that those collations aren't
> important and people shouldn't want to use them.

Yes, it is. I think that's fine, though. Other database systems that
use ICU for collations do this. Without exception, I think.

> It's saying that we
> know better than the ICU maintainers which collations ought to exist.

As I've pointed out already, we populate pg_collation by asking
ucol_getKeywordValuesForLocale() to get only "commonly used [variant]
values with the given locale" within pg_import_system_collations().
That's a value judgement. This doesn't have documented stability
guarantees. I'm not sure how many technically distinct collations we
could generate by actually including every possible variant, but it
might be a great deal more than we get already.

Why would users have the same upgrade problem when they created these
collations manually? The BCP 47 language tag format seems to be very
much centered around "doing its best", even though that could be
pretty far from the desired behavior [1]. This makes a certain amount
of sense, considering that it's well documented, and can be considered
a stable API. It really shouldn't break, but if it does then I suppose
it's probably because the behavior doesn't have an analog in the ICU
version in use.

[1] postgr.es/m/CAH2-Wzm22vtxvD-e1oz90DE8Z_M61_8amHsDOZf1PWRKfRmj1g@mail.gmail.com
-- 
Peter Geoghegan


-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Peter Geoghegan <pg@bowt.ie> writes:
> On Wed, Aug 9, 2017 at 10:35 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> In other words, excluding, say, emoji collations from what gets
>> imported is just making a value judgement that those collations aren't
>> important and people shouldn't want to use them.

> Yes, it is. I think that's fine, though. Other database systems that
> use ICU for collations do this. Without exception, I think.

Actually, I don't think that's the issue at all.  People are free to
make other ICU collations if they want to.  My point is that we should
encourage them to do that, rather than depend on initdb-provided
collations, because manually-created collations are much more certain
to move across version upgrades safely.  If we were sure that
pg_import_system_collations would produce pretty much the same set of
collation names with future ICU releases as it does with current ones,
then there would be no issue --- but the evidence at hand suggests the
opposite.  I want to do something to address that stability issue before
it comes back to bite us.

I suppose a different way to address this would be to make pg_upgrade
smart enough to deal with the situation, by creating ICU collations
that are used in the source installation but are missing from the
initdb-provided set in the target.  But even if we had that, I'm
dubious that having hundreds of collations present by default is really
all that user-friendly.
        regards, tom lane


-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Re: [BUGS] Crash report for some ICU-52 (debian8) COLLATE andwork_mem values

From
Peter Geoghegan
Date:
On Wed, Aug 9, 2017 at 11:07 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Yes, it is. I think that's fine, though. Other database systems that
>> use ICU for collations do this. Without exception, I think.
>
> Actually, I don't think that's the issue at all.  People are free to
> make other ICU collations if they want to.  My point is that we should
> encourage them to do that, rather than depend on initdb-provided
> collations, because manually-created collations are much more certain
> to move across version upgrades safely.  If we were sure that
> pg_import_system_collations would produce pretty much the same set of
> collation names with future ICU releases as it does with current ones,
> then there would be no issue --- but the evidence at hand suggests the
> opposite.  I want to do something to address that stability issue before
> it comes back to bite us.

I must have been unclear, then. I am fully in agreement with what you say here.


-- 
Peter Geoghegan


-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Re: [BUGS] Crash report for some ICU-52 (debian8) COLLATE andwork_mem values

From
Peter Geoghegan
Date:
On Wed, Aug 9, 2017 at 11:07 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I suppose a different way to address this would be to make pg_upgrade
> smart enough to deal with the situation, by creating ICU collations
> that are used in the source installation but are missing from the
> initdb-provided set in the target.  But even if we had that, I'm
> dubious that having hundreds of collations present by default is really
> all that user-friendly.

Let's try to get to a proposal that we'll all be happy with.

The base locale names, which include regional variants like Austrian
German, are as stable as possible. pg_import_system_collations()
should just add those. If anything changes there, it's because the
locale literally ceases to exist for political reasons (it is subject
to a 5 year CLDR deprecation policy when this happens). Not much point
in worrying about that. You said yourself that these seemed stable.

Separately, there could be a new SQL-callable function that advertises
what ICU says it makes available, melded into a valid collation name,
in the style of today's pg_import_system_collations(). This could ask
ucol_getKeywordValuesForLocale() to get only "commonly used [variant]
values with the given locale" for a specified base locale (system
pg_collation OID?), or it could also show every possible variant.

We'd encourage users to investigate the variants within the
documentation, including giving specific examples of what is generally
possible. There could be practical examples of why the user might
actually want to create, say, a phonebook collation, or a traditional
Spanish collation.

-- 
Peter Geoghegan


-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Re: [BUGS] Crash report for some ICU-52 (debian8) COLLATE andwork_mem values

From
Robert Haas
Date:
On Wed, Aug 9, 2017 at 2:07 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I suppose a different way to address this would be to make pg_upgrade
> smart enough to deal with the situation, by creating ICU collations
> that are used in the source installation but are missing from the
> initdb-provided set in the target.

Yeah, that idea has some appeal.

> But even if we had that, I'm
> dubious that having hundreds of collations present by default is really
> all that user-friendly.

The flip side is that it's not clear to me how the user is supposed to
discover possibly-useful collations that don't show up in
pg_collation. The documentation for CREATE COLLATION says only that
"The available choices depend on the operating system and build
options." Section 23.2.2.2 says "With ICU, it is not sensible to
enumerate all possible locale names. ICU uses a particular naming
system for locales, but there are many more ways to name a locale than
there are actually distinct locales. (In fact, any string will be
accepted as a locale name.)"  So, there's no guidance on how to pick a
string that will work, and no error if you pick one that doesn't.
Wahoo!

I assume that this is in fact the whole point of putting any
ICU-related entries in pg_collation at all -- or for that matter any
libc-related entries.  If it were easy to guess what parameters you
might want to provide to CREATE COLLATION, we could just let users
create the collations they happen to need and it would be only a minor
inconvenience.  As it is, removing things from pg_collation seems to
make them all but undiscoverable.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Re: [BUGS] Crash report for some ICU-52 (debian8) COLLATE andwork_mem values

From
Peter Geoghegan
Date:
On Wed, Aug 9, 2017 at 11:46 AM, Peter Geoghegan <pg@bowt.ie> wrote:
> The base locale names, which include regional variants like Austrian
> German, are as stable as possible. pg_import_system_collations()
> should just add those.

I just noticed that ICU collations within pg_collation do not include
entries for each of the many regional variants of English, such as
English from Ireland, Britain, the Cook Islands, etc. OTOH, glibc does
have all these variants listed within pg_collation, at least on my
system. This is because we don't actually add one pg_collation entry
per ICU locale. Rather, we add one entry per distinct ICU collation,
via ucol_countAvailable() + ucol_getAvailable(). Maybe Austrian German
actually is sorted in a slightly different way to German German, and
so gets its own pg_collation entry?

I guess this is fine, because English is in practice sorted in exactly
the same way throughout all English locales (except for a few true
variants, like "English (United States, Computer)"). These unlisted
locales do have different currency symbols and so on, but that's not
something we'll use ICU for, so it's fine. This difference will need
to be documented, though. It would help if CREATE COLLATION left new
ICU collations with the same useful "Description" as initdb created
collations will have; maybe that should be added.

-- 
Peter Geoghegan


-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Peter Geoghegan <pg@bowt.ie> writes:
> It would help if CREATE COLLATION left new
> ICU collations with the same useful "Description" as initdb created
> collations will have; maybe that should be added.

I do not think it is CREATE COLLATION's job to provide a comment;
no other CREATE command does so.
        regards, tom lane


-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Re: [BUGS] Crash report for some ICU-52 (debian8) COLLATE andwork_mem values

From
Peter Geoghegan
Date:
On Wed, Aug 9, 2017 at 2:26 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Peter Geoghegan <pg@bowt.ie> writes:
>> It would help if CREATE COLLATION left new
>> ICU collations with the same useful "Description" as initdb created
>> collations will have; maybe that should be added.
>
> I do not think it is CREATE COLLATION's job to provide a comment;
> no other CREATE command does so.

It's not a comment. It's a description that comes from ICU. Just like
when collations are added by initdb, so that there is some kind of
parity. The user doesn't get a say in what it is.

It helps the user to verify that they're getting what they thought
they were, which seems very valuable, given how loosely the 'locale'
string they provide may be interpreted.

-- 
Peter Geoghegan


-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Re: [BUGS] Crash report for some ICU-52 (debian8) COLLATE andwork_mem values

From
Peter Eisentraut
Date:
On 8/7/17 14:42, Peter Eisentraut wrote:
> On 8/7/17 13:51, Tom Lane wrote:
>> At this point I'm thinking that really what we ought to do is deprecate
>> using any pre-53 ICU release for Postgres.  It is very clear that those
>> versions are an entirely different beast from 53-and-up, and that they
>> are now abandonware so far as ICU upstream is concerned.
> 
> Maybe a note in the documentation to that effect would be useful.
> 
> I don't think we should prevent the use of older ICU versions in the
> code.  That would probably annoy people who are forced to develop or
> test their applications on older platforms.

I propose the attached patches for fixing the crash.  This just disables
the use of the problematic function for versions before ICU 53.

(The 0001 patch is just something I found that was needed while
experimenting with different ICU versions.  The 0002 patch is the real fix.)

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Attachment

Re: [BUGS] Crash report for some ICU-52 (debian8) COLLATE andwork_mem values

From
Peter Eisentraut
Date:
On 8/9/17 20:45, Peter Eisentraut wrote:
> On 8/7/17 14:42, Peter Eisentraut wrote:
>> On 8/7/17 13:51, Tom Lane wrote:
>>> At this point I'm thinking that really what we ought to do is deprecate
>>> using any pre-53 ICU release for Postgres.  It is very clear that those
>>> versions are an entirely different beast from 53-and-up, and that they
>>> are now abandonware so far as ICU upstream is concerned.
>>
>> Maybe a note in the documentation to that effect would be useful.
>>
>> I don't think we should prevent the use of older ICU versions in the
>> code.  That would probably annoy people who are forced to develop or
>> test their applications on older platforms.
> 
> I propose the attached patches for fixing the crash.  This just disables
> the use of the problematic function for versions before ICU 53.

This has been committed.

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Re: [BUGS] Crash report for some ICU-52 (debian8) COLLATE andwork_mem values

From
Peter Eisentraut
Date:
On 8/9/17 15:29, Robert Haas wrote:
> On Wed, Aug 9, 2017 at 2:07 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> I suppose a different way to address this would be to make pg_upgrade
>> smart enough to deal with the situation, by creating ICU collations
>> that are used in the source installation but are missing from the
>> initdb-provided set in the target.
> 
> Yeah, that idea has some appeal.

I think the manual workflow would be that you initdb the target
instance, then log into the target instance to create the missing
collations, then run pg_upgrade.

While pg_upgrade could probably detect which collations are missing on
the target side, I don't think it follows that it can just create the
missing ones automatically.  Manual intervention would be necessary (and
desirable IMO) to analyze the situation.

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Re: [BUGS] Crash report for some ICU-52 (debian8) COLLATE andwork_mem values

From
Peter Eisentraut
Date:
On 8/9/17 14:07, Tom Lane wrote:
> Actually, I don't think that's the issue at all.  People are free to
> make other ICU collations if they want to.  My point is that we should
> encourage them to do that, rather than depend on initdb-provided
> collations, because manually-created collations are much more certain
> to move across version upgrades safely.  If we were sure that
> pg_import_system_collations would produce pretty much the same set of
> collation names with future ICU releases as it does with current ones,
> then there would be no issue --- but the evidence at hand suggests the
> opposite.  I want to do something to address that stability issue before
> it comes back to bite us.

I understand this concern, but I don't know how to solve it.  Any string
(more or less), is valid as an ICU locale name.  In order to create some
initial collations, we ask ICU, give me a list of distinct locales with
their canonical names.  That's the list we use.  The next version of ICU
gives us a different list.  There is no way to tell the API, no, I meant
the list you gave me last time.

The only way to solve that problem directly is to preload no locales.
Even the idea of maintaining a list of locales by hand cannot guarantee
that things won't go away in the future.

I think it is better in the long run that we acknowledge that the list
of preloaded collations will change (it already happens with glibc), and
we develop some guidance on how to deal with that.  As mentioned in
another message, I don't foresee a fully automatic solution, however.

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Re: [BUGS] Crash report for some ICU-52 (debian8) COLLATE andwork_mem values

From
Peter Eisentraut
Date:
On 8/9/17 14:46, Peter Geoghegan wrote:
> The base locale names, which include regional variants like Austrian
> German, are as stable as possible.

Well, no.  The problem (well, one of several) is exactly that an older
version of ICU includes things like de-BE, and a newer version does not.

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Re: [BUGS] Crash report for some ICU-52 (debian8) COLLATE andwork_mem values

From
Peter Geoghegan
Date:
On Mon, Aug 14, 2017 at 8:17 AM, Peter Eisentraut
<peter.eisentraut@2ndquadrant.com> wrote:
> On 8/9/17 14:46, Peter Geoghegan wrote:
>> The base locale names, which include regional variants like Austrian
>> German, are as stable as possible.
>
> Well, no.  The problem (well, one of several) is exactly that an older
> version of ICU includes things like de-BE, and a newer version does not.

I can't understand why that should be. ICU 58 uses CLDR 30, which
quite clearer does have the locale de_BE:

http://www.unicode.org/cldr/charts/30/summary/de.html

-- 
Peter Geoghegan


-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Re: [BUGS] Crash report for some ICU-52 (debian8) COLLATE andwork_mem values

From
Peter Eisentraut
Date:
On 8/14/17 12:09, Peter Geoghegan wrote:
> On Mon, Aug 14, 2017 at 8:17 AM, Peter Eisentraut
> <peter.eisentraut@2ndquadrant.com> wrote:
>> On 8/9/17 14:46, Peter Geoghegan wrote:
>>> The base locale names, which include regional variants like Austrian
>>> German, are as stable as possible.
>>
>> Well, no.  The problem (well, one of several) is exactly that an older
>> version of ICU includes things like de-BE, and a newer version does not.
> 
> I can't understand why that should be. ICU 58 uses CLDR 30, which
> quite clearer does have the locale de_BE:

I'm not familiar with the nuances of CLDR, but quite plausibly there is
a *locale* named de_BE, but the collation of "de" as used in "BE" is not
different than "de" used elsewhere, so it is omitted from the list as
redundant.

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Re: [BUGS] Crash report for some ICU-52 (debian8) COLLATE andwork_mem values

From
Peter Geoghegan
Date:
On Mon, Aug 14, 2017 at 9:27 AM, Peter Eisentraut
<peter.eisentraut@2ndquadrant.com> wrote:
>> I can't understand why that should be. ICU 58 uses CLDR 30, which
>> quite clearer does have the locale de_BE:
>
> I'm not familiar with the nuances of CLDR, but quite plausibly there is
> a *locale* named de_BE, but the collation of "de" as used in "BE" is not
> different than "de" used elsewhere, so it is omitted from the list as
> redundant.

That's also what I think is the most likely explanation. Maybe the
lesson here is that we ought to be adding all locales at initdb time,
not all distinct collations. That would also get us a dozen or so
identical English variants, and so on, just like with glibc. I don't
especially want that, but it seems like it might be better from a
stability point of view. After all, CLDR gives guarantees about the
stability of locales, not collations.

-- 
Peter Geoghegan


-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Re: [BUGS] Crash report for some ICU-52 (debian8) COLLATE andwork_mem values

From
Peter Eisentraut
Date:
On 8/9/17 17:29, Peter Geoghegan wrote:
> On Wed, Aug 9, 2017 at 2:26 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Peter Geoghegan <pg@bowt.ie> writes:
>>> It would help if CREATE COLLATION left new
>>> ICU collations with the same useful "Description" as initdb created
>>> collations will have; maybe that should be added.

I had been thinking about that, too, while I was writing it, but there
is the semantic objection:

>> I do not think it is CREATE COLLATION's job to provide a comment;
>> no other CREATE command does so.

How about a special option in the CREATE COLLATION to create the comment
automatically on demand?  Sample patch attached.

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Attachment

Re: [BUGS] Crash report for some ICU-52 (debian8) COLLATE andwork_mem values

From
Peter Geoghegan
Date:
On Mon, Aug 14, 2017 at 10:22 AM, Peter Eisentraut
<peter.eisentraut@2ndquadrant.com> wrote:
> On 8/9/17 17:29, Peter Geoghegan wrote:
>> On Wed, Aug 9, 2017 at 2:26 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> Peter Geoghegan <pg@bowt.ie> writes:
>>>> It would help if CREATE COLLATION left new
>>>> ICU collations with the same useful "Description" as initdb created
>>>> collations will have; maybe that should be added.
>
> I had been thinking about that, too, while I was writing it, but there
> is the semantic objection:
>
>>> I do not think it is CREATE COLLATION's job to provide a comment;
>>> no other CREATE command does so.

I must admit that I missed that the description was cataloged as a
comment. Is it out of the question to add another column to
pg_collation, that just always has the ICU-provided description? These
seem like separate concerns to me.


-- 
Peter Geoghegan


-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Peter Geoghegan <pg@bowt.ie> writes:
>>> On Wed, Aug 9, 2017 at 2:26 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>>> I do not think it is CREATE COLLATION's job to provide a comment;
>>>> no other CREATE command does so.

> I must admit that I missed that the description was cataloged as a
> comment. Is it out of the question to add another column to
> pg_collation, that just always has the ICU-provided description? These
> seem like separate concerns to me.

That seems like a reasonable solution to me, if it's not too late
for another catversion bump.  I like it better than the auto-comment
thing Peter E. suggests nearby.
        regards, tom lane


-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Re: [BUGS] Crash report for some ICU-52 (debian8) COLLATE andwork_mem values

From
Peter Geoghegan
Date:
On Mon, Aug 14, 2017 at 11:00 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> I must admit that I missed that the description was cataloged as a
>> comment. Is it out of the question to add another column to
>> pg_collation, that just always has the ICU-provided description? These
>> seem like separate concerns to me.
>
> That seems like a reasonable solution to me, if it's not too late
> for another catversion bump.  I like it better than the auto-comment
> thing Peter E. suggests nearby.

Another advantage of doing a catversion bump for v10 is that it lets
us add a new SQL-callable function (or two). I would like to add a
function to solve some of our discoverability problems around variant
collations. We now agree that we should not add variants at initdb
time, but Peter E's concern about where that leaves the
discoverability of the variants (phonebook, emoji, pinyin, and so on)
is a concern that I share.

Similarly, it would also be helpful if users could inquire about both
ICU version, and the corresponding CLDR version, using a new view.
That way, they could easily find the right CLDR version/standard,
which is where all the customization stuff is actually documented.

Any thoughts on that?

-- 
Peter Geoghegan


-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Peter Geoghegan <pg@bowt.ie> writes:
> On Mon, Aug 14, 2017 at 11:00 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> That seems like a reasonable solution to me, if it's not too late
>> for another catversion bump.  I like it better than the auto-comment
>> thing Peter E. suggests nearby.

> Another advantage of doing a catversion bump for v10 is that it lets
> us add a new SQL-callable function (or two). I would like to add a
> function to solve some of our discoverability problems around variant
> collations. We now agree that we should not add variants at initdb
> time, but Peter E's concern about where that leaves the
> discoverability of the variants (phonebook, emoji, pinyin, and so on)
> is a concern that I share.

> Similarly, it would also be helpful if users could inquire about both
> ICU version, and the corresponding CLDR version, using a new view.
> That way, they could easily find the right CLDR version/standard,
> which is where all the customization stuff is actually documented.

Hmm ... this all seems like good stuff, but it also seems like new
development, and it's way too late in the v10 cycle for any significant
amount of that.

My view of the state of our ICU support is that in v10, only brave
early adopters are going to be messing with it; it's not going to be
production-grade for most people for another release or three.  That
being the case, I don't think that whether second-order options have
discoverability problems is really a critical concern right now.

I think at this point I'd vote to drop auto-installation of keyword
variants, and I would also vote against Peter's auto_comment proposal.
We shouldn't be adding features that we're going to supersede in v11.

Maybe, if people are okay with a catversion bump this late, we could
push the ICU descriptions into pg_collation proper, but I think it
would be fine to leave pg_import_system_collations's behavior in that
regard alone for v10, too.

As for new functions, views, etc, that's all excellent v11 material.
        regards, tom lane


-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Re: [BUGS] Crash report for some ICU-52 (debian8) COLLATE andwork_mem values

From
Peter Geoghegan
Date:
On Mon, Aug 14, 2017 at 11:32 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Hmm ... this all seems like good stuff, but it also seems like new
> development, and it's way too late in the v10 cycle for any significant
> amount of that.

Fair enough.

> I think at this point I'd vote to drop auto-installation of keyword
> variants, and I would also vote against Peter's auto_comment proposal.
> We shouldn't be adding features that we're going to supersede in v11.

+1

> Maybe, if people are okay with a catversion bump this late, we could
> push the ICU descriptions into pg_collation proper, but I think it
> would be fine to leave pg_import_system_collations's behavior in that
> regard alone for v10, too.

I really think we should add a pg_collation column to store ICU's
description of the collation, because that's something that we'll have
to live with forever.

I don't think we have pg_import_system_collations's behavior all
worked out just yet. I'm not sure that we'd do the right thing at
initdb even if pg_import_system_collations did not add the keyword
variants. It would still add all distinct collations, instead of all
CLDR locales (some of which will have equivalent collation behavior).
I suspect that this explains the stability problems you saw [1] that
were not explainable by available variants differing.

I do think that we need to have a better answer to this "locales vs.
collations" question for v10.

[1] postgr.es/m/CAH2-Wzm3EOEqYB48m7aT7sjPXw=kUHU=FdzzdU90dfiatt62CQ@mail.gmail.com

-- 
Peter Geoghegan


-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Peter Geoghegan <pg@bowt.ie> writes:
> On Mon, Aug 14, 2017 at 11:32 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Maybe, if people are okay with a catversion bump this late, we could
>> push the ICU descriptions into pg_collation proper, but I think it
>> would be fine to leave pg_import_system_collations's behavior in that
>> regard alone for v10, too.

> I really think we should add a pg_collation column to store ICU's
> description of the collation, because that's something that we'll have
> to live with forever.

Maybe I'm missing something, but I didn't think there would be anything
terribly disastrous about having pg_import_system_collations() install
these descriptions as comments in v10 and then changing it to put them
directly into pg_collation in later versions.  We'd have to adapt the
behavior of psql's \dO, but that's true no matter when we change that.

> I don't think we have pg_import_system_collations's behavior all
> worked out just yet.

Agreed, but we can probably tweak that without forcing a catversion
bump.
        regards, tom lane


-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Re: [BUGS] Crash report for some ICU-52 (debian8) COLLATE andwork_mem values

From
Peter Geoghegan
Date:
On Mon, Aug 14, 2017 at 12:17 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Maybe I'm missing something, but I didn't think there would be anything
> terribly disastrous about having pg_import_system_collations() install
> these descriptions as comments in v10 and then changing it to put them
> directly into pg_collation in later versions.  We'd have to adapt the
> behavior of psql's \dO, but that's true no matter when we change that.

Well, I don't think you're going to get on board with the idea of
having CREATE COLLATION add a comment (and I agree that that's a bad
idea), so CREATE COLLATION is in a very bad spot if we put this off
for a release.

Unless we add an ICU-owned pg_collation column for the human-readable
ICU string, CREATE COLLATION will not let the user determine what ICU
thinks the collation/language is from within Postgres. This seems
arbitrarily different from the situation with initdb ICU collations,
where currently the ICU string comment is added. Importantly, not
having the ICU string deprives the user of a way of determining
whether or not ICU has even heard of the language/region they
specified. The language tags are supposed to be forgiving, but that
does mean that ICU won't actually complain when the user fat-fingers
their CREATE COLLATION language tag.

(Note that this doesn't mean that we couldn't provide *minimal*
sanitization of language tags by CREATE COLLATION; we can and should
reject multiple "co" subtags, and similar unambiguously bogus
subtags.)

>> I don't think we have pg_import_system_collations's behavior all
>> worked out just yet.
>
> Agreed, but we can probably tweak that without forcing a catversion
> bump.

True.

I might have used the wrong terminology on this "locales vs.
collations" thing. Perhaps what we actually need is pg_collation
entries at initdb time that are an enumeration of all locales *and
their regions*, as opposed to all locales. I'm researching this now.

-- 
Peter Geoghegan


-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Peter Geoghegan <pg@bowt.ie> writes:
> On Mon, Aug 14, 2017 at 12:17 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Maybe I'm missing something, but I didn't think there would be anything
>> terribly disastrous about having pg_import_system_collations() install
>> these descriptions as comments in v10 and then changing it to put them
>> directly into pg_collation in later versions.  We'd have to adapt the
>> behavior of psql's \dO, but that's true no matter when we change that.

> Well, I don't think you're going to get on board with the idea of
> having CREATE COLLATION add a comment (and I agree that that's a bad
> idea), so CREATE COLLATION is in a very bad spot if we put this off
> for a release.

How so?  When you create the collation in v11, the column will get
populated, no problem.  (I'm envisioning that we add code to CREATE
COLLATION to populate it, much as Peter E's auto_comment patch did
except for where the data goes.)

> Unless we add an ICU-owned pg_collation column for the human-readable
> ICU string, CREATE COLLATION will not let the user determine what ICU
> thinks the collation/language is from within Postgres.

Somehow, this does not strike me as a stop-ship problem for v10.
It's a nice-to-have feature, sure, but we're past the time for that.
For now, we have to take the attitude that it's the user's problem
whether the arguments for CREATE COLLATION are sane --- we can blame
that policy on ICU itself.

(BTW, I would argue vehemently against defining this new column as
"ICU owned".  I think it should just be "colldescription" and we
populate it with whatever is reasonable.  For example we can move the
comments for the hardwired collations there.  I dunno whether we can
come up with anything authoritative for libc collations, but if we
can, they're welcome to the party as well.)
        regards, tom lane


-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Re: [BUGS] Crash report for some ICU-52 (debian8) COLLATE andwork_mem values

From
Peter Geoghegan
Date:
On Mon, Aug 14, 2017 at 1:13 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Well, I don't think you're going to get on board with the idea of
>> having CREATE COLLATION add a comment (and I agree that that's a bad
>> idea), so CREATE COLLATION is in a very bad spot if we put this off
>> for a release.
>
> How so?  When you create the collation in v11, the column will get
> populated, no problem.  (I'm envisioning that we add code to CREATE
> COLLATION to populate it, much as Peter E's auto_comment patch did
> except for where the data goes.)

I meant that we'd be in a bad spot for v10, at least. I suppose that
you're right that it can be corrected in v11. This doesn't seem like
the best plan, but I understand that we are short on time here.

>> Unless we add an ICU-owned pg_collation column for the human-readable
>> ICU string, CREATE COLLATION will not let the user determine what ICU
>> thinks the collation/language is from within Postgres.
>
> Somehow, this does not strike me as a stop-ship problem for v10.
> It's a nice-to-have feature, sure, but we're past the time for that.
> For now, we have to take the attitude that it's the user's problem
> whether the arguments for CREATE COLLATION are sane --- we can blame
> that policy on ICU itself.

I hope that we manage to slip in the minimal sanitization I described,
which doesn't need a catversion bump at all.

> (BTW, I would argue vehemently against defining this new column as
> "ICU owned".  I think it should just be "colldescription" and we
> populate it with whatever is reasonable.

Agreed. "ICU owned" was intended as an informal shorthand for
"collation provider owned".

-- 
Peter Geoghegan


-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Re: [BUGS] Crash report for some ICU-52 (debian8) COLLATE andwork_mem values

From
Peter Geoghegan
Date:
On Mon, Aug 14, 2017 at 12:54 PM, Peter Geoghegan <pg@bowt.ie> wrote:
> I might have used the wrong terminology on this "locales vs.
> collations" thing. Perhaps what we actually need is pg_collation
> entries at initdb time that are an enumeration of all locales *and
> their regions*, as opposed to all locales. I'm researching this now.

I've figured this out. It's not a matter of "locales vs. collations"
-- it's a matter of avoiding skipping some ICU collations during
initdb, which we seem to do right now.

CollationCreate() is being passed "if_not_exists = true" for ICU
collations during initdb for ICU collations, within
pg_import_system_collations(). We're actually not creating new entries
for some collations that get their own distinct
listing/ucol_countAvailable() iteration. This happens because the
ICU-wise name isn't fully spelled out (country wasn't included for
entries that had one), and so we incorrectly ignore some collation
entries as "duplicates".

(I don't know why some regional variants, like de_AT, are still
included in master's pg_collation, but most regional variants are
not.)

Attached self-contained program should give you some idea what I'm
talking about. Sample output:

$ ./icu-coll-versions | grep Spanish
Spanish                                           | es          | es
     |           | es        | es
Spanish (Latin America)                           | es_419      | es
     | 419       | es        | es
Spanish (Argentina)                               | es_AR       | es
     | AR        | es        | es
Spanish (Bolivia)                                 | es_BO       | es
     | BO        | es        | es
Spanish (Chile)                                   | es_CL       | es
     | CL        | es        | es
Spanish (Colombia)                                | es_CO       | es
     | CO        | es        | es
Spanish (Costa Rica)                              | es_CR       | es
     | CR        | es        | es
Spanish (Cuba)                                    | es_CU       | es
     | CU        | es        | es
Spanish (Dominican Republic)                      | es_DO       | es
     | DO        | es        | es
Spanish (Ceuta & Melilla)                         | es_EA       | es
     | EA        | es        | es
Spanish (Ecuador)                                 | es_EC       | es
     | EC        | es        | es
Spanish (Spain)                                   | es_ES       | es
     | ES        | es        | es
Spanish (Equatorial Guinea)                       | es_GQ       | es
     | GQ        | es        | es
Spanish (Guatemala)                               | es_GT       | es
     | GT        | es        | es
Spanish (Honduras)                                | es_HN       | es
     | HN        | es        | es
Spanish (Canary Islands)                          | es_IC       | es
     | IC        | es        | es
Spanish (Mexico)                                  | es_MX       | es
     | MX        | es        | es
Spanish (Nicaragua)                               | es_NI       | es
     | NI        | es        | es
Spanish (Panama)                                  | es_PA       | es
     | PA        | es        | es
Spanish (Peru)                                    | es_PE       | es
     | PE        | es        | es
Spanish (Philippines)                             | es_PH       | es
     | PH        | es        | es
Spanish (Puerto Rico)                             | es_PR       | es
     | PR        | es        | es
Spanish (Paraguay)                                | es_PY       | es
     | PY        | es        | es
Spanish (El Salvador)                             | es_SV       | es
     | SV        | es        | es
Spanish (United States)                           | es_US       | es
     | US        | es        | es
Spanish (Uruguay)                                 | es_UY       | es
     | UY        | es        | es
Spanish (Venezuela)                               | es_VE       | es
     | VE        | es        | es

As an example, "es_VE" is a canonical locale name here.

Note that there are many more collations listed here than Spanish ICU
pg_collation entries that you'll find following initdb (forgetting
about the keyword variant stuff, which is really another issue).

In order to get stable pg_collation names, we should probably use
country code within pg_import_system_collations(), for collations
where there is a country code. We also need to add script (using
uloc_getScript()), so that those collations with a non-default script
(e.g. "Serbian (Latin, Bosnia & Herzegovina)") similarly have their
own pg_collation entries (FYI, "script" isn't broken out into its own
column by my test program). In short, every row that this test program
outputs should have a pg_collation entry after initdb -- the number
should match exactly.

Do we really need to pass "if_not_exists = true", anyway? Why
shouldn't initdb fail if there are apparent duplicate ICU collations?

BTW, I suspect that this is why very old ICU versions appeared to not
accept language tags constructed with subtags produced by
ucol_getKeywordValuesForLocale(), per eccead9's commit message. The
collation's country/script was omitted, so concatenating variants
somehow didn't make sense (wrong script?). It certainly makes no sense
that earlier ICU versions just didn't get a valid language tag when
using the infrastructure that is supposed to be used to build language
tags. It may be necessary to revert eccead9, once this understanding
of the situation is confirmed.

-- 
Peter Geoghegan

-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Attachment
Peter Geoghegan <pg@bowt.ie> writes:
> Do we really need to pass "if_not_exists = true", anyway? Why
> shouldn't initdb fail if there are apparent duplicate ICU collations?

Because the villagers will come after you; see commit
0b13b2a7712b6f91590b7589a314240a14520c2f.

It's possible that there's an argument for not de-duplicating ICU
collation names even though we must do so for libc collation names.
But I'm not exactly sure what that would be.  Having initdb fail if
the ICU configuration is funny isn't going to endear us to anyone.
        regards, tom lane


-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Re: [BUGS] Crash report for some ICU-52 (debian8) COLLATE andwork_mem values

From
Peter Geoghegan
Date:
On Mon, Aug 14, 2017 at 4:11 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Peter Geoghegan <pg@bowt.ie> writes:
>> Do we really need to pass "if_not_exists = true", anyway? Why
>> shouldn't initdb fail if there are apparent duplicate ICU collations?
>
> Because the villagers will come after you; see commit
> 0b13b2a7712b6f91590b7589a314240a14520c2f.

Fair enough. Changing it is actually unnecessary anyway.
pg_import_system_collations() should simply use uloc_countAvailable()
+ uloc_getAvailable, rather than ucol_countAvailable() +
ucol_getAvailable(). Just like my test program.

FYI, without the keyword variants, but with all base locales, ICU 55
adds 684 base locales to pg_collation. Should I write a real patch to
make the changes I deem necessary, or do you prefer to Peter E.?

The comments that pg_import_system_collations() adds to each locale
have an annoying tendency to exclude anything that isn't ascii-safe,
which is much more noticeable when locales populate pg_collation
instead of collations:

...foo    │ fr-BE-x-icu            │ fr-BE            │ fr-BE
│ icu      │ French (Belgium)foo    │ fr-BF-x-icu            │ fr-BF            │ fr-BF
│ icu      │ French (Burkina Faso)foo    │ fr-BI-x-icu            │ fr-BI            │ fr-BI
│ icu      │ French (Burundi)foo    │ fr-BJ-x-icu            │ fr-BJ            │ fr-BJ
│ icu      │ French (Benin)foo    │ fr-BL-x-icu            │ fr-BL            │ fr-BL
│ icu      │foo    │ fr-CA-x-icu            │ fr-CA            │ fr-CA
│ icu      │ French (Canada)foo    │ fr-CD-x-icu            │ fr-CD            │ fr-CD
│ icu      │ French (Congo - Kinshasa)foo    │ fr-CF-x-icu            │ fr-CF            │ fr-CF
│ icu      │ French (Central African Republic)foo    │ fr-CG-x-icu            │ fr-CG            │ fr-CG
│ icu      │ French (Congo - Brazzaville)foo    │ fr-CH-x-icu            │ fr-CH            │ fr-CH
│ icu      │ French (Switzerland)foo    │ fr-CI-x-icu            │ fr-CI            │ fr-CI
│ icu      │foo    │ fr-CM-x-icu            │ fr-CM            │ fr-CM
│ icu      │ French (Cameroon)foo    │ fr-DJ-x-icu            │ fr-DJ            │ fr-DJ
│ icu      │ French (Djibouti)
...

Notice the lack of descriptions here, just because they weren't
ascii-safe. I now wonder if we should even keep the comment thing at
all. We can eventually add an SQL function, that is called by \dOS+,
that interrogates ICU for a description on the fly. No need to store
anything.

Another thing that I dislike about get_icu_locale_comment() is that
the descriptions we show are always English display names, regardless
of pg_database collation or anything else. We shouldn't hard-code the
use of English for display name, on general principle.

> It's possible that there's an argument for not de-duplicating ICU
> collation names even though we must do so for libc collation names.
> But I'm not exactly sure what that would be.  Having initdb fail if
> the ICU configuration is funny isn't going to endear us to anyone.

BTW, get_icu_language_tag() has a buglet. ULOC_FULLNAME_CAPACITY does
not bound the size of a language tag; those can be almost arbitrarily
long, unlike locale identifiers. That should probably be fixed, if
only to avoid setting a bad example.

--
Peter Geoghegan


--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Peter Geoghegan <pg@bowt.ie> writes:
> The comments that pg_import_system_collations() adds to each locale
> have an annoying tendency to exclude anything that isn't ascii-safe,

Right, but that's *necessary* for anything that goes into template0.

> Notice the lack of descriptions here, just because they weren't
> ascii-safe. I now wonder if we should even keep the comment thing at
> all. We can eventually add an SQL function, that is called by \dOS+,
> that interrogates ICU for a description on the fly.

Yup, the same idea came to me as I was reading your message.
So we don't really want a pg_collation column either, but a function
that can produce a result that varies with the current database's
encoding.

> Another thing that I dislike about get_icu_locale_comment() is that
> the descriptions we show are always English display names, regardless
> of pg_database collation or anything else. We shouldn't hard-code the
> use of English for display name, on general principle.

Again, that was mostly driven by the desire to get ASCII results
(or at least so I imagine; Peter E. might have a different take).
I think there's value in having a function that produces such a
description, but we could have a different one that produces a
description in the language indicated by the database's locale.
        regards, tom lane


-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Re: [BUGS] Crash report for some ICU-52 (debian8) COLLATE andwork_mem values

From
Peter Geoghegan
Date:
On Mon, Aug 14, 2017 at 9:23 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Peter Geoghegan <pg@bowt.ie> writes:
>> The comments that pg_import_system_collations() adds to each locale
>> have an annoying tendency to exclude anything that isn't ascii-safe,
>
> Right, but that's *necessary* for anything that goes into template0.

Of course. I phrased this badly.

>> Another thing that I dislike about get_icu_locale_comment() is that
>> the descriptions we show are always English display names, regardless
>> of pg_database collation or anything else. We shouldn't hard-code the
>> use of English for display name, on general principle.
>
> Again, that was mostly driven by the desire to get ASCII results
> (or at least so I imagine; Peter E. might have a different take).

Probably.

> I think there's value in having a function that produces such a
> description, but we could have a different one that produces a
> description in the language indicated by the database's locale.

I suppose it's useful to have it match the database locale/language,
too. The point IMV is that display name is strictly for, uh, display.
I'm not attached to any particular scheme for determining which
language to display in which context. That's for v11.

I take it that you're in agreement with me on
pg_import_system_collations() adding a display name as a comment -- we
should prevent that, even for v10?

-- 
Peter Geoghegan


-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Peter Geoghegan <pg@bowt.ie> writes:
> I take it that you're in agreement with me on
> pg_import_system_collations() adding a display name as a comment -- we
> should prevent that, even for v10?

I'm kind of agnostic on that.  I think we've established that that's
not what we want for v11 and up, but it's not clear to me that the
current behavior isn't an acceptable short-term answer.  If there
are comments in v10, and then in v11 there aren't because the data
went somewhere else, what will that really break?

BTW, it strikes me that a description function that produces strings
in the database's language/encoding is not without gotchas.  Suppose,
for example, that we have a database using UTF8 while psql is using
client_encoding LATIN1.  If \dO calls a function that produces random
non-ASCII strings, then there's a significant hazard of \dO failing
altogether due to encoding conversion failure on the way to the client.
That doesn't seem like it will pass muster.  I'm not sure what to do
instead, but this seems like a reason why there's value in plain-ASCII
descriptions.
        regards, tom lane


-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Re: [BUGS] Crash report for some ICU-52 (debian8) COLLATE andwork_mem values

From
Peter Geoghegan
Date:
On Mon, Aug 14, 2017 at 9:56 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Peter Geoghegan <pg@bowt.ie> writes:
>> I take it that you're in agreement with me on
>> pg_import_system_collations() adding a display name as a comment -- we
>> should prevent that, even for v10?
>
> I'm kind of agnostic on that.  I think we've established that that's
> not what we want for v11 and up, but it's not clear to me that the
> current behavior isn't an acceptable short-term answer.  If there
> are comments in v10, and then in v11 there aren't because the data
> went somewhere else, what will that really break?

Either way, we might want to point out in the docs that you can pretty
easily find the details of every locale on the web:

https://ssl.icu-project.org/icu-bin/locexp?d_=en&_=en_HK


-- 
Peter Geoghegan


-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Re: [BUGS] Crash report for some ICU-52 (debian8) COLLATE andwork_mem values

From
Peter Eisentraut
Date:
On 8/14/17 23:55, Peter Geoghegan wrote:
> pg_import_system_collations() should simply use uloc_countAvailable()
> + uloc_getAvailable, rather than ucol_countAvailable() +
> ucol_getAvailable().

It's not clear to me that this is better.  Why do we need to use a
function that is clearly not the preferred API for this ("col" vs "loc")
just to get more entries?  If we go down this route, then we'll be on
the hook forever to keep adding more and more predefined entries by
whatever means necessary.

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Re: [BUGS] Crash report for some ICU-52 (debian8) COLLATE andwork_mem values

From
Peter Geoghegan
Date:
On Thu, Aug 17, 2017 at 6:13 PM, Peter Eisentraut
<peter.eisentraut@2ndquadrant.com> wrote:
> On 8/14/17 23:55, Peter Geoghegan wrote:
>> pg_import_system_collations() should simply use uloc_countAvailable()
>> + uloc_getAvailable, rather than ucol_countAvailable() +
>> ucol_getAvailable().
>
> It's not clear to me that this is better.  Why do we need to use a
> function that is clearly not the preferred API for this ("col" vs "loc")
> just to get more entries?

My argument for doing this is very simple: ICU/CLDR/BCP 47 provides
stability guarantees for locales, not collations [1]. For example, as
we discussed, de_BE didn't actually go away -- it just stopped being a
distinct collation within ICU, for reasons that are implementation
defined.

I admit that there are arguments against this, but by far the most
important consideration should be the stability of the names used for
pg_collation entries created during initdb.

> If we go down this route, then we'll be on
> the hook forever to keep adding more and more predefined entries by
> whatever means necessary.

Why? Locales have a mapping to various ISO codes (Country, language,
etc). It's not like new ones come along very often.

[1] https://tools.ietf.org/html/bcp47#section-3.4
-- 
Peter Geoghegan


-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Re: [BUGS] Crash report for some ICU-52 (debian8) COLLATE andwork_mem values

From
Peter Geoghegan
Date:
On Thu, Aug 17, 2017 at 6:22 PM, Peter Geoghegan <pg@bowt.ie> wrote:
> My argument for doing this is very simple: ICU/CLDR/BCP 47 provides
> stability guarantees for locales, not collations [1]. For example, as
> we discussed, de_BE didn't actually go away -- it just stopped being a
> distinct collation within ICU, for reasons that are implementation
> defined.

I have data to back this up. I attach 2 files: one is a listing of
locale XML files from within CLDR 1.9's ./common/main/, dating from
December 2010, and the other is a similar listing for CLDR 3.1, dating
from April 2017. This roughly covers every ICU version we'll support
on day 1. The listing is sorted alphabetically, to ease comparison.

Summary:

$ cat locale_list_cldr-19.txt | wc -l
605
$ cat locale_list_cldr-31.txt | wc -l
722
$ diff -d -u locale_list_cldr-19.txt locale_list_cldr-31.txt | grep
"^-[a-zA-Z]" | wc -l
144
$ diff -d -u locale_list_cldr-19.txt locale_list_cldr-31.txt | grep
"^+[a-zA-Z]" | wc -l
261

So, there have been 144 locales removed in that time, and 261 added.
My proposal to standardize on using all locales ICU makes available,
rather than all behaviorally distinct collations, clearly does not
ensure perfect stability. It does actually work pretty well in
practice, though. The number 144 is misleadingly high. If you actually
look at what went away in detail, it looks like there is a lot of
script variants of the same language/country code. Plus, the changes
themselves are non-technical in nature.

The churn seems to be in part due to geopolitical changes, such as 5
years [1] passing after the dissolution of Serbia and Montenegro.
However, it is mostly due to switching from ISO 639-1 to ISO 639-3
codes in cases where a finer distinction about cultural preferences
needed to be made (note that they still only list *macro*
language/region/script combinations as distinct collations). For
example, Kurdish went from being "ku-" to 3 different macro languages:
"ckb-" (Central Kurdish), "kmr-" (Northern Kurdish), and "sdh-"
(Southern Kurdish). Wikipedia says of ISO 639-3: "Because it provides
comprehensive language coverage, giving equal opportunity for all
languages, and because of its wide adoption in information
technologies, ISO 639-3 provides an important technology component
addressing the digital divide problem". We can hope that it will be
the last such revision ever needed, because this digital divide
problem is solved once and for all, at least as far as these standards
go.

CLDR prefers to use ISO 639-1 language codes for compatibility [2],
which is why the language codes are mostly still 2 letters (ISO
639-1). "en" did not change to "eng", because there was no cultural
reason to do so, and thus there was a 1:1 mapping between "en" and
"eng" anyway. Regions/countries will only change due to rare
geopolitical events.

In summary, I think that these changes are fairly low impact in
practice, and are entirely explainable by political changes and
cultural controversies. They really are minimal, because CLDR/ICU
really does take the stability of collation names seriously. We can
and should ensure that locales like "de_BE" are available in every ICU
version, because that is an inexcusable technical oversight, and is
not due to a cultural or political issue.

[1] http://cldr.unicode.org/index/process/cldr-data-retention-policy
[2] http://www.unicode.org/reports/tr35/#unicode_language_subtag_validity
-- 
Peter Geoghegan

-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Attachment

Re: [BUGS] Crash report for some ICU-52 (debian8) COLLATE andwork_mem values

From
Peter Eisentraut
Date:
On 8/17/17 21:22, Peter Geoghegan wrote:
>> It's not clear to me that this is better.  Why do we need to use a
>> function that is clearly not the preferred API for this ("col" vs "loc")
>> just to get more entries?
> My argument for doing this is very simple: ICU/CLDR/BCP 47 provides
> stability guarantees for locales, not collations [1]. For example, as
> we discussed, de_BE didn't actually go away -- it just stopped being a
> distinct collation within ICU, for reasons that are implementation
> defined.
> 
> I admit that there are arguments against this, but by far the most
> important consideration should be the stability of the names used for
> pg_collation entries created during initdb.

One argument I can think of is that it is confusing right now.  Why is
there de-AT but no de-DE?  I know why, but users in "DE" might not and
would be really confused.  So I concede that adding the full set would
be useful.

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Re: [BUGS] Crash report for some ICU-52 (debian8) COLLATE andwork_mem values

From
Peter Geoghegan
Date:
On Fri, Aug 18, 2017 at 7:47 PM, Peter Eisentraut
<peter.eisentraut@2ndquadrant.com> wrote:
> On 8/17/17 21:22, Peter Geoghegan wrote:
>>> It's not clear to me that this is better.  Why do we need to use a
>>> function that is clearly not the preferred API for this ("col" vs "loc")
>>> just to get more entries?
>> My argument for doing this is very simple: ICU/CLDR/BCP 47 provides
>> stability guarantees for locales, not collations [1]. For example, as
>> we discussed, de_BE didn't actually go away -- it just stopped being a
>> distinct collation within ICU, for reasons that are implementation
>> defined.
>>
>> I admit that there are arguments against this, but by far the most
>> important consideration should be the stability of the names used for
>> pg_collation entries created during initdb.
>
> One argument I can think of is that it is confusing right now.  Why is
> there de-AT but no de-DE?  I know why, but users in "DE" might not and
> would be really confused.  So I concede that adding the full set would
> be useful.

Actually, there would be a de-DE. Just like with de-BE, that's ensured
by using locales rather than collations. See my self-contained test
program, or even the output I posted earlier.

-- 
Peter Geoghegan


-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Re: [BUGS] Crash report for some ICU-52 (debian8) COLLATE andwork_mem values

From
Peter Geoghegan
Date:
On Fri, Aug 18, 2017 at 7:55 PM, Peter Geoghegan <pg@bowt.ie> wrote:
>> One argument I can think of is that it is confusing right now.  Why is
>> there de-AT but no de-DE?  I know why, but users in "DE" might not and
>> would be really confused.  So I concede that adding the full set would
>> be useful.
>
> Actually, there would be a de-DE. Just like with de-BE, that's ensured
> by using locales rather than collations. See my self-contained test
> program, or even the output I posted earlier.

I just realized that I misunderstood you here.

Anyway, I'm glad that we're in agreement that adding every locale at
initdb time would be the most useful behavior.

-- 
Peter Geoghegan


-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs