Thread: [BUGS] Crash report for some ICU-52 (debian8) COLLATE and work_mem values
[BUGS] Crash report for some ICU-52 (debian8) COLLATE and work_mem values
From
"Daniel Verite"
Date:
Hi, With 10beta2 built on Debian 8 with ./configure --enable-debug --with-icu and the ICU package currently in the "jessie" Debian repo: $ dpkg -l 'libicu*' ... ii libicu-dev:amd 52.1-8+deb8u amd64 Development files for Internation ii libicu52:amd64 52.1-8+deb8u amd64 International Components for Unic ii libicu52-dbg 52.1-8+deb8u amd64 International Components for Unic I've got a table with 6,6 million unique small bits of text from different Unicode alphabets: Table "public.words_test" Column | Type | Collation | Nullable | Default ----------+------+-----------+----------+---------wordtext | text | | | and found that running the following query on it consistently provokes a SIGSEGV with certain collations: SELECT count(distinct wordtext COLLATE :"collname") FROM words_test; Some of the collations that crash: az-Latn-AZ-u-co-search-x-icu bs-Latn-BA-u-co-search-x-icu bs-x-icu cs-CZ-u-co-search-x-icude-BE-u-co-phonebk-x-icu sr-Latn-XK-x-icu zh-Hans-CN-u-co-big5han-x-icu Trying all of them I had 146 crashes out of the 1741 ICU entries in pg_collation created by initdb. The size of the table is 291MB, and work_mem to 128MB. Reducing the dataset tends to make the problem disappear: if I split the table in halves based on row_number() to bisect on the data, the queries on both parts pass without crashing. Below is a backtrace got with collate "az-Latn-AZ-u-co-search-x-icu", and work_mem to 128MB. Cranking up work_mem to 512MB makes the crash not happen, but 300MB is not enough. (by comparison, the same query with collate "en_US.utf8" or "fr-x-icu" runs fine with work_mem to 4MB) #0 0x00007fc3c5017030 in ucol_getLatinOneContractionUTF8 ( coll=coll@entry=0x2824ef0, strength=strength@entry=0, CE=<optimizedout>, s=s@entry=0x383a3ec "u\364\217\273\252tectuablesow-what-it-is-about_tlsstnamek\224\257\346\214\201gb2312\357\274\214\202\232\204\344\271\237\351\201\207\345\210\260\350\277\207\344\270\200\344\272\233\345\205\266\345\256\203\351\227\256\351\242\230\357\274\214\345\233\240\344\270\272\346\262\241\346\234\211\350\277\231\344\270\252\351\227\256\351\242\230\344\270\245\351\207\215\357\274\214\350\277\230\347\256\227\344\270\215\351\224\231\343\200\202\200\342\224\200", index=index@entry=0x7fff46afe290, len=len@entry=7) at ucol.cpp:8044 #1 0x00007fc3c502917c in ucol_strcollUseLatin1UTF8 (status=0x7fff46afe338, tLen=7, target=0x383a3ec "u\364\217\273\252tectuablesow-what-it-is-about_tlsstnamek\224\257\346\214\201gb2312\357\274\214\202\232\204\344\271\237\351\201\207\345\210\260\350\277\207\344\270\200\344\272\233\345\205\266\345\256\203\351\227\256\351\242\230\357\274\214\345\233\240\344\270\272\346\262\241\346\234\211\350\277\231\344\270\252\351\227\256\351\242\230\344\270\245\351\207\215\357\274\214\350\277\230\347\256\227\344\270\215\351\224\231\343\200\202\200\342\224\200", sLen=6, source=0x3839bec "wuiredntnsookiemmand-tcl-testsuit-tp65374p65375\204\346\226\231\344\271\237\345\276\210\345\244\232\343\200\202\257debian\345\222\214redhat\344\272\206\357\274\214\344\270\244\344\270\252\215\263\345\264\251\346\272\203\343\200\202\226\350\257\221\343\200\202\204u\347\233\230\357\274\214\272\346\211\213\344\272\206\344\270\200\344\272\233\343\200\202ct\342\224\200\342\224\230", coll=0x2824ef0) at ucol.cpp:8153 #2 ucol_strcollUTF8_52 (coll=<optimized out>, source=0x3839bec "wuiredntnsookiemmand-tcl-testsuit-tp65374p65375\204\346\226\231\344\271\237\345\276\210\345\244\232\343\200\202\257debian\345\222\214redhat\344\272\206\357\274\214\344\270\244\344\270\252\215\263\345\264\251\346\272\203\343\200\202\226\350\257\221\343\200\202\204u\347\233\230\357\274\214\272\346\211\213\344\272\206\344\270\200\344\272\233\343\200\202ct\342\224\200\342\224\230", source@entry=0x3839be9 "reqwuiredntnsookiemmand-tcl-testsuit-tp65374p65375\204\346\226\231\344\271\237\345\276\210\345\244\232\343\200\202\257debian\345\222\214redhat\344\272\206\357\274\214\344\270\244\344\270\252\215\263\345\264\251\346\272\203\343\200\202\226\350\257\221\343\200\202\204u\347\233\230\357\274\214\272\346\211\213\344\272\206\344\270\200\344\272\233\343\200\202ct\342\224\200\342\224\230", sourceLength=<optimized out>, sourceLength@entry=9, target=0x383a3ec "u\364\217\273\252tectuablesow-what-it-is-about_tlsstnamek\224\257\346\214\201gb2312\357\274\214\202\232\204\344\271\237\351\201\207\345\210\260\350\277\207\344\270\200\344\272\233\345\205\266\345\256\203\351\227\256\351\242\230\357\274\214\345\233\240\344\270\272\346\262\241\346\234\211\350\277\231\344\270\252\351\227\256\351\242\230\344\270\245\351\207\215\357\274\214\350\277\230\347\256\227\344\270\215\351\224\231\343\200\202\200\342\224\200", target@entry=0x383a3e9 "requ\364\217\273\252tectuablesow-what-it-is-about_tlsstnamek\224\257\346\214\201gb2312\357\274\214\202\232\204\344\271\237\351\201\207\345\210\260\350\277\207\344\270\200\344\272\233\345\205\266\345\256\203\351\227\256\351\242\230\357\274\214\345\233\240\344\270\272\346\262\241\346\234\211\350\277\231\344\270\252\351\227\256\351\242\230\344\270\245\351\207\215\357\274\214\350\277\230\347\256\227\344\270\215\351\224\231\343\200\202\200\342\224\200", targetLength=7, targetLength@entry=10, status=status@entry=0x7fff46afe338) at ucol.cpp:8770 #3 0x00000000007cc7b4 in varstrfastcmp_locale (x=58956776, y=58958824, ssup=<optimized out>) at varlena.c:2139 #4 0x00000000008170b6 in ApplySortComparator (ssup=0x28171b8, isNull2=<optimized out>, datum2=<optimized out>, isNull1=<optimizedout>, datum1=<optimized out>) at ../../../../src/include/utils/sortsupport.h:225 #5 comparetup_datum (a=0x2818ef8, b=0x2818f10, state=0x2816fa8) at tuplesort.c:4341 #6 0x0000000000815623 in tuplesort_heap_replace_top ( state=state@entry=0x2816fa8, tuple=tuple@entry=0x7fff46afe410, checkIndex=checkIndex@entry=0 '\000') at tuplesort.c:3510 #7 0x0000000000816d8c in tuplesort_gettuple_common ( state=state@entry=0x2816fa8, forward=forward@entry=1 '\001', stup=stup@entry=0x7fff46afe460)at tuplesort.c:2082 #8 0x000000000081b176 in tuplesort_getdatum (state=0x2816fa8, forward=forward@entry=1 '\001', val=val@entry=0x28130f0, isNull=isNull@entry=0x2813409 "", abbrev=abbrev@entry=0x7fff46afe518) at tuplesort.c:2205 #9 0x00000000005ea107 in process_ordered_aggregate_single ( pergroupstate=0x2812f38, pertrans=0x2812f98, aggstate=0x2811198) at nodeAgg.c:1330 #10 finalize_aggregates (aggstate=aggstate@entry=0x2811198, peraggs=peraggs@entry=0x2812588, pergroup=<optimized out>) at nodeAgg.c:1736 #11 0x00000000005eaabd in agg_retrieve_direct (aggstate=0x2811198) at nodeAgg.c:2464 #12 ExecAgg (node=node@entry=0x2811198) at nodeAgg.c:2117 #13 0x00000000005e2378 in ExecProcNode (node=node@entry=0x2811198) at execProcnode.c:539 #14 0x00000000005ddf1e in ExecutePlan (execute_once=<optimized out>, dest=0x27f7eb0, direction=<optimized out>, numberTuples=0, sendTuples=<optimized out>, operation=CMD_SELECT, use_parallel_mode=<optimized out>, planstate=0x2811198,estate=0x2810f88) at execMain.c:1693 #15 standard_ExecutorRun (queryDesc=0x280d6d8, direction=<optimized out>, count=0, execute_once=<optimized out>) at execMain.c:362 #16 0x00000000006f924c in PortalRunSelect (portal=portal@entry=0x280ef78, forward=forward@entry=1 '\001', count=0, count@entry=9223372036854775807, dest=dest@entry=0x27f7eb0) at pquery.c:932 #17 0x00000000006fa5f0 in PortalRun (portal=0x280ef78, count=9223372036854775807, isTopLevel=<optimized out>, run_once=<optimizedout>, dest=0x27f7eb0, altdest=0x27f7eb0, completionTag=0x7fff46afe830 "") at pquery.c:773 #18 0x00000000006f67c3 in exec_simple_query ( query_string=0x383a3ec "u\364\217\273\252tectuablesow-what-it-is-about_tlsstnamek\224\257\346\214\201gb2312\357\274\214\202\232\204\344\271\237\351\201\207\345\210\260\350\277\207\344\270\200\344\272\233\345\205\266\345\256\203\351\227\256\351\242\230\357\274\214\345\233\240\344\270\272\346\262\241\346\234\211\350\277\231\344\270\252\351\227\256\351\242\230\344\270\245\351\207\215\357\274\214\350\277\230\347\256\227\344\270\215\351\224\231\343\200\202\200\342\224\200") at postgres.c:1099 #19 0x00000000006f842a in PostgresMain (argc=1, argv=0x27d5f68, dbname=0x2752288 "mlists", username=0x276a298 "postgres") at postgres.c:4090 #20 0x000000000047803f in BackendRun (port=0x274bf30) at postmaster.c:4357 #21 BackendStartup (port=0x274bf30) at postmaster.c:4029 #22 ServerLoop () at postmaster.c:1753 #23 0x0000000000692a82 in PostmasterMain (argc=argc@entry=3, argv=argv@entry=0x2724330) at postmaster.c:1361 #24 0x0000000000478f8e in main (argc=3, argv=0x2724330) at main.c:228 Best regards, -- Daniel Vérité PostgreSQL-powered mailer: http://www.manitou-mail.org Twitter: @DanielVerite -- Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-bugs
Re: [BUGS] Crash report for some ICU-52 (debian8) COLLATE andwork_mem values
From
Peter Geoghegan
Date:
Daniel Verite <daniel@manitou-mail.org> wrote: > SELECT count(distinct wordtext COLLATE :"collname") FROM words_test; > >Some of the collations that crash: > az-Latn-AZ-u-co-search-x-icu > bs-Latn-BA-u-co-search-x-icu > bs-x-icu > cs-CZ-u-co-search-x-icu > de-BE-u-co-phonebk-x-icu > sr-Latn-XK-x-icu > zh-Hans-CN-u-co-big5han-x-icu > >Trying all of them I had 146 crashes out of the 1741 ICU >entries in pg_collation created by initdb. > >The size of the table is 291MB, and work_mem to 128MB. > >Reducing the dataset tends to make the problem disappear: if I split >the table in halves based on row_number() to bisect on the data, >the queries on both parts pass without crashing. I think that this sensitivity to work_mem exists because abbreviated keys are used for quicksort operations that sort individual runs. As work_mem is increased, and less merging is required, affected codepaths are reached less frequently. You would probably find that the problem appears more consistently if varstr_sortsupport() is modified so that even ICU collations never use abbreviated keys; that would be a matter of "abbreviate" always being set to false within that function. I suggest using the new amcheck contrib module as part of this testing (you'll need to use CREATE INDEX to have an index to perform verification against). This will zero in on inconsistencies that may be far more subtle than a hard crash. I wouldn't assume that abbreviated key comparisons are correct here just because there is no hard crash. Does the crash always have ucol_strcollUseLatin1UTF8() in its backtrace? -- Peter Geoghegan -- Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-bugs
Re: [BUGS] Crash report for some ICU-52 (debian8) COLLATE and work_mem values
From
"Daniel Verite"
Date:
Peter Geoghegan wrote: > Does the crash always have ucol_strcollUseLatin1UTF8() in its backtrace? AFAICS, yes. The test that iterates over collations produces two kinds of core files, some of them are 289MB large, some others are 17GB large. shared_buffers is only 128MB and work_mem 128MB, so 289MB is not surprising but 17GB seems excessive. The box has 16GB of physical mem and 8GB of swap. I haven't checked all core files because they exhaust the disk space before completion of the test, but a typical backtrace for the biggest ones looks like the following, with the segfaults happening in memcpy: Program terminated with signal SIGSEGV, Segmentation fault. #0 __memcpy_sse2_unaligned () at ../sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S:35 (gdb) #0 __memcpy_sse2_unaligned () at ../sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S:35 #1 0x00007fc1db02be6b in memcpy (__len=8589934592, __src=0x7fbc6f4a2010, __dest=<optimized out>) at /usr/include/x86_64-linux-gnu/bits/string3.h:51 #2 ucol_CEBuf_Expand (ci=<optimized out>, status=0x7ffd90777128, b=0x7ffd907751e0) at ucol.cpp:7009 #3 UCOL_CEBUF_PUT (status=0x7ffd90777128, ci=0x7ffd90776460, ce=1493173509, b=0x7ffd907751e0) at ucol.cpp:7022 #4 ucol_strcollRegular (sColl=sColl@entry=0x7ffd90776460, tColl=tColl@entry=0x7ffd90776610, status=status@entry=0x7ffd90777128) at ucol.cpp:7163 #5 0x00007fc1db031177 in ucol_strcollRegularUTF8 (coll=0x1371af0, source=source@entry=0x273d379 "콗喩zx㎍", sourceLength=sourceLength@entry=11,target=<optimized out>, targetLength=targetLength@entry=8, status=status@entry=0x7ffd90777128) at ucol.cpp:8023 #6 0x00007fc1db032d36 in ucol_strcollUseLatin1UTF8 (status=<optimized out>, tLen=<optimized out>, target=<optimized out>,sLen=<optimized out>, source=<optimized out>, coll=<optimized out>) at ucol.cpp:8108 #7 ucol_strcollUTF8_52 (coll=<optimized out>, source=source@entry=0x273d379 "콗喩zx㎍", sourceLength=<optimized out>, sourceLength@entry=11,target=<optimized out>, target@entry=0x273d409 "쳭喩zz", targetLength=targetLength@entry=8, status=status@entry=0x7ffd90777128)at ucol.cpp:8770 #8 0x00000000007cc7b4 in varstrfastcmp_locale (x=41145208, y=41145352, ssup=<optimized out>) at varlena.c:2139 #9 0x0000000000817138 in ApplySortAbbrevFullComparator (ssup=0x136bd98, isNull2=<optimized out>, datum2=<optimized out>,isNull1=<optimized out>, datum1=<optimized out>) at ../../../../src/include/utils/sortsupport.h:263 #10 comparetup_datum (a=0x7fc1c642d210, b=0x7fc1c642d228, state=<optimized out>) at tuplesort.c:4350 #11 0x0000000000815aef in qsort_tuple (a=0x7fc1c642d210, n=<optimized out>, n@entry=2505, cmp_tuple=cmp_tuple@entry=0x817030<comparetup_datum>, state=state@entry=0x136bb88) at qsort_tuple.c:104 #12 0x0000000000815c1f in qsort_tuple (a=0x7fc1c6413020, n=<optimized out>, n@entry=5279, cmp_tuple=cmp_tuple@entry=0x817030<comparetup_datum>, state=state@entry=0x136bb88) at qsort_tuple.c:191 #13 0x0000000000815c1f in qsort_tuple (a=0x7fc1c63f27e8, n=<optimized out>, n@entry=103353, cmp_tuple=cmp_tuple@entry=0x817030<comparetup_datum>, state=state@entry=0x136bb88) at qsort_tuple.c:191 #14 0x0000000000815c1f in qsort_tuple (a=0x7fc1c5f423c8, n=<optimized out>, n@entry=503456, cmp_tuple=cmp_tuple@entry=0x817030<comparetup_datum>, state=state@entry=0x136bb88) at qsort_tuple.c:191 #15 0x0000000000815c1f in qsort_tuple (a=0x7fc1c34ae048, n=<optimized out>, cmp_tuple=0x817030 <comparetup_datum>, state=state@entry=0x136bb88) at qsort_tuple.c:191 #16 0x000000000081873b in tuplesort_sort_memtuples ( state=state@entry=0x136bb88) at tuplesort.c:3410 #17 0x0000000000818907 in dumpbatch (alltuples=0 '\000', state=0x136bb88) at tuplesort.c:3087 #18 dumptuples (state=state@entry=0x136bb88, alltuples=alltuples@entry=0 '\000') at tuplesort.c:2970 #19 0x0000000000818b81 in puttuple_common (state=state@entry=0x136bb88, tuple=tuple@entry=0x7ffd90777470) at tuplesort.c:1719 #20 0x000000000081a6f1 in tuplesort_putdatum (state=0x136bb88, val=<optimized out>, isNull=<optimized out>) at tuplesort.c:1558 #21 0x00000000005e91ec in advance_aggregates ( aggstate=aggstate@entry=0x13731f8, pergroup=pergroup@entry=0x1376008, pergroups=0x0) at nodeAgg.c:1023 #22 0x00000000005eac03 in agg_retrieve_direct (aggstate=0x13731f8) at nodeAgg.c:2402 #23 ExecAgg (node=node@entry=0x13731f8) at nodeAgg.c:2117 #24 0x00000000005e2378 in ExecProcNode (node=node@entry=0x13731f8) at execProcnode.c:539 #25 0x00000000005ddf1e in ExecutePlan (execute_once=<optimized out>, dest=0x132f368, direction=<optimized out>, numberTuples=0, sendTuples=<optimized out>, operation=CMD_SELECT, use_parallel_mode=<optimized out>, planstate=0x13731f8,estate=0x1372fe8) at execMain.c:1693 #26 standard_ExecutorRun (queryDesc=0x1350938, direction=<optimized out>, count=0, execute_once=<optimized out>) at execMain.c:362 #27 0x00000000006f924c in PortalRunSelect (portal=portal@entry=0x1362d98, forward=forward@entry=1 '\001', count=0, count@entry=9223372036854775807, dest=dest@entry=0x132f368) at pquery.c:932 #28 0x00000000006fa5f0 in PortalRun (portal=0x1362d98, count=9223372036854775807, isTopLevel=<optimized out>, run_once=<optimizedout>, dest=0x132f368, altdest=0x132f368, completionTag=0x7ffd90777990 "") at pquery.c:773 #29 0x00000000006f7da1 in exec_execute_message (max_rows=9223372036854775807, portal_name=<optimized out>) at postgres.c:1984 #30 PostgresMain (argc=3, argv=0x7fffffffffffffff, dbname=0x12cb288 "mlists", username=0x132f368 "") at postgres.c:4153 #31 0x000000000047803f in BackendRun (port=0x12c4f30) at postmaster.c:4357 #32 BackendStartup (port=0x12c4f30) at postmaster.c:4029 #33 ServerLoop () at postmaster.c:1753 #34 0x0000000000692a82 in PostmasterMain (argc=argc@entry=3, argv=argv@entry=0x129d330) at postmaster.c:1361 #35 0x0000000000478f8e in main (argc=3, argv=0x129d330) at main.c:228 Best regards, -- Daniel Vérité PostgreSQL-powered mailer: http://www.manitou-mail.org Twitter: @DanielVerite -- Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-bugs
Re: [BUGS] Crash report for some ICU-52 (debian8) COLLATE andwork_mem values
From
Peter Geoghegan
Date:
On Tue, Aug 1, 2017 at 12:45 AM, Daniel Verite <daniel@manitou-mail.org> wrote: > The test that iterates over collations produces two kinds of core files, > some of them are 289MB large, some others are 17GB large. > shared_buffers is only 128MB and work_mem 128MB, > so 289MB is not surprising but 17GB seems excessive. > The box has 16GB of physical mem and 8GB of swap. > > I haven't checked all core files because they exhaust the disk > space before completion of the test, but a typical backtrace for > the biggest ones looks like the following, with the segfaults > happening in memcpy: > > Program terminated with signal SIGSEGV, Segmentation fault. > #0 __memcpy_sse2_unaligned () > at ../sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S:35 > (gdb) #0 __memcpy_sse2_unaligned () > at ../sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S:35 > #1 0x00007fc1db02be6b in memcpy (__len=8589934592, __src=0x7fbc6f4a2010, > __dest=<optimized out>) at > /usr/include/x86_64-linux-gnu/bits/string3.h:51 > #2 ucol_CEBuf_Expand (ci=<optimized out>, status=0x7ffd90777128, > b=0x7ffd907751e0) at ucol.cpp:7009 > #3 UCOL_CEBUF_PUT (status=0x7ffd90777128, ci=0x7ffd90776460, ce=1493173509, > b=0x7ffd907751e0) at ucol.cpp:7022 > #4 ucol_strcollRegular (sColl=sColl@entry=0x7ffd90776460, > tColl=tColl@entry=0x7ffd90776610, status=status@entry=0x7ffd90777128) > at ucol.cpp:7163 > #5 0x00007fc1db031177 in ucol_strcollRegularUTF8 (coll=0x1371af0, > source=source@entry=0x273d379 "콗喩zx㎍", > sourceLength=sourceLength@entry=11, target=<optimized out>, > targetLength=targetLength@entry=8, status=status@entry=0x7ffd90777128) > at ucol.cpp:8023 > #6 0x00007fc1db032d36 in ucol_strcollUseLatin1UTF8 (status=<optimized out>, > tLen=<optimized out>, target=<optimized out>, sLen=<optimized out>, > source=<optimized out>, coll=<optimized out>) at ucol.cpp:8108 > #7 ucol_strcollUTF8_52 (coll=<optimized out>, > source=source@entry=0x273d379 "콗喩zx㎍", sourceLength=<optimized out>, > sourceLength@entry=11, target=<optimized out>, > target@entry=0x273d409 "쳭喩zz", targetLength=targetLength@entry=8, > status=status@entry=0x7ffd90777128) at ucol.cpp:8770 Interesting. The "__len" argument to memcpy() is 8589934592 -- that's 2 ^ 33. (I'm not sure why it's the first memcpy() argument in the stack trace, since it's supposed to be the last -- anyone seen that before?) Can you figure out what the optimized-out lengths are, by either looking at registers within GDB, or building at a lower optimization level? Maybe this is a bug in ICU-52. For reference, here is ICU-52's ucol_CEBuf_Expand() function: static void ucol_CEBuf_Expand(ucol_CEBuf *b, collIterate *ci, UErrorCode *status) { uint32_t oldSize; uint32_t newSize; uint32_t *newBuf; ci->flags |= UCOL_ITER_ALLOCATED; oldSize = (uint32_t)(b->pos - b->buf); newSize = oldSize * 2; newBuf = (uint32_t*)uprv_malloc(newSize * sizeof(uint32_t)); if(newBuf == NULL) { *status = U_MEMORY_ALLOCATION_ERROR; } else { uprv_memcpy(newBuf, b->buf, oldSize * sizeof(uint32_t)); if (b->buf != b->localArray) { uprv_free(b->buf); } b->buf = newBuf; b->endp = b->buf + newSize; b->pos = b->buf + oldSize; } } If "oldSize * sizeof(uint32_t)" becomes what we see as "__len", as I believe it does, then that must mean that oldSize is 2 ^ 31. *Not* 2 ^ 31 - 1 (INT_MAX). I think that this could be an off-by-one bug, since ucol_strcollUTF8()/ucol_strcollUTF8_52() accepts an int32 argument for sourceLength and targetLength. I'm not very confident of this, but it does make a certain amount of sense. It could be that everyone else is passing -1 as sourceLength and targetLength arguments, anyway, to indicate that the buffer is NUL-terminated, as required by regular strcoll(). Note also that the docs say this of ucol_strcollUTF8(): "When input string contains malformed a UTF-8 byte sequence, this function treats these bytes as REPLACEMENT CHARACTER (U+FFFD)". I'm not sure that that's a very sensible way for it to fail. I'd be interested to see if anything changed when -1 was passed as both sourceLength and targetLength to ucol_strcollUTF8(). You'd have to build Postgres yourself to test this, but it would just work, since we don't actually avoid NUL termination, even though in principled we could with ICU. -- Peter Geoghegan -- Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-bugs
Re: [BUGS] Crash report for some ICU-52 (debian8) COLLATE and work_mem values
From
"Daniel Verite"
Date:
Peter Geoghegan wrote: > I'd be interested to see if anything changed when -1 was passed as > both sourceLength and targetLength to ucol_strcollUTF8(). You'd have > to build Postgres yourself to test this, but it would just work, since > we don't actually avoid NUL termination, even though in principled we > could with ICU. It doesn't seem to change the outcome: *** varlena.c~ 2017-07-10 22:26:20.000000000 +0200 --- varlena.c 2017-08-02 12:57:25.997265001 +0200 *************** *** 2137,2144 **** status = U_ZERO_ERROR; result = ucol_strcollUTF8(sss->locale->info.icu.ucol, ! a1p, len1, ! a2p, len2, &status); if (U_FAILURE(status)) ereport(ERROR, --- 2137,2144 ---- status = U_ZERO_ERROR; result = ucol_strcollUTF8(sss->locale->info.icu.ucol, ! a1p, -1, /* len1 */ ! a2p, -1, /*len2,*/ &status); if (U_FAILURE(status)) ereport(ERROR, For the case when its ends up crashing in memcpy, the backtrace now looks like this: #0 __memcpy_sse2_unaligned () at ../sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S:35 #1 0x00007f0156812e6b in memcpy (__len=8589934592, __src=0x7efbeac89010, __dest=<optimized out>) at /usr/include/x86_64-linux-gnu/bits/string3.h:51 #2 ucol_CEBuf_Expand (ci=<optimized out>, status=0x7ffd49c4ca98, b=0x7ffd49c4ab60) at ucol.cpp:7009 #3 UCOL_CEBUF_PUT (status=0x7ffd49c4ca98, ci=0x7ffd49c4bde0, ce=1493173509, b=0x7ffd49c4ab60) at ucol.cpp:7022 #4 ucol_strcollRegular (sColl=sColl@entry=0x7ffd49c4bde0, tColl=tColl@entry=0x7ffd49c4bf90, status=status@entry=0x7ffd49c4ca98) at ucol.cpp:7163 #5 0x00007f0156818177 in ucol_strcollRegularUTF8 (coll=0x2286100, source=source@entry=0x3651829 "쳭喩zz", sourceLength=sourceLength@entry=-1,target=<optimized out>, targetLength=targetLength@entry=-1, status=status@entry=0x7ffd49c4ca98) at ucol.cpp:8023 #6 0x00007f0156819d36 in ucol_strcollUseLatin1UTF8 (status=<optimized out>, tLen=<optimized out>, target=<optimized out>,sLen=<optimized out>, source=<optimized out>, coll=<optimized out>) at ucol.cpp:8108 #7 ucol_strcollUTF8_52 (coll=<optimized out>, source=source@entry=0x3651829 "쳭喩zz", sourceLength=<optimized out>, sourceLength@entry=-1,target=<optimized out>, target@entry=0x3651799 "콗喩zx㎍", targetLength=targetLength@entry=-1, status=status@entry=0x7ffd49c4ca98)at ucol.cpp:8770 #8 0x00000000007cc7df in varstrfastcmp_locale (x=56956968, y=56956824, ssup=<optimized out>) at varlena.c:2139 #9 0x0000000000817168 in ApplySortAbbrevFullComparator (ssup=0x2280338, isNull2=<optimized out>, datum2=<optimized out>,isNull1=<optimized out>, datum1=<optimized out>) at ../../../../src/include/utils/sortsupport.h:263 #10 comparetup_datum (a=0x7f0141c14228, b=0x7f0141c14240, state=<optimized out>) at tuplesort.c:4350 #11 0x0000000000815b1f in qsort_tuple (a=0x7f0141c14210, n=<optimized out>, n@entry=106, cmp_tuple=cmp_tuple@entry=0x817060<comparetup_datum>, state=state@entry=0x2280128) at qsort_tuple.c:104 ..... Anyway I'll try to build ICU-52 without optimization to avoid these <optimized out> parameters. Also running the same tests with a self-compiled ICU-59 versus debian's packaged ICU-52 results in *no crash at all*. So maybe the root cause of these crashes is that a critical fix in ICU has not been backported? I note that the list of ICU locales available in pg_collation is different though. When compiled with ICU-59, initdb produces "only" 581 collname matching '%icu%', versus 1741 with ICU-52. The ICU version is the only thing that differs between both cases. But anyway in the list of collations that crash for me with ICU-52, a fair number (~50) are still there and work in ICU-59, so I think the comparison is still meaningful. Best regards, -- Daniel Vérité PostgreSQL-powered mailer: http://www.manitou-mail.org Twitter: @DanielVerite -- Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-bugs
Re: [BUGS] Crash report for some ICU-52 (debian8) COLLATE andwork_mem values
From
Peter Geoghegan
Date:
On Wed, Aug 2, 2017 at 5:00 AM, Daniel Verite <daniel@manitou-mail.org> wrote: > Anyway I'll try to build ICU-52 without optimization to avoid these > <optimized out> parameters. > > Also running the same tests with a self-compiled ICU-59 versus debian's > packaged > ICU-52 results in *no crash at all*. So maybe the root cause of these crashes > is > that a critical fix in ICU has not been backported? That seems possible. Certainly, there have been a fair amount of bugfixes against this package, some as recently as April of this year: http://metadata.ftp-master.debian.org/changelogs/main/i/icu/icu_52.1-8+deb8u5_changelog I'm sure that you used a version that has the relevant fixes, but it's not that hard to imagine that new bugs were introduced in the process of backpatching upstream fixes from later major ICU releases. There has only been one official ICU point release for 52, which was way back in 2013. Unofficial backpatches are hazardous in their own way. > I note that the list of ICU locales available in pg_collation is different > though. > When compiled with ICU-59, initdb produces "only" 581 collname > matching '%icu%', versus 1741 with ICU-52. The ICU version > is the only thing that differs between both cases. > But anyway in the list of collations that crash for me with ICU-52, a fair > number > (~50) are still there and work in ICU-59, so I think the comparison is still > meaningful. I agree that it's a meaningful comparison. It now seems very likely that this is a bug in ICU, and possibly this particular package. That seemed likely before now anyway; there is not that much that could go wrong with PostgreSQL's use of ICU that is peculiar to only a small minority of available collations. Perhaps you can whittle down your testcase into something minimal, for the Debian package maintainer. -- Peter Geoghegan -- Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-bugs
Re: [BUGS] Crash report for some ICU-52 (debian8) COLLATE andwork_mem values
From
Peter Eisentraut
Date:
On 8/2/17 08:00, Daniel Verite wrote: > I note that the list of ICU locales available in pg_collation is different > though. > When compiled with ICU-59, initdb produces "only" 581 collname > matching '%icu%', versus 1741 with ICU-52. This is because the older version produces redundant locale names such as de-BE-*, de-CH-*, de-DE-*, whereas the newer version omits those because they are (presumably?) all equivalent to de-*. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-bugs
Re: [BUGS] Crash report for some ICU-52 (debian8) COLLATE and work_mem values
From
"Daniel Verite"
Date:
Peter Geoghegan wrote: > it's not that hard to imagine that new bugs were introduced in the process > of backpatching upstream fixes from later major ICU releases I've now tried with ICU-52.1 from upstream, built with gcc -O0 -g to get cleaner stack traces. The behavior is similar to debian's precompiled lib in how and where it crashes. work_mem is at 128MB. query #1: select count(distinct wordtext collate "az-x-icu") from words_test; The backend terminates with SIGSEGV and a 289MB core file : at ucol.cpp:8033 8033 while(schar > (tchar = *(UCharOffset+offset))) { /* since the contraction codepoints should be ordered, we skip all that are smaller */ #0 0x00007f7926e12341 in ucol_getLatinOneContractionUTF8 (coll=0x1b422c0, strength=0, CE=4061217513, s=0x2b56b5c "u\364\217\273\252tectuablesow-what-it-is-about_tlsstnamek\224\257\346\214\201gb2312\357\274\214\202\232\204\344\271\237\351\201\207\345\210\260\350\277\207\344\270\200\344\272\233\345\205\266\345\256\203\351\227\256\351\242\230\357\274\214\345\233\240\344\270\272\346\262\241\346\234\211\350\277\231\344\270\252\351\227\256\351\242\230\344\270\245\351\207\215\357\274\214\350\277\230\347\256\227\344\270\215\351\224\231\343\200\202\200\342\224\220", index=0x7ffde80dc978, len=7) at ucol.cpp:8033 #1 0x00007f7926e12845 in ucol_strcollUseLatin1UTF8 (coll=0x1b422c0, source=0x2b56b5c "u\364\217\273\252tectuablesow-what-it-is-about_tlsstnamek\224\257\346\214\201gb2312\357\274\214\202\232\204\344\271\237\351\201\207\345\210\260\350\277\207\344\270\200\344\272\233\345\205\266\345\256\203\351\227\256\351\242\230\357\274\214\345\233\240\344\270\272\346\262\241\346\234\211\350\277\231\344\270\252\351\227\256\351\242\230\344\270\245\351\207\215\357\274\214\350\277\230\347\256\227\344\270\215\351\224\231\343\200\202\200\342\224\220", sLen=7, target=0x2b5735c "wuiredntnsookiemmand-tcl-testsuit-tp65374p65375\204\346\226\231\344\271\237\345\276\210\345\244\232\343\200\202\257debian\345\222\214redhat\344\272\206\357\274\214\344\270\244\344\270\252\215\263\345\264\251\346\272\203\343\200\202\273\345\205\263\351\227\255\347\232\204\351\202\243\344\270\252\346\217\220\347\244\272\347\252\227\345\234\250\346\234\200\345\272\225\344\270\213\222\230", tLen=6, status=0x7ffde80dcab8) at ucol.cpp:8104 #2 0x00007f7926e1487a in ucol_strcollUTF8_52 (coll=0x1b422c0, source=0x2b56b5c "u\364\217\273\252tectuablesow-what-it-is-about_tlsstnamek\224\257\346\214\201gb2312\357\274\214\202\232\204\344\271\237\351\201\207\345\210\260\350\277\207\344\270\200\344\272\233\345\205\266\345\256\203\351\227\256\351\242\230\357\274\214\345\233\240\344\270\272\346\262\241\346\234\211\350\277\231\344\270\252\351\227\256\351\242\230\344\270\245\351\207\215\357\274\214\350\277\230\347\256\227\344\270\215\351\224\231\343\200\202\200\342\224\220", sourceLength=7, target=0x2b5735c "wuiredntnsookiemmand-tcl-testsuit-tp65374p65375\204\346\226\231\344\271\237\345\276\210\345\244\232\343\200\202\257debian\345\222\214redhat\344\272\206\357\274\214\344\270\244\344\270\252\215\263\345\264\251\346\272\203\343\200\202\273\345\205\263\351\227\255\347\232\204\351\202\243\344\270\252\346\217\220\347\244\272\347\252\227\345\234\250\346\234\200\345\272\225\344\270\213\222\230", targetLength=6, status=0x7ffde80dcab8) at ucol.cpp:8759 #3 0x00000000007cc7b4 in varstrfastcmp_locale (x=45443928, y=45445976, ssup=<optimized out>) at varlena.c:2139 #4 0x00000000008170b6 in ApplySortComparator (ssup=0x1b344a8, isNull2=<optimized out>, datum2=<optimized out>, isNull1=<optimized out>, datum1=<optimized out>) at ../../../../src/include/utils/sortsupport.h:225 #5 comparetup_datum (a=0x7ffde80dcb90, b=0x1b36218, state=0x1b34298) at tuplesort.c:4341 #6 0x00000000008154d9 in tuplesort_heap_replace_top ( state=state@entry=0x1b34298, tuple=tuple@entry=0x7ffde80dcb90, checkIndex=checkIndex@entry=0 '\000') at tuplesort.c:3512 ...cut... query #2: select count(distinct wordtext collate "bs-Cyrl-BA-u-co-search-x-icu") from words_test; The backend terminates with SIGSEGV leaving a 17GB core file (there's definitely a uint32 offset overflow from 2^31->0 in ucol_CEBuf_Expand just before the crash in memcpy) #0 __memcpy_sse2_unaligned () at ../sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S:35 #1 0x00007fbad6951ec3 in ucol_CEBuf_Expand (b=0x7ffdef2d2390, ci=0x7ffdef2d3560, status=0x7ffdef2d41e8) at ucol.cpp:6998 #2 0x00007fbad6951f68 in UCOL_CEBUF_PUT (b=0x7ffdef2d2390, ce=1493173509, ci=0x7ffdef2d3560, status=0x7ffdef2d41e8) at ucol.cpp:7011 #3 0x00007fbad69525c7 in ucol_strcollRegular (sColl=0x7ffdef2d3560, tColl=0x7ffdef2d3710, status=0x7ffdef2d41e8) at ucol.cpp:7152 #4 0x00007fbad6954080 in ucol_strcollRegularUTF8 (coll=0xf75610, source=0x23318a9 "\354\275\227\345\226\251zx\343\216\215", sourceLength=11, target=0x2331939 "\354\263\255\345\226\251zz", targetLength=8, status=0x7ffdef2d41e8) at ucol.cpp:8012 #5 0x00007fbad6956845 in ucol_strcollUTF8_52 (coll=0xf75610, source=0x23318a9 "\354\275\227\345\226\251zx\343\216\215", sourceLength=11, target=0x2331939 "\354\263\255\345\226\251zz", targetLength=8, status=0x7ffdef2d41e8) at ucol.cpp:8757 #6 0x00000000007cc7b4 in varstrfastcmp_locale (x=36903080, y=36903224, ssup=<optimized out>) at varlena.c:2139 #7 0x0000000000817138 in ApplySortAbbrevFullComparator (ssup=0xf67888, isNull2=<optimized out>, datum2=<optimized out>, isNull1=<optimized out>, datum1=<optimized out>) at ../../../../src/include/utils/sortsupport.h:263 #8 comparetup_datum (a=0x7fbac1cd6210, b=0x7fbac1cd6228, state=<optimized out>) at tuplesort.c:4350 #9 0x0000000000815aef in qsort_tuple (a=0x7fbac1cd6210, n=<optimized out>, n@entry=2505, cmp_tuple=cmp_tuple@entry=0x817030 <comparetup_datum>, state=state@entry=0xf67678) at qsort_tuple.c:104 #10 0x0000000000815c1f in qsort_tuple (a=0x7fbac1cbc020, n=<optimized out>, n@entry=5279, cmp_tuple=cmp_tuple@entry=0x817030 <comparetup_datum>, state=state@entry=0xf67678) at qsort_tuple.c:191 ...cut... Next I've tried with valgrind, invoked with: valgrind \ --suppressions=$HOME/src/postgresql-10beta2/src/tools/valgrind.supp \ --trace-children=yes --track-origins=yes --error-limit=no \ --read-var-info=yes --log-file=/tmp/log-valgrind \ bin/postgres -D $PWD/data (and turning off autovacuum because it causes some unrelated reports). With query #1 the run goes to completion without crashing (hi heisenbug!), but it reports many uses of uninitialised values under ucol_strcollUTF8(). The full log is attached in log-valgrind-1.txt.gz With query #2 it ends up crashing after ~5hours and produces the log in log-valgrind-2.txt.gz with some other entries than case #1, but AFAICS still all about reading uninitialised values in space allocated by datumCopy(). Best regards, -- Daniel Vérité PostgreSQL-powered mailer: http://www.manitou-mail.org Twitter: @DanielVerite -- Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-bugs
Attachment
Re: [BUGS] Crash report for some ICU-52 (debian8) COLLATE andwork_mem values
From
Peter Geoghegan
Date:
On Thu, Aug 3, 2017 at 8:49 AM, Daniel Verite <daniel@manitou-mail.org> wrote: > With query #2 it ends up crashing after ~5hours and produces > the log in log-valgrind-2.txt.gz with some other entries than > case #1, but AFAICS still all about reading uninitialised values > in space allocated by datumCopy(). Right. This part is really interesting to me: ==48827== Uninitialised value was created by a heap allocation ==48827== at 0x4C28C20: malloc (vg_replace_malloc.c:296) ==48827== by 0x80B597: AllocSetAlloc (aset.c:771) ==48827== by 0x810ADC: palloc (mcxt.c:862) ==48827== by 0x72BFEF: datumCopy (datum.c:171) ==48827== by 0x81A74C: tuplesort_putdatum (tuplesort.c:1515) ==48827== by 0x5E91EB: advance_aggregates (nodeAgg.c:1023) If you actually go to datum.c:171, you see that that's a codepath for pass-by-reference datatypes that lack a varlena header. Text is a datatype that has a varlena header, though, so that's clearly wrong. I don't know how this actually happened, but working back through the relevant tuplesort_begin_datum() caller, initialize_aggregate(), does suggest some things. (tuplesort_begin_datum() is where datumTypeLen is determined for the entire datum tuplesort.) I am once again only guessing, but I have to wonder if this is a problem in commit b8d7f053. It seems likely that the problem begins before tuplesort_begin_datum() is even called, which is the basis of this suspicion. If the problem is within tuplesort, then that could only be because get_typlenbyval() gives wrong answers, which seems very unlikely. -- Peter Geoghegan -- Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-bugs
Re: [BUGS] Crash report for some ICU-52 (debian8) COLLATE andwork_mem values
From
Peter Geoghegan
Date:
On Thu, Aug 3, 2017 at 8:49 AM, Daniel Verite <daniel@manitou-mail.org> wrote: > With query #1 the run goes to completion without crashing (hi heisenbug!), > but it reports many uses of uninitialised values under ucol_strcollUTF8(). > The full log is attached in log-valgrind-1.txt.gz > > > With query #2 it ends up crashing after ~5hours and produces > the log in log-valgrind-2.txt.gz with some other entries than > case #1, but AFAICS still all about reading uninitialised values > in space allocated by datumCopy(). It would be nice if you could confirm whether or not Valgrind complains when non-ICU collations are in use. It may just have been that we get (un)lucky with ICU, where the undefined behavior happens to result in a hard crash, more or less by accident. -- Peter Geoghegan -- Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-bugs
Re: [BUGS] Crash report for some ICU-52 (debian8) COLLATE and work_mem values
From
"Daniel Verite"
Date:
Peter Geoghegan wrote: > It would be nice if you could confirm whether or not Valgrind > complains when non-ICU collations are in use. It doesn't (with en_US.utf8), and also doesn't complain for certain ICU collations such as "fr-x-icu", one of the 1595 that don't segfault with this data and work_mem at 128MB (vs 146 that happen to segfault) If someone wants to reproduce the problem, I've made a custom dump (40MB) available at http://www.manitou-mail.org/vrac/words_test.dump query: select count(distinct wordtext collate "bs-x-icu") from words_test; work_mem: 128MB or less, shared_buffers: 128MB OS: debian 8 x86_64 ICU: 52.1, upstream or deb8 locale environment from which initdb was called: $ locale -a C C.UTF-8 de_DE.utf8 en_US.utf8 français french fr_FR fr_FR.iso88591 fr_FR.utf8 pl_PL pl_PL.iso88592 polish POSIX Best regards, -- Daniel Vérité PostgreSQL-powered mailer: http://www.manitou-mail.org Twitter: @DanielVerite -- Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-bugs
On Thu, Aug 03, 2017 at 11:42:25AM -0700, Peter Geoghegan wrote: > On Thu, Aug 3, 2017 at 8:49 AM, Daniel Verite <daniel@manitou-mail.org> wrote: > > With query #2 it ends up crashing after ~5hours and produces > > the log in log-valgrind-2.txt.gz with some other entries than > > case #1, but AFAICS still all about reading uninitialised values > > in space allocated by datumCopy(). > > Right. This part is really interesting to me: > > ==48827== Uninitialised value was created by a heap allocation > ==48827== at 0x4C28C20: malloc (vg_replace_malloc.c:296) > ==48827== by 0x80B597: AllocSetAlloc (aset.c:771) > ==48827== by 0x810ADC: palloc (mcxt.c:862) > ==48827== by 0x72BFEF: datumCopy (datum.c:171) > ==48827== by 0x81A74C: tuplesort_putdatum (tuplesort.c:1515) > ==48827== by 0x5E91EB: advance_aggregates (nodeAgg.c:1023) > > If you actually go to datum.c:171, you see that that's a codepath for > pass-by-reference datatypes that lack a varlena header. Text is a > datatype that has a varlena header, though, so that's clearly wrong. I > don't know how this actually happened, but working back through the > relevant tuplesort_begin_datum() caller, initialize_aggregate(), does > suggest some things. (tuplesort_begin_datum() is where datumTypeLen is > determined for the entire datum tuplesort.) > > I am once again only guessing, but I have to wonder if this is a > problem in commit b8d7f053. It seems likely that the problem begins > before tuplesort_begin_datum() is even called, which is the basis of > this suspicion. If the problem is within tuplesort, then that could > only be because get_typlenbyval() gives wrong answers, which seems > very unlikely. [Action required within three days. This is a generic notification.] The above-described topic is currently a PostgreSQL 10 open item. Peter (Eisentraut), since you committed the patch believed to have created it, you own this open item. If some other commit is more relevant or if this does not belong as a v10 open item, please let us know. Otherwise, please observe the policy on open item ownership[1] and send a status update within three calendar days of this message. Include a date for your subsequent status update. Testers may discover new open items at any time, and I want to plan to get them all fixed well in advance of shipping v10. Consequently, I will appreciate your efforts toward speedy resolution. Thanks. [1] https://www.postgresql.org/message-id/20170404140717.GA2675809%40tornado.leadboat.com -- Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-bugs
Adding -hackers. On Sat, Aug 05, 2017 at 03:55:13PM -0700, Noah Misch wrote: > On Thu, Aug 03, 2017 at 11:42:25AM -0700, Peter Geoghegan wrote: > > On Thu, Aug 3, 2017 at 8:49 AM, Daniel Verite <daniel@manitou-mail.org> wrote: > > > With query #2 it ends up crashing after ~5hours and produces > > > the log in log-valgrind-2.txt.gz with some other entries than > > > case #1, but AFAICS still all about reading uninitialised values > > > in space allocated by datumCopy(). > > > > Right. This part is really interesting to me: > > > > ==48827== Uninitialised value was created by a heap allocation > > ==48827== at 0x4C28C20: malloc (vg_replace_malloc.c:296) > > ==48827== by 0x80B597: AllocSetAlloc (aset.c:771) > > ==48827== by 0x810ADC: palloc (mcxt.c:862) > > ==48827== by 0x72BFEF: datumCopy (datum.c:171) > > ==48827== by 0x81A74C: tuplesort_putdatum (tuplesort.c:1515) > > ==48827== by 0x5E91EB: advance_aggregates (nodeAgg.c:1023) > > > > If you actually go to datum.c:171, you see that that's a codepath for > > pass-by-reference datatypes that lack a varlena header. Text is a > > datatype that has a varlena header, though, so that's clearly wrong. I > > don't know how this actually happened, but working back through the > > relevant tuplesort_begin_datum() caller, initialize_aggregate(), does > > suggest some things. (tuplesort_begin_datum() is where datumTypeLen is > > determined for the entire datum tuplesort.) > > > > I am once again only guessing, but I have to wonder if this is a > > problem in commit b8d7f053. It seems likely that the problem begins > > before tuplesort_begin_datum() is even called, which is the basis of > > this suspicion. If the problem is within tuplesort, then that could > > only be because get_typlenbyval() gives wrong answers, which seems > > very unlikely. > > [Action required within three days. This is a generic notification.] > > The above-described topic is currently a PostgreSQL 10 open item. Peter > (Eisentraut), since you committed the patch believed to have created it, you > own this open item. If some other commit is more relevant or if this does not > belong as a v10 open item, please let us know. Otherwise, please observe the > policy on open item ownership[1] and send a status update within three > calendar days of this message. Include a date for your subsequent status > update. Testers may discover new open items at any time, and I want to plan > to get them all fixed well in advance of shipping v10. Consequently, I will > appreciate your efforts toward speedy resolution. Thanks. > > [1] https://www.postgresql.org/message-id/20170404140717.GA2675809%40tornado.leadboat.com
Re: [BUGS] Crash report for some ICU-52 (debian8) COLLATE andwork_mem values
From
Andres Freund
Date:
On 2017-08-03 11:42:25 -0700, Peter Geoghegan wrote: > On Thu, Aug 3, 2017 at 8:49 AM, Daniel Verite <daniel@manitou-mail.org> wrote: > > With query #2 it ends up crashing after ~5hours and produces > > the log in log-valgrind-2.txt.gz with some other entries than > > case #1, but AFAICS still all about reading uninitialised values > > in space allocated by datumCopy(). > > Right. This part is really interesting to me: > > ==48827== Uninitialised value was created by a heap allocation > ==48827== at 0x4C28C20: malloc (vg_replace_malloc.c:296) > ==48827== by 0x80B597: AllocSetAlloc (aset.c:771) > ==48827== by 0x810ADC: palloc (mcxt.c:862) > ==48827== by 0x72BFEF: datumCopy (datum.c:171) > ==48827== by 0x81A74C: tuplesort_putdatum (tuplesort.c:1515) > ==48827== by 0x5E91EB: advance_aggregates (nodeAgg.c:1023) > > If you actually go to datum.c:171, you see that that's a codepath for > pass-by-reference datatypes that lack a varlena header. Text is a > datatype that has a varlena header, though, so that's clearly wrong. I > don't know how this actually happened, but working back through the > relevant tuplesort_begin_datum() caller, initialize_aggregate(), does > suggest some things. (tuplesort_begin_datum() is where datumTypeLen is > determined for the entire datum tuplesort.) > > I am once again only guessing, but I have to wonder if this is a > problem in commit b8d7f053. It seems likely that the problem begins > before tuplesort_begin_datum() is even called, which is the basis of > this suspicion. If the problem is within tuplesort, then that could > only be because get_typlenbyval() gives wrong answers, which seems > very unlikely. Not saying it's not the fault of b8d7f053 et al, but I don't quite see how - whether something is a varlena tuple or not isn't really something expression evaluation has an influence over if it doesn't happen from within its code. That's the responsibility of the calling code, not from within the datum. So I don't quite understand how you got to b8d7f053? - Andres -- Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-bugs
Re: [BUGS] Crash report for some ICU-52 (debian8) COLLATE andwork_mem values
From
Peter Geoghegan
Date:
On Sat, Aug 5, 2017 at 4:03 PM, Andres Freund <andres@anarazel.de> wrote: > Not saying it's not the fault of b8d7f053 et al, but I don't quite see > how - whether something is a varlena tuple or not isn't really something > expression evaluation has an influence over if it doesn't happen from > within its code. That's the responsibility of the calling code, not from > within the datum. So I don't quite understand how you got to b8d7f053? It was a guess based on nothing more than what codepaths changed since the last release. While I think it's fair to assume for the time being that this is an ICU bug, based on the fact that Valgrind won't complain when other collations are used with the same data, I have a really hard time figuring out why the datum tuplesort thinks that text is a pass-by-reference type without a varlena header. That doesn't smell like a bug in the ICU feature to me. Could an ICU bug really have somehow made unrelated syscache lookups give wrong answers about built-in types? I'll try to find time to reproduce the problem next week. It should be possible to work backwards from the weird issue within datumCopy(), and figure out what the problem is. -- Peter Geoghegan -- Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-bugs
"Daniel Verite" <daniel@manitou-mail.org> writes: > If someone wants to reproduce the problem, I've made > a custom dump (40MB) available at > http://www.manitou-mail.org/vrac/words_test.dump Thanks for providing this test data. I've attempted, so far unsuccessfully, to reproduce the crash. I'm using today's HEAD PG code, in --enable-debug --enable-cassert configuration, all parameters default except work_mem = 128MB. I see no crash with any ICU collation provided by these configurations: * RHEL 6's stock version of ICU 4.2, on x86_64 * Direct-from-upstream-source build of icu4c-57_1-src.tgz on current macOS, also x86_64. So while that isn't helping all that much, it seems to constrain the range of affected ICU versions further than we knew before. I'm quite disturbed though that the set of installed collations on these two test cases seem to be entirely different both from each other and from what you reported. The base collations look generally similar, but the "keyword variant" versions are not comparable at all. Considering that the entire reason we are interested in ICU in the first place is its alleged cross-version collation behavior stability, this gives me the exact opposite of a warm fuzzy feeling. We need to understand why it's like that and what we can do to reduce the variation, or else we're just buying our users enormous future pain. At least with the libc collations, you can expect that if you have en_US.utf8 available today you will probably still have en_US.utf8 available tomorrow. I am not seeing any reason to believe that the same holds for ICU collations. regards, tom lane -- Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-bugs
I wrote: > "Daniel Verite" <daniel@manitou-mail.org> writes: >> If someone wants to reproduce the problem, I've made >> a custom dump (40MB) available at >> http://www.manitou-mail.org/vrac/words_test.dump > Thanks for providing this test data. I've attempted, so far > unsuccessfully, to reproduce the crash. OK, so after installing ICU 52.1 from upstream sources, I can reproduce these problems, and I think there's nothing to see here except ICU bugs. In the case where Daniel reports huge core dumps, the problem seems to be that ucol_strcollRegular() gets into an infinite loop here: // We fetch CEs until we hit a non ignorable primary or end. uint32_t sPrimary; do { // We get the next CE sOrder = ucol_IGetNextCE(coll, sColl, status); // Stuff it inthe buffer UCOL_CEBUF_PUT(&sCEs, sOrder, sColl, status); // And keep just the primary part. sPrimary = sOrder & UCOL_PRIMARYMASK; } while(sPrimary == 0); It keeps on stuffing tokens into "sCEs", which is an expansible buffer in the same spirit as our StringInfos, and once that gets past 2^31 tokens all hell breaks loose, because the expansion code is entirely unconcerned about the possibility of integer overflow: void ucol_CEBuf_Expand(ucol_CEBuf *b, collIterate *ci, UErrorCode *status) { uint32_t oldSize; uint32_t newSize; uint32_t *newBuf; ci->flags |= UCOL_ITER_ALLOCATED; oldSize = (uint32_t)(b->pos - b->buf); newSize = oldSize * 2; newBuf = (uint32_t*)uprv_malloc(newSize * sizeof(uint32_t)); if(newBuf == NULL) { *status = U_MEMORY_ALLOCATION_ERROR; } else { uprv_memcpy(newBuf, b->buf, oldSize * sizeof(uint32_t)); ... The buffer size seems to always be a power of 2. After enough iterations oldSize is 2^31, so newSize overflows to zero, and it allocates a zero-length newBuf and proceeds to copy ~8GB of data there. No surprise that we get a SIGSEGV, but before that happens it'll have trashed a ton of memory. I suspect whatever problems valgrind is reporting are just false positives induced by that massive memory stomp. So there are two separate bugs there --- one is that that loop fails to terminate, for reasons that are doubtless specific to particular collation data, and the other is that the expansible-buffer code is not robust. The other class of failures amounts to this loop iterating till it falls off the end of memory: while(schar > (tchar = *(UCharOffset+offset))) { /* since the contraction codepoints should be ordered, we skip allthat are smaller */ offset++; } which is unsurprising, because (in my core dump) schar is 1113834 which is larger than any possible UChar value, so the loop cannot terminate except by crashing. The crash occurred while trying to process this string: buf2 = 0x1614d20 "requ\364\217\273\252te", and I do not think it's coincidence that that multibyte character there corresponds to U+10FEEA or decimal 1113834. Apparently they've got some bugs with dealing with characters beyond U+FFFF, at least in certain locales. In short then, ICU 52.1 is just too buggy to contemplate using. I haven't compared versions to see when these issues were introduced or fixed. But I'm now thinking that trying to support old ICU versions is a mistake, and we'd better serve our users by telling them not to use ICU before some-version-after-52. regards, tom lane -- Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-bugs
Re: [BUGS] Crash report for some ICU-52 (debian8) COLLATE andwork_mem values
From
Peter Geoghegan
Date:
On Sat, Aug 5, 2017 at 8:26 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > I'm quite disturbed though that the set of installed collations on these > two test cases seem to be entirely different both from each other and from > what you reported. The base collations look generally similar, but the > "keyword variant" versions are not comparable at all. Considering that > the entire reason we are interested in ICU in the first place is its > alleged cross-version collation behavior stability, this gives me the > exact opposite of a warm fuzzy feeling. We need to understand why it's > like that and what we can do to reduce the variation, or else we're just > buying our users enormous future pain. At least with the libc collations, > you can expect that if you have en_US.utf8 available today you will > probably still have en_US.utf8 available tomorrow. I am not seeing any > reason to believe that the same holds for ICU collations. +1. That seems like something that is important to get right up-front. -- Peter Geoghegan -- Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-bugs
Re: [BUGS] Crash report for some ICU-52 (debian8) COLLATE andwork_mem values
From
Peter Geoghegan
Date:
On Sun, Aug 6, 2017 at 1:00 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > So there are two separate bugs there --- one is that that loop fails to > terminate, for reasons that are doubtless specific to particular collation > data, and the other is that the expansible-buffer code is not robust. Interesting. > In short then, ICU 52.1 is just too buggy to contemplate using. > I haven't compared versions to see when these issues were introduced > or fixed. But I'm now thinking that trying to support old ICU > versions is a mistake, and we'd better serve our users by telling > them not to use ICU before some-version-after-52. Distributions generally back-patch fixes to older ICU versions. What upstream source did you use, exactly? It's possible that this bug could be fixed by the Debian maintainer. This is supposed to be a very stable version of the library. It has received security updates fairly recently [1]. Maybe this is actually a regression caused by a badly handled backpatch, like the infamous random number bug that only appeared in Debian's OpenSSL. Many important software packages will have a dependency on the Debian 52.1 ICU package, and it really shouldn't be a liability to support it. (Granted, it does appear to be one right now.) [1] http://metadata.ftp-master.debian.org/changelogs/main/i/icu/icu_52.1-8+deb8u5_changelog -- Peter Geoghegan -- Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-bugs
Peter Geoghegan <pg@bowt.ie> writes: > On Sun, Aug 6, 2017 at 1:00 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> In short then, ICU 52.1 is just too buggy to contemplate using. > Distributions generally back-patch fixes to older ICU versions. What > upstream source did you use, exactly? I went to http://www.icu-project.org/ and downloaded icu4c-52_1-src.tgz. All the file dates therein seem to be 2013-10-04. Debian, for one, is evidently not trying very hard in that direction, since not only are the bugs still there but the line numbers I saw in my backtraces agreed with Daniel's, indicating they've not changed much of anything at all in ucol.cpp. regards, tom lane -- Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-bugs
Re: [BUGS] Crash report for some ICU-52 (debian8) COLLATE andwork_mem values
From
Peter Geoghegan
Date:
On Sun, Aug 6, 2017 at 1:47 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Debian, for one, is evidently not trying very hard in that direction, > since not only are the bugs still there but the line numbers I saw in > my backtraces agreed with Daniel's, indicating they've not changed > much of anything at all in ucol.cpp. I wonder if it's worth considering distrusting ucol_strcollUTF8() on some ICU versions that prove to not be up to snuff. The function has only been around since ICU-50, released in late 2012. We could have something like a TRUST_UCOL_STRCOLLUTF8 override for HAVE_UCOL_STRCOLLUTF8. This might make sense as a compromise between not supporting somewhat older ICU versions and doing nothing at all. I'm pretty sure that ucol_strcollUTF8() is unused by all the major projects that use ICU, such as LibreOffice, Chromium, and Qt. Their requirements play to the strengths of UTF-16, so they naturally always used UTF-16. -- Peter Geoghegan -- Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-bugs
Re: [BUGS] Crash report for some ICU-52 (debian8) COLLATE andwork_mem values
From
"Daniel Verite"
Date:
Tom Lane wrote: > I went to http://www.icu-project.org/ and downloaded icu4c-52_1-src.tgz. > All the file dates therein seem to be 2013-10-04. > > Debian, for one, is evidently not trying very hard in that direction, > since not only are the bugs still there but the line numbers I saw in > my backtraces agreed with Daniel's, indicating they've not changed > much of anything at all in ucol.cpp. They have 2 small patches in ucol.cpp (diff attached), but the last backtraces I've sent were against upstream, not Debian, got from the same source as you, so they wouldn't differ in the line numbers. Anyway the behavior with segfaulting was identical to Debian's. Speaking of upstream vs Debian, for the library as a whole there are quite a few security patches that are not in upstream: $ apt-get source libicu-dev [...] dpkg-source: info: extracting icu in icu-52.1 dpkg-source: info: unpacking icu_52.1.orig.tar.gz dpkg-source: info: unpacking icu_52.1-8+deb8u5.debian.tar.xz dpkg-source: info: applying icudata-stdlibs.patch dpkg-source: info: applying gennorm2-man.patch dpkg-source: info: applying icuinfo-man.patch dpkg-source: info: applying malayalam-rendering.patch dpkg-source: info: applying indic-ccmp.patch dpkg-source: info: applying mlym-crash.patch dpkg-source: info: applying two-digit-year-test.patch dpkg-source: info: applying icu-config.patch dpkg-source: info: applying CVE-2014-6585.patch dpkg-source: info: applying CVE-2014-6591.patch dpkg-source: info: applying CVE-2014-7923+7926.patch dpkg-source: info: applying CVE-2014-7940.patch dpkg-source: info: applying CVE-2014-9654.patch dpkg-source: info: applying CVE-2014-8146.patch dpkg-source: info: applying CVE-2014-8147.patch dpkg-source: info: applying CVE-2015-4760.patch dpkg-source: info: applying CVE-2014-6585+.patch dpkg-source: info: applying CVE-2015-1270.patch dpkg-source: info: applying CVE-2014-9911.patch dpkg-source: info: applying CVE-2015-2632.patch dpkg-source: info: applying CVE-2015-4844.patch dpkg-source: info: applying CVE-2016-0494.patch dpkg-source: info: applying CVE-2016-6293.patch dpkg-source: info: applying CVE-2016-7415.patch dpkg-source: info: applying CVE-2017-7867_CVE-2017-7868.patch Independantly of the bug discussed in this thread, what is puzzling to me is why upstream does not integrate any of these fixes. Here's their policy about maintenance releases: http://site.icu-project.org/processes/maintenance-releases "When a critical problem is found in ICU libraries, we try to fix the problem in the latest development stream first. If there is a demand for the fix in a past release, an ICU project developer may escalate the fix to be integrated in the release to the ICU project management committee. Once the committee approved to merge the fix into back level stream, the developer can merge the bug fix back to the past release suggested by the committee. This merge activity must be tracked by maintenance release place holder tickets and the developer should provide original ticket number and description as the response in each maintenance ticket. These fixes are automatically included in a future ICU maintenance release." Best regards, -- Daniel Vérité PostgreSQL-powered mailer: http://www.manitou-mail.org Twitter: @DanielVerite -- Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-bugs
Attachment
Re: [BUGS] Crash report for some ICU-52 (debian8) COLLATE andwork_mem values
From
Peter Eisentraut
Date:
On 8/6/17 17:17, Peter Geoghegan wrote: > I wonder if it's worth considering distrusting ucol_strcollUTF8() on > some ICU versions that prove to not be up to snuff. The function has > only been around since ICU-50, released in late 2012. We could have > something like a TRUST_UCOL_STRCOLLUTF8 override for > HAVE_UCOL_STRCOLLUTF8. This might make sense as a compromise between > not supporting somewhat older ICU versions and doing nothing at all. I can confirm that that avoids the crash in ICU 52, so it's something to consider. We would need more testing about which other ICU versions are affected, however. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-bugs
Re: [BUGS] Crash report for some ICU-52 (debian8) COLLATE andwork_mem values
From
Peter Eisentraut
Date:
On 8/7/17 11:17, Peter Eisentraut wrote: > On 8/6/17 17:17, Peter Geoghegan wrote: >> I wonder if it's worth considering distrusting ucol_strcollUTF8() on >> some ICU versions that prove to not be up to snuff. The function has >> only been around since ICU-50, released in late 2012. We could have >> something like a TRUST_UCOL_STRCOLLUTF8 override for >> HAVE_UCOL_STRCOLLUTF8. This might make sense as a compromise between >> not supporting somewhat older ICU versions and doing nothing at all. > > I can confirm that that avoids the crash in ICU 52, so it's something to > consider. We would need more testing about which other ICU versions are > affected, however. In my testing, there are no crashes in ICU 51 and ICU 53, so we could restrict this workaround to version 52 specifically. Perhaps someone else can confirm my test results. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-bugs
[BUGS] Re: [HACKERS] Re: Crash report for some ICU-52 (debian8) COLLATE andwork_mem values
From
Peter Eisentraut
Date:
On 8/5/17 18:56, Noah Misch wrote: >> [Action required within three days. This is a generic notification.] I'm awaiting further testing and discussion. Probably nothing happening for beta3. Will report on Thursday. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: [BUGS] Crash report for some ICU-52 (debian8) COLLATE andwork_mem values
From
Peter Eisentraut
Date:
On 8/5/17 23:26, Tom Lane wrote: > I'm quite disturbed though that the set of installed collations on these > two test cases seem to be entirely different both from each other and from > what you reported. The base collations look generally similar, but the > "keyword variant" versions are not comparable at all. The differences you are seeing are probably because we intentionally removed the keyword variants for ICU 4.2: <https://www.postgresql.org/message-id/E1ddzFn-0005x5-N0%40gemulon.postgresql.org>. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-bugs
Re: [BUGS] Crash report for some ICU-52 (debian8) COLLATE andwork_mem values
From
Peter Geoghegan
Date:
On Mon, Aug 7, 2017 at 8:55 AM, Peter Eisentraut <peter.eisentraut@2ndquadrant.com> wrote: > In my testing, there are no crashes in ICU 51 and ICU 53, so we could > restrict this workaround to version 52 specifically. Perhaps someone > else can confirm my test results. Probably worth running tests with Valgrind on those other versions, if you haven't already. We can worry about which versions it makes sense for, and we should, but it's more important to make sure that the TRUST_UCOL_STRCOLLUTF8 idea is useful as a way of defending against a class of possible ICU bugs. If every package maintainer applies their own patches for ICU, which appears to be the case, then we shouldn't be too surprised if a certain maintainer needs to go their own way on TRUST_UCOL_STRCOLLUTF8. Has anyone reported this to the Debian maintainer yet? -- Peter Geoghegan -- Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-bugs
Peter Eisentraut <peter.eisentraut@2ndquadrant.com> writes: > On 8/7/17 11:17, Peter Eisentraut wrote: >> I can confirm that that avoids the crash in ICU 52, so it's something to >> consider. We would need more testing about which other ICU versions are >> affected, however. > In my testing, there are no crashes in ICU 51 and ICU 53, so we could > restrict this workaround to version 52 specifically. Perhaps someone > else can confirm my test results. I did some desultory comparison of this source code in ICU 51.2, 52.1, and 53.1. The crash-related code fragments are not significantly different in 51.2 and 52.1. It may be that 52's problem is triggered by broken collation data that wasn't broken in previous versions ... but I have to say that I don't find 51 worthy of trust either. It seems quite likely that pre-52 can still be made to crash, we just don't yet have a test case that triggers it. In 53, they have thrown away ucol.cpp lock stock and barrel, and rewritten the whole thing from scratch in C++. So it's not surprising that these bugs are gone. (Whether it's got other bugs is hard to tell. I do not get the impression, just from eyeballing it, that the code quality went up significantly :-(.) So that explains why upstream is not back-patching fixes for these problems from newer versions: there are no newer versions that they could possibly back-patch from. I'm on board with Peter G's proposal: let's distrust ucol_strcollUTF8 in ICU < 53. If those folk felt a need to rewrite it completely after only three release cycles, that should tell us something. regards, tom lane -- Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-bugs
Peter Eisentraut <peter.eisentraut@2ndquadrant.com> writes: > On 8/5/17 23:26, Tom Lane wrote: >> I'm quite disturbed though that the set of installed collations on these >> two test cases seem to be entirely different both from each other and from >> what you reported. The base collations look generally similar, but the >> "keyword variant" versions are not comparable at all. > The differences you are seeing are probably because we intentionally > removed the keyword variants for ICU 4.2: > <https://www.postgresql.org/message-id/E1ddzFn-0005x5-N0%40gemulon.postgresql.org>. After looking more closely, I think I'd mistaken keyword variants for something else. As an example, I get these "af-*" collations installed by initdb when using ICU 52.1: af-x-icuaf-u-co-standard-x-icuaf-u-co-eor-x-icuaf-u-co-search-x-icuaf-NA-x-icuaf-NA-u-co-standard-x-icuaf-NA-u-co-eor-x-icuaf-NA-u-co-search-x-icuaf-ZA-x-icuaf-ZA-u-co-standard-x-icuaf-ZA-u-co-eor-x-icuaf-ZA-u-co-search-x-icu while with ICU 57.1, the set is af-x-icuaf-u-co-standard-x-icuaf-u-co-emoji-x-icuaf-u-co-eor-x-icuaf-u-co-search-x-icu So they added emojis (I'm with Peter G that we could do without installing that by default) ... but what became of the af-NA and af-ZA collations? If I were a user who'd adopted one of those as a database collation, I'd be seriously unhappy to have them go away in a later PG release. I think we'd be well advised to filter the set of installed-by-default collations rather strongly, in hope of avoiding such problems. For starters, do we really need the keyword variants at all? People who know what those are for can create their own collations, and take their own risks of the feature disappearing in later ICU releases. regards, tom lane -- Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-bugs
Re: [BUGS] Crash report for some ICU-52 (debian8) COLLATE andwork_mem values
From
Peter Eisentraut
Date:
On 8/7/17 12:25, Peter Geoghegan wrote: > Has anyone reported this to the Debian maintainer yet? This is an upstream bug, so there is nothing to report. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-bugs
Re: [BUGS] Crash report for some ICU-52 (debian8) COLLATE andwork_mem values
From
Peter Geoghegan
Date:
On Mon, Aug 7, 2017 at 10:30 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > So they added emojis (I'm with Peter G that we could do without installing > that by default) ... but what became of the af-NA and af-ZA collations? > If I were a user who'd adopted one of those as a database collation, > I'd be seriously unhappy to have them go away in a later PG release. > > I think we'd be well advised to filter the set of installed-by-default > collations rather strongly, in hope of avoiding such problems. For > starters, do we really need the keyword variants at all? People who > know what those are for can create their own collations, and take > their own risks of the feature disappearing in later ICU releases. Actually, I think I was wrong about it being possible to create the collations after the fact. CREATE COLLATION simply doesn't support that. This surprised me. -- Peter Geoghegan -- Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-bugs
Peter Geoghegan <pg@bowt.ie> writes: > On Mon, Aug 7, 2017 at 10:30 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> I think we'd be well advised to filter the set of installed-by-default >> collations rather strongly, in hope of avoiding such problems. For >> starters, do we really need the keyword variants at all? People who >> know what those are for can create their own collations, and take >> their own risks of the feature disappearing in later ICU releases. > Actually, I think I was wrong about it being possible to create the > collations after the fact. CREATE COLLATION simply doesn't support > that. This surprised me. Surely that's a bug? regards, tom lane -- Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-bugs
Re: [BUGS] Crash report for some ICU-52 (debian8) COLLATE and work_mem values
From
"Daniel Verite"
Date:
Peter Eisentraut wrote: > In my testing, there are no crashes in ICU 51 and ICU 53, so we could > restrict this workaround to version 52 specifically. Perhaps someone > else can confirm my test results. I'm testing all collations with ICU-51.2 with the same data as previously, and I just see it crashing at the 103th and a few other az-* after that: 2017-08-07 19:09:26.538 CEST [59353] LOG: server process (PID 60458) was terminated by signal 11: Segmentation fault 2017-08-07 19:09:26.538 CEST [59353] DETAIL: Failed process was running: select count(distinct wordtext collate "az-Latn-AZ-u-co-search-x-icu") from words_test 2017-08-07 19:09:26.538 CEST [59353] LOG: terminating any other active server processes 2017-08-07 19:09:26.538 CEST [60316] WARNING: terminating connection because of crash of another server process I don't have core files for this run but that's probably enough to say that ICU-51 is no good for us. Best regards, -- Daniel Vérité PostgreSQL-powered mailer: http://www.manitou-mail.org Twitter: @DanielVerite -- Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-bugs
I wrote: > I'm on board with Peter G's proposal: let's distrust ucol_strcollUTF8 > in ICU < 53. If those folk felt a need to rewrite it completely after > only three release cycles, that should tell us something. A bit of googling turned up this document: http://site.icu-project.org/design/collation/v2 which indicates that the ucol_strcollUTF8 rewrite was really just one facet of a rather major reimplementation. But still, that document says specifically that they knew of multiple bugs and suspected many more. At this point I'm thinking that really what we ought to do is deprecate using any pre-53 ICU release for Postgres. It is very clear that those versions are an entirely different beast from 53-and-up, and that they are now abandonware so far as ICU upstream is concerned. I have not checked, but I wonder whether 53 is also when the large change in the set of available collations happened. Maybe rejecting pre-53 would also be enough to assuage my concerns about disappearing collations. regards, tom lane -- Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-bugs
Re: [BUGS] Crash report for some ICU-52 (debian8) COLLATE andwork_mem values
From
Peter Geoghegan
Date:
On Mon, Aug 7, 2017 at 10:39 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Actually, I think I was wrong about it being possible to create the >> collations after the fact. CREATE COLLATION simply doesn't support >> that. This surprised me. > > Surely that's a bug? I had it right the first time, actually -- it is possible. There is a lack of any example of it in the docs. And, references to CREATE COLLATION right under "23.2.2.2.2. ICU collations" in the docs are under the title "Copying Collations". That threw me off. I think that we are entitled to assume plenty about what collations ICU makes available, at least per version. -- Peter Geoghegan -- Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-bugs
Re: [BUGS] Crash report for some ICU-52 (debian8) COLLATE andwork_mem values
From
Peter Geoghegan
Date:
On Mon, Aug 7, 2017 at 10:51 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > At this point I'm thinking that really what we ought to do is deprecate > using any pre-53 ICU release for Postgres. It is very clear that those > versions are an entirely different beast from 53-and-up, and that they > are now abandonware so far as ICU upstream is concerned. +1. I think that ICU is very important, and I would like to see it become the defacto standard collation provider, but I don't think it's something that needs to happen on an aggressive schedule. > I have not checked, but I wonder whether 53 is also when the large > change in the set of available collations happened. Maybe rejecting > pre-53 would also be enough to assuage my concerns about disappearing > collations. I bet it would. Although, I should point out that ICU has an annoying habit of being very tolerant of misspellings, or alternative spellings, so it might not be as bad as it appeared. We could perhaps fix this by taking a greater interest in the collations that are initially available, documenting useful variations, and so on. We should do that anyway. -- Peter Geoghegan -- Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-bugs
Re: [BUGS] Crash report for some ICU-52 (debian8) COLLATE andwork_mem values
From
Peter Eisentraut
Date:
On 8/7/17 13:51, Tom Lane wrote: > At this point I'm thinking that really what we ought to do is deprecate > using any pre-53 ICU release for Postgres. It is very clear that those > versions are an entirely different beast from 53-and-up, and that they > are now abandonware so far as ICU upstream is concerned. Maybe a note in the documentation to that effect would be useful. I don't think we should prevent the use of older ICU versions in the code. That would probably annoy people who are forced to develop or test their applications on older platforms. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-bugs
Re: [BUGS] Crash report for some ICU-52 (debian8) COLLATE andwork_mem values
From
Peter Eisentraut
Date:
On 8/7/17 13:30, Tom Lane wrote: > ... but what became of the af-NA and af-ZA collations? This was already mentioned here: https://www.postgresql.org/message-id/93ef952f-40d4-24ce-bc62-97cc49a565ab@2ndquadrant.com -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-bugs
Peter Eisentraut <peter.eisentraut@2ndquadrant.com> writes: > On 8/7/17 13:30, Tom Lane wrote: >> ... but what became of the af-NA and af-ZA collations? > This was already mentioned here: > https://www.postgresql.org/message-id/93ef952f-40d4-24ce-bc62-97cc49a565ab@2ndquadrant.com Well, the fact that they're "redundant" doesn't really help you if you can't pg_upgrade because the collation name you chose in v10 is not present in initdb's results in v11. So this is still a serious issue to my mind. regards, tom lane -- Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-bugs
Re: [BUGS] Crash report for some ICU-52 (debian8) COLLATE andwork_mem values
From
Peter Geoghegan
Date:
On Mon, Aug 7, 2017 at 12:29 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Well, the fact that they're "redundant" doesn't really help you if > you can't pg_upgrade because the collation name you chose in v10 is > not present in initdb's results in v11. So this is still a serious > issue to my mind. I agree. Even MongoDB has ICU support these days. They specifically document which collations are supported. It's just the same for DB2, and other systems that build their collations on ICU. Users do not "use the ICU collations" on these other systems. They simply use the collations that are available, choosing from a list in the documentation, or possibly create their own collations with their own customization. The ICU collations are based on the CLDR data and an IETF standard's idea of a locale identifier [1], so in an important sense they're supposed to be universal; they're not tied to ICU in particular. This is probably why ICU is ridiculously forgiving of alternate collation names, and will not throw an error if you specify an ICU collation name that is total garbage within CREATE COLLATION (there is a Postgres regression test that proves this for ICU, actually): As far as ICU is concerned, this may be coming from input from an end user over the web, where it makes sense to be so forgiving. Even stuff like the names for emoji collations, or phonebook collations, are covered by a standard, though it's not quite an IETF standard. RFC 6067 says that the CLDR data is the authoritative source of which variant subtags are allowed, and ICU uses CLDR, from the Unicode consortium. We need to move further away from the idea that there are ICU collations just like there are libc collations. [1] https://www.rfc-editor.org/rfc/bcp/bcp47.txt -- Peter Geoghegan -- Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-bugs
Re: [BUGS] Crash report for some ICU-52 (debian8) COLLATE andwork_mem values
From
Peter Eisentraut
Date:
On 8/7/17 15:29, Tom Lane wrote: > Peter Eisentraut <peter.eisentraut@2ndquadrant.com> writes: >> On 8/7/17 13:30, Tom Lane wrote: >>> ... but what became of the af-NA and af-ZA collations? > >> This was already mentioned here: >> https://www.postgresql.org/message-id/93ef952f-40d4-24ce-bc62-97cc49a565ab@2ndquadrant.com > > Well, the fact that they're "redundant" doesn't really help you if > you can't pg_upgrade because the collation name you chose in v10 is > not present in initdb's results in v11. So this is still a serious > issue to my mind. It has never been the case that there is a guarantee that a new operating system environment will have the same or more collations available as an earlier version. Even glibc removes or renames locales. The way pg_upgrade works, it will presumably detect that during the schema dump/restore, so you should get a proper error message. In this particular case, you can create user-defined collations to fill in the missing names if you want. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-bugs
Re: [BUGS] Crash report for some ICU-52 (debian8) COLLATE andwork_mem values
From
Peter Eisentraut
Date:
On 8/7/17 16:21, Peter Geoghegan wrote: > Even MongoDB has ICU support these days. They specifically document > which collations are supported. It's just the same for DB2, and other > systems that build their collations on ICU. Users do not "use the ICU > collations" on these other systems. They simply use the collations > that are available, choosing from a list in the documentation, or > possibly create their own collations with their own customization. We cannot know what versions of ICU or a C library users would build PostgreSQL with, so I don't see how we can provide a definite list of available locales, other than at run time like we do now. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-bugs
Peter Eisentraut <peter.eisentraut@2ndquadrant.com> writes: > It has never been the case that there is a guarantee that a new > operating system environment will have the same or more collations > available as an earlier version. Even glibc removes or renames locales. Indeed, and one of the alleged selling points of ICU was greater stability of collation behaviors (including naming). I'd like to try to actually achieve that. > In this particular case, you can create user-defined collations to fill > in the missing names if you want. NO. YOU. CAN'T. At least not with convenient upgrade methods like pg_upgrade. If the collation name isn't produced by the newer version's initdb, pg_upgrade will fail. regards, tom lane -- Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-bugs
Re: [BUGS] Crash report for some ICU-52 (debian8) COLLATE andwork_mem values
From
Peter Geoghegan
Date:
On Mon, Aug 7, 2017 at 3:27 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Peter Eisentraut <peter.eisentraut@2ndquadrant.com> writes: >> It has never been the case that there is a guarantee that a new >> operating system environment will have the same or more collations >> available as an earlier version. Even glibc removes or renames locales. A major goal of the BCP 47 language tag format (and related standards) seems to be to decouple technical implementation details (e.g. encoding) from natural language/cultural concerns. That's the only approach that really scales for applications in the modern internet-centric world, I imagine. The glibc collations do very badly there. > Indeed, and one of the alleged selling points of ICU was greater stability > of collation behaviors (including naming). I'd like to try to actually > achieve that. IETF/CLDR describe the naming conventions for locales/collations in excruciating detail, across multiple RFCs/specs. There is no excuse for not making use of that work, IMV. -- Peter Geoghegan -- Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-bugs
Re: [BUGS] Crash report for some ICU-52 (debian8) COLLATE andwork_mem values
From
"Daniel Verite"
Date:
Tom Lane wrote: > The other class of failures amounts to this loop iterating till it falls > off the end of memory: > > while(schar > (tchar = *(UCharOffset+offset))) { /* since the > contraction codepoints should be ordered, we skip all that are smaller */ > offset++; > } > > which is unsurprising, because (in my core dump) schar is 1113834 which is > larger than any possible UChar value, so the loop cannot terminate except > by crashing. The crash occurred while trying to process this string: > > buf2 = 0x1614d20 "requ\364\217\273\252te", > > and I do not think it's coincidence that that multibyte character > there corresponds to U+10FEEA or decimal 1113834. Apparently > they've got some bugs with dealing with characters beyond U+FFFF, > at least in certain locales. For that failure, this single comparison seems to hit it consistently with ICU-52: # SELECT 'ab' < 'abc'||chr(128519) COLLATE "bs-x-icu"; server closed the connection unexpectedlyThis probably means the server terminated abnormallybefore or while processing therequest. 128519 is 'SMILING FACE WITH HALO' (U+1F607), and nearby codepoints cause the same bug. At least from this simple test we can produce a bug report to the maintainers who are still interested in pre-53 versions. Best regards, -- Daniel Vérité PostgreSQL-powered mailer: http://www.manitou-mail.org Twitter: @DanielVerite -- Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-bugs
On Mon, Aug 7, 2017 at 1:30 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > So they added emojis (I'm with Peter G that we could do without installing > that by default) ... but what became of the af-NA and af-ZA collations? > If I were a user who'd adopted one of those as a database collation, > I'd be seriously unhappy to have them go away in a later PG release. I disagree with this argument. I think you're looking at this problem from the wrong angle. If the customer doesn't really care what collation they end up with, then filtering down the set of collations is exactly the right fix -- they'll pick something that is included rather than something that is excluded. However, I think we should assume that people choose a collation because they want and need the sorting behavior it gives. In that case, if we remove the collation from the default catalog contents, then people who need it will (1) see that it's not there, be unhappy, and give up or (2) figure out that they can create it manually, do so, and then have the same upgrade problem. In other words, excluding, say, emoji collations from what gets imported is just making a value judgement that those collations aren't important and people shouldn't want to use them. It's saying that we know better than the ICU maintainers which collations ought to exist. To use one of your phrases, color me skeptical. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-bugs
Re: [BUGS] Crash report for some ICU-52 (debian8) COLLATE andwork_mem values
From
Peter Geoghegan
Date:
On Wed, Aug 9, 2017 at 10:35 AM, Robert Haas <robertmhaas@gmail.com> wrote: > In other words, excluding, say, emoji collations from what gets > imported is just making a value judgement that those collations aren't > important and people shouldn't want to use them. Yes, it is. I think that's fine, though. Other database systems that use ICU for collations do this. Without exception, I think. > It's saying that we > know better than the ICU maintainers which collations ought to exist. As I've pointed out already, we populate pg_collation by asking ucol_getKeywordValuesForLocale() to get only "commonly used [variant] values with the given locale" within pg_import_system_collations(). That's a value judgement. This doesn't have documented stability guarantees. I'm not sure how many technically distinct collations we could generate by actually including every possible variant, but it might be a great deal more than we get already. Why would users have the same upgrade problem when they created these collations manually? The BCP 47 language tag format seems to be very much centered around "doing its best", even though that could be pretty far from the desired behavior [1]. This makes a certain amount of sense, considering that it's well documented, and can be considered a stable API. It really shouldn't break, but if it does then I suppose it's probably because the behavior doesn't have an analog in the ICU version in use. [1] postgr.es/m/CAH2-Wzm22vtxvD-e1oz90DE8Z_M61_8amHsDOZf1PWRKfRmj1g@mail.gmail.com -- Peter Geoghegan -- Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-bugs
Peter Geoghegan <pg@bowt.ie> writes: > On Wed, Aug 9, 2017 at 10:35 AM, Robert Haas <robertmhaas@gmail.com> wrote: >> In other words, excluding, say, emoji collations from what gets >> imported is just making a value judgement that those collations aren't >> important and people shouldn't want to use them. > Yes, it is. I think that's fine, though. Other database systems that > use ICU for collations do this. Without exception, I think. Actually, I don't think that's the issue at all. People are free to make other ICU collations if they want to. My point is that we should encourage them to do that, rather than depend on initdb-provided collations, because manually-created collations are much more certain to move across version upgrades safely. If we were sure that pg_import_system_collations would produce pretty much the same set of collation names with future ICU releases as it does with current ones, then there would be no issue --- but the evidence at hand suggests the opposite. I want to do something to address that stability issue before it comes back to bite us. I suppose a different way to address this would be to make pg_upgrade smart enough to deal with the situation, by creating ICU collations that are used in the source installation but are missing from the initdb-provided set in the target. But even if we had that, I'm dubious that having hundreds of collations present by default is really all that user-friendly. regards, tom lane -- Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-bugs
Re: [BUGS] Crash report for some ICU-52 (debian8) COLLATE andwork_mem values
From
Peter Geoghegan
Date:
On Wed, Aug 9, 2017 at 11:07 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Yes, it is. I think that's fine, though. Other database systems that >> use ICU for collations do this. Without exception, I think. > > Actually, I don't think that's the issue at all. People are free to > make other ICU collations if they want to. My point is that we should > encourage them to do that, rather than depend on initdb-provided > collations, because manually-created collations are much more certain > to move across version upgrades safely. If we were sure that > pg_import_system_collations would produce pretty much the same set of > collation names with future ICU releases as it does with current ones, > then there would be no issue --- but the evidence at hand suggests the > opposite. I want to do something to address that stability issue before > it comes back to bite us. I must have been unclear, then. I am fully in agreement with what you say here. -- Peter Geoghegan -- Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-bugs
Re: [BUGS] Crash report for some ICU-52 (debian8) COLLATE andwork_mem values
From
Peter Geoghegan
Date:
On Wed, Aug 9, 2017 at 11:07 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > I suppose a different way to address this would be to make pg_upgrade > smart enough to deal with the situation, by creating ICU collations > that are used in the source installation but are missing from the > initdb-provided set in the target. But even if we had that, I'm > dubious that having hundreds of collations present by default is really > all that user-friendly. Let's try to get to a proposal that we'll all be happy with. The base locale names, which include regional variants like Austrian German, are as stable as possible. pg_import_system_collations() should just add those. If anything changes there, it's because the locale literally ceases to exist for political reasons (it is subject to a 5 year CLDR deprecation policy when this happens). Not much point in worrying about that. You said yourself that these seemed stable. Separately, there could be a new SQL-callable function that advertises what ICU says it makes available, melded into a valid collation name, in the style of today's pg_import_system_collations(). This could ask ucol_getKeywordValuesForLocale() to get only "commonly used [variant] values with the given locale" for a specified base locale (system pg_collation OID?), or it could also show every possible variant. We'd encourage users to investigate the variants within the documentation, including giving specific examples of what is generally possible. There could be practical examples of why the user might actually want to create, say, a phonebook collation, or a traditional Spanish collation. -- Peter Geoghegan -- Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-bugs
On Wed, Aug 9, 2017 at 2:07 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > I suppose a different way to address this would be to make pg_upgrade > smart enough to deal with the situation, by creating ICU collations > that are used in the source installation but are missing from the > initdb-provided set in the target. Yeah, that idea has some appeal. > But even if we had that, I'm > dubious that having hundreds of collations present by default is really > all that user-friendly. The flip side is that it's not clear to me how the user is supposed to discover possibly-useful collations that don't show up in pg_collation. The documentation for CREATE COLLATION says only that "The available choices depend on the operating system and build options." Section 23.2.2.2 says "With ICU, it is not sensible to enumerate all possible locale names. ICU uses a particular naming system for locales, but there are many more ways to name a locale than there are actually distinct locales. (In fact, any string will be accepted as a locale name.)" So, there's no guidance on how to pick a string that will work, and no error if you pick one that doesn't. Wahoo! I assume that this is in fact the whole point of putting any ICU-related entries in pg_collation at all -- or for that matter any libc-related entries. If it were easy to guess what parameters you might want to provide to CREATE COLLATION, we could just let users create the collations they happen to need and it would be only a minor inconvenience. As it is, removing things from pg_collation seems to make them all but undiscoverable. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-bugs
Re: [BUGS] Crash report for some ICU-52 (debian8) COLLATE andwork_mem values
From
Peter Geoghegan
Date:
On Wed, Aug 9, 2017 at 11:46 AM, Peter Geoghegan <pg@bowt.ie> wrote: > The base locale names, which include regional variants like Austrian > German, are as stable as possible. pg_import_system_collations() > should just add those. I just noticed that ICU collations within pg_collation do not include entries for each of the many regional variants of English, such as English from Ireland, Britain, the Cook Islands, etc. OTOH, glibc does have all these variants listed within pg_collation, at least on my system. This is because we don't actually add one pg_collation entry per ICU locale. Rather, we add one entry per distinct ICU collation, via ucol_countAvailable() + ucol_getAvailable(). Maybe Austrian German actually is sorted in a slightly different way to German German, and so gets its own pg_collation entry? I guess this is fine, because English is in practice sorted in exactly the same way throughout all English locales (except for a few true variants, like "English (United States, Computer)"). These unlisted locales do have different currency symbols and so on, but that's not something we'll use ICU for, so it's fine. This difference will need to be documented, though. It would help if CREATE COLLATION left new ICU collations with the same useful "Description" as initdb created collations will have; maybe that should be added. -- Peter Geoghegan -- Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-bugs
Peter Geoghegan <pg@bowt.ie> writes: > It would help if CREATE COLLATION left new > ICU collations with the same useful "Description" as initdb created > collations will have; maybe that should be added. I do not think it is CREATE COLLATION's job to provide a comment; no other CREATE command does so. regards, tom lane -- Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-bugs
Re: [BUGS] Crash report for some ICU-52 (debian8) COLLATE andwork_mem values
From
Peter Geoghegan
Date:
On Wed, Aug 9, 2017 at 2:26 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Peter Geoghegan <pg@bowt.ie> writes: >> It would help if CREATE COLLATION left new >> ICU collations with the same useful "Description" as initdb created >> collations will have; maybe that should be added. > > I do not think it is CREATE COLLATION's job to provide a comment; > no other CREATE command does so. It's not a comment. It's a description that comes from ICU. Just like when collations are added by initdb, so that there is some kind of parity. The user doesn't get a say in what it is. It helps the user to verify that they're getting what they thought they were, which seems very valuable, given how loosely the 'locale' string they provide may be interpreted. -- Peter Geoghegan -- Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-bugs
Re: [BUGS] Crash report for some ICU-52 (debian8) COLLATE andwork_mem values
From
Peter Eisentraut
Date:
On 8/7/17 14:42, Peter Eisentraut wrote: > On 8/7/17 13:51, Tom Lane wrote: >> At this point I'm thinking that really what we ought to do is deprecate >> using any pre-53 ICU release for Postgres. It is very clear that those >> versions are an entirely different beast from 53-and-up, and that they >> are now abandonware so far as ICU upstream is concerned. > > Maybe a note in the documentation to that effect would be useful. > > I don't think we should prevent the use of older ICU versions in the > code. That would probably annoy people who are forced to develop or > test their applications on older platforms. I propose the attached patches for fixing the crash. This just disables the use of the problematic function for versions before ICU 53. (The 0001 patch is just something I found that was needed while experimenting with different ICU versions. The 0002 patch is the real fix.) -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-bugs
Attachment
Re: [BUGS] Crash report for some ICU-52 (debian8) COLLATE andwork_mem values
From
Peter Eisentraut
Date:
On 8/9/17 20:45, Peter Eisentraut wrote: > On 8/7/17 14:42, Peter Eisentraut wrote: >> On 8/7/17 13:51, Tom Lane wrote: >>> At this point I'm thinking that really what we ought to do is deprecate >>> using any pre-53 ICU release for Postgres. It is very clear that those >>> versions are an entirely different beast from 53-and-up, and that they >>> are now abandonware so far as ICU upstream is concerned. >> >> Maybe a note in the documentation to that effect would be useful. >> >> I don't think we should prevent the use of older ICU versions in the >> code. That would probably annoy people who are forced to develop or >> test their applications on older platforms. > > I propose the attached patches for fixing the crash. This just disables > the use of the problematic function for versions before ICU 53. This has been committed. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-bugs
Re: [BUGS] Crash report for some ICU-52 (debian8) COLLATE andwork_mem values
From
Peter Eisentraut
Date:
On 8/9/17 15:29, Robert Haas wrote: > On Wed, Aug 9, 2017 at 2:07 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> I suppose a different way to address this would be to make pg_upgrade >> smart enough to deal with the situation, by creating ICU collations >> that are used in the source installation but are missing from the >> initdb-provided set in the target. > > Yeah, that idea has some appeal. I think the manual workflow would be that you initdb the target instance, then log into the target instance to create the missing collations, then run pg_upgrade. While pg_upgrade could probably detect which collations are missing on the target side, I don't think it follows that it can just create the missing ones automatically. Manual intervention would be necessary (and desirable IMO) to analyze the situation. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-bugs
Re: [BUGS] Crash report for some ICU-52 (debian8) COLLATE andwork_mem values
From
Peter Eisentraut
Date:
On 8/9/17 14:07, Tom Lane wrote: > Actually, I don't think that's the issue at all. People are free to > make other ICU collations if they want to. My point is that we should > encourage them to do that, rather than depend on initdb-provided > collations, because manually-created collations are much more certain > to move across version upgrades safely. If we were sure that > pg_import_system_collations would produce pretty much the same set of > collation names with future ICU releases as it does with current ones, > then there would be no issue --- but the evidence at hand suggests the > opposite. I want to do something to address that stability issue before > it comes back to bite us. I understand this concern, but I don't know how to solve it. Any string (more or less), is valid as an ICU locale name. In order to create some initial collations, we ask ICU, give me a list of distinct locales with their canonical names. That's the list we use. The next version of ICU gives us a different list. There is no way to tell the API, no, I meant the list you gave me last time. The only way to solve that problem directly is to preload no locales. Even the idea of maintaining a list of locales by hand cannot guarantee that things won't go away in the future. I think it is better in the long run that we acknowledge that the list of preloaded collations will change (it already happens with glibc), and we develop some guidance on how to deal with that. As mentioned in another message, I don't foresee a fully automatic solution, however. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-bugs
Re: [BUGS] Crash report for some ICU-52 (debian8) COLLATE andwork_mem values
From
Peter Eisentraut
Date:
On 8/9/17 14:46, Peter Geoghegan wrote: > The base locale names, which include regional variants like Austrian > German, are as stable as possible. Well, no. The problem (well, one of several) is exactly that an older version of ICU includes things like de-BE, and a newer version does not. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-bugs
Re: [BUGS] Crash report for some ICU-52 (debian8) COLLATE andwork_mem values
From
Peter Geoghegan
Date:
On Mon, Aug 14, 2017 at 8:17 AM, Peter Eisentraut <peter.eisentraut@2ndquadrant.com> wrote: > On 8/9/17 14:46, Peter Geoghegan wrote: >> The base locale names, which include regional variants like Austrian >> German, are as stable as possible. > > Well, no. The problem (well, one of several) is exactly that an older > version of ICU includes things like de-BE, and a newer version does not. I can't understand why that should be. ICU 58 uses CLDR 30, which quite clearer does have the locale de_BE: http://www.unicode.org/cldr/charts/30/summary/de.html -- Peter Geoghegan -- Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-bugs
Re: [BUGS] Crash report for some ICU-52 (debian8) COLLATE andwork_mem values
From
Peter Eisentraut
Date:
On 8/14/17 12:09, Peter Geoghegan wrote: > On Mon, Aug 14, 2017 at 8:17 AM, Peter Eisentraut > <peter.eisentraut@2ndquadrant.com> wrote: >> On 8/9/17 14:46, Peter Geoghegan wrote: >>> The base locale names, which include regional variants like Austrian >>> German, are as stable as possible. >> >> Well, no. The problem (well, one of several) is exactly that an older >> version of ICU includes things like de-BE, and a newer version does not. > > I can't understand why that should be. ICU 58 uses CLDR 30, which > quite clearer does have the locale de_BE: I'm not familiar with the nuances of CLDR, but quite plausibly there is a *locale* named de_BE, but the collation of "de" as used in "BE" is not different than "de" used elsewhere, so it is omitted from the list as redundant. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-bugs
Re: [BUGS] Crash report for some ICU-52 (debian8) COLLATE andwork_mem values
From
Peter Geoghegan
Date:
On Mon, Aug 14, 2017 at 9:27 AM, Peter Eisentraut <peter.eisentraut@2ndquadrant.com> wrote: >> I can't understand why that should be. ICU 58 uses CLDR 30, which >> quite clearer does have the locale de_BE: > > I'm not familiar with the nuances of CLDR, but quite plausibly there is > a *locale* named de_BE, but the collation of "de" as used in "BE" is not > different than "de" used elsewhere, so it is omitted from the list as > redundant. That's also what I think is the most likely explanation. Maybe the lesson here is that we ought to be adding all locales at initdb time, not all distinct collations. That would also get us a dozen or so identical English variants, and so on, just like with glibc. I don't especially want that, but it seems like it might be better from a stability point of view. After all, CLDR gives guarantees about the stability of locales, not collations. -- Peter Geoghegan -- Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-bugs
Re: [BUGS] Crash report for some ICU-52 (debian8) COLLATE andwork_mem values
From
Peter Eisentraut
Date:
On 8/9/17 17:29, Peter Geoghegan wrote: > On Wed, Aug 9, 2017 at 2:26 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Peter Geoghegan <pg@bowt.ie> writes: >>> It would help if CREATE COLLATION left new >>> ICU collations with the same useful "Description" as initdb created >>> collations will have; maybe that should be added. I had been thinking about that, too, while I was writing it, but there is the semantic objection: >> I do not think it is CREATE COLLATION's job to provide a comment; >> no other CREATE command does so. How about a special option in the CREATE COLLATION to create the comment automatically on demand? Sample patch attached. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-bugs
Attachment
Re: [BUGS] Crash report for some ICU-52 (debian8) COLLATE andwork_mem values
From
Peter Geoghegan
Date:
On Mon, Aug 14, 2017 at 10:22 AM, Peter Eisentraut <peter.eisentraut@2ndquadrant.com> wrote: > On 8/9/17 17:29, Peter Geoghegan wrote: >> On Wed, Aug 9, 2017 at 2:26 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> Peter Geoghegan <pg@bowt.ie> writes: >>>> It would help if CREATE COLLATION left new >>>> ICU collations with the same useful "Description" as initdb created >>>> collations will have; maybe that should be added. > > I had been thinking about that, too, while I was writing it, but there > is the semantic objection: > >>> I do not think it is CREATE COLLATION's job to provide a comment; >>> no other CREATE command does so. I must admit that I missed that the description was cataloged as a comment. Is it out of the question to add another column to pg_collation, that just always has the ICU-provided description? These seem like separate concerns to me. -- Peter Geoghegan -- Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-bugs
Peter Geoghegan <pg@bowt.ie> writes: >>> On Wed, Aug 9, 2017 at 2:26 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>>> I do not think it is CREATE COLLATION's job to provide a comment; >>>> no other CREATE command does so. > I must admit that I missed that the description was cataloged as a > comment. Is it out of the question to add another column to > pg_collation, that just always has the ICU-provided description? These > seem like separate concerns to me. That seems like a reasonable solution to me, if it's not too late for another catversion bump. I like it better than the auto-comment thing Peter E. suggests nearby. regards, tom lane -- Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-bugs
Re: [BUGS] Crash report for some ICU-52 (debian8) COLLATE andwork_mem values
From
Peter Geoghegan
Date:
On Mon, Aug 14, 2017 at 11:00 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> I must admit that I missed that the description was cataloged as a >> comment. Is it out of the question to add another column to >> pg_collation, that just always has the ICU-provided description? These >> seem like separate concerns to me. > > That seems like a reasonable solution to me, if it's not too late > for another catversion bump. I like it better than the auto-comment > thing Peter E. suggests nearby. Another advantage of doing a catversion bump for v10 is that it lets us add a new SQL-callable function (or two). I would like to add a function to solve some of our discoverability problems around variant collations. We now agree that we should not add variants at initdb time, but Peter E's concern about where that leaves the discoverability of the variants (phonebook, emoji, pinyin, and so on) is a concern that I share. Similarly, it would also be helpful if users could inquire about both ICU version, and the corresponding CLDR version, using a new view. That way, they could easily find the right CLDR version/standard, which is where all the customization stuff is actually documented. Any thoughts on that? -- Peter Geoghegan -- Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-bugs
Peter Geoghegan <pg@bowt.ie> writes: > On Mon, Aug 14, 2017 at 11:00 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> That seems like a reasonable solution to me, if it's not too late >> for another catversion bump. I like it better than the auto-comment >> thing Peter E. suggests nearby. > Another advantage of doing a catversion bump for v10 is that it lets > us add a new SQL-callable function (or two). I would like to add a > function to solve some of our discoverability problems around variant > collations. We now agree that we should not add variants at initdb > time, but Peter E's concern about where that leaves the > discoverability of the variants (phonebook, emoji, pinyin, and so on) > is a concern that I share. > Similarly, it would also be helpful if users could inquire about both > ICU version, and the corresponding CLDR version, using a new view. > That way, they could easily find the right CLDR version/standard, > which is where all the customization stuff is actually documented. Hmm ... this all seems like good stuff, but it also seems like new development, and it's way too late in the v10 cycle for any significant amount of that. My view of the state of our ICU support is that in v10, only brave early adopters are going to be messing with it; it's not going to be production-grade for most people for another release or three. That being the case, I don't think that whether second-order options have discoverability problems is really a critical concern right now. I think at this point I'd vote to drop auto-installation of keyword variants, and I would also vote against Peter's auto_comment proposal. We shouldn't be adding features that we're going to supersede in v11. Maybe, if people are okay with a catversion bump this late, we could push the ICU descriptions into pg_collation proper, but I think it would be fine to leave pg_import_system_collations's behavior in that regard alone for v10, too. As for new functions, views, etc, that's all excellent v11 material. regards, tom lane -- Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-bugs
Re: [BUGS] Crash report for some ICU-52 (debian8) COLLATE andwork_mem values
From
Peter Geoghegan
Date:
On Mon, Aug 14, 2017 at 11:32 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Hmm ... this all seems like good stuff, but it also seems like new > development, and it's way too late in the v10 cycle for any significant > amount of that. Fair enough. > I think at this point I'd vote to drop auto-installation of keyword > variants, and I would also vote against Peter's auto_comment proposal. > We shouldn't be adding features that we're going to supersede in v11. +1 > Maybe, if people are okay with a catversion bump this late, we could > push the ICU descriptions into pg_collation proper, but I think it > would be fine to leave pg_import_system_collations's behavior in that > regard alone for v10, too. I really think we should add a pg_collation column to store ICU's description of the collation, because that's something that we'll have to live with forever. I don't think we have pg_import_system_collations's behavior all worked out just yet. I'm not sure that we'd do the right thing at initdb even if pg_import_system_collations did not add the keyword variants. It would still add all distinct collations, instead of all CLDR locales (some of which will have equivalent collation behavior). I suspect that this explains the stability problems you saw [1] that were not explainable by available variants differing. I do think that we need to have a better answer to this "locales vs. collations" question for v10. [1] postgr.es/m/CAH2-Wzm3EOEqYB48m7aT7sjPXw=kUHU=FdzzdU90dfiatt62CQ@mail.gmail.com -- Peter Geoghegan -- Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-bugs
Peter Geoghegan <pg@bowt.ie> writes: > On Mon, Aug 14, 2017 at 11:32 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Maybe, if people are okay with a catversion bump this late, we could >> push the ICU descriptions into pg_collation proper, but I think it >> would be fine to leave pg_import_system_collations's behavior in that >> regard alone for v10, too. > I really think we should add a pg_collation column to store ICU's > description of the collation, because that's something that we'll have > to live with forever. Maybe I'm missing something, but I didn't think there would be anything terribly disastrous about having pg_import_system_collations() install these descriptions as comments in v10 and then changing it to put them directly into pg_collation in later versions. We'd have to adapt the behavior of psql's \dO, but that's true no matter when we change that. > I don't think we have pg_import_system_collations's behavior all > worked out just yet. Agreed, but we can probably tweak that without forcing a catversion bump. regards, tom lane -- Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-bugs
Re: [BUGS] Crash report for some ICU-52 (debian8) COLLATE andwork_mem values
From
Peter Geoghegan
Date:
On Mon, Aug 14, 2017 at 12:17 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Maybe I'm missing something, but I didn't think there would be anything > terribly disastrous about having pg_import_system_collations() install > these descriptions as comments in v10 and then changing it to put them > directly into pg_collation in later versions. We'd have to adapt the > behavior of psql's \dO, but that's true no matter when we change that. Well, I don't think you're going to get on board with the idea of having CREATE COLLATION add a comment (and I agree that that's a bad idea), so CREATE COLLATION is in a very bad spot if we put this off for a release. Unless we add an ICU-owned pg_collation column for the human-readable ICU string, CREATE COLLATION will not let the user determine what ICU thinks the collation/language is from within Postgres. This seems arbitrarily different from the situation with initdb ICU collations, where currently the ICU string comment is added. Importantly, not having the ICU string deprives the user of a way of determining whether or not ICU has even heard of the language/region they specified. The language tags are supposed to be forgiving, but that does mean that ICU won't actually complain when the user fat-fingers their CREATE COLLATION language tag. (Note that this doesn't mean that we couldn't provide *minimal* sanitization of language tags by CREATE COLLATION; we can and should reject multiple "co" subtags, and similar unambiguously bogus subtags.) >> I don't think we have pg_import_system_collations's behavior all >> worked out just yet. > > Agreed, but we can probably tweak that without forcing a catversion > bump. True. I might have used the wrong terminology on this "locales vs. collations" thing. Perhaps what we actually need is pg_collation entries at initdb time that are an enumeration of all locales *and their regions*, as opposed to all locales. I'm researching this now. -- Peter Geoghegan -- Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-bugs
Peter Geoghegan <pg@bowt.ie> writes: > On Mon, Aug 14, 2017 at 12:17 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Maybe I'm missing something, but I didn't think there would be anything >> terribly disastrous about having pg_import_system_collations() install >> these descriptions as comments in v10 and then changing it to put them >> directly into pg_collation in later versions. We'd have to adapt the >> behavior of psql's \dO, but that's true no matter when we change that. > Well, I don't think you're going to get on board with the idea of > having CREATE COLLATION add a comment (and I agree that that's a bad > idea), so CREATE COLLATION is in a very bad spot if we put this off > for a release. How so? When you create the collation in v11, the column will get populated, no problem. (I'm envisioning that we add code to CREATE COLLATION to populate it, much as Peter E's auto_comment patch did except for where the data goes.) > Unless we add an ICU-owned pg_collation column for the human-readable > ICU string, CREATE COLLATION will not let the user determine what ICU > thinks the collation/language is from within Postgres. Somehow, this does not strike me as a stop-ship problem for v10. It's a nice-to-have feature, sure, but we're past the time for that. For now, we have to take the attitude that it's the user's problem whether the arguments for CREATE COLLATION are sane --- we can blame that policy on ICU itself. (BTW, I would argue vehemently against defining this new column as "ICU owned". I think it should just be "colldescription" and we populate it with whatever is reasonable. For example we can move the comments for the hardwired collations there. I dunno whether we can come up with anything authoritative for libc collations, but if we can, they're welcome to the party as well.) regards, tom lane -- Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-bugs
Re: [BUGS] Crash report for some ICU-52 (debian8) COLLATE andwork_mem values
From
Peter Geoghegan
Date:
On Mon, Aug 14, 2017 at 1:13 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Well, I don't think you're going to get on board with the idea of >> having CREATE COLLATION add a comment (and I agree that that's a bad >> idea), so CREATE COLLATION is in a very bad spot if we put this off >> for a release. > > How so? When you create the collation in v11, the column will get > populated, no problem. (I'm envisioning that we add code to CREATE > COLLATION to populate it, much as Peter E's auto_comment patch did > except for where the data goes.) I meant that we'd be in a bad spot for v10, at least. I suppose that you're right that it can be corrected in v11. This doesn't seem like the best plan, but I understand that we are short on time here. >> Unless we add an ICU-owned pg_collation column for the human-readable >> ICU string, CREATE COLLATION will not let the user determine what ICU >> thinks the collation/language is from within Postgres. > > Somehow, this does not strike me as a stop-ship problem for v10. > It's a nice-to-have feature, sure, but we're past the time for that. > For now, we have to take the attitude that it's the user's problem > whether the arguments for CREATE COLLATION are sane --- we can blame > that policy on ICU itself. I hope that we manage to slip in the minimal sanitization I described, which doesn't need a catversion bump at all. > (BTW, I would argue vehemently against defining this new column as > "ICU owned". I think it should just be "colldescription" and we > populate it with whatever is reasonable. Agreed. "ICU owned" was intended as an informal shorthand for "collation provider owned". -- Peter Geoghegan -- Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-bugs
Re: [BUGS] Crash report for some ICU-52 (debian8) COLLATE andwork_mem values
From
Peter Geoghegan
Date:
On Mon, Aug 14, 2017 at 12:54 PM, Peter Geoghegan <pg@bowt.ie> wrote: > I might have used the wrong terminology on this "locales vs. > collations" thing. Perhaps what we actually need is pg_collation > entries at initdb time that are an enumeration of all locales *and > their regions*, as opposed to all locales. I'm researching this now. I've figured this out. It's not a matter of "locales vs. collations" -- it's a matter of avoiding skipping some ICU collations during initdb, which we seem to do right now. CollationCreate() is being passed "if_not_exists = true" for ICU collations during initdb for ICU collations, within pg_import_system_collations(). We're actually not creating new entries for some collations that get their own distinct listing/ucol_countAvailable() iteration. This happens because the ICU-wise name isn't fully spelled out (country wasn't included for entries that had one), and so we incorrectly ignore some collation entries as "duplicates". (I don't know why some regional variants, like de_AT, are still included in master's pg_collation, but most regional variants are not.) Attached self-contained program should give you some idea what I'm talking about. Sample output: $ ./icu-coll-versions | grep Spanish Spanish | es | es | | es | es Spanish (Latin America) | es_419 | es | 419 | es | es Spanish (Argentina) | es_AR | es | AR | es | es Spanish (Bolivia) | es_BO | es | BO | es | es Spanish (Chile) | es_CL | es | CL | es | es Spanish (Colombia) | es_CO | es | CO | es | es Spanish (Costa Rica) | es_CR | es | CR | es | es Spanish (Cuba) | es_CU | es | CU | es | es Spanish (Dominican Republic) | es_DO | es | DO | es | es Spanish (Ceuta & Melilla) | es_EA | es | EA | es | es Spanish (Ecuador) | es_EC | es | EC | es | es Spanish (Spain) | es_ES | es | ES | es | es Spanish (Equatorial Guinea) | es_GQ | es | GQ | es | es Spanish (Guatemala) | es_GT | es | GT | es | es Spanish (Honduras) | es_HN | es | HN | es | es Spanish (Canary Islands) | es_IC | es | IC | es | es Spanish (Mexico) | es_MX | es | MX | es | es Spanish (Nicaragua) | es_NI | es | NI | es | es Spanish (Panama) | es_PA | es | PA | es | es Spanish (Peru) | es_PE | es | PE | es | es Spanish (Philippines) | es_PH | es | PH | es | es Spanish (Puerto Rico) | es_PR | es | PR | es | es Spanish (Paraguay) | es_PY | es | PY | es | es Spanish (El Salvador) | es_SV | es | SV | es | es Spanish (United States) | es_US | es | US | es | es Spanish (Uruguay) | es_UY | es | UY | es | es Spanish (Venezuela) | es_VE | es | VE | es | es As an example, "es_VE" is a canonical locale name here. Note that there are many more collations listed here than Spanish ICU pg_collation entries that you'll find following initdb (forgetting about the keyword variant stuff, which is really another issue). In order to get stable pg_collation names, we should probably use country code within pg_import_system_collations(), for collations where there is a country code. We also need to add script (using uloc_getScript()), so that those collations with a non-default script (e.g. "Serbian (Latin, Bosnia & Herzegovina)") similarly have their own pg_collation entries (FYI, "script" isn't broken out into its own column by my test program). In short, every row that this test program outputs should have a pg_collation entry after initdb -- the number should match exactly. Do we really need to pass "if_not_exists = true", anyway? Why shouldn't initdb fail if there are apparent duplicate ICU collations? BTW, I suspect that this is why very old ICU versions appeared to not accept language tags constructed with subtags produced by ucol_getKeywordValuesForLocale(), per eccead9's commit message. The collation's country/script was omitted, so concatenating variants somehow didn't make sense (wrong script?). It certainly makes no sense that earlier ICU versions just didn't get a valid language tag when using the infrastructure that is supposed to be used to build language tags. It may be necessary to revert eccead9, once this understanding of the situation is confirmed. -- Peter Geoghegan -- Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-bugs
Attachment
Peter Geoghegan <pg@bowt.ie> writes: > Do we really need to pass "if_not_exists = true", anyway? Why > shouldn't initdb fail if there are apparent duplicate ICU collations? Because the villagers will come after you; see commit 0b13b2a7712b6f91590b7589a314240a14520c2f. It's possible that there's an argument for not de-duplicating ICU collation names even though we must do so for libc collation names. But I'm not exactly sure what that would be. Having initdb fail if the ICU configuration is funny isn't going to endear us to anyone. regards, tom lane -- Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-bugs
Re: [BUGS] Crash report for some ICU-52 (debian8) COLLATE andwork_mem values
From
Peter Geoghegan
Date:
On Mon, Aug 14, 2017 at 4:11 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Peter Geoghegan <pg@bowt.ie> writes: >> Do we really need to pass "if_not_exists = true", anyway? Why >> shouldn't initdb fail if there are apparent duplicate ICU collations? > > Because the villagers will come after you; see commit > 0b13b2a7712b6f91590b7589a314240a14520c2f. Fair enough. Changing it is actually unnecessary anyway. pg_import_system_collations() should simply use uloc_countAvailable() + uloc_getAvailable, rather than ucol_countAvailable() + ucol_getAvailable(). Just like my test program. FYI, without the keyword variants, but with all base locales, ICU 55 adds 684 base locales to pg_collation. Should I write a real patch to make the changes I deem necessary, or do you prefer to Peter E.? The comments that pg_import_system_collations() adds to each locale have an annoying tendency to exclude anything that isn't ascii-safe, which is much more noticeable when locales populate pg_collation instead of collations: ...foo │ fr-BE-x-icu │ fr-BE │ fr-BE │ icu │ French (Belgium)foo │ fr-BF-x-icu │ fr-BF │ fr-BF │ icu │ French (Burkina Faso)foo │ fr-BI-x-icu │ fr-BI │ fr-BI │ icu │ French (Burundi)foo │ fr-BJ-x-icu │ fr-BJ │ fr-BJ │ icu │ French (Benin)foo │ fr-BL-x-icu │ fr-BL │ fr-BL │ icu │foo │ fr-CA-x-icu │ fr-CA │ fr-CA │ icu │ French (Canada)foo │ fr-CD-x-icu │ fr-CD │ fr-CD │ icu │ French (Congo - Kinshasa)foo │ fr-CF-x-icu │ fr-CF │ fr-CF │ icu │ French (Central African Republic)foo │ fr-CG-x-icu │ fr-CG │ fr-CG │ icu │ French (Congo - Brazzaville)foo │ fr-CH-x-icu │ fr-CH │ fr-CH │ icu │ French (Switzerland)foo │ fr-CI-x-icu │ fr-CI │ fr-CI │ icu │foo │ fr-CM-x-icu │ fr-CM │ fr-CM │ icu │ French (Cameroon)foo │ fr-DJ-x-icu │ fr-DJ │ fr-DJ │ icu │ French (Djibouti) ... Notice the lack of descriptions here, just because they weren't ascii-safe. I now wonder if we should even keep the comment thing at all. We can eventually add an SQL function, that is called by \dOS+, that interrogates ICU for a description on the fly. No need to store anything. Another thing that I dislike about get_icu_locale_comment() is that the descriptions we show are always English display names, regardless of pg_database collation or anything else. We shouldn't hard-code the use of English for display name, on general principle. > It's possible that there's an argument for not de-duplicating ICU > collation names even though we must do so for libc collation names. > But I'm not exactly sure what that would be. Having initdb fail if > the ICU configuration is funny isn't going to endear us to anyone. BTW, get_icu_language_tag() has a buglet. ULOC_FULLNAME_CAPACITY does not bound the size of a language tag; those can be almost arbitrarily long, unlike locale identifiers. That should probably be fixed, if only to avoid setting a bad example. -- Peter Geoghegan -- Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-bugs
Peter Geoghegan <pg@bowt.ie> writes: > The comments that pg_import_system_collations() adds to each locale > have an annoying tendency to exclude anything that isn't ascii-safe, Right, but that's *necessary* for anything that goes into template0. > Notice the lack of descriptions here, just because they weren't > ascii-safe. I now wonder if we should even keep the comment thing at > all. We can eventually add an SQL function, that is called by \dOS+, > that interrogates ICU for a description on the fly. Yup, the same idea came to me as I was reading your message. So we don't really want a pg_collation column either, but a function that can produce a result that varies with the current database's encoding. > Another thing that I dislike about get_icu_locale_comment() is that > the descriptions we show are always English display names, regardless > of pg_database collation or anything else. We shouldn't hard-code the > use of English for display name, on general principle. Again, that was mostly driven by the desire to get ASCII results (or at least so I imagine; Peter E. might have a different take). I think there's value in having a function that produces such a description, but we could have a different one that produces a description in the language indicated by the database's locale. regards, tom lane -- Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-bugs
Re: [BUGS] Crash report for some ICU-52 (debian8) COLLATE andwork_mem values
From
Peter Geoghegan
Date:
On Mon, Aug 14, 2017 at 9:23 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Peter Geoghegan <pg@bowt.ie> writes: >> The comments that pg_import_system_collations() adds to each locale >> have an annoying tendency to exclude anything that isn't ascii-safe, > > Right, but that's *necessary* for anything that goes into template0. Of course. I phrased this badly. >> Another thing that I dislike about get_icu_locale_comment() is that >> the descriptions we show are always English display names, regardless >> of pg_database collation or anything else. We shouldn't hard-code the >> use of English for display name, on general principle. > > Again, that was mostly driven by the desire to get ASCII results > (or at least so I imagine; Peter E. might have a different take). Probably. > I think there's value in having a function that produces such a > description, but we could have a different one that produces a > description in the language indicated by the database's locale. I suppose it's useful to have it match the database locale/language, too. The point IMV is that display name is strictly for, uh, display. I'm not attached to any particular scheme for determining which language to display in which context. That's for v11. I take it that you're in agreement with me on pg_import_system_collations() adding a display name as a comment -- we should prevent that, even for v10? -- Peter Geoghegan -- Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-bugs
Peter Geoghegan <pg@bowt.ie> writes: > I take it that you're in agreement with me on > pg_import_system_collations() adding a display name as a comment -- we > should prevent that, even for v10? I'm kind of agnostic on that. I think we've established that that's not what we want for v11 and up, but it's not clear to me that the current behavior isn't an acceptable short-term answer. If there are comments in v10, and then in v11 there aren't because the data went somewhere else, what will that really break? BTW, it strikes me that a description function that produces strings in the database's language/encoding is not without gotchas. Suppose, for example, that we have a database using UTF8 while psql is using client_encoding LATIN1. If \dO calls a function that produces random non-ASCII strings, then there's a significant hazard of \dO failing altogether due to encoding conversion failure on the way to the client. That doesn't seem like it will pass muster. I'm not sure what to do instead, but this seems like a reason why there's value in plain-ASCII descriptions. regards, tom lane -- Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-bugs
Re: [BUGS] Crash report for some ICU-52 (debian8) COLLATE andwork_mem values
From
Peter Geoghegan
Date:
On Mon, Aug 14, 2017 at 9:56 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Peter Geoghegan <pg@bowt.ie> writes: >> I take it that you're in agreement with me on >> pg_import_system_collations() adding a display name as a comment -- we >> should prevent that, even for v10? > > I'm kind of agnostic on that. I think we've established that that's > not what we want for v11 and up, but it's not clear to me that the > current behavior isn't an acceptable short-term answer. If there > are comments in v10, and then in v11 there aren't because the data > went somewhere else, what will that really break? Either way, we might want to point out in the docs that you can pretty easily find the details of every locale on the web: https://ssl.icu-project.org/icu-bin/locexp?d_=en&_=en_HK -- Peter Geoghegan -- Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-bugs
Re: [BUGS] Crash report for some ICU-52 (debian8) COLLATE andwork_mem values
From
Peter Eisentraut
Date:
On 8/14/17 23:55, Peter Geoghegan wrote: > pg_import_system_collations() should simply use uloc_countAvailable() > + uloc_getAvailable, rather than ucol_countAvailable() + > ucol_getAvailable(). It's not clear to me that this is better. Why do we need to use a function that is clearly not the preferred API for this ("col" vs "loc") just to get more entries? If we go down this route, then we'll be on the hook forever to keep adding more and more predefined entries by whatever means necessary. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-bugs
Re: [BUGS] Crash report for some ICU-52 (debian8) COLLATE andwork_mem values
From
Peter Geoghegan
Date:
On Thu, Aug 17, 2017 at 6:13 PM, Peter Eisentraut <peter.eisentraut@2ndquadrant.com> wrote: > On 8/14/17 23:55, Peter Geoghegan wrote: >> pg_import_system_collations() should simply use uloc_countAvailable() >> + uloc_getAvailable, rather than ucol_countAvailable() + >> ucol_getAvailable(). > > It's not clear to me that this is better. Why do we need to use a > function that is clearly not the preferred API for this ("col" vs "loc") > just to get more entries? My argument for doing this is very simple: ICU/CLDR/BCP 47 provides stability guarantees for locales, not collations [1]. For example, as we discussed, de_BE didn't actually go away -- it just stopped being a distinct collation within ICU, for reasons that are implementation defined. I admit that there are arguments against this, but by far the most important consideration should be the stability of the names used for pg_collation entries created during initdb. > If we go down this route, then we'll be on > the hook forever to keep adding more and more predefined entries by > whatever means necessary. Why? Locales have a mapping to various ISO codes (Country, language, etc). It's not like new ones come along very often. [1] https://tools.ietf.org/html/bcp47#section-3.4 -- Peter Geoghegan -- Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-bugs
Re: [BUGS] Crash report for some ICU-52 (debian8) COLLATE andwork_mem values
From
Peter Geoghegan
Date:
On Thu, Aug 17, 2017 at 6:22 PM, Peter Geoghegan <pg@bowt.ie> wrote: > My argument for doing this is very simple: ICU/CLDR/BCP 47 provides > stability guarantees for locales, not collations [1]. For example, as > we discussed, de_BE didn't actually go away -- it just stopped being a > distinct collation within ICU, for reasons that are implementation > defined. I have data to back this up. I attach 2 files: one is a listing of locale XML files from within CLDR 1.9's ./common/main/, dating from December 2010, and the other is a similar listing for CLDR 3.1, dating from April 2017. This roughly covers every ICU version we'll support on day 1. The listing is sorted alphabetically, to ease comparison. Summary: $ cat locale_list_cldr-19.txt | wc -l 605 $ cat locale_list_cldr-31.txt | wc -l 722 $ diff -d -u locale_list_cldr-19.txt locale_list_cldr-31.txt | grep "^-[a-zA-Z]" | wc -l 144 $ diff -d -u locale_list_cldr-19.txt locale_list_cldr-31.txt | grep "^+[a-zA-Z]" | wc -l 261 So, there have been 144 locales removed in that time, and 261 added. My proposal to standardize on using all locales ICU makes available, rather than all behaviorally distinct collations, clearly does not ensure perfect stability. It does actually work pretty well in practice, though. The number 144 is misleadingly high. If you actually look at what went away in detail, it looks like there is a lot of script variants of the same language/country code. Plus, the changes themselves are non-technical in nature. The churn seems to be in part due to geopolitical changes, such as 5 years [1] passing after the dissolution of Serbia and Montenegro. However, it is mostly due to switching from ISO 639-1 to ISO 639-3 codes in cases where a finer distinction about cultural preferences needed to be made (note that they still only list *macro* language/region/script combinations as distinct collations). For example, Kurdish went from being "ku-" to 3 different macro languages: "ckb-" (Central Kurdish), "kmr-" (Northern Kurdish), and "sdh-" (Southern Kurdish). Wikipedia says of ISO 639-3: "Because it provides comprehensive language coverage, giving equal opportunity for all languages, and because of its wide adoption in information technologies, ISO 639-3 provides an important technology component addressing the digital divide problem". We can hope that it will be the last such revision ever needed, because this digital divide problem is solved once and for all, at least as far as these standards go. CLDR prefers to use ISO 639-1 language codes for compatibility [2], which is why the language codes are mostly still 2 letters (ISO 639-1). "en" did not change to "eng", because there was no cultural reason to do so, and thus there was a 1:1 mapping between "en" and "eng" anyway. Regions/countries will only change due to rare geopolitical events. In summary, I think that these changes are fairly low impact in practice, and are entirely explainable by political changes and cultural controversies. They really are minimal, because CLDR/ICU really does take the stability of collation names seriously. We can and should ensure that locales like "de_BE" are available in every ICU version, because that is an inexcusable technical oversight, and is not due to a cultural or political issue. [1] http://cldr.unicode.org/index/process/cldr-data-retention-policy [2] http://www.unicode.org/reports/tr35/#unicode_language_subtag_validity -- Peter Geoghegan -- Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-bugs
Attachment
Re: [BUGS] Crash report for some ICU-52 (debian8) COLLATE andwork_mem values
From
Peter Eisentraut
Date:
On 8/17/17 21:22, Peter Geoghegan wrote: >> It's not clear to me that this is better. Why do we need to use a >> function that is clearly not the preferred API for this ("col" vs "loc") >> just to get more entries? > My argument for doing this is very simple: ICU/CLDR/BCP 47 provides > stability guarantees for locales, not collations [1]. For example, as > we discussed, de_BE didn't actually go away -- it just stopped being a > distinct collation within ICU, for reasons that are implementation > defined. > > I admit that there are arguments against this, but by far the most > important consideration should be the stability of the names used for > pg_collation entries created during initdb. One argument I can think of is that it is confusing right now. Why is there de-AT but no de-DE? I know why, but users in "DE" might not and would be really confused. So I concede that adding the full set would be useful. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-bugs
Re: [BUGS] Crash report for some ICU-52 (debian8) COLLATE andwork_mem values
From
Peter Geoghegan
Date:
On Fri, Aug 18, 2017 at 7:47 PM, Peter Eisentraut <peter.eisentraut@2ndquadrant.com> wrote: > On 8/17/17 21:22, Peter Geoghegan wrote: >>> It's not clear to me that this is better. Why do we need to use a >>> function that is clearly not the preferred API for this ("col" vs "loc") >>> just to get more entries? >> My argument for doing this is very simple: ICU/CLDR/BCP 47 provides >> stability guarantees for locales, not collations [1]. For example, as >> we discussed, de_BE didn't actually go away -- it just stopped being a >> distinct collation within ICU, for reasons that are implementation >> defined. >> >> I admit that there are arguments against this, but by far the most >> important consideration should be the stability of the names used for >> pg_collation entries created during initdb. > > One argument I can think of is that it is confusing right now. Why is > there de-AT but no de-DE? I know why, but users in "DE" might not and > would be really confused. So I concede that adding the full set would > be useful. Actually, there would be a de-DE. Just like with de-BE, that's ensured by using locales rather than collations. See my self-contained test program, or even the output I posted earlier. -- Peter Geoghegan -- Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-bugs
Re: [BUGS] Crash report for some ICU-52 (debian8) COLLATE andwork_mem values
From
Peter Geoghegan
Date:
On Fri, Aug 18, 2017 at 7:55 PM, Peter Geoghegan <pg@bowt.ie> wrote: >> One argument I can think of is that it is confusing right now. Why is >> there de-AT but no de-DE? I know why, but users in "DE" might not and >> would be really confused. So I concede that adding the full set would >> be useful. > > Actually, there would be a de-DE. Just like with de-BE, that's ensured > by using locales rather than collations. See my self-contained test > program, or even the output I posted earlier. I just realized that I misunderstood you here. Anyway, I'm glad that we're in agreement that adding every locale at initdb time would be the most useful behavior. -- Peter Geoghegan -- Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-bugs