Re: Improved ICU patch - WAS: Implementing full UTF-8 support (aka supporting 0x00) - Mailing list pgsql-hackers

From Palle Girgensohn
Subject Re: Improved ICU patch - WAS: Implementing full UTF-8 support (aka supporting 0x00)
Date
Msg-id 789A2F56-0E42-409D-A840-6AF5110D6085@pingpong.net
Whole thread Raw
In response to Re: Improved ICU patch - WAS: Implementing full UTF-8 support (aka supporting 0x00)  (Palle Girgensohn <girgen@pingpong.net>)
Responses Re: Improved ICU patch - WAS: Implementing full UTF-8 support (aka supporting 0x00)  (Peter Geoghegan <pg@heroku.com>)
List pgsql-hackers
<br class="" /><blockquote class="" type="cite">11 aug. 2016 kl. 11:15 skrev Palle Girgensohn <<a class=""
href="mailto:girgen@pingpong.net">girgen@pingpong.net</a>>:<brclass="" /><br class="" /><blockquote class=""
style="font-family:Menlo-Regular; font-size: 13px; font-style: normal; font-variant-caps: normal; font-weight: normal;
letter-spacing:normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal;
widows:auto; word-spacing: 0px; -webkit-text-stroke-width: 0px;" type="cite"><br class="" />11 aug. 2016 kl. 03:05
skrevPeter Geoghegan <<a class="" href="mailto:pg@heroku.com">pg@heroku.com</a>>:<br class="" /><br class="" />On
Wed,Aug 10, 2016 at 1:42 PM, Palle Girgensohn <<a class=""
href="mailto:girgen@pingpong.net">girgen@pingpong.net</a>>wrote:<br class="" /><blockquote class=""
type="cite">They'vebeen used for the FreeBSD ports since 2005, and have served us well. I have of course updated them
regularly.In this latest version, I've removed support for other encodings beside UTF-8, mostly since I don't know how
totest them, but also, I see little point in supporting anything else using ICU.<br class="" /></blockquote><br
class=""/>Looks like you're not using the ICU equivalent of strxfrm(). While 9.5<br class="" />is not the release that
introducedits use, it did expand it<br class="" />significantly. I think you need to fix this, even though it isn't<br
class=""/>actually used to sort text at present, since presumably FreeBSD builds<br class="" />of 9.5 don't
TRUST_STRXFRM.Since you're using ICU, though, you could<br class="" />reasonably trust the ICU equivalent of strxfrm(),
sothat's a missed<br class="" />opportunity. (See the wiki page on the abbreviated keys issue [1] if<br class="" />you
don'tknow what I'm talking about.)<br class="" /></blockquote><br class="" />My plan was to get it working without
TRUST_STRXFRMfirst, and then add that functinality. I've made some preliminary tests using ICU:s ucol_getSortKey but I
willhave to test it a bit more. For now, I just expect not to trust strxfrm. It is the first iteration wrt strxfrm, the
planis to use that code base.<br class="" /></blockquote><div class=""><br class="" /></div><div class="">Here are some
preliminaryresults running 10000 times comparing the same two strings in a tight loop.</div><br class="" /><div
class="">             ucol_strcollUTF8: -1<span class="Apple-tab-span" style="white-space:pre"> </span>0.002448<br
class=""/>                       strcoll: 1<span class="Apple-tab-span" style="white-space:pre"> </span>0.060711<br
class=""/>              ucol_strcollIter: -1<span class="Apple-tab-span" style="white-space:pre"> </span>0.009221<br
class=""/>                 direct memcmp: 1<span class="Apple-tab-span" style="white-space:pre"> </span>0.000457<br
class=""/>                 memcpy memcmp: 1<span class="Apple-tab-span" style="white-space:pre"> </span>0.001706<br
class=""/>                memcpy strcoll: 1<span class="Apple-tab-span" style="white-space:pre"> </span>0.068425<br
class=""/>               nextSortKeyPart: -1<span class="Apple-tab-span" style="white-space:pre"> </span>0.041011<br
class=""/>    ucnv_toUChars + getSortKey: -1<span class="Apple-tab-span" style="white-space:pre"> </span>0.050379<br
class=""/><br class="" /></div><div class=""><br class="" /></div><div class="">correct answer is -1, but since we
compareåasdf and äasdf with a Swedish locale, memcmp and strcoll fails of course, as espected. Direct memcmp is 5 times
fasterthan ucol_strcollUTF8 (used in my patch), but sadly the best implementation using sort keys with
ICU, nextSortKeyPart,is way slower.</div><div class=""><br class="" /></div><div class=""><br class="" /></div><div
class=""><brclass="" /></div><div class=""><span class="Apple-tab-span" style="white-space:pre"> </span>startTime =
getRealTime();<brclass="" /><span class="Apple-tab-span" style="white-space:pre"> </span>for ( int i = 0; i < loop;
i++){<br class="" /><span class="Apple-tab-span" style="white-space:pre"> </span>result = ucol_strcollUTF8(collator,
arg1,len1, arg2, len2, &status);<br class="" /><span class="Apple-tab-span" style="white-space:pre"> </span>}<br
class=""/><span class="Apple-tab-span" style="white-space:pre"> </span>endTime = getRealTime();<br class="" /><span
class="Apple-tab-span"style="white-space:pre"> </span>printf("%30s: %d\t%lf\n", "ucol_strcollUTF8", result, endTime -
startTime);<brclass="" /><br class="" /></div><div class=""><span class="Apple-tab-span" style="white-space:pre"><span
class=""style="white-space: normal;"><br class="" /></span></span></div><div class=""><span class="Apple-tab-span"
style="white-space:pre"><spanclass="" style="white-space: normal;"><br class="" /></span></span></div><div
class=""><spanclass="Apple-tab-span" style="white-space:pre"><span class="" style="white-space:
normal;">vs</span></span></div><divclass=""><span class="Apple-tab-span" style="white-space:pre"><span class=""
style="white-space:normal;"><br class="" /></span></span></div><div class=""><span class="Apple-tab-span"
style="white-space:pre"><spanclass="" style="white-space: normal;"><br class="" /></span></span></div><div
class=""></div><divclass=""><span class="Apple-tab-span" style="white-space:pre"><span class="Apple-tab-span"
style="white-space:pre;"> </span>int sortkeysize=8;<br class="" /><br class="" /><span class="" style="white-space:
normal;"><spanclass="Apple-tab-span" style="white-space:pre"> </span>startTime = getRealTime();</span></span></div><div
class=""><spanclass="Apple-tab-span" style="white-space:pre"><span class="" style="white-space: normal;"><span
class="Apple-tab-span"style="white-space:pre"> </span>uint8_t key1[sortkeysize],
key2[sortkeysize];</span></span></div><divclass=""><span class="Apple-tab-span" style="white-space:pre"><span class=""
style="white-space:normal;"><span class="Apple-tab-span" style="white-space:pre"> </span>uint32_t sState[2],
tState[2];</span></span></div><divclass=""><span class="Apple-tab-span" style="white-space: pre;"> </span>UCharIterator
sIter,tIter; </div><div class=""><span class="Apple-tab-span" style="white-space:pre"><br class="" /></span>for ( int i
=0; i < loop; i++) {</div><div class=""><span class="Apple-tab-span" style="white-space:pre">
</span>uiter_setUTF8(&sIter,arg1, len1);<br class="" /><span class="Apple-tab-span" style="white-space:pre">
</span>uiter_setUTF8(&tIter,arg2, len2);<br class="" /><span class="Apple-tab-span" style="white-space:pre">
</span>sState[0]= 0; sState[1] = 0;<br class="" /><span class="Apple-tab-span" style="white-space:pre">
</span>tState[0]= 0; tState[1] = 0;<br class="" /><span class="Apple-tab-span" style="white-space:pre">
</span>ucol_nextSortKeyPart(collator,&sIter, sState, key1, sortkeysize, &status);<br class="" /><span
class="Apple-tab-span"style="white-space:pre"> </span>ucol_nextSortKeyPart(collator, &tIter, tState, key2,
sortkeysize,&status);<br class="" /><span class="Apple-tab-span" style="white-space:pre"> </span>result = memcmp
(key1,key2, sortkeysize);<br class="" /><span class="Apple-tab-span" style="white-space:pre"> </span>}<br class=""
/><spanclass="Apple-tab-span" style="white-space:pre"> </span>endTime = getRealTime();<br class="" /><span
class="Apple-tab-span"style="white-space:pre"> </span>printf("%30s: %d\t%lf\n", "nextSortKeyPart", result, endTime -
startTime);<brclass="" /><br class="" /></div><div class=""><br class="" /></div><div class=""><br class=""
/></div><divclass="">But in your strxfrm code in PostgreSQL, the keys are cached, and represented as int64:s if I
remembercorrectly, so perhaps there is still a benefit using the abbreviated keys? More testing is required, I
guess...</div><divclass=""><br class="" /></div><div class="">Palle</div><div class=""><br class="" /></div><div
class=""><brclass="" /></div><div class=""><br class="" /></div> 

pgsql-hackers by date:

Previous
From: Shay Rojansky
Date:
Subject: Re: Slowness of extended protocol
Next
From: Stas Kelvich
Date:
Subject: Re: Logical Replication WIP