Re: Improved ICU patch - WAS: Implementing full UTF-8 support (aka supporting 0x00) - Mailing list pgsql-hackers

From Palle Girgensohn
Subject Re: Improved ICU patch - WAS: Implementing full UTF-8 support (aka supporting 0x00)
Date
Msg-id 789A2F56-0E42-409D-A840-6AF5110D6085@pingpong.net
Whole thread Raw
In response to Re: Improved ICU patch - WAS: Implementing full UTF-8 support (aka supporting 0x00)  (Palle Girgensohn <girgen@pingpong.net>)
Responses Re: Improved ICU patch - WAS: Implementing full UTF-8 support (aka supporting 0x00)
List pgsql-hackers

11 aug. 2016 kl. 11:15 skrev Palle Girgensohn <girgen@pingpong.net>:


11 aug. 2016 kl. 03:05 skrev Peter Geoghegan <pg@heroku.com>:

On Wed, Aug 10, 2016 at 1:42 PM, Palle Girgensohn <girgen@pingpong.net> wrote:
They've been used for the FreeBSD ports since 2005, and have served us well. I have of course updated them regularly. In this latest version, I've removed support for other encodings beside UTF-8, mostly since I don't know how to test them, but also, I see little point in supporting anything else using ICU.

Looks like you're not using the ICU equivalent of strxfrm(). While 9.5
is not the release that introduced its use, it did expand it
significantly. I think you need to fix this, even though it isn't
actually used to sort text at present, since presumably FreeBSD builds
of 9.5 don't TRUST_STRXFRM. Since you're using ICU, though, you could
reasonably trust the ICU equivalent of strxfrm(), so that's a missed
opportunity. (See the wiki page on the abbreviated keys issue [1] if
you don't know what I'm talking about.)

My plan was to get it working without TRUST_STRXFRM first, and then add that functinality. I've made some preliminary tests using ICU:s ucol_getSortKey but I will have to test it a bit more. For now, I just expect not to trust strxfrm. It is the first iteration wrt strxfrm, the plan is to use that code base.

Here are some preliminary results running 10000 times comparing the same two strings in a tight loop.

              ucol_strcollUTF8: -1 0.002448
                       strcoll: 1 0.060711
              ucol_strcollIter: -1 0.009221
                 direct memcmp: 1 0.000457
                 memcpy memcmp: 1 0.001706
                memcpy strcoll: 1 0.068425
               nextSortKeyPart: -1 0.041011
    ucnv_toUChars + getSortKey: -1 0.050379


correct answer is -1, but since we compare åasdf and äasdf with a Swedish locale, memcmp and strcoll fails of course, as espected. Direct memcmp is 5 times faster than ucol_strcollUTF8 (used in my patch), but sadly the best implementation using sort keys with ICU, nextSortKeyPart, is way slower.



startTime = getRealTime();
for ( int i = 0; i < loop; i++) {
result = ucol_strcollUTF8(collator, arg1, len1, arg2, len2, &status);
}
endTime = getRealTime();
printf("%30s: %d\t%lf\n", "ucol_strcollUTF8", result, endTime - startTime);



vs


int sortkeysize=8;

startTime = getRealTime();
uint8_t key1[sortkeysize], key2[sortkeysize];
uint32_t sState[2], tState[2];
UCharIterator sIter, tIter; 

for ( int i = 0; i < loop; i++) {
uiter_setUTF8(&sIter, arg1, len1);
uiter_setUTF8(&tIter, arg2, len2);
sState[0] = 0; sState[1] = 0;
tState[0] = 0; tState[1] = 0;
ucol_nextSortKeyPart(collator, &sIter, sState, key1, sortkeysize, &status);
ucol_nextSortKeyPart(collator, &tIter, tState, key2, sortkeysize, &status);
result = memcmp (key1, key2, sortkeysize);
}
endTime = getRealTime();
printf("%30s: %d\t%lf\n", "nextSortKeyPart", result, endTime - startTime);



But in your strxfrm code in PostgreSQL, the keys are cached, and represented as int64:s if I remember correctly, so perhaps there is still a benefit using the abbreviated keys? More testing is required, I guess...

Palle



pgsql-hackers by date:

Previous
From: Shay Rojansky
Date:
Subject: Re: Slowness of extended protocol
Next
From: Stas Kelvich
Date:
Subject: Re: Logical Replication WIP