Re: B-Tree support function number 3 (strxfrm() optimization) - Mailing list pgsql-hackers

From Peter Geoghegan
Subject Re: B-Tree support function number 3 (strxfrm() optimization)
Date
Msg-id CAM3SWZSMA5moFmb0xgr3x-1dOoFtyCXiXDBiH2vBp=eFz_Mtow@mail.gmail.com
Whole thread Raw
In response to Re: B-Tree support function number 3 (strxfrm() optimization)  (Peter Geoghegan <pg@heroku.com>)
List pgsql-hackers
On Sun, Jul 27, 2014 at 12:00 AM, Peter Geoghegan <pg@heroku.com> wrote:
> Robert pointed out a case where varying character case of an English
> word did not alter the primary level bytes (and thus the poor man's
> normalized key was the same). He acknowledged that most of the entropy
> of the first 8 bytes of the string went into the first 8 bytes of the
> blob/key. This can actually be very useful to the optimization in some
> cases. In particular, with most Latin alphabets you'll see the same
> pattern when diacritics are used. This means that even though the
> original string has (say) an accented character that would take 2
> bytes to store in UTF-8, the weight in the primary level is the same
> as an equivalent unaccented character (and so only takes one byte to
> store at that level, with differences only in subsequent levels).
> Whole strxfrm() blobs are the same length regardless of how many
> accents appear in otherwise equivalent original Latin string, and you
> get exactly the same high concentrations of entropy in the first 8
> bytes in pretty much all Latin alphabets (the *absence* of diacritics
> is stored in later weight levels too, even with the "en_US.UTF-8"
> collation).

There are many other interesting cases where en_US.UTF-8, and
presumably all other collations concentrate much more entropy into
leading bytes than might be expected. Consider:

pg@hamster:~/code$ ./strxfrm-binary en_US.UTF-8 "abc"
"abc" -> 0c0d0e0109090901090909 (11 bytes)
pg@hamster:~/code$ ./strxfrm-binary en_US.UTF-8 "# abc"
"# abc" -> 0c0d0e01090909010909090101760135 (16 bytes)
pg@hamster:~/code$ ./strxfrm-binary en_US.UTF-8 "***** abc"
"***** abc" -> 0c0d0e010909090109090901017301730173017301730135 (24 bytes)

and, to show you what this looks like when the primary
weights/original codepoints appear backwards:

pg@hamster:~/code$ ./strxfrm-binary en_US.UTF-8 "cba"
"cba" -> 0e0d0c0109090901090909 (11 bytes)
pg@hamster:~/code$ ./strxfrm-binary en_US.UTF-8 "# cba"
"# cba" -> 0e0d0c01090909010909090101760135 (16 bytes)
pg@hamster:~/code$ ./strxfrm-binary en_US.UTF-8 "***** cba"
"***** cba" -> 0e0d0c010909090109090901017301730173017301730135 (24 bytes)

Evidently, the implementation always places primary weights first,
corresponding to "abc" (and later "cba") - the bytes "\0c\0d\0e" (and
later "\0e\0d\0c") - no matter how many "extraneous" characters are
placed in front. They're handled later. Space don't appear in the
primary weight level at all:

pg@hamster:~/code$ ./strxfrm-binary en_US.UTF-8 "a b c"
"a b c" -> 0c0d0e01090909010909090102350235 (16 bytes)

Lots of punctuation-type characters will not affect the primary weight level:

pg@hamster:~/code$ ./strxfrm-binary en_US.UTF-8 "%@!()\/#-+,:^~? a b c"
"%@!()\/#-+,:^~? a b c" ->
0c0d0e0109090901090909010177015d013e01500152017401420176013a0178013b013d011201170140013502350235
(48 bytes)

Some non-alphabetic ASCII characters will affect the primary level,
though. For example:

pg@hamster:~/code$ ./strxfrm-binary en_US.UTF-8 "1.) a b c"
"1.) a b c" -> 030c0d0e010909090901090909090102440152013502350235 (25 bytes)

There is one extra byte here, in front of the "abc" bytes "\0c\0d\0e",
a primary weight "\03" corresponding to '1'.

-- 
Peter Geoghegan



pgsql-hackers by date:

Previous
From: Larry White
Date:
Subject: Re: jsonb format is pessimal for toast compression
Next
From: Tom Lane
Date:
Subject: Re: Defining a foreign key with a duplicate column is broken