Re: Implementation of SASLprep for SCRAM-SHA-256 - Mailing list pgsql-hackers

From Heikki Linnakangas
Subject Re: Implementation of SASLprep for SCRAM-SHA-256
Date
Msg-id 4e9eb400-0827-dbd6-8e48-661b31f2be5e@iki.fi
Whole thread Raw
In response to Re: Implementation of SASLprep for SCRAM-SHA-256  (Heikki Linnakangas <hlinnaka@iki.fi>)
Responses Re: Implementation of SASLprep for SCRAM-SHA-256  (Michael Paquier <michael.paquier@gmail.com>)
List pgsql-hackers
On 04/04/2017 01:52 PM, Heikki Linnakangas wrote:
> On 03/31/2017 10:10 AM, Michael Paquier wrote:
>> On Wed, Mar 8, 2017 at 10:39 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>>> On Tue, Mar 7, 2017 at 10:01 PM, Michael Paquier
>>> <michael.paquier@gmail.com> wrote:
>>> I kinda hope Heikki is going to step up to the plate here, because I
>>> think he understands most of it already.
>>
>> The second person who knows a good deal about NFKC is Ishii-san.
>>
>> Attached is a rebased patch.
>
> Thanks, I'm looking at this now.

I will continue tomorrow, but I wanted to report on what I've done so 
far. Attached is a new patch version, quite heavily modified. Notable 
changes so far:

* Use Unicode codepoints, rather than UTF-8 bytes packed in a 32-bit 
ints. IMHO this makes the tables easier to read (to a human), and they 
are also packed slightly more tightly (see next two points), as you can 
fit more codepoints in a 16-bit integer.

* Reorganize the decomposition table. Instead of a separate 
binary-searched table for each "size", have one table that's ordered by 
codepoint, which contains a length and offset into another array, where 
all the decomposed sequences are. This makes the recomposition step 
somewhat slower, as it now has to grovel through an array that contains 
all decompositions, rather than just the array that contains 
decompositions of two characters. But this isn't performance-critical, 
readability and tight packing of the tables matter more.

* In the lookup table, store singleton decompositions that decompose to 
a single character, and the character fits in a uint16, directly in the 
main lookup table, instead of storing an index into the other table. 
This makes the tables somewhat smaller, as there are a lot of such 
decompositions.

* Use uint32 instead of pg_wchar to represent unicode codepoints. 
pg_wchar suggests something that holds multi-byte characters in the OS 
locale, used by functions like wcscmp, so avoid that. I added a "typedef 
uint32 Codepoint" alias, though, to make it more clear which variables 
hold Unicode codepoints rather than e.g. indexes into arrays.

* Unicode consortium publishes a comprehensive normalization test suite, 
as a text file, NormalizationTest.txt. I wrote a small perl and C 
program to run all the tests from that test suite, with the 
normalization function. That uncovered a number of bugs in the 
recomposition algorithm, which I then fixed:

  * decompositions that map to ascii characters cannot be excluded.
  * non-canonical and non-starter decompositions need to be excluded. 
They are now flagged in the lookup table, so that we only use them 
during the decomposition phase, not when recomposing.


TODOs / Discussion points:

* I'm not sure I like the way the code is organized, I feel that we're 
littering src/common with these. Perhaps we should add a 
src/common/unicode_normalization directory for all this.

* The list of characters excluded from recomposition is currently 
hard-coded in utf_norm_generate.pl. However, that list is available in 
machine-readable format, in file CompositionExclusions.txt. Since we're 
reading most of the data from UnicodeData.txt, would be good to read the 
exclusion table from a file, too.

* SASLPrep specifies normalization form KC, but it also specifies that 
some characters are mapped to space or nothing. Should do those 
mappings, too.

- Heikki

Attachment

pgsql-hackers by date:

Previous
From: Andres Freund
Date:
Subject: Re: bug in SlabAlloc / VALGRIND_MAKE_MEM_DEFINED
Next
From: Simon Riggs
Date:
Subject: Re: increasing the default WAL segment size