Re: Implementation of SASLprep for SCRAM-SHA-256 - Mailing list pgsql-hackers

From Michael Paquier
Subject Re: Implementation of SASLprep for SCRAM-SHA-256
Date
Msg-id CAB7nPqQqKih8rY6jZFgwtTh6dmbj7B9d0b48=60xPtu+5WqbVw@mail.gmail.com
Whole thread Raw
In response to Re: Implementation of SASLprep for SCRAM-SHA-256  (Heikki Linnakangas <hlinnaka@iki.fi>)
Responses Re: Implementation of SASLprep for SCRAM-SHA-256  (Heikki Linnakangas <hlinnaka@iki.fi>)
List pgsql-hackers
fore

On Wed, Apr 5, 2017 at 7:05 AM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> I will continue tomorrow, but I wanted to report on what I've done so far.
> Attached is a new patch version, quite heavily modified. Notable changes so
> far:

Great, thanks!

> * Use Unicode codepoints, rather than UTF-8 bytes packed in a 32-bit ints.
> IMHO this makes the tables easier to read (to a human), and they are also
> packed slightly more tightly (see next two points), as you can fit more
> codepoints in a 16-bit integer.

Using directly codepoints is not much consistent with the existing
backend, but for the sake of packing things more, OK.

> * Reorganize the decomposition table. Instead of a separate binary-searched
> table for each "size", have one table that's ordered by codepoint, which
> contains a length and offset into another array, where all the decomposed
> sequences are. This makes the recomposition step somewhat slower, as it now
> has to grovel through an array that contains all decompositions, rather than
> just the array that contains decompositions of two characters. But this
> isn't performance-critical, readability and tight packing of the tables
> matter more.

Okay, no objection to that.

> * In the lookup table, store singleton decompositions that decompose to a
> single character, and the character fits in a uint16, directly in the main
> lookup table, instead of storing an index into the other table. This makes
> the tables somewhat smaller, as there are a lot of such decompositions.

Indeed.

> * Use uint32 instead of pg_wchar to represent unicode codepoints. pg_wchar
> suggests something that holds multi-byte characters in the OS locale, used
> by functions like wcscmp, so avoid that. I added a "typedef uint32
> Codepoint" alias, though, to make it more clear which variables hold Unicode
> codepoints rather than e.g. indexes into arrays.
>
> * Unicode consortium publishes a comprehensive normalization test suite, as
> a text file, NormalizationTest.txt. I wrote a small perl and C program to
> run all the tests from that test suite, with the normalization function.
> That uncovered a number of bugs in the recomposition algorithm, which I then
> fixed:

I was looking for something like that for some time, thanks! That's
here actually:
http://www.unicode.org/Public/8.0.0/ucd/NormalizationTest.txt

>  * decompositions that map to ascii characters cannot be excluded.
>  * non-canonical and non-starter decompositions need to be excluded. They
> are now flagged in the lookup table, so that we only use them during the
> decomposition phase, not when recomposing.

Okay on that.

> TODOs / Discussion points:
>
> * I'm not sure I like the way the code is organized, I feel that we're
> littering src/common with these. Perhaps we should add a
> src/common/unicode_normalization directory for all this.

pg_utf8_islegal() and pg_utf_mblen() should as well be moved in their
own file I think, and wchar.c can use that.

> * The list of characters excluded from recomposition is currently hard-coded
> in utf_norm_generate.pl. However, that list is available in machine-readable
> format, in file CompositionExclusions.txt. Since we're reading most of the
> data from UnicodeData.txt, would be good to read the exclusion table from a
> file, too.

Ouch. Those are present here...
http://www.unicode.org/reports/tr41/tr41-19.html#Exclusions
Definitely it makes more sense to read them from a file.

> * SASLPrep specifies normalization form KC, but it also specifies that some
> characters are mapped to space or nothing. Should do those mappings, too.

Ah, right. Those ones are here:
https://tools.ietf.org/html/rfc3454#appendix-B.1
The conversion tables need some extra tweaking...

+       else if ((*utf & 0xf0) == 0xe0)
+       {
+           if (utf[1] == 0 || utf[2] == 0)
+               goto invalid;
+           cp = (*utf++) & 0x0F;
+           cp = (cp << 6) | ((*utf++) & 0x3F);
+           cp = (cp << 6) | ((*utf++) & 0x3F);
+       }
+       else if ((*utf & 0xf8) == 0xf0)
+       {
+           if (utf[1] == 0 || utf[2] == 0|| utf[3] == 0)
+               goto invalid;
+           cp = (*utf++) & 0x07;
+           cp = (cp << 6) | ((*utf++) & 0x3F);
+           cp = (cp << 6) | ((*utf++) & 0x3F);
+           cp = (cp << 6) | ((*utf++) & 0x3F);
+       }
I find more readable something like pg_utf2wchar_with_len(), but well...

Some typos:
s/fore/for/
s/reprsented/represented/

You seem to be fully on it... I can help out with any of those items as needed.
-- 
Michael



pgsql-hackers by date:

Previous
From: Peter Eisentraut
Date:
Subject: Re: Re: PATCH: pageinspect / add page_checksum andbt_page_items(bytea)
Next
From: "Tsunakawa, Takayuki"
Date:
Subject: Re: Supporting huge pages on Windows