Thread: [HACKERS] Implementation of SASLprep for SCRAM-SHA-256

[HACKERS] Implementation of SASLprep for SCRAM-SHA-256

From
Michael Paquier
Date:
Hi all,

Here is another thread for an issue related to SCRAM
(https://www.postgresql.org/message-id/243d8c11-6149-a4bb-0909-136992f74b23@iki.fi),
that can be discussed independently: SASLprep.

RFC5802 (https://tools.ietf.org/html/rfc5802) regarding the
implementation of SCRAM, needs to have passwords normalized as per
RFC4013 (https://tools.ietf.org/html/rfc4013) using SASLprep, which is
actually NFKC. I have previously described what happens in this case
here:
https://www.postgresql.org/message-id/CAB7nPqScgwh6Eu4=c-0L7-cufZrU5rULTSdpMOOWiz1YFvGE6w@mail.gmail.com
This way, we can be sure that two UTf-8 strings are considered as
equivalent in a SASL exchange, in our case we care about the password
string (we should care about the username as well). Without SASLprep,
our implementation of SCRAM may fail with other third-part tools if
passwords are not made only of ASCII characters. And that sucks.

To put in short words, NFKC is made of two steps: a canonical
decomposition of the UTF-8 string (decomposition of the string
following by a reordering using the class of each character), followed
by its recomposition.

Postgres does not have any logic on the frontend-side related to
encoding, still it is possible with some of the APIs of wchar.c to
check if a string provided is valid UTF-8 or not. If the string is not
valid UTF-8, we should just use the string as-is in the exchange,
which leads to undefined behaviors anyway as SASL needs to do things
with UTF-8.

I have also implementation a Postgres extension able to do this
operation, which has been useful for testing.
https://github.com/michaelpq/pg_plugins/tree/master/pg_sasl_prepare

Attached is a patch implementing that, the conversion tables are built
from UnicodeData.txt to be minimal in size. I have added an Open Item
on the wiki for this problem as well.

Thanks,
-- 
Michael

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] Implementation of SASLprep for SCRAM-SHA-256

From
Robert Haas
Date:
On Tue, Mar 7, 2017 at 10:01 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> RFC5802 (https://tools.ietf.org/html/rfc5802) regarding the
> implementation of SCRAM, needs to have passwords normalized as per
> RFC4013 (https://tools.ietf.org/html/rfc4013) using SASLprep, which is
> actually NFKC. I have previously described what happens in this case
> here:
> https://www.postgresql.org/message-id/CAB7nPqScgwh6Eu4=c-0L7-cufZrU5rULTSdpMOOWiz1YFvGE6w@mail.gmail.com
> This way, we can be sure that two UTf-8 strings are considered as
> equivalent in a SASL exchange, in our case we care about the password
> string (we should care about the username as well). Without SASLprep,
> our implementation of SCRAM may fail with other third-part tools if
> passwords are not made only of ASCII characters. And that sucks.

Agreed.  I am not sure this quite rises to the level of a stop-ship
issue; we could document the restriction.  However, that's pretty
ugly; it would be a whole lot better to get this fixed.

I kinda hope Heikki is going to step up to the plate here, because I
think he understands most of it already.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Implementation of SASLprep for SCRAM-SHA-256

From
Michael Paquier
Date:
On Wed, Mar 8, 2017 at 10:39 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Tue, Mar 7, 2017 at 10:01 PM, Michael Paquier
> <michael.paquier@gmail.com> wrote:
>> This way, we can be sure that two UTf-8 strings are considered as
>> equivalent in a SASL exchange, in our case we care about the password
>> string (we should care about the username as well). Without SASLprep,
>> our implementation of SCRAM may fail with other third-part tools if
>> passwords are not made only of ASCII characters. And that sucks.
>
> Agreed.  I am not sure this quite rises to the level of a stop-ship
> issue; we could document the restriction.

I am not sure about that, what we have now on HEAD is still useful and
better than MD5.

> However, that's pretty ugly; it would be a whole lot better to get this fixed.

Agreed.

> I kinda hope Heikki is going to step up to the plate here, because I
> think he understands most of it already.

The second person who knows a good deal about NFKC is Ishii-san.

Attached is a rebased patch.
-- 
Michael

Attachment

Re: Implementation of SASLprep for SCRAM-SHA-256

From
Heikki Linnakangas
Date:
On 03/31/2017 10:10 AM, Michael Paquier wrote:
> On Wed, Mar 8, 2017 at 10:39 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> On Tue, Mar 7, 2017 at 10:01 PM, Michael Paquier
>> <michael.paquier@gmail.com> wrote:
>> I kinda hope Heikki is going to step up to the plate here, because I
>> think he understands most of it already.
>
> The second person who knows a good deal about NFKC is Ishii-san.
>
> Attached is a rebased patch.

Thanks, I'm looking at this now.

- Heikki




Re: Implementation of SASLprep for SCRAM-SHA-256

From
Heikki Linnakangas
Date:
On 04/04/2017 01:52 PM, Heikki Linnakangas wrote:
> On 03/31/2017 10:10 AM, Michael Paquier wrote:
>> On Wed, Mar 8, 2017 at 10:39 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>>> On Tue, Mar 7, 2017 at 10:01 PM, Michael Paquier
>>> <michael.paquier@gmail.com> wrote:
>>> I kinda hope Heikki is going to step up to the plate here, because I
>>> think he understands most of it already.
>>
>> The second person who knows a good deal about NFKC is Ishii-san.
>>
>> Attached is a rebased patch.
>
> Thanks, I'm looking at this now.

I will continue tomorrow, but I wanted to report on what I've done so 
far. Attached is a new patch version, quite heavily modified. Notable 
changes so far:

* Use Unicode codepoints, rather than UTF-8 bytes packed in a 32-bit 
ints. IMHO this makes the tables easier to read (to a human), and they 
are also packed slightly more tightly (see next two points), as you can 
fit more codepoints in a 16-bit integer.

* Reorganize the decomposition table. Instead of a separate 
binary-searched table for each "size", have one table that's ordered by 
codepoint, which contains a length and offset into another array, where 
all the decomposed sequences are. This makes the recomposition step 
somewhat slower, as it now has to grovel through an array that contains 
all decompositions, rather than just the array that contains 
decompositions of two characters. But this isn't performance-critical, 
readability and tight packing of the tables matter more.

* In the lookup table, store singleton decompositions that decompose to 
a single character, and the character fits in a uint16, directly in the 
main lookup table, instead of storing an index into the other table. 
This makes the tables somewhat smaller, as there are a lot of such 
decompositions.

* Use uint32 instead of pg_wchar to represent unicode codepoints. 
pg_wchar suggests something that holds multi-byte characters in the OS 
locale, used by functions like wcscmp, so avoid that. I added a "typedef 
uint32 Codepoint" alias, though, to make it more clear which variables 
hold Unicode codepoints rather than e.g. indexes into arrays.

* Unicode consortium publishes a comprehensive normalization test suite, 
as a text file, NormalizationTest.txt. I wrote a small perl and C 
program to run all the tests from that test suite, with the 
normalization function. That uncovered a number of bugs in the 
recomposition algorithm, which I then fixed:

  * decompositions that map to ascii characters cannot be excluded.
  * non-canonical and non-starter decompositions need to be excluded. 
They are now flagged in the lookup table, so that we only use them 
during the decomposition phase, not when recomposing.


TODOs / Discussion points:

* I'm not sure I like the way the code is organized, I feel that we're 
littering src/common with these. Perhaps we should add a 
src/common/unicode_normalization directory for all this.

* The list of characters excluded from recomposition is currently 
hard-coded in utf_norm_generate.pl. However, that list is available in 
machine-readable format, in file CompositionExclusions.txt. Since we're 
reading most of the data from UnicodeData.txt, would be good to read the 
exclusion table from a file, too.

* SASLPrep specifies normalization form KC, but it also specifies that 
some characters are mapped to space or nothing. Should do those 
mappings, too.

- Heikki

Attachment

Re: Implementation of SASLprep for SCRAM-SHA-256

From
Michael Paquier
Date:
fore

On Wed, Apr 5, 2017 at 7:05 AM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> I will continue tomorrow, but I wanted to report on what I've done so far.
> Attached is a new patch version, quite heavily modified. Notable changes so
> far:

Great, thanks!

> * Use Unicode codepoints, rather than UTF-8 bytes packed in a 32-bit ints.
> IMHO this makes the tables easier to read (to a human), and they are also
> packed slightly more tightly (see next two points), as you can fit more
> codepoints in a 16-bit integer.

Using directly codepoints is not much consistent with the existing
backend, but for the sake of packing things more, OK.

> * Reorganize the decomposition table. Instead of a separate binary-searched
> table for each "size", have one table that's ordered by codepoint, which
> contains a length and offset into another array, where all the decomposed
> sequences are. This makes the recomposition step somewhat slower, as it now
> has to grovel through an array that contains all decompositions, rather than
> just the array that contains decompositions of two characters. But this
> isn't performance-critical, readability and tight packing of the tables
> matter more.

Okay, no objection to that.

> * In the lookup table, store singleton decompositions that decompose to a
> single character, and the character fits in a uint16, directly in the main
> lookup table, instead of storing an index into the other table. This makes
> the tables somewhat smaller, as there are a lot of such decompositions.

Indeed.

> * Use uint32 instead of pg_wchar to represent unicode codepoints. pg_wchar
> suggests something that holds multi-byte characters in the OS locale, used
> by functions like wcscmp, so avoid that. I added a "typedef uint32
> Codepoint" alias, though, to make it more clear which variables hold Unicode
> codepoints rather than e.g. indexes into arrays.
>
> * Unicode consortium publishes a comprehensive normalization test suite, as
> a text file, NormalizationTest.txt. I wrote a small perl and C program to
> run all the tests from that test suite, with the normalization function.
> That uncovered a number of bugs in the recomposition algorithm, which I then
> fixed:

I was looking for something like that for some time, thanks! That's
here actually:
http://www.unicode.org/Public/8.0.0/ucd/NormalizationTest.txt

>  * decompositions that map to ascii characters cannot be excluded.
>  * non-canonical and non-starter decompositions need to be excluded. They
> are now flagged in the lookup table, so that we only use them during the
> decomposition phase, not when recomposing.

Okay on that.

> TODOs / Discussion points:
>
> * I'm not sure I like the way the code is organized, I feel that we're
> littering src/common with these. Perhaps we should add a
> src/common/unicode_normalization directory for all this.

pg_utf8_islegal() and pg_utf_mblen() should as well be moved in their
own file I think, and wchar.c can use that.

> * The list of characters excluded from recomposition is currently hard-coded
> in utf_norm_generate.pl. However, that list is available in machine-readable
> format, in file CompositionExclusions.txt. Since we're reading most of the
> data from UnicodeData.txt, would be good to read the exclusion table from a
> file, too.

Ouch. Those are present here...
http://www.unicode.org/reports/tr41/tr41-19.html#Exclusions
Definitely it makes more sense to read them from a file.

> * SASLPrep specifies normalization form KC, but it also specifies that some
> characters are mapped to space or nothing. Should do those mappings, too.

Ah, right. Those ones are here:
https://tools.ietf.org/html/rfc3454#appendix-B.1
The conversion tables need some extra tweaking...

+       else if ((*utf & 0xf0) == 0xe0)
+       {
+           if (utf[1] == 0 || utf[2] == 0)
+               goto invalid;
+           cp = (*utf++) & 0x0F;
+           cp = (cp << 6) | ((*utf++) & 0x3F);
+           cp = (cp << 6) | ((*utf++) & 0x3F);
+       }
+       else if ((*utf & 0xf8) == 0xf0)
+       {
+           if (utf[1] == 0 || utf[2] == 0|| utf[3] == 0)
+               goto invalid;
+           cp = (*utf++) & 0x07;
+           cp = (cp << 6) | ((*utf++) & 0x3F);
+           cp = (cp << 6) | ((*utf++) & 0x3F);
+           cp = (cp << 6) | ((*utf++) & 0x3F);
+       }
I find more readable something like pg_utf2wchar_with_len(), but well...

Some typos:
s/fore/for/
s/reprsented/represented/

You seem to be fully on it... I can help out with any of those items as needed.
-- 
Michael



Re: Implementation of SASLprep for SCRAM-SHA-256

From
Heikki Linnakangas
Date:
On 04/05/2017 07:23 AM, Michael Paquier wrote:
> fore
>
> On Wed, Apr 5, 2017 at 7:05 AM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>> I will continue tomorrow, but I wanted to report on what I've done so far.
>> Attached is a new patch version, quite heavily modified. Notable changes so
>> far:
>
> Great, thanks!
>
>> * Use Unicode codepoints, rather than UTF-8 bytes packed in a 32-bit ints.
>> IMHO this makes the tables easier to read (to a human), and they are also
>> packed slightly more tightly (see next two points), as you can fit more
>> codepoints in a 16-bit integer.
>
> Using directly codepoints is not much consistent with the existing
> backend, but for the sake of packing things more, OK.

Oh, I see, we already have similar functions in wchar.c. 
unicode_to_utf8() and utf8_to_unicode(). We should probably move those 
to src/common, rather than re-invent the wheel.

> pg_utf8_islegal() and pg_utf_mblen() should as well be moved in their
> own file I think, and wchar.c can use that.

Yeah..

>> * The list of characters excluded from recomposition is currently hard-coded
>> in utf_norm_generate.pl. However, that list is available in machine-readable
>> format, in file CompositionExclusions.txt. Since we're reading most of the
>> data from UnicodeData.txt, would be good to read the exclusion table from a
>> file, too.
>
> Ouch. Those are present here...
> http://www.unicode.org/reports/tr41/tr41-19.html#Exclusions
> Definitely it makes more sense to read them from a file.

Did that.

>> * SASLPrep specifies normalization form KC, but it also specifies that some
>> characters are mapped to space or nothing. Should do those mappings, too.
>
> Ah, right. Those ones are here:
> https://tools.ietf.org/html/rfc3454#appendix-B.1

Yep.


Attached is a new version. Notable changes since yesterday:

* Implemented the rest of the SASLPrep, mapping some characters to 
spaces, leaving out others, and checking for prohibited characters and 
bidirectional strings.

* Moved things around. There's now a separate directory, 
src/common/unicode, which contains the perl scripts and the test code. 
Those are not needed to build from source, as the pre-generated tables 
are put in src/include/common. Similar to the scripts in 
src/backend/utils/mb/Unicode, really.

* Renamed many things from utf_* to unicode_*, since they don't deal 
with utf-8 input anymore.


This is starting to shape up, but still some cleanup work to do. I will 
continue tomorrow..

- Heikki


Attachment

Re: Implementation of SASLprep for SCRAM-SHA-256

From
Michael Paquier
Date:
On Thu, Apr 6, 2017 at 1:33 AM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> Attached is a new version. Notable changes since yesterday:
>
> * Implemented the rest of the SASLPrep, mapping some characters to spaces,
> leaving out others, and checking for prohibited characters and bidirectional
> strings.
>
> * Moved things around. There's now a separate directory, src/common/unicode,
> which contains the perl scripts and the test code. Those are not needed to
> build from source, as the pre-generated tables are put in
> src/include/common. Similar to the scripts in src/backend/utils/mb/Unicode,
> really.
>
> * Renamed many things from utf_* to unicode_*, since they don't deal with
> utf-8 input anymore.
>
> This is starting to shape up, but still some cleanup work to do. I will
> continue tomorrow..

Thanks for the new patch, that's looking nice. Now I was not able to
compile it as saslprep.h is missing from what you have sent...

There is for example this portion in the new tables:
+static const Codepoint prohibited_output_chars[] =
+{
+   0xD800, 0xF8FF,             /* C.3, C.5 */
  ----- Start Table C.5 -----  D800-DFFF; [SURROGATE CODES]  ----- End Table C.5 -----
This indicates a range of values. Wouldn't it be better to split this
table in two, one for the range of codepoints and another one with the
single entries?

+   0x1D173, 0x1D17A,           /* C.2.2 */
This is for musical symbols. It seems to me that checking for a range
is what is intended.
-- 
Michael



Re: [HACKERS] Implementation of SASLprep for SCRAM-SHA-256

From
Heikki Linnakangas
Date:
Another version attached.

I think this is now in committable state, but there's been a lot of 
small changes here and there, so please have one more look over it if 
you have a chance. I'm planning to push this tomorrow, after sleeping on it.

The code-organization issue with the utf8 functions, pg_utf_mblen, 
pg_utf8_islegal, unicode_to_utf8, and utf8_to_unicode, is still kind of 
unresolved. The way I have it in this version is that the functions are 
in wchar.c, like they've always been, and saslprep.c uses them from 
there. This is ugly from a code organization point of view. as we now 
have functions in src/common depending on functions in 
src/backend/utils/mb/wchar.c. I think wchar.c deserves some 
re-organization, as most of the functions there are also used in 
frontend code. Perhaps it should be moved to src/common as whole.

Libpq already compiles with wchar.c, so all those functions that 
src/common/saslprep.c requires are already available in libpq. So I 
don't think that's urgent, but something we nevertheless ought to clean 
up at some point.


Another thing I'd like some more eyes on, is how this will work with 
encodings other than UTF-8. We will now try to normalize the password as 
if it was in UTF-8, even if it isn't. That's OK as long as we're 
consistent about it, but there is one worrisome scenario: what if the 
user's password consists mostly of characters, that when interpreted as 
UTF-8, are in the list of ignored characters. IOW, is it realistic that 
a user might have a password in a non-UTF-8 encoding, that gets silently 
mangled into something much shorter? I think that's highly unlikely, but 
can anyone come up with a plausible example of that?

- Heikki


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] Implementation of SASLprep for SCRAM-SHA-256

From
Heikki Linnakangas
Date:
(sorry, I didn't notice your email until after I just sent version 4!)

On 04/06/2017 10:32 AM, Michael Paquier wrote:
> On Thu, Apr 6, 2017 at 1:33 AM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>> Attached is a new version. Notable changes since yesterday:
>>
>> * Implemented the rest of the SASLPrep, mapping some characters to spaces,
>> leaving out others, and checking for prohibited characters and bidirectional
>> strings.
>>
>> * Moved things around. There's now a separate directory, src/common/unicode,
>> which contains the perl scripts and the test code. Those are not needed to
>> build from source, as the pre-generated tables are put in
>> src/include/common. Similar to the scripts in src/backend/utils/mb/Unicode,
>> really.
>>
>> * Renamed many things from utf_* to unicode_*, since they don't deal with
>> utf-8 input anymore.
>>
>> This is starting to shape up, but still some cleanup work to do. I will
>> continue tomorrow..
>
> Thanks for the new patch, that's looking nice. Now I was not able to
> compile it as saslprep.h is missing from what you have sent...

D'oh. Here's a new version, with saslprep.h included.

> There is for example this portion in the new tables:
> +static const Codepoint prohibited_output_chars[] =
> +{
> +   0xD800, 0xF8FF,             /* C.3, C.5 */
>
>    ----- Start Table C.5 -----
>    D800-DFFF; [SURROGATE CODES]
>    ----- End Table C.5 -----
> This indicates a range of values. Wouldn't it be better to split this
> table in two, one for the range of codepoints and another one with the
> single entries?

I considered that, but there are relatively few singular codepoints in 
the tables, so it wouldn't save much space. In this patch, singular 
codepoints are represented by a range like "0x3000, 0x3000".

> +   0x1D173, 0x1D17A,           /* C.2.2 */
> This is for musical symbols. It seems to me that checking for a range
> is what is intended.

Can you elaborate?

- Heikki




Re: [HACKERS] Implementation of SASLprep for SCRAM-SHA-256

From
Heikki Linnakangas
Date:
On 04/06/2017 08:42 PM, Heikki Linnakangas wrote:
> D'oh. Here's a new version, with saslprep.h included.

And here it is for real. Sigh.

>> There is for example this portion in the new tables:
>> +static const Codepoint prohibited_output_chars[] =
>> +{
>> +   0xD800, 0xF8FF,             /* C.3, C.5 */
>>
>>    ----- Start Table C.5 -----
>>    D800-DFFF; [SURROGATE CODES]
>>    ----- End Table C.5 -----
>> This indicates a range of values. Wouldn't it be better to split this
>> table in two, one for the range of codepoints and another one with the
>> single entries?
>
> I considered that, but there are relatively few singular codepoints in
> the tables, so it wouldn't save much space. In this patch, singular
> codepoints are represented by a range like "0x3000, 0x3000".
>
>> +   0x1D173, 0x1D17A,           /* C.2.2 */
>> This is for musical symbols. It seems to me that checking for a range
>> is what is intended.
>
> Can you elaborate?

Oh, I think I understand the confusion now. All the arrays represent 
codepoint ranges, not singular codepoints. I renamed them to "*_ranges", 
to make that more clear.

- Heikki


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] Implementation of SASLprep for SCRAM-SHA-256

From
Michael Paquier
Date:
On Fri, Apr 7, 2017 at 2:47 AM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> On 04/06/2017 08:42 PM, Heikki Linnakangas wrote:
>>> There is for example this portion in the new tables:
>>> +static const Codepoint prohibited_output_chars[] =
>>> +{
>>> +   0xD800, 0xF8FF,             /* C.3, C.5 */
>>>
>>>    ----- Start Table C.5 -----
>>>    D800-DFFF; [SURROGATE CODES]
>>>    ----- End Table C.5 -----
>>> This indicates a range of values. Wouldn't it be better to split this
>>> table in two, one for the range of codepoints and another one with the
>>> single entries?
>>
>> I considered that, but there are relatively few singular codepoints in
>> the tables, so it wouldn't save much space. In this patch, singular
>> codepoints are represented by a range like "0x3000, 0x3000".

I am really wondering if this should not reflect the real range
reported by the RFC. I understand that you have grouped things to save
a couple of bytes, but that would protect from any updates of the
codepoints within those ranges (unlikely to happen I agree).

>>> +   0x1D173, 0x1D17A,           /* C.2.2 */
>>> This is for musical symbols. It seems to me that checking for a range
>>> is what is intended.
>>
>> Can you elaborate?
>
> Oh, I think I understand the confusion now. All the arrays represent
> codepoint ranges, not singular codepoints. I renamed them to "*_ranges", to
> make that more clear.

Thanks! Yes I got confused by the name.

+/* Is the given Unicode codepoint in the given table? */
+#define IS_CODE_IN_TABLE(code, map) is_code_in_table(code, map, lengthof(map))
[...]
+static bool
+is_code_in_table(pg_wchar code, const pg_wchar *map, int mapsize)
+{
+   Assert(mapsize % 2 == 0);
+
+   if (code < map[0] || code > map[mapsize - 1])
+       return false;
+
+   if (bsearch(&code, map, mapsize / 2, sizeof(pg_wchar) * 2,
+               codepoint_range_cmp))
+       return true;
+   else
+       return false;
+}
Those could be renamed to range_table as well to avoid more confusion.

> The SASLprep implementation depends on the UTF-8 functions from
> src/backend/utils/mb/wchar.c. So to use it, you must also compile and link
> that. That doesn't change anything for the current users of these
> functions, the backend and libpq, as they both already link with wchar.o.
> It would be good to move those functions into a separate file in
> src/commmon, but I'll leave that for another day.

Fine for me.

You may want to add a .gitignore in src/common/unicode for norm_test
and norm_test_table.h.
-- 
Michael



Re: [HACKERS] Implementation of SASLprep for SCRAM-SHA-256

From
Heikki Linnakangas
Date:
On 04/07/2017 05:30 AM, Michael Paquier wrote:
> On Fri, Apr 7, 2017 at 2:47 AM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>> On 04/06/2017 08:42 PM, Heikki Linnakangas wrote:
>>>> There is for example this portion in the new tables:
>>>> +static const Codepoint prohibited_output_chars[] =
>>>> +{
>>>> +   0xD800, 0xF8FF,             /* C.3, C.5 */
>>>>
>>>>    ----- Start Table C.5 -----
>>>>    D800-DFFF; [SURROGATE CODES]
>>>>    ----- End Table C.5 -----
>>>> This indicates a range of values. Wouldn't it be better to split this
>>>> table in two, one for the range of codepoints and another one with the
>>>> single entries?
>>>
>>> I considered that, but there are relatively few singular codepoints in
>>> the tables, so it wouldn't save much space. In this patch, singular
>>> codepoints are represented by a range like "0x3000, 0x3000".
>
> I am really wondering if this should not reflect the real range
> reported by the RFC. I understand that you have grouped things to save
> a couple of bytes, but that would protect from any updates of the
> codepoints within those ranges (unlikely to happen I agree).

It just means that there will be some more work required to apply the 
changes to the current lists. I constructed the lists manually to begin 
with, copy-pasting the lists from the RFC, and moving and merging 
entries by hand. I wouldn't mind doing that by hand again, if the lists 
change. But as you said, it seems unlikely that they would change any 
time soon.

> You may want to add a .gitignore in src/common/unicode for norm_test
> and norm_test_table.h.

Added, and pushed, with some more comment fixes.

Many thanks, Michael!

- Heikki




Re: [HACKERS] Implementation of SASLprep for SCRAM-SHA-256

From
Michael Paquier
Date:
On Fri, Apr 7, 2017 at 8:58 PM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> On 04/07/2017 05:30 AM, Michael Paquier wrote:
>> I am really wondering if this should not reflect the real range
>> reported by the RFC. I understand that you have grouped things to save
>> a couple of bytes, but that would protect from any updates of the
>> codepoints within those ranges (unlikely to happen I agree).
>
> It just means that there will be some more work required to apply the
> changes to the current lists. I constructed the lists manually to begin
> with, copy-pasting the lists from the RFC, and moving and merging entries by
> hand. I wouldn't mind doing that by hand again, if the lists change. But as
> you said, it seems unlikely that they would change any time soon.

Yeah, I don't mind either. That's simple enough to change should that happen.

>> You may want to add a .gitignore in src/common/unicode for norm_test
>> and norm_test_table.h.
>
> Added, and pushed, with some more comment fixes.

Nice. There are still a couple of important items pending for SCRAM,
so I would think that it is better to not do the refactoring now (but
rework it in PG11), but polish a bit more the documentation. Your
thoughts on that?
-- 
Michael



Re: [HACKERS] Implementation of SASLprep for SCRAM-SHA-256

From
Heikki Linnakangas
Date:
On 04/06/2017 07:59 PM, Heikki Linnakangas wrote:
> Another thing I'd like some more eyes on, is how this will work with
> encodings other than UTF-8. We will now try to normalize the password as
> if it was in UTF-8, even if it isn't. That's OK as long as we're
> consistent about it, but there is one worrisome scenario: what if the
> user's password consists mostly of characters, that when interpreted as
> UTF-8, are in the list of ignored characters. IOW, is it realistic that
> a user might have a password in a non-UTF-8 encoding, that gets silently
> mangled into something much shorter? I think that's highly unlikely, but
> can anyone come up with a plausible example of that?

I did some testing on what the byte sequences for the Unicode characters 
that SASLprep ignores mean in other encodings. I created a text file 
containing every ignored character, in UTF-8, and ran "iconv -f <other 
encoding> -t UTF-8//TRANSLIT" on the file, using all supported server 
encodings. The idea is to take each of the ignored byte sequences, and 
pretend that they are in some other encoding. If converting them to 
UTF-8 results in a legit character, then that character means something 
in that encoding, and could be misinterpreted if it's used in a password.

Here are some characters that seem plausible to be misinterpreted and 
ignored by SASLprep:

-------
EUC-JP and EUC-JISX0213:

U+00AD (C2 AD): 足 (meaning "foot", per Unihan database)
U+FE00-FE0F (EF B8 8X): 鏝 (meaning "trowel", per Unihan database)

EUC-CN:

U+00AD (C2 AD): 颅 (meaning "skull", per Unihan database)
U+FE00-FE0FF (EF B8 8X): 锔 (meaning "curium", per Unihan database)
U+FEFF (EF BB BF): 锘 (meaning "nobelium", per Wikipedia)

EUC-KR:

U+FE00-FE0F (EF BB BF): 截 (meanings "cut off, stop, obstruct, 
intersect", per Unihan database
U+FEFF (EF BB BF): 癤 (meanings "pimple, sore, boil", per Unihan database)

EUC-TW:
U+FE00-FE0F: 踫 (meanings "collide, bump into", per Unihan database)
U+FEFF: 踢  (meaning "kick", per Unihan database)

CP866:
U+1806: саЖ
U+180B: саЛ
U+180C: саМ
U+180D: саН
U+200B: тАЛ
U+200C: тАМ
U+200D: тАН
-------

The CP866 cases seem most likely to cause confusion. Those are all 
common words in Russian. I don't know how common those Chinese/Japanese 
characters are.

Overall, I think this is OK. Even though there are those characters that 
can be misinterpreted, for it to be problem all of the following have to 
be true:

1. The client is using one of those encodings.
2. The password string as whole has to look like valid UTF-8.
3. Ignoring those characters/words from the password would lead to a 
significantly weaker password, i.e. it was not very long to begin with, 
or it consisted almost entirely of those characters/words.

Thoughts? Attached is the full results of running iconv with each 
encoding, from which I picked the above cases.

- Heikki


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] Implementation of SASLprep for SCRAM-SHA-256

From
Michael Paquier
Date:
On Mon, Apr 10, 2017 at 5:11 PM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> Here are some characters that seem plausible to be misinterpreted and
> ignored by SASLprep:
> EUC-JP and EUC-JISX0213:
>
> U+00AD (C2 AD): 足 (meaning "foot", per Unihan database)
> U+FE00-FE0F (EF B8 8X): 鏝 (meaning "trowel", per Unihan database)

Those are common words, but I still have to see those characters used
in passwords. For user names, I have seen games or apps using Kanjis,
so they could be used in such cases. But any application can be
careful in controlling the encoding used by the client, so I would not
worry much about them.

> EUC-CN:
>
> U+00AD (C2 AD): 颅 (meaning "skull", per Unihan database)
> U+FE00-FE0FF (EF B8 8X): 锔 (meaning "curium", per Unihan database)
> U+FEFF (EF BB BF): 锘 (meaning "nobelium", per Wikipedia)
>
> EUC-KR:
>
> U+FE00-FE0F (EF BB BF): 截 (meanings "cut off, stop, obstruct, intersect",
> per Unihan database
> U+FEFF (EF BB BF): 癤 (meanings "pimple, sore, boil", per Unihan database)
>
> EUC-TW:
> U+FE00-FE0F: 踫 (meanings "collide, bump into", per Unihan database)
> U+FEFF: 踢  (meaning "kick", per Unihan database)

Not completely sure about those ones. At least I think that the two
set of characters of Chinese Taipei are pretty common there.

> CP866:
> U+1806: саЖ
> U+180B: саЛ
> U+180C: саМ
> U+180D: саН
> U+200B: тАЛ
> U+200C: тАМ
> U+200D: тАН
> -------
>
> The CP866 cases seem most likely to cause confusion.

No idea about those.

> Those are all common
> words in Russian. I don't know how common those Chinese/Japanese characters are.
>
> Overall, I think this is OK. Even though there are those characters that can
> be misinterpreted, for it to be problem all of the following have to be
> true:
>
> 1. The client is using one of those encodings.
> 2. The password string as whole has to look like valid UTF-8.
> 3. Ignoring those characters/words from the password would lead to a
> significantly weaker password, i.e. it was not very long to begin with, or
> it consisted almost entirely of those characters/words.
>
> Thoughts? Attached is the full results of running iconv with each encoding,
> from which I picked the above cases.

I am not much worrying about such things either. Still I am wondering
though if it would be useful to give users the possibility to block
connection attempts from clients that do not use UTF-8 in this case.
Some people use open instances.
--
Michael