Thread: [HACKERS] Implementation of SASLprep for SCRAM-SHA-256
Hi all, Here is another thread for an issue related to SCRAM (https://www.postgresql.org/message-id/243d8c11-6149-a4bb-0909-136992f74b23@iki.fi), that can be discussed independently: SASLprep. RFC5802 (https://tools.ietf.org/html/rfc5802) regarding the implementation of SCRAM, needs to have passwords normalized as per RFC4013 (https://tools.ietf.org/html/rfc4013) using SASLprep, which is actually NFKC. I have previously described what happens in this case here: https://www.postgresql.org/message-id/CAB7nPqScgwh6Eu4=c-0L7-cufZrU5rULTSdpMOOWiz1YFvGE6w@mail.gmail.com This way, we can be sure that two UTf-8 strings are considered as equivalent in a SASL exchange, in our case we care about the password string (we should care about the username as well). Without SASLprep, our implementation of SCRAM may fail with other third-part tools if passwords are not made only of ASCII characters. And that sucks. To put in short words, NFKC is made of two steps: a canonical decomposition of the UTF-8 string (decomposition of the string following by a reordering using the class of each character), followed by its recomposition. Postgres does not have any logic on the frontend-side related to encoding, still it is possible with some of the APIs of wchar.c to check if a string provided is valid UTF-8 or not. If the string is not valid UTF-8, we should just use the string as-is in the exchange, which leads to undefined behaviors anyway as SASL needs to do things with UTF-8. I have also implementation a Postgres extension able to do this operation, which has been useful for testing. https://github.com/michaelpq/pg_plugins/tree/master/pg_sasl_prepare Attached is a patch implementing that, the conversion tables are built from UnicodeData.txt to be minimal in size. I have added an Open Item on the wiki for this problem as well. Thanks, -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
On Tue, Mar 7, 2017 at 10:01 PM, Michael Paquier <michael.paquier@gmail.com> wrote: > RFC5802 (https://tools.ietf.org/html/rfc5802) regarding the > implementation of SCRAM, needs to have passwords normalized as per > RFC4013 (https://tools.ietf.org/html/rfc4013) using SASLprep, which is > actually NFKC. I have previously described what happens in this case > here: > https://www.postgresql.org/message-id/CAB7nPqScgwh6Eu4=c-0L7-cufZrU5rULTSdpMOOWiz1YFvGE6w@mail.gmail.com > This way, we can be sure that two UTf-8 strings are considered as > equivalent in a SASL exchange, in our case we care about the password > string (we should care about the username as well). Without SASLprep, > our implementation of SCRAM may fail with other third-part tools if > passwords are not made only of ASCII characters. And that sucks. Agreed. I am not sure this quite rises to the level of a stop-ship issue; we could document the restriction. However, that's pretty ugly; it would be a whole lot better to get this fixed. I kinda hope Heikki is going to step up to the plate here, because I think he understands most of it already. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Mar 8, 2017 at 10:39 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On Tue, Mar 7, 2017 at 10:01 PM, Michael Paquier > <michael.paquier@gmail.com> wrote: >> This way, we can be sure that two UTf-8 strings are considered as >> equivalent in a SASL exchange, in our case we care about the password >> string (we should care about the username as well). Without SASLprep, >> our implementation of SCRAM may fail with other third-part tools if >> passwords are not made only of ASCII characters. And that sucks. > > Agreed. I am not sure this quite rises to the level of a stop-ship > issue; we could document the restriction. I am not sure about that, what we have now on HEAD is still useful and better than MD5. > However, that's pretty ugly; it would be a whole lot better to get this fixed. Agreed. > I kinda hope Heikki is going to step up to the plate here, because I > think he understands most of it already. The second person who knows a good deal about NFKC is Ishii-san. Attached is a rebased patch. -- Michael
Attachment
On 03/31/2017 10:10 AM, Michael Paquier wrote: > On Wed, Mar 8, 2017 at 10:39 PM, Robert Haas <robertmhaas@gmail.com> wrote: >> On Tue, Mar 7, 2017 at 10:01 PM, Michael Paquier >> <michael.paquier@gmail.com> wrote: >> I kinda hope Heikki is going to step up to the plate here, because I >> think he understands most of it already. > > The second person who knows a good deal about NFKC is Ishii-san. > > Attached is a rebased patch. Thanks, I'm looking at this now. - Heikki
On 04/04/2017 01:52 PM, Heikki Linnakangas wrote: > On 03/31/2017 10:10 AM, Michael Paquier wrote: >> On Wed, Mar 8, 2017 at 10:39 PM, Robert Haas <robertmhaas@gmail.com> wrote: >>> On Tue, Mar 7, 2017 at 10:01 PM, Michael Paquier >>> <michael.paquier@gmail.com> wrote: >>> I kinda hope Heikki is going to step up to the plate here, because I >>> think he understands most of it already. >> >> The second person who knows a good deal about NFKC is Ishii-san. >> >> Attached is a rebased patch. > > Thanks, I'm looking at this now. I will continue tomorrow, but I wanted to report on what I've done so far. Attached is a new patch version, quite heavily modified. Notable changes so far: * Use Unicode codepoints, rather than UTF-8 bytes packed in a 32-bit ints. IMHO this makes the tables easier to read (to a human), and they are also packed slightly more tightly (see next two points), as you can fit more codepoints in a 16-bit integer. * Reorganize the decomposition table. Instead of a separate binary-searched table for each "size", have one table that's ordered by codepoint, which contains a length and offset into another array, where all the decomposed sequences are. This makes the recomposition step somewhat slower, as it now has to grovel through an array that contains all decompositions, rather than just the array that contains decompositions of two characters. But this isn't performance-critical, readability and tight packing of the tables matter more. * In the lookup table, store singleton decompositions that decompose to a single character, and the character fits in a uint16, directly in the main lookup table, instead of storing an index into the other table. This makes the tables somewhat smaller, as there are a lot of such decompositions. * Use uint32 instead of pg_wchar to represent unicode codepoints. pg_wchar suggests something that holds multi-byte characters in the OS locale, used by functions like wcscmp, so avoid that. I added a "typedef uint32 Codepoint" alias, though, to make it more clear which variables hold Unicode codepoints rather than e.g. indexes into arrays. * Unicode consortium publishes a comprehensive normalization test suite, as a text file, NormalizationTest.txt. I wrote a small perl and C program to run all the tests from that test suite, with the normalization function. That uncovered a number of bugs in the recomposition algorithm, which I then fixed: * decompositions that map to ascii characters cannot be excluded. * non-canonical and non-starter decompositions need to be excluded. They are now flagged in the lookup table, so that we only use them during the decomposition phase, not when recomposing. TODOs / Discussion points: * I'm not sure I like the way the code is organized, I feel that we're littering src/common with these. Perhaps we should add a src/common/unicode_normalization directory for all this. * The list of characters excluded from recomposition is currently hard-coded in utf_norm_generate.pl. However, that list is available in machine-readable format, in file CompositionExclusions.txt. Since we're reading most of the data from UnicodeData.txt, would be good to read the exclusion table from a file, too. * SASLPrep specifies normalization form KC, but it also specifies that some characters are mapped to space or nothing. Should do those mappings, too. - Heikki
Attachment
fore On Wed, Apr 5, 2017 at 7:05 AM, Heikki Linnakangas <hlinnaka@iki.fi> wrote: > I will continue tomorrow, but I wanted to report on what I've done so far. > Attached is a new patch version, quite heavily modified. Notable changes so > far: Great, thanks! > * Use Unicode codepoints, rather than UTF-8 bytes packed in a 32-bit ints. > IMHO this makes the tables easier to read (to a human), and they are also > packed slightly more tightly (see next two points), as you can fit more > codepoints in a 16-bit integer. Using directly codepoints is not much consistent with the existing backend, but for the sake of packing things more, OK. > * Reorganize the decomposition table. Instead of a separate binary-searched > table for each "size", have one table that's ordered by codepoint, which > contains a length and offset into another array, where all the decomposed > sequences are. This makes the recomposition step somewhat slower, as it now > has to grovel through an array that contains all decompositions, rather than > just the array that contains decompositions of two characters. But this > isn't performance-critical, readability and tight packing of the tables > matter more. Okay, no objection to that. > * In the lookup table, store singleton decompositions that decompose to a > single character, and the character fits in a uint16, directly in the main > lookup table, instead of storing an index into the other table. This makes > the tables somewhat smaller, as there are a lot of such decompositions. Indeed. > * Use uint32 instead of pg_wchar to represent unicode codepoints. pg_wchar > suggests something that holds multi-byte characters in the OS locale, used > by functions like wcscmp, so avoid that. I added a "typedef uint32 > Codepoint" alias, though, to make it more clear which variables hold Unicode > codepoints rather than e.g. indexes into arrays. > > * Unicode consortium publishes a comprehensive normalization test suite, as > a text file, NormalizationTest.txt. I wrote a small perl and C program to > run all the tests from that test suite, with the normalization function. > That uncovered a number of bugs in the recomposition algorithm, which I then > fixed: I was looking for something like that for some time, thanks! That's here actually: http://www.unicode.org/Public/8.0.0/ucd/NormalizationTest.txt > * decompositions that map to ascii characters cannot be excluded. > * non-canonical and non-starter decompositions need to be excluded. They > are now flagged in the lookup table, so that we only use them during the > decomposition phase, not when recomposing. Okay on that. > TODOs / Discussion points: > > * I'm not sure I like the way the code is organized, I feel that we're > littering src/common with these. Perhaps we should add a > src/common/unicode_normalization directory for all this. pg_utf8_islegal() and pg_utf_mblen() should as well be moved in their own file I think, and wchar.c can use that. > * The list of characters excluded from recomposition is currently hard-coded > in utf_norm_generate.pl. However, that list is available in machine-readable > format, in file CompositionExclusions.txt. Since we're reading most of the > data from UnicodeData.txt, would be good to read the exclusion table from a > file, too. Ouch. Those are present here... http://www.unicode.org/reports/tr41/tr41-19.html#Exclusions Definitely it makes more sense to read them from a file. > * SASLPrep specifies normalization form KC, but it also specifies that some > characters are mapped to space or nothing. Should do those mappings, too. Ah, right. Those ones are here: https://tools.ietf.org/html/rfc3454#appendix-B.1 The conversion tables need some extra tweaking... + else if ((*utf & 0xf0) == 0xe0) + { + if (utf[1] == 0 || utf[2] == 0) + goto invalid; + cp = (*utf++) & 0x0F; + cp = (cp << 6) | ((*utf++) & 0x3F); + cp = (cp << 6) | ((*utf++) & 0x3F); + } + else if ((*utf & 0xf8) == 0xf0) + { + if (utf[1] == 0 || utf[2] == 0|| utf[3] == 0) + goto invalid; + cp = (*utf++) & 0x07; + cp = (cp << 6) | ((*utf++) & 0x3F); + cp = (cp << 6) | ((*utf++) & 0x3F); + cp = (cp << 6) | ((*utf++) & 0x3F); + } I find more readable something like pg_utf2wchar_with_len(), but well... Some typos: s/fore/for/ s/reprsented/represented/ You seem to be fully on it... I can help out with any of those items as needed. -- Michael
On 04/05/2017 07:23 AM, Michael Paquier wrote: > fore > > On Wed, Apr 5, 2017 at 7:05 AM, Heikki Linnakangas <hlinnaka@iki.fi> wrote: >> I will continue tomorrow, but I wanted to report on what I've done so far. >> Attached is a new patch version, quite heavily modified. Notable changes so >> far: > > Great, thanks! > >> * Use Unicode codepoints, rather than UTF-8 bytes packed in a 32-bit ints. >> IMHO this makes the tables easier to read (to a human), and they are also >> packed slightly more tightly (see next two points), as you can fit more >> codepoints in a 16-bit integer. > > Using directly codepoints is not much consistent with the existing > backend, but for the sake of packing things more, OK. Oh, I see, we already have similar functions in wchar.c. unicode_to_utf8() and utf8_to_unicode(). We should probably move those to src/common, rather than re-invent the wheel. > pg_utf8_islegal() and pg_utf_mblen() should as well be moved in their > own file I think, and wchar.c can use that. Yeah.. >> * The list of characters excluded from recomposition is currently hard-coded >> in utf_norm_generate.pl. However, that list is available in machine-readable >> format, in file CompositionExclusions.txt. Since we're reading most of the >> data from UnicodeData.txt, would be good to read the exclusion table from a >> file, too. > > Ouch. Those are present here... > http://www.unicode.org/reports/tr41/tr41-19.html#Exclusions > Definitely it makes more sense to read them from a file. Did that. >> * SASLPrep specifies normalization form KC, but it also specifies that some >> characters are mapped to space or nothing. Should do those mappings, too. > > Ah, right. Those ones are here: > https://tools.ietf.org/html/rfc3454#appendix-B.1 Yep. Attached is a new version. Notable changes since yesterday: * Implemented the rest of the SASLPrep, mapping some characters to spaces, leaving out others, and checking for prohibited characters and bidirectional strings. * Moved things around. There's now a separate directory, src/common/unicode, which contains the perl scripts and the test code. Those are not needed to build from source, as the pre-generated tables are put in src/include/common. Similar to the scripts in src/backend/utils/mb/Unicode, really. * Renamed many things from utf_* to unicode_*, since they don't deal with utf-8 input anymore. This is starting to shape up, but still some cleanup work to do. I will continue tomorrow.. - Heikki
Attachment
On Thu, Apr 6, 2017 at 1:33 AM, Heikki Linnakangas <hlinnaka@iki.fi> wrote: > Attached is a new version. Notable changes since yesterday: > > * Implemented the rest of the SASLPrep, mapping some characters to spaces, > leaving out others, and checking for prohibited characters and bidirectional > strings. > > * Moved things around. There's now a separate directory, src/common/unicode, > which contains the perl scripts and the test code. Those are not needed to > build from source, as the pre-generated tables are put in > src/include/common. Similar to the scripts in src/backend/utils/mb/Unicode, > really. > > * Renamed many things from utf_* to unicode_*, since they don't deal with > utf-8 input anymore. > > This is starting to shape up, but still some cleanup work to do. I will > continue tomorrow.. Thanks for the new patch, that's looking nice. Now I was not able to compile it as saslprep.h is missing from what you have sent... There is for example this portion in the new tables: +static const Codepoint prohibited_output_chars[] = +{ + 0xD800, 0xF8FF, /* C.3, C.5 */ ----- Start Table C.5 ----- D800-DFFF; [SURROGATE CODES] ----- End Table C.5 ----- This indicates a range of values. Wouldn't it be better to split this table in two, one for the range of codepoints and another one with the single entries? + 0x1D173, 0x1D17A, /* C.2.2 */ This is for musical symbols. It seems to me that checking for a range is what is intended. -- Michael
Another version attached. I think this is now in committable state, but there's been a lot of small changes here and there, so please have one more look over it if you have a chance. I'm planning to push this tomorrow, after sleeping on it. The code-organization issue with the utf8 functions, pg_utf_mblen, pg_utf8_islegal, unicode_to_utf8, and utf8_to_unicode, is still kind of unresolved. The way I have it in this version is that the functions are in wchar.c, like they've always been, and saslprep.c uses them from there. This is ugly from a code organization point of view. as we now have functions in src/common depending on functions in src/backend/utils/mb/wchar.c. I think wchar.c deserves some re-organization, as most of the functions there are also used in frontend code. Perhaps it should be moved to src/common as whole. Libpq already compiles with wchar.c, so all those functions that src/common/saslprep.c requires are already available in libpq. So I don't think that's urgent, but something we nevertheless ought to clean up at some point. Another thing I'd like some more eyes on, is how this will work with encodings other than UTF-8. We will now try to normalize the password as if it was in UTF-8, even if it isn't. That's OK as long as we're consistent about it, but there is one worrisome scenario: what if the user's password consists mostly of characters, that when interpreted as UTF-8, are in the list of ignored characters. IOW, is it realistic that a user might have a password in a non-UTF-8 encoding, that gets silently mangled into something much shorter? I think that's highly unlikely, but can anyone come up with a plausible example of that? - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
(sorry, I didn't notice your email until after I just sent version 4!) On 04/06/2017 10:32 AM, Michael Paquier wrote: > On Thu, Apr 6, 2017 at 1:33 AM, Heikki Linnakangas <hlinnaka@iki.fi> wrote: >> Attached is a new version. Notable changes since yesterday: >> >> * Implemented the rest of the SASLPrep, mapping some characters to spaces, >> leaving out others, and checking for prohibited characters and bidirectional >> strings. >> >> * Moved things around. There's now a separate directory, src/common/unicode, >> which contains the perl scripts and the test code. Those are not needed to >> build from source, as the pre-generated tables are put in >> src/include/common. Similar to the scripts in src/backend/utils/mb/Unicode, >> really. >> >> * Renamed many things from utf_* to unicode_*, since they don't deal with >> utf-8 input anymore. >> >> This is starting to shape up, but still some cleanup work to do. I will >> continue tomorrow.. > > Thanks for the new patch, that's looking nice. Now I was not able to > compile it as saslprep.h is missing from what you have sent... D'oh. Here's a new version, with saslprep.h included. > There is for example this portion in the new tables: > +static const Codepoint prohibited_output_chars[] = > +{ > + 0xD800, 0xF8FF, /* C.3, C.5 */ > > ----- Start Table C.5 ----- > D800-DFFF; [SURROGATE CODES] > ----- End Table C.5 ----- > This indicates a range of values. Wouldn't it be better to split this > table in two, one for the range of codepoints and another one with the > single entries? I considered that, but there are relatively few singular codepoints in the tables, so it wouldn't save much space. In this patch, singular codepoints are represented by a range like "0x3000, 0x3000". > + 0x1D173, 0x1D17A, /* C.2.2 */ > This is for musical symbols. It seems to me that checking for a range > is what is intended. Can you elaborate? - Heikki
On 04/06/2017 08:42 PM, Heikki Linnakangas wrote: > D'oh. Here's a new version, with saslprep.h included. And here it is for real. Sigh. >> There is for example this portion in the new tables: >> +static const Codepoint prohibited_output_chars[] = >> +{ >> + 0xD800, 0xF8FF, /* C.3, C.5 */ >> >> ----- Start Table C.5 ----- >> D800-DFFF; [SURROGATE CODES] >> ----- End Table C.5 ----- >> This indicates a range of values. Wouldn't it be better to split this >> table in two, one for the range of codepoints and another one with the >> single entries? > > I considered that, but there are relatively few singular codepoints in > the tables, so it wouldn't save much space. In this patch, singular > codepoints are represented by a range like "0x3000, 0x3000". > >> + 0x1D173, 0x1D17A, /* C.2.2 */ >> This is for musical symbols. It seems to me that checking for a range >> is what is intended. > > Can you elaborate? Oh, I think I understand the confusion now. All the arrays represent codepoint ranges, not singular codepoints. I renamed them to "*_ranges", to make that more clear. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
On Fri, Apr 7, 2017 at 2:47 AM, Heikki Linnakangas <hlinnaka@iki.fi> wrote: > On 04/06/2017 08:42 PM, Heikki Linnakangas wrote: >>> There is for example this portion in the new tables: >>> +static const Codepoint prohibited_output_chars[] = >>> +{ >>> + 0xD800, 0xF8FF, /* C.3, C.5 */ >>> >>> ----- Start Table C.5 ----- >>> D800-DFFF; [SURROGATE CODES] >>> ----- End Table C.5 ----- >>> This indicates a range of values. Wouldn't it be better to split this >>> table in two, one for the range of codepoints and another one with the >>> single entries? >> >> I considered that, but there are relatively few singular codepoints in >> the tables, so it wouldn't save much space. In this patch, singular >> codepoints are represented by a range like "0x3000, 0x3000". I am really wondering if this should not reflect the real range reported by the RFC. I understand that you have grouped things to save a couple of bytes, but that would protect from any updates of the codepoints within those ranges (unlikely to happen I agree). >>> + 0x1D173, 0x1D17A, /* C.2.2 */ >>> This is for musical symbols. It seems to me that checking for a range >>> is what is intended. >> >> Can you elaborate? > > Oh, I think I understand the confusion now. All the arrays represent > codepoint ranges, not singular codepoints. I renamed them to "*_ranges", to > make that more clear. Thanks! Yes I got confused by the name. +/* Is the given Unicode codepoint in the given table? */ +#define IS_CODE_IN_TABLE(code, map) is_code_in_table(code, map, lengthof(map)) [...] +static bool +is_code_in_table(pg_wchar code, const pg_wchar *map, int mapsize) +{ + Assert(mapsize % 2 == 0); + + if (code < map[0] || code > map[mapsize - 1]) + return false; + + if (bsearch(&code, map, mapsize / 2, sizeof(pg_wchar) * 2, + codepoint_range_cmp)) + return true; + else + return false; +} Those could be renamed to range_table as well to avoid more confusion. > The SASLprep implementation depends on the UTF-8 functions from > src/backend/utils/mb/wchar.c. So to use it, you must also compile and link > that. That doesn't change anything for the current users of these > functions, the backend and libpq, as they both already link with wchar.o. > It would be good to move those functions into a separate file in > src/commmon, but I'll leave that for another day. Fine for me. You may want to add a .gitignore in src/common/unicode for norm_test and norm_test_table.h. -- Michael
On 04/07/2017 05:30 AM, Michael Paquier wrote: > On Fri, Apr 7, 2017 at 2:47 AM, Heikki Linnakangas <hlinnaka@iki.fi> wrote: >> On 04/06/2017 08:42 PM, Heikki Linnakangas wrote: >>>> There is for example this portion in the new tables: >>>> +static const Codepoint prohibited_output_chars[] = >>>> +{ >>>> + 0xD800, 0xF8FF, /* C.3, C.5 */ >>>> >>>> ----- Start Table C.5 ----- >>>> D800-DFFF; [SURROGATE CODES] >>>> ----- End Table C.5 ----- >>>> This indicates a range of values. Wouldn't it be better to split this >>>> table in two, one for the range of codepoints and another one with the >>>> single entries? >>> >>> I considered that, but there are relatively few singular codepoints in >>> the tables, so it wouldn't save much space. In this patch, singular >>> codepoints are represented by a range like "0x3000, 0x3000". > > I am really wondering if this should not reflect the real range > reported by the RFC. I understand that you have grouped things to save > a couple of bytes, but that would protect from any updates of the > codepoints within those ranges (unlikely to happen I agree). It just means that there will be some more work required to apply the changes to the current lists. I constructed the lists manually to begin with, copy-pasting the lists from the RFC, and moving and merging entries by hand. I wouldn't mind doing that by hand again, if the lists change. But as you said, it seems unlikely that they would change any time soon. > You may want to add a .gitignore in src/common/unicode for norm_test > and norm_test_table.h. Added, and pushed, with some more comment fixes. Many thanks, Michael! - Heikki
On Fri, Apr 7, 2017 at 8:58 PM, Heikki Linnakangas <hlinnaka@iki.fi> wrote: > On 04/07/2017 05:30 AM, Michael Paquier wrote: >> I am really wondering if this should not reflect the real range >> reported by the RFC. I understand that you have grouped things to save >> a couple of bytes, but that would protect from any updates of the >> codepoints within those ranges (unlikely to happen I agree). > > It just means that there will be some more work required to apply the > changes to the current lists. I constructed the lists manually to begin > with, copy-pasting the lists from the RFC, and moving and merging entries by > hand. I wouldn't mind doing that by hand again, if the lists change. But as > you said, it seems unlikely that they would change any time soon. Yeah, I don't mind either. That's simple enough to change should that happen. >> You may want to add a .gitignore in src/common/unicode for norm_test >> and norm_test_table.h. > > Added, and pushed, with some more comment fixes. Nice. There are still a couple of important items pending for SCRAM, so I would think that it is better to not do the refactoring now (but rework it in PG11), but polish a bit more the documentation. Your thoughts on that? -- Michael
On 04/06/2017 07:59 PM, Heikki Linnakangas wrote: > Another thing I'd like some more eyes on, is how this will work with > encodings other than UTF-8. We will now try to normalize the password as > if it was in UTF-8, even if it isn't. That's OK as long as we're > consistent about it, but there is one worrisome scenario: what if the > user's password consists mostly of characters, that when interpreted as > UTF-8, are in the list of ignored characters. IOW, is it realistic that > a user might have a password in a non-UTF-8 encoding, that gets silently > mangled into something much shorter? I think that's highly unlikely, but > can anyone come up with a plausible example of that? I did some testing on what the byte sequences for the Unicode characters that SASLprep ignores mean in other encodings. I created a text file containing every ignored character, in UTF-8, and ran "iconv -f <other encoding> -t UTF-8//TRANSLIT" on the file, using all supported server encodings. The idea is to take each of the ignored byte sequences, and pretend that they are in some other encoding. If converting them to UTF-8 results in a legit character, then that character means something in that encoding, and could be misinterpreted if it's used in a password. Here are some characters that seem plausible to be misinterpreted and ignored by SASLprep: ------- EUC-JP and EUC-JISX0213: U+00AD (C2 AD): 足 (meaning "foot", per Unihan database) U+FE00-FE0F (EF B8 8X): 鏝 (meaning "trowel", per Unihan database) EUC-CN: U+00AD (C2 AD): 颅 (meaning "skull", per Unihan database) U+FE00-FE0FF (EF B8 8X): 锔 (meaning "curium", per Unihan database) U+FEFF (EF BB BF): 锘 (meaning "nobelium", per Wikipedia) EUC-KR: U+FE00-FE0F (EF BB BF): 截 (meanings "cut off, stop, obstruct, intersect", per Unihan database U+FEFF (EF BB BF): 癤 (meanings "pimple, sore, boil", per Unihan database) EUC-TW: U+FE00-FE0F: 踫 (meanings "collide, bump into", per Unihan database) U+FEFF: 踢 (meaning "kick", per Unihan database) CP866: U+1806: саЖ U+180B: саЛ U+180C: саМ U+180D: саН U+200B: тАЛ U+200C: тАМ U+200D: тАН ------- The CP866 cases seem most likely to cause confusion. Those are all common words in Russian. I don't know how common those Chinese/Japanese characters are. Overall, I think this is OK. Even though there are those characters that can be misinterpreted, for it to be problem all of the following have to be true: 1. The client is using one of those encodings. 2. The password string as whole has to look like valid UTF-8. 3. Ignoring those characters/words from the password would lead to a significantly weaker password, i.e. it was not very long to begin with, or it consisted almost entirely of those characters/words. Thoughts? Attached is the full results of running iconv with each encoding, from which I picked the above cases. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
On Mon, Apr 10, 2017 at 5:11 PM, Heikki Linnakangas <hlinnaka@iki.fi> wrote: > Here are some characters that seem plausible to be misinterpreted and > ignored by SASLprep: > EUC-JP and EUC-JISX0213: > > U+00AD (C2 AD): 足 (meaning "foot", per Unihan database) > U+FE00-FE0F (EF B8 8X): 鏝 (meaning "trowel", per Unihan database) Those are common words, but I still have to see those characters used in passwords. For user names, I have seen games or apps using Kanjis, so they could be used in such cases. But any application can be careful in controlling the encoding used by the client, so I would not worry much about them. > EUC-CN: > > U+00AD (C2 AD): 颅 (meaning "skull", per Unihan database) > U+FE00-FE0FF (EF B8 8X): 锔 (meaning "curium", per Unihan database) > U+FEFF (EF BB BF): 锘 (meaning "nobelium", per Wikipedia) > > EUC-KR: > > U+FE00-FE0F (EF BB BF): 截 (meanings "cut off, stop, obstruct, intersect", > per Unihan database > U+FEFF (EF BB BF): 癤 (meanings "pimple, sore, boil", per Unihan database) > > EUC-TW: > U+FE00-FE0F: 踫 (meanings "collide, bump into", per Unihan database) > U+FEFF: 踢 (meaning "kick", per Unihan database) Not completely sure about those ones. At least I think that the two set of characters of Chinese Taipei are pretty common there. > CP866: > U+1806: саЖ > U+180B: саЛ > U+180C: саМ > U+180D: саН > U+200B: тАЛ > U+200C: тАМ > U+200D: тАН > ------- > > The CP866 cases seem most likely to cause confusion. No idea about those. > Those are all common > words in Russian. I don't know how common those Chinese/Japanese characters are. > > Overall, I think this is OK. Even though there are those characters that can > be misinterpreted, for it to be problem all of the following have to be > true: > > 1. The client is using one of those encodings. > 2. The password string as whole has to look like valid UTF-8. > 3. Ignoring those characters/words from the password would lead to a > significantly weaker password, i.e. it was not very long to begin with, or > it consisted almost entirely of those characters/words. > > Thoughts? Attached is the full results of running iconv with each encoding, > from which I picked the above cases. I am not much worrying about such things either. Still I am wondering though if it would be useful to give users the possibility to block connection attempts from clients that do not use UTF-8 in this case. Some people use open instances. -- Michael