On 04/06/2017 07:59 PM, Heikki Linnakangas wrote:
> Another thing I'd like some more eyes on, is how this will work with
> encodings other than UTF-8. We will now try to normalize the password as
> if it was in UTF-8, even if it isn't. That's OK as long as we're
> consistent about it, but there is one worrisome scenario: what if the
> user's password consists mostly of characters, that when interpreted as
> UTF-8, are in the list of ignored characters. IOW, is it realistic that
> a user might have a password in a non-UTF-8 encoding, that gets silently
> mangled into something much shorter? I think that's highly unlikely, but
> can anyone come up with a plausible example of that?
I did some testing on what the byte sequences for the Unicode characters
that SASLprep ignores mean in other encodings. I created a text file
containing every ignored character, in UTF-8, and ran "iconv -f <other
encoding> -t UTF-8//TRANSLIT" on the file, using all supported server
encodings. The idea is to take each of the ignored byte sequences, and
pretend that they are in some other encoding. If converting them to
UTF-8 results in a legit character, then that character means something
in that encoding, and could be misinterpreted if it's used in a password.
Here are some characters that seem plausible to be misinterpreted and
ignored by SASLprep:
-------
EUC-JP and EUC-JISX0213:
U+00AD (C2 AD): 足 (meaning "foot", per Unihan database)
U+FE00-FE0F (EF B8 8X): 鏝 (meaning "trowel", per Unihan database)
EUC-CN:
U+00AD (C2 AD): 颅 (meaning "skull", per Unihan database)
U+FE00-FE0FF (EF B8 8X): 锔 (meaning "curium", per Unihan database)
U+FEFF (EF BB BF): 锘 (meaning "nobelium", per Wikipedia)
EUC-KR:
U+FE00-FE0F (EF BB BF): 截 (meanings "cut off, stop, obstruct,
intersect", per Unihan database
U+FEFF (EF BB BF): 癤 (meanings "pimple, sore, boil", per Unihan database)
EUC-TW:
U+FE00-FE0F: 踫 (meanings "collide, bump into", per Unihan database)
U+FEFF: 踢 (meaning "kick", per Unihan database)
CP866:
U+1806: саЖ
U+180B: саЛ
U+180C: саМ
U+180D: саН
U+200B: тАЛ
U+200C: тАМ
U+200D: тАН
-------
The CP866 cases seem most likely to cause confusion. Those are all
common words in Russian. I don't know how common those Chinese/Japanese
characters are.
Overall, I think this is OK. Even though there are those characters that
can be misinterpreted, for it to be problem all of the following have to
be true:
1. The client is using one of those encodings.
2. The password string as whole has to look like valid UTF-8.
3. Ignoring those characters/words from the password would lead to a
significantly weaker password, i.e. it was not very long to begin with,
or it consisted almost entirely of those characters/words.
Thoughts? Attached is the full results of running iconv with each
encoding, from which I picked the above cases.
- Heikki
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers