Thread: Non-ASCII DSN name troubles
Hi, If you try to create a data source with a name that contains non-ASCII characters, funny things will happen. I wouldn't expect the ANSI driver to support that, but a Unicode driver ought to handle it. 1. We always use the ANSI versions of the functions to read/write the config, SQLGetPrivateProfileString/SQLWritePrivateProfileString. In the Unicode driver, I think we should be using the Unicode *W variants of those functions, otherwise we cannot handle characters that don't have a representation in the current system codepage. 2. Even if all the characters can be represented in the system codepage, when built as a Unicode driver, we internally pass all strings as UTF-8 encoded char[] arrays, and convert between UTF-8 and UCS-2 in the wrapper functions in odbcapiw.c. We also do that for the DSN name in SQLDriverConnextW(), but we pass the UTF-8 encoded DSN name to SQLGetPrivateProfileString() function, to get the config options. That doesn't work, because SQLGetPrivateStringProfileString() expect the string to be encoded in the system codepage, not UTF-8. Again, we should be using the Unicode version, SQLGetPrivateProfileStringW(). 3. We don't use the Unicode versions of the GUI functions, like GetDlgTextItem(), when dealing with the configuration dialog. That again means that the GUI cannot handle characters outside the system codepage, but we also don't convert the strings to UTF-8 like we do to strings coming through SQLDriverConnectW() and other API functions, so there's another mismatch. Attached patch fixes those issues, allowing you to create a use any Unicode characters in the DSN name, or any other configuration fields, with the Unicode driver. This changes the behavior of how username and password are handled in the Unicode driver. Without this patch, the username is read from the registry in the system codepage, and also sent as such to the server. After the patch, it's always sent to the server in UTF-8. I think that's more sane behavior, but there's a small chance of breaking existing installation that depend on the old behavior. So we probably should include this patch when we bump the major version number to 9.4. - Heikki
Attachment
(2014/06/21 20:37), Heikki Linnakangas wrote: > Hi, > > If you try to create a data source with a name that contains non-ASCII > characters, funny things will happen. I wouldn't expect the ANSI driver > to support that, but a Unicode driver ought to handle it. Currently NON-ascii characters are not recommended because they are mainly used at connection time. Though Unicode version SQLDriverConnect uses UTF-8 encoded user, password, database ... because I don't think of other ways, it has little meaning IMHO. Was there a decision that the encoding of user, password or database is utf-8? > 1. We always use the ANSI versions of the functions to read/write the > config, SQLGetPrivateProfileString/SQLWritePrivateProfileString. In the > Unicode driver, I think we should be using the Unicode *W variants of > those functions, otherwise we cannot handle characters that don't have a > representation in the current system codepage. > > 2. Even if all the characters can be represented in the system codepage, > when built as a Unicode driver, we internally pass all strings as UTF-8 > encoded char[] arrays, and convert between UTF-8 and UCS-2 in the > wrapper functions in odbcapiw.c. We also do that for the DSN name in > SQLDriverConnextW(), but we pass the UTF-8 encoded DSN name to > SQLGetPrivateProfileString() function, to get the config options. That > doesn't work, because SQLGetPrivateStringProfileString() expect the > string to be encoded in the system codepage, not UTF-8. Again, we should > be using the Unicode version, SQLGetPrivateProfileStringW(). > > 3. We don't use the Unicode versions of the GUI functions, like > GetDlgTextItem(), when dealing with the configuration dialog. That again > means that the GUI cannot handle characters outside the system codepage, > but we also don't convert the strings to UTF-8 like we do to strings > coming through SQLDriverConnectW() and other API functions, so there's > another mismatch. > > Attached patch fixes those issues, allowing you to create a use any > Unicode characters in the DSN name, or any other configuration fields, > with the Unicode driver. > > > This changes the behavior of how username and password are handled in > the Unicode driver. Without this patch, the username is read from the > registry in the system codepage, and also sent as such to the server. > After the patch, it's always sent to the server in UTF-8. I think that's > more sane behavior, but there's a small chance of breaking existing > installation that depend on the old behavior. So we probably should > include this patch when we bump the major version number to 9.4. > > - Heikki > > >
On 06/23/2014 11:58 PM, Inoue, Hiroshi wrote: > (2014/06/21 20:37), Heikki Linnakangas wrote: >> If you try to create a data source with a name that contains non-ASCII >> characters, funny things will happen. I wouldn't expect the ANSI driver >> to support that, but a Unicode driver ought to handle it. > > Currently NON-ascii characters are not recommended because they are > mainly used at connection time. Note that the DSN name is never sent to the server. Even if we conclude that we want to keep the behavior of username, password and database as is, we should still allow the DSN name to contain any characters. > Though Unicode version SQLDriverConnect > uses UTF-8 encoded user, password, database ... because I don't think of > other ways, it has little meaning IMHO. Was there a decision that > the encoding of user, password or database is utf-8? Not sure what you mean. There has been no changes in the server around this. The server just treats the username, password and database as raw bytes. Which is unfortunate, but we'll just have to deal with it in the driver. The question is, what encoding should we use to send the username, password and database to the server? 1. Current behavior: The username, password and database are encoded using the current Windows ANSI codepage. If there are characters that cannot be encoded using the ANSI codepage, Windows will replace them with ?. 2. Behavior with the patch: The username, password and database are always encoded using UTF-8, when using the Unicode driver. Both behaviors have pros and cons. If you assume that the server uses UTF-8, and the client uses the Unicode driver and is fully Unicode-enabled, then the patched behavior is clearly better. With the current behavior, if e.g the username contains any non-ASCII characters, you cannot connect. But if you assume that the server is not using UTF-8, but LATIN1 for example, and the client uses the Unicode driver, then the current behavior is better. It will allow the client to connect, assuming that the Windows ANSI codepage is set to LATIN1, while with the patch it will not work. However, if the server and client both use LATIN1 rather than Unicode/UTF-8, then you probably should be using the ANSI driver instead. Overall, I think the patched behavior is better. If we want to make it really flexible, we could add a new parameter to explicitly specify the encoding used for username, password and database. Then you could connect to any database with the Unicode driver, as long as you set the parameter correctly. - Heikki