Thread: Lexing with different charsets
I've spent some more time reading specs today. Together with Peter E's explanataion (Thanks!) I think I've got a farily good understanding of the parts talking about locales now. My next question is about lexing. The spec says that one can use strings of different charsets in the queries, like: ... WHERE field1 = _latin1'FooBar' and field2 = _utf8'Åäö' I can see that the lexer either needs to be taught about all the different charsets or this is not going to work very well. What if one wants to include a string in utf-16 in the query, the lexer can not handle that without understanding utf-16. The query can also be in different charsets. If it's in utf-8 for example, then we can not embed latin1 strings and still have a validating utf-8 query. With the above we can not think of the query as being in a single charset anymore. That's strange but okay I guess. The new wire protocol allows us to send data seperatly from the query which is nice, but the standard talked about strings as above so it's not a solution to the problem. Maybe I should have adressed this to Peter directly :-) -- /Dennis Björklund
Dennis Bjorklund <db@zigo.dhs.org> writes: > My next question is about lexing. The spec says that one can use strings > of different charsets in the queries, like: > ... WHERE field1 = _latin1'FooBar' and field2 = _utf8'���' > I can see that the lexer either needs to be taught about all the > different charsets or this is not going to work very well. Yeah. I'm not sure that we're ever going to support that part of the spec; doing so would break too many useful things without adding very much useful functionality. We could possibly do it if we restrict to ASCII-superset character sets (not UTF-16 for instance), so that the string quoting boundaries can be found without hardwired knowledge about every character set. regards, tom lane
Tom Lane wrote: > Dennis Bjorklund <db@zigo.dhs.org> writes: > > My next question is about lexing. The spec says that one can use > > strings of different charsets in the queries, like: > > ... WHERE field1 = _latin1'FooBar' and field2 = _utf8'Åäö' > > I can see that the lexer either needs to be taught about all the > > different charsets or this is not going to work very well. > > Yeah. I'm not sure that we're ever going to support that part of the > spec; doing so would break too many useful things without adding very > much useful functionality. Like what? I think it could be fairly useful. We would have to restrict ourselves to character sets that are supersets of ASCII, but there are boatloads of reasons to do that besides this issue.
On Tue, 13 Apr 2004, Tom Lane wrote: > We could possibly do it if we restrict to ASCII-superset character sets > (not UTF-16 for instance), so that the string quoting boundaries can be > found without hardwired knowledge about every character set. It's a reasonable compromise I guess. One can still support utf-16 and others using the new wire protocol and maybe with some escaping extension like: _utf16 H'a42a1121311' where H would be a way to form a string from hexencoded bytes (or using the same as for bytea, or whatever). It's a problem for the future. -- /Dennis Björklund
Peter Eisentraut <peter_e@gmx.net> writes: > Tom Lane wrote: >> Yeah. I'm not sure that we're ever going to support that part of the >> spec; doing so would break too many useful things without adding very >> much useful functionality. > Like what? The first things that came to mind were losing psql's ability to tell what's a literal, losing the existing capability for queries to be translated from client-side to server-side character set, and losing the capability to have character sets defined by plug-in extensions rather than being hard-wired into the lexer. (Before you claim that the last is easily solved, consider that the lexer is not allowed to do database accesses.) > I think it could be fairly useful. We would have to > restrict ourselves to character sets that are supersets of ASCII, but > there are boatloads of reasons to do that besides this issue. If we do that then some of the problems go away, but I'm not sure they all do. Are you willing to drop support for non-ASCII-superset character sets on the client side as well as the server? regards, tom lane
> I've spent some more time reading specs today. Together with Peter E's > explanataion (Thanks!) I think I've got a farily good understanding of the > parts talking about locales now. > > My next question is about lexing. The spec says that one can use strings > of different charsets in the queries, like: > > ... WHERE field1 = _latin1'FooBar' and field2 = _utf8'Åäö' In my understanding this was removed as of SQL:1999. I'm not sure about SQL:2003 though. -- Tatsuo Ishii > I can see that the lexer either needs to be taught about all the > different charsets or this is not going to work very well. > > What if one wants to include a string in utf-16 in the query, the lexer > can not handle that without understanding utf-16. The query can also be in > different charsets. If it's in utf-8 for example, then we can not embed > latin1 strings and still have a validating utf-8 query. With the above we > can not think of the query as being in a single charset anymore. That's > strange but okay I guess. > > The new wire protocol allows us to send data seperatly from the query > which is nice, but the standard talked about strings as above so it's not > a solution to the problem. > > Maybe I should have adressed this to Peter directly :-) > > -- > /Dennis Björklund > > > ---------------------------(end of broadcast)--------------------------- > TIP 8: explain analyze is your friend >
On Wed, 14 Apr 2004, Tatsuo Ishii wrote: > > I've spent some more time reading specs today. Together with Peter E's > > explanataion (Thanks!) I think I've got a farily good understanding of the > > parts talking about locales now. > > > > My next question is about lexing. The spec says that one can use strings > > of different charsets in the queries, like: > > > > ... WHERE field1 = _latin1'FooBar' and field2 = _utf8'Åäö' > > In my understanding this was removed as of SQL:1999. I'm not sure > about SQL:2003 though. AFAICS, it still basically has: <character string literal> ::= [ <introducer><character set specification> ] <quote> [ <character representation>... ] <quote> [ { <separator> <quote> [ <character representation>... ] <quote> }... ]
> My next question is about lexing. The spec says that one can use strings > of different charsets in the queries, like: > > ... WHERE field1 = _latin1'FooBar' and field2 = _utf8'Åäö' > > different charsets or this is not going to work very well. Sorry for this maybe stupid question about an must-be-obvious hidden rationnal behind this feature: What "editor" or terminal is supposed to be able to generate text in different encodings depending on the part of the sentence? I don't think I have that in emacs. Or is it irrelevant?? I cannot see where I could use such a feature. -- Fabien Coelho - coelho@cri.ensmp.fr
On Wed, 14 Apr 2004, Fabien COELHO wrote: > > ... WHERE field1 = _latin1'FooBar' and field2 = _utf8'Åäö' > > > > different charsets or this is not going to work very well. > > What "editor" or terminal is supposed to be able to generate text in > different encodings depending on the part of the sentence? I don't think I > have that in emacs. Or is it irrelevant?? > > I cannot see where I could use such a feature. Applications usually generate queries. So you can do things like printf ("SELECT * FROM foo WHERE field1 = _latin1'%s';", my_latin1_data); for use on the terminal one would need to use some escaping/encoding much like is done with bytea. For example something like _latin1 H'0a660d' (but that is not sql-standard). -- /Dennis Björklund
> > I cannot see where I could use such a feature. > > Applications usually generate queries. Sure. > So you can do things like > > printf ("SELECT * FROM foo WHERE field1 = _latin1'%s';", my_latin1_data); Hmmm... I guess the following was too complicated. You need a library for conversion. You need to take care of conversions. printf("SELECT * FROM foo WHERE field1 = '%s'", latin1_to_database_encoding(...)); Well, so this is a great new useful feature indeed, that will help improve the lexer code a lot;-) Good luck, -- Fabien Coelho - coelho@cri.ensmp.fr
On Wed, 14 Apr 2004, Fabien COELHO wrote: > printf("SELECT * FROM foo WHERE field1 = '%s'", > latin1_to_database_encoding(...)); And how do you do this if the database encoding is latin2? You can not convert latin1 to latin2. The specification was written like this to handle things like latin1 strings in latin2 databases, or latin1 in a database that otherwise only uses ascii. The intention is good, but the specification is not perfect in any way. -- /Dennis Björklund