Thread: Lexing with different charsets

Lexing with different charsets

From

Dennis Bjorklund

Date:

13 April 2004, 13:57:55

I've spent some more time reading specs today. Together with Peter E's
explanataion (Thanks!) I think I've got a farily good understanding of the
parts talking about locales now.

My next question is about lexing. The spec says that one can use strings 
of different charsets in the queries, like:
 ... WHERE field1 = _latin1'FooBar' and field2 = _utf8'Åäö'

I can see that the lexer either needs to be taught about all the
different charsets or this is not going to work very well.

What if one wants to include a string in utf-16 in the query, the lexer
can not handle that without understanding utf-16. The query can also be in
different charsets. If it's in utf-8 for example, then we can not embed
latin1 strings and still have a validating utf-8 query. With the above we
can not think of the query as being in a single charset anymore. That's 
strange but okay I guess.

The new wire protocol allows us to send data seperatly from the query
which is nice, but the standard talked about strings as above so it's not
a solution to the problem.

Maybe I should have adressed this to Peter directly :-)

-- 
/Dennis Björklund

Re: Lexing with different charsets

From

Tom Lane

Date:

13 April 2004, 15:27:33

Dennis Bjorklund <db@zigo.dhs.org> writes:
> My next question is about lexing. The spec says that one can use strings 
> of different charsets in the queries, like:
>   ... WHERE field1 = _latin1'FooBar' and field2 = _utf8'���'
> I can see that the lexer either needs to be taught about all the
> different charsets or this is not going to work very well.

Yeah.  I'm not sure that we're ever going to support that part of the
spec; doing so would break too many useful things without adding very
much useful functionality.

We could possibly do it if we restrict to ASCII-superset character sets
(not UTF-16 for instance), so that the string quoting boundaries can be
found without hardwired knowledge about every character set.
        regards, tom lane

Re: Lexing with different charsets

From

Peter Eisentraut

Date:

13 April 2004, 16:18:59

Tom Lane wrote:
> Dennis Bjorklund <db@zigo.dhs.org> writes:
> > My next question is about lexing. The spec says that one can use
> > strings of different charsets in the queries, like:
> >   ... WHERE field1 = _latin1'FooBar' and field2 = _utf8'Åäö'
> > I can see that the lexer either needs to be taught about all the
> > different charsets or this is not going to work very well.
>
> Yeah.  I'm not sure that we're ever going to support that part of the
> spec; doing so would break too many useful things without adding very
> much useful functionality.

Like what?  I think it could be fairly useful.  We would have to 
restrict ourselves to character sets that are supersets of ASCII, but 
there are boatloads of reasons to do that besides this issue.

Re: Lexing with different charsets

From

Dennis Bjorklund

Date:

13 April 2004, 16:22:04

On Tue, 13 Apr 2004, Tom Lane wrote:

> We could possibly do it if we restrict to ASCII-superset character sets
> (not UTF-16 for instance), so that the string quoting boundaries can be
> found without hardwired knowledge about every character set.

It's a reasonable compromise I guess. One can still support utf-16 and
others using the new wire protocol and maybe with some escaping extension
like:
_utf16 H'a42a1121311'

where H would be a way to form a string from hexencoded bytes (or 
using the same as for bytea, or whatever). It's a problem for the future.

-- 
/Dennis Björklund

Re: Lexing with different charsets

From

Tom Lane

Date:

13 April 2004, 16:32:50

Peter Eisentraut <peter_e@gmx.net> writes:
> Tom Lane wrote:
>> Yeah.  I'm not sure that we're ever going to support that part of the
>> spec; doing so would break too many useful things without adding very
>> much useful functionality.

> Like what?

The first things that came to mind were losing psql's ability to tell
what's a literal, losing the existing capability for queries to be
translated from client-side to server-side character set, and losing the
capability to have character sets defined by plug-in extensions rather
than being hard-wired into the lexer.  (Before you claim that the last
is easily solved, consider that the lexer is not allowed to do database
accesses.)

> I think it could be fairly useful.  We would have to 
> restrict ourselves to character sets that are supersets of ASCII, but 
> there are boatloads of reasons to do that besides this issue.

If we do that then some of the problems go away, but I'm not sure they
all do.  Are you willing to drop support for non-ASCII-superset
character sets on the client side as well as the server?
        regards, tom lane

Re: Lexing with different charsets

From

Tatsuo Ishii

Date:

13 April 2004, 22:18:27

> I've spent some more time reading specs today. Together with Peter E's
> explanataion (Thanks!) I think I've got a farily good understanding of the
> parts talking about locales now.
>
> My next question is about lexing. The spec says that one can use strings
> of different charsets in the queries, like:
>
>   ... WHERE field1 = _latin1'FooBar' and field2 = _utf8'Åäö'

In my understanding this was removed as of SQL:1999. I'm not sure
about SQL:2003 though.
--
Tatsuo Ishii

> I can see that the lexer either needs to be taught about all the
> different charsets or this is not going to work very well.
>
> What if one wants to include a string in utf-16 in the query, the lexer
> can not handle that without understanding utf-16. The query can also be in
> different charsets. If it's in utf-8 for example, then we can not embed
> latin1 strings and still have a validating utf-8 query. With the above we
> can not think of the query as being in a single charset anymore. That's
> strange but okay I guess.
>
> The new wire protocol allows us to send data seperatly from the query
> which is nice, but the standard talked about strings as above so it's not
> a solution to the problem.
>
> Maybe I should have adressed this to Peter directly :-)
>
> --
> /Dennis Björklund
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 8: explain analyze is your friend
>

Re: Lexing with different charsets

From

Stephan Szabo

Date:

13 April 2004, 23:30:23

On Wed, 14 Apr 2004, Tatsuo Ishii wrote:

> > I've spent some more time reading specs today. Together with Peter E's
> > explanataion (Thanks!) I think I've got a farily good understanding of the
> > parts talking about locales now.
> >
> > My next question is about lexing. The spec says that one can use strings
> > of different charsets in the queries, like:
> >
> >   ... WHERE field1 = _latin1'FooBar' and field2 = _utf8'Åäö'
>
> In my understanding this was removed as of SQL:1999. I'm not sure
> about SQL:2003 though.

AFAICS, it still basically has:
<character string literal> ::=
[ <introducer><character set specification> ]
<quote> [ <character representation>... ] <quote>
[ { <separator> <quote> [ <character representation>... ] <quote> }... ]

Re: Lexing with different charsets

From

Fabien COELHO

Date:

14 April 2004, 04:33:45

> My next question is about lexing. The spec says that one can use strings
> of different charsets in the queries, like:
>
>   ... WHERE field1 = _latin1'FooBar' and field2 = _utf8'Åäö'
>
> different charsets or this is not going to work very well.

Sorry for this maybe stupid question about an must-be-obvious hidden
rationnal behind this feature:

What "editor" or terminal is supposed to be able to generate text in
different encodings depending on the part of the sentence? I don't think I
have that in emacs. Or is it irrelevant??

I cannot see where I could use such a feature.

--
Fabien Coelho - coelho@cri.ensmp.fr

Re: Lexing with different charsets

From

Dennis Bjorklund

Date:

14 April 2004, 04:58:33

On Wed, 14 Apr 2004, Fabien COELHO wrote:

> >   ... WHERE field1 = _latin1'FooBar' and field2 = _utf8'Åäö'
> >
> > different charsets or this is not going to work very well.
> 
> What "editor" or terminal is supposed to be able to generate text in
> different encodings depending on the part of the sentence? I don't think I
> have that in emacs. Or is it irrelevant??
> 
> I cannot see where I could use such a feature.

Applications usually generate queries. So you can do things like

printf ("SELECT * FROM foo WHERE field1 = _latin1'%s';", my_latin1_data);

for use on the terminal one would need to use some escaping/encoding much 
like is done with bytea. For example something like _latin1 H'0a660d' (but 
that is not sql-standard).

-- 
/Dennis Björklund

Re: Lexing with different charsets

From

Fabien COELHO

Date:

14 April 2004, 05:28:54

> > I cannot see where I could use such a feature.
>
> Applications usually generate queries.

Sure.

> So you can do things like
>
> printf ("SELECT * FROM foo WHERE field1 = _latin1'%s';", my_latin1_data);

Hmmm... I guess the following was too complicated. You need a library
for conversion. You need to take care of conversions.

printf("SELECT * FROM foo WHERE field1 = '%s'",      latin1_to_database_encoding(...));


Well, so this is a great new useful feature indeed, that will help improve
the lexer code a lot;-)

Good luck,

-- 
Fabien Coelho - coelho@cri.ensmp.fr

Re: Lexing with different charsets

From

Dennis Bjorklund

Date:

14 April 2004, 05:36:25

On Wed, 14 Apr 2004, Fabien COELHO wrote:

> printf("SELECT * FROM foo WHERE field1 = '%s'",
>        latin1_to_database_encoding(...));

And how do you do this if the database encoding is latin2? You can not 
convert latin1 to latin2.

The specification was written like this to handle things like latin1 
strings in latin2 databases, or latin1 in a database that otherwise 
only uses ascii.

The intention is good, but the specification is not perfect in any way.

-- 
/Dennis Björklund