Thread: This approach to non-ASCII names does not work

This approach to non-ASCII names does not work

From
Tom Lane
Date:
openjade -V draft-mode -wall -wno-unused-param -wno-empty -D . -c /usr/share/sgml/docbook/dsssl-stylesheets/catalog -d
stylesheet.dsl-i output-html -t sgml postgres.sgml 
openjade:release.sgml:567:14:E: "353" is not a character number in the document character set
openjade:release.sgml:1085:56:E: "305" is not a character number in the document character set
openjade:release.sgml:1085:63:E: "305" is not a character number in the document character set
openjade:release.sgml:1497:35:E: "305" is not a character number in the document character set
openjade:release.sgml:1497:42:E: "305" is not a character number in the document character set
openjade:release.sgml:1662:38:E: "305" is not a character number in the document character set
openjade:release.sgml:1662:45:E: "305" is not a character number in the document character set
make: *** [html] Error 1

            regards, tom lane

Re: This approach to non-ASCII names does not work

From
Bruce Momjian
Date:
Tom Lane wrote:
> openjade -V draft-mode -wall -wno-unused-param -wno-empty -D . -c /usr/share/sgml/docbook/dsssl-stylesheets/catalog
-dstylesheet.dsl -i output-html -t sgml postgres.sgml 
> openjade:release.sgml:567:14:E: "353" is not a character number in the document character set
> openjade:release.sgml:1085:56:E: "305" is not a character number in the document character set
> openjade:release.sgml:1085:63:E: "305" is not a character number in the document character set
> openjade:release.sgml:1497:35:E: "305" is not a character number in the document character set
> openjade:release.sgml:1497:42:E: "305" is not a character number in the document character set
> openjade:release.sgml:1662:38:E: "305" is not a character number in the document character set
> openjade:release.sgml:1662:45:E: "305" is not a character number in the document character set
> make: *** [html] Error 1

Wow, our documentation characterset is "ISO-8859-1":

    CONTENT="text/html; charset=ISO-8859-1"

Should we change it to UTF8?

--
  Bruce Momjian   bruce@momjian.us
  EnterpriseDB    http://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

Re: This approach to non-ASCII names does not work

From
Tom Lane
Date:
Bruce Momjian <bruce@momjian.us> writes:
> Tom Lane wrote:
>> openjade:release.sgml:567:14:E: "353" is not a character number in the document character set

> Wow, our documentation characterset is "ISO-8859-1":
>     CONTENT="text/html; charset=ISO-8859-1"
> Should we change it to UTF8?

I'm betting you should change those numbers from octal to decimal,
actually.

            regards, tom lane

Re: This approach to non-ASCII names does not work

From
Bruce Momjian
Date:
Tom Lane wrote:
> Bruce Momjian <bruce@momjian.us> writes:
> > Tom Lane wrote:
> >> openjade:release.sgml:567:14:E: "353" is not a character number in the document character set
>
> > Wow, our documentation characterset is "ISO-8859-1":
> >     CONTENT="text/html; charset=ISO-8859-1"
> > Should we change it to UTF8?
>
> I'm betting you should change those numbers from octal to decimal,
> actually.

Those numbers are decimal, but certainly cannot be represented in
ISO-8859-1.  They are multi-byte, one is Turkish.

--
  Bruce Momjian   bruce@momjian.us
  EnterpriseDB    http://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

Re: This approach to non-ASCII names does not work

From
Bruce Momjian
Date:
Bruce Momjian wrote:
> Tom Lane wrote:
> > Bruce Momjian <bruce@momjian.us> writes:
> > > Tom Lane wrote:
> > >> openjade:release.sgml:567:14:E: "353" is not a character number in the document character set
> >
> > > Wow, our documentation characterset is "ISO-8859-1":
> > >     CONTENT="text/html; charset=ISO-8859-1"
> > > Should we change it to UTF8?
> >
> > I'm betting you should change those numbers from octal to decimal,
> > actually.
>
> Those numbers are decimal, but certainly cannot be represented in
> ISO-8859-1.  They are multi-byte, one is Turkish.

Actually, I got the codes from here:

    http://www.pemberley.com/janeinfo/latin1.html#latexta

--
  Bruce Momjian   bruce@momjian.us
  EnterpriseDB    http://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

Re: This approach to non-ASCII names does not work

From
Peter Eisentraut
Date:
Tom Lane wrote:
> I'm betting you should change those numbers from octal to decimal,
> actually.

I suggest using named entities like ü.

--
Peter Eisentraut
http://developer.postgresql.org/~petere/

Re: This approach to non-ASCII names does not work

From
Bruce Momjian
Date:
Peter Eisentraut wrote:
> Tom Lane wrote:
> > I'm betting you should change those numbers from octal to decimal,
> > actually.
>
> I suggest using named entities like ü.

Yes, I use them where possible.  I use:


    http://www.mountaindragon.com/html/iso.htm

for named cases, but for the ones that don't have names, I have to use
UTF8 numbers:

    http://www.pemberley.com/janeinfo/latin1.html#latexta

The case that I needed was "Latin Small Letter Dotless I", which has no
name on the first URL.

The unusual thing is that though our docs web pages use a stated
encoding as ISO-8859-1, the UTF8 number does generate the proper symbol
in my browser (Mozilla), so I wonder if >255 codes are assumed to be
UTF8.

--
  Bruce Momjian   bruce@momjian.us
  EnterpriseDB    http://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

Re: This approach to non-ASCII names does not work

From
Tom Lane
Date:
Bruce Momjian <bruce@momjian.us> writes:
> Yes, I use them where possible.  I use:
>     http://www.mountaindragon.com/html/iso.htm

... which says right on it that it considers only ISO 8859/1 and is not
a complete list even of that set.

I assume that somewhere there is a Web-related spec of the widely
recognized entity names, but I see no reason to suppose that this list
is it.  Something at w3c, say, would have a tad more credibility.

            regards, tom lane

Re: This approach to non-ASCII names does not work

From
Alvaro Herrera
Date:
Tom Lane wrote:
> Bruce Momjian <bruce@momjian.us> writes:
> > Yes, I use them where possible.  I use:
> >     http://www.mountaindragon.com/html/iso.htm
>
> ... which says right on it that it considers only ISO 8859/1 and is not
> a complete list even of that set.
>
> I assume that somewhere there is a Web-related spec of the widely
> recognized entity names, but I see no reason to suppose that this list
> is it.  Something at w3c, say, would have a tad more credibility.

Maybe this:

http://www.w3.org/TR/html4/sgml/entities.html

--
Alvaro Herrera                                http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

Re: This approach to non-ASCII names does not work

From
Tom Lane
Date:
Alvaro Herrera <alvherre@commandprompt.com> writes:
> Tom Lane wrote:
>> I assume that somewhere there is a Web-related spec of the widely
>> recognized entity names, but I see no reason to suppose that this list
>> is it.  Something at w3c, say, would have a tad more credibility.

> Maybe this:
> http://www.w3.org/TR/html4/sgml/entities.html

Also, I just found this in the XHTML 1.0 spec:
http://www.w3.org/TR/xhtml1/dtds.html#a_dtd_Latin-1_characters

            regards, tom lane

Re: This approach to non-ASCII names does not work

From
Alvaro Herrera
Date:
Tom Lane wrote:
> Alvaro Herrera <alvherre@commandprompt.com> writes:
> > Tom Lane wrote:
> >> I assume that somewhere there is a Web-related spec of the widely
> >> recognized entity names, but I see no reason to suppose that this list
> >> is it.  Something at w3c, say, would have a tad more credibility.
>
> > Maybe this:
> > http://www.w3.org/TR/html4/sgml/entities.html
>
> Also, I just found this in the XHTML 1.0 spec:
> http://www.w3.org/TR/xhtml1/dtds.html#a_dtd_Latin-1_characters

Neither seem to list a "dotless i" :-(

--
Alvaro Herrera                                http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

Re: This approach to non-ASCII names does not work

From
Tom Lane
Date:
Bruce Momjian <bruce@momjian.us> writes:
> Interesting, I found this for that character:
>     http://www.fileformat.info/info/unicode/char/0131/index.htm
> Turns out that number is the right entity.  Seems they have numbers that
> match UTF16/UTF32 values.  So are we OK?

No, we are not, because the docs don't build for anyone who has pickier
SGML tools than the ancient laissez-faire toolchain you seem to be using.
HEAD currently gives me

openjade -V draft-mode -wall -wno-unused-param -wno-empty -D . -c /usr/share/sgml/docbook/dsssl-stylesheets/catalog -d
stylesheet.dsl-i output-html -t sgml postgres.sgml 
openjade:ddl.sgml:2581:51:E: document type does not allow element "SECT2" here
openjade:ddl.sgml:2646:39:E: document type does not allow element "SECT2" here
openjade:ddl.sgml:2706:52:E: document type does not allow element "SECT2" here
openjade:ddl.sgml:2848:8:E: end tag for "SECT2" omitted, but OMITTAG NO was specified
openjade:ddl.sgml:2317:3: start tag was here
openjade:release.sgml:572:14:E: "353" is not a character number in the document character set
openjade:release.sgml:1091:56:E: "305" is not a character number in the document character set
openjade:release.sgml:1091:63:E: "305" is not a character number in the document character set
openjade:release.sgml:1505:35:E: "305" is not a character number in the document character set
openjade:release.sgml:1505:42:E: "305" is not a character number in the document character set
openjade:release.sgml:1670:38:E: "305" is not a character number in the document character set
openjade:release.sgml:1670:45:E: "305" is not a character number in the document character set
make: *** [html] Error 1

I don't believe in ignoring compiler warnings, and I don't believe in
ignoring these problems either.

            regards, tom lane

Re: This approach to non-ASCII names does not work

From
Tom Lane
Date:
The HTML specs do include the other character at issue:

!ENTITY scaron  "š"> <!--  latin small letter s with caron,
                                    U+0161 ISOlat2 -->

I suggest we use that where needed and spell dotless i as plain i.
(Sorry, Volkan :-( ... but your beef is with the HTML standards
not us.)

            regards, tom lane

Re: This approach to non-ASCII names does not work

From
Peter Eisentraut
Date:
Bruce Momjian wrote:
> The unusual thing is that though our docs web pages use a stated
> encoding as ISO-8859-1, the UTF8 number does generate the proper
> symbol in my browser (Mozilla), so I wonder if >255 codes are assumed
> to be UTF8.

These are two different things.

A numeric character reference picks the numbered character from the
document character set.  The document character set is declared in the
document type declaration (and is therefore fixed by the standards
committee for all users).  The document character sets for commonly
used SGML applications are:

HTML 3.2    Latin 1 (ISO 646 + ECMA 94)
HTML 4+        UCS (ISO 10646)
XML        UCS (ISO 10646)
DocBook SGML    Latin 1 (ISO 646 + ECMA 94)

If a font is available, an HTML application (browser) should be able to
process (display) any character from the document character set,
whether it arrives in plain or as a character entity.

Conversely, a character not in the document character set, such as a
non-Latin-1 character in DocBook SGML, cannot be processed, strictly
speaking.

The other thing you are talking about is the character *encoding* which
specifies how the sequence of bytes that makes up the document is to be
interpreted.  Note that this happens before the document character set
is taken into consideration and is pretty much independent of it.  For
example, knowledge of the character encoding is necessary to find
the "&" that starts entities.  Not all character encodings are capable
of encoding all characters in the document character set, which is why
you need to use character entities to access characters outside the
encoding.

--
Peter Eisentraut
http://developer.postgresql.org/~petere/

Re: This approach to non-ASCII names does not work

From
Bruce Momjian
Date:
Tom Lane wrote:
> The HTML specs do include the other character at issue:
>
> !ENTITY scaron  "š"> <!--  latin small letter s with caron,
>                                     U+0161 ISOlat2 -->

Release notes updated to use scaron.

--
  Bruce Momjian   bruce@momjian.us
  EnterpriseDB    http://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

Re: This approach to non-ASCII names does not work

From
Bruce Momjian
Date:
That makes a lot of sense.  The encoding mentioned in the HTML is how
high-bit characters are treated in the HTML, and doesn't control what
entities it supports.

However, I am confused how non-Latin users can use SGML if it does not
support UTF8 entities.  I see this flag in openjade:

      -b, --encoding=NAME         Use encoding NAME for output.

but I assume it is only for how to treat the high bits in the file, not
for entity recognition.

I IM'ed with Peter and he said SGML Docbook just doesn't support UTF8
easily, so I am reverting Volkan YAZICI's name to be ASCII (he requested
an all-uppercase last name if we can't use the proper symbol), and
documented we can only use HTML4 entities, and updated the URLs we
should use for reference.  I have the official URL and URLs that show
the actual symbols too, which is helpful.

If people have names that contain HTML4 symbols, please let me know so I
can add the symbols:

    http://www.zipcon.net/~swhite/docs/computers/browsers/entities_page.html

---------------------------------------------------------------------------

Peter Eisentraut wrote:
> Bruce Momjian wrote:
> > The unusual thing is that though our docs web pages use a stated
> > encoding as ISO-8859-1, the UTF8 number does generate the proper
> > symbol in my browser (Mozilla), so I wonder if >255 codes are assumed
> > to be UTF8.
>
> These are two different things.
>
> A numeric character reference picks the numbered character from the
> document character set.  The document character set is declared in the
> document type declaration (and is therefore fixed by the standards
> committee for all users).  The document character sets for commonly
> used SGML applications are:
>
> HTML 3.2    Latin 1 (ISO 646 + ECMA 94)
> HTML 4+        UCS (ISO 10646)
> XML        UCS (ISO 10646)
> DocBook SGML    Latin 1 (ISO 646 + ECMA 94)
>
> If a font is available, an HTML application (browser) should be able to
> process (display) any character from the document character set,
> whether it arrives in plain or as a character entity.
>
> Conversely, a character not in the document character set, such as a
> non-Latin-1 character in DocBook SGML, cannot be processed, strictly
> speaking.
>
> The other thing you are talking about is the character *encoding* which
> specifies how the sequence of bytes that makes up the document is to be
> interpreted.  Note that this happens before the document character set
> is taken into consideration and is pretty much independent of it.  For
> example, knowledge of the character encoding is necessary to find
> the "&" that starts entities.  Not all character encodings are capable
> of encoding all characters in the document character set, which is why
> you need to use character entities to access characters outside the
> encoding.
>
> --
> Peter Eisentraut
> http://developer.postgresql.org/~petere/
>
> ---------------------------(end of broadcast)---------------------------
> TIP 9: In versions below 8.0, the planner will ignore your desire to
>        choose an index scan if your joining column's datatypes do not
>        match

--
  Bruce Momjian   bruce@momjian.us
  EnterpriseDB    http://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +