Thread: This approach to non-ASCII names does not work
openjade -V draft-mode -wall -wno-unused-param -wno-empty -D . -c /usr/share/sgml/docbook/dsssl-stylesheets/catalog -d stylesheet.dsl-i output-html -t sgml postgres.sgml openjade:release.sgml:567:14:E: "353" is not a character number in the document character set openjade:release.sgml:1085:56:E: "305" is not a character number in the document character set openjade:release.sgml:1085:63:E: "305" is not a character number in the document character set openjade:release.sgml:1497:35:E: "305" is not a character number in the document character set openjade:release.sgml:1497:42:E: "305" is not a character number in the document character set openjade:release.sgml:1662:38:E: "305" is not a character number in the document character set openjade:release.sgml:1662:45:E: "305" is not a character number in the document character set make: *** [html] Error 1 regards, tom lane
Tom Lane wrote: > openjade -V draft-mode -wall -wno-unused-param -wno-empty -D . -c /usr/share/sgml/docbook/dsssl-stylesheets/catalog -dstylesheet.dsl -i output-html -t sgml postgres.sgml > openjade:release.sgml:567:14:E: "353" is not a character number in the document character set > openjade:release.sgml:1085:56:E: "305" is not a character number in the document character set > openjade:release.sgml:1085:63:E: "305" is not a character number in the document character set > openjade:release.sgml:1497:35:E: "305" is not a character number in the document character set > openjade:release.sgml:1497:42:E: "305" is not a character number in the document character set > openjade:release.sgml:1662:38:E: "305" is not a character number in the document character set > openjade:release.sgml:1662:45:E: "305" is not a character number in the document character set > make: *** [html] Error 1 Wow, our documentation characterset is "ISO-8859-1": CONTENT="text/html; charset=ISO-8859-1" Should we change it to UTF8? -- Bruce Momjian bruce@momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
Bruce Momjian <bruce@momjian.us> writes: > Tom Lane wrote: >> openjade:release.sgml:567:14:E: "353" is not a character number in the document character set > Wow, our documentation characterset is "ISO-8859-1": > CONTENT="text/html; charset=ISO-8859-1" > Should we change it to UTF8? I'm betting you should change those numbers from octal to decimal, actually. regards, tom lane
Tom Lane wrote: > Bruce Momjian <bruce@momjian.us> writes: > > Tom Lane wrote: > >> openjade:release.sgml:567:14:E: "353" is not a character number in the document character set > > > Wow, our documentation characterset is "ISO-8859-1": > > CONTENT="text/html; charset=ISO-8859-1" > > Should we change it to UTF8? > > I'm betting you should change those numbers from octal to decimal, > actually. Those numbers are decimal, but certainly cannot be represented in ISO-8859-1. They are multi-byte, one is Turkish. -- Bruce Momjian bruce@momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
Bruce Momjian wrote: > Tom Lane wrote: > > Bruce Momjian <bruce@momjian.us> writes: > > > Tom Lane wrote: > > >> openjade:release.sgml:567:14:E: "353" is not a character number in the document character set > > > > > Wow, our documentation characterset is "ISO-8859-1": > > > CONTENT="text/html; charset=ISO-8859-1" > > > Should we change it to UTF8? > > > > I'm betting you should change those numbers from octal to decimal, > > actually. > > Those numbers are decimal, but certainly cannot be represented in > ISO-8859-1. They are multi-byte, one is Turkish. Actually, I got the codes from here: http://www.pemberley.com/janeinfo/latin1.html#latexta -- Bruce Momjian bruce@momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
Tom Lane wrote: > I'm betting you should change those numbers from octal to decimal, > actually. I suggest using named entities like ü. -- Peter Eisentraut http://developer.postgresql.org/~petere/
Peter Eisentraut wrote: > Tom Lane wrote: > > I'm betting you should change those numbers from octal to decimal, > > actually. > > I suggest using named entities like ü. Yes, I use them where possible. I use: http://www.mountaindragon.com/html/iso.htm for named cases, but for the ones that don't have names, I have to use UTF8 numbers: http://www.pemberley.com/janeinfo/latin1.html#latexta The case that I needed was "Latin Small Letter Dotless I", which has no name on the first URL. The unusual thing is that though our docs web pages use a stated encoding as ISO-8859-1, the UTF8 number does generate the proper symbol in my browser (Mozilla), so I wonder if >255 codes are assumed to be UTF8. -- Bruce Momjian bruce@momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
Bruce Momjian <bruce@momjian.us> writes: > Yes, I use them where possible. I use: > http://www.mountaindragon.com/html/iso.htm ... which says right on it that it considers only ISO 8859/1 and is not a complete list even of that set. I assume that somewhere there is a Web-related spec of the widely recognized entity names, but I see no reason to suppose that this list is it. Something at w3c, say, would have a tad more credibility. regards, tom lane
Tom Lane wrote: > Bruce Momjian <bruce@momjian.us> writes: > > Yes, I use them where possible. I use: > > http://www.mountaindragon.com/html/iso.htm > > ... which says right on it that it considers only ISO 8859/1 and is not > a complete list even of that set. > > I assume that somewhere there is a Web-related spec of the widely > recognized entity names, but I see no reason to suppose that this list > is it. Something at w3c, say, would have a tad more credibility. Maybe this: http://www.w3.org/TR/html4/sgml/entities.html -- Alvaro Herrera http://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc.
Alvaro Herrera <alvherre@commandprompt.com> writes: > Tom Lane wrote: >> I assume that somewhere there is a Web-related spec of the widely >> recognized entity names, but I see no reason to suppose that this list >> is it. Something at w3c, say, would have a tad more credibility. > Maybe this: > http://www.w3.org/TR/html4/sgml/entities.html Also, I just found this in the XHTML 1.0 spec: http://www.w3.org/TR/xhtml1/dtds.html#a_dtd_Latin-1_characters regards, tom lane
Tom Lane wrote: > Alvaro Herrera <alvherre@commandprompt.com> writes: > > Tom Lane wrote: > >> I assume that somewhere there is a Web-related spec of the widely > >> recognized entity names, but I see no reason to suppose that this list > >> is it. Something at w3c, say, would have a tad more credibility. > > > Maybe this: > > http://www.w3.org/TR/html4/sgml/entities.html > > Also, I just found this in the XHTML 1.0 spec: > http://www.w3.org/TR/xhtml1/dtds.html#a_dtd_Latin-1_characters Neither seem to list a "dotless i" :-( -- Alvaro Herrera http://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc.
Bruce Momjian <bruce@momjian.us> writes: > Interesting, I found this for that character: > http://www.fileformat.info/info/unicode/char/0131/index.htm > Turns out that number is the right entity. Seems they have numbers that > match UTF16/UTF32 values. So are we OK? No, we are not, because the docs don't build for anyone who has pickier SGML tools than the ancient laissez-faire toolchain you seem to be using. HEAD currently gives me openjade -V draft-mode -wall -wno-unused-param -wno-empty -D . -c /usr/share/sgml/docbook/dsssl-stylesheets/catalog -d stylesheet.dsl-i output-html -t sgml postgres.sgml openjade:ddl.sgml:2581:51:E: document type does not allow element "SECT2" here openjade:ddl.sgml:2646:39:E: document type does not allow element "SECT2" here openjade:ddl.sgml:2706:52:E: document type does not allow element "SECT2" here openjade:ddl.sgml:2848:8:E: end tag for "SECT2" omitted, but OMITTAG NO was specified openjade:ddl.sgml:2317:3: start tag was here openjade:release.sgml:572:14:E: "353" is not a character number in the document character set openjade:release.sgml:1091:56:E: "305" is not a character number in the document character set openjade:release.sgml:1091:63:E: "305" is not a character number in the document character set openjade:release.sgml:1505:35:E: "305" is not a character number in the document character set openjade:release.sgml:1505:42:E: "305" is not a character number in the document character set openjade:release.sgml:1670:38:E: "305" is not a character number in the document character set openjade:release.sgml:1670:45:E: "305" is not a character number in the document character set make: *** [html] Error 1 I don't believe in ignoring compiler warnings, and I don't believe in ignoring these problems either. regards, tom lane
The HTML specs do include the other character at issue: !ENTITY scaron "š"> <!-- latin small letter s with caron, U+0161 ISOlat2 --> I suggest we use that where needed and spell dotless i as plain i. (Sorry, Volkan :-( ... but your beef is with the HTML standards not us.) regards, tom lane
Bruce Momjian wrote: > The unusual thing is that though our docs web pages use a stated > encoding as ISO-8859-1, the UTF8 number does generate the proper > symbol in my browser (Mozilla), so I wonder if >255 codes are assumed > to be UTF8. These are two different things. A numeric character reference picks the numbered character from the document character set. The document character set is declared in the document type declaration (and is therefore fixed by the standards committee for all users). The document character sets for commonly used SGML applications are: HTML 3.2 Latin 1 (ISO 646 + ECMA 94) HTML 4+ UCS (ISO 10646) XML UCS (ISO 10646) DocBook SGML Latin 1 (ISO 646 + ECMA 94) If a font is available, an HTML application (browser) should be able to process (display) any character from the document character set, whether it arrives in plain or as a character entity. Conversely, a character not in the document character set, such as a non-Latin-1 character in DocBook SGML, cannot be processed, strictly speaking. The other thing you are talking about is the character *encoding* which specifies how the sequence of bytes that makes up the document is to be interpreted. Note that this happens before the document character set is taken into consideration and is pretty much independent of it. For example, knowledge of the character encoding is necessary to find the "&" that starts entities. Not all character encodings are capable of encoding all characters in the document character set, which is why you need to use character entities to access characters outside the encoding. -- Peter Eisentraut http://developer.postgresql.org/~petere/
Tom Lane wrote: > The HTML specs do include the other character at issue: > > !ENTITY scaron "š"> <!-- latin small letter s with caron, > U+0161 ISOlat2 --> Release notes updated to use scaron. -- Bruce Momjian bruce@momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
That makes a lot of sense. The encoding mentioned in the HTML is how high-bit characters are treated in the HTML, and doesn't control what entities it supports. However, I am confused how non-Latin users can use SGML if it does not support UTF8 entities. I see this flag in openjade: -b, --encoding=NAME Use encoding NAME for output. but I assume it is only for how to treat the high bits in the file, not for entity recognition. I IM'ed with Peter and he said SGML Docbook just doesn't support UTF8 easily, so I am reverting Volkan YAZICI's name to be ASCII (he requested an all-uppercase last name if we can't use the proper symbol), and documented we can only use HTML4 entities, and updated the URLs we should use for reference. I have the official URL and URLs that show the actual symbols too, which is helpful. If people have names that contain HTML4 symbols, please let me know so I can add the symbols: http://www.zipcon.net/~swhite/docs/computers/browsers/entities_page.html --------------------------------------------------------------------------- Peter Eisentraut wrote: > Bruce Momjian wrote: > > The unusual thing is that though our docs web pages use a stated > > encoding as ISO-8859-1, the UTF8 number does generate the proper > > symbol in my browser (Mozilla), so I wonder if >255 codes are assumed > > to be UTF8. > > These are two different things. > > A numeric character reference picks the numbered character from the > document character set. The document character set is declared in the > document type declaration (and is therefore fixed by the standards > committee for all users). The document character sets for commonly > used SGML applications are: > > HTML 3.2 Latin 1 (ISO 646 + ECMA 94) > HTML 4+ UCS (ISO 10646) > XML UCS (ISO 10646) > DocBook SGML Latin 1 (ISO 646 + ECMA 94) > > If a font is available, an HTML application (browser) should be able to > process (display) any character from the document character set, > whether it arrives in plain or as a character entity. > > Conversely, a character not in the document character set, such as a > non-Latin-1 character in DocBook SGML, cannot be processed, strictly > speaking. > > The other thing you are talking about is the character *encoding* which > specifies how the sequence of bytes that makes up the document is to be > interpreted. Note that this happens before the document character set > is taken into consideration and is pretty much independent of it. For > example, knowledge of the character encoding is necessary to find > the "&" that starts entities. Not all character encodings are capable > of encoding all characters in the document character set, which is why > you need to use character entities to access characters outside the > encoding. > > -- > Peter Eisentraut > http://developer.postgresql.org/~petere/ > > ---------------------------(end of broadcast)--------------------------- > TIP 9: In versions below 8.0, the planner will ignore your desire to > choose an index scan if your joining column's datatypes do not > match -- Bruce Momjian bruce@momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +