Thread: Re: pgsql: We're going to have to spell dotless i as plain i, because
Re: pgsql: We're going to have to spell dotless i as plain i, because
From
Martijn van Oosterhout
Date:
On Fri, Sep 22, 2006 at 12:29:05PM -0300, Tom Lane wrote: > Log Message: > ----------- > We're going to have to spell dotless i as plain i, because dotless i is > not in the character set supported by DocBook nor standard HTML. (Sorry > Volkan.) Also replace random character-set references by a pointer to > the actual standard. Well you could always use te HTML4 ı which most tools should understand. At least browsers have good support for this kind of entity. Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > From each according to his ability. To each according to his ability to litigate.
Martijn van Oosterhout wrote: > Well you could always use te HTML4 ı which most tools should > understand. At least browsers have good support for this kind of > entity. Please review the recent thread on pgsql-docs before reiterating all the suggestions. -- Peter Eisentraut http://developer.postgresql.org/~petere/
Re: pgsql: We're going to have to spell dotless i as plain i, because
From
Martijn van Oosterhout
Date:
On Sat, Sep 23, 2006 at 11:54:47AM +0200, Peter Eisentraut wrote: > Martijn van Oosterhout wrote: > > Well you could always use te HTML4 ı which most tools should > > understand. At least browsers have good support for this kind of > > entity. > > Please review the recent thread on pgsql-docs before reiterating all the > suggestions. Oh sorry, it wasn't clear from the commit entry. It's not that DocBook doesn't support the character or that it can't be represented. It's just not supported in the document encoding we're using. Sorry for the noise. Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > From each according to his ability. To each according to his ability to litigate.
Martijn van Oosterhout wrote: -- Start of PGP signed section. > On Sat, Sep 23, 2006 at 11:54:47AM +0200, Peter Eisentraut wrote: > > Martijn van Oosterhout wrote: > > > Well you could always use te HTML4 ı which most tools should > > > understand. At least browsers have good support for this kind of > > > entity. > > > > Please review the recent thread on pgsql-docs before reiterating all the > > suggestions. > > Oh sorry, it wasn't clear from the commit entry. It's not that DocBook > doesn't support the character or that it can't be represented. It's > just not supported in the document encoding we're using. That's not how I understand it. The document encoding is only related to how high-bit characters are interpreted, I am told by Peter, but for some reason the toolchain just doesn't support UTF8, even though if you use ı in SGML it does come out right in HTML, but new toolchains throw an error for it. -- Bruce Momjian bruce@momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
On Sat, Sep 23, 2006 at 08:49:02AM -0400, Bruce Momjian wrote: > That's not how I understand it. The document encoding is only related > to how high-bit characters are interpreted, I am told by Peter, but for > some reason the toolchain just doesn't support UTF8, even though if you > use ı in SGML it does come out right in HTML, but new toolchains > throw an error for it. Dunno about UTF-8, but openjade only supports one character repertoire, and that's Unicode (under character handling in the man page). According to the docbook reference, a way to specify the dotless i is ı http://www.oasis-open.org/docbook/documentation/reference/html/iso-lat2.html But it's part of Latin-2, and if your stylesheet declares latin1 as the only valid characters, then that character is invalid, no matter how you represent it. I was just surprised, because ı has been part of docbook since version 3, which is quite some time ago now. So to me (a more docbook novice) it seems like it's the stylesheet that's limiting you to latin1, not the docbook parser. Anyway, the problem has been solved, so we can all get back to testing the beta now. Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > From each according to his ability. To each according to his ability to litigate.
Martijn van Oosterhout <kleptog@svana.org> writes: > So to me (a more docbook novice) it seems like it's the stylesheet > that's limiting you to latin1, not the docbook parser. But the "stylesheet" in question is part of the basic docbook infrastructure, so the above distinction is academic. (Or at least that's what Peter stated upthread.) To my mind the real problem is that one of the principal output formats we are interested in is HTML, and there is no dotless-i entity in any version of the HTML standard. I trust I need not point out again the difference between "my browser recognizes this construct" and "it's in the standard". regards, tom lane
Martijn van Oosterhout wrote: > Oh sorry, it wasn't clear from the commit entry. It's not that > DocBook doesn't support the character or that it can't be > represented. It's just not supported in the document encoding we're > using. No, no, and no. The reason that it doesn't work is that the document character set for DocBook is Latin 1, so any attempt to refer to a character not in this set is going to fail. -- Peter Eisentraut http://developer.postgresql.org/~petere/
On Sat, Sep 23, 2006 at 12:27:51PM -0400, Tom Lane wrote: > To my mind the real problem is that one of the principal output formats > we are interested in is HTML, and there is no dotless-i entity in any > version of the HTML standard. I trust I need not point out again the > difference between "my browser recognizes this construct" and "it's in > the standard". Sure there is, HTML4 includes all of Unicode, thus also the dotless-i. They gave up assigning names to them after latin1, but numerical references are in the standard also (decimal and hex). I created a simple docbook document on my computer with ı and ran openjade over and in the output file it is converted to ı. Openjade knows how to generate valid character references. The input file is attached, I compiled it with the command: openjade -V draft-mode -wall -wno-unused-param -wno-empty -i output-html -t sgml /tmp/a.sgml For dsl file just copy the stylesheet.dsl file in the postgresql source tree. Why it doesn't work in the current docs I don't know, but I think we can rule out limitations of HTML or Docbook. Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > From each according to his ability. To each according to his ability to litigate.
Attachment
Martijn van Oosterhout <kleptog@svana.org> writes: > I created a simple docbook document on my computer with ı and > ran openjade over and in the output file it is converted to ı. I experimented with that, and openjade didn't complain about it, but it renders in my browser (Safari) as Have the COPY command return a command tag that includes the number of rows copied (Volkan Yazıcı) So that hardly looks like a portable solution either. regards, tom lane
Tom Lane wrote: > Martijn van Oosterhout <kleptog@svana.org> writes: > > I created a simple docbook document on my computer with ı and > > ran openjade over and in the output file it is converted to ı. > > I experimented with that, and openjade didn't complain about it, but > it renders in my browser (Safari) as > > Have the COPY command return a command tag that includes the number of rows copied (Volkan Yazıcı) Well, if I put a ı into an HTML document and open it on my browser (Epiphany, which is Mozilla-based), it surely looks like verbatim ı. However, if I replace it with ı then it looks like a dotless i. So maybe your Openjade is not exactly the same Martijn was using, because what I understood was that Openjade replaced the ı with ı, which should work. Does your browser display it correctly if you replace manually with ı? On the other hand, I don't understand why DocBook would be Latin-1 only. What would be the point of that limitation? Some googling seems to reveal that people indeed uses other charsets, UTF-8 in particular (but also Big5, Latin-2, etc), so apparently this isn't set in stone. (I admit that they mainly talk about XML Docbook though). -- Alvaro Herrera http://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc.
Alvaro Herrera <alvherre@commandprompt.com> writes: > So maybe your Openjade is not exactly the same > Martijn was using, because what I understood was that Openjade replaced > the ı with ı, which should work. I think it's more likely that he was running with a non-DocBook stylesheet (his openjade command did not explicitly select a catalog and stylesheet the way that our Makefiles do). Or just a different version of the stylesheet. I'm testing with whatever ships in Fedora Core 5. I see definitions of ı in some of the files under /usr/share/sgml, but evidently none of them are included by docbook... > Does your browser display it correctly if you replace manually with ı? Doesn't really matter whether it does or not, since my gripe about that is that DocBook rejects it. > On the other hand, I don't understand why DocBook would be Latin-1 only. I'm surprised too that it couldn't be easily overridden. Peter, any idea why not? regards, tom lane
Alvaro Herrera wrote: > On the other hand, I don't understand why DocBook would be Latin-1 > only. What would be the point of that limitation? Some googling > seems to reveal that people indeed uses other charsets, UTF-8 in > particular (but also Big5, Latin-2, etc), so apparently this isn't > set in stone. (I admit that they mainly talk about XML Docbook > though). DocBook SGML is Latin 1; DocBook XML, like all XML, is UCS-4. -- Peter Eisentraut http://developer.postgresql.org/~petere/
Ühel kenal päeval, P, 2006-09-24 kell 10:20, kirjutas Peter Eisentraut: > Alvaro Herrera wrote: > > On the other hand, I don't understand why DocBook would be Latin-1 > > only. What would be the point of that limitation? Some googling > > seems to reveal that people indeed uses other charsets, UTF-8 in > > particular (but also Big5, Latin-2, etc), so apparently this isn't > > set in stone. (I admit that they mainly talk about XML Docbook > > though). > > DocBook SGML is Latin 1; DocBook XML, like all XML, is UCS-4. Are you sure it's UCS-4 ? I've always thought that XML is what is given in <xml > tag, and utf-8 if no charset is given. -- ---------------- Hannu Krosing Database Architect Skype Technologies OÜ Akadeemia tee 21 F, Tallinn, 12618, Estonia Skype me: callto:hkrosing Get Skype for free: http://www.skype.com
Hi, Hannu, Hannu Krosing wrote: > Are you sure it's UCS-4 ? I've always thought that XML is what is given > in <xml > tag, and utf-8 if no charset is given. You have to distinguish between the supported charset, and the document encoding. HTH, Markus -- Markus Schaber | Logical Tracking&Tracing International AG Dipl. Inf. | Software Development GIS Fight against software patents in Europe! www.ffii.org www.nosoftwarepatents.org
On Sun, Sep 24, 2006 at 10:20:22AM +0200, Peter Eisentraut wrote: > Alvaro Herrera wrote: > > On the other hand, I don't understand why DocBook would be Latin-1 > > only. What would be the point of that limitation? Some googling > > seems to reveal that people indeed uses other charsets, UTF-8 in > > particular (but also Big5, Latin-2, etc), so apparently this isn't > > set in stone. (I admit that they mainly talk about XML Docbook > > though). > > DocBook SGML is Latin 1; DocBook XML, like all XML, is UCS-4. This sheds a new light on the XML vs. SGML thing you said before. While it's not necessarily compelling enough to force a switch, it is a substantive difference that we can actually see. Cheers, D -- David Fetter <david@fetter.org> http://fetter.org/ phone: +1 415 235 3778 AIM: dfetter666 Skype: davidfetter Remember to vote!
Ühel kenal päeval, P, 2006-09-24 kell 14:56, kirjutas Markus Schaber: > Hi, Hannu, > > Hannu Krosing wrote: > > > Are you sure it's UCS-4 ? I've always thought that XML is what is given > > in <xml > tag, and utf-8 if no charset is given. > > You have to distinguish between the supported charset, and the document > encoding. UCS-4 and UTF-8 are both encodings for UNICODE see: http://en.wikipedia.org/wiki/UTF-32 > HTH, > Markus -- ---------------- Hannu Krosing Database Architect Skype Technologies OÜ Akadeemia tee 21 F, Tallinn, 12618, Estonia Skype me: callto:hkrosing Get Skype for free: http://www.skype.com
Hannu Krosing wrote: > Ühel kenal päeval, P, 2006-09-24 kell 14:56, kirjutas Markus Schaber: > >> Hi, Hannu, >> >> Hannu Krosing wrote: >> >> >>> Are you sure it's UCS-4 ? I've always thought that XML is what is given >>> in <xml > tag, and utf-8 if no charset is given. >>> >> You have to distinguish between the supported charset, and the document >> encoding. >> > > UCS-4 and UTF-8 are both encodings for UNICODE > > see: http://en.wikipedia.org/wiki/UTF-32 > If we want to quote references, we should quote the XML standard. For example, see here to see the exact charset supported by XML: http://www.w3.org/TR/2006/REC-xml11-20060816/#charsets. A little lower down it defines the encodings allowed too. cheers andrew
Andrew Dunstan wrote: > If we want to quote references, we should quote the XML standard. For > example, see here to see the exact charset supported by XML: > http://www.w3.org/TR/2006/REC-xml11-20060816/#charsets. The actual cause of the processing problems we have been seeing are the character set definitions in the SGML declarations of the respective document types. For DocBook SGML 4.2: CHARSET BASESET "ISO 646:1983//CHARSET International Reference Version (IRV)//ESC 2/5 4/0" DESCSET 0 9 UNUSED 9 2 9 11 2 UNUSED 13 1 13 14 18 UNUSED 32 95 32 127 1 UNUSED BASESET "ISO Registration Number 100//CHARSET ECMA-94 Right Part of Latin Alphabet Nr. 1//ESC 2/13 4/1" DESCSET 128 32 UNUSED 160 96 32 For XML: CHARSET BASESET "ISO Registration Number 177//CHARSET ISO/IEC 10646-1:1993 UCS-4 with implementation level 3//ESC 2/5 2/15 4/6" DESCSET 0 9 UNUSED 9 2 9 11 2 UNUSED 13 1 13 14 18 UNUSED 32 95 32 127 1 UNUSED 128 32 UNUSED 160 55136 160 55296 2048 UNUSED -- surrogates -- 57344 8190 57344 65534 2 UNUSED -- FFFE and FFFF -- 65536 1048576 65536 -- 16 planes outside BMP -- -- Peter Eisentraut http://developer.postgresql.org/~petere/
Ühel kenal päeval, E, 2006-09-25 kell 00:23, kirjutas Peter Eisentraut: > Andrew Dunstan wrote: > > If we want to quote references, we should quote the XML standard. For > > example, see here to see the exact charset supported by XML: > > http://www.w3.org/TR/2006/REC-xml11-20060816/#charsets. > > The actual cause of the processing problems we have been seeing are the > character set definitions in the SGML declarations of the respective > document types. I see charsets, but where are encodings defined ? I don't think that any of our SGML documentation is actually in UCS-4 encoding. -- ---------------- Hannu Krosing Database Architect Skype Technologies OÜ Akadeemia tee 21 F, Tallinn, 12618, Estonia Skype me: callto:hkrosing Get Skype for free: http://www.skype.com
Hannu Krosing <hannu@skype.net> writes: > I don't think that any of our SGML documentation is actually in UCS-4 > encoding. The source files use nothing beyond plain ASCII (and should remain that way, IMHO) so there isn't any need to inquire very far into exactly what the toolchain thinks the "document encoding" is. The issue at hand here is what the *output* character set is, which is to say the "document character set" if I have the jargon right. That is the space over which we are permitted to use &-entities. regards, tom lane
Tom Lane wrote: > Hannu Krosing <hannu@skype.net> writes: > > I don't think that any of our SGML documentation is actually in UCS-4 > > encoding. > > The source files use nothing beyond plain ASCII (and should remain that > way, IMHO) so there isn't any need to inquire very far into exactly what > the toolchain thinks the "document encoding" is. The issue at hand here > is what the *output* character set is, which is to say the "document > character set" if I have the jargon right. That is the space over which > we are permitted to use &-entities. Just for reference, if we could support UTF8, I was hoping to add non-Latin names as alternates to the ASCII versions, so we could have Japanese and Russian-lettered names in the release notes. I thought it would be a nice touch. -- Bruce Momjian bruce@momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
On Sun, Sep 24, 2006 at 07:38:20PM -0400, Tom Lane wrote: > Hannu Krosing <hannu@skype.net> writes: > > I don't think that any of our SGML documentation is actually in UCS-4 > > encoding. > > The source files use nothing beyond plain ASCII (and should remain that > way, IMHO) so there isn't any need to inquire very far into exactly what > the toolchain thinks the "document encoding" is. The issue at hand here > is what the *output* character set is, which is to say the "document > character set" if I have the jargon right. That is the space over which > we are permitted to use &-entities. What you're talking about is generally referred to as the "character repertoire", the abstract set of characters a document is considered to be composed of. For example: HTML4 (and XML IIRC) explicitly defines the "character repertoire" to be Unicode, even though the "character encoding" may only point to a subset of the total. Any others can be generated via the &xxx; escape syntax. I'm surprised about the difference in installations. I didn't use your -c option because that directory does not exist on my computer, but maybe that's all the difference... http://www.unicode.org/unicode/reports/tr17/ Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > From each according to his ability. To each according to his ability to litigate.
Hi, Hannu, Hannu Krosing wrote: >>> Are you sure it's UCS-4 ? I've always thought that XML is what is given >>> in <xml > tag, and utf-8 if no charset is given. >> You have to distinguish between the supported charset, and the document >> encoding. > UCS-4 and UTF-8 are both encodings for UNICODE > see: http://en.wikipedia.org/wiki/UTF-32 Yes, I know. The Point I wanted to make was that the document encoding is independent from the allowed charset (except having to be a subset). That is what XML entities were defined for. So even in an document using LATIN-1 as encoding, the charset still is Unicode, giving us the possibility to use &entities; to use non-latin1 characters. HTH, Markus -- Markus Schaber | Logical Tracking&Tracing International AG Dipl. Inf. | Software Development GIS Fight against software patents in Europe! www.ffii.org www.nosoftwarepatents.org
Hi, Bruce, Bruce Momjian wrote: >>> I don't think that any of our SGML documentation is actually in UCS-4 >>> encoding. >> The source files use nothing beyond plain ASCII (and should remain that >> way, IMHO) so there isn't any need to inquire very far into exactly what >> the toolchain thinks the "document encoding" is. The issue at hand here >> is what the *output* character set is, which is to say the "document >> character set" if I have the jargon right. That is the space over which >> we are permitted to use &-entities. > > Just for reference, if we could support UTF8, I was hoping to add > non-Latin names as alternates to the ASCII versions, so we could have > Japanese and Russian-lettered names in the release notes. I thought it > would be a nice touch. We don't need UTF8 encoding for this. It's also possible using ASCII encoding + ቧ entities. But we need the Charset to be Unicode. HTH, Markus -- Markus Schaber | Logical Tracking&Tracing International AG Dipl. Inf. | Software Development GIS Fight against software patents in Europe! www.ffii.org www.nosoftwarepatents.org