Thread: Re: pgsql: We're going to have to spell dotless i as plain i, because

Re: pgsql: We're going to have to spell dotless i as plain i, because

From

Martijn van Oosterhout

Date:

23 September 2006, 06:46:18

On Fri, Sep 22, 2006 at 12:29:05PM -0300, Tom Lane wrote:
> Log Message:
> -----------
> We're going to have to spell dotless i as plain i, because dotless i is
> not in the character set supported by DocBook nor standard HTML.  (Sorry
> Volkan.)  Also replace random character-set references by a pointer to
> the actual standard.

Well you could always use te HTML4 ı which most tools should
understand. At least browsers have good support for this kind of
entity.

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> From each according to his ability. To each according to his ability to litigate.

Re: pgsql: We're going to have to spell dotless i as plain i, because

From

Peter Eisentraut

Date:

23 September 2006, 06:55:00

Martijn van Oosterhout wrote:
> Well you could always use te HTML4 ı which most tools should
> understand. At least browsers have good support for this kind of
> entity.

Please review the recent thread on pgsql-docs before reiterating all the 
suggestions.

-- 
Peter Eisentraut
http://developer.postgresql.org/~petere/

Re: pgsql: We're going to have to spell dotless i as plain i, because

From

Martijn van Oosterhout

Date:

23 September 2006, 09:19:15

On Sat, Sep 23, 2006 at 11:54:47AM +0200, Peter Eisentraut wrote:
> Martijn van Oosterhout wrote:
> > Well you could always use te HTML4 ı which most tools should
> > understand. At least browsers have good support for this kind of
> > entity.
>
> Please review the recent thread on pgsql-docs before reiterating all the
> suggestions.

Oh sorry, it wasn't clear from the commit entry. It's not that DocBook
doesn't support the character or that it can't be represented. It's
just not supported in the document encoding we're using.

Sorry for the noise.

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> From each according to his ability. To each according to his ability to litigate.

Re: pgsql: We're going to have to spell dotless i

From

Bruce Momjian

Date:

23 September 2006, 09:49:36

Martijn van Oosterhout wrote:
-- Start of PGP signed section.
> On Sat, Sep 23, 2006 at 11:54:47AM +0200, Peter Eisentraut wrote:
> > Martijn van Oosterhout wrote:
> > > Well you could always use te HTML4 ı which most tools should
> > > understand. At least browsers have good support for this kind of
> > > entity.
> > 
> > Please review the recent thread on pgsql-docs before reiterating all the 
> > suggestions.
> 
> Oh sorry, it wasn't clear from the commit entry. It's not that DocBook
> doesn't support the character or that it can't be represented. It's
> just not supported in the document encoding we're using.

That's not how I understand it.  The document encoding is only related
to how high-bit characters are interpreted, I am told by Peter, but for
some reason the toolchain just doesn't support UTF8, even though if you
use ı in SGML it does come out right in HTML, but new toolchains
throw an error for it.

--  Bruce Momjian   bruce@momjian.us EnterpriseDB    http://www.enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +

Re: pgsql: We're going to have to spell dotless i

From

Martijn van Oosterhout

Date:

23 September 2006, 11:07:24

On Sat, Sep 23, 2006 at 08:49:02AM -0400, Bruce Momjian wrote:
> That's not how I understand it.  The document encoding is only related
> to how high-bit characters are interpreted, I am told by Peter, but for
> some reason the toolchain just doesn't support UTF8, even though if you
> use ı in SGML it does come out right in HTML, but new toolchains
> throw an error for it.

Dunno about UTF-8, but openjade only supports one character repertoire,
and that's Unicode (under character handling in the man page).

According to the docbook reference, a way to specify the dotless i
is ı

http://www.oasis-open.org/docbook/documentation/reference/html/iso-lat2.html

But it's part of Latin-2, and if your stylesheet declares latin1 as
the only valid characters, then that character is invalid, no matter
how you represent it. I was just surprised, because ı has been
part of docbook since version 3, which is quite some time ago now.

So to me (a more docbook novice) it seems like it's the stylesheet
that's limiting you to latin1, not the docbook parser.

Anyway, the problem has been solved, so we can all get back to testing
the beta now.

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> From each according to his ability. To each according to his ability to litigate.

Re: pgsql: We're going to have to spell dotless i

From

Tom Lane

Date:

23 September 2006, 13:27:59

Martijn van Oosterhout <kleptog@svana.org> writes:
> So to me (a more docbook novice) it seems like it's the stylesheet
> that's limiting you to latin1, not the docbook parser.

But the "stylesheet" in question is part of the basic docbook
infrastructure, so the above distinction is academic.  (Or at least
that's what Peter stated upthread.)

To my mind the real problem is that one of the principal output formats
we are interested in is HTML, and there is no dotless-i entity in any
version of the HTML standard.  I trust I need not point out again the
difference between "my browser recognizes this construct" and "it's in
the standard".
        regards, tom lane

Re: pgsql: We're going to have to spell dotless i as plain i, because

From

Peter Eisentraut

Date:

23 September 2006, 13:53:56

Martijn van Oosterhout wrote:
> Oh sorry, it wasn't clear from the commit entry. It's not that
> DocBook doesn't support the character or that it can't be
> represented. It's just not supported in the document encoding we're
> using.

No, no, and no.

The reason that it doesn't work is that the document character set for
DocBook is Latin 1, so any attempt to refer to a character not in this 
set is going to fail.

-- 
Peter Eisentraut
http://developer.postgresql.org/~petere/

Re: pgsql: We're going to have to spell dotless i

From

Martijn van Oosterhout

Date:

23 September 2006, 16:56:24

On Sat, Sep 23, 2006 at 12:27:51PM -0400, Tom Lane wrote:
> To my mind the real problem is that one of the principal output formats
> we are interested in is HTML, and there is no dotless-i entity in any
> version of the HTML standard.  I trust I need not point out again the
> difference between "my browser recognizes this construct" and "it's in
> the standard".

Sure there is, HTML4 includes all of Unicode, thus also the dotless-i.
They gave up assigning names to them after latin1, but numerical
references are in the standard also (decimal and hex).

I created a simple docbook document on my computer with ı and
ran openjade over and in the output file it is converted to ı.
Openjade knows how to generate valid character references. The input
file is attached, I compiled it with the command:

openjade -V draft-mode -wall -wno-unused-param -wno-empty -i output-html -t sgml /tmp/a.sgml

For dsl file just copy the stylesheet.dsl file in the postgresql source
tree.

Why it doesn't work in the current docs I don't know, but I think we can
rule out limitations of HTML or Docbook.

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> From each according to his ability. To each according to his ability to litigate.

Attachment

Re: pgsql: We're going to have to spell dotless i

From

Tom Lane

Date:

23 September 2006, 17:18:21

Martijn van Oosterhout <kleptog@svana.org> writes:
> I created a simple docbook document on my computer with ı and
> ran openjade over and in the output file it is converted to ı.

I experimented with that, and openjade didn't complain about it, but
it renders in my browser (Safari) as

Have the COPY command return a command tag that includes the number of rows copied (Volkan Yazıcı)

So that hardly looks like a portable solution either.
        regards, tom lane

Re: pgsql: We're going to have to spell dotless i

From

Alvaro Herrera

Date:

23 September 2006, 19:16:36

Tom Lane wrote:
> Martijn van Oosterhout <kleptog@svana.org> writes:
> > I created a simple docbook document on my computer with ı and
> > ran openjade over and in the output file it is converted to ı.
> 
> I experimented with that, and openjade didn't complain about it, but
> it renders in my browser (Safari) as
> 
> Have the COPY command return a command tag that includes the number of rows copied (Volkan Yazıcı)

Well, if I put a ı into an HTML document and open it on my
browser (Epiphany, which is Mozilla-based), it surely looks like
verbatim ı.  However, if I replace it with ı then it looks
like a dotless i.  So maybe your Openjade is not exactly the same
Martijn was using, because what I understood was that Openjade replaced
the ı with ı, which should work.

Does your browser display it correctly if you replace manually with ı?

On the other hand, I don't understand why DocBook would be Latin-1 only.
What would be the point of that limitation?  Some googling seems to
reveal that people indeed uses other charsets, UTF-8 in particular (but
also Big5, Latin-2, etc), so apparently this isn't set in stone.  (I
admit that they mainly talk about XML Docbook though).

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

Re: pgsql: We're going to have to spell dotless i

From

Tom Lane

Date:

23 September 2006, 19:43:48

Alvaro Herrera <alvherre@commandprompt.com> writes:
> So maybe your Openjade is not exactly the same
> Martijn was using, because what I understood was that Openjade replaced
> the ı with ı, which should work.

I think it's more likely that he was running with a non-DocBook
stylesheet (his openjade command did not explicitly select a catalog and
stylesheet the way that our Makefiles do).  Or just a different version
of the stylesheet.  I'm testing with whatever ships in Fedora Core 5.
I see definitions of ı in some of the files under
/usr/share/sgml, but evidently none of them are included by docbook...

> Does your browser display it correctly if you replace manually with ı?

Doesn't really matter whether it does or not, since my gripe about that
is that DocBook rejects it.

> On the other hand, I don't understand why DocBook would be Latin-1 only.

I'm surprised too that it couldn't be easily overridden.  Peter, any
idea why not?
        regards, tom lane

Re: pgsql: We're going to have to spell dotless i

From

Peter Eisentraut

Date:

24 September 2006, 05:20:34

Alvaro Herrera wrote:
> On the other hand, I don't understand why DocBook would be Latin-1
> only. What would be the point of that limitation?  Some googling
> seems to reveal that people indeed uses other charsets, UTF-8 in
> particular (but also Big5, Latin-2, etc), so apparently this isn't
> set in stone.  (I admit that they mainly talk about XML Docbook
> though).

DocBook SGML is Latin 1; DocBook XML, like all XML, is UCS-4.

-- 
Peter Eisentraut
http://developer.postgresql.org/~petere/

Re: pgsql: We're going to have to spell dotless i

From

Hannu Krosing

Date:

24 September 2006, 09:52:12

Ühel kenal päeval, P, 2006-09-24 kell 10:20, kirjutas Peter Eisentraut:
> Alvaro Herrera wrote:
> > On the other hand, I don't understand why DocBook would be Latin-1
> > only. What would be the point of that limitation?  Some googling
> > seems to reveal that people indeed uses other charsets, UTF-8 in
> > particular (but also Big5, Latin-2, etc), so apparently this isn't
> > set in stone.  (I admit that they mainly talk about XML Docbook
> > though).
> 
> DocBook SGML is Latin 1; DocBook XML, like all XML, is UCS-4.

Are you sure it's UCS-4 ? I've always thought that XML is what is given
in <xml > tag, and utf-8 if no charset is given.

-- 
----------------
Hannu Krosing
Database Architect
Skype Technologies OÜ
Akadeemia tee 21 F, Tallinn, 12618, Estonia

Skype me:  callto:hkrosing
Get Skype for free:  http://www.skype.com

Re: pgsql: We're going to have to spell dotless i

From

Markus Schaber

Date:

24 September 2006, 09:56:36

Hi, Hannu,

Hannu Krosing wrote:

> Are you sure it's UCS-4 ? I've always thought that XML is what is given
> in <xml > tag, and utf-8 if no charset is given.

You have to distinguish between the supported charset, and the document
encoding.

HTH,
Markus
--
Markus Schaber | Logical Tracking&Tracing International AG
Dipl. Inf.     | Software Development GIS

Fight against software patents in Europe! www.ffii.org
www.nosoftwarepatents.org

Re: pgsql: We're going to have to spell dotless i

From

David Fetter

Date:

24 September 2006, 13:22:40

On Sun, Sep 24, 2006 at 10:20:22AM +0200, Peter Eisentraut wrote:
> Alvaro Herrera wrote:
> > On the other hand, I don't understand why DocBook would be Latin-1
> > only. What would be the point of that limitation?  Some googling
> > seems to reveal that people indeed uses other charsets, UTF-8 in
> > particular (but also Big5, Latin-2, etc), so apparently this isn't
> > set in stone.  (I admit that they mainly talk about XML Docbook
> > though).
> 
> DocBook SGML is Latin 1; DocBook XML, like all XML, is UCS-4.

This sheds a new light on the XML vs. SGML thing you said before.
While it's not necessarily compelling enough to force a switch, it is
a substantive difference that we can actually see.

Cheers,
D
-- 
David Fetter <david@fetter.org> http://fetter.org/
phone: +1 415 235 3778        AIM: dfetter666                             Skype: davidfetter

Remember to vote!

Re: pgsql: We're going to have to spell dotless i

From

Hannu Krosing

Date:

24 September 2006, 17:47:39

Ühel kenal päeval, P, 2006-09-24 kell 14:56, kirjutas Markus Schaber:
> Hi, Hannu,
> 
> Hannu Krosing wrote:
> 
> > Are you sure it's UCS-4 ? I've always thought that XML is what is given
> > in <xml > tag, and utf-8 if no charset is given.
> 
> You have to distinguish between the supported charset, and the document
> encoding.

UCS-4 and UTF-8 are both encodings for UNICODE 

see: http://en.wikipedia.org/wiki/UTF-32


> HTH,
> Markus
-- 
----------------
Hannu Krosing
Database Architect
Skype Technologies OÜ
Akadeemia tee 21 F, Tallinn, 12618, Estonia

Skype me:  callto:hkrosing
Get Skype for free:  http://www.skype.com

Re: pgsql: We're going to have to spell dotless i

From

Andrew Dunstan

Date:

24 September 2006, 18:56:11

Hannu Krosing wrote:
> Ühel kenal päeval, P, 2006-09-24 kell 14:56, kirjutas Markus Schaber:
>   
>> Hi, Hannu,
>>
>> Hannu Krosing wrote:
>>
>>     
>>> Are you sure it's UCS-4 ? I've always thought that XML is what is given
>>> in <xml > tag, and utf-8 if no charset is given.
>>>       
>> You have to distinguish between the supported charset, and the document
>> encoding.
>>     
>
> UCS-4 and UTF-8 are both encodings for UNICODE 
>
> see: http://en.wikipedia.org/wiki/UTF-32
>   

If we want to quote references, we should quote the XML standard. For 
example, see here to see the exact charset supported by XML: 
http://www.w3.org/TR/2006/REC-xml11-20060816/#charsets.

A little lower down it defines the encodings allowed too.

cheers

andrew

Re: pgsql: We're going to have to spell dotless i

From

Peter Eisentraut

Date:

24 September 2006, 19:23:38

Andrew Dunstan wrote:
> If we want to quote references, we should quote the XML standard. For
> example, see here to see the exact charset supported by XML:
> http://www.w3.org/TR/2006/REC-xml11-20060816/#charsets.

The actual cause of the processing problems we have been seeing are the
character set definitions in the SGML declarations of the respective
document types.

For DocBook SGML 4.2:

CHARSET
       BASESET "ISO 646:1983//CHARSET International Reference Version (IRV)//ESC 2/5 4/0"       DESCSET
 0   9   UNUSED                   9   2     9                  11   2   UNUSED                  13   1    13
     14  18   UNUSED                  32  95    32                 127   1   UNUSED
 
       BASESET "ISO Registration Number 100//CHARSET ECMA-94 Right Part of Latin Alphabet Nr. 1//ESC 2/13 4/1"
DESCSET                128  32   UNUSED                 160  96   32
 

For XML:
    CHARSET        BASESET            "ISO Registration Number 177//CHARSET             ISO/IEC 10646-1:1993 UCS-4 with
implementation            level 3//ESC 2/5 2/15 4/6"        DESCSET                0        9  UNUSED                9
     2       9               11        2  UNUSED               13        1      13               14       18  UNUSED
          32       95      32              127        1  UNUSED              128       32  UNUSED              160
55136    160            55296     2048  UNUSED -- surrogates --            57344     8190   57344            65534
 2  UNUSED -- FFFE and FFFF --            65536  1048576   65536 -- 16 planes outside BMP --
 

-- 
Peter Eisentraut
http://developer.postgresql.org/~petere/

Re: pgsql: We're going to have to spell dotless i

From

Hannu Krosing

Date:

24 September 2006, 20:22:09

Ühel kenal päeval, E, 2006-09-25 kell 00:23, kirjutas Peter Eisentraut:
> Andrew Dunstan wrote:
> > If we want to quote references, we should quote the XML standard. For
> > example, see here to see the exact charset supported by XML:
> > http://www.w3.org/TR/2006/REC-xml11-20060816/#charsets.
> 
> The actual cause of the processing problems we have been seeing are the
> character set definitions in the SGML declarations of the respective
> document types.

I see charsets, but where are encodings defined ?

I don't think that any of our SGML documentation is actually in UCS-4
encoding.

-- 
----------------
Hannu Krosing
Database Architect
Skype Technologies OÜ
Akadeemia tee 21 F, Tallinn, 12618, Estonia

Skype me:  callto:hkrosing
Get Skype for free:  http://www.skype.com

Re: pgsql: We're going to have to spell dotless i

From

Tom Lane

Date:

24 September 2006, 20:38:36

Hannu Krosing <hannu@skype.net> writes:
> I don't think that any of our SGML documentation is actually in UCS-4
> encoding.

The source files use nothing beyond plain ASCII (and should remain that
way, IMHO) so there isn't any need to inquire very far into exactly what
the toolchain thinks the "document encoding" is.  The issue at hand here
is what the *output* character set is, which is to say the "document
character set" if I have the jargon right.  That is the space over which
we are permitted to use &-entities.
        regards, tom lane

Re: pgsql: We're going to have to spell dotless i

From

Bruce Momjian

Date:

24 September 2006, 21:36:36

Tom Lane wrote:
> Hannu Krosing <hannu@skype.net> writes:
> > I don't think that any of our SGML documentation is actually in UCS-4
> > encoding.
> 
> The source files use nothing beyond plain ASCII (and should remain that
> way, IMHO) so there isn't any need to inquire very far into exactly what
> the toolchain thinks the "document encoding" is.  The issue at hand here
> is what the *output* character set is, which is to say the "document
> character set" if I have the jargon right.  That is the space over which
> we are permitted to use &-entities.

Just for reference, if we could support UTF8, I was hoping to add
non-Latin names as alternates to the ASCII versions, so we could have
Japanese and Russian-lettered names in the release notes.  I thought it
would be a nice touch.

--  Bruce Momjian   bruce@momjian.us EnterpriseDB    http://www.enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +

Re: pgsql: We're going to have to spell dotless i

From

Martijn van Oosterhout

Date:

25 September 2006, 02:43:49

On Sun, Sep 24, 2006 at 07:38:20PM -0400, Tom Lane wrote:
> Hannu Krosing <hannu@skype.net> writes:
> > I don't think that any of our SGML documentation is actually in UCS-4
> > encoding.
>
> The source files use nothing beyond plain ASCII (and should remain that
> way, IMHO) so there isn't any need to inquire very far into exactly what
> the toolchain thinks the "document encoding" is.  The issue at hand here
> is what the *output* character set is, which is to say the "document
> character set" if I have the jargon right.  That is the space over which
> we are permitted to use &-entities.

What you're talking about is generally referred to as the "character
repertoire", the abstract set of characters a document is considered to
be composed of. For example: HTML4 (and XML IIRC) explicitly defines
the "character repertoire" to be Unicode, even though the "character
encoding" may only point to a subset of the total. Any others can be
generated via the &xxx; escape syntax.

I'm surprised about the difference in installations. I didn't use your
-c option because that directory does not exist on my computer, but
maybe that's all the difference...

http://www.unicode.org/unicode/reports/tr17/

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> From each according to his ability. To each according to his ability to litigate.

Re: pgsql: We're going to have to spell dotless i

From

Markus Schaber

Date:

25 September 2006, 04:48:27

Hi, Hannu,

Hannu Krosing wrote:

>>> Are you sure it's UCS-4 ? I've always thought that XML is what is given
>>> in <xml > tag, and utf-8 if no charset is given.
>> You have to distinguish between the supported charset, and the document
>> encoding.
> UCS-4 and UTF-8 are both encodings for UNICODE
> see: http://en.wikipedia.org/wiki/UTF-32

Yes, I know.

The Point I wanted to make was that the document encoding is independent
from the allowed charset (except having to be a subset).

That is what XML entities were defined for.

So even in an document using LATIN-1 as encoding, the charset still is
Unicode, giving us the possibility to use &entities; to use non-latin1
characters.

HTH,
Markus

--
Markus Schaber | Logical Tracking&Tracing International AG
Dipl. Inf.     | Software Development GIS

Fight against software patents in Europe! www.ffii.org
www.nosoftwarepatents.org

Re: pgsql: We're going to have to spell dotless i

From

Markus Schaber

Date:

25 September 2006, 05:03:07

Hi, Bruce,

Bruce Momjian wrote:

>>> I don't think that any of our SGML documentation is actually in UCS-4
>>> encoding.
>> The source files use nothing beyond plain ASCII (and should remain that
>> way, IMHO) so there isn't any need to inquire very far into exactly what
>> the toolchain thinks the "document encoding" is.  The issue at hand here
>> is what the *output* character set is, which is to say the "document
>> character set" if I have the jargon right.  That is the space over which
>> we are permitted to use &-entities.
>
> Just for reference, if we could support UTF8, I was hoping to add
> non-Latin names as alternates to the ASCII versions, so we could have
> Japanese and Russian-lettered names in the release notes.  I thought it
> would be a nice touch.

We don't need UTF8 encoding for this. It's also possible using ASCII
encoding + ቧ entities.

But we need the Charset to be Unicode.

HTH,
Markus
--
Markus Schaber | Logical Tracking&Tracing International AG
Dipl. Inf.     | Software Development GIS

Fight against software patents in Europe! www.ffii.org
www.nosoftwarepatents.org