Thread: non-ASCII characters in SGML documentation (and elsewhere)

non-ASCII characters in SGML documentation (and elsewhere)

From
Peter Eisentraut
Date:
There are a few literal non-ASCII characters in the SGML documentation,
namely in

isn.sgml
release-7.4.sgml
release-8.4.sgml

Also, there are some encoded (&foo;) non-ASCII characters in

release-8.0.sgml
release-8.1.sgml
release-8.2.sgml
unaccent.sgml

These all work fine, because they are all LATIN1, and DocBook SGML uses
LATIN1.

But I notice that the contributor names in the 9.1 release notes have
been carefully ASCII-fied, presumably from the Git UTF-8 commit
messages.

For additional amusement, when creating the HISTORY file, lynx recodes
the HTML into the encoding specified by your LC_CTYPE environment
setting.

Also, the following source files contain non-ASCII characters in
comments:

src/backend/port/dynloader/darwin.c (LATIN1)
src/backend/storage/lmgr/predicate.c (UTF8)
src/backend/storage/lmgr/README-SSI (UTF8)

The last two are new in 9.1.

So, some questions:

      * Should we consistently use entities for encoding non-ASCII
        characters in SGML?  Or use LATIN1 freely?
      * Should we allow/use non-ASCII characters in the release notes?
      * What encoding should the HISTORY file have?
      * Should we allow non-ASCII characters in general source files?
      * If so, what should the encoding be?



Re: non-ASCII characters in SGML documentation (and elsewhere)

From
Susanne Ebrecht
Date:
Hello Peter,

On 19.05.2011 23:49, Peter Eisentraut wrote:
> So, some questions:
>
>        * Should we consistently use entities for encoding non-ASCII
>          characters in SGML?  Or use LATIN1 freely?
>        * Should we allow/use non-ASCII characters in the release notes?
>        * What encoding should the HISTORY file have?
>        * Should we allow non-ASCII characters in general source files?
>        * If so, what should the encoding be?

one more argument for switching to XML? :)

I guess we will get some more non-ASCII signs in documentation.
How do you want to document the collation stuff?
Collations are for all that isn't ASCII.
Our docs usually have  small examples.
I can imagine that you want to place German or Russian letters or whatever
else as examples into doc.

Do you have another idea then using utf8?
What do you expect what not would fit into utf8?
I would expect words like déjà vu - means words that English just copied
from French and still use the French accents.
Or even personal names with e.g. umlauts, accents, and other special
signs from
special languages.

Also consider - usually editors (vi, emacs) use utf8 today.

Btw.
For German docs I use utf8.
The HTML output works well with both 'ö' and 'ö'.
I not yet tested other outputs.

I just changed to utf8 in stylsheets and use export SP_ENCODING=XML
before compiling.

Unfortunately index sorting neither works with 'ö' nor 'ö' yet.
We are still fighting with it and try to figure out how we can force that
it will sort correct.
Just changing makefile didn't help.

But - in English docs - I doubt that you have to deal with indexes on
special
words using non-ASCII characters.

Means very small and low effort changes already might help.

Susanne

--
Susanne Ebrecht - 2ndQuadrant
PostgreSQL Development, 24x7 Support, Training and Services
www.2ndQuadrant.com


Re: non-ASCII characters in SGML documentation (and elsewhere)

From
Tom Lane
Date:
Peter Eisentraut <peter_e@gmx.net> writes:
>       * Should we consistently use entities for encoding non-ASCII
>         characters in SGML?  Or use LATIN1 freely?

I think we previously discussed this and agreed that all non-ASCII in
the SGML docs should be written as entities.  The existence of
violations of that rule is just, well, a violation that ought to be
fixed.

>       * Should we allow/use non-ASCII characters in the release notes?
>       * What encoding should the HISTORY file have?

Ideally "sure, if entity-ified", but I don't know what to do about
HISTORY.

>       * Should we allow non-ASCII characters in general source files?

Prefer "no" here.

            regards, tom lane

Re: non-ASCII characters in SGML documentation (and elsewhere)

From
Alvaro Herrera
Date:
Excerpts from Tom Lane's message of vie may 20 07:56:58 -0400 2011:
> Peter Eisentraut <peter_e@gmx.net> writes:
> >       * Should we consistently use entities for encoding non-ASCII
> >         characters in SGML?  Or use LATIN1 freely?
>
> I think we previously discussed this and agreed that all non-ASCII in
> the SGML docs should be written as entities.  The existence of
> violations of that rule is just, well, a violation that ought to be
> fixed.

+1

> >       * Should we allow/use non-ASCII characters in the release notes?
> >       * What encoding should the HISTORY file have?
>
> Ideally "sure, if entity-ified", but I don't know what to do about
> HISTORY.

Can we recode that to plain ascii?  I think iconv has a //TRANSLIT flag
or something like that.

--
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

Re: non-ASCII characters in SGML documentation (and elsewhere)

From
Susanne Ebrecht
Date:
On 20.05.2011 13:56, Tom Lane wrote:
>>        * Should we allow non-ASCII characters in general source files?
> Prefer "no" here.

I only see two reasons for non-ASCII signs in English.
Either it is a foreign name of e.g. a person
or it is a word that English took from French like in déjà vu.
For the second I am sure you will find synonyms that are ASCII only.

The only other reason that I can see for non-ASCII signs in our docs is
for demonstrating collations.

Susanne

--
Susanne Ebrecht - 2ndQuadrant
PostgreSQL Development, 24x7 Support, Training and Services
www.2ndQuadrant.com


Re: non-ASCII characters in SGML documentation (and elsewhere)

From
Alvaro Herrera
Date:
Excerpts from Susanne Ebrecht's message of vie may 20 09:04:26 -0400 2011:
> On 20.05.2011 13:56, Tom Lane wrote:
> >>        * Should we allow non-ASCII characters in general source files?
> > Prefer "no" here.
>
> I only see two reasons for non-ASCII signs in English.
> Either it is a foreign name of e.g. a person
> or it is a word that English took from French like in déjà vu.

I'd like my name accented in the release notes, thanks.

--
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

Re: non-ASCII characters in SGML documentation (and elsewhere)

From
Peter Eisentraut
Date:
On fre, 2011-05-20 at 07:56 -0400, Tom Lane wrote:
> >       * Should we allow non-ASCII characters in general source
> files?
>
> Prefer "no" here.

Going through this I felt a little bad butchering up people's names that
hadn't bothered anyone before now.  So as a compromise, I made
contributor names UTF-8 consistently, but removed other uses of
non-ASCII characters.


Re: non-ASCII characters in SGML documentation (and elsewhere)

From
Peter Eisentraut
Date:
On fre, 2011-05-20 at 08:16 -0400, Alvaro Herrera wrote:
> > >       * Should we allow/use non-ASCII characters in the release
> notes?
> > >       * What encoding should the HISTORY file have?
> >
> > Ideally "sure, if entity-ified", but I don't know what to do about
> > HISTORY.
>
> Can we recode that to plain ascii?  I think iconv has a //TRANSLIT
> flag or something like that.

To make this work on FreeBSD, where we build the releases, we need to
use the following command:

 "/usr/bin/perl" -p -e 's/<H(1|2)$/<H\1 align=center/g' HISTORY.html | LC_ALL=en_US.ISO8859-1 lynx -force_html -dump
-nolist-stdin | iconv -f latin1 -t us-ascii//TRANSLIT > HISTORY 

This also works on Linux/glibc, but FreeBSD is a bit stricter/more
limited.  Not sure about other platforms, but I'd guess if they don't
have the required locales, they'd be no worse off than now anyway.

The results are reasonable.  It actually depends on the platform
what //TRANSLIT does, e.g. on FreeBSD ö -> "o, on Linux ö -> o.



Re: non-ASCII characters in SGML documentation (and elsewhere)

From
Bruce Momjian
Date:
Alvaro Herrera wrote:
> Excerpts from Susanne Ebrecht's message of vie may 20 09:04:26 -0400 2011:
> > On 20.05.2011 13:56, Tom Lane wrote:
> > >>        * Should we allow non-ASCII characters in general source files?
> > > Prefer "no" here.
> >
> > I only see two reasons for non-ASCII signs in English.
> > Either it is a foreign name of e.g. a person
> > or it is a word that English took from French like in déjà vu.
>
> I'd like my name accented in the release notes, thanks.

Sure, you want the first "A" in Alvaro with an accent.  I would love to
backpatch that but it would be royal pain.  I am afraid it can only
easily be done in future release notes.

I have added the proper markup to our release note checklist; patch
attached.  Does anyone else want special handling for their name?

--
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

  + It's impossible for everything to be true. +
diff --git a/doc/src/sgml/release.sgml b/doc/src/sgml/release.sgml
new file mode 100644
index 15f273c..c860b90
*** a/doc/src/sgml/release.sgml
--- b/doc/src/sgml/release.sgml
*************** non-ASCII characters            convert
*** 27,32 ****
--- 27,34 ----
          does not support it
            http://www.pemberley.com/janeinfo/latin1.html#latexta

+     Alvaro Herrera is Álvaro Herrera
+
  wrap long lines

  For new features, add links to the documentation sections.  Use </link>

Re: non-ASCII characters in SGML documentation (and elsewhere)

From
Alvaro Herrera
Date:
Excerpts from Bruce Momjian's message of mié oct 12 18:21:19 -0300 2011:
> Alvaro Herrera wrote:
> > Excerpts from Susanne Ebrecht's message of vie may 20 09:04:26 -0400 2011:
> > > On 20.05.2011 13:56, Tom Lane wrote:
> > > >>        * Should we allow non-ASCII characters in general source files?
> > > > Prefer "no" here.
> > >
> > > I only see two reasons for non-ASCII signs in English.
> > > Either it is a foreign name of e.g. a person
> > > or it is a word that English took from French like in dj vu.
> >
> > I'd like my name accented in the release notes, thanks.
>
> Sure, you want the first "A" in Alvaro with an accent.  I would love to
> backpatch that but it would be royal pain.  I am afraid it can only
> easily be done in future release notes.

Many thanks, Bruce.

--
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support