Thread: Doc: typo in config.sgml

Doc: typo in config.sgml

From
Tatsuo Ishii
Date:
I think there's an unnecessary underscore in config.sgml.
Attached patch fixes it.

Best reagards,
--
Tatsuo Ishii
SRA OSS K.K.
English: http://www.sraoss.co.jp/index_en/
Japanese:http://www.sraoss.co.jp
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 0aec11f443..08173ecb5c 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -9380,7 +9380,7 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
        <para>
         If <varname>transaction_timeout</varname> is shorter or equal to
         <varname>idle_in_transaction_session_timeout</varname> or <varname>statement_timeout</varname>
-        then the longer timeout is ignored.
+        then the longer timeout is ignored.
        </para>

        <para>

Re: Doc: typo in config.sgml

From
Tatsuo Ishii
Date:
>> I think there's an unnecessary underscore in config.sgml.
>> Attached patch fixes it.
> 
> I could not apply the patch with an error.
> 
>  error: patch failed: doc/src/sgml/config.sgml:9380
>  error: doc/src/sgml/config.sgml: patch does not apply

Strange. I have no problem applying the patch here.

> I found your patch contains an odd character (ASCII Code 240?)
> by performing `od -c` command on the file. See the attached file.

Yes, 240 in octal (== 0xc2) is in the patch but it's because current
config.sgml includes the character. You can check it by looking at
line 9383 of config.sgml.

I think it was introduced by 28e858c0f95.

Best reagards,
--
Tatsuo Ishii
SRA OSS K.K.
English: http://www.sraoss.co.jp/index_en/
Japanese:http://www.sraoss.co.jp



Re: Doc: typo in config.sgml

From
Yugo NAGATA
Date:
On Mon, 30 Sep 2024 17:23:24 +0900 (JST)
Tatsuo Ishii <ishii@postgresql.org> wrote:

> >> I think there's an unnecessary underscore in config.sgml.
> >> Attached patch fixes it.
> > 
> > I could not apply the patch with an error.
> > 
> >  error: patch failed: doc/src/sgml/config.sgml:9380
> >  error: doc/src/sgml/config.sgml: patch does not apply
> 
> Strange. I have no problem applying the patch here.
> 
> > I found your patch contains an odd character (ASCII Code 240?)
> > by performing `od -c` command on the file. See the attached file.
> 
> Yes, 240 in octal (== 0xc2) is in the patch but it's because current
> config.sgml includes the character. You can check it by looking at
> line 9383 of config.sgml.

Yes, you are right, I can find the 0xc2 char in config.sgml using od -c,
although I still could not apply the patch. 

I think this is non-breaking space of (C2A0) of utf-8. I guess my
terminal normally regards this as a space, so applying patch fails.

I found it also in line 85 of ref/drop_extension.sgml.


> 
> I think it was introduced by 28e858c0f95.
> 
> Best reagards,
> --
> Tatsuo Ishii
> SRA OSS K.K.
> English: http://www.sraoss.co.jp/index_en/
> Japanese:http://www.sraoss.co.jp


-- 
Yugo NAGATA <nagata@sraoss.co.jp>



Re: Doc: typo in config.sgml

From
Tatsuo Ishii
Date:
>>> I think there's an unnecessary underscore in config.sgml.

I was wrong. The particular byte sequences just looked an underscore
on my editor but the byte sequence is actually 0xc2a0, which must be a
"non breaking space" encoded in UTF-8. I guess someone mistakenly
insert a non breaking space while editing config.sgml.

However the mistake does not affect the patch.

Best reagards,
--
Tatsuo Ishii
SRA OSS K.K.
English: http://www.sraoss.co.jp/index_en/
Japanese:http://www.sraoss.co.jp



Re: Doc: typo in config.sgml

From
Yugo NAGATA
Date:
On Mon, 30 Sep 2024 18:03:44 +0900 (JST)
Tatsuo Ishii <ishii@postgresql.org> wrote:

> >>> I think there's an unnecessary underscore in config.sgml.
> 
> I was wrong. The particular byte sequences just looked an underscore
> on my editor but the byte sequence is actually 0xc2a0, which must be a
> "non breaking space" encoded in UTF-8. I guess someone mistakenly
> insert a non breaking space while editing config.sgml.
> 
> However the mistake does not affect the patch.

It looks like we've crisscrossed our mail.
Anyway, I agree with removing non breaking spaces, as well as
one found in line 85 of ref/drop_extension.sgml.

Regards,
Yugo Nagata

> 
> Best reagards,
> --
> Tatsuo Ishii
> SRA OSS K.K.
> English: http://www.sraoss.co.jp/index_en/
> Japanese:http://www.sraoss.co.jp


-- 
Yugo NAGATA <nagata@sraoss.co.jp>



Re: Doc: typo in config.sgml

From
Yugo Nagata
Date:
On Mon, 30 Sep 2024 11:59:48 +0200
Daniel Gustafsson <daniel@yesql.se> wrote:

> > On 30 Sep 2024, at 11:03, Tatsuo Ishii <ishii@postgresql.org> wrote:
> > 
> >>>> I think there's an unnecessary underscore in config.sgml.
> > 
> > I was wrong. The particular byte sequences just looked an underscore
> > on my editor but the byte sequence is actually 0xc2a0, which must be a
> > "non breaking space" encoded in UTF-8. I guess someone mistakenly
> > insert a non breaking space while editing config.sgml.
> 
> I wonder if it would be worth to add a check for this like we have to tabs?
> The attached adds a rule to "make -C doc/src/sgml check" for trapping nbsp
> (doing so made me realize we don't have an equivalent meson target).

Your patch couldn't detect 0xA0 in config.sgml in my machine, but it works
when I use `grep -P "[\xA0]"` instead of `grep -e "\xA0"`.

However, it also detects the following line in charset.sgml.
(https://www.postgresql.org/docs/current/collation.html)

 For example, locale und-u-kb sorts 'àe' before 'aé'.

This is not non-breaking space, so should not be detected as an error.

Regards,
Yugo Nagata

> --
> Daniel Gustafsson
> 


-- 
Yugo Nagata <nagata@sraoss.co.jp>



Re: Doc: typo in config.sgml

From
Tatsuo Ishii
Date:
>> I wonder if it would be worth to add a check for this like we have to tabs?

+1.

>> The attached adds a rule to "make -C doc/src/sgml check" for trapping nbsp
>> (doing so made me realize we don't have an equivalent meson target).
>
> Your patch couldn't detect 0xA0 in config.sgml in my machine, but it works
> when I use `grep -P "[\xA0]"` instead of `grep -e "\xA0"`.
>
> However, it also detects the following line in charset.sgml.
> (https://www.postgresql.org/docs/current/collation.html)
>
>  For example, locale und-u-kb sorts 'àe' before 'aé'.
>
> This is not non-breaking space, so should not be detected as an error.

That's because non-breaking space (nbsp) is not encoded as 0xa0 in
UTF-8. nbsp in UTF-8 is "0xc2 0xa0" (2 bytes) (A 0xa0 is a nbsp's code
point in Unicode. i.e. U+00A0).
So grep -P "[\xC2\xA0]" should work to detect nbsp.

Best reagards,
--
Tatsuo Ishii
SRA OSS K.K.
English: http://www.sraoss.co.jp/index_en/
Japanese:http://www.sraoss.co.jp



Re: Doc: typo in config.sgml

From
Tatsuo Ishii
Date:
>> That's because non-breaking space (nbsp) is not encoded as 0xa0 in
>> UTF-8. nbsp in UTF-8 is "0xc2 0xa0" (2 bytes) (A 0xa0 is a nbsp's code
>> point in Unicode. i.e. U+00A0).
>> So grep -P "[\xC2\xA0]" should work to detect nbsp.
> 
> `LC_ALL=C grep -P "\xC2\xA0"` works for my environment. 
> ([ and ] were not necessary.)
> 
> When LC_ALL is null, `grep -P "\xA0"` could not detect any characters in charset.sgml,
> but I think it is better to specify both LC_ALL=C and "\xC2\xA0" for making sure detecting
> nbsp.
> 
> One problem is that -P option can be used in only GNU grep, and grep in mac doesn't support it.
> 
> On bash, we can also use `grep $'\xc2\xa0'`, but I am not sure we can assume the shell is bash.
> 
> Maybe, better way is use perl itself rather than grep as following.
> 
>  `perl -ne '/\xC2\xA0/ and print' `
> 
> I attached a patch fixed in this way.

GNU sed can also be used without setting LC_ALL:

sed -n /"\xC2\xA0"/p

However I am not sure if non-GNU sed can do this too...

Best reagards,
--
Tatsuo Ishii
SRA OSS K.K.
English: http://www.sraoss.co.jp/index_en/
Japanese:http://www.sraoss.co.jp



Re: Doc: typo in config.sgml

From
Tatsuo Ishii
Date:
> On Mon, 30 Sep 2024 17:23:24 +0900 (JST)
> Tatsuo Ishii <ishii@postgresql.org> wrote:
> 
>> >> I think there's an unnecessary underscore in config.sgml.
>> >> Attached patch fixes it.
>> > 
>> > I could not apply the patch with an error.
>> > 
>> >  error: patch failed: doc/src/sgml/config.sgml:9380
>> >  error: doc/src/sgml/config.sgml: patch does not apply
>> 
>> Strange. I have no problem applying the patch here.
>> 
>> > I found your patch contains an odd character (ASCII Code 240?)
>> > by performing `od -c` command on the file. See the attached file.
>> 
>> Yes, 240 in octal (== 0xc2) is in the patch but it's because current
>> config.sgml includes the character. You can check it by looking at
>> line 9383 of config.sgml.
> 
> Yes, you are right, I can find the 0xc2 char in config.sgml using od -c,
> although I still could not apply the patch. 
> 
> I think this is non-breaking space of (C2A0) of utf-8. I guess my
> terminal normally regards this as a space, so applying patch fails.
> 
> I found it also in line 85 of ref/drop_extension.sgml.

Thanks. I have pushed the fix for ref/drop_extension.sgml along with
config.sgml.

Best reagards,
--
Tatsuo Ishii
SRA OSS K.K.
English: http://www.sraoss.co.jp/index_en/
Japanese:http://www.sraoss.co.jp



Re: Doc: typo in config.sgml

From
Bruce Momjian
Date:
On Mon, Sep 30, 2024 at 11:59:48AM +0200, Daniel Gustafsson wrote:
> > On 30 Sep 2024, at 11:03, Tatsuo Ishii <ishii@postgresql.org> wrote:
> > 
> >>>> I think there's an unnecessary underscore in config.sgml.
> > 
> > I was wrong. The particular byte sequences just looked an underscore
> > on my editor but the byte sequence is actually 0xc2a0, which must be a
> > "non breaking space" encoded in UTF-8. I guess someone mistakenly
> > insert a non breaking space while editing config.sgml.
> 
> I wonder if it would be worth to add a check for this like we have to tabs?
> The attached adds a rule to "make -C doc/src/sgml check" for trapping nbsp
> (doing so made me realize we don't have an equivalent meson target).

Can we check for any character outside the support range of SGML?

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  When a patient asks the doctor, "Am I going to die?", he means 
  "Am I going to die soon?"



Re: Doc: typo in config.sgml

From
Tatsuo Ishii
Date:
> On Tue, 1 Oct 2024 22:20:55 +0900
> Yugo Nagata <nagata@sraoss.co.jp> wrote:
> 
>> On Tue, 1 Oct 2024 15:16:52 +0900
>> Yugo NAGATA <nagata@sraoss.co.jp> wrote:
>> 
>> > On Tue, 01 Oct 2024 10:33:50 +0900 (JST)
>> > Tatsuo Ishii <ishii@postgresql.org> wrote:
>> > 
>> > > >> That's because non-breaking space (nbsp) is not encoded as 0xa0 in
>> > > >> UTF-8. nbsp in UTF-8 is "0xc2 0xa0" (2 bytes) (A 0xa0 is a nbsp's code
>> > > >> point in Unicode. i.e. U+00A0).
>> > > >> So grep -P "[\xC2\xA0]" should work to detect nbsp.
>> > > > 
>> > > > `LC_ALL=C grep -P "\xC2\xA0"` works for my environment. 
>> > > > ([ and ] were not necessary.)
>> > > > 
>> > > > When LC_ALL is null, `grep -P "\xA0"` could not detect any characters in charset.sgml,
>> > > > but I think it is better to specify both LC_ALL=C and "\xC2\xA0" for making sure detecting
>> > > > nbsp.
>> > > > 
>> > > > One problem is that -P option can be used in only GNU grep, and grep in mac doesn't support it.
>> > > > 
>> > > > On bash, we can also use `grep $'\xc2\xa0'`, but I am not sure we can assume the shell is bash.
>> > > > 
>> > > > Maybe, better way is use perl itself rather than grep as following.
>> > > > 
>> > > >  `perl -ne '/\xC2\xA0/ and print' `
>> > > > 
>> > > > I attached a patch fixed in this way.
>> > > 
>> > > GNU sed can also be used without setting LC_ALL:
>> > > 
>> > > sed -n /"\xC2\xA0"/p
>> > > 
>> > > However I am not sure if non-GNU sed can do this too...
>> > 
>> > Although I've not check it myself, BSD sed doesn't support \x escape according to [1].
>> > 
>> > [1]
https://stackoverflow.com/questions/24275070/sed-not-giving-me-correct-substitute-operation-for-newline-with-mac-difference
>> > 
>> > By the way, I've attached a patch a bit modified to use the plural form statement
>> > as same as check-tabs.
>> > 
>> >  Non-breaking **spaces** appear in SGML/XML files
>> 
>> The previous patch was broken because the perl command failed to return the correct result.
>> I've attached an updated patch to fix the return value. In passing, I added line breaks
>> for long lines.
> 
> I've attached a updated patch. 
> I added the comment to explain why Perl is used instead of grep or sed.

Looks good to me. If there's no objection, I will commit this to
master branch.

Best reagards,
--
Tatsuo Ishii
SRA OSS K.K.
English: http://www.sraoss.co.jp/index_en/
Japanese:http://www.sraoss.co.jp



Re: Doc: typo in config.sgml

From
Daniel Gustafsson
Date:
> On 8 Oct 2024, at 02:03, Tatsuo Ishii <ishii@postgresql.org> wrote:
>> On Tue, 1 Oct 2024 22:20:55 +0900
>> Yugo Nagata <nagata@sraoss.co.jp> wrote:

>> I've attached a updated patch. 
>> I added the comment to explain why Perl is used instead of grep or sed.
> 
> Looks good to me. If there's no objection, I will commit this to
> master branch.

No objections, LGTM.

--
Daniel Gustafsson




Re: Doc: typo in config.sgml

From
Tatsuo Ishii
Date:
Hi Danile, Yugo,

>> On 8 Oct 2024, at 02:03, Tatsuo Ishii <ishii@postgresql.org> wrote:
>>> On Tue, 1 Oct 2024 22:20:55 +0900
>>> Yugo Nagata <nagata@sraoss.co.jp> wrote:
> 
>>> I've attached a updated patch. 
>>> I added the comment to explain why Perl is used instead of grep or sed.
>> 
>> Looks good to me. If there's no objection, I will commit this to
>> master branch.
> 
> No objections, LGTM.

Thank you for the patch and review! I have pushed the patch.

https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=5b7da5c261d1af1a5d6a275e1090b07de3654033

Best reagards,
--
Tatsuo Ishii
SRA OSS K.K.
English: http://www.sraoss.co.jp/index_en/
Japanese:http://www.sraoss.co.jp



Re: Doc: typo in config.sgml

From
Yugo Nagata
Date:
On Mon, 7 Oct 2024 15:45:54 -0400
Bruce Momjian <bruce@momjian.us> wrote:

> On Mon, Sep 30, 2024 at 11:59:48AM +0200, Daniel Gustafsson wrote:
> > > On 30 Sep 2024, at 11:03, Tatsuo Ishii <ishii@postgresql.org> wrote:
> > > 
> > >>>> I think there's an unnecessary underscore in config.sgml.
> > > 
> > > I was wrong. The particular byte sequences just looked an underscore
> > > on my editor but the byte sequence is actually 0xc2a0, which must be a
> > > "non breaking space" encoded in UTF-8. I guess someone mistakenly
> > > insert a non breaking space while editing config.sgml.
> > 
> > I wonder if it would be worth to add a check for this like we have to tabs?
> > The attached adds a rule to "make -C doc/src/sgml check" for trapping nbsp
> > (doing so made me realize we don't have an equivalent meson target).
> 
> Can we check for any character outside the support range of SGML?

What we can define the range of allowed characters range in SGML?

We can detect non-ASCII characters by using regexp /\P{ascii}/ or /[^\x00-\x7f]/,
but they are used in some places in charset.sgml and some names in release-*.sgml.

Regards,
Yugo Nagata

> 
> -- 
>   Bruce Momjian  <bruce@momjian.us>        https://momjian.us
>   EDB                                      https://enterprisedb.com
> 
>   When a patient asks the doctor, "Am I going to die?", he means 
>   "Am I going to die soon?"
> 
> 


-- 
Yugo Nagata <nagata@sraoss.co.jp>



Re: Doc: typo in config.sgml

From
Tatsuo Ishii
Date:
> On Mon, 7 Oct 2024 15:45:54 -0400
> Bruce Momjian <bruce@momjian.us> wrote:
> 
>> On Mon, Sep 30, 2024 at 11:59:48AM +0200, Daniel Gustafsson wrote:
>> > > On 30 Sep 2024, at 11:03, Tatsuo Ishii <ishii@postgresql.org> wrote:
>> > > 
>> > >>>> I think there's an unnecessary underscore in config.sgml.
>> > > 
>> > > I was wrong. The particular byte sequences just looked an underscore
>> > > on my editor but the byte sequence is actually 0xc2a0, which must be a
>> > > "non breaking space" encoded in UTF-8. I guess someone mistakenly
>> > > insert a non breaking space while editing config.sgml.
>> > 
>> > I wonder if it would be worth to add a check for this like we have to tabs?
>> > The attached adds a rule to "make -C doc/src/sgml check" for trapping nbsp
>> > (doing so made me realize we don't have an equivalent meson target).
>> 
>> Can we check for any character outside the support range of SGML?
> 
> What we can define the range of allowed characters range in SGML?
> 
> We can detect non-ASCII characters by using regexp /\P{ascii}/ or /[^\x00-\x7f]/,
> but they are used in some places in charset.sgml and some names in release-*.sgml.

I failed to find any standard regarding what characters are allowed in
SGML/XML. Assuming that any valid Unicode characters are allowed in
our *sgml files, I am afraid the best we can do is grepping non-ASCII
characters against the files and checking the results by a visual
inspection. Besides nbsp, there are tons of confusing Unicode
characters out there. For example there are many "hyphen like
characters".

https://www.compart.com/en/unicode/category/Pd

If one of them is used in the sgml files, it may be possible that it
was accidentally inserted.

Best reagards,
--
Tatsuo Ishii
SRA OSS K.K.
English: http://www.sraoss.co.jp/index_en/
Japanese:http://www.sraoss.co.jp



Re: Doc: typo in config.sgml

From
Bruce Momjian
Date:
On Wed, Oct  9, 2024 at 11:49:29AM +0900, Tatsuo Ishii wrote:
> >> On Mon, Sep 30, 2024 at 11:59:48AM +0200, Daniel Gustafsson wrote:
> >> > > On 30 Sep 2024, at 11:03, Tatsuo Ishii <ishii@postgresql.org> wrote:
> >> > > 
> >> > >>>> I think there's an unnecessary underscore in config.sgml.
> >> > > 
> >> > > I was wrong. The particular byte sequences just looked an underscore
> >> > > on my editor but the byte sequence is actually 0xc2a0, which must be a
> >> > > "non breaking space" encoded in UTF-8. I guess someone mistakenly
> >> > > insert a non breaking space while editing config.sgml.
> >> > 
> >> > I wonder if it would be worth to add a check for this like we have to tabs?
> >> > The attached adds a rule to "make -C doc/src/sgml check" for trapping nbsp
> >> > (doing so made me realize we don't have an equivalent meson target).
> >> 
> >> Can we check for any character outside the support range of SGML?
> > 
> > What we can define the range of allowed characters range in SGML?
> > 
> > We can detect non-ASCII characters by using regexp /\P{ascii}/ or /[^\x00-\x7f]/,
> > but they are used in some places in charset.sgml and some names in release-*.sgml.
> 
> I failed to find any standard regarding what characters are allowed in
> SGML/XML. Assuming that any valid Unicode characters are allowed in
> our *sgml files, I am afraid the best we can do is grepping non-ASCII
> characters against the files and checking the results by a visual
> inspection. Besides nbsp, there are tons of confusing Unicode
> characters out there. For example there are many "hyphen like
> characters".
> 
> https://www.compart.com/en/unicode/category/Pd
> 
> If one of them is used in the sgml files, it may be possible that it
> was accidentally inserted.

Can we use Unicode in the SGML files?

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  When a patient asks the doctor, "Am I going to die?", he means 
  "Am I going to die soon?"



Re: Doc: typo in config.sgml

From
Tom Lane
Date:
Bruce Momjian <bruce@momjian.us> writes:
> Can we use Unicode in the SGML files?

I believe we've been doing it for contributors' names that require
non-ASCII letters, but not in any other places.

            regards, tom lane



Re: Doc: typo in config.sgml

From
Tatsuo Ishii
Date:
> Bruce Momjian <bruce@momjian.us> writes:
>> Can we use Unicode in the SGML files?
> 
> I believe we've been doing it for contributors' names that require
> non-ASCII letters, but not in any other places.

We have non-ASCII letters in charset.sgml too, to show some examples
of collation.

Best reagards,
--
Tatsuo Ishii
SRA OSS K.K.
English: http://www.sraoss.co.jp/index_en/
Japanese:http://www.sraoss.co.jp



Re: Doc: typo in config.sgml

From
Daniel Gustafsson
Date:
> On 9 Oct 2024, at 04:49, Tatsuo Ishii <ishii@postgresql.org> wrote:

> Besides nbsp, there are tons of confusing Unicode
> characters out there. For example there are many "hyphen like
> characters".

Using characters which look alike is in the field of internet security known as
homograph attacks, where for example a url visually passes for postgresql.org
but in fact leads to an attacker.  That sort of attack clearly doesn't apply to
our docs though.  However, what might cause similar problems is if we use a
unicode character in example code which the reader could be expected to
copy/paste into psql and run which then (at best) cause a syntax error.  We
could probably build tooling to catch this (most likely not too hard in XSLT)
but the ROI for that might be unfavourable.  Even with tooling, committer
caution is needed to ensure we don't publish examples that might cause
unintended side effects when executed by copy/paste.

What separates nbsp is that it may affect the rendering in an un-intuitive way
by forcing two words to not break even if the viewport is too narrow to fit.
Catching such characters seems wortwhile since it's also quite doable with a
trivial grep.

--
Daniel Gustafsson

[0] https://en.wikipedia.org/wiki/IDN_homograph_attack


Re: Doc: typo in config.sgml

From
Tatsuo Ishii
Date:
> We can check non-ASCII letters SGML/XML files by preparing "allowlist"
> that contains lines which are allowed to have non-ascii characters,
> although this list will need to be maintained when lines in it are modified.
> I've attached a patch to add a simple Perl script to do this.

I doubt it really works. For example, nbsp can be used formatting
(that's the purpose of the character in the first place). Whenever a
developer decides to or not to use nbsp, "allowlist" needs to be
maintained. It's too annoying.

I think it's better to add the non-ASCII character checking to the
comitting check list and let committers check non-ASCII character in
the patch. Non-ASCII characters rarely used and it would not become a
burden.
https://wiki.postgresql.org/wiki/Committing_checklist

Maybe we can add to the wiki page something like this?

git diff origin/master | grep -P '[^\x00-\x7f]'

> During testing this script, I found "stylesheet-man.xsl" also has non-ascii
> characters. I don't know these characters are really necessary though, since
> I don't understand this file well.

They are U+201C (double turned comma quotation mark) and U+201D
(double comma quotation mark).

       <l:template name="sect3" text="Section %n, “%t”, in the documentation"/>

I would like to know why they are necessary too.

Best reagards,
--
Tatsuo Ishii
SRA OSS K.K.
English: http://www.sraoss.co.jp/index_en/
Japanese:http://www.sraoss.co.jp



Re: Doc: typo in config.sgml

From
Yugo NAGATA
Date:
On Fri, 11 Oct 2024 12:16:50 +0900 (JST)
Tatsuo Ishii <ishii@postgresql.org> wrote:

> > We can check non-ASCII letters SGML/XML files by preparing "allowlist"
> > that contains lines which are allowed to have non-ascii characters,
> > although this list will need to be maintained when lines in it are modified.
> > I've attached a patch to add a simple Perl script to do this.
> 
> I doubt it really works. For example, nbsp can be used formatting
> (that's the purpose of the character in the first place). Whenever a
> developer decides to or not to use nbsp, "allowlist" needs to be
> maintained. It's too annoying.

I suppose non-ascii characters including nbsp are basically disallowed,
so the allowlist will not increase unless there is some special reason.

However, it is true that there might be a cost for maintaining the list
more or less, so if people don't think it is worth adding this check, 
I will withdraw this proposal.l.

> I think it's better to add the non-ASCII character checking to the
> comitting check list and let committers check non-ASCII character in
> the patch. Non-ASCII characters rarely used and it would not become a
> burden.
> https://wiki.postgresql.org/wiki/Committing_checklist
> 
> Maybe we can add to the wiki page something like this?
> 
> git diff origin/master | grep -P '[^\x00-\x7f]'
> 
> > During testing this script, I found "stylesheet-man.xsl" also has non-ascii
> > characters. I don't know these characters are really necessary though, since
> > I don't understand this file well.
> 
> They are U+201C (double turned comma quotation mark) and U+201D
> (double comma quotation mark).
> 
>        <l:template name="sect3" text="Section %n, “%t”, in the documentation"/>
> 
> I would like to know why they are necessary too.

+1

Regards,
Yugo Nagata

-- 
Yugo NAGATA <nagata@sraoss.co.jp>



Re: Doc: typo in config.sgml

From
Bruce Momjian
Date:
On Mon, Oct 14, 2024 at 03:05:35PM -0400, Bruce Momjian wrote:
> I did some more research and we able to clarify our behavior in
> release.sgml:

I have specified some more details in my patched version:

        We can only use Latin1 characters, not all UTF8 characters,
        because some rendering engines do not support non-Latin1 UTF8
        characters.  Specifically, the HTML rendering engine can display
        all UTF8 characters, but the PDF rendering engine can only display
        Latin1 characters.  In PDF files, non-Latin1 UTF8 characters are
        displayed as "###".

        In the SGML files we encode non-ASCII Latin1 characters as HTML
        entities, e.g., Álvaro.  Oddly, it is possible to safely
        represent Latin1 characters in SGML files as UTF8 for HTML and
        PDF output, but we we currently disallow this via the Makefile
        "check-non-ascii" rule.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  When a patient asks the doctor, "Am I going to die?", he means 
  "Am I going to die soon?"



Re: Doc: typo in config.sgml

From
Peter Eisentraut
Date:
On 15.10.24 18:54, Bruce Momjian wrote:
>> I agree with encoding non-Latin1 characters and disallowing non-ASCII
>> characters totally.
>>
>> I found your patch includes fixes in *.svg files, so how about checking
>> also them by check-non-ascii? Also, I think it is better to use perl instead
>> of grep because non-GNU grep doesn't support hex escape sequences. I've attached
>> a updated patch for Makefile. The changes in release.sgml above is not applied
>> yet, though.
> Yes, good idea on using Perl and checking svg files --- I have used your
> Makefile rule.
> 
> Attached is an updated patch.  I realized that the new rules apply to
> all SGML files, not just the release notes, so I have created
> README.non-ASCII and moved the description there.

I don't understand the point of this.  Maybe it's okay to try to detect 
certain "hidden" whitespace characters, like in the case that started 
this thread.  But I don't see the value in prohibiting all non-ASCII 
characters, as is being proposed here.




Re: Doc: typo in config.sgml

From
Bruce Momjian
Date:
On Tue, Oct 15, 2024 at 10:34:16PM +0200, Peter Eisentraut wrote:
> On 15.10.24 18:54, Bruce Momjian wrote:
> > > I agree with encoding non-Latin1 characters and disallowing non-ASCII
> > > characters totally.
> > > 
> > > I found your patch includes fixes in *.svg files, so how about checking
> > > also them by check-non-ascii? Also, I think it is better to use perl instead
> > > of grep because non-GNU grep doesn't support hex escape sequences. I've attached
> > > a updated patch for Makefile. The changes in release.sgml above is not applied
> > > yet, though.
> > Yes, good idea on using Perl and checking svg files --- I have used your
> > Makefile rule.
> > 
> > Attached is an updated patch.  I realized that the new rules apply to
> > all SGML files, not just the release notes, so I have created
> > README.non-ASCII and moved the description there.
> 
> I don't understand the point of this.  Maybe it's okay to try to detect
> certain "hidden" whitespace characters, like in the case that started this
> thread.  But I don't see the value in prohibiting all non-ASCII characters,
> as is being proposed here.

Well, we can only use Latin-1, so the idea is that we will be explicit
about specifying Latin-1 only as HTML entities, rather than letting
non-Latin-1 creep in as UTF8.  We can exclude certain UTF8 or SGML files
if desired.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  When a patient asks the doctor, "Am I going to die?", he means 
  "Am I going to die soon?"



Re: Doc: typo in config.sgml

From
Peter Eisentraut
Date:
On 15.10.24 22:37, Bruce Momjian wrote:
>> I don't understand the point of this.  Maybe it's okay to try to detect
>> certain "hidden" whitespace characters, like in the case that started this
>> thread.  But I don't see the value in prohibiting all non-ASCII characters,
>> as is being proposed here.
> Well, we can only use Latin-1, so the idea is that we will be explicit
> about specifying Latin-1 only as HTML entities, rather than letting
> non-Latin-1 creep in as UTF8.

But your patch prohibits even otherwise allowed Latin-1 characters.

I don't see why we need to enforce this at this level.  Whatever 
downstream toolchain has requirements about which characters are allowed 
will complain if it encounters a character it doesn't like.




Re: Doc: typo in config.sgml

From
Tom Lane
Date:
Bruce Momjian <bruce@momjian.us> writes:
> Well, we can only use Latin-1, so the idea is that we will be explicit
> about specifying Latin-1 only as HTML entities, rather than letting
> non-Latin-1 creep in as UTF8.  We can exclude certain UTF8 or SGML files
> if desired.

That policy would cause substantial problems with contributor names
in the release notes.  I agree with Peter that we don't need this.
Catching otherwise-invisible characters seems sufficient.

            regards, tom lane



Re: Doc: typo in config.sgml

From
Bruce Momjian
Date:
On Tue, Oct 15, 2024 at 05:27:49PM -0400, Tom Lane wrote:
> Bruce Momjian <bruce@momjian.us> writes:
> > Well, we can only use Latin-1, so the idea is that we will be explicit
> > about specifying Latin-1 only as HTML entities, rather than letting
> > non-Latin-1 creep in as UTF8.  We can exclude certain UTF8 or SGML files
> > if desired.
> 
> That policy would cause substantial problems with contributor names
> in the release notes.  I agree with Peter that we don't need this.
> Catching otherwise-invisible characters seems sufficient.

Uh, why can't we use HTML entities going forward?  Is that harder?  Can
we just exclude the release notes from this check?

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  When a patient asks the doctor, "Am I going to die?", he means 
  "Am I going to die soon?"



Re: Doc: typo in config.sgml

From
Tom Lane
Date:
Bruce Momjian <bruce@momjian.us> writes:
> On Tue, Oct 15, 2024 at 05:27:49PM -0400, Tom Lane wrote:
>> That policy would cause substantial problems with contributor names
>> in the release notes.  I agree with Peter that we don't need this.
>> Catching otherwise-invisible characters seems sufficient.

> Uh, why can't we use HTML entities going forward?  Is that harder?

Yes: it requires looking up the entities.  The mail you are probably
consulting to make a release note or commit message is most likely
just going to contain the person's name as normally spelled.

Plus (as you pointed out earlier today) there aren't HTML entities for
all characters.

> Can we just exclude the release notes from this check?

What is the point of a check we can only enforce against part of the
documentation?

            regards, tom lane



Re: Doc: typo in config.sgml

From
Peter Eisentraut
Date:
On 15.10.24 23:51, Bruce Momjian wrote:
>> I don't see why we need to enforce this at this level.  Whatever downstream
>> toolchain has requirements about which characters are allowed will complain
>> if it encounters a character it doesn't like.
> 
> Uh, the PDF build does not complain if you pass it a non-Latin-1 UTF8
> characters.  To test this I added some Russian characters (non-Latin-1)
> to release.sgml:
> 
>     (⟨б⟩, ⟨в⟩, ⟨г⟩, ⟨д⟩, ⟨ж⟩, ⟨з⟩, ⟨к⟩, ⟨л⟩, ⟨м⟩, ⟨н⟩, ⟨п⟩, ⟨р⟩, ⟨с⟩, ⟨т⟩,
>     ⟨ф⟩, ⟨х⟩, ⟨ц⟩, ⟨ч⟩, ⟨ш⟩, ⟨щ⟩), ten vowels (⟨а⟩, ⟨е⟩, ⟨ё⟩, ⟨и⟩, ⟨о⟩, ⟨у⟩,
>     ⟨ы⟩, ⟨э⟩, ⟨ю⟩, ⟨я⟩), a semivowel / consonant (⟨й⟩), and two modifier
>     letters or "signs" (⟨ъ⟩, ⟨ь⟩)
> 
> and I ran 'make postgres-US.pdf', and then removed the Russian
> characters and ran the same command again.  The output, including stderr
> was identical.  The PDFs, of course, were not, with the Russian
> characters showing as "####".  Makefile output attached.

Hmm, mine complains:

/opt/homebrew/bin/fop -fo postgres-A4.fo -pdf postgres-A4.pdf
Picked up JAVA_TOOL_OPTIONS: -Djava.awt.headless=true
[WARN] FOUserAgent - Font "Symbol,normal,700" not found. Substituting 
with "Symbol,normal,400".
[WARN] FOUserAgent - Font "ZapfDingbats,normal,700" not found. 
Substituting with "ZapfDingbats,normal,400".
[WARN] FOUserAgent - Glyph "⟨" (0x27e8) not available in font "Times-Roman".
[WARN] FOUserAgent - Glyph "б" (0x431, afii10066) not available in font 
"Times-Roman".
[WARN] FOUserAgent - Glyph "⟩" (0x27e9) not available in font "Times-Roman".
[WARN] FOUserAgent - Glyph "в" (0x432, afii10067) not available in font 
"Times-Roman".
[WARN] FOUserAgent - Glyph "г" (0x433, afii10068) not available in font 
"Times-Roman".
[WARN] FOUserAgent - Glyph "д" (0x434, afii10069) not available in font 
"Times-Roman".
[WARN] FOUserAgent - Glyph "ж" (0x436, afii10072) not available in font 
"Times-Roman".
[WARN] FOUserAgent - Glyph "з" (0x437, afii10073) not available in font 
"Times-Roman".
[WARN] PropertyMaker - span="inherit" on fo:block, but no explicit value 
found on the parent FO.




Re: Doc: typo in config.sgml

From
Peter Eisentraut
Date:
On 15.10.24 23:51, Bruce Momjian wrote:
> On Tue, Oct 15, 2024 at 05:27:49PM -0400, Tom Lane wrote:
>> Bruce Momjian <bruce@momjian.us> writes:
>>> Well, we can only use Latin-1, so the idea is that we will be explicit
>>> about specifying Latin-1 only as HTML entities, rather than letting
>>> non-Latin-1 creep in as UTF8.  We can exclude certain UTF8 or SGML files
>>> if desired.
>>
>> That policy would cause substantial problems with contributor names
>> in the release notes.  I agree with Peter that we don't need this.
>> Catching otherwise-invisible characters seems sufficient.
> 
> Uh, why can't we use HTML entities going forward?  Is that harder?

I think the question should be the other way around.  The entities are a 
historical workaround for when encoding support and rendering support 
was poor.  Now you can just type in the characters you want as is, which 
seems nicer.




Re: Doc: typo in config.sgml

From
Bruce Momjian
Date:
On Wed, Oct 16, 2024 at 10:00:15AM +0200, Peter Eisentraut wrote:
> On 15.10.24 23:51, Bruce Momjian wrote:
> > On Tue, Oct 15, 2024 at 05:27:49PM -0400, Tom Lane wrote:
> > > Bruce Momjian <bruce@momjian.us> writes:
> > > > Well, we can only use Latin-1, so the idea is that we will be explicit
> > > > about specifying Latin-1 only as HTML entities, rather than letting
> > > > non-Latin-1 creep in as UTF8.  We can exclude certain UTF8 or SGML files
> > > > if desired.
> > > 
> > > That policy would cause substantial problems with contributor names
> > > in the release notes.  I agree with Peter that we don't need this.
> > > Catching otherwise-invisible characters seems sufficient.
> > 
> > Uh, why can't we use HTML entities going forward?  Is that harder?
> 
> I think the question should be the other way around.  The entities are a
> historical workaround for when encoding support and rendering support was
> poor.  Now you can just type in the characters you want as is, which seems
> nicer.

Yes, that does make sense, and if we fully supported Unicode, we could
ignore all of this.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  When a patient asks the doctor, "Am I going to die?", he means 
  "Am I going to die soon?"



Re: Doc: typo in config.sgml

From
Bruce Momjian
Date:
On Wed, Oct 16, 2024 at 09:58:23AM +0200, Peter Eisentraut wrote:
> On 15.10.24 23:51, Bruce Momjian wrote:
> > > I don't see why we need to enforce this at this level.  Whatever downstream
> > > toolchain has requirements about which characters are allowed will complain
> > > if it encounters a character it doesn't like.
> > 
> > Uh, the PDF build does not complain if you pass it a non-Latin-1 UTF8
> > characters.  To test this I added some Russian characters (non-Latin-1)
> > to release.sgml:
> > 
> >     (⟨б⟩, ⟨в⟩, ⟨г⟩, ⟨д⟩, ⟨ж⟩, ⟨з⟩, ⟨к⟩, ⟨л⟩, ⟨м⟩, ⟨н⟩, ⟨п⟩, ⟨р⟩, ⟨с⟩, ⟨т⟩,
> >     ⟨ф⟩, ⟨х⟩, ⟨ц⟩, ⟨ч⟩, ⟨ш⟩, ⟨щ⟩), ten vowels (⟨а⟩, ⟨е⟩, ⟨ё⟩, ⟨и⟩, ⟨о⟩, ⟨у⟩,
> >     ⟨ы⟩, ⟨э⟩, ⟨ю⟩, ⟨я⟩), a semivowel / consonant (⟨й⟩), and two modifier
> >     letters or "signs" (⟨ъ⟩, ⟨ь⟩)
> > 
> > and I ran 'make postgres-US.pdf', and then removed the Russian
> > characters and ran the same command again.  The output, including stderr
> > was identical.  The PDFs, of course, were not, with the Russian
> > characters showing as "####".  Makefile output attached.
> 
> Hmm, mine complains:

My Debian 12 toolchain must be older.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  When a patient asks the doctor, "Am I going to die?", he means 
  "Am I going to die soon?"



Re: Doc: typo in config.sgml

From
Bruce Momjian
Date:
On Wed, Oct 16, 2024 at 09:54:57AM -0400, Bruce Momjian wrote:
> On Wed, Oct 16, 2024 at 10:00:15AM +0200, Peter Eisentraut wrote:
> > On 15.10.24 23:51, Bruce Momjian wrote:
> > > On Tue, Oct 15, 2024 at 05:27:49PM -0400, Tom Lane wrote:
> > > > Bruce Momjian <bruce@momjian.us> writes:
> > > > > Well, we can only use Latin-1, so the idea is that we will be explicit
> > > > > about specifying Latin-1 only as HTML entities, rather than letting
> > > > > non-Latin-1 creep in as UTF8.  We can exclude certain UTF8 or SGML files
> > > > > if desired.
> > > > 
> > > > That policy would cause substantial problems with contributor names
> > > > in the release notes.  I agree with Peter that we don't need this.
> > > > Catching otherwise-invisible characters seems sufficient.
> > > 
> > > Uh, why can't we use HTML entities going forward?  Is that harder?
> > 
> > I think the question should be the other way around.  The entities are a
> > historical workaround for when encoding support and rendering support was
> > poor.  Now you can just type in the characters you want as is, which seems
> > nicer.
> 
> Yes, that does make sense, and if we fully supported Unicode, we could
> ignore all of this.

Patch applied to master --- no new UTF8 restrictions.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  When a patient asks the doctor, "Am I going to die?", he means 
  "Am I going to die soon?"



Re: Doc: typo in config.sgml

From
Tatsuo Ishii
Date:
Hi Bruce,

> On Wed, Oct 16, 2024 at 09:54:57AM -0400, Bruce Momjian wrote:
>> On Wed, Oct 16, 2024 at 10:00:15AM +0200, Peter Eisentraut wrote:
>> > On 15.10.24 23:51, Bruce Momjian wrote:
>> > > On Tue, Oct 15, 2024 at 05:27:49PM -0400, Tom Lane wrote:
>> > > > Bruce Momjian <bruce@momjian.us> writes:
>> > > > > Well, we can only use Latin-1, so the idea is that we will be explicit
>> > > > > about specifying Latin-1 only as HTML entities, rather than letting
>> > > > > non-Latin-1 creep in as UTF8.  We can exclude certain UTF8 or SGML files
>> > > > > if desired.
>> > > > 
>> > > > That policy would cause substantial problems with contributor names
>> > > > in the release notes.  I agree with Peter that we don't need this.
>> > > > Catching otherwise-invisible characters seems sufficient.
>> > > 
>> > > Uh, why can't we use HTML entities going forward?  Is that harder?
>> > 
>> > I think the question should be the other way around.  The entities are a
>> > historical workaround for when encoding support and rendering support was
>> > poor.  Now you can just type in the characters you want as is, which seems
>> > nicer.
>> 
>> Yes, that does make sense, and if we fully supported Unicode, we could
>> ignore all of this.
> 
> Patch applied to master --- no new UTF8 restrictions.

I thought the conclusion of the discussion was allowing to use LATIN1
(or UTF-8 encoded LATIN1) characters in SGML files without converting
them to HTML entities. Your patch seems to do opposite.

https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=641a5b7a1447954076728f259342c2f9201bb0b5

Best reagards,
--
Tatsuo Ishii
SRA OSS K.K.
English: http://www.sraoss.co.jp/index_en/
Japanese:http://www.sraoss.co.jp



Re: Doc: typo in config.sgml

From
Bruce Momjian
Date:
On Sat, Nov  2, 2024 at 07:27:00AM +0900, Tatsuo Ishii wrote:
> > On Wed, Oct 16, 2024 at 09:54:57AM -0400, Bruce Momjian wrote:
> >> On Wed, Oct 16, 2024 at 10:00:15AM +0200, Peter Eisentraut wrote:
> >> > On 15.10.24 23:51, Bruce Momjian wrote:
> >> > > On Tue, Oct 15, 2024 at 05:27:49PM -0400, Tom Lane wrote:
> >> > > > Bruce Momjian <bruce@momjian.us> writes:
> >> > > > > Well, we can only use Latin-1, so the idea is that we will be explicit
> >> > > > > about specifying Latin-1 only as HTML entities, rather than letting
> >> > > > > non-Latin-1 creep in as UTF8.  We can exclude certain UTF8 or SGML files
> >> > > > > if desired.
> >> > > > 
> >> > > > That policy would cause substantial problems with contributor names
> >> > > > in the release notes.  I agree with Peter that we don't need this.
> >> > > > Catching otherwise-invisible characters seems sufficient.
> >> > > 
> >> > > Uh, why can't we use HTML entities going forward?  Is that harder?
> >> > 
> >> > I think the question should be the other way around.  The entities are a
> >> > historical workaround for when encoding support and rendering support was
> >> > poor.  Now you can just type in the characters you want as is, which seems
> >> > nicer.
> >> 
> >> Yes, that does make sense, and if we fully supported Unicode, we could
> >> ignore all of this.
> > 
> > Patch applied to master --- no new UTF8 restrictions.
> 
> I thought the conclusion of the discussion was allowing to use LATIN1
> (or UTF-8 encoded LATIN1) characters in SGML files without converting
> them to HTML entities. Your patch seems to do opposite.
> 
> https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=641a5b7a1447954076728f259342c2f9201bb0b5

Yes, we _allow_ LATIN1 characters in the SGML docs, but I replaced the
LATIN1 characters we had with HTML entities, so there are none
currently.

I think it is too easy for non-Latin1 UTF8 to creep into our SGML docs
so I added a cron job on my server to alert me when non-ASCII characters
appear.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  When a patient asks the doctor, "Am I going to die?", he means 
  "Am I going to die soon?"



Re: Doc: typo in config.sgml

From
Tatsuo Ishii
Date:
> Yes, we _allow_ LATIN1 characters in the SGML docs, but I replaced the
> LATIN1 characters we had with HTML entities, so there are none
> currently.
> 
> I think it is too easy for non-Latin1 UTF8 to creep into our SGML docs
> so I added a cron job on my server to alert me when non-ASCII characters
> appear.

So you convert LATIN1 characters to HTML entities so that it's easier
to detect non-LATIN1 characters is in the SGML docs? If my
understanding is correct, it can be also achieved by using some tools
like:

iconv -t ISO-8859-1 -f UTF-8 release-17.sgml 

If there are some non-LATIN1 characters in release-17.sgml,
it will complain like:

iconv: illegal input sequence at position 175

An advantage of this is, we don't need to covert each LATIN1
characters to HTML entities and make the sgml file authors life a
little bit easier.

Best reagards,
--
Tatsuo Ishii
SRA OSS K.K.
English: http://www.sraoss.co.jp/index_en/
Japanese:http://www.sraoss.co.jp



Re: Doc: typo in config.sgml

From
Bruce Momjian
Date:
On Sat, Nov  2, 2024 at 12:02:12PM +0900, Tatsuo Ishii wrote:
> > Yes, we _allow_ LATIN1 characters in the SGML docs, but I replaced the
> > LATIN1 characters we had with HTML entities, so there are none
> > currently.
> > 
> > I think it is too easy for non-Latin1 UTF8 to creep into our SGML docs
> > so I added a cron job on my server to alert me when non-ASCII characters
> > appear.
> 
> So you convert LATIN1 characters to HTML entities so that it's easier
> to detect non-LATIN1 characters is in the SGML docs? If my
> understanding is correct, it can be also achieved by using some tools
> like:
> 
> iconv -t ISO-8859-1 -f UTF-8 release-17.sgml 
> 
> If there are some non-LATIN1 characters in release-17.sgml,
> it will complain like:
> 
> iconv: illegal input sequence at position 175
> 
> An advantage of this is, we don't need to covert each LATIN1
> characters to HTML entities and make the sgml file authors life a
> little bit easier.

I might have misread the feedback.  I know people didn't want a Makfile
rule to prevent it, but I though converting few UTF8's we had was
acceptable.  Let me think some more and come up with a patch.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  When a patient asks the doctor, "Am I going to die?", he means 
  "Am I going to die soon?"



Re: Doc: typo in config.sgml

From
Peter Eisentraut
Date:
On 02.11.24 14:18, Bruce Momjian wrote:
> On Sat, Nov  2, 2024 at 12:02:12PM +0900, Tatsuo Ishii wrote:
>>> Yes, we _allow_ LATIN1 characters in the SGML docs, but I replaced the
>>> LATIN1 characters we had with HTML entities, so there are none
>>> currently.
>>>
>>> I think it is too easy for non-Latin1 UTF8 to creep into our SGML docs
>>> so I added a cron job on my server to alert me when non-ASCII characters
>>> appear.
>>
>> So you convert LATIN1 characters to HTML entities so that it's easier
>> to detect non-LATIN1 characters is in the SGML docs? If my
>> understanding is correct, it can be also achieved by using some tools
>> like:
>>
>> iconv -t ISO-8859-1 -f UTF-8 release-17.sgml
>>
>> If there are some non-LATIN1 characters in release-17.sgml,
>> it will complain like:
>>
>> iconv: illegal input sequence at position 175
>>
>> An advantage of this is, we don't need to covert each LATIN1
>> characters to HTML entities and make the sgml file authors life a
>> little bit easier.
> 
> I might have misread the feedback.  I know people didn't want a Makfile
> rule to prevent it, but I though converting few UTF8's we had was
> acceptable.  Let me think some more and come up with a patch.

The question of encoding characters as entities is orthogonal to the 
issue of only allowing Unicode characters that have a mapping to Latin 
1.  This patch seems to confuse these two issues, and I don't think it 
actually fixed the second one, which is the one that was complained 
about.  I don't think anyone actually complained about the first one, 
which is the one that was actually patched.

I think the iconv approach is an idea worth checking out.

It's also not necessarily true that the set of characters provided by 
the built-in PDF fonts is exactly the set of characters in Latin 1.  It 
appears to be close enough, but I'm not sure, and I haven't found any 
authoritative information on that.  Another approach for a fix would be 
to get FOP produce the required warnings or errors more reliably.  I 
know it has a bunch of logging settings (ultimately via log4j), so there 
might be some possibilities.




Re: Doc: typo in config.sgml

From
Bruce Momjian
Date:
On Mon, Nov 11, 2024 at 10:02:15PM +0900, Yugo Nagata wrote:
> On Tue, 5 Nov 2024 10:08:17 +0100
> Peter Eisentraut <peter@eisentraut.org> wrote:
> 
> 
> > >> So you convert LATIN1 characters to HTML entities so that it's easier
> > >> to detect non-LATIN1 characters is in the SGML docs? If my
> > >> understanding is correct, it can be also achieved by using some tools
> > >> like:
> > >>
> > >> iconv -t ISO-8859-1 -f UTF-8 release-17.sgml
> > >>
> > >> If there are some non-LATIN1 characters in release-17.sgml,
> > >> it will complain like:
> > >>
> > >> iconv: illegal input sequence at position 175
> > >>
> > >> An advantage of this is, we don't need to covert each LATIN1
> > >> characters to HTML entities and make the sgml file authors life a
> > >> little bit easier.
> 
> > I think the iconv approach is an idea worth checking out.
> > 
> > It's also not necessarily true that the set of characters provided by 
> > the built-in PDF fonts is exactly the set of characters in Latin 1.  It 
> > appears to be close enough, but I'm not sure, and I haven't found any 
> > authoritative information on that.  
> 
> I found a description in FAQ on Apache FOP [1] that explains some glyphs for
> Latin1 character set are not contained in the standard text fonts.
> 
>  The standard text fonts supplied with Acrobat Reader have mostly glyphs for
>  characters from the ISO Latin 1 character set. For a variety of reasons, even
>  those are not completely guaranteed to work, for example you can't use the fi
>  ligature from the standard serif font.

So, the failure of ligatures is caused usually by not using the right
Adobe Font Metric (AFM) file, I think.  I have seen faulty ligature
rendering in PDFs but was alway able to fix it by using the right AFM
file.  Odds are, failure is caused by using a standard Latin1 AFM file
and not the AFM file that matches the font being used.

> [1] https://xmlgraphics.apache.org/fop/faq.html#pdf-characters
> 
> However, it seems that using iconv to detect non-Latin1 characters may be still
> useful because these are likely not displayed in PDF. For example, we can do this
> in make check as the attached patch 0002. It cannot show the filname where one
> is found, though.

I was thinking something like:

    grep -l --recursive  -P '[\x80-\xFF]' . |
    while read FILE
    do  iconv -f UTF-8 -t ISO-8859-1 "$FILE" || exit 1
    done

This only checks files with non-ASCII characters.

> > Another approach for a fix would be 
> > to get FOP produce the required warnings or errors more reliably.  I 
> > know it has a bunch of logging settings (ultimately via log4j), so there 
> > might be some possibilities.
> 
> When a character that cannot be displayed in PDF is found, a warning
> "Glyph ... not available in font ...." is output in fop's log. We can
> prevent such characters from being contained in PDF by checking
> the message as the attached patch 0001. However, this is checked after
> the pdf is generated since I could not have an idea how to terminate the
> generation immediately when such character is detected.

So, are we sure this will be the message even for non-English users? I
thought checking for warning message text was too fragile.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  When a patient asks the doctor, "Am I going to die?", he means 
  "Am I going to die soon?"



Re: Doc: typo in config.sgml

From
Bruce Momjian
Date:
On Tue, Nov 19, 2024 at 11:29:07AM +0900, Yugo NAGATA wrote:
> On Mon, 18 Nov 2024 16:04:20 -0500
> > So, the failure of ligatures is caused usually by not using the right
> > Adobe Font Metric (AFM) file, I think.  I have seen faulty ligature
> > rendering in PDFs but was alway able to fix it by using the right AFM
> > file.  Odds are, failure is caused by using a standard Latin1 AFM file
> > and not the AFM file that matches the font being used.
> > 
> > > [1] https://xmlgraphics.apache.org/fop/faq.html#pdf-characters
> > > 
> > > However, it seems that using iconv to detect non-Latin1 characters may be still
> > > useful because these are likely not displayed in PDF. For example, we can do this
> > > in make check as the attached patch 0002. It cannot show the filname where one
> > > is found, though.
> > 
> > I was thinking something like:
> > 
> >     grep -l --recursive  -P '[\x80-\xFF]' . |
> >     while read FILE
> >     do  iconv -f UTF-8 -t ISO-8859-1 "$FILE" || exit 1
> >     done
> > 
> > This only checks files with non-ASCII characters.
> 
> Checking non-latin1 after non-ASCII characters seems good idea.
> I attached a updated patch (0002) that uses perl instead of grep
> because non-GNU grep could not have escape sequences for hex.

Yes, good point.

> > So, are we sure this will be the message even for non-English users? I
> > thought checking for warning message text was too fragile.
> 
> I am not sure whether fop has messages in non-English, although I've never
> seen Japanese messages output. 
> 
> I wonder we can get unified results if executed with LANG=C.
> The updated patch 0001 is fixed in this direction.

Yes, good idea.

> +    @ ( $(PERL) -ne '/[\x80-\xFF]/ and `${ICONV} -t ISO-8859-1 -f UTF-8 "$$ARGV" 2>/dev/null` and
print("$$ARGV:$$_"),$$n++;END {exit($$n>0)}' \
 

I am thinking we should have -f before -t becaues it is from/to.

I like this approach.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  When a patient asks the doctor, "Am I going to die?", he means 
  "Am I going to die soon?"



Re: Doc: typo in config.sgml

From
Tatsuo Ishii
Date:
I have looked into the patches.

> Subject: [PATCH v3 1/3] Disallow characters that cannot be displayed in PDF
> 
> ---
>  doc/src/sgml/Makefile | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/doc/src/sgml/Makefile b/doc/src/sgml/Makefile
> index a04c532b53..18bf87d031 100644
> --- a/doc/src/sgml/Makefile
> +++ b/doc/src/sgml/Makefile
> @@ -156,7 +156,9 @@ XSLTPROC_FO_FLAGS += --stringparam img.src.path '$(srcdir)/'
>      $(XSLTPROC) $(XMLINCLUDE) $(XSLTPROCFLAGS) $(XSLTPROC_FO_FLAGS) --stringparam paper.type USletter -o $@ $^
>  
>  %.pdf: %.fo $(ALL_IMAGES)
> -    $(FOP) -fo $< -pdf $@
> +    CLANG=C $(FOP) -fo $< -pdf $@ 2>&1 | \

Shouldn't "CLANG" be "LANG"?

> +    awk 'BEGIN{err=0}{print}/not available in font/{err=1}END{exit err}' 1>&2  || \
> +    (echo "Found characters that cannot be displayed in PDF" 1>&2;  exit 1)

Currently "make postgres*.pdf" generates the pdf file even if there's
a "not available in font" error while generating it. With the patch
the pdf file is removed in this case. I'm not sure if this is an
improvement because there's no way to generate such a pdf file if
there's such a warning. Printing "Found characters that cannot be
displayed in PDF" is good, but I'd prefer let users decide whether
they retain or remove the pdf file.

> Subject: [PATCH v3 3/3] Check whether iconv exists for detecting non-latin1
>  characters
> 
> ---
>  configure              | 65 ++++++++++++++++++++++++++++++++++++++----
>  configure.ac           |  1 +
>  doc/src/sgml/Makefile  |  6 +++-
>  src/Makefile.global.in |  1 +

You don't need to include the patch for configure. Committer will
generate configure when it gets committed. See the discussion:
https://www.postgresql.org/message-id/20241126.102906.1020285543012274306.ishii%40postgresql.org

Best reagards,
--
Tatsuo Ishii
SRA OSS K.K.
English: http://www.sraoss.co.jp/index_en/
Japanese:http://www.sraoss.co.jp



Re: Doc: typo in config.sgml

From
Bruce Momjian
Date:
On Tue, Nov 26, 2024 at 06:25:13PM +0900, Tatsuo Ishii wrote:
> I have looked into the patches.
> >  %.pdf: %.fo $(ALL_IMAGES)
> > -    $(FOP) -fo $< -pdf $@
> > +    CLANG=C $(FOP) -fo $< -pdf $@ 2>&1 | \
> 
> Shouldn't "CLANG" be "LANG"?

Yes, probably.

> > +    awk 'BEGIN{err=0}{print}/not available in font/{err=1}END{exit err}' 1>&2  || \
> > +    (echo "Found characters that cannot be displayed in PDF" 1>&2;  exit 1)
> 
> Currently "make postgres*.pdf" generates the pdf file even if there's
> a "not available in font" error while generating it. With the patch
> the pdf file is removed in this case. I'm not sure if this is an
> improvement because there's no way to generate such a pdf file if
> there's such a warning. Printing "Found characters that cannot be
> displayed in PDF" is good, but I'd prefer let users decide whether
> they retain or remove the pdf file.

Looking at the patch:

     %.pdf: %.fo $(ALL_IMAGES)
    -       $(FOP) -fo $< -pdf $@
    +       CLANG=C $(FOP) -fo $< -pdf $@ 2>&1 | \
    +       awk 'BEGIN{err=0}{print}/not available in font/{err=1}END{exit err}' 1>&2  || \
    +       (echo "Found characters that cannot be displayed in PDF" 1>&2;  exit 1)

it returns an error if it sees a "not available in font" error, and
since src/Makefile.global has .DELETE_ON_ERROR, and this is included in
doc/src/sgml/Makefile, the file is deleted on the awk 'exit' error.

If there are invalid characters in the PDF, shouldn't the PDF be
considered invalid and removed from the build?  To allow such builds to
keep those PDF files, we would need to probably override
.DELETE_ON_ERROR, but it would have to be done in a way that an error
exit from FOP would still remove the PDF file.  I think we would have to
have FOP write to a temporary file, and then override the
.DELETE_ON_ERROR just for the check for the string "not available in
font" text in the temporary file.

Do we want to add this complexity?

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  When a patient asks the doctor, "Am I going to die?", he means 
  "Am I going to die soon?"



Re: Doc: typo in config.sgml

From
Tom Lane
Date:
Bruce Momjian <bruce@momjian.us> writes:
> Do we want to add this complexity?

I don't think this patch is doing anything I want at all.

            regards, tom lane



Re: Doc: typo in config.sgml

From
Tom Lane
Date:
Bruce Momjian <bruce@momjian.us> writes:
> On Tue, Nov 26, 2024 at 11:43:02AM -0500, Tom Lane wrote:
>> I don't think this patch is doing anything I want at all.

> Gee, I kind of liked the patch, but maybe you didn't like the additional
> complexity to check the PDF output twice, once on input (complex) and
> once on output.  The attached patch only does the output check.

It's still not doing anything I want at all.  I'm with Tatsuo
on this: I do not want the makefiles deciding for me which
warnings are acceptable.

            regards, tom lane



Re: Doc: typo in config.sgml

From
Bruce Momjian
Date:
On Tue, Nov 26, 2024 at 02:04:15PM -0500, Bruce Momjian wrote:
> On Tue, Nov 26, 2024 at 12:41:37PM -0500, Tom Lane wrote:
> > Bruce Momjian <bruce@momjian.us> writes:
> > > On Tue, Nov 26, 2024 at 11:43:02AM -0500, Tom Lane wrote:
> > >> I don't think this patch is doing anything I want at all.
> > 
> > > Gee, I kind of liked the patch, but maybe you didn't like the additional
> > > complexity to check the PDF output twice, once on input (complex) and
> > > once on output.  The attached patch only does the output check.
> > 
> > It's still not doing anything I want at all.  I'm with Tatsuo
> > on this: I do not want the makefiles deciding for me which
> > warnings are acceptable.
> 
> Okay, how about the attached patch that just prints the message at the
> bottom, with no error.  We could do this for all warnings, but I think
> there are some we expect.

Patch applied.  I added a mention of README.non-ASCII.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  When a patient asks the doctor, "Am I going to die?", he means 
  "Am I going to die soon?"



Re: Doc: typo in config.sgml

From
Bruce Momjian
Date:
On Tue, Nov  5, 2024 at 10:08:17AM +0100, Peter Eisentraut wrote:
> On 02.11.24 14:18, Bruce Momjian wrote:
> > On Sat, Nov  2, 2024 at 12:02:12PM +0900, Tatsuo Ishii wrote:
> > > > Yes, we _allow_ LATIN1 characters in the SGML docs, but I replaced the
> > > > LATIN1 characters we had with HTML entities, so there are none
> > > > currently.
> > > > 
> > > > I think it is too easy for non-Latin1 UTF8 to creep into our SGML docs
> > > > so I added a cron job on my server to alert me when non-ASCII characters
> > > > appear.
> > > 
> > > So you convert LATIN1 characters to HTML entities so that it's easier
> > > to detect non-LATIN1 characters is in the SGML docs? If my
> > > understanding is correct, it can be also achieved by using some tools
> > > like:
> > > 
> > > iconv -t ISO-8859-1 -f UTF-8 release-17.sgml
> > > 
> > > If there are some non-LATIN1 characters in release-17.sgml,
> > > it will complain like:
> > > 
> > > iconv: illegal input sequence at position 175
> > > 
> > > An advantage of this is, we don't need to covert each LATIN1
> > > characters to HTML entities and make the sgml file authors life a
> > > little bit easier.
> > 
> > I might have misread the feedback.  I know people didn't want a Makfile
> > rule to prevent it, but I though converting few UTF8's we had was
> > acceptable.  Let me think some more and come up with a patch.
> 
> The question of encoding characters as entities is orthogonal to the issue
> of only allowing Unicode characters that have a mapping to Latin 1.  This
> patch seems to confuse these two issues, and I don't think it actually fixed
> the second one, which is the one that was complained about.  I don't think
> anyone actually complained about the first one, which is the one that was
> actually patched.

Now that we have a warning about non-emittable characters in the PDF
build, do you want me to put back the Latin1 characters in the SGML
files or leave them as HTML entities?

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  When a patient asks the doctor, "Am I going to die?", he means 
  "Am I going to die soon?"



Re: Doc: typo in config.sgml

From
Tom Lane
Date:
Bruce Momjian <bruce@momjian.us> writes:
> Now that we have a warning about non-emittable characters in the PDF
> build, do you want me to put back the Latin1 characters in the SGML
> files or leave them as HTML entities?

I think going forward we're going to be putting in people's names
in UTF8 --- I was certainly planning to start doing that.  It doesn't
matter that much what we do with existing cases, though.

            regards, tom lane



Re: Doc: typo in config.sgml

From
Bruce Momjian
Date:
On Mon, Dec  2, 2024 at 09:33:39PM -0500, Tom Lane wrote:
> Bruce Momjian <bruce@momjian.us> writes:
> > Now that we have a warning about non-emittable characters in the PDF
> > build, do you want me to put back the Latin1 characters in the SGML
> > files or leave them as HTML entities?
> 
> I think going forward we're going to be putting in people's names
> in UTF8 --- I was certainly planning to start doing that.  It doesn't

Yes, I expected that, and added an item to my release checklist to make
a PDF file and check for the warning.  I don't normally do that.

> matter that much what we do with existing cases, though.

Okay, I think Peter had an opinion but I wasn't sure what it was.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  When a patient asks the doctor, "Am I going to die?", he means 
  "Am I going to die soon?"



Re: Doc: typo in config.sgml

From
Peter Eisentraut
Date:
On 03.12.24 04:13, Bruce Momjian wrote:
> On Mon, Dec  2, 2024 at 09:33:39PM -0500, Tom Lane wrote:
>> Bruce Momjian <bruce@momjian.us> writes:
>>> Now that we have a warning about non-emittable characters in the PDF
>>> build, do you want me to put back the Latin1 characters in the SGML
>>> files or leave them as HTML entities?
>>
>> I think going forward we're going to be putting in people's names
>> in UTF8 --- I was certainly planning to start doing that.  It doesn't
> 
> Yes, I expected that, and added an item to my release checklist to make
> a PDF file and check for the warning.  I don't normally do that.
> 
>> matter that much what we do with existing cases, though.
> 
> Okay, I think Peter had an opinion but I wasn't sure what it was.

I would prefer that the parts of commit 641a5b7a144 that replace 
non-ASCII characters with entities are reverted.




Re: Doc: typo in config.sgml

From
Peter Eisentraut
Date:
On 26.11.24 20:04, Bruce Momjian wrote:
>   %.pdf: %.fo $(ALL_IMAGES)
> -    $(FOP) -fo $< -pdf $@
> +    LANG=C $(FOP) -fo $< -pdf $@ 2>&1 | \
> +    awk 'BEGIN { warn = 0 }  { print }/not available in font/ { warn = 1 }  \
> +    END { if (warn != 0) print("\nFound characters that cannot be displayed in the PDF document") }' 1>&2

Wouldn't that lose the exit code from the fop execution?



Re: Doc: typo in config.sgml

From
Bruce Momjian
Date:
On Tue, Dec  3, 2024 at 09:05:45PM +0100, Peter Eisentraut wrote:
> On 26.11.24 20:04, Bruce Momjian wrote:
> >   %.pdf: %.fo $(ALL_IMAGES)
> > -    $(FOP) -fo $< -pdf $@
> > +    LANG=C $(FOP) -fo $< -pdf $@ 2>&1 | \
> > +    awk 'BEGIN { warn = 0 }  { print }/not available in font/ { warn = 1 }  \
> > +    END { if (warn != 0) print("\nFound characters that cannot be displayed in the PDF document") }' 1>&2
> 
> Wouldn't that lose the exit code from the fop execution?

Yikes, I think it would.  Let me work on a fix now.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  Do not let urgent matters crowd out time for investment in the future.





Re: Doc: typo in config.sgml

From
Bruce Momjian
Date:
On Tue, Dec  3, 2024 at 09:03:37PM +0100, Peter Eisentraut wrote:
> On 03.12.24 04:13, Bruce Momjian wrote:
> > On Mon, Dec  2, 2024 at 09:33:39PM -0500, Tom Lane wrote:
> > > Bruce Momjian <bruce@momjian.us> writes:
> > > > Now that we have a warning about non-emittable characters in the PDF
> > > > build, do you want me to put back the Latin1 characters in the SGML
> > > > files or leave them as HTML entities?
> > > 
> > > I think going forward we're going to be putting in people's names
> > > in UTF8 --- I was certainly planning to start doing that.  It doesn't
> > 
> > Yes, I expected that, and added an item to my release checklist to make
> > a PDF file and check for the warning.  I don't normally do that.
> > 
> > > matter that much what we do with existing cases, though.
> > 
> > Okay, I think Peter had an opinion but I wasn't sure what it was.
> 
> I would prefer that the parts of commit 641a5b7a144 that replace non-ASCII
> characters with entities are reverted.

Done.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  Do not let urgent matters crowd out time for investment in the future.