Thread: Doc: typo in config.sgml

Doc: typo in config.sgml

From
Tatsuo Ishii
Date:
I think there's an unnecessary underscore in config.sgml.
Attached patch fixes it.

Best reagards,
--
Tatsuo Ishii
SRA OSS K.K.
English: http://www.sraoss.co.jp/index_en/
Japanese:http://www.sraoss.co.jp
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 0aec11f443..08173ecb5c 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -9380,7 +9380,7 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
        <para>
         If <varname>transaction_timeout</varname> is shorter or equal to
         <varname>idle_in_transaction_session_timeout</varname> or <varname>statement_timeout</varname>
-        then the longer timeout is ignored.
+        then the longer timeout is ignored.
        </para>

        <para>

Re: Doc: typo in config.sgml

From
Tatsuo Ishii
Date:
>> I think there's an unnecessary underscore in config.sgml.
>> Attached patch fixes it.
> 
> I could not apply the patch with an error.
> 
>  error: patch failed: doc/src/sgml/config.sgml:9380
>  error: doc/src/sgml/config.sgml: patch does not apply

Strange. I have no problem applying the patch here.

> I found your patch contains an odd character (ASCII Code 240?)
> by performing `od -c` command on the file. See the attached file.

Yes, 240 in octal (== 0xc2) is in the patch but it's because current
config.sgml includes the character. You can check it by looking at
line 9383 of config.sgml.

I think it was introduced by 28e858c0f95.

Best reagards,
--
Tatsuo Ishii
SRA OSS K.K.
English: http://www.sraoss.co.jp/index_en/
Japanese:http://www.sraoss.co.jp



Re: Doc: typo in config.sgml

From
Yugo NAGATA
Date:
On Mon, 30 Sep 2024 17:23:24 +0900 (JST)
Tatsuo Ishii <ishii@postgresql.org> wrote:

> >> I think there's an unnecessary underscore in config.sgml.
> >> Attached patch fixes it.
> > 
> > I could not apply the patch with an error.
> > 
> >  error: patch failed: doc/src/sgml/config.sgml:9380
> >  error: doc/src/sgml/config.sgml: patch does not apply
> 
> Strange. I have no problem applying the patch here.
> 
> > I found your patch contains an odd character (ASCII Code 240?)
> > by performing `od -c` command on the file. See the attached file.
> 
> Yes, 240 in octal (== 0xc2) is in the patch but it's because current
> config.sgml includes the character. You can check it by looking at
> line 9383 of config.sgml.

Yes, you are right, I can find the 0xc2 char in config.sgml using od -c,
although I still could not apply the patch. 

I think this is non-breaking space of (C2A0) of utf-8. I guess my
terminal normally regards this as a space, so applying patch fails.

I found it also in line 85 of ref/drop_extension.sgml.


> 
> I think it was introduced by 28e858c0f95.
> 
> Best reagards,
> --
> Tatsuo Ishii
> SRA OSS K.K.
> English: http://www.sraoss.co.jp/index_en/
> Japanese:http://www.sraoss.co.jp


-- 
Yugo NAGATA <nagata@sraoss.co.jp>



Re: Doc: typo in config.sgml

From
Tatsuo Ishii
Date:
>>> I think there's an unnecessary underscore in config.sgml.

I was wrong. The particular byte sequences just looked an underscore
on my editor but the byte sequence is actually 0xc2a0, which must be a
"non breaking space" encoded in UTF-8. I guess someone mistakenly
insert a non breaking space while editing config.sgml.

However the mistake does not affect the patch.

Best reagards,
--
Tatsuo Ishii
SRA OSS K.K.
English: http://www.sraoss.co.jp/index_en/
Japanese:http://www.sraoss.co.jp



Re: Doc: typo in config.sgml

From
Yugo NAGATA
Date:
On Mon, 30 Sep 2024 18:03:44 +0900 (JST)
Tatsuo Ishii <ishii@postgresql.org> wrote:

> >>> I think there's an unnecessary underscore in config.sgml.
> 
> I was wrong. The particular byte sequences just looked an underscore
> on my editor but the byte sequence is actually 0xc2a0, which must be a
> "non breaking space" encoded in UTF-8. I guess someone mistakenly
> insert a non breaking space while editing config.sgml.
> 
> However the mistake does not affect the patch.

It looks like we've crisscrossed our mail.
Anyway, I agree with removing non breaking spaces, as well as
one found in line 85 of ref/drop_extension.sgml.

Regards,
Yugo Nagata

> 
> Best reagards,
> --
> Tatsuo Ishii
> SRA OSS K.K.
> English: http://www.sraoss.co.jp/index_en/
> Japanese:http://www.sraoss.co.jp


-- 
Yugo NAGATA <nagata@sraoss.co.jp>



Re: Doc: typo in config.sgml

From
Yugo Nagata
Date:
On Mon, 30 Sep 2024 11:59:48 +0200
Daniel Gustafsson <daniel@yesql.se> wrote:

> > On 30 Sep 2024, at 11:03, Tatsuo Ishii <ishii@postgresql.org> wrote:
> > 
> >>>> I think there's an unnecessary underscore in config.sgml.
> > 
> > I was wrong. The particular byte sequences just looked an underscore
> > on my editor but the byte sequence is actually 0xc2a0, which must be a
> > "non breaking space" encoded in UTF-8. I guess someone mistakenly
> > insert a non breaking space while editing config.sgml.
> 
> I wonder if it would be worth to add a check for this like we have to tabs?
> The attached adds a rule to "make -C doc/src/sgml check" for trapping nbsp
> (doing so made me realize we don't have an equivalent meson target).

Your patch couldn't detect 0xA0 in config.sgml in my machine, but it works
when I use `grep -P "[\xA0]"` instead of `grep -e "\xA0"`.

However, it also detects the following line in charset.sgml.
(https://www.postgresql.org/docs/current/collation.html)

 For example, locale und-u-kb sorts 'àe' before 'aé'.

This is not non-breaking space, so should not be detected as an error.

Regards,
Yugo Nagata

> --
> Daniel Gustafsson
> 


-- 
Yugo Nagata <nagata@sraoss.co.jp>



Re: Doc: typo in config.sgml

From
Tatsuo Ishii
Date:
>> I wonder if it would be worth to add a check for this like we have to tabs?

+1.

>> The attached adds a rule to "make -C doc/src/sgml check" for trapping nbsp
>> (doing so made me realize we don't have an equivalent meson target).
>
> Your patch couldn't detect 0xA0 in config.sgml in my machine, but it works
> when I use `grep -P "[\xA0]"` instead of `grep -e "\xA0"`.
>
> However, it also detects the following line in charset.sgml.
> (https://www.postgresql.org/docs/current/collation.html)
>
>  For example, locale und-u-kb sorts 'àe' before 'aé'.
>
> This is not non-breaking space, so should not be detected as an error.

That's because non-breaking space (nbsp) is not encoded as 0xa0 in
UTF-8. nbsp in UTF-8 is "0xc2 0xa0" (2 bytes) (A 0xa0 is a nbsp's code
point in Unicode. i.e. U+00A0).
So grep -P "[\xC2\xA0]" should work to detect nbsp.

Best reagards,
--
Tatsuo Ishii
SRA OSS K.K.
English: http://www.sraoss.co.jp/index_en/
Japanese:http://www.sraoss.co.jp



Re: Doc: typo in config.sgml

From
Tatsuo Ishii
Date:
>> That's because non-breaking space (nbsp) is not encoded as 0xa0 in
>> UTF-8. nbsp in UTF-8 is "0xc2 0xa0" (2 bytes) (A 0xa0 is a nbsp's code
>> point in Unicode. i.e. U+00A0).
>> So grep -P "[\xC2\xA0]" should work to detect nbsp.
> 
> `LC_ALL=C grep -P "\xC2\xA0"` works for my environment. 
> ([ and ] were not necessary.)
> 
> When LC_ALL is null, `grep -P "\xA0"` could not detect any characters in charset.sgml,
> but I think it is better to specify both LC_ALL=C and "\xC2\xA0" for making sure detecting
> nbsp.
> 
> One problem is that -P option can be used in only GNU grep, and grep in mac doesn't support it.
> 
> On bash, we can also use `grep $'\xc2\xa0'`, but I am not sure we can assume the shell is bash.
> 
> Maybe, better way is use perl itself rather than grep as following.
> 
>  `perl -ne '/\xC2\xA0/ and print' `
> 
> I attached a patch fixed in this way.

GNU sed can also be used without setting LC_ALL:

sed -n /"\xC2\xA0"/p

However I am not sure if non-GNU sed can do this too...

Best reagards,
--
Tatsuo Ishii
SRA OSS K.K.
English: http://www.sraoss.co.jp/index_en/
Japanese:http://www.sraoss.co.jp



Re: Doc: typo in config.sgml

From
Tatsuo Ishii
Date:
> On Mon, 30 Sep 2024 17:23:24 +0900 (JST)
> Tatsuo Ishii <ishii@postgresql.org> wrote:
> 
>> >> I think there's an unnecessary underscore in config.sgml.
>> >> Attached patch fixes it.
>> > 
>> > I could not apply the patch with an error.
>> > 
>> >  error: patch failed: doc/src/sgml/config.sgml:9380
>> >  error: doc/src/sgml/config.sgml: patch does not apply
>> 
>> Strange. I have no problem applying the patch here.
>> 
>> > I found your patch contains an odd character (ASCII Code 240?)
>> > by performing `od -c` command on the file. See the attached file.
>> 
>> Yes, 240 in octal (== 0xc2) is in the patch but it's because current
>> config.sgml includes the character. You can check it by looking at
>> line 9383 of config.sgml.
> 
> Yes, you are right, I can find the 0xc2 char in config.sgml using od -c,
> although I still could not apply the patch. 
> 
> I think this is non-breaking space of (C2A0) of utf-8. I guess my
> terminal normally regards this as a space, so applying patch fails.
> 
> I found it also in line 85 of ref/drop_extension.sgml.

Thanks. I have pushed the fix for ref/drop_extension.sgml along with
config.sgml.

Best reagards,
--
Tatsuo Ishii
SRA OSS K.K.
English: http://www.sraoss.co.jp/index_en/
Japanese:http://www.sraoss.co.jp



Re: Doc: typo in config.sgml

From
Bruce Momjian
Date:
On Mon, Sep 30, 2024 at 11:59:48AM +0200, Daniel Gustafsson wrote:
> > On 30 Sep 2024, at 11:03, Tatsuo Ishii <ishii@postgresql.org> wrote:
> > 
> >>>> I think there's an unnecessary underscore in config.sgml.
> > 
> > I was wrong. The particular byte sequences just looked an underscore
> > on my editor but the byte sequence is actually 0xc2a0, which must be a
> > "non breaking space" encoded in UTF-8. I guess someone mistakenly
> > insert a non breaking space while editing config.sgml.
> 
> I wonder if it would be worth to add a check for this like we have to tabs?
> The attached adds a rule to "make -C doc/src/sgml check" for trapping nbsp
> (doing so made me realize we don't have an equivalent meson target).

Can we check for any character outside the support range of SGML?

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  When a patient asks the doctor, "Am I going to die?", he means 
  "Am I going to die soon?"



Re: Doc: typo in config.sgml

From
Tatsuo Ishii
Date:
> On Tue, 1 Oct 2024 22:20:55 +0900
> Yugo Nagata <nagata@sraoss.co.jp> wrote:
> 
>> On Tue, 1 Oct 2024 15:16:52 +0900
>> Yugo NAGATA <nagata@sraoss.co.jp> wrote:
>> 
>> > On Tue, 01 Oct 2024 10:33:50 +0900 (JST)
>> > Tatsuo Ishii <ishii@postgresql.org> wrote:
>> > 
>> > > >> That's because non-breaking space (nbsp) is not encoded as 0xa0 in
>> > > >> UTF-8. nbsp in UTF-8 is "0xc2 0xa0" (2 bytes) (A 0xa0 is a nbsp's code
>> > > >> point in Unicode. i.e. U+00A0).
>> > > >> So grep -P "[\xC2\xA0]" should work to detect nbsp.
>> > > > 
>> > > > `LC_ALL=C grep -P "\xC2\xA0"` works for my environment. 
>> > > > ([ and ] were not necessary.)
>> > > > 
>> > > > When LC_ALL is null, `grep -P "\xA0"` could not detect any characters in charset.sgml,
>> > > > but I think it is better to specify both LC_ALL=C and "\xC2\xA0" for making sure detecting
>> > > > nbsp.
>> > > > 
>> > > > One problem is that -P option can be used in only GNU grep, and grep in mac doesn't support it.
>> > > > 
>> > > > On bash, we can also use `grep $'\xc2\xa0'`, but I am not sure we can assume the shell is bash.
>> > > > 
>> > > > Maybe, better way is use perl itself rather than grep as following.
>> > > > 
>> > > >  `perl -ne '/\xC2\xA0/ and print' `
>> > > > 
>> > > > I attached a patch fixed in this way.
>> > > 
>> > > GNU sed can also be used without setting LC_ALL:
>> > > 
>> > > sed -n /"\xC2\xA0"/p
>> > > 
>> > > However I am not sure if non-GNU sed can do this too...
>> > 
>> > Although I've not check it myself, BSD sed doesn't support \x escape according to [1].
>> > 
>> > [1]
https://stackoverflow.com/questions/24275070/sed-not-giving-me-correct-substitute-operation-for-newline-with-mac-difference
>> > 
>> > By the way, I've attached a patch a bit modified to use the plural form statement
>> > as same as check-tabs.
>> > 
>> >  Non-breaking **spaces** appear in SGML/XML files
>> 
>> The previous patch was broken because the perl command failed to return the correct result.
>> I've attached an updated patch to fix the return value. In passing, I added line breaks
>> for long lines.
> 
> I've attached a updated patch. 
> I added the comment to explain why Perl is used instead of grep or sed.

Looks good to me. If there's no objection, I will commit this to
master branch.

Best reagards,
--
Tatsuo Ishii
SRA OSS K.K.
English: http://www.sraoss.co.jp/index_en/
Japanese:http://www.sraoss.co.jp



Re: Doc: typo in config.sgml

From
Daniel Gustafsson
Date:
> On 8 Oct 2024, at 02:03, Tatsuo Ishii <ishii@postgresql.org> wrote:
>> On Tue, 1 Oct 2024 22:20:55 +0900
>> Yugo Nagata <nagata@sraoss.co.jp> wrote:

>> I've attached a updated patch. 
>> I added the comment to explain why Perl is used instead of grep or sed.
> 
> Looks good to me. If there's no objection, I will commit this to
> master branch.

No objections, LGTM.

--
Daniel Gustafsson




Re: Doc: typo in config.sgml

From
Tatsuo Ishii
Date:
Hi Danile, Yugo,

>> On 8 Oct 2024, at 02:03, Tatsuo Ishii <ishii@postgresql.org> wrote:
>>> On Tue, 1 Oct 2024 22:20:55 +0900
>>> Yugo Nagata <nagata@sraoss.co.jp> wrote:
> 
>>> I've attached a updated patch. 
>>> I added the comment to explain why Perl is used instead of grep or sed.
>> 
>> Looks good to me. If there's no objection, I will commit this to
>> master branch.
> 
> No objections, LGTM.

Thank you for the patch and review! I have pushed the patch.

https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=5b7da5c261d1af1a5d6a275e1090b07de3654033

Best reagards,
--
Tatsuo Ishii
SRA OSS K.K.
English: http://www.sraoss.co.jp/index_en/
Japanese:http://www.sraoss.co.jp



Re: Doc: typo in config.sgml

From
Yugo Nagata
Date:
On Mon, 7 Oct 2024 15:45:54 -0400
Bruce Momjian <bruce@momjian.us> wrote:

> On Mon, Sep 30, 2024 at 11:59:48AM +0200, Daniel Gustafsson wrote:
> > > On 30 Sep 2024, at 11:03, Tatsuo Ishii <ishii@postgresql.org> wrote:
> > > 
> > >>>> I think there's an unnecessary underscore in config.sgml.
> > > 
> > > I was wrong. The particular byte sequences just looked an underscore
> > > on my editor but the byte sequence is actually 0xc2a0, which must be a
> > > "non breaking space" encoded in UTF-8. I guess someone mistakenly
> > > insert a non breaking space while editing config.sgml.
> > 
> > I wonder if it would be worth to add a check for this like we have to tabs?
> > The attached adds a rule to "make -C doc/src/sgml check" for trapping nbsp
> > (doing so made me realize we don't have an equivalent meson target).
> 
> Can we check for any character outside the support range of SGML?

What we can define the range of allowed characters range in SGML?

We can detect non-ASCII characters by using regexp /\P{ascii}/ or /[^\x00-\x7f]/,
but they are used in some places in charset.sgml and some names in release-*.sgml.

Regards,
Yugo Nagata

> 
> -- 
>   Bruce Momjian  <bruce@momjian.us>        https://momjian.us
>   EDB                                      https://enterprisedb.com
> 
>   When a patient asks the doctor, "Am I going to die?", he means 
>   "Am I going to die soon?"
> 
> 


-- 
Yugo Nagata <nagata@sraoss.co.jp>



Re: Doc: typo in config.sgml

From
Tatsuo Ishii
Date:
> On Mon, 7 Oct 2024 15:45:54 -0400
> Bruce Momjian <bruce@momjian.us> wrote:
> 
>> On Mon, Sep 30, 2024 at 11:59:48AM +0200, Daniel Gustafsson wrote:
>> > > On 30 Sep 2024, at 11:03, Tatsuo Ishii <ishii@postgresql.org> wrote:
>> > > 
>> > >>>> I think there's an unnecessary underscore in config.sgml.
>> > > 
>> > > I was wrong. The particular byte sequences just looked an underscore
>> > > on my editor but the byte sequence is actually 0xc2a0, which must be a
>> > > "non breaking space" encoded in UTF-8. I guess someone mistakenly
>> > > insert a non breaking space while editing config.sgml.
>> > 
>> > I wonder if it would be worth to add a check for this like we have to tabs?
>> > The attached adds a rule to "make -C doc/src/sgml check" for trapping nbsp
>> > (doing so made me realize we don't have an equivalent meson target).
>> 
>> Can we check for any character outside the support range of SGML?
> 
> What we can define the range of allowed characters range in SGML?
> 
> We can detect non-ASCII characters by using regexp /\P{ascii}/ or /[^\x00-\x7f]/,
> but they are used in some places in charset.sgml and some names in release-*.sgml.

I failed to find any standard regarding what characters are allowed in
SGML/XML. Assuming that any valid Unicode characters are allowed in
our *sgml files, I am afraid the best we can do is grepping non-ASCII
characters against the files and checking the results by a visual
inspection. Besides nbsp, there are tons of confusing Unicode
characters out there. For example there are many "hyphen like
characters".

https://www.compart.com/en/unicode/category/Pd

If one of them is used in the sgml files, it may be possible that it
was accidentally inserted.

Best reagards,
--
Tatsuo Ishii
SRA OSS K.K.
English: http://www.sraoss.co.jp/index_en/
Japanese:http://www.sraoss.co.jp



Re: Doc: typo in config.sgml

From
Bruce Momjian
Date:
On Wed, Oct  9, 2024 at 11:49:29AM +0900, Tatsuo Ishii wrote:
> >> On Mon, Sep 30, 2024 at 11:59:48AM +0200, Daniel Gustafsson wrote:
> >> > > On 30 Sep 2024, at 11:03, Tatsuo Ishii <ishii@postgresql.org> wrote:
> >> > > 
> >> > >>>> I think there's an unnecessary underscore in config.sgml.
> >> > > 
> >> > > I was wrong. The particular byte sequences just looked an underscore
> >> > > on my editor but the byte sequence is actually 0xc2a0, which must be a
> >> > > "non breaking space" encoded in UTF-8. I guess someone mistakenly
> >> > > insert a non breaking space while editing config.sgml.
> >> > 
> >> > I wonder if it would be worth to add a check for this like we have to tabs?
> >> > The attached adds a rule to "make -C doc/src/sgml check" for trapping nbsp
> >> > (doing so made me realize we don't have an equivalent meson target).
> >> 
> >> Can we check for any character outside the support range of SGML?
> > 
> > What we can define the range of allowed characters range in SGML?
> > 
> > We can detect non-ASCII characters by using regexp /\P{ascii}/ or /[^\x00-\x7f]/,
> > but they are used in some places in charset.sgml and some names in release-*.sgml.
> 
> I failed to find any standard regarding what characters are allowed in
> SGML/XML. Assuming that any valid Unicode characters are allowed in
> our *sgml files, I am afraid the best we can do is grepping non-ASCII
> characters against the files and checking the results by a visual
> inspection. Besides nbsp, there are tons of confusing Unicode
> characters out there. For example there are many "hyphen like
> characters".
> 
> https://www.compart.com/en/unicode/category/Pd
> 
> If one of them is used in the sgml files, it may be possible that it
> was accidentally inserted.

Can we use Unicode in the SGML files?

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  When a patient asks the doctor, "Am I going to die?", he means 
  "Am I going to die soon?"



Re: Doc: typo in config.sgml

From
Tom Lane
Date:
Bruce Momjian <bruce@momjian.us> writes:
> Can we use Unicode in the SGML files?

I believe we've been doing it for contributors' names that require
non-ASCII letters, but not in any other places.

            regards, tom lane



Re: Doc: typo in config.sgml

From
Tatsuo Ishii
Date:
> Bruce Momjian <bruce@momjian.us> writes:
>> Can we use Unicode in the SGML files?
> 
> I believe we've been doing it for contributors' names that require
> non-ASCII letters, but not in any other places.

We have non-ASCII letters in charset.sgml too, to show some examples
of collation.

Best reagards,
--
Tatsuo Ishii
SRA OSS K.K.
English: http://www.sraoss.co.jp/index_en/
Japanese:http://www.sraoss.co.jp



Re: Doc: typo in config.sgml

From
Daniel Gustafsson
Date:
> On 9 Oct 2024, at 04:49, Tatsuo Ishii <ishii@postgresql.org> wrote:

> Besides nbsp, there are tons of confusing Unicode
> characters out there. For example there are many "hyphen like
> characters".

Using characters which look alike is in the field of internet security known as
homograph attacks, where for example a url visually passes for postgresql.org
but in fact leads to an attacker.  That sort of attack clearly doesn't apply to
our docs though.  However, what might cause similar problems is if we use a
unicode character in example code which the reader could be expected to
copy/paste into psql and run which then (at best) cause a syntax error.  We
could probably build tooling to catch this (most likely not too hard in XSLT)
but the ROI for that might be unfavourable.  Even with tooling, committer
caution is needed to ensure we don't publish examples that might cause
unintended side effects when executed by copy/paste.

What separates nbsp is that it may affect the rendering in an un-intuitive way
by forcing two words to not break even if the viewport is too narrow to fit.
Catching such characters seems wortwhile since it's also quite doable with a
trivial grep.

--
Daniel Gustafsson

[0] https://en.wikipedia.org/wiki/IDN_homograph_attack


Re: Doc: typo in config.sgml

From
Tatsuo Ishii
Date:
> We can check non-ASCII letters SGML/XML files by preparing "allowlist"
> that contains lines which are allowed to have non-ascii characters,
> although this list will need to be maintained when lines in it are modified.
> I've attached a patch to add a simple Perl script to do this.

I doubt it really works. For example, nbsp can be used formatting
(that's the purpose of the character in the first place). Whenever a
developer decides to or not to use nbsp, "allowlist" needs to be
maintained. It's too annoying.

I think it's better to add the non-ASCII character checking to the
comitting check list and let committers check non-ASCII character in
the patch. Non-ASCII characters rarely used and it would not become a
burden.
https://wiki.postgresql.org/wiki/Committing_checklist

Maybe we can add to the wiki page something like this?

git diff origin/master | grep -P '[^\x00-\x7f]'

> During testing this script, I found "stylesheet-man.xsl" also has non-ascii
> characters. I don't know these characters are really necessary though, since
> I don't understand this file well.

They are U+201C (double turned comma quotation mark) and U+201D
(double comma quotation mark).

       <l:template name="sect3" text="Section %n, “%t”, in the documentation"/>

I would like to know why they are necessary too.

Best reagards,
--
Tatsuo Ishii
SRA OSS K.K.
English: http://www.sraoss.co.jp/index_en/
Japanese:http://www.sraoss.co.jp



Re: Doc: typo in config.sgml

From
Yugo NAGATA
Date:
On Fri, 11 Oct 2024 12:16:50 +0900 (JST)
Tatsuo Ishii <ishii@postgresql.org> wrote:

> > We can check non-ASCII letters SGML/XML files by preparing "allowlist"
> > that contains lines which are allowed to have non-ascii characters,
> > although this list will need to be maintained when lines in it are modified.
> > I've attached a patch to add a simple Perl script to do this.
> 
> I doubt it really works. For example, nbsp can be used formatting
> (that's the purpose of the character in the first place). Whenever a
> developer decides to or not to use nbsp, "allowlist" needs to be
> maintained. It's too annoying.

I suppose non-ascii characters including nbsp are basically disallowed,
so the allowlist will not increase unless there is some special reason.

However, it is true that there might be a cost for maintaining the list
more or less, so if people don't think it is worth adding this check, 
I will withdraw this proposal.l.

> I think it's better to add the non-ASCII character checking to the
> comitting check list and let committers check non-ASCII character in
> the patch. Non-ASCII characters rarely used and it would not become a
> burden.
> https://wiki.postgresql.org/wiki/Committing_checklist
> 
> Maybe we can add to the wiki page something like this?
> 
> git diff origin/master | grep -P '[^\x00-\x7f]'
> 
> > During testing this script, I found "stylesheet-man.xsl" also has non-ascii
> > characters. I don't know these characters are really necessary though, since
> > I don't understand this file well.
> 
> They are U+201C (double turned comma quotation mark) and U+201D
> (double comma quotation mark).
> 
>        <l:template name="sect3" text="Section %n, “%t”, in the documentation"/>
> 
> I would like to know why they are necessary too.

+1

Regards,
Yugo Nagata

-- 
Yugo NAGATA <nagata@sraoss.co.jp>



Re: Doc: typo in config.sgml

From
Bruce Momjian
Date:
On Mon, Oct 14, 2024 at 03:05:35PM -0400, Bruce Momjian wrote:
> I did some more research and we able to clarify our behavior in
> release.sgml:

I have specified some more details in my patched version:

        We can only use Latin1 characters, not all UTF8 characters,
        because some rendering engines do not support non-Latin1 UTF8
        characters.  Specifically, the HTML rendering engine can display
        all UTF8 characters, but the PDF rendering engine can only display
        Latin1 characters.  In PDF files, non-Latin1 UTF8 characters are
        displayed as "###".

        In the SGML files we encode non-ASCII Latin1 characters as HTML
        entities, e.g., Álvaro.  Oddly, it is possible to safely
        represent Latin1 characters in SGML files as UTF8 for HTML and
        PDF output, but we we currently disallow this via the Makefile
        "check-non-ascii" rule.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  When a patient asks the doctor, "Am I going to die?", he means 
  "Am I going to die soon?"



Re: Doc: typo in config.sgml

From
Peter Eisentraut
Date:
On 15.10.24 18:54, Bruce Momjian wrote:
>> I agree with encoding non-Latin1 characters and disallowing non-ASCII
>> characters totally.
>>
>> I found your patch includes fixes in *.svg files, so how about checking
>> also them by check-non-ascii? Also, I think it is better to use perl instead
>> of grep because non-GNU grep doesn't support hex escape sequences. I've attached
>> a updated patch for Makefile. The changes in release.sgml above is not applied
>> yet, though.
> Yes, good idea on using Perl and checking svg files --- I have used your
> Makefile rule.
> 
> Attached is an updated patch.  I realized that the new rules apply to
> all SGML files, not just the release notes, so I have created
> README.non-ASCII and moved the description there.

I don't understand the point of this.  Maybe it's okay to try to detect 
certain "hidden" whitespace characters, like in the case that started 
this thread.  But I don't see the value in prohibiting all non-ASCII 
characters, as is being proposed here.




Re: Doc: typo in config.sgml

From
Bruce Momjian
Date:
On Tue, Oct 15, 2024 at 10:34:16PM +0200, Peter Eisentraut wrote:
> On 15.10.24 18:54, Bruce Momjian wrote:
> > > I agree with encoding non-Latin1 characters and disallowing non-ASCII
> > > characters totally.
> > > 
> > > I found your patch includes fixes in *.svg files, so how about checking
> > > also them by check-non-ascii? Also, I think it is better to use perl instead
> > > of grep because non-GNU grep doesn't support hex escape sequences. I've attached
> > > a updated patch for Makefile. The changes in release.sgml above is not applied
> > > yet, though.
> > Yes, good idea on using Perl and checking svg files --- I have used your
> > Makefile rule.
> > 
> > Attached is an updated patch.  I realized that the new rules apply to
> > all SGML files, not just the release notes, so I have created
> > README.non-ASCII and moved the description there.
> 
> I don't understand the point of this.  Maybe it's okay to try to detect
> certain "hidden" whitespace characters, like in the case that started this
> thread.  But I don't see the value in prohibiting all non-ASCII characters,
> as is being proposed here.

Well, we can only use Latin-1, so the idea is that we will be explicit
about specifying Latin-1 only as HTML entities, rather than letting
non-Latin-1 creep in as UTF8.  We can exclude certain UTF8 or SGML files
if desired.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  When a patient asks the doctor, "Am I going to die?", he means 
  "Am I going to die soon?"



Re: Doc: typo in config.sgml

From
Peter Eisentraut
Date:
On 15.10.24 22:37, Bruce Momjian wrote:
>> I don't understand the point of this.  Maybe it's okay to try to detect
>> certain "hidden" whitespace characters, like in the case that started this
>> thread.  But I don't see the value in prohibiting all non-ASCII characters,
>> as is being proposed here.
> Well, we can only use Latin-1, so the idea is that we will be explicit
> about specifying Latin-1 only as HTML entities, rather than letting
> non-Latin-1 creep in as UTF8.

But your patch prohibits even otherwise allowed Latin-1 characters.

I don't see why we need to enforce this at this level.  Whatever 
downstream toolchain has requirements about which characters are allowed 
will complain if it encounters a character it doesn't like.




Re: Doc: typo in config.sgml

From
Tom Lane
Date:
Bruce Momjian <bruce@momjian.us> writes:
> Well, we can only use Latin-1, so the idea is that we will be explicit
> about specifying Latin-1 only as HTML entities, rather than letting
> non-Latin-1 creep in as UTF8.  We can exclude certain UTF8 or SGML files
> if desired.

That policy would cause substantial problems with contributor names
in the release notes.  I agree with Peter that we don't need this.
Catching otherwise-invisible characters seems sufficient.

            regards, tom lane



Re: Doc: typo in config.sgml

From
Bruce Momjian
Date:
On Tue, Oct 15, 2024 at 05:27:49PM -0400, Tom Lane wrote:
> Bruce Momjian <bruce@momjian.us> writes:
> > Well, we can only use Latin-1, so the idea is that we will be explicit
> > about specifying Latin-1 only as HTML entities, rather than letting
> > non-Latin-1 creep in as UTF8.  We can exclude certain UTF8 or SGML files
> > if desired.
> 
> That policy would cause substantial problems with contributor names
> in the release notes.  I agree with Peter that we don't need this.
> Catching otherwise-invisible characters seems sufficient.

Uh, why can't we use HTML entities going forward?  Is that harder?  Can
we just exclude the release notes from this check?

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  When a patient asks the doctor, "Am I going to die?", he means 
  "Am I going to die soon?"



Re: Doc: typo in config.sgml

From
Tom Lane
Date:
Bruce Momjian <bruce@momjian.us> writes:
> On Tue, Oct 15, 2024 at 05:27:49PM -0400, Tom Lane wrote:
>> That policy would cause substantial problems with contributor names
>> in the release notes.  I agree with Peter that we don't need this.
>> Catching otherwise-invisible characters seems sufficient.

> Uh, why can't we use HTML entities going forward?  Is that harder?

Yes: it requires looking up the entities.  The mail you are probably
consulting to make a release note or commit message is most likely
just going to contain the person's name as normally spelled.

Plus (as you pointed out earlier today) there aren't HTML entities for
all characters.

> Can we just exclude the release notes from this check?

What is the point of a check we can only enforce against part of the
documentation?

            regards, tom lane



Re: Doc: typo in config.sgml

From
Peter Eisentraut
Date:
On 15.10.24 23:51, Bruce Momjian wrote:
>> I don't see why we need to enforce this at this level.  Whatever downstream
>> toolchain has requirements about which characters are allowed will complain
>> if it encounters a character it doesn't like.
> 
> Uh, the PDF build does not complain if you pass it a non-Latin-1 UTF8
> characters.  To test this I added some Russian characters (non-Latin-1)
> to release.sgml:
> 
>     (⟨б⟩, ⟨в⟩, ⟨г⟩, ⟨д⟩, ⟨ж⟩, ⟨з⟩, ⟨к⟩, ⟨л⟩, ⟨м⟩, ⟨н⟩, ⟨п⟩, ⟨р⟩, ⟨с⟩, ⟨т⟩,
>     ⟨ф⟩, ⟨х⟩, ⟨ц⟩, ⟨ч⟩, ⟨ш⟩, ⟨щ⟩), ten vowels (⟨а⟩, ⟨е⟩, ⟨ё⟩, ⟨и⟩, ⟨о⟩, ⟨у⟩,
>     ⟨ы⟩, ⟨э⟩, ⟨ю⟩, ⟨я⟩), a semivowel / consonant (⟨й⟩), and two modifier
>     letters or "signs" (⟨ъ⟩, ⟨ь⟩)
> 
> and I ran 'make postgres-US.pdf', and then removed the Russian
> characters and ran the same command again.  The output, including stderr
> was identical.  The PDFs, of course, were not, with the Russian
> characters showing as "####".  Makefile output attached.

Hmm, mine complains:

/opt/homebrew/bin/fop -fo postgres-A4.fo -pdf postgres-A4.pdf
Picked up JAVA_TOOL_OPTIONS: -Djava.awt.headless=true
[WARN] FOUserAgent - Font "Symbol,normal,700" not found. Substituting 
with "Symbol,normal,400".
[WARN] FOUserAgent - Font "ZapfDingbats,normal,700" not found. 
Substituting with "ZapfDingbats,normal,400".
[WARN] FOUserAgent - Glyph "⟨" (0x27e8) not available in font "Times-Roman".
[WARN] FOUserAgent - Glyph "б" (0x431, afii10066) not available in font 
"Times-Roman".
[WARN] FOUserAgent - Glyph "⟩" (0x27e9) not available in font "Times-Roman".
[WARN] FOUserAgent - Glyph "в" (0x432, afii10067) not available in font 
"Times-Roman".
[WARN] FOUserAgent - Glyph "г" (0x433, afii10068) not available in font 
"Times-Roman".
[WARN] FOUserAgent - Glyph "д" (0x434, afii10069) not available in font 
"Times-Roman".
[WARN] FOUserAgent - Glyph "ж" (0x436, afii10072) not available in font 
"Times-Roman".
[WARN] FOUserAgent - Glyph "з" (0x437, afii10073) not available in font 
"Times-Roman".
[WARN] PropertyMaker - span="inherit" on fo:block, but no explicit value 
found on the parent FO.




Re: Doc: typo in config.sgml

From
Peter Eisentraut
Date:
On 15.10.24 23:51, Bruce Momjian wrote:
> On Tue, Oct 15, 2024 at 05:27:49PM -0400, Tom Lane wrote:
>> Bruce Momjian <bruce@momjian.us> writes:
>>> Well, we can only use Latin-1, so the idea is that we will be explicit
>>> about specifying Latin-1 only as HTML entities, rather than letting
>>> non-Latin-1 creep in as UTF8.  We can exclude certain UTF8 or SGML files
>>> if desired.
>>
>> That policy would cause substantial problems with contributor names
>> in the release notes.  I agree with Peter that we don't need this.
>> Catching otherwise-invisible characters seems sufficient.
> 
> Uh, why can't we use HTML entities going forward?  Is that harder?

I think the question should be the other way around.  The entities are a 
historical workaround for when encoding support and rendering support 
was poor.  Now you can just type in the characters you want as is, which 
seems nicer.




Re: Doc: typo in config.sgml

From
Bruce Momjian
Date:
On Wed, Oct 16, 2024 at 10:00:15AM +0200, Peter Eisentraut wrote:
> On 15.10.24 23:51, Bruce Momjian wrote:
> > On Tue, Oct 15, 2024 at 05:27:49PM -0400, Tom Lane wrote:
> > > Bruce Momjian <bruce@momjian.us> writes:
> > > > Well, we can only use Latin-1, so the idea is that we will be explicit
> > > > about specifying Latin-1 only as HTML entities, rather than letting
> > > > non-Latin-1 creep in as UTF8.  We can exclude certain UTF8 or SGML files
> > > > if desired.
> > > 
> > > That policy would cause substantial problems with contributor names
> > > in the release notes.  I agree with Peter that we don't need this.
> > > Catching otherwise-invisible characters seems sufficient.
> > 
> > Uh, why can't we use HTML entities going forward?  Is that harder?
> 
> I think the question should be the other way around.  The entities are a
> historical workaround for when encoding support and rendering support was
> poor.  Now you can just type in the characters you want as is, which seems
> nicer.

Yes, that does make sense, and if we fully supported Unicode, we could
ignore all of this.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  When a patient asks the doctor, "Am I going to die?", he means 
  "Am I going to die soon?"



Re: Doc: typo in config.sgml

From
Bruce Momjian
Date:
On Wed, Oct 16, 2024 at 09:58:23AM +0200, Peter Eisentraut wrote:
> On 15.10.24 23:51, Bruce Momjian wrote:
> > > I don't see why we need to enforce this at this level.  Whatever downstream
> > > toolchain has requirements about which characters are allowed will complain
> > > if it encounters a character it doesn't like.
> > 
> > Uh, the PDF build does not complain if you pass it a non-Latin-1 UTF8
> > characters.  To test this I added some Russian characters (non-Latin-1)
> > to release.sgml:
> > 
> >     (⟨б⟩, ⟨в⟩, ⟨г⟩, ⟨д⟩, ⟨ж⟩, ⟨з⟩, ⟨к⟩, ⟨л⟩, ⟨м⟩, ⟨н⟩, ⟨п⟩, ⟨р⟩, ⟨с⟩, ⟨т⟩,
> >     ⟨ф⟩, ⟨х⟩, ⟨ц⟩, ⟨ч⟩, ⟨ш⟩, ⟨щ⟩), ten vowels (⟨а⟩, ⟨е⟩, ⟨ё⟩, ⟨и⟩, ⟨о⟩, ⟨у⟩,
> >     ⟨ы⟩, ⟨э⟩, ⟨ю⟩, ⟨я⟩), a semivowel / consonant (⟨й⟩), and two modifier
> >     letters or "signs" (⟨ъ⟩, ⟨ь⟩)
> > 
> > and I ran 'make postgres-US.pdf', and then removed the Russian
> > characters and ran the same command again.  The output, including stderr
> > was identical.  The PDFs, of course, were not, with the Russian
> > characters showing as "####".  Makefile output attached.
> 
> Hmm, mine complains:

My Debian 12 toolchain must be older.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  When a patient asks the doctor, "Am I going to die?", he means 
  "Am I going to die soon?"