Thread: Doc: typo in config.sgml
I think there's an unnecessary underscore in config.sgml. Attached patch fixes it. Best reagards, -- Tatsuo Ishii SRA OSS K.K. English: http://www.sraoss.co.jp/index_en/ Japanese:http://www.sraoss.co.jp diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml index 0aec11f443..08173ecb5c 100644 --- a/doc/src/sgml/config.sgml +++ b/doc/src/sgml/config.sgml @@ -9380,7 +9380,7 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv; <para> If <varname>transaction_timeout</varname> is shorter or equal to <varname>idle_in_transaction_session_timeout</varname> or <varname>statement_timeout</varname> - then the longer timeout is ignored. + then the longer timeout is ignored. </para> <para>
>> I think there's an unnecessary underscore in config.sgml. >> Attached patch fixes it. > > I could not apply the patch with an error. > > error: patch failed: doc/src/sgml/config.sgml:9380 > error: doc/src/sgml/config.sgml: patch does not apply Strange. I have no problem applying the patch here. > I found your patch contains an odd character (ASCII Code 240?) > by performing `od -c` command on the file. See the attached file. Yes, 240 in octal (== 0xc2) is in the patch but it's because current config.sgml includes the character. You can check it by looking at line 9383 of config.sgml. I think it was introduced by 28e858c0f95. Best reagards, -- Tatsuo Ishii SRA OSS K.K. English: http://www.sraoss.co.jp/index_en/ Japanese:http://www.sraoss.co.jp
On Mon, 30 Sep 2024 17:23:24 +0900 (JST) Tatsuo Ishii <ishii@postgresql.org> wrote: > >> I think there's an unnecessary underscore in config.sgml. > >> Attached patch fixes it. > > > > I could not apply the patch with an error. > > > > error: patch failed: doc/src/sgml/config.sgml:9380 > > error: doc/src/sgml/config.sgml: patch does not apply > > Strange. I have no problem applying the patch here. > > > I found your patch contains an odd character (ASCII Code 240?) > > by performing `od -c` command on the file. See the attached file. > > Yes, 240 in octal (== 0xc2) is in the patch but it's because current > config.sgml includes the character. You can check it by looking at > line 9383 of config.sgml. Yes, you are right, I can find the 0xc2 char in config.sgml using od -c, although I still could not apply the patch. I think this is non-breaking space of (C2A0) of utf-8. I guess my terminal normally regards this as a space, so applying patch fails. I found it also in line 85 of ref/drop_extension.sgml. > > I think it was introduced by 28e858c0f95. > > Best reagards, > -- > Tatsuo Ishii > SRA OSS K.K. > English: http://www.sraoss.co.jp/index_en/ > Japanese:http://www.sraoss.co.jp -- Yugo NAGATA <nagata@sraoss.co.jp>
>>> I think there's an unnecessary underscore in config.sgml. I was wrong. The particular byte sequences just looked an underscore on my editor but the byte sequence is actually 0xc2a0, which must be a "non breaking space" encoded in UTF-8. I guess someone mistakenly insert a non breaking space while editing config.sgml. However the mistake does not affect the patch. Best reagards, -- Tatsuo Ishii SRA OSS K.K. English: http://www.sraoss.co.jp/index_en/ Japanese:http://www.sraoss.co.jp
On Mon, 30 Sep 2024 18:03:44 +0900 (JST) Tatsuo Ishii <ishii@postgresql.org> wrote: > >>> I think there's an unnecessary underscore in config.sgml. > > I was wrong. The particular byte sequences just looked an underscore > on my editor but the byte sequence is actually 0xc2a0, which must be a > "non breaking space" encoded in UTF-8. I guess someone mistakenly > insert a non breaking space while editing config.sgml. > > However the mistake does not affect the patch. It looks like we've crisscrossed our mail. Anyway, I agree with removing non breaking spaces, as well as one found in line 85 of ref/drop_extension.sgml. Regards, Yugo Nagata > > Best reagards, > -- > Tatsuo Ishii > SRA OSS K.K. > English: http://www.sraoss.co.jp/index_en/ > Japanese:http://www.sraoss.co.jp -- Yugo NAGATA <nagata@sraoss.co.jp>
On Mon, 30 Sep 2024 11:59:48 +0200 Daniel Gustafsson <daniel@yesql.se> wrote: > > On 30 Sep 2024, at 11:03, Tatsuo Ishii <ishii@postgresql.org> wrote: > > > >>>> I think there's an unnecessary underscore in config.sgml. > > > > I was wrong. The particular byte sequences just looked an underscore > > on my editor but the byte sequence is actually 0xc2a0, which must be a > > "non breaking space" encoded in UTF-8. I guess someone mistakenly > > insert a non breaking space while editing config.sgml. > > I wonder if it would be worth to add a check for this like we have to tabs? > The attached adds a rule to "make -C doc/src/sgml check" for trapping nbsp > (doing so made me realize we don't have an equivalent meson target). Your patch couldn't detect 0xA0 in config.sgml in my machine, but it works when I use `grep -P "[\xA0]"` instead of `grep -e "\xA0"`. However, it also detects the following line in charset.sgml. (https://www.postgresql.org/docs/current/collation.html) For example, locale und-u-kb sorts 'àe' before 'aé'. This is not non-breaking space, so should not be detected as an error. Regards, Yugo Nagata > -- > Daniel Gustafsson > -- Yugo Nagata <nagata@sraoss.co.jp>
>> I wonder if it would be worth to add a check for this like we have to tabs? +1. >> The attached adds a rule to "make -C doc/src/sgml check" for trapping nbsp >> (doing so made me realize we don't have an equivalent meson target). > > Your patch couldn't detect 0xA0 in config.sgml in my machine, but it works > when I use `grep -P "[\xA0]"` instead of `grep -e "\xA0"`. > > However, it also detects the following line in charset.sgml. > (https://www.postgresql.org/docs/current/collation.html) > > For example, locale und-u-kb sorts 'àe' before 'aé'. > > This is not non-breaking space, so should not be detected as an error. That's because non-breaking space (nbsp) is not encoded as 0xa0 in UTF-8. nbsp in UTF-8 is "0xc2 0xa0" (2 bytes) (A 0xa0 is a nbsp's code point in Unicode. i.e. U+00A0). So grep -P "[\xC2\xA0]" should work to detect nbsp. Best reagards, -- Tatsuo Ishii SRA OSS K.K. English: http://www.sraoss.co.jp/index_en/ Japanese:http://www.sraoss.co.jp
>> That's because non-breaking space (nbsp) is not encoded as 0xa0 in >> UTF-8. nbsp in UTF-8 is "0xc2 0xa0" (2 bytes) (A 0xa0 is a nbsp's code >> point in Unicode. i.e. U+00A0). >> So grep -P "[\xC2\xA0]" should work to detect nbsp. > > `LC_ALL=C grep -P "\xC2\xA0"` works for my environment. > ([ and ] were not necessary.) > > When LC_ALL is null, `grep -P "\xA0"` could not detect any characters in charset.sgml, > but I think it is better to specify both LC_ALL=C and "\xC2\xA0" for making sure detecting > nbsp. > > One problem is that -P option can be used in only GNU grep, and grep in mac doesn't support it. > > On bash, we can also use `grep $'\xc2\xa0'`, but I am not sure we can assume the shell is bash. > > Maybe, better way is use perl itself rather than grep as following. > > `perl -ne '/\xC2\xA0/ and print' ` > > I attached a patch fixed in this way. GNU sed can also be used without setting LC_ALL: sed -n /"\xC2\xA0"/p However I am not sure if non-GNU sed can do this too... Best reagards, -- Tatsuo Ishii SRA OSS K.K. English: http://www.sraoss.co.jp/index_en/ Japanese:http://www.sraoss.co.jp
> On Mon, 30 Sep 2024 17:23:24 +0900 (JST) > Tatsuo Ishii <ishii@postgresql.org> wrote: > >> >> I think there's an unnecessary underscore in config.sgml. >> >> Attached patch fixes it. >> > >> > I could not apply the patch with an error. >> > >> > error: patch failed: doc/src/sgml/config.sgml:9380 >> > error: doc/src/sgml/config.sgml: patch does not apply >> >> Strange. I have no problem applying the patch here. >> >> > I found your patch contains an odd character (ASCII Code 240?) >> > by performing `od -c` command on the file. See the attached file. >> >> Yes, 240 in octal (== 0xc2) is in the patch but it's because current >> config.sgml includes the character. You can check it by looking at >> line 9383 of config.sgml. > > Yes, you are right, I can find the 0xc2 char in config.sgml using od -c, > although I still could not apply the patch. > > I think this is non-breaking space of (C2A0) of utf-8. I guess my > terminal normally regards this as a space, so applying patch fails. > > I found it also in line 85 of ref/drop_extension.sgml. Thanks. I have pushed the fix for ref/drop_extension.sgml along with config.sgml. Best reagards, -- Tatsuo Ishii SRA OSS K.K. English: http://www.sraoss.co.jp/index_en/ Japanese:http://www.sraoss.co.jp
On Mon, Sep 30, 2024 at 11:59:48AM +0200, Daniel Gustafsson wrote: > > On 30 Sep 2024, at 11:03, Tatsuo Ishii <ishii@postgresql.org> wrote: > > > >>>> I think there's an unnecessary underscore in config.sgml. > > > > I was wrong. The particular byte sequences just looked an underscore > > on my editor but the byte sequence is actually 0xc2a0, which must be a > > "non breaking space" encoded in UTF-8. I guess someone mistakenly > > insert a non breaking space while editing config.sgml. > > I wonder if it would be worth to add a check for this like we have to tabs? > The attached adds a rule to "make -C doc/src/sgml check" for trapping nbsp > (doing so made me realize we don't have an equivalent meson target). Can we check for any character outside the support range of SGML? -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com When a patient asks the doctor, "Am I going to die?", he means "Am I going to die soon?"
> On Tue, 1 Oct 2024 22:20:55 +0900 > Yugo Nagata <nagata@sraoss.co.jp> wrote: > >> On Tue, 1 Oct 2024 15:16:52 +0900 >> Yugo NAGATA <nagata@sraoss.co.jp> wrote: >> >> > On Tue, 01 Oct 2024 10:33:50 +0900 (JST) >> > Tatsuo Ishii <ishii@postgresql.org> wrote: >> > >> > > >> That's because non-breaking space (nbsp) is not encoded as 0xa0 in >> > > >> UTF-8. nbsp in UTF-8 is "0xc2 0xa0" (2 bytes) (A 0xa0 is a nbsp's code >> > > >> point in Unicode. i.e. U+00A0). >> > > >> So grep -P "[\xC2\xA0]" should work to detect nbsp. >> > > > >> > > > `LC_ALL=C grep -P "\xC2\xA0"` works for my environment. >> > > > ([ and ] were not necessary.) >> > > > >> > > > When LC_ALL is null, `grep -P "\xA0"` could not detect any characters in charset.sgml, >> > > > but I think it is better to specify both LC_ALL=C and "\xC2\xA0" for making sure detecting >> > > > nbsp. >> > > > >> > > > One problem is that -P option can be used in only GNU grep, and grep in mac doesn't support it. >> > > > >> > > > On bash, we can also use `grep $'\xc2\xa0'`, but I am not sure we can assume the shell is bash. >> > > > >> > > > Maybe, better way is use perl itself rather than grep as following. >> > > > >> > > > `perl -ne '/\xC2\xA0/ and print' ` >> > > > >> > > > I attached a patch fixed in this way. >> > > >> > > GNU sed can also be used without setting LC_ALL: >> > > >> > > sed -n /"\xC2\xA0"/p >> > > >> > > However I am not sure if non-GNU sed can do this too... >> > >> > Although I've not check it myself, BSD sed doesn't support \x escape according to [1]. >> > >> > [1] https://stackoverflow.com/questions/24275070/sed-not-giving-me-correct-substitute-operation-for-newline-with-mac-difference >> > >> > By the way, I've attached a patch a bit modified to use the plural form statement >> > as same as check-tabs. >> > >> > Non-breaking **spaces** appear in SGML/XML files >> >> The previous patch was broken because the perl command failed to return the correct result. >> I've attached an updated patch to fix the return value. In passing, I added line breaks >> for long lines. > > I've attached a updated patch. > I added the comment to explain why Perl is used instead of grep or sed. Looks good to me. If there's no objection, I will commit this to master branch. Best reagards, -- Tatsuo Ishii SRA OSS K.K. English: http://www.sraoss.co.jp/index_en/ Japanese:http://www.sraoss.co.jp
> On 8 Oct 2024, at 02:03, Tatsuo Ishii <ishii@postgresql.org> wrote: >> On Tue, 1 Oct 2024 22:20:55 +0900 >> Yugo Nagata <nagata@sraoss.co.jp> wrote: >> I've attached a updated patch. >> I added the comment to explain why Perl is used instead of grep or sed. > > Looks good to me. If there's no objection, I will commit this to > master branch. No objections, LGTM. -- Daniel Gustafsson
Hi Danile, Yugo, >> On 8 Oct 2024, at 02:03, Tatsuo Ishii <ishii@postgresql.org> wrote: >>> On Tue, 1 Oct 2024 22:20:55 +0900 >>> Yugo Nagata <nagata@sraoss.co.jp> wrote: > >>> I've attached a updated patch. >>> I added the comment to explain why Perl is used instead of grep or sed. >> >> Looks good to me. If there's no objection, I will commit this to >> master branch. > > No objections, LGTM. Thank you for the patch and review! I have pushed the patch. https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=5b7da5c261d1af1a5d6a275e1090b07de3654033 Best reagards, -- Tatsuo Ishii SRA OSS K.K. English: http://www.sraoss.co.jp/index_en/ Japanese:http://www.sraoss.co.jp
On Mon, 7 Oct 2024 15:45:54 -0400 Bruce Momjian <bruce@momjian.us> wrote: > On Mon, Sep 30, 2024 at 11:59:48AM +0200, Daniel Gustafsson wrote: > > > On 30 Sep 2024, at 11:03, Tatsuo Ishii <ishii@postgresql.org> wrote: > > > > > >>>> I think there's an unnecessary underscore in config.sgml. > > > > > > I was wrong. The particular byte sequences just looked an underscore > > > on my editor but the byte sequence is actually 0xc2a0, which must be a > > > "non breaking space" encoded in UTF-8. I guess someone mistakenly > > > insert a non breaking space while editing config.sgml. > > > > I wonder if it would be worth to add a check for this like we have to tabs? > > The attached adds a rule to "make -C doc/src/sgml check" for trapping nbsp > > (doing so made me realize we don't have an equivalent meson target). > > Can we check for any character outside the support range of SGML? What we can define the range of allowed characters range in SGML? We can detect non-ASCII characters by using regexp /\P{ascii}/ or /[^\x00-\x7f]/, but they are used in some places in charset.sgml and some names in release-*.sgml. Regards, Yugo Nagata > > -- > Bruce Momjian <bruce@momjian.us> https://momjian.us > EDB https://enterprisedb.com > > When a patient asks the doctor, "Am I going to die?", he means > "Am I going to die soon?" > > -- Yugo Nagata <nagata@sraoss.co.jp>
> On Mon, 7 Oct 2024 15:45:54 -0400 > Bruce Momjian <bruce@momjian.us> wrote: > >> On Mon, Sep 30, 2024 at 11:59:48AM +0200, Daniel Gustafsson wrote: >> > > On 30 Sep 2024, at 11:03, Tatsuo Ishii <ishii@postgresql.org> wrote: >> > > >> > >>>> I think there's an unnecessary underscore in config.sgml. >> > > >> > > I was wrong. The particular byte sequences just looked an underscore >> > > on my editor but the byte sequence is actually 0xc2a0, which must be a >> > > "non breaking space" encoded in UTF-8. I guess someone mistakenly >> > > insert a non breaking space while editing config.sgml. >> > >> > I wonder if it would be worth to add a check for this like we have to tabs? >> > The attached adds a rule to "make -C doc/src/sgml check" for trapping nbsp >> > (doing so made me realize we don't have an equivalent meson target). >> >> Can we check for any character outside the support range of SGML? > > What we can define the range of allowed characters range in SGML? > > We can detect non-ASCII characters by using regexp /\P{ascii}/ or /[^\x00-\x7f]/, > but they are used in some places in charset.sgml and some names in release-*.sgml. I failed to find any standard regarding what characters are allowed in SGML/XML. Assuming that any valid Unicode characters are allowed in our *sgml files, I am afraid the best we can do is grepping non-ASCII characters against the files and checking the results by a visual inspection. Besides nbsp, there are tons of confusing Unicode characters out there. For example there are many "hyphen like characters". https://www.compart.com/en/unicode/category/Pd If one of them is used in the sgml files, it may be possible that it was accidentally inserted. Best reagards, -- Tatsuo Ishii SRA OSS K.K. English: http://www.sraoss.co.jp/index_en/ Japanese:http://www.sraoss.co.jp
On Wed, Oct 9, 2024 at 11:49:29AM +0900, Tatsuo Ishii wrote: > >> On Mon, Sep 30, 2024 at 11:59:48AM +0200, Daniel Gustafsson wrote: > >> > > On 30 Sep 2024, at 11:03, Tatsuo Ishii <ishii@postgresql.org> wrote: > >> > > > >> > >>>> I think there's an unnecessary underscore in config.sgml. > >> > > > >> > > I was wrong. The particular byte sequences just looked an underscore > >> > > on my editor but the byte sequence is actually 0xc2a0, which must be a > >> > > "non breaking space" encoded in UTF-8. I guess someone mistakenly > >> > > insert a non breaking space while editing config.sgml. > >> > > >> > I wonder if it would be worth to add a check for this like we have to tabs? > >> > The attached adds a rule to "make -C doc/src/sgml check" for trapping nbsp > >> > (doing so made me realize we don't have an equivalent meson target). > >> > >> Can we check for any character outside the support range of SGML? > > > > What we can define the range of allowed characters range in SGML? > > > > We can detect non-ASCII characters by using regexp /\P{ascii}/ or /[^\x00-\x7f]/, > > but they are used in some places in charset.sgml and some names in release-*.sgml. > > I failed to find any standard regarding what characters are allowed in > SGML/XML. Assuming that any valid Unicode characters are allowed in > our *sgml files, I am afraid the best we can do is grepping non-ASCII > characters against the files and checking the results by a visual > inspection. Besides nbsp, there are tons of confusing Unicode > characters out there. For example there are many "hyphen like > characters". > > https://www.compart.com/en/unicode/category/Pd > > If one of them is used in the sgml files, it may be possible that it > was accidentally inserted. Can we use Unicode in the SGML files? -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com When a patient asks the doctor, "Am I going to die?", he means "Am I going to die soon?"
Bruce Momjian <bruce@momjian.us> writes: > Can we use Unicode in the SGML files? I believe we've been doing it for contributors' names that require non-ASCII letters, but not in any other places. regards, tom lane
> Bruce Momjian <bruce@momjian.us> writes: >> Can we use Unicode in the SGML files? > > I believe we've been doing it for contributors' names that require > non-ASCII letters, but not in any other places. We have non-ASCII letters in charset.sgml too, to show some examples of collation. Best reagards, -- Tatsuo Ishii SRA OSS K.K. English: http://www.sraoss.co.jp/index_en/ Japanese:http://www.sraoss.co.jp
> On 9 Oct 2024, at 04:49, Tatsuo Ishii <ishii@postgresql.org> wrote: > Besides nbsp, there are tons of confusing Unicode > characters out there. For example there are many "hyphen like > characters". Using characters which look alike is in the field of internet security known as homograph attacks, where for example a url visually passes for postgresql.org but in fact leads to an attacker. That sort of attack clearly doesn't apply to our docs though. However, what might cause similar problems is if we use a unicode character in example code which the reader could be expected to copy/paste into psql and run which then (at best) cause a syntax error. We could probably build tooling to catch this (most likely not too hard in XSLT) but the ROI for that might be unfavourable. Even with tooling, committer caution is needed to ensure we don't publish examples that might cause unintended side effects when executed by copy/paste. What separates nbsp is that it may affect the rendering in an un-intuitive way by forcing two words to not break even if the viewport is too narrow to fit. Catching such characters seems wortwhile since it's also quite doable with a trivial grep. -- Daniel Gustafsson [0] https://en.wikipedia.org/wiki/IDN_homograph_attack
> We can check non-ASCII letters SGML/XML files by preparing "allowlist" > that contains lines which are allowed to have non-ascii characters, > although this list will need to be maintained when lines in it are modified. > I've attached a patch to add a simple Perl script to do this. I doubt it really works. For example, nbsp can be used formatting (that's the purpose of the character in the first place). Whenever a developer decides to or not to use nbsp, "allowlist" needs to be maintained. It's too annoying. I think it's better to add the non-ASCII character checking to the comitting check list and let committers check non-ASCII character in the patch. Non-ASCII characters rarely used and it would not become a burden. https://wiki.postgresql.org/wiki/Committing_checklist Maybe we can add to the wiki page something like this? git diff origin/master | grep -P '[^\x00-\x7f]' > During testing this script, I found "stylesheet-man.xsl" also has non-ascii > characters. I don't know these characters are really necessary though, since > I don't understand this file well. They are U+201C (double turned comma quotation mark) and U+201D (double comma quotation mark). <l:template name="sect3" text="Section %n, “%t”, in the documentation"/> I would like to know why they are necessary too. Best reagards, -- Tatsuo Ishii SRA OSS K.K. English: http://www.sraoss.co.jp/index_en/ Japanese:http://www.sraoss.co.jp
On Fri, 11 Oct 2024 12:16:50 +0900 (JST) Tatsuo Ishii <ishii@postgresql.org> wrote: > > We can check non-ASCII letters SGML/XML files by preparing "allowlist" > > that contains lines which are allowed to have non-ascii characters, > > although this list will need to be maintained when lines in it are modified. > > I've attached a patch to add a simple Perl script to do this. > > I doubt it really works. For example, nbsp can be used formatting > (that's the purpose of the character in the first place). Whenever a > developer decides to or not to use nbsp, "allowlist" needs to be > maintained. It's too annoying. I suppose non-ascii characters including nbsp are basically disallowed, so the allowlist will not increase unless there is some special reason. However, it is true that there might be a cost for maintaining the list more or less, so if people don't think it is worth adding this check, I will withdraw this proposal.l. > I think it's better to add the non-ASCII character checking to the > comitting check list and let committers check non-ASCII character in > the patch. Non-ASCII characters rarely used and it would not become a > burden. > https://wiki.postgresql.org/wiki/Committing_checklist > > Maybe we can add to the wiki page something like this? > > git diff origin/master | grep -P '[^\x00-\x7f]' > > > During testing this script, I found "stylesheet-man.xsl" also has non-ascii > > characters. I don't know these characters are really necessary though, since > > I don't understand this file well. > > They are U+201C (double turned comma quotation mark) and U+201D > (double comma quotation mark). > > <l:template name="sect3" text="Section %n, “%t”, in the documentation"/> > > I would like to know why they are necessary too. +1 Regards, Yugo Nagata -- Yugo NAGATA <nagata@sraoss.co.jp>
On Mon, Oct 14, 2024 at 03:05:35PM -0400, Bruce Momjian wrote: > I did some more research and we able to clarify our behavior in > release.sgml: I have specified some more details in my patched version: We can only use Latin1 characters, not all UTF8 characters, because some rendering engines do not support non-Latin1 UTF8 characters. Specifically, the HTML rendering engine can display all UTF8 characters, but the PDF rendering engine can only display Latin1 characters. In PDF files, non-Latin1 UTF8 characters are displayed as "###". In the SGML files we encode non-ASCII Latin1 characters as HTML entities, e.g., Álvaro. Oddly, it is possible to safely represent Latin1 characters in SGML files as UTF8 for HTML and PDF output, but we we currently disallow this via the Makefile "check-non-ascii" rule. -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com When a patient asks the doctor, "Am I going to die?", he means "Am I going to die soon?"
On 15.10.24 18:54, Bruce Momjian wrote: >> I agree with encoding non-Latin1 characters and disallowing non-ASCII >> characters totally. >> >> I found your patch includes fixes in *.svg files, so how about checking >> also them by check-non-ascii? Also, I think it is better to use perl instead >> of grep because non-GNU grep doesn't support hex escape sequences. I've attached >> a updated patch for Makefile. The changes in release.sgml above is not applied >> yet, though. > Yes, good idea on using Perl and checking svg files --- I have used your > Makefile rule. > > Attached is an updated patch. I realized that the new rules apply to > all SGML files, not just the release notes, so I have created > README.non-ASCII and moved the description there. I don't understand the point of this. Maybe it's okay to try to detect certain "hidden" whitespace characters, like in the case that started this thread. But I don't see the value in prohibiting all non-ASCII characters, as is being proposed here.
On Tue, Oct 15, 2024 at 10:34:16PM +0200, Peter Eisentraut wrote: > On 15.10.24 18:54, Bruce Momjian wrote: > > > I agree with encoding non-Latin1 characters and disallowing non-ASCII > > > characters totally. > > > > > > I found your patch includes fixes in *.svg files, so how about checking > > > also them by check-non-ascii? Also, I think it is better to use perl instead > > > of grep because non-GNU grep doesn't support hex escape sequences. I've attached > > > a updated patch for Makefile. The changes in release.sgml above is not applied > > > yet, though. > > Yes, good idea on using Perl and checking svg files --- I have used your > > Makefile rule. > > > > Attached is an updated patch. I realized that the new rules apply to > > all SGML files, not just the release notes, so I have created > > README.non-ASCII and moved the description there. > > I don't understand the point of this. Maybe it's okay to try to detect > certain "hidden" whitespace characters, like in the case that started this > thread. But I don't see the value in prohibiting all non-ASCII characters, > as is being proposed here. Well, we can only use Latin-1, so the idea is that we will be explicit about specifying Latin-1 only as HTML entities, rather than letting non-Latin-1 creep in as UTF8. We can exclude certain UTF8 or SGML files if desired. -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com When a patient asks the doctor, "Am I going to die?", he means "Am I going to die soon?"
On 15.10.24 22:37, Bruce Momjian wrote: >> I don't understand the point of this. Maybe it's okay to try to detect >> certain "hidden" whitespace characters, like in the case that started this >> thread. But I don't see the value in prohibiting all non-ASCII characters, >> as is being proposed here. > Well, we can only use Latin-1, so the idea is that we will be explicit > about specifying Latin-1 only as HTML entities, rather than letting > non-Latin-1 creep in as UTF8. But your patch prohibits even otherwise allowed Latin-1 characters. I don't see why we need to enforce this at this level. Whatever downstream toolchain has requirements about which characters are allowed will complain if it encounters a character it doesn't like.
Bruce Momjian <bruce@momjian.us> writes: > Well, we can only use Latin-1, so the idea is that we will be explicit > about specifying Latin-1 only as HTML entities, rather than letting > non-Latin-1 creep in as UTF8. We can exclude certain UTF8 or SGML files > if desired. That policy would cause substantial problems with contributor names in the release notes. I agree with Peter that we don't need this. Catching otherwise-invisible characters seems sufficient. regards, tom lane
On Tue, Oct 15, 2024 at 05:27:49PM -0400, Tom Lane wrote: > Bruce Momjian <bruce@momjian.us> writes: > > Well, we can only use Latin-1, so the idea is that we will be explicit > > about specifying Latin-1 only as HTML entities, rather than letting > > non-Latin-1 creep in as UTF8. We can exclude certain UTF8 or SGML files > > if desired. > > That policy would cause substantial problems with contributor names > in the release notes. I agree with Peter that we don't need this. > Catching otherwise-invisible characters seems sufficient. Uh, why can't we use HTML entities going forward? Is that harder? Can we just exclude the release notes from this check? -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com When a patient asks the doctor, "Am I going to die?", he means "Am I going to die soon?"
Bruce Momjian <bruce@momjian.us> writes: > On Tue, Oct 15, 2024 at 05:27:49PM -0400, Tom Lane wrote: >> That policy would cause substantial problems with contributor names >> in the release notes. I agree with Peter that we don't need this. >> Catching otherwise-invisible characters seems sufficient. > Uh, why can't we use HTML entities going forward? Is that harder? Yes: it requires looking up the entities. The mail you are probably consulting to make a release note or commit message is most likely just going to contain the person's name as normally spelled. Plus (as you pointed out earlier today) there aren't HTML entities for all characters. > Can we just exclude the release notes from this check? What is the point of a check we can only enforce against part of the documentation? regards, tom lane
On 15.10.24 23:51, Bruce Momjian wrote: >> I don't see why we need to enforce this at this level. Whatever downstream >> toolchain has requirements about which characters are allowed will complain >> if it encounters a character it doesn't like. > > Uh, the PDF build does not complain if you pass it a non-Latin-1 UTF8 > characters. To test this I added some Russian characters (non-Latin-1) > to release.sgml: > > (⟨б⟩, ⟨в⟩, ⟨г⟩, ⟨д⟩, ⟨ж⟩, ⟨з⟩, ⟨к⟩, ⟨л⟩, ⟨м⟩, ⟨н⟩, ⟨п⟩, ⟨р⟩, ⟨с⟩, ⟨т⟩, > ⟨ф⟩, ⟨х⟩, ⟨ц⟩, ⟨ч⟩, ⟨ш⟩, ⟨щ⟩), ten vowels (⟨а⟩, ⟨е⟩, ⟨ё⟩, ⟨и⟩, ⟨о⟩, ⟨у⟩, > ⟨ы⟩, ⟨э⟩, ⟨ю⟩, ⟨я⟩), a semivowel / consonant (⟨й⟩), and two modifier > letters or "signs" (⟨ъ⟩, ⟨ь⟩) > > and I ran 'make postgres-US.pdf', and then removed the Russian > characters and ran the same command again. The output, including stderr > was identical. The PDFs, of course, were not, with the Russian > characters showing as "####". Makefile output attached. Hmm, mine complains: /opt/homebrew/bin/fop -fo postgres-A4.fo -pdf postgres-A4.pdf Picked up JAVA_TOOL_OPTIONS: -Djava.awt.headless=true [WARN] FOUserAgent - Font "Symbol,normal,700" not found. Substituting with "Symbol,normal,400". [WARN] FOUserAgent - Font "ZapfDingbats,normal,700" not found. Substituting with "ZapfDingbats,normal,400". [WARN] FOUserAgent - Glyph "⟨" (0x27e8) not available in font "Times-Roman". [WARN] FOUserAgent - Glyph "б" (0x431, afii10066) not available in font "Times-Roman". [WARN] FOUserAgent - Glyph "⟩" (0x27e9) not available in font "Times-Roman". [WARN] FOUserAgent - Glyph "в" (0x432, afii10067) not available in font "Times-Roman". [WARN] FOUserAgent - Glyph "г" (0x433, afii10068) not available in font "Times-Roman". [WARN] FOUserAgent - Glyph "д" (0x434, afii10069) not available in font "Times-Roman". [WARN] FOUserAgent - Glyph "ж" (0x436, afii10072) not available in font "Times-Roman". [WARN] FOUserAgent - Glyph "з" (0x437, afii10073) not available in font "Times-Roman". [WARN] PropertyMaker - span="inherit" on fo:block, but no explicit value found on the parent FO.
On 15.10.24 23:51, Bruce Momjian wrote: > On Tue, Oct 15, 2024 at 05:27:49PM -0400, Tom Lane wrote: >> Bruce Momjian <bruce@momjian.us> writes: >>> Well, we can only use Latin-1, so the idea is that we will be explicit >>> about specifying Latin-1 only as HTML entities, rather than letting >>> non-Latin-1 creep in as UTF8. We can exclude certain UTF8 or SGML files >>> if desired. >> >> That policy would cause substantial problems with contributor names >> in the release notes. I agree with Peter that we don't need this. >> Catching otherwise-invisible characters seems sufficient. > > Uh, why can't we use HTML entities going forward? Is that harder? I think the question should be the other way around. The entities are a historical workaround for when encoding support and rendering support was poor. Now you can just type in the characters you want as is, which seems nicer.
On Wed, Oct 16, 2024 at 10:00:15AM +0200, Peter Eisentraut wrote: > On 15.10.24 23:51, Bruce Momjian wrote: > > On Tue, Oct 15, 2024 at 05:27:49PM -0400, Tom Lane wrote: > > > Bruce Momjian <bruce@momjian.us> writes: > > > > Well, we can only use Latin-1, so the idea is that we will be explicit > > > > about specifying Latin-1 only as HTML entities, rather than letting > > > > non-Latin-1 creep in as UTF8. We can exclude certain UTF8 or SGML files > > > > if desired. > > > > > > That policy would cause substantial problems with contributor names > > > in the release notes. I agree with Peter that we don't need this. > > > Catching otherwise-invisible characters seems sufficient. > > > > Uh, why can't we use HTML entities going forward? Is that harder? > > I think the question should be the other way around. The entities are a > historical workaround for when encoding support and rendering support was > poor. Now you can just type in the characters you want as is, which seems > nicer. Yes, that does make sense, and if we fully supported Unicode, we could ignore all of this. -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com When a patient asks the doctor, "Am I going to die?", he means "Am I going to die soon?"
On Wed, Oct 16, 2024 at 09:58:23AM +0200, Peter Eisentraut wrote: > On 15.10.24 23:51, Bruce Momjian wrote: > > > I don't see why we need to enforce this at this level. Whatever downstream > > > toolchain has requirements about which characters are allowed will complain > > > if it encounters a character it doesn't like. > > > > Uh, the PDF build does not complain if you pass it a non-Latin-1 UTF8 > > characters. To test this I added some Russian characters (non-Latin-1) > > to release.sgml: > > > > (⟨б⟩, ⟨в⟩, ⟨г⟩, ⟨д⟩, ⟨ж⟩, ⟨з⟩, ⟨к⟩, ⟨л⟩, ⟨м⟩, ⟨н⟩, ⟨п⟩, ⟨р⟩, ⟨с⟩, ⟨т⟩, > > ⟨ф⟩, ⟨х⟩, ⟨ц⟩, ⟨ч⟩, ⟨ш⟩, ⟨щ⟩), ten vowels (⟨а⟩, ⟨е⟩, ⟨ё⟩, ⟨и⟩, ⟨о⟩, ⟨у⟩, > > ⟨ы⟩, ⟨э⟩, ⟨ю⟩, ⟨я⟩), a semivowel / consonant (⟨й⟩), and two modifier > > letters or "signs" (⟨ъ⟩, ⟨ь⟩) > > > > and I ran 'make postgres-US.pdf', and then removed the Russian > > characters and ran the same command again. The output, including stderr > > was identical. The PDFs, of course, were not, with the Russian > > characters showing as "####". Makefile output attached. > > Hmm, mine complains: My Debian 12 toolchain must be older. -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com When a patient asks the doctor, "Am I going to die?", he means "Am I going to die soon?"