Thread: Doc: typo in config.sgml
I think there's an unnecessary underscore in config.sgml. Attached patch fixes it. Best reagards, -- Tatsuo Ishii SRA OSS K.K. English: http://www.sraoss.co.jp/index_en/ Japanese:http://www.sraoss.co.jp diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml index 0aec11f443..08173ecb5c 100644 --- a/doc/src/sgml/config.sgml +++ b/doc/src/sgml/config.sgml @@ -9380,7 +9380,7 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv; <para> If <varname>transaction_timeout</varname> is shorter or equal to <varname>idle_in_transaction_session_timeout</varname> or <varname>statement_timeout</varname> - then the longer timeout is ignored. + then the longer timeout is ignored. </para> <para>
>> I think there's an unnecessary underscore in config.sgml. >> Attached patch fixes it. > > I could not apply the patch with an error. > > error: patch failed: doc/src/sgml/config.sgml:9380 > error: doc/src/sgml/config.sgml: patch does not apply Strange. I have no problem applying the patch here. > I found your patch contains an odd character (ASCII Code 240?) > by performing `od -c` command on the file. See the attached file. Yes, 240 in octal (== 0xc2) is in the patch but it's because current config.sgml includes the character. You can check it by looking at line 9383 of config.sgml. I think it was introduced by 28e858c0f95. Best reagards, -- Tatsuo Ishii SRA OSS K.K. English: http://www.sraoss.co.jp/index_en/ Japanese:http://www.sraoss.co.jp
On Mon, 30 Sep 2024 17:23:24 +0900 (JST) Tatsuo Ishii <ishii@postgresql.org> wrote: > >> I think there's an unnecessary underscore in config.sgml. > >> Attached patch fixes it. > > > > I could not apply the patch with an error. > > > > error: patch failed: doc/src/sgml/config.sgml:9380 > > error: doc/src/sgml/config.sgml: patch does not apply > > Strange. I have no problem applying the patch here. > > > I found your patch contains an odd character (ASCII Code 240?) > > by performing `od -c` command on the file. See the attached file. > > Yes, 240 in octal (== 0xc2) is in the patch but it's because current > config.sgml includes the character. You can check it by looking at > line 9383 of config.sgml. Yes, you are right, I can find the 0xc2 char in config.sgml using od -c, although I still could not apply the patch. I think this is non-breaking space of (C2A0) of utf-8. I guess my terminal normally regards this as a space, so applying patch fails. I found it also in line 85 of ref/drop_extension.sgml. > > I think it was introduced by 28e858c0f95. > > Best reagards, > -- > Tatsuo Ishii > SRA OSS K.K. > English: http://www.sraoss.co.jp/index_en/ > Japanese:http://www.sraoss.co.jp -- Yugo NAGATA <nagata@sraoss.co.jp>
>>> I think there's an unnecessary underscore in config.sgml. I was wrong. The particular byte sequences just looked an underscore on my editor but the byte sequence is actually 0xc2a0, which must be a "non breaking space" encoded in UTF-8. I guess someone mistakenly insert a non breaking space while editing config.sgml. However the mistake does not affect the patch. Best reagards, -- Tatsuo Ishii SRA OSS K.K. English: http://www.sraoss.co.jp/index_en/ Japanese:http://www.sraoss.co.jp
On Mon, 30 Sep 2024 18:03:44 +0900 (JST) Tatsuo Ishii <ishii@postgresql.org> wrote: > >>> I think there's an unnecessary underscore in config.sgml. > > I was wrong. The particular byte sequences just looked an underscore > on my editor but the byte sequence is actually 0xc2a0, which must be a > "non breaking space" encoded in UTF-8. I guess someone mistakenly > insert a non breaking space while editing config.sgml. > > However the mistake does not affect the patch. It looks like we've crisscrossed our mail. Anyway, I agree with removing non breaking spaces, as well as one found in line 85 of ref/drop_extension.sgml. Regards, Yugo Nagata > > Best reagards, > -- > Tatsuo Ishii > SRA OSS K.K. > English: http://www.sraoss.co.jp/index_en/ > Japanese:http://www.sraoss.co.jp -- Yugo NAGATA <nagata@sraoss.co.jp>
On Mon, 30 Sep 2024 11:59:48 +0200 Daniel Gustafsson <daniel@yesql.se> wrote: > > On 30 Sep 2024, at 11:03, Tatsuo Ishii <ishii@postgresql.org> wrote: > > > >>>> I think there's an unnecessary underscore in config.sgml. > > > > I was wrong. The particular byte sequences just looked an underscore > > on my editor but the byte sequence is actually 0xc2a0, which must be a > > "non breaking space" encoded in UTF-8. I guess someone mistakenly > > insert a non breaking space while editing config.sgml. > > I wonder if it would be worth to add a check for this like we have to tabs? > The attached adds a rule to "make -C doc/src/sgml check" for trapping nbsp > (doing so made me realize we don't have an equivalent meson target). Your patch couldn't detect 0xA0 in config.sgml in my machine, but it works when I use `grep -P "[\xA0]"` instead of `grep -e "\xA0"`. However, it also detects the following line in charset.sgml. (https://www.postgresql.org/docs/current/collation.html) For example, locale und-u-kb sorts 'àe' before 'aé'. This is not non-breaking space, so should not be detected as an error. Regards, Yugo Nagata > -- > Daniel Gustafsson > -- Yugo Nagata <nagata@sraoss.co.jp>
>> I wonder if it would be worth to add a check for this like we have to tabs? +1. >> The attached adds a rule to "make -C doc/src/sgml check" for trapping nbsp >> (doing so made me realize we don't have an equivalent meson target). > > Your patch couldn't detect 0xA0 in config.sgml in my machine, but it works > when I use `grep -P "[\xA0]"` instead of `grep -e "\xA0"`. > > However, it also detects the following line in charset.sgml. > (https://www.postgresql.org/docs/current/collation.html) > > For example, locale und-u-kb sorts 'àe' before 'aé'. > > This is not non-breaking space, so should not be detected as an error. That's because non-breaking space (nbsp) is not encoded as 0xa0 in UTF-8. nbsp in UTF-8 is "0xc2 0xa0" (2 bytes) (A 0xa0 is a nbsp's code point in Unicode. i.e. U+00A0). So grep -P "[\xC2\xA0]" should work to detect nbsp. Best reagards, -- Tatsuo Ishii SRA OSS K.K. English: http://www.sraoss.co.jp/index_en/ Japanese:http://www.sraoss.co.jp
>> That's because non-breaking space (nbsp) is not encoded as 0xa0 in >> UTF-8. nbsp in UTF-8 is "0xc2 0xa0" (2 bytes) (A 0xa0 is a nbsp's code >> point in Unicode. i.e. U+00A0). >> So grep -P "[\xC2\xA0]" should work to detect nbsp. > > `LC_ALL=C grep -P "\xC2\xA0"` works for my environment. > ([ and ] were not necessary.) > > When LC_ALL is null, `grep -P "\xA0"` could not detect any characters in charset.sgml, > but I think it is better to specify both LC_ALL=C and "\xC2\xA0" for making sure detecting > nbsp. > > One problem is that -P option can be used in only GNU grep, and grep in mac doesn't support it. > > On bash, we can also use `grep $'\xc2\xa0'`, but I am not sure we can assume the shell is bash. > > Maybe, better way is use perl itself rather than grep as following. > > `perl -ne '/\xC2\xA0/ and print' ` > > I attached a patch fixed in this way. GNU sed can also be used without setting LC_ALL: sed -n /"\xC2\xA0"/p However I am not sure if non-GNU sed can do this too... Best reagards, -- Tatsuo Ishii SRA OSS K.K. English: http://www.sraoss.co.jp/index_en/ Japanese:http://www.sraoss.co.jp
> On Mon, 30 Sep 2024 17:23:24 +0900 (JST) > Tatsuo Ishii <ishii@postgresql.org> wrote: > >> >> I think there's an unnecessary underscore in config.sgml. >> >> Attached patch fixes it. >> > >> > I could not apply the patch with an error. >> > >> > error: patch failed: doc/src/sgml/config.sgml:9380 >> > error: doc/src/sgml/config.sgml: patch does not apply >> >> Strange. I have no problem applying the patch here. >> >> > I found your patch contains an odd character (ASCII Code 240?) >> > by performing `od -c` command on the file. See the attached file. >> >> Yes, 240 in octal (== 0xc2) is in the patch but it's because current >> config.sgml includes the character. You can check it by looking at >> line 9383 of config.sgml. > > Yes, you are right, I can find the 0xc2 char in config.sgml using od -c, > although I still could not apply the patch. > > I think this is non-breaking space of (C2A0) of utf-8. I guess my > terminal normally regards this as a space, so applying patch fails. > > I found it also in line 85 of ref/drop_extension.sgml. Thanks. I have pushed the fix for ref/drop_extension.sgml along with config.sgml. Best reagards, -- Tatsuo Ishii SRA OSS K.K. English: http://www.sraoss.co.jp/index_en/ Japanese:http://www.sraoss.co.jp
On Mon, Sep 30, 2024 at 11:59:48AM +0200, Daniel Gustafsson wrote: > > On 30 Sep 2024, at 11:03, Tatsuo Ishii <ishii@postgresql.org> wrote: > > > >>>> I think there's an unnecessary underscore in config.sgml. > > > > I was wrong. The particular byte sequences just looked an underscore > > on my editor but the byte sequence is actually 0xc2a0, which must be a > > "non breaking space" encoded in UTF-8. I guess someone mistakenly > > insert a non breaking space while editing config.sgml. > > I wonder if it would be worth to add a check for this like we have to tabs? > The attached adds a rule to "make -C doc/src/sgml check" for trapping nbsp > (doing so made me realize we don't have an equivalent meson target). Can we check for any character outside the support range of SGML? -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com When a patient asks the doctor, "Am I going to die?", he means "Am I going to die soon?"
> On Tue, 1 Oct 2024 22:20:55 +0900 > Yugo Nagata <nagata@sraoss.co.jp> wrote: > >> On Tue, 1 Oct 2024 15:16:52 +0900 >> Yugo NAGATA <nagata@sraoss.co.jp> wrote: >> >> > On Tue, 01 Oct 2024 10:33:50 +0900 (JST) >> > Tatsuo Ishii <ishii@postgresql.org> wrote: >> > >> > > >> That's because non-breaking space (nbsp) is not encoded as 0xa0 in >> > > >> UTF-8. nbsp in UTF-8 is "0xc2 0xa0" (2 bytes) (A 0xa0 is a nbsp's code >> > > >> point in Unicode. i.e. U+00A0). >> > > >> So grep -P "[\xC2\xA0]" should work to detect nbsp. >> > > > >> > > > `LC_ALL=C grep -P "\xC2\xA0"` works for my environment. >> > > > ([ and ] were not necessary.) >> > > > >> > > > When LC_ALL is null, `grep -P "\xA0"` could not detect any characters in charset.sgml, >> > > > but I think it is better to specify both LC_ALL=C and "\xC2\xA0" for making sure detecting >> > > > nbsp. >> > > > >> > > > One problem is that -P option can be used in only GNU grep, and grep in mac doesn't support it. >> > > > >> > > > On bash, we can also use `grep $'\xc2\xa0'`, but I am not sure we can assume the shell is bash. >> > > > >> > > > Maybe, better way is use perl itself rather than grep as following. >> > > > >> > > > `perl -ne '/\xC2\xA0/ and print' ` >> > > > >> > > > I attached a patch fixed in this way. >> > > >> > > GNU sed can also be used without setting LC_ALL: >> > > >> > > sed -n /"\xC2\xA0"/p >> > > >> > > However I am not sure if non-GNU sed can do this too... >> > >> > Although I've not check it myself, BSD sed doesn't support \x escape according to [1]. >> > >> > [1] https://stackoverflow.com/questions/24275070/sed-not-giving-me-correct-substitute-operation-for-newline-with-mac-difference >> > >> > By the way, I've attached a patch a bit modified to use the plural form statement >> > as same as check-tabs. >> > >> > Non-breaking **spaces** appear in SGML/XML files >> >> The previous patch was broken because the perl command failed to return the correct result. >> I've attached an updated patch to fix the return value. In passing, I added line breaks >> for long lines. > > I've attached a updated patch. > I added the comment to explain why Perl is used instead of grep or sed. Looks good to me. If there's no objection, I will commit this to master branch. Best reagards, -- Tatsuo Ishii SRA OSS K.K. English: http://www.sraoss.co.jp/index_en/ Japanese:http://www.sraoss.co.jp
> On 8 Oct 2024, at 02:03, Tatsuo Ishii <ishii@postgresql.org> wrote: >> On Tue, 1 Oct 2024 22:20:55 +0900 >> Yugo Nagata <nagata@sraoss.co.jp> wrote: >> I've attached a updated patch. >> I added the comment to explain why Perl is used instead of grep or sed. > > Looks good to me. If there's no objection, I will commit this to > master branch. No objections, LGTM. -- Daniel Gustafsson
Hi Danile, Yugo, >> On 8 Oct 2024, at 02:03, Tatsuo Ishii <ishii@postgresql.org> wrote: >>> On Tue, 1 Oct 2024 22:20:55 +0900 >>> Yugo Nagata <nagata@sraoss.co.jp> wrote: > >>> I've attached a updated patch. >>> I added the comment to explain why Perl is used instead of grep or sed. >> >> Looks good to me. If there's no objection, I will commit this to >> master branch. > > No objections, LGTM. Thank you for the patch and review! I have pushed the patch. https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=5b7da5c261d1af1a5d6a275e1090b07de3654033 Best reagards, -- Tatsuo Ishii SRA OSS K.K. English: http://www.sraoss.co.jp/index_en/ Japanese:http://www.sraoss.co.jp
On Mon, 7 Oct 2024 15:45:54 -0400 Bruce Momjian <bruce@momjian.us> wrote: > On Mon, Sep 30, 2024 at 11:59:48AM +0200, Daniel Gustafsson wrote: > > > On 30 Sep 2024, at 11:03, Tatsuo Ishii <ishii@postgresql.org> wrote: > > > > > >>>> I think there's an unnecessary underscore in config.sgml. > > > > > > I was wrong. The particular byte sequences just looked an underscore > > > on my editor but the byte sequence is actually 0xc2a0, which must be a > > > "non breaking space" encoded in UTF-8. I guess someone mistakenly > > > insert a non breaking space while editing config.sgml. > > > > I wonder if it would be worth to add a check for this like we have to tabs? > > The attached adds a rule to "make -C doc/src/sgml check" for trapping nbsp > > (doing so made me realize we don't have an equivalent meson target). > > Can we check for any character outside the support range of SGML? What we can define the range of allowed characters range in SGML? We can detect non-ASCII characters by using regexp /\P{ascii}/ or /[^\x00-\x7f]/, but they are used in some places in charset.sgml and some names in release-*.sgml. Regards, Yugo Nagata > > -- > Bruce Momjian <bruce@momjian.us> https://momjian.us > EDB https://enterprisedb.com > > When a patient asks the doctor, "Am I going to die?", he means > "Am I going to die soon?" > > -- Yugo Nagata <nagata@sraoss.co.jp>
> On Mon, 7 Oct 2024 15:45:54 -0400 > Bruce Momjian <bruce@momjian.us> wrote: > >> On Mon, Sep 30, 2024 at 11:59:48AM +0200, Daniel Gustafsson wrote: >> > > On 30 Sep 2024, at 11:03, Tatsuo Ishii <ishii@postgresql.org> wrote: >> > > >> > >>>> I think there's an unnecessary underscore in config.sgml. >> > > >> > > I was wrong. The particular byte sequences just looked an underscore >> > > on my editor but the byte sequence is actually 0xc2a0, which must be a >> > > "non breaking space" encoded in UTF-8. I guess someone mistakenly >> > > insert a non breaking space while editing config.sgml. >> > >> > I wonder if it would be worth to add a check for this like we have to tabs? >> > The attached adds a rule to "make -C doc/src/sgml check" for trapping nbsp >> > (doing so made me realize we don't have an equivalent meson target). >> >> Can we check for any character outside the support range of SGML? > > What we can define the range of allowed characters range in SGML? > > We can detect non-ASCII characters by using regexp /\P{ascii}/ or /[^\x00-\x7f]/, > but they are used in some places in charset.sgml and some names in release-*.sgml. I failed to find any standard regarding what characters are allowed in SGML/XML. Assuming that any valid Unicode characters are allowed in our *sgml files, I am afraid the best we can do is grepping non-ASCII characters against the files and checking the results by a visual inspection. Besides nbsp, there are tons of confusing Unicode characters out there. For example there are many "hyphen like characters". https://www.compart.com/en/unicode/category/Pd If one of them is used in the sgml files, it may be possible that it was accidentally inserted. Best reagards, -- Tatsuo Ishii SRA OSS K.K. English: http://www.sraoss.co.jp/index_en/ Japanese:http://www.sraoss.co.jp
On Wed, Oct 9, 2024 at 11:49:29AM +0900, Tatsuo Ishii wrote: > >> On Mon, Sep 30, 2024 at 11:59:48AM +0200, Daniel Gustafsson wrote: > >> > > On 30 Sep 2024, at 11:03, Tatsuo Ishii <ishii@postgresql.org> wrote: > >> > > > >> > >>>> I think there's an unnecessary underscore in config.sgml. > >> > > > >> > > I was wrong. The particular byte sequences just looked an underscore > >> > > on my editor but the byte sequence is actually 0xc2a0, which must be a > >> > > "non breaking space" encoded in UTF-8. I guess someone mistakenly > >> > > insert a non breaking space while editing config.sgml. > >> > > >> > I wonder if it would be worth to add a check for this like we have to tabs? > >> > The attached adds a rule to "make -C doc/src/sgml check" for trapping nbsp > >> > (doing so made me realize we don't have an equivalent meson target). > >> > >> Can we check for any character outside the support range of SGML? > > > > What we can define the range of allowed characters range in SGML? > > > > We can detect non-ASCII characters by using regexp /\P{ascii}/ or /[^\x00-\x7f]/, > > but they are used in some places in charset.sgml and some names in release-*.sgml. > > I failed to find any standard regarding what characters are allowed in > SGML/XML. Assuming that any valid Unicode characters are allowed in > our *sgml files, I am afraid the best we can do is grepping non-ASCII > characters against the files and checking the results by a visual > inspection. Besides nbsp, there are tons of confusing Unicode > characters out there. For example there are many "hyphen like > characters". > > https://www.compart.com/en/unicode/category/Pd > > If one of them is used in the sgml files, it may be possible that it > was accidentally inserted. Can we use Unicode in the SGML files? -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com When a patient asks the doctor, "Am I going to die?", he means "Am I going to die soon?"
Bruce Momjian <bruce@momjian.us> writes: > Can we use Unicode in the SGML files? I believe we've been doing it for contributors' names that require non-ASCII letters, but not in any other places. regards, tom lane
> Bruce Momjian <bruce@momjian.us> writes: >> Can we use Unicode in the SGML files? > > I believe we've been doing it for contributors' names that require > non-ASCII letters, but not in any other places. We have non-ASCII letters in charset.sgml too, to show some examples of collation. Best reagards, -- Tatsuo Ishii SRA OSS K.K. English: http://www.sraoss.co.jp/index_en/ Japanese:http://www.sraoss.co.jp
> On 9 Oct 2024, at 04:49, Tatsuo Ishii <ishii@postgresql.org> wrote: > Besides nbsp, there are tons of confusing Unicode > characters out there. For example there are many "hyphen like > characters". Using characters which look alike is in the field of internet security known as homograph attacks, where for example a url visually passes for postgresql.org but in fact leads to an attacker. That sort of attack clearly doesn't apply to our docs though. However, what might cause similar problems is if we use a unicode character in example code which the reader could be expected to copy/paste into psql and run which then (at best) cause a syntax error. We could probably build tooling to catch this (most likely not too hard in XSLT) but the ROI for that might be unfavourable. Even with tooling, committer caution is needed to ensure we don't publish examples that might cause unintended side effects when executed by copy/paste. What separates nbsp is that it may affect the rendering in an un-intuitive way by forcing two words to not break even if the viewport is too narrow to fit. Catching such characters seems wortwhile since it's also quite doable with a trivial grep. -- Daniel Gustafsson [0] https://en.wikipedia.org/wiki/IDN_homograph_attack
> We can check non-ASCII letters SGML/XML files by preparing "allowlist" > that contains lines which are allowed to have non-ascii characters, > although this list will need to be maintained when lines in it are modified. > I've attached a patch to add a simple Perl script to do this. I doubt it really works. For example, nbsp can be used formatting (that's the purpose of the character in the first place). Whenever a developer decides to or not to use nbsp, "allowlist" needs to be maintained. It's too annoying. I think it's better to add the non-ASCII character checking to the comitting check list and let committers check non-ASCII character in the patch. Non-ASCII characters rarely used and it would not become a burden. https://wiki.postgresql.org/wiki/Committing_checklist Maybe we can add to the wiki page something like this? git diff origin/master | grep -P '[^\x00-\x7f]' > During testing this script, I found "stylesheet-man.xsl" also has non-ascii > characters. I don't know these characters are really necessary though, since > I don't understand this file well. They are U+201C (double turned comma quotation mark) and U+201D (double comma quotation mark). <l:template name="sect3" text="Section %n, “%t”, in the documentation"/> I would like to know why they are necessary too. Best reagards, -- Tatsuo Ishii SRA OSS K.K. English: http://www.sraoss.co.jp/index_en/ Japanese:http://www.sraoss.co.jp
On Fri, 11 Oct 2024 12:16:50 +0900 (JST) Tatsuo Ishii <ishii@postgresql.org> wrote: > > We can check non-ASCII letters SGML/XML files by preparing "allowlist" > > that contains lines which are allowed to have non-ascii characters, > > although this list will need to be maintained when lines in it are modified. > > I've attached a patch to add a simple Perl script to do this. > > I doubt it really works. For example, nbsp can be used formatting > (that's the purpose of the character in the first place). Whenever a > developer decides to or not to use nbsp, "allowlist" needs to be > maintained. It's too annoying. I suppose non-ascii characters including nbsp are basically disallowed, so the allowlist will not increase unless there is some special reason. However, it is true that there might be a cost for maintaining the list more or less, so if people don't think it is worth adding this check, I will withdraw this proposal.l. > I think it's better to add the non-ASCII character checking to the > comitting check list and let committers check non-ASCII character in > the patch. Non-ASCII characters rarely used and it would not become a > burden. > https://wiki.postgresql.org/wiki/Committing_checklist > > Maybe we can add to the wiki page something like this? > > git diff origin/master | grep -P '[^\x00-\x7f]' > > > During testing this script, I found "stylesheet-man.xsl" also has non-ascii > > characters. I don't know these characters are really necessary though, since > > I don't understand this file well. > > They are U+201C (double turned comma quotation mark) and U+201D > (double comma quotation mark). > > <l:template name="sect3" text="Section %n, “%t”, in the documentation"/> > > I would like to know why they are necessary too. +1 Regards, Yugo Nagata -- Yugo NAGATA <nagata@sraoss.co.jp>
On Mon, Oct 14, 2024 at 03:05:35PM -0400, Bruce Momjian wrote: > I did some more research and we able to clarify our behavior in > release.sgml: I have specified some more details in my patched version: We can only use Latin1 characters, not all UTF8 characters, because some rendering engines do not support non-Latin1 UTF8 characters. Specifically, the HTML rendering engine can display all UTF8 characters, but the PDF rendering engine can only display Latin1 characters. In PDF files, non-Latin1 UTF8 characters are displayed as "###". In the SGML files we encode non-ASCII Latin1 characters as HTML entities, e.g., Álvaro. Oddly, it is possible to safely represent Latin1 characters in SGML files as UTF8 for HTML and PDF output, but we we currently disallow this via the Makefile "check-non-ascii" rule. -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com When a patient asks the doctor, "Am I going to die?", he means "Am I going to die soon?"
On 15.10.24 18:54, Bruce Momjian wrote: >> I agree with encoding non-Latin1 characters and disallowing non-ASCII >> characters totally. >> >> I found your patch includes fixes in *.svg files, so how about checking >> also them by check-non-ascii? Also, I think it is better to use perl instead >> of grep because non-GNU grep doesn't support hex escape sequences. I've attached >> a updated patch for Makefile. The changes in release.sgml above is not applied >> yet, though. > Yes, good idea on using Perl and checking svg files --- I have used your > Makefile rule. > > Attached is an updated patch. I realized that the new rules apply to > all SGML files, not just the release notes, so I have created > README.non-ASCII and moved the description there. I don't understand the point of this. Maybe it's okay to try to detect certain "hidden" whitespace characters, like in the case that started this thread. But I don't see the value in prohibiting all non-ASCII characters, as is being proposed here.
On Tue, Oct 15, 2024 at 10:34:16PM +0200, Peter Eisentraut wrote: > On 15.10.24 18:54, Bruce Momjian wrote: > > > I agree with encoding non-Latin1 characters and disallowing non-ASCII > > > characters totally. > > > > > > I found your patch includes fixes in *.svg files, so how about checking > > > also them by check-non-ascii? Also, I think it is better to use perl instead > > > of grep because non-GNU grep doesn't support hex escape sequences. I've attached > > > a updated patch for Makefile. The changes in release.sgml above is not applied > > > yet, though. > > Yes, good idea on using Perl and checking svg files --- I have used your > > Makefile rule. > > > > Attached is an updated patch. I realized that the new rules apply to > > all SGML files, not just the release notes, so I have created > > README.non-ASCII and moved the description there. > > I don't understand the point of this. Maybe it's okay to try to detect > certain "hidden" whitespace characters, like in the case that started this > thread. But I don't see the value in prohibiting all non-ASCII characters, > as is being proposed here. Well, we can only use Latin-1, so the idea is that we will be explicit about specifying Latin-1 only as HTML entities, rather than letting non-Latin-1 creep in as UTF8. We can exclude certain UTF8 or SGML files if desired. -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com When a patient asks the doctor, "Am I going to die?", he means "Am I going to die soon?"
On 15.10.24 22:37, Bruce Momjian wrote: >> I don't understand the point of this. Maybe it's okay to try to detect >> certain "hidden" whitespace characters, like in the case that started this >> thread. But I don't see the value in prohibiting all non-ASCII characters, >> as is being proposed here. > Well, we can only use Latin-1, so the idea is that we will be explicit > about specifying Latin-1 only as HTML entities, rather than letting > non-Latin-1 creep in as UTF8. But your patch prohibits even otherwise allowed Latin-1 characters. I don't see why we need to enforce this at this level. Whatever downstream toolchain has requirements about which characters are allowed will complain if it encounters a character it doesn't like.
Bruce Momjian <bruce@momjian.us> writes: > Well, we can only use Latin-1, so the idea is that we will be explicit > about specifying Latin-1 only as HTML entities, rather than letting > non-Latin-1 creep in as UTF8. We can exclude certain UTF8 or SGML files > if desired. That policy would cause substantial problems with contributor names in the release notes. I agree with Peter that we don't need this. Catching otherwise-invisible characters seems sufficient. regards, tom lane
On Tue, Oct 15, 2024 at 05:27:49PM -0400, Tom Lane wrote: > Bruce Momjian <bruce@momjian.us> writes: > > Well, we can only use Latin-1, so the idea is that we will be explicit > > about specifying Latin-1 only as HTML entities, rather than letting > > non-Latin-1 creep in as UTF8. We can exclude certain UTF8 or SGML files > > if desired. > > That policy would cause substantial problems with contributor names > in the release notes. I agree with Peter that we don't need this. > Catching otherwise-invisible characters seems sufficient. Uh, why can't we use HTML entities going forward? Is that harder? Can we just exclude the release notes from this check? -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com When a patient asks the doctor, "Am I going to die?", he means "Am I going to die soon?"
Bruce Momjian <bruce@momjian.us> writes: > On Tue, Oct 15, 2024 at 05:27:49PM -0400, Tom Lane wrote: >> That policy would cause substantial problems with contributor names >> in the release notes. I agree with Peter that we don't need this. >> Catching otherwise-invisible characters seems sufficient. > Uh, why can't we use HTML entities going forward? Is that harder? Yes: it requires looking up the entities. The mail you are probably consulting to make a release note or commit message is most likely just going to contain the person's name as normally spelled. Plus (as you pointed out earlier today) there aren't HTML entities for all characters. > Can we just exclude the release notes from this check? What is the point of a check we can only enforce against part of the documentation? regards, tom lane
On 15.10.24 23:51, Bruce Momjian wrote: >> I don't see why we need to enforce this at this level. Whatever downstream >> toolchain has requirements about which characters are allowed will complain >> if it encounters a character it doesn't like. > > Uh, the PDF build does not complain if you pass it a non-Latin-1 UTF8 > characters. To test this I added some Russian characters (non-Latin-1) > to release.sgml: > > (⟨б⟩, ⟨в⟩, ⟨г⟩, ⟨д⟩, ⟨ж⟩, ⟨з⟩, ⟨к⟩, ⟨л⟩, ⟨м⟩, ⟨н⟩, ⟨п⟩, ⟨р⟩, ⟨с⟩, ⟨т⟩, > ⟨ф⟩, ⟨х⟩, ⟨ц⟩, ⟨ч⟩, ⟨ш⟩, ⟨щ⟩), ten vowels (⟨а⟩, ⟨е⟩, ⟨ё⟩, ⟨и⟩, ⟨о⟩, ⟨у⟩, > ⟨ы⟩, ⟨э⟩, ⟨ю⟩, ⟨я⟩), a semivowel / consonant (⟨й⟩), and two modifier > letters or "signs" (⟨ъ⟩, ⟨ь⟩) > > and I ran 'make postgres-US.pdf', and then removed the Russian > characters and ran the same command again. The output, including stderr > was identical. The PDFs, of course, were not, with the Russian > characters showing as "####". Makefile output attached. Hmm, mine complains: /opt/homebrew/bin/fop -fo postgres-A4.fo -pdf postgres-A4.pdf Picked up JAVA_TOOL_OPTIONS: -Djava.awt.headless=true [WARN] FOUserAgent - Font "Symbol,normal,700" not found. Substituting with "Symbol,normal,400". [WARN] FOUserAgent - Font "ZapfDingbats,normal,700" not found. Substituting with "ZapfDingbats,normal,400". [WARN] FOUserAgent - Glyph "⟨" (0x27e8) not available in font "Times-Roman". [WARN] FOUserAgent - Glyph "б" (0x431, afii10066) not available in font "Times-Roman". [WARN] FOUserAgent - Glyph "⟩" (0x27e9) not available in font "Times-Roman". [WARN] FOUserAgent - Glyph "в" (0x432, afii10067) not available in font "Times-Roman". [WARN] FOUserAgent - Glyph "г" (0x433, afii10068) not available in font "Times-Roman". [WARN] FOUserAgent - Glyph "д" (0x434, afii10069) not available in font "Times-Roman". [WARN] FOUserAgent - Glyph "ж" (0x436, afii10072) not available in font "Times-Roman". [WARN] FOUserAgent - Glyph "з" (0x437, afii10073) not available in font "Times-Roman". [WARN] PropertyMaker - span="inherit" on fo:block, but no explicit value found on the parent FO.
On 15.10.24 23:51, Bruce Momjian wrote: > On Tue, Oct 15, 2024 at 05:27:49PM -0400, Tom Lane wrote: >> Bruce Momjian <bruce@momjian.us> writes: >>> Well, we can only use Latin-1, so the idea is that we will be explicit >>> about specifying Latin-1 only as HTML entities, rather than letting >>> non-Latin-1 creep in as UTF8. We can exclude certain UTF8 or SGML files >>> if desired. >> >> That policy would cause substantial problems with contributor names >> in the release notes. I agree with Peter that we don't need this. >> Catching otherwise-invisible characters seems sufficient. > > Uh, why can't we use HTML entities going forward? Is that harder? I think the question should be the other way around. The entities are a historical workaround for when encoding support and rendering support was poor. Now you can just type in the characters you want as is, which seems nicer.
On Wed, Oct 16, 2024 at 10:00:15AM +0200, Peter Eisentraut wrote: > On 15.10.24 23:51, Bruce Momjian wrote: > > On Tue, Oct 15, 2024 at 05:27:49PM -0400, Tom Lane wrote: > > > Bruce Momjian <bruce@momjian.us> writes: > > > > Well, we can only use Latin-1, so the idea is that we will be explicit > > > > about specifying Latin-1 only as HTML entities, rather than letting > > > > non-Latin-1 creep in as UTF8. We can exclude certain UTF8 or SGML files > > > > if desired. > > > > > > That policy would cause substantial problems with contributor names > > > in the release notes. I agree with Peter that we don't need this. > > > Catching otherwise-invisible characters seems sufficient. > > > > Uh, why can't we use HTML entities going forward? Is that harder? > > I think the question should be the other way around. The entities are a > historical workaround for when encoding support and rendering support was > poor. Now you can just type in the characters you want as is, which seems > nicer. Yes, that does make sense, and if we fully supported Unicode, we could ignore all of this. -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com When a patient asks the doctor, "Am I going to die?", he means "Am I going to die soon?"
On Wed, Oct 16, 2024 at 09:58:23AM +0200, Peter Eisentraut wrote: > On 15.10.24 23:51, Bruce Momjian wrote: > > > I don't see why we need to enforce this at this level. Whatever downstream > > > toolchain has requirements about which characters are allowed will complain > > > if it encounters a character it doesn't like. > > > > Uh, the PDF build does not complain if you pass it a non-Latin-1 UTF8 > > characters. To test this I added some Russian characters (non-Latin-1) > > to release.sgml: > > > > (⟨б⟩, ⟨в⟩, ⟨г⟩, ⟨д⟩, ⟨ж⟩, ⟨з⟩, ⟨к⟩, ⟨л⟩, ⟨м⟩, ⟨н⟩, ⟨п⟩, ⟨р⟩, ⟨с⟩, ⟨т⟩, > > ⟨ф⟩, ⟨х⟩, ⟨ц⟩, ⟨ч⟩, ⟨ш⟩, ⟨щ⟩), ten vowels (⟨а⟩, ⟨е⟩, ⟨ё⟩, ⟨и⟩, ⟨о⟩, ⟨у⟩, > > ⟨ы⟩, ⟨э⟩, ⟨ю⟩, ⟨я⟩), a semivowel / consonant (⟨й⟩), and two modifier > > letters or "signs" (⟨ъ⟩, ⟨ь⟩) > > > > and I ran 'make postgres-US.pdf', and then removed the Russian > > characters and ran the same command again. The output, including stderr > > was identical. The PDFs, of course, were not, with the Russian > > characters showing as "####". Makefile output attached. > > Hmm, mine complains: My Debian 12 toolchain must be older. -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com When a patient asks the doctor, "Am I going to die?", he means "Am I going to die soon?"
On Wed, Oct 16, 2024 at 09:54:57AM -0400, Bruce Momjian wrote: > On Wed, Oct 16, 2024 at 10:00:15AM +0200, Peter Eisentraut wrote: > > On 15.10.24 23:51, Bruce Momjian wrote: > > > On Tue, Oct 15, 2024 at 05:27:49PM -0400, Tom Lane wrote: > > > > Bruce Momjian <bruce@momjian.us> writes: > > > > > Well, we can only use Latin-1, so the idea is that we will be explicit > > > > > about specifying Latin-1 only as HTML entities, rather than letting > > > > > non-Latin-1 creep in as UTF8. We can exclude certain UTF8 or SGML files > > > > > if desired. > > > > > > > > That policy would cause substantial problems with contributor names > > > > in the release notes. I agree with Peter that we don't need this. > > > > Catching otherwise-invisible characters seems sufficient. > > > > > > Uh, why can't we use HTML entities going forward? Is that harder? > > > > I think the question should be the other way around. The entities are a > > historical workaround for when encoding support and rendering support was > > poor. Now you can just type in the characters you want as is, which seems > > nicer. > > Yes, that does make sense, and if we fully supported Unicode, we could > ignore all of this. Patch applied to master --- no new UTF8 restrictions. -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com When a patient asks the doctor, "Am I going to die?", he means "Am I going to die soon?"
Hi Bruce, > On Wed, Oct 16, 2024 at 09:54:57AM -0400, Bruce Momjian wrote: >> On Wed, Oct 16, 2024 at 10:00:15AM +0200, Peter Eisentraut wrote: >> > On 15.10.24 23:51, Bruce Momjian wrote: >> > > On Tue, Oct 15, 2024 at 05:27:49PM -0400, Tom Lane wrote: >> > > > Bruce Momjian <bruce@momjian.us> writes: >> > > > > Well, we can only use Latin-1, so the idea is that we will be explicit >> > > > > about specifying Latin-1 only as HTML entities, rather than letting >> > > > > non-Latin-1 creep in as UTF8. We can exclude certain UTF8 or SGML files >> > > > > if desired. >> > > > >> > > > That policy would cause substantial problems with contributor names >> > > > in the release notes. I agree with Peter that we don't need this. >> > > > Catching otherwise-invisible characters seems sufficient. >> > > >> > > Uh, why can't we use HTML entities going forward? Is that harder? >> > >> > I think the question should be the other way around. The entities are a >> > historical workaround for when encoding support and rendering support was >> > poor. Now you can just type in the characters you want as is, which seems >> > nicer. >> >> Yes, that does make sense, and if we fully supported Unicode, we could >> ignore all of this. > > Patch applied to master --- no new UTF8 restrictions. I thought the conclusion of the discussion was allowing to use LATIN1 (or UTF-8 encoded LATIN1) characters in SGML files without converting them to HTML entities. Your patch seems to do opposite. https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=641a5b7a1447954076728f259342c2f9201bb0b5 Best reagards, -- Tatsuo Ishii SRA OSS K.K. English: http://www.sraoss.co.jp/index_en/ Japanese:http://www.sraoss.co.jp
On Sat, Nov 2, 2024 at 07:27:00AM +0900, Tatsuo Ishii wrote: > > On Wed, Oct 16, 2024 at 09:54:57AM -0400, Bruce Momjian wrote: > >> On Wed, Oct 16, 2024 at 10:00:15AM +0200, Peter Eisentraut wrote: > >> > On 15.10.24 23:51, Bruce Momjian wrote: > >> > > On Tue, Oct 15, 2024 at 05:27:49PM -0400, Tom Lane wrote: > >> > > > Bruce Momjian <bruce@momjian.us> writes: > >> > > > > Well, we can only use Latin-1, so the idea is that we will be explicit > >> > > > > about specifying Latin-1 only as HTML entities, rather than letting > >> > > > > non-Latin-1 creep in as UTF8. We can exclude certain UTF8 or SGML files > >> > > > > if desired. > >> > > > > >> > > > That policy would cause substantial problems with contributor names > >> > > > in the release notes. I agree with Peter that we don't need this. > >> > > > Catching otherwise-invisible characters seems sufficient. > >> > > > >> > > Uh, why can't we use HTML entities going forward? Is that harder? > >> > > >> > I think the question should be the other way around. The entities are a > >> > historical workaround for when encoding support and rendering support was > >> > poor. Now you can just type in the characters you want as is, which seems > >> > nicer. > >> > >> Yes, that does make sense, and if we fully supported Unicode, we could > >> ignore all of this. > > > > Patch applied to master --- no new UTF8 restrictions. > > I thought the conclusion of the discussion was allowing to use LATIN1 > (or UTF-8 encoded LATIN1) characters in SGML files without converting > them to HTML entities. Your patch seems to do opposite. > > https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=641a5b7a1447954076728f259342c2f9201bb0b5 Yes, we _allow_ LATIN1 characters in the SGML docs, but I replaced the LATIN1 characters we had with HTML entities, so there are none currently. I think it is too easy for non-Latin1 UTF8 to creep into our SGML docs so I added a cron job on my server to alert me when non-ASCII characters appear. -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com When a patient asks the doctor, "Am I going to die?", he means "Am I going to die soon?"
> Yes, we _allow_ LATIN1 characters in the SGML docs, but I replaced the > LATIN1 characters we had with HTML entities, so there are none > currently. > > I think it is too easy for non-Latin1 UTF8 to creep into our SGML docs > so I added a cron job on my server to alert me when non-ASCII characters > appear. So you convert LATIN1 characters to HTML entities so that it's easier to detect non-LATIN1 characters is in the SGML docs? If my understanding is correct, it can be also achieved by using some tools like: iconv -t ISO-8859-1 -f UTF-8 release-17.sgml If there are some non-LATIN1 characters in release-17.sgml, it will complain like: iconv: illegal input sequence at position 175 An advantage of this is, we don't need to covert each LATIN1 characters to HTML entities and make the sgml file authors life a little bit easier. Best reagards, -- Tatsuo Ishii SRA OSS K.K. English: http://www.sraoss.co.jp/index_en/ Japanese:http://www.sraoss.co.jp
On Sat, Nov 2, 2024 at 12:02:12PM +0900, Tatsuo Ishii wrote: > > Yes, we _allow_ LATIN1 characters in the SGML docs, but I replaced the > > LATIN1 characters we had with HTML entities, so there are none > > currently. > > > > I think it is too easy for non-Latin1 UTF8 to creep into our SGML docs > > so I added a cron job on my server to alert me when non-ASCII characters > > appear. > > So you convert LATIN1 characters to HTML entities so that it's easier > to detect non-LATIN1 characters is in the SGML docs? If my > understanding is correct, it can be also achieved by using some tools > like: > > iconv -t ISO-8859-1 -f UTF-8 release-17.sgml > > If there are some non-LATIN1 characters in release-17.sgml, > it will complain like: > > iconv: illegal input sequence at position 175 > > An advantage of this is, we don't need to covert each LATIN1 > characters to HTML entities and make the sgml file authors life a > little bit easier. I might have misread the feedback. I know people didn't want a Makfile rule to prevent it, but I though converting few UTF8's we had was acceptable. Let me think some more and come up with a patch. -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com When a patient asks the doctor, "Am I going to die?", he means "Am I going to die soon?"
On 02.11.24 14:18, Bruce Momjian wrote: > On Sat, Nov 2, 2024 at 12:02:12PM +0900, Tatsuo Ishii wrote: >>> Yes, we _allow_ LATIN1 characters in the SGML docs, but I replaced the >>> LATIN1 characters we had with HTML entities, so there are none >>> currently. >>> >>> I think it is too easy for non-Latin1 UTF8 to creep into our SGML docs >>> so I added a cron job on my server to alert me when non-ASCII characters >>> appear. >> >> So you convert LATIN1 characters to HTML entities so that it's easier >> to detect non-LATIN1 characters is in the SGML docs? If my >> understanding is correct, it can be also achieved by using some tools >> like: >> >> iconv -t ISO-8859-1 -f UTF-8 release-17.sgml >> >> If there are some non-LATIN1 characters in release-17.sgml, >> it will complain like: >> >> iconv: illegal input sequence at position 175 >> >> An advantage of this is, we don't need to covert each LATIN1 >> characters to HTML entities and make the sgml file authors life a >> little bit easier. > > I might have misread the feedback. I know people didn't want a Makfile > rule to prevent it, but I though converting few UTF8's we had was > acceptable. Let me think some more and come up with a patch. The question of encoding characters as entities is orthogonal to the issue of only allowing Unicode characters that have a mapping to Latin 1. This patch seems to confuse these two issues, and I don't think it actually fixed the second one, which is the one that was complained about. I don't think anyone actually complained about the first one, which is the one that was actually patched. I think the iconv approach is an idea worth checking out. It's also not necessarily true that the set of characters provided by the built-in PDF fonts is exactly the set of characters in Latin 1. It appears to be close enough, but I'm not sure, and I haven't found any authoritative information on that. Another approach for a fix would be to get FOP produce the required warnings or errors more reliably. I know it has a bunch of logging settings (ultimately via log4j), so there might be some possibilities.
On Mon, Nov 11, 2024 at 10:02:15PM +0900, Yugo Nagata wrote: > On Tue, 5 Nov 2024 10:08:17 +0100 > Peter Eisentraut <peter@eisentraut.org> wrote: > > > > >> So you convert LATIN1 characters to HTML entities so that it's easier > > >> to detect non-LATIN1 characters is in the SGML docs? If my > > >> understanding is correct, it can be also achieved by using some tools > > >> like: > > >> > > >> iconv -t ISO-8859-1 -f UTF-8 release-17.sgml > > >> > > >> If there are some non-LATIN1 characters in release-17.sgml, > > >> it will complain like: > > >> > > >> iconv: illegal input sequence at position 175 > > >> > > >> An advantage of this is, we don't need to covert each LATIN1 > > >> characters to HTML entities and make the sgml file authors life a > > >> little bit easier. > > > I think the iconv approach is an idea worth checking out. > > > > It's also not necessarily true that the set of characters provided by > > the built-in PDF fonts is exactly the set of characters in Latin 1. It > > appears to be close enough, but I'm not sure, and I haven't found any > > authoritative information on that. > > I found a description in FAQ on Apache FOP [1] that explains some glyphs for > Latin1 character set are not contained in the standard text fonts. > > The standard text fonts supplied with Acrobat Reader have mostly glyphs for > characters from the ISO Latin 1 character set. For a variety of reasons, even > those are not completely guaranteed to work, for example you can't use the fi > ligature from the standard serif font. So, the failure of ligatures is caused usually by not using the right Adobe Font Metric (AFM) file, I think. I have seen faulty ligature rendering in PDFs but was alway able to fix it by using the right AFM file. Odds are, failure is caused by using a standard Latin1 AFM file and not the AFM file that matches the font being used. > [1] https://xmlgraphics.apache.org/fop/faq.html#pdf-characters > > However, it seems that using iconv to detect non-Latin1 characters may be still > useful because these are likely not displayed in PDF. For example, we can do this > in make check as the attached patch 0002. It cannot show the filname where one > is found, though. I was thinking something like: grep -l --recursive -P '[\x80-\xFF]' . | while read FILE do iconv -f UTF-8 -t ISO-8859-1 "$FILE" || exit 1 done This only checks files with non-ASCII characters. > > Another approach for a fix would be > > to get FOP produce the required warnings or errors more reliably. I > > know it has a bunch of logging settings (ultimately via log4j), so there > > might be some possibilities. > > When a character that cannot be displayed in PDF is found, a warning > "Glyph ... not available in font ...." is output in fop's log. We can > prevent such characters from being contained in PDF by checking > the message as the attached patch 0001. However, this is checked after > the pdf is generated since I could not have an idea how to terminate the > generation immediately when such character is detected. So, are we sure this will be the message even for non-English users? I thought checking for warning message text was too fragile. -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com When a patient asks the doctor, "Am I going to die?", he means "Am I going to die soon?"
On Tue, Nov 19, 2024 at 11:29:07AM +0900, Yugo NAGATA wrote: > On Mon, 18 Nov 2024 16:04:20 -0500 > > So, the failure of ligatures is caused usually by not using the right > > Adobe Font Metric (AFM) file, I think. I have seen faulty ligature > > rendering in PDFs but was alway able to fix it by using the right AFM > > file. Odds are, failure is caused by using a standard Latin1 AFM file > > and not the AFM file that matches the font being used. > > > > > [1] https://xmlgraphics.apache.org/fop/faq.html#pdf-characters > > > > > > However, it seems that using iconv to detect non-Latin1 characters may be still > > > useful because these are likely not displayed in PDF. For example, we can do this > > > in make check as the attached patch 0002. It cannot show the filname where one > > > is found, though. > > > > I was thinking something like: > > > > grep -l --recursive -P '[\x80-\xFF]' . | > > while read FILE > > do iconv -f UTF-8 -t ISO-8859-1 "$FILE" || exit 1 > > done > > > > This only checks files with non-ASCII characters. > > Checking non-latin1 after non-ASCII characters seems good idea. > I attached a updated patch (0002) that uses perl instead of grep > because non-GNU grep could not have escape sequences for hex. Yes, good point. > > So, are we sure this will be the message even for non-English users? I > > thought checking for warning message text was too fragile. > > I am not sure whether fop has messages in non-English, although I've never > seen Japanese messages output. > > I wonder we can get unified results if executed with LANG=C. > The updated patch 0001 is fixed in this direction. Yes, good idea. > + @ ( $(PERL) -ne '/[\x80-\xFF]/ and `${ICONV} -t ISO-8859-1 -f UTF-8 "$$ARGV" 2>/dev/null` and print("$$ARGV:$$_"),$$n++;END {exit($$n>0)}' \ I am thinking we should have -f before -t becaues it is from/to. I like this approach. -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com When a patient asks the doctor, "Am I going to die?", he means "Am I going to die soon?"
I have looked into the patches. > Subject: [PATCH v3 1/3] Disallow characters that cannot be displayed in PDF > > --- > doc/src/sgml/Makefile | 4 +++- > 1 file changed, 3 insertions(+), 1 deletion(-) > > diff --git a/doc/src/sgml/Makefile b/doc/src/sgml/Makefile > index a04c532b53..18bf87d031 100644 > --- a/doc/src/sgml/Makefile > +++ b/doc/src/sgml/Makefile > @@ -156,7 +156,9 @@ XSLTPROC_FO_FLAGS += --stringparam img.src.path '$(srcdir)/' > $(XSLTPROC) $(XMLINCLUDE) $(XSLTPROCFLAGS) $(XSLTPROC_FO_FLAGS) --stringparam paper.type USletter -o $@ $^ > > %.pdf: %.fo $(ALL_IMAGES) > - $(FOP) -fo $< -pdf $@ > + CLANG=C $(FOP) -fo $< -pdf $@ 2>&1 | \ Shouldn't "CLANG" be "LANG"? > + awk 'BEGIN{err=0}{print}/not available in font/{err=1}END{exit err}' 1>&2 || \ > + (echo "Found characters that cannot be displayed in PDF" 1>&2; exit 1) Currently "make postgres*.pdf" generates the pdf file even if there's a "not available in font" error while generating it. With the patch the pdf file is removed in this case. I'm not sure if this is an improvement because there's no way to generate such a pdf file if there's such a warning. Printing "Found characters that cannot be displayed in PDF" is good, but I'd prefer let users decide whether they retain or remove the pdf file. > Subject: [PATCH v3 3/3] Check whether iconv exists for detecting non-latin1 > characters > > --- > configure | 65 ++++++++++++++++++++++++++++++++++++++---- > configure.ac | 1 + > doc/src/sgml/Makefile | 6 +++- > src/Makefile.global.in | 1 + You don't need to include the patch for configure. Committer will generate configure when it gets committed. See the discussion: https://www.postgresql.org/message-id/20241126.102906.1020285543012274306.ishii%40postgresql.org Best reagards, -- Tatsuo Ishii SRA OSS K.K. English: http://www.sraoss.co.jp/index_en/ Japanese:http://www.sraoss.co.jp
On Tue, Nov 26, 2024 at 06:25:13PM +0900, Tatsuo Ishii wrote: > I have looked into the patches. > > %.pdf: %.fo $(ALL_IMAGES) > > - $(FOP) -fo $< -pdf $@ > > + CLANG=C $(FOP) -fo $< -pdf $@ 2>&1 | \ > > Shouldn't "CLANG" be "LANG"? Yes, probably. > > + awk 'BEGIN{err=0}{print}/not available in font/{err=1}END{exit err}' 1>&2 || \ > > + (echo "Found characters that cannot be displayed in PDF" 1>&2; exit 1) > > Currently "make postgres*.pdf" generates the pdf file even if there's > a "not available in font" error while generating it. With the patch > the pdf file is removed in this case. I'm not sure if this is an > improvement because there's no way to generate such a pdf file if > there's such a warning. Printing "Found characters that cannot be > displayed in PDF" is good, but I'd prefer let users decide whether > they retain or remove the pdf file. Looking at the patch: %.pdf: %.fo $(ALL_IMAGES) - $(FOP) -fo $< -pdf $@ + CLANG=C $(FOP) -fo $< -pdf $@ 2>&1 | \ + awk 'BEGIN{err=0}{print}/not available in font/{err=1}END{exit err}' 1>&2 || \ + (echo "Found characters that cannot be displayed in PDF" 1>&2; exit 1) it returns an error if it sees a "not available in font" error, and since src/Makefile.global has .DELETE_ON_ERROR, and this is included in doc/src/sgml/Makefile, the file is deleted on the awk 'exit' error. If there are invalid characters in the PDF, shouldn't the PDF be considered invalid and removed from the build? To allow such builds to keep those PDF files, we would need to probably override .DELETE_ON_ERROR, but it would have to be done in a way that an error exit from FOP would still remove the PDF file. I think we would have to have FOP write to a temporary file, and then override the .DELETE_ON_ERROR just for the check for the string "not available in font" text in the temporary file. Do we want to add this complexity? -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com When a patient asks the doctor, "Am I going to die?", he means "Am I going to die soon?"
Bruce Momjian <bruce@momjian.us> writes: > Do we want to add this complexity? I don't think this patch is doing anything I want at all. regards, tom lane
Bruce Momjian <bruce@momjian.us> writes: > On Tue, Nov 26, 2024 at 11:43:02AM -0500, Tom Lane wrote: >> I don't think this patch is doing anything I want at all. > Gee, I kind of liked the patch, but maybe you didn't like the additional > complexity to check the PDF output twice, once on input (complex) and > once on output. The attached patch only does the output check. It's still not doing anything I want at all. I'm with Tatsuo on this: I do not want the makefiles deciding for me which warnings are acceptable. regards, tom lane
On Tue, Nov 26, 2024 at 02:04:15PM -0500, Bruce Momjian wrote: > On Tue, Nov 26, 2024 at 12:41:37PM -0500, Tom Lane wrote: > > Bruce Momjian <bruce@momjian.us> writes: > > > On Tue, Nov 26, 2024 at 11:43:02AM -0500, Tom Lane wrote: > > >> I don't think this patch is doing anything I want at all. > > > > > Gee, I kind of liked the patch, but maybe you didn't like the additional > > > complexity to check the PDF output twice, once on input (complex) and > > > once on output. The attached patch only does the output check. > > > > It's still not doing anything I want at all. I'm with Tatsuo > > on this: I do not want the makefiles deciding for me which > > warnings are acceptable. > > Okay, how about the attached patch that just prints the message at the > bottom, with no error. We could do this for all warnings, but I think > there are some we expect. Patch applied. I added a mention of README.non-ASCII. -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com When a patient asks the doctor, "Am I going to die?", he means "Am I going to die soon?"
On Tue, Nov 5, 2024 at 10:08:17AM +0100, Peter Eisentraut wrote: > On 02.11.24 14:18, Bruce Momjian wrote: > > On Sat, Nov 2, 2024 at 12:02:12PM +0900, Tatsuo Ishii wrote: > > > > Yes, we _allow_ LATIN1 characters in the SGML docs, but I replaced the > > > > LATIN1 characters we had with HTML entities, so there are none > > > > currently. > > > > > > > > I think it is too easy for non-Latin1 UTF8 to creep into our SGML docs > > > > so I added a cron job on my server to alert me when non-ASCII characters > > > > appear. > > > > > > So you convert LATIN1 characters to HTML entities so that it's easier > > > to detect non-LATIN1 characters is in the SGML docs? If my > > > understanding is correct, it can be also achieved by using some tools > > > like: > > > > > > iconv -t ISO-8859-1 -f UTF-8 release-17.sgml > > > > > > If there are some non-LATIN1 characters in release-17.sgml, > > > it will complain like: > > > > > > iconv: illegal input sequence at position 175 > > > > > > An advantage of this is, we don't need to covert each LATIN1 > > > characters to HTML entities and make the sgml file authors life a > > > little bit easier. > > > > I might have misread the feedback. I know people didn't want a Makfile > > rule to prevent it, but I though converting few UTF8's we had was > > acceptable. Let me think some more and come up with a patch. > > The question of encoding characters as entities is orthogonal to the issue > of only allowing Unicode characters that have a mapping to Latin 1. This > patch seems to confuse these two issues, and I don't think it actually fixed > the second one, which is the one that was complained about. I don't think > anyone actually complained about the first one, which is the one that was > actually patched. Now that we have a warning about non-emittable characters in the PDF build, do you want me to put back the Latin1 characters in the SGML files or leave them as HTML entities? -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com When a patient asks the doctor, "Am I going to die?", he means "Am I going to die soon?"
Bruce Momjian <bruce@momjian.us> writes: > Now that we have a warning about non-emittable characters in the PDF > build, do you want me to put back the Latin1 characters in the SGML > files or leave them as HTML entities? I think going forward we're going to be putting in people's names in UTF8 --- I was certainly planning to start doing that. It doesn't matter that much what we do with existing cases, though. regards, tom lane
On Mon, Dec 2, 2024 at 09:33:39PM -0500, Tom Lane wrote: > Bruce Momjian <bruce@momjian.us> writes: > > Now that we have a warning about non-emittable characters in the PDF > > build, do you want me to put back the Latin1 characters in the SGML > > files or leave them as HTML entities? > > I think going forward we're going to be putting in people's names > in UTF8 --- I was certainly planning to start doing that. It doesn't Yes, I expected that, and added an item to my release checklist to make a PDF file and check for the warning. I don't normally do that. > matter that much what we do with existing cases, though. Okay, I think Peter had an opinion but I wasn't sure what it was. -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com When a patient asks the doctor, "Am I going to die?", he means "Am I going to die soon?"
On 03.12.24 04:13, Bruce Momjian wrote: > On Mon, Dec 2, 2024 at 09:33:39PM -0500, Tom Lane wrote: >> Bruce Momjian <bruce@momjian.us> writes: >>> Now that we have a warning about non-emittable characters in the PDF >>> build, do you want me to put back the Latin1 characters in the SGML >>> files or leave them as HTML entities? >> >> I think going forward we're going to be putting in people's names >> in UTF8 --- I was certainly planning to start doing that. It doesn't > > Yes, I expected that, and added an item to my release checklist to make > a PDF file and check for the warning. I don't normally do that. > >> matter that much what we do with existing cases, though. > > Okay, I think Peter had an opinion but I wasn't sure what it was. I would prefer that the parts of commit 641a5b7a144 that replace non-ASCII characters with entities are reverted.
On 26.11.24 20:04, Bruce Momjian wrote: > %.pdf: %.fo $(ALL_IMAGES) > - $(FOP) -fo $< -pdf $@ > + LANG=C $(FOP) -fo $< -pdf $@ 2>&1 | \ > + awk 'BEGIN { warn = 0 } { print }/not available in font/ { warn = 1 } \ > + END { if (warn != 0) print("\nFound characters that cannot be displayed in the PDF document") }' 1>&2 Wouldn't that lose the exit code from the fop execution?
On Tue, Dec 3, 2024 at 09:05:45PM +0100, Peter Eisentraut wrote: > On 26.11.24 20:04, Bruce Momjian wrote: > > %.pdf: %.fo $(ALL_IMAGES) > > - $(FOP) -fo $< -pdf $@ > > + LANG=C $(FOP) -fo $< -pdf $@ 2>&1 | \ > > + awk 'BEGIN { warn = 0 } { print }/not available in font/ { warn = 1 } \ > > + END { if (warn != 0) print("\nFound characters that cannot be displayed in the PDF document") }' 1>&2 > > Wouldn't that lose the exit code from the fop execution? Yikes, I think it would. Let me work on a fix now. -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com Do not let urgent matters crowd out time for investment in the future.
On Tue, Dec 3, 2024 at 09:03:37PM +0100, Peter Eisentraut wrote: > On 03.12.24 04:13, Bruce Momjian wrote: > > On Mon, Dec 2, 2024 at 09:33:39PM -0500, Tom Lane wrote: > > > Bruce Momjian <bruce@momjian.us> writes: > > > > Now that we have a warning about non-emittable characters in the PDF > > > > build, do you want me to put back the Latin1 characters in the SGML > > > > files or leave them as HTML entities? > > > > > > I think going forward we're going to be putting in people's names > > > in UTF8 --- I was certainly planning to start doing that. It doesn't > > > > Yes, I expected that, and added an item to my release checklist to make > > a PDF file and check for the warning. I don't normally do that. > > > > > matter that much what we do with existing cases, though. > > > > Okay, I think Peter had an opinion but I wasn't sure what it was. > > I would prefer that the parts of commit 641a5b7a144 that replace non-ASCII > characters with entities are reverted. Done. -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com Do not let urgent matters crowd out time for investment in the future.