Thread: sgml cleanup: unescaped '>' characters

sgml cleanup: unescaped '>' characters

From
Josh Kupershmidt
Date:
I found myself rewriting the ./src/tools/find_gt_lt script in Perl
this evening, since the existing script was quite broken (the main
problem is it's not capable of understanding CDATA or sgml comment
sections, and hence produces a bunch of noise).

The rewritten version picked up a few stylistic inconsistencies in the
SGML, such as:
 * breaking the trailing '>' of an SGML marker across lines. AFAIK
this is legal, but is a bit inconsistent and just confuses simplistic
tools like find_gt_lt
 * using single quotes instead of double quotes to surround a node
attribute, as in <orderedlist numeration='loweralpha'>

as well as seemingly-invalid SGML, such as using '>' unescaped inside
normal SGML entries.

I've attached a patch to fix these problems. I can send in the new
version of find_gt_lt if these changes prove useful.

Josh

Attachment

Re: sgml cleanup: unescaped '>' characters

From
Peter Eisentraut
Date:
On ons, 2011-08-24 at 23:28 -0400, Josh Kupershmidt wrote:
> I found myself rewriting the ./src/tools/find_gt_lt script in Perl
> this evening, since the existing script was quite broken (the main
> problem is it's not capable of understanding CDATA or sgml comment
> sections, and hence produces a bunch of noise).
>
> The rewritten version picked up a few stylistic inconsistencies in the
> SGML, such as:
>  * breaking the trailing '>' of an SGML marker across lines. AFAIK
> this is legal, but is a bit inconsistent and just confuses simplistic
> tools like find_gt_lt

The cases you show don't appear to be terribly useful, but I think on
occasion this can be necessary to work around some arcane whitespace
rules in SGML or XML.  (Just look at the generated HTML; it uses this
technique throughout.)

>  * using single quotes instead of double quotes to surround a node
> attribute, as in <orderedlist numeration='loweralpha'>

It would be better if the tool could handle that, because sometimes you
want to use single quotes if the value contains double quotes.

> as well as seemingly-invalid SGML, such as using '>' unescaped inside
> normal SGML entries.

Unescaped > is valid, AFAIK.



Re: sgml cleanup: unescaped '>' characters

From
Josh Kupershmidt
Date:
On Sat, Aug 27, 2011 at 3:48 PM, Peter Eisentraut <peter_e@gmx.net> wrote:
> On ons, 2011-08-24 at 23:28 -0400, Josh Kupershmidt wrote:
>> I found myself rewriting the ./src/tools/find_gt_lt script in Perl
>> this evening, since the existing script was quite broken (the main
>> problem is it's not capable of understanding CDATA or sgml comment
>> sections, and hence produces a bunch of noise).
>>
>> The rewritten version picked up a few stylistic inconsistencies in the
>> SGML, such as:
>>  * breaking the trailing '>' of an SGML marker across lines. AFAIK
>> this is legal, but is a bit inconsistent and just confuses simplistic
>> tools like find_gt_lt
>
> The cases you show don't appear to be terribly useful, but I think on
> occasion this can be necessary to work around some arcane whitespace
> rules in SGML or XML.  (Just look at the generated HTML; it uses this
> technique throughout.)

Hrm, well if the spurious whitespace isn't serving any purpose in
these cases, why not just fix it to match the rest of SGML style?

>>  * using single quotes instead of double quotes to surround a node
>> attribute, as in <orderedlist numeration='loweralpha'>
>
> It would be better if the tool could handle that, because sometimes you
> want to use single quotes if the value contains double quotes.

It's trivial to adjust the regex I was using to ignore such cases. I'm
just on about stylistic consistency here. If there's a reason to use
single quotes, such as when the value contains double quotes, then
that's fine -- but I don't think any of the cases I pointed out fall
under that category.

>> as well as seemingly-invalid SGML, such as using '>' unescaped inside
>> normal SGML entries.
>
> Unescaped > is valid, AFAIK.

Oh, that's interesting. I took a quick look at "The SGML FAQ book",
page 73 [1], which supports this claim.

But I notice we've been fixing such issues in the recent past (e.g.
commit d420ba2a2d4ea4831f89a3fd7ce86b05eff932ff). Don't we want to
continue doing so? Not to mention the fact that we have
./src/tools/find_gt_lt, which while somewhat broken, has the
ostensible goal of finding such problems in the SGML. Or do we want to
stop worrying about '>' entirely, and rename find_gt_lt to find_lt,
instead?

Josh

[1] http://books.google.com/books?id=OyJHFJsnh10C&lpg=PA229&ots=DGkYDdvbhE&pg=PA73#v=onepage&q&f=false

Re: sgml cleanup: unescaped '>' characters

From
Peter Eisentraut
Date:
On mån, 2011-08-29 at 18:22 -0500, Josh Kupershmidt wrote:
> >> The rewritten version picked up a few stylistic inconsistencies in the
> >> SGML, such as:
> >>  * breaking the trailing '>' of an SGML marker across lines. AFAIK
> >> this is legal, but is a bit inconsistent and just confuses simplistic
> >> tools like find_gt_lt
> >
> > The cases you show don't appear to be terribly useful, but I think on
> > occasion this can be necessary to work around some arcane whitespace
> > rules in SGML or XML.  (Just look at the generated HTML; it uses this
> > technique throughout.)
>
> Hrm, well if the spurious whitespace isn't serving any purpose in
> these cases, why not just fix it to match the rest of SGML style?
>
> >>  * using single quotes instead of double quotes to surround a node
> >> attribute, as in <orderedlist numeration='loweralpha'>
> >
> > It would be better if the tool could handle that, because sometimes you
> > want to use single quotes if the value contains double quotes.
>
> It's trivial to adjust the regex I was using to ignore such cases. I'm
> just on about stylistic consistency here. If there's a reason to use
> single quotes, such as when the value contains double quotes, then
> that's fine -- but I don't think any of the cases I pointed out fall
> under that category.

I have committed your fixes relevant to these two points.

> >> as well as seemingly-invalid SGML, such as using '>' unescaped inside
> >> normal SGML entries.
> >
> > Unescaped > is valid, AFAIK.
>
> Oh, that's interesting. I took a quick look at "The SGML FAQ book",
> page 73 [1], which supports this claim.
>
> But I notice we've been fixing such issues in the recent past (e.g.
> commit d420ba2a2d4ea4831f89a3fd7ce86b05eff932ff). Don't we want to
> continue doing so? Not to mention the fact that we have
> ./src/tools/find_gt_lt, which while somewhat broken, has the
> ostensible goal of finding such problems in the SGML. Or do we want to
> stop worrying about '>' entirely, and rename find_gt_lt to find_lt,
> instead?

> [1] http://books.google.com/books?id=OyJHFJsnh10C&lpg=PA229&ots=DGkYDdvbhE&pg=PA73#v=onepage&q&f=false

I don't know what the rationale for this tool is.  I have never used it.
Clearly, the reference shows, and the tools we use confirm, that it is
not necessary to use it.




Re: sgml cleanup: unescaped '>' characters

From
Bruce Momjian
Date:
Peter Eisentraut wrote:
> > >> as well as seemingly-invalid SGML, such as using '>' unescaped inside
> > >> normal SGML entries.
> > >
> > > Unescaped > is valid, AFAIK.
> >
> > Oh, that's interesting. I took a quick look at "The SGML FAQ book",
> > page 73 [1], which supports this claim.
> >
> > But I notice we've been fixing such issues in the recent past (e.g.
> > commit d420ba2a2d4ea4831f89a3fd7ce86b05eff932ff). Don't we want to
> > continue doing so? Not to mention the fact that we have
> > ./src/tools/find_gt_lt, which while somewhat broken, has the
> > ostensible goal of finding such problems in the SGML. Or do we want to
> > stop worrying about '>' entirely, and rename find_gt_lt to find_lt,
> > instead?
>
> > [1] http://books.google.com/books?id=OyJHFJsnh10C&lpg=PA229&ots=DGkYDdvbhE&pg=PA73#v=onepage&q&f=false
>
> I don't know what the rationale for this tool is.  I have never used it.
> Clearly, the reference shows, and the tools we use confirm, that it is
> not necessary to use it.

I have updated the scripts and instructions accordingly.

--
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

  + It's impossible for everything to be true. +

Re: sgml cleanup: unescaped '>' characters

From
Peter Eisentraut
Date:
On tor, 2011-09-01 at 10:17 -0400, Bruce Momjian wrote:
> Peter Eisentraut wrote:
> > > >> as well as seemingly-invalid SGML, such as using '>' unescaped inside
> > > >> normal SGML entries.
> > > >
> > > > Unescaped > is valid, AFAIK.
> > >
> > > Oh, that's interesting. I took a quick look at "The SGML FAQ book",
> > > page 73 [1], which supports this claim.
> > >
> > > But I notice we've been fixing such issues in the recent past (e.g.
> > > commit d420ba2a2d4ea4831f89a3fd7ce86b05eff932ff). Don't we want to
> > > continue doing so? Not to mention the fact that we have
> > > ./src/tools/find_gt_lt, which while somewhat broken, has the
> > > ostensible goal of finding such problems in the SGML. Or do we want to
> > > stop worrying about '>' entirely, and rename find_gt_lt to find_lt,
> > > instead?
> >
> > > [1] http://books.google.com/books?id=OyJHFJsnh10C&lpg=PA229&ots=DGkYDdvbhE&pg=PA73#v=onepage&q&f=false
> >
> > I don't know what the rationale for this tool is.  I have never used it.
> > Clearly, the reference shows, and the tools we use confirm, that it is
> > not necessary to use it.
>
> I have updated the scripts and instructions accordingly.

That still leaves open why we bother about escaping <.


Re: sgml cleanup: unescaped '>' characters

From
Bruce Momjian
Date:
Peter Eisentraut wrote:
> On tor, 2011-09-01 at 10:17 -0400, Bruce Momjian wrote:
> > Peter Eisentraut wrote:
> > > > >> as well as seemingly-invalid SGML, such as using '>' unescaped inside
> > > > >> normal SGML entries.
> > > > >
> > > > > Unescaped > is valid, AFAIK.
> > > >
> > > > Oh, that's interesting. I took a quick look at "The SGML FAQ book",
> > > > page 73 [1], which supports this claim.
> > > >
> > > > But I notice we've been fixing such issues in the recent past (e.g.
> > > > commit d420ba2a2d4ea4831f89a3fd7ce86b05eff932ff). Don't we want to
> > > > continue doing so? Not to mention the fact that we have
> > > > ./src/tools/find_gt_lt, which while somewhat broken, has the
> > > > ostensible goal of finding such problems in the SGML. Or do we want to
> > > > stop worrying about '>' entirely, and rename find_gt_lt to find_lt,
> > > > instead?
> > >
> > > > [1] http://books.google.com/books?id=OyJHFJsnh10C&lpg=PA229&ots=DGkYDdvbhE&pg=PA73#v=onepage&q&f=false
> > >
> > > I don't know what the rationale for this tool is.  I have never used it.
> > > Clearly, the reference shows, and the tools we use confirm, that it is
> > > not necessary to use it.
> >
> > I have updated the scripts and instructions accordingly.
>
> That still leaves open why we bother about escaping <.

The problem is that I often add SGML that has:

    if (1 < 0) ...

I need something to warn me about those, especially in the release
notes.

--
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

  + It's impossible for everything to be true. +

Re: sgml cleanup: unescaped '>' characters

From
Peter Eisentraut
Date:
On tor, 2011-09-01 at 14:17 -0400, Bruce Momjian wrote:
> > That still leaves open why we bother about escaping <.
>
> The problem is that I often add SGML that has:
>
>         if (1 < 0) ...
>
> I need something to warn me about those, especially in the release
> notes.

Why do you need to be warned about that?



Re: sgml cleanup: unescaped '>' characters

From
Bruce Momjian
Date:
Peter Eisentraut wrote:
> On tor, 2011-09-01 at 14:17 -0400, Bruce Momjian wrote:
> > > That still leaves open why we bother about escaping <.
> >
> > The problem is that I often add SGML that has:
> >
> >         if (1 < 0) ...
> >
> > I need something to warn me about those, especially in the release
> > notes.
>
> Why do you need to be warned about that?

If I have:

    if (1 < fred)

it will think "fred" is a SGML tag, no?

--
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

  + It's impossible for everything to be true. +

Re: sgml cleanup: unescaped '>' characters

From
Peter Eisentraut
Date:
On tor, 2011-09-01 at 17:31 -0400, Bruce Momjian wrote:
> Peter Eisentraut wrote:
> > On tor, 2011-09-01 at 14:17 -0400, Bruce Momjian wrote:
> > > > That still leaves open why we bother about escaping <.
> > >
> > > The problem is that I often add SGML that has:
> > >
> > >         if (1 < 0) ...
> > >
> > > I need something to warn me about those, especially in the release
> > > notes.
> >
> > Why do you need to be warned about that?
>
> If I have:
>
>     if (1 < fred)
>
> it will think "fred" is a SGML tag, no?

No, a < followed by a space is not a tag, it's character data.  If it
thought it were a tag, it would complain.


Re: sgml cleanup: unescaped '>' characters

From
Bruce Momjian
Date:
Peter Eisentraut wrote:
> On tor, 2011-09-01 at 17:31 -0400, Bruce Momjian wrote:
> > Peter Eisentraut wrote:
> > > On tor, 2011-09-01 at 14:17 -0400, Bruce Momjian wrote:
> > > > > That still leaves open why we bother about escaping <.
> > > >
> > > > The problem is that I often add SGML that has:
> > > >
> > > >         if (1 < 0) ...
> > > >
> > > > I need something to warn me about those, especially in the release
> > > > notes.
> > >
> > > Why do you need to be warned about that?
> >
> > If I have:
> >
> >     if (1 < fred)
> >
> > it will think "fred" is a SGML tag, no?
>
> No, a < followed by a space is not a tag, it's character data.  If it
> thought it were a tag, it would complain.

Sometimes it is '<' (in single quotes), which I thought would be a
problem.

--
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

  + It's impossible for everything to be true. +

Re: sgml cleanup: unescaped '>' characters

From
Peter Eisentraut
Date:
On lör, 2011-09-03 at 16:47 -0400, Bruce Momjian wrote:
> Peter Eisentraut wrote:
> > On tor, 2011-09-01 at 17:31 -0400, Bruce Momjian wrote:
> > > Peter Eisentraut wrote:
> > > > On tor, 2011-09-01 at 14:17 -0400, Bruce Momjian wrote:
> > > > > > That still leaves open why we bother about escaping <.
> > > > >
> > > > > The problem is that I often add SGML that has:
> > > > >
> > > > >         if (1 < 0) ...
> > > > >
> > > > > I need something to warn me about those, especially in the release
> > > > > notes.
> > > >
> > > > Why do you need to be warned about that?
> > >
> > > If I have:
> > >
> > >     if (1 < fred)
> > >
> > > it will think "fred" is a SGML tag, no?
> >
> > No, a < followed by a space is not a tag, it's character data.  If it
> > thought it were a tag, it would complain.
>
> Sometimes it is '<' (in single quotes), which I thought would be a
> problem.

The bottom line is, the SGML parser can figure that out itself, and if
it has a problem, it will complain.  We don't need to second guess it
with regular expressions that are handcrafted out of thin air.

I was hoping you would remember whether you initially put this in
because of some tool problem.  But if we are not finding any supporting
evidence, I would suggest that we just scrap this thing entirely.


Re: sgml cleanup: unescaped '>' characters

From
Bruce Momjian
Date:
Peter Eisentraut wrote:
> On l?r, 2011-09-03 at 16:47 -0400, Bruce Momjian wrote:
> > Peter Eisentraut wrote:
> > > On tor, 2011-09-01 at 17:31 -0400, Bruce Momjian wrote:
> > > > Peter Eisentraut wrote:
> > > > > On tor, 2011-09-01 at 14:17 -0400, Bruce Momjian wrote:
> > > > > > > That still leaves open why we bother about escaping <.
> > > > > >
> > > > > > The problem is that I often add SGML that has:
> > > > > >
> > > > > >         if (1 < 0) ...
> > > > > >
> > > > > > I need something to warn me about those, especially in the release
> > > > > > notes.
> > > > >
> > > > > Why do you need to be warned about that?
> > > >
> > > > If I have:
> > > >
> > > >     if (1 < fred)
> > > >
> > > > it will think "fred" is a SGML tag, no?
> > >
> > > No, a < followed by a space is not a tag, it's character data.  If it
> > > thought it were a tag, it would complain.
> >
> > Sometimes it is '<' (in single quotes), which I thought would be a
> > problem.
>
> The bottom line is, the SGML parser can figure that out itself, and if
> it has a problem, it will complain.  We don't need to second guess it
> with regular expressions that are handcrafted out of thin air.
>
> I was hoping you would remember whether you initially put this in
> because of some tool problem.  But if we are not finding any supporting
> evidence, I would suggest that we just scrap this thing entirely.

I put it in to warn about release.sgml markup problems, so I properly
escaped all non-tag '>' and '<' characters.

I have removed the tool.  We can always re-add it if we find it is
needed.

--
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

  + It's impossible for everything to be true. +