Thread: func.sgml

func.sgml

From

Andrew Dunstan

Date:

04 October 2021, 14:33:36

At
<https://www.postgresql.org/message-id/543620.1629899413%40sss.pgh.pa.us>
Tom noted:

> You have to be very careful these days when applying stale patches to
> func.sgml --- there's enough duplicate boilerplate that "patch' can easily
> be fooled into dumping an addition into the wrong place. 

This is yet another indication to me that there's probably a good case
for breaking func.sgml up into sections. It is by a very large margin
the biggest file in our document sources (the next largest is less than
half the number of lines).


thoughts?


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: func.sgml

From

Tom Lane

Date:

04 October 2021, 14:52:13

Andrew Dunstan <andrew@dunslane.net> writes:
> Tom noted:
>> You have to be very careful these days when applying stale patches to
>> func.sgml --- there's enough duplicate boilerplate that "patch' can easily
>> be fooled into dumping an addition into the wrong place.

> This is yet another indication to me that there's probably a good case
> for breaking func.sgml up into sections. It is by a very large margin
> the biggest file in our document sources (the next largest is less than
> half the number of lines).

What are you envisioning ... a file per <sect1>, or something else?

I'm not sure that a split-up would really fix the problem I mentioned;
but at least it'd reduce the scope for things to go into *completely*
the wrong place.

I think to make things safer for "patch", we'd have to give up a lot
of vertical space around function-table entries.  For example,
instead of

      <row>
       <entry role="func_table_entry"><para role="func_signature">
        <indexterm>
         <primary>num_nonnulls</primary>
        </indexterm>
        <function>num_nonnulls</function> ( <literal>VARIADIC</literal> <type>"any"</type> )
        <returnvalue>integer</returnvalue>
        ...
       </para></entry>
      </row>

maybe

      <row><entry role="func_table_entry"><para role="func_signature">
        <indexterm><primary>num_nonnulls</primary></indexterm>
        <function>num_nonnulls</function> ( <literal>VARIADIC</literal> <type>"any"</type> )
        <returnvalue>integer</returnvalue>
        ...
       </para></entry></row>

In this way, there'd be something at least a little bit unique within
the first couple of lines of an entry, so that the standard amount of
context in a diff would provide some genuine indication of where a
new entry is supposed to go.

The main problem with this formatting is that I'm not sure that
anybody's editors' SGML modes would be on board with it.

            regards, tom lane

Re: func.sgml

From

Dagfinn Ilmari Mannsåker

Date:

04 October 2021, 15:06:48

Andrew Dunstan <andrew@dunslane.net> writes:

> At
> <https://www.postgresql.org/message-id/543620.1629899413%40sss.pgh.pa.us>
> Tom noted:
>
>> You have to be very careful these days when applying stale patches to
>> func.sgml --- there's enough duplicate boilerplate that "patch' can easily
>> be fooled into dumping an addition into the wrong place. 
>
> This is yet another indication to me that there's probably a good case
> for breaking func.sgml up into sections. It is by a very large margin
> the biggest file in our document sources (the next largest is less than
> half the number of lines).
>
> thoughts?

It would make sense to follow a similar pattern to datatype.sgml and
break out the largest sections.  I whipped up a quick awk script to get
an idea of the sizes of the sections in the file:

$ awk '$1 == "<sect1" { start = NR; name = $2 }
       $1 == "</sect1>" { print NR-start, name }' \
   func.sgml | sort -rn
3076 id="functions-info">
2506 id="functions-admin">
2463 id="functions-json">
2352 id="functions-matching">
2028 id="functions-datetime">
1672 id="functions-string">
1466 id="functions-math">
1263 id="functions-geometry">
1252 id="functions-xml">
1220 id="functions-aggregate">
1165 id="functions-formatting">
1053 id="functions-textsearch">
1049 id="functions-range">
785 id="functions-binarystring">
625 id="functions-comparison">
591 id="functions-net">
552 id="functions-array">
357 id="functions-bitstring">
350 id="functions-comparisons">
348 id="functions-subquery">
327 id="functions-event-triggers">
284 id="functions-conditional">
283 id="functions-window">
282 id="functions-srf">
181 id="functions-sequence">
145 id="functions-logical">
134 id="functions-trigger">
120 id="functions-enum">
84 id="functions-statistics">
31 id="functions-uuid">

Tangentially, running the same on datatype.sgml indicates that the
datetime section might do with splitting out:

$ awk '$1 == "<sect1" { start = NR; name = $2 }
       $1 == "</sect1>" { print NR-start, name }' \
   datatype.sgml | sort -rn
1334 id="datatype-datetime">
701 id="datatype-numeric">
374 id="datatype-net-types">
367 id="datatype-oid">
320 id="datatype-geometric">
310 id="datatype-pseudo">
295 id="datatype-binary">
256 id="datatype-character">
245 id="datatype-textsearch">
197 id="datatype-xml">
160 id="datatype-enum">
119 id="datatype-boolean">
81 id="datatype-money">
74 id="datatype-bit">
51 id="domains">
49 id="datatype-uuid">
30 id="datatype-pg-lsn">

The existing split-out sections of datatype.sgml are:

$ wc -l json.sgml array.sgml rowtypes.sgml rangetypes.sgml | grep -v total | sort -rn
  1006 json.sgml
   797 array.sgml
   592 rangetypes.sgml
   540 rowtypes.sgml

The names are also rather inconsistent and vague, especially "json" and
"array". If we split the json section out of func.sgml, we might want to
rename these datatype-foo.sgml instead of foo(types).sgml, or go the
whole hog and create subdirectories and move all the sections into
separate files in them, like with reference.sgml.

- ilmari

Re: func.sgml

From

Tatsuo Ishii

Date:

05 October 2021, 05:40:35

>> You have to be very careful these days when applying stale patches to
>> func.sgml --- there's enough duplicate boilerplate that "patch' can easily
>> be fooled into dumping an addition into the wrong place. 
> 
> This is yet another indication to me that there's probably a good case
> for breaking func.sgml up into sections. It is by a very large margin
> the biggest file in our document sources (the next largest is less than
> half the number of lines).

I am welcome this by a different reason. I have been involved in a
translation (to Japanese) project for long time. For this work we are
using Github. Translation works are submitted as pull requests. With
large sgml files (not only func.sgml, but config.sgml, catalogs.sgml
and libpq.sgml), Github's UI cannot handle them correctly. Sometimes
they don't show certain lines, which makes the review process
significantly hard.  Because of this, we have to split those large
sgml files into small files, typically 4 to 5 segments for each large
sgml file.

Splitting those large sgml files in upstream woudl greatly help us
because we don't need to split the large sgml files.
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp