Thread: Uh-oh: documentation PDF output no longer builds in HEAD

Uh-oh: documentation PDF output no longer builds in HEAD

From

Tom Lane

Date:

08 November 2015, 21:34:24

$ cd doc/src/sgml
$ make postgres-US.pdf
... lots of crap later ...

[3253.0.51
! TeX capacity exceeded, sorry [number of strings=245828].
<to be read again>                   \endgroup \set@typeset@protect 
l.1879198 {1}}             \Node%
!  ==> Fatal error occurred, no output PDF file produced!
Transcript written on postgres-US.log.
make: *** [postgres-US.pdf] Error 1
rm postgres-US.tex-pdf

The A4-format PDF still builds, which implies that this has something to
do with the number of pages produced.  The 9.5 branch also still builds,
but it seems highly likely that it is within a few pages of failing as
well.  (HEAD is only about a dozen pages longer than 9.5 in A4 format,
and presumably the difference is around that for US format.)

We ran into a very similar issue back around 9.0, and solved it with an
ugly style-sheet hack, see thread here:
http://www.postgresql.org/message-id/flat/1270189232.5018.7.camel@hp-laptop2.gunduz.org

As noted then, and as I reconfirmed just now, you can *not* fix this by
hacking TeX's parameters: there is a hard-wired limit of 2^18 strings
regardless of what you try to set in texmf.cnf.

Thoughts?
        regards, tom lane

Re: Uh-oh: documentation PDF output no longer builds in HEAD

From

andres@anarazel.de (Andres Freund)

Date:

09 November 2015, 00:17:47

On 2015-11-08 13:34:18 -0500, Tom Lane wrote:
> $ cd doc/src/sgml
> $ make postgres-US.pdf
> ... lots of crap later ...
> ! TeX capacity exceeded, sorry [number of strings=245828].

> We ran into a very similar issue back around 9.0, and solved it with an
> ugly style-sheet hack, see thread here:
> http://www.postgresql.org/message-id/flat/1270189232.5018.7.camel@hp-laptop2.gunduz.org
>
> As noted then, and as I reconfirmed just now, you can *not* fix this by
> hacking TeX's parameters: there is a hard-wired limit of 2^18 strings
> regardless of what you try to set in texmf.cnf.

While taking pretty short of forever, postgres-US.pdf seems to build on
my debian unstable as of 8d7396e509 + some additional docs. Is this
dependant of what version of text you're using (plain tex, pdftex,
xetex, whatnot)?

postgres-US.log contains:
360764 strings out of 481710
2617927 string characters out of 6028023
857532 words of memory out of 5085000
252961 multiletter control sequences out of 15000+600000
101035 words of font info for 156 fonts, out of 8000000 for 9000
36 hyphenation exceptions out of 8191

So at least debian's version of tex seems to have to worked around the
limit somehow. I found only one interesting looking setting in the
relevant config files:

%%
%% jacking up TeX settings for the unique uses of jadetex
%%
extra_mem_bot.jadetex = 85000
extra_mem_bot.pdfjadetex = 85000

Andres

Re: Uh-oh: documentation PDF output no longer builds in HEAD

From

Tom Lane

Date:

09 November 2015, 00:30:13

andres@anarazel.de (Andres Freund) writes:
> While taking pretty short of forever, postgres-US.pdf seems to build on
> my debian unstable as of 8d7396e509 + some additional docs. Is this
> dependant of what version of text you're using (plain tex, pdftex,
> xetex, whatnot)?

> postgres-US.log contains:
> 360764 strings out of 481710

Interesting.  They must have boosted the strings limit from 2^18 to 2^19.
According to what I've read, this is doable when compiling TeX from
source, but we can hardly expect users (or packagers) to do that if their
distribution hasn't built it that way.  (The 2^18 limit I'm seeing is with
RHEL6's tex package.  I'm currently downloading the Fedora rawhide package
to see if it's any better, but man that is one large package...)

BTW, I realized after poking around that the hack I put in back in 9.0
probably only eliminates about 5000 strings from the pool, because
it should save one string per \pagelabel entry added to the .aux
file, and there are less than 5000 such entries after a successful
build.  So that was a good quick-n-dirty fix but it's really only
scratching the surface of the problem: there are ~240000 other strings
getting made somewhere.  I wonder if a better answer is possible.
        regards, tom lane

Re: Uh-oh: documentation PDF output no longer builds in HEAD

From

andres@anarazel.de (Andres Freund)

Date:

09 November 2015, 00:55:11

On 2015-11-08 16:29:56 -0500, Tom Lane wrote:
> andres@anarazel.de (Andres Freund) writes:
> > While taking pretty short of forever, postgres-US.pdf seems to build on
> > my debian unstable as of 8d7396e509 + some additional docs. Is this
> > dependant of what version of text you're using (plain tex, pdftex,
> > xetex, whatnot)?
> 
> > postgres-US.log contains:
> > 360764 strings out of 481710
> 
> Interesting.  They must have boosted the strings limit from 2^18 to 2^19.
> According to what I've read, this is doable when compiling TeX from
> source, but we can hardly expect users (or packagers) to do that if their
> distribution hasn't built it that way.  (The 2^18 limit I'm seeing is with
> RHEL6's tex package.

Debian uses pdflatex from texlive-base (2015.20151016-1). Maybe that's
the relevant difference.

> I'm currently downloading the Fedora rawhide package
> to see if it's any better, but man that is one large package...)

It indeed is. I've not found any relevant patches in debian's
package. Lots of changing paths and defaults, but afaics nothing else.

> BTW, I realized after poking around that the hack I put in back in 9.0
> probably only eliminates about 5000 strings from the pool, because
> it should save one string per \pagelabel entry added to the .aux
> file, and there are less than 5000 such entries after a successful
> build.  So that was a good quick-n-dirty fix but it's really only
> scratching the surface of the problem: there are ~240000 other strings
> getting made somewhere.  I wonder if a better answer is possible.

Debian's pdfjadetex package has the following comment in
README.jadetex.cfg:

> * Not Labelling Elements
> 
> In some cases, it is possible for pdfjadetex to error out even with
> expanded texmf.cnf settings.  The sign of this is that jadetex is able
> to process the file, but pdfjadetex isn't.  The upstream maintainer,
> Sebastian Rahtz, had this to say:
> 
> | pdfjadetex _can_ go over a string limit in TeX
> | which *isn't* changeable in texmf.cnf. The workaround is to write a
> | file called jadetex.cfg, containing just the line
> |
> | \LabelElementsfalse
> |
> | and see if that helps. it stops jadetex from using up a string for
> | every element. If that leaves unsatisfied cross-references, try
> | "jadetex" instead of "pdfjadetex", and create your PDF in
> | another via
> | (ie via Distiller).

Might be worthwhile to see wether \LabelElementsfalse makes a huge
difference.

Re: Uh-oh: documentation PDF output no longer builds in HEAD

From

Tom Lane

Date:

09 November 2015, 01:24:30

andres@anarazel.de (Andres Freund) writes:
>> In some cases, it is possible for pdfjadetex to error out even with
>> expanded texmf.cnf settings.  The sign of this is that jadetex is able
>> to process the file, but pdfjadetex isn't.  The upstream maintainer,
>> Sebastian Rahtz, had this to say:
>> 
>> | pdfjadetex _can_ go over a string limit in TeX
>> | which *isn't* changeable in texmf.cnf. The workaround is to write a
>> | file called jadetex.cfg, containing just the line
>> |
>> | \LabelElementsfalse

Interesting.  That seems to be a slightly more aggressive version of my
9.0-era hack: it effectively turns FlowObjectSetup into a no-op that won't
generate page labels at all, saving *both* of the strings it would create
not only one.  However, I'm afraid that's not gonna do: it looks like it
turns a large fraction of the index entries from page numbers into "??".
And some of the table-of-contents entries as well.  (It looks like maybe
only things with explicit id= entries get correct page number data with
this setting.  We could maybe live with that if the tool threw an error
about missing ids; but it doesn't, it just emits "??" ...)

Curiously though, that gets us down to this:
30615 strings out of 245828397721 string characters out of 1810780

which implies that indeed FlowObjectSetup *is* the cause of most of
the strings being entered.  I'm not sure how that squares with the
observation that there are less than 5000 \pagelabel entries in the
postgres-US.aux file.  Time for more digging.
        regards, tom lane

Re: Uh-oh: documentation PDF output no longer builds in HEAD

From

Tom Lane

Date:

10 November 2015, 03:46:46

I wrote:
> Curiously though, that gets us down to this:
>  30615 strings out of 245828
>  397721 string characters out of 1810780
> which implies that indeed FlowObjectSetup *is* the cause of most of
> the strings being entered.  I'm not sure how that squares with the
> observation that there are less than 5000 \pagelabel entries in the
> postgres-US.aux file.  Time for more digging.

Well, after much digging, I've found what seems a workable answer.
It turns out that the original form of FlowObjectSetup is just
unbelievably awful when it comes to handling of hyperlink anchors:
it will put a hyperlink anchor into the PDF for every "flow object",
that is, everything in the document that could possibly have a link
to it, whether or not it actually is linked to.  And aside from bloating
the PDF file, it turns out that the hyperlink stuff also consumes some
control sequence names, which is why we're running out of strings.

There already is logic (probably way older than the hyperlink code)
in jadetex to avoid generating page-number labels for objects that have
no cross-references.  So what I did to fix this was to piggyback on
that code: with the attached jadetex.cfg, both a page-number label
and a hyperlink anchor will be generated for all and only those flow
objects that have either a page-number reference or a hyperlink reference.
(We could try to separate those things, but then we'd need two control
sequence names not one per object for tracking purposes, and anyway many
objects will have both kinds of reference if they have either.)

This gets us down to ~135000 strings to build HEAD, and not incidentally,
the resulting PDF is about half the size it was before.  I think I've
also fixed a number of formerly unexplainable broken hyperlinks in the
PDF; some are still broken, but they were that way before.  (It looks
like <xref> with endterm doesn't work very well in jadetex; all the
remaining bad links seem to be associated with uses of that.)

Barring objection I'll commit this tomorrow.  I'm inclined to back-patch
it at least into 9.5, maybe further, because I'm afraid we may be closer
than we realized to exceeding the strings limit in the back branches too.

            regards, tom lane

% doc/src/sgml/jadetex.cfg
%
% This file redefines FlowObjectSetup and some related macros to greatly
% reduce the number of control sequence names created, and also to avoid
% creation of many useless hyperlink anchors in PDF files.
%
% The original coding of FlowObjectSetup defined a control sequence x@LABEL
% for pretty nearly every flow object in the file, whether that object was
% cross-referenced or not.  Worse yet, it created a hyperlink anchor for
% every such object, which not only bloated the output PDF with useless
% anchors but consumed additional control sequence names internally.
%
% To fix, extend PageLabel's already-existing mechanism whereby a p@LABEL
% control sequence is filled in only for labels that are referenced by at
% least one \Pageref call.  We now also fill in p@LABEL for labels that are
% referenced by a \Link.  Then, we can drop x@LABEL entirely, and use
% p@LABEL to control emission of both a hyperlink anchor and a \PageLabel.
% Now, both of those things are emitted for all and only the flow objects
% that have either a hyperlink reference or a page-number reference.
%
% (With a more invasive patch, we could track the need for an anchor and a
% page-number label separately, but that would probably require more control
% sequences than this way does.)
%
%
% In addition to checking p@LABEL not x@LABEL, this version of FlowObjectSetup
% is fixed to clear \Label and \Element whether or not it emits an anchor
% and page label.  Failure to do that seems to explain some pre-existing bugs
% in which certain SGML constructs weren't correctly cross-referenced.
%
\def\FlowObjectSetup#1{%
\ifDoFOBSet
  \ifLabelElements
     \ifx\Label\@empty\let\Label\Element\fi
  \fi
  \ifx\Label\@empty\else
      \expandafter\ifx\csname p@\Label\endcsname\relax
      \else
       \bgroup
         \ifNestedLink
         \else
           \hyper@anchorstart{\Label}\hyper@anchorend
           \PageLabel{\Label}%
         \fi
       \egroup
      \fi
      \let\Label\@empty
      \let\Element\@empty
  \fi
\fi
}
%
% Adjust PageLabel so that the p@NAME control sequence acquires a correct
% value immediately; this seems to be needed to avoid scenarios wherein
% additional TeX runs are needed to reach a stable state of the .aux file.
%
\def\PageLabel#1{%
  \@bsphack
  \expandafter\ifx\csname p@#1\endcsname\relax
  \else
  \protected@write\@auxout{}%
         {\string\pagelabel{#1}{\thepage}}%
  % Ensure the p@NAME control sequence acquires correct value immediately
  \expandafter\xdef\csname p@#1\endcsname{\thepage}%
  \fi
  \@esphack}
%
% In \Link, add code to emit an aux-file entry if the p@NAME sequence isn't
% defined.  Much as in @Setref, this ensures we'll process the referenced
% item correctly on the next TeX run.
%
\def\Link#1{%
  \begingroup
  \SetupICs{#1}%
  \ifx\Label\@empty\let\Label\Element\fi
%  \typeout{Made a Link at \the\inputlineno, to \Label}%
  \hyper@linkstart{\LinkType}{\Label}%
  \NestedLinktrue
  % If p@NAME control sequence isn't defined, emit dummy def to aux file
  % so it will get defined properly on next run, much as in @Setref
  \expandafter\ifx\csname p@\Label\endcsname\relax
    \immediate\write\@mainaux{\string\pagelabel{\Label}{qqq}}%
  \fi
}

Re: Uh-oh: documentation PDF output no longer builds in HEAD

From

Andres Freund

Date:

10 November 2015, 03:53:08

On 2015-11-09 19:46:37 -0500, Tom Lane wrote:
> Well, after much digging, I've found what seems a workable answer.
> It turns out that the original form of FlowObjectSetup is just
> unbelievably awful [...].
>
> This gets us down to ~135000 strings to build HEAD, and not incidentally,
> the resulting PDF is about half the size it was before.  I think I've
> also fixed a number of formerly unexplainable broken hyperlinks in the
> PDF; some are still broken, but they were that way before.  (It looks
> like <xref> with endterm doesn't work very well in jadetex; all the
> remaining bad links seem to be associated with uses of that.)

Nice work. On an ugly subject.

> Barring objection I'll commit this tomorrow.  I'm inclined to back-patch
> it at least into 9.5, maybe further, because I'm afraid we may be closer
> than we realized to exceeding the strings limit in the back branches too.

+1 for doing this in 9.5+. I think we will probably want this in all
branches at some point. I don't have a strong opinion on whether we want
to let this mature in 9.5 or not.

Greetings,

Andres Freund

Re: Uh-oh: documentation PDF output no longer builds in HEAD

From

Robert Haas

Date:

10 November 2015, 15:50:52

On Mon, Nov 9, 2015 at 7:46 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I wrote:
>> Curiously though, that gets us down to this:
>>  30615 strings out of 245828
>>  397721 string characters out of 1810780
>> which implies that indeed FlowObjectSetup *is* the cause of most of
>> the strings being entered.  I'm not sure how that squares with the
>> observation that there are less than 5000 \pagelabel entries in the
>> postgres-US.aux file.  Time for more digging.
>
> Well, after much digging, I've found what seems a workable answer.
> It turns out that the original form of FlowObjectSetup is just
> unbelievably awful when it comes to handling of hyperlink anchors:
> it will put a hyperlink anchor into the PDF for every "flow object",
> that is, everything in the document that could possibly have a link
> to it, whether or not it actually is linked to.  And aside from bloating
> the PDF file, it turns out that the hyperlink stuff also consumes some
> control sequence names, which is why we're running out of strings.
>
> There already is logic (probably way older than the hyperlink code)
> in jadetex to avoid generating page-number labels for objects that have
> no cross-references.  So what I did to fix this was to piggyback on
> that code: with the attached jadetex.cfg, both a page-number label
> and a hyperlink anchor will be generated for all and only those flow
> objects that have either a page-number reference or a hyperlink reference.
> (We could try to separate those things, but then we'd need two control
> sequence names not one per object for tracking purposes, and anyway many
> objects will have both kinds of reference if they have either.)
>
> This gets us down to ~135000 strings to build HEAD, and not incidentally,
> the resulting PDF is about half the size it was before.  I think I've
> also fixed a number of formerly unexplainable broken hyperlinks in the
> PDF; some are still broken, but they were that way before.  (It looks
> like <xref> with endterm doesn't work very well in jadetex; all the
> remaining bad links seem to be associated with uses of that.)
>
> Barring objection I'll commit this tomorrow.  I'm inclined to back-patch
> it at least into 9.5, maybe further, because I'm afraid we may be closer
> than we realized to exceeding the strings limit in the back branches too.

I am in awe.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Uh-oh: documentation PDF output no longer builds in HEAD

From

Magnus Hagander

Date:

10 November 2015, 19:59:08

On Tue, Nov 10, 2015 at 1:46 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

I wrote:
> Curiously though, that gets us down to this:
> 30615 strings out of 245828
> 397721 string characters out of 1810780
> which implies that indeed FlowObjectSetup *is* the cause of most of
> the strings being entered. I'm not sure how that squares with the
> observation that there are less than 5000 \pagelabel entries in the
> postgres-US.aux file. Time for more digging.

Well, after much digging, I've found what seems a workable answer.
It turns out that the original form of FlowObjectSetup is just
unbelievably awful when it comes to handling of hyperlink anchors:
it will put a hyperlink anchor into the PDF for every "flow object",
that is, everything in the document that could possibly have a link
to it, whether or not it actually is linked to. And aside from bloating
the PDF file, it turns out that the hyperlink stuff also consumes some
control sequence names, which is why we're running out of strings.

There already is logic (probably way older than the hyperlink code)
in jadetex to avoid generating page-number labels for objects that have
no cross-references. So what I did to fix this was to piggyback on
that code: with the attached jadetex.cfg, both a page-number label
and a hyperlink anchor will be generated for all and only those flow
objects that have either a page-number reference or a hyperlink reference.
(We could try to separate those things, but then we'd need two control
sequence names not one per object for tracking purposes, and anyway many
objects will have both kinds of reference if they have either.)

This gets us down to ~135000 strings to build HEAD, and not incidentally,
the resulting PDF is about half the size it was before. I think I've
also fixed a number of formerly unexplainable broken hyperlinks in the
PDF; some are still broken, but they were that way before. (It looks
like <xref> with endterm doesn't work very well in jadetex; all the
remaining bad links seem to be associated with uses of that.)

Barring objection I'll commit this tomorrow. I'm inclined to back-patch
it at least into 9.5, maybe further, because I'm afraid we may be closer
than we realized to exceeding the strings limit in the back branches too.

Impressive, indeed.

When you say it's half the size - is that half the size of the preprocessed PDF or is it also after the stuff we do on the website PDFs using jpdftweak? IIRC that tweak is only there to deal with the size, and specifically it deals with "bookmarks" which sounds a lot like this...

Magnus Hagander
Me: http://www.hagander.net/
Work: http://www.redpill-linpro.com/

Re: Uh-oh: documentation PDF output no longer builds in HEAD

From

Tom Lane

Date:

10 November 2015, 20:15:50

Magnus Hagander <magnus@hagander.net> writes:
> When you say it's half the size - is that half the size of the preprocessed
> PDF or is it also after the stuff we do on the website PDFs using
> jpdftweak? IIRC that tweak is only there to deal with the size, and
> specifically it deals with "bookmarks" which sounds a lot like this...

I'm just looking at the size of the file produced by "make postgres-A4.pdf".

I don't know anything about jpdftweak, but if it's being used to get rid
of unreferenced hyperlink anchors, maybe we could dispense with that step
after this goes in.
        regards, tom lane

Re: Uh-oh: documentation PDF output no longer builds in HEAD

From

Magnus Hagander

Date:

10 November 2015, 21:01:55

On Tue, Nov 10, 2015 at 6:15 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Magnus Hagander <magnus@hagander.net> writes:
> When you say it's half the size - is that half the size of the preprocessed
> PDF or is it also after the stuff we do on the website PDFs using
> jpdftweak? IIRC that tweak is only there to deal with the size, and
> specifically it deals with "bookmarks" which sounds a lot like this...

I'm just looking at the size of the file produced by "make postgres-A4.pdf".

I don't know anything about jpdftweak, but if it's being used to get rid
of unreferenced hyperlink anchors, maybe we could dispense with that step
after this goes in.

Yeah, that's what I was hoping. You can see how it's used in the mk-release-pdfs script on borka. It doesn't explain why we're doing it, that's probably in the list archives somewhere, but it does show what we do. So it should be easy enough to see if the benefit goes away :) (it would also be nice for build times - that pdftweak step is very very slow)

Magnus Hagander
Me: http://www.hagander.net/
Work: http://www.redpill-linpro.com/

Re: Uh-oh: documentation PDF output no longer builds in HEAD

From

Alvaro Herrera

Date:

10 November 2015, 21:22:17

Magnus Hagander wrote:
> On Tue, Nov 10, 2015 at 6:15 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> 
> > Magnus Hagander <magnus@hagander.net> writes:
> > > When you say it's half the size - is that half the size of the
> > preprocessed
> > > PDF or is it also after the stuff we do on the website PDFs using
> > > jpdftweak? IIRC that tweak is only there to deal with the size, and
> > > specifically it deals with "bookmarks" which sounds a lot like this...
> >
> > I'm just looking at the size of the file produced by "make
> > postgres-A4.pdf".
> >
> > I don't know anything about jpdftweak, but if it's being used to get rid
> > of unreferenced hyperlink anchors, maybe we could dispense with that step
> > after this goes in.
> >
> >
> Yeah, that's what I was hoping.  You can see how it's used in the
> mk-release-pdfs script on borka. It doesn't explain why we're doing it,
> that's probably in the list archives somewhere, but it does show what we
> do. So it should be easy enough to see if the benefit goes away :) (it
> would also be nice for build times - that pdftweak step is very very slow)


http://www.postgresql.org/message-id/flat/1284678175.2459.21.camel@hp-laptop2.gunduz.org#1284678175.2459.21.camel@hp-laptop2.gunduz.org


http://www.postgresql.org/message-id/flat/AANLkTi=3bkqc3ScM5Y==NPeY0_4uLFy+yGD9=GJ-NMZB@mail.gmail.com#AANLkTi=3bkqc3ScM5Y==NPeY0_4uLFy+yGD9=GJ-NMZB@mail.gmail.com

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Uh-oh: documentation PDF output no longer builds in HEAD

From

Tom Lane

Date:

10 November 2015, 23:55:55

Magnus Hagander <magnus@hagander.net> writes:
> On Tue, Nov 10, 2015 at 6:15 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> I don't know anything about jpdftweak, but if it's being used to get rid
>> of unreferenced hyperlink anchors, maybe we could dispense with that step
>> after this goes in.

> Yeah, that's what I was hoping.  You can see how it's used in the
> mk-release-pdfs script on borka.

Hmm ... building current HEAD in A4 format, I get
            HEAD        with my patch

initially generated PDF:    26.30MB        13.25MB
after jpdftweak:        7.24MB        7.24MB

Evidently, jpdftweak *is* removing unreferenced bookmarks --- the output
file sizes would not be so close to identical otherwise.  But it's
evidently doing more than that.  The initially generated PDFs are
fairly compressible by "gzip", while jpdftweak's outputs are not, so
I suppose that the additional savings come from applying compression.
jdftweak's help output indicates that the "-ocs" options are selecting
aggressive compression.

I tried removing the load/save bookmarks steps from mk-release-pdfs,
but what I get is files that are a little smaller yet and again almost the
same size for either starting point; that probably means that jpdftweak's
default behavior is to strip all bookmarks :-(.

Maybe we could look around for another tool that just does PDF compression
and not the other stuff ...
        regards, tom lane