Thread: BUG #8532: postgres fails to start with timezone-data >=2013e

BUG #8532: postgres fails to start with timezone-data >=2013e

From
timo.gurr@gmail.com
Date:
The following bug has been logged on the website:

Bug reference:      8532
Logged by:          Timo Gurr
Email address:      timo.gurr@gmail.com
PostgreSQL version: 9.1.10
Operating system:   Gentoo Linux (64bit, kernel 3.11.0, glibc 2.17)
Description:

>From the timezone-data NEWS:


Release 2013e - 2013-09-19 23:50:04 -0700
  Changes affecting the build procedure
    When building the 'posix' or 'right' subdirectories, if the
    subdirectory would be a copy of the default subdirectory, it is
    now made a symbolic link if that is supported.  This saves about
    2 MB of file system space.


This change breaks postgres, so then having a recent enough timezone-data
package installed on the system postgres fails to start:


/var/lib/postgresql/9.1/data/postmaster.log
FATAL:  exceeded maxAllocatedDescs (16) while trying to open directory
"/usr/share/zoneinfo"


# ls -la /usr/share/zoneinfo/
lrwxrwxrwx  1 root root     1 Oct 16 12:06 posix -> .


Gentoo has a downstream bugreport about it stating the problem should be
fixed on the postgres side:
https://bugs.gentoo.org/show_bug.cgi?id=486556
Also found on the net:
http://blog.endpoint.com/2013/06/debugging-obscure-postgres-problems.html


The mentioned workaround by manually removing the symlink lets postgres
start fine again.

Re: BUG #8532: postgres fails to start with timezone-data >=2013e

From
Heikki Linnakangas
Date:
On 16.10.2013 13:09, timo.gurr@gmail.com wrote:
> The following bug has been logged on the website:
>
> Bug reference:      8532
> Logged by:          Timo Gurr
> Email address:      timo.gurr@gmail.com
> PostgreSQL version: 9.1.10
> Operating system:   Gentoo Linux (64bit, kernel 3.11.0, glibc 2.17)
> Description:
>
>> From the timezone-data NEWS:
>
>
> Release 2013e - 2013-09-19 23:50:04 -0700
>    Changes affecting the build procedure
>      When building the 'posix' or 'right' subdirectories, if the
>      subdirectory would be a copy of the default subdirectory, it is
>      now made a symbolic link if that is supported.  This saves about
>      2 MB of file system space.
>
>
> This change breaks postgres, so then having a recent enough timezone-data
> package installed on the system postgres fails to start:
>
>
> /var/lib/postgresql/9.1/data/postmaster.log
> FATAL:  exceeded maxAllocatedDescs (16) while trying to open directory
> "/usr/share/zoneinfo"
>
>
> # ls -la /usr/share/zoneinfo/
> lrwxrwxrwx  1 root root     1 Oct 16 12:06 posix ->  .
>
>
> Gentoo has a downstream bugreport about it stating the problem should be
> fixed on the postgres side:
> https://bugs.gentoo.org/show_bug.cgi?id=486556

When you download the vanilla timezone sources and install, the
directory layout looks different:

~/tz ((2013e))$ make -s install DESTDIR=foo TZDIR=/usr/share/zoneinfo
ar: creating foo/usr/local/lib/libtz.a
mkdir: cannot create directory 'foo/usr/local': File exists
mkdir: cannot create directory 'foo/usr/local': File exists
~/tz ((2013e))$ ls -l foo/usr/share/
total 8
drwxr-xr-x 19 heikki heikki 4096 Oct 21 12:48 zoneinfo
drwxr-xr-x 19 heikki heikki 4096 Oct 21 12:48 zoneinfo-leaps
lrwxrwxrwx  1 heikki heikki    8 Oct 21 12:48 zoneinfo-posix -> zoneinfo

There is no 'posix' symlink inside 'zoneinfo'. The zoneinfo git
repository says that this layout was adopted in the upstream library a
long time ago:

> commit 77e3dfe1a7b7e14e9f252fc628a5d405c35b6444
> Author: Arthur David Olson <ado@elsie>
> Date:   Mon May 25 13:04:43 1998 -0400
>
>     Eggert mod
>
>     SCCS-file: Makefile
>     SCCS-SID: 7.66
>
> diff --git a/Makefile b/Makefile
> index a8d8067..ae414b4 100644
> --- a/Makefile
> +++ b/Makefile
> @@ -293,10 +293,19 @@ posix_only:    zic $(TDATA)
>  right_only:    zic leapseconds $(TDATA)
>          $(ZIC) -y $(YEARISTYPE) -d $(TZDIR) -L leapseconds $(TDATA)
>
> +# In earlier versions of this makefile, the other two directories were
> +# subdirectories of $(TZDIR).  However, this led to configuration errors.
> +# For example, with posix_right under the earlier scheme,
> +# TZ='right/Australia/Adelaide' got you localtime with leap seconds,
> +# but gmtime without leap seconds, which led to problems with applications
> +# like sendmail that subtract gmtime from localtime.
> +# Therefore, the other two directories are now siblings of $(TZDIR).
> +# You must replace all of $(TZDIR) to switch from not using leap seconds
> +# to using them, or vice versa.
>  other_two:    zic leapseconds $(TDATA)
> -        $(ZIC) -y $(YEARISTYPE) -d $(TZDIR)/posix -L /dev/null $(TDATA)
> +        $(ZIC) -y $(YEARISTYPE) -d $(TZDIR)-posix -L /dev/null $(TDATA)
>          $(ZIC) -y $(YEARISTYPE) \
> -            -d $(TZDIR)/right -L leapseconds $(TDATA)
> +            -d $(TZDIR)-right -L leapseconds $(TDATA)
>
>  posix_right:    posix_only other_two
>

However, Gentoo seems to carry a patch that reverts that commit:


http://sources.gentoo.org/cgi-bin/viewvc.cgi/gentoo-x86/sys-libs/timezone-data/files/timezone-data-2013f-makefile.patch?revision=1.1&view=markup

That patch conflicts with the upstream Makefile change to create the
"other" directory as a symlink. With the vanilla zoneinfo sources, the
symlink is fine, but by putting 'posix' inside 'zoneinfo' directory, the
Gentoo-specific patch creates that infinite recursion situation.

Gentoo isn't alone in doing this: my Debian system has a similar layout,
with 'posix' directory inside /usr/share/zoneinfo, rather than as a
sibling. I'm not sure if Debian will have this problem, though. If I'm
reading the debian rules file correctly, they're not relying on the
upstream "make install" to create the 'posix' and 'right' directories,
but calls zic directly.

In summary, I'd call this a packaging bug.

That said, the error message you get from PostgreSQL isn't very
user-friendly. There is a check on recursion depth in the timezone
traversing code, but apparently it trips on another limit first, on the
number of directory handles that can be open at a time.

Also, I don't understand how this is preventing PostgreSQL from starting
up. AFAICS the traversal of the timezones is only done when you query
the pg_timezone_names system view. Not at startup.

- Heikki

Re: BUG #8532: postgres fails to start with timezone-data >=2013e

From
Tom Lane
Date:
Heikki Linnakangas <hlinnakangas@vmware.com> writes:
> On 16.10.2013 13:09, timo.gurr@gmail.com wrote:
>> # ls -la /usr/share/zoneinfo/
>> lrwxrwxrwx  1 root root     1 Oct 16 12:06 posix ->  .

> That patch conflicts with the upstream Makefile change to create the
> "other" directory as a symlink. With the vanilla zoneinfo sources, the
> symlink is fine, but by putting 'posix' inside 'zoneinfo' directory, the
> Gentoo-specific patch creates that infinite recursion situation.

I agree, this is an egregious packaging bug.  Programs should be able to
enumerate the timezone database without running into infinite recursion.

> That said, the error message you get from PostgreSQL isn't very
> user-friendly. There is a check on recursion depth in the timezone
> traversing code, but apparently it trips on another limit first, on the
> number of directory handles that can be open at a time.

> Also, I don't understand how this is preventing PostgreSQL from starting
> up. AFAICS the traversal of the timezones is only done when you query
> the pg_timezone_names system view. Not at startup.

Keep in mind that in 9.2 and later, we traverse the timezone tree in
initdb to set the timezone GUC.  Before that, we would do it in postmaster
startup --- but only if we didn't find a TZ variable in the environment
nor a setting in postgresql.conf.

I tried to reproduce this in HEAD by inserting a bogus symlink into
the installation timezone tree and running initdb.  Curiously, it did not
fail, though initdb took rather longer than expected.  After debugging
I realized that scan_available_timezones() was in fact recursing deeper
and deeper into the posix/posix/posix/posix/... nest --- but eventually,
the constructed filename exceeds MAXPGPATH, and we truncate it, and
fail to open the truncated filename, so the recursion stops.
(And you don't get any error message, unless you compiled with
DEBUG_IDENTIFY_TIMEZONE.)  Also, the implementation in initdb isn't
vulnerable to running out of descriptors because it sucks in an entire
directory at a time with pgfnames(), so it doesn't have a descriptor
open when it recurses.  (Instead, it consumes a lot of memory --- but
it looks like still only about 10MB worth.)

In 9.1, the reason you see the maxAllocatedDescs complaint is that the
postmaster tries to set the timezone before it's increased max_safe_fds,
so it won't increase maxAllocatedDescs past 16.  Enumerating the zones
in a regular backend would almost certainly report the timezone recursion
error instead.  (I am kinda wondering why maxAllocatedDescs == 16 isn't
enough to get to a recursion error at depth 10, but maybe there are a
few other files open when this happens.)

Basically, I don't think we should do anything about this.  Packaging
the TZ database like that is completely brain-dead, and Gentoo needs
to fix it, not tell us we're doing something wrong.  The consequences
of their bug aren't too serious in modern PG releases anyway.  (Given
what I know of their packaging policies, I have to wonder why they're
still shipping 9.1 rather than the bleeding edge...)

            regards, tom lane

Re: BUG #8532: postgres fails to start with timezone-data >=2013e

From
"Aaron W. Swenson"
Date:
On 2013-11-10 20:57, Tom Lane wrote:
> Heikki Linnakangas <hlinnakangas@vmware.com> writes:
> > On 16.10.2013 13:09, timo.gurr@gmail.com wrote:
> >> # ls -la /usr/share/zoneinfo/
> >> lrwxrwxrwx  1 root root     1 Oct 16 12:06 posix ->  .
>
> > That patch conflicts with the upstream Makefile change to create the
> > "other" directory as a symlink. With the vanilla zoneinfo sources, the
> > symlink is fine, but by putting 'posix' inside 'zoneinfo' directory, the
> > Gentoo-specific patch creates that infinite recursion situation.
>
> I agree, this is an egregious packaging bug.  Programs should be able to
> enumerate the timezone database without running into infinite recursion.

I agree. This looks like it's rather recent. I'm waiting for a reply
=66rom the toolchain herd as to why they're effectively reverting that patc=
h.

> Basically, I don't think we should do anything about this.  Packaging
> the TZ database like that is completely brain-dead, and Gentoo needs
> to fix it, not tell us we're doing something wrong.  The consequences
> of their bug aren't too serious in modern PG releases anyway.  (Given
> what I know of their packaging policies, I have to wonder why they're
> still shipping 9.1 rather than the bleeding edge...)
>
>             regards, tom lane
>

Respectfully, I disagree. PostgreSQL should be able to handle cyclic
directory structures gracefully. More basic utilities, like ls and du,
won't die because of it.

And, we've been keeping pace rather well. We had 9.3.1 in the tree
before you wrote your email. :p

--
Mr. Aaron W. Swenson
Gentoo Linux Developer
PostgreSQL Herd Bull
Email : titanofold@gentoo.org
GnuPG FP : 2C00 7719 4F85 FB07 A49C 0E31 5713 AA03 D1BB FDA0
GnuPG ID : D1BBFDA0