Thread: PG 7.1.2 Crash: cannot read xlog dir

PG 7.1.2 Crash: cannot read xlog dir

From
kay
Date:
My situation: PG 7.1.2, Redhat 7.2, running in a chroot jail on a "VDS"
server at my new ISP. I can't recompile anything, can't upgrade PG
(basically, I'm stuck with 7.1.2).

This issue was previously noted in a thread in late 2002. The actual thread
that Tom Lane suggests it might be a permissions issue is missing from the
archive, but I found it in Google's cache ( for two Webcrawler docs:
http://www.google.com/search?hl=en&ie=UTF-8&oe=UTF-8&q=%22cannot+read+xlog+d
ir%22+&btnG=Google+Search ). As to why they aren't on
archives.postgresql.org ... ya got me.

I changed permissions to the most permissive setting I know (0777), plus I
own the directory, I own the files, and I own the postmaster process, so the
only thing I can think is that 'readdir' is badly linked or has some freaky
kernel interaction. I have Python, perl and PHP on the system, and they all
use 'opendir' and 'readdir' and 'closedir' just fine on the pg_xlog
directory.

My problem: I've deduced that the 'readdir' call is broken in my PG. I
examined the source code for 7.1 very very thoroughly (
http://developer.postgresql.org/cvsweb.cgi/pgsql-server/src/backend/access/t
ransam/xlog.c?rev=1.65.2.1&content-type=text/x-cvsweb-markup&only_with_tag=R
EL7_1_STABLE see MoveOfflineLogs). What I've found is that 'opendir' seems
to open the directory fine (does not return a NULL value), but when
'readdir' tries to grab a filename something bombs with a file system error
'No such file or directory' and it returns a NULL and 'errno' gets set. The
strange thing is that it gets in there ONCE and does ONE file
(0000000000000000) and then it won't do anymore, ever again, until I stop
the server and run initdb again.

At this point I know that there's nothing wrong with the XLOG directory or
the files in it, because PG has been writing transactions fine for 7-8 hours
up to this point. It can only be a bad 'readdir' call.

My question: Is there some runtime setting I can use to prevent
MoveOfflineLogs() from ever being called? I would MUCH rather have a couple
of old XLOGs lying around than a fatal crash. Maybe by CHECKPOINTing every
hour or something ... I've tried playing with a bunch of different WAL
settings and ... I can't stop MoveOfflineLogs from being called.

Please keep in mind my hands are tied, and I can't recompile and I can't
upgrade. Even if I could upgrade, I imagine that 'readdir' would still be
broken, and I'd still have this issue.

If anybody can think of a workaround I'd really appreciate it. I've been
racking my brain on this for a week.

Thanks

-Keith


==================

Here's the log.

/usr/local/pgsql/bin/postmaster: reaping dead processes...
/usr/local/pgsql/bin/postmaster: CleanupProc: pid 24626 exited with status 0
XLogFlush: rqst 0/12259528; wrt 0/0; flsh 0/0
XLogFlush: rqst 0/17078212; wrt 0/17078248; flsh 0/17078248
XLogFlush: rqst 0/17078152; wrt 0/17078248; flsh 0/17078248
XLogFlush: rqst 0/0; wrt 0/17078248; flsh 0/17078248
INSERT @ 0/17078248: prev 0/17078212; xprev 0/0; xid 0: XLOG - checkpoint:
redo 0/17078248; undo 0/0; sui 28; xid 3495; oid 36603; online
XLogFlush: rqst 0/17078312; wrt 0/17078248; flsh 0/17078248
DEBUG:  MoveOfflineLogs: remove 0000000000000000
FATAL 2:  MoveOfflineLogs: cannot read xlog dir: No such file or directory
DEBUG:  proc_exit(2)
DEBUG:  shmem_exit(2)
DEBUG:  exit(2)
/usr/local/pgsql/bin/postmaster: reaping dead processes...
/usr/local/pgsql/bin/postmaster: CleanupProc: pid 24736 exited with status
512
Server process (pid 24736) exited with status 512 at Sat May 31 09:57:57
2003
Terminating any active server processes...
Server processes were terminated at Sat May 31 09:57:57 2003
Reinitializing shared memory and semaphores
invoking IpcMemoryCreate(size=1236992)
DEBUG:  database system was interrupted at 2003-05-31 09:57:57 EDT
DEBUG:  CheckPoint record at (0, 17078248)
DEBUG:  Redo record at (0, 17078248); Undo record at (0, 0); Shutdown FALSE
DEBUG:  NextTransactionId: 3495; NextOid: 36603
DEBUG:  database system was not properly shut down; automatic recovery in
progress...
DEBUG:  ReadRecord: record with zero len at (0, 17078312)
DEBUG:  redo is not required
INSERT @ 0/17078312: prev 0/17078248; xprev 0/0; xid 0: XLOG - checkpoint:
redo 0/17078312; undo 0/0; sui 28; xid 3495; oid 36603; shutdown
XLogFlush: rqst 0/17078376; wrt 0/17078312; flsh 0/17078312
FATAL 2:  MoveOfflineLogs: cannot read xlog dir: No such file or directory
DEBUG:  proc_exit(2)
DEBUG:  shmem_exit(2)
DEBUG:  exit(2)

=========================

Here's the code from 7.1.

static void
MoveOfflineLogs(uint32 log, uint32 seg)
{
        DIR                   *xldir;
        struct dirent *xlde;
        char                lastoff[32];
        char                path[MAXPGPATH];

        Assert(XLOG_archive_dir[0] == 0);        /* ! implemented yet */

        xldir = opendir(XLogDir);
        if (xldir == NULL)
                elog(STOP, "MoveOfflineLogs: cannot open xlog dir: %m");

        sprintf(lastoff, "%08X%08X", log, seg);

        errno = 0;
        while ((xlde = readdir(xldir)) != NULL)
        {
                if (strlen(xlde->d_name) == 16 &&
                        strspn(xlde->d_name, "0123456789ABCDEF") == 16 &&
                        strcmp(xlde->d_name, lastoff) <= 0)
                {
                        elog(LOG, "MoveOfflineLogs: %s %s",
(XLOG_archive_dir[0]) ?
                                 "archive" : "remove", xlde->d_name);
                        sprintf(path, "%s%c%s", XLogDir, SEP_CHAR,
xlde->d_name);
                        if (XLOG_archive_dir[0] == 0)
                                unlink(path);
                }
                errno = 0;
        }
        if (errno)
                elog(STOP, "MoveOfflineLogs: cannot read xlog dir: %m");
        closedir(xldir);
}



Re: PG 7.1.2 Crash: cannot read xlog dir

From
Tom Lane
Date:
kay <efesar@nmia.com> writes:
> Please keep in mind my hands are tied, and I can't recompile and I can't
> upgrade.
> If anybody can think of a workaround I'd really appreciate it.

Find another ISP.  You should not be having to work around broken system
software, and you certainly aren't going to be able to do so without
recompiling.

            regards, tom lane

Re: PG 7.1.2 Crash: cannot read xlog dir

From
kay
Date:
Tom,

Thanks for the advice. I realize your concern, but it turns out there was a
simple solution. I'd like to share that solution with the list, since the
solution I received was not posted on the list. Since "VDS" servers are
becoming more popular by the day, this might be useful to some other admins
in the future.

> Please contact your ISP. There is a bug in the "VDS" jail that
> causes this behavior.
> The solution is to create the soft link in the /dev of the VDS
> linking console to null:
> cd /dev; ln -s null console
> You can not do this , your ISP has to do it for you.

It worked perfectly. Thanks to Jefim Matskin for this solve.

-Keith


> -----Original Message-----
> From: Tom Lane [mailto:tgl@sss.pgh.pa.us]
> Sent: Sunday, June 01, 2003 9:30 AM
> To: kay
> Cc: pgsql-admin@postgresql.org
> Subject: Re: [ADMIN] PG 7.1.2 Crash: cannot read xlog dir
>
>
> kay <efesar@nmia.com> writes:
> > Please keep in mind my hands are tied, and I can't recompile and I can't
> > upgrade.
> > If anybody can think of a workaround I'd really appreciate it.
>
> Find another ISP.  You should not be having to work around broken system
> software, and you certainly aren't going to be able to do so without
> recompiling.
>
>             regards, tom lane
>
>