Thread: Could not read directory "pg_xlog": Invalid argument (on SSD Raid)

Could not read directory "pg_xlog": Invalid argument (on SSD Raid)

From
Data Growth Pty Ltd
Date:
I'm frequently getting these errors in my console:

4/11/09 2:25:04 PM    org.postgresql.postgres[192]    ERROR:  could not read directory "pg_xlog": Invalid argument
4/11/09 2:25:56 PM    org.postgresql.postgres[192]    ERROR:  could not read directory "pg_xlog": Invalid argument
4/11/09 2:36:03 PM    org.postgresql.postgres[192]    ERROR:  could not read directory "pg_xlog": Invalid argument

and rarely:

3/11/09 10:32:31 PM    org.postgresql.postgres[217]    ERROR:  could not read directory "pg_clog": Invalid argument

It is clearly not failing all the time, as the pg_xlog file is full of files that keep being touched and updated.  I have not experienced data loss (yet), but large queries are taking orders of magnitude longer than I would like.


System:

Mac Pro Quad Nahelem 2.93GHz, 16GB RAM running Snow Leopard OS X 10.6.1 in 64bit mode

Postgres 8.4.1 (Intel 64 bit) from http://www.kyngchaos.com/software:postgres
    ( I have also tried compiling from source - I have the same problems plus a few extra installation issues.  The "official" postgresql binary from http://www.enterprisedb.com/ is not 64 bit)

The postgres data directory is on an SSD Raid 0 array.  It can support around 10K random read I/O per second, or 5K random write I/Os, sustained, in other applications. pg_xlog and pg_clog are on the same SSD raid array as the postgres DB.



Under postgres it does several thousand I/Os per second for about 1-2 seconds, then drops back to only about 50 I/Os per second for about 10 seconds, before repeating the cycle.  CPU is usually only a couple % occupied.  The console often records an error message "pg_xlog": Invalid argument during those infrequent activity bursts.

I've looked at the source code in src/port/dirmod.c:

pgfnames(const char *path)
{
....
        while ((file = readdir(dir)) != NULL)
        {
....
                errno = 0;
        }
....
        if (errno)
        {
....
                fprintf(stderr, _("could not read directory \"%s\": %s\n"),
                                path, strerror(errno));
....
        }


So it seems that readdir is returning "Invalid argument" occasionally.  But I do not understand how this error could possibly occur in this location.

I've searched for "pg_xlog": Invalid argument, and the only other mention I have found was on Linux running on a ram disk.

Could this be a race condition?  Suggestions?

Stephen

Re: Could not read directory "pg_xlog": Invalid argument (on SSD Raid)

From
Tom Lane
Date:
Data Growth Pty Ltd <datagrowth@gmail.com> writes:
> I'm frequently getting these errors in my console:
> 4/11/09 2:25:04 PM    org.postgresql.postgres[192]    ERROR:  could not read
> directory "pg_xlog": Invalid argument
> Mac Pro Quad Nahelem 2.93GHz, 16GB RAM running Snow Leopard OS X 10.6.1 in
> 64bit mode

This is a known bug in Snow Leopard --- readdir() calls fail after
having deleted a file in the directory.  We are hoping that Apple fixes
it in 10.6.2, because trying to kluge around it seems like a mess.
Your example is actually in a different place from the known case in
DROP TABLESPACE, which just reinforces that trying to avoid the bug at
the application level would be difficult.

I'm running 10.6 myself on my laptop, but I think it ought to be
regarded as not quite stable enough for production servers yet :-(

            regards, tom lane