BUG #7624: Misleading Log Message & Inconsistent Configuration Design - Mailing list pgsql-bugs

From tixu@cs.ucsd.edu
Subject BUG #7624: Misleading Log Message & Inconsistent Configuration Design
Date
Msg-id E1TSNPE-0003BZ-Sb@wrigleys.postgresql.org
Whole thread Raw
Responses Re: BUG #7624: Misleading Log Message & Inconsistent Configuration Design  (Craig Ringer <ringerc@ringerc.id.au>)
List pgsql-bugs
The following bug has been logged on the website:

Bug reference:      7624
Logged by:          Tianyin Xu
Email address:      tixu@cs.ucsd.edu
PostgreSQL version: 9.2.1
Operating system:   Linux (actually doesn't matter)
Description:        =


Hi, Postgresql,

I'm here to report a very misleading log message and inconsistent
configuration design. It confuses users, unnecessarily complicating
configuration error diagnosis and causing frustration. As a computer system
grad student familiar with C code and GDB, it took me ~1.8 hours to figure
out the problem.

--------------------------

So, in the postgresql.conf, I had the following entries:

data_directory =3D 'local/pgsql/data/'
hba_file =3D 'local/pgsql/data/pg_hba.conf'
ident_file =3D 'local/pgsql/data/pg_ident.conf'
unix_socket_directory =3D 'local/pgsql/data/'

Without the =E2=80=9Cunix_socket_directory=E2=80=9D setting, when I started=
 the server,
everything was fine. =


However, with the =E2=80=9Cunix_socket_directory=E2=80=9D setting, the serv=
er refused to
start with the following message:

FATAL:  could not create lock file "local/pgsql/data/.s.PGSQL.5432.lock": No
such file or directory

Since the sever definitely could find =E2=80=9Cdata_directory=E2=80=9D, =E2=
=80=9Chba_file=E2=80=9D, and
=E2=80=9Cident_file=E2=80=9D, the =E2=80=9Clocal/pgsql/data=E2=80=9D was de=
finitely considered as
***valid*** path.

The weird thing to me is that typing =E2=80=9C#touch
local/pgsql/data/.s.PGSQL.5432.lock=E2=80=9D could generate the file, and e=
ven I
generated the file, the server still complained with the same log message.

Then, I used GDB to trace the error point and found the message was printed
because of the failure of =E2=80=9Cfopen=E2=80=9D,

/* src/backend/utils/init/miscinit.c */
static void
CreateLockFile(const char *filename, bool amPostmaster,
                           bool isDDLock, const char *refName)
{
    =E2=80=A6.
    fd =3D open(filename, O_RDWR | O_CREAT | O_EXCL, 0600);
    if ((errno !=3D EEXIST && errno !=3D EACCES) || ntries > 100)
        ereport(FATAL,
                                        (errcode_for_file_access(),
                                         errmsg("could not create lock file
\"%s\": %m",
 filename)));
    =E2=80=A6.
}

I copied this piece of code out into a test.c file, compiled, and ran. It
did generate the =E2=80=9Clocal/pgsql/data/.s.PGSQL.5432.lock=E2=80=9D!!!, =
which means the
function =E2=80=9Copen=E2=80=9D definitely works. =


Then, I print the errno and found it's 2, i.e., ENOENT:
 =

O_CREAT is not set and the named file does not exist; or O_CREAT is set and
either the path prefix does not exist or the path argument points to an
empty string. =


But I thought (I was stupid) this's bullshit. Because path was not an empty
string, and it definitely existed! Otherwise how could =E2=80=9Cdata_direct=
ory=E2=80=9D,
=E2=80=9Chba_file=E2=80=9D, =E2=80=9Cident_file=E2=80=9D and the other simi=
lar file/directory paths work?

No idea...

I installed the newest postgres-9.2.1 again, but nothing changed. I
suspected it's because of some weird settings on my machine, so I installed
postgres again on a new server machine, but still nothing changed! =


No idea...

Found a senior student and explained this problem. He tried the same thing
as I did and also had no idea. =


What the hell...

Go back to the code, trace more carefully (patient, patient, patient), and
finally found the problem!!!

On the execution path, the server first called:

/*src/backend/postmaster/postmaster.c*/
757         ChangeToDataDir();

And then, called

/*src/backend/utils/init/miscinit.c*/
965         CreateLockFile(lockfile, amPostmaster, false, socketfile);

So, when it called open(1), the working directory is changed to the data
directory!

Actually the errno makes sense. I was stupid!!

-----------------------------

But how can users know this without reading the code... The only interface
exposed to users are the configuration file which is not consistent at all.
And the log message does not help but mislead users (the directory **does**
exist...).

I really suggest to make the configuration file consistent in this case. But
I understand it might not be easy. But at least I think we should print a
better log message which pinpoints to the absolute path like

FATAL:  58P01: could not create lock file
"/home/tianyin/postgresql-9.2.1/local/pgsql/data/local/pgsql/data/.s.PGSQL.=
5432.lock":
No such file or directory

In this case, the users can definitely figure out the problem by themselves.


Here is the patch:

diff --git a/miscinit_old.c b/miscinit.c
index fb376a0..8df83a8 100644
--- a/miscinit_old.c
+++ b/miscinit.c
@@ -736,11 +736,13 @@ CreateLockFile(const char *filename, bool
amPostmaster,
                /*
                 * Couldn't create the pid file. Probably it already
exists.
                 */
-               if ((errno !=3D EEXIST && errno !=3D EACCES) || ntries > 10=
0)
+               if ((errno !=3D EEXIST && errno !=3D EACCES) || ntries > 10=
0) {
+                       char* abs_filename =3D make_absolute_path(filename);
                        ereport(FATAL,
                                        (errcode_for_file_access(),
                                         errmsg("could not create lock file
\"%s\": %m",
-                                                       filename)));
+                                                       abs_filename)));
+                }

                /*
                 * Read the file to get the old owner's PID.  Note race
condition


Thanks a lot!
Tianyin

pgsql-bugs by date:

Previous
From: Craig Ringer
Date:
Subject: Re: BUG #7620: array_to_json doesn't support heterogeneous arrays
Next
From: tixu@cs.ucsd.edu
Date:
Subject: BUG #7625: corrupted pid file prevents postgreSQL from normal starting