Thread: BUG #7624: Misleading Log Message & Inconsistent Configuration Design

BUG #7624: Misleading Log Message & Inconsistent Configuration Design

From
tixu@cs.ucsd.edu
Date:
The following bug has been logged on the website:

Bug reference:      7624
Logged by:          Tianyin Xu
Email address:      tixu@cs.ucsd.edu
PostgreSQL version: 9.2.1
Operating system:   Linux (actually doesn't matter)
Description:        =


Hi, Postgresql,

I'm here to report a very misleading log message and inconsistent
configuration design. It confuses users, unnecessarily complicating
configuration error diagnosis and causing frustration. As a computer system
grad student familiar with C code and GDB, it took me ~1.8 hours to figure
out the problem.

--------------------------

So, in the postgresql.conf, I had the following entries:

data_directory =3D 'local/pgsql/data/'
hba_file =3D 'local/pgsql/data/pg_hba.conf'
ident_file =3D 'local/pgsql/data/pg_ident.conf'
unix_socket_directory =3D 'local/pgsql/data/'

Without the =E2=80=9Cunix_socket_directory=E2=80=9D setting, when I started=
 the server,
everything was fine. =


However, with the =E2=80=9Cunix_socket_directory=E2=80=9D setting, the serv=
er refused to
start with the following message:

FATAL:  could not create lock file "local/pgsql/data/.s.PGSQL.5432.lock": No
such file or directory

Since the sever definitely could find =E2=80=9Cdata_directory=E2=80=9D, =E2=
=80=9Chba_file=E2=80=9D, and
=E2=80=9Cident_file=E2=80=9D, the =E2=80=9Clocal/pgsql/data=E2=80=9D was de=
finitely considered as
***valid*** path.

The weird thing to me is that typing =E2=80=9C#touch
local/pgsql/data/.s.PGSQL.5432.lock=E2=80=9D could generate the file, and e=
ven I
generated the file, the server still complained with the same log message.

Then, I used GDB to trace the error point and found the message was printed
because of the failure of =E2=80=9Cfopen=E2=80=9D,

/* src/backend/utils/init/miscinit.c */
static void
CreateLockFile(const char *filename, bool amPostmaster,
                           bool isDDLock, const char *refName)
{
    =E2=80=A6.
    fd =3D open(filename, O_RDWR | O_CREAT | O_EXCL, 0600);
    if ((errno !=3D EEXIST && errno !=3D EACCES) || ntries > 100)
        ereport(FATAL,
                                        (errcode_for_file_access(),
                                         errmsg("could not create lock file
\"%s\": %m",
 filename)));
    =E2=80=A6.
}

I copied this piece of code out into a test.c file, compiled, and ran. It
did generate the =E2=80=9Clocal/pgsql/data/.s.PGSQL.5432.lock=E2=80=9D!!!, =
which means the
function =E2=80=9Copen=E2=80=9D definitely works. =


Then, I print the errno and found it's 2, i.e., ENOENT:
 =

O_CREAT is not set and the named file does not exist; or O_CREAT is set and
either the path prefix does not exist or the path argument points to an
empty string. =


But I thought (I was stupid) this's bullshit. Because path was not an empty
string, and it definitely existed! Otherwise how could =E2=80=9Cdata_direct=
ory=E2=80=9D,
=E2=80=9Chba_file=E2=80=9D, =E2=80=9Cident_file=E2=80=9D and the other simi=
lar file/directory paths work?

No idea...

I installed the newest postgres-9.2.1 again, but nothing changed. I
suspected it's because of some weird settings on my machine, so I installed
postgres again on a new server machine, but still nothing changed! =


No idea...

Found a senior student and explained this problem. He tried the same thing
as I did and also had no idea. =


What the hell...

Go back to the code, trace more carefully (patient, patient, patient), and
finally found the problem!!!

On the execution path, the server first called:

/*src/backend/postmaster/postmaster.c*/
757         ChangeToDataDir();

And then, called

/*src/backend/utils/init/miscinit.c*/
965         CreateLockFile(lockfile, amPostmaster, false, socketfile);

So, when it called open(1), the working directory is changed to the data
directory!

Actually the errno makes sense. I was stupid!!

-----------------------------

But how can users know this without reading the code... The only interface
exposed to users are the configuration file which is not consistent at all.
And the log message does not help but mislead users (the directory **does**
exist...).

I really suggest to make the configuration file consistent in this case. But
I understand it might not be easy. But at least I think we should print a
better log message which pinpoints to the absolute path like

FATAL:  58P01: could not create lock file
"/home/tianyin/postgresql-9.2.1/local/pgsql/data/local/pgsql/data/.s.PGSQL.=
5432.lock":
No such file or directory

In this case, the users can definitely figure out the problem by themselves.


Here is the patch:

diff --git a/miscinit_old.c b/miscinit.c
index fb376a0..8df83a8 100644
--- a/miscinit_old.c
+++ b/miscinit.c
@@ -736,11 +736,13 @@ CreateLockFile(const char *filename, bool
amPostmaster,
                /*
                 * Couldn't create the pid file. Probably it already
exists.
                 */
-               if ((errno !=3D EEXIST && errno !=3D EACCES) || ntries > 10=
0)
+               if ((errno !=3D EEXIST && errno !=3D EACCES) || ntries > 10=
0) {
+                       char* abs_filename =3D make_absolute_path(filename);
                        ereport(FATAL,
                                        (errcode_for_file_access(),
                                         errmsg("could not create lock file
\"%s\": %m",
-                                                       filename)));
+                                                       abs_filename)));
+                }

                /*
                 * Read the file to get the old owner's PID.  Note race
condition


Thanks a lot!
Tianyin

Re: BUG #7624: Misleading Log Message & Inconsistent Configuration Design

From
Craig Ringer
Date:
On 10/28/2012 03:35 PM, tixu@cs.ucsd.edu wrote:
> I really suggest to make the configuration file consistent in this case. But
> I understand it might not be easy. But at least I think we should print a
> better log message which pinpoints to the absolute path like
>
> FATAL:  58P01: could not create lock file
> "/home/tianyin/postgresql-9.2.1/local/pgsql/data/local/pgsql/data/.s.PGSQL.5432.lock":
> No such file or directory

Essentially, you want PostgreSQL to print absolute paths in error
messages, instead of paths relative to the server datadir?

I'd prefer to do it slightly differently, but I like the general idea.
I'd instead want to write:

FATAL:  58P01: could not create lock file
"local/pgsql/data/.s.PGSQL.5432.lock"
(cwd="/home/tianyin/postgresql-9.2.1/local/pgsql/data/"): No such file
or directory

By spelling out the CWD explicitly there's no impression given that the
user ever actually specified the whole path in the configuration
anywhere, so it's clearer how the path came to be. It also shows which
part of the path is known-good (since it's the CWD) and which could be
at issue.

--
Craig Ringer

Re: BUG #7624: Misleading Log Message & Inconsistent Configuration Design

From
Tianyin Xu
Date:
Hi, Craig,

Thanks a lot for your response. Sorry to make the bug report so long... I
was a little pissed off by myself for my stupidness after spending around 2
hours on this issue.

Yes, your message is definitely better and explicit by pointing out CWD.

The only concern to me is whether or not CWD is a common abbreviation,
especially for non-native speakers. Searching "CWD" on google, I didn't
find it refers to Current Working Directory in the first 3 returned pages.

To me, printing an explicit log message is the easiest way to solve the
problem. But the configuration file design is still not consistent (maybe
I'm too picky). In the following settings, we have to tell users that the
first 3 works and the last 1 does not work because of chdir blah blah blah.

data_directory = 'local/pgsql/data/'
hba_file = 'local/pgsql/data/pg_hba.conf'
ident_file = 'local/pgsql/data/pg_ident.conf'
unix_socket_directory = 'local/pgsql/data/'

Thanks,
Tianyin



On Mon, Oct 29, 2012 at 7:46 PM, Craig Ringer <ringerc@ringerc.id.au> wrote:

> On 10/28/2012 03:35 PM, tixu@cs.ucsd.edu wrote:
> > I really suggest to make the configuration file consistent in this case.
> But
> > I understand it might not be easy. But at least I think we should print a
> > better log message which pinpoints to the absolute path like
> >
> > FATAL:  58P01: could not create lock file
> >
> "/home/tianyin/postgresql-9.2.1/local/pgsql/data/local/pgsql/data/.s.PGSQL.5432.lock":
> > No such file or directory
>
> Essentially, you want PostgreSQL to print absolute paths in error
> messages, instead of paths relative to the server datadir?
>
> I'd prefer to do it slightly differently, but I like the general idea.
> I'd instead want to write:
>
> FATAL:  58P01: could not create lock file
> "local/pgsql/data/.s.PGSQL.5432.lock"
> (cwd="/home/tianyin/postgresql-9.2.1/local/pgsql/data/"): No such file
> or directory
>
> By spelling out the CWD explicitly there's no impression given that the
> user ever actually specified the whole path in the configuration
> anywhere, so it's clearer how the path came to be. It also shows which
> part of the path is known-good (since it's the CWD) and which could be
> at issue.
>
> --
> Craig Ringer
>



--
Tianyin XU,
http://cseweb.ucsd.edu/~tixu/