Thread: 2nd update on TOAST

2nd update on TOAST

From
JanWieck@t-online.de (Jan Wieck)
Date:
Uh oh -
   on  one side I'm a happy camper. pg_dump already ignored pg_*   tables and since the TOAST  tables  are  named
pg_toast_...,  nothing  to  be done. I loaded my test DB with TOAST entries,   dumped and restored it.  Anything is
there,works  perfectly.
 
   But  then  I added the ALTER TABLE for unlimited rewrite rule   size to initdb and the problems started. I can
createa table   with  500+  attributes.  Also  I can create a view on it (the   rules size is 170K - whow). Anything
workspretty well,  just   a pg_dump output is garbage.
 
   Seems  the  dynamic  string buffers used in pg_dump aren't as   bullet proof as they  should.   I'm  still  busy
with other   things, so can someone please take a look on it?  Attached is   an SQL script that creates the table  and
the view  that  I   cannot dump.
 


Jan

--

#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.                                  #
#================================================== JanWieck@Yahoo.com #



Re: 2nd update on TOAST

From
Tom Lane
Date:
JanWieck@t-online.de (Jan Wieck) writes:
>     size to initdb and the problems started. I can create a table
>     with  500+  attributes.  Also  I can create a view on it (the
>     rules size is 170K - whow). Anything works pretty well,  just
>     a pg_dump output is garbage.

Are you using current sources?  Bruce committed Philip Warner's
pg_dump rewrite a day or so ago (which I thought was way premature,
but anyway...).  Just want to know which pg_dump we're talking about.
        regards, tom lane


Re: 2nd update on TOAST

From
Philip Warner
Date:
At 22:01 5/07/00 +0200, Jan Wieck wrote:
>
>    Seems  the  dynamic  string buffers used in pg_dump aren't as
>    bullet proof as they  should.   I'm  still  busy  with  other
>    things, so can someone please take a look on it?  Attached is
>    an SQL script that creates the table  and  the  view  that  I
>    cannot dump.
>

If the TOAST patch on the FTP site the most recent?


----------------------------------------------------------------
Philip Warner                    |     __---_____
Albatross Consulting Pty. Ltd.   |----/       -  \
(A.C.N. 008 659 498)             |          /(@)   ______---_
Tel: (+61) 0500 83 82 81         |                 _________  \
Fax: (+61) 0500 83 82 82         |                 ___________ |
Http://www.rhyme.com.au          |                /           \|                                |    --________--
PGP key available upon request,  |  /
and from pgp5.ai.mit.edu:11371   |/


Re: 2nd update on TOAST

From
Philip Warner
Date:
At 17:18 5/07/00 -0400, Tom Lane wrote:
>JanWieck@t-online.de (Jan Wieck) writes:
>>     size to initdb and the problems started. I can create a table
>>     with  500+  attributes.  Also  I can create a view on it (the
>>     rules size is 170K - whow). Anything works pretty well,  just
>>     a pg_dump output is garbage.
>
>Are you using current sources?  Bruce committed Philip Warner's
>pg_dump rewrite a day or so ago (which I thought was way premature,
>but anyway...).  Just want to know which pg_dump we're talking about.

It's good that any bugs had a chnace to come out ASAP.

However, I'd be interested to know what actually happens: I do recall that
pg_dump (both versions) have seomething like '#define COPY_BUFFER_SIZE
8192'; if I were to start looking, that's where I'd go first.


----------------------------------------------------------------
Philip Warner                    |     __---_____
Albatross Consulting Pty. Ltd.   |----/       -  \
(A.C.N. 008 659 498)             |          /(@)   ______---_
Tel: (+61) 0500 83 82 81         |                 _________  \
Fax: (+61) 0500 83 82 82         |                 ___________ |
Http://www.rhyme.com.au          |                /           \|                                |    --________--
PGP key available upon request,  |  /
and from pgp5.ai.mit.edu:11371   |/


Re: 2nd update on TOAST

From
Philip Warner
Date:
At 12:50 6/07/00 +1000, Philip Warner wrote:
>>
>>Are you using current sources?  Bruce committed Philip Warner's
>>pg_dump rewrite a day or so ago (which I thought was way premature,
>>but anyway...).  Just want to know which pg_dump we're talking about.
>
>It's good that any bugs had a chnace to come out ASAP.
>
>However, I'd be interested to know what actually happens: I do recall that
>pg_dump (both versions) have seomething like '#define COPY_BUFFER_SIZE
>8192'; if I were to start looking, that's where I'd go first.
>

Further to this, looking at the code, it now uses 'archputs', which
replaces 'fputs', to output the copy buffer (assuming the error is in the
copy, not while dumping the definitions); if PQgetline fills the entire
buffer (no trailing \0), then I could imagine both would croak. The
simplest test would be to tell PQgetline that the buffer size if 1 byte
smaller, and see if it fixes the problem. 

I'd be interested to hear from someone...at least to know a little more of
the circumstaces of the error.



----------------------------------------------------------------
Philip Warner                    |     __---_____
Albatross Consulting Pty. Ltd.   |----/       -  \
(A.C.N. 008 659 498)             |          /(@)   ______---_
Tel: (+61) 0500 83 82 81         |                 _________  \
Fax: (+61) 0500 83 82 82         |                 ___________ |
Http://www.rhyme.com.au          |                /           \|                                |    --________--
PGP key available upon request,  |  /
and from pgp5.ai.mit.edu:11371   |/


Re: 2nd update on TOAST

From
Philip Warner
Date:
At 13:15 6/07/00 +1000, Philip Warner wrote:
>At 12:50 6/07/00 +1000, Philip Warner wrote:
>>>
>>>Are you using current sources?  Bruce committed Philip Warner's
>>>pg_dump rewrite a day or so ago (which I thought was way premature,
>>>but anyway...).  Just want to know which pg_dump we're talking about.
>>

OK, I've built from the latest CVS, run initdb etc, created a new DB,
created the megatable & megaview, and done a pg_dump. It works, except that
I introduced a but that caused the view to be dumped as a table as well. I
will keep looking, and submit a patch shortly. In the mean time, if anyone
can reproduce Jan's problem, that would help...


----------------------------------------------------------------
Philip Warner                    |     __---_____
Albatross Consulting Pty. Ltd.   |----/       -  \
(A.C.N. 008 659 498)             |          /(@)   ______---_
Tel: (+61) 0500 83 82 81         |                 _________  \
Fax: (+61) 0500 83 82 82         |                 ___________ |
Http://www.rhyme.com.au          |                /           \|                                |    --________--
PGP key available upon request,  |  /
and from pgp5.ai.mit.edu:11371   |/


Re: 2nd update on TOAST

From
JanWieck@t-online.de (Jan Wieck)
Date:
Philip Warner wrote:
> At 22:01 5/07/00 +0200, Jan Wieck wrote:
> >
> >    Seems  the  dynamic  string buffers used in pg_dump aren't as
> >    bullet proof as they  should.   I'm  still  busy  with  other
> >    things, so can someone please take a look on it?  Attached is
> >    an SQL script that creates the table  and  the  view  that  I
> >    cannot dump.
> >
> 
> If the TOAST patch on the FTP site the most recent?
   No. It's all in the current CVS tree. I removed that patch   already.


Jan

-- 

#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.                                  #
#================================================== JanWieck@Yahoo.com #



Re: 2nd update on TOAST

From
Philip Warner
Date:
At 11:09 6/07/00 +0200, Jan Wieck wrote:
>
>    No. It's all in the current CVS tree. I removed that patch
>    already.
>

OK, I've updated from CVS and rebuilt & it worked. The, to be sure, I did a
'make distclean' then a 'make' & 'make install' again, and now postmaster
wont start (SIGSEGV). I have rebuild with '-O0 -g', gone into gdb, but the
process dies so I cant get a backtrace.

I have now rebuilt with SYSLOG support, and get the following:

Jul  6 20:54:54 Cerberus2 kernel: Unable to handle kernel NULL pointer
dereference at virtual address 00000038
Jul  6 20:54:54 Cerberus2 kernel: current->tss.cr3 = 02467000, %cr3 = 02467000
Jul  6 20:54:54 Cerberus2 kernel: *pde = 00000000
Jul  6 20:54:54 Cerberus2 kernel: Oops: 0000
Jul  6 20:54:54 Cerberus2 kernel: CPU:    0
Jul  6 20:54:54 Cerberus2 kernel: EIP:    0010:[fcntl_setlk+327/404]
Jul  6 20:54:54 Cerberus2 kernel: EFLAGS: 00000202
Jul  6 20:54:54 Cerberus2 kernel: eax: 00000000   ebx: c15485b0   ecx:
c2494000   edx: c0d49a50
Jul  6 20:54:54 Cerberus2 kernel: esi: bffff574   edi: 00000004   ebp:
fffffff7   esp: c2495f34
Jul  6 20:54:54 Cerberus2 kernel: ds: 0018   es: 0018   ss: 0018
Jul  6 20:54:54 Cerberus2 kernel: Process postmaster (pid: 5661, process
nr: 50, stackpage=c2495000)
Jul  6 20:54:54 Cerberus2 kernel: Stack: 00000000 c15485b0 c2495f40
00000001 00000000 00000000 4000bc74 00000000
Jul  6 20:54:54 Cerberus2 kernel:        00000000 00000000 00000000
00000000 c0d49a50 0000161d 00000000 c15485b0
Jul  6 20:54:54 Cerberus2 kernel:        00000101 00000000 7fffffff
00000000 00000000 00000000 c012c687 00000004
Jul  6 20:54:54 Cerberus2 kernel: Call Trace: [sys_fcntl+595/772]
[sys_open+94/124] [system_call+52/56]
Jul  6 20:54:54 Cerberus2 kernel: Code: 8b 50 38 85 d2 74 15 8d 44 24 24 50
ff 74 24 6c 53 ff d2 89


If anyone can give me some tips on tracking this down, I would appreciate
it....


----------------------------------------------------------------
Philip Warner                    |     __---_____
Albatross Consulting Pty. Ltd.   |----/       -  \
(A.C.N. 008 659 498)             |          /(@)   ______---_
Tel: (+61) 0500 83 82 81         |                 _________  \
Fax: (+61) 0500 83 82 82         |                 ___________ |
Http://www.rhyme.com.au          |                /           \|                                |    --________--
PGP key available upon request,  |  /
and from pgp5.ai.mit.edu:11371   |/


Re: 2nd update on TOAST

From
JanWieck@t-online.de (Jan Wieck)
Date:
Philip Warner wrote:
> At 11:09 6/07/00 +0200, Jan Wieck wrote:
> >
> >    No. It's all in the current CVS tree. I removed that patch
> >    already.
> >
>
> OK, I've updated from CVS and rebuilt & it worked. The, to be sure, I did a
> 'make distclean' then a 'make' & 'make install' again, and now postmaster
> wont start (SIGSEGV). I have rebuild with '-O0 -g', gone into gdb, but the
> process dies so I cant get a backtrace.
   Have the same symptom with a completely fresh cvs checkout.

> If anyone can give me some tips on tracking this down, I would appreciate
> it....
   Bruce  applied  a  patch  to configure.in yesterday. Read the   comments from the cvslog. It tells that it triggers
abug  in   the  Linux  kernels  fcntl(SETLK)  code  when  used with unix   domain sockets, and that the bug is present
inLinux  kernels   <=  2.2.16. I'm running a 2.2.12 here, and so it exactly dies   in pqcomm.c line 229 on fcntl()
againstthe socket.
 
   Undefining  HAVE_FCNTL_SETLK  in  config.h  did  it  for   me   temporary. Don't know how to deal with it finally.
   With this setup I did
       initdb       createdb       psql <megaview.sql       pg_dump pgsql >megaview.dump
   In  the  dump  file,  the  first  2183  bytes look OK. What's   following then looks like internal tables where
pg_dumpholds   the info of the schema analyzing.
 
   And don't worry that the view is dumped as table with a later   CREATE RULE.  That's correct this way.


Jan

--

#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.                                  #
#================================================== JanWieck@Yahoo.com #




Re: 2nd update on TOAST

From
Philip Warner
Date:
At 14:04 6/07/00 +0200, Jan Wieck wrote:
>Philip Warner wrote:
>
>    In  the  dump  file,  the  first  2183  bytes look OK. What's
>    following then looks like internal tables where pg_dump holds
>    the info of the schema analyzing.

Any chance you could mail it direct to me? 


>    And don't worry that the view is dumped as table with a later
>    CREATE RULE.  That's correct this way.

I figured this out the hard way!



----------------------------------------------------------------
Philip Warner                    |     __---_____
Albatross Consulting Pty. Ltd.   |----/       -  \
(A.C.N. 008 659 498)             |          /(@)   ______---_
Tel: (+61) 0500 83 82 81         |                 _________  \
Fax: (+61) 0500 83 82 82         |                 ___________ |
Http://www.rhyme.com.au          |                /           \|                                |    --________--
PGP key available upon request,  |  /
and from pgp5.ai.mit.edu:11371   |/


Re: 2nd update on TOAST

From
Philip Warner
Date:
At 14:04 6/07/00 +0200, Jan Wieck wrote:
>
>    With this setup I did
>
>        initdb
>        createdb
>        psql <megaview.sql
>        pg_dump pgsql >megaview.dump
>
>    In  the  dump  file,  the  first  2183  bytes look OK. What's
>    following then looks like internal tables where pg_dump holds
>    the info of the schema analyzing.
>

Just in case there is some other factor, can you let me know what your
choices in 'configure' were?


----------------------------------------------------------------
Philip Warner                    |     __---_____
Albatross Consulting Pty. Ltd.   |----/       -  \
(A.C.N. 008 659 498)             |          /(@)   ______---_
Tel: (+61) 0500 83 82 81         |                 _________  \
Fax: (+61) 0500 83 82 82         |                 ___________ |
Http://www.rhyme.com.au          |                /           \|                                |    --________--
PGP key available upon request,  |  /
and from pgp5.ai.mit.edu:11371   |/


Re: 2nd update on TOAST

From
JanWieck@t-online.de (Jan Wieck)
Date:
Philip Warner wrote:
> At 14:04 6/07/00 +0200, Jan Wieck wrote:
> >Philip Warner wrote:
> >
> >    In  the  dump  file,  the  first  2183  bytes look OK. What's
> >    following then looks like internal tables where pg_dump holds
> >    the info of the schema analyzing.
> 
> Any chance you could mail it direct to me? 
   Attached.

> 
> 
> >    And don't worry that the view is dumped as table with a later
> >    CREATE RULE.  That's correct this way.
> 
> I figured this out the hard way!
   :-)
   Will be off after this until approx. 1:00 UCT. 


Jan

-- 

#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.                                  #
#================================================== JanWieck@Yahoo.com #


Re: 2nd update on TOAST

From
JanWieck@t-online.de (Jan Wieck)
Date:
Philip Warner wrote:
> At 14:04 6/07/00 +0200, Jan Wieck wrote:
> >
> >    With this setup I did
> >
> >        initdb
> >        createdb
> >        psql <megaview.sql
> >        pg_dump pgsql >megaview.dump
> >
> >    In  the  dump  file,  the  first  2183  bytes look OK. What's
> >    following then looks like internal tables where pg_dump holds
> >    the info of the schema analyzing.
> >
> 
> Just in case there is some other factor, can you let me know what your
> choices in 'configure' were?
   --with-tcl   --enable-cassert   --enable-debug

Jan

-- 

#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.                                  #
#================================================== JanWieck@Yahoo.com #



Re: 2nd update on TOAST

From
Thomas Lockhart
Date:
>     Bruce  applied  a  patch  to configure.in yesterday. Read the
>     comments from the cvslog. It tells that it triggers a bug  in
>     the  Linux  kernels  fcntl(SETLK)  code  when  used with unix
>     domain sockets, and that the bug is present in Linux  kernels
>     <=  2.2.16. I'm running a 2.2.12 here, and so it exactly dies
>     in pqcomm.c line 229 on fcntl() against the socket.
>     Undefining  HAVE_FCNTL_SETLK  in  config.h  did  it  for   me
>     temporary. Don't know how to deal with it finally.

Thanks Jan for the workaround. I'll see if this gets me up and going
again.
                       - Thomas


Re: 2nd update on TOAST

From
Bruce Momjian
Date:
> JanWieck@t-online.de (Jan Wieck) writes:
> >     size to initdb and the problems started. I can create a table
> >     with  500+  attributes.  Also  I can create a view on it (the
> >     rules size is 170K - whow). Anything works pretty well,  just
> >     a pg_dump output is garbage.
> 
> Are you using current sources?  Bruce committed Philip Warner's
> pg_dump rewrite a day or so ago (which I thought was way premature,
> but anyway...).  Just want to know which pg_dump we're talking about.

I committed pg_dump so people could see his changes and start making
additions.  I can then have him supply incremental patches.  Keeping
stuff out of the tree usually causes the author to get frustrated.

Of course, if it comes in too quickly, it can cause chaos if he needs to
make major changes.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: 2nd update on TOAST

From
Bruce Momjian
Date:
> > OK, I've updated from CVS and rebuilt & it worked. The, to be sure, I did a
> > 'make distclean' then a 'make' & 'make install' again, and now postmaster
> > wont start (SIGSEGV). I have rebuild with '-O0 -g', gone into gdb, but the
> > process dies so I cant get a backtrace.
> 
>     Have the same symptom with a completely fresh cvs checkout.
> 
> > If anyone can give me some tips on tracking this down, I would appreciate
> > it....
> 
>     Bruce  applied  a  patch  to configure.in yesterday. Read the
>     comments from the cvslog. It tells that it triggers a bug  in
>     the  Linux  kernels  fcntl(SETLK)  code  when  used with unix
>     domain sockets, and that the bug is present in Linux  kernels
>     <=  2.2.16. I'm running a 2.2.12 here, and so it exactly dies
>     in pqcomm.c line 229 on fcntl() against the socket.

I thought when he said flock() bug, he meant only on the new IA64
platform, not on all Linux platforms.  Yikes, I enable flock(), and it
breaks initdb for all the Linux users.  This is a problem!

Tom was mentioning the configure check for flock() was broken recently,
so I was glad to fix it.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: 2nd update on TOAST

From
JanWieck@t-online.de (Jan Wieck)
Date:
Bruce Momjian wrote:
> > > OK, I've updated from CVS and rebuilt & it worked. The, to be sure, I did a
> > > 'make distclean' then a 'make' & 'make install' again, and now postmaster
> > > wont start (SIGSEGV). I have rebuild with '-O0 -g', gone into gdb, but the
> > > process dies so I cant get a backtrace.
> >
> >     Have the same symptom with a completely fresh cvs checkout.
> >
> > > If anyone can give me some tips on tracking this down, I would appreciate
> > > it....
> >
> >     Bruce  applied  a  patch  to configure.in yesterday. Read the
> >     comments from the cvslog. It tells that it triggers a bug  in
> >     the  Linux  kernels  fcntl(SETLK)  code  when  used with unix
> >     domain sockets, and that the bug is present in Linux  kernels
> >     <=  2.2.16. I'm running a 2.2.12 here, and so it exactly dies
> >     in pqcomm.c line 229 on fcntl() against the socket.
>
> I thought when he said flock() bug, he meant only on the new IA64
> platform, not on all Linux platforms.  Yikes, I enable flock(), and it
> breaks initdb for all the Linux users.  This is a problem!
   Not initdb, but postmaster. That's the one who tries (after a   successful initdb) to  do  the  fcntl(F_SETLK)  on
the unix   domain  socket.   Causing  the  kernel saying "go to hell, go   directly, don't write a core,  don't  leave
useful info  in   gdb".
 
   The  only reason I see for the entire section is to detect if   it would be safe to unlink the socket because  it's
left by   another  postmaster  in case of abnormal termination. Tell me   if I've misread it.   So  why  not  doing  it
on  the  Linux   platform    different,    using    a   separate   file   like   .s.PGSQL.5432.LCK?
 


Jan

--

#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.                                  #
#================================================== JanWieck@Yahoo.com #




Re: 2nd update on TOAST

From
Bruce Momjian
Date:
>     Not initdb, but postmaster. That's the one who tries (after a
>     successful initdb) to  do  the  fcntl(F_SETLK)  on  the  unix
>     domain  socket.   Causing  the  kernel saying "go to hell, go
>     directly, don't write a core,  don't  leave  useful  info  in
>     gdb".
> 
>     The  only reason I see for the entire section is to detect if
>     it would be safe to unlink the socket because  it's  left  by
>     another  postmaster  in case of abnormal termination. Tell me
>     if I've misread it.   So  why  not  doing  it  on  the  Linux
>     platform    different,    using    a   separate   file   like
>     .s.PGSQL.5432.LCK?

But how do you know if that file still belongs to an active postmaster? 
What if it exited before removing the file.  Seems we would have to
write the PID into the file, and do a kill() to see if it is running.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


fcntl(SETLK) [was Re: 2nd update on TOAST]

From
Tom Lane
Date:
>> The  only reason I see for the entire section is to detect if
>> it would be safe to unlink the socket because  it's  left  by
>> another  postmaster  in case of abnormal termination. Tell me
>> if I've misread it.

That's exactly what it's for.  We need to tell whether there is
still another postmaster running on the same port number.  Too
bad the kernel is not bright enough to unlink the socket file
automatically when it's no longer in use...

>> So  why  not  doing  it  on  the  Linux
>> platform    different,    using    a   separate   file   like
>> .s.PGSQL.5432.LCK?

I think it's a bad idea to do it differently on Linux than other
platforms.  If we fix this (other than by just disabling the fcntl
call again on old Linuxen) we should use the new method everywhere.

> But how do you know if that file still belongs to an active postmaster? 
> What if it exited before removing the file.  Seems we would have to
> write the PID into the file, and do a kill() to see if it is running.

Well, if we wanted to continue to depend on fcntl(SETLK) then we could
use an empty plain file.  I read the bug report as being that old Linux
kernels fail if fcntl(SETLK) is applied to a Unix-socket file.  They'd
surely have noticed long before if the feature didn't work on plain
files.

But if we are going to change this at all, I'd vote for storing pids
in the lock files the way we are now doing in the data-directory pid
lock files.  Then we wouldn't have to depend on fcntl at all, which
would be a Good Thing from a portability point of view.

However, I think it would be a really bad idea to keep the lock files
in /tmp --- that's way too open to accidental removals, not to mention
deliberate denial-of-service attacks.  They need to be in a more secure
directory; but where?  See the past discussions summarized in the
TODO.detail file.
        regards, tom lane


Re: fcntl(SETLK) [was Re: 2nd update on TOAST]

From
Peter Eisentraut
Date:
Tom Lane writes:

> However, I think it would be a really bad idea to keep the lock files
> in /tmp --- that's way too open to accidental removals, not to mention
> deliberate denial-of-service attacks.  They need to be in a more secure
> directory; but where?  See the past discussions summarized in the
> TODO.detail file.

Quoth the file system standard:

`sharedstatedir'    The directory for installing architecture-independent data files    which the programs modify while
theyrun.  This should normally be    `/usr/local/com', but write it as `$(prefix)/com'.  (If you are    using Autoconf,
writeit as `@sharedstatedir@'.)
 

The problem with this approach is making that directory writeable by the
server account. Solutions:

1) Making the postmaster executable as root but later drop root  privileges. (This looks to be the cleanest solution,
butit is  probably a security problem waiting to happen.)
 

2) Making initdb executable as root but with some --user switch. Have it  create a subdirectory of $sharedstatedir
writableby the server  account, possibly with sticky bit and whatnot. Use `su' to invoke  `postgres'.
 
  This approach might be convenient also in terms of creating the data  directory.

3) Making "initialize lock file area" a separate initialization step,  possibly encapsulated into a shell script.


Btw., what would happen if we did start a second postmaster at the same
TCP port? Or more interestingly, what happens if some completely different
program already runs at that port? How do we protect against that? This
has something to do with SO_REUSEADDR, but I don't understand those things
too well.


-- 
Peter Eisentraut                  Sernanders väg 10:115
peter_e@gmx.net                   75262 Uppsala
http://yi.org/peter-e/            Sweden



Re: fcntl(SETLK) [was Re: 2nd update on TOAST]

From
Tom Lane
Date:
Peter Eisentraut <peter_e@gmx.net> writes:
> Quoth the file system standard:

> `sharedstatedir'
>      The directory for installing architecture-independent data files
>      which the programs modify while they run.  This should normally be
>      `/usr/local/com', but write it as `$(prefix)/com'.  (If you are
>      using Autoconf, write it as `@sharedstatedir@'.)

> The problem with this approach is making that directory writeable by the
> server account.

The lock directory should certainly be one used only for Postgres locks,
owned by postgres user and writable only by postgres user.

> 2) Making initdb executable as root but with some --user switch. Have it
>    create a subdirectory of $sharedstatedir writable by the server
>    account, possibly with sticky bit and whatnot. Use `su' to invoke
>    `postgres'.

>    This approach might be convenient also in terms of creating the data
>    directory.

We could do that, or we could just say "you must have arranged for
creation of these directories before you run initdb".  For the truly
lazy, a small script that could be executed as root could be provided.

Personally I'd be unwilling to run a script as complex as initdb as
root; what if it goes wrong?  Keep the stuff that requires root
permission separate, and as small as possible.

BTW, regardless of where exactly the lock directory lives (and IIRC
there were several schools of thought on that), I believe that the
lock directory pathname has to be wired in at configure time.  It
can't be an initdb argument because the whole locking thing is useless
unless all the PG installations on a machine agree on where the port
locks are.

> Btw., what would happen if we did start a second postmaster at the same
> TCP port? Or more interestingly, what happens if some completely different
> program already runs at that port? How do we protect against that? This
> has something to do with SO_REUSEADDR, but I don't understand those things
> too well.

SO_REUSEADDR solves the problem for TCP sockets.  The problem with Unix
sockets is that the kernel's detection of conflicts is pretty braindead:
if there is an existing socket file of the same name, you get an
"address in use" failure from bind(), regardless of whether anyone else
is actually using the socket.  So, if the previous postmaster died
ungracefully and didn't delete its socket file, a new postmaster cannot
be started up until the old socket file is removed.  What we're trying
to do here is automate that removal so the admin doesn't have to do it.
The trouble is we can't just unlink() the old socket file because
that'll succeed even if there is a postmaster actively using the socket!
So we need to find out whether the old postmaster is still alive
to decide whether it's OK to remove the old socket file or whether we
should abort startup.

Bruce and I were just talking by phone about this, and we realized that
there is a completely different approach to making that decision: if you
want to know whether there's an old postmaster connected to a socket
file, try to connect to the old postmaster!  In other words, pretend to
be a client and see if your connection attempt is answered.  (You don't
have to try to log in, just see if you get a connection.)  This might
also answer Peter's concern about socket files that belong to
non-Postgres programs, although I doubt that's really a big issue.

There are some potential pitfalls here, like what if the old postmaster
is there but overloaded?  But on the whole it seems like it might be
a cleaner answer than fooling around with lockfiles, and certainly safer
than relying on fcntl(SETLK) to work on a socket file.  Comments anyone?
        regards, tom lane


Re: fcntl(SETLK) [was Re: 2nd update on TOAST]

From
JanWieck@t-online.de (Jan Wieck)
Date:
Tom Lane wrote:
> 
> Bruce and I were just talking by phone about this, and we realized that
> there is a completely different approach to making that decision: if you
> want to know whether there's an old postmaster connected to a socket
> file, try to connect to the old postmaster!  In other words, pretend to
> be a client and see if your connection attempt is answered.  (You don't
> have to try to log in, just see if you get a connection.)  This might
> also answer Peter's concern about socket files that belong to
> non-Postgres programs, although I doubt that's really a big issue.
> 
> There are some potential pitfalls here, like what if the old postmaster
> is there but overloaded?  But on the whole it seems like it might be
> a cleaner answer than fooling around with lockfiles, and certainly safer
> than relying on fcntl(SETLK) to work on a socket file.  Comments anyone?
   Like it.


Jan

-- 

#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.                                  #
#================================================== JanWieck@Yahoo.com #



Re: fcntl(SETLK) [was Re: 2nd update on TOAST]

From
Alfred Perlstein
Date:
* Jan Wieck <JanWieck@t-online.de> [000708 05:47] wrote:
> Tom Lane wrote:
> > 
> > Bruce and I were just talking by phone about this, and we realized that
> > there is a completely different approach to making that decision: if you
> > want to know whether there's an old postmaster connected to a socket
> > file, try to connect to the old postmaster!  In other words, pretend to
> > be a client and see if your connection attempt is answered.  (You don't
> > have to try to log in, just see if you get a connection.)  This might
> > also answer Peter's concern about socket files that belong to
> > non-Postgres programs, although I doubt that's really a big issue.
> > 
> > There are some potential pitfalls here, like what if the old postmaster
> > is there but overloaded?  But on the whole it seems like it might be
> > a cleaner answer than fooling around with lockfiles, and certainly safer
> > than relying on fcntl(SETLK) to work on a socket file.  Comments anyone?
> 
>     Like it.

my $pgsocket = "/tmp/.s.PGSQL.5432";

# try to connect to the postmaster
socket(SOCK, PF_UNIX, SOCK_STREAM, 0)       or die "unable to create unix domain socket: $!";

connect(SOCK, sockaddr_un($pgsocket))       and errexit("postmaster is running you must shut it down");

oh yeah... :)

-Alfred


Re: fcntl(SETLK) [was Re: 2nd update on TOAST]

From
Mike Mascari
Date:
Alfred Perlstein wrote:
> 
> * Jan Wieck <JanWieck@t-online.de> [000708 05:47] wrote:
> > Tom Lane wrote:
> > >
> > > Bruce and I were just talking by phone about this, and we realized that
> > > there is a completely different approach to making that decision: if you
> > > want to know whether there's an old postmaster connected to a socket
> > > file, try to connect to the old postmaster!  In other words, pretend to
> > > be a client and see if your connection attempt is answered.  (You don't
> > > have to try to log in, just see if you get a connection.)  This might
> > > also answer Peter's concern about socket files that belong to
> > > non-Postgres programs, although I doubt that's really a big issue.
> > >
> > > There are some potential pitfalls here, like what if the old postmaster
> > > is there but overloaded?  But on the whole it seems like it might be
> > > a cleaner answer than fooling around with lockfiles, and certainly safer
> > > than relying on fcntl(SETLK) to work on a socket file.  Comments anyone?
> >
> >     Like it.
> 
> my $pgsocket = "/tmp/.s.PGSQL.5432";
> 
> # try to connect to the postmaster
> socket(SOCK, PF_UNIX, SOCK_STREAM, 0)
>         or die "unable to create unix domain socket: $!";
> 
> connect(SOCK, sockaddr_un($pgsocket))
>         and errexit("postmaster is running you must shut it down");
> 
> oh yeah... :)
> 
> -Alfred

I don't get this. Isn't there a race condition here?

Just curious,

Mike Mascari


Re: fcntl(SETLK) [was Re: 2nd update on TOAST]

From
Alfred Perlstein
Date:
* Mike Mascari <mascarm@mascari.com> [000708 05:55] wrote:
> Alfred Perlstein wrote:
> > 
> > * Jan Wieck <JanWieck@t-online.de> [000708 05:47] wrote:
> > > Tom Lane wrote:
> > > >
> > > > Bruce and I were just talking by phone about this, and we realized that
> > > > there is a completely different approach to making that decision: if you
> > > > want to know whether there's an old postmaster connected to a socket
> > > > file, try to connect to the old postmaster!  In other words, pretend to
> > > > be a client and see if your connection attempt is answered.  (You don't
> > > > have to try to log in, just see if you get a connection.)  This might
> > > > also answer Peter's concern about socket files that belong to
> > > > non-Postgres programs, although I doubt that's really a big issue.
> > > >
> > > > There are some potential pitfalls here, like what if the old postmaster
> > > > is there but overloaded?  But on the whole it seems like it might be
> > > > a cleaner answer than fooling around with lockfiles, and certainly safer
> > > > than relying on fcntl(SETLK) to work on a socket file.  Comments anyone?
> > >
> > >     Like it.
> > 
> > my $pgsocket = "/tmp/.s.PGSQL.5432";
> > 
> > # try to connect to the postmaster
> > socket(SOCK, PF_UNIX, SOCK_STREAM, 0)
> >         or die "unable to create unix domain socket: $!";
> > 
> > connect(SOCK, sockaddr_un($pgsocket))
> >         and errexit("postmaster is running you must shut it down");
> > 
> > oh yeah... :)
> > 
> > -Alfred
> 
> I don't get this. Isn't there a race condition here?
> 
> Just curious,

Sure but it's handled, if there's a postmaster starting at this
exact instant, however since the script just runs postmaster
afterwards the conflict will make postmaster abort and I'll get an
error return from my invocation of postmaster.

-Alfred


Re: fcntl(SETLK) [was Re: 2nd update on TOAST]

From
Bruce Momjian
Date:
> > my $pgsocket = "/tmp/.s.PGSQL.5432";
> > 
> > # try to connect to the postmaster
> > socket(SOCK, PF_UNIX, SOCK_STREAM, 0)
> >         or die "unable to create unix domain socket: $!";
> > 
> > connect(SOCK, sockaddr_un($pgsocket))
> >         and errexit("postmaster is running you must shut it down");
> > 
> > oh yeah... :)
> > 
> > -Alfred
> 
> I don't get this. Isn't there a race condition here?

That's a good point.  I don't think so because the socket will only
create for one user.  Basically, we don't need something bulletproof
here.  We just need something to prevent admins from accidentally
starting two postmasters on the same port.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: fcntl(SETLK) [was Re: 2nd update on TOAST]

From
Alfred Perlstein
Date:
* Bruce Momjian <pgman@candle.pha.pa.us> [000708 06:02] wrote:
> > > my $pgsocket = "/tmp/.s.PGSQL.5432";
> > > 
> > > # try to connect to the postmaster
> > > socket(SOCK, PF_UNIX, SOCK_STREAM, 0)
> > >         or die "unable to create unix domain socket: $!";
> > > 
> > > connect(SOCK, sockaddr_un($pgsocket))
> > >         and errexit("postmaster is running you must shut it down");
> > > 
> > > oh yeah... :)
> > > 
> > > -Alfred
> > 
> > I don't get this. Isn't there a race condition here?
> 
> That's a good point.  I don't think so because the socket will only
> create for one user.  Basically, we don't need something bulletproof
> here.  We just need something to prevent admins from accidentally
> starting two postmasters on the same port.

Actually I just remebered the issue here, if one wants to start
postmaster on an alternate port there will be no conflict and 
all hell may break loose.

-Alfred


Re: fcntl(SETLK) [was Re: 2nd update on TOAST]

From
Bruce Momjian
Date:
> > That's a good point.  I don't think so because the socket will only
> > create for one user.  Basically, we don't need something bulletproof
> > here.  We just need something to prevent admins from accidentally
> > starting two postmasters on the same port.
> 
> Actually I just remebered the issue here, if one wants to start
> postmaster on an alternate port there will be no conflict and 
> all hell may break loose.

We already lock the /data directory.  This is for the port lock.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: fcntl(SETLK) [was Re: 2nd update on TOAST]

From
Tatsuo Ishii
Date:
> > But how do you know if that file still belongs to an active postmaster? 
> > What if it exited before removing the file.  Seems we would have to
> > write the PID into the file, and do a kill() to see if it is running.

I believe we already do this (SetPidFile() in
utils/init/miscinit.c). Isn't it sufficient (1) to prevent starting a
new postmaster on the same data dir and (2) to unlink the accidently
left socket file?
--
Tatsuo Ishii


Re: fcntl(SETLK) [was Re: 2nd update on TOAST]

From
Tatsuo Ishii
Date:
> > > But how do you know if that file still belongs to an active postmaster? 
> > > What if it exited before removing the file.  Seems we would have to
> > > write the PID into the file, and do a kill() to see if it is running.
> 
> I believe we already do this (SetPidFile() in
> utils/init/miscinit.c). Isn't it sufficient (1) to prevent starting a
> new postmaster on the same data dir and (2) to unlink the accidently
> left socket file?

I noticed what I was missing after sending the mail. Sorry for the
confusion. Seems the idea trying to connect a postmaster looks good.
--
Tatsuo Ishii


Re: fcntl(SETLK) [was Re: 2nd update on TOAST]

From
Peter Eisentraut
Date:
Tom Lane writes:

> Bruce and I were just talking by phone about this, and we realized that
> there is a completely different approach to making that decision: if you
> want to know whether there's an old postmaster connected to a socket
> file, try to connect to the old postmaster!

It seems that that would completely reverse the assumption of risk.
Currently, the postmaster may fail to start because there's a stale socket
file lying around, out of respect to a running colleague. With this idea
it would be the running postmaster's job to "defend" his socket against
newly starting colleagues. That doesn't seem fair.

What are our problems?

There's a possible DoS attack when someone else comes first and creates a
file /tmp/.s.PGSQL.5432. But detecting whether there's another program
running on that socket (if it's a socket) isn't going to help because you
most likely won't be able to delete it anyway. The solution to this is to
make the path of the socket file configurable more easily so that the
administrator has the choice of putting it a safer place that he prepared
appropriately.

A complementary solution is of course to add an option to run without Unix
socket, since we don't rely on the socket file for data directory locking
anymore. In fact, does anybody mind if I add such an option? We can have

tcpip_socket = yes|no
unix_socket = yes|no

(Security-conscious users may choose to turn off both. :-))

The other problem is a socket file left behind by a crashed postmaster. I
don't consider this such a big problem; a crashed postmaster is not the
normal mode of operation. The friendly message we have right now seems
alright to me. And it's a way of tell that the postmaster crashed at all.

One idea to get the pid in there somewhere is creating a socket file
"/tmp/.s.PGSQL.port.pid" and making /tmp/.s.PGSQL.port a symlink to it.
Then clients don't know the difference, but the server knows the pid and
can take appropriate action. Or make the symlink the other way around, not
sure.


-- 
Peter Eisentraut                  Sernanders väg 10:115
peter_e@gmx.net                   75262 Uppsala
http://yi.org/peter-e/            Sweden



Re: fcntl(SETLK) [was Re: 2nd update on TOAST]

From
Bruce Momjian
Date:
> The other problem is a socket file left behind by a crashed postmaster. I
> don't consider this such a big problem; a crashed postmaster is not the
> normal mode of operation. The friendly message we have right now seems
> alright to me. And it's a way of tell that the postmaster crashed at all.
> 
> One idea to get the pid in there somewhere is creating a socket file
> "/tmp/.s.PGSQL.port.pid" and making /tmp/.s.PGSQL.port a symlink to it.
> Then clients don't know the difference, but the server knows the pid and
> can take appropriate action. Or make the symlink the other way around, not
> sure.

The symlink is an interesting idea.  lstat() on the normal name can give
the file name with pid.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: fcntl(SETLK) [was Re: 2nd update on TOAST]

From
Chris Bitmead
Date:
> There's a possible DoS attack when someone else comes first and creates a
> file /tmp/.s.PGSQL.5432. But detecting whether there's another program
> running on that socket (if it's a socket) isn't going to help because you
> most likely won't be able to delete it anyway. The solution to this is to
> make the path of the socket file configurable more easily so that the
> administrator has the choice of putting it a safer place that he prepared
> appropriately.

If you are worried about DoS, I think the only solution is to figure out
a way to be using one of the reserved <1000 ports. I don't think there's
any way around that is there? Also presumably not using a reserved port
is a security risk. Not that I'm worried.


Re: fcntl(SETLK) [was Re: 2nd update on TOAST]

From
Tom Lane
Date:
Mike Mascari <mascarm@mascari.com> writes:
> I don't get this. Isn't there a race condition here?

Strictly speaking, there is, but the race window is only a couple
of kernel calls wide, and as Bruce pointed out we do not need something
that is absolutely gold-plated bulletproof.  We are just trying to
prevent dbadmins from accidentally starting two postmasters on the
same port number.

The way this would work is that pqcomm.c would do something like
if (socketFileAlreadyExists) {    try to open connection to existing postmaster;    if (successful) {        report
portconflict and die;    }    delete existing socket file;}bind(socket);  // kernel creates new socket file
herelisten();

The race condition here is that if newly-started postmaster A has
executed bind() but not yet listen(), then newly-started postmaster B
could come along, observe the existing socket file, try to open
connection, fail, delete socket file, proceed.  AFAIK B will be allowed
to bind() and create a new socket file, and A ends up listening to a
port that's lost in hyperspace --- no one else can ever connect to it
because it has no visible representative in the filesystem.

But as soon as A has executed listen() it's safe --- even though it's
not really ready to accept connections yet, the attempted connect from
B will wait till it does.  (We should, therefore, use a plain vanilla
connect attempt for the probe --- no non-blocking connect or anything
fancy.)

The bind-to-listen delay in pqcomm.c is currently several lines long,
but there's no reason they couldn't be successive kernel calls with
nothing but a test for bind() failure between.

That strikes me as plenty close enough...
        regards, tom lane


Re: fcntl(SETLK) [was Re: 2nd update on TOAST]

From
Tom Lane
Date:
Peter Eisentraut <peter_e@gmx.net> writes:
> It seems that that would completely reverse the assumption of risk.
> Currently, the postmaster may fail to start because there's a stale socket
> file lying around, out of respect to a running colleague. With this idea
> it would be the running postmaster's job to "defend" his socket against
> newly starting colleagues. That doesn't seem fair.

True, it would reverse the most probable failure mode, but I'm not sure
that's a bad thing.

> The other problem is a socket file left behind by a crashed postmaster. I
> don't consider this such a big problem; a crashed postmaster is not the
> normal mode of operation. The friendly message we have right now seems
> alright to me. And it's a way of tell that the postmaster crashed at all.

No, actually this is a *big* problem.  That friendly message is no help
to a system boot script that can't read it (the same point you've made
repeatedly w.r.t configure issues; surprised you don't see it here).

If I do a fast shutdown of my Unix system (the kind where shutdown does
a 'kill -9' on all user processes --- on HPUX systems this is invoked by
hitting the power switch or by the power supply overtemperature sensor)
then the postmaster doesn't get a chance to clean out its socket file.
After reboot, the postmaster fails to start up until I manually
intervene by removing the socket file.  That's not robust and not
acceptable.

The way I currently get around this (and I believe it's a pretty popular
thing to do) is that my postmaster-start script unconditionally deletes
the socket file before launching the postmaster.  That's actually far
riskier than what we are discussing, because there is *no* safety check
for an already-started postmaster.  A connection check would be a big
improvement.

I consider failure-to-start during normal system bootup to be a far
graver risk than the possibility that a second postmaster will usurp
a first postmaster's Unix socket --- especially since the latter could
only happen if the first postmaster isn't answering connections, in
which case allowing it to keep the socket is of dubious value anyhow.
So reversing the presumption of innocence seems like a good idea to me.

> ... The solution to this is to make the path of the socket file
> configurable more easily so that the administrator has the choice of
> putting it a safer place that he prepared appropriately.

We talked about that in the original discussion (you might want to
review the flock pghackers thread from late August '98).  The trouble is
that the socket file path is a critical part of the client-to-postmaster
protocol: change the path, and existing clients don't know where to
connect.  Oops.  So even though /tmp is obviously a pretty bogus place
to keep the socket, the compatibility headaches of moving it are so
great that no one really wants to bite the bullet.

We talked about compromises like keeping the real socket in some safer
directory, with a symlink from /tmp for old clients, and I think that's
what will happen eventually.  But please note that if the socket file
path is "easily configurable" then the same problem comes right back
to bite you again. It's *not* "easy" to change your mind about where
the socket files live; on any given platform that decision had better be
graven on stone tablets, because you want all your clients of whatever
vintage to be able to find your postmaster(s).  I'm inclined to think
that a configure option might be counterproductive --- nailing it down
in the per-OS template file seems much less likely to get screwed up.

The major problem with a hard-wired socket path that's not /tmp is
that you can't install the socket directory if you're not root, so the
ability to fire up a postmaster with no root privs whatever would no
longer exist.  We could get around that if it were possible to run with
only TCP connection support, making Unix-domain connections an option
instead of the base requirement.

> A complementary solution is of course to add an option to run without Unix
> socket, since we don't rely on the socket file for data directory locking
> anymore. In fact, does anybody mind if I add such an option? We can have
> tcpip_socket = yes|no
> unix_socket = yes|no

Yup, it would make a lot of sense to have an option for no Unix socket
connections (we already have that as an #ifdef for a couple of platforms
with no Unix socket support, but not as a postmaster start-time choice).

> (Security-conscious users may choose to turn off both. :-))

Uh, not at the moment, because we use the port interlock(s) as a proxy
for a shared-memory interlock.  Really there are three resources that
we must prevent concurrent postmasters from sharing:* data directory;* listen port number;* shared-memory blocks (and
semaphoresets).
 
We have a good solution in place now for locking the data directory, but
the port interlock still needs work.  Currently we use the port number
to assign shmem/sema keys, and there is no separate interlock to guard
against shmem conflicts.  I believe we had a discussion a few months ago
about rejiggering the shmem key assignment method so that shmem
conflicts would be detected and dealt with cleanly --- might be a good
idea to make that happen before we go too far with port interlock
changes.
        regards, tom lane


Re: fcntl(SETLK) [was Re: 2nd update on TOAST]

From
Alfred Perlstein
Date:
* Bruce Momjian <pgman@candle.pha.pa.us> [000708 06:40] wrote:
> > > That's a good point.  I don't think so because the socket will only
> > > create for one user.  Basically, we don't need something bulletproof
> > > here.  We just need something to prevent admins from accidentally
> > > starting two postmasters on the same port.
> > 
> > Actually I just remebered the issue here, if one wants to start
> > postmaster on an alternate port there will be no conflict and 
> > all hell may break loose.
> 
> We already lock the /data directory.  This is for the port lock.

The whole process could be locked by an fcntl lock on a seperate file,
which I think was already mentioned, however I've deleted most of the
thread unfortunatly.

/tmp/.l.PGSQL.5432 <- fcntl lockfile, aquired first.
/tmp/.s.PGSQL.5432 <- socket.

-- 
-Alfred Perlstein - [bright@wintelcom.net|alfred@freebsd.org]
"I have the heart of a child; I keep it in a jar on my desk."