Thread: Restart after poweroutage
I recently had another electrical power outage that left my machine unable to restart postgresql. I had previously reported this a while ago: http://archives.postgresql.org/pgsql-general/2005-04/msg01286.php Anyway, because I have seen this problem before, I knew exactly what the solution to the problem was (delete the postmaster.pid file), but for yucks I decided to imagine I did not know. What would I need to do to figure this out. [root@bilbo ~]# cat /etc/issue Fedora Core release 5 (Bordeaux) [root@bilbo ~]# rpm -q postgresql postgresql-8.1.4-1.FC5.1 Looking in /etc/init.d/postgresql I found that this file exists: /var/lib/pgsql/pgstartup.log [root@bilbo ~]# grep PGLOG /etc/init.d/postgresql PGLOG=/var/lib/pgsql/pgstartup.log Looking at $PGLOG, I discovered: FATAL: pre-existing shared memory block (key 5432001, ID 65536) is still in use HINT: If you're sure there are no old server processes still running, remove the shared memory block with the command "ipcclean", "ipcrm", or just delete the file "postmaster.pid". From this commentary, to find "postmaster.pid" I ran: [root@bilbo ~]# find /var | grep postmaster.pid /var/lib/pgsql/data/postmaster.pid [root@bilbo ~]# rm /var/lib/pgsql/data/postmaster.pid [root@bilbo ~]# service postgresql start Starting postgresql service: [ OK ] I am a long time pgsql user, and it still took me a while (and some hunting) to find the solution to this problem. This may be a real stumbling block of a new user. According to commentary on my report from April 2005, Tom mentions that there was an automated solution to this issue in v8.0.2: http://archives.postgresql.org/pgsql-general/2005-04/msg01289.php I can reproduce this problem in v8.1.4 by these simple steps: [root@bilbo ~]# service postgresql start Starting postgresql service: [ OK ] [root@bilbo ~]# service postgresql stop Stopping postgresql service: [ OK ] [root@bilbo ~]# touch /var/lib/pgsql/data/postmaster.pid [root@bilbo ~]# service postgresql start Starting postgresql service: [FAILED] [root@bilbo ~]# rm /var/lib/pgsql/data/postmaster.pid [root@bilbo ~]# service postgresql start Starting postgresql service: [ OK ] PS: This email is just a FYI, not a complaint. Maybe it is just a right of passage that users of postgresql will just have to learn about this. -- -**-*-*---*-*---*-*---*-----*-*-----*---*-*---*-----*-----*-*-----*--- Jon Lapham <lapham@jandr.org> Rio de Janeiro, Brasil Personal: http://www.jandr.org/ ***-*--*----*-------*------------*--------------------*---------------
> ...but for > yucks I decided to imagine I did not know. What would I need to do to > figure this out. > Maybe it is just a right > of passage that users of postgresql will just have to learn about this. I would imagine that I am one of the yucks like you. ;-) Anyway, everytime I see a useful (for a yuck) email like this I always flag it for future reference. Thanks, Richard Broersma Jr.
On Sun, Sep 24, 2006 at 10:11:00AM -0300, Jon Lapham wrote: > I recently had another electrical power outage that left my machine > unable to restart postgresql. I had previously reported this a while ago: FWIW, I've crashed my machine a lot of times and never run into this problem. However, I run Debian, maybe they do something different. > Looking at $PGLOG, I discovered: > FATAL: pre-existing shared memory block (key 5432001, ID 65536) is > still in use > HINT: If you're sure there are no old server processes still running, > remove the shared memory block with the command "ipcclean", "ipcrm", or > just delete the file "postmaster.pid". This doesn't make sense to me. A reboot will absolutly kill any existing shared memory blocks, how can it possibly be complaining about it? What does ipcs show after the failure to start postgres? Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > From each according to his ability. To each according to his ability to litigate.
Attachment
Jon Lapham <lapham@jandr.org> writes: > I recently had another electrical power outage that left my machine > unable to restart postgresql. I had previously reported this a while ago: > http://archives.postgresql.org/pgsql-general/2005-04/msg01286.php > Anyway, because I have seen this problem before, I knew exactly what the > solution to the problem was (delete the postmaster.pid file), As was pointed out to you in the discussion subsequent to that message, this is not a good automatic response, and it should not be necessary at all with a post-8.0 postmaster. > FATAL: pre-existing shared memory block (key 5432001, ID 65536) is > still in use This is extremely odd, because a shared memory block could not possibly have survived a reboot. Too bad you have destroyed the evidence, because I would like to know what really happened there. Is it possible that you have somehow managed to try to start the postmaster twice during your system boot cycle? If you do have two postmasters running in that data directory right now, you are in deep trouble :-( > I can reproduce this problem in v8.1.4 by these simple steps: This is not "reproducing the problem", this is merely demonstrating that the postmaster will fail to overwrite a root-owned postmaster.pid file. regards, tom lane
> This doesn't make sense to me. A reboot will absolutly kill any > existing shared memory blocks, how can it possibly be complaining about > it? PostgreSQL complains if it finds a postmaster.pid. As far as I can tell it doesn't have anything to do with shared memory except that we are tracking info in the postmaster.pid. Info that is no longer valid. What would make sense to me, would be to read the pid file, check the process list. If the process that is in the pid file is not in the process list, we remove the pid file and start. Sincerely, Joshua D. Drake
Tom Lane wrote: > Jon Lapham <lapham@jandr.org> writes: >> I recently had another electrical power outage that left my machine >> unable to restart postgresql. I had previously reported this a while ago: > >> http://archives.postgresql.org/pgsql-general/2005-04/msg01286.php > >> Anyway, because I have seen this problem before, I knew exactly what the >> solution to the problem was (delete the postmaster.pid file), > > As was pointed out to you in the discussion subsequent to that message, > this is not a good automatic response, and it should not be necessary at > all with a post-8.0 postmaster. Okay, yes, I forgot to mention that I also checked to make sure there was no postmaster running (via "ps"). > >> FATAL: pre-existing shared memory block (key 5432001, ID 65536) is >> still in use > > This is extremely odd, because a shared memory block could not possibly > have survived a reboot. Too bad you have destroyed the evidence, > because I would like to know what really happened there. Is it possible > that you have somehow managed to try to start the postmaster twice > during your system boot cycle? If you do have two postmasters running > in that data directory right now, you are in deep trouble :-( Ugh, I should have sent the email before "fixing" the problem. Sorry about that. If it happens again, I will not be so hasty. Luckily (for you, not me) we have frequent power outages where this computer resides. :) Maybe it will happen again. I do not *think* I am running 2 postmasters. [root@bilbo ~]# service postgresql stop Stopping postgresql service: [ OK ] [root@bilbo ~]# ps -A | grep -i post [root@bilbo ~]# service postgresql start Starting postgresql service: [ OK ] [root@bilbo ~]# ps -A | grep -i post 30760 ? 00:00:00 postmaster 30762 ? 00:00:00 postmaster 30764 ? 00:00:00 postmaster 30765 ? 00:00:00 postmaster 30766 ? 00:00:00 postmaster ...is that normal to see 5 of them running? I'm running just the standard (up to date) Fedora Core 5 version of postgresql, init scripts and all. >> I can reproduce this problem in v8.1.4 by these simple steps: > > This is not "reproducing the problem", this is merely demonstrating that > the postmaster will fail to overwrite a root-owned postmaster.pid file. Okay. -- -**-*-*---*-*---*-*---*-----*-*-----*---*-*---*-----*-----*-*-----*--- Jon Lapham <lapham@jandr.org> Rio de Janeiro, Brasil Personal: http://www.jandr.org/ ***-*--*----*-------*------------*--------------------*---------------
Jon Lapham wrote: [...] > > I do not *think* I am running 2 postmasters. > > [root@bilbo ~]# service postgresql stop > Stopping postgresql service: [ OK ] > [root@bilbo ~]# ps -A | grep -i post > [root@bilbo ~]# service postgresql start > Starting postgresql service: [ OK ] > [root@bilbo ~]# ps -A | grep -i post > 30760 ? 00:00:00 postmaster > 30762 ? 00:00:00 postmaster > 30764 ? 00:00:00 postmaster > 30765 ? 00:00:00 postmaster > 30766 ? 00:00:00 postmaster > > ....is that normal to see 5 of them running? Yes it is, and even more, depending on the number of clients currently connected... akar# ps ax | grep postma 18626 ?? Is 0:36.48 /usr/local/bin/postmaster (postgres) 18627 ?? S 0:19.54 postmaster: logger process (postgres) 18629 ?? S 3:50.58 postmaster: writer process (postgres) 18630 ?? S 1:51.48 postmaster: stats buffer process (postgres) 18632 ?? I 7:17.39 postmaster: stats collector process (postgres) 18683 ?? I 0:47.33 postmaster: ipfwpglogger ipfw 192.168.0.1(59189) idle (postgres) 18685 ?? I 0:00.94 postmaster: ipfwpglogger ipfw 192.168.0.1(54261) idle (postgres) 49188 ?? I 0:04.37 postmaster: snortpguser ipfw 192.168.0.1(56173) idle (postgres) 80722 ?? I 0:00.11 postmaster: pgsql template1 192.168.0.250(49421) idle (postgres) 80723 ?? I 0:00.75 postmaster: pgsql ipfw 192.168.0.250(50624) idle (postgres) 80724 ?? I 0:01.67 postmaster: pgsql ipfw 192.168.0.250(60737) idle (postgres) 81216 ?? I 0:00.48 postmaster: perl ipfw [local] idle (postgres) 81253 ?? I 0:01.43 postmaster: webpguser ipfw [local] idle (postgres) HTH, -- MaXX
On Sunday 24 September 2006 09:17 am, Tom Lane wrote: > Jon Lapham <lapham@jandr.org> writes: > > I recently had another electrical power outage that left my machine > > unable to restart postgresql. I had previously reported this a while > > ago: > > > > http://archives.postgresql.org/pgsql-general/2005-04/msg01286.php > > > > Anyway, because I have seen this problem before, I knew exactly what the > > solution to the problem was (delete the postmaster.pid file), > > As was pointed out to you in the discussion subsequent to that message, > this is not a good automatic response, and it should not be necessary at > all with a post-8.0 postmaster. > > > FATAL: pre-existing shared memory block (key 5432001, ID 65536) is > > still in use > > This is extremely odd, because a shared memory block could not possibly > have survived a reboot. Too bad you have destroyed the evidence, > because I would like to know what really happened there. Is it possible > that you have somehow managed to try to start the postmaster twice > during your system boot cycle? If you do have two postmasters running > in that data directory right now, you are in deep trouble :-( > <Snip> > > regards, tom lane > > ---------------------------(end of broadcast)--------------------------- > TIP 2: Don't 'kill -9' the postmaster In the its a small world department I experienced the same problem shortly after reading this message. The particulars Postgres 8.1.4, Kubuntu 6.06 on a laptop. My laptop sometimes experiences issues with ACPI and has to be powered off. After the most recent event I saw a message similar to that reported above. I checked and there were no other Postgres instances running. What information I could collect is included in the attached file. -- Adrian Klaver aklaver@comcast.net
Attachment
Jon Lapham wrote: > [root@bilbo ~]# ps -A | grep -i post > 30760 ? 00:00:00 postmaster > 30762 ? 00:00:00 postmaster > 30764 ? 00:00:00 postmaster > 30765 ? 00:00:00 postmaster > 30766 ? 00:00:00 postmaster > > ...is that normal to see 5 of them running? Yes, because they are not really postmasters; they are child processes, which can be backends, the logger process, the background writer, etc. Try with this: ps u -C postmaster That should show more detail, and save you the "grep". Or try something like this: $ ps -w -C postmaster -o pid,ppid,args PID PPID COMMAND 15812 15808 /pgsql/install/00orig/bin/postmaster 15814 15812 postgres: writer process 15815 15812 postgres: stats collector process 15830 15812 postgres: alvherre alvherre [local] idle in transaction Here you can see that there is a postmaster with PID 15812, and several processes which are children of that one. -- Alvaro Herrera http://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support
Adrian Klaver <aklaver@comcast.net> writes: > On Sunday 24 September 2006 09:17 am, Tom Lane wrote: >> Jon Lapham <lapham@jandr.org> writes: >>> FATAL: pre-existing shared memory block (key 5432001, ID 65536) is >>> still in use >> >> This is extremely odd, because a shared memory block could not possibly >> have survived a reboot. > In the its a small world department I experienced the same problem shortly > after reading this message. I spent quite some time today trying to duplicate this failure (by pulling the plug on an up-to-date Fedora Core 5 machine). No luck. I suppose there is some contributing factor on your machines that we haven't identified yet ... regards, tom lane
On Monday 25 September 2006 02:48 pm, Tom Lane wrote: > Adrian Klaver <aklaver@comcast.net> writes: > > On Sunday 24 September 2006 09:17 am, Tom Lane wrote: > >> Jon Lapham <lapham@jandr.org> writes: > >>> FATAL: pre-existing shared memory block (key 5432001, ID 65536) is > >>> still in use > >> > >> This is extremely odd, because a shared memory block could not possibly > >> have survived a reboot. > > > > In the its a small world department I experienced the same problem > > shortly after reading this message. > > I spent quite some time today trying to duplicate this failure (by > pulling the plug on an up-to-date Fedora Core 5 machine). No luck. > I suppose there is some contributing factor on your machines that > we haven't identified yet ... > > regards, tom lane Is there something I could do to help capture useful information should I see a repeat of this behavior? -- Adrian Klaver aklaver@comcast.net
Adrian Klaver <aklaver@comcast.net> writes: > On Monday 25 September 2006 02:48 pm, Tom Lane wrote: >> I spent quite some time today trying to duplicate this failure (by >> pulling the plug on an up-to-date Fedora Core 5 machine). No luck. > Is there something I could do to help capture useful information > should I see a repeat of this behavior? What I'd really like to have is an strace of the postmaster when it produces this error. If you're lucky, the failure will be repeatable (ie, try to start the postmaster again and it fails again with same error) and you can just strace it at that point. If it isn't repeatable in that sense, that in itself is useful information, but capturing an strace will be painful :-( Also, if it is repeatable, "ipcs -m" output will be useful context. regards, tom lane
I wrote: > Also, if it is repeatable, "ipcs -m" output will be useful context. I forgot to mention: on Linux it's important to run ipcs as root (eg "sudo ipcs -m") else it will lie to you. An incomplete listing is worse than useless. regards, tom lane
Tom Lane wrote: > Adrian Klaver <aklaver@comcast.net> writes: >> On Sunday 24 September 2006 09:17 am, Tom Lane wrote: >>> Jon Lapham <lapham@jandr.org> writes: >>>> FATAL: pre-existing shared memory block (key 5432001, ID 65536) is >>>> still in use >>> This is extremely odd, because a shared memory block could not possibly >>> have survived a reboot. > >> In the its a small world department I experienced the same problem shortly >> after reading this message. > > I spent quite some time today trying to duplicate this failure (by > pulling the plug on an up-to-date Fedora Core 5 machine). No luck. > I suppose there is some contributing factor on your machines that > we haven't identified yet ... For what it is worth, I created a FC5 VMware installation and loaded my database data into it. I simulated a bunch of power outages by telling VMware to power off the vm. Is this a good simulation of a power outage, or is there something inherently flawed about using a VM to test this? To stress test, I turned off the power while actively processing db operations, such as: loading data into a database, create a new database, delete contents of a large table within a transaction. Anyway, in every case I could not reproduce the issue, postgresql came back flawlessly. I am willing to run any power outage simulation tests people may have. -- -**-*-*---*-*---*-*---*-----*-*-----*---*-*---*-----*-----*-*-----*--- Jon Lapham <lapham@jandr.org> Rio de Janeiro, Brasil Personal: http://www.jandr.org/ ***-*--*----*-------*------------*--------------------*---------------
Jon,
It is inherently flawed. VMware really "powers down", that is, the operating system has time to shut down. Or, in other incarnations, VMware freezes the system state.
It's nothing near a real power outage, which gives no time for anything.
Harald
--
GHUM Harald Massa
persuadere et programmare
Harald Armin Massa
Reinsburgstraße 202b
70197 Stuttgart
0173/9409607
-
Python: the only language with more web frameworks than keywords.
-- Harald A. Massa, December 2005
http://groups.google.de/group/comp.lang.python/browse_thread/thread/285b9adeec188fb2
For what it is worth, I created a FC5 VMware installation and loaded my
database data into it. I simulated a bunch of power outages by telling
VMware to power off the vm. Is this a good simulation of a power
outage, or is there something inherently flawed about using a VM to test
this?
It's nothing near a real power outage, which gives no time for anything.
Harald
--
GHUM Harald Massa
persuadere et programmare
Harald Armin Massa
Reinsburgstraße 202b
70197 Stuttgart
0173/9409607
-
Python: the only language with more web frameworks than keywords.
-- Harald A. Massa, December 2005
http://groups.google.de/group/comp.lang.python/browse_thread/thread/285b9adeec188fb2
Harald Armin Massa wrote: > It is inherently flawed. VMware really "powers down", that is, the > operating system has time to shut down. Or, in other incarnations, > VMware freezes the system state. > > It's nothing near a real power outage, which gives no time for anything. On my VMware window, there are "shut down guest", "suspend" and "power off" options in the menu. Each acts differently. Using "power off" pops up a warning window saying that this is dangerous as the guest OS will not shut down properly. When selecting "power off", it acts instantly, the VM instantly closes. Using "suspend" takes a minute or so, while the VM RAM state is saved to disk. Bringing the VM back from this state brings the VM back to exactly how it was before the suspend. Using "shut down guest" causes the FC5 (in this case) VM instance to run through its shutdown procedure, and takes 15 seconds or so to shutdown. I am using "power off". -- -**-*-*---*-*---*-*---*-----*-*-----*---*-*---*-----*-----*-*-----*--- Jon Lapham <lapham@jandr.org> Rio de Janeiro, Brasil Personal: http://www.jandr.org/ ***-*--*----*-------*------------*--------------------*---------------
Jon Lapham <lapham@jandr.org> writes: > To stress test, I turned off the power while actively processing db > operations, such as: loading data into a database, create a new > database, delete contents of a large table within a transaction. > Anyway, in every case I could not reproduce the issue, postgresql came > back flawlessly. I am willing to run any power outage simulation tests > people may have. It seems unlikely to me that this issue has anything to do with what the database was doing at the time of the system crash. More likely the critical variable is some other software on your system. What other shared memory segments besides Postgres' exist when your system is operating normally? regards, tom lane
Tom Lane wrote: > Jon Lapham <lapham@jandr.org> writes: >> To stress test, I turned off the power while actively processing db >> operations, such as: loading data into a database, create a new >> database, delete contents of a large table within a transaction. >> Anyway, in every case I could not reproduce the issue, postgresql came >> back flawlessly. I am willing to run any power outage simulation tests >> people may have. > > It seems unlikely to me that this issue has anything to do with what > the database was doing at the time of the system crash. More likely > the critical variable is some other software on your system. What > other shared memory segments besides Postgres' exist when your system > is operating normally? So, you want the output of "sudo ipcs -m"? Just looking at the information, I can't imagine that this is useful to you... but, it is attached below. Is it important for you to know that at the time of the power outage, I *did* have 2 closed source kernel modules loaded, vmware's and NVidia's. (This is a development machine, not production...). Could one of these modules screwed up somehow and trampled on postgres's shared memory space? Anyway, just thought I'd mention it. lapham@bilbo > sudo ipcs -m ------ Shared Memory Segments -------- key shmid owner perms bytes nattch status 0x00000000 0 root 777 94208 0 0x00000000 98305 gdm 600 393216 0 0x00000000 41517058 lapham 600 393216 2 dest 0x6a6b6cbd 53411843 lapham 600 384 0 0x12ac1925 53444612 lapham 600 131072 0 0x000a1de2 41451525 lapham 600 1 0 0x6499077f 41484294 lapham 600 1 0 0x00000000 41549831 lapham 600 393216 2 dest 0x00000000 41582600 lapham 600 393216 2 dest 0x00000000 41615369 lapham 600 393216 2 dest 0x00000000 41648138 lapham 600 393216 2 dest 0x00000000 41680907 lapham 600 393216 2 dest 0x00000000 41713676 lapham 600 393216 2 dest 0x00000000 41746445 lapham 600 393216 2 dest 0x00000000 41779214 lapham 600 393216 2 dest 0x00000000 41844751 lapham 600 393216 2 dest 0x00000000 41877520 lapham 600 393216 2 dest 0x00000000 41910289 lapham 600 393216 2 dest 0x00000000 59539474 lapham 666 28800 1 dest 0x00000000 43450387 lapham 600 393216 2 dest 0x00000000 44990484 lapham 600 393216 2 dest 0x00000000 48234517 lapham 600 12288 2 dest 0x00000000 43483158 lapham 600 393216 2 dest 0x00000000 59277335 lapham 600 393216 2 dest 0x0052e2c1 8192026 postgres 600 10461184 2 0x00000000 49086493 lapham 600 393216 2 dest 0x00000000 48267296 lapham 600 393216 2 dest 0x00000000 48300065 lapham 600 12288 2 dest -- -**-*-*---*-*---*-*---*-----*-*-----*---*-*---*-----*-----*-*-----*--- Jon Lapham <lapham@jandr.org> Rio de Janeiro, Brasil Personal: http://www.jandr.org/ ***-*--*----*-------*------------*--------------------*---------------
Jon Lapham <lapham@jandr.org> writes: > Is it important for you to know that at the time of the power outage, I > *did* have 2 closed source kernel modules loaded, vmware's and NVidia's. > (This is a development machine, not production...). Could one of > these modules screwed up somehow and trampled on postgres's shared > memory space? Anyway, just thought I'd mention it. Well, that's potentially interesting. Do you have things configured to load those automatically on boot? > ------ Shared Memory Segments -------- > key shmid owner perms bytes nattch status > 0x00000000 0 root 777 94208 0 > 0x6a6b6cbd 53411843 lapham 600 384 0 > 0x12ac1925 53444612 lapham 600 131072 0 > 0x000a1de2 41451525 lapham 600 1 0 > 0x6499077f 41484294 lapham 600 1 0 Anyone have any idea what the above entries are from? The root-owned, world-writable one seems a bit odd in particular. regards, tom lane
Tom Lane wrote: > Jon Lapham <lapham@jandr.org> writes: >> Is it important for you to know that at the time of the power outage, I >> *did* have 2 closed source kernel modules loaded, vmware's and NVidia's. >> (This is a development machine, not production...). Could one of >> these modules screwed up somehow and trampled on postgres's shared >> memory space? Anyway, just thought I'd mention it. > > Well, that's potentially interesting. Do you have things configured to > load those automatically on boot? Yes, both. -- -**-*-*---*-*---*-*---*-----*-*-----*---*-*---*-----*-----*-*-----*--- Jon Lapham <lapham@jandr.org> Rio de Janeiro, Brasil Personal: http://www.jandr.org/ ***-*--*----*-------*------------*--------------------*---------------
Tom Lane wrote: > Jon Lapham <lapham@jandr.org> writes: [...] >> ------ Shared Memory Segments -------- >> key shmid owner perms bytes nattch status >> 0x00000000 0 root 777 94208 0 >> 0x6a6b6cbd 53411843 lapham 600 384 0 >> 0x12ac1925 53444612 lapham 600 131072 0 >> 0x000a1de2 41451525 lapham 600 1 0 >> 0x6499077f 41484294 lapham 600 1 0 > > Anyone have any idea what the above entries are from? The root-owned, > world-writable one seems a bit odd in particular. I think it is the Nvidia module... I have an nvidia kernel module loaded on my FreeBSD 6.1 laptop and I also have that root owned world writable stuff. From my laptop (sorry for the line wrap): maxx@fireblade:/home/maxx #ipcs -a Shared Memory: T ID KEY MODE OWNER GROUP CREATOR CGROUP NATTCH SEGSZ CPID LPID ATIME DTIME CTIME m 65536 5432001 --rw------- pgsql pgsql pgsql pgsql 2 9519104 651 651 15:53:58 19:05:43 15:53:58 m 65537 0 --rwarwarwa root wheel root wheel 1 94208 794 794 15:54:04 15:55:04 15:54:04 m 44630018 0 --rw------- maxx wheel maxx wheel 2 393216 33046 794 19:11:01 19:11:01 19:11:01 m 43778051 0 --rw------- maxx wheel maxx wheel 2 393216 33046 794 19:11:01 19:11:01 19:11:01 maxx@fireblade:/home/maxx #ps ax |grep 794 794 ?? R 99:46.07 /usr/X11R6/bin/X -br -nolisten tcp :0 -auth /var/run/xauth/A:0-ZlO0QR (Xorg) HTH, -- MaXX