Thread: pg crash shortly after 9.1.1 -> 9.1.2 upgrade
All was fine until: LOG: statement: select "_devel".cleanupEvent('10 minutes'::interval, 'false'::boolean); ERROR: could not open file "base/16406/2072097_fsm": Permission denied STATEMENT: select "_devel".cleanupEvent('10 minutes'::interval, 'false'::boolean); WARNING: AbortTransaction while in COMMIT state PANIC: cannot abort transaction 7108311, it was already committed LOG: unexpected EOF on client connection LOG: unexpected EOF on client connection LOG: unexpected EOF on client connection LOG: server process (PID 21090) was terminated by signal 6: Aborted LOG: terminating any other active server processes WARNING: terminating connection because of crash of another server process DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory. ... DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory. HINT: In a moment you should be able to reconnect to the database and repeat your command. LOG: all server processes terminated; reinitializing LOG: database system was interrupted; last known up at 2011-12-07 14:51:37 EST LOG: database system was not properly shut down; automatic recovery in progress LOG: consistent recovery state reached at 5/C9A28BE4 LOG: redo starts at 5/C9A26DBC FATAL: could not open file "base/16406/2072097_fsm": Permission denied CONTEXT: xlog redo commit: 2011-12-07 14:56:04.305309-05; rels: base/16406/2072101 base/16406/2072100 base/16406/2072099 base/16406/2072098 base/16406/2072097 LOG: startup process (PID 25977) exited with exit code 1 LOG: aborting startup due to startup process failure I was unable to restart pg until I did a chmod u+rw on that file. There was nothing interesting in the log before the crash, just a bunch of activity that was obviously slony. version is: PostgreSQL 9.1.2 on i686-pc-linux-gnu, compiled by gcc (GCC) 4.1.2 20080704 (Red Hat 4.1.2-48), 32-bit
Joseph Shraibman <jks@selectacast.net> writes: > All was fine until: > LOG: statement: select "_devel".cleanupEvent('10 minutes'::interval, > 'false'::boolean); > ERROR: could not open file "base/16406/2072097_fsm": Permission denied That's pretty weird. What were the permissions on that file? Was it properly owned by the postgres user? > WARNING: AbortTransaction while in COMMIT state > PANIC: cannot abort transaction 7108311, it was already committed And that's even weirder, since it implies that we tried to open the FSM file during post-commit cleanup. The commit-record printout suggests that we probably were trying to delete the file due to a committed DROP TABLE operation, but I don't understand why we tried to open the file rather than just unlink it. Can you reproduce this? Can you show us what cleanupEvent() does? regards, tom lane
On 12/08/2011 12:54 AM, Tom Lane wrote: > Joseph Shraibman<jks@selectacast.net> writes: >> All was fine until: >> LOG: statement: select "_devel".cleanupEvent('10 minutes'::interval, >> 'false'::boolean); >> ERROR: could not open file "base/16406/2072097_fsm": Permission denied > > That's pretty weird. What were the permissions on that file? Was it > properly owned by the postgres user? It had no permissions at all ---------- 1 postgres postgres 0 Feb 14 2005 2072097_fsm I actually didn't notice the old date until now. This was an 8.4.x database that I upgraded to 9.1.1 a while ago using pg_upgrade (using the hardlink option). I still have a backup of the 8.4 database from when I did the upgrade, and that file doesn't appear in it. > >> WARNING: AbortTransaction while in COMMIT state >> PANIC: cannot abort transaction 7108311, it was already committed > > And that's even weirder, since it implies that we tried to open the FSM > file during post-commit cleanup. The commit-record printout suggests > that we probably were trying to delete the file due to a committed DROP > TABLE operation, but I don't understand why we tried to open the file > rather than just unlink it. > > Can you reproduce this? No. Nothing was going on at the time, and it hasn't happened again (yet). > Can you show us what cleanupEvent() does? It's from slony 2.0.7. The server wasn't in heavy use at the time, and all that appeared in the logs as far back as I cared to look was slony activity. That was that last line before the error.
Joseph Shraibman <jks@selectacast.net> writes: > On 12/08/2011 12:54 AM, Tom Lane wrote: >> Joseph Shraibman<jks@selectacast.net> writes: >>> All was fine until: >>> LOG: statement: select "_devel".cleanupEvent('10 minutes'::interval, >>> 'false'::boolean); >>> ERROR: could not open file "base/16406/2072097_fsm": Permission denied >> That's pretty weird. What were the permissions on that file? Was it >> properly owned by the postgres user? > It had no permissions at all > ---------- 1 postgres postgres 0 Feb 14 2005 2072097_fsm > I actually didn't notice the old date until now. This was an 8.4.x > database that I upgraded to 9.1.1 a while ago using pg_upgrade (using > the hardlink option). I still have a backup of the 8.4 database from > when I did the upgrade, and that file doesn't appear in it. It turns out that the crash at commit + failure to restart was a bug introduced in 8.4; it should have just unlinked the file without complaint. I've now applied a patch for that misbehavior. It's still mighty weird that the file was there with that mod date in the first place, since PG certainly wasn't using FSM forks in 2005, even if this database is that old. I'm guessing that the mod date and lack of permissions are both artifacts of some non-PG operation, perhaps a failed rsync or some such? I would suspect pg_upgrade except I'm pretty sure it does not mess with either permissions or mod date. Anyway, future releases should handle such a thing a bit more gracefully. regards, tom lane