Thread: pg crash shortly after 9.1.1 -> 9.1.2 upgrade

pg crash shortly after 9.1.1 -> 9.1.2 upgrade

From
Joseph Shraibman
Date:
All was fine until:

LOG:  statement: select "_devel".cleanupEvent('10 minutes'::interval,
'false'::boolean);
ERROR:  could not open file "base/16406/2072097_fsm": Permission denied
STATEMENT:  select "_devel".cleanupEvent('10 minutes'::interval,
'false'::boolean);
WARNING:  AbortTransaction while in COMMIT state
PANIC:  cannot abort transaction 7108311, it was already committed
LOG:  unexpected EOF on client connection
LOG:  unexpected EOF on client connection
LOG:  unexpected EOF on client connection
LOG:  server process (PID 21090) was terminated by signal 6: Aborted
LOG:  terminating any other active server processes
WARNING:  terminating connection because of crash of another server process
DETAIL:  The postmaster has commanded this server process to roll back
the current transaction and exit, because another server process exited
abnormally and possibly corrupted shared memory.

...

DETAIL:  The postmaster has commanded this server process to roll back
the current transaction and exit, because another server process exited
abnormally and possibly corrupted shared memory.
HINT:  In a moment you should be able to reconnect to the database and
repeat your command.
LOG:  all server processes terminated; reinitializing
LOG:  database system was interrupted; last known up at 2011-12-07
14:51:37 EST
LOG:  database system was not properly shut down; automatic recovery in
progress
LOG:  consistent recovery state reached at 5/C9A28BE4
LOG:  redo starts at 5/C9A26DBC
FATAL:  could not open file "base/16406/2072097_fsm": Permission denied
CONTEXT:  xlog redo commit: 2011-12-07 14:56:04.305309-05; rels:
base/16406/2072101 base/16406/2072100 base/16406/2072099
base/16406/2072098 base/16406/2072097
LOG:  startup process (PID 25977) exited with exit code 1
LOG:  aborting startup due to startup process failure


I was unable to restart pg until I did a chmod u+rw on that file.  There
was nothing interesting in the log before the crash, just a bunch of
activity that was obviously slony.

version is:
PostgreSQL 9.1.2 on i686-pc-linux-gnu, compiled by gcc (GCC) 4.1.2
20080704 (Red Hat 4.1.2-48), 32-bit

Re: pg crash shortly after 9.1.1 -> 9.1.2 upgrade

From
Tom Lane
Date:
Joseph Shraibman <jks@selectacast.net> writes:
> All was fine until:
> LOG:  statement: select "_devel".cleanupEvent('10 minutes'::interval,
> 'false'::boolean);
> ERROR:  could not open file "base/16406/2072097_fsm": Permission denied

That's pretty weird.  What were the permissions on that file?  Was it
properly owned by the postgres user?

> WARNING:  AbortTransaction while in COMMIT state
> PANIC:  cannot abort transaction 7108311, it was already committed

And that's even weirder, since it implies that we tried to open the FSM
file during post-commit cleanup.  The commit-record printout suggests
that we probably were trying to delete the file due to a committed DROP
TABLE operation, but I don't understand why we tried to open the file
rather than just unlink it.

Can you reproduce this?  Can you show us what cleanupEvent() does?

            regards, tom lane

Re: pg crash shortly after 9.1.1 -> 9.1.2 upgrade

From
Joseph Shraibman
Date:
On 12/08/2011 12:54 AM, Tom Lane wrote:
> Joseph Shraibman<jks@selectacast.net>  writes:
>> All was fine until:
>> LOG:  statement: select "_devel".cleanupEvent('10 minutes'::interval,
>> 'false'::boolean);
>> ERROR:  could not open file "base/16406/2072097_fsm": Permission denied
>
> That's pretty weird.  What were the permissions on that file?  Was it
> properly owned by the postgres user?

It had no permissions at all

---------- 1 postgres postgres         0 Feb 14  2005 2072097_fsm

I actually didn't notice the old date until now.  This was an 8.4.x
database that I upgraded to 9.1.1 a while ago using pg_upgrade (using
the hardlink option).  I still have a backup of the 8.4 database from
when I did the upgrade, and that file doesn't appear in it.
>
>> WARNING:  AbortTransaction while in COMMIT state
>> PANIC:  cannot abort transaction 7108311, it was already committed
>
> And that's even weirder, since it implies that we tried to open the FSM
> file during post-commit cleanup.  The commit-record printout suggests
> that we probably were trying to delete the file due to a committed DROP
> TABLE operation, but I don't understand why we tried to open the file
> rather than just unlink it.
>
> Can you reproduce this?

No.  Nothing was going on at the time, and it hasn't happened again (yet).

>  Can you show us what cleanupEvent() does?

It's from slony 2.0.7.  The server wasn't in heavy use at the time, and
all that appeared in the logs as far back as I cared to look was slony
activity.  That was that last line before the error.

Re: pg crash shortly after 9.1.1 -> 9.1.2 upgrade

From
Tom Lane
Date:
Joseph Shraibman <jks@selectacast.net> writes:
> On 12/08/2011 12:54 AM, Tom Lane wrote:
>> Joseph Shraibman<jks@selectacast.net>  writes:
>>> All was fine until:
>>> LOG:  statement: select "_devel".cleanupEvent('10 minutes'::interval,
>>> 'false'::boolean);
>>> ERROR:  could not open file "base/16406/2072097_fsm": Permission denied

>> That's pretty weird.  What were the permissions on that file?  Was it
>> properly owned by the postgres user?

> It had no permissions at all

> ---------- 1 postgres postgres         0 Feb 14  2005 2072097_fsm

> I actually didn't notice the old date until now.  This was an 8.4.x
> database that I upgraded to 9.1.1 a while ago using pg_upgrade (using
> the hardlink option).  I still have a backup of the 8.4 database from
> when I did the upgrade, and that file doesn't appear in it.

It turns out that the crash at commit + failure to restart was a bug
introduced in 8.4; it should have just unlinked the file without
complaint.  I've now applied a patch for that misbehavior.

It's still mighty weird that the file was there with that mod date in
the first place, since PG certainly wasn't using FSM forks in 2005, even
if this database is that old.  I'm guessing that the mod date and lack
of permissions are both artifacts of some non-PG operation, perhaps a
failed rsync or some such?  I would suspect pg_upgrade except I'm pretty
sure it does not mess with either permissions or mod date.

Anyway, future releases should handle such a thing a bit more
gracefully.

            regards, tom lane