Re: Getting rid of duplicate tables. - Mailing list pgsql-general

From Tom Lane
Subject Re: Getting rid of duplicate tables.
Date
Msg-id 25637.1074625739@sss.pgh.pa.us
Whole thread Raw
In response to Getting rid of duplicate tables.  (Jared Carr <jared@89glass.com>)
Responses Re: Getting rid of duplicate tables.
List pgsql-general
Jared Carr <jared@89glass.com> writes:
> Tom Lane wrote:
>>> Yes, it does appear that there was a backend crash/(operator stupidly
>>> kill -9 'ing possibly) on the 29th.
>>
>> Hmm ... could you send me that area of the log?

> Dec 29 16:31:54 penguin postgres[1714]: [3-1] LOG:  received smart
> shutdown request

What it looks like to me is that this shutdown never completed, probably
because some client was hanging onto an open connection ("smart
shutdown" doesn't forcibly abort active sessions).  The subsequent log
entries show a couple of failed attempts to start a postmaster.  I would
guess the sequence was roughly this:

    pg_ctl stop
    ... wait a couple minutes, get bored of waiting ...
    pg_ctl start
    ... fails with

> Dec 29 16:33:44 penguin postgres[5379]: [1-1] FATAL:  lock file
> "/var/lib/pgsql/data74/postmaster.pid" already exists
> Dec 29 16:33:44 penguin postgres[5379]: [1-2] HINT:  Is another
> postmaster (PID 1714) running in data directory "/var/lib/pgsql/data74"?

    kill -9 old postmaster
    pg_ctl start
    ... fails with

> Dec 29 16:34:12 penguin postgres[5395]: [1-1] FATAL:  pre-existing
> shared memory block (key 5432001, ID 0) is still in use
> Dec 29 16:34:12 penguin postgres[5395]: [1-2] HINT:  If you're sure
> there are no old server processes still running, remove the shared
> memory block with the command "ipcrm",
> Dec 29 16:34:12 penguin postgres[5395]: [1-3]  or just delete the file
> "/var/lib/pgsql/data74/postmaster.pid".

    rm /var/lib/pgsql/data74/postmaster.pid
    pg_ctl start

If those were the only steps taken, then the old postmaster's child
backends would still have been running.  At that point you're pretty
well hosed because there are two independent sets of backends with
separate shared memory blocks, and any subsequent transaction status
updates from the old backends could overwrite pg_clog data from the
new ones.

So I now think it was operator error and not a software bug at all.
You need to educate whoever did this about proper database shutdown
procedures.

As a Postgres maintainer, the only thing that troubles me about this
is that the error messages from the failed postmaster start attempts
could be read as having encouraged the operator to do exactly the
worst possible things.  I'm cc'ing this back to pgsql-general to see
if anyone has any thoughts about rewording these messages.  In
particular it seems like the HINT for the second failure is really
disastrous; it should tell you to kill off the old backends, not to
zap the lockfile.

            regards, tom lane

PS: do you know why the database was being shut down in the first place?
Was there a pre-existing problem?

PPS: at this point I think we've learned all we can from your database,
so you can go ahead and repair the damage:
    delete from pg_class where ctid = '(27,43)';
should do it, at least for the one known problem.  I'd encourage you
to do what you can to look for other inconsistencies that may have been
introduced.

pgsql-general by date:

Previous
From: Andrew Sullivan
Date:
Subject: Re: erServer (was: Re: Postgress and MYSQL)
Next
From: "Esteban Kemp"
Date:
Subject: PostgreSQL and Datawarehouse tools