Re: Getting rid of duplicate tables. - Mailing list pgsql-general
From | Tom Lane |
---|---|
Subject | Re: Getting rid of duplicate tables. |
Date | |
Msg-id | 25637.1074625739@sss.pgh.pa.us Whole thread Raw |
In response to | Getting rid of duplicate tables. (Jared Carr <jared@89glass.com>) |
Responses |
Re: Getting rid of duplicate tables.
|
List | pgsql-general |
Jared Carr <jared@89glass.com> writes: > Tom Lane wrote: >>> Yes, it does appear that there was a backend crash/(operator stupidly >>> kill -9 'ing possibly) on the 29th. >> >> Hmm ... could you send me that area of the log? > Dec 29 16:31:54 penguin postgres[1714]: [3-1] LOG: received smart > shutdown request What it looks like to me is that this shutdown never completed, probably because some client was hanging onto an open connection ("smart shutdown" doesn't forcibly abort active sessions). The subsequent log entries show a couple of failed attempts to start a postmaster. I would guess the sequence was roughly this: pg_ctl stop ... wait a couple minutes, get bored of waiting ... pg_ctl start ... fails with > Dec 29 16:33:44 penguin postgres[5379]: [1-1] FATAL: lock file > "/var/lib/pgsql/data74/postmaster.pid" already exists > Dec 29 16:33:44 penguin postgres[5379]: [1-2] HINT: Is another > postmaster (PID 1714) running in data directory "/var/lib/pgsql/data74"? kill -9 old postmaster pg_ctl start ... fails with > Dec 29 16:34:12 penguin postgres[5395]: [1-1] FATAL: pre-existing > shared memory block (key 5432001, ID 0) is still in use > Dec 29 16:34:12 penguin postgres[5395]: [1-2] HINT: If you're sure > there are no old server processes still running, remove the shared > memory block with the command "ipcrm", > Dec 29 16:34:12 penguin postgres[5395]: [1-3] or just delete the file > "/var/lib/pgsql/data74/postmaster.pid". rm /var/lib/pgsql/data74/postmaster.pid pg_ctl start If those were the only steps taken, then the old postmaster's child backends would still have been running. At that point you're pretty well hosed because there are two independent sets of backends with separate shared memory blocks, and any subsequent transaction status updates from the old backends could overwrite pg_clog data from the new ones. So I now think it was operator error and not a software bug at all. You need to educate whoever did this about proper database shutdown procedures. As a Postgres maintainer, the only thing that troubles me about this is that the error messages from the failed postmaster start attempts could be read as having encouraged the operator to do exactly the worst possible things. I'm cc'ing this back to pgsql-general to see if anyone has any thoughts about rewording these messages. In particular it seems like the HINT for the second failure is really disastrous; it should tell you to kill off the old backends, not to zap the lockfile. regards, tom lane PS: do you know why the database was being shut down in the first place? Was there a pre-existing problem? PPS: at this point I think we've learned all we can from your database, so you can go ahead and repair the damage: delete from pg_class where ctid = '(27,43)'; should do it, at least for the one known problem. I'd encourage you to do what you can to look for other inconsistencies that may have been introduced.
pgsql-general by date: