Thread: DROP TABLESPACE causes panic during recovery
In CVS tip, try running the regression tests against an installed postmaster (ie, make installcheck); then as soon as the tests are done, kill -9 the bgwriter process to force a database restart. Most of the time you'll get a PANIC during recovery: LOG: background writer process (PID 2493) was terminated by signal 9 LOG: server process (PID 2493) was terminated by signal 9 LOG: terminating any other active server processes LOG: all server processes terminated; reinitializing LOG: database system was interrupted at 2004-08-04 14:26:23 EDT LOG: checkpoint record is at 0/4C1CA28 LOG: redo record is at 0/4BFD510; undo record is at 0/0; shutdown FALSE LOG: next transaction ID: 11269; next OID: 294376 LOG: database system was not properly shut down; automatic recovery in progress LOG: redo starts at 0/4BFD510 PANIC: could not create directory "/home/postgres/testversion/data/pg_tblspc/301180/163304": No such file or directory LOG: startup process (PID 4560) was terminated by signal 6 LOG: aborting startup due to startup process failure The panic is here: (gdb) bt #0 0xc0141220 in ?? () from /usr/lib/libc.1 #1 0xc00aa7ec in ?? () from /usr/lib/libc.1 #2 0xc008c2b8 in ?? () from /usr/lib/libc.1 #3 0xc0086d9c in ?? () from /usr/lib/libc.1 #4 0x2c6080 in errfinish (dummy=1) at elog.c:454 #5 0x185984 in TablespaceCreateDbspace (spcNode=1074100592, dbNode=0, isRedo=1 '\001') at tablespace.c:140 #6 0x23c90c in smgrcreate (reln=0x400a1d80, isTemp=0 '\000', isRedo=1 '\001') at smgr.c:327 #7 0x23d6cc in smgr_redo (lsn={xlogid = 0, xrecoff = 86455912}, record=0x40067be8) at smgr.c:876 #8 0x115714 in StartupXLOG () at xlog.c:4229 #9 0x11dc5c in BootstrapMain (argc=4, argv=0x7b03b630) at bootstrap.c:426 #10 0x20b7dc in StartChildProcess (xlop=2) at postmaster.c:3233 and of course the problem is that log replay is not prepared to cope with a reference to a table that's in a tablespace that no longer exists. The regression tests trigger the problem because they do a DROP TABLESPACE near the end. This is impossible to fix nicely because the information to reconstruct the tablespace is simply not available. We could make an ordinary directory (not a symlink) under pg_tblspc and then limp along in the expectation that it would get removed before we finish replay. Or we could just skip logged operations on files within the tablespace, but that feels pretty uncomfortable to me --- it amounts to deliberately discarding data ... Any thoughts? regards, tom lane
Tom Lane wrote: > In CVS tip, try running the regression tests against an installed > postmaster (ie, make installcheck); then as soon as the tests are > done, kill -9 the bgwriter process to force a database restart. > Most of the time you'll get a PANIC during recovery: [...] > This is impossible to fix nicely because the information to reconstruct > the tablespace is simply not available. We could make an ordinary > directory (not a symlink) under pg_tblspc and then limp along in the > expectation that it would get removed before we finish replay. Or we > could just skip logged operations on files within the tablespace, but > that feels pretty uncomfortable to me --- it amounts to deliberately > discarding data ... > > Any thoughts? How is a dropped table handled by the recovery code? Doesn't it present the same sort of issues (though on a smaller scale)? -- Kevin Brown kevin@sysexperts.com
Kevin Brown <kevin@sysexperts.com> writes: > Tom Lane wrote: >> This is impossible to fix nicely because the information to reconstruct >> the tablespace is simply not available. We could make an ordinary >> directory (not a symlink) under pg_tblspc and then limp along in the >> expectation that it would get removed before we finish replay. Or we >> could just skip logged operations on files within the tablespace, but >> that feels pretty uncomfortable to me --- it amounts to deliberately >> discarding data ... > How is a dropped table handled by the recovery code? Doesn't it present > the same sort of issues (though on a smaller scale)? Not really. If the replay code encounters an update to a table file that's not there, it simply creates the file and plows ahead. The thing that I'm stuck on about tablespaces is that if the symlink in $PGDATA/pg_tblspc isn't there, there's no evident way to recreate it correctly --- we have no idea where it was supposed to point. regards, tom lane
On Wed, 4 Aug 2004, Tom Lane wrote: > Kevin Brown <kevin@sysexperts.com> writes: > > Tom Lane wrote: > >> This is impossible to fix nicely because the information to reconstruct > >> the tablespace is simply not available. We could make an ordinary > >> directory (not a symlink) under pg_tblspc and then limp along in the > >> expectation that it would get removed before we finish replay. Or we > >> could just skip logged operations on files within the tablespace, but > >> that feels pretty uncomfortable to me --- it amounts to deliberately > >> discarding data ... > > > How is a dropped table handled by the recovery code? Doesn't it present > > the same sort of issues (though on a smaller scale)? > > Not really. If the replay code encounters an update to a table file > that's not there, it simply creates the file and plows ahead. The thing > that I'm stuck on about tablespaces is that if the symlink in > $PGDATA/pg_tblspc isn't there, there's no evident way to recreate it > correctly --- we have no idea where it was supposed to point. I don't think we have any choice but to log the symlink creation. Will this solve the problem? Gavin
Gavin Sherry <swm@linuxworld.com.au> writes: > On Wed, 4 Aug 2004, Tom Lane wrote: >> Not really. If the replay code encounters an update to a table file >> that's not there, it simply creates the file and plows ahead. The thing >> that I'm stuck on about tablespaces is that if the symlink in >> $PGDATA/pg_tblspc isn't there, there's no evident way to recreate it >> correctly --- we have no idea where it was supposed to point. > I don't think we have any choice but to log the symlink creation. Will > this solve the problem? We do need to do that, but it will *not* solve this problem. The scenario that causes the problem is CREATE TABLESPACE...much time passes...CHECKPOINT...modify tables in tablespacedrop tables in tablespaceDROP TABLESPACE...systemcrash Now the system needs to replay from the last checkpoint. It's going to hit updates to tables that aren't there anymore in a tablespace that's not there anymore. There will not be anything in the replayed part of the log that will give a clue where that tablespace was physically. regards, tom lane
On Wed, 4 Aug 2004, Tom Lane wrote: > Gavin Sherry <swm@linuxworld.com.au> writes: > > On Wed, 4 Aug 2004, Tom Lane wrote: > >> Not really. If the replay code encounters an update to a table file > >> that's not there, it simply creates the file and plows ahead. The thing > >> that I'm stuck on about tablespaces is that if the symlink in > >> $PGDATA/pg_tblspc isn't there, there's no evident way to recreate it > >> correctly --- we have no idea where it was supposed to point. > > > I don't think we have any choice but to log the symlink creation. Will > > this solve the problem? > > We do need to do that, but it will *not* solve this problem. The > scenario that causes the problem is > > CREATE TABLESPACE > ... > much time passes > ... > CHECKPOINT > ... > modify tables in tablespace > drop tables in tablespace > DROP TABLESPACE > ... > system crash > > Now the system needs to replay from the last checkpoint. It's going to > hit updates to tables that aren't there anymore in a tablespace that's > not there anymore. There will not be anything in the replayed part of > the log that will give a clue where that tablespace was physically. Ahh, yes of course. Seems like the best way would be to create the path under pg_tblspc as directories and plough ahead, like you said. The only alternatively that comes to mind is that we could keep all the directory structure and symlinks around until the next checkpoint. But that would be messy and may well not solve the problem anyway for things like PITR. Gavin
> We do need to do that, but it will *not* solve this problem. The > scenario that causes the problem is > > CREATE TABLESPACE > ... > much time passes > ... > CHECKPOINT > ... > modify tables in tablespace > drop tables in tablespace > DROP TABLESPACE > ... > system crash > > Now the system needs to replay from the last checkpoint. It's going to > hit updates to tables that aren't there anymore in a tablespace that's > not there anymore. There will not be anything in the replayed part of > the log that will give a clue where that tablespace was physically. Maybe we need to create a new system tablespace: pg_recovery Then when this situation occurs, if the tablespace cannot be located, we recrated the objects in the system 'pg_recovery' tablespace or something. I dunno :) Chris
Tom Lane said: >The > scenario that causes the problem is > > CREATE TABLESPACE > ... > much time passes > ... > CHECKPOINT > ... > modify tables in tablespace > drop tables in tablespace > DROP TABLESPACE > ... > system crash > > Now the system needs to replay from the last checkpoint. It's going to > hit updates to tables that aren't there anymore in a tablespace that's > not there anymore. There will not be anything in the replayed part of > the log that will give a clue where that tablespace was physically. > Could we create the tables in the default tablespace? Or create a dummy tablespace (since it's not there we expect it to be removed anyway, don't we?) I guess the big danger would be running out of disk space, but maybe that is a lower risk than this one. cheers andrew
Gavin Sherry <swm@linuxworld.com.au> writes: > > CREATE TABLESPACE > > ... > > much time passes > > ... > > CHECKPOINT > > ... > > modify tables in tablespace > > drop tables in tablespace > > DROP TABLESPACE > > ... > > system crash What happens here if no table spaces are involved? It just creates bogus tables with partial data counting on the restore to see the drop table command later and delete the corrupt tables? Does that pose any danger with PITR? The scenario above seems ok since if the PITR starting point is after the drop table/tablespace then presumably the recovery target has to be after that as well? Is there any other scenario where the partial data files could escape the recovery process? -- greg
Andrew Dunstan wrote: > Tom Lane said: > >The > > scenario that causes the problem is > > > > CREATE TABLESPACE > > ... > > much time passes > > ... > > CHECKPOINT > > ... > > modify tables in tablespace > > drop tables in tablespace > > DROP TABLESPACE > > ... > > system crash > > > > Now the system needs to replay from the last checkpoint. It's going to > > hit updates to tables that aren't there anymore in a tablespace that's > > not there anymore. There will not be anything in the replayed part of > > the log that will give a clue where that tablespace was physically. > > > > Could we create the tables in the default tablespace? Or create a dummy > tablespace (since it's not there we expect it to be removed anyway, don't > we?) I guess the big danger would be running out of disk space, but maybe > that is a lower risk than this one. Uh, why is the symlink not going to be there already? -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
Bruce Momjian <pgman@candle.pha.pa.us> writes: > Uh, why is the symlink not going to be there already? Because we removed it at the DROP TABLESPACE. regards, tom lane
>>Uh, why is the symlink not going to be there already? > > > Because we removed it at the DROP TABLESPACE. Maybe we could avoid removing it until the next checkpoint? Or is that not enough. Maybe it could stay there forever :/ Chris
Christopher Kings-Lynne <chriskl@familyhealth.com.au> writes: > Maybe we could avoid removing it until the next checkpoint? Or is that > not enough. Maybe it could stay there forever :/ Part of the problem here is that this code has to serve several purposes. We have different scenarios to worry about: * crash recovery from the most recent checkpoint * PITR replay over a long interval (many checkpoints) * recovery in the face of a partially corrupt filesystem It's the last one that is mostly bothering me at the moment. I don't want us to throw away data simply because the filesystem forgot an inode. Yeah, we might not have enough data in the WAL log to completely reconstruct a table, but we should push out what we do have, *not* toss it into the bit bucket. In the first case (straight crash recovery) I think it is true that any reference to a missing file is a reference to a file that will get deleted before recovery finishes. But I don't think that holds for PITR (we might be asked to stop short of where the table gets deleted) nor for the case where there's been filesystem damage. regards, tom lane
Tom Lane wrote: > Christopher Kings-Lynne <chriskl@familyhealth.com.au> writes: > > Maybe we could avoid removing it until the next checkpoint? Or is that > > not enough. Maybe it could stay there forever :/ > > Part of the problem here is that this code has to serve several > purposes. We have different scenarios to worry about: > > * crash recovery from the most recent checkpoint > > * PITR replay over a long interval (many checkpoints) > > * recovery in the face of a partially corrupt filesystem > > It's the last one that is mostly bothering me at the moment. I don't > want us to throw away data simply because the filesystem forgot an > inode. Yeah, we might not have enough data in the WAL log to completely > reconstruct a table, but we should push out what we do have, *not* toss > it into the bit bucket. I like the idea tossed out by one of the others the most: create a "recovery" system tablespace, and use it to resolve issues like this. The question is: what do you do with the tables in that tablespace once recovery is complete? Leave them there? That's certainly a possibility (in fact, it seems the best option, especially now that we're doing PITR), but it means that the DBA would have to periodically clean up that tablespace so that it doesn't run out of space during a later recovery. Actually, it seems to me to be the only option that isn't the equivalent of throwing away the data... > In the first case (straight crash recovery) I think it is true that any > reference to a missing file is a reference to a file that will get > deleted before recovery finishes. But I don't think that holds for PITR > (we might be asked to stop short of where the table gets deleted) nor > for the case where there's been filesystem damage. But doesn't PITR assume that a full filesystem-level restore of the database as it was prior to the events in the first event log being replayed has been done? In that event, wouldn't the PITR process Just Work? -- Kevin Brown kevin@sysexperts.com
Did we resolve this? --------------------------------------------------------------------------- Tom Lane wrote: > Christopher Kings-Lynne <chriskl@familyhealth.com.au> writes: > > Maybe we could avoid removing it until the next checkpoint? Or is that > > not enough. Maybe it could stay there forever :/ > > Part of the problem here is that this code has to serve several > purposes. We have different scenarios to worry about: > > * crash recovery from the most recent checkpoint > > * PITR replay over a long interval (many checkpoints) > > * recovery in the face of a partially corrupt filesystem > > It's the last one that is mostly bothering me at the moment. I don't > want us to throw away data simply because the filesystem forgot an > inode. Yeah, we might not have enough data in the WAL log to completely > reconstruct a table, but we should push out what we do have, *not* toss > it into the bit bucket. > > In the first case (straight crash recovery) I think it is true that any > reference to a missing file is a reference to a file that will get > deleted before recovery finishes. But I don't think that holds for PITR > (we might be asked to stop short of where the table gets deleted) nor > for the case where there's been filesystem damage. > > regards, tom lane > > ---------------------------(end of broadcast)--------------------------- > TIP 3: if posting/reading through Usenet, please send an appropriate > subscribe-nomail command to majordomo@postgresql.org so that your > message can get through to the mailing list cleanly > -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
Bruce Momjian <pgman@candle.pha.pa.us> writes: > Did we resolve this? No, it's an open issue. regards, tom lane
Added to open items: * fix recovery of DROP TABLESPACE after checkpoint --------------------------------------------------------------------------- Tom Lane wrote: > Christopher Kings-Lynne <chriskl@familyhealth.com.au> writes: > > Maybe we could avoid removing it until the next checkpoint? Or is that > > not enough. Maybe it could stay there forever :/ > > Part of the problem here is that this code has to serve several > purposes. We have different scenarios to worry about: > > * crash recovery from the most recent checkpoint > > * PITR replay over a long interval (many checkpoints) > > * recovery in the face of a partially corrupt filesystem > > It's the last one that is mostly bothering me at the moment. I don't > want us to throw away data simply because the filesystem forgot an > inode. Yeah, we might not have enough data in the WAL log to completely > reconstruct a table, but we should push out what we do have, *not* toss > it into the bit bucket. > > In the first case (straight crash recovery) I think it is true that any > reference to a missing file is a reference to a file that will get > deleted before recovery finishes. But I don't think that holds for PITR > (we might be asked to stop short of where the table gets deleted) nor > for the case where there's been filesystem damage. > > regards, tom lane > > ---------------------------(end of broadcast)--------------------------- > TIP 3: if posting/reading through Usenet, please send an appropriate > subscribe-nomail command to majordomo@postgresql.org so that your > message can get through to the mailing list cleanly > -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
Is this fixed? --------------------------------------------------------------------------- Tom Lane wrote: > Christopher Kings-Lynne <chriskl@familyhealth.com.au> writes: > > Maybe we could avoid removing it until the next checkpoint? Or is that > > not enough. Maybe it could stay there forever :/ > > Part of the problem here is that this code has to serve several > purposes. We have different scenarios to worry about: > > * crash recovery from the most recent checkpoint > > * PITR replay over a long interval (many checkpoints) > > * recovery in the face of a partially corrupt filesystem > > It's the last one that is mostly bothering me at the moment. I don't > want us to throw away data simply because the filesystem forgot an > inode. Yeah, we might not have enough data in the WAL log to completely > reconstruct a table, but we should push out what we do have, *not* toss > it into the bit bucket. > > In the first case (straight crash recovery) I think it is true that any > reference to a missing file is a reference to a file that will get > deleted before recovery finishes. But I don't think that holds for PITR > (we might be asked to stop short of where the table gets deleted) nor > for the case where there's been filesystem damage. > > regards, tom lane > > ---------------------------(end of broadcast)--------------------------- > TIP 3: if posting/reading through Usenet, please send an appropriate > subscribe-nomail command to majordomo@postgresql.org so that your > message can get through to the mailing list cleanly > -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
Bruce Momjian <pgman@candle.pha.pa.us> writes: > Is this fixed? Yes. regards, tom lane